Exact mechanism behind the hash() function?

davesanders · March 6, 2024, 9:37pm

We've got some data that we've run through an EM process which hashes a bunch of numbers into their hashed, base64'd equivalents. However, we're trying to write some other code in another system (python) and can't for the life of us get the hashes to match.

Can someone confirm what kind of hashing algorithm is being used for the hash() function? And then I assume its just a straight base64 encoding happening on that value, without any seeding or anything?

Basically I need to compute the same hash in another system for comparisons later. It's not being used as an id, just some secure data.

Thanks!

dgudkov · March 6, 2024, 9:56pm

It's MurmurHash3. You can try this: mmh3 · PyPI

Yes, then it's just base64.

ckononenko · March 7, 2024, 1:08pm

Pay attention to the fact that MurmurHash3 is a non-cryptographic hash.
You should use hashhex() or even hmachex() if you need a cryptographic hash.

davesanders · March 8, 2024, 8:03pm

Well, unfortunately, I can't recreate the same hashes. I can see the values in EM and setup a really quick bit of python to generate the values using mmh3, but the results are nowhere near the same.

Is there any way to see the murmurhashed value in EM before its base64 encoded? I'm trying to figure out if my problem is in the hashing, or the encoding.

Here's my code and the results:

import mmh3
import base64

v1 = mmh3.hash("11122811")
v1_encoded = base64.b64encode(str(v1).encode())
print("v1:", v1, "v1_encoded:", v1_encoded)

v2 = mmh3.hash("0")
v2_encoded = base64.b64encode(str(v2).encode())
print("v2:", v2, "v2_encoded:", v2_encoded)

11122811 = v1: -736067490 v1_encoded: b'LTczNjA2NzQ5MA=='
0 = v2: -764297089 v2_encoded: b'LTc2NDI5NzA4OQ=='

EM reports these values:

11122811: WmyDDpiEEtPmFuCfRWYLKA
0: 1hipffIbvUu2HHnN7KlltA

dgudkov · March 8, 2024, 8:38pm

I suspect the difference is that Python encodes strings in UTF-8, but .NET encodes in UTF-16 (little endian).

Try something like below (I'm not a Python expert):

import mmh3
import base64

txt = "11122811"
v1 = mmh3.hash(txt.encode('UTF-16LE'))
v1_encoded = base64.b64encode(str(v1).encode())
print("v1:", v1, "v1_encoded:", v1_encoded)

I'll ask one of our software developers to take a look at it.

davesanders · March 8, 2024, 9:55pm

Thanks Dmitry, that's a good lead. Sadly, I tried out a couple different encodings and some different ways of generating the hash and b64, but no luck.

I'm also not a python expert so I'm sort of feeling my way through this.

Thanks for checking with your team, let me know if any revelations come up. I'm sure I'm doing something wrong, I just don't know which part.

ckononenko · March 9, 2024, 12:04am

You need to use the 128bit x64 implementation of MurmurHash3.
For strings, use UTF-16 (little endian).
However, number values must be normalized and converted to byte array via a special function.
After that, encode the result of MurmurHash3 as base64 and get the first 22 characters.

ckononenko · March 9, 2024, 8:21pm

Here is a sample python3 code. It works for strings. Let us know if you need a code for numeric values.

import mmh3
import base64

def hash_str(str):
    h = mmh3.hash128(str.encode('UTF-16LE'), seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded[:22]


if __name__ == '__main__':
    text = "Vive la République!"
    print(f'hash={hash_str(text)}')

hash=kRADzn6YbcT84bBvvParnQ

Pasted image 20240309221245

davesanders · March 11, 2024, 5:49pm

Thanks, yeah we are primarily using numeric values for this matching. I used ChatGPT to try to figure that out and it came up with this implementation, which works for strings, but gives a different answer for numbers than what EM is giving us.

import mmh3
import base64
import struct

def hash_value(value):
    # Determine the type and pack accordingly
    if isinstance(value, int):
        # For a 64-bit integer
        value_bytes = struct.pack('<q', value)  # Little-endian 8-byte integer
    elif isinstance(value, float):
        # For a double-precision float
        value_bytes = struct.pack('<d', value)  # Little-endian 8-byte float
    else:
        value_bytes = str(value).encode('UTF-16LE')
    
    h = mmh3.hash128(value_bytes, seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded

if __name__ == '__main__':
    value = 11122811  # Can be an int or float
    print(f'{value} hash = {hash_value(value)}')

If I use "111228811", then I get "FrnIdQEyv6GNAGyNCYe9FA==" which is what your python code also gave me.

If I use 111228811 with the above code, then I'm getting Bre5wUQKCyhkuSKopSyQtg==, but an EM hash() of that value gives us "WmyDDpiEEtPmFuCfRWYLKA"

ckononenko · March 11, 2024, 11:00pm

Hi
There are some limitations. This code works only with integer values up to 28 digits.
Fractional numbers are not supported.


def hash_num(num):

    if num < 0:
        n = num.__abs__().to_bytes(16, byteorder='little', signed=False)
        n = bytearray(n)
        n[15] = n[15] | 128
        n = bytes(n)
    else:
        n = num.to_bytes(16, byteorder='little', signed=False)

    h = mmh3.hash128(n, seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded[:22]



if __name__ == '__main__':
    d = 11122811
    dm = -11122811
    d0 = 0
    dl = 9234567890123456789012345678

    print(f'v={d} hash={hash_num(d)}')
    print(f'v={dm} hash={hash_num(dm)}')
    print(f'v={d0} hash={hash_num(d0)}')
    print(f'v={dl} hash={hash_num(dl)}')

v=11122811 hash=WmyDDpiEEtPmFuCfRWYLKA
v=-11122811 hash=QL1qRgMOKDpUevdBMOvxVg
v=0 hash=1hipffIbvUu2HHnN7KlltA
v=9234567890123456789012345678 hash=GH3KE3G663suGCB91Vx0tg

Pasted image 20240312003142