Exact mechanism behind the hash() function?

We've got some data that we've run through an EM process which hashes a bunch of numbers into their hashed, base64'd equivalents. However, we're trying to write some other code in another system (python) and can't for the life of us get the hashes to match.

Can someone confirm what kind of hashing algorithm is being used for the hash() function? And then I assume its just a straight base64 encoding happening on that value, without any seeding or anything?

Basically I need to compute the same hash in another system for comparisons later. It's not being used as an id, just some secure data.

Thanks!

1 Like

It's MurmurHash3. You can try this: mmh3 · PyPI

Yes, then it's just base64.

Pay attention to the fact that MurmurHash3 is a non-cryptographic hash.
You should use hashhex() or even hmachex() if you need a cryptographic hash.

Well, unfortunately, I can't recreate the same hashes. I can see the values in EM and setup a really quick bit of python to generate the values using mmh3, but the results are nowhere near the same.

Is there any way to see the murmurhashed value in EM before its base64 encoded? I'm trying to figure out if my problem is in the hashing, or the encoding.

Here's my code and the results:

import mmh3
import base64

v1 = mmh3.hash("11122811")
v1_encoded = base64.b64encode(str(v1).encode())
print("v1:", v1, "v1_encoded:", v1_encoded)

v2 = mmh3.hash("0")
v2_encoded = base64.b64encode(str(v2).encode())
print("v2:", v2, "v2_encoded:", v2_encoded)
11122811 = v1: -736067490 v1_encoded: b'LTczNjA2NzQ5MA=='
0 = v2: -764297089 v2_encoded: b'LTc2NDI5NzA4OQ=='

EM reports these values:

11122811: WmyDDpiEEtPmFuCfRWYLKA
0: 1hipffIbvUu2HHnN7KlltA

I suspect the difference is that Python encodes strings in UTF-8, but .NET encodes in UTF-16 (little endian).

Try something like below (I'm not a Python expert):

import mmh3
import base64

txt = "11122811"
v1 = mmh3.hash(txt.encode('UTF-16LE'))
v1_encoded = base64.b64encode(str(v1).encode())
print("v1:", v1, "v1_encoded:", v1_encoded)

I'll ask one of our software developers to take a look at it.

Thanks Dmitry, that's a good lead. Sadly, I tried out a couple different encodings and some different ways of generating the hash and b64, but no luck.

I'm also not a python expert so I'm sort of feeling my way through this. :slight_smile:

Thanks for checking with your team, let me know if any revelations come up. I'm sure I'm doing something wrong, I just don't know which part.

You need to use the 128bit x64 implementation of MurmurHash3.
For strings, use UTF-16 (little endian).
However, number values must be normalized and converted to byte array via a special function.
After that, encode the result of MurmurHash3 as base64 and get the first 22 characters.

Here is a sample python3 code. It works for strings. Let us know if you need a code for numeric values.

import mmh3
import base64

def hash_str(str):
    h = mmh3.hash128(str.encode('UTF-16LE'), seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded[:22]


if __name__ == '__main__':
    text = "Vive la République!"
    print(f'hash={hash_str(text)}')
hash=kRADzn6YbcT84bBvvParnQ

Pasted image 20240309221245

1 Like

Thanks, yeah we are primarily using numeric values for this matching. I used ChatGPT to try to figure that out and it came up with this implementation, which works for strings, but gives a different answer for numbers than what EM is giving us.

import mmh3
import base64
import struct

def hash_value(value):
    # Determine the type and pack accordingly
    if isinstance(value, int):
        # For a 64-bit integer
        value_bytes = struct.pack('<q', value)  # Little-endian 8-byte integer
    elif isinstance(value, float):
        # For a double-precision float
        value_bytes = struct.pack('<d', value)  # Little-endian 8-byte float
    else:
        value_bytes = str(value).encode('UTF-16LE')
    
    h = mmh3.hash128(value_bytes, seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded

if __name__ == '__main__':
    value = 11122811  # Can be an int or float
    print(f'{value} hash = {hash_value(value)}')

If I use "111228811", then I get "FrnIdQEyv6GNAGyNCYe9FA==" which is what your python code also gave me.

If I use 111228811 with the above code, then I'm getting Bre5wUQKCyhkuSKopSyQtg==, but an EM hash() of that value gives us "WmyDDpiEEtPmFuCfRWYLKA"

Hi
There are some limitations. This code works only with integer values up to 28 digits.
Fractional numbers are not supported.


def hash_num(num):

    if num < 0:
        n = num.__abs__().to_bytes(16, byteorder='little', signed=False)
        n = bytearray(n)
        n[15] = n[15] | 128
        n = bytes(n)
    else:
        n = num.to_bytes(16, byteorder='little', signed=False)

    h = mmh3.hash128(n, seed=0, x64arch=True, signed=False)
    hash_encoded = base64.b64encode(h.to_bytes(16, byteorder='little', signed=False)).decode()
    return hash_encoded[:22]



if __name__ == '__main__':
    d = 11122811
    dm = -11122811
    d0 = 0
    dl = 9234567890123456789012345678

    print(f'v={d} hash={hash_num(d)}')
    print(f'v={dm} hash={hash_num(dm)}')
    print(f'v={d0} hash={hash_num(d0)}')
    print(f'v={dl} hash={hash_num(dl)}')

v=11122811 hash=WmyDDpiEEtPmFuCfRWYLKA
v=-11122811 hash=QL1qRgMOKDpUevdBMOvxVg
v=0 hash=1hipffIbvUu2HHnN7KlltA
v=9234567890123456789012345678 hash=GH3KE3G663suGCB91Vx0tg

Pasted image 20240312003142

1 Like

Thank you very much. This all worked, but I have problems with hashing floats. Could you please write how you hash this data type.

I tried like that:

import mmh3, base64, pandas as pd
import struct


def _hash_core(buf: bytes) -> str:
    h = mmh3.hash128(buf, seed=0, signed=False, x64arch=True)
    return base64.b64encode(h.to_bytes(16, byteorder="little", signed=False)).decode()[:22]


def em_mmh3(val):
    if pd.isna(val):  # None or NaN
        return val

    if isinstance(val, bool):                   # 1-byte rule
        return _hash_core(b"\x01" if val else b"\x00")

    if isinstance(val, int):
        if val < 0:
            n = val.__abs__().to_bytes(16, byteorder='little', signed=False)
            n = bytearray(n)
            n[15] = n[15] | 128
            n = bytes(n)
        else:
            n = val.to_bytes(16, byteorder='little', signed=False)
        return _hash_core(n)

    if isinstance(val, float):
        value_bytes = struct.pack('<d', val)
        return _hash_core(value_bytes)

    return _hash_core(str(val).encode("utf-16le"))

But floats still hash wrong

EasyMorph represents numbers as fixed-point 128-bit decimals, not floats. This type of decimal doesn't exist in Python as far as I know.

If possible, try to multiply your floats by 10^N and cast them to integers for hashing.

N depends on how many fractional decimal digits you want to preserve.

Thank you very much for the quick response. Let me explain my issue. My EM script produce some results and hash it using EM hash function. Then those results I check using python code. So if we have values like "Number", for example 0.1599 and I hash it by EM function, I want to be able to check the same hash in my python code. So how EM hash function make hash from NUMBER 0.1599 so I can do the same in my python code and be sure that the hashes are equal. Am I understand correctly that it is impossible and I need to change my EM logic and before hashing multiply it by 10*N?

Yes, that's correct. If the hashes are supposed to be equal, both of them should be calculated from numbers multiplied by 10^N and cast to an integer.

Alternatively, you can convert the numbers to text and calculate a hash from the text (float->text, decimal->text). But in any case, it has to be done on both sides.

Thank you very much for the quick response. I will use your advice on converting to text.