Skip to content

det_hash is not consistent with __eq__ #602

@apoorvkh

Description

@apoorvkh

🐛 Describe the bug

Problem

I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.

Examples

from tango.common.det_hash import det_hash

x = ["bert", "bert"]
y = ["bert", "bert "[:-1]]
assert x == y
assert det_hash(x) == det_hash(y)  # AssertionError

# Also, suppose you invoke this script as `python script.py bert bert`
assert x == sys.argv[1:]
assert det_hash(x) == det_hash(sys.argv[1:])  # AssertionError

# I understand that this is because the object hierarchies may be different for `x` and `y`.

x_addr = [id(i) for i in x]
y_addr = [id(i) for i in y]
assert x_addr[0] == x_addr[1]
assert y_addr[0] == y_addr[1]  # AssertionError

# Therefore, `x` and `y` pickle differently.

import dill
assert dill.dumps(x) == dill.dumps(y)  # AssertionError

Solutions

In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.

x == y implies both that x is y and hash(x) == hash(y)

However, there are a few problems with using __hash__:

  1. mutable data types are not supposed to implement __hash__

  2. not all classes will implement __hash__

  3. determinism can only be toggled when initializing the interpreter

I propose the following implementation:

  1. At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.

  2. We can simply fall back to pickle serialization for objects that are not hashable. I am assuming that hash collisions between __hash__ and pickle are negligible.

  3. Two suggestions:

    3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement __hash__.
    3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (like rec_hash above) to handle nested data structures. But custom classes that do implement __hash__ will not be supported.

Together (with 3.1) that's:

def rec_hash(o: Any) -> str:
    if isinstance(o, collections.abc.Sequence):  # tuples, lists, ranges
        return hash((rec_hash(x) for x in o))
    elif isinstance(o, set):
        # set elements are guaranteed to be hashable
        return hash(frozenset(o))
    elif isinstance(o, dict):
        # dict keys are guaranteed to be hashable
        return hash(sorted(hash((k, rec_hash(v)) for k,v in o.items())))
    elif isinstance(o, collections.abc.Hashable):
        # nested types may be un-hashable, could raise TypeError
        return hash(o)
    raise TypeError(f"unhashable type: '{type(o).__name__}'")


def det_hash(o: Any) -> str:
    try:
        with hash_seed(0):  # TODO: need to implement this
            h = rec_hash(o)
        return base58.b58encode_int(h)
    except TypeError:
        pass

    # Fallback to pickling
    m = hashlib.blake2b()
    with io.BytesIO() as buffer:
        pickler = _DetHashPickler(buffer)
        pickler.dump(o)
        m.update(buffer.getbuffer())
        return base58.b58encode(m.digest()).decode()

Please let me know what you think, thanks!

Versions

Python 3.9.16
ai2-tango==1.2.1
base58==2.1.1
black==23.7.0
boto3==1.28.54
botocore==1.31.54
cached-path==1.4.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
click-help-colors==0.9.2
dill==0.3.7
filelock==3.12.4
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.37
glob2==0.7
google-api-core==2.12.0
google-auth==2.23.0
google-cloud-core==2.3.3
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.60.0
huggingface-hub==0.16.4
idna==3.4
jmespath==1.0.1
markdown-it-py==3.0.0
mdurl==0.1.2
more-itertools==9.1.0
packaging==23.1
petname==2.6
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.16.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
rich==13.5.3
rjsonnet==0.5.3
rsa==4.9
s3transfer==0.6.2
six==1.16.0
smmap==5.0.1
sqlitedict==2.1.0
tqdm==4.66.1
typing_extensions==4.8.0
urllib3==1.26.16
xxhash==3.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions