-
Notifications
You must be signed in to change notification settings - Fork 53
Description
🐛 Describe the bug
Problem
I found that det_hash can sometimes hash equal objects differently. This was causing me a lot of grief when trying to manually initializing a step and retrieve its cached results. I feel that the default hashing in Tango should (at least) respect __eq__ for built-in types and fallback to Pickle serialization for more complex and custom data structures.
Examples
from tango.common.det_hash import det_hash
x = ["bert", "bert"]
y = ["bert", "bert "[:-1]]
assert x == y
assert det_hash(x) == det_hash(y) # AssertionError
# Also, suppose you invoke this script as `python script.py bert bert`
assert x == sys.argv[1:]
assert det_hash(x) == det_hash(sys.argv[1:]) # AssertionError
# I understand that this is because the object hierarchies may be different for `x` and `y`.
x_addr = [id(i) for i in x]
y_addr = [id(i) for i in y]
assert x_addr[0] == x_addr[1]
assert y_addr[0] == y_addr[1] # AssertionError
# Therefore, `x` and `y` pickle differently.
import dill
assert dill.dumps(x) == dill.dumps(y) # AssertionErrorSolutions
In Python, object hashes (hash or __hash__) are indeed expected to respect equality: i.e.
x == yimplies both thatx is yandhash(x) == hash(y)
However, there are a few problems with using __hash__:
-
mutable data types are not supposed to implement
__hash__ -
not all classes will implement
__hash__ -
determinism can only be toggled when initializing the interpreter
I propose the following implementation:
-
At hash time in Tango, we just want the object's instantaneous hash and do not actually care whether objects are immutable. So we can manually hash the mutable built-in types. Also, since mutable types may be nested, we can hash recursively.
-
We can simply fall back to
pickleserialization for objects that are not hashable. I am assuming that hash collisions between__hash__andpickleare negligible. -
Two suggestions:
3.1. We could write a simple C extension to temporarily disable hash determinism during Tango's hashing function. (Note: I'm not personally familiar with cpython, am just assuming that this variable can be changed.) Then, we will have deterministic hashing for all types that implement
__hash__.
3.2. Alternatively, we could write a custom function with deterministic hashing logic for all built-in Python types. This would be recursive (likerec_hashabove) to handle nested data structures. But custom classes that do implement__hash__will not be supported.
Together (with 3.1) that's:
def rec_hash(o: Any) -> str:
if isinstance(o, collections.abc.Sequence): # tuples, lists, ranges
return hash((rec_hash(x) for x in o))
elif isinstance(o, set):
# set elements are guaranteed to be hashable
return hash(frozenset(o))
elif isinstance(o, dict):
# dict keys are guaranteed to be hashable
return hash(sorted(hash((k, rec_hash(v)) for k,v in o.items())))
elif isinstance(o, collections.abc.Hashable):
# nested types may be un-hashable, could raise TypeError
return hash(o)
raise TypeError(f"unhashable type: '{type(o).__name__}'")
def det_hash(o: Any) -> str:
try:
with hash_seed(0): # TODO: need to implement this
h = rec_hash(o)
return base58.b58encode_int(h)
except TypeError:
pass
# Fallback to pickling
m = hashlib.blake2b()
with io.BytesIO() as buffer:
pickler = _DetHashPickler(buffer)
pickler.dump(o)
m.update(buffer.getbuffer())
return base58.b58encode(m.digest()).decode()Please let me know what you think, thanks!
Versions
Python 3.9.16
ai2-tango==1.2.1
base58==2.1.1
black==23.7.0
boto3==1.28.54
botocore==1.31.54
cached-path==1.4.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==3.2.0
click==8.1.7
click-help-colors==0.9.2
dill==0.3.7
filelock==3.12.4
fsspec==2023.9.2
gitdb==4.0.10
GitPython==3.1.37
glob2==0.7
google-api-core==2.12.0
google-auth==2.23.0
google-cloud-core==2.3.3
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.60.0
huggingface-hub==0.16.4
idna==3.4
jmespath==1.0.1
markdown-it-py==3.0.0
mdurl==0.1.2
more-itertools==9.1.0
packaging==23.1
petname==2.6
protobuf==4.24.3
pyasn1==0.5.0
pyasn1-modules==0.3.0
Pygments==2.16.1
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
requests==2.31.0
rich==13.5.3
rjsonnet==0.5.3
rsa==4.9
s3transfer==0.6.2
six==1.16.0
smmap==5.0.1
sqlitedict==2.1.0
tqdm==4.66.1
typing_extensions==4.8.0
urllib3==1.26.16
xxhash==3.3.0