I am running into some unexpected behavior when trying to hash a dataclass and I’m wondering if anyone can explain it.
The below script reproduces the problem. First, we need to run export PYTHONHASHSEED='0'
to disable hash randomization so we can compare the hash across runs.
import os from dataclasses import dataclass from typing import Optional assert os.getenv("PYTHONHASHSEED", None) == "0" @dataclass(frozen=True) class Foo: x = 1 y = None @dataclass(frozen=True) class Bar: x: Optional[int] = 1 y = None @dataclass(frozen=True) class Foobar: x = 1 y: Optional[int] = None print("hash(Foo()):", hash(Foo())) print("hash(Bar()):", hash(Bar())) print("hash(Foobar()):", hash(Foobar()))
Here’s the result of running the script twice:
>>> py temp.py hash(Foo()): 5740354900026072187 hash(Bar()): -6644214454873602895 hash(Foobar()): 582415153292506125 >>> py temp.py hash(Foo()): 5740354900026072187 hash(Bar()): -6644214454873602895 hash(Foobar()): -8226650923609135754
Note that the hash for the first two classes is the same across runs, but the hash of the last class is different each time. It seems to be the combination of the type annotation with the value None in the class Foobar
that causes the hash to change. (Incidentally, if I replace Optional[int]
with int
I get the same behavior.)
I tried with both Python 3.9 and 3.10 and got similar results each time.
Can anyone explain what is going on?
Advertisement
Answer
Dataclass fields must be annotated. The annotation is how the dataclass machinery determines that something is a field. All 3 of your dataclasses are broken due to missing annotations.
Disabling hash randomization isn’t supposed to make hashes deterministic. It just disables one specific security feature that deliberately randomizes some types’ hashes to mitigate hash collision-based denial-of-service attacks.
The default CPython object.__hash__
is nondeterministic. It’s based on an object’s address, which is not consistent from run to run. None
uses this default hash, so hash(None)
is nondeterministic, and your dataclass hashes are based on their fields’ hashes, so the hash of a dataclass with a None
field value is also nondeterministic. However, since your dataclasses are broken, Foobar
is the only one where y
is actually a field.
Bar()
‘s hash seems to be deterministic because it only depends on the hashes of ints and tuples (the frozen dataclass __hash__
implementation builds a tuple of field values and hashes that), and the int and tuple hash algorithms happen to be close to deterministic. They’re not actually deterministic, though; they depend on whether you’re on a 32-bit or 64-bit Python build, and while the int hashing algorithm is mostly specified, the tuple hash algorithm is all implementation details.
hash
is not designed to be deterministic, no matter what settings you use. If you need deterministic hashing, do not use hash
.