Skip to content
Advertisement

dataclass hash() of field with type annotation and default value = None is always nondeterministic

I am running into some unexpected behavior when trying to hash a dataclass and I’m wondering if anyone can explain it.

The below script reproduces the problem. First, we need to run export PYTHONHASHSEED='0' to disable hash randomization so we can compare the hash across runs.

import os
from dataclasses import dataclass
from typing import Optional

assert os.getenv("PYTHONHASHSEED", None) == "0"


@dataclass(frozen=True)
class Foo:
    x = 1
    y = None


@dataclass(frozen=True)
class Bar:
    x: Optional[int] = 1
    y = None


@dataclass(frozen=True)
class Foobar:
    x = 1
    y: Optional[int] = None


print("hash(Foo()):", hash(Foo()))
print("hash(Bar()):", hash(Bar()))
print("hash(Foobar()):", hash(Foobar()))

Here’s the result of running the script twice:

>>> py temp.py 
hash(Foo()): 5740354900026072187
hash(Bar()): -6644214454873602895
hash(Foobar()): 582415153292506125
>>> py temp.py 
hash(Foo()): 5740354900026072187
hash(Bar()): -6644214454873602895
hash(Foobar()): -8226650923609135754

Note that the hash for the first two classes is the same across runs, but the hash of the last class is different each time. It seems to be the combination of the type annotation with the value None in the class Foobar that causes the hash to change. (Incidentally, if I replace Optional[int] with int I get the same behavior.)

I tried with both Python 3.9 and 3.10 and got similar results each time.

Can anyone explain what is going on?

Advertisement

Answer

Dataclass fields must be annotated. The annotation is how the dataclass machinery determines that something is a field. All 3 of your dataclasses are broken due to missing annotations.


Disabling hash randomization isn’t supposed to make hashes deterministic. It just disables one specific security feature that deliberately randomizes some types’ hashes to mitigate hash collision-based denial-of-service attacks.

The default CPython object.__hash__ is nondeterministic. It’s based on an object’s address, which is not consistent from run to run. None uses this default hash, so hash(None) is nondeterministic, and your dataclass hashes are based on their fields’ hashes, so the hash of a dataclass with a None field value is also nondeterministic. However, since your dataclasses are broken, Foobar is the only one where y is actually a field.

Bar()‘s hash seems to be deterministic because it only depends on the hashes of ints and tuples (the frozen dataclass __hash__ implementation builds a tuple of field values and hashes that), and the int and tuple hash algorithms happen to be close to deterministic. They’re not actually deterministic, though; they depend on whether you’re on a 32-bit or 64-bit Python build, and while the int hashing algorithm is mostly specified, the tuple hash algorithm is all implementation details.

hash is not designed to be deterministic, no matter what settings you use. If you need deterministic hashing, do not use hash.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement