Skip to content
Advertisement

how to hash a “json” nested dictionary identically in Python and JavaScript?

What’s the best way to consistently hash an object/dictionary that’s limited to what JSON can represent, in both JavaScript and Python? What about in many different languages?

Of course there are hash functions implemented consistently in many different languages that take a string, but to hash an object you have to convert it to a string representation first.

I want a hash function that will always return the same value for the same dictionary in any language, but the JSON spec doesn’t guarantee anything about the order of keys in the serialized representation.

Do json.dumps() and JSON.stringify() behave identically? How would you verify this?

If not, is there a serialization format with libraries in many languages (I’m practically interested in Python and JavaScript but also curious about all languages) that doesn’t require any additional processing by the caller to produce consistent results?

Advertisement

Answer

I would split this into two problems.

  1. How do you get the same serialized string in both JavaScript and Python?
  2. Which byte array hash function should you use? It must be an established algorithm with identical implementations in both JavaScript and Python.

Use (1) to get two strings, then UTF8 encode, then use (2) to get hashes.

Since (2) is straightforward, I’ll only address (1).

There are multiple facets to the problem of making sure the two JSON strings you generate are identical.

  • You’ll want to used unformatted JSON (no extraneous spaces, tabs, or newlines).
  • null values must be treated identically. Some serializers will by default throw away a dictionary key-value pair if the value is null.
  • Ordering of key-value pairs within a dictionary must be consistent.
  • JSON number serialization should be consistent. For example, you can’t have integer one serialize as 1 on one side and 1.0 on the other. (This probably won’t be as big of an issue however.)
  • The string encoding should be the same for both. JSON allows serialization to Unicode text, only mandating that " and be backslash-escaped in JSON strings. Most serializers do more than necessary, however, and reduce almost all Unicode characters to the uXXXX equivalent. See json.org for the details on JSON string encoding. One way to remove all ambiguity is to only escape when absolutely necessary.

You’ll want to make sure all of these are matched between JavaScript and Python. Most JSON serialization libraries I’ve used provide configuration hooks for all of the things I mention in the list above. Unfortunately, I’m not very familiar with the JavaScript or Python libraries.

Advertisement