Skip to content
Advertisement

Generate unique id using strings

I am parsing data from multiple sources and I want to assign a unique (string) id to each entry. Each entry contains a title (string), url(string) and body(string). We can get same title from multiple sources but those will have different urls and I would like to store both the items in that case. I am thinking of creating a hash of title and url and assign that as an id, that ways if I get same title and url from different sources, the id will be same and I will be able to identify that it’s a duplicate.

import hashlib 
hashlib.sha256(str("title url").encode('utf-8')).hexdigest()

But I think there can be a case where 2 different title url combinations might generate same hash, not sure how to overcome the clash. Can someone suggest a way of generating unique identifier using strings I don’t want to use timestamp because I might get same row from different sources at different times

Advertisement

Answer

You’re safe, you won’t have 2 different title url combinations generating same hash with SHA-256


SHA256 is a cryptographic hash function, from the SHA-2 hash family, and is a standard from 2020.

The collision probability (2 inputs gives same output) is 1/(2^128) which is about 2e-39.


See: SHA-256 collisions on crypto.stackexchange

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement