I am parsing data from multiple sources and I want to assign a unique (string) id to each entry. Each entry contains a title (string), url(string) and body(string). We can get same title from multiple sources but those will have different urls and I would like to store both the items in that case. I am thinking of creating a hash of title and url and assign that as an id, that ways if I get same title and url from different sources, the id will be same and I will be able to identify that it’s a duplicate.
import hashlib
hashlib.sha256(str("title url").encode('utf-8')).hexdigest()
But I think there can be a case where 2 different title url combinations might generate same hash, not sure how to overcome the clash. Can someone suggest a way of generating unique identifier using strings I don’t want to use timestamp because I might get same row from different sources at different times
Advertisement
Answer
You’re safe, you won’t have 2 different title url combinations generating same hash with SHA-256
SHA256
is a cryptographic hash function, from the SHA-2
hash family, and is a standard from 2020.
The collision
probability (2 inputs gives same output) is 1/(2^128)
which is about 2e-39
.