I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2):
hash = imagehash.average_hash(Image.open(path + file_name1))
otherhash = imagehash.average_hash(Image.open(path + file_name2))
return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks
Advertisement
Answer
There is indeed a much faster way of doing this:
import collections
import glob
import os
def dupDetector(dirpath, ext):
hashes = collections.defaultdict(list)
for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
h = imagehash.average_hash(Image.open(fpath))
hashes[h].append(fpath)
for h,fpaths in hashes.items():
if len(fpaths) == 1:
print(fpaths[0], "is one of a kind")
continue
print("The following files are duplicates of each other (with the hash {}): nt{}".format(h, 'nt'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don’t need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)