I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2): hash = imagehash.average_hash(Image.open(path + file_name1)) otherhash = imagehash.average_hash(Image.open(path + file_name2)) return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks
Advertisement
Answer
There is indeed a much faster way of doing this:
import collections import glob import os def dupDetector(dirpath, ext): hashes = collections.defaultdict(list) for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))): h = imagehash.average_hash(Image.open(fpath)) hashes[h].append(fpath) for h,fpaths in hashes.items(): if len(fpaths) == 1: print(fpaths[0], "is one of a kind") continue print("The following files are duplicates of each other (with the hash {}): nt{}".format(h, 'nt'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don’t need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)