Check if files in dir are the same

Question

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names. I am currently checking if the hashes are the same however this is a very long process. I am cur…

Accepted Answer

There is indeed a much faster way of doing this:import collectionsimport globimport osdef dupDetector(dirpath, ext):    hashes = collections.defaultdict(list)    for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):        h = imagehash.average_hash(Image.open(fpath))        hashes[h].append(fpath)    for h,fpaths in hashes.items():        if len(fpaths) == 1:            print(fpaths[0], "is one of a kind")            continue        print("The following files are duplicates of each other (with the hash {}): nt{}".format(h, 'nt'.join(fpaths)))Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don&#8217;t need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Advertisement

Answer