Skip to content
Advertisement

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.

I am currently checking if the hashes are the same however this is a very long process. I am currently using:

def sameIm(file_name1,file_name2):
    hash = imagehash.average_hash(Image.open(path + file_name1))
    otherhash = imagehash.average_hash(Image.open(path + file_name2))

    return (hash == otherhash)

Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.

Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?

or is there another way to compare files which is faster?

Thanks

Advertisement

Answer

There is indeed a much faster way of doing this:

import collections
import glob
import os


def dupDetector(dirpath, ext):
    hashes = collections.defaultdict(list)
    for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
        h = imagehash.average_hash(Image.open(fpath))
        hashes[h].append(fpath)

    for h,fpaths in hashes.items():
        if len(fpaths) == 1:
            print(fpaths[0], "is one of a kind")
            continue
        print("The following files are duplicates of each other (with the hash {}): nt{}".format(h, 'nt'.join(fpaths)))

Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don’t need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Advertisement