Skip to content
Advertisement

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.

I am currently checking if the hashes are the same however this is a very long process. I am currently using:

JavaScript

Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.

Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?

or is there another way to compare files which is faster?

Thanks

Advertisement

Answer

There is indeed a much faster way of doing this:

JavaScript

Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don’t need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Advertisement