Finding identical numbers in large files python

Question

I have two data files in python, each containing two-column data as below: There are about 10M entries in each file (~400Mb). I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file. The code I currently have converted the files to

Accepted Answer

There are lots of ways to improve this; for example:Since you only scan through the contents of ch1.txt once, you don&#8217;t need to read it into a list, and should thus take up less memory, but probably won&#8217;t speed things up all that much.If you sort each of your lists, you can check for matches much more efficiently.  Something like:    i1, i2 = 0, 0    while i1 < len(ch1) and i2 < len(ch2):        if ch1[i1][0] == ch2[i2][0]:            # Do what you do for matches            ...            # Advance both indices            i1 += 1            i2 += 1        elif ch1[i1][0] < ch2[i2][0]:            # Advance index of the smaller value            i1 += 1        else: # ch1[i1][0] > ch2[i2][0]            i2 += 1If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file.  This should improve efficiency in time and space.

Advertisement

Answer