I have two data files in python, each containing two-column data as below:
3023084 5764 9152549 5812 18461998 5808 45553152 5808 74141469 5753 106932238 5830 112230478 5795 135207137 5800 148813978 5802 154818883 5798
There are about 10M entries in each file (~400Mb).
I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.
The code I currently have converted the files to lists:
ch1 = [] with open('ch1.txt', 'r+') as file: for line in file: if ':' not in line: line = line.split() ch1.append([line[0], line[1]]) ch2 = [] with open('ch2.txt', 'r+') as file: for line in file: if ':' not in line: line = line.split() ch2.append([line[0], line[1]])
I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list ‘coin’
coin = [] for item1 in ch1: for item2 in ch2: if item1[0] == item2[0]: coin.append(int(item1[1]) + int(item2[1]))
The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?
Advertisement
Answer
There are lots of ways to improve this; for example:
Since you only scan through the contents of
ch1.txt
once, you don’t need to read it into a list, and should thus take up less memory, but probably won’t speed things up all that much.If you sort each of your lists, you can check for matches much more efficiently. Something like:
i1, i2 = 0, 0 while i1 < len(ch1) and i2 < len(ch2): if ch1[i1][0] == ch2[i2][0]: # Do what you do for matches ... # Advance both indices i1 += 1 i2 += 1 elif ch1[i1][0] < ch2[i2][0]: # Advance index of the smaller value i1 += 1 else: # ch1[i1][0] > ch2[i2][0] i2 += 1
If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.