Skip to content
Advertisement

Finding identical numbers in large files python

I have two data files in python, each containing two-column data as below:

JavaScript

There are about 10M entries in each file (~400Mb).

I have to sort through each file and check if any number in the first column of one file matches any number in the first column in another file.

The code I currently have converted the files to lists:

JavaScript

I then iterate through both of the lists looking for a match. When a match is found I with to add the sum of the right hand columns to a new list ‘coin’

JavaScript

The issue is this is taking a very long time and or crashing. Is there a more efficient way of running with?

Advertisement

Answer

There are lots of ways to improve this; for example:

  • Since you only scan through the contents of ch1.txt once, you don’t need to read it into a list, and should thus take up less memory, but probably won’t speed things up all that much.

  • If you sort each of your lists, you can check for matches much more efficiently. Something like:

JavaScript

If the data in the files are already sorted, you can combine both ideas: instead of advancing an index, you simply read in the next line of the corresponding file. This should improve efficiency in time and space.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement