Skip to content
Advertisement

Operating large .csv file with pandas/dask Python

I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.

Each row of the table looks like this:

JavaScript

I’ve never used pandas or any data science library. So far I’ve come up with this plan:

  1. Load the .csv file and add headers and column names

  2. Drop unnecessary columns

  3. Create hashmap of edited df and find duplicates

  4. Export duplicates to a new .csv file

  5. From my research I found that pandas are bad with very big files so I used dask

JavaScript
  1. After I tried to delete unnecessary columns
JavaScript

also tried

JavaScript

Nothing worked.

The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column’s data isn’t consistent (ID – is unique code of transaction, Date, type of offer etc.)

I need to do this task with minimum memory and CPU usage that’s why I went with hashmap.

I don’t know if there’s another method to do this easier or more efficient.

Advertisement

Answer

Some minor suggestions:

  • if 5GB is the full dataset, it’s best to use plain pandas. The strategy you outlined might involve communication across partitions, so it’s going to be computationally more expensive (or will require some work to make it more efficient). With pandas all the data will be in memory, so sorting/duplication check will be fast.

  • In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:

JavaScript
  • If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns). For this we will need to provide the list of columns to usecols kwarg of pd.read_csv. Here’s the rough idea:
JavaScript
Advertisement