Skip to content
Advertisement

Operating large .csv file with pandas/dask Python

I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.

Each row of the table looks like this:

{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"

I’ve never used pandas or any data science library. So far I’ve come up with this plan:

  1. Load the .csv file and add headers and column names

  2. Drop unnecessary columns

  3. Create hashmap of edited df and find duplicates

  4. Export duplicates to a new .csv file

  5. From my research I found that pandas are bad with very big files so I used dask

df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str)
df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
df.head()

  1. After I tried to delete unnecessary columns
df.drop('ID', axis=1).head()

also tried

indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
for index in indexes_to_remove:
    df.drop(df.index[index], axis=1)

Nothing worked.

The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column’s data isn’t consistent (ID – is unique code of transaction, Date, type of offer etc.)

I need to do this task with minimum memory and CPU usage that’s why I went with hashmap.

I don’t know if there’s another method to do this easier or more efficient.

Advertisement

Answer

Some minor suggestions:

  • if 5GB is the full dataset, it’s best to use plain pandas. The strategy you outlined might involve communication across partitions, so it’s going to be computationally more expensive (or will require some work to make it more efficient). With pandas all the data will be in memory, so sorting/duplication check will be fast.

  • In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:

# without "df = " part, the modification is not stored
df = df.drop(columns=['ID'])

  • If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns). For this we will need to provide the list of columns to usecols kwarg of pd.read_csv. Here’s the rough idea:
column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]
indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]
column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]

df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)

Advertisement