I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times.
Each row of the table looks like this:
{F887F88E-7D15-4415-804E-52EAC2F10958},"70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"
I’ve never used pandas or any data science library. So far I’ve come up with this plan:
Load the .csv file and add headers and column names
Drop unnecessary columns
Create hashmap of edited df and find duplicates
Export duplicates to a new .csv file
From my research I found that pandas are bad with very big files so I used dask
df = dd.read_csv('pp-complete.csv', header=None, dtype={7: 'object', 8: 'object'}).astype(str) df.columns = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat'] df.head()
- After I tried to delete unnecessary columns
df.drop('ID', axis=1).head()
also tried
indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16] for index in indexes_to_remove: df.drop(df.index[index], axis=1)
Nothing worked.
The task is to show the property that has been bought/sold two or more times. I decided to use only address columns because every other column’s data isn’t consistent (ID – is unique code of transaction, Date, type of offer etc.)
I need to do this task with minimum memory and CPU usage that’s why I went with hashmap.
I don’t know if there’s another method to do this easier or more efficient.
Advertisement
Answer
Some minor suggestions:
if 5GB is the full dataset, it’s best to use plain pandas. The strategy you outlined might involve communication across partitions, so it’s going to be computationally more expensive (or will require some work to make it more efficient). With
pandas
all the data will be in memory, so sorting/duplication check will be fast.In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:
# without "df = " part, the modification is not stored df = df.drop(columns=['ID'])
- If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns). For this we will need to provide the list of columns to
usecols
kwarg ofpd.read_csv
. Here’s the rough idea:
column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat'] indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16] indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove] column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep] df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)