Operating large .csv file with pandas/dask Python

Question

I&#8217;ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times. Each row of the table looks like this: I&#8217;ve never used pandas or any data science library. So far I&#8217;ve come up with this plan: Load the .csv file and add header…

Accepted Answer

Some minor suggestions:if 5GB is the full dataset, it&#8217;s best to use plain pandas. The strategy you outlined might involve communication across partitions, so it&#8217;s going to be computationally more expensive (or will require some work to make it more efficient). With pandas all the data will be in memory, so sorting/duplication check will be fast.In the code, make sure to assign the modified dataframe. Typically the modification is assigned to replace the existing dataframe:# without "df = " part, the modification is not storeddf = df.drop(columns=['ID'])If memory is a big constraint, then consider loading only the data you need (as opposed to loading everything and then dropping specific columns). For this we will need to provide the list of columns to usecols kwarg of pd.read_csv. Here&#8217;s the rough idea:column_names = ['ID', 'Price', 'Date', 'ZIP', 'PropType', 'Old/new', 'Duration', 'Padress', 'Sadress', 'Str', 'Locality', 'Town', 'District', 'County', 'PPDType', 'Rec_Stat']indexes_to_remove = [0, 1, 2, 3, 4, 5, 6, 7, 14, 15, 16]indexes_to_keep = [i for i in range(len(column_names)) if i not in indexes_to_remove]column_names_to_keep = [n for i,n in enumerate(column_names) if i in indexes_to_keep]df = pd.read_csv('some_file.csv', header=column_names_to_keep, usecols=indexes_to_keep)

Advertisement

Answer