Skip to content
Advertisement

Dealing with huge pandas data frames

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:

JavaScript

However when I am trying to drop some of the rows e.g. by doing:

JavaScript

the kernel suddenly stops. So my questions are:

  1. is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
  2. do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
  3. do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?

Thank you

Advertisement

Answer

Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it’s possible to assign the result to the same dataframe:

JavaScript

If there are multiple conditions, mask can be redefined, for example to test two values:

JavaScript

Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas-like:

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement