Dealing with huge pandas data frames

Question

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this: However when I am trying to

Accepted Answer

Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it&#8217;s possible to assign the result to the same dataframe:df = merge_bodytextknown5  # this line is for easier readabilitymask = df['confscore'] != 3  # note the inversion of the requirementdf = df[mask]If there are multiple conditions, mask can be redefined, for example to test two values:mask = ~df['confscore'].isin([3,4])Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas-like:df.to_csv('test.csv', index=False, single_file=True) # this save to one filedf.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition

Advertisement

Answer