I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this:
import dask.dataframe as dd dask_merge_bodytextknown5 = dd.from_pandas(merge_bodytextknown5, npartitions=20) # Dask DataFrame has 20 partitions dask_merge_bodytextknown5.to_csv('df_complete_emakg_*.csv') #merge_bodytextknown5.to_csv('df_complete.zip', compression={'method': 'zip', 'archive_name': 'df_complete_emakg.csv'})
However when I am trying to drop some of the rows e.g. by doing:
merge_bodytextknown5.drop(merge_bodytextknown5.index[merge_bodytextknown5['confscore'] == 3], inplace = True)
the kernel suddenly stops. So my questions are:
- is there a way to drop the desired rows using Dask (or another way that prevents the crush of the kernel)?
- do you know a way to lighten the dataset or deal with it in python (e.g. doing some basic descriptive statistics in parallel) other than dropping observations?
- do you know a way to export the pandas db as a csv in parallel without saving the n partition separately (as done by Dask)?
Thank you
Advertisement
Answer
Dask dataframes do not support the inplace kwarg, since each partition and subsequent operations are delayed/lazy. However, just like in Pandas, it’s possible to assign the result to the same dataframe:
df = merge_bodytextknown5 # this line is for easier readability mask = df['confscore'] != 3 # note the inversion of the requirement df = df[mask]
If there are multiple conditions, mask
can be redefined, for example to test two values:
mask = ~df['confscore'].isin([3,4])
Dask will keep track of the operations, but, crucially, will not launch computations until they are requested/needed. For example, the syntax to save a csv file is very much pandas
-like:
df.to_csv('test.csv', index=False, single_file=True) # this save to one file df.to_csv('test_*.csv', index=False) # this saves one file per dask dataframe partition