Dask dataframe crashes

Question

I&#8217;m loading a large parquet dataframe using Dask but can&#8217;t seem to be able to do anything with it without the system crashing on me or getting a million errors and no output. The data weighs about 165M compressed, or 13G once loaded in pandas (it fits well in the 45G RAM available). Instead, if us…

Accepted Answer

Dask works by loading and processing your data chunk-wise. In the case of parquet, the origin of that chunking comes from the datafiles themselves: internally parquet is organised into &#8220;row-groups&#8221;, sets of rows that are meant to be read together.It sounds like in this case, the entire dataset consists of one row-group in one file. This means that Dask has no opportunity to split the data into chunks; you get one task, which takes the full amount of memory pressure in one worker (probably equal to the total data size plus some temporary values), which has only been allocated a faction of the total system memory. Hence the errors.Note that you can turn off memory monitoring to prevent workers getting killed in the configuration or directly with keywords like memory_limit=0. In this case, you know that only one worker will be doing the load.In some very specific situations (no nesting/list/map types), it would be possible to split row-groups, but code for this does not exist, and it would be inefficient due to the compression and encoding of the data.

Advertisement

Answer