Tag: dask-dataframe

Dealing with huge pandas data frames

dask dask-dataframe dataframe pandas python

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this: However when I am trying to

Dask “Column assignment doesn’t support type numpy.ndarray”

bigdata dask dask-dataframe multiple-conditions python

I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions. But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn’t work with da…

Operating large .csv file with pandas/dask Python

dask dask-dataframe data-science pandas python

I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times. Each row of the table looks like this: I’ve never used pandas or any data science library. So far I’ve come up with this plan: Load the .csv file and add header…

Dask dataframe crashes

dask dask-dataframe pandas python

I’m loading a large parquet dataframe using Dask but can’t seem to be able to do anything with it without the system crashing on me or getting a million errors and no output. The data weighs about 165M compressed, or 13G once loaded in pandas (it fits well in the 45G RAM available). Instead, if us…

Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

dask dask-dataframe python

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I’m doing with Dask is: Reading the parquet files Sorting on one of the columns (called “friend”) Writing as parquet files in …