I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this: However when I am trying to
Tag: dask-dataframe
Dask “Column assignment doesn’t support type numpy.ndarray”
I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions. But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn’t work with dask.array.where. Answer If numpy works and the operation is
Operating large .csv file with pandas/dask Python
I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times. Each row of the table looks like this: I’ve never used pandas or any data science library. So far I’ve come up with this plan: Load the .csv file and add headers and column
Dask dataframe crashes
I’m loading a large parquet dataframe using Dask but can’t seem to be able to do anything with it without the system crashing on me or getting a million errors and no output. The data weighs about 165M compressed, or 13G once loaded in pandas (it fits well in the 45G RAM available). Instead, if using Dask prints the same
Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion
I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I’m doing with Dask is: Reading the parquet files Sorting on one of the columns (called “friend”) Writing as parquet files in a separate directory I