Skip to content

Tag: dask

High memory allocation when using

I am using dask for extending dask bag items by information from an external, previously computed object arg. Dask seems to allocate memory for arg for each partition at once in the beginning of the computation process. Is there a workaround to prevent Dask from duplicating the arg multiple times (and allocating a lot of memory)? Here is a simplified

Dealing with huge pandas data frames

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this: However when I am trying to

Dask Df convert All Dtype using dictionary

Is there an easy equivalent way to convert all columns in a dask df(converted from a pandas df) using a dictionary. I have a dictionary as follows: and would like to convert the pandas|dask df dtypes all at once to the suggested dtypes in the dictionary. Answer Not sure if I understand the question correctly, but the conversion of dtypes

1D netcdf to 2D lat lon using xarray and Dask

I have a large netcdf dataset which has two dimensions – ‘time’ and a single spatial dimension ‘x’. There is also a ‘lat’ and ‘lon’ coord for each ‘x’. This needs to be mapped onto a global half degree 2D grid, such that the dimensions are ‘time’, ‘lat and ‘lon’. Not all the points on the global half degree grid

Retrieving data from multiple parquet files into one dataframe (Python)

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below: /Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet’ The file name 000.parquet is always