Tag: dask

High memory allocation when using dask.bag.map

I am using dask for extending dask bag items by information from an external, previously computed object arg. Dask seems to allocate memory for arg for each partition at once in the beginning of the computation process. Is there a workaround to prevent Dask from duplicating the arg multiple times (and allocat…

Testing string membership using (in) keyword in python is very slow

dask pandas python

I have the following text dataset: 4 million paragraphs of length between (10-60 words each). Also I have a set of 30,000 unique sentences: I want to check if ANY of the sentences in the set are in those 4 million paragraphs. If any of those 30,000 sentences are in one of those paragraphs I want to keep that …

Dealing with huge pandas data frames

dask dask-dataframe dataframe pandas python

I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to cdv into 20 partitions like this: However when I am trying to

Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

arrays dask dask-distributed distance python

I want to calculate a large distance matrix, based on a higher dimensional vector. For instance, I have 1000 instances each represented by 20 vectors of length 10. The distance between each two instances is given by the mean distance between each of the 20 vectors associated to each vector. So I want to go fr…

Dask Df convert All Dtype using dictionary

dask dataframe pandas python

Is there an easy equivalent way to convert all columns in a dask df(converted from a pandas df) using a dictionary. I have a dictionary as follows: and would like to convert the pandas|dask df dtypes all at once to the suggested dtypes in the dictionary. Answer Not sure if I understand the question correctly,…

Dask “Column assignment doesn’t support type numpy.ndarray”

bigdata dask dask-dataframe multiple-conditions python

I’m trying to use Dask instead of pandas since the data size I’m analyzing is quite large. I wanted to add a flag column based on several conditions. But, then I got the following error message. The above code works perfectly when using np.where with pandas dataframe, but didn’t work with da…

1D netcdf to 2D lat lon using xarray and Dask

coordinates dask netcdf python python-xarray

I have a large netcdf dataset which has two dimensions – ‘time’ and a single spatial dimension ‘x’. There is also a ‘lat’ and ‘lon’ coord for each ‘x’. This needs to be mapped onto a global half degree 2D grid, such that the dimensions are &#82…

Fastest way to filter csv using pandas and create a matrix

csv dask dataframe pandas python

input dict I have large csv files in the below format basename_AM1.csv I have large csv files in the below format basename_AM1.csv Now I need to create a similarity dict like below for the given input_dict by searching/filter the csv files I have come up with the below logic but for an input_dict of 100 sampl…

Retrieving data from multiple parquet files into one dataframe (Python)

dask dataframe pandas parquet python

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below: /Forecasting/as_of_date=2022-02-01/type=full/…

Operating large .csv file with pandas/dask Python

dask dask-dataframe data-science pandas python

I’ve got a large .csv file (5GB) from UK land registry. I need to find all real estate that has been bought/sold two or more times. Each row of the table looks like this: I’ve never used pandas or any data science library. So far I’ve come up with this plan: Load the .csv file and add header…