Skip to content

Tag: dask

Data Quality check with Python Dask

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the …

Error comparing dask date month with an integer

The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). What is this error and how to fix it? A…

Unable to expand cluster by dask

I am very new to kubernetes & dask and trying to implement some kube cluster and have created minikube cluster with some services, further want to expand it with flexible dask functionality. I am planning to deploy it to gcloud somehow later, so i am trying to initialize dask cluster (scheduler and worker…

Build a dask dataframe from a list of dask delayed objects

I have a list of dask delayed objects Portfolio_perfs: Each delayed object is a numpy array of length 2 I want to build the following dataframe without using dask.compute: How can I build this dask dataframe without going through dask.compute? Thank you Answer Since each delayed object is a numpy array, you a…

Dask dataframe crashes

I’m loading a large parquet dataframe using Dask but can’t seem to be able to do anything with it without the system crashing on me or getting a million errors and no output. The data weighs about 165M compressed, or 13G once loaded in pandas (it fits well in the 45G RAM available). Instead, if us…