Skip to content
Advertisement

Tag: dask

Data Quality check with Python Dask

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare the quality between each. We are expecting the second is the more

Error comparing dask date month with an integer

The dask map_partitions function in the code below has a dask date field where its month is compared to an integer. This comparison fails with the following error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). What is this error and how to fix it? Answer By using .map_partition, each dask dataframe

Unable to expand cluster by dask

I am very new to kubernetes & dask and trying to implement some kube cluster and have created minikube cluster with some services, further want to expand it with flexible dask functionality. I am planning to deploy it to gcloud somehow later, so i am trying to initialize dask cluster (scheduler and workers to my minikube cluster) from a pod

Build a dask dataframe from a list of dask delayed objects

I have a list of dask delayed objects Portfolio_perfs: Each delayed object is a numpy array of length 2 I want to build the following dataframe without using dask.compute: How can I build this dask dataframe without going through dask.compute? Thank you Answer Since each delayed object is a numpy array, you are interested in da.from_delayed(): Alternatively, it’s possible to

Dask dataframe crashes

I’m loading a large parquet dataframe using Dask but can’t seem to be able to do anything with it without the system crashing on me or getting a million errors and no output. The data weighs about 165M compressed, or 13G once loaded in pandas (it fits well in the 45G RAM available). Instead, if using Dask prints the same

Advertisement