Tag: dask

Running two dask-ml imputers simultaneously instead of sequentially

I can impute the mean and most frequent value using dask-ml like so, this works fine: But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be…

Using set_index() on a Dask Dataframe and writing to parquet causes memory explosion

dask dask-dataframe python

I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I’m doing with Dask is: Reading the parquet files Sorting on one of the columns (called “friend”) Writing as parquet files in …

ERROR: Could not find a version that satisfies the requirement dask-cudf (from versions: none)

dask gpu python python-3.x rapids

Describe the bug When I am trying to import dask_cudf I get the following ERROR: I have dask and RAPIDS installed with pip when I search for: pip install dask_cudf original site is not exists anymore: https://pypi.org/project/dask-cudf/ google stored site history: https://webcache.googleusercontent.com/search…

Dask concatenate 2 dataframes into 1 single dataframe

dask pandas python

Objective To merge df_labelled file with a portion of labelled points to df where contains all the point. What I have tried Referring to Simple way to Dask concatenate (horizontal, axis=1, columns), I tried the code below But I get the error ValueError: Not all divisions are known, can’t align partition…

Dask distributed.scheduler – ERROR – Couldn’t gather keys

dask dask-distributed dask-ml python

I created a dask cluster using two local machines using I am trying to find best parameters using dask gridsearchcv. I am facing the following error. I hope someone helps in solving this issue. Thanks in advance. Answer I also meet the same issue, and I find it’s likely to be caused by firewall. Suppose…

Writing dask bag to DB using custom function

dask python

I’m running a function on dask bag to dump data into NoSQL DB like: Now when I look at the dask task graph, after each partition completes the write_to_db function, it is being shown as memory instead ofreleased. My Questions: How to tell dask that there is no return value and hence mark memory as relea…

Load oracle Dataframe in dask dataframe

dask error-handling python

I used to work with pandas and cx_Oracle until now. But I haver to switch to dask now due to RAM limitations. I tried to do it similar to how I used cx_oracle with pandas. But I receive an AttributeError named: Any ideas if its just a problem with the package? Answer Please read the dask doc on SQL: you

Losing index information when using dask.dataframe.to_parquet() with partitioning

dask parquet partitioning python

When I was using dask=1.2.2 with pyarrow 0.11.1 I did not observe this behavior. After updating (dask=2.10.1 and pyarrow=0.15.1), I cannot save the index when I use to_parquet method with given partition_on and write_index arguments. Here I have created a minimal example which shows the issue: Which gives: I …