Running two dask-ml imputers simultaneously instead of sequentially

Question

I can impute the mean and most frequent value using dask-ml like so, this works fine: But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample

Accepted Answer

You can used dask.delayed as suggested in docs and Dask Toutorial to parallelise the computation if entities are independent of one another.Your code would look like:from dask.distributed import Clientclient = Client(n_workers=4)from dask import delayedimport numpy as npimport pandas as pdfrom dask_ml import imputemean_imputer = impute.SimpleImputer(strategy='mean')most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')def fit_transform_mi(d):    return mean_imputer.fit_transform(d)def fit_transform_mfi(d):    return most_frequent_imputer.fit_transform(d)def setdf(a,b,df):    df.iloc[:, [0,1]]=a    df.iloc[:, [2]]=b    return dfdata = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])b = delayed(fit_transform_mfi)(df.iloc[:,[2]])c = delayed(setdf)(a,b,df)df= c.compute()print(df)client.close()The c object is a lazy Delayed object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.

Advertisement

Answer