I can impute the mean and most frequent value using dask-ml like so, this works fine:
mean_imputer = impute.SimpleImputer(strategy='mean') most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent') data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]] df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]]) df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]]) print(df) Weight Age Height 0 100.0 2.0 5.0 1 85.0 4.5 5.0 2 70.0 7.0 5.0
But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample code to achieve that?
Advertisement
Answer
You can used dask.delayed as suggested in docs and Dask Toutorial to parallelise the computation if entities are independent of one another.
Your code would look like:
from dask.distributed import Client client = Client(n_workers=4) from dask import delayed import numpy as np import pandas as pd from dask_ml import impute mean_imputer = impute.SimpleImputer(strategy='mean') most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent') def fit_transform_mi(d): return mean_imputer.fit_transform(d) def fit_transform_mfi(d): return most_frequent_imputer.fit_transform(d) def setdf(a,b,df): df.iloc[:, [0,1]]=a df.iloc[:, [2]]=b return df data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]] df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) a = delayed(fit_transform_mi)(df.iloc[:,[0,1]]) b = delayed(fit_transform_mfi)(df.iloc[:,[2]]) c = delayed(setdf)(a,b,df) df= c.compute() print(df) client.close()
The c object is a lazy Delayed object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.