I want to use UMAP in my sklearn’s Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn’t work.
Example code:
from sklearn.preprocessing import FunctionTransformer from tempfile import mkdtemp from sklearn.pipeline import Pipeline from umap import UMAP from hdbscan import HDBSCAN import seaborn as sns iris = sns.load_dataset("iris") X = iris.drop(columns='species') y = iris.species @FunctionTransformer def transform_something(iris): iris = iris.copy() iris['sepal_sum'] = iris.sepal_length + iris.sepal_width return iris cachedir = mkdtemp() pipe = Pipeline([ ('transformer', transform_something), ('umap', UMAP()), ('hdb', HDBSCAN()), ], memory=cachedir ) pipe.fit_predict(X)
If you run this, you will get a PicklingError
, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any suggestions to make it work?
Advertisement
Answer
Not the cleanest, but you could nest pipelines?
pipe = Pipeline( [ ('transformer', transform_something), ('the_rest', Pipeline([ ('umap', UMAP()), ('hdb', HDBSCAN()), ], memory=cachedir)) ] )