I want to use UMAP in my sklearn’s Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn’t work.
Example code:
from sklearn.preprocessing import FunctionTransformer
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from umap import UMAP
from hdbscan import HDBSCAN
import seaborn as sns
iris = sns.load_dataset("iris")
X = iris.drop(columns='species')
y = iris.species
@FunctionTransformer
def transform_something(iris):
iris = iris.copy()
iris['sepal_sum'] = iris.sepal_length + iris.sepal_width
return iris
cachedir = mkdtemp()
pipe = Pipeline([
('transformer', transform_something),
('umap', UMAP()),
('hdb', HDBSCAN()),
],
memory=cachedir
)
pipe.fit_predict(X)
If you run this, you will get a PicklingError, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any suggestions to make it work?
Advertisement
Answer
Not the cleanest, but you could nest pipelines?
pipe = Pipeline(
[
('transformer', transform_something),
('the_rest', Pipeline([
('umap', UMAP()),
('hdb', HDBSCAN()),
], memory=cachedir))
]
)