Cache only a single step in sklearn’s Pipeline

I want to use UMAP in my sklearn’s Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn’t work.

Example code:

from sklearn.preprocessing import FunctionTransformer
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from umap import UMAP
from hdbscan import HDBSCAN
import seaborn as sns

iris = sns.load_dataset("iris")
X = iris.drop(columns='species')
y = iris.species

@FunctionTransformer
def transform_something(iris):
    iris = iris.copy()
    iris['sepal_sum'] = iris.sepal_length + iris.sepal_width
    return iris

cachedir = mkdtemp()
pipe = Pipeline([
                 ('transformer', transform_something),
                 ('umap', UMAP()),
                 ('hdb', HDBSCAN()),
                ],
                memory=cachedir
            )

pipe.fit_predict(X)

If you run this, you will get a PicklingError, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any suggestions to make it work?

Answer

Not the cleanest, but you could nest pipelines?

pipe = Pipeline(
    [
        ('transformer', transform_something),
        ('the_rest', Pipeline([
            ('umap', UMAP()),
            ('hdb', HDBSCAN()),
        ], memory=cachedir))
    ]
)

Advertisement

Answer