I want to use UMAP in my sklearn’s Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn’t work.
Example code:
JavaScript
x
28
28
1
from sklearn.preprocessing import FunctionTransformer
2
from tempfile import mkdtemp
3
from sklearn.pipeline import Pipeline
4
from umap import UMAP
5
from hdbscan import HDBSCAN
6
import seaborn as sns
7
8
iris = sns.load_dataset("iris")
9
X = iris.drop(columns='species')
10
y = iris.species
11
12
@FunctionTransformer
13
def transform_something(iris):
14
iris = iris.copy()
15
iris['sepal_sum'] = iris.sepal_length + iris.sepal_width
16
return iris
17
18
cachedir = mkdtemp()
19
pipe = Pipeline([
20
('transformer', transform_something),
21
('umap', UMAP()),
22
('hdb', HDBSCAN()),
23
],
24
memory=cachedir
25
)
26
27
pipe.fit_predict(X)
28
If you run this, you will get a PicklingError
, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any suggestions to make it work?
Advertisement
Answer
Not the cleanest, but you could nest pipelines?
JavaScript
1
10
10
1
pipe = Pipeline(
2
[
3
('transformer', transform_something),
4
('the_rest', Pipeline([
5
('umap', UMAP()),
6
('hdb', HDBSCAN()),
7
], memory=cachedir))
8
]
9
)
10