Skip to content
Advertisement

Extracting feature names from sklearn column transformer

I’m using sklearn.pipeline to transform my features and fit a model, so my general flow looks like this: column transformer –> general pipeline –> model. I would like to be able to extract feature names from the column transformer (since the following step, general pipeline applies the same transformation to all columns, e.g. nan_to_zero) and use them for model explainability (e.g. feature importance). I’d also like it to work with custom transformer classes too.

Here is the set up:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="ordinal"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df)  # returns a numpy array

So far I’ve tried using get_feature_names_out, e.g.:

pipe.named_steps["transform"].get_feature_names_out()

But I’m running into get_feature_names_out() takes 1 positional argument but 2 were given, not sure what’s going on but this entire process doesn’t feel right. Is there a better way to do it?

EDIT: A big thank you to @amiola for answering the question, that was indeed the problem. I just wanted to add another important point for posterity: I was having other problems with my own custom pipeline and I was getting an error get_feature_names_out() takes 1 positional argument but 2 were given. So it turns out, aside from the KBinsDiscretizer there was another bug in my custom transformer classes. I implemented the get_feature_names_out method, but it was not accepting any parameter on my end and that was the problem. If you run into similar issues, then make sure that this method has the following signature: get_feature_names_out(self, input_features) -> List[str].

Advertisement

Answer

It seems the problem is generated by the encode="ordinal" parameter passed to the KBinsDiscretizer constructor. The bug is tracked in GitHub issue #22731 and GitHub issue #22841 and solved with PR #22735.

Indeed, you might see that by specifying encode="onehot" you might get a consistent result:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="onehot"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df) 

pipe.named_steps['transform'].get_feature_names_out()

# array(['standardscaler__a', 'standardscaler__b', 'kbinsdiscretizer__a_0.0', 'kbinsdiscretizer__a_1.0','onehotencoder__c_x', 'onehotencoder__c_y', 'onehotencoder__c_z'], dtype=object)

Besides this, everything seems fine to me.

Eventually, apparently, even installing the nightly builds, I still get the same error.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement