I want to get names of the most important features for Logistic regression after transformation.
columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f', 'g','h','i','j','k','l', 'm', 'n', 'o', 'p'] columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee'] transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')
I know that I can do this:
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42) x_train = transformerVectoriser.fit_transform(x_train) x_test = transformerVectoriser.transform(x_test) clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1}) model = clf.fit(x_train, y_train) importance = model.coef_[0] # summarize feature importance for i,v in enumerate(importance): print('Feature: %0d, Score: %.5f' % (i,v)) # plot feature importance pyplot.bar([x for x in range(len(importance))], importance) pyplot.show()
But with this I’m getting feature1, feature2, feature3…etc. And after transformation I have around 45k features.
How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I’m having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.
IMPORTANT
I have other features that are used but not transformed…because of that I put remainder='passthrough'
Advertisement
Answer
As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression
. You can read more about it from these posts:
- How to find the importance of the features for a logistic regression model?
- Feature Importance in Logistic Regression for Machine Learning Interpretability
- How to Calculate Feature Importance With Python
I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier
. I am also assuming that the importance of a parent feature is sum total of that of the child features.
Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.
df = pd.read_csv('./data/penguins_size.csv') df = df.dropna() # to comply with the assumption later that column names don't contain _ df.columns = [c.replace('_', '-') for c in df.columns] X = df.iloc[:, :-1] y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int) pd.options.display.width = 0 print(X.head())
species | island | culmen-length-mm | culmen-depth-mm | flipper-length-mm | body-mass-g |
---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 |
Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 |
Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 |
Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 |
Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 |
columns_for_encoding = ['species', 'island'] columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm'] transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough') x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) x_train = transformerVectoriser.fit_transform(x_train) x_test = transformerVectoriser.transform(x_test) clf = RandomForestClassifier(max_depth=5) model = clf.fit(x_train, y_train) importance = model.feature_importances_ # feature names derived from the encoded columns and their individual importances # encoded cols enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out() enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']] # normalized cols norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_ norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']] # remainder cols, require a quick lookup as no transformer object exists for this case rem_cols = [] for (tname, _, cs) in transformerVectoriser.transformers_: if tname == 'remainder': rem_cols = X.columns[cs]; break rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']] # storing them in a df for easy manipulation imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))}) # aggregating, assuming that column names don't contain _ just to keep it simple imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0]) imp_agg = imp_df.groupby(by=['feature']).sum() print(imp_agg) print(f'Sum of feature importances: {imp_df["importance"].sum()}')
Output: