Skip to content
Advertisement

Get names of the most important features for Logistic Regression after transformation

I want to get names of the most important features for Logistic regression after transformation.

columns_for_encoding = ['a', 'b', 'c', 'd', 'e', 'f',
                        'g','h','i','j','k','l', 'm', 
                        'n', 'o', 'p']

columns_for_scaling = ['aa', 'bb', 'cc', 'dd', 'ee']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), columns_for_encoding),
                                                        ('Normalizer', Normalizer(), columns_for_scaling)],
                                          remainder='passthrough') 

I know that I can do this:

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size = 0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = LogisticRegression(max_iter = 5000, class_weight = {1: 3.5, 0: 1})
model = clf.fit(x_train, y_train)

importance = model.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
# plot feature importance
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()

But with this I’m getting feature1, feature2, feature3…etc. And after transformation I have around 45k features.

How can I get the list of most important features (before transformation)? I want to know what are the best features for the model. I have a lot of categorical features with 100+ different categories, so after encoding I’m having more features than rows in my dataset. So I want to find out what features can I exclude from my dataset and what features are the best for my model.

IMPORTANT I have other features that are used but not transformed…because of that I put remainder='passthrough'

Advertisement

Answer

As you would already be aware that the whole idea of feature importances is bit tricky for the case of LogisticRegression. You can read more about it from these posts:

  1. How to find the importance of the features for a logistic regression model?
  2. Feature Importance in Logistic Regression for Machine Learning Interpretability
  3. How to Calculate Feature Importance With Python

I personally found these and other similar posts inconclusive so I am going to avoid this part in my answer and address your main question about feature splitting and aggregating the feature importances (assuming they are available for the split features) using a RandomForestClassifier. I am also assuming that the importance of a parent feature is sum total of that of the child features.

Under these assumptions, we can use the below code to have the importances of the original features. I am using the Palmer Archipelago (Antarctica) penguin data for the illustration.

df = pd.read_csv('./data/penguins_size.csv')
df = df.dropna()
# to comply with the assumption later that column names don't contain _
df.columns = [c.replace('_', '-') for c in df.columns]

X = df.iloc[:, :-1]
y = np.asarray(df.iloc[:, 6] == 'MALE').astype(int)

pd.options.display.width = 0
print(X.head())
species island culmen-length-mm culmen-depth-mm flipper-length-mm body-mass-g
Adelie Torgersen 39.1 18.7 181.0 3750.0
Adelie Torgersen 39.5 17.4 186.0 3800.0
Adelie Torgersen 40.3 18.0 195.0 3250.0
Adelie Torgersen 36.7 19.3 193.0 3450.0
Adelie Torgersen 39.3 20.6 190.0 3650.0
columns_for_encoding = ['species', 'island']
columns_for_scaling = ['culmen-length-mm', 'culmen-depth-mm']

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown="ignore"), columns_for_encoding), ('Normalizer', Normalizer(), columns_for_scaling)], remainder='passthrough')

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)

clf = RandomForestClassifier(max_depth=5)
model = clf.fit(x_train, y_train)

importance = model.feature_importances_

# feature names derived from the encoded columns and their individual importances
# encoded cols
enc_col_out = transformerVectoriser.named_transformers_['Vector Cat'].get_feature_names_out()
enc_col_out_imp = importance[transformerVectoriser.output_indices_['Vector Cat']]
# normalized cols
norm_col = transformerVectoriser.named_transformers_['Normalizer'].feature_names_in_
norm_col_imp = importance[transformerVectoriser.output_indices_['Normalizer']]
# remainder cols, require a quick lookup as no transformer object exists for this case
rem_cols = []
for (tname, _, cs) in transformerVectoriser.transformers_:
    if tname == 'remainder': rem_cols = X.columns[cs]; break
rem_col_imp = importance[transformerVectoriser.output_indices_['remainder']]

# storing them in a df for easy manipulation
imp_df = pd.DataFrame({'feature': (list(enc_col_out) + list(norm_col) + list(rem_cols)), 'importance': (list(enc_col_out_imp) + list(norm_col_imp) + list(rem_col_imp))})

# aggregating, assuming that column names don't contain _ just to keep it simple
imp_df['feature'] = imp_df['feature'].apply(lambda x: x.split('_')[0])
imp_agg = imp_df.groupby(by=['feature']).sum()
print(imp_agg)
print(f'Sum of feature importances: {imp_df["importance"].sum()}')

Output:

enter image description here

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement