Skip to content
Advertisement

How to remove features from regression results using bonferroni correction results?

I implemented a regression model using

formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) + 
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous + 
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"

model_a = smf.ols(formula = formula, data = train).fit()
model_a.summary()

After fitting a regression model, I ran a bonferroni correction using

smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False, 
returnsorted=False)

And I get the following result:

(array([ True, False,  True,  True,  True,  True,  True, False,  True,
     True,  True, False,  True,  True,  True,  True, False, False,
    False, False,  True, False,  True,  True,  True,  True,  True,
     True,  True, False,  True,  True, False,  True,  True, False,
     True,  True,  True,  True,  True,  True,  True,  True, False,
     True,  True,  True, False, False, False, False, False, False,
     True,  True,  True,  True,  True, False,  True, False,  True,
    False,  True,  True,  True,  True]),
 array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
    5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
    4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
    5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
    8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
    7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
    1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
    1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
    2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
    5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
    1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
    1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
    1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
    1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
    4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
    7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
 0.0007540287301109894,
 0.0007352941176470588)

I want to use these arrays to remove the features in model_a that are False and create a new model ‘train_simplified’.

I’m using the following manual approach, but I want to know if thereĀ“s a more efficient way to do it.

train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38, 
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)

Advertisement

Answer

You could use Pandas loc to select only the features in model_a that are True.

.loc[] is primarily label based, but may also be used with a boolean array.

train = pd.DataFrame(np.random.rand(5,68))
          0         1         2         3  ...        63        64        65        66        67
0  0.637557  0.887213  0.472215  0.119594  ...  0.908266  0.239562  0.144895  0.489453  0.985650
1  0.242055  0.672136  0.761620  0.237638  ...  0.649633  0.849223  0.657613  0.568309  0.093675
2  0.367716  0.265202  0.243990  0.973011  ...  0.465598  0.542645  0.286541  0.590833  0.030500
3  0.037348  0.822601  0.360191  0.127061  ...  0.070569  0.642419  0.026511  0.585776  0.940230
4  0.575474  0.388170  0.643288  0.458253  ...  0.091206  0.494420  0.057559  0.549529  0.441531

[5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests
    True, False,  True,  True,  True,  True,  True, False,  True,
    True,  True, False,  True,  True,  True,  True, False, False,
    False, False,  True, False,  True,  True,  True,  True,  True,
    True,  True, False,  True,  True, False,  True,  True, False,
    True,  True,  True,  True,  True,  True,  True,  True, False,
    True,  True,  True, False, False, False, False, False, False,
    True,  True,  True,  True,  True, False,  True, False,  True,
    False,  True,  True,  True,  True])
np.sum(keep_columns) # 47 (keep 47 columns)

train_simplified = train.loc[:,keep_columns]

Output from train_simplified

          0         2         3         4  ...        62        64        65        66        67
0  0.637557  0.472215  0.119594  0.713245  ...  0.278646  0.239562  0.144895  0.489453  0.985650
1  0.242055  0.761620  0.237638  0.728216  ...  0.746491  0.849223  0.657613  0.568309  0.093675
2  0.367716  0.243990  0.973011  0.393098  ...  0.035942  0.542645  0.286541  0.590833  0.030500
3  0.037348  0.360191  0.127061  0.522243  ...  0.162934  0.642419  0.026511  0.585776  0.940230
4  0.575474  0.643288  0.458253  0.545617  ...  0.789618  0.494420  0.057559  0.549529  0.441531

[5 rows x 47 columns]
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement