I implemented a regression model using
formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) + risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous + duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)" model_a = smf.ols(formula = formula, data = train).fit() model_a.summary()
After fitting a regression model, I ran a bonferroni correction using
smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False, returnsorted=False)
And I get the following result:
(array([ True, False, True, True, True, True, True, False, True, True, True, False, True, True, True, True, False, False, False, False, True, False, True, True, True, True, True, True, True, False, True, True, False, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, False, False, False, False, False, False, True, True, True, True, True, False, True, False, True, False, True, True, True, True]), array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21, 5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01, 4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00, 5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01, 8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54, 7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04, 1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07, 1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00, 2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07, 5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03, 1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22, 1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00, 4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00, 7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]), 0.0007540287301109894, 0.0007352941176470588)
I want to use these arrays to remove the features in model_a that are False and create a new model ‘train_simplified’.
I’m using the following manual approach, but I want to know if thereĀ“s a more efficient way to do it.
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38, 41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
Advertisement
Answer
You could use Pandas loc
to select only the features in model_a
that are True
.
.loc[] is primarily label based, but may also be used with a boolean array.
train = pd.DataFrame(np.random.rand(5,68)) 0 1 2 3 ... 63 64 65 66 67 0 0.637557 0.887213 0.472215 0.119594 ... 0.908266 0.239562 0.144895 0.489453 0.985650 1 0.242055 0.672136 0.761620 0.237638 ... 0.649633 0.849223 0.657613 0.568309 0.093675 2 0.367716 0.265202 0.243990 0.973011 ... 0.465598 0.542645 0.286541 0.590833 0.030500 3 0.037348 0.822601 0.360191 0.127061 ... 0.070569 0.642419 0.026511 0.585776 0.940230 4 0.575474 0.388170 0.643288 0.458253 ... 0.091206 0.494420 0.057559 0.549529 0.441531 [5 rows x 68 columns]
keep_columns = np.array([ # array from smt.multipletests True, False, True, True, True, True, True, False, True, True, True, False, True, True, True, True, False, False, False, False, True, False, True, True, True, True, True, True, True, False, True, True, False, True, True, False, True, True, True, True, True, True, True, True, False, True, True, True, False, False, False, False, False, False, True, True, True, True, True, False, True, False, True, False, True, True, True, True]) np.sum(keep_columns) # 47 (keep 47 columns) train_simplified = train.loc[:,keep_columns]
Output from train_simplified
0 2 3 4 ... 62 64 65 66 67 0 0.637557 0.472215 0.119594 0.713245 ... 0.278646 0.239562 0.144895 0.489453 0.985650 1 0.242055 0.761620 0.237638 0.728216 ... 0.746491 0.849223 0.657613 0.568309 0.093675 2 0.367716 0.243990 0.973011 0.393098 ... 0.035942 0.542645 0.286541 0.590833 0.030500 3 0.037348 0.360191 0.127061 0.522243 ... 0.162934 0.642419 0.026511 0.585776 0.940230 4 0.575474 0.643288 0.458253 0.545617 ... 0.789618 0.494420 0.057559 0.549529 0.441531 [5 rows x 47 columns]