I implemented a regression model using
JavaScript
x
7
1
formula= "cost ~ C(state) + group_size + C(homeowner) + car_age + C(car_value) +
2
risk_factor + age_oldest + age_youngest + C(married_couple) + c_previous +
3
duration_previous + C(a) + C(b) + C(c) + C(d) + C(e) + C(f) + C(g)"
4
5
model_a = smf.ols(formula = formula, data = train).fit()
6
model_a.summary()
7
After fitting a regression model, I ran a bonferroni correction using
JavaScript
1
3
1
smt.multipletests(model_a.pvalues, alpha=0.05, method='bonferroni', is_sorted=False,
2
returnsorted=False)
3
And I get the following result:
JavaScript
1
28
28
1
(array([ True, False, True, True, True, True, True, False, True,
2
True, True, False, True, True, True, True, False, False,
3
False, False, True, False, True, True, True, True, True,
4
True, True, False, True, True, False, True, True, False,
5
True, True, True, True, True, True, True, True, False,
6
True, True, True, False, False, False, False, False, False,
7
True, True, True, True, True, False, True, False, True,
8
False, True, True, True, True]),
9
array([0.00000000e+00, 1.00000000e+00, 1.45352365e-03, 2.14422252e-21,
10
5.68726115e-13, 4.81466313e-12, 1.22517937e-05, 3.36565323e-01,
11
4.81396354e-45, 1.51138583e-05, 4.27572151e-04, 1.00000000e+00,
12
5.91690245e-10, 2.62041907e-16, 3.12129589e-18, 9.88879325e-13,
13
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 6.85853188e-01,
14
8.94886169e-07, 1.00000000e+00, 3.55801455e-12, 5.35987286e-54,
15
7.77655333e-03, 5.45090922e-04, 5.15690091e-03, 7.40791788e-04,
16
1.24797586e-07, 1.00000000e+00, 2.91991310e-04, 1.75502703e-07,
17
1.00000000e+00, 2.57023089e-26, 2.34824045e-10, 1.00000000e+00,
18
2.79360586e-87, 5.26115182e-09, 4.94812967e-08, 3.36073545e-07,
19
5.06333547e-07, 4.44900552e-07, 1.06078148e-05, 1.42866234e-03,
20
1.00000000e+00, 3.72074539e-10, 1.38294896e-74, 1.39540646e-69,
21
1.00000000e+00, 1.00000000e+00, 1.00000000e+00, 1.00000000e+00,
22
1.00000000e+00, 1.00000000e+00, 2.78538149e-18, 3.74576314e-22,
23
1.12111501e-19, 1.14698339e-04, 9.34411232e-18, 1.00000000e+00,
24
4.10430857e-02, 1.00000000e+00, 5.35030644e-23, 1.00000000e+00,
25
7.61651080e-20, 9.49735915e-56, 7.90523832e-66, 8.15390766e-94]),
26
0.0007540287301109894,
27
0.0007352941176470588)
28
I want to use these arrays to remove the features in model_a that are False and create a new model ‘train_simplified’.
I’m using the following manual approach, but I want to know if there´s a more efficient way to do it.
JavaScript
1
3
1
train_simplified = train.drop(train.columns[[0, 1, 2, 4, 10, 16, 25, 27, 28, 30, 36, 38,
2
41, 44, 47, 55, 61, 62, 63, 64, 65, 66, 67, 68, 69, 75, 78]], axis=1)
3
Advertisement
Answer
You could use Pandas loc
to select only the features in model_a
that are True
.
.loc[] is primarily label based, but may also be used with a boolean array.
JavaScript
1
10
10
1
train = pd.DataFrame(np.random.rand(5,68))
2
0 1 2 3 63 64 65 66 67
3
0 0.637557 0.887213 0.472215 0.119594 0.908266 0.239562 0.144895 0.489453 0.985650
4
1 0.242055 0.672136 0.761620 0.237638 0.649633 0.849223 0.657613 0.568309 0.093675
5
2 0.367716 0.265202 0.243990 0.973011 0.465598 0.542645 0.286541 0.590833 0.030500
6
3 0.037348 0.822601 0.360191 0.127061 0.070569 0.642419 0.026511 0.585776 0.940230
7
4 0.575474 0.388170 0.643288 0.458253 0.091206 0.494420 0.057559 0.549529 0.441531
8
9
[5 rows x 68 columns]
10
JavaScript
1
13
13
1
keep_columns = np.array([ # array from smt.multipletests
2
True, False, True, True, True, True, True, False, True,
3
True, True, False, True, True, True, True, False, False,
4
False, False, True, False, True, True, True, True, True,
5
True, True, False, True, True, False, True, True, False,
6
True, True, True, True, True, True, True, True, False,
7
True, True, True, False, False, False, False, False, False,
8
True, True, True, True, True, False, True, False, True,
9
False, True, True, True, True])
10
np.sum(keep_columns) # 47 (keep 47 columns)
11
12
train_simplified = train.loc[:,keep_columns]
13
Output from train_simplified
JavaScript
1
9
1
0 2 3 4 62 64 65 66 67
2
0 0.637557 0.472215 0.119594 0.713245 0.278646 0.239562 0.144895 0.489453 0.985650
3
1 0.242055 0.761620 0.237638 0.728216 0.746491 0.849223 0.657613 0.568309 0.093675
4
2 0.367716 0.243990 0.973011 0.393098 0.035942 0.542645 0.286541 0.590833 0.030500
5
3 0.037348 0.360191 0.127061 0.522243 0.162934 0.642419 0.026511 0.585776 0.940230
6
4 0.575474 0.643288 0.458253 0.545617 0.789618 0.494420 0.057559 0.549529 0.441531
7
8
[5 rows x 47 columns]
9