I have dataframe like this:
JavaScript
x
27
27
1
features_df = pd.DataFrame({
2
'group': np.array([0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]),
3
'variable': ['var1'] * 8 + ['var2'] * 8,
4
'value': np.array([5.582443, 7.855871, 9.843828, 8.331354, 1.593624, 2.151113, 1.403245, 3.495429,
5
5.361531, 6.739888, 4.120531, 9.931341, 1.121117, 0.730207, 0.931132, 3.001303])
6
})
7
8
features_df
9
10
group variable value
11
0 0 var1 5.582443
12
1 0 var1 7.855871
13
2 0 var1 9.843828
14
3 0 var1 8.331354
15
4 1 var1 1.593624
16
5 1 var1 2.151113
17
6 1 var1 1.403245
18
7 1 var1 3.495429
19
8 0 var2 5.361531
20
9 0 var2 6.739888
21
10 0 var2 4.120531
22
11 0 var2 9.931341
23
12 1 var2 1.121117
24
13 1 var2 0.730207
25
14 1 var2 0.931132
26
15 1 var2 3.001303
27
And i want to calculate p-value from T-Test for each variable between groups. I can manually calculate each p-value like this:
JavaScript
1
15
15
1
var1_0 = features_df.query('variable == "var1" & group == 0').value.values
2
var1_1 = features_df.query('variable == "var1" & group == 1').value.values
3
4
var2_0 = features_df.query('variable == "var2" & group == 0').value.values
5
var2_1 = features_df.query('variable == "var2" & group == 1').value.values
6
7
8
var1_pvalue = ttest_ind(var1_0, var1_1)[1]
9
var1_pvalue
10
#0.0012163722443546759
11
12
var2_pvalue = ttest_ind(var2_0, var2_1)[1]
13
var2_pvalue
14
#0.00946879342461542
15
So the question is how can i get a result dataframe like shown below for all variables automatically?
JavaScript
1
4
1
variables ttest_pvalue
2
0 var1 0.001216
3
1 var2 0.009469
4
Advertisement
Answer
There are several ways, the core idea is to use groupby
on the variable.
Here is one example:
JavaScript
1
8
1
from scipy.stats import ttest_ind
2
3
(features_df
4
.set_index('group')
5
.groupby('variable', as_index=False)['value']
6
.apply(lambda g: ttest_ind(g[0], g[1])[1])
7
)
8
output:
JavaScript
1
4
1
variable value
2
0 var1 0.001216
3
1 var2 0.009469
4