I have dataframe like this:
features_df = pd.DataFrame({ 'group': np.array([0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1]), 'variable': ['var1'] * 8 + ['var2'] * 8, 'value': np.array([5.582443, 7.855871, 9.843828, 8.331354, 1.593624, 2.151113, 1.403245, 3.495429, 5.361531, 6.739888, 4.120531, 9.931341, 1.121117, 0.730207, 0.931132, 3.001303]) }) features_df group variable value 0 0 var1 5.582443 1 0 var1 7.855871 2 0 var1 9.843828 3 0 var1 8.331354 4 1 var1 1.593624 5 1 var1 2.151113 6 1 var1 1.403245 7 1 var1 3.495429 8 0 var2 5.361531 9 0 var2 6.739888 10 0 var2 4.120531 11 0 var2 9.931341 12 1 var2 1.121117 13 1 var2 0.730207 14 1 var2 0.931132 15 1 var2 3.001303
And i want to calculate p-value from T-Test for each variable between groups. I can manually calculate each p-value like this:
var1_0 = features_df.query('variable == "var1" & group == 0').value.values var1_1 = features_df.query('variable == "var1" & group == 1').value.values var2_0 = features_df.query('variable == "var2" & group == 0').value.values var2_1 = features_df.query('variable == "var2" & group == 1').value.values var1_pvalue = ttest_ind(var1_0, var1_1)[1] var1_pvalue #0.0012163722443546759 var2_pvalue = ttest_ind(var2_0, var2_1)[1] var2_pvalue #0.00946879342461542
So the question is how can i get a result dataframe like shown below for all variables automatically?
variables ttest_pvalue 0 var1 0.001216 1 var2 0.009469
Advertisement
Answer
There are several ways, the core idea is to use groupby
on the variable.
Here is one example:
from scipy.stats import ttest_ind (features_df .set_index('group') .groupby('variable', as_index=False)['value'] .apply(lambda g: ttest_ind(g[0], g[1])[1]) )
output:
variable value 0 var1 0.001216 1 var2 0.009469