I am trying to calculate the covariance between two columns by group. I am doing doing the following:
A = pd.DataFrame({'group':['A','A','A','A','B','B','B'], 'value1':[1,2,3,4,5,6,7], 'value2':[8,5,4,3,7,8,8]}) B = A.groupby('group') B['value1'].cov(B['value2'])
Ideally, I would like to get the covariance between X and Y and not the whole variance-covariance matrix, since I only have two columns.
Thank you,
Advertisement
Answer
You are almost there, only that you do not clear understand the groupby object, see Pandas-GroupBy for more details.
For your problem, if I understand correctly, you would like to calculate cov between two columns in same group.
The simplest one is to use groupeby.cov
function, which gives pairwise cov between groups.
A.groupby('group').cov() value1 value2 group A value1 1.666667 -2.666667 value2 -2.666667 4.666667 B value1 1.000000 0.500000 value2 0.500000 0.333333
If you only need cov(grouped_v1, grouped_v2)
grouped = A.groupby('group') grouped.apply(lambda x: x['value1'].cov(x['value2'])) group A -2.666667 B 0.500000
In which, grouped
is a groupby
object. For grouped.apply
function, it need a callback function as argument and each group will be the argument for the callback function. Here, the callback function is a lambda
function, and the argument x
is a group (a DataFrame).
Hope this will be helpful for your understanding of groupby.