I have a pandas DataFrame of the following format:
Input:
X [OTHER_COLUMNS] version branch v0 overall 2475.0 -1 . v1 overall 2475.0 -1 . A 1712.5 1 . B 257.5 2 . C 392.5 2 D 112.5 3 v2 overall 2475.0 -1 A 2341.5 1 B 95.0 2 C 38.5 2 v3 overall 2475.0 -1 A 2000.0 1 B 475.0 2 v4 overall 2475.0 -1 A 2341.5 1 B 133.5 1
where (version, branch)
is a MultiIndex.
PROBLEM DESCRIPTION:
I want to groupby
version
and set the values in the column X
with branch
overall
to the sum of the values in the column X
for the remaining branches (having the same version
), weighted by the values in the column N
. For groups (i.e. version
s) which have only one branch (named overall
), I want X
to be set to 1
.
EXAMPLE:
For version
v2
, the value in the cell with column X
and branch
overall
should be
(2341.5 * 1 + 95.0 * 2 + 38.5 * 2) / 2475.0 = 1.05393939394
,
and in pseudo-code:
(A_N * A_X + B_N * B_X) / overall_N
.
Note: For a given version
, the value in column N
and branch
overall
will always be equal to the sum of the values in column N
for the other branch
‘es.
IDEA AND QUESTION:
I think I have to do the following:
df.loc[pd.IndexSlice[:, 'overall'], 'X'] = df.groupby('version').apply(...)
where df
is the DataFrame and where ...
is to be replaced by a custom function.
I am looking for help in constructing such a function.
Expected output:
N X version branch v0 overall 2475.0 1 v1 overall 2475.0 1.35353535354 A 1712.5 1 B 257.5 2 C 392.5 2 D 112.5 3 v2 overall 2475.0 1.05393939394 A 2341.5 1 B 95.0 2 C 38.5 2 v3 overall 2475.0 1.19191919192 A 2000.0 1 B 475.0 2 v4 overall 2475.0 1 A 2341.5 1 B 133.5 1
Explaination of expected output:
(1712.5 * 1 + 257.5 * 2 + 392.5 * 2 + 112.5 * 3) / 2475.0 = 1.35353535354 (2341.5 * 1 + 95.0 * 2 + 38.5 * 2) / 2475.0 = 1.05393939394 (2000.0 * 1 + 475.0 * 2) / 2475.0 = 1.19191919192 (2341.5 * 1 + 133.5 * 1) / 2475.0 = 1
CODE TO CREATE DATAFRAME:
import numpy as np import pandas as pd df = pd.DataFrame( data=np.array( [ [2475.0, 2475.0, 1712.5, 257.5, 392.5, 112.5, 2475.0, 2341.5, 95.0, 38.5, 2475.0, 2000.0, 475.0, 2475.0, 2341.5, 133.5], [-1, -1, 1, 2, 2, 3, -1, 1, 2, 2, -1, 1, 2, -1, 1, 1] ] ).T, index=pd.MultiIndex.from_tuples( tuples=[ ('v0', 'overall'), ('v1', 'overall'), ('v1', 'A'), ('v1', 'B'), ('v1', 'C'), ('v1', 'D'), ('v2', 'overall'), ('v2', 'A'), ('v2', 'B'), ('v2', 'C'), ('v3', 'overall'), ('v3', 'A'), ('v3', 'B'), ('v4', 'overall'), ('v4', 'A'), ('v4', 'B'), ], names=['version', 'branch'], ), columns=['N', 'X'], )
print (df) N X version branch v0 overall 2475.0 -1.0 v1 overall 2475.0 -1.0 A 1712.5 1.0 B 257.5 2.0 C 392.5 2.0 D 112.5 3.0 v2 overall 2475.0 -1.0 A 2341.5 1.0 B 95.0 2.0 C 38.5 2.0 v3 overall 2475.0 -1.0 A 2000.0 1.0 B 475.0 2.0 v4 overall 2475.0 -1.0 A 2341.5 1.0 B 133.5 1.0
Advertisement
Answer
Use:
#select overalls only overall = df['N'].xs('overall', level=1) #select all rows without overalls df1 = df.drop('overall', level=1) #multiple and aggregate sum, divide overalls s = df1['N'].mul(df1['X']).groupby(level=0).sum().div(overall) #create MultiIndex and assign back df.loc[pd.IndexSlice[:, 'overall'], 'X'] = pd.concat({'overall':s}).swaplevel(0,1)
print (df) N X version branch v1 overall 2475.0 1.353535 A 1712.5 1.000000 B 257.5 2.000000 C 392.5 2.000000 D 112.5 3.000000 v2 overall 2475.0 1.053939 A 2341.5 1.000000 B 95.0 2.000000 C 38.5 2.000000 v3 overall 2475.0 1.191919 A 2000.0 1.000000 B 475.0 2.000000 v4 overall 2475.0 1.000000 A 2341.5 1.000000 B 133.5 1.000000