Skip to content
Advertisement

Pandas – Groupby and Standardize

I have tried to tackle this for quite some time, but haven’t been able to get a pythonic way around it by using the built-in groupby and transform methods from pandas.

The goal is to group the data by columns ex_date and id, then within the groups identified, standardize the column called ref_value_1 against the value found in the row where df['calc_date'] == df['ex_date']

Here’s a sample input:

df = pd.DataFrame({'calc_date': ['1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021'], 'ex_date': ['2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021'], 'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'ref_value_1': [1.5, 3.0, 4.5, 5.0, 10.0, 15.0, 15.0, 40.0, 60.0, 75.0, 100.0, 120.0], 'bins': [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]})

which looks like:

calc_date ex_date id ref_value_1 bins
1/1/2021 2/1/2021 1 1.5 1
2/1/2021 2/1/2021 1 3 1
3/1/2021 2/1/2021 1 4.5 1
1/1/2021 2/1/2021 2 5 1
2/1/2021 2/1/2021 2 10 1
3/1/2021 2/1/2021 2 15 1
1/1/2021 2/1/2021 3 15 2
2/1/2021 2/1/2021 3 40 2
3/1/2021 2/1/2021 3 60 2
1/1/2021 2/1/2021 4 75 3
2/1/2021 2/1/2021 4 100 3
3/1/2021 2/1/2021 4 120 3

And expected output:

calc_date ex_date id ref_value_1 bins standardized_val
1/1/2021 2/1/2021 1 1.5 1 0.5
2/1/2021 2/1/2021 1 3 1 1
3/1/2021 2/1/2021 1 4.5 1 1.5
1/1/2021 2/1/2021 2 5 1 0.5
2/1/2021 2/1/2021 2 10 1 1
3/1/2021 2/1/2021 2 15 1 1.5
1/1/2021 2/1/2021 3 15 2 0.375
2/1/2021 2/1/2021 3 40 2 1
3/1/2021 2/1/2021 3 60 2 1.5
1/1/2021 2/1/2021 4 75 3 0.75
2/1/2021 2/1/2021 4 100 3 1
3/1/2021 2/1/2021 4 120 3 1.2

Advertisement

Answer

You can mask the non matching values and fill per group using groupby+transform to get the reference. Then simply divide your data with the reference.

ref = df['ref_value_1'].where(df['calc_date'].eq(df['ex_date'])).groupby(df['id']).transform('first')

df['standardized_val'] = df['ref_value_1'].div(ref)

Output:

   calc_date   ex_date  id  ref_value_1  bins  standardized_val
0   1/1/2021  2/1/2021   1          1.5     1             0.500
1   2/1/2021  2/1/2021   1          3.0     1             1.000
2   3/1/2021  2/1/2021   1          4.5     1             1.500
3   1/1/2021  2/1/2021   2          5.0     1             0.500
4   2/1/2021  2/1/2021   2         10.0     1             1.000
5   3/1/2021  2/1/2021   2         15.0     1             1.500
6   1/1/2021  2/1/2021   3         15.0     2             0.375
7   2/1/2021  2/1/2021   3         40.0     2             1.000
8   3/1/2021  2/1/2021   3         60.0     2             1.500
9   1/1/2021  2/1/2021   4         75.0     3             0.750
10  2/1/2021  2/1/2021   4        100.0     3             1.000
11  3/1/2021  2/1/2021   4        120.0     3             1.200
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement