I have tried to tackle this for quite some time, but haven’t been able to get a pythonic way around it by using the built-in groupby
and transform
methods from pandas.
The goal is to group the data by columns ex_date
and id
, then within the groups identified, standardize the column called ref_value_1
against the value found in the row where df['calc_date'] == df['ex_date']
Here’s a sample input:
df = pd.DataFrame({'calc_date': ['1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021', '1/1/2021', '2/1/2021', '3/1/2021'], 'ex_date': ['2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021', '2/1/2021'], 'id': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], 'ref_value_1': [1.5, 3.0, 4.5, 5.0, 10.0, 15.0, 15.0, 40.0, 60.0, 75.0, 100.0, 120.0], 'bins': [1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]})
which looks like:
calc_date | ex_date | id | ref_value_1 | bins |
---|---|---|---|---|
1/1/2021 | 2/1/2021 | 1 | 1.5 | 1 |
2/1/2021 | 2/1/2021 | 1 | 3 | 1 |
3/1/2021 | 2/1/2021 | 1 | 4.5 | 1 |
1/1/2021 | 2/1/2021 | 2 | 5 | 1 |
2/1/2021 | 2/1/2021 | 2 | 10 | 1 |
3/1/2021 | 2/1/2021 | 2 | 15 | 1 |
1/1/2021 | 2/1/2021 | 3 | 15 | 2 |
2/1/2021 | 2/1/2021 | 3 | 40 | 2 |
3/1/2021 | 2/1/2021 | 3 | 60 | 2 |
1/1/2021 | 2/1/2021 | 4 | 75 | 3 |
2/1/2021 | 2/1/2021 | 4 | 100 | 3 |
3/1/2021 | 2/1/2021 | 4 | 120 | 3 |
And expected output:
calc_date | ex_date | id | ref_value_1 | bins | standardized_val |
---|---|---|---|---|---|
1/1/2021 | 2/1/2021 | 1 | 1.5 | 1 | 0.5 |
2/1/2021 | 2/1/2021 | 1 | 3 | 1 | 1 |
3/1/2021 | 2/1/2021 | 1 | 4.5 | 1 | 1.5 |
1/1/2021 | 2/1/2021 | 2 | 5 | 1 | 0.5 |
2/1/2021 | 2/1/2021 | 2 | 10 | 1 | 1 |
3/1/2021 | 2/1/2021 | 2 | 15 | 1 | 1.5 |
1/1/2021 | 2/1/2021 | 3 | 15 | 2 | 0.375 |
2/1/2021 | 2/1/2021 | 3 | 40 | 2 | 1 |
3/1/2021 | 2/1/2021 | 3 | 60 | 2 | 1.5 |
1/1/2021 | 2/1/2021 | 4 | 75 | 3 | 0.75 |
2/1/2021 | 2/1/2021 | 4 | 100 | 3 | 1 |
3/1/2021 | 2/1/2021 | 4 | 120 | 3 | 1.2 |
Advertisement
Answer
You can mask the non matching values and fill per group using groupby
+transform
to get the reference. Then simply divide your data with the reference.
ref = df['ref_value_1'].where(df['calc_date'].eq(df['ex_date'])).groupby(df['id']).transform('first') df['standardized_val'] = df['ref_value_1'].div(ref)
Output:
calc_date ex_date id ref_value_1 bins standardized_val 0 1/1/2021 2/1/2021 1 1.5 1 0.500 1 2/1/2021 2/1/2021 1 3.0 1 1.000 2 3/1/2021 2/1/2021 1 4.5 1 1.500 3 1/1/2021 2/1/2021 2 5.0 1 0.500 4 2/1/2021 2/1/2021 2 10.0 1 1.000 5 3/1/2021 2/1/2021 2 15.0 1 1.500 6 1/1/2021 2/1/2021 3 15.0 2 0.375 7 2/1/2021 2/1/2021 3 40.0 2 1.000 8 3/1/2021 2/1/2021 3 60.0 2 1.500 9 1/1/2021 2/1/2021 4 75.0 3 0.750 10 2/1/2021 2/1/2021 4 100.0 3 1.000 11 3/1/2021 2/1/2021 4 120.0 3 1.200