I have the following dataframe with a tricky problem:
df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], [181, 175, 172, 165, 150]], 'days_since_gym': [[0, 87, 174, 205, 279], [43, 171, 241, 273, 300]]}) print(df) weight days_since_gym 0 [200, 190, 188, 180, 170] [0, 91, 174, 205, 279] 1 [181, 175, 172, 165, 150] [93, 171, 241, 273, 300]
I have to make 4 columns (0-90 days, 91-180 days, 181-270 days, 271-360 days) based on the following conditions:
1) If there are multiple weights in a specific time duration, get the maximum weight in that time duration column. 2) If no weight is present in that time duration, the value for that duration would be 0.
Desired output:
weight days_since_gym 0-90 91-180 181-270 271-360 0 [200, 190, 188, 180, 170] [0, 87, 174, 205, 279] 200 188 180 170 1 [181, 175, 172, 165, 150] [93, 171, 241, 273, 300] 0 181 172 165
What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!
Advertisement
Answer
You can write a custom function that takes in a list of weights, start day, end day — then apply this function rowwise to create each of your new columns using the pandas apply function. If you haven’t used apply before, the basic structure is something like: df.apply(lambda x: custom_function(...), axis=1)
. The argument axis=1 ensures your custom function is applied rowwise.
Since the names of your new columns are also the start and end days, you can loop through these start and end day ranges.
Also I noticed that in your question there seem to be some mismatches between the DataFrame you created and your desired output, so I am taking the the desired output to be the DataFrame.
import numpy as np import pandas as pd df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], [181, 175, 172, 165, 150]], 'days_since_gym': [[0, 87, 174, 205, 279], [93, 171, 241, 273, 300]]}) def return_max_weight(weights, days, start_day, end_day): ## get the indices where weights are between start and end days days = np.array(days) weights_idx = list(np.where((days >= start_day) & (days <= end_day))[0]) if len(weights_idx) == 0: return 0 else: weight_between_start_and_end = [weights[idx] for idx in weights_idx] return max(weight_between_start_and_end) for start_day, end_day in zip([0, 91, 181, 271],[90, 180, 270, 360]): col_name = f"{start_day}-{end_day}" df[col_name] = df[['weight','days_since_gym']].apply( lambda x: return_max_weight(x['weight'], x['days_since_gym'], start_day, end_day), axis=1 )
Output:
>>> df weight days_since_gym 0-90 91-180 181-270 271-360 0 [200, 190, 188, 180, 170] [0, 87, 174, 205, 279] 200 188 180 170 1 [181, 175, 172, 165, 150] [93, 171, 241, 273, 300] 0 181 172 165