I have a DataFrame
that looks like this:
df = pd.DataFrame.from_dict({'id': [1, 2, 1, 1, 2, 3], 'reward': [0.1, 0.25, 0.15, 0.05, 0.4, 0.45], 'time': ['10:00:00', '12:00:00', '10:00:05', '10:00:07', '12:00:03', '15:00:00']} )
What I want to get is:
out = pd.DataFrame.from_dict({'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3], 'reward': [0.1, 0, 0, 0, 0, 0.15, 0.0, 0.05, 0.25, 0.0, 0.0, 0.4, 0.45], 'time': ['10:00:00', '10:00:01', '10:00:02', '10:00:03', '10:00:04', '10:00:05', '10:00:06', '10:00:07', '12:00:00', '12:00:01', '12:00:02', '12:00:03', '15:00:00']} )
In short, for each id, add the time rows missing with value 0. How do I do this? I wrote something with a loop, but it’s going to be prohibitively slow for my use case which has several million rows
Advertisement
Answer
Here’s one way using groupby.apply
where we use date_range
to add the missing times. Then merge
it back to df
and fill in the missing values of the other columns:
df['time'] = pd.to_datetime(df['time']) out = df.merge(df.groupby('id')['time'].apply(lambda x: pd.date_range(x.iat[0], x.iat[-1], freq='S')).explode(), how='right') out['id'] = out['id'].ffill().astype(int) out['reward'] = out['reward'].fillna(0)
Output:
id reward time 0 1 0.10 2022-04-23 10:00:00 1 1 0.00 2022-04-23 10:00:01 2 1 0.00 2022-04-23 10:00:02 3 1 0.00 2022-04-23 10:00:03 4 1 0.00 2022-04-23 10:00:04 5 1 0.15 2022-04-23 10:00:05 6 1 0.00 2022-04-23 10:00:06 7 1 0.05 2022-04-23 10:00:07 8 2 0.25 2022-04-23 12:00:00 9 2 0.00 2022-04-23 12:00:01 10 2 0.00 2022-04-23 12:00:02 11 2 0.40 2022-04-23 12:00:03 12 3 0.45 2022-04-23 15:00:00