I am new in python and pandas and also in stackoverflow so I apologize for any mistakes I make in advance.
I have this dataframe
df = pd.DataFrame( data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ], ['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ], ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ], ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]], columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration'] ) df['interval_start'] = pd.to_datetime(df['interval_start']) df['interval_end'] = pd.to_datetime(df['interval_end'])
output is
specialist language interval_start interval_end status_duration 0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800 1 Donald Trump German 2021-09-23 14:58:01 2021-09-23 15:00:05 125 2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630 3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
and my desired outcome is to have something like in below
specialist language interval status_duration 0 Donald Trump German 2021-9-23 14:15:00 120 1 Donald Trump German 2021-9-23 14:30:00 900 2 Donald Trump German 2021-9-23 14:45:00 899 3 Donald Trump German 2021-9-23 15:00:00 5 4 Donald Trump German 2021-9-24 10:00:00 600 5 Donald Trump German 2021-9-24 10:15:00 30 6 Monica Lewinsky German 2021-9-24 10:15:00 30
I have this code from another topic link
ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False) .agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]), Interval=("Interval Start", "first")) .explode("status_duration")) ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")
But it does not take “interval_start” into consideration, I need to check first if the status_duration will remain on same 15 mins interval or not. Hope somebody can help as it is a very advanced problem for me and i am working on it for more than 10 days.
Advertisement
Answer
After learning a bit more, I came up with another (better) solution using groupby()
and explode()
. I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.
I first added a few new columns to split up the status_duration
into the first slice and the rest and replaced the original value of status_duration
with an according 2-element list:
df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int) df['rest'] = df['status_duration'] - df['first'] df['status_duration'] = df[['first','rest']].values.tolist() df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
This gives you the following prepared dataframe:
specialist language interval_start ... status_duration first rest 0 Donald Trump German 2021-09-23 14:28:00 ... [120, 1680] 120 1680 1 Donald Trump German 2021-09-23 14:58:01 ... [119, 5] 119 5 2 Donald Trump German 2021-09-24 10:05:00 ... [600, 30] 600 30 3 Monica Lewinsky German 2021-09-24 10:05:00 ... [30, 0] 600 -570
On this, you can now perform a groupby()
and explode()
similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode()
. To clean up, I dropped the rows with duration 0
and reset the index:
ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False) .agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first')) .explode('status_duration') ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec') ref = ref.groupby(['specialist', 'language', 'interval_start']).sum() ref = ref[ref.status_duration != 0].reset_index()
This gives you your desired output:
specialist language interval_start status_duration 0 Donald Trump German 2021-09-23 14:15:00 120 1 Donald Trump German 2021-09-23 14:30:00 900 2 Donald Trump German 2021-09-23 14:45:00 899 3 Donald Trump German 2021-09-23 15:00:00 5 4 Donald Trump German 2021-09-24 10:00:00 600 5 Donald Trump German 2021-09-24 10:15:00 30 6 Monica Lewinsky German 2021-09-24 10:00:00 30
Note: The problem I described in the other answer, that the final grouping step could result in a status_duration
> 900 should not be possible with real data, since a specialist shouldn’t be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.