In the MWE below, I show my attempt to line-plot trips (from my df
aggregated per month):
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns d = {'user': [62, 62, 154, 154, 154, 73, 73, 73, 73], 'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2007-05-01 10:10:00', '2007-05-01 10:36:00','2007-05-01 15:30:00','2008-05-19 16:25:54', '2008-05-21 02:21:37','2008-05-22 01:30:09','2008-05-29 01:55:59'], 'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11', '2007-05-01 10:36:00', '2007-05-01 11:00:00','2007-05-01 15:55:00','2008-05-19 16:29:22', '2008-05-21 02:25:04','2008-05-22 01:33:51','2008-05-29 01:59:25'], 'mode': ['bus','walk','bus','walk','taxi','walk','walk','walk','walk'],} df = pd.DataFrame(d) df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime) df['Date'] = df['start'].dt.strftime('%Y-%m') df.sort_values(['Date' ], ascending=True, inplace=True) df.reset_index(drop=True, inplace=True) # aggregate monthly trips df = df.groupby(['Date']).size().reset_index().rename( columns={0:'Trips'}) df Date Trips 0 2007-05 3 1 2008-05 4 2 2008-06 2 # the I can plot trips like so: sns.lineplot(data=df, x = df['Date'], y='Trips')
I realised in my df
, some trips contains jump (maybe due to data log), so they should be merged into single trip before aggregation. In the given df
example above (before grouping).
df user start end mode Date 0 154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus 2007-05 1 154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk 2007-05 2 154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi 2007-05 3 73 2008-05-19 16:25:54 2008-05-19 16:29:22 walk 2008-05 4 73 2008-05-21 02:21:37 2008-05-21 02:25:04 walk 2008-05 5 73 2008-05-22 01:30:09 2008-05-22 01:33:51 walk 2008-05 6 73 2008-05-29 01:55:59 2008-05-29 01:59:25 walk 2008-05 7 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 2008-06 8 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 2008-06
User 154
does undertake 2-trips, not 3. First started at 10:10:00
on bus then changes travel mode to at 10:36:00
to walk
. Also user 62
does only a trip, started at 04:21:40 - 05:33:46
on a bus, then stopped, waited for about 6-minutes then continued to at about 05:40:31
.
Question
I want to merge such trips into a single trip so that if the time interval between the end of a user’s trip to beginning of the next is less than 30 minutes, that should be considered a single trip. Then I can plot,
In the above MWE, the aggregated trips are “over-estimated”, the expected final df
should be:
Date Trips 0 2007-05 2 1 2008-05 4 2 2008-06 1
Advertisement
Answer
You can use a custom function:
offset = pd.DateOffset(minutes=30) merge_trip = lambda x: x['start'].ge(x['end'].shift() + offset).cumsum().add(1) df['Trips'] = df.groupby('user').apply(merge_trip).droplevel('user') out = df.groupby('Date', as_index=False)['Trips'].max()
Output:
>>> out Date Trips 0 2007-05 2 1 2008-05 4 2 2008-06 1 >>> df user start end mode Date Trips 0 154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus 2007-05 1 1 154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk 2007-05 1 2 154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi 2007-05 2 3 73 2008-05-19 16:25:54 2008-05-19 16:29:22 walk 2008-05 1 4 73 2008-05-21 02:21:37 2008-05-21 02:25:04 walk 2008-05 2 5 73 2008-05-22 01:30:09 2008-05-22 01:33:51 walk 2008-05 3 6 73 2008-05-29 01:55:59 2008-05-29 01:59:25 walk 2008-05 4 7 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 2008-06 1 8 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 2008-06 1