In the MWE below, I show my attempt to line-plot trips (from my df aggregated per month):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
d = {'user': [62, 62, 154, 154, 154, 73, 73, 73, 73],
     'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2007-05-01 10:10:00',
         '2007-05-01 10:36:00','2007-05-01 15:30:00','2008-05-19 16:25:54',
          '2008-05-21 02:21:37','2008-05-22 01:30:09','2008-05-29 01:55:59'],
    'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11', '2007-05-01 10:36:00',
         '2007-05-01 11:00:00','2007-05-01 15:55:00','2008-05-19 16:29:22',
         '2008-05-21 02:25:04','2008-05-22 01:33:51','2008-05-29 01:59:25'],
 'mode': ['bus','walk','bus','walk','taxi','walk','walk','walk','walk'],}
df = pd.DataFrame(d)
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)
df['Date'] = df['start'].dt.strftime('%Y-%m')
df.sort_values(['Date' ], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# aggregate monthly trips
df = df.groupby(['Date']).size().reset_index().rename(
    columns={0:'Trips'})
df
    Date   Trips
0   2007-05 3
1   2008-05 4
2   2008-06 2
# the I can plot trips like so:
sns.lineplot(data=df, x = df['Date'], y='Trips')
I realised in my df, some trips contains jump (maybe due to data log), so they should be merged into single trip before aggregation. In the given df example above (before grouping).
df user start end mode Date 0 154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus 2007-05 1 154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk 2007-05 2 154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi 2007-05 3 73 2008-05-19 16:25:54 2008-05-19 16:29:22 walk 2008-05 4 73 2008-05-21 02:21:37 2008-05-21 02:25:04 walk 2008-05 5 73 2008-05-22 01:30:09 2008-05-22 01:33:51 walk 2008-05 6 73 2008-05-29 01:55:59 2008-05-29 01:59:25 walk 2008-05 7 62 2008-06-20 04:21:40 2008-06-20 05:33:46 bus 2008-06 8 62 2008-06-20 05:40:31 2008-06-20 05:53:11 walk 2008-06
User 154 does undertake 2-trips, not 3. First started at 10:10:00 on bus then changes travel mode to at 10:36:00 to walk. Also user 62 does only a trip, started at 04:21:40 - 05:33:46  on a bus, then stopped, waited for about 6-minutes then continued to at about 05:40:31.
Question
I want to merge such trips into a single trip so that if the time interval between the end of a user’s trip to beginning of the next is less than 30 minutes, that should be considered a single trip. Then I can plot,
In the above MWE, the aggregated trips are “over-estimated”, the expected final df should be:
Date Trips 0 2007-05 2 1 2008-05 4 2 2008-06 1
Advertisement
Answer
You can use a custom function:
offset = pd.DateOffset(minutes=30)
merge_trip = lambda x: x['start'].ge(x['end'].shift() + offset).cumsum().add(1)
df['Trips'] = df.groupby('user').apply(merge_trip).droplevel('user')
out = df.groupby('Date', as_index=False)['Trips'].max()
Output:
>>> out
      Date  Trips
0  2007-05      2
1  2008-05      4
2  2008-06      1
>>> df
   user               start                 end  mode     Date  Trips
0   154 2007-05-01 10:10:00 2007-05-01 10:36:00   bus  2007-05      1
1   154 2007-05-01 10:36:00 2007-05-01 11:00:00  walk  2007-05      1
2   154 2007-05-01 15:30:00 2007-05-01 15:55:00  taxi  2007-05      2
3    73 2008-05-19 16:25:54 2008-05-19 16:29:22  walk  2008-05      1
4    73 2008-05-21 02:21:37 2008-05-21 02:25:04  walk  2008-05      2
5    73 2008-05-22 01:30:09 2008-05-22 01:33:51  walk  2008-05      3
6    73 2008-05-29 01:55:59 2008-05-29 01:59:25  walk  2008-05      4
7    62 2008-06-20 04:21:40 2008-06-20 05:33:46   bus  2008-06      1
8    62 2008-06-20 05:40:31 2008-06-20 05:53:11  walk  2008-06      1
