I have a dataframe with messages sent and received. I want to calculate the time it took for someone to reply to the message.
JavaScript
x
12
12
1
import numpy as np
2
import pandas as pd
3
from datetime import datetime
4
5
df = pd.DataFrame({'sent':[78,18,94,55,68,57,78,8],
6
'received':[18,78,35,14,57,68,57,17],
7
'time':['2017-01-01T12','2017-01-01T13',
8
'2017-01-02T12','2017-02-01T13',
9
'2017-01-01T14','2017-01-01T15',
10
'2017-01-01T16','2017-01-01T17']})
11
df['time'] = pd.to_datetime(pd.Series(df['time']))
12
The method I thought of using was identifying pairs, so if sent =A and received =B, then there should be another entry with sent=B and received =A.
JavaScript
1
2
1
df["pairs"] = df.apply(lambda x: not df[(df["sent"] == x["received"]) & (df["received"] == x["sent"]) & (df.index != x.name)].empty, axis=1)
2
Then once I identify the pairs, I can calculate the time it took to respond with
JavaScript
1
10
10
1
sent_time = datetime.strptime('2017-01-01 12:00:00', fmt)
2
recieved_time = datetime.strptime('2017-01-01 13:00:00', fmt)
3
4
if sent_time > recieved_time:
5
td = sent_time - recieved_time
6
else:
7
td = recieved_time - sent_time
8
9
time = int(round(td.total_seconds() / 60))
10
I feel like I can do these separately, but I can’t seem to put them together.
EDIT
So as for the output, I guess I would need a separate dataframe that would list the senders and the time it took for someone to reply to the e-mail.
so with the example,
the message was sent by 78 and it took 60 min to respond. then a message was sent by 68 and it took 60 min to respond
Sender | time_to_respond |
---|---|
78 | 60 |
68 | 60 |
Advertisement
Answer
JavaScript
1
16
16
1
#Sort row values to create unique group
2
df[['s','t']] = np.sort(df[['sent','received']], axis=1)
3
4
#Subset duplicated groups
5
6
s = df[df.duplicated(subset=['s','t'], keep=False)]
7
8
#Compute time difference between duplicated groups, drop duplicated rows and unwanted columns
9
s=s.assign(time_to_respond=s.groupby(['s','t'])['time'].transform(lambda x:x.diff().bfill().dt.total_seconds()/60)).drop_duplicates(subset=['s','t'])[['sent','time_to_respond']]
10
11
12
13
sent time_to_respond
14
0 78 60.0
15
4 68 60.0
16