Please find below my input/desired output :
INPUT
JavaScript
x
11
11
1
Id Status Date
2
0 Id001 off 01/01/2022
3
1 Id001 online 01/06/2022
4
2 Id002 off 05/12/2021
5
3 Id002 off 30/08/2021
6
4 Id003 running 02/03/2021
7
5 Id004 online 21/10/2021
8
6 Id004 running 01/04/2022
9
7 Id005 online 01/02/2022
10
8 Id005 online 02/07/2022
11
OUTPUT (desired)
JavaScript
1
7
1
Id Status Date
2
0 Id001 online 01/06/2022
3
1 Id002 off 05/12/2021
4
2 Id003 running 02/03/2021
5
3 Id004 online 21/10/2021
6
4 Id005 online 01/02/2022
7
The goal is firstly to have one line per Id in the output.
The output will be made based on a this simple statement :
JavaScript
1
8
1
if a group of Id has «online» as a Status then:
2
if «online» occurences > 1 then:
3
Choose the oldest Id that has «online»
4
else:
5
choose the Id that has «online»
6
else:
7
Choose the most recent Id
8
This is what I’ve tried so far :
JavaScript
1
9
1
import pandas as pd
2
import numpy as np
3
4
df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'],
5
'Status': ['off','online','off','off','running','online','running','online','online'],
6
'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']})
7
8
df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()
9
Do you have any suggestion/propositions, please ?
Any help we be so much appreciated !
Advertisement
Answer
Answer is completely edited – first filter only online
rows, sorting by Date
and remove duplicates by first Id
:
JavaScript
1
7
1
df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id')
2
print (df1)
3
Id Status Date
4
5 Id004 online 2021-10-21
5
7 Id005 online 2022-02-01
6
1 Id001 online 2022-06-01
7
Then filter not matched Id
and sorting descending:
JavaScript
1
6
1
df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id')
2
print (df2)
3
Id Status Date
4
2 Id002 off 2021-12-05
5
4 Id003 running 2021-03-02
6
Last join both Dataframes:
JavaScript
1
9
1
df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True)
2
print (df)
3
Id Status Date
4
0 Id001 online 2022-06-01
5
1 Id002 off 2021-12-05
6
2 Id003 running 2021-03-02
7
3 Id004 online 2021-10-21
8
4 Id005 online 2022-02-01
9
Original solution should be changed:
JavaScript
1
23
23
1
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
2
3
df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s'])
4
.agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax')))
5
6
df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index()
7
print (df1)
8
Id s Date_min Date_max
9
0 Id001 True 1 1
10
1 Id002 False 3 2
11
2 Id003 False 4 4
12
3 Id004 True 5 5
13
4 Id005 True 7 8
14
15
df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])]
16
print (df)
17
Id Status Date
18
1 Id001 online 2022-06-01
19
2 Id002 off 2021-12-05
20
4 Id003 running 2021-03-02
21
5 Id004 online 2021-10-21
22
7 Id005 online 2022-02-01
23