Please find below my input/desired output :
INPUT
Id Status Date 0 Id001 off 01/01/2022 1 Id001 online 01/06/2022 2 Id002 off 05/12/2021 3 Id002 off 30/08/2021 4 Id003 running 02/03/2021 5 Id004 online 21/10/2021 6 Id004 running 01/04/2022 7 Id005 online 01/02/2022 8 Id005 online 02/07/2022
OUTPUT (desired)
Id Status Date 0 Id001 online 01/06/2022 1 Id002 off 05/12/2021 2 Id003 running 02/03/2021 3 Id004 online 21/10/2021 4 Id005 online 01/02/2022
The goal is firstly to have one line per Id in the output.
The output will be made based on a this simple statement :
if a group of Id has «online» as a Status then: if «online» occurences > 1 then: Choose the oldest Id that has «online» else: choose the Id that has «online» else: Choose the most recent Id
This is what I’ve tried so far :
import pandas as pd import numpy as np df = pd.DataFrame({'Id': ['Id001','Id001','Id002','Id002','Id003','Id004','Id004','Id005','Id005'], 'Status': ['off','online','off','off','running','online','running','online','online'], 'Date': ['01/01/2022','01/06/2022','05/12/2021','30/08/2021','02/03/2021','21/10/2021','01/04/2022','01/02/2022','02/07/2022']}) df.groupby(['Id', 'Status'], as_index=False).agg({'Date' : [np.min, np.max]}).reset_index()
Do you have any suggestion/propositions, please ?
Any help we be so much appreciated !
Advertisement
Answer
Answer is completely edited – first filter only online
rows, sorting by Date
and remove duplicates by first Id
:
df1 = df[df['Status'].eq('online')].sort_values('Date').drop_duplicates('Id') print (df1) Id Status Date 5 Id004 online 2021-10-21 7 Id005 online 2022-02-01 1 Id001 online 2022-06-01
Then filter not matched Id
and sorting descending:
df2 =df[~df['Id'].isin(df1['Id'])].sort_values('Date',ascending=False).drop_duplicates('Id') print (df2) Id Status Date 2 Id002 off 2021-12-05 4 Id003 running 2021-03-02
Last join both Dataframes:
df = pd.concat([df1, df2]).sort_values('Id', ignore_index=True) print (df) Id Status Date 0 Id001 online 2022-06-01 1 Id002 off 2021-12-05 2 Id003 running 2021-03-02 3 Id004 online 2021-10-21 4 Id005 online 2022-02-01
Original solution should be changed:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True) df1 = (df.assign(s = df['Status'].eq('online')).groupby(['Id','s']) .agg(Date_min=('Date','idxmin'), Date_max=('Date','idxmax'))) df1 = df1[~df1.index.get_level_values(0).duplicated(keep='last')].reset_index() print (df1) Id s Date_min Date_max 0 Id001 True 1 1 1 Id002 False 3 2 2 Id003 False 4 4 3 Id004 True 5 5 4 Id005 True 7 8 df = df.loc[np.where(df1['s'], df1['Date_min'], df1['Date_max'])] print (df) Id Status Date 1 Id001 online 2022-06-01 2 Id002 off 2021-12-05 4 Id003 running 2021-03-02 5 Id004 online 2021-10-21 7 Id005 online 2022-02-01