I have a dataframe like as shown below
person_id source_system r_diff 1 O NULL 1 O 0 1 O 9 1 O NULL 2 O 574 2 I 20 2 O 135 2 O 0 2 I 21 2 O 2 2 O 0 2 O 0 2 I 12
I would like to select rows based on the criteria below
criteria 1 – pick all rows where source-system = I
criteria 2 – pick prior row (n-1) only when source-system of (n-1)th is O
and diff is zero.
This criteria 2 should be applied only when nth row has source-system = I
. If source-system of (n-1)th is I
, we don’t have to do anything because criteria 1 will handle that.
We have to apply both the criteria each person
I tried the below based on SO suggestion but not sure how to make it work
m1 = df['visit_source_value'] == 'I' m2 = df['diff'] <= 0 m3 = df.groupby('person_id')['diff'].shift(-1) <= 0 df = df1[m1 | m2 | m3]
I expect my output to be like as shown below
2 I 20 2 O 0 2 I 21 2 O 0 2 I 12
Advertisement
Answer
I prefer not one line solution, because hard readable if more complicated code, so better is use:
m1 = df['visit_source_value'] == 'I' m2 = df['r_diff'] <= 0 m3 = df.groupby('person_id')['visit_source_value'].shift(-1) == 'I' df = df[m1 | (m2 & m3)] print (df) person_id visit_source_value r_diff 5 2 I 20.0 7 2 O 0.0 8 2 I 21.0 11 2 O 0.0 12 2 I 12.0