I have the following df We can assume that this data is already sorted. What i need to do is, for every id, I need to remove rows under the following conditions the first entry for every id is type A the last entry for every id is type B the last entry's B is the last one that appears

Remove rows in a group by until the last row meets some condition

I have the following df

JavaScript
​x
 
  id  type 
​
0 1    A   
1 1    B
2 1    A
3 2    A
4 2    B
5 3    A
6 3    B
7 3    A
8 3    B
9 3    A
10 3   A
​

We can assume that this data is already sorted. What i need to do is, for every id, I need to remove rows under the following conditions

the first entry for every id is type A
the last entry for every id is type B
the last entry’s B is the last one that appears (data is already sorted)

I’ve accomplished 1. with the following:

df = df.groupby('id').filter(lambda x: x['Type'].iloc[0] != 'A')

Which removes ids entirely if their first type isn’t A

However, for 2. and 3., I don’t want to remove the id if the last type isn’t B, instead I just want to remove everything in the middle

Resulting df:

JavaScript
 
 id  type 
​
0 1    A   
1 1    B
3 2    A
4 2    B
5 3    A
8 3    B
​

example code:

d = {'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3},
 'type': {0: 'A',
  1: 'B',
  2: 'A',
  3: 'A',
  4: 'B',
  5: 'A',
  6: 'B',
  7: 'A',
  8: 'B',
  9: 'A',
  10: 'A'}}

df = pd.DataFrame.from_dict(d)

JavaScript
 
d = {'id': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 3, 8: 3, 9: 3, 10: 3},
 'type': {0: 'A',
  1: 'B',
  2: 'A',
  3: 'A',
  4: 'B',
  5: 'A',
  6: 'B',
  7: 'A',
  8: 'B',
  9: 'A',
  10: 'A'}}
​
df = pd.DataFrame.from_dict(d)
​

Answer

It seems you could use drop_duplicates with different rule depending on type:

out = pd.concat([df.query("type=='A'").drop_duplicates(subset=['id','type'], keep='first'), 
                 df.query("type=='B'").drop_duplicates(subset=['id','type'], keep='last')]).sort_index()

JavaScript
 
out = pd.concat([df.query("type=='A'").drop_duplicates(subset=['id','type'], keep='first'), 
                 df.query("type=='B'").drop_duplicates(subset=['id','type'], keep='last')]).sort_index()
​

Output:

JavaScript
 
   id type
0   1    A
1   1    B
3   2    B
4   2    A
5   3    A
8   3    B
​

Advertisement

Answer