Skip to content
Advertisement

Filter column for multiple values but only select the last one for one criteria

I have a dataframe similar to this one

df = pd.DataFrame({'date':[20220101,20220102,20220103,20220101,20220102,20220101], 'id':[1,1,1,2,2,3], 'value':[11,22,33,44,55,66], 'categorie':['a','a','c','a','c','c']})

       date  id  value categorie
   20220101   1     11         a
   20220102   1     22         a
   20220103   1     33         c
   20220101   2     44         a
   20220102   2     55         c
   20220101   3     66         c

I would now like to slice the df based on multiple values from column ‘categorie’ and am currently using

df = df[df['categorie'].isin(['a','c'])]

In addition to that I would like to be able to only get the [-1] row back for categorie ‘a’

    date  id  value categorie
20220102   1     22         a
20220103   1     33         c
20220101   2     44         a
20220102   2     55         c
20220101   3     66         c

instead of

    date  id  value categorie
20220101   1     11         a
20220102   1     22         a 
20220103   1     33         c
20220101   2     44         a
20220102   2     55         c
20220101   3     66         c

I think the closest would be to think about it as a groupby max value on id and categorie but I am curious if there is a more pythonic way.

Advertisement

Answer

‘a’ and ‘c’ are the only categories in your data, if you just need the latest then drop the duplicates

# drop duplicates and keep the last
df.drop_duplicates(subset=['id','categorie'], keep='last')

or

# select the categories of 'a' and 'c' and drop the duplicates from among them
(df.loc[df['categorie'].isin(['a','c'])]
 .drop_duplicates(subset=['id','categorie'], keep='last'))
    date       id   value   categorie
1   20220102    1      22   a
2   20220103    1      33   c
3   20220101    2      44   a
4   20220102    2      55   c
5   20220101    3      66   c
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement