Skip to content
Advertisement

How to extract ids and rows only if a column has all of the designated values

I have the following dataframe

id  date     other variables..  
A   2019Q4      
A   2020Q4        
A   2021Q4 
B   2018Q4
B   2019Q4
B   2020Q4
B   2021Q4
C   2020Q4
C   2021Q4
D   2021Q4
E   2018Q4
E   2019Q4
E   2020Q4
E   2021Q4
.       .      

I want to group by id and keep those ids if it contains all of the designated values (i.e. 2019Q4, 2020Q4, 2021Q4) then extract rows that correspond to those values. isin() won’t work because it won’t drop C and D.

desired output

A   2019Q4      
A   2020Q4        
A   2021Q4 
B   2019Q4
B   2020Q4
B   2021Q4
E   2019Q4
E   2020Q4
E   2021Q4
.       .      

Advertisement

Answer

You can use set operations to filter the id and isin for the date:

target = {'2019Q4', '2020Q4', '2021Q4'}

id_ok = df.groupby('id')['date'].agg(lambda x: target.issubset(x))

df2 = df[df['date'].isin(target) & df['id'].map(id_ok)]

or, using transform:

target = {'2019Q4', '2020Q4', '2021Q4'}

mask = df.groupby('id')['date'].transform(lambda x: target.issubset(x))

df2 = df[df['date'].isin(target) & mask]

output:

   id    date  other
0   A  2019Q4    NaN
1   A  2020Q4    NaN
2   A  2021Q4    NaN
4   B  2019Q4    NaN
5   B  2020Q4    NaN
6   B  2021Q4    NaN
11  E  2019Q4    NaN
12  E  2020Q4    NaN
13  E  2021Q4    NaN

id_ok:

id
A     True
B     True
C    False
D    False
E     True
Name: date, dtype: bool
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement