I have the following dataframe
id date other variables.. A 2019Q4 A 2020Q4 A 2021Q4 B 2018Q4 B 2019Q4 B 2020Q4 B 2021Q4 C 2020Q4 C 2021Q4 D 2021Q4 E 2018Q4 E 2019Q4 E 2020Q4 E 2021Q4 . .
I want to group by id and keep those ids if it contains all of the designated values (i.e. 2019Q4, 2020Q4, 2021Q4) then extract rows that correspond to those values. isin() won’t work because it won’t drop C and D.
desired output
A 2019Q4 A 2020Q4 A 2021Q4 B 2019Q4 B 2020Q4 B 2021Q4 E 2019Q4 E 2020Q4 E 2021Q4 . .
Advertisement
Answer
You can use set operations to filter the id and isin for the date:
target = {'2019Q4', '2020Q4', '2021Q4'}
id_ok = df.groupby('id')['date'].agg(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & df['id'].map(id_ok)]
or, using transform:
target = {'2019Q4', '2020Q4', '2021Q4'}
mask = df.groupby('id')['date'].transform(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & mask]
output:
id date other 0 A 2019Q4 NaN 1 A 2020Q4 NaN 2 A 2021Q4 NaN 4 B 2019Q4 NaN 5 B 2020Q4 NaN 6 B 2021Q4 NaN 11 E 2019Q4 NaN 12 E 2020Q4 NaN 13 E 2021Q4 NaN
id_ok:
id A True B True C False D False E True Name: date, dtype: bool