I have a dataframe that looks like this:
dict = {'companyId': {0: 198236, 1: 198236, 2: 900814, 3: 153421, 4: 153421, 5: 337815}, 'region': {0: 'Europe', 1: 'Europe', 2: 'Asia-Pacific', 3: 'North America', 4: 'North America', 5:'Africa'}, 'value': {0: 560, 1: 771, 2: 964, 3: 217, 4: 433, 5: 680}, 'type': {0: 'actual', 1: 'forecast', 2: 'actual', 3: 'forecast', 4: 'actual', 5: 'forecast'}} df = pd.DataFrame(dict) companyId region value type 0 198236 Europe 560 actual 1 198236 Europe 771 forecast 2 900814 Asia-Pacific 964 actual 3 153421 North America 217 forecast 4 153421 North America 433 actual 5 337815 Africa 680 forecast
I can’t seem to figure out a way to filter out certain rows based on the following condition:
If there are two entries under the same companyId
, as is the case for 198236
and 153421
, I want to keep only the entry where type
is actual
.
If there is only one entry under a companyId
, as is the case for 337815
and 900814
, I want to keep that row, irrespective of the value in column type
.
Does anyone have an idea how to go about this?
Advertisement
Answer
You can use a groupby and transform to create boolean indexing:
#Your condition i.e. retain the rows which are not duplicated and those # which are duplicated but only type==actual. Lets express that as a lambda. to_filter = lambda x: (len(x) == 1) | ((len(x) > 1) & (x == 'actual')) #then create a boolean indexing mask as below m = df.groupby('companyId')['type'].transform(to_filter) #then filter your df with that m: df[m]: companyId region value type 0 198236 Europe 560 actual 2 900814 Asia-Pacific 964 actual 4 153421 North America 433 actual 5 337815 Africa 680 forecast