Suppose I have this dataframe
d = {'fruit': ['apple', 'apple', 'apple', 'orange'], 'names': ['bob', 'bob', 'bob', 'joe']} df = pd.DataFrame(data=d)
Now I want to go through each column and filter out the low percentiles keeping only values that are contained in the 90th percentile.
Thus since apple
and bob
are each within their associated 90th percentiles I would have this dataframe
d = {'fruit': ['apple', 'apple', 'apple'], 'names': ['bob', 'bob', 'bob']} df = pd.DataFrame(data=d)
How do I achieve this?
Advertisement
Answer
Hope this helps:
- Calculate 90th percentile and keep them in a dictionary:
perc_dic = df.apply(lambda x : np.percentile(x.value_counts().tolist(), 90) , axis= 0).to_dict() perc_dic {'fruit': 2.8, 'names': 2.8}
2.Calculate value counts of each value:
value_dic = df.apply(lambda x : x.value_counts() , axis= 0).to_dict() value_dic {'fruit': {'apple': 3.0, 'bob': nan, 'joe': nan, 'orange': 1.0}, 'names': {'apple': nan, 'bob': 3.0, 'joe': 1.0, 'orange': nan}}
- Compare value counts and percentile:
df_result = df.apply(lambda row : row if value_dic['fruit'][row['fruit']] >= perc_dic['fruit'] and value_dic['names'][row['names']] >= perc_dic['names'] else np.nan , axis= 1) df_result.dropna() fruit names 0 apple bob 1 apple bob 2 apple bob
Note: There are some nan values in value_dic but it won’t hurt the performance.