Below is the requirement.
There are 2 tables: brand_df
(Brands’ value) and score_df
(containing Subject score for each brand). [Generating samples below]
brand_dict = {'Brand': ['a', 'b', 'c', 'g', 'e', 'i', 'd', 'h', 'f'], 'Value': [2, 1, 3, 1, 4, 1, 3, 2, 3]} brand_df = pd.DataFrame(brand_dict, columns=['Brand', 'Value']) score_dict = {'a' : [4,4,3,4,0], 'b':[0,2,0,1,0], 'c':[4,0,0,4,3], 'd':[3,2,0,3,1], 'e':[0,0,2,1,0], 'f':[2,0,3,0,0], 'g':[2,3,0,0,1], 'h':[3,0,0,1,3], 'i':[0,3,3,1,0] } score_df = pd.DataFrame(score_dict, columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']) score_df.index = ['sub1', 'sub2', 'sub3', 'sub4', 'sub5'] # brand_df output # Brand Value # 0 a 2 # 1 b 1 # 2 c 3 # 3 g 1 # 4 e 4 # 5 i 1 # 6 d 3 # 7 h 2 # 8 f 3 # score_df output # a b c d e f g h i # sub1 4 0 4 3 0 2 2 3 0 # sub2 4 2 0 2 0 0 3 0 3 # sub3 3 0 0 0 2 3 0 0 3 # sub4 4 1 4 3 1 0 0 1 1 # sub5 0 0 3 1 0 0 1 3 0
What is being done :-
- Pick only the top brands that make 75% of the cumulative value
brand_df.sort_values("Value", axis=0, ascending=False, inplace=True, na_position='first') # ordered Desc to get biggest brands brand_df['cum_percent'] = (brand_df.Value.cumsum() / brand_df.Value.sum()) * 100 # brand_df Output # Brand Value cum_percent # e 4 20.0 # c 3 35.0 # d 3 50.0 # f 3 65.0 # a 2 75.0 # h 2 85.0 # b 1 90.0 # g 1 95.0 # i 1 100.0 selbrand = [] for index, row in brand_df.iterrows(): if row['cum_percent'] <= 75: selbrand.append(row['Brand']) # selbrand output # ['e', 'c', 'd', 'f', 'a']
- Pick the subjects where 75% of the selected brand has a score (i.e > 0)
# Setting threshold for subject selection. [75% percent of the selected brands must have a score] threshold = math.trunc(len(selbrand)*0.75) # In this case 75% of 5 is 3.75 and hence rounded to 3 selsub = [] for index, row in score_df.iterrows(): count = 0 for col in selbrand: if row[col] > 0: count += 1 if count >= threshold: selsub.append(index)
I get the below output using the above scripts.
print(selsub) ['sub1', 'sub3', 'sub4'] # Only these subjects have score for at least 3 of the selected brands
What is a more efficient or shorter way to achieve this?
Advertisement
Answer
You creating cum_percent
is good so far. What’s next is to remove the two loops:
selbrand = brand_df.loc[brand_df['cum_percent']<=75,'Brand'] # I tend to not use `math` package threshold = int(len(selbrand)*0.75) # extract the brands with [selbrand] # compare with 0 and count along the rows # then compare with threshold s = score_df[selbrand].gt(0).sum(1) >= threshold selsub = s[s].index # print(selsub) Index(['sub1', 'sub3', 'sub4'], dtype='object')