Below is the requirement.
There are 2 tables: brand_df
(Brands’ value) and score_df
(containing Subject score for each brand). [Generating samples below]
JavaScript
x
27
27
1
brand_dict = {'Brand': ['a', 'b', 'c', 'g', 'e', 'i', 'd', 'h', 'f'], 'Value': [2, 1, 3, 1, 4, 1, 3, 2, 3]}
2
brand_df = pd.DataFrame(brand_dict, columns=['Brand', 'Value'])
3
4
score_dict = {'a' : [4,4,3,4,0], 'b':[0,2,0,1,0], 'c':[4,0,0,4,3], 'd':[3,2,0,3,1], 'e':[0,0,2,1,0], 'f':[2,0,3,0,0], 'g':[2,3,0,0,1], 'h':[3,0,0,1,3], 'i':[0,3,3,1,0] }
5
score_df = pd.DataFrame(score_dict, columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
6
score_df.index = ['sub1', 'sub2', 'sub3', 'sub4', 'sub5']
7
8
# brand_df output
9
# Brand Value
10
# 0 a 2
11
# 1 b 1
12
# 2 c 3
13
# 3 g 1
14
# 4 e 4
15
# 5 i 1
16
# 6 d 3
17
# 7 h 2
18
# 8 f 3
19
20
# score_df output
21
# a b c d e f g h i
22
# sub1 4 0 4 3 0 2 2 3 0
23
# sub2 4 2 0 2 0 0 3 0 3
24
# sub3 3 0 0 0 2 3 0 0 3
25
# sub4 4 1 4 3 1 0 0 1 1
26
# sub5 0 0 3 1 0 0 1 3 0
27
What is being done :-
- Pick only the top brands that make 75% of the cumulative value
JavaScript
1
22
22
1
brand_df.sort_values("Value", axis=0, ascending=False, inplace=True, na_position='first') # ordered Desc to get biggest brands
2
brand_df['cum_percent'] = (brand_df.Value.cumsum() / brand_df.Value.sum()) * 100
3
4
# brand_df Output
5
# Brand Value cum_percent
6
# e 4 20.0
7
# c 3 35.0
8
# d 3 50.0
9
# f 3 65.0
10
# a 2 75.0
11
# h 2 85.0
12
# b 1 90.0
13
# g 1 95.0
14
# i 1 100.0
15
selbrand = []
16
for index, row in brand_df.iterrows():
17
if row['cum_percent'] <= 75:
18
selbrand.append(row['Brand'])
19
20
# selbrand output
21
# ['e', 'c', 'd', 'f', 'a']
22
- Pick the subjects where 75% of the selected brand has a score (i.e > 0)
JavaScript
1
13
13
1
# Setting threshold for subject selection. [75% percent of the selected brands must have a score]
2
threshold = math.trunc(len(selbrand)*0.75) # In this case 75% of 5 is 3.75 and hence rounded to 3
3
4
selsub = []
5
6
for index, row in score_df.iterrows():
7
count = 0
8
for col in selbrand:
9
if row[col] > 0:
10
count += 1
11
if count >= threshold:
12
selsub.append(index)
13
I get the below output using the above scripts.
JavaScript
1
3
1
print(selsub)
2
['sub1', 'sub3', 'sub4'] # Only these subjects have score for at least 3 of the selected brands
3
What is a more efficient or shorter way to achieve this?
Advertisement
Answer
You creating cum_percent
is good so far. What’s next is to remove the two loops:
JavaScript
1
14
14
1
selbrand = brand_df.loc[brand_df['cum_percent']<=75,'Brand']
2
3
# I tend to not use `math` package
4
threshold = int(len(selbrand)*0.75)
5
6
# extract the brands with [selbrand]
7
# compare with 0 and count along the rows
8
# then compare with threshold
9
s = score_df[selbrand].gt(0).sum(1) >= threshold
10
selsub = s[s].index
11
12
# print(selsub)
13
Index(['sub1', 'sub3', 'sub4'], dtype='object')
14