Skip to content
Advertisement

A shorter & more efficient pandas code for cumulative based data selection & column based data selection

Below is the requirement.

There are 2 tables: brand_df (Brands’ value) and score_df (containing Subject score for each brand). [Generating samples below]

brand_dict = {'Brand': ['a', 'b', 'c', 'g', 'e', 'i', 'd', 'h', 'f'], 'Value': [2, 1, 3, 1, 4, 1, 3, 2, 3]}
brand_df = pd.DataFrame(brand_dict, columns=['Brand', 'Value'])

score_dict = {'a' : [4,4,3,4,0], 'b':[0,2,0,1,0], 'c':[4,0,0,4,3], 'd':[3,2,0,3,1], 'e':[0,0,2,1,0], 'f':[2,0,3,0,0], 'g':[2,3,0,0,1], 'h':[3,0,0,1,3], 'i':[0,3,3,1,0] }
score_df = pd.DataFrame(score_dict, columns=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
score_df.index = ['sub1', 'sub2', 'sub3', 'sub4', 'sub5']

# brand_df output
#   Brand  Value
# 0     a      2
# 1     b      1
# 2     c      3
# 3     g      1
# 4     e      4
# 5     i      1
# 6     d      3
# 7     h      2
# 8     f      3

# score_df output 
#       a  b  c  d  e  f  g  h  i
# sub1  4  0  4  3  0  2  2  3  0
# sub2  4  2  0  2  0  0  3  0  3
# sub3  3  0  0  0  2  3  0  0  3
# sub4  4  1  4  3  1  0  0  1  1
# sub5  0  0  3  1  0  0  1  3  0

What is being done :-

  1. Pick only the top brands that make 75% of the cumulative value
brand_df.sort_values("Value", axis=0, ascending=False, inplace=True, na_position='first')  # ordered Desc to get biggest brands
brand_df['cum_percent'] = (brand_df.Value.cumsum() / brand_df.Value.sum()) * 100

# brand_df Output
# Brand  Value  cum_percent
#    e      4         20.0
#    c      3         35.0
#    d      3         50.0
#    f      3         65.0
#    a      2         75.0
#    h      2         85.0
#    b      1         90.0
#    g      1         95.0
#    i      1        100.0
selbrand = []
for index, row in brand_df.iterrows():
    if row['cum_percent'] <= 75:
        selbrand.append(row['Brand'])

# selbrand output
# ['e', 'c', 'd', 'f', 'a']
  1. Pick the subjects where 75% of the selected brand has a score (i.e > 0)
# Setting threshold for subject selection. [75% percent of the selected brands must have a score] 
threshold = math.trunc(len(selbrand)*0.75) # In this case 75% of 5 is 3.75 and hence rounded to 3

selsub = []

for index, row in score_df.iterrows():
    count = 0
    for col in selbrand:
        if row[col] > 0:
            count += 1
    if count >= threshold:
        selsub.append(index)

I get the below output using the above scripts.

print(selsub)
['sub1', 'sub3', 'sub4']  # Only these subjects have score for at least 3 of the selected brands

What is a more efficient or shorter way to achieve this?

Advertisement

Answer

You creating cum_percent is good so far. What’s next is to remove the two loops:

selbrand = brand_df.loc[brand_df['cum_percent']<=75,'Brand']

# I tend to not use `math` package
threshold = int(len(selbrand)*0.75)

# extract the brands with [selbrand]
# compare with 0 and count along the rows
# then compare with threshold
s = score_df[selbrand].gt(0).sum(1) >= threshold
selsub = s[s].index

# print(selsub)
Index(['sub1', 'sub3', 'sub4'], dtype='object')
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement