I have a dataframe like as shown below
import numpy as np import pandas as pd from numpy.random import default_rng rng = default_rng(100) cdf = pd.DataFrame({'Id':[1,2,3,4,5], 'year':[2017,2017,2018,2019,2018], 'customer': rng.choice(list('ACD'),size=(5)), 'region': rng.choice(list('PQRS'),size=(5)), 'dumeel': rng.choice(list('QWER'),size=(5)), 'dumma': rng.choice((1234),size=(5)), 'target_at50': rng.choice([0,1],size=(5)), 'target_at60': rng.choice([1,1],size=(5)), 'target_at70': rng.choice([0,0],size=(5))})
My objective is to do the below
a) Group columns based on multiple criteria (as shown in below code)
b) Assign a default value based on target column. (ex: if target_at50, then assign value 50, if target_at60, then assign 60. if target_at70, then assign 70)
b) Repeat the same group by criteria for different target column (target_at60, target_at70)
c) write the result of each group statement for each target to a new sheet in the single excel file.
I was trying something like below but it’s not efficient or elegant. You can see that am using the same code but repeated thrice for different target columns (target_at50. target_at60, target_at70
)
cdf.groupby(['region','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet1') cdf.groupby(['region','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet2') cdf.groupby(['region','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet3') cdf.groupby(['region','year','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet4') cdf.groupby(['region','year','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet5') cdf.groupby(['region','year','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet6') cdf.groupby(['region','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet7') cdf.groupby(['region','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet8') cdf.groupby(['region','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet9') cdf.groupby(['region','year','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet10') cdf.groupby(['region','year','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet11') cdf.groupby(['region','year','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet12') cdf.groupby(['region','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet13') cdf.groupby(['region','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet14') cdf.groupby(['region','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet15') cdf.groupby(['region','year','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet16') cdf.groupby(['region','year','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet17') cdf.groupby(['region','year','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet18')
Though all the above group by statements works fine, I would like to know is there any efficient and better approach to perform the above task?
I expect my output to just have an excel with multiple sheets (18 sheets) (which will have the output from 18 groupby statements)
Advertisement
Answer
First generate all combinations passed to groupby
:
from itertools import combinations targets = cdf.filter(like='target').columns cols = ['customer', 'year', 'dumeel'] tups = [('region', *c, t) for t in targets for i in range(0, len(cols)+1) for c in combinations(cols, i)] print (tups)
Ant then write aggregate values in loop:
with pd.ExcelWriter('Values.xlsx') as writer: for i, val in enumerate(tups, 1): threshold = re. df = (cdf.groupby(val).size().reset_index(name='Count') .sort_values(by=['region','Count'],ascending=False) .assign(threshold=int(re.findall('d+',val[-1])[0])) df.to_excel(writer,sheet_name=f'sheet{i}')