Skip to content
Advertisement

Pandas groupby, assign and to_excel – on loop/repeat

I have a dataframe like as shown below

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
                    'year':[2017,2017,2018,2019,2018],
                   'customer': rng.choice(list('ACD'),size=(5)),
                   'region': rng.choice(list('PQRS'),size=(5)),
                   'dumeel': rng.choice(list('QWER'),size=(5)),
                   'dumma': rng.choice((1234),size=(5)),
                   'target_at50': rng.choice([0,1],size=(5)),
                   'target_at60': rng.choice([1,1],size=(5)),
                   'target_at70': rng.choice([0,0],size=(5))})

My objective is to do the below

a) Group columns based on multiple criteria (as shown in below code)

b) Assign a default value based on target column. (ex: if target_at50, then assign value 50, if target_at60, then assign 60. if target_at70, then assign 70)

b) Repeat the same group by criteria for different target column (target_at60, target_at70)

c) write the result of each group statement for each target to a new sheet in the single excel file.

I was trying something like below but it’s not efficient or elegant. You can see that am using the same code but repeated thrice for different target columns (target_at50. target_at60, target_at70)

cdf.groupby(['region','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet1')
cdf.groupby(['region','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet2')
cdf.groupby(['region','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet3')
cdf.groupby(['region','year','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet4')
cdf.groupby(['region','year','customer','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet5')
cdf.groupby(['region','year','dumeel','target_at50']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=50).to_excel(writer,'sheet6')


cdf.groupby(['region','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet7')
cdf.groupby(['region','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet8')
cdf.groupby(['region','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet9')
cdf.groupby(['region','year','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet10')
cdf.groupby(['region','year','customer','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet11')
cdf.groupby(['region','year','dumeel','target_at60']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=60).to_excel(writer,'sheet12')


cdf.groupby(['region','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet13')
cdf.groupby(['region','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet14')
cdf.groupby(['region','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet15')
cdf.groupby(['region','year','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet16')
cdf.groupby(['region','year','customer','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet17')
cdf.groupby(['region','year','dumeel','target_at70']).size().reset_index(name='Count').sort_values(by=['region','Count'],ascending=False).assign(threshold=70).to_excel(writer,'sheet18')

Though all the above group by statements works fine, I would like to know is there any efficient and better approach to perform the above task?

I expect my output to just have an excel with multiple sheets (18 sheets) (which will have the output from 18 groupby statements)

Advertisement

Answer

First generate all combinations passed to groupby:

from  itertools import combinations

targets = cdf.filter(like='target').columns
cols = ['customer', 'year', 'dumeel']

tups = [('region', *c, t) for t in targets 
                          for i in range(0, len(cols)+1)
                          for c in combinations(cols, i)]
print (tups)

Ant then write aggregate values in loop:

with pd.ExcelWriter('Values.xlsx') as writer: 
    for i, val in enumerate(tups, 1):
        threshold = re.
        df = (cdf.groupby(val).size().reset_index(name='Count')
                 .sort_values(by=['region','Count'],ascending=False)
                 .assign(threshold=int(re.findall('d+',val[-1])[0]))

        df.to_excel(writer,sheet_name=f'sheet{i}')
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement