Skip to content
Advertisement

Pandas: Is better aggregation possible

Ddate   Test Name   T/F
01-01-2019  Haem    T
02-01-2019  Haem    T
03-01-2019  Haem    T
04-01-2019  Haem    F
05-01-2019  Haem    F
01-01-2019  CBC T
02-01-2019  CBC T
03-01-2019  CBC T
04-01-2019  CBC F
05-01-2019  CBC F
01-01-2019  KFT T
02-01-2019  KFT T
03-01-2019  KFT T
04-01-2019  KFT F
05-01-2019  KFT F

I have sample dataframe above. I wish to calculate percentage True for each date. I am able to do as below. But, feel it can be done with groupby + agg. Is it possible?

My attempt:

import pandas as pd
import numpy as np

df['T/F'] = df['T/F'].map({'T':1, 'F':0})
df['T'] = np.where(df['T/F']==1, 1, 0)
df['F'] = 1 - df['T']

df = df.groupby('Ddate').agg(
    {
        'T': sum,
        'F': sum,
    }
)

df['Total'] = df['T'] + df['F']
df['per T'] = df['T'] / df['Total'] * 100

Advertisement

Answer

You can do groupby like this:

df['T/F'].eq('T').groupby([df['Date']]).mean()

Output:

Date
01-01-2019    1.0
02-01-2019    1.0
03-01-2019    1.0
04-01-2019    0.0
05-01-2019    0.0
Name: T/F, dtype: float64

You can get both percentages for T and F with crosstab:

pd.crosstab(df.Date, df['T/F'], normalize='index')

Output:

T/F           F    T
Date                
01-01-2019  0.0  1.0
02-01-2019  0.0  1.0
03-01-2019  0.0  1.0
04-01-2019  1.0  0.0
05-01-2019  1.0  0.0

Note 1: Extra comment to your code: The counts per date can be obtained by:

counts = pd.crosstab(df['Date'], df['T/F'])

Then the percentage of T can be:

counts['per T'] = counts['T']/counts.sum(axis=1)

Note 2: Don’t do groupby().agg({'col1': sum, 'col2':sum}) because:

  1. sum is python native, and is slow
  2. agg is slow(er), and only useful when you want to perform different operations to different columns.

Do: groupby()[['col1','col2']].sum()

Note 3: All of the solutions above give percentage in scale 0-1. If you want scale 0-100, you know what to do.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement