Skip to content
Advertisement

How to get unique counts based on a different column, with pandas groupby

I have the following dataframe:

df = pd.DataFrame({
    'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'],
    'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'],
    'callbacks' : [0, 0, 0, 0, 0, 1, 1],
    'applications': [0, 0, 1, 0, 0, 1, 1]})
df

  user     effortduration  callbacks  applications
user122        2 weeks          0             0
user122            NaN          0             0
user124        2 weeks          0             1
user125        3 weeks          0             0
user125            NaN          0             0
user126        2 weeks          1             1
user126        2 weeks          1             1

I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:

function = {"user": pd.Series.nunique,
            "callbacks": lambda x: x.nunique(),
            "applications": lambda x: x.isin(['1']).nunique(),}

df.groupby('effortduration').agg(function)

                  user  callbacks  applications
effortduration                               
2 weeks            3          2             2
3 weeks            1          1             1

However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:

                   user  callbacks  applications
effortduration                               
2 weeks            3          1             2
3 weeks            1          0             0

Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?

Advertisement

Answer

  • This works with the sample data, I’m not sure with real data
  • Replace 0, with NaN, and then drop NaN if 'effortduration', 'callbacks', and 'applications' are all NaN.
  • Drop all duplicates
    • Based on the desired result, it only matters if a user called/applied, once.
  • Groupby count
import pandas as pd

# sample data
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})

# replace 0 and drop nan
df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications'])

# drop duplicates
df = df.drop_duplicates()

# groupby and count
dfg = df.groupby(['effortduration']).count()

# dfg
                user  callbacks  applications
effortduration                               
2 weeks            3          1             2
3 weeks            1          0             0

nunique

  • As already noted, this option returns the count of the number of unique values in the column, so doesn’t return the desired output.
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})

# using nunique
dfg = df.groupby('effortduration').nunique()

# dfg
                user  callbacks  applications
effortduration                               
2 weeks            3          2             2
3 weeks            1          1             1
Advertisement