I have the following dataframe:
df = pd.DataFrame({ 'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]}) df user effortduration callbacks applications user122 2 weeks 0 0 user122 NaN 0 0 user124 2 weeks 0 1 user125 3 weeks 0 0 user125 NaN 0 0 user126 2 weeks 1 1 user126 2 weeks 1 1
I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:
function = {"user": pd.Series.nunique, "callbacks": lambda x: x.nunique(), "applications": lambda x: x.isin(['1']).nunique(),} df.groupby('effortduration').agg(function) user callbacks applications effortduration 2 weeks 3 2 2 3 weeks 1 1 1
However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:
user callbacks applications effortduration 2 weeks 3 1 2 3 weeks 1 0 0
Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?
Advertisement
Answer
- This works with the sample data, I’m not sure with real data
- Replace 0, with
NaN
, and then dropNaN
if'effortduration'
,'callbacks'
, and'applications'
are allNaN
. - Drop all duplicates
- Based on the desired result, it only matters if a user called/applied, once.
- Groupby count
import pandas as pd # sample data df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]}) # replace 0 and drop nan df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications']) # drop duplicates df = df.drop_duplicates() # groupby and count dfg = df.groupby(['effortduration']).count() # dfg user callbacks applications effortduration 2 weeks 3 1 2 3 weeks 1 0 0
nunique
- As already noted, this option returns the count of the number of unique values in the column, so doesn’t return the desired output.
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]}) # using nunique dfg = df.groupby('effortduration').nunique() # dfg user callbacks applications effortduration 2 weeks 3 2 2 3 weeks 1 1 1