I have the following dataframe:
df = pd.DataFrame({
'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'],
'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'],
'callbacks' : [0, 0, 0, 0, 0, 1, 1],
'applications': [0, 0, 1, 0, 0, 1, 1]})
df
user effortduration callbacks applications
user122 2 weeks 0 0
user122 NaN 0 0
user124 2 weeks 0 1
user125 3 weeks 0 0
user125 NaN 0 0
user126 2 weeks 1 1
user126 2 weeks 1 1
I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:
function = {"user": pd.Series.nunique,
"callbacks": lambda x: x.nunique(),
"applications": lambda x: x.isin(['1']).nunique(),}
df.groupby('effortduration').agg(function)
user callbacks applications
effortduration
2 weeks 3 2 2
3 weeks 1 1 1
However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:
user callbacks applications effortduration 2 weeks 3 1 2 3 weeks 1 0 0
Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?
Advertisement
Answer
- This works with the sample data, I’m not sure with real data
- Replace 0, with
NaN, and then dropNaNif'effortduration','callbacks', and'applications'are allNaN. - Drop all duplicates
- Based on the desired result, it only matters if a user called/applied, once.
- Groupby count
import pandas as pd
# sample data
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
# replace 0 and drop nan
df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications'])
# drop duplicates
df = df.drop_duplicates()
# groupby and count
dfg = df.groupby(['effortduration']).count()
# dfg
user callbacks applications
effortduration
2 weeks 3 1 2
3 weeks 1 0 0
nunique
- As already noted, this option returns the count of the number of unique values in the column, so doesn’t return the desired output.
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
# using nunique
dfg = df.groupby('effortduration').nunique()
# dfg
user callbacks applications
effortduration
2 weeks 3 2 2
3 weeks 1 1 1