I have the following dataframe:
JavaScript
x
16
16
1
df = pd.DataFrame({
2
'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'],
3
'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'],
4
'callbacks' : [0, 0, 0, 0, 0, 1, 1],
5
'applications': [0, 0, 1, 0, 0, 1, 1]})
6
df
7
8
user effortduration callbacks applications
9
user122 2 weeks 0 0
10
user122 NaN 0 0
11
user124 2 weeks 0 1
12
user125 3 weeks 0 0
13
user125 NaN 0 0
14
user126 2 weeks 1 1
15
user126 2 weeks 1 1
16
I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:
JavaScript
1
11
11
1
function = {"user": pd.Series.nunique,
2
"callbacks": lambda x: x.nunique(),
3
"applications": lambda x: x.isin(['1']).nunique(),}
4
5
df.groupby('effortduration').agg(function)
6
7
user callbacks applications
8
effortduration
9
2 weeks 3 2 2
10
3 weeks 1 1 1
11
However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:
JavaScript
1
5
1
user callbacks applications
2
effortduration
3
2 weeks 3 1 2
4
3 weeks 1 0 0
5
Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?
Advertisement
Answer
- This works with the sample data, I’m not sure with real data
- Replace 0, with
NaN
, and then dropNaN
if'effortduration'
,'callbacks'
, and'applications'
are allNaN
. - Drop all duplicates
- Based on the desired result, it only matters if a user called/applied, once.
- Groupby count
JavaScript
1
20
20
1
import pandas as pd
2
3
# sample data
4
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
5
6
# replace 0 and drop nan
7
df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications'])
8
9
# drop duplicates
10
df = df.drop_duplicates()
11
12
# groupby and count
13
dfg = df.groupby(['effortduration']).count()
14
15
# dfg
16
user callbacks applications
17
effortduration
18
2 weeks 3 1 2
19
3 weeks 1 0 0
20
nunique
- As already noted, this option returns the count of the number of unique values in the column, so doesn’t return the desired output.
JavaScript
1
11
11
1
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
2
3
# using nunique
4
dfg = df.groupby('effortduration').nunique()
5
6
# dfg
7
user callbacks applications
8
effortduration
9
2 weeks 3 2 2
10
3 weeks 1 1 1
11