I have a data frame that records responses of 19717 people’s choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a NaN
.
JavaScript
x
6
1
ID Gender Python Bash R JavaScript C++
2
0 Male Python nan nan JavaScript nan
3
1 Female nan nan R JavaScript C++
4
2 Prefer not to say Python Bash nan nan nan
5
3 Male nan nan nan nan nan
6
What I want is a table that returns the count based on Gender
. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this:
JavaScript
1
5
1
Gender Python Bash R JavaScript C++
2
Male 5000 1000 800 1500 1000
3
Female 4000 500 1500 3000 800
4
Prefer Not To Say 2000 860
5
I have tried some of the options:
JavaScript
1
20
20
1
df.iloc[:, [*range(0, 13)]].stack().value_counts()
2
3
Male 16138
4
Python 12841
5
SQL 6532
6
R 4588
7
Female 3212
8
Java 2267
9
C++ 2256
10
Javascript 2174
11
Bash 2037
12
C 1672
13
MATLAB 1516
14
Other 1148
15
TypeScript 389
16
Prefer not to say 318
17
None 83
18
Prefer to self-describe 49
19
dtype: int64
20
And it’s not what is required as described above. Can this be done in pandas?
Advertisement
Answer
Another idea would be to apply
join
values along axis 1, get_dummies
then groupby
:
JavaScript
1
5
1
(df.loc[:, 'Python':]
2
.apply(lambda x: '|'.join(x.dropna()), axis=1)
3
.str.get_dummies('|')
4
.groupby(df['Gender']).sum())
5
[out]
JavaScript
1
6
1
Bash C++ JavaScript Python R
2
Gender
3
Female 0 1 1 0 1
4
Male 0 0 1 1 0
5
Prefer not to say 1 0 0 1 0
6