I have a data frame that records responses of 19717 people’s choice of programing languages through multiple choice questions. The first column is of course the gender of the respondent while the rest are the choices they picked. The data frame is shown below, with each response being recorded as the same name as column. If no response is selected, then this results in a NaN
.
ID Gender Python Bash R JavaScript C++ 0 Male Python nan nan JavaScript nan 1 Female nan nan R JavaScript C++ 2 Prefer not to say Python Bash nan nan nan 3 Male nan nan nan nan nan
What I want is a table that returns the count based on Gender
. Hence if 5000 men coded in Python and 3000 women in JS, then I should get this:
Gender Python Bash R JavaScript C++ Male 5000 1000 800 1500 1000 Female 4000 500 1500 3000 800 Prefer Not To Say 2000 ... ... ... 860
I have tried some of the options:
df.iloc[:, [*range(0, 13)]].stack().value_counts() Male 16138 Python 12841 SQL 6532 R 4588 Female 3212 Java 2267 C++ 2256 Javascript 2174 Bash 2037 C 1672 MATLAB 1516 Other 1148 TypeScript 389 Prefer not to say 318 None 83 Prefer to self-describe 49 dtype: int64
And it’s not what is required as described above. Can this be done in pandas?
Advertisement
Answer
Another idea would be to apply
join
values along axis 1, get_dummies
then groupby
:
(df.loc[:, 'Python':] .apply(lambda x: '|'.join(x.dropna()), axis=1) .str.get_dummies('|') .groupby(df['Gender']).sum())
[out]
Bash C++ JavaScript Python R Gender Female 0 1 1 0 1 Male 0 0 1 1 0 Prefer not to say 1 0 0 1 0