Skip to content
Advertisement

pd.get_dummies() not converting categorical data to one hot encoded vectors when multiple features are used

I just got started on Kaggle and for my first project I was working on the Titanic dataset.

I ran the following codeblock

ndf = pd.concat([pd.get_dummies(df[["Pclass", "SibSp", "Parch", "Sex"]]), (df[["Age", "Fare"]])],axis=1)

Although I’m getting the output as:

  Pclass  SibSp  Parch  Sex_female  Sex_male   Age     Fare
0         3      1      0           0         1  22.0   7.2500
1         1      1      0           1         0  38.0  71.2833
2         3      0      0           1         0  26.0   7.9250
3         1      1      0           1         0  35.0  53.1000
4         3      0      0           0         1  35.0   8.0500
..      ...    ...    ...         ...       ...   ...      ...
886       2      0      0           0         1  27.0  13.0000
887       1      0      0           1         0  19.0  30.0000
888       3      1      2           1         0   NaN  23.4500
889       1      0      0           0         1  26.0  30.0000
890       3      0      0           0         1  32.0   7.7500

The Pclass, SibSp and Parch variables did not convert to one_hot encoded vectors though the Sex attribute did.

I didn’t understand why because when I try to run pd.get_dummes() function on the Pclass variable alone, the result it gives me is perfectly fine.

    1   2   3
0   0   0   1
1   1   0   0
2   0   0   1
3   1   0   0
4   0   0   1
...     ...     ...     ...
886     0   1   0
887     1   0   0
888     0   0   1
889     1   0   0
890     0   0   1

Although the names of the columns have been converted to “0”, “1” and “2” which of course is not fine actually…

But how can I fix the problem? I want all the features to be converted to one-hot encoded vectors.

Advertisement

Answer

Use OneHotEncoder from sklearn

from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'Pclass': [0, 1, 2], 'SibSp': [3, 1, 0],
                   'Parch': [0, 2, 2], 'Sex': [0, 1, 1]})

ohe = OneHotEncoder()
data = ohe.fit_transform(df[['Pclass', 'SibSp', 'Parch', 'Sex']])
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)

Output:

>>> df
   Pclass  SibSp  Parch  Sex
0       0      3      0    0
1       1      1      2    1
2       2      0      2    1

>>> df1
   Pclass_0  Pclass_1  Pclass_2  SibSp_0  SibSp_1  SibSp_3  Parch_0  Parch_2  Sex_0  Sex_1
0         1         0         0        0        0        1        1        0      1      0
1         0         1         0        0        1        0        0        1      0      1
2         0         0         1        1        0        0        0        1      0      1
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement