I just got started on Kaggle and for my first project I was working on the Titanic dataset.
I ran the following codeblock
JavaScript
x
2
1
ndf = pd.concat([pd.get_dummies(df[["Pclass", "SibSp", "Parch", "Sex"]]), (df[["Age", "Fare"]])],axis=1)
2
Although I’m getting the output as:
JavaScript
1
13
13
1
Pclass SibSp Parch Sex_female Sex_male Age Fare
2
0 3 1 0 0 1 22.0 7.2500
3
1 1 1 0 1 0 38.0 71.2833
4
2 3 0 0 1 0 26.0 7.9250
5
3 1 1 0 1 0 35.0 53.1000
6
4 3 0 0 0 1 35.0 8.0500
7
..
8
886 2 0 0 0 1 27.0 13.0000
9
887 1 0 0 1 0 19.0 30.0000
10
888 3 1 2 1 0 NaN 23.4500
11
889 1 0 0 0 1 26.0 30.0000
12
890 3 0 0 0 1 32.0 7.7500
13
The Pclass, SibSp and Parch variables did not convert to one_hot encoded vectors though the Sex attribute did.
I didn’t understand why because when I try to run pd.get_dummes() function on the Pclass variable alone, the result it gives me is perfectly fine.
JavaScript
1
13
13
1
1 2 3
2
0 0 0 1
3
1 1 0 0
4
2 0 0 1
5
3 1 0 0
6
4 0 0 1
7
8
886 0 1 0
9
887 1 0 0
10
888 0 0 1
11
889 1 0 0
12
890 0 0 1
13
Although the names of the columns have been converted to “0”, “1” and “2” which of course is not fine actually…
But how can I fix the problem? I want all the features to be converted to one-hot encoded vectors.
Advertisement
Answer
Use OneHotEncoder
from sklearn
JavaScript
1
9
1
from sklearn.preprocessing import OneHotEncoder
2
3
df = pd.DataFrame({'Pclass': [0, 1, 2], 'SibSp': [3, 1, 0],
4
'Parch': [0, 2, 2], 'Sex': [0, 1, 1]})
5
6
ohe = OneHotEncoder()
7
data = ohe.fit_transform(df[['Pclass', 'SibSp', 'Parch', 'Sex']])
8
df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)
9
Output:
JavaScript
1
12
12
1
>>> df
2
Pclass SibSp Parch Sex
3
0 0 3 0 0
4
1 1 1 2 1
5
2 2 0 2 1
6
7
>>> df1
8
Pclass_0 Pclass_1 Pclass_2 SibSp_0 SibSp_1 SibSp_3 Parch_0 Parch_2 Sex_0 Sex_1
9
0 1 0 0 0 0 1 1 0 1 0
10
1 0 1 0 0 1 0 0 1 0 1
11
2 0 0 1 1 0 0 0 1 0 1
12