I just got started on Kaggle and for my first project I was working on the Titanic dataset.
I ran the following codeblock
ndf = pd.concat([pd.get_dummies(df[["Pclass", "SibSp", "Parch", "Sex"]]), (df[["Age", "Fare"]])],axis=1)
Although I’m getting the output as:
Pclass SibSp Parch Sex_female Sex_male Age Fare 0 3 1 0 0 1 22.0 7.2500 1 1 1 0 1 0 38.0 71.2833 2 3 0 0 1 0 26.0 7.9250 3 1 1 0 1 0 35.0 53.1000 4 3 0 0 0 1 35.0 8.0500 .. ... ... ... ... ... ... ... 886 2 0 0 0 1 27.0 13.0000 887 1 0 0 1 0 19.0 30.0000 888 3 1 2 1 0 NaN 23.4500 889 1 0 0 0 1 26.0 30.0000 890 3 0 0 0 1 32.0 7.7500
The Pclass, SibSp and Parch variables did not convert to one_hot encoded vectors though the Sex attribute did.
I didn’t understand why because when I try to run pd.get_dummes() function on the Pclass variable alone, the result it gives me is perfectly fine.
1 2 3 0 0 0 1 1 1 0 0 2 0 0 1 3 1 0 0 4 0 0 1 ... ... ... ... 886 0 1 0 887 1 0 0 888 0 0 1 889 1 0 0 890 0 0 1
Although the names of the columns have been converted to “0”, “1” and “2” which of course is not fine actually…
But how can I fix the problem? I want all the features to be converted to one-hot encoded vectors.
Advertisement
Answer
Use OneHotEncoder
from sklearn
from sklearn.preprocessing import OneHotEncoder df = pd.DataFrame({'Pclass': [0, 1, 2], 'SibSp': [3, 1, 0], 'Parch': [0, 2, 2], 'Sex': [0, 1, 1]}) ohe = OneHotEncoder() data = ohe.fit_transform(df[['Pclass', 'SibSp', 'Parch', 'Sex']]) df1 = pd.DataFrame(data.toarray(), columns=ohe.get_feature_names_out(), dtype=int)
Output:
>>> df Pclass SibSp Parch Sex 0 0 3 0 0 1 1 1 2 1 2 2 0 2 1 >>> df1 Pclass_0 Pclass_1 Pclass_2 SibSp_0 SibSp_1 SibSp_3 Parch_0 Parch_2 Sex_0 Sex_1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 2 0 0 1 1 0 0 0 1 0 1