Skip to content
Advertisement

label-encoder encoding missing values

I am using the label encoder to convert categorical data into numeric values.

How does LabelEncoder handle missing values?

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)

Output:

array([1, 2, 3, 0, 4, 1])

For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values?

Advertisement

Answer

Don’t use LabelEncoder with missing values. I don’t know which version of scikit-learn you’re using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float().

As you can see in the source it uses numpy.unique against the data to encode, which raises TypeError if missing values are found. If you want to encode missing values, first change its type to a string:

a[pd.isnull(a)]  = 'NaN'
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement