im trying to learn scikit but stucked at the code which is about encoders require their input to be be uniformly string or number

Tags: ,



I have been learning python form youtube videos. im new to python just a beginner. I saw this code on video so i tried it but getting the error which i dont known how to solve. This is the following code where im getting trouble. I didint wrote the enitre code as its to long.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
%matplotlib inline


wine = pd.read_csv('wine_quality.csv')
wine.head()
wine.info()
wine.isnull().sum()

#Preprocessing
bins=(2,6.5,8)
group_names=['bad','good']
wine['quality'] = pd.cut(wine['quality'], bins=bins, labels=group_names)
wine['quality'].unique()

label_quality=LabelEncoder()
wine['quality']=label_quality.fit_transform(wine['quality'])
#after this im getting that error

'''TypeError                                 Traceback (most recent call last)
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode(values, uniques, encode, check_unknown)
    112         try:
--> 113             res = _encode_python(values, uniques, encode)
    114         except TypeError:

~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode_python(values, uniques, encode)
     60     if uniques is None:
---> 61         uniques = sorted(set(values))
     62         uniques = np.array(uniques, dtype=values.dtype)

TypeError: '<' not supported between instances of 'float' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-14-8e211b2c4bf8> in <module>
----> 1 wine['quality'] = label_quality.fit_transform(wine['quality'])

~anaconda3libsite-packagessklearnpreprocessing_label.py in fit_transform(self, y)
    254         """
    255         y = column_or_1d(y, warn=True)
--> 256         self.classes_, y = _encode(y, encode=True)
    257         return y
    258 

~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode(values, uniques, encode, check_unknown)
    115             types = sorted(t.__qualname__
    116                            for t in set(type(v) for v in values))
--> 117             raise TypeError("Encoders require their input to be uniformly "
    118                             f"strings or numbers. Got {types}")
    119         return res

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']'''
```

please help me fix my error. it will be great if you will tell me exactly what should i do.

Answer

so I checked the Wine Quality dataset, and upon doing:

wine['quality'].unique()

I got the following output:

array([6, 5, 7, 8, 4, 3, 9], dtype=int64)

Now since we have values that exceed the upper bound which you have provided in your bins for pd.cut() function, the out of limits values will be replaced by NaN values. I checked it on my compiler too, so after performing your preprocessing

#Preprocessing
bins=(2,6.5,8)
group_names=['bad','good']
wine['quality'] = pd.cut(wine['quality'], bins=bins, labels=group_names)
wine['quality'].unique()

The result I get for wine['quality'].unique() is:

['bad', 'good', NaN]
Categories (2, object): ['bad' < 'good']

This happens because all values that exceed 8 (the upper bound you provided) are changed to NaN, this is mentioned in the documentation for pd.cut() function too:

Out of bounds values will be NA in the resulting Series or Categorical object. Therefore I would suggest that you should increase your upper bound in the bins to 9. I tried to do that and the function works fine without any issues.

#Preprocessing
bins=(2,6.5,9)
group_names=['bad','good']
wine['quality'] = pd.cut(wine['quality'], bins=bins, labels=group_names)
wine['quality'].unique()

And the output for wine['quality'].unique() now was:

['bad', 'good']
Categories (2, object): ['bad' < 'good']

So, we do not have NaN values anymore, and your Label Encoder should now work fine.



Source: stackoverflow