I can easily train and test a classifier using the code below.
import pandas as pd import numpy as np # Load Library import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_moons from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier# Step1: Create data set # Define the headers since the data does not have any headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data", header=None, names=headers, na_values="?" ) df.head() df.columns df_fin = pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index) df_fin X = df_fin[['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm']] y = df_fin['city_mpg'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit a Decision Tree model clf = DecisionTreeClassifier() clf.fit(X_train, y_train) y_pred = clf.predict(X_test) accuracy_score(y_test, y_pred)
Now, how can I make a prediction of the target variable (dependent variable) based on the independent variables?
Something like this should work, I think, but it doesn’t…
clf.predict([[2,164,'audi','gas','std','four','sedan','fwd','front',99.8,176.6,66.2,54.3,2337,'ohc','four',109,'mpfi',3.19,3.4,10,102,5500,24,30,13950,]])
If we leave numerics as numerics, and put quotes around labels, I would like to predict the dependent variable, but I can’t, because of the labeled data. If the data was all numerics, and this was a regression problem, it would work!! My question is…how can we feed in numbers and labels, like a real person would understand, rather than using the the numerics that the labels are converted into. I’ve gotta believe, labels are converted into numerics (one hot encoding, catagorical codes, or whatever) before the training and testing is done, right.
Here is the error message that I’m getting.
clf.predict([[2,164,'audi','gas','std','four','sedan','fwd','front',99.8,176.6,66.2,54.3,2337,'ohc','four',109,'mpfi',3.19,3.4,10,102,5500,24,30,13950,]]) C:Usersryansanaconda3libsite-packagessklearnbase.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names warnings.warn( Traceback (most recent call last): Input In [20] in <cell line: 1> clf.predict([[2,164,'audi','gas','std','four','sedan','fwd','front',99.8,176.6,66.2,54.3,2337,'ohc','four',109,'mpfi',3.19,3.4,10,102,5500,24,30,13950,]]) File ~anaconda3libsite-packagessklearntree_classes.py:505 in predict X = self._validate_X_predict(X, check_input) File ~anaconda3libsite-packagessklearntree_classes.py:471 in _validate_X_predict X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr", reset=False) File ~anaconda3libsite-packagessklearnbase.py:577 in _validate_data X = check_array(X, input_name="X", **check_params) File ~anaconda3libsite-packagessklearnutilsvalidation.py:856 in check_array array = np.asarray(array, order=order, dtype=dtype) ValueError: could not convert string to float: 'audi'
Advertisement
Answer
You can create a map of your columns values with the corresponding categorical values:
col_dictionary = {} for col in df: dictionary = dict(enumerate(df[col].astype('category').cat.categories)) col_dictionary[col] = {v: k for k, v in dictionary.items()}
obtaining:
{'symboling': {-2: 0, -1: 1, 0: 2, ..., 3: 5}, 'normalized_losses': {65.0: 0, 74.0: 1, ..., 197.0: 48, 231.0: 49, 256.0: 50}, 'make': {'alfa-romero': 0, 'audi': 1, 'bmw': 2, 'chevrolet': 3, 'dodge': 4, ..., 'volkswagen': 20, 'volvo': 21}, 'fuel_type': {'diesel': 0, 'gas': 1}, 'aspiration': {'std': 0, 'turbo': 1}, 'num_doors': {'four': 0, 'two': 1}, 'body_style': {'convertible': 0, 'hardtop': 1, 'hatchback': 2, 'sedan': 3, 'wagon': 4}, 'drive_wheels': {'4wd': 0, 'fwd': 1, 'rwd': 2}, 'engine_location': {'front': 0, 'rear': 1}, 'wheel_base': {86.6: 0, 88.4: 1, ..., 115.6: 51, 120.9: 52}, 'length': {141.1: 0, 144.6: 1, ..., 202.6: 73, 208.1: 74}, 'width': {60.3: 0, 61.8: 1, ..., 59.1: 47, 59.8: 48}, 'curb_weight': {1488: 0, 1713: 1, 1819: 2, ..., 4066: 170}, 'engine_type': {'dohc': 0, 'dohcv': 1, 'l': 2, 'ohc': 3, 'ohcf': 4, 'ohcv': 5, 'rotor': 6}, 'num_cylinders': {'eight': 0, 'five': 1, 'four': 2, 'six': 3, 'three': 4, 'twelve': 5, 'two': 6}, 'engine_size': {61: 0, 70: 1, 79: 2, ..., 304: 41, 308: 42, 326: 43}, 'fuel_system': {'1bbl': 0, '2bbl': 1, '4bbl': 2, 'idi': 3, 'mfi': 4, 'mpfi': 5, 'spdi': 6, 'spfi': 7}, 'bore': {2.54: 0, 2.68: 1, ..., 3.94: 37}, 'stroke': {2.07: 0, 2.19: 1, ..., 3.9: 34, 4.17: 35}, 'compression_ratio': {7.0: 0, 7.5: 1, ..., 23.0: 31}, 'horsepower': {48.0: 0, 52.0: 1, ..., 288.0: 58}, 'peak_rpm': {4150.0: 0, ..., 6600.0: 22}, 'city_mpg': {13: 0, 14: 1, 15: 2, ..., 49: 28}, 'highway_mpg': {16: 0, ..., 53: 28, 54: 29}, 'price': {5118.0: 0, 5151.0: 1, ..., 41315.0: 184, 45400.0: 185}}
And then use this map to convert the array you want to predict:
prediction_values = [2, 164, 'audi', 'gas', 'std', 'four', 'sedan', 'fwd', 'front', 99.8, 176.6, 66.2, 54.3, 2337, 'ohc', 'four', 109, 'mpfi', 3.19, 3.4, 10, 102, 5500, 30, 13950] to_predict = [] for (column, value) in zip(X.columns, prediction_values): to_predict.append(col_dictionary[column][value]) to_predict_df = pd.DataFrame([to_predict], columns=X.columns) clf.predict([to_predict_df.iloc[0].values])