I’m trying to build a Voting Ensemble model, with a data transformation pipeline. I still need to put the transformation of the response variable into the pipeline. I’m trying to use GridSearchCV to evaluate the best parameters for each algorithm, but when I try to run the last code block, I get an error.
dummy= pd.get_dummies(df['over_30']) df = pd.concat((df, dummy), axis = 1) df = df.drop(['over_30','N'], axis = 1) df = df.rename(columns = {'Y':'over_30'}) X,y = df.drop(['over_30'], axis = 1), df[['over_30']] categorical = ['business_sector', "state"] numerical = ['valor_contrato', 'prazo', 'num_avalistas', 'annual_revenue', 'risk', 'carteira_vencer_curto_prazo', 'carteira_vencer_longo_prazo', 'risk_fintech_fidc', 'risk_pos_money', 'alavancagem_rate', 'patrimonio_socios', 'target_amount', 'score', 'pib', 'company_month', 'week_month'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) variable_transformer = ColumnTransformer( transformers=[ ('numeric', numeric_transformer, numerical), ('categorical', categorical_transformer, categorical)], remainder='passthrough') classifiers = [ XGBClassifier(), LGBMClassifier(), RandomForestClassifier() ] xgbclassifier_parameters = { 'classifier__eta' : [0.001,0.3], 'classifier__gamma' : [0], 'classifier__max_depth' : [3, 7], 'classifier__grow_policy' : ['lossguide', 'deptwise'], 'classifier__objective' : ['reg:logistic'], 'classifier__reg_lambda' : [1.25, 1], 'classifier__subsample' : [0.5, 0.6, 0.7], 'classifier__tree_method' : ['auto', 'hist'], 'classifier__colsample_bytree' : [0.7, 0.8, 0.9, 1.0], 'classifier__max_leaves' : [0, 7] } randomforest_paramenters = { 'classifier__n_estimators': [200, 500], 'classifier__max_features': ['auto', 'sqrt', 'log2'], 'classifier__max_depth': [4, 5, 6, 7, 8], } lightgbm_parameters = { 'classifier__num_leaves': [31, 127], 'classifier__reg_alpha': [0.1, 0.5], 'classifier__min_data_in_leaf': [30, 50, 100, 300, 400], 'classifier__lambda_l1': [0, 1, 1.5], 'classifier__lambda_l2': [0, 1] } parameters = [ xgbclassifier_parameters, randomforest_paramenters, lightgbm_parameters ] estimators = [] # iterate through each classifier and use GridSearchCV for i, classifier in enumerate(classifiers): # create a Pipeline object pipe = Pipeline(steps=[ ('transformer', variable_transformer), ('classifier', classifier) ]) clf = GridSearchCV(pipe, param_grid=parameters[i], scoring=['f1_weighted', 'f1_macro', 'recall', 'roc_auc', 'precision'], refit='recall', cv=8) clf.fit(X, y) print("Tuned Hyperparameters :", clf.best_params_) print("Recall:", clf.best_score_) # add the clf to the estimators list estimators.append((classifier.__class__.__name__, clf))
But when I run this last cell, i get this error:
min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
Someone can help me?
Advertisement
Answer
Always, please post the stack trace of the error for people to understand
There are multiple mistakes in your code,
- You are creating Pipeline Object using
varible_transformer
, where are you fitting it? - What is
X
andy
?
Solution:
- Separate
X
-> input features needed for training andy
-> the output variable values which the model has to learn. - Create pipeline object, it is a wrapper that does the preprocessing for you, so fit it first before giving the input features to model.
- After fitting the pipeline object, you give the resultant numpy array to the classifier as the
X
and correspondingy
to fit the model/classifier.
I am showing an example of regressors with the data that I had handy.
# median_house_value is what I am trying to estimate. # input_features = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income'] df = pd.read_csv("/filepath/california_housing_train.csv") X = df.drop("median_house_value", axis=1) y = df["median_house_value"] categorical = [] numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]) variable_transformer = ColumnTransformer( transformers=[ ('numeric', numeric_transformer, numerical), ('categorical', categorical_transformer, categorical)], remainder='passthrough') regressors = [XGBRegressor(),RandomForestRegressor()] xgbregressor_parameters = { 'regressor__grow_policy' : ['lossguide', 'deptwise'], 'regressor__objective' : ['reg:squarederror'], 'regressor__colsample_bytree' : [0.7, 0.8, 0.9, 1.0], 'regressor__max_leaves' : [0, 7]} randomforest_parameters = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [4, 5, 6, 7, 8]} parameters = [xgbregressor_parameters, randomforest_parameters] estimators = [] pipe = Pipeline(steps=[ ('transformer', variable_transformer)]) # fit the pipeline with input features for preprocessing prepared_data = pipe.fit_transform(X) # iterate through each regressor and use GridSearchCV for i, regressor in enumerate(regressors): clf = GridSearchCV(regressor, param_grid=parameters[i], scoring=['neg_mean_squared_error', 'r2', 'explained_variance', ], refit='neg_mean_squared_error', cv=2) clf.fit(prepared_data, y) print("Tuned Hyperparameters :", clf.best_params_) # add the clf to the estimators list estimators.append((regressor.__class__.__name__, clf)) # Output: Tuned Hyperparameters : {'regressor__colsample_bytree': 0.7, 'regressor__grow_policy': 'lossguide', 'regressor__max_leaves': 0, 'regressor__objective': 'reg:squarederror'} Tuned Hyperparameters : {'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
Note: Delete the classifier
tag prepended to the parameter names for RandomForestclassifier in your case.