I’m trying to build a Voting Ensemble model, with a data transformation pipeline. I still need to put the transformation of the response variable into the pipeline. I’m trying to use GridSearchCV to evaluate the best parameters for each algorithm, but when I try to run the last code block, I get an error.
JavaScript
x
91
91
1
dummy= pd.get_dummies(df['over_30'])
2
df = pd.concat((df, dummy), axis = 1)
3
df = df.drop(['over_30','N'], axis = 1)
4
df = df.rename(columns = {'Y':'over_30'})
5
6
X,y = df.drop(['over_30'], axis = 1), df[['over_30']]
7
8
categorical = ['business_sector', "state"]
9
numerical = ['valor_contrato', 'prazo', 'num_avalistas', 'annual_revenue',
10
'risk', 'carteira_vencer_curto_prazo', 'carteira_vencer_longo_prazo',
11
'risk_fintech_fidc', 'risk_pos_money', 'alavancagem_rate', 'patrimonio_socios',
12
'target_amount', 'score', 'pib', 'company_month', 'week_month']
13
14
numeric_transformer = Pipeline(steps=[
15
('imputer', SimpleImputer(strategy='mean')),
16
('scaler', StandardScaler())
17
18
categorical_transformer = Pipeline(steps=[
19
('imputer', SimpleImputer(strategy='most_frequent')),
20
('onehot', OneHotEncoder(handle_unknown='ignore'))
21
])
22
23
variable_transformer = ColumnTransformer(
24
transformers=[
25
('numeric', numeric_transformer, numerical),
26
('categorical', categorical_transformer, categorical)],
27
remainder='passthrough')
28
29
classifiers = [
30
XGBClassifier(),
31
LGBMClassifier(),
32
RandomForestClassifier()
33
]
34
35
xgbclassifier_parameters = {
36
'classifier__eta' : [0.001,0.3],
37
'classifier__gamma' : [0],
38
'classifier__max_depth' : [3, 7],
39
'classifier__grow_policy' : ['lossguide', 'deptwise'],
40
'classifier__objective' : ['reg:logistic'],
41
'classifier__reg_lambda' : [1.25, 1],
42
'classifier__subsample' : [0.5, 0.6, 0.7],
43
'classifier__tree_method' : ['auto', 'hist'],
44
'classifier__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
45
'classifier__max_leaves' : [0, 7]
46
}
47
48
randomforest_paramenters = {
49
'classifier__n_estimators': [200, 500],
50
'classifier__max_features': ['auto', 'sqrt', 'log2'],
51
'classifier__max_depth': [4, 5, 6, 7, 8],
52
}
53
54
lightgbm_parameters = {
55
'classifier__num_leaves': [31, 127],
56
'classifier__reg_alpha': [0.1, 0.5],
57
'classifier__min_data_in_leaf': [30, 50, 100, 300, 400],
58
'classifier__lambda_l1': [0, 1, 1.5],
59
'classifier__lambda_l2': [0, 1]
60
}
61
62
parameters = [
63
xgbclassifier_parameters,
64
randomforest_paramenters,
65
lightgbm_parameters
66
]
67
68
estimators = []
69
70
# iterate through each classifier and use GridSearchCV
71
for i, classifier in enumerate(classifiers):
72
# create a Pipeline object
73
pipe = Pipeline(steps=[
74
('transformer', variable_transformer),
75
('classifier', classifier)
76
])
77
clf = GridSearchCV(pipe,
78
param_grid=parameters[i],
79
scoring=['f1_weighted',
80
'f1_macro',
81
'recall',
82
'roc_auc',
83
'precision'],
84
refit='recall',
85
cv=8)
86
clf.fit(X, y)
87
print("Tuned Hyperparameters :", clf.best_params_)
88
print("Recall:", clf.best_score_)
89
# add the clf to the estimators list
90
estimators.append((classifier.__class__.__name__, clf))
91
But when I run this last cell, i get this error:
JavaScript
1
6
1
min_impurity_decrease=0.0, min_impurity_split=None,
2
min_samples_leaf=1, min_samples_split=2,
3
min_weight_fraction_leaf=0.0, n_estimators=100,
4
n_jobs=None, oob_score=False, random_state=None,
5
verbose=0, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.
6
Someone can help me?
Advertisement
Answer
Always, please post the stack trace of the error for people to understand
There are multiple mistakes in your code,
- You are creating Pipeline Object using
varible_transformer
, where are you fitting it? - What is
X
andy
?
Solution:
- Separate
X
-> input features needed for training andy
-> the output variable values which the model has to learn. - Create pipeline object, it is a wrapper that does the preprocessing for you, so fit it first before giving the input features to model.
- After fitting the pipeline object, you give the resultant numpy array to the classifier as the
X
and correspondingy
to fit the model/classifier.
I am showing an example of regressors with the data that I had handy.
JavaScript
1
71
71
1
# median_house_value is what I am trying to estimate.
2
# input_features = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']
3
4
df = pd.read_csv("/filepath/california_housing_train.csv")
5
X = df.drop("median_house_value", axis=1)
6
y = df["median_house_value"]
7
8
categorical = []
9
numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
10
'total_bedrooms', 'population', 'households',
11
'median_income']
12
13
numeric_transformer = Pipeline(steps=[
14
('imputer', SimpleImputer(strategy='mean')),
15
('scaler', StandardScaler())])
16
17
categorical_transformer = Pipeline(steps=[
18
('imputer', SimpleImputer(strategy='most_frequent')),
19
('onehot', OneHotEncoder(handle_unknown='ignore'))])
20
21
variable_transformer = ColumnTransformer(
22
transformers=[
23
('numeric', numeric_transformer, numerical),
24
('categorical', categorical_transformer, categorical)],
25
remainder='passthrough')
26
regressors = [XGBRegressor(),RandomForestRegressor()]
27
28
xgbregressor_parameters = {
29
'regressor__grow_policy' : ['lossguide', 'deptwise'],
30
'regressor__objective' : ['reg:squarederror'],
31
'regressor__colsample_bytree' : [0.7, 0.8, 0.9, 1.0],
32
'regressor__max_leaves' : [0, 7]}
33
34
randomforest_parameters = {
35
'n_estimators': [200, 500],
36
'max_features': ['auto', 'sqrt', 'log2'],
37
'max_depth': [4, 5, 6, 7, 8]}
38
39
parameters = [xgbregressor_parameters, randomforest_parameters]
40
41
estimators = []
42
43
pipe = Pipeline(steps=[
44
('transformer', variable_transformer)])
45
46
47
# fit the pipeline with input features for preprocessing
48
prepared_data = pipe.fit_transform(X)
49
50
# iterate through each regressor and use GridSearchCV
51
for i, regressor in enumerate(regressors):
52
53
clf = GridSearchCV(regressor,
54
param_grid=parameters[i],
55
scoring=['neg_mean_squared_error',
56
'r2',
57
'explained_variance',
58
],
59
refit='neg_mean_squared_error',
60
cv=2)
61
clf.fit(prepared_data, y)
62
print("Tuned Hyperparameters :", clf.best_params_)
63
64
# add the clf to the estimators list
65
estimators.append((regressor.__class__.__name__, clf))
66
67
# Output:
68
69
Tuned Hyperparameters : {'regressor__colsample_bytree': 0.7, 'regressor__grow_policy': 'lossguide', 'regressor__max_leaves': 0, 'regressor__objective': 'reg:squarederror'}
70
Tuned Hyperparameters : {'max_depth': 4, 'max_features': 'sqrt', 'n_estimators': 200}
71
Note: Delete the classifier
tag prepended to the parameter names for RandomForestclassifier in your case.