I have a GridSearchCV, with a pipeline that looks something like this:
numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('scaler', StandardScaler()) ]) preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ]) clf = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression(solver='lbfgs')) ])
my GridSearchCV looks like this:
search = GridSearchCV(clf, param_grid, cv = 5, scoring = "roc_auc",error_score=0.0)
with Cross Validation = 5
So, how do I ensure that I split the data first, and then impute in the most frequent?
Advertisement
Answer
GridSearchCV will run roughly like this:
for train_index, val_index in StratifiedKFold(n_splits=5).split(X, y): X_train, X_val = X[train_index], X[val_index] y_train, y_val = y[train_index], y[val_index] clf = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', LogisticRegression(solver='lbfgs')) ]) clf.fit(X_train, y_train) clf.predict(X_val, y_val)
You can be sure that SimpleImputer
and StandardScaler
will do .fit()
and .transform()
for each fold.