How do I make sure GridSearchCV first does the cross split and then the imputing?

Question

I have a GridSearchCV, with a pipeline that looks something like this: my GridSearchCV looks like this: with Cross Validation = 5 So, how do I ensure that I split the data first, and then impute in the most frequent? Answer GridSearchCV will run roughly like this: You can be sure that SimpleImputer and StandardScaler will do .fit() and .transform()

Accepted Answer

GridSearchCV will run roughly like this:for train_index, val_index in StratifiedKFold(n_splits=5).split(X, y):    X_train, X_val = X[train_index], X[val_index]    y_train, y_val = y[train_index], y[val_index]    clf = Pipeline(steps=[        ('preprocessor', preprocessor),        ('classifier', LogisticRegression(solver='lbfgs'))    ])     clf.fit(X_train, y_train)    clf.predict(X_val, y_val)You can be sure that SimpleImputer and StandardScaler will do .fit() and .transform() for each fold.

Advertisement

Answer