I have created a pipeline using sklearn so that multiple models will go through it. Since there is vectorization before fitting the model, I wonder if this vectorization is performed always before the model fitting process? If yes, maybe I should take this preprocessing out of the pipeline.
log_reg = LogisticRegression() rand_for = RandomForestClassifier() lin_svc = LinearSVC() svc = SVC() # The pipeline contains both vectorization model and classifier pipe = Pipeline( [ ('vect', tfidf), ('classifier', log_reg) ] ) # params dictionary example params_log_reg = { 'classifier__penalty': ['l2'], 'classifier__C': [0.01, 0.1, 1.0, 10.0, 100.0], 'classifier__class_weight': ['balanced', class_weights], 'classifier__solver': ['lbfgs', 'newton-cg'], # 'classifier__verbose': [2], 'classifier': [log_reg] } params = [params_log_reg, params_rand_for, params_lin_svc, params_svc] # param dictionaries for each model # Grid search for to combine it all grid = GridSearchCV( pipe, params, cv=skf, scoring= 'f1_weighted') grid.fit(features_train, labels_train[:,0])
Advertisement
Answer
When you are running a GridSearchCV
, pipeline steps will be recomputed for every combination of hyperparameters
. So yes, this vectorization process will be done every time the pipeline is called.
Have a look at the sklearn Pipeline and composite estimators.
To quote:
Fitting transformers may be computationally expensive. With its memory parameter set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.
So you can use the memory
flag to cache the transformers.
cachedir = mkdtemp() pipe = Pipeline(estimators, memory=cachedir)