I have a corpus like the following ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘X X X’, ‘X X X’, ‘X X X’, I would like to use count and tfidf vectorizer along with logistic regression as a classifier. The code below I have adapted from sklearn’s samples.
from pprint import pprint from time import time import logging import pickle from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression print(__doc__) # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') # ############################################################################# # Define a pipeline combining a text feature extractor with a simple # classifier pipeline = Pipeline([ ('vect', CountVectorizer(analyzer='char',lowercase=False)), ('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)), ('clf', LogisticRegression()), ]) # uncommenting more parameters will give better exploring power but will # increase processing time in a combinatorial way parameters = { 'vect__max_df': (0.5, 0.75, 1.0), # 'vect__max_features': (None, 5000, 10000, 50000), 'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams # 'tfidf__use_idf': (True, False), # 'tfidf__norm': ('l1', 'l2'), 'clf__max_iter': (1000,), 'clf__C': (0.00001, 0.000001), 'clf__penalty': ('l2', 'elasticnet'), # 'clf__max_iter': (10, 50, 80), } if __name__ == "__main__": # multiprocessing requires the fork to happen in a __main__ protected # block # find the best parameters for both the feature extraction and the # classifier grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1) corpus =['C C C 0 0 0 X 0 1 0 0 0 0', 'C C C 0 0 0 X 0 1 0 0 0 0', 'C C C 0 0 0 X 0 1 0 0 0 0', 'X X X', 'X X X', 'X X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 X', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X', 'C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X X X X 0 0 0 X 0 X X'] y_train = [0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] print(len(corpus),len(y_train)) print("Performing grid search...") print("pipeline:", [name for name, _ in pipeline.steps]) print("parameters:") pprint(parameters) t0 = time() #print(type(data.data),type(data.target)) #print(data.data[:1]) #print(data.data[:2]) grid_search.fit(corpus,y_train) print("done in %0.3fs" % (time() - t0)) print() print("Best score: %0.3f" % grid_search.best_score_) print("Best parameters set:") best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print("t%s: %r" % (param_name, best_parameters[param_name]))
My stack trace is as follows
Automatically created module for IPython interactive environment 50 50 Performing grid search... pipeline: ['vect', 'tfidf', 'clf'] parameters: {'clf__C': (1e-05, 1e-06), 'clf__max_iter': (1000,), 'clf__penalty': ('l2', 'elasticnet'), 'vect__max_df': (0.5, 0.75, 1.0), 'vect__ngram_range': ((1, 1), (1, 2))} Fitting 5 folds for each of 24 candidates, totalling 120 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 0.1s finished --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-114-0d47590b1279> in <module> 107 #print(data.data[:2]) 108 --> 109 grid_search.fit(corpus,y_train) 110 print("done in %0.3fs" % (time() - t0)) 111 print() E:anacondaenvsappliedaicourselibsite-packagessklearnmodel_selection_search.py in fit(self, X, y, groups, **fit_params) 737 refit_start_time = time.time() 738 if y is not None: --> 739 self.best_estimator_.fit(X, y, **fit_params) 740 else: 741 self.best_estimator_.fit(X, **fit_params) E:anacondaenvsappliedaicourselibsite-packagessklearnpipeline.py in fit(self, X, y, **fit_params) 348 This estimator 349 """ --> 350 Xt, fit_params = self._fit(X, y, **fit_params) 351 with _print_elapsed_time('Pipeline', 352 self._log_message(len(self.steps) - 1)): E:anacondaenvsappliedaicourselibsite-packagessklearnpipeline.py in _fit(self, X, y, **fit_params) 313 message_clsname='Pipeline', 314 message=self._log_message(step_idx), --> 315 **fit_params_steps[name]) 316 # Replace the transformer of the step with the fitted 317 # transformer. This is necessary when loading the transformer E:anacondaenvsappliedaicourselibsite-packagesjoblibmemory.py in __call__(self, *args, **kwargs) 350 351 def __call__(self, *args, **kwargs): --> 352 return self.func(*args, **kwargs) 353 354 def call_and_shelve(self, *args, **kwargs): E:anacondaenvsappliedaicourselibsite-packagessklearnpipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params) 726 with _print_elapsed_time(message_clsname, message): 727 if hasattr(transformer, 'fit_transform'): --> 728 res = transformer.fit_transform(X, y, **fit_params) 729 else: 730 res = transformer.fit(X, y, **fit_params).transform(X) E:anacondaenvsappliedaicourselibsite-packagessklearnfeature_extractiontext.py in fit_transform(self, raw_documents, y) 1857 """ 1858 self._check_params() -> 1859 X = super().fit_transform(raw_documents) 1860 self._tfidf.fit(X) 1861 # X is already a transformed view of raw_documents so E:anacondaenvsappliedaicourselibsite-packagessklearnfeature_extractiontext.py in fit_transform(self, raw_documents, y) 1218 1219 vocabulary, X = self._count_vocab(raw_documents, -> 1220 self.fixed_vocabulary_) 1221 1222 if self.binary: E:anacondaenvsappliedaicourselibsite-packagessklearnfeature_extractiontext.py in _count_vocab(self, raw_documents, fixed_vocab) 1129 for doc in raw_documents: 1130 feature_counter = {} -> 1131 for feature in analyze(doc): 1132 try: 1133 feature_idx = vocabulary[feature] E:anacondaenvsappliedaicourselibsite-packagessklearnfeature_extractiontext.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words) 108 doc = ngrams(doc, stop_words) 109 else: --> 110 doc = ngrams(doc) 111 return doc 112 E:anacondaenvsappliedaicourselibsite-packagessklearnfeature_extractiontext.py in _char_ngrams(self, text_document) 255 """Tokenize text_document into a sequence of character n-grams""" 256 # normalize white spaces --> 257 text_document = self._white_spaces.sub(" ", text_document) 258 259 text_len = len(text_document) TypeError: expected string or bytes-like object
I ran the tfidf vectorizer alone and get the following results
vectorizer = TfidfVectorizer(analyzer='char',lowercase=False,ngram_range=(6, 6)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.shape) print(X)
Results
<class 'list'> [' 0 0 0', ' 0 0 X', ' 0 1 0', ' 0 X 0', ' 0 X X', ' 1 0 0', ' C 0 0', ' C C 0', ' C C C', ' C C X', ' C X X', ' X 0 0', ' X 0 1', ' X 0 X', ' X X 0', ' X X X', '0 0 0 ', '0 0 X ', '0 1 0 ', '0 X 0 ', '1 0 0 ', 'C 0 0 ', 'C C 0 ', 'C C C ', 'C C X ', 'C X X ', 'X 0 0 ', 'X 0 1 ', 'X 0 X ', 'X X 0 ', 'X X X '] (50, 31) (0, 20) 0.31810783213188626 (0, 5) 0.31810783213188626 (0, 18) 0.31810783213188626 (0, 2) 0.31810783213188626 (0, 27) 0.31810783213188626 (0, 12) 0.31810783213188626 (0, 19) 0.16116825632411622 (0, 3) 0.16116825632411622 (0, 17) 0.16116825632411622 (0, 1) 0.11378963445554637 (0, 16) 0.22757926891109273 (0, 0) 0.3413689033666391 (0, 21) 0.17370780684495662 (0, 6) 0.17370780684495662 (0, 22) 0.17370780684495662 (0, 7) 0.17370780684495662 (0, 23) 0.11378963445554637 (1, 20) 0.31810783213188626 (1, 5) 0.31810783213188626 (1, 18) 0.31810783213188626 ... ... ... (49, 1) 0.01436413072356797 (49, 16) 0.01436413072356797 (49, 0) 0.01436413072356797 (49, 23) 0.6894782747312626
My Question
Why is the standalone vectorizer working but when placed within pipeline that is used by Gridsearch I get the Type Error
Advertisement
Answer
By default, both CountVectorizer and TfidfVectorizer expect a sequence of items that can be of type string or byte. In your pipeline the CountVectorizer receives the corpus and outputs to TfidfVectorizer a sparse representation of the counts using scipy.sparse.csr_matrix. Since the input to TfidfVectorizer is not of the expected type you get the type error “TypeError: expected string or bytes-like object”. Your pipeline works if you use either but not both vectorizers. For example,
pipeline = Pipeline([ #('vect', CountVectorizer(analyzer='char',lowercase=False)), ('tfidf', TfidfVectorizer(analyzer='char',lowercase=False)), ('clf', LogisticRegression()) ]) # uncommenting more parameters will give better exploring power but will # increase processing time in a combinatorial way parameters = { #'vect__max_df': (0.5, 0.75, 1.0), # 'vect__max_features': (None, 5000, 10000, 50000), #'vect__ngram_range': [(1, 1), (1, 2)], # unigrams or bigrams 'tfidf__use_idf': [True, False], 'tfidf__norm': ['l1', 'l2'], 'clf__max_iter': [1000], 'clf__C': [0.00001, 0.000001], 'clf__penalty': ['l2'], # 'clf__max_iter': (10, 50, 80), }
produces the following output:
50 50 Performing grid search... pipeline: ['tfidf', 'clf'] parameters: {'clf__C': [1e-05, 1e-06], 'clf__max_iter': [1000], 'clf__penalty': ['l2'], 'tfidf__norm': ['l1', 'l2'], 'tfidf__use_idf': [True, False]} Fitting 5 folds for each of 8 candidates, totalling 40 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. done in 0.347s Best score: 0.680 Best parameters set: clf__C: 1e-05 clf__max_iter: 1000 clf__penalty: 'l2' tfidf__norm: 'l1' tfidf__use_idf: True [Parallel(n_jobs=-1)]: Done 40 out of 40 | elapsed: 0.2s finished