Skip to content
Advertisement

Pipeline with count and tfidf vectorizer produces TypeError: expected string or bytes-like object

I have a corpus like the following ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘X X X’, ‘X X X’, ‘X X X’, I would like to use count and tfidf vectorizer along with logistic regression as a classifier. The code below I have adapted from sklearn’s samples.

JavaScript

My stack trace is as follows

JavaScript

I ran the tfidf vectorizer alone and get the following results

JavaScript

Results

JavaScript

My Question

Why is the standalone vectorizer working but when placed within pipeline that is used by Gridsearch I get the Type Error

Advertisement

Answer

By default, both CountVectorizer and TfidfVectorizer expect a sequence of items that can be of type string or byte. In your pipeline the CountVectorizer receives the corpus and outputs to TfidfVectorizer a sparse representation of the counts using scipy.sparse.csr_matrix. Since the input to TfidfVectorizer is not of the expected type you get the type error “TypeError: expected string or bytes-like object”. Your pipeline works if you use either but not both vectorizers. For example,

JavaScript

produces the following output:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement