I have a dataframe with two columns of texts and only the POS tags (of the same texts), which I want to use for language classification. I am trying to use both features as part of my model. This is what the data looks like: X_train.head()
This is what the shape of the data looks like:
print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) X_train.shape[0] != y_train.shape[0]
(11000, 2) (11000,) (1100, 2) (1100,) False
When I run my estimator on either one of the coulmns in my training set individually, it works fine. But as soon as I include both columns together and run my estimator:
scaler = MaxAbsScaler() count_vect = CountVectorizer(lowercase = False, max_features = 1000) clf = SVC() pipe = make_pipeline(count_vect, scaler, clf) params = [{ 'countvectorizer__analyzer': ['word', 'char'], 'countvectorizer__ngram_range': [(1, 1), (1, 2)], 'svc__kernel': ['linear', 'rbf', 'poly'] }] gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1) gs.fit(X_train, y_train) print(gs.best_score_) print(gs.best_params_)
I get this error:
UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan] ValueError: Found input variables with inconsistent numbers of samples: [2, 11000]
I have tried changing the type from a series to string, and running a .transpose()
function, but neither have worked. I don’t understand what is causing the Nan. Can you please help?
Advertisement
Answer
I think the problem is that CountVectorizer
expects 1D inputs. You can get around that by using a ColumnTransformer
, with two copies of the vectorizer, one for each column.
For example, assuming X_train
is a frame with columns text
and pos
:
scaler = MaxAbsScaler() count_vect = CountVectorizer(lowercase=False, max_features=1000) vectorizer = ColumnTransformer([ ('vec_txt', count_vect, 'text'), ('vec_pos', count_vect, 'pos'), ]) clf = SVC() pipe = make_pipeline(vectorizer, scaler, clf) params = { 'columntransformer__vec_txt__analyzer': ['word', 'char'], 'columntransformer__vec_txt__ngram_range': [(1, 1), (1, 2)], 'columntransformer__vec_pos__analyzer': ['word', 'char'], 'columntransformer__vec_pos__ngram_range': [(1, 1), (1, 2)], 'svc__kernel': ['linear', 'rbf', 'poly'], } gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1) gs.fit(X_train, y_train)