Skip to content
Advertisement

Scikit-learn pipeline: Non-finite test scores error / Inconsistent number of samples

I have a dataframe with two columns of texts and only the POS tags (of the same texts), which I want to use for language classification. I am trying to use both features as part of my model. This is what the data looks like: X_train.head()

This is what the shape of the data looks like:

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

X_train.shape[0] != y_train.shape[0]

(11000, 2)
(11000,)
(1100, 2)
(1100,)
False

When I run my estimator on either one of the coulmns in my training set individually, it works fine. But as soon as I include both columns together and run my estimator:

scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase = False, max_features = 1000)
clf = SVC()

pipe = make_pipeline(count_vect, scaler, clf)

params = [{
    'countvectorizer__analyzer': ['word', 'char'],
    'countvectorizer__ngram_range': [(1, 1), (1, 2)],
    'svc__kernel': ['linear', 'rbf', 'poly']
    }]

gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)

print(gs.best_score_)
print(gs.best_params_)

I get this error:

UserWarning: One or more of the test scores are non-finite: [nan nan nan nan nan nan nan nan nan nan nan nan]
ValueError: Found input variables with inconsistent numbers of samples: [2, 11000]

I have tried changing the type from a series to string, and running a .transpose() function, but neither have worked. I don’t understand what is causing the Nan. Can you please help?

Advertisement

Answer

I think the problem is that CountVectorizer expects 1D inputs. You can get around that by using a ColumnTransformer, with two copies of the vectorizer, one for each column.

For example, assuming X_train is a frame with columns text and pos:

scaler = MaxAbsScaler()
count_vect = CountVectorizer(lowercase=False, max_features=1000)
vectorizer = ColumnTransformer([
    ('vec_txt', count_vect, 'text'),
    ('vec_pos', count_vect, 'pos'),
])
clf = SVC()

pipe = make_pipeline(vectorizer, scaler, clf)

params = {
    'columntransformer__vec_txt__analyzer': ['word', 'char'],
    'columntransformer__vec_txt__ngram_range': [(1, 1), (1, 2)],
    'columntransformer__vec_pos__analyzer': ['word', 'char'],
    'columntransformer__vec_pos__ngram_range': [(1, 1), (1, 2)],
    'svc__kernel': ['linear', 'rbf', 'poly'],
}

gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)
gs.fit(X_train, y_train)
6 People found this is helpful
Advertisement