Scikit-learn pipeline: Non-finite test scores error / Inconsistent number of samples

Question

I have a dataframe with two columns of texts and only the POS tags (of the same texts), which I want to use for language classification. I am trying to use both features as part of my model. This is what the data looks like: X_train.head() This is what the shape of the data looks like: When I run my

Accepted Answer

I think the problem is that CountVectorizer expects 1D inputs. You can get around that by using a  ColumnTransformer, with two copies of the vectorizer, one for each column.For example, assuming X_train is a frame with columns text and pos:scaler = MaxAbsScaler()count_vect = CountVectorizer(lowercase=False, max_features=1000)vectorizer = ColumnTransformer([    ('vec_txt', count_vect, 'text'),    ('vec_pos', count_vect, 'pos'),])clf = SVC()pipe = make_pipeline(vectorizer, scaler, clf)params = {    'columntransformer__vec_txt__analyzer': ['word', 'char'],    'columntransformer__vec_txt__ngram_range': [(1, 1), (1, 2)],    'columntransformer__vec_pos__analyzer': ['word', 'char'],    'columntransformer__vec_pos__ngram_range': [(1, 1), (1, 2)],    'svc__kernel': ['linear', 'rbf', 'poly'],}gs = GridSearchCV(pipe, params, cv=3, scoring='accuracy', n_jobs=-1, refit=True, verbose=1)gs.fit(X_train, y_train)

Advertisement

Answer