Skip to content
Advertisement

Scikit-learn pipeline: Non-finite test scores error / Inconsistent number of samples

I have a dataframe with two columns of texts and only the POS tags (of the same texts), which I want to use for language classification. I am trying to use both features as part of my model. This is what the data looks like: X_train.head()

This is what the shape of the data looks like:

JavaScript

JavaScript

When I run my estimator on either one of the coulmns in my training set individually, it works fine. But as soon as I include both columns together and run my estimator:

JavaScript

I get this error:

JavaScript

I have tried changing the type from a series to string, and running a .transpose() function, but neither have worked. I don’t understand what is causing the Nan. Can you please help?

Advertisement

Answer

I think the problem is that CountVectorizer expects 1D inputs. You can get around that by using a ColumnTransformer, with two copies of the vectorizer, one for each column.

For example, assuming X_train is a frame with columns text and pos:

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement