I would like to train a DecisionTree using sklearn Pipeline. My goal is to predict the ‘language’ column, using the ‘tweet’ as ngram transformed features. However I am not able to make the LabelEncoder transformation works for the ‘language’ column inside a pipeline. I saw that there is a common error, but also if I try the suggested method to reshape I am still no able to overcame the problem. This is my df:
tweet language 0 kann sein grund europ regulierung rat tarif bu... ge 1 willkommen post von zdfintendant schächter ein... ge 2 der neue formel1weltmeister kann es selbst noc... ge 3 ruf am besten mal die hotline an unter 0800172... ge 4 ups musikmontag verpasst hier die deutsche lis... ge ... ... ... 9061 hoe smaakt je kerstdiner nog lekkerder sms uni... nl 9062 femke halsema een partijvernieuwer met lef thi... nl 9063 een lijst van alle vngerelateerde twitteraars nl 9064 vanmiddag vanaf 1300 uur delen we gratis warme... nl 9065 staat hier het biermeisje van 2011 nl
target_features=['language'] text_features=['tweet'] ngram_size=2 preprocessor = ColumnTransformer( transformers=[ ("cat", OrdinalEncoder(), 'language'), ('vect', CountVectorizer(ngram_range=(ngram_size,ngram_size),analyzer='char'), text_features)]) X_train, X_test, y_train, y_test = train_test_split(d.tweet, d.language, test_size=0.3, random_state=42) ngram_size = 2 clf = DecisionTreeClassifier() clf_ngram = Pipeline(steps=[('pre',preprocessor), ('clf', clf)]) clf_ngram.fit(X_train.values, y_train.values) print('Test accuracy computed using cross validation:') scores = cross_val_score(clf_ngram, X_test, y_test, cv=2)
I tried also using: y_train = y_train.values.reshape(-1, 1) X_train = X_train.values.reshape(-1, 1)
But the error is still the same.
clf_ngram.fit() IndexError: tuple index out of range
Many thanks!
Advertisement
Answer
Imo there are a couple of main issues linked to the way you’re dealing with your CountVectorizer
instance.
First off, CountVectorizer
requires 1D input, in which case (I mean with such transformers) ColumnTransformer
requires parameter column
to be passed as a scalar string or int; you might find a detailed explanation in sklearn .fit transformers , IndexError: tuple index out of range. Therefore you should pass 'tweet'
rather than ['tweet']
(or 0
rather than [0]
in case you’d specify columns positionally) to the ColumnTransformer
instance.
Then, according to the documentation of CountVectorizer
,
input: {‘filename’, ‘file’, ‘content’}, default=’content’
If ‘filename’, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.
If ‘file’, the sequence items must have a ‘read’ method (file-like object) that is called to fetch the bytes in memory.
If ‘content’, the input is expected to be a sequence of items that can be of type string or byte.
you should better pass to its instance an input which is a sequence of items that can be of type string or byte. That’s the point of passing X_train.values
rather than X_train
only: you’ll get away from a pandas series, which would not respect the specification given in the docs. Eventually, I’m not completely sure why reshaping is needed given that you’d obtain an iterable of strings even with X_train.values
, but effectively X_train.values.reshape(-1, 1)
should be your way to go in such case.
Last point, personally I wouldn’t deal with target transformation in a ColumnTransformer
or in a Pipeline
: they’re meant to transform the features only. Some details can be found in my answer to Why should LabelEncoder from sklearn be used only for the target variable?.
All in all, this might be one way of solving your problem:
import pandas as pd from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import cross_val_score, train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OrdinalEncoder from sklearn.tree import DecisionTreeClassifier df = pd.DataFrame({ 'tweet': ['kann sein grund europ regulierung rat tarif bu...', 'willkommen post von zdfintendant schächter ein...', 'der neue formel1weltmeister kann es selbst noc...', 'ruf am besten mal die hotline an unter 0800172...', 'ups musikmontag verpasst hier die deutsche lis...', 'hoe smaakt je kerstdiner nog lekkerder sms uni...', 'femke halsema een partijvernieuwer met lef thi...', 'een lijst van alle vngerelateerde twitteraars', 'vanmiddag vanaf 1300 uur delen we gratis warme...', 'staat hier het biermeisje van 2011'], 'language': ['ge', 'ge', 'ge', 'ge', 'ge', 'nl', 'nl', 'nl', 'nl', 'nl'] }) oe = OrdinalEncoder() df['tar'] = oe.fit_transform(df[['language']]) #target_features=['language'] text_features='tweet' ngram_size=2 preprocessor = ColumnTransformer( transformers=[ ('vect', CountVectorizer(ngram_range=(ngram_size,ngram_size),analyzer='char'), 0), #("cat", OrdinalEncoder(), [1]) ] ) X_train, X_test, y_train, y_test = train_test_split(df.tweet, df.tar, test_size=0.3, random_state=42) clf = DecisionTreeClassifier() clf_ngram = Pipeline(steps=[('pre',preprocessor), ('clf', clf)]) clf_ngram.fit(X_train.values.reshape(-1, 1), y_train) print('Test accuracy computed using cross validation:') scores = cross_val_score(clf_ngram, X_test.values.reshape(-1, 1), y_test, cv=2)