Skip to content
Advertisement

TypeError: expected string or bytes-like object – with Python/NLTK word_tokenize

I have a dataset with ~40 columns, and am using .apply(word_tokenize) on 5 of them like so: df['token_column'] = df.column.apply(word_tokenize).

I’m getting a TypeError for only one of the columns, we’ll call this problem_column

JavaScript

Here’s the full error (stripped df and column names, and pii), I’m new to Python and am still trying to figure out which parts of the error messages are relevant:

JavaScript

The 5 columns are all character/string (as verified in SQL Server, SAS, and using .select_dtypes(include=[object])).

For good measure I used .to_string() to make sure problem_column is really and truly not anything besides a string, but I continue to get the error. If I process the columns separately good_column1-good_column4 continue to work and problem_column will still generate the error.

I’ve googled around and aside from stripping any numbers from the set (which I can’t do, because those are meaningful) I haven’t found any additional fixes.

Advertisement

Answer

This is what got me the desired result.

JavaScript
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement