Skip to content
Advertisement

How to create a list of tokenized words from dataframe column using spaCy?

I’m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe:

JavaScript

The code below aims to tokenize Text column:

JavaScript

The results looks like:

enter image description here

Now, we have a new column tokens, which returns doc object for each sentence.

How could we change the code to get a python list of tokenized words?

I’ve tried the following line:

JavaScript

but I have the following error:

JavaScript

Thank you in advance!

UPDATE: I have a solution, but I still have another problem. I want to count words using built-in class Counter, which takes a list as input and can be incrementally updated with a list of tokens of other document using update function. The below code should returns the number of occurences for each word in dataframe:

JavaScript

However, the output is:

JavaScript

The expected output must be like:

JavaScript

Thank you again :)

Advertisement

Answer

You can use

JavaScript

See the Pandas test:

JavaScript

Output:

JavaScript
Advertisement