How to create a list of tokenized words from dataframe column using spaCy?

Question

I&#8217;m trying to apply spaCys tokenizer on dataframe column to get a new column containing list of tokens. Assume we have the following dataframe: The code below aims to tokenize Text column: The results looks like: Now, we have a new column tokens, which returns doc object for each sentence. How could we …

Accepted Answer

You can useexample_df["tokens"] = example_df["Text"].apply(lambda x: [t.text for t in nlp.tokenizer(x)])See the Pandas test:import pandas as pddetails = {    'Text_id' : [23, 21, 22, 21],    'Text' : ['All roads lead to Rome',               'All work and no play makes Jack a dull buy',               'Any port in a storm',               'Avoid a questioner, for he is also a tattler'],}  # creating a Dataframe object example_df = pd.DataFrame(details)import spacynlp = spacy.load("en_core_web_sm")example_df["tokens"] = example_df["Text"].apply(lambda x: [t.text for t in nlp.tokenizer(x)])print(example_df.to_string())Output:   Text_id                                          Text                                                    tokens0       23                        All roads lead to Rome                              [All, roads, lead, to, Rome]1       21    All work and no play makes Jack a dull buy     [All, work, and, no, play, makes, Jack, a, dull, buy]2       22                           Any port in a storm                                 [Any, port, in, a, storm]3       21  Avoid a questioner, for he is also a tattler  [Avoid, a, questioner, ,, for, he, is, also, a, tattler]

Advertisement

Answer