how to use word_tokenize in data frame

Question

I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe. Basically, i want to separate all the words and find the length of each text in the dataframe. I know

Accepted Answer

You can use apply method of DataFrame API:import pandas as pdimport nltkdf = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)Output:>>> df                                           sentences  0  This is a very good site. I will recommend it ...   1  Can you please give me a call at 9983938428. h...   2                              good work! keep it up                                        tokenized_sents  0  [This, is, a, very, good, site, ., I, will, re...  1  [Can, you, please, give, me, a, call, at, 9983...  2                      [good, work, !, keep, it, up]For finding the length of each text try to use apply and  lambda function again:df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)>>> df                                           sentences  0  This is a very good site. I will recommend it ...   1  Can you please give me a call at 9983938428. h...   2                              good work! keep it up                                        tokenized_sents  sents_length  0  [This, is, a, very, good, site, ., I, will, re...            14  1  [Can, you, please, give, me, a, call, at, 9983...            15  2                      [good, work, !, keep, it, up]             6

Advertisement

Answer