I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.
data example: text 1. This is a very good site. I will recommend it to others. 2. Can you please give me a call at 9983938428. have issues with the listings. 3. good work! keep it up 4. not a very helpful site in finding home decor. expected output: 1. 'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.' 2. 'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings' 3. 'good','work','!','keep','it','up' 4. 'not','a','very','helpful','site','in','finding','home','decor'
Basically, i want to separate all the words and find the length of each text in the dataframe.
I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?
Please help!
Thanks in advance…
Advertisement
Answer
You can use apply method of DataFrame API:
import pandas as pd import nltk df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']}) df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
Output:
>>> df sentences 0 This is a very good site. I will recommend it ... 1 Can you please give me a call at 9983938428. h... 2 good work! keep it up tokenized_sents 0 [This, is, a, very, good, site, ., I, will, re... 1 [Can, you, please, give, me, a, call, at, 9983... 2 [good, work, !, keep, it, up]
For finding the length of each text try to use apply and lambda function again:
df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1) >>> df sentences 0 This is a very good site. I will recommend it ... 1 Can you please give me a call at 9983938428. h... 2 good work! keep it up tokenized_sents sents_length 0 [This, is, a, very, good, site, ., I, will, re... 14 1 [Can, you, please, give, me, a, call, at, 9983... 15 2 [good, work, !, keep, it, up] 6