I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.
JavaScript
x
14
14
1
data example:
2
text
3
1. This is a very good site. I will recommend it to others.
4
2. Can you please give me a call at 9983938428. have issues with the listings.
5
3. good work! keep it up
6
4. not a very helpful site in finding home decor.
7
8
expected output:
9
10
1. 'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
11
2. 'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
12
3. 'good','work','!','keep','it','up'
13
4. 'not','a','very','helpful','site','in','finding','home','decor'
14
Basically, i want to separate all the words and find the length of each text in the dataframe.
I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?
Please help!
Thanks in advance…
Advertisement
Answer
You can use apply method of DataFrame API:
JavaScript
1
6
1
import pandas as pd
2
import nltk
3
4
df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
5
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)
6
Output:
JavaScript
1
11
11
1
>>> df
2
sentences
3
0 This is a very good site. I will recommend it
4
1 Can you please give me a call at 9983938428. h
5
2 good work! keep it up
6
7
tokenized_sents
8
0 [This, is, a, very, good, site, ., I, will, re
9
1 [Can, you, please, give, me, a, call, at, 9983...
10
2 [good, work, !, keep, it, up]
11
For finding the length of each text try to use apply and lambda function again:
JavaScript
1
13
13
1
df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)
2
3
>>> df
4
sentences
5
0 This is a very good site. I will recommend it
6
1 Can you please give me a call at 9983938428. h
7
2 good work! keep it up
8
9
tokenized_sents sents_length
10
0 [This, is, a, very, good, site, ., I, will, re 14
11
1 [Can, you, please, give, me, a, call, at, 9983... 15
12
2 [good, work, !, keep, it, up] 6
13