I am currently trying to solve this homework question.
My task is to implement a function that returns a vector of word counts in a given text. I am required to split the text into words then use NLTK's
tokeniser to tokenise each sentence.
This is the code I have so far:
JavaScript
x
29
29
1
import nltk
2
import collections
3
nltk.download('punkt')
4
nltk.download('gutenberg')
5
nltk.download('brown')
6
7
def word_counts(text, words):
8
"""Return a vector that represents the counts of specific words in the text
9
>>> word_counts("Here is sentence one. Here is sentence two.", ['Here', 'two', 'three'])
10
[2, 1, 0]
11
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
12
>>> word_counts(emma, ['the', 'a'])
13
[4842, 3001]
14
"""
15
16
from nltk.tokenize import TweetTokenizer
17
text = nltk.sent_tokenize(text)
18
words = nltk.sent_tokenize(words)
19
20
wordList = []
21
22
for sen in text, words:
23
for word in nltk.word_tokenize(sen):
24
25
wordList.append(text, words).split(word)
26
27
counter = TweetTokenizer(wordList)
28
return counter
29
There are two doctests that should give the result of: [2, 1, 0] and [4842, 3001]
This is the error message I am getting from my code
I’ve spent all day trying to tackle this and I feel I’m getting close but I don’t know what I’m doing wrong, the script is giving me an error every time.
Any help will be very appreciated. Thank you.
Advertisement
Answer
This is how I would use nltk to get to the result your homework wants:
JavaScript
1
28
28
1
import nltk
2
import collections
3
from nltk.tokenize import TweetTokenizer
4
# nltk.download('punkt')
5
# nltk.download('gutenberg')
6
# nltk.download('brown')
7
8
def word_counts(text, words):
9
"""Return a vector that represents the counts of specific words in the text
10
word_counts("Here is one. Here is two.", ['Here', 'two', 'three'])
11
[2, 1, 0]
12
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
13
word_counts(emma, ['the', 'a'])
14
[4842, 3001]
15
"""
16
17
textTok = nltk.word_tokenize(text)
18
counts = nltk.FreqDist(textTok) # this counts ALL word occurences
19
20
return [counts[x] for x in words] # this returns what was counted for *words
21
22
r1 = word_counts("Here is one. Here is two.", ['Here', 'two', 'three'])
23
print(r1) # [2, 1, 0]
24
25
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
26
r2 = word_counts(emma, ['the', 'a'])
27
print(r2) # [4842, 3001]
28
Your code does multiple things that look just wrong:
JavaScript151for sen in text, words:
2for word in nltk.word_tokenize(sen):
3
4wordList.append(text, words).split(word)
5
sent_tokenize()
takes a string and returns a list of sentences from it – you store the results in 2 variablestext, words
and then you try to iterate over tuple of them?words
is not a text with sentences to begin, this makes not much sense to mewordList
is a list, if you use the.append()
on it,append()
returnsNone
.None
has no.split()
function.