Skip to content
Advertisement

Counting specific words in a sentence

I am currently trying to solve this homework question.

My task is to implement a function that returns a vector of word counts in a given text. I am required to split the text into words then use NLTK's tokeniser to tokenise each sentence.

This is the code I have so far:

import nltk
import collections
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('brown')

def word_counts(text, words):
"""Return a vector that represents the counts of specific words in the text
>>> word_counts("Here is sentence one. Here is sentence two.", ['Here', 'two', 'three'])
[2, 1, 0]
>>> emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
>>> word_counts(emma, ['the', 'a'])
[4842, 3001]
"""

from nltk.tokenize import TweetTokenizer
text = nltk.sent_tokenize(text)
words = nltk.sent_tokenize(words)

wordList = []

for sen in text, words:
    for word in nltk.word_tokenize(sen):

        wordList.append(text, words).split(word)

counter = TweetTokenizer(wordList)
return counter

There are two doctests that should give the result of: [2, 1, 0] and [4842, 3001]

This is the error message I am getting from my code enter image description here

I’ve spent all day trying to tackle this and I feel I’m getting close but I don’t know what I’m doing wrong, the script is giving me an error every time.

Any help will be very appreciated. Thank you.

Advertisement

Answer

This is how I would use nltk to get to the result your homework wants:

import nltk
import collections
from nltk.tokenize import TweetTokenizer
# nltk.download('punkt')
# nltk.download('gutenberg')
# nltk.download('brown')

def word_counts(text, words):
    """Return a vector that represents the counts of specific words in the text
    word_counts("Here is one. Here is two.", ['Here', 'two', 'three'])
    [2, 1, 0]
    emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
    word_counts(emma, ['the', 'a'])
    [4842, 3001]
    """  

    textTok = nltk.word_tokenize(text) 
    counts =  nltk.FreqDist(textTok)   # this counts ALL word occurences

    return [counts[x] for x in words] # this returns what was counted for *words

r1 = word_counts("Here is one. Here is two.", ['Here', 'two', 'three'])
print(r1) #    [2, 1, 0]

emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
r2 = word_counts(emma, ['the', 'a'])
print(r2) # [4842, 3001]

Your code does multiple things that look just wrong:

for sen in text, words:
    for word in nltk.word_tokenize(sen):

        wordList.append(text, words).split(word)
  • sent_tokenize() takes a string and returns a list of sentences from it – you store the results in 2 variables text, words and then you try to iterate over tuple of them? words is not a text with sentences to begin, this makes not much sense to me
  • wordList is a list, if you use the .append() on it, append() returns None. Nonehas no .split() function.
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement