Skip to content
Advertisement

Count occurrences of a couple of specific words

I have a list of words, lets say: [“foo”, “bar”, “baz”] and a large string in which these words may occur.

I now use for every word in the list the “string”.count(“word”) method. This works OK, but seems rather inefficient. For every extra word added to the list the entire string must be iterated over an extra time.

Is their any better method to do this, or should I implement a custom method which iterates over the large string a single time, checking for each character if one of the words in the list has been reached?

To be clear:

  • I want the number of occurrences per word in the list.
  • The string to search in is different each time and consists of about 10000 chars
  • The list of words is constant
  • The words in the list of words can contain whitespace

Advertisement

Answer

Make a dict-typed frequency table for your words, then iterate over the words in your string.

vocab = ["foo", "bar", "baz"]
s = "foo bar baz bar quux foo bla bla"

wordcount = dict((x,0) for x in vocab)
for w in re.findall(r"w+", s):
    if w in wordcount:
        wordcount[w] += 1

Edit: if the “words” in your list contain whitespace, you can instead build an RE out of them:

from collections import Counter

vocab = ["foo bar", "baz"]
r = re.compile("|".join(r"b%sb" % w for w in vocab))
wordcount = Counter(re.findall(r, s))

Explanation: this builds the RE r'bfoo barb|bbazb' from the vocabulary. findall then finds the list ['baz', 'foo bar'] and the Counter (Python 2.7+) counts the occurrence of each distinct element in it. Watch out that your list of words should not contain characters that are special to REs, such as ()[].

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement