Skip to content
Advertisement

Count the number of times a group of words appear in a text

I have 4 lists of words that categorise something and a tokenised text by word.

animals = ["cat", "dog", "fish"]
colours = ["blue", "red", "green"]
food = ["pasta", "chips", "beef"]
sport = ["football", "basketball", "tennis"]

text = ["Once","upon","a","time",.......]

I would like to count the number of occurrences of the words in these lists in a certain text but as a sum of the words for each list. Therefore the results would show an occurrence of 10 animal words, 20 colour words, 6 food words and 13 sport words across the whole text.

The data I’m actually working on is quite large, so anything that works quickly is required.

Thanks for any help!

Advertisement

Answer

You could change your categories to a dict of set objects (which will allow for O(1) membership tests):

categories = {'animals': {'cat', 'dog', 'fish'},
              'colours': {'blue', 'green', 'red'},
              'food': {'beef', 'chips', 'pasta'},
              'sport': {'basketball', 'football', 'tennis'}}

Then iterate over the words and perform membership tests for each category set:

def count_words(text, categories):
    counts = dict.fromkeys(categories, 0)
    for word in text:
        for cat_name, cat_words in categories.items():
            counts[cat_name] += word in cat_words
    return counts

Usage:

In [19]: text = "Once upon a time there was a proper minimal reproducible example given by the OP without anybody having to ask for it".split()

In [20]: count_words(text, categories)
Out[20]: {'animals': 0, 'colours': 0, 'food': 0, 'sport': 0}

In [21]: text = ("cat dog fish "*3).split()

In [22]: count_words(text, categories)
Out[22]: {'animals': 9, 'colours': 0, 'food': 0, 'sport': 0}
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement