Count the number of times a group of words appear in a text

Question

I have 4 lists of words that categorise something and a tokenised text by word. I would like to count the number of occurrences of the words in these lists in a certain text but as a sum of the words for each list. Therefore the results would show an occurrence of 10 animal words, 20 colour words, 6 food

Accepted Answer

You could change your categories to a dict of set objects (which will allow for O(1) membership tests):categories = {'animals': {'cat', 'dog', 'fish'},              'colours': {'blue', 'green', 'red'},              'food': {'beef', 'chips', 'pasta'},              'sport': {'basketball', 'football', 'tennis'}}Then iterate over the words and perform membership tests for each category set:def count_words(text, categories):    counts = dict.fromkeys(categories, 0)    for word in text:        for cat_name, cat_words in categories.items():            counts[cat_name] += word in cat_words    return countsUsage:In [19]: text = "Once upon a time there was a proper minimal reproducible example given by the OP without anybody having to ask for it".split()In [20]: count_words(text, categories)Out[20]: {'animals': 0, 'colours': 0, 'food': 0, 'sport': 0}In [21]: text = ("cat dog fish "*3).split()In [22]: count_words(text, categories)Out[22]: {'animals': 9, 'colours': 0, 'food': 0, 'sport': 0}

Advertisement

Answer