I have 4 lists of words that categorise something and a tokenised text by word.
animals = ["cat", "dog", "fish"] colours = ["blue", "red", "green"] food = ["pasta", "chips", "beef"] sport = ["football", "basketball", "tennis"] text = ["Once","upon","a","time",.......]
I would like to count the number of occurrences of the words in these lists in a certain text but as a sum of the words for each list. Therefore the results would show an occurrence of 10 animal words, 20 colour words, 6 food words and 13 sport words across the whole text.
The data I’m actually working on is quite large, so anything that works quickly is required.
Thanks for any help!
Advertisement
Answer
You could change your categories to a dict
of set
objects (which will allow for O(1)
membership tests):
categories = {'animals': {'cat', 'dog', 'fish'}, 'colours': {'blue', 'green', 'red'}, 'food': {'beef', 'chips', 'pasta'}, 'sport': {'basketball', 'football', 'tennis'}}
Then iterate over the words and perform membership tests for each category set:
def count_words(text, categories): counts = dict.fromkeys(categories, 0) for word in text: for cat_name, cat_words in categories.items(): counts[cat_name] += word in cat_words return counts
Usage:
In [19]: text = "Once upon a time there was a proper minimal reproducible example given by the OP without anybody having to ask for it".split() In [20]: count_words(text, categories) Out[20]: {'animals': 0, 'colours': 0, 'food': 0, 'sport': 0} In [21]: text = ("cat dog fish "*3).split() In [22]: count_words(text, categories) Out[22]: {'animals': 9, 'colours': 0, 'food': 0, 'sport': 0}