I know that similar questions have been asked several times, but my problem is a bit different and I am looking for a time-efficient solution, in Python.
I have a set of words, some of them end with the “*” and some others don’t:
words = set(["apple", "cat*", "dog"])
I have to count their total occurrences in a text, considering that anything can go after an asterisk (“cat*” means all the words that start with “cat”). Search has to be case insensitive. Consider this example:
text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS".
I would like to get a final score of 4 (= cat* x 2 + dog + apple). Please note that “cat*” has ben counted twice, also considering the plural, whereas “apple” has been counted just once, as its plural is not considered (having no asterisk at the end).
I have to repeat this operation on a large set of documents, so I would need a fast solution. I don’t know if regex or flashtext could reach a fast solution. Could you help me?
I forgot to mention thas some of my words contain punctuation, see here for e.g.:
words = set(["apple", "cat*", "dog", ":)", "I've"])
This seems to create additional problems when compiling the regex. Is there some integration to the code you already provided that would work for these two additional words?
You can do this with regex, creating a regex out of the set of words, putting word boundaries around the words but leaving the trailing word boundary off words that end with
*. Compiling the regex should help performance:
import re words = set(["apple", "cat*", "dog"]) text = "My cat loves apples, but I never ate an apple. My dog loves them less than my CATS" regex = re.compile('|'.join([r'b' + w[:-1] if w.endswith('*') else r'b' + w + r'b' for w in words]), re.I) matches = regex.findall(text) print(len(matches))