Skip to content
Advertisement

How to search a string for a long list of patterns

I’m writing a tool to index a document. I have a long list of hundreds, perhaps thousands of fixed patterns. E.g. my index might look like {"cat training":"p.27", "cat handling":"p.29", "cat":"p.15", "dog training":"p.62", "dog":"p.60"} and so on.

Now I want to search my text (for the sake of argument, each paragraph is a single string) for all instances of any substring in my index. (During the search, I’ll sort my keys by length, as shown, so that “cat training” would match before “cat”).

One more complication is that I want the matches to occur on word boundaries. I.e. I don’t want “catch” to match “cat”.

Is there a pythonic way to do this? My current solution is to scan through the source string, word-by-word, and then try to match the start of the string against my entire index. It works, but it’s pretty slow.

Advertisement

Answer

Aho-Corasick algorithm was developed to tackle this type of problem.

It was used to answer a previous Stack Overflow question about matching a large number of patterns.

Python library for Aho–Corasick.

Procedure for modifying Aho-Corasick algorithm for word boundaries

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement