I have a string S
and a list of strings allItems
, allItems
contains strings that may have common “sub-words” but one element is never an extension of another:
//good, they both contain Fuzzy but Fuzzy isn't in allItems const allItems = "FuzzyBunny", "FuzzyBear" //not good BCS allItems[i] = whatever1+allItems[j]+whatever2 const allItems = "Fuzzy", "FuzzyBear", "FuzzyBunny" , "FuzzyBearBunny"
My goal is to find every match or approximate match of a string in allItems
in S
alongside their index(can be start or end, or ideally both). I’ve been searching for some algorithms to do this, similar to the aho-corasick algorithm, but that doesn’t do approximate string matching.
Example:
S = "I love FuzzyBears and FuzzyDucks" allItems = ["FuzzyBear", "FuzzyDuck"]
->
[ { match: "FuzzyBear", matchIndex: 0, startIndex: 7, endIndex: 16 }, { match: "FuzzyDuck", matchIndex: 1, startIndex: 22, endIndex: 31 } ]
I’m still pretty new to pattern matching so I’d appreciate some resources on how to code any of the recommended algorithms.
UPDATE: I’ve found a Fuzzified Aho-Corasick automata as described here, but I have very little idea on how to implement this in JS.I also have no issues with the code being slow as I use this for a one time run, and rarely need to do this often.
Advertisement
Answer
Python version (fuzzy-matching) [splits string by spaces]:
import difflib s = "I love FuzzyBears and FuzzyDucks" allitems = ["FuzzyBear", "FuzzyDuck"] lst = [] for x in allitems: d = {} match = difflib.get_close_matches(x, s.split()) if match: d["match"] = x d["matchIndex"] = allitems.index(x) d["startIndex"] = s.find(x) d["endIndex"] = d["startIndex"] + len(x) lst.append(d) print(lst) >>> [{'match': 'FuzzyBear', 'matchIndex': 0, 'startIndex': 7, 'endIndex': 16}, {'match': 'FuzzyDuck', 'matchIndex': 1, 'startIndex': 22, 'endIndex': 31}]
Python version (fuzzy-matching) [split by all possible substrings]:
import difflib from itertools import combinations s = "I love FuzzyBears and FuzzyDucks" allitems = ["FuzzyBear", "FuzzyDuck"] lst = [] for x in allitems: d = {} match = difflib.get_close_matches(x, [x[i:j] for i, j in combinations(range(len(x) + 1), r=2)]) if match: d["match"] = x d["matchIndex"] = allitems.index(x) d["startIndex"] = s.find(x) d["endIndex"] = d["startIndex"] + len(x) lst.append(d) print(lst) >>> [{'match': 'FuzzyBear', 'matchIndex': 0, 'startIndex': 7, 'endIndex': 16}, {'match': 'FuzzyDuck', 'matchIndex': 1, 'startIndex': 22, 'endIndex': 31}]
After a bunch of online searching:
JS version (exact-matching):
const s = "I love FuzzyBears and FuzzyDucks"; const allitems = ["FuzzyBear", "FuzzyDuck"]; const lst = []; for (const i of allitems) { const x = {}; if (s.includes(i)) { x["match"] = i; x["matchIndex"] = allitems.indexOf(i); x["startIndex"] = s.indexOf(i); x["endIndex"] = s.indexOf(i) + i.length; } lst.push(x); } console.log(lst); >>> [ { match: 'FuzzyBear', matchIndex: 0, startIndex: 7, endIndex: 16 }, { match: 'FuzzyDuck', matchIndex: 1, startIndex: 22, endIndex: 31 } ]