I have list of thousand of strings sub_strings
, and list of millions of strings strings
I want to check if strings[i]
have substring from sub_strings
or substring that’s different in 1 character
JavaScript
x
9
1
sub_strings = ['hello']
2
strings = ['hell dude whats good',
3
'hllo',
4
'hallo',
5
'hello',
6
'dude whats good']
7
8
is_substring_no_more_then_1_differnce(strings , sub_strings)
9
expected
JavaScript
1
2
1
[True, True, True, True, False]
2
Advertisement
Answer
Here’s a route for solving:
JavaScript
1
16
16
1
from Levenshtein import distance as lev
2
3
sub_strings = ['hello', 'bob']
4
strings = ['hell dude whats good',
5
'hllo',
6
'hallo',
7
'hello',
8
'dude whats good',
9
'bobo wants some food']
10
11
distance_list = []
12
for sentence in strings:
13
distance_list.append(min([lev(word, substring) for word in sentence.split() for substring in sub_strings]))
14
15
print([x <= 1 for x in distance_list])
16
That will spit out [True, True, True, True, False, True]
But this is going to get very slow as you add elements to either list. Every individual word inside a string inside strings
has to be checked against every word in substrings
. That’s a lot of checking when you have millions of words in each string in strings
and thousands of subtrings
.