I’ve come across a problem that similar to the Longest Common Substring problem but with a modification. It’s as follows:
A list of strings lst
and a string text
are provided. The string may or may not contain substrings that are present in the list. I need the first longest substring of text
that is inside lst
, considering that you start checking text
from the back. What I mean by first and from the back is that you start iterating over text
from the last word, match the longest substring, and return after you encounter a character that breaks the substring match.
For example, if
lst = ['abcd', 'x', 'xy', 'xyz', 'abcdxyz'] text = 'abcd abcd xyz xyz'
Then the answer would be the last xyz
in the text because you start checking from the back of text
, it’s inside lst
, and it is a substring of text
.
'abcd'
is not the answer because it comes beforexyz
intext
'abcdxyz'
is not the answer because it’s a subsequence intext
Also, in text
, substrings can be separated by any character that isn’t inside [A-Za-z]
, but usually they are separated by spaces.
I need an algorithm that solves this. Pseudocode or a Python program would do.
Some test cases
lst = ['brisbane', 'east brisbane', '2 street east'] text = '2 street east brisbane' # answer is east brisbane lst = ['brisbane', 'east brisbane', '2 street east'] text = '2 street east east brisbane brisbane xyz' # answer is brisbane lst = ['sale', 'yarrabilba'] text = 'sale yarrabilba' # answer is yarrabilba lst = ['sale', 'yarrabilba'] text = 'abc fgh xyz' # answer is None text = 'A Street Name Some Words Suburb Name' lst = ['A Street Name', 'Suburb Name'] # answer is 'Suburb Name' text = 'A Street Name Some Words Name Suburb Name' lst = ['A Street Name', 'Suburb Name', 'Name'] # answer is 'Suburb Name'
Advertisement
Answer
Let’s use your better example of the comments:
In the original problem, text can be a string like
'A Street Name Some Words Suburb Name'
, and lst can be['A Street Name', 'Suburb Name']
, then I would like to match'Suburb Name'
only. It is not possible that'A Street Name'
comes after'Suburb Name'
The task would be easy if you had to find the first match of the sentence, you could use a regex and re.finditer
. Then, let’s rework the input by inverting the words and do this!
text = 'A Street Name Some Words Suburb Name' lst = ['A Street Name', 'Suburb Name'] import re # define a helper function to reverse words rev = lambda x: ' '.join(reversed(x.split())) # invert words in the query txet = rev(text) # 'Name Suburb Words Some Name Street A' # invert words in the searched strings tsl = [rev(e) for e in sorted(lst, key=len, reverse=True)] # ['Name Street A', 'Name Suburb'] # find "first" match m = re.finditer('|'.join(tsl), txet) try: out = rev(next(m).group()) except StopIteration: out = None
output: 'Suburb Name'
example #2:
lst = ['brisbane', 'east brisbane', '2 street east'] text = '2 street east east brisbane xyz'
output #2: 'east brisbane'