Skip to content
Advertisement

Longest common substring in a text that is inside a list of strings

I’ve come across a problem that similar to the Longest Common Substring problem but with a modification. It’s as follows:

A list of strings lst and a string text are provided. The string may or may not contain substrings that are present in the list. I need the first longest substring of text that is inside lst, considering that you start checking text from the back. What I mean by first and from the back is that you start iterating over text from the last word, match the longest substring, and return after you encounter a character that breaks the substring match.

For example, if

lst = ['abcd', 'x', 'xy', 'xyz', 'abcdxyz']
text = 'abcd abcd xyz xyz'

Then the answer would be the last xyz in the text because you start checking from the back of text, it’s inside lst, and it is a substring of text.

  • 'abcd' is not the answer because it comes before xyz in text
  • 'abcdxyz' is not the answer because it’s a subsequence in text

Also, in text, substrings can be separated by any character that isn’t inside [A-Za-z], but usually they are separated by spaces.

I need an algorithm that solves this. Pseudocode or a Python program would do.

Some test cases

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east brisbane'
# answer is east brisbane

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane brisbane xyz'
# answer is brisbane

lst = ['sale', 'yarrabilba']
text = 'sale yarrabilba'
# answer is yarrabilba

lst = ['sale', 'yarrabilba']
text = 'abc fgh xyz'
# answer is None

text = 'A Street Name Some Words Suburb Name' 
lst = ['A Street Name', 'Suburb Name']
# answer is 'Suburb Name'

text = 'A Street Name Some Words Name Suburb Name' 
lst = ['A Street Name', 'Suburb Name', 'Name']
# answer is 'Suburb Name'

Advertisement

Answer

Let’s use your better example of the comments:

In the original problem, text can be a string like 'A Street Name Some Words Suburb Name', and lst can be ['A Street Name', 'Suburb Name'], then I would like to match 'Suburb Name' only. It is not possible that 'A Street Name' comes after 'Suburb Name'

The task would be easy if you had to find the first match of the sentence, you could use a regex and re.finditer. Then, let’s rework the input by inverting the words and do this!

text = 'A Street Name Some Words Suburb Name'
lst  = ['A Street Name', 'Suburb Name']

import re

# define a helper function to reverse words
rev = lambda x: ' '.join(reversed(x.split()))

# invert words in the query
txet = rev(text)
# 'Name Suburb Words Some Name Street A'

# invert words in the searched strings
tsl  = [rev(e) for e in sorted(lst, key=len, reverse=True)]
# ['Name Street A', 'Name Suburb']

# find "first" match    
m = re.finditer('|'.join(tsl), txet)
try:
    out = rev(next(m).group())
except StopIteration:
    out = None

output: 'Suburb Name'

example #2:

lst = ['brisbane', 'east brisbane', '2 street east']
text = '2 street east east brisbane xyz'

output #2: 'east brisbane'

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement