Skip to content
Advertisement

Python: What is an efficient way to loop over a list of strings and group substrings in the list?

Background

mylist = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']

I would like to find and group the substrings in the list into a list of tuples where the first element of the tuple would be the substring and the second element would be the larger string that contains the substring. The expected output is given below

[('abc123', 'abc123456'), ('abc123', 'abc12355'), ('abc123', 'abc123cvz'), ('def456', 'def4567'), ('ghi789', 'ghi78910')]

I’ve written the following code which achieves the desired outcome

substring_superstring_list = []
for sub in mylist:
   substring_superstring_pair = [(sub, s) for s in mylist if sub in s and s != sub]
   if substring_superstring_pair:
       substring_superstring_list.append(substring_superstring_pair)

flat_list = [item for sublist in substring_superstring_list for item in sublist]

Is there a more efficient way to do this? I’ll eventually need to loop over a list containing 80k strings and do the above. I appreciate any suggestions/help

Advertisement

Answer

Combining suggestions in the comments and @ZabielskiGrabriel’s answer, you can do it by first sorting the list and then comparing each element in the sorted list with those that follow it in a list comprehension:

my_list = sorted(my_list, key=len)
[(x, y) for i, x in enumerate(my_list, 1) for y in my_list[i:] if x in y]

Benchmarks (with supplied test list):

%timeit op(my_list)
%timeit zabiel(my_list)
%timeit nin17(my_list)

Output:

3.92 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.76 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.25 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement