Background
mylist = ['abc123', 'abc123456', 'abc12355', 'def456', 'ghi789', 'def4567', 'ghi78910', 'abc123cvz']
I would like to find and group the substrings in the list into a list of tuples where the first element of the tuple would be the substring and the second element would be the larger string that contains the substring. The expected output is given below
[('abc123', 'abc123456'), ('abc123', 'abc12355'), ('abc123', 'abc123cvz'), ('def456', 'def4567'), ('ghi789', 'ghi78910')]
I’ve written the following code which achieves the desired outcome
substring_superstring_list = [] for sub in mylist: substring_superstring_pair = [(sub, s) for s in mylist if sub in s and s != sub] if substring_superstring_pair: substring_superstring_list.append(substring_superstring_pair) flat_list = [item for sublist in substring_superstring_list for item in sublist]
Is there a more efficient way to do this? I’ll eventually need to loop over a list containing 80k strings and do the above. I appreciate any suggestions/help
Advertisement
Answer
Combining suggestions in the comments and @ZabielskiGrabriel’s answer, you can do it by first sorting the list and then comparing each element in the sorted list with those that follow it in a list comprehension:
my_list = sorted(my_list, key=len) [(x, y) for i, x in enumerate(my_list, 1) for y in my_list[i:] if x in y]
Benchmarks (with supplied test list):
%timeit op(my_list) %timeit zabiel(my_list) %timeit nin17(my_list)
Output:
3.92 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 2.76 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) 2.25 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)