I have a list of strings and I want to extract all pairs of strings such that the first string is a substring of the second string . However, I do not want to include pairs such that first string contains another string in that list (except for itself). I would like the output to be returned as a dataframe.
To give a simple example, consider the below list:
names = ['dog', 'big dog', 'big brown down', 'cat', 'small cat', 'small white cat']
I expect the output to look like this:
Note, the pair (‘big dog’, ‘big brown dog’) is not included because ‘dog’ is a substring of ‘big dog’.
Advertisement
Answer
Does this work?
import pandas as pd
names = ['dog', 'big dog', 'big brown dog', 'cat', 'small cat', 'small white cat']
names = sorted(names, key=len)
df = pd.DataFrame(columns=['Base String','String'])
base_strings = [x for x in names if x in x]
used = set()
i = 0
for name in names:
for base in base_strings:
if name in base and base not in used:
df.loc[i] = [name] + [base]
used.add(base)
i += 1
print(df)
