I have a list of strings and I want to extract all pairs of strings such that the first string is a substring of the second string . However, I do not want to include pairs such that first string contains another string in that list (except for itself). I would like the output to be returned as a dataframe.
To give a simple example, consider the below list:
names = ['dog', 'big dog', 'big brown down', 'cat', 'small cat', 'small white cat']
I expect the output to look like this:
Note, the pair (‘big dog’, ‘big brown dog’) is not included because ‘dog’ is a substring of ‘big dog’.
Advertisement
Answer
Does this work?
import pandas as pd names = ['dog', 'big dog', 'big brown dog', 'cat', 'small cat', 'small white cat'] names = sorted(names, key=len) df = pd.DataFrame(columns=['Base String','String']) base_strings = [x for x in names if x in x] used = set() i = 0 for name in names: for base in base_strings: if name in base and base not in used: df.loc[i] = [name] + [base] used.add(base) i += 1 print(df)