I have a list of strings and I want to extract all pairs of strings such that the first string is a substring of the second string . However, I do not want to include pairs such that first string contains another string in that list (except for itself). I would like the output to be returned as a dataframe.
To give a simple example, consider the below list:
JavaScript
x
2
1
names = ['dog', 'big dog', 'big brown down', 'cat', 'small cat', 'small white cat']
2
I expect the output to look like this:
Note, the pair (‘big dog’, ‘big brown dog’) is not included because ‘dog’ is a substring of ‘big dog’.
Advertisement
Answer
Does this work?
JavaScript
1
17
17
1
import pandas as pd
2
names = ['dog', 'big dog', 'big brown dog', 'cat', 'small cat', 'small white cat']
3
names = sorted(names, key=len)
4
df = pd.DataFrame(columns=['Base String','String'])
5
base_strings = [x for x in names if x in x]
6
used = set()
7
8
i = 0
9
for name in names:
10
for base in base_strings:
11
if name in base and base not in used:
12
df.loc[i] = [name] + [base]
13
used.add(base)
14
i += 1
15
16
print(df)
17