Skip to content
Advertisement

Obtain all Pairs of Strings which are Contained in Each Other [closed]

I have a list of strings and I want to extract all pairs of strings such that the first string is a substring of the second string . However, I do not want to include pairs such that first string contains another string in that list (except for itself). I would like the output to be returned as a dataframe.

To give a simple example, consider the below list:

names = ['dog', 'big dog', 'big brown down', 'cat', 'small cat', 'small white cat']

I expect the output to look like this:

enter image description here

Note, the pair (‘big dog’, ‘big brown dog’) is not included because ‘dog’ is a substring of ‘big dog’.

Advertisement

Answer

Does this work?

import pandas as pd
names = ['dog', 'big dog', 'big brown dog', 'cat', 'small cat', 'small white cat']
names = sorted(names, key=len)
df = pd.DataFrame(columns=['Base String','String'])
base_strings = [x for x in names if x in x]
used = set()

i = 0
for name in names:
    for base in base_strings:
        if name in base and base not in used:
                df.loc[i] = [name] + [base]
                used.add(base)
                i += 1
            
print(df)

output example

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement