Skip to content
Advertisement

Identifying common elements in a list of words

I have list of words in a column where I need to find common elements. For example, list contains words such as,

sinazz31 sinazz12 45sinazz sinazz_84

As you can see, the common element is “sinazz”. Is there a way to develop an algorithm in Python to identify such common elements? If the length of the words are less than 4, the words can be ignored.

Advertisement

Answer

You could search for substrings contained in all of the source strings. Starting with the length of the shortest string and going down from there:

string = 'sinazz31 sinazz12 45sinazz sinazz_84'
min_substring_length = 3

words = string.split()
longest_word = max(filter(None, words), key=len)
matches = {}

for sub_length in range(len(longest_word), min_substring_length - 1, -1):
    for x in range(len(longest_word) - sub_length):
            substring = longest_word[(0 + x):(sub_length + x)] # create substring to check
            check = len([1 for word in words if (substring in word)]) # number of words containing substring
            if check > 1:
                matches[substring] = check # number of words containing substring

# results
if matches:
    match_list = list(sorted(matches,key=matches.get,reverse=True)) # list of matches by frequency

    if matches[match_list[0]] == len(words): # prints substring if matches all words
        print('best match for all words:',match_list[0])
    print('best to worst:',match_list)
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement