I have list of words in a column where I need to find common elements. For example, list contains words such as,
sinazz31 sinazz12 45sinazz sinazz_84
As you can see, the common element is “sinazz”. Is there a way to develop an algorithm in Python to identify such common elements? If the length of the words are less than 4, the words can be ignored.
Advertisement
Answer
You could search for substrings contained in all of the source strings. Starting with the length of the shortest string and going down from there:
string = 'sinazz31 sinazz12 45sinazz sinazz_84' min_substring_length = 3 words = string.split() longest_word = max(filter(None, words), key=len) matches = {} for sub_length in range(len(longest_word), min_substring_length - 1, -1): for x in range(len(longest_word) - sub_length): substring = longest_word[(0 + x):(sub_length + x)] # create substring to check check = len([1 for word in words if (substring in word)]) # number of words containing substring if check > 1: matches[substring] = check # number of words containing substring # results if matches: match_list = list(sorted(matches,key=matches.get,reverse=True)) # list of matches by frequency if matches[match_list[0]] == len(words): # prints substring if matches all words print('best match for all words:',match_list[0]) print('best to worst:',match_list)