How would I make clusters from a Levenshtein similarity matrix?

Question

I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters. Here's the example I have so far: Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with 'The Bachelor','The Bachelorette','The Bachelor Special', and the other with 'SportsCenter','SportsCenter

Accepted Answer

Using the matrix that your code generates, the matrix can be looped through to find the similarity between each word and the words under a certain threshold can be grouped. The code should look similar to the following:from Levenshtein import distanceimport numpy as npwords = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter',     'SportsCenter 8 PM','SportsCenter Sunday']list1 = wordslist2 = wordsmatrix = np.zeros((len(list1),len(list2)),dtype=np.int64)for i in range(0,len(list1)):    for j in range(0,len(list2)):        matrix[i,j] = distance(list1[i],list2[j])        difference_threshold = 10 #this variable is the maximum difference that 2 words can have to be grouped togetherclusters = [] #list of clusters#loop through the matrix previously generatedfor col in matrix:    cluster=[] #cluster generated for this row    for row in range(len(col)):        if(col[row] <= difference_threshold):            cluster.append(words[row])    #if the current cluster is a NOT duplicate, add it    if cluster not in clusters:        clusters.append(cluster)The threshold variable must be altered depending on the desired sensitivity for similarity. The code assumes that words may be repeated in separate clusters as long as they meet the similarity threshold. I recommend changing the matrix values to a percentage as well so that the length of the words will have less impact on sensitivity threshold. Also, if the matrix does not need to be computed before the clusters are created, the clusters could be created as the Levenshtein distances are computed.Happy Coding!

Advertisement

Answer