Skip to content
Advertisement

How would I make clusters from a Levenshtein similarity matrix?

I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters.

Here’s the example I have so far:

from Levenshtein import distance
import numpy as np

words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter',
         'SportsCenter 8 PM','SportsCenter Sunday']

list1 = words
list2 = words

matrix = np.zeros((len(list1),len(list2)),dtype=np.int)

for i in range(0,len(list1)):
    for j in range(0,len(list2)):
        matrix[i,j] = distance(list1[I],list2[j])

Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with ‘The Bachelor’,’The Bachelorette’,’The Bachelor Special’, and the other with ‘SportsCenter’,’SportsCenter 8 PM’,’SportsCenter Sunday’.

Can anyone help me with this?

Advertisement

Answer

Using the matrix that your code generates, the matrix can be looped through to find the similarity between each word and the words under a certain threshold can be grouped. The code should look similar to the following:

from Levenshtein import distance
import numpy as np

words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter',
     'SportsCenter 8 PM','SportsCenter Sunday']

list1 = words
list2 = words

matrix = np.zeros((len(list1),len(list2)),dtype=np.int64)

for i in range(0,len(list1)):
    for j in range(0,len(list2)):
        matrix[i,j] = distance(list1[i],list2[j])
    
    
difference_threshold = 10 #this variable is the maximum difference that 2 words can have to be grouped together
clusters = [] #list of clusters

#loop through the matrix previously generated
for col in matrix:
    cluster=[] #cluster generated for this row
    for row in range(len(col)):
        if(col[row] <= difference_threshold):
            cluster.append(words[row])
    #if the current cluster is a NOT duplicate, add it
    if cluster not in clusters:
        clusters.append(cluster)

The threshold variable must be altered depending on the desired sensitivity for similarity. The code assumes that words may be repeated in separate clusters as long as they meet the similarity threshold. I recommend changing the matrix values to a percentage as well so that the length of the words will have less impact on sensitivity threshold. Also, if the matrix does not need to be computed before the clusters are created, the clusters could be created as the Levenshtein distances are computed.

Happy Coding!

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement