I have a similarity matrix of words and would like to apply an algorithm that can put the words in clusters.
Here’s the example I have so far:
from Levenshtein import distance import numpy as np words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter', 'SportsCenter 8 PM','SportsCenter Sunday'] list1 = words list2 = words matrix = np.zeros((len(list1),len(list2)),dtype=np.int) for i in range(0,len(list1)): for j in range(0,len(list2)): matrix[i,j] = distance(list1[I],list2[j])
Obviously this is a very simple dummy example, but what I would expect the output to be is 2 clusters, one with ‘The Bachelor’,’The Bachelorette’,’The Bachelor Special’, and the other with ‘SportsCenter’,’SportsCenter 8 PM’,’SportsCenter Sunday’.
Can anyone help me with this?
Advertisement
Answer
Using the matrix that your code generates, the matrix can be looped through to find the similarity between each word and the words under a certain threshold can be grouped. The code should look similar to the following:
from Levenshtein import distance import numpy as np words = ['The Bachelor','The Bachelorette','The Bachelor Special','SportsCenter', 'SportsCenter 8 PM','SportsCenter Sunday'] list1 = words list2 = words matrix = np.zeros((len(list1),len(list2)),dtype=np.int64) for i in range(0,len(list1)): for j in range(0,len(list2)): matrix[i,j] = distance(list1[i],list2[j]) difference_threshold = 10 #this variable is the maximum difference that 2 words can have to be grouped together clusters = [] #list of clusters #loop through the matrix previously generated for col in matrix: cluster=[] #cluster generated for this row for row in range(len(col)): if(col[row] <= difference_threshold): cluster.append(words[row]) #if the current cluster is a NOT duplicate, add it if cluster not in clusters: clusters.append(cluster)
The threshold variable must be altered depending on the desired sensitivity for similarity. The code assumes that words may be repeated in separate clusters as long as they meet the similarity threshold. I recommend changing the matrix values to a percentage as well so that the length of the words will have less impact on sensitivity threshold. Also, if the matrix does not need to be computed before the clusters are created, the clusters could be created as the Levenshtein distances are computed.
Happy Coding!