Skip to content
Advertisement

Kmean clustering top terms in cluster

I am using python Kmean clustering algorithm for cluster document. I have created a term-document matrix

   from sklearn.feature_extraction.text import TfidfVectorizer
   from sklearn.cluster import KMeans
   vectorizer = TfidfVectorizer(tokenizer=tokenize, encoding='latin-1',
                          stop_words='english')
    X = vectorizer.fit_transform(token_dict.values())

Then I applied Kmean clustering using following code

 km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
 y=km.fit(X)

My next task is to see the top terms in every cluster, searching on googole suggested that many of the people has used the km.cluster_centers_.argsort()[:, ::-1] for finding the top term in the clusters using the following code:

 print("Top terms per cluster:")
 order_centroids = km.cluster_centers_.argsort()[:, ::-1]
 terms = vectorizer.get_feature_names()
 for i in range(true_k):
     print("Cluster %d:" % i, end='')
     for ind in order_centroids[i, :10]:
         print(' %s' % terms[ind], end='')
         print()

Now my question is that to my understanding km.cluster_centers_ returns the coordinated of the center of the clusters so for example if there are 100 features and three clusters it would return us a matrix of 3 rows and 100 column representing a centroid for each cluster. What I wish to understand how it is used in the above code to determine the top terms in the cluster. Thanks Any comments are appreciated Nadeem

Advertisement

Answer

You’re correct about the shape and meaning of the cluster centers. Because you’re using Tf-Idf vectorizer, your “features” are the words in a given document (and each document is its own vector). Thus, when you cluster the document vectors, each “feature” of the centroid represents the relevance of that word to it. “word” (in vocabulary)=”feature” (in your vector space)=”column” (in your centroid matrix)

The get_feature_names call gets the mapping of column index to the word it represents (so it seems from the documentation… if that doesn’t work as expected, just reverse the vocabulary_ matrix to get the same result).

Then the .argsort()[:, ::-1] line converts each centroid into a sorted (descending) list of the columns most “relevant” (highly-valued) in it, and hence the words most relevant (since words=columns).

The rest of the code is just printing, I’m sure that doesn’t need any explaining. All the code is really doing is sorting each centroid in descending order of the features/words most valued in it, then mapping those columns back to their original words and printing them.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement