Skip to content
Advertisement

Why Python’s scikit-learn K-Means text clustering algorithm always provides different retult

I have a list of documents and this class to perform actions on that list. So, basically, morphed_documents is a list of strings. And at the end, the algorithm returns the cluster for each document. But why the results and model’s labels are not the same though?

class Vectorizer:

    def __init__(self):
        self.vectorizer = TfidfVectorizer()

    def fit_transform(self, morphed_documents):
        matrix = self.vectorizer.fit_transform(morphed_documents)
        return matrix

    def fit(self, number_of_clusters, matrix):
        model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=100)
        model.fit(matrix)
        return model

    def print_terms(self, model, number_of_clusters):
        order_centroids = model.cluster_centers_.argsort()[:, ::-1]
        terms = self.vectorizer.get_feature_names()

        for i in range(number_of_clusters):
            print("Cluster %d:" % i),
            for ind in order_centroids[i, :100]:
                print(' %s' % terms[ind])

Advertisement

Answer

K-Means algorithm starts with a random initialization of cluster centroids. This selection will be different each time you run KMeans and may produce different results. In order to get reproducible results, you can use random_state argument in KMeans, which will fix the initial selection of cluster centeroids:

model = KMeans(n_clusters=number_of_clusters, 
               init='k-means++', 
               max_iter=100, 
               n_init=100, 
               random_state=123)
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement