I have a list of documents and this class to perform actions on that list. So, basically, morphed_documents
is a list of strings. And at the end, the algorithm returns the cluster for each document. But why the results and model’s labels are not the same though?
class Vectorizer: def __init__(self): self.vectorizer = TfidfVectorizer() def fit_transform(self, morphed_documents): matrix = self.vectorizer.fit_transform(morphed_documents) return matrix def fit(self, number_of_clusters, matrix): model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=100) model.fit(matrix) return model def print_terms(self, model, number_of_clusters): order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms = self.vectorizer.get_feature_names() for i in range(number_of_clusters): print("Cluster %d:" % i), for ind in order_centroids[i, :100]: print(' %s' % terms[ind])
Advertisement
Answer
K-Means algorithm starts with a random initialization of cluster centroids. This selection will be different each time you run KMeans
and may produce different results. In order to get reproducible results, you can use random_state
argument in KMeans
, which will fix the initial selection of cluster centeroids:
model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=100, random_state=123)