I have a list of documents and this class to perform actions on that list. So, basically, morphed_documents is a list of strings. And at the end, the algorithm returns the cluster for each document. But why the results and model’s labels are not the same though?
class Vectorizer:
def __init__(self):
self.vectorizer = TfidfVectorizer()
def fit_transform(self, morphed_documents):
matrix = self.vectorizer.fit_transform(morphed_documents)
return matrix
def fit(self, number_of_clusters, matrix):
model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=100)
model.fit(matrix)
return model
def print_terms(self, model, number_of_clusters):
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = self.vectorizer.get_feature_names()
for i in range(number_of_clusters):
print("Cluster %d:" % i),
for ind in order_centroids[i, :100]:
print(' %s' % terms[ind])
Advertisement
Answer
K-Means algorithm starts with a random initialization of cluster centroids. This selection will be different each time you run KMeans and may produce different results. In order to get reproducible results, you can use random_state argument in KMeans, which will fix the initial selection of cluster centeroids:
model = KMeans(n_clusters=number_of_clusters,
init='k-means++',
max_iter=100,
n_init=100,
random_state=123)