I’m running a k-means algorithm (k=5) to cluster my Data. To check the stability of my algorithm, I first run the algorithm once on my whole dataset and afterwards I run the algorithm multiple times on 2/3 of my dataset (using a different random states for the splits). I use the results to predict the cluster of the remaining 1/3 of my data. Finally I want to compare the predicted cluster with the cluster I get when I run k-means on the whole dataset. This is where I get stuck.
Since k-means always assigns different labels to the (more or less) same clusters I can’t just compare them. I tried using .value_counts()
to reassign the labels 0 to 4 based on their frequency. But because I run this check multiple times, I need something that works in a loop.
Basically when I use .value_counts()
I get something like this:
PredictedCluster 4 55555 0 44444 2 33333 1 22222 3 11111
I wish I could turn this into an array, where the labels are sorted by size:
a = [[4, 55555],[0,44444],...,[3,11111]]
Can anyone please tell me how to do this or what other approaches could I use to solve my problem?
Advertisement
Answer
Something like the one-liner below could work:
a = list(map(list, df["PredictedCluster"].value_counts().items()))