Skip to content
Advertisement

Python: Convert a pandas Series into an array and keep the index

I’m running a k-means algorithm (k=5) to cluster my Data. To check the stability of my algorithm, I first run the algorithm once on my whole dataset and afterwards I run the algorithm multiple times on 2/3 of my dataset (using a different random states for the splits). I use the results to predict the cluster of the remaining 1/3 of my data. Finally I want to compare the predicted cluster with the cluster I get when I run k-means on the whole dataset. This is where I get stuck.

Since k-means always assigns different labels to the (more or less) same clusters I can’t just compare them. I tried using .value_counts() to reassign the labels 0 to 4 based on their frequency. But because I run this check multiple times, I need something that works in a loop.
Basically when I use .value_counts() I get something like this:

     PredictedCluster  
4              55555  
0              44444
2              33333
1              22222
3              11111

I wish I could turn this into an array, where the labels are sorted by size:

a = [[4, 55555],[0,44444],...,[3,11111]]

Can anyone please tell me how to do this or what other approaches could I use to solve my problem?

Advertisement

Answer

Something like the one-liner below could work:

a = list(map(list, df["PredictedCluster"].value_counts().items()))
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement