Skip to content
Advertisement

sklearn Clustering: Fastest way to determine optimal number of cluster on large data sets

I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow.

  1. Is there a faster method to determine the optimal number of cluster?
  2. Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with >300.000 samples and lots of clusters ?

Advertisement

Answer

Most common method to find number of cluster is elbow curve method. But it will require you to run KMeans algorithm multiple times to plot graph. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set wiki page mentions some common methods to determine number of clusters.

Advertisement