I use KMeans and the silhouette_score from sklearn in python to calculate my cluster, but on >10.000 samples with >1000 cluster calculating the silhouette_score is very slow.
- Is there a faster method to determine the optimal number of cluster?
- Or should I change the clustering algorithm? If yes, which is the best (and fastest) algorithm for a data set with >300.000 samples and lots of clusters ?
Advertisement
Answer
Most common method to find number of cluster is elbow curve method. But it will require you to run KMeans algorithm multiple times to plot graph. https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set wiki page mentions some common methods to determine number of clusters.