How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)

Question

My understanding of &#8220;an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters&#8221; is that the number of clusters is determined by the data as they converge to a certain amount of clusters. This R Implementation https://github.com/jacobian1980/ecostates de…

Accepted Answer

As mentioned by @maxymoo in the comments, n_components is a truncation parameter. In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn&#8217;s DP-GMM, a new data point joins an existing cluster k with probability |k| / n-1+alpha and starts a new cluster with probability alpha / n-1 + alpha. This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters.Unlike R&#8217;s implementation that uses Gibbs sampling, sklearn&#8217;s DP-GMM implementation uses variational inference. This can be related to the difference in results.A gentle Dirichlet Process tutorial can be found here.

Advertisement

Answer