I have the dataset that generates the following code.
X_moons, y_moons = datasets.make_moons(n_samples=1000, noise=.07, random_state=42)
The case is that I would like to make a dendrogram (bottom-up) in Python and I must select a linkage criterion. If you consult the documentation of the function you can see the existing methods. https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html
Any suggestions on how I can move forward? Is there a foolproof way to determine the best linkage?
I have tested the cophenetic distance for my dataset with each of the methods.
Advertisement
Answer
There is no direct way to know which linkage is best. However, by looking at spread of data we can best guess. For your case, single linkage will produce best result.
- Single linkage works best if cluster is in form of a chain. Complete linkage is more appropriate for data with globules/spherical clusters.
- If your data has categorical variables, then average/centroid/ward may not work properly. Single/Complete linkage is better for data with categorical variables.
from sklearn.cluster import AgglomerativeClustering fig, ax = plt.subplots(1,4,figsize=(20,5)) link =['single','complete','average','ward'] for i in range(4): model = AgglomerativeClustering(n_clusters=2, linkage=link[i]) labels = model.fit_predict(X_moons) ax[i].scatter(X_moons[:,0],X_moons[:,1], c=labels) ax[i].set_title(link[i]) fig.show()
Further Reading: https://www.youtube.com/watch?v=VMyXc3SiEqs