I need to cluster groups of points with the same linear relationship, as per the code and figure below.
import numpy as np import matplotlib.pyplot as plt x = np.linspace(0, 30, 100) y1 = 3*x -50 + 20*np.random.random(size=len(x)) y2 = 3*x -20 + 20*np.random.random(size=len(x)) y3 = 3*x +10 + 20*np.random.random(size=len(x)) plt.plot(x, y1, 'o') plt.plot(x, y2, 'o') plt.plot(x, y3, 'o') plt.xlim([-50,125]) plt.ylim([-50,125])
Obviously, I wouldn’t have the points that way; I would just have the following x and y.
x_final = np.concatenate((x,x,x)) y_final = np.concatenate((y1,y2,y3)) plt.plot(x_final, y_final, 'o') plt.xlim([-50,125]) plt.ylim([-50,125])
Note the following: the points respect linear relationships with high slope, they present a slight separation from each other, and they all have the same slope, just with different intercepts.
How would you suggest I cluster these points? I thought about using PCA and clustering the main components with k-means, but I don’t know if there would be a more efficient way. In my real case I have more than three clusters and they have different distances from each other, even though they all have the same slope.
Advertisement
Answer
Take a look at all clustering algorithms scikit-learn offers :
Spectral clustering and Gaussian mixture should work for your use case.