Skip to content
Advertisement

How to properly cluster with HDBSCAN for 1D dataset?

My dataset below shows product sales per price (link to download dataset csv):

JavaScript

What I want to achive is clustering the dense regions (rectangles below) using HDBSCAN and sklearn. We have four regions, but regions 3 and 4 could also be grouped into a big region, which would lead to only 3 regions on the entire dataset by changing the parameters min_cluster_size and min_samples in the function call. enter image description here

And here is my code:

JavaScript

enter image description here

JavaScript

enter image description here

The problem is the result, the clustering did not work as expected (picture above x below). It clustered the amplitudes, not the dense regions as it mentions in the algorithm. What am I missing in the code? enter image description here

I’ve tried the follwing things: normalizing the data (both axis) and also swaping the axis before calling the HDBSCAN class. Any help would be appreciated. I’m kind of lost in this code, but I thought by reading the documentation that it would be straight forward for this particular problem, as HDBSCAN deals great with density and noise.

Advertisement

Answer

The way you’ve implemented this, you are actually trying to cluster 2-D data. This make more sense when you visualize the result of your clustering as a scatter plot:

scatter_clustering

In order to cluster the 1-D data as I believe you’re intending, you could reshape the data. Essentially, you want a single list of prices where each price value is repeated in the list quantity times. This is pretty straightforward with numpy:

JavaScript

which gives

JavaScript

Then you can cluster on this numpy array directly, but you need to significantly increase min_cluster_size and min_samples because you have way more values to cluster now:

JavaScript

Finally, we can combine the cluster labels, pick the label that occurs most frequently*** for each price, and group by price:

JavaScript

To verify that we got what we expected, let’s plot:

JavaScript

enter image description here

Looks like the clusters generated by HDBSCAN with otherwise default parameters are largely similar to what you expected, though I’m sure you could tweak these a bit if you need fewer clusters for your final application.

*** Using the ‘mode’ or the most commonly occurring cluster label may be a bit lazy on my part. You could also consider taking a mean and rounding, or finding the lowest and highest price with each label and using those as cluster endpoints, or something else entirely!


Full code to copy-paste for those wishing to replicate:

JavaScript
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement