Clustering with Python — HDBSCAN

Most effective clustering method using DBSCAN and Hierarchical clustering

Anakin
5 min readNov 7, 2020

https://hdbscan.readthedocs.io/en/latest/index.html

HDBSCAN Intuition

The simple way to do this is to define a new distance metric between points which we will call the mutual reachability distance. We define mutual reachability distance as follows:

For this we need 3 things -

Core distance of point A , Core Distance of point B , Distance between A and B

1- Use k neighbours of 6 points ( whatever you want to start with) and get a core point A , and the core distance of point A

2 — find another core point with a larger circle and this is Core point B and core distance of point B

Progressively we find another point and its core distance

3 Calculate the Distance Between A and B = Mutual Reachable Distance:

If distance between A and B is smaller than the the core distance of A or B the mutual reachable distance is marked as lesser

Same way if the mutual reachable distance here is greater than core distance of A and Core distance of B

Using the Mutual Reachability Distance we build the minimum spanning tree

Now consider a threshold value, starting high, and steadily being lowered. Drop any edges with weight above that threshold. As we drop edges we will start to disconnect the graph into connected components. Eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels.

Build the cluster hierarchy by defining the Min Cluster Points

Any points that are outside this range are dropped out

Condense the clusters and extract the points

The circled out are the points inside the condensed clusters , To make this concrete we need a notion of minimum cluster size which we take as a parameter to HDBSCAN.

More Reading : https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html

Python Implementation

https://hdbscan.readthedocs.io/en/latest/index.html

Install hdbscan

!pip install hdbscan

import hdbscan

import hdbscan
hdbscan = hdbscan.HDBSCAN(min_cluster_size=10, min_samples = 5)
labels = hdbscan.fit_predict(data)
hdbscan.condensed_tree_.plot(select_clusters=True)
np.unique(labels)

Note the horizontal lines in the number of 6 clusters

min_cluster_size = is guessed looking at the scatter plot

min_smaples = number of points in k neighbour

hdbscan.condensed_tree_.plot(select_clusters=True) — shows the points

The returned array has negative labels = outliers and many labels

hdbscan.single_linkage_tree_.plot()

this will plot the tree DO NOT COMPUTE THIS as it will break your computer takes a lot of time

Fine tune the model:

To get a reasonable number of labels … increase or decrease the number of cluster and number of points = min_samples

eg: labels = number of clusters

Plot the results

This clearly does not look correct , so we need lesser clusters

Predict New Data Points with HDBSCAN

1- In the clusterer — add parameter prediction data = True

hdbscan = hdbscan.HDBSCAN(min_cluster_size=12, min_samples = 3, prediction_data=True)

2- Use the approximate_predict( hdbscan,[[34,90]])

>>>> approximate_predict(clusterer, new points)

Read more : https://hdbscan.readthedocs.io/en/latest/prediction_tutorial.html

Considerations : the data used originally does not have any clear clusters , and hdbscan accurately calculated that, consider a dataset which has clusters so we can see more clear results :

Comparisons between different algorithms

--

--