Clustering with Python — HDBSCAN
Most effective clustering method using DBSCAN and Hierarchical clustering
https://hdbscan.readthedocs.io/en/latest/index.html
HDBSCAN Intuition
The simple way to do this is to define a new distance metric between points which we will call the mutual reachability distance. We define mutual reachability distance as follows:
For this we need 3 things -
Core distance of point A , Core Distance of point B , Distance between A and B
1- Use k neighbours of 6 points ( whatever you want to start with) and get a core point A , and the core distance of point A
2 — find another core point with a larger circle and this is Core point B and core distance of point B
Progressively we find another point and its core distance
3 Calculate the Distance Between A and B = Mutual Reachable Distance:
If distance between A and B is smaller than the the core distance of A or B the mutual reachable distance is marked as lesser
Same way if the mutual reachable distance here is greater than core distance of A and Core distance of B
Using the Mutual Reachability Distance we build the minimum spanning tree
Now consider a threshold value, starting high, and steadily being lowered. Drop any edges with weight above that threshold. As we drop edges we will start to disconnect the graph into connected components. Eventually we will have a hierarchy of connected components (from completely connected to completely disconnected) at varying threshold levels.
Build the cluster hierarchy by defining the Min Cluster Points
Any points that are outside this range are dropped out
Condense the clusters and extract the points
The circled out are the points inside the condensed clusters , To make this concrete we need a notion of minimum cluster size which we take as a parameter to HDBSCAN.
More Reading : https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html
Python Implementation
Install hdbscan
!pip install hdbscan
import hdbscan
import hdbscan
hdbscan = hdbscan.HDBSCAN(min_cluster_size=10, min_samples = 5)
labels = hdbscan.fit_predict(data)
hdbscan.condensed_tree_.plot(select_clusters=True)
np.unique(labels)
Note the horizontal lines in the number of 6 clusters
min_cluster_size = is guessed looking at the scatter plot
min_smaples = number of points in k neighbour
hdbscan.condensed_tree_.plot(select_clusters=True) — shows the points
The returned array has negative labels = outliers and many labels
hdbscan.single_linkage_tree_.plot()
this will plot the tree DO NOT COMPUTE THIS as it will break your computer takes a lot of time
Fine tune the model:
To get a reasonable number of labels … increase or decrease the number of cluster and number of points = min_samples
eg: labels = number of clusters
Plot the results
This clearly does not look correct , so we need lesser clusters
Predict New Data Points with HDBSCAN
1- In the clusterer — add parameter prediction data = True
hdbscan = hdbscan.HDBSCAN(min_cluster_size=12, min_samples = 3, prediction_data=True)
2- Use the approximate_predict( hdbscan,[[34,90]])
>>>> approximate_predict(clusterer, new points)
Read more : https://hdbscan.readthedocs.io/en/latest/prediction_tutorial.html
Considerations : the data used originally does not have any clear clusters , and hdbscan accurately calculated that, consider a dataset which has clusters so we can see more clear results :