from sklearn.cluster import AgglomerativeClustering
https://scikitlearn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Hierarchical Clustering Algorithm
Also called Hierarchical cluster analysis or HCA is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom
Types are ( both are same but reverse in direction)
- Agglomerative Hierarchical Clustering ( top —down)
- Divisive Hierarchical Clustering ( down — up)
Linkage Methods — ( distance between between 2 clusters)
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:
- Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster.
- Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
- Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
- Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.
What is a Dendrogram?
A Dendrogram is a type of tree diagram showing hierarchical relationships between different sets of data.
As already said a Dendrogram contains the memory of hierarchical clustering algorithm, so just by looking at the Dendrogram you can tell how the cluster is formed.
STOP the process once all clusters are inside the big circle
Form the clusters ( Number of Dissimilar clusters)
You cut the dendrogram tree with a horizontal line at a height where the line can traverse the maximum distance up and down without intersecting the merging point.
For example in the below figure L3 can traverse maximum distance up and down without intersecting the merging points. So we draw a horizontal line and the number of vertical lines it intersects is the optimal number of clusters.
1 — Mark all the vertical line = its the Distance between clusters
2 — extend the horizontal lines
3 — find the lines that dont get cut by the horizontal lines , marking that they have different clusters
4 — find the tallest line that remains uncut
5 — Number of clusters here is then 3 as it cuts 3 VERTICAL LINES
Distance between clusters
Python implementation
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
dendrogram = sch.dendrogram(sch.linkage(data, method = ‘ward’))
ward is default
Fit the model and predict the results
from sklearn.cluster import AgglomerativeClustering
agglomerative = AgglomerativeClustering(affinity=’euclidean’, linkage=’ward’, n_clusters = 5)
labels = agglomerative.fit_predict(data)
np.unique(labels)
Plot the results
Predict New Data Points —
not available in sklearn