August Coding Dojo: Choosing the Optimal Number of Cluster
4 minute read
At the last coding dojo, the interrogation we get was the following:
Is it possible to create a function which automatically define the optimal number of cluster?
As usual, the answer with R is: there is a package for that.
Training data set
First, we generate some fake data:
Not too much separated, but not too messy. It is a simulation, not real life :)
Our main inspiration is that post on stackoverflow:
The first way to determine a reasonnable number of cluster that was taught at school was the elbow plot.
The concept is to plot the sum of the distance between the centroid of the cluster and the point of the cluster by cluster.
The plot looks like an elbow and the classic rule is to take the number of cluster where the curve begin to flaten. Afterward, each new cluster is not really separated from the others.
The function NbClust
The function NbClust test a consequent set of methods to determine the optimal number of clusters.
The different method used (minus the graphical ones) and the number of clusters picked by each:
Most common value:(Without 0)
In the end, the median of all these methods is choosed. In this case, 2.
There is another approach we didn’t had time to look at, but which seems promising:
The package BHC which does bayesian hierarchical clustering could also provide us an insight on the best cluster.