Hi everyone! In this tutorial we are going to perform (step by step) a clustering algorithm. K-Means is the most known and probably the most used clustering algorithm, so we are going to implement it from scratch.
Do you know what clustering mean?
If not, don’t worry. It’s not so complicated when you look into them. Basically, clustering algorithms aim to identify groups inside a given dataset. Technically, we call these findings (groups) as clusters. Some people call them as patterns, what is completelly acceptable since you are trying to extract some knowledge from your data.
However, how can we connect our data in order to say: “Okay, group 1 is composed by points B, C and E and grroup 2 is composed by points A and D”? In order to do that we have to use a proximity or distance metric! Let’s adopt the term distance metric.
The distance metric is responsible to calculate the distance between an given point of your dataset against each cluster. The point will belong to the closer cluster, according to the distance metric. Isn’t easy?
Just to let you know, the most known and used distance metric is named Euclidian distance and the maths behind it is really simple: imagine that the points of your dataset are composed by to dimensions, x and y, so we could consider points A and B like: A(1,1) B(2,5).
The Euclidian distance between A and B is given by: root square of (Xa Xb)^2 + (Ya Yb)^2.
Also, you will realise that we need to use a method to define the central point of each cluster, this central point is also known as cluster centroid. In order to define new cluster centroids after each iteration of KMeans we are going to use a method known as average linkage.
Thus far we know the components that compose a clustering method: algorithm, distance metric and linkage method. So, keep in mind that you can perform clustering methods using different kinds of algorithms, distance metrics and linkage methods!
It´s not to late to say that clustering methods belong to the field of nonsupervised learning (Google it) and clustering algorithms dont’t provide us the meaning of each cluster, I mean, they just find groups.
Furthermore, each attribute (also known as variable, for example, in the previous example the point A and B have two attributes: X and Y) added to perform clustering is a dimension. For example, A(x,y) has two dimensions and A(x,y,z) has three dimensions. The more dimensions you have the more complicated is the algorithm. In this tutorial we are going to use a twodimension dataset.
Okay, now everybody has the necessary wepons, so let’s go ahead!