Unsupervised classification (also called clustering) is the problem of classifying a dataset without any labels for training. An unsupervised learning algorithm is supposed to find tendencies, similarities between certain feature vectors. The goal is to group data within clusters, such that the members of one cluster are as similar as possible while being as different as possible from other clusters members.
The following video shows an application of clustering to unsupervised sorting of objects by a robotics system.
The K-means clustering algorithm principle is the following. Starting from a given dataset, the space is partitioned by centroids in the $N$ dimensional space. These centroids are simply points in this space. A data point belongs to the class represented by the centroid that is closest to itself according to the norm used (usually the euclidean norm). Hence, there should be as many centroids as the number of desired classes. The successive steps executed by $K$-means are the following.
Centroids initialization: The $K$-means algorithm works in an iterative fashion. The first step consists in initializing the centroids. If we want to classify the data within $K$ classes, then $K$ points (with the same dimension as the feature vectors) must be introduced. There exist different methods to position such points, one of them is to choose $K$ random points among the dataset and to let the initial centroids be these points.
Classes allocation: Once centroids are defined, the next step is to label each point according to the current centroids. Hence, for each point, we must compute its distance to each centroid and label it with the class corresponding to the closest one.
Centroids updating: Once every point in the dataset has got a label, centroids are updated. For each one of the current classes, the mean vector is computed and the new value of the class centroid is set equal to this mean vector.
The class allocation and centroid update steps are repeated successively until there is no more evolution in both steps. Once the process is over, data are classified. The following figures show the different steps of $K$-means clustering on a two dimensional dataset with two classes.
Exercise[$K$-means clustering implementation]
Implement the $K$-means clustering algorithm on the Iris dataset without using the labels (Labels can be used to check the clustering results).
Do the same thing with the "Wine" dataset, which can be found at this link
- Franck Rosenblatt (1962), Principles of neurodynamics: perceptrons and the theory of brain mechanisms