Next: Hierarchical Classifiers Up: classification Previous: Some special cases

# Unsupervised Classification - Clustering

A clustering algorithm groups the given samples, each represented as a vector in the N-dimensional feature space, into a set of clusters according to their spatial distribution in the N-D space. Clustering is an unsupervised classification as no a priori knowledge (such as samples of known classes) is assumed to be available.

• Step 1. Randomly choose initial cluster centers from the given sample set (e.g., the first samples of the sample set). Set ;

• Step 2. Assign each of the samples to one of the clusters according to the distance between the sample and the mean vectors of the clusters:

where denotes the ith cluster of samples whose center is at the lth iteration;

• Step 3. Update the cluster centers to get so that the sum of the distances from all points in to the new center is minimized. To find this new center, we let

Solving this we get

where is the number of samples currently in at the lth iteration.

• Step 4. Terminate if the algorithm has converged, i.e., the membership of each pattern is no longer changed:

or a preset maximum number of iterations is exceeded.

Otherwise, , goto Step 2.

This method is simple, but it has the main drawback that the number of clusters needs to be estimated based on some prior knowledge, and it stays fixed through out the clustering process, even it may turn out later that more or fewer clusters would fit the data better. One way to resolve this is to carry out the algorithm multiple times with different , and then evaluate each result by some separability criteria, such as .

The ISODATA Algorithm

In the K-means method, the number of clusters remains the same through out the iteration, although it may turn out later that more or fewer clusters would fit the data better. This drawback can be overcome in the ISODATA Algorithm (Iterative Self-Organizing Data Analysis Technique Algorithm), which allows the number of clusters to be adjusted automatically during the iteration by merging similar clusters and splitting clusters with large standard deviations. The algorithm is highly heuristic and based on the following pre-specified parameters:

• : desired number of clusters;
• : minimum number of samples in each cluster (for discarding clusters);
• : maximum variance (for splitting clusters).
• : minimum pairwise distance (for merging clusters).

Here are the steps of the algorithm:

1. Choose randomly initial mean vectors from the data set.
2. Assign each data point to the cluster with closest mean:

3. Discard clusters containing too few members, i.e., if , then discard and reassign its members to other clusters. .
4. For each cluster , update the mean vector

and the covariance matrix:

The diagonal elements are the variances along the dimensions.

5. If (too few clusters), go to Steps 6 for splitting;

else if (too many clusters), go to Steps 7 for merging;

else go to Step 8.

6. (split) For each cluster , find the greatest covariance

If and , then split into two new cluster centers

Alternatively, carry out PCA to find the variance corresponding to the greatest eigenvalue and split the cluster along the direction along the corresponding eigenvector.

Set .

Go to Step 8.

7. (merge) Compute the pairwise Bhattacharyya distances between every two cluster mean vectors:

For each of the distances satisfying , merge of the corresponding clusters to form a new one:

Delete , set .

8. Terminate if maximum number of iterations is reached. Otherwise go to Step 2.

As the number of clusters can be dynamically adjusted in the process, the Isodata algorithm is more flexible than the K-means algorithm. But all of the many more parameters listed previously have to be chosen empirically.

Next: Hierarchical Classifiers Up: classification Previous: Some special cases
Ruye Wang 2016-04-04