Details, Explanation and Meaning About Data clustering

Data clustering Guide, Meaning , Facts, Information and Description

Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering consists of partitioning a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often similarity or proximity for some defined distance measure.

Data clustering algorithms can be hierarchical or partitional, and hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down).

Table of contents
1 Hierarchical clustering
2 K-means and derivatives
3 Applications
4 References
5 See also

Hierarchical clustering

Introduction

Hierarchical clustering builds a hierarchy of clusters from individual elements. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other.

Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.

Agglomerative hierarchical clustering

This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance between elements.

Suppose we have merged the two closest elements {b} and {c}, we now have the following clusters {a} {b c} {d} {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following:

Each of the agglomeration occurs at a certain distance between clusters and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).

K-means and derivatives

k-means clustering

The k-means algorithm assigns each point to the cluster which center (or centroid) is nearest, the centroid being defined as the point which is the average of all the points in the cluster.

As one can see, this is not a complete definition, because the .clusters depend on the centroid and the centroid depend on the cluster, so the following algorithm is used:

The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

fuzzy c-means clustering

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster.

Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster . Usually, the sum of those coefficients has to be one, so that denotes a probability of belonging to a certain cluster.

With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is .

The degree of being in a certain cluster is the inverse of the distance to the cluster , then the coefficients are normalized so that their sum is 1.

The fuzzy c-means algorithm is greatly similar to the k-means algorithm:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than , the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above

The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Applications

In biology has two main applications in the fields of computational biology and bioinformatics.

References

See also

k-means,
ANN


This is an Article on Data clustering. Page Contains Information, Facts Details or Explanation Guide About Data clustering


Google
 
Web www.E-paranoids.com

Search Anything