Cluster Analysis

Cluster analysis is a collective term for a broad variety of statistical methods and procedures which share the purpose of reducing the complexity of large collections of elements by arranging these objects into groups. Clustering procedures try to identify homogeneous groups (clusters) of entities within an empirical data set by comparing their attributes. The elements being clustered are predominantly the objects (cases, arranged in rows of a common data matrix) of an observation, but nothing stands against clustering the variables (columns in a data matrix) likewise. Thus cluster analysis is closely related to methods of factor analysis.

The groups to be identified consist of elements that are similar to one another, while the resulting groups ought to be as different as possible. Thus the main goal of all measures and algorithms employed in cluster analysis is to optimize this equilibrium of homogeneity (of the elements within groups) and heterogeneity (of all identified groups).

Applications Of Cluster Analysis

Early developments of methods for classification or cluster analysis can be found from the late 1930s on. The techniques and algorithms used today (and included in popular statistical software packages like SAS and SPSS) were developed mainly in the 1960s and 1970s, with the growing power of electronic data processing, which is essential for cluster analysis. The output of literature on classification and clustering is still much higher than in other fields of quantitative data analysis.

The main fields of application for clustering techniques early on, and still today, have been biology, psychology, marketing, and medical science (Mirkin 1996, 4–17; Mirkin 2005, xi–xxii). In media and communication studies the number of publications referring to this method is relatively moderate (if the much more frequently used factor analysis is not regarded as a variation of a clustering technique). Nevertheless, cluster analysis can be very helpful in exploratory studies as in hypothesis-testing in the field of communication studies: it can contribute substantially to the identification and description of types of media users, print publications, and patterns of media use.

Elements Of A Cluster Analysis

A cluster analysis usually comprises several steps (Milligan 1996): (1) the entities to be clustered and the variables by which the clusters will be defined have to be selected; (2) the similarity of the objects of interest according to the selected variables has to be determined by measures of proximity; (3) the objects have to be sorted into groups on the basis of these measures and by using specific clustering methods; and (4) a decision has to be made as to the appropriate number of clusters, followed by description and interpretation of the final cluster solution.

The first step is usually predetermined by the available data set. The entities for clustering in the field of communication research might be primarily persons (survey data), media products, statements, images (content analysis), or even complex aggregates like states or communities. There are no special requirements for the entities, but some clustering methods achieve a better response with the problem of outliers than others. One solution is to eliminate outliers from the data set; furthermore, literature has provided specific methods of clustering that are less affected by extreme cases.

Much more crucial, though seldom discussed in applied research, is the selection of the variables for clustering. There is no upper limit for the number of variables that can be included in a cluster analysis. There are measurements and methods for binary data as well as for metric scaling; there is no normal distribution required and only a maximum number of variables is allowed. Nevertheless, the outcome of a cluster analysis may be severely affected by mistakes in choosing variables. The selection of the variables should be made on the basis of theoretical assumptions, for the incorporation of just one or two “noise variables” might have a severe influence on the resulting solution (Milligan 1996, 361–365). It should also be recognized that large numbers of particular variables have a weighting effect on the cluster formation (if, for example, TV use is measured by 10 variables and newspaper use by only two, the resulting clusters will be strongly determined by TV use), as well as the use of different scale ranges or possible correlations of the variables used. This problem can be solved by the employment of standardization or weighting procedures; however, an imbalance of variables of different kinds might be deliberately imposed by a researcher as a way of weighting.

The second step concerns the selection of an appropriate measure of proximity. Proximity measures determine the distance between two objects on one or more variables, and many different statistical procedures are available that can deal with all sorts of scale types. For example, with binary data, there are four possible relations of two objects: both carry the same attribute (a), e.g., persons are newspaper readers or not, have online access or not, etc.; only either the first object (b) or the second object (c) carries it, or finally neither of them (d). In order to compute the simple matching coefficient (SMC) the matching values (a + d) are related to the sum of all measurements:

Formula 1.1

Cluster Analysis

Distances between metric scales can be determined by the Minkowsky metrics, among which the Euclidian distance is the most common. All these measures have specific effects on the results which might be unwanted in the light of the data or of the research question. The decision as to what measure to use should be made on the basis of a careful examination of the requirements of the study (Arabie & Hubert 1996, 13).

The third step and the core procedure of a cluster analysis is the clustering method, which puts the objects into groups. The main distinction in the field of clustering algorithms is that between hierarchical and nonhierarchical methods. Hierarchical methods generally try to separate the whole portion of cases stepwise into groups, while nonhierarchical procedures start with a specific partition and try to optimize the cluster representation by moving single objects between the given clusters. As the latter group of methods require an idea of the possible resulting groups, very solid and theoretically based assumptions about the pre-formulated groups are essential. Thus nonhierarchical methods are usually employed in a second step to evaluate or optimize a cluster solution that has previously been produced by the use of hierarchical methods.

In order to identify groups of objects in a data set without any preconceptions, hierarchical methods are appropriate and common (Gordon 1996, 65). Here too a large variety of methods is available, of which agglomerative (bottom-up) and divisive (top-down) methods are the most common. Agglomerative algorithms begin by regarding every case as one cluster, and continue by merging the two most similar clusters step by step, until all cases are integrated in one final cluster. Some examples are single link, complete link, centroid, and the Ward algorithm. Divisive methods work the other way around: they start with all the objects in one cluster and then separate them step by step. Within these groups of methods there are many algorithms to identify distinct groups of objects that have, as the underlying proximity measures, certain consequences which may be positive or negative depending on the type of data and research objective. Hence some of the algorithms tend to form long chains of cases, while others tend to produce more bold clusters; some tend to build clusters of a comparable size, while others produce different-sized clusters. Generally speaking, the results of agglomerative methods are strongly dependent on the starting point: a second run through the same data set, beginning with a different object, might easily lead to more or less different group sizes or even different optimal cluster solutions.

In the fourth step the last problem is to be assessed: the decision about the final cluster solution, which is mainly a decision about the number of clusters that best represents the structure of the population. As a result, hierarchical algorithms provide the whole scale of steps, from all cases in a single cluster to each case forming its own cluster, leaving the researcher to decide the result. Additionally, in a prototypical social sciences data set of about 1,000 or 2,000 cases and several dozens of variables it is – despite technological progress in software and hardware development – impossible to calculate all the potential distances, combinations, and clustering orders in order to identify the one solution that represents the optimum in similarities and dissimilarities of cases and clusters according to a specific measurement. Last but not least, there may be theoretical implications that lead a researcher to prefer, for example, a two-cluster solution to a sixor seven-cluster landscape. Statistics provide several “stopping rules,” which help to determine the optimal point at which to stop the cluster integration before the end (either all cases in one cluster or every case a cluster) is reached (Bock 1994, 9–10). Following the Ward algorithm, for example, a significant increase of the error sum of squares from one step to another would be a criterion for stopping the process and accepting the previous solution as final.

In all four steps of cluster analysis, there is not the one standard method (as the limited possibilities of standard software packages may sometimes suggest). In addition to careful selection of the appropriate variables, proximity measures, and clustering algorithms, the choice of procedure depends to a great extent on the individual research question.

References:

Arabie, P., & Hubert, L. J. (1996). An overview of combinatorial data analysis. In P. Arabie, L. J. Hubert, & G. De Soete (eds.), Clustering and classification. Singapore and River Edge, NJ: World Scientific, pp. 5 – 63.
Bock, H. H. (1994). Classification and clustering: Problems for the future. In E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, & B. Burtschy (eds.), New approaches in classification and data analysis. Berlin and New York: Springer, pp. 9 – 24.
Gordon, A. D. (1996). Hierarchical classification. In P. Arabie, L. J. Hubert, & G. De Soete (eds.), Clustering and classification. Singapore and River Edge, NJ: World Scientific, pp. 65 –121.
Milligan, G. W. (1996). Clustering validation: Results and implications for applied analysis. In P. Arabie, L. J. Hubert, & G. De Soete (eds.), Clustering and classification. Singapore and River Edge, NJ: World Scientific, pp. 341–375.
Mirkin, B. (1996). Mathematical classification and clustering. Dordrecht and Boston: Kluwer.
Mirkin, B. (2005). Clustering for data mining: A data recovery approach. Boca Raton, FL: Chapman and Hall/CRC.