Descriptive Statistics

The original purpose of statistics was collecting data for government and administration. Thus, the term is used for, e.g., employment data or for censuses in general that provide data (statistics) about a population or nation. Scientifically, the term stands for different forms of presenting empirical data in charts, diagrams, or tables as well as for an academic discipline. In this respect, statistics is part of mathematics, but also an auxiliary discipline for other academic disciplines like communication research. Statistics serves for analyzing, presenting, and interpreting empirical data that have been collected by applying quantitative methods like surveys, content analyses, or observations in experimental designs or field studies. Statistics comprises of descriptive and explanatory procedures and aspects.

Sample, Variables, And Scales

Descriptive statistics deals with describing and characterizing empirical data from representative samples, i.e., mostly random samples. A sample is called random when all elements have the same chance of becoming or not becoming a member of it. This is a crucial condition for all statistical analyses, especially when it comes to explanatory procedures. Observations or elements of a random sample can be, for instance, persons, newspaper articles, or television ads. In statistics, the attributes of these elements are called “variables.” They can take different values. An interviewee in a representative survey, for instance, can be male or female; his or her political interest can be weak, medium, or strong; and he or she can be 14 years, 52 years, or 79 years old. Gender, political interest, and age are variables, whereas “male/female,” “weak/medium/strong,” and {1, 2, 3 . . . n} are the values of these variables.

Variables can have different scales or levels of measurement, so one might say scales represent certain accuracies in measuring a variable. Usually, four types of measurements or scales are distinguished: nominal, ordinal, interval, and ratio scale. (1) Nominal scales or variables (e.g., gender, color) represent a difference or equality of values (e.g., “male” and “female” or “red,” “blue,” and “yellow”) but with no meaningful rank order among values. Dummy variables that can take 1 and 0 as value are often considered a special form of nominal measurement. (2) Ordinal scales (e.g., military rank, school grades) also represent difference/equality, but additionally include a rank order like “bigger/smaller” or “more/less.” But there are no precise differences between consecutive values. Thus, one cannot state that a strong political interest would be twice as much as a weak political interest. (3) This is, however, possible for interval scales (e.g., temperature), which have meaningful, i.e., equal distances between ranks or values. For instance, a temperature of 240 degrees Fahrenheit doubles a temperature of 120 degrees Fahrenheit.

(4) Finally, a ratio scale (e.g., weight, prices) is like interval measurement, but additionally with a meaningful, i.e., absolute zero value. Humans and animals, for instance, cannot have a negative weight. Therefore ratio scales represent not only equal distance, but equal proportions of consecutive values. Dummy, nominal, and ordinal scales are called nonmetric, whereas interval and ratio measures are called metric measurements. For computer programs non-numeric values of measured attributes (e.g., “male/female”) are usually turned into numeric values (e.g., 1 and 2), i.e., they are coded to enable statistical procedures automatically.

After having collected data, e.g., in a survey, we ask questions like these: how old are interviewees in the survey sample? How often do they watch television? Is there any connection between media use and political interest? To answer these questions we can describe empirical data in two ways: on the one hand, by making charts, diagrams, and tables; on the other hand, by computing sample characteristics (e.g., means) or coefficients (e.g., correlation coefficient) that represent a certain feature of the sample (e.g., average) in one value. Both ways comprise of univariate and multivariate perspectives.

Univariate Statistics

Univariate statistics deals with single variables. One can, for instance, examine the distribution of age among all interviewees in a survey. The original data show values of a variable (e.g., age) for each element in the sample, e.g., each interviewee. By grouping all elements (e.g., interviewees) sharing a certain value (e.g., 25 years) we obtain a frequency table: this lists all values (e.g., ages) in the first column and the corresponding frequencies (e.g., number of interviewees) in the second column. Tables of categorical data list classes of values in the first column (e.g., “10–19 years,” “20–29 years,” etc.) and frequencies in the second column. Tables of cumulated data – whether categorical or not – add up frequencies starting from the top of the column and ending with the sum of all frequencies (100 percent) in the last row. All types of tables can be transformed into corresponding charts following the same logic, but giving a better impression of the distribution of a variable in the sample. Another way of characterizing a sample is calculating sample characteristics. Here, univariate statistics distinguishes between means and deviances. Both differ with the measurement level of a variable.

Strictly speaking, the term “mean” refers to an average of values, i.e., to the arithmetic mean: Here, we sum up all values of all interviewees in our survey sample and divide this sum by the number of interviewees. This type of mean value can only be calculated for metric measurement. The permissible mean value for the nominal level of measurement is called “mode.” It is the most frequent value in a sample: if “37 years” is the most frequent answer to the survey question regarding the age of the interviewees, then the mode is 37. Of course, the arithmetic mean can be computed for age as well. The mean value for ordinal variables is the median. It splits a distribution in half: the first 50 percent of all values rank below the median and the second 50 percent rank above. The median can also be considered a special form of a percentile: Those percentiles dividing a distribution into quarters – 25 percent, 50 percent, 75 percent, 100 percent of all values – are called quartiles and the median is simply the 50 percent quartile.

Deviation characteristics differ with level of measurement as well. For nominal variables we cannot compute a deviance from the mean value. For ordinal measurement the permissible deviances are the range and the interquartile range. The range is the difference between the highest and the lowest value in the sample, whereas the interquartile range represents the difference between the third and first quartile. For metric measurement we calculate the variance or its square root, called standard deviation. An example for a metric variable is the duration of watching television measured in minutes. The variance in television viewing represents the sum of squares of interviewees’ individual differences from the mean value of all interviewees.

Multivariate Statistics

Multivariate statistics describes correlations between variables. For instance, we can ask if the time subjects spend on watching televised debates increases with a growing political interest. With bivariate statistics the term “covariance” refers to the fact that the variance of one variable corresponds to the variance of the other variable, e.g., in the sense of “the more X, the more Y.” In other words, we focus on the joint distribution of two variables. Similar to univariate statistics, correlations between two or more variables can be described by charts, diagrams, and tables on the one hand and by calculating correlation coefficients on the other hand.

The correlation between two variables is called asymmetric if one variable can be assumed to be the impact factor of the other variable; otherwise the relation is called symmetric. Age has a possible impact on political sophistication, for example – but not vice versa. Correspondingly, some coefficients indicate only the strength while others also represent the direction of a correlation: Here, a positive (negative) value of the coefficient indicates that an increase in value of the first variable is accompanied by an increase (decrease) in value of the second variable. In cross-tabulation, i.e., in contingency tables with two variables, the columns usually represent the values of the assumed impact factor, which is also called the predictor or independent variable. And the rows of the contingency table represent the values of the assumed affected variable, which is also called the response or dependent variable.

All correlation coefficients rely on the general linear model, which stipulates a linear relation between variables. Strictly speaking, the term “correlation” should only be applied to metric measurement. At the nominal level correlations are called “contingency,” and at with ordinal measurement they are called “association.” For nominal measurement one computes contingency coefficients. Here, the permissible coefficient is the phi coefficient Φ with a four-cell table (two variables with two values each), and Cramer’s V with a table of more than four cells. Both are based on chi-square-statistics, χ², i.e., the differences between theoretically expected frequencies (based on probability theory) and observed frequencies in the joint distribution of two variables. The values of both coefficients range from 0 (no contingency at all) to 1 (perfect contingency).

At the ordinal level different coefficients can be applied. In contrast to Kendall’s τ (Tau-b/c) the gamma coefficient γ does not make an adjustment for either table size or ties. A more popular coefficient especially in agenda-setting research is Spearman’s rank correlation coefficient ρ (rho). In many agenda-setting studies several issues are ranked for all media on the one hand and for the aggregate of recipients on the other hand. Here, Spearman’s ρ indicates strength and direction of the association between media agenda and recipients’ agenda. Thus, values range from −1 for perfect negative correspondence to +1 for perfect positive correspondence, with 0 for no correspondence at all. At the metric level of measurement one can compute correlation coefficients or conduct a regression analysis. Pearson’s product–moment coefficient r indicates the strength and direction of a correlation between two metric measurements. The coefficient whose values range from −1 to +1 is obtained by dividing the covariance of both variables by the product of their standard deviations.

The product–moment coefficient can also be considered the square root of the socalled explained variance in linear regressions. This procedure examines the impact of one or more assumedly independent or explanatory variables X₁, . . . , X_n, etc. (predictors, regressors) on an assumed dependent variable Y (response variable, regressand). This must be a metric variable, whereas the explanatory variables can be of lower measurement. R² is the determination coefficient and indicates the amount (in percent) variance in the response variable explained by X₁, . . . , X_n. Values of R² range from 0 (no impact of the explanatory variables) to 1 or 100 percent (full impact). The “error” 1 − R² is also called unexplained variance and represents the square sum of all residuals. Since regression tries to keep these residuals or unexplained variation to a minimum (“least”) it is often called OLS regression (ordinary least squares). Additionally, beta values β can be computed for each explanatory variable. A beta value represents the strength and direction of the correlation between a regressor and response variable with values ranging from −1 to +1. Since beta values are standardized coefficients, one can compare them for different regressors in one regression analysis, or for one regression between different regression models.

References:

Best, J. (2001). Damned lies and statistics: Untangling numbers from the media, politicians, and activists. Berkeley, CA: University of California Press.
Clarke, G. M., & Cooke, D. (2004). A basic course in statistics, 5th edn. London: Hodder Arnold.
Daalgard, P. (2002). Introductory statistics with R. Berlin: Springer.
Gerber, S. B., & Finn, K. V. (2005). Using SPSS for Windows. Data analysis and graphics, 2nd edn. New York: Springer.
Harrell, F. E., Jr. (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis. New York: Springer.
Pagano, R. (2003). Understanding statistics in the behavioural sciences, 7th edn. St. Paul: West Publishing.
Stigler, S. M. (2002). Statistics on the table: The history of statistical concepts and methods. Cambridge, MA: Harvard University Press.