Explanatory Statistics

Explanatory statistics is also called inferential statistics or statistical induction and deals with inferences about the population from the characteristics of a random sample, i.e., with making (probability) statements about usually unknown parameters of a population. For instance, when taking a random sample (e.g., n = 1,000) of television viewers from the population of all TV viewers (e.g., N = 1,000,000), we want to know if the average time of TV viewing in the sample (e.g., 184 minutes/weekday) comes close to the average time in the population from which the sample was taken.

Roots In Probability Theory

Explanatory statistics includes point and interval estimation as well as hypothesis tests for statistical significance. They are based on probability theory (e.g., the work of Richard von Mises, Thomas Bayes, and Pierre-Simon Laplace). On the one hand, probability can be interpreted as the ratio of a favorable outcome and the number of possible outcomes. For instance, if one expects a “6” when throwing a single die once, the probability of getting a “6” is the ratio of this favorable ourcome (n = 1) divided by the possible outcomes (n = 6) – thus p(“6”) = 1/6. Throwing a single die is an example of a random experiment. It is called random since we do not know the outcome before having thrown the die. On the other hand, we can assume that the number of cases in which we get a “6” comes nearer to the probability p(“6”) when throwing the single die many, many times.

Asking an interviewee how old he or she is can be considered a random experiment, since we do not know what he or she will answer. All possible outcomes are called the sample space Ω of the random experiment. We do not know the outcome for the next interviewee, but we can tell the probability of a certain age. For another example: we cannot predict the outcome of “throwing a 5” for sure when playing dice, but we know that the probability of this outcome can be obtained by dividing the number of positive outcomes (throwing a 5) divided by the number of possible outcomes (throwing a 1, 2, 3, 4, 5, or 6) – thus the probability of “throwing a dice once and obtaining 5” is p = 1/6. Probability values range from 0 to 1. This is the first of the Kolmogorov axioms. The second postulates that the probability of the sample space Ω is 1 – in other words, the probability that any of the possible outcomes will occur is 1. The third axiom claims that the probability of a sum of disjoint, i.e., “exclusive” events or outcomes is the sum of the single probabilities of each event or outcome: e.g., p(“throwing a 1, 2, or 3”) = 1/6 + 1/6 + 1/6 = 1/2. From these axioms one can deduce lemmas or rules for calculating other probabilities. One lemma tells us that the probability of the impossible event (e.g., “throwing a 7 with one die”) is 0. A second lemma tells us that the probability of an event p(A) and the probability p(A^C) of the event’s contrary C (“not A”) sum up to 1.

Attributes of sample elements, i.e., variables and values, can be treated in terms of probability theory: when interviewing someone, we do not know the answer to the question “How old are you?” before we ask. Thus, age is a random variable. It can take different values, which can be considered as outcomes of a random experiment with Ω = {1, 2, 3, . . . , n}. Each outcome or event has a certain probability. Listing the outcomes consecutively, one obtains a table or diagram showing the probability distribution (or function), which is comparable to the sample distribution. And as cumulated frequencies in sample statistics do, the cumulated probabilities form up to a cumulated probability distribution, also called density distribution (or function). As in sample statistics, we can characterize probability distributions by parameters. The expected value E(X) is the average or mean value of a random variable X. Additionally, we can compute the variance var(X) or standard deviation of a random variable X.

Distributions differ with the level of measuring a random variable. Discrete random variables take finite or countably infinite values – like age: Ω = {1, 2, 3, . . . , n}. Examples of discrete distributions are the Bernoulli or the binominal distribution. Continuous random variables can take an infinite number of real numbers within the limits of an interval – like “time spent watching a half-hour TV soap”: Ω = {0 ≤ x ≤ 30). The most prominent example of a continuous distribution is the normal or Gaussian distribution (named after Carl Friedrich Gauss). Normal distributions N(µ,σ²) vary in position due to the expected value E(X) = µ and in form due to the variance var(X) = σ². All normal distributions can be standardized, i.e., z-transformed into a standard normal distribution (SND) with µ = 0 and σ² = 1. The SND is a key feature of statistics since many distributions tend toward a normal distribution when numbers get large. This is described by the central limit theorem, which postulates that the sum of stochastically independent variables X_n – as well as X_n/n – is approximately normally distributed.

Point And Interval Estimation

Empirical distributions are often assumed to be normally distributed for practical reasons. One can take the SND as a template for drawing inferences about population parameters from sample characteristics. For instance, one can compare the arithmetic mean of a random sample with the SND and then make a probability statement about the unknown mean value in the population from which the sample was taken. The SND provides a standardized template with µ = 0 and σ² = 1. In a diagram, the area below the curve of the SND represents 100 percent of the values. More important are the standardized intervals: Within the limits of ±1 standard deviations we find roughly 68 percent of all values of the SND. Within ±1.96 (±2.38; ±3.29) standard deviations we find exactly 95 percent (99 percent; 99.9 percent) of all values.

These three limits represent the levels of confidence. If the mean value of TV viewing in a random sample is 184 minutes, for example, a confidence of 95 percent can be interpreted like this: if we would take 100 random samples of the same sample size, we would find exactly this mean of 184 minutes for 95 of the 100 random samples. Yet for 5 of the 100 random samples we would find a different average. Thus, we make a 5 percent error when stipulating that the result of our single random sample is “for sure.” This error probability is denoted α and also called the level of significance. With the second lemma of probability theory, confidence and error probability are complementary, i.e., sum up to p = 1 or 100 percent. Thus, corresponding to levels of confidence α < 5 percent (significant), α < 1 percent (very significant), and α < 0.1 percent (highly significant) there are three levels of significance. For an example: the sample average of 184 minutes is called very significant (α < 1 percent), when there is a probability of at least 99 percent that we would get the same value when taking another sample from the same population. Or in other words, with a probability of 99 percent the population average is 184 minutes.

This statement is an example of point estimation. Here, the sample statistics (e.g., mean value) are directly used as a point estimator for population parameters (e.g., mean value). Interval estimations are a similar but more “precise” inference about population parameters from sample characteristics. Here, one gives an interval based on confidence and thus calls it the confidence interval. With the example above one would say: starting from the sample mean and α < 1 percent, we can assume that the mean of watching TV for all subjects from the population (from which the sample was taken) lies within the boundaries of, e.g., 180 and 188 minutes. Logically speaking, this is the same statement as the probability statement in the last paragraph, since both are based on the same confidence level.

Testing Statistical Hypotheses

Explanatory statistics also deals with statistical hypothesis or significance testing. Here, one asks questions like “How significant is the sample average of 184 minutes?” (onesample test) or “Does the mean in TV viewing in the male group differ significantly from the mean in the female group?” (two-sample test). For significance testing, SND and other distributions like the Student-t-distributions serve as test statistics. In practice, the statistical hypothesis test comprises several steps, which can be illustrated with the example of the two-sample (un)pooled t-test. First, one proposes a null and an alternative hypothesis. The null hypothesis H₀ postulates “no significance” or “equality”: for instance, H₀ might claim that the mean value in the first sample (e.g., people from New York) and the mean value in the second sample (e.g., inhabitants of Los Angeles) do not differ significantly, but by chance. Statistically speaking, the two samples were taken from the same population (µ_NY= µ_LA). The opposite, i.e., alternative hypothesis H_A, claims significance by postulating that the two samples are taken from different populations (µ_NY≠ µ_LA).

Second, we set the level of significance. With medicine testing, for instance, high levels of significance should be chosen to “insure” that a medicine has no side-effects. If we chose α = 0.05, we test on the 95 percent level of confidence. It is important to mention that one tests H₀and not H_A. The third step is choosing a test statistic (test distribution). With the two-sample t-test one chooses the Student-t-distribution as a template for the test decision. Strictly speaking, the t-distribution is a group of distributions differing with the degrees of freedom (df) that represent sample size (e.g., df = n_NY + n_LA – 2 for a two-sample t-test). The fourth step comprises calculating the empirical value of the test statistics from the sample data. The empirical t-value t_emp is compared to a critical, i.e., theoretical t-value t_crit depending on α and df. If t_emp ≤ t_crit, we accept H₀. But if t_emp > t_crit, we reject H₀ and accept H_A. So we can state that the group means differ significantly (α = 0.05) and claim that the groups are samples from different populations – with a probability of 95 percent.

Statistical tests all follow this logic, but vary with the test situation and test statistics being applied. If we ask for the significance of a single mean value, we usually choose t-statistics. This test situation is called a one-sample t-test and is applied if the population parameter σ is unknown – as is mostly the case. For a large sample size (usually n ≥ 30) and known σ, however, one can chose z-statistics based on SND, since under these conditions t-distributions are approximately normally distributed. This test situation is called a one-sample z-test. Comparing two means is the above-mentioned two-sample test situation. Another example is χ²-statistics, which is based on the differences of theoretically expected frequencies (based on probability theory) and observed frequencies in the joint distribution of two variables. On the one hand, χ² statistics is chosen if we ask for independence (H₀) and association or contingency (H_A) of two nominal variables (one-sample test). On the other hand, χ² statistics is chosen for the two-sample test, i.e., when asking for equality (H₀) or difference (H_A) of distributions in two samples. Here, for instance, we might examine whether the New York Times and the Washington Post differ in weighting political issues.

References:

Antonius, R. (2002). Interpreting quantitative data with SPSS. London: Sage.
Blum, I., & Rosenblatt, I. (1972). Probability and statistics. Philadelphia, London, and Toronto: W. B. Saunders.
Gerber, S. B., & Finn, K. V. (2005). Using SPSS for Windows: Data analysis and graphics, 2nd edn. New York: Springer.
Lipschutz, S., & Lipson, M. (2000). Schaum’s outline of probability, 2nd edn. Glencoe, IL: McGraw-Hill.
Stirzaker, D. (1999). Probability and random variables. Cambridge: Cambridge University Press.