Linear and Nonlinear Models of Causal Analysis

Communication researchers often gather quantitative data – for example from surveys, content analyses, or experiments – and then generate mathematical models to represent or summarize those data. These models are used in two basic ways: first, to generate predictions about certain variables; and second, to study relationships among some number of variables (Allison 1999). In some cases, the goal is to obtain as accurate a prediction as possible, without much regard to the particular variables used to make the prediction. In most cases, at least in communication research, the focus is on the pattern of relationships among particular variables, and especially on testing hypotheses made in advance about how variables are expected to interrelate.

Models

Models represent the relationship between a dependent or outcome variable and one or more independent or predictor variables, by stipulating the former as some mathematical function of the latter. Models contain parameters that express the function, usually a set of weights applied to independent variables.

The vast majority of the models used in communication research are linear models, which represent the functional relationship as a line (in two dimensions) or as a linear surface (in three or more dimensions). For example, a simple linear model explaining the number of times a child behaves aggressively (the dependent variable) by the hours of violent television watched (the independent variable) would be a line representing the predicted number of aggressive acts per each hour viewed. In some cases, we may find that a straight line or linear surface is inadequate to represent the functional relationship. For instance, the outcome could grow exponentially with increases in a predictor variable, or the outcome could at first grow with increases in a predictor up to some point, but thereafter stop increasing, or begin to decline with further increases in a predictor variable. In such instances, curvilinear or nonlinear models would better represent the data.

While both are commonly termed “nonlinear,” curvilinear and nonlinear models are distinguished mathematically. Curvilinear models, like polynomial functions (for instance, where aggressive behavior is modeled as an additive function of both hours viewed and the square of hours viewed), predict outcomes falling on a curved line, but the predictions are still actually a simple linear function of the weights (or parameters) assigned to each variable. These sorts of curvilinear models are “nonlinear in the variables, but linear in the parameters.” Nonlinear models, on the other hand, posit more complex functional relationships stipulating that variable weights of other parameters themselves depend on the values of predictor variables. These models are “nonlinear in the parameters” (Cohen et al. 2003).

Model Estimation

A model, then, specifies a particular functional relationship (e.g., linear, curvilinear, or nonlinear) between a dependent variable and some set of independent variables. These models are then fit to empirical data using some estimation method, that is, a particular algorithm used to estimate model parameters or coefficients (such as the slope of a prediction line). Estimation methods typically define some mathematical quantity that is to be minimized, which is known as a loss function. For instance, one of the most commonly used estimation methods, known as ordinary least squares (OLS), fits a predictive model to a set of observed data by minimizing squared deviations of each observed data point from the points predicted by the model.

The same model can be fit to data using different estimation methods. Whatever method is used, the process usually involves calculating standard errors, which are estimates of the degree of random variation one would expect in coefficient estimates for samples of a given size. These standard errors are critical in hypothesis testing, because they offer an estimate of the chances that certain hypothesized relationships, as reflected in calculated model coefficients, may have been generated by random processes alone. Evaluating these chances is known generally as significance testing.

To make sound interpretations of model estimates, researchers must also assess how well the stipulated model fits the data. Appraisal of fit is often based on the proportion of variance in the dependent variable predicted by the model, or on comparison of the model’s prediction errors to those generated by some baseline, such as a random model (Lewis-Beck 1980).

The Linear Regression Model

Multiple-variable linear regression, using OLS estimation methods, is one of the most popular ways of fitting a linear model to observed data. It allows researchers to estimate how much variation in a dependent variable is uniquely predicted (or “accounted for”) by one out of a set of independent variables. For example, we may be interested in the effect of viewing violent television on aggressive behavior, but recognize the possible influence of other factors, such as gender or socio-economic status, on both the amount of viewing and aggressive behavior. In generating the slope estimate for aggression on hours of viewing, linear regression removes from consideration the portions of the variance in aggression that can be attributed to other variables in the model. The effects of numerous variables can thus be disentangled, and if comparisons are desired, then standardized slopes or coefficients, which put change in standard deviation metric, are used. Each has an associated standard error, which can be used to conduct statistical tests of differences between slopes for different variables.

Curvilinear Regression Models

Standard regression results can be misleading if a straight-line relationship does not properly characterize the data. Consider, for instance, two variables perfectly related in a U-shaped pattern. Fitting a straight line might lead one to conclude, quite incorrectly, that no relationship exists.

Researchers can adapt the standard linear regression model to account for such curvilinear relationships. One approach involves using dummy variables (Allison 1999); that is, variables that categorize observations according to where they fall along the distribution of a given independent variable (e.g., low or high). If we think, for example, that with increasing amounts of exposure to children’s TV programming, young viewers gain verbal knowledge, but that this relationship reverses and becomes negative once viewing time passes a particular threshold, we might include dummies that identify the appropriate viewing ranges and capture the change in slope that occurs relative to verbal scores.

Another common approach is to specify a curved functional form, and then to apply transformations of the variables (e.g., by taking their logs, reciprocals, squares, or cubes), such that the model remains linear in its parameters, although it is nonlinear in the variables (Berry 1993). Such transformations allow the use of OLS linear estimation with a few complications that can be mitigated. Sometimes researchers will fit a variety of functional forms and test for improvements in overall model fit (Kleinbaum et al. 1998).

The Generalized Linear Model

Variable transformations are an important feature of the generalized linear model (GLM), a generalization of the standard linear regression model to cover many different contexts, including curvilinear models and some, but not all, nonlinear models. There are two components of the GLM: a link function transforming variables so that the dependent variable becomes a linear function of the independent variables; and an error distribution reflecting the assumed behavior of a model’s prediction errors (Dunteman & Ho 2006). The standard linear regression model is just a special case of the GLM, in which the link function is the identity, since there is no transformation of the dependent variable, and in which errors are assumed to be normal. Other GLMs rely on different link functions and error distributions, enabling researchers to capture a wide variety of functions.

Analysis Of Variance Models

Another special case of the GLM, which also assumes the identity link and the normal error distribution, is the analysis of variance (ANOVA). Commonly employed with planned experiments, ANOVA models group differences (Keppel & Wickens 2004). ANOVA is estimated with OLS and is very closely related to standard linear regression, to the point that almost any ANOVA design can be analyzed in a linear regression model (Allison 1999). Still, many find it easier to apply ANOVA to certain data, particularly those gathered in randomized experimental trials.

Nonlinear Models

Although many communication processes and effects can be well described using linear or curvilinear models, others are inherently nonlinear. For instance, research on the ways in which ideas diffuse or gather strength through communication over time strongly suggests that these processes are nonlinear. In some cases, the appropriate nonlinear model can be transformed and analyzed within the GLM by way of appropriate link functions and error distributions, but many others cannot.

Methods for estimating nonlinear models differ from those used in linear regression and ANOVA, where the loss function can be solved analytically. Nonlinear estimation instead requires iterative estimation methods. The process involves choosing “start values” for model coefficients, running these through a maximization (or minimization) function to improve the coefficients, and then running these back through the function again until stable estimates are realized. There are a number of such iterative methods, such as nonlinear least squares or maximum likelihood (ML), which estimates model parameters that maximize the probability of generating the observed data (Dunteman & Ho 2006).

Logit, Probit, And Poisson Models

Several widely used nonlinear models can be reformulated within the GLM and estimated using ML or other iterative methods. A number of these predict categorical or binary variables; for example, whether or not a person is able to recognize a particular news story. In such cases, OLS linear regression is problematic, since neither a linear relationship nor a normal distribution of prediction errors can be safely assumed (Hosmer & Lemeshow 2000). These issues can be solved in the GLM by using a logit link function, which describes an S-shaped growth in the dependent variable over the range of a predictor, rising monotonically to an asymptote. The logit transforms this function into one that is linear in its parameters and can be analyzed using a binomial distribution to model prediction errors. Closely related to the logit model are the logistic regression model, which predicts continuous outcomes ranging between a lower and an upper limit, multinomial or polytomous logits for nominal dependent variables with more than two categories, and ordered logit models for ordinal data. Probit models are a slight variation, which may be appropriate when the outcome represents a continuous latent construct rather than a true binary. For count data, such as the number of times a particular event occurs, researchers might use a Poisson regression.

Nonlinear Regression

Most communication researchers rely on linear or curvilinear models estimated through standard linear regression, ANOVA, or nonlinear models such as logistic regression that can be handled within GLM. Additional nonlinear models, including many that cannot be transformed so that they are linear in the parameters, can also be stipulated and estimated using iterative, nonlinear optimization techniques specifying an appropriate loss function. David Fan’s ideodynamic model (1985) is one such model. It predicts opinion change in a particular population as a dynamic, nonlinear function of the amount of media information available, the “decay rate” of that information, the proportion of people already on different sides of an issue, and other factors.

References:

Agresti, A. (1996). An introduction to categorical data analysis. New York: John Wiley.
Allison, P. D. (1999). Multiple regression: A primer. Thousand Oaks, CA: Pine Forge Press.
Berry, W. D. (1993). Understanding regression assumptions. London: Sage.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences, 3rd edn. Hillsdale, NJ: Lawrence Erlbaum.
Dunteman, G. H., & Ho, M. R. (2006). An introduction to generalized linear models. London: Sage.
Fan, D. P. (1985). Ideodynamics: The kinetics of the evolution of ideas. Journal of Mathematical Sociology, 11, 1–23.
Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression, 2nd edn. New York: John Wiley.
Keppel, G., & Wickens, T. D. (2004). Design and analysis: A researcher’s handbook. London: Prentice Hall.
Kleinbaum, D. G., Kupper, L. L., Muller, K. E., & Nizam, A. (1998). Applied regression analysis and other multivariable methods. Pacific Grove, CA: Duxbury Press.
Lewis-Beck, M. S. (1980). Applied regression: An introduction. London: Sage.