In psychology, psychometric tests are standardized data collection methods. To provide significant and interpretable results in an empirical study, a test must meet specific requirements that are laid down by test theory. Only if these preconditions are given can reliable and valid conclusions be drawn with respect to the “real” value of a person’s trait or state under measurement. For standardized tests the raw score obtained during administration is transformed into a test score (e.g., an IQ or an anxiety score), which represents the intensity or strength of the trait/state to be measured on the level of an interval scale.
Classical test theory is influenced by measurement theory in physics. It is based on the assumption that every test value obtained consists of a “true” value (i.e., the “real” intensity of the person’s trait or state) as well as an additional error value. The error part of measured values should be minimized or at least known in size. For that, classic test theory is based on the assumption that error values are balanced out during repeated testing, that the error value is independent of the strength or intensity of the personal state or trait (the “real” value), and that for separate test administrations error values are independent of each other.
The goodness of a test is indicated by three different criteria. A test is objective if different testers obtain identical results with respect to application, data analysis, and interpretation. Ideally, the objectivity coefficient of test is as high as 1.0. Objectivity during application can be increased by providing a standardized instruction. Objectivity of data analysis mainly depends on item construction: objectivity is higher if a person can choose answers from a set of alternatives (i.e., multiple choice items). Objectivity of interpretation is high if conclusions can be drawn directly from a person’s test score to the trait/state to be measured.
Reliability is an indicator of the exactness of measurement. A lower reliability corresponds to higher error values during testing. For tests a reliability value greater than 0.80 should be aimed at. Reliability can be qualified in three different ways. First, the test can be applied twice. The correlation between the values of the first and second measurements is an indicator of retest stability. Second, reliability also can be determined by administering two parallel versions of a test and computing the correlation coefficient between them. For that an additional pool of homogeneous test items is required. Third, reliability can be estimated by computing internal consistency. For that, the number of test items is divided into two halves (by chance or by odd and even item numbers). Reliability then is indicated by the correlation between both test halves. A better estimation of internal consistency is provided by the Kuder–Richardson formula or the formula for Cronbach’s alpha. Whereas internal consistency always should be as high as possible, a high stability value is only reasonable for traits that are invariant over time. With respect to (variable) states, the parallel test methods are to be preferred.
A validity score indicates to what extent a test measures the aspect it is expected to measure. Criterion validity is based on the correlation between test scores and an external criterion that represents the trait or state as closely as possible. For example, if a test is constructed to measure viewers’ preferences for film genres, subjects can be asked to record all programs they have seen on TV in a media diary. Preferences that subjects have specified by answering questions in a test can then be related to diary data. However, it is not always possible unambiguously to define a criterion for the trait under investigation. For example, there is no single criterion for human intelligence. In that case, measurement with an intelligence test will be integrated in a network of other variables being related to intelligence (e.g., other intelligence test, grades, judgments of teachers, problem-solving behavior, etc.). Within this network of interdependencies, the construct of intelligence can be validated (i.e., construct validity or internal validity).
Item response theory is different from classic test theory because it is not based on the distinction between “true” and error values. Instead, a personal trait to be measured is conceived of as being a “latent dimension.” The strength of this latent dimension with a certain probability is indicated by scoring test items. Item difficulty and the person’s ability both determine the probability that a specific test item will be solved by the person: the easier an item and the more able a person, the higher the probability that this person will solve the item. Based on Rasch’s (1980) logistic model, item difficulty and personal ability can be determined independently of each other. For that, in contrast to classic test theory, the assumptions underlying a test can be tested empirically.
Computing coefficients for reliability or validity is not reasonable within the model of item response theory, as here accuracy of measurement and the person’s ability are not independent of each other. Within the item response approach, tests may be developed for personal traits or states that have been intensively studied in other research and where the items for measurement can be formulated precisely (Bortz 1984, 143). However, for developing new methods for the assessment of personal variables, a test in the framework of classic test theory is to be preferred, which in addition saves time and effort.
References:
- Allen, M. J., & Yen, W. M. (2002). Introduction to measurement theory. Long Grove, IL: Waveland Press.
- Bortz, J. (1984). Lehrbuch der empirischen Forschung [Textbook of empirical research]. Berlin: Springer.
- De Boeck, P., & Wilson, M. (eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer.
- Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago, IL: University of Chicago Press.