5
Conclusions
The Committee on Equivalency and Linkage of Educational Tests was created to answer a relatively straightforward question: Is it feasible to establish an equivalency scale that would enable commercial and state tests to be linked to one another and to the National Assessment of Educational Progress (NAEP)? In this report we have attempted to answer this question by examining the fundamentals of tests and the nature of linking; reviewing the literature on linking, including previous attempts to link different tests; surveying the landscape of tests and testing programs in the United States; and looking at the unique characteristics and qualities of NAEP.
Factors that Affect the Validity of Links
Test Content
A test is a sample of a much larger, more complex body of content, a domain. Test developers must make choices about the knowledge, skills, and topics from the domain they want to emphasize. The choices are numerous in a vast domain like reading or mathematics, where there are differing opinions about what should be taught, how it should be taught, and how it should be tested. Therefore, two state tests labeled "4th-grade reading" may cover very different parts of the domain. One test might
ask students to read simple passages and answer questions about the facts and vocabulary of what they read, thereby testing simple recall and comprehension; another test might ask students to read multiple texts and make inferences that relate them, thereby testing analytic and interpretive reading skills.
Tests with different content may measure different aspects of performance and may produce different rankings or score patterns among test takers. For example, students who have trouble with algebra—or who have not yet studied it in their mathematics classes—may do poorly on a mathematics test that places heavy emphasis on algebra. But these same students might earn a high score on a test that emphasizes computation, estimation, and number concepts, such as prime numbers and least common multiples. When content differences are significant, scores from one test provide poor estimates of scores on another test: any calculated linkage between them would have little practical meaning and would be misleading for many uses.
Test Format
Tests are becoming more varied in their formats. In addition to multiple-choice questions, many state assessments now include more open-ended questions that require students to develop their own responses, and some include performance items that ask students to demonstrate knowledge by performing a complex task. Computer-based testing is another alternative format that has gained in popularity in recent years. The effects of format differences on linkages are not always predictable, and they are sometimes large (see, e.g., Shavelson et al., 1992).
Measurement Error
Every test is only a sample of a person's performance. If a test taker also took an equivalent, but not identical, test on a different day in a different place, her score is unlikely to be the same. That is, a test score always has some margin of error (which testing professionals call the standard error of measurement). Measurement error plays a role in the interpretation and use of scores on linked tests. If test A, with a large margin of error, is linked with test B, which is much more precise, the score of a person who took test A still has the margin of error of test A, even when reported in terms of the scale of test B. Students and test users
can be misled by this difference in precision. A short test with unreliable (i.e., less precise) scores can seem to have more precision than it actually has if it is reported on the scale of the more reliable test.
Test Uses and Consequences
Variations in how tests are used, especially their consequences, can affect the stability of linkages over time. Many states are using or planning to use tests for high-stakes decisions, such as determining graduation for students, compensation for teachers, rating for schools or districts (National Research Council, 1999c). In contrast, other assessments, like NAEP, often have lower stakes for test takers, with no important consequences for individuals or others. When test stakes are low for them, students may have little incentive to take the test seriously; when they have reason to worry about the consequences of their scores, they are usually more motivated to try harder. When stakes are high, teachers are likely to alter instruction to try to produce higher scores, through such strategies as focusing on the specific knowledge, skills, and formats in that particular test. The strengths and weaknesses of these and other test-based accountability practices are controversial, and they are not the subject of this report. The important point for this report is that when a high-stakes test is linked with a low-stakes test, the relative difficulty of the two tests is likely to change (i.e., the high-stakes test will appear to become easier as the curriculum becomes aligned with it), and this can affect the stability of a linkage over time.
Evaluating Linkages
All of these factors—content emphases, difficulty, format, measurement error, and uses and consequences—point to the difficulty of establishing trustworthy links among different tests. But the extent to which any of these factors affects linkage can be determined only by a case-by-case evaluation of specific tests in a specific context. Developers of linkages should look carefully at the differences in content emphases, format, and intended uses of tests before deciding to link them. They should also set targets for the level of accuracy that will be required to support the intended uses of the linkage. Developers of linkages should also conduct empirical studies to determine the accuracy and stability of the linkage. In this report the committee suggests some criteria to be
considered as part of this process. One noteworthy criterion is the similarity or dissimilarity of linkage functions developed using data from different subgroups (e.g., gender, ethnicity, race) of students.
Finally, since linkage relationships can change relatively quickly, especially in high-stakes situations, developers need to continue to monitor linkages regularly to make necessary adjustments to the linking function over time. The research literature is rife with examples of linkages that looked good at first but failed to hold up over time.
NAEP Achievement Levels
Even if two or more tests satisfy the appropriate criteria and prove to be amenable to linkage, linking any or all of them to NAEP poses unique challenges. This is particularly true when the goal of the linkage is to report individual student scores in terms of the NAEP achievement levels—basic, proficient, advanced—established and defined by the National Assessment Governing Board. Problems arise for several reasons.
First, NAEP is designed to estimate and report distributions of student scores by state, region, or the nation as a whole, but it is not designed to report individual student scores. It uses a matrix sampling technique in which each student answers a relatively small number of items from the total set then aggregates their scores in order to report group results. Such data are quite imprecise at the student-level, and they are not well suited for use in standard procedures for linking individual scores (see, e.g., Beaton and Gonzalez, 1995). Most studies that have obtained links with the NAEP scale have prepared a test made from NAEP items, which was then given to individual students who had also taken the test being linked (see, e.g., Williams et al., 1995). Such NAEP stand-in tests must reflect the full content of the NAEP assessment and must also maintain the specific combination of item formats. They must also be administered in a way as nearly like the NAEP procedure as possible. Linking a test to a variant of NAEP that has a different mix of item formats, or a different balance of content, could produce a link whose validity is suspect (see, e.g., Linn et al., 1992).
Unique challenges arise in linking any other test with NAEP when the goal of the linkage is to report individual student scores in terms of the achievement levels. First, all test scores, including a NAEP score inferred from a linked test, have associated measurement error: even if a student took a different form of the same basic test, her score on that form
might be somewhat higher or lower than the score she obtained on the form of the test she actually did take. The margin of error problem is not usually significant for students whose scores fall in the middle of an achievement category. It may be a problem, however, for students whose scores are near the border of two adjacent levels. Some of these students could easily deserve to be in an adjacent category. Every teacher knows that a high B and a low A could easily be reversed on another occasion. When NAEP estimates the proportion of students in each category, for its reports, such potential classification errors are accounted for. If the linked test is not a close match to NAEP, the classification differences can be substantial. This challenge might be addressed through a special administration of a longer version of NAEP, perhaps by testing students with many more items than they complete in a standard NAEP assessment.
Second, differences in formats or combinations in formats used in different tests are a special concern. Changing the proportion of multiple-choice items to constructed-response items could place a student in a different achievement level. Any special variant of NAEP designed for use in a linking study must maintain the mix of formats used in NAEP (as specified in the NAEP test specifications).
Over all, the committee urges caution in attempting to link achievement tests to NAEP and to report individual student scores on those tests in terms of the NAEP achievement levels.
Conclusions
Our findings, as summarized above, lead us to the following conclusions:
Comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible.
Reporting individual student scores from the full array of state and commercial achievement tests on the NAEP scale and transforming individual scores on these various tests and assessments into the NAEP achievement levels are not feasible.
Under limited conditions it may be possible to calculate a linkage between two tests, but multiple factors affect the validity of inferences that may be drawn from the linked scores. These factors include the
context, format, and margin of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests. When tests differ on any of these factors, some limited interpretations may be defensible, while others would not.
Links between most existing tests and NAEP, for the purpose of reporting individual students' scores on the NAEP scale and in terms of the NAEP achievement levels, will be problematic. Unless the test to be linked to the NAEP is very similar to NAEP in context, format, and uses, the resulting linkage could be unstable and potentially misleading. (The committee notes that it is theoretically possible to develop an expanded version of NAEP that could be used in conducting linkage experiments, which would make it possible to establish a basis for reporting achievement test scores in terms of the NAEP achievement levels. However, the few such efforts that have been made thus far have yielded limited and mixed results.)
The committee arrived at these conclusions notwithstanding the fact that we believe that the goal of bringing greater coherence to the reporting of student achievement data, without compromising the increasingly rich and innovative tapestry of tests in the United States today, is an understandable one. We respect both the judgments of states and districts that have produced the diverse array of tests and the desire for more information than current tests can provide. Furthermore, the committee was disposed, as are large segments of the measurement and educational policy communities, to seek a technological solution to the challenge of linking.
Future Research
Despite our pessimism, we believe there are a number of areas where further research could prove fruitful and could help advance the idea of linkage of educational tests. First, we suggest research on the criteria for evaluating the quality of linkages. In its deliberations, the committee identified several such criteria, but we were unable to determine which were the most critical, and we cannot claim to have developed the exhaustive or definitive set of criteria. Additional study, for example on methods for assessing content congruence, could prove beneficial. The work of Kenney and Silver (1997) and of Bond and Jaeger (1993) represent important approaches to the problem. These researchers had to
invent methods for establishing the extent to which test contents match; those methods need additional research and development, especially with respect to providing quantitative estimates of congruence that could be used in evaluating (predicting) the validity of proposed linkages.
Second, we suggest further research to determine the level of precision needed to make valid inferences about linked tests. We know that two tests that are built to different content frameworks, or to different test specifications, are looking at the test taker in two different ways. Each perspective may yield valid information, although not the same information. How important are the differences? Are they so minor that the differences can be overlooked? Are the biases sufficiently large to lead to misleading interpretations, or are they so small that they are inconsequential, although statistically detectable? And how can one determine what is "consequential"? What kind of guidelines do policy makers need in order to determine an acceptable level of error? In addressing these questions, the research community could make an important contribution to the policy debate by focusing on the marginal decrements in validity or precision of inferences that can be attributed to linkage, independent of the imprecision or invalidity attributable to the tests themselves. More research on methods of assessing the quality of linked assessment information would go a long way in making these important judgments
Finally, we urge further research on the reporting of linked assessment information. The committee found that one way of reporting a students' performance in terms of NAEP achievement levels is to state that, among 100 students who performed at the same level as the student, call her Sally, 10 are likely to be in the below basic category, 60 are likely to be basic; 28 are likely to be proficient; and 2 are likely to be in the highest, or advanced category.
While such information may be statistically valid, its utility is questionable. More research might point to ways in which reports from linking tests could provide information that is useful to students, parents, teachers, administrators, and policy makers.