Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 37
3
Tests as Performance Measures
A
s Chapter 2 discusses, the performance measures that are used with
incentives are critically important in determining how incentives
operate. Specifically, performance measures need to be aligned
with the desired outcomes so that behavior that increases the measures
also increases the desired outcomes. In this chapter we look at the use of
tests as performance measures for incentive systems in education.
We have noted above that tests fall short as a complete measure of
desired educational outcomes. Most obviously, the typical tests of aca -
demic subjects that are used in test-based accountability provide direct
measures of performance only in the tested subjects and grade levels. In
addition, less tangible characteristics—such as curiosity, persistence, col -
laboration, or socialization—are not tested. Nor are subsequent achieve -
ments, such as success in work, civic, or personal life, which are examples
of the long-term outcomes that education aims to improve.
In this chapter, we turn to some important limits about tests that are
not obvious—specifically, the ways they fall short in providing a direct
measure of performance even in the tested subjects and grades. We begin
by looking at an essential characteristic of tests themselves and then
turn to review the ways that test results can be turned into performance
measures that can be used with incentives. Finally, we look at the use of
multiple measures in incentive systems in which there is an attempt to
overcome the limitations of any single measure by using a set of comple -
mentary measures.
37
OCR for page 38
38 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
TESTS AS ESTIMATES FROM A SUBSET OF A DOMAIN
Although large-scale tests can provide a relatively objective and effi -
cient way to gauge the most valued aspects of student achievement, they
are neither perfect nor comprehensive measures. Many policy makers in
education are familiar with the concept of test reliability and understand
that the test score for an individual is measured with uncertainty. Test
scores will typically differ from one occasion to another even when there
has been no change in a test taker’s proficiency because of chance differ-
ences in the interaction of the test questions, the test taker, and the testing
context. Researchers think of these fluctuations as measurement error and
so treat test results as estimates of test takers’ “true scores” and not as “the
truth” in an absolute sense.
In addition, tests are estimates in another way that has important
implications for the way they function when used as performance mea-
sures with incentives: they cover only a subset of the content domain that
is being tested. There are four key stages of selection and sampling that
occur when a large-scale testing program is created to test a particular
subject area. Each stage narrows the range of material that the test covers
(Koretz, 2002; Popham, 2000). First, the domain to be tested, when specifi-
cally defined, is typically only part of what might be reasonable to assess.
For example, there needs to be a decision about whether the material to
be tested in each grade and subject should include only material currently
taught in most schools in the state or whether it should include material
that people think should be taught in each grade and subject.
Second, the test maker crafts a framework that lists the content and
skills to be tested. For example, if history questions are to be part of the
eighth grade test, they might ask about names and the sequence of events
or they might ask students to relate such facts to abstractions, such as
rights and democracy. These decisions are partly influenced by practical
constraints. Some aspects of learning are more difficult or costly to assess
using standardized measures than others. In reading, for example, stu -
dents’ general understanding of the main topic of a text is typically more
straightforward to assess than the extent to which a student has formed
connections among parts of the text or applied the text to other texts or
to real-world situations.
Third, the test maker develops specifications that dictate how many
test questions of certain types will constitute a test form. Such a docu-
ment describes the mix of item formats (such as multiple choice or short
answer), the distribution of test questions across different content and
skill areas (such as the number of test questions that will assess decimal
numbers or percentages), and whether additional tools will be allowed
(such as calculators or computers).
Fourth, specific test items (questions) are created to meet the test
OCR for page 39
39
TESTS AS PERFORMANCE MEASURES
specifications. After a set of test items of the correct types are created, the
items are pilot-tested with students to see whether they are at the appro -
priate level of difficulty and are technically sound in other ways. On the
basis of the results of the pilot test and expert reviews, the best test items
are selected to be used on the final test. It is generally more difficult to
design items at higher levels of cognitive complexity and to have such
items survive pilot testing.
As a result of these necessary decisions about how to focus the content
and the types of questions, the resulting test will measure only a subset of
the domain being tested. Some material in the domain will be reflected in
the test and other material in the domain will not. If one imagines the full
range of material that might be appropriate to test for a particular subject—
such as eighth grade mathematics as it is taught in a particular state—then
the resulting test might include questions that reflect, for example, only
three-quarters of that material. The rest of the material—in this hypothetical
example, the remaining quarter of the subject that is excluded—would sim-
ply not be measured by the test, and this missing segment would typically
be the portion of the curriculum that deals with higher levels of cognitive
functioning and application of knowledge and skills.
Score Inflation
Although the example of a test covering only three-quarters of a
domain is hypothetical, it provides a useful way to think about what
can happen if instruction shifts to focus on test preparation in response
to test-based incentives. If teachers move from covering the full range of
material in eighth grade mathematics to focusing specifically on the por-
tion of the content standards included on the test, it is possible for test
scores to increase while learning in the untested portions of the subject
stays the same or even declines. That is, test preparation may improve
learning of the three-quarters of the domain that is included on the test
by increasing instruction time on that material, but that increase will
occur by reducing instruction time on the remaining one-quarter of the
material. The likely outcome is that performance on the untested mate -
rial will show less improvement or decrease, but this difference will be
invisible because the material is not covered by the test.
To this point, we have discussed problems with tests as accountability
measures even when best practices are followed. In addition, now that
tests are being widely used for high-stakes accountability, inappropriate
forms of test preparation are becoming more widespread and problematic
(Hamilton et al., 2007). Test results may become increasingly misleading
as measures of achievement in a domain when instruction is focused too
narrowly on the specific knowledge, skills, and test question formats that
OCR for page 40
40 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
are likely to appear on the test. Overly narrow instruction might include
such practices as drilling students on practice questions that were released
from prior years’ tests, focusing on the limited subset of skills, knowledge
and question formats that are most likely to be tested, teaching test-taking
tricks (such as the process of elimination for multiple-choice items or
memorizing the two “common Pythagorean ratios” rather than learning
the Pythagorean theorem), or teaching students to answer open-ended
questions according to the test’s scoring rubric. When scores increase on
a test for which students have been “prepared” in these ways, it indicates
only that students have learned to correctly answer the specific kinds of
questions that are included on that particular test. It does not indicate that
that students have also attained greater mastery of the broader domain
that the test is intended to represent (Koretz, 2002).
Changing teaching in at least some classrooms is one goal of test-
based incentives. Good test preparation is instruction that leads to stu -
dents’ mastery of the full domain of knowledge and skills that a test is
intended to measure. This mastery will incidentally improve large-scale
test scores, but it will also be reflected elsewhere, for example, on other
tests and in the application of knowledge outside school.
It is an essential goal of education reform that instruction be tied to
the full set of intended learning goals, not just the tested sample of knowl-
edge, skills, and question formats. Bad or inappropriate test preparation
is instruction that leads to test score gains without increasing students’
mastery of the broader, intended domain, which can result from engaging
in the types of inappropriate strategies discussed above. These practices
are technically permissible and can even be appropriate to a limited
degree, but they will not necessarily help students understand the mate-
rial in a way that generalizes beyond the particular problems they have
practiced. Mastering content taught in test-like formats has been shown
not to generalize to mastery of the same content taught or tested in even
slightly different ways (Koretz et al., 1991). In this kind of situation, test
scores are likely to give an inflated picture of students’ understanding of
the broader domain.
If test score gains are meaningful, they must generalize to the intended
domain, and if they do, they should also generalize to a considerable
extent to other tests and nontest indicators of the same domain. For that
reason, trends in performance on the National Assessment of Educational
Progress (NAEP)—a broad assessment designed to reflect a national con-
sensus about important elements of the tested domains—are frequently
compared with trends on the tests that states use for accountability.
One study examined the extent to which the large performance gains
shown on the Kentucky Instructional Results Information System (KIRIS),
the state’s high-stakes test, represented real improvements in student
OCR for page 41
41
TESTS AS PERFORMANCE MEASURES
learning rather than inflation of scores (Koretz and Barron, 1998). The
study found evidence of score inflation. Even though KIRIS was designed
partially to reflect the frameworks of NAEP, very large and rapid KIRIS
gains in fourth grade reading from 1992 to 1994 were not matched by
gains in NAEP scores. Although large KIRIS gains in mathematics from
1992 to 1994 in the fourth and eighth grades were accompanied by gains
in NAEP scores, Kentucky’s NAEP gains were roughly one-fourth as
large as the KIRIS gains and were typical of gains shown in other states.
At the high school level, the large gains that students showed on KIRIS in
mathematics and reading were not reflected in their scores on the Ameri-
can College Testing (ACT) college admissions tests.1 A Texas study found
similar evidence of score inflation (Klein et al., 2000).
In a recent comparison of state test and NAEP results between 2003
and 2007, the Center on Education Policy (2008) found that trends in read-
ing and mathematics achievement on NAEP generally moved in the same
positive direction as trends on state tests, although gains on NAEP tended
to be smaller than those on state tests. The exception to the broad trend
of rising scores on both assessments occurred in eighth grade reading, in
which fewer states showed gains on NAEP than on state tests.
The average scores on state accountability tests tend to rise, some-
times dramatically, every year for the first 3 or 4 years of use and then
level off (Linn, 2000). When an existing test is then replaced with a new
test or test form, the scores on the new test rise while the scores on the old
test fall. Linn surmised that these initial gains reflect growing familiarity
with the specific format and content of the new test. This explanation was
supported by a study in which students were retested with an old test 4
years after a new test had been introduced in a large district (Koretz et
al., 1991): while students’ performance on the new test had increased,
their performance had dropped on the test no longer routinely used. This
result showed that the initial gains on the new test were specific to that
test and did not support a conclusion of improved learning in the subject
matter domain. A number of other studies provide persuasive evidence
that gains on high-stakes accountability tests do not always generalize
to other assessments given at approximately the same time in the same
subjects (Fuller et al., 2006; Ho and Haertel, 2006; Jacob, 2005, 2007; Klein
et al., 2000; Koretz and Barron, 1998; Lee 2006; Linn and Dunbar, 1990).
There is also evidence that teachers themselves lack confidence in
the meaningfulness of the score gains in their own schools. A survey of
educators in Kentucky asked respondents how much each of seven fac-
tors had contributed to score gains in their schools (Koretz et al., 1996a).
1 The two tests measure somewhat different constructs, but the overlap was sufficient that
one would expect a substantial echo of the KIRIS trends in ACT data.
OCR for page 42
42 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
Over half of the teachers said that “increased familiarity with KIRIS [the
state test]” and “work with practice tests and preparation materials” had
contributed a great deal. In contrast, only 16 percent reported that “broad
improvements in knowledge and skill” had contributed a great deal. Very
similar results were found in Maryland (Koretz et al., 1996b).
Fundamentally, the score inflation that results from teaching to the
test is a problem with attaching incentives to performance measures that
do not fully reflect desired outcomes in a domain that is broader than the
test. It is unreasonable to implement incentives with narrow tests and
then criticize teachers for narrowing their instruction to match the tests.
When incentives are used, the performance measures need to be broad
enough to adequately align with desired outcomes. One route to doing
this is to use multiple measures, which we discuss later on in the chapter.
However, another important route to broadening the performance mea-
sures is to improve the tests themselves. Finally, given the inherent limits
in the breadth that can be achieved on tests, it is important to evaluate
test results for possible score inflation.2
Broadening Tests to Reflect the Domain of Interest
A test will not provide good information about students’ learning,
in an accountability context when incentives have been attached to the
results, unless it samples well—both in terms of breadth and depth—from
the content that students have studied and asks questions in a variety of
ways to make sure that students’ performance covers the domain. That
is: Can a test’s results be generalized beyond that test?
In current practice, this concern is addressed in part by examining
the alignment of tests with content and performance standards. How-
ever, it is not enough to have the limited alignment obtained when test
publishers show that all of their multiple-choice items can be matched
somewhere within the categories of a state’s content standards (Shepard,
2003). Rather, what is needed is a more complete and substantive type of
alignment “that occurs when the tasks, problems, and projects in which
students are engaged represent the range and depth of what we say we
want students to understand and be able to do. Perhaps a better word
than alignment would be embodiment” (Shepard, 2003, p. 123). Shepard
goes on to warn that “when the conception of curriculum represented
by a state’s large-scale assessment is at odds with content standards and
curricular goals, then the ill effects of teaching to the external, high-stakes
2 Such monitoring can be done by looking at low-stakes tests that are not attached to the
incentives. In addition, see the work by Koretz and Béguin (2010) on possibilities for design-
ing tests that include a component to self-monitor for score inflation.
OCR for page 43
43
TESTS AS PERFORMANCE MEASURES
test, especially curriculum distortion and nongeneralizable test score
gains, will be exaggerated” (p. 124). To the extent feasible, it is important
to broaden the range of material included on tests to better reflect the full
range of what we expect students to know and be able to do.
In addition to broadening the range of material included on tests to
better reflect the content standards they are intended to measure, it is also
important to broaden the questions that are used to assess performance.
Currently, one can find many unnecessary recurrences in the characteris -
tics of many tests—unneeded similarities in content, format, other aspects
of presentation, and aspects of the responses demanded (Koretz, 2008a).
In some cases, one can find items that are near clones of items used in pre-
vious years, with only minor details changed. These unnecessary recur-
rences provide opportunities for coaching, and, indeed, test preparation
materials often focus on them. Reducing these recurrences would make
it harder to focus instructional time on tested details and thereby reduce
score inflation when incentives are attached to the tests.
CONSTRUCTING INDICATORS FROM TEST RESULTS
Incentives are rarely attached directly to individual test scores; rather,
they are usually attached to an indicator that summarizes those scores in
some way. The indicators that are constructed from test scores have a cru-
cial role in determining how the incentives operate. Different indicators
created from the same test can produce dramatically different incentives.
A choice of indicator is fundamentally a choice about what a policy
maker values and what pressures the policy maker wants to create by
the incentives of test-based accountability. Is the goal to affect particular
students, such as those who are high achievers, low achievers, or English
learners? Is the only goal to ensure that everyone reaches some minimum
performance level, or should progress below the minimum that fails
to reach the minimum as well as progress above the minimum also be
encouraged? It can be difficult to talk about the trade-offs that these ques -
tions imply, but the indicators used in test-based accountability implicitly
include decisions about how such tradeoffs have been made.
For example, two commonly used ways of constructing indicators
from test scores—mean scores and minimum performance levels—result
in dramatically different incentives. A mean score places value on scores
at all levels of achievement: every student who improves raises the mean
and every student who declines lowers the mean. An incentive attached
to a mean score will focus efforts on the scores of students at any achieve-
ment level whose scores can be increased. In contrast, a performance stan-
dard for a specific minimum level of achievement focuses attention on the
scores near the cut score that represents the standard. When a standard
OCR for page 44
44 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
that defines a minimum performance level is set, efforts are focused on
raising the performance of students below the standard up to that level,
while keeping students just above it from falling below it. An incentive
attached to an indicator based on a minimum performance level will focus
instruction on students believed to be near the standard; there is no incen-
tive to improve the performance of students who are already well above
the standard or who are far below it.
Research has demonstrated the effect that incentives based on per-
formance standards can have in focusing attention on students who are
near the standard. In a study that analyzed test scores before and after
the introduction of Chicago’s own accountability program in 1996, and
before and after the introduction of the No Child Left Behind (NCLB)
Act requirements in 2002 (Neal and Schanzenbach, 2010), the greatest
gains were shown by students in the middle deciles, particularly the
third and fourth. Little or no gain was shown in the top decile, and the
bottom two deciles showed no improvement or even a decline. A similar
pattern was found in Texas during the 1990s (Reback, 2008): “marginal”
students, meaning those on the cusp of passing or failing a state exam
used to judge the quality of schools, showed the greatest improvements
because the accountability system provided strong incentives for teachers
to focus on them. Two other studies (Booher-Jennings, 2005; Hamilton
et al., 2007) also found that teachers focused their efforts on students
near the proficiency cut score; teachers even reported being concerned
about the consequences of doing so for the instruction of high- and low-
achieving students.
Indicators based on performance standards were adopted to give
more interpretable summaries to policy makers and the public of how
groups of students are performing. There is some question whether the
use of performance standards actually accomplishes this goal of greater
interpretability. The simple performance labels that are shared across
many tests—basic, proficient, and advanced—mask substantial variation
within the categories. The reason for this variation is that standard-setting
is a judgmental process that can be affected by the particular process used,
the panelists who implement the process, and the political pressures that
may lead to adjustments for the final levels. Different standard-setting
methods often produce dramatically different results, as do different
groups of panelists (Buckendahl et al., 2002; Jaeger et al., 1980; Linn,
2003; Shepard, 1993). Despite improvements in standard setting methods
over time, performance standards vary greatly in rigor across the states
(McLaughlin et al., 2008). One prominent researcher concluded that this
variability is so great as to render characterizations such as “proficient”
meaningless—despite the rhetorical power of such terms (Linn, 2003). In
any case, it is important to realize that the use of performance standards
OCR for page 45
45
TESTS AS PERFORMANCE MEASURES
has additional implications when incentives are attached to indicators
that are based on those performance standards.
Another basic difference in types of indicators is the contrast between
indicators that look at the levels of test scores and indicators that look
at changes in those levels. There are several different ways of construct -
ing indicators that look at test score changes: cohort-to-cohort changes,
growth models, and value-added models.
Cohort-to-Cohort Changes Some indicators of change look at the
test scores of successive cohorts of students in a particular grade to see
if the performance of the students in that grade is improving over time.
NCLB includes an indicator based on this kind of cohort-to-cohort change
in its “safe harbor” provision, which gives credit to schools that have
sufficiently improved the percentage of students meeting the proficient
performance standard in successive years, even if the percentage does not
yet meet the state’s target for that year (Center on Education Policy, 2003).
Growth Models Some indicators of change look at the growth paths
of individual students using longitudinal data that has multiple test
scores for each student over time (see, e.g., Raudenbush, 2004). An indi-
cator based on growth can adjust for the point at which students start
in each grade and focus on how much they are progressing in that year.
Growth models are technically challenging, both because there are dif -
ficulties in linking scores from year to year (especially when students
change schools), though many states are making substantial progress,
and because the models may require tests that are linked from grade
to grade, which is difficult to do (Doran and Cohen, 2005; Michaelides
and Haertel, 2004). Researchers have proposed an approach to modeling
growth that would structure both instruction and tests around “learning
progressions” that describe learning in terms of conceptual milestones in
each subject (National Research Council, 2006b), but such an approach is
not yet common.
Value-Added Models There has been widespread interest in a special
type of growth model that attempts to control statistically for differences
across students to make it possible to quantify the portions of student
growth that are due to schools or teachers. The appeal of indicators based
on value-added models is the promise that they could be used to fairly
compare the effectiveness of different schools and teachers, despite the
substantial differences in the types of students at different schools and the
factors that determine how students are assigned to teachers and schools.
This is an active area of research, but the extent to which value-added
OCR for page 46
46 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
models can realize their promise has not yet been determined (see, e.g.,
Braun, 2005; National Research Council, 2010).
These different ways for deriving indicators from changes in test
scores focus on different questions and so can be used to provide different
incentives when consequences are attached to the indicators. Cohort-to-
cohort change indicators look at the change in successive cohorts and may
be especially useful during periods of reform when schools and teachers
are making substantial changes over a short period of time. In periods
when the education system is relatively stable, there is no reason to expect
cohort-to-cohort indicators to show any change. Growth indicators look at
changes for individual students and provide a way of isolating the learn-
ing that occurs in a given year. Because one always expect students to be
learning—whether there is education reform or not—growth models need
to provide some sort of target to indicate what level of annual change is
appropriate. Indicators of growth based on learning progressions offer
a way to do this that is tied to the curriculum in a meaningful way, but
the necessary curricula and tests for this approach have not yet been
developed. Finally, indicators based on value added expand the focus
beyond student learning to the contributions of their teachers or schools
to learning, with the attempt to identify the portion of learning that can
be attributed to a teacher or school. As with growth models in general,
value-added models have no natural metric that defines how much value
added is appropriate or to be expected. These models have been used
to look at the distribution of results for different teachers and schools to
identify those that are apparently more effective or less effective in raising
test scores, as well as possible mechanisms for increasing effectiveness.
We also note the use of subgroup indicators, which have been an
important part of the test-based accountability structure of NCLB. If there
is concern that group measures may systematically mask the performance
of different subgroups, then it is possible to calculate an indicator using
the test scores of different subgroups of students rather than the test
scores for the entire student population. Attaching incentives to indica-
tors of test results for different subgroups focuses attention to how each
of those subgroups is doing.
In summary, different indicators constructed from the same test can
provide very different types of information and very different pressures
for change when incentives are attached to them. When choosing an
indicator, it is necessary for policy makers to think carefully about the
changes they want to bring about, the actions that would bring about
those changes, and the people who could perform those actions. The
answers to these questions must guide the aggregation of students’ scores
OCR for page 47
47
TESTS AS PERFORMANCE MEASURES
into indicators so that the indicators highlight useful information that can
help bring about the desired changes.
Each type of indicator also brings its own technical challenges, which
may limit its ability to provide information that is fair, reliable, and valid.
It is important to address these technical issues, and we have mentioned
some of them briefly in our discussion. However, the message from our
review of the research—an assessment of the big picture about the use of
test-based incentives—is that different indicators result in very different
incentives. Consequently, it is important for policy makers to fully con -
sider possible indicators when they are designing a system of test-based
accountability.
MULTIPLE MEASURES
The tests that are typically used to measure performance in educa-
tion fall short of providing a complete measure of desired educational
outcomes in many ways. In addition, the indicators constructed from
tests highlight particular types of information. Given the broad outcomes
people want and expect for education, the necessarily limited coverage of
tests, and the ways that indicators constructed from tests focus on particu-
lar types of information, it is prudent to consider designing an incentives
system that uses multiple performance measures.
One of the basic research findings detailed in Chapter 2 is the impor-
tance of aligning performance measures with desired outcomes. As we
note in that chapter, incentive systems in other sectors tend to evolve
toward using increasing numbers of performance measures as experi -
ence with the limitations of particular performance measures accumu -
lates. This evolution can be viewed as a search for a set of performance
measures that better covers the full range of desired outcomes and also
monitors behavior that merely inflates the measured performance without
actually improving outcomes. In this section we discuss the use of mul -
tiple performance measures in education.
Professional standards for educational testing and guidelines for
using tests emphasize that important decisions should not be made on the
basis of a single test score and that other relevant information should be
taken into account (American Educational Research Association, Ameri -
can Psychological Association, and National Council on Measurement in
Education, 1999; National Research Council, 1999). Adding information
about student performance from other sources can enhance the validity
and reliability of decisions. This standard was originally conceived with
individual students in mind, cognizant of the fact that tests are only sam -
ples of what students know and can do. For example, when a student fails
a high school exit exam, taking into account other test scores or samples
OCR for page 48
48 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
of the student’s work can guard against denying a diploma to someone
who really has mastered the requisite material.
As the consequences of testing have become more serious for entire
schools, education stakeholders are increasingly advocating the use
of multiple measures for school accountability to help guard against
wrongly identifying schools as failing or needing intervention. Adopting
appropriate multiple measures is a design choice that satisfies profes -
sional standards and can offer a better representation of the full range
of educational goals. Give the context of our focus on incentives, we are
particularly interested in the possibility that a set of multiple measures
may better reflect education goals and so can provide better incentives
when consequences are attached to those measures.
“Multiple measures” is often used loosely and can refer to many dif-
ferent things. Sometimes it is used to mean multiple opportunities with
the same measure: for example, in many states, students are allowed to
retake the high school exit exam until they pass. In our discussion here,
we specifically exclude a discussion of the interpretation of multiple mea-
sures that focuses on multiple opportunities to take the same test because
that does not provide a way to broaden the performance measure to better
reflect our goals. Rather, we focus on two other meanings of the term. One
is the use of more than one indicator of a student’s performance in one
subject area, such as by using both standardized test scores and teachers’
judgments to determine a student’s level of mathematics achievement.
The second meaning is assessing student achievement in multiple sub-
jects, such as reading, writing, mathematics, and science, and combining
indicators across domains. In both kinds of multiple measures, indicators
can be combined in a conjunctive or compensatory fashion, each of which
has implications when consequences are attached, as discussed below.
Conjunctive Models
Conjunctive models combine indicators but do not allow high perfor-
mance on one measure to compensate for low performance on another.
For example, NCLB uses a conjunctive or multiple-hurdle model. In order
to make adequate yearly progress, a school must meet each of a number
of conditions. The first is that 95 percent of students in each numerically
significant subgroup must be tested. Then, all students, as well as all sub -
groups, must meet targets for percentage proficient. In addition, there are
targets for attendance and graduation rates. This combination of measures
is used to determine whether schools are making adequately yearly prog-
ress, with consequences if they are not. The consequences attached to this
conjunctive system of measures sends the message that each indicator is
important and schools are expected to meet each target. The result is that
OCR for page 49
49
TESTS AS PERFORMANCE MEASURES
there is only one way to pass—to meet all of the requirements—and many
ways to fail. For example, with NCLB, a school may have excellent test
scores, but a shortfall in attendance would still cause the school to fail to
make adequate yearly progress (Linn, 2007). With multiple ways to fail,
the consequences in this system focus attention on the areas that are in
danger of not meeting their targets.
Compensatory Models
In contrast to conjunctive models, compensatory models combine
multiple indicators so that a low score in one area can be offset by a high
score in another. This produces an overall picture of whether performance
targets are being achieved, across the range of areas, but it does not
require that each of the individual targets is achieved. Attaching conse-
quences to a system of multiple measures using a compensatory model
provides incentives to improve overall performance; the consequences in
this system focus attention on the areas where there are the most oppor-
tunities for improvement, not areas that are most in danger of failing to
meet their individual targets, because there are no individual targets.
Compensatory incentives are appropriate in cases where policy makers
want to ensure overall performance levels across a number of areas but
not where they have individual targets for each of those areas that they
view as critical.
In Ohio, for example, the system involves four indicators that are
combined in a compensatory way to classify its districts and schools
into five categories of performance. The four indicators are (1) the per-
formance indicators for each grade and subject area (reading, writing,
mathematics, science, and social studies); (2) a performance index that
is a composite score based on all tested grades and subjects, weighted so
that scores above proficient count more than those below proficient; (3) a
growth calculation; and (4) adequate yearly progress under NCLB. Each
indicator uses scores from the statewide testing program, and two of the
indicators also consider attendance and graduation rates. The way the
four different indicators complement each other to produce an aggregate
measure has been described by one expert (Chester, 2005) as better than
any single measure in capturing the varied outcomes that the state wants
to monitor and encourage. For example, rather than viewing NCLB’s
measure of adequate yearly progress as a substitute for the state’s entire
system, Ohio understood that that measure provides crucial monitor-
ing of subgroup performance that had previously been lacking in their
system. Thus the adequate yearly progress indicator provides important
additional information on the overall performance of the schools in the
states, even though it fails to capture crucial information—about other
OCR for page 50
50 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
subject areas, different levels of performance, and growth—that the other
indicators in the system provide.
In cases where compensatory systems bring together different inde -
pendent measures, they can have greater reliability than conjunctive sys -
tems in a statistical sense because information about the overall perfor-
mance accumulates across indicators, and the random fluctuations that
affect any single indicator tend to offset each other; a chance positive on
one indicator can be offset by an equally chance negative on another, but
information about performance is present in all indicators (Chester, 2005;
Linn, 2007).
Compensatory systems can combine indicators either in a single sub-
ject area or across subject areas. Each version can be appropriate for some
objectives and inappropriate for others.
The structure of high school exit exams in many states provides an
example of the use of compensatory measures within a single subject area.
Although people commonly think of high school exit exam requirements
as requiring students to pass a single test, the actual requirement in many
states involves additional routes to meeting the target. These multiple
routes effectively create a compensatory system of multiple measures. In
2006, 16 of the 25 states with exit exams had policies in place for an alter-
nate route to a diploma for students who could not pass the exams, yet
had adequate attendance records and grades (Center on Education Policy,
2006b). For example, in a number of states students can use course grades,
a collection of classroom work, or the results from a different test in the
subject—such as an AP test—to make up for a failure to pass a subject on
the state’s high school exam. In states that allow these multiple routes, the
high school exit exam requirement provides an overall incentive to meet
the requirement but not to pass the test itself.
Similarly, there are examples of incentive systems that use compen-
satory models across subject areas. For example, at the individual level,
Maryland’s high school exit exam uses an overall score that combines
results for different subjects (Center on Education Policy, 2005). At the
state level, California’s accountability program uses an academic perfor-
mance index that combines indicators from four different tests: the state’s
standards-referenced test, a norm-referenced test, an alternate test, and
the state’s high school exit exam. The tested subjects are English, math -
ematics, history/social science, and science. The indicators are weighted
on a scale that was determined by the state board of education and
combined to give a final metaindicator of school performance (California
Department of Education, 2011).
The essential principle in using a compensatory system of multiple
measures is that attaching consequences to an overall compensatory index
focuses the incentives at an overall level that uses a broader performance
OCR for page 51
51
TESTS AS PERFORMANCE MEASURES
measure than any one measure alone. If the compensatory system is used
for multiple indicators within a single subject area, then incentives will
focus attention more broadly across the full range of the subject than a
single test would. If the compensatory system is used for multiple indica -
tors across subject areas, then incentives will focus attention across the full
range of subject areas. In both cases, there are no targets for the individual
measures—which means no targets on the individual tests when com-
pensatory measures are used within a single subject area and no targets
on the individual subjects when compensatory measures are used across
subject areas. Attaching incentives to a compensatory system of multiple
measures within a subject area may be appropriate for a subject area that
is critical where there is concern about the necessarily limited coverage
of each of the available measures. Attaching incentives to a compensa-
tory system of multiple measures across subject areas may be appropriate
where there is more concern about tracking overall performance and less
concern about the relative performance in particular subject areas.
An Alternative Approach to Multiple Measures:
Using Test Scores as a Trigger
Another possible approach is to use large-scale test scores as a trigger
for a more in-depth evaluation, as proposed by Linn (2008). Under such a
system, teachers or schools with low scores on standardized tests would
not be subject to automatic sanctions. Instead, the results of standard -
ized tests would be used as descriptive information in order to identify
schools that may need a review of their organizational and instructional
practices. With such identification, the appropriate authority would begin
an intensive investigation to determine whether the poor performance
was reflected in other measures, possibly including subjective measures.
One way of thinking about the trigger approach is that it effectively
institutes a system of multiple measures in stages, incorporating addi -
tional measures of school performance only when the test score measures
indicate a likelihood that there is a problem. The approach trades off
greater reliability and validity of a system of multiple measures applied
to all schools for a more detailed inspection carried out for those schools
identified as possibly in trouble. In addition, the approach combines the
step of obtaining additional information with the opportunity to provide
initial recommendations for improvement, if they seem to be warranted.
Variations of this approach are already being used in some places (see
Archer, 2006; McDonnell, 2008). For example, in Britain, teams of inspec-
tors visit schools periodically to judge the quality of their leadership and
ability to make improvements. The inspectors draw on test scores, school
self-evaluations, and input from parents, teachers, and students and then
issue a report on various aspects of the school’s performance.
OCR for page 52