Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 53
4
Evidence on the Use of
Test-Based Incentives
I
n Chapters 2 and 3, we discuss theory and research on incentives with
brief references to tests, and testing with brief references to incentives.
In this chapter we delve more fully into the intersection of tests and
incentives with the goal of providing an interpretive review of differ-
ent types of incentives in education in light of the basic research find-
ings about how incentives operate and how they should be evaluated.
We focus on rigorous studies that can provide guidance to policy mak-
ers about the effects of test-based incentives in education. Although our
review does not cover all the available research about the use of test-based
incentives in education, we have attempted to include all prominent stud-
ies from the past few years that satisfy the criteria we outline below.
In our descriptions of the structure of the test-based incentive pro-
grams, we provide information about the key elements that should be
considered in designing incentive systems (see Chapter 2): who receives
incentives (the targets of the incentive), what performance measures
are used, what consequences are attached, and whether supports for
improvement are provided. Unfortunately, the available program infor-
mation often fails to adequately address these elements, which limits our
ability to draw inferences about how they affect the outcomes.
In describing evidence about the effects of the incentive programs,
we provide information about relevant outcomes other than the tests that
are attached to the incentives in order to reduce the likelihood that our
conclusions are biased by any distortion that the incentives may cause. We
also offer information about changes on high-stakes tests, if it is available,
53
OCR for page 54
54 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
but our focus is on evidence from other measures of the same domain,
including both the results of low-stakes tests and other outcomes, such
as graduation.
Tables 4-1A, 4-1B, 4-2, and 4-3, presented at the end of the chapter,
summarize the descriptive and outcome information discussed in the
text below. The studies or groups of studies are referred to below and
in the tables as examples; by number, and in some cases additional by
letter designations. In both the text and tables, we divide the studies we
analyzed into three categories that are familiar to education policy makers
and researchers: school-level policies related to the No Child Left Behind
(NCLB) Act and its predecessors; high school exit exams; and experiments
with teachers and students that use rewards, such as performance pay.
Note that the first two categories address policies rather than experiments
and so involve larger numbers of students, teachers, and schools and
longer implementation periods, but they also present greater difficulties
in identifying appropriate comparison groups. NCLB, as the one federal
policy discussed in our review, involves particularly difficult challenges
in identifying a comparison group.
STUDIES INCLUDED AND FEATURES CONSIDERED
Criteria for Inclusion
Our literature review is limited to studies that allow us to draw causal
conclusions about the overall effects of incentive policies and programs. 1
In some cases, programs were planned to include untreated control
groups for comparison; in other cases, researchers have carefully docu-
mented how to make appropriate comparisons. Because our purpose is to
draw causal conclusions about the overall effects of test-based incentives,
we exclude several kinds of studies that do not permit such conclusions:
• studies that omit a comparison group, including the evalua-
tions of NCLB carried out by the U.S. Department of Education
(Stulich et al., 2007), the Center on Education Policy (2008), and
the Northwest Evaluation Association (Cronin et al., 2005), in
addition to various well-known earlier studies (e.g., Klein et al.,
2000; Richards and Sheu, 1992);
• cross-sectional studies that compare results with and without
incentive programs but with no controls for selection into the
1 Forliterature reviews that cover a broader range of related studies, see Figlio and Loeb
(2010) on school accountability, Podgursky and Springer (2006) on teacher performance pay,
and Holme et al. (2010) on high school exit examinations.
OCR for page 55
55
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
incentive programs, including well-known studies of exit exams
(e.g., Jacob, 2001) and teacher performance pay (e.g., Figlio and
Kenny, 2007); and
• studies that focus on contrasting results for students, teachers,
or schools that are immediately above or below the threshold for
receiving the consequences of the incentive programs,2 including
well-known studies of exit exams (e.g., Martorell, 2004; Papay et
al., 2010; Reardon et al., 2010) and school incentives (e.g., Ladd
and Lauen, 2009; Reback, 2008; Rouse et al., 2007).
Finally, we exclude programs using incentives that are too new to
have meaningful results (e.g., Kemple, 2011; Springer and Winters, 2009).3
Particularly in the area of performance pay for teachers, there has been
strong recent interest in developing new incentive programs, and we
expect these will make important additions to the research base in the
near future.4
Policy and Program Features and Outcomes Considered
The features related to the structure of the incentive programs that
we selected for our analysis are derived from four of the five key ele -
ments that should be considered in designing incentive programs (see
Chapter 2).
Target Our analysis primarily included studies with incentives that
were given to schools, teachers, or students, though one case provides
an example of incentives given to both students and parents. We coded
performance pay programs for teachers as being received by teachers
2 Such regression discontinuity studies provide interesting causal information about the
effect of being above or below the threshold, but they do not provide information about
the overall effect of implementing an incentives program.
3 New York City has recently implemented a performance pay program for teachers in
about 200 schools using random assignment of eligible schools (see Springer and Winters,
2009). An initial analysis showed small and negative effects of the program on the tests
linked to the incentives, but none of the effects was statistically significant, and the initial
analysis used tests that were given less than 3 months after the program was instituted.
In addition, New York City’s reform effort since mayoral control of the schools began in
2002 includes a schoolwide performance bonus plan that began in the 2007-2008 school
year. Initial analysis suggests that scores on the tests attached to the incentives increased
faster during the reform period than occurred in comparable urban districts in New York
(Kemple, 2011).
4 See, for example, the various reports on the Texas performance pay program avail -
able from the National Center on Performance Incentives (see http://www.performance
incentives.org [June 2011]).
OCR for page 56
56 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
either individually or as a group (Teachers-I or Teachers-G), depending on
whether the incentives were based on the performance of each teacher’s
own students or on the performance of all students in the school.
Performance Measures We used the limited information about the
performance measures to code two different features related to the cover-
age of the measures across subjects and within subjects. For most of the
incentive programs we reviewed, the performance measures included
only tests, but we noted other measures if they were used. We coded the
content coverage across subjects as either narrow or broad, depending
on whether the tests included only a portion of the curriculum or most
subjects. Usually programs with narrow coverage across subjects focused
on language and mathematics tests. When the studies compared results
across states where some states used performance measures with broad
coverage across subjects and others used performance measures with nar-
row coverage across subjects, we coded the coverage across subjects as
mixed. We also coded the content coverage of the performance measures
within subjects as either narrow or broad, depending on whether the test
and the performance indicator were sensitive to the full range of content
and skill within the subject or to only a portion of the content and skill.
For the tests, we looked for information that the tests covered higher-
order thinking skills within the subject area. For the performance indica-
tor, we looked for information that the indicator reflected gains across the
entire distribution of performance, such as by using a score average or a
measure of test score gains rather than a performance level. We coded the
coverage of the performance measure within subjects as broad only if both
the test and the performance indicator were sensitive to the full range of
content and skill.5
Consequences With respect to the basic structure of the programs, we
coded whether they were focused primarily on penalizing poor perfor-
mance with sanctions or rewarding performance that meets or exceeds
expectations. In the text, we also describe the nature of the consequences
and any available information about their frequency, but we did not
attempt to code the consequences as large or small because we lacked an
objective way of making such a determination.
5 It was often easier to obtain information from the studies about the breadth of the per-
formance indicators than it was to obtain information about the breadth of the tests. Since
we required both the test and the indicator to be broad in order to code a program as using
a broad performance measure within subjects, we were able to code many programs as us -
ing a narrow performance measure within subjects by looking at the performance indicator
alone, without needing to obtain information about the test.
OCR for page 57
57
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
Supports To see whether the incentives program takes account of the
ability of people to influence their performance, we coded whether or not
resources or supports were provided to aid in the attainment of perfor-
mance goals as part of the incentives program.
Our coding of the incentives structure captures the types of contrasts
reflected in the economics literature, but it does not reflect those in the
psychology literature about the way that incentives are framed and com -
municated. In the experimental work discussed in Chapter 2, the contrast
between different conditions sometimes involved subtle differences in
wording. It is plausible that most of the incentive programs we discuss
could have been presented in ways that were either more positive or more
negative, depending on whether those in leadership positions character-
ized them as supporting a shared commitment to learning or as posing
an additional burden in already difficult circumstances. Even the contrast
between sanctions and rewards fails to measure the way incentives were
communicated in a district, school or classroom, since a skillful leader
could have described potential sanctions as reaffirming a shared com-
mitment to learning, and an inept leader could have described potential
rewards as an attempt to impose external control. In many situations, the
contrast between emphasizing one message or the other is subtle—just
as it was in the experiments discussed in Chapter 2. The lack of a good
measure of the way incentives are framed and communicated is an impor-
tant limitation in our description of the structure of the different incentive
programs.
The features in Table 4-1B related to the outcomes of the incentive pro-
grams reflect the importance of providing outcome measures other than
the tests that are attached to the incentives. In addition, we looked for
information about whether the program effects were distributed across
all content areas included in the program and whether they differed for
the relatively low- or high-performing students. Our analysis included
the following features:
• effect on high-stakes test: the effect of the incentives program on
the tests that were attached to the incentives in the program;
• effect on low-stakes test: the effect on tests that were in the same
subjects as the tests attached to the program’s incentives but that
were not themselves attached to those incentives;
• effect on other subject tests: the effect of the program on tests
in subjects other than those that were attached to the program’s
incentives;
• effect on graduation or certification: the effects of the program on
graduation or college-bound certification;
OCR for page 58
58 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
• effect on lower performing students: the statistically signifi -
cant effects of the program for students in the lower half of the
achievement distribution; and
• effect on higher performing students: the statistically signifi-
cant effects of the program for students in the upper half of the
achievement distribution.
In the tables, the outcomes columns summarize the outcomes as
positive, negative, or not statistically significantly different from zero.6
If a study provided multiple results, the discussion below and the table
entries summarize the overall tendency of the outcomes; if the results
diverged, the multiple outcomes are discussed and shown in order of
prevalence.
As with our coding of the structural features of the incentive pro -
grams, our coding of the outcomes of the programs failed to capture the
important outcome from the psychology literature related to changes in
dispositions. In general, the studies we analyzed did not provide informa-
tion about such outcomes; however, a few studies were exceptions to this
finding, and for these studies we note their findings related to changes in
dispositions in the text.
NCLB AND ITS PREDECESSORS
We identified causal studies related to three examples of school incen-
tives that are in the NCLB mold. Two related to the overall adoption of
school incentives across the United States: Example 1 reflects the ini-
tiatives in a number of individual states before NCLB, and Example 2
reflects the changes that came with NCLB. Example 3 is Chicago, for both
the initial district-level incentives in the 1990s and the implementation of
the succeeding NCLB incentives.
Examples 1 and 2: Nationwide School Incentives
A number of states instituted test-based incentives during the 1990s,
with consequences for schools that anticipated the consequences that
were implemented for all states in 2001 under NCLB (Dee and Jacob, 2007;
Hanushek and Raymond, 2005). Under NCLB, schools that do not show
adequate yearly progress face escalating consequences. The structure of
NCLB defines consequences for schools that involve increasing levels of
state intervention and support to bring about improvement. The initial
6 We used the most lenient level of statistical significance provided in each study, generally
p < 0.10 or p < 0.05.
OCR for page 59
59
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
requirements are to file improvement plans, make curriculum changes,
and offer their students school choice or tutoring; if progress does not
improve as specified, they are required to restructure in various ways.
The consequences are based on state tests in reading and mathematics
that use state-defined targets for student proficiency. During 2006-2009,
the proportion of schools failing to show adequate yearly progress ranged
from 29 to 35 percent (Center on Education Policy, 2010). There is mixed
information about the implementation of the consequences prescribed
under NCLB, with frequent focus on making curriculum and instructional
changes, but fewer cases of implementing effective school choice or tutor-
ing options that students use (Center on Education Policy, 2006a).
We treated the incentive programs adopted by many states in the
1990s as roughly similar to NCLB although there were many variations
in the incentive structures in the states that may have affected results. For
example, North Carolina’s school incentives, which were implemented in
1996 and continued alongside NCLB after 2001, are based on test score
gains rather than proficiency levels and so are targeted to a broad range
of performance rather than a narrow range near the proficiency cut point.
Under the two different performance criteria, there were different out -
comes: schools facing sanctions under NCLB improved the test scores of
lower performing students, while schools facing sanctions under the state
program improved the test scores of both lower and higher performing
students (Ladd and Lauen, 2009). Unfortunately, there were no studies
available that would have allowed us to contrast the overall effect of state
incentive programs predating NCLB by the committee’s key elements in
incentive structure.
We considered three studies that identified causal effects of school
incentive policies by comparing changes in states that did and did not use
those policies. The studies used the National Assessment of Educational
Progress (NAEP) to measure achievement in reading and mathematics
for fourth and eighth grade students. For the early period, we used a
meta-analysis of 14 studies that compared states that started test-based
incentives before NCLB with states that did not (Lee, 2008). For the later
period, we used two studies that each performed a complementary analy-
sis that compared states that started using school incentives under NCLB
to states that already had school incentives before NCLB (Dee and Jacob,
2009; Wong, Cook, and Steiner, 2009).
Example 1: Pre-NCLB Nationwide School Incentives
For the early period, the meta-analysis by Lee (2008) identified 14
studies that compared results across states with different test-based
accountability policies. Most of the studies used longitudinal NAEP data
OCR for page 60
60 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
from the 1990s to compare states with different levels of test-based school
accountability policy.7 The studies defined the policy contrasts in a vari-
ety of ways and used a variety of analytic strategies. Some of the studies
focused on mathematics, and others looked at both mathematics and
reading. Most of the studies looked at test results in grades 4 and 8. Across
the 76 effect sizes that were calculated from the studies, the average effect
size associated with a contrast between states with and without test-based
accountability was 0.08 standard deviations (Lee, 2008, p. 625); 66 were
positive, 2 were zero, and 8 were negative (pp. 631-638).8 The study did
not report how many of these effects were statistically significant. The
meta-analysis did not find significant differences in effect sizes between
school and student incentive policies (p. 616), between mathematics and
reading (p. 619), between different grade levels (p. 619), or between dif -
ferent racial and ethnic groups (p. 621).
Example 2A: NCLB Nationwide School Incentives (Dee and Jacob)
For the NCLB period, Dee and Jacob (forthcoming) estimated that the
imposition of the NCLB requirements in states that had not yet adopted
school incentives increased achievement by 2007 in fourth grade math-
ematics by 7.2 points in the preferred model (Dee and Jacob, forthcom-
ing, Table 3, Panel B). This increase corresponds to an effect size of 0.23
standard deviations. The effects on eighth grade mathematics and fourth
grade reading were positive, and the effect on eighth grade reading was
negative; none of these other effects was statistically significant.9 The
paper did not provide a formal test of the statistical significance of the
subject or grade differences in the effect sizes. Over four combinations of
7 Given this generalization, the multiple studies in Lee (2008) can be thought of as ef -
fectively providing multiple analyses of a single big experiment across states in the 1990s,
using different ways of analyzing the available NAEP data. Note that four studies included
in Lee (2008) do not fit the generalization in the text: two involve cross-sectional comparisons
(Bishop et al., 2001; Lee, 1998) and two focus exclusively on high school exit requirements
that are based on minimum competency testing rather than school accountability (Freder-
icksen, 1994; Jacob, 2001), with one (Jacob, 2001) using the National Education Longitudinal
Study rather than NAEP.
8 The effect sizes are calculated in Lee (2008) from information provided in the original
papers. The figure reported in the text is for effect sizes calculated in terms of the standard
deviation of student scores. Note that many of the effect sizes reported in the paper are based
on the standard deviation of state scores and so are not comparable to the versions calculated
in terms of the standard deviation of student scores.
9 The study notes uncertainty about the reading estimates because the fourth grade data do
not follow the linear trend that the statistical model assumes and because the eighth grade
data include only two pre-NCLB observations. The results for eighth grade reading were
reported only in an appendix.
OCR for page 61
61
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
subject and grade, the average effect size was 0.08 standard deviations. 10
The increase for fourth grade mathematics occurred for both lower and
higher performing students (Table 5). Finally, a check for changes in
NAEP science test scores showed no effect of NCLB in either fourth or
eighth grade on a subject without incentives (Table C4, Panel B), with
a small positive effect in grade 4 and a small negative effect in grade 8,
neither of which was statistically significant.
Example 2B: NCLB Nationwide School Incentives, Public Schools (Wong,
Cook, and Steiner)
Wong, Cook, and Steiner (2009) found similar results for the NCLB
period for public schools, though with some differences in their approach.
In addition to the contrast between states with and without school incen-
tives before NCLB used by Dee and Jacob, they added a contrast between
states with high and low standards. Although high standards did not
appear to interact with incentives,11 the results suggested that the sepa-
rate effects of the two policies combined in grade 4 reading to produce a
statistically significant change. Across three combinations of subject and
grade, the average effect size associated with incentives was 0.12 (Wong,
Cook, and Steiner, 2009, Table 14).12 The effect size was statistically sig-
nificant only for fourth grade mathematics (Table 13). The paper omitted
eighth grade reading, the one test for which Dee and Jacob found nega -
tive effects.
10 We computed the average from the coefficients on the “Total effect by 2007” line of
Table 3 in Dee and Jacob (forthcoming) dividing each by the standard deviation of the
scores for the different tests provided at the bottom of the table. The results for eighth
grade reading were taken from the corresponding line of appendix Table C2. Despite the
authors’ uncertainty about the reading estimates (see fn. 9), our analysis included them in
the overall average in order to provide the best available average of the effect of NCLB that
reflects a balance across subjects and grades. When the subjects were considered separately,
the average effect for mathematics was 0.17 standard deviations, and the average effect for
reading was 0.00 standard deviations.
11 In the case of fourth grade mathematics, in one specification there was an interaction
effect of standards and incentives with borderline statistical significance that suggests that
either high standards or incentives alone produced the same effect as the two policies to -
gether (Wong, Cook, and Steiner, 2009, Table 13).
12 We averaged the effect sizes in the “Diff. in Total Δ (2007 or 2009) CA” line of Table 14
of Wong, Cook, and Steiner (2009).
OCR for page 62
62 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
Example 2C: NCLB Nationwide School Incentives, Public and Private Schools
(Wong, Cook, and Steiner)
Wong, Cook, and Steiner (2009) also used a comparison between pub-
lic and private (mostly Catholic) schools as a way to estimate the effects
of NCLB, though Dee and Jacob rejected this approach because of the
decline in Catholic school enrollment that occurred around the start of
NCLB (because of the sex abuse scandal). In addition to comparing public
and Catholic schools, the study also compared public and non-Catholic
private schools. Over six combinations of subject, grade, and private
school type, there was an average effect size of 0.22 standard deviations
associated with the change in public school NAEP scores by 2007 or
2009.13 Although all of the effect sizes were positive, the only one that
was marginally significant was for fourth grade mathematics for Catholic
private schools (Wong, Cook, and Steiner, 2009, Table 6).
Related Studies About School Incentives
There have been a number of studies of the instructional changes
that have accompanied the implementation of school incentives (e.g.,
Center on Education Policy, 2007a; Hamilton et al., 2007; Rouse et al.,
2007; Stecher, 2002; White and Rosenbaum, 2008). In general, these stud -
ies found shifts in instruction that were consistent with the performance
measures that were attached to the incentives. Some of these changes
were aimed at improving achievement broadly, such as increasing total
instruction time, improving the alignment of instruction with standards,
or adding professional development for teachers. Other changes were
focused on the specific structure of the incentives system, such as shifting
instruction to focus on aspects that count in the system and away from
aspects that do not count: these changes involved an increased focus on
tested subjects, on lower performing students at the threshold of attaining
proficiency, and on material that closely mimics the tests. These findings
about instructional shifts underline the necessity of evaluating the effect
of incentives with information from low-stakes tests in the same subjects
as the tests attached to incentives, on students at different performance
levels, and on subjects not attached to incentives.
In addition to changes in instruction in the subject area, there is evi-
dence of attempts to increase scores in ways that are completely unrelated
to improving learning. The attempts included teaching test-taking skills,
excluding low-performing students from tests, feeding students high-
averaged the effect sizes in the “Diff. in Total Δ (2007 or 2009)” lines of Table 7 of
13 We
Wong, Cook, and Steiner (2009) for the “Public vs. Catholic (Main NAEP)” and “Public vs.
Non-Catholic (Main NAEP)” sections of the table.
OCR for page 63
63
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
calorie meals on testing days, providing help to students during a test,
and even changing student answers on tests after they were finished (e.g.,
Cullen and Reback, 2006; Figlio and Getzler, 2006; Figlio and Winicki,
2005; Jacob and Levitt, 2003; Stecher, 2002). The evidence about behaviors
that were likely to distort test results again underlines the importance of
evaluating the effects of incentives using measures of the same domain
that are different than the results of the tests attached to the incentives.
It is also important to note, however, that some of the changes that can
distort high-stakes tests—such as a focus on the portions of the subject
that are easy to test—can also distort low-stakes tests.
Example 3: Chicago School Incentives
The incentives that Chicago introduced in 1996 included sanctions for
both schools and students (Jacob, 2005). The school sanctions involved
the possibility of reconstituting schools with a high percentage of low-
performing students. The student sanctions involved mandatory summer
school and retention for students unable to pass exams in the third, sixth,
and eighth grades. If students were unable to pass the exams after sum-
mer school, they had an additional opportunity to rejoin their class if they
could pass the exams in January of the following year. During the first 3
years of the program, retention rates in these grades increased to 10-20
percent, far above the prior level of 1-2 percent (Jacob and Lefgren, 2009).
Jacob (2005) used longitudinal data for Chicago that included the
period before the policy took effect and controls for both prior test trends
and changes in student demographics. For the 4 years after the start of
school incentives, scores on the high-stakes tests in the three grades had
increased above predicted trends by about 0.2 standard deviations in
reading and 0.3 standard deviations in mathematics (Jacob, 2005, Table 1).
Similar results were obtained by comparing the change in Chicago’s test
score trends when incentives were introduced with the test score trends
in other large, midwestern cities (Table 2). Looking across students, there
were generally positive effects for both lower and higher performing stu -
dents in mathematics; for reading, the effects occurred primarily for lower
performing students (Table 3). In the lowest decile of students, however,
there was some indication that incentives decreased performance. Neal
and Schanzenbach (2010) obtained similar results on the distribution of
effects across students.
Jacob (2005, Table 5) replicated a version of his analysis with data
on low-stakes tests in reading and mathematics. The analysis showed
an effect of about 0.2 standard deviations in both subjects 2 years after
implementation, but only for the eighth grade; the effect on the low-stakes
tests for the third and sixth grade was either negative or was small and
OCR for page 80
80 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
of the program, but starting in the third year, enrollment increased by 34
percent (Table 3, column 1). There was an increase of 1.2 percent in the
graduation rate, but the result was not statistically significant (Table 2,
model 16). However, the number of students attending college increased
by 5.3 percent (Table 2, model 34).
CONCLUSIONS
In this section we synthesize the results across the different incentive
programs discussed above and summarized at the end of this chapter in
Tables 4-1A, 4-1B, 4-2, and 4-3. We focus specifically on summarizing the
types of incentive programs investigated and analyzing the effect of those
programs on student achievement and on high school graduation and
certification. We then consider the relative costs and benefits of incentive
programs.
Types of Incentive Programs Investigated in the Literature
As summarized in Tables 4-1A and 4-1B, researchers and policy mak-
ers have explored incentive programs with a relatively wide range of
variation in key structural features. Across the 15 examples we analyzed,
there are substantial differences in who receives incentives, the breadth of
the performance measures across and within subjects that are attached to
the incentives, the nature of the consequences that the program attaches
to the performance level, and whether extra support is provided by the
program. In addition, there are differences in the nature and frequency of
the consequences attached to the performance measures that are summa -
rized in the text describing the programs, though not coded in the table.
The research literature we reviewed (see Chapters 2 and 3) suggests
that these key structural features could be critical to the successful opera -
tion of an incentive program, so it is notable that the literature includes
examples of different options for the different features. Looking at the
feature options one at a time, the studies we review provide examples of
major contrasts that could potentially be important, and for each contrast-
ing feature option in the table, there are at least several strong studies that
investigate programs containing that option.
When we considered the feature options in combination, however, it
is clear that many possible combinations of the basic structural features
do not appear: see Tables 4-1A and 4-1B. Some unexplored combinations
are likely to seem uninteresting to implement as actual programs—such
as a possible incentive program that might combine consequences in the
form of sanctions while providing no additional support, which would
likely prove to be politically untenable. However, there are a number of
OCR for page 81
81
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
unexplored feature combinations that are potentially interesting and seem
potentially promising for implementation and study.
In the current policy context, there are at least two such unexplored
combinations of structural features that are salient: the combination of
incentives for schools and broad performance measures within subjects,
and the combination of incentives for individual teachers and sanctions.
The first combination is a frequently mentioned possible change that
might be introduced with the next reauthorization of the Elementary
and Secondary Education Act (ESEA)—school accountability with per-
formance measures that have broader coverage within subjects by using
tests that better reflect higher order thinking skills and indicators that are
sensitive to changes across a broader range of performance than a single
proficiency level.
The second combination is a frequently mentioned possible change
in discussions about teacher quality—incentives for individual teach-
ers in the form of sanctions that require teachers whose students do not
meet some test-based level of performance to leave the profession (see,
e.g., Lang, 2010; Staiger and Rockoff, 2010). Proposals to use the results
of student tests as an input into teacher tenure decisions—which can be
interpreted as subjecting teachers to a strong sanction if their students
perform poorly—are an example of this combination. We do not take a
position on either of these proposals here or on other unexplored com-
binations that may be proposed. Instead, we note the twin points that
the existing research literature contains information about the effects of
incentive programs that use these features in other combinations, but it
does not contain information about the effects of programs with these
particular combinations of features.
Effects on Student Achievement and
High School Graduation and Certification
We summarize the effects of the incentive programs on student
achievement and high school graduation and certification in Tables 4-2
and 4-3. We discuss these effects in terms of four groupings of programs:
NCLB and its predecessors, high school exit exams, programs using
rewards in other countries, and programs using rewards in the United
States.
NCLB and Its Predecessors
The four studies that we analyzed all provided information about the
achievement effects of test-based incentives targeted at schools that are
OCR for page 82
82 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
in the NCLB mold.42 The studies showed average incentive effects on the
low-stakes tests ranging from 0.04 to 0.22 standard deviations. Across the
studies there were a number of individual effect estimates that were posi-
tive and statistically significant, though there were also many that were
not statistically significant and some that were negative.
At first blush, the evidence of incentives on student achievement
from these studies appears substantial. However, there are two impor-
tant caveats. First, the statistically significant effects were concentrated
in fourth grade math; in contrast, the results for eighth grade math and
for reading for both grades were often not statistically significant and
sometimes negative.
Second, the highest two estimates—0.22 and 0.12 standard
deviations—were problematic. Both estimates came from analyses that
excluded results for eighth grade reading, giving an unbalanced over-
all picture of the effects of the incentives on achievement. In addition,
the highest estimate of 0.22 standard deviations came from comparisons
between public and private schools that may have been affected by move-
ment away from Catholic schools that occurred during the early years of
NCLB. Without these two problematic estimates, the effects estimated by
the research range from 0.04 to only 0.08 standard deviations.
Given these two caveats, the evidence related to the effects on
achievement of test-based incentives to schools appears to be modest,
limited in both size and applicability. Our preferred estimate for these
programs is 0.08 standard deviations, reflecting the national results
for both the pre-NCLB period by Lee (2008) and the NCLB period by
Dee and Jacob (2011). A program with an effect size of 0.08 standard
deviations would raise the achievement of students currently at the 50th
percentile to the 53rd percentile. This gain is small, both by itself and
in comparisons across nations: the highest achieving countries on inter-
national tests often perform a full standard deviation above the United
States, measured in terms of the distribution of performance within
the United States (see, e.g., Gonzales et al., 2008, Figure 14 for TIMSS
2007 mathematics). To achieve an increase of the magnitude needed to
match the high performing countries would mean that students cur-
rently at the 50th percentile in the United States would have to increase
their scores to the current 84th percentile. For underachieving groups,
far more improvement would be needed because of the large achieve -
ment gaps in the United States (Hill et al., 2008, Table 2). Although an
effect size of 0.08 standard deviations is small in comparison with the
improvements the nation hopes to achieve, it is comparable to the effect
42 One of the research papers was a meta-analysis covering 14 studies, many of which
would meet our inclusion criteria if we had considered them separately.
OCR for page 83
83
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
sizes found for other promising interventions that have been evaluated
using standardized tests with relatively broad subject coverage (Hill
et al., 2008, Table 4). The influential Tennessee STAR experiment with
class-size reduction was notable for achieving effect sizes ranging from
0.15 to 0.25 standard deviations (Finn and Achilles, 1999), though the
gains from class-size reduction have been much smaller when they were
instituted on a statewide basis (e.g., Stecher et al. 2001).
High School Exit Exams
One of the three studies on the effects of high school exit exam require-
ments provided estimates of the effects on achievement on a low-stakes
test: it found an average effect of 0.00 standard deviations (see Table 4-2).
The other two studies provided estimates of the effects on graduation:
they found average effects of −2.1 and −0.6 percentage points (see Table
4-3). A number of the negative effects are statistically significant. The
smaller estimate was for a study that counted GEDs as equivalent to high
school diplomas; excluding this study leaves an estimate of the gradua -
tion effect of −2.1 percentage points.
Incentive Programs That Use Rewards in Other Countries
The committee’s analysis included six studies of incentive programs
that used rewards in other countries, in India, Israel, and Kenya. The
Kenya study measured the effect of incentives on achievement using low-
stakes tests, while the studies in India and Israel measured the achieve -
ment effect using the tests attached to the incentives (see Table 4-2). The
six studies found average estimates of the effect on achievement ranging
from 0.01 to 0.19 standard deviations, and most of the high positive effects
are statistically significant. Two of the Israel studies found effects on high
school certification that averaged 2.2 and 5.4 percentage points (see Table
4-3). The Israel studies found that the effects on both achievement and
certification were concentrated on lower-performing students.
As with the studies on NCLB and its predecessors, the studies on for-
eign reward programs suggest substantial benefits of incentive programs
that must be considered in light of important caveats. First, the programs
in India and Israel measured achievement using the high-stakes tests
attached to the incentives. The problems with this measure are discussed
above, and it is not clear how much change in achievement would be
shown on low-stakes tests.
Second, the programs in India and Kenya were in developing coun-
tries that have quite a different context for education than that in devel -
oped countries. In particular, the high level of teacher absenteeism and the
OCR for page 84
84 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
high rate of student dropout in middle school suggest that the incentives
for both teachers and students may operate differently in developing
countries.
Given these caveats, it is not clear what can be learned from these stud-
ies that would be applicable to the use of incentives in the United States.
For all three countries, there are difficulties in drawing conclusions about
the ability of such programs to increase achievement in the United States. In
addition, although the ability of the Israel programs to increase high school
certification with incentives is potentially promising, it is hard to evaluate
the value of the increase without knowing whether it is accompanied by
increased learning beyond that measured by the high-stakes test.
U.S. Incentive Programs That Use Rewards
Six of the seven studies that provided information about U.S. incen-
tive programs that use rewards showed average effects on achievement
that ranged from −0.02 to 0.06 standard deviations (see Table 4-2). Many
effects were positive, and some were statistically significant, but there
were also a number of negative effects. The estimates of achievement
effects included a number that were based on the tests attached to the
incentives; when these are eliminated, there are two studies, both of
which found 0.01 standard deviations. One study showed an effect of
incentives on high school graduation of 0.9 percentage points, but the
effect was not statistically significant (see Table 4-3).
On the basis of our synthesis of the evidence, summarized above,
we reached two conclusions about the effect of test-based incentives on
student achievement and high school completion.
Conclusion 1: Test-based incentive programs, as designed and
implemented in the programs that have been carefully studied,
have not increased student achievement enough to bring the
United States close to the levels of the highest achieving coun-
tries. When evaluated using relevant low-stakes tests, which
are less likely to be inflated by the incentives themselves, the
overall effects on achievement tend to be small and are effec-
tively zero for a number of programs. Even when evaluated
using the tests attached to the incentives, a number of programs
show only small effects. Programs in foreign countries that
show larger effects are not clearly applicable in the U.S. context.
School-level incentives like those of the No Child Left Behind
Act produce some of the larger estimates of achievement effects,
with effect sizes around 0.08 standard deviations, but the mea-
OCR for page 85
85
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
sured effects to date tend to be concentrated in elementary
grade mathematics and the effects are small compared to the
improvements the nation hopes to achieve.
Conclusion 2: The evidence we have reviewed suggests that
high school exit exam programs, as currently implemented in
the United States, decrease the rate of high school graduation
without increasing achievement. The best available estimate
suggests a decrease of 2 percentage points when averaged over
the population. In contrast, several experiments with providing
incentives for graduation in the form of rewards, while keep -
ing graduation standards constant, suggest that such incentives
might be used to increase high school completion.
Balancing the Benefits and Costs of Test-Based Incentives
The research to date suggests that the benefits of test-based incentive
programs over the past two decades have been quite small. Although the
available evidence is limited, it is not insignificant. The incentive pro-
grams that have been tried have involved a number of different incentive
designs and substantial numbers of schools, teachers, and students. We
focused on studies that allowed us to draw conclusions about the causal
effects of incentive programs and found a significant body of evidence
that was carefully constructed. Unfortunately, the guidance offered by
this body of evidence is not encouraging about the ability of incentive
programs to reliably produce meaningful increases in student achieve -
ment—except in mathematics for elementary school students.
Although the evidence to date about the effectiveness of incentive
programs has not been encouraging, the basic research findings suggest
a number of features that are likely to be important to the effective-
ness of incentive programs and that can provide guidance in the design
of new models. Some proposals for new models of incentive programs
involve combinations of features that have not yet been tried to a signifi-
cant degree, such as school-based incentives using broader performance
measures and teacher incentives using sanctions related to tenure. Other
proposals involve more sophisticated versions of the basic features we
have described, such as the “trigger” systems discussed in Chapter 3 that
use the more narrow information from tests to start an intensive school
evaluation that considers a much broader range of information and then
provides more focused supports to aid in school improvement.
It is also likely to be important to consider potential programs that
focus more on the informational role that tests can play. Our study has spe-
OCR for page 86
86 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
cifically not focused on policies and programs that rely solely on informa-
tion about educational achievement that tests provide to drive improve -
ment through educator motivation and public pressure. Our focus for the
study was chosen because so much of the educational policy discussion
over the past decade has been driven by the conclusion that mere infor-
mation without explicit consequences is insufficient to drive change. And
yet the guidance coming from the basic research in psychology suggests
that the purely informational uses of test results may be more effective in
some situations than incentives that attach explicit consequences to those
results. As policy makers and educators continue to look for successful
routes to improving education in the years ahead, the exploration should
include more subtle incentives that rely on the informational role of test
results and broader types of accountability.
In continuing to explore promising routes to using test-based incen-
tives, however, policy makers and educators should take into account the
costs of doing so. Over the past two decades, the education policy and
research communities have invested substantial attention and resources
in exploring the use of test-based incentives as a way to improve educa -
tion. This investment seemed to be worthwhile because it appeared to
offer a promising route for improvement. Further investment in test-
based incentives still seems to be worthwhile because there are now
more sophisticated proposals for using test-based incentives that offer
hope for improvement and deserve to be tried. However, in choosing
how much attention and investment to devote to the exploration of new
forms of test-based incentives, it is important to remember that there
are other aspects of improving education that also would benefit from
development. In addition to test-based incentives, investments to improve
standards, curriculum, instructional methods, and educator capacity are
all likely to be necessary for improving educational outcomes. Although
these other aspects of the system are likely to be complements to test-
based incentives in improving education, they are competitors for fund-
ing and policy attention. Further research and development of promising
new approaches to test-based incentives need to be balanced against the
research and development needs of promising new approaches in other
areas related to improving education. We have not considered those trade-
offs in our examination of test-based incentives, but those trade-offs are
the most important costs that need to be considered by the policy makers
who will decide which new incentive programs to support.
OCR for page 87
87
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
TABLE 4-1A Overview of Results from All Studies of Test-Based
Incentive Programs Using Causal Analyses
Structure of Incentives Systema
Perf Perf
Target Who Measure Measure
Incentive Receives Across Within Conse-
Programs Incentives Subjects Subjects quences Support
Studies of NCLB and Its Predecessors
1. U.S. pre- Schools Mixed Mixed Mixed Mixed
NCLB
2A. U.S. NCLB Schools Narrow Narrow Sanction Yes
2B. U.S. NCLB Schools Narrow Narrow Sanction Yes
2C. U.S. NCLB Schools Narrow Narrow Sanction Yes
3. Chicago pre- Schools and Narrow Narrow Sanction Yes
NCLB Students
Studies of High School Exit Exams
4. U.S. HS Exit Students Mixed Narrow Sanction Yes
Studies of Incentive Experiments Using Rewards
5. India Teachers-I or Narrow Broad Reward No
Teachers-G
6. Israel Teachers-G Broad Narrow Reward No
Teachers-G
7. Israel Teachers-I Broad Narrow Reward No
Teachers-I
8. Israel Student Students Broad Narrow Reward No
9. Kenya Teachers-G Broad Narrow Reward No
Teachers-G
10. Kenya Students and Broad Narrow Reward No
Student Parents
11. Nashville Teachers-I Narrow Narrow Reward No
12. New York Students Narrow Broad Reward No
13. Ohio Student Students Broad Narrow Reward No
14A. TAP-Chicago Teachers-I Broad Broad Reward Yes
and
Teachers-G
14B. TAP-2 states Teachers-I Broad Broad Reward Yes
and
Teachers-G
15. Texas AP Teachers-I Narrow Narrow Reward Yes
and Students
NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually.
aThe features related to the structure of incentive programs that should be considered
when designing the programs are (1) the target for the incentives (schools, teachers,
or students in these examples); (2) the extent to which the performance measures are
aligned with the outcomes desired (broad or narrow), both across and within subjects;
(3) the consequences that the incentives provide (reward or sanction); (4) the support
provided to reach the performance goals; and (5) the way the incentives are framed and
communicated. The last feature is not included in the table because no studies consider it.
OCR for page 88
88 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION
TABLE 4-1B Overview of Results from All Studies of Test-Based
Incentive Programs Using Causal Analyses
Outcomesa
Effect Effect Effect Effect Effect
on on on Effect on on
High- Low- Other on HS Lower Higher
Incentive Stakes Stakes Subject Grad Perf Perf
Programs Tests Tests Tests or Cert Students Students
Studies of NCLB and Its Predecessors
1. U.S. pre- +
NCLB
2A. U.S. NCLB 0/+ 0 +/0 +/0
2B. U.S. NCLB 0/+
2C. U.S. NCLB 0/+
3. Chicago pre- + 0/+/− + + +/0
NCLB
Studies of High School Exit Exams
4. U.S. HS Exit 0 −/0 test 0 test 0
Studies of Incentive Experiments Using Rewards
5. India + + + +
6. Israel + +/0 + 0
Teachers-G
7. Israel + + 0
Teachers-I
8. Israel Student + + 0
9. Kenya +/0 0
Teachers-G
10. Kenya + + + +
Student
11. Nashville 0/+ 0/+
12. New York 0
13. Ohio Student +/0 +/0 +/0
14A. TAP-Chicago 0
14B. TAP-2 states +/−/0
15. Texas AP + 0 +
NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually.
aResults of studies are characterized here as positive (+), negative (−), or not statistically
significantly different from zero (0). The most lenient level of significance provided in the
study is used, generally p < 0.10 or p < 0.05.
OCR for page 89
89
EVIDENCE ON THE USE OF TEST-BASED INCENTIVES
TABLE 4-2 Summary of Average Effects of Incentive Programs on
Student Achievement Tests
Distribution of Test Outcome Effects
Test Outcome Across Analyses
Overall
Incentive Type of Effect
Sizea
Programs Stakes +Sig +Nonsig −Nonsig −Sig
Studies of NCLB and Its Predecessors
1. U.S. pre- Low 0.08 87% 11%
NCLB
2A. U.S. NCLB Low 0.08 25% 50% 25% 0%
0.12b
2B. U.S. NCLB Low 33% 67% 0% 0%
0.22c
2C. U.S. NCLB Low 17% 83% 0% 0%
3. Chicago Low 0.04 83% 22% 22% 22%
pre-NCLB
Studies of High School Exit Exams
4A. U.S. HS Exit Low 0.00 0% 50% 50% 0%
Studies of Incentive Experiments Using Rewards
5. India High 0.19 100% 0% 0% 0%
6. Israel High 0.11 75% 13% 13% 0%
Teachers-G
7. Israel High 0.19 100% 0% 0% 0%
Teachers-I
9. Kenya Low 0.01 0% 50% 50% 0%
Teachers-G
10. Kenya Low 0.19 100% 0% 0% 0%
Student
11. Nashville High 0.04 17% 42% 42% 0%
12. New York Low 0.01 0% 50% 50% 0%
13. Ohio High 0.06 29% 64% 7% 0%
Student
14A. TAP- High –0.02 0% 50% 50% 0%
Chicago
14B. TAP-2 states Low 0.01 39% 11% 17% 33%
NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually.
a Effect size is presented in standard deviation units.
b Omits eighth grade reading.
c Omits eighth grade reading; uses comparison to private schools during period of fluctu -
ating enrollment.
OCR for page 90
90
TABLE 4-3 Average Effects of Test-Based Incentive Programs on
High School Graduation/Certification Rates
Distribution of Rate Changes Across Analyses
HS Grad/
Cert Rate
Changes +Sig +Nonsig −Nonsig −Sig
Incentive Programs
Studies of High School Exit Exams
4B. U.S. HS Exit −2.1% 0% 0% 0% 100%
4C. U.S. HS Exit −0.6% 0% 0% 33% 67%
Studies of Incentive Experiments Using Rewards
6. Israel Teachers-G 2.2% 0% 75% 25% 0%
8. Israel Student 5.4% 0% 100% 0% 0%
15. Texas AP 0.9% 0% 50% 50% 0%
NOTE: Teachers-G = Teachers-Group.