Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 53
4 Evidence on the Use of Test-Based Incentives I n Chapters 2 and 3, we discuss theory and research on incentives with brief references to tests, and testing with brief references to incentives. In this chapter we delve more fully into the intersection of tests and incentives with the goal of providing an interpretive review of differ- ent types of incentives in education in light of the basic research find- ings about how incentives operate and how they should be evaluated. We focus on rigorous studies that can provide guidance to policy mak- ers about the effects of test-based incentives in education. Although our review does not cover all the available research about the use of test-based incentives in education, we have attempted to include all prominent stud- ies from the past few years that satisfy the criteria we outline below. In our descriptions of the structure of the test-based incentive pro- grams, we provide information about the key elements that should be considered in designing incentive systems (see Chapter 2): who receives incentives (the targets of the incentive), what performance measures are used, what consequences are attached, and whether supports for improvement are provided. Unfortunately, the available program infor- mation often fails to adequately address these elements, which limits our ability to draw inferences about how they affect the outcomes. In describing evidence about the effects of the incentive programs, we provide information about relevant outcomes other than the tests that are attached to the incentives in order to reduce the likelihood that our conclusions are biased by any distortion that the incentives may cause. We also offer information about changes on high-stakes tests, if it is available, 53
OCR for page 54
54 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION but our focus is on evidence from other measures of the same domain, including both the results of low-stakes tests and other outcomes, such as graduation. Tables 4-1A, 4-1B, 4-2, and 4-3, presented at the end of the chapter, summarize the descriptive and outcome information discussed in the text below. The studies or groups of studies are referred to below and in the tables as examples; by number, and in some cases additional by letter designations. In both the text and tables, we divide the studies we analyzed into three categories that are familiar to education policy makers and researchers: school-level policies related to the No Child Left Behind (NCLB) Act and its predecessors; high school exit exams; and experiments with teachers and students that use rewards, such as performance pay. Note that the first two categories address policies rather than experiments and so involve larger numbers of students, teachers, and schools and longer implementation periods, but they also present greater difficulties in identifying appropriate comparison groups. NCLB, as the one federal policy discussed in our review, involves particularly difficult challenges in identifying a comparison group. STUDIES INCLUDED AND FEATURES CONSIDERED Criteria for Inclusion Our literature review is limited to studies that allow us to draw causal conclusions about the overall effects of incentive policies and programs. 1 In some cases, programs were planned to include untreated control groups for comparison; in other cases, researchers have carefully docu- mented how to make appropriate comparisons. Because our purpose is to draw causal conclusions about the overall effects of test-based incentives, we exclude several kinds of studies that do not permit such conclusions: • studies that omit a comparison group, including the evalua- tions of NCLB carried out by the U.S. Department of Education (Stulich et al., 2007), the Center on Education Policy (2008), and the Northwest Evaluation Association (Cronin et al., 2005), in addition to various well-known earlier studies (e.g., Klein et al., 2000; Richards and Sheu, 1992); • cross-sectional studies that compare results with and without incentive programs but with no controls for selection into the 1 Forliterature reviews that cover a broader range of related studies, see Figlio and Loeb (2010) on school accountability, Podgursky and Springer (2006) on teacher performance pay, and Holme et al. (2010) on high school exit examinations.
OCR for page 55
55 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES incentive programs, including well-known studies of exit exams (e.g., Jacob, 2001) and teacher performance pay (e.g., Figlio and Kenny, 2007); and • studies that focus on contrasting results for students, teachers, or schools that are immediately above or below the threshold for receiving the consequences of the incentive programs,2 including well-known studies of exit exams (e.g., Martorell, 2004; Papay et al., 2010; Reardon et al., 2010) and school incentives (e.g., Ladd and Lauen, 2009; Reback, 2008; Rouse et al., 2007). Finally, we exclude programs using incentives that are too new to have meaningful results (e.g., Kemple, 2011; Springer and Winters, 2009).3 Particularly in the area of performance pay for teachers, there has been strong recent interest in developing new incentive programs, and we expect these will make important additions to the research base in the near future.4 Policy and Program Features and Outcomes Considered The features related to the structure of the incentive programs that we selected for our analysis are derived from four of the five key ele - ments that should be considered in designing incentive programs (see Chapter 2). Target Our analysis primarily included studies with incentives that were given to schools, teachers, or students, though one case provides an example of incentives given to both students and parents. We coded performance pay programs for teachers as being received by teachers 2 Such regression discontinuity studies provide interesting causal information about the effect of being above or below the threshold, but they do not provide information about the overall effect of implementing an incentives program. 3 New York City has recently implemented a performance pay program for teachers in about 200 schools using random assignment of eligible schools (see Springer and Winters, 2009). An initial analysis showed small and negative effects of the program on the tests linked to the incentives, but none of the effects was statistically significant, and the initial analysis used tests that were given less than 3 months after the program was instituted. In addition, New York City’s reform effort since mayoral control of the schools began in 2002 includes a schoolwide performance bonus plan that began in the 2007-2008 school year. Initial analysis suggests that scores on the tests attached to the incentives increased faster during the reform period than occurred in comparable urban districts in New York (Kemple, 2011). 4 See, for example, the various reports on the Texas performance pay program avail - able from the National Center on Performance Incentives (see http://www.performance incentives.org [June 2011]).
OCR for page 56
56 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION either individually or as a group (Teachers-I or Teachers-G), depending on whether the incentives were based on the performance of each teacher’s own students or on the performance of all students in the school. Performance Measures We used the limited information about the performance measures to code two different features related to the cover- age of the measures across subjects and within subjects. For most of the incentive programs we reviewed, the performance measures included only tests, but we noted other measures if they were used. We coded the content coverage across subjects as either narrow or broad, depending on whether the tests included only a portion of the curriculum or most subjects. Usually programs with narrow coverage across subjects focused on language and mathematics tests. When the studies compared results across states where some states used performance measures with broad coverage across subjects and others used performance measures with nar- row coverage across subjects, we coded the coverage across subjects as mixed. We also coded the content coverage of the performance measures within subjects as either narrow or broad, depending on whether the test and the performance indicator were sensitive to the full range of content and skill within the subject or to only a portion of the content and skill. For the tests, we looked for information that the tests covered higher- order thinking skills within the subject area. For the performance indica- tor, we looked for information that the indicator reflected gains across the entire distribution of performance, such as by using a score average or a measure of test score gains rather than a performance level. We coded the coverage of the performance measure within subjects as broad only if both the test and the performance indicator were sensitive to the full range of content and skill.5 Consequences With respect to the basic structure of the programs, we coded whether they were focused primarily on penalizing poor perfor- mance with sanctions or rewarding performance that meets or exceeds expectations. In the text, we also describe the nature of the consequences and any available information about their frequency, but we did not attempt to code the consequences as large or small because we lacked an objective way of making such a determination. 5 It was often easier to obtain information from the studies about the breadth of the per- formance indicators than it was to obtain information about the breadth of the tests. Since we required both the test and the indicator to be broad in order to code a program as using a broad performance measure within subjects, we were able to code many programs as us - ing a narrow performance measure within subjects by looking at the performance indicator alone, without needing to obtain information about the test.
OCR for page 57
57 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES Supports To see whether the incentives program takes account of the ability of people to influence their performance, we coded whether or not resources or supports were provided to aid in the attainment of perfor- mance goals as part of the incentives program. Our coding of the incentives structure captures the types of contrasts reflected in the economics literature, but it does not reflect those in the psychology literature about the way that incentives are framed and com - municated. In the experimental work discussed in Chapter 2, the contrast between different conditions sometimes involved subtle differences in wording. It is plausible that most of the incentive programs we discuss could have been presented in ways that were either more positive or more negative, depending on whether those in leadership positions character- ized them as supporting a shared commitment to learning or as posing an additional burden in already difficult circumstances. Even the contrast between sanctions and rewards fails to measure the way incentives were communicated in a district, school or classroom, since a skillful leader could have described potential sanctions as reaffirming a shared com- mitment to learning, and an inept leader could have described potential rewards as an attempt to impose external control. In many situations, the contrast between emphasizing one message or the other is subtle—just as it was in the experiments discussed in Chapter 2. The lack of a good measure of the way incentives are framed and communicated is an impor- tant limitation in our description of the structure of the different incentive programs. The features in Table 4-1B related to the outcomes of the incentive pro- grams reflect the importance of providing outcome measures other than the tests that are attached to the incentives. In addition, we looked for information about whether the program effects were distributed across all content areas included in the program and whether they differed for the relatively low- or high-performing students. Our analysis included the following features: • effect on high-stakes test: the effect of the incentives program on the tests that were attached to the incentives in the program; • effect on low-stakes test: the effect on tests that were in the same subjects as the tests attached to the program’s incentives but that were not themselves attached to those incentives; • effect on other subject tests: the effect of the program on tests in subjects other than those that were attached to the program’s incentives; • effect on graduation or certification: the effects of the program on graduation or college-bound certification;
OCR for page 58
58 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION • effect on lower performing students: the statistically signifi - cant effects of the program for students in the lower half of the achievement distribution; and • effect on higher performing students: the statistically signifi- cant effects of the program for students in the upper half of the achievement distribution. In the tables, the outcomes columns summarize the outcomes as positive, negative, or not statistically significantly different from zero.6 If a study provided multiple results, the discussion below and the table entries summarize the overall tendency of the outcomes; if the results diverged, the multiple outcomes are discussed and shown in order of prevalence. As with our coding of the structural features of the incentive pro - grams, our coding of the outcomes of the programs failed to capture the important outcome from the psychology literature related to changes in dispositions. In general, the studies we analyzed did not provide informa- tion about such outcomes; however, a few studies were exceptions to this finding, and for these studies we note their findings related to changes in dispositions in the text. NCLB AND ITS PREDECESSORS We identified causal studies related to three examples of school incen- tives that are in the NCLB mold. Two related to the overall adoption of school incentives across the United States: Example 1 reflects the ini- tiatives in a number of individual states before NCLB, and Example 2 reflects the changes that came with NCLB. Example 3 is Chicago, for both the initial district-level incentives in the 1990s and the implementation of the succeeding NCLB incentives. Examples 1 and 2: Nationwide School Incentives A number of states instituted test-based incentives during the 1990s, with consequences for schools that anticipated the consequences that were implemented for all states in 2001 under NCLB (Dee and Jacob, 2007; Hanushek and Raymond, 2005). Under NCLB, schools that do not show adequate yearly progress face escalating consequences. The structure of NCLB defines consequences for schools that involve increasing levels of state intervention and support to bring about improvement. The initial 6 We used the most lenient level of statistical significance provided in each study, generally p < 0.10 or p < 0.05.
OCR for page 59
59 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES requirements are to file improvement plans, make curriculum changes, and offer their students school choice or tutoring; if progress does not improve as specified, they are required to restructure in various ways. The consequences are based on state tests in reading and mathematics that use state-defined targets for student proficiency. During 2006-2009, the proportion of schools failing to show adequate yearly progress ranged from 29 to 35 percent (Center on Education Policy, 2010). There is mixed information about the implementation of the consequences prescribed under NCLB, with frequent focus on making curriculum and instructional changes, but fewer cases of implementing effective school choice or tutor- ing options that students use (Center on Education Policy, 2006a). We treated the incentive programs adopted by many states in the 1990s as roughly similar to NCLB although there were many variations in the incentive structures in the states that may have affected results. For example, North Carolina’s school incentives, which were implemented in 1996 and continued alongside NCLB after 2001, are based on test score gains rather than proficiency levels and so are targeted to a broad range of performance rather than a narrow range near the proficiency cut point. Under the two different performance criteria, there were different out - comes: schools facing sanctions under NCLB improved the test scores of lower performing students, while schools facing sanctions under the state program improved the test scores of both lower and higher performing students (Ladd and Lauen, 2009). Unfortunately, there were no studies available that would have allowed us to contrast the overall effect of state incentive programs predating NCLB by the committee’s key elements in incentive structure. We considered three studies that identified causal effects of school incentive policies by comparing changes in states that did and did not use those policies. The studies used the National Assessment of Educational Progress (NAEP) to measure achievement in reading and mathematics for fourth and eighth grade students. For the early period, we used a meta-analysis of 14 studies that compared states that started test-based incentives before NCLB with states that did not (Lee, 2008). For the later period, we used two studies that each performed a complementary analy- sis that compared states that started using school incentives under NCLB to states that already had school incentives before NCLB (Dee and Jacob, 2009; Wong, Cook, and Steiner, 2009). Example 1: Pre-NCLB Nationwide School Incentives For the early period, the meta-analysis by Lee (2008) identified 14 studies that compared results across states with different test-based accountability policies. Most of the studies used longitudinal NAEP data
OCR for page 60
60 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION from the 1990s to compare states with different levels of test-based school accountability policy.7 The studies defined the policy contrasts in a vari- ety of ways and used a variety of analytic strategies. Some of the studies focused on mathematics, and others looked at both mathematics and reading. Most of the studies looked at test results in grades 4 and 8. Across the 76 effect sizes that were calculated from the studies, the average effect size associated with a contrast between states with and without test-based accountability was 0.08 standard deviations (Lee, 2008, p. 625); 66 were positive, 2 were zero, and 8 were negative (pp. 631-638).8 The study did not report how many of these effects were statistically significant. The meta-analysis did not find significant differences in effect sizes between school and student incentive policies (p. 616), between mathematics and reading (p. 619), between different grade levels (p. 619), or between dif - ferent racial and ethnic groups (p. 621). Example 2A: NCLB Nationwide School Incentives (Dee and Jacob) For the NCLB period, Dee and Jacob (forthcoming) estimated that the imposition of the NCLB requirements in states that had not yet adopted school incentives increased achievement by 2007 in fourth grade math- ematics by 7.2 points in the preferred model (Dee and Jacob, forthcom- ing, Table 3, Panel B). This increase corresponds to an effect size of 0.23 standard deviations. The effects on eighth grade mathematics and fourth grade reading were positive, and the effect on eighth grade reading was negative; none of these other effects was statistically significant.9 The paper did not provide a formal test of the statistical significance of the subject or grade differences in the effect sizes. Over four combinations of 7 Given this generalization, the multiple studies in Lee (2008) can be thought of as ef - fectively providing multiple analyses of a single big experiment across states in the 1990s, using different ways of analyzing the available NAEP data. Note that four studies included in Lee (2008) do not fit the generalization in the text: two involve cross-sectional comparisons (Bishop et al., 2001; Lee, 1998) and two focus exclusively on high school exit requirements that are based on minimum competency testing rather than school accountability (Freder- icksen, 1994; Jacob, 2001), with one (Jacob, 2001) using the National Education Longitudinal Study rather than NAEP. 8 The effect sizes are calculated in Lee (2008) from information provided in the original papers. The figure reported in the text is for effect sizes calculated in terms of the standard deviation of student scores. Note that many of the effect sizes reported in the paper are based on the standard deviation of state scores and so are not comparable to the versions calculated in terms of the standard deviation of student scores. 9 The study notes uncertainty about the reading estimates because the fourth grade data do not follow the linear trend that the statistical model assumes and because the eighth grade data include only two pre-NCLB observations. The results for eighth grade reading were reported only in an appendix.
OCR for page 61
61 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES subject and grade, the average effect size was 0.08 standard deviations. 10 The increase for fourth grade mathematics occurred for both lower and higher performing students (Table 5). Finally, a check for changes in NAEP science test scores showed no effect of NCLB in either fourth or eighth grade on a subject without incentives (Table C4, Panel B), with a small positive effect in grade 4 and a small negative effect in grade 8, neither of which was statistically significant. Example 2B: NCLB Nationwide School Incentives, Public Schools (Wong, Cook, and Steiner) Wong, Cook, and Steiner (2009) found similar results for the NCLB period for public schools, though with some differences in their approach. In addition to the contrast between states with and without school incen- tives before NCLB used by Dee and Jacob, they added a contrast between states with high and low standards. Although high standards did not appear to interact with incentives,11 the results suggested that the sepa- rate effects of the two policies combined in grade 4 reading to produce a statistically significant change. Across three combinations of subject and grade, the average effect size associated with incentives was 0.12 (Wong, Cook, and Steiner, 2009, Table 14).12 The effect size was statistically sig- nificant only for fourth grade mathematics (Table 13). The paper omitted eighth grade reading, the one test for which Dee and Jacob found nega - tive effects. 10 We computed the average from the coefficients on the “Total effect by 2007” line of Table 3 in Dee and Jacob (forthcoming) dividing each by the standard deviation of the scores for the different tests provided at the bottom of the table. The results for eighth grade reading were taken from the corresponding line of appendix Table C2. Despite the authors’ uncertainty about the reading estimates (see fn. 9), our analysis included them in the overall average in order to provide the best available average of the effect of NCLB that reflects a balance across subjects and grades. When the subjects were considered separately, the average effect for mathematics was 0.17 standard deviations, and the average effect for reading was 0.00 standard deviations. 11 In the case of fourth grade mathematics, in one specification there was an interaction effect of standards and incentives with borderline statistical significance that suggests that either high standards or incentives alone produced the same effect as the two policies to - gether (Wong, Cook, and Steiner, 2009, Table 13). 12 We averaged the effect sizes in the “Diff. in Total Δ (2007 or 2009) CA” line of Table 14 of Wong, Cook, and Steiner (2009).
OCR for page 62
62 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION Example 2C: NCLB Nationwide School Incentives, Public and Private Schools (Wong, Cook, and Steiner) Wong, Cook, and Steiner (2009) also used a comparison between pub- lic and private (mostly Catholic) schools as a way to estimate the effects of NCLB, though Dee and Jacob rejected this approach because of the decline in Catholic school enrollment that occurred around the start of NCLB (because of the sex abuse scandal). In addition to comparing public and Catholic schools, the study also compared public and non-Catholic private schools. Over six combinations of subject, grade, and private school type, there was an average effect size of 0.22 standard deviations associated with the change in public school NAEP scores by 2007 or 2009.13 Although all of the effect sizes were positive, the only one that was marginally significant was for fourth grade mathematics for Catholic private schools (Wong, Cook, and Steiner, 2009, Table 6). Related Studies About School Incentives There have been a number of studies of the instructional changes that have accompanied the implementation of school incentives (e.g., Center on Education Policy, 2007a; Hamilton et al., 2007; Rouse et al., 2007; Stecher, 2002; White and Rosenbaum, 2008). In general, these stud - ies found shifts in instruction that were consistent with the performance measures that were attached to the incentives. Some of these changes were aimed at improving achievement broadly, such as increasing total instruction time, improving the alignment of instruction with standards, or adding professional development for teachers. Other changes were focused on the specific structure of the incentives system, such as shifting instruction to focus on aspects that count in the system and away from aspects that do not count: these changes involved an increased focus on tested subjects, on lower performing students at the threshold of attaining proficiency, and on material that closely mimics the tests. These findings about instructional shifts underline the necessity of evaluating the effect of incentives with information from low-stakes tests in the same subjects as the tests attached to incentives, on students at different performance levels, and on subjects not attached to incentives. In addition to changes in instruction in the subject area, there is evi- dence of attempts to increase scores in ways that are completely unrelated to improving learning. The attempts included teaching test-taking skills, excluding low-performing students from tests, feeding students high- averaged the effect sizes in the “Diff. in Total Δ (2007 or 2009)” lines of Table 7 of 13 We Wong, Cook, and Steiner (2009) for the “Public vs. Catholic (Main NAEP)” and “Public vs. Non-Catholic (Main NAEP)” sections of the table.
OCR for page 63
63 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES calorie meals on testing days, providing help to students during a test, and even changing student answers on tests after they were finished (e.g., Cullen and Reback, 2006; Figlio and Getzler, 2006; Figlio and Winicki, 2005; Jacob and Levitt, 2003; Stecher, 2002). The evidence about behaviors that were likely to distort test results again underlines the importance of evaluating the effects of incentives using measures of the same domain that are different than the results of the tests attached to the incentives. It is also important to note, however, that some of the changes that can distort high-stakes tests—such as a focus on the portions of the subject that are easy to test—can also distort low-stakes tests. Example 3: Chicago School Incentives The incentives that Chicago introduced in 1996 included sanctions for both schools and students (Jacob, 2005). The school sanctions involved the possibility of reconstituting schools with a high percentage of low- performing students. The student sanctions involved mandatory summer school and retention for students unable to pass exams in the third, sixth, and eighth grades. If students were unable to pass the exams after sum- mer school, they had an additional opportunity to rejoin their class if they could pass the exams in January of the following year. During the first 3 years of the program, retention rates in these grades increased to 10-20 percent, far above the prior level of 1-2 percent (Jacob and Lefgren, 2009). Jacob (2005) used longitudinal data for Chicago that included the period before the policy took effect and controls for both prior test trends and changes in student demographics. For the 4 years after the start of school incentives, scores on the high-stakes tests in the three grades had increased above predicted trends by about 0.2 standard deviations in reading and 0.3 standard deviations in mathematics (Jacob, 2005, Table 1). Similar results were obtained by comparing the change in Chicago’s test score trends when incentives were introduced with the test score trends in other large, midwestern cities (Table 2). Looking across students, there were generally positive effects for both lower and higher performing stu - dents in mathematics; for reading, the effects occurred primarily for lower performing students (Table 3). In the lowest decile of students, however, there was some indication that incentives decreased performance. Neal and Schanzenbach (2010) obtained similar results on the distribution of effects across students. Jacob (2005, Table 5) replicated a version of his analysis with data on low-stakes tests in reading and mathematics. The analysis showed an effect of about 0.2 standard deviations in both subjects 2 years after implementation, but only for the eighth grade; the effect on the low-stakes tests for the third and sixth grade was either negative or was small and
OCR for page 80
80 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION of the program, but starting in the third year, enrollment increased by 34 percent (Table 3, column 1). There was an increase of 1.2 percent in the graduation rate, but the result was not statistically significant (Table 2, model 16). However, the number of students attending college increased by 5.3 percent (Table 2, model 34). CONCLUSIONS In this section we synthesize the results across the different incentive programs discussed above and summarized at the end of this chapter in Tables 4-1A, 4-1B, 4-2, and 4-3. We focus specifically on summarizing the types of incentive programs investigated and analyzing the effect of those programs on student achievement and on high school graduation and certification. We then consider the relative costs and benefits of incentive programs. Types of Incentive Programs Investigated in the Literature As summarized in Tables 4-1A and 4-1B, researchers and policy mak- ers have explored incentive programs with a relatively wide range of variation in key structural features. Across the 15 examples we analyzed, there are substantial differences in who receives incentives, the breadth of the performance measures across and within subjects that are attached to the incentives, the nature of the consequences that the program attaches to the performance level, and whether extra support is provided by the program. In addition, there are differences in the nature and frequency of the consequences attached to the performance measures that are summa - rized in the text describing the programs, though not coded in the table. The research literature we reviewed (see Chapters 2 and 3) suggests that these key structural features could be critical to the successful opera - tion of an incentive program, so it is notable that the literature includes examples of different options for the different features. Looking at the feature options one at a time, the studies we review provide examples of major contrasts that could potentially be important, and for each contrast- ing feature option in the table, there are at least several strong studies that investigate programs containing that option. When we considered the feature options in combination, however, it is clear that many possible combinations of the basic structural features do not appear: see Tables 4-1A and 4-1B. Some unexplored combinations are likely to seem uninteresting to implement as actual programs—such as a possible incentive program that might combine consequences in the form of sanctions while providing no additional support, which would likely prove to be politically untenable. However, there are a number of
OCR for page 81
81 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES unexplored feature combinations that are potentially interesting and seem potentially promising for implementation and study. In the current policy context, there are at least two such unexplored combinations of structural features that are salient: the combination of incentives for schools and broad performance measures within subjects, and the combination of incentives for individual teachers and sanctions. The first combination is a frequently mentioned possible change that might be introduced with the next reauthorization of the Elementary and Secondary Education Act (ESEA)—school accountability with per- formance measures that have broader coverage within subjects by using tests that better reflect higher order thinking skills and indicators that are sensitive to changes across a broader range of performance than a single proficiency level. The second combination is a frequently mentioned possible change in discussions about teacher quality—incentives for individual teach- ers in the form of sanctions that require teachers whose students do not meet some test-based level of performance to leave the profession (see, e.g., Lang, 2010; Staiger and Rockoff, 2010). Proposals to use the results of student tests as an input into teacher tenure decisions—which can be interpreted as subjecting teachers to a strong sanction if their students perform poorly—are an example of this combination. We do not take a position on either of these proposals here or on other unexplored com- binations that may be proposed. Instead, we note the twin points that the existing research literature contains information about the effects of incentive programs that use these features in other combinations, but it does not contain information about the effects of programs with these particular combinations of features. Effects on Student Achievement and High School Graduation and Certification We summarize the effects of the incentive programs on student achievement and high school graduation and certification in Tables 4-2 and 4-3. We discuss these effects in terms of four groupings of programs: NCLB and its predecessors, high school exit exams, programs using rewards in other countries, and programs using rewards in the United States. NCLB and Its Predecessors The four studies that we analyzed all provided information about the achievement effects of test-based incentives targeted at schools that are
OCR for page 82
82 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION in the NCLB mold.42 The studies showed average incentive effects on the low-stakes tests ranging from 0.04 to 0.22 standard deviations. Across the studies there were a number of individual effect estimates that were posi- tive and statistically significant, though there were also many that were not statistically significant and some that were negative. At first blush, the evidence of incentives on student achievement from these studies appears substantial. However, there are two impor- tant caveats. First, the statistically significant effects were concentrated in fourth grade math; in contrast, the results for eighth grade math and for reading for both grades were often not statistically significant and sometimes negative. Second, the highest two estimates—0.22 and 0.12 standard deviations—were problematic. Both estimates came from analyses that excluded results for eighth grade reading, giving an unbalanced over- all picture of the effects of the incentives on achievement. In addition, the highest estimate of 0.22 standard deviations came from comparisons between public and private schools that may have been affected by move- ment away from Catholic schools that occurred during the early years of NCLB. Without these two problematic estimates, the effects estimated by the research range from 0.04 to only 0.08 standard deviations. Given these two caveats, the evidence related to the effects on achievement of test-based incentives to schools appears to be modest, limited in both size and applicability. Our preferred estimate for these programs is 0.08 standard deviations, reflecting the national results for both the pre-NCLB period by Lee (2008) and the NCLB period by Dee and Jacob (2011). A program with an effect size of 0.08 standard deviations would raise the achievement of students currently at the 50th percentile to the 53rd percentile. This gain is small, both by itself and in comparisons across nations: the highest achieving countries on inter- national tests often perform a full standard deviation above the United States, measured in terms of the distribution of performance within the United States (see, e.g., Gonzales et al., 2008, Figure 14 for TIMSS 2007 mathematics). To achieve an increase of the magnitude needed to match the high performing countries would mean that students cur- rently at the 50th percentile in the United States would have to increase their scores to the current 84th percentile. For underachieving groups, far more improvement would be needed because of the large achieve - ment gaps in the United States (Hill et al., 2008, Table 2). Although an effect size of 0.08 standard deviations is small in comparison with the improvements the nation hopes to achieve, it is comparable to the effect 42 One of the research papers was a meta-analysis covering 14 studies, many of which would meet our inclusion criteria if we had considered them separately.
OCR for page 83
83 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES sizes found for other promising interventions that have been evaluated using standardized tests with relatively broad subject coverage (Hill et al., 2008, Table 4). The influential Tennessee STAR experiment with class-size reduction was notable for achieving effect sizes ranging from 0.15 to 0.25 standard deviations (Finn and Achilles, 1999), though the gains from class-size reduction have been much smaller when they were instituted on a statewide basis (e.g., Stecher et al. 2001). High School Exit Exams One of the three studies on the effects of high school exit exam require- ments provided estimates of the effects on achievement on a low-stakes test: it found an average effect of 0.00 standard deviations (see Table 4-2). The other two studies provided estimates of the effects on graduation: they found average effects of −2.1 and −0.6 percentage points (see Table 4-3). A number of the negative effects are statistically significant. The smaller estimate was for a study that counted GEDs as equivalent to high school diplomas; excluding this study leaves an estimate of the gradua - tion effect of −2.1 percentage points. Incentive Programs That Use Rewards in Other Countries The committee’s analysis included six studies of incentive programs that used rewards in other countries, in India, Israel, and Kenya. The Kenya study measured the effect of incentives on achievement using low- stakes tests, while the studies in India and Israel measured the achieve - ment effect using the tests attached to the incentives (see Table 4-2). The six studies found average estimates of the effect on achievement ranging from 0.01 to 0.19 standard deviations, and most of the high positive effects are statistically significant. Two of the Israel studies found effects on high school certification that averaged 2.2 and 5.4 percentage points (see Table 4-3). The Israel studies found that the effects on both achievement and certification were concentrated on lower-performing students. As with the studies on NCLB and its predecessors, the studies on for- eign reward programs suggest substantial benefits of incentive programs that must be considered in light of important caveats. First, the programs in India and Israel measured achievement using the high-stakes tests attached to the incentives. The problems with this measure are discussed above, and it is not clear how much change in achievement would be shown on low-stakes tests. Second, the programs in India and Kenya were in developing coun- tries that have quite a different context for education than that in devel - oped countries. In particular, the high level of teacher absenteeism and the
OCR for page 84
84 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION high rate of student dropout in middle school suggest that the incentives for both teachers and students may operate differently in developing countries. Given these caveats, it is not clear what can be learned from these stud- ies that would be applicable to the use of incentives in the United States. For all three countries, there are difficulties in drawing conclusions about the ability of such programs to increase achievement in the United States. In addition, although the ability of the Israel programs to increase high school certification with incentives is potentially promising, it is hard to evaluate the value of the increase without knowing whether it is accompanied by increased learning beyond that measured by the high-stakes test. U.S. Incentive Programs That Use Rewards Six of the seven studies that provided information about U.S. incen- tive programs that use rewards showed average effects on achievement that ranged from −0.02 to 0.06 standard deviations (see Table 4-2). Many effects were positive, and some were statistically significant, but there were also a number of negative effects. The estimates of achievement effects included a number that were based on the tests attached to the incentives; when these are eliminated, there are two studies, both of which found 0.01 standard deviations. One study showed an effect of incentives on high school graduation of 0.9 percentage points, but the effect was not statistically significant (see Table 4-3). On the basis of our synthesis of the evidence, summarized above, we reached two conclusions about the effect of test-based incentives on student achievement and high school completion. Conclusion 1: Test-based incentive programs, as designed and implemented in the programs that have been carefully studied, have not increased student achievement enough to bring the United States close to the levels of the highest achieving coun- tries. When evaluated using relevant low-stakes tests, which are less likely to be inflated by the incentives themselves, the overall effects on achievement tend to be small and are effec- tively zero for a number of programs. Even when evaluated using the tests attached to the incentives, a number of programs show only small effects. Programs in foreign countries that show larger effects are not clearly applicable in the U.S. context. School-level incentives like those of the No Child Left Behind Act produce some of the larger estimates of achievement effects, with effect sizes around 0.08 standard deviations, but the mea-
OCR for page 85
85 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES sured effects to date tend to be concentrated in elementary grade mathematics and the effects are small compared to the improvements the nation hopes to achieve. Conclusion 2: The evidence we have reviewed suggests that high school exit exam programs, as currently implemented in the United States, decrease the rate of high school graduation without increasing achievement. The best available estimate suggests a decrease of 2 percentage points when averaged over the population. In contrast, several experiments with providing incentives for graduation in the form of rewards, while keep - ing graduation standards constant, suggest that such incentives might be used to increase high school completion. Balancing the Benefits and Costs of Test-Based Incentives The research to date suggests that the benefits of test-based incentive programs over the past two decades have been quite small. Although the available evidence is limited, it is not insignificant. The incentive pro- grams that have been tried have involved a number of different incentive designs and substantial numbers of schools, teachers, and students. We focused on studies that allowed us to draw conclusions about the causal effects of incentive programs and found a significant body of evidence that was carefully constructed. Unfortunately, the guidance offered by this body of evidence is not encouraging about the ability of incentive programs to reliably produce meaningful increases in student achieve - ment—except in mathematics for elementary school students. Although the evidence to date about the effectiveness of incentive programs has not been encouraging, the basic research findings suggest a number of features that are likely to be important to the effective- ness of incentive programs and that can provide guidance in the design of new models. Some proposals for new models of incentive programs involve combinations of features that have not yet been tried to a signifi- cant degree, such as school-based incentives using broader performance measures and teacher incentives using sanctions related to tenure. Other proposals involve more sophisticated versions of the basic features we have described, such as the “trigger” systems discussed in Chapter 3 that use the more narrow information from tests to start an intensive school evaluation that considers a much broader range of information and then provides more focused supports to aid in school improvement. It is also likely to be important to consider potential programs that focus more on the informational role that tests can play. Our study has spe-
OCR for page 86
86 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION cifically not focused on policies and programs that rely solely on informa- tion about educational achievement that tests provide to drive improve - ment through educator motivation and public pressure. Our focus for the study was chosen because so much of the educational policy discussion over the past decade has been driven by the conclusion that mere infor- mation without explicit consequences is insufficient to drive change. And yet the guidance coming from the basic research in psychology suggests that the purely informational uses of test results may be more effective in some situations than incentives that attach explicit consequences to those results. As policy makers and educators continue to look for successful routes to improving education in the years ahead, the exploration should include more subtle incentives that rely on the informational role of test results and broader types of accountability. In continuing to explore promising routes to using test-based incen- tives, however, policy makers and educators should take into account the costs of doing so. Over the past two decades, the education policy and research communities have invested substantial attention and resources in exploring the use of test-based incentives as a way to improve educa - tion. This investment seemed to be worthwhile because it appeared to offer a promising route for improvement. Further investment in test- based incentives still seems to be worthwhile because there are now more sophisticated proposals for using test-based incentives that offer hope for improvement and deserve to be tried. However, in choosing how much attention and investment to devote to the exploration of new forms of test-based incentives, it is important to remember that there are other aspects of improving education that also would benefit from development. In addition to test-based incentives, investments to improve standards, curriculum, instructional methods, and educator capacity are all likely to be necessary for improving educational outcomes. Although these other aspects of the system are likely to be complements to test- based incentives in improving education, they are competitors for fund- ing and policy attention. Further research and development of promising new approaches to test-based incentives need to be balanced against the research and development needs of promising new approaches in other areas related to improving education. We have not considered those trade- offs in our examination of test-based incentives, but those trade-offs are the most important costs that need to be considered by the policy makers who will decide which new incentive programs to support.
OCR for page 87
87 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES TABLE 4-1A Overview of Results from All Studies of Test-Based Incentive Programs Using Causal Analyses Structure of Incentives Systema Perf Perf Target Who Measure Measure Incentive Receives Across Within Conse- Programs Incentives Subjects Subjects quences Support Studies of NCLB and Its Predecessors 1. U.S. pre- Schools Mixed Mixed Mixed Mixed NCLB 2A. U.S. NCLB Schools Narrow Narrow Sanction Yes 2B. U.S. NCLB Schools Narrow Narrow Sanction Yes 2C. U.S. NCLB Schools Narrow Narrow Sanction Yes 3. Chicago pre- Schools and Narrow Narrow Sanction Yes NCLB Students Studies of High School Exit Exams 4. U.S. HS Exit Students Mixed Narrow Sanction Yes Studies of Incentive Experiments Using Rewards 5. India Teachers-I or Narrow Broad Reward No Teachers-G 6. Israel Teachers-G Broad Narrow Reward No Teachers-G 7. Israel Teachers-I Broad Narrow Reward No Teachers-I 8. Israel Student Students Broad Narrow Reward No 9. Kenya Teachers-G Broad Narrow Reward No Teachers-G 10. Kenya Students and Broad Narrow Reward No Student Parents 11. Nashville Teachers-I Narrow Narrow Reward No 12. New York Students Narrow Broad Reward No 13. Ohio Student Students Broad Narrow Reward No 14A. TAP-Chicago Teachers-I Broad Broad Reward Yes and Teachers-G 14B. TAP-2 states Teachers-I Broad Broad Reward Yes and Teachers-G 15. Texas AP Teachers-I Narrow Narrow Reward Yes and Students NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually. aThe features related to the structure of incentive programs that should be considered when designing the programs are (1) the target for the incentives (schools, teachers, or students in these examples); (2) the extent to which the performance measures are aligned with the outcomes desired (broad or narrow), both across and within subjects; (3) the consequences that the incentives provide (reward or sanction); (4) the support provided to reach the performance goals; and (5) the way the incentives are framed and communicated. The last feature is not included in the table because no studies consider it.
OCR for page 88
88 INCENTIVES AND TEST-BASED ACCOUNTABILITY IN EDUCATION TABLE 4-1B Overview of Results from All Studies of Test-Based Incentive Programs Using Causal Analyses Outcomesa Effect Effect Effect Effect Effect on on on Effect on on High- Low- Other on HS Lower Higher Incentive Stakes Stakes Subject Grad Perf Perf Programs Tests Tests Tests or Cert Students Students Studies of NCLB and Its Predecessors 1. U.S. pre- + NCLB 2A. U.S. NCLB 0/+ 0 +/0 +/0 2B. U.S. NCLB 0/+ 2C. U.S. NCLB 0/+ 3. Chicago pre- + 0/+/− + + +/0 NCLB Studies of High School Exit Exams 4. U.S. HS Exit 0 −/0 test 0 test 0 Studies of Incentive Experiments Using Rewards 5. India + + + + 6. Israel + +/0 + 0 Teachers-G 7. Israel + + 0 Teachers-I 8. Israel Student + + 0 9. Kenya +/0 0 Teachers-G 10. Kenya + + + + Student 11. Nashville 0/+ 0/+ 12. New York 0 13. Ohio Student +/0 +/0 +/0 14A. TAP-Chicago 0 14B. TAP-2 states +/−/0 15. Texas AP + 0 + NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually. aResults of studies are characterized here as positive (+), negative (−), or not statistically significantly different from zero (0). The most lenient level of significance provided in the study is used, generally p < 0.10 or p < 0.05.
OCR for page 89
89 EVIDENCE ON THE USE OF TEST-BASED INCENTIVES TABLE 4-2 Summary of Average Effects of Incentive Programs on Student Achievement Tests Distribution of Test Outcome Effects Test Outcome Across Analyses Overall Incentive Type of Effect Sizea Programs Stakes +Sig +Nonsig −Nonsig −Sig Studies of NCLB and Its Predecessors 1. U.S. pre- Low 0.08 87% 11% NCLB 2A. U.S. NCLB Low 0.08 25% 50% 25% 0% 0.12b 2B. U.S. NCLB Low 33% 67% 0% 0% 0.22c 2C. U.S. NCLB Low 17% 83% 0% 0% 3. Chicago Low 0.04 83% 22% 22% 22% pre-NCLB Studies of High School Exit Exams 4A. U.S. HS Exit Low 0.00 0% 50% 50% 0% Studies of Incentive Experiments Using Rewards 5. India High 0.19 100% 0% 0% 0% 6. Israel High 0.11 75% 13% 13% 0% Teachers-G 7. Israel High 0.19 100% 0% 0% 0% Teachers-I 9. Kenya Low 0.01 0% 50% 50% 0% Teachers-G 10. Kenya Low 0.19 100% 0% 0% 0% Student 11. Nashville High 0.04 17% 42% 42% 0% 12. New York Low 0.01 0% 50% 50% 0% 13. Ohio High 0.06 29% 64% 7% 0% Student 14A. TAP- High –0.02 0% 50% 50% 0% Chicago 14B. TAP-2 states Low 0.01 39% 11% 17% 33% NOTE: Teachers-G = Teachers-Group, Teachers-I = Teachers-Individually. a Effect size is presented in standard deviation units. b Omits eighth grade reading. c Omits eighth grade reading; uses comparison to private schools during period of fluctu - ating enrollment.
OCR for page 90
90 TABLE 4-3 Average Effects of Test-Based Incentive Programs on High School Graduation/Certification Rates Distribution of Rate Changes Across Analyses HS Grad/ Cert Rate Changes +Sig +Nonsig −Nonsig −Sig Incentive Programs Studies of High School Exit Exams 4B. U.S. HS Exit −2.1% 0% 0% 0% 100% 4C. U.S. HS Exit −0.6% 0% 0% 33% 67% Studies of Incentive Experiments Using Rewards 6. Israel Teachers-G 2.2% 0% 75% 25% 0% 8. Israel Student 5.4% 0% 100% 0% 0% 15. Texas AP 0.9% 0% 50% 50% 0% NOTE: Teachers-G = Teachers-Group.