| Copyright © 2010. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Division of Behavioral and Social Sciences and Education 500 Fifth Street, NW
Board on Testing and Assessment Washington, DC 20001
Phone: 202 334 2353
Fax: 202 334 1294
E-mail: bota1@nas.edu
www.nationalacademies.org
October 5, 2009
The Honorable Arne Duncan
Secretary of Education
U.S. Department of Education
400 Maryland Avenue, SW, Room 3W329
Washington, DC 20202
Dear Mr. Secretary:
This letter offers comments concerning the Department’s Proposed Regulations on the
Race to the Top (RTT) fund of the American Recovery and Reinvestment Act of 2009 (74 Fed.
Reg. 37804, proposed July 29, 2009) from the Board on Testing and Assessment of the National
Research Council. (See Attachment A for a list of members.) The comments reflect a consensus
of the Board.
Under National Academies procedures, any letter report must be reviewed by an
independent group of experts before it can be publicly released, which made it impossible to
complete the letter within the public comment period of the Federal Register notice.1 However,
we hope that the Department will still find these comments helpful in revising the RTT plans.
The Board on Testing and Assessment stands ready to assist the federal government,
Congress, and the states in addressing issues concerning the use of evidence to improve
educational opportunities for the nation’s young people.
Sincerely yours,
Edward H. Haertel, Chair
Board on Testing and Assessment
cc: Carmel Martin, Assistant Secretary for Planning, Evaluation, and Policy Development
Thelma Melendez, Assistant Secretary for Elementary and Secondary Education
Joanne Weiss, Senior Advisor to the Secretary and Director, Race to the Top
Marshall S. Smith, Senior Counselor to the Secretary, Director, International Affairs
John Q. Easton, Director, Institute of Education Sciences
1
The reviewers for the report are acknowledged in Attachment B.
OCR for page 2
COMMENTS ON THE DEPARTMENT OF EDUCATION’S PROPOSAL
ON THE RACE TO THE TOP FUND
This letter offers comments concerning the Department’s Proposed Regulations on the
Race to the Top (RTT) fund of the American Recovery and Reinvestment Act (ARRA) of 2009
(74 Fed. Reg. 37804, proposed July 29, 2009) from the Board on Testing and Assessment
(BOTA) of the National Research Council. (See Attachment A for a list of members.) The
comments reflect a consensus of the Board.
The Board held an open session at its meeting on July 30, 2009, to discuss the importance
of evaluating RTT spending. This meeting had been planned before the Department’s posting in
the Federal Register. In deciding to hold this session, BOTA hoped to ensure that the unusual
opportunity offered by RTT to invest in educational innovations be fully exploited through
evaluation of the supported programs. Although RTT represents a substantial amount of money
in absolute terms, it is small in comparison with the total spending on education in the United
States. It is only through careful evaluations from RTT-supported innovations that the
investment in RTT is likely to be leveraged to support improvements that can affect the entire
educational system. Without the learning that results from careful evaluation, any benefits of this
one-time spending on innovation are likely to end when the funding ends.
Tests often play an important role in evaluating educational innovations, but an evaluation
requires much more than tests alone. A rigorous evaluation plan typically involves
implementation and outcome data that need to be collected throughout the course of a project. In
addition, a rigorous evaluation plan may affect the way that participants are selected to be
included in the project.
Originally, BOTA’s July 30 meeting was planned to include several members of the
Department, to allow a discussion of possible approaches for requiring and conducting
evaluations of RTT innovations. However, as a result of the Federal Register filing on the
preceding day, most of the Department’s representatives were unable to attend BOTA’s meeting.
In addition, the availability of the Department’s full proposal caused the Board to consider and
discuss a number of other testing-related elements in the proposal. As a result, the Board
concluded that it would be important and appropriate to communicate directly with the
Department in the form of a letter, both to underline the importance of requiring evaluations for
RTT-supported innovations and to address several other concerns about the use of tests
suggested in the Federal Register notice.
The comments we offer in this letter are based on more than 15 years of work by BOTA on
a wide range of topics concerning the use of testing and assessment. We draw from that body of
work and from our collective knowledge about accepted practices in educational measurement to
offer our comments here. (See Attachment C for a selected list of BOTA reports.)
We begin our comments with a general discussion about the role of testing and assessment
in educational reform. This discussion underlines some important aspects of testing and
assessment that are widely understood but not always considered in the design of educational
policies that use tests. We then discuss six testing-related topics that are reflected in the
Department’s proposal in the Federal Register:
1. the need for states to include evaluations in their RTT program activities;
2. the use of NAEP to evaluate RTT outcomes;
3. the use of student growth data to evaluate teachers and principals;
4. the use of data to improve instruction;
2
OCR for page 3
Comments on Race to the Top Proposal
5. challenges in developing common assessments; and
6. challenges in creating longitudinal data systems.
We focus our comments on these six issues because we have concerns in each case about how
they will be addressed and the resulting implications for the way that tests and assessments will
be used in implementing education policy. Throughout, the Departments’ specific proposed
language is in italics, with the page numbers taken from the notice in the Federal Register.
RELIANCE ON TESTING AND ASSESSMENT IN EDUCATIONAL REFORM
The proposed regulations rely heavily on the use of testing and assessment both to drive
and to measure education reform. BOTA strongly supports the Department’s desire to
incorporate objective measures of student performance to inform the process of educational
reform. Standardized assessments have much to offer in this regard. BOTA has noted in Lessons
Learned (p. 5):
In many situations, standardized tests provide the most objective way to compare the
performance of a large group of examinees across places and times.… Similarly, in
K-12 education, statewide standardized tests are useful for comparing student
achievement across classrooms, schools or districts, or across different times. When
combined with other sources of information, these comparisons can help educators
and policy makers decide how to target resources.
However, the Lessons Learned report (p. 5) also cautions that “A test score is an estimate
rather than an exact measure of what a person knows and can do.” The items on any test are a
sample from some larger universe of knowledge and skills, and scores for individual students are
affected by the particular questions included. A student may have done better or worse on a
different sample of questions. In addition, guessing, motivation, momentary distractions, and
other factors also introduce uncertainty into individual scores. When scores are averaged at the
classroom, school, district, or state level, some of these sources of measurement error (e.g.,
guessing or momentary distractions) may average out, but other sources of error become much
more salient. Average scores for groups of students are affected by exclusion and
accommodation policies (e.g., for students with disabilities or English learners), retest policies
for absentees, the timing of testing over the course of the school year, and by performance
incentives that influence test takers’ effort and motivation. Pupil grade retention policies may
influence year-to-year changes in average scores for grade-level cohorts.
Moreover, test results are affected by the degree to which curriculum and instruction are
aligned with the knowledge and skills measured by the test. Educators understandably try to
align curriculum and instruction with knowledge and skills measured on high-stakes tests, and
they similarly focus curriculum and instruction on the kinds of tasks and formats used by the test
itself. For these reasons, as research has conclusively demonstrated, gains on high-stakes tests
are typically larger than corresponding gains on concurrently administered “audit” tests, and
sometimes they are much larger.2 Improvements on the necessarily limited content of a high-
states test may be offset by losses on other, equally valuable content that happens to be left
untested.
2
For references and further discussion of score inflation, see Koretz (2008, Chapter 10).
3
OCR for page 4
Comments on Race to the Top Proposal
These issues do not mean that test scores are unimportant or generally invalid. They do
mean that test use should comport with relevant professional standards. A list of relevant
professional and technical standards applicable to the use of educational assessments is included
in Attachment D. Three core issues reflected in these documents are (1) the need to develop
multiple measures of key outcomes, ideally using multiple assessment formats; (2) the need to
validate these assessments for specific uses; and (3) the need to consider the populations
involved and any associated validity and fairness issues. These documents also point to several
technical issues (e.g., the reliability or precision of the measures, scaling and equating issues)
and operational issues (e.g., data collection and processing) that can undermine the
interpretability and usefulness of the test results if not handled appropriately. The documents
provide guidelines on how to avoid these kinds of problems. In addition, a thorough account of
the steps needed to improve the gathering and use of data by state assessment systems can be
found in the BOTA report, Systems for State Science Assessment (2006).
We encourage the Department to pursue vigorously the use of multiple indicators of what
students know and can do. A single test should not be relied on as the sole indicator of program
effectiveness. This caveat applies as well to other targets of measurement, such as teacher quality
and effectiveness and school progress in closing achievement gaps. Development of an
appropriate system of multiple indicators involves thinking about the objectives of the system
and the nature of the different information that different indicators can provide. Such a system
should be constructed from a careful consideration of the complementary information that is
provided by different measures.3
SPECIFIC AREAS OF CONCERN
Need for States to Include Evaluations in their RTT Program Activities
The Department specifically requests comment on whether states should be required to
conduct evaluations of their support programs and indicates that the final notice for RTT
applications will specify the nature of the evaluations that will be required:
The State and its participating LEAs must use funds under this program to
participate in a national evaluation of the program, if the Department chooses to
conduct one. In addition, the Department is seeking comment on whether a State
should, instead of or in addition to a national evaluation, be required to conduct its
own evaluation of its program activities using funds under this program. The
Department will announce in the notice inviting applications the evaluation
approach(es) that will be required. (Section II.D.a, p. 37808)
We strongly affirm the importance of independent evaluation of all RTT initiatives. These
evaluations should be conducted at all levels: national, state, and local. Such evaluations will
help foster a culture of continuous improvement in the nation’s educational systems. As part of
the evaluation, all consequences—both positive and negative, and intended and unintended—
should be carefully monitored and weighed.
At BOTA’s July 30 meeting to discuss the Department’s proposed regulations, there was
strong support for rigorous evaluation requirements to help ensure both the success of the states’
3
For a thoughtful description of the creation of such a system of multiple indicators, see Chester (2005).
4
OCR for page 5
Comments on Race to the Top Proposal
RTT policy and program implementations and to increase scientific understanding of effective
school reform. Strong evaluation requirements can also help to establish a culture of evidence-
based reform at the state and local levels, which will pay dividends long after the short-term RTT
stimulus funding has ended.
As part of its discussion, BOTA considered the example of welfare reform in 1990s, which
was informed by a history of state experimentation, in which states received waivers from
federal requirements under the condition that they conduct evaluations of those experiments
(Harvey, Camasso, and Jagannathan, 2000). The evaluation requirements from the federal
government led to the scientific study of the impact of crucial features of welfare policy that
were tried out in these state experiments. The RTT fund offers an analogous opportunity in
education: a careful choice of evaluation requirements can result in substantial opportunities for
research and learning about the educational innovations that are supported.
BOTA’s discussion on evaluation focused on six key principles: (1) theory of action, (2)
appropriateness of design, (3) documentation of implementation, (4) formative and summative
components, (5) independent evaluators, and (6) valid outcome measures. These principles are
commonly accepted as best practice in evaluation.4 We briefly discuss each of these
requirements below.
Theory of Action The design of each evaluation will benefit from a clear description by
the applicant of the mechanism by which the reform initiative is expected to improve student
learning outcomes. This “theory of action” will typically take the form of a connected set of
propositions—a logical chain of reasoning that explains how the reform expenditures will lead to
improved schooling practices. A theory of action “connects the dots,” explaining in
commonsense terms which program features are expected to produce which results in order to
reach the final desired outcome. A theory of action gives shape to any evaluation plan. The
evaluation is framed as an investigation of each link in the chain, focusing on those links for
which there may be only limited prior evidence. If the evaluation demonstrates that the reform
initiative failed to meet its objectives, the investigation of the individual links will facilitate the
redesign of the reform initiative to make it effective.
For example, if the theory of action for an accountability program based on tests assumes
that performance will improve because educators will use test results to refine their instruction in
areas in which students perform poorly, then an evaluation of that accountability program should
look at the ways that the tests are being used to influence teaching. If the test results are used to
refocus the curriculum towards areas in which students perform poorly, that would confirm the
result expected by the theory of action; in contrast, if the test results lead educators to shift their
attention from teaching to test preparation, then that would not confirm the result expected by the
theory of action and suggest the possibility of unintended consequences.
Appropriateness of Design There is no one best design for program or policy evaluation.
Whatever design is chosen must be appropriate to the program or policy being evaluated. A
strong design will typically involve the collection of both quantitative and qualitative
information. The choice of a design should be guided by the theory of action of the intervention
or activity, by best practices in the field, and by a clear sense of the purpose of the evaluation:
that is, what questions are to be answered and how they are to be prioritized. An evaluation
4
For further information, see The Program Evaluation Standards (Joint Committee on Standards for Educational
Evaluation, 1994) listed in Attachment D.
5
OCR for page 6
Comments on Race to the Top Proposal
should be designed before the program is implemented so there is an opportunity to collect
baseline data and for the evaluation plan to influence crucial decisions about the selection of
participants for the program and the types of implementation and outcome information to be
collected.
Documentation of Implementation The evaluation must include documentation of the
extent to which the policy or program was in fact implemented as intended. Failure to implement
a program adequately can be one explanation for a program’s failure to produce the desired
effect. Any significant problems encountered along the way should be documented so that future
efforts can profit by lessons learned from those problems. Implementation evaluation can also
help in spotting potential negative, unintended effects of reforms.
Formative and Summative Components Ideally, evaluation plans will provide for both
short-term, “formative” feedback for program improvement, and longer-term, “summative”
information to judge program impact. The formative component of an evaluation can be
relatively informal, featuring quick turnaround, so that the local educators or state officials in
charge of the program can adjust implementation and fine-tune policies as needed. Formative
evaluation identifies ways to adjust or improve program implementation in real time. A
summative evaluation is more formal, longer term, and typically requires more advance
planning. This kind of evaluation does not influence the implementation of the policy or
program; rather, it documents the extent of intended and unintended outcomes, using a rigorous
research design and statistical methods.
For RTT initiatives, both formative and summative evaluation components should be
included. Both components will benefit from documentation about implementation. When
possible, summative evaluations should be designed in ways that provide generalizable
knowledge to guide future education reform.
Independent Evaluators Evaluations should employ independent and well-qualified
external evaluators. The field of evaluation is well developed, and much is known about sound
evaluation practice that should be incorporated in an evaluation. Outside (external) evaluations
will be more credible than evaluations conducted in-house by the states or LEAs themselves.
Valid Outcome Measures Evaluations must use student outcome measures that are valid
for the intended use. Outcomes must reflect intended program or policy goals. An otherwise
sound evaluation may be undermined by reliance on a test that is poorly aligned with the
intervention. An outcome measure should not be chosen simply because it is inexpensive or
readily available. Even a test’s alignment to content standards, as typically implemented, may be
an insufficient criterion for the choice of an outcome measure. Alignment of a test to a state’s
academic content standards typically means that the items on the test can each be matched to one
or more specific content standards, in accordance with the test specification. But the standards
are often much broader and more complex than any of the corresponding items. Taken together,
the items representing a standard may be a pale reflection of the intent of that standard.
Moreover, some standards are typically omitted altogether from the test specification. For these
reasons, “alignment,” as generally implemented, is not sufficient to ensure that a test is
appropriate as an outcome measure for an evaluation.
6
OCR for page 7
Comments on Race to the Top Proposal
Use of NAEP to Evaluate RTT Outcomes
Although BOTA is a strong advocate for the importance of NAEP in measuring U.S.
educational progress, NAEP cannot play a primary role in evaluating RTT initiatives, a role that
might be mistakenly inferred from the language in the Department’s proposal.
We propose using the NAEP to monitor overall increases in student achievement and
decreases in the achievement gap over the course of this grant because the NAEP
provides a way to report consistently across Race to the Top grantees as well as
within a State over time. . . . (Section I, footnote 1, p. 37805)
It is necessary to be clear about the distinction between the requirements of an evaluation
and the kind of “monitoring” that NAEP can provide. For the purposes of evaluating RTT
initiatives, there are at least four critical limitations with regard to inferences that can be drawn
from NAEP.
1. NAEP is intended to survey the knowledge of students across the nation with respect
to a broad range of content and skills: it was not designed to be aligned to a specific
curriculum. Because states differ in the extent to which their standards and curricula
overlap with the standards assessed by NAEP, it is unlikely that NAEP will be able to
fully reflect improvements taking place at the state level.5
2. Although NAEP can provide reliable information for states and certain large school
districts, it cannot, as presently designed, fully reflect the effects of any interventions
targeted at the local level or on a small portion of a state’s students, such as are likely
to be supported with RTT initiatives.
3. States are likely to undertake multiple initiatives under RTT, and NAEP results, even
at the state level, cannot disaggregate the contributions of different RTT initiatives to
state educational progress.
4. The specific grade levels included in NAEP (grades 4, 8, and 12) may not align with
the targeted populations for some RTT interventions.
Consequently, NAEP will be of limited value in judging the success or failure of individual
initiatives under RTT, even at the state level. The availability of NAEP does not in any way
obviate the need to plan rigorous evaluations—at national, state, and local levels—that are
appropriately designed to assess the implementation and outcomes of RTT initiatives.
These concerns do not diminish the importance of NAEP as an independent measure of
educational progress at the state and national levels. NAEP has served and should continue to
serve an essential “audit” function for evaluating states’ reform efforts, using objective
measurements that are comparable across states and over time. In order that it continue to serve
that function, it is critically important that high-stakes decisions not be attached directly to
NAEP results. What makes NAEP so valuable is precisely the fact that there are no incentives
attached to it that might lead educators to “teach to the test.” Consequently, NAEP is relatively
immune to the pressures of score inflation that can distort any high-stakes measure. NAEP can
5
Continued progress toward common standards across the states, as expected by the Department’s proposal, could
lead to a shift in NAEP standards to align more closely with those emerging state standards. However, that will not
have occurred during the period of RTT initiatives.
7
OCR for page 8
Comments on Race to the Top Proposal
only be useful for the monitoring function envisioned in the regulations to the extent that it is not
used for any high-stakes decision.
Use of Student Growth Data to Evaluate Teachers and Principals
We applaud the Department’s desire to move educational data collection systems forward
by supporting and encouraging the construction of data systems that can link students and their
teachers.
We propose that to be eligible under this program, a State must not have any legal,
statutory, or regulatory barriers to linking student achievement or student growth
data to teachers for the purpose of teacher and principal evaluation. (Section II.A., p.
37806)
Such data systems are essential for conducting research related to the full range of potential
approaches for evaluating educators and for developing pilot programs for evaluation approaches
that will one day become operational. However, BOTA has significant concerns that the
Department’s proposal places too much emphasis on measures of growth in student achievement
(1) that have not yet been adequately studied for the purposes of evaluating teachers and
principals and (2) that face substantial practical barriers to being successfully deployed in an
operational personnel system that is fair, reliable, and valid.
Differentiating teacher and principal effectiveness based on performance.…The
extent to which the State, in collaboration with its participating LEAs, has a high-
quality plan and ambitious yet achievable annual targets to (a) Determine an
approach to measuring student growth (as defined in this notice); (b) employ
rigorous, transparent, and equitable processes for differentiating the effectiveness of
teachers and principals using multiple rating categories that take into account data
on student growth (as defined in this notice) as a significant factor; (c) provide to
each teacher and principal his or her own data and rating; and (d) use this
information when making decisions. (Section III.C.(C)(2), p. 37809)
The most prominent way to measure student growth involves “value added” approaches
that use results from earlier tests, as well as other information, to adjust new test scores for pre-
existing differences across students. The objective of these statistical techniques is to produce a
measure of the “value added” to a student’s achievement by a teacher or a school in a given year.
The term “value-added model” (VAM) has been applied to a range of approaches, varying
in their data requirements and statistical complexity. Although the idea has intuitive appeal, a
great deal is unknown about the potential and the limitations of alternative statistical models for
evaluating teachers’ value-added contributions to student learning. BOTA agrees with other
experts who have urged the need for caution and for further research prior to any large-scale,
high-stakes reliance on these approaches (e.g., Braun, 2005; McCaffrey and Lockwood, 2008;
McCaffrey et al., 2003).
BOTA and the National Academy of Education conducted a joint workshop (November
13-14, 2008) to obtain expert judgments and assessments of issues related to the use of value-
added methodologies in education. The workshop topics included both the potential utility of
VAM approaches and the limitations of VAM techniques, particularly in making high-stakes
8
OCR for page 9
Comments on Race to the Top Proposal
individual, institutional, or district accountability decisions. (A report of the workshop is
forthcoming; the workshop papers are available at www7.nationalacademies.org/bota.) The
considerable majority of experts at the workshop cautioned that although VAM approaches seem
promising, particularly as an additional way to evaluate teachers, there is little scientific
consensus about the many technical issues that have been raised about the techniques and their
use.
Prominent testing expert Robert Linn concluded in his workshop paper: “As with any
effort to isolate causal effects from observational data when random assignment is not feasible,
there are reasons to question the ability of value-added methods to achieve the goal of
determining the value added by a particular teacher, school, or educational program” (Linn,
2008, p. 3). Teachers are not assigned randomly to schools, and students are not assigned
randomly to teachers. Without a way to account for important unobservable differences across
students, VAM techniques fail to control fully for those differences and are therefore unable to
provide objective comparisons between teachers who work with different populations. As a
result, value-added scores that are attributed to a teacher or principal may be affected by other
factors, such as student motivation and parental support.
VAM also raises important technical issues about test scores that are not raised by other
uses of those scores. In particular, the statistical procedures assume that a one-unit difference in a
test score means the same amount of learning—and the same amount of teaching—for low-
performing, average, and high-performing students. If this is not the case, then the value-added
scores for teachers who work with different types of students will not be comparable. One
common version of this problem occurs for students whose achievement levels are too high or
too low to be measured by the available tests. For such students, the tests show “ceiling” or
“floor” effects and cannot be used to provide a valid measure of growth. It is not possible to
calculate valid value-added measures for teachers with students who have achievement levels
that are too high or too low to be measured by the available tests.
In addition to these unresolved issues, there are a number of important practical difficulties
in using value-added measures in an operational, high-stakes program to evaluate teachers and
principals in a way that is fair, reliable, and valid. Those difficulties include the following:
1. Estimates of value added by a teacher can vary greatly from year to year, with many
teachers moving between high and low performance categories in successive years
(McCaffrey, Sass, and Lockwood, 2008).
2. Estimates of value added by a teacher may vary depending on the method used to
calculate the value added, which may make it difficult to defend the choice of a
particular method (e.g., Briggs, Weeks, and Wiley, 2008).
3. VAM cannot be used to evaluate educators for untested grades and subjects.
4. Most data bases used to support value-added analyses still face fundamental
challenges related to their ability to correctly link students with teachers by subject.
5. Students often receive instruction from multiple teachers, making it difficult to
attribute learning gains to a specific teacher, even if the data bases were to correctly
record the contributions of all teachers.
6. There are considerable limitations to the transparency of VAM approaches for
educators, parents and policy makers, among others, given the sophisticated statistical
methods they employ.
9
OCR for page 10
Comments on Race to the Top Proposal
Many of these difficulties could be addressed in time—with further research and
development of VAM statistical approaches, expansion of testing programs into more grades and
subjects, improvement of data bases, and careful development of personnel evaluation systems
that use multiple measures. However, it is unlikely that any state at this time could make a
proposal for using VAM approaches in an operational program for teacher or principal
evaluation that adequately addresses all of these concerns.
The use of test data for teacher and educator evaluation requires the same types of cautions
that are stressed when test data are used to evaluate students: “Tests are one objective and
efficient way to measure what people know and can do, and they can help make comparisons
across large groups of people. However, test scores are not perfect measures: they should be
considered with other sources of information when making important decisions about
individuals” (Lessons Learned, p. 15). This caution is even more important when applied to
complex statistics—like value-added analyses—derived from tests.
In sum, value-added methodologies should be used only after careful consideration of their
appropriateness for the data that are available, and if used, should be subjected to rigorous
evaluation. At present, the best use of VAM techniques is in closely studied pilot projects. Even
in pilot projects, VAM estimates of teacher effectiveness should not be used as the sole or
primary basis for making operational decisions because the extent to which the measures reflect
the contribution of teachers themselves, rather than other factors, is not understood. Even in pilot
projects, VAM estimates of teacher effectiveness should not be used to make operational
decisions for teachers with students who have achievement levels that are too high or too low to
be measured by the available tests because the estimates for such teachers will be essentially
meaningless. Even in pilot projects, VAM estimates of teacher effectiveness that are based on
data for a single class of students should not used to make operational decisions because such
estimates are far too unstable to be considered fair or reliable.
Use of Data to Improve Instruction
We applaud the Department’s support of “instructional improvement systems” that include
the use of assessments, as well as student work, to help teachers focus instruction and evaluate
the success of their instructional efforts. It is appropriate that the Department has included the
use of such systems in its criteria for evaluating state proposals for the use of RTT funds.
The extent to which the State, in collaboration with its participating LEAs, has a
high-quality plan to-- (i) Increase the use of instructional improvement systems (as
defined in this notice) that provide teachers, principals, and administrators with the
information they need to inform and improve their instructional practices, decision-
making, and overall effectiveness. (Section III.B.(B)(3), p. 37809)
The choice of appropriate assessments for use in instructional improvement systems is
critical. Because of the extensive focus on large-scale, high-stakes, summative tests, policy
makers and educators sometimes mistakenly believe that such tests are appropriate to use to
provide rapid feedback to guide instruction. This is not the case.
Tests that mimic the structure of large-scale, high-stakes, summative tests, which lightly
sample broad domains of content taught over an extended period of time, are unlikely to provide
the kind of fine-grained, diagnostic information that teachers need to guide their day-to-day
instructional decisions. In addition, an attempt to use such tests to guide instruction encourages a
10
OCR for page 11
Comments on Race to the Top Proposal
narrow focus on the skills used in a particular test—“teaching to the test”—that can severely
restrict instruction. Some topics and types of performance are more difficult to assess with large-
scale, high-stakes, summative tests, including the kind of extended reasoning and problem-
solving tasks that show that a student is able to apply concepts from a domain in a meaningful
way. The use of high-stakes tests already leads to concerns about narrowing the curriculum
towards the knowledge and skills that are easy to assess on such tests; it is critical that the choice
of assessments for use in instructional improvement systems not reinforce the same kind of
narrowing.
In applying its criteria to evaluate state proposals for RTT funds, BOTA urges the
Department to clarify that assessments that simply reproduce the formats of large-scale, high-
stakes, summative tests are not sufficient for instructional improvement systems. The multiple-
choice format in particular lends itself more easily to measuring declarative knowledge than
complex “higher-order” cognitive skills. Instructional improvement systems that rely heavily on
such item formats may reinforce a tendency to narrow instruction to reflect little more than tested
content and formats.
In addition, BOTA urges a great deal of caution about the nature of assessments that could
meet the Department’s definition for inclusion in a “rapid-time” turnaround system.
Rapid-time, in reference to reporting and availability of school- and LEA-level data,
means that data is available quickly enough to inform current lessons, instruction,
and related supports; in most cases, this will be within 72 hours of an assessment or
data gathering in classrooms, schools, and LEAs. (Section IV, p. 37811)
If the Department is referring to informal, classroom assessment methods that can be
scored and interpreted by the teacher, a 72-hour turnaround is a reasonable goal. It is important
to provide teachers with a better understanding about the types of assessment they can use
informally in their classrooms.
However, if the Department is referring to more formal assessment methods, then a 72-
hour turnaround is very difficult to attain. Even for multiple-choice items, it would be hard to
provide a 72-hour turnaround. Assessment of complex reasoning and problem-solving skills
typically demands assessment formats that require students to generate their own extended
responses rather than selecting a word or phrase from a short list of options. Automated scoring
of such “constructed responses” is an active field of psychometric research: to date, however,
except for automated scoring of written essays, the “state of the art” extends only to small
demonstration projects not large-scale applications. Likewise, the logistics of human scoring for
large-scale assessments would make a 72-hour turnaround extremely costly. It is premature to
specify such a required period of time within which assessment results must be returned to
teachers. When seeking to provide quick turnaround of information, it is essential that the quality
of that information not be sacrificed in the process. The first priority must be that the right
information is produced and that the information meets professional standards for technical
adequacy so that the information can accurately guide decision making.
11
OCR for page 12
Comments on Race to the Top Proposal
Challenges in Developing Common Assessments
We applaud the Department’s plan to invest in state consortia that will work toward
developing high quality assessments. It is appropriate that the Department has included the
commitment toward improving the quality of state assessment in its criteria for evaluating state
proposals for the use of RTT funds.
Whether the State has demonstrated a commitment to improving the quality of its
assessments by participating in a consortium of States that is working toward jointly
developing and implementing common, high-quality assessments (as defined in this
notice) aligned with the consortium’s common set of K-12 standards (as defined in
this notice) that are internationally benchmarked and that build toward college and
career readiness by the time of high school graduation, and the extent to which this
consortium includes a significant number of States. (Section III.A.(A)(2), p. 37808)
The opportunity of working toward common assessments of common standards raises the
possibility of significant economies of scale in the development of high-quality assessments.
Although BOTA supports the value of a joint development effort toward common
assessments, we want to stress that the requirements are quite high for producing common
assessments that would truly allow comparisons in student achievement across states in the same
way that NAEP currently does. BOTA’s previous work on potential approaches to developing
common standards and assessments (Uncommon Measures, 1999; Embedding Questions, 1999)
concluded that this aspiration is very difficult to accomplish in practice. The fundamental
problem relates to dissimilarities across states in their standards, instruction and curriculum, and
uses of test scores, as well as the assessments themselves: these dissimilarities ultimately make
assessment results incomparable.
If states indeed adopt fully common standards and develop common assessments, these
concerns would be reduced, but even seemingly small deviations from fully common standards
and assessments will introduce important limitations in the degree of comparability of outcomes.
For instance, to be fully comparable, the assessments would need to be administered under
standardized conditions across the country. This means that time limits, the length of testing
sessions, the instructions given to test takers, test security provisions, and the use of
manipulatives and calculators would need to be the same everywhere. The test administration
would need to be scheduled at an appropriate time during the school year such that all the
students who will take the test have been exposed to and have an opportunity to learn the content
covered on the test. The stakes attached to the assessment results would also need to be constant
across the country to ensure that students are equally motivated to perform well. Including state-
specific content on the assessments—that is, content that differs across states—introduces the
issue of context effects. These state-specific items would need to be placed on the assessment in
a location where they would not influence how students perform on the common items. See
Embedding Questions (1999, pp. 20-36) for a fuller discussion.
In addition to the aspiration of creating common assessments, the Department’s proposal
also notes the objective of creating assessments that are “internationally benchmarked.” There
are several different ways this phrase might be interpreted. However, for assessment results that
could be directly compared to their international counterparts, we note that the difficulties that
arise in comparing test results from different states apply even more strongly for comparing test
12
OCR for page 13
Comments on Race to the Top Proposal
results from different countries. For making comparisons internationally, the problems with
differing standards, assessments, instruction, curricula, and testing regimes are magnified. In
addition, international test comparisons raise difficult problems related to language translation.
Because of these challenges, the Department should think carefully about the kind of
“international benchmarking” that it wants to encourage states to pursue.
The aspiration for higher quality in standards and assessments is one that BOTA members
share, and over the coming months, BOTA will be conducting a series of workshops on
improving state assessment systems, with particular attention to the opportunities offered by the
current interest in moving toward common assessments and by the ARRA funding set aside for
this purpose.
Challenges in Creating Longitudinal Data Systems
We applaud the Department’s desire to support statewide longitudinal data systems. It is
appropriate that the Department has included the commitment toward development such a
system in its plan for evaluating state proposals for the use of RTT funds.
The extent to which the State has a high-quality plan to ensure that data from the
State's statewide longitudinal data system are accessible to, and used to inform and
engage, as appropriate, key stakeholders (e.g., parents, students, teachers, princi-
pals, LEA leaders, community members, unions, researchers, and policymakers);
that the data support decision-makers in the continuous improvement of instruction,
operations, management, and resource allocation. (Section III,B,(B)(2), p. 37809)
The development and application of such longitudinal data systems are considerably more
difficult than many people realize. State proposals for RTT funds should indicate a familiarity
with the full complexity of the challenges and describe appropriate plans to begin to address
them. The workshop on VAM approaches discussed above highlighted the challenges associated
with creating data systems that produce understandable and useful information for all
stakeholders, including the challenge of accurately connecting students to their teachers. An
ongoing set of evaluation activities, as described above, should inform states’ efforts in this
regard. Another upcoming joint report from BOTA and the National Academy of Education on
calculating dropout and graduation rates will also offer advice on building longitudinal data
systems and making use of the data to inform decision making. Issues such as dealing with
transfer students and students who drop out, some of whom re-enroll and drop out repeatedly,
pose challenges in building data systems that accurately code and track students’ status. We hope
to address the qualities for best practices in state assessment systems in the workshop series we
will be conducting over the coming year.
CONCLUSION
In closing, we return to the beginning of this letter, with the importance of rigorously
evaluating the innovations supported by RTT funds. Careful evaluation of this spending should
not be seen as optional; it is likely to be the only way that this substantial investment in
educational innovation can have a lasting impact on the U.S. education system. BOTA urges the
Department to carefully craft a set of requirements for rigorous evaluation of all initiatives
support by RTT funds.
13
OCR for page 14
Comments on Race to the Top Proposal
Attachment A Membership of the Board on Testing and Assessment
Attachment B Reviewer Acknowledgments
Attachment C Selected Reports of the Board on Testing and Assessment
Attachment D Key Professional and Technical Standards Documents Relevant to the Use
of Educational Assessments
Attachment E References
14