Questions? Call 888-624-8373
High Stakes: Testing for Tracking, Promotion, and Graduation

High Stakes

TESTING FOR TRACKING, PROMOTION, AND GRADUATION




Jay P. Heubert and Robert M. Hauser, Editors

Committee on Appropriate Test Use


checkmark

Board on Testing and Assessment

Commission on Behavioral and Social Sciences and Education

National Research Council



NATIONAL ACADEMY PRESS
Washington, D.C. 1999



Notice | Committee and Board Members | Foreword | Dedication
Acknowledgments | Contents | Executive Summary


NATIONAL ACADEMY PRESS • 2101 Constitution Avenue, N.W. • Washington, D.C. 20418

NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competences and with regard for appropriate balance.

The study was supported by Contract/Grant No. ED-98-CO-0005 between the National Academy of Sciences and the U.S. Department of Education. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the view of the organizations or agencies that provided support for this project.

Library of Congress Cataloging-in-Publication Data

High stakes : testing for tracking, promotion, and graduation / Jay
P. Heubert and Robert M. Hauser, editors ; Committee on Appropriate
Test Use.
      p. cm.
    Includes bibliographical references and index.
    ISBN 0-309-06280-2 (pbk.)
    1. Educational tests and measurements—United States. 2.
Educational accountability—United States. 3. Education and
state—United States. I. Heubert, Jay Philip. II. Hauser, Robert
Mason. III. National Research Council (U.S.). Committee on
Appropriate Test Use.
LB3051 .H475 1999
371.26'0973—dc21 98-40215

Additional copies of this report are available from National Academy Press, 2101 Constitution Avenue, N.W., Washington, D.C. 20418

Call (800) 624-6242 or (202) 334-3313 (in the Washington metropolitan area)

This report is also available on line at http://www.nap.edu

Printed in the United States of America

Copyright 1999 by the National Academy of Sciences. All rights reserved.


COMMITTEE ON APPROPRIATE TEST USE


ROBERT M. HAUSER (Chair), Department of Sociology, University of Wisconsin, Madison

LIZANNE DeSTEFANO, Department of Education, University of Illinois, Urbana-Champaign

PASQUALE J. DeVITO, Office of Assessment and Information Services, Rhode Island Department of Education, Providence

RICHARD P. DURÁN, Graduate School of Education, University of California, Santa Barbara

JENNIFER L. HOCHSCHILD, Woodrow Wilson School of Public and International Affairs, Princeton University

STEPHEN P. KLEIN, RAND Corporation, Santa Monica

SHARON LEWIS, Council of the Great City Schools, Washington, D.C.

LORRAINE M. McDONNELL, Department of Political Science, University of California, Santa Barbara

SAMUEL MESSICK, Educational Testing Service, Princeton, New Jersey

ULRIC NEISSER, Department of Psychology, Cornell University

ANDREW C. PORTER, Wisconsin Center for Educational Research, University of Wisconsin, Madison

AUDREY L. QUALLS, Iowa Testing Program, University of Iowa, Iowa City

PAUL R. SACKETT, Department of Psychology, University of Minnesota, Minneapolis

CATHERINE E. SNOW, Graduate School of Education, Harvard University

WILLIAM T. TRENT, Department of Educational Policy Studies, University of Illinois, Urbana-Champaign


ROBERT L. LINN, ex officio, Board on Testing and Assessment; School of Education, University of Colorado, Boulder


JAY P. HEUBERT, Study Director

MICHAEL J. FEUER, Director, Board on Testing and Assessment

PATRICIA MORISON, Senior Program Officer

NAOMI CHUDOWSKY, Senior Program Officer

ALLISON M. BLACK, Research Associate

MARGUERITE CLARKE, Technical Consultant

EDWARD MILLER, Editorial Consultant

VIOLA C. HOREK, Administrative Associate

KIMBERLY D. SALDIN, Senior Project Assistant



BOARD ON TESTING AND ASSESSMENT


ROBERT L. LINN (Chair), School of Education, University of Colorado, Boulder

CARL F. KAESTLE (Vice Chair), Department of Education, Brown University

RICHARD C. ATKINSON, President, University of California

IRALINE BARNES, The Superior Court of the District of Columbia

PAUL J. BLACK, School of Education, King's College, London

RICHARD P. DURÁN, Graduate School of Education, University of California, Santa Barbara

CHRISTOPHER F. EDLEY, JR., Harvard Law School, Harvard University

PAUL W. HOLLAND, Graduate School of Education, University of California, Berkeley

MICHAEL W. KIRST, School of Education, Stanford University

ALAN M. LESGOLD, Learning Research and Development Center, University of Pittsburgh

LORRAINE MCDONNELL, Department of Political Science, University of California, Santa Barbara

KENNETH PEARLMAN, Lucent Technologies, Inc., Warren, New Jersey

PAUL R. SACKETT, Department of Psychology, University of Minnesota, Minneapolis

RICHARD J. SHAVELSON, School of Education, Stanford University

CATHERINE E. SNOW, Graduate School of Education, Harvard University

WILLIAM L. TAYLOR, Attorney at Law, Washington, D.C.

WILLIAM T. TRENT, Associate Chancellor, University of Illinois,
Urbana-Champaign

JACK WHALEN, Xerox Palo Alto Research Center, Palo Alto, California

KENNETH I. WOLPIN, Department of Economics, University of Pennsylvania


MICHAEL J. FEUER, Director

VIOLA C. HOREK, Administrative Associate



checkmark

Foreword

 

President Clinton's 1997 proposal to create voluntary national tests in reading and mathematics catapulted testing to the top of the national education agenda. The proposal turned up the volume on what had already been a contentious debate and drew intense scrutiny from a wide range of educators, parents, policy makers, and social scientists. Recognizing the important role science could play in sorting through the passionate and often heated exchanges in the testing debate, Congress and the Clinton administration asked the National Research Council, through its Board on Testing and Assessment (BOTA), to conduct three fast-track studies over a 10-month period.

This report and its companions—Uncommon Measures: Equivalence and Linkage Among Educational Tests and Evaluation of the Voluntary National Tests: Phase 1—are the result of truly heroic efforts on the part of the BOTA members, the study committee chairs and members, two co-principal investigators, consultants, and staff, who all understood the urgency of the mission and rose to the challenge of a unique and daunting timeline. Michael Feuer, BOTA director, deserves the special thanks of the board for keeping the effort on track and shepherding the report through the review process. His dedicated effort, long hours, sage advice, and good humor were essential to the success of this effort. Robert Hauser deserves our deepest appreciation for his superb leadership of the committee that produced this report.

These reports are exemplars of the Research Council's commitment to scientific rigor in the public interest: they provide clear and compelling statements of the underlying issues, cogent answers to nettling questions, and highly readable findings and recommendations. These reports will help illuminate the toughest issues in the ongoing debate over the proposed voluntary national tests. But they will do much more as well. The issues addressed in this and the other two reports go well beyond the immediate national testing proposal: they have much to contribute to knowledge about the way tests—all tests—are planned, designed, implemented, reported, and used for a variety of education policy goals.

I know the whole board joins me in expressing our deepest gratitude to the many people who worked so hard on this project. These reports will advance the debate over the role of testing in American education, and I am honored to have participated in this effort.

Robert L. Linn, Chair
Board on Testing and Assessment






checkmark

Dedication

 

In early October 1998, after the public release of this report but before its formal publication, we were saddened to learn of the death of our fellow committee member, Samuel Messick. Sam spent almost all of his career at the Educational Testing Service, and he made legendary contributions to the science and profession of educational measurement. Even had he not been a member of the committee, Sam would have guided the committee's deliberations through his earlier National Research Council work on the use of tests to make decisions about students with mental retardation—which provided the overarching framework of our report—and his creative reconstruction of the concept of test validity. As it was, Sam made even greater contributions to the project through his drafts of major sections of the text as well as his cordial, but ever crisp, incisive, and often wryly humorous contributions to our discussions. Sam was a wonderful scholar, intellect, and friend, and we dedicate this book to him.



checkmark

Acknowledgments

 

The Committee on Appropriate Test Use wishes to thank the many people who helped make possible the preparation of this report on an accelerated schedule.

An important part of the committee's work was to gather data about testing research, policy, and practice in states and school districts. Many people gave generously of their time, at meetings and workshops of the committee, in interviews with committee staff, and by drafting short papers to assist the committee's thinking.

Lorrie A. Shepard, University of Colorado, Boulder, provided an excellent overview of educational issues in high-stakes testing of individual students. Floraline Stevens, of Los Angeles, provided insights into state and local high-stakes test policies. At a workshop on testing of English-language learners, Jamal Abedi, University of California, Los Angeles, shared his experimental findings on effects of question wording and format among English-language learners. Toni Marsnik, Language Acquisition and Bilingual Development Branch, Los Angeles Unified School District, and Lynn Winters, assistant superintendent for research, planning, and evaluation, Long Beach Unified School District, offered perspectives on practices for testing English-language learners in their districts and in California more generally.

At a committee workshop in Washington, D.C., six leading educational policymakers offered local, state, and national perspectives on the use of high-stakes tests for promotion or retention; the presenters included Arlene Ackerman, superintendent of schools, Washington, D.C.; Philip Hansen, chief accountability officer, Chicago Public Schools; Nancy Grasmick, superintendent of schools, State of Maryland; Jim Watts, vice president for state services, Southern Regional Education Board; Michael Cohen, special assistant to the president for educational policy; and Bella Rosenberg, assistant to the president, American Federation of Teachers.

The committee also commissioned short papers to assist in deliberations about alternate strategies for promoting appropriate test use. Those who prepared such papers include: Tyler Cowan, George Mason University; Ernest House, University of Colorado, Boulder; Don Kettl, University of Wisconsin, Madison; Henry Levin, Stanford University; Theodore Marmor, Yale University; and Anne Schneider, Arizona State University. We are grateful to David Klahr, Carnegie Mellon University, for his insights.

Jennifer C. Day, Population Division, U.S. Bureau of the Census, provided access to unpublished tabulations of school enrollment data from the October Current Population Survey. In addition, staff of several state education agencies provided valuable information about state retention rates: Alabama, Arizona, California, Delaware, District of Columbia, Florida, Georgia, Indiana, Kentucky, Louisiana, Maryland, Massachusetts, Michigan, Mississippi, New Mexico, New York, North Carolina, Ohio, South Carolina, Tennessee, Texas, Vermont, Virginia, West Virginia, and Wisconsin.

We are also grateful to those who served as consultants to the committee. Marguerite Clarke, research associate at Boston College, provided invaluable contributions during all phases of the study, especially on psychometric issues. Edward Miller joined the project midway as editor, and he skillfully, tirelessly pulled our bits, scraps, and—sometimes—avalanches of text into clear, concise prose. Diane August provided important advice and assistance on the testing of English-language learners and prepared early drafts of Chapter 9 of the report. Susan E. Phillips, Michigan State University, and William L. Taylor, a member of the Board on Testing and Assessment, provided valuable advice on legal issues in testing. Taissa S. Hauser volunteered to collect and assemble statistical data on school retention and age-grade retardation, and her good company and quiet advice were a source of support to all on the project staff.

We owe an important debt of gratitude to the scientific and professional staff of the Commission on Behavioral and Social Sciences and Education (CBASSE), without whose guidance, support, and hard work we could not conceivably have completed this report. Barbara B. Torrey, executive director of the commission, and Sandy Wigdor, director of the Division on Education, Labor, and Human Performance, have been enthusiastic supporters of the project and a timely source of gracious reminders that we keep our priorities in line. Michael J. Feuer, director of the Board on Testing and Assessment (BOTA), brought our research team together, created staff support and resources whenever we needed them, and was our most valuable guide, sounding board, and humorist as we pondered the complexities of educational policy analysis. Patricia Morison made major contributions to our work on students with disabilities and English-language learners and was a constant source of support and thoughtful ideas. Allison Black contributed to many phases of the project; she developed many of the background materials for the committee, and her structured interviews with school administrators were a key source of information about local testing policies and practices. Naomi Chudowsky took major responsibility for the investigation of high school graduation and also contributed to the presentation of psychometric concepts, and Robert Rothman made important contributions to the analysis of policy alternatives. During her summer internship, Yale University doctoral student Marilyn Dabady was a careful and critical in-house reader of our drafts. National Research Council (NRC) staff were always available to pitch in when expertise or energy were called for. They were key members of the study team, and it is hard to see how the study could have been completed without their expert help.

Kimberly Saldin served unflappably and flawlessly as the committee's senior project assistant. She dealt smoothly with the logistics of our four committee meetings in five months, with our voluminous collections and distributions of published and unpublished research materials, and with a seemingly endless stream of text files, e-mail file attachments, and file revisions in seemingly incompatible word-processing formats.

Other BOTA staff—Steve Baldwin, Alix Beatty, Meryl Bertenthal, Cadelle Hemphill, Lee Jones, Karen Mitchell—offered advice, help, and support at key stages of the process. Kimberly Saldin received support when she needed it from other wonderful project assistants to the board: Lisa Alston, Dorothy Majewski, Jane Phillips, and Holly Wells. Viola Horek, administrative associate to BOTA, was always there, instrumental in seeing that the entire project ran smoothly.

We are deeply grateful to Eugenia Grohman, associate director for reports of CBASSE. Genie has and shares enormous knowledge and experience in keeping a committee on track and putting a report together from beginning to end. We also appreciate the superb work of Christine McShane, to whom fell the responsibility for final editing of the full report. We are indebted, also, to the whole CBASSE staff for indulging our scheduling exigencies. Thanks also to Sally Stanfield and the whole Audubon team at the National Academy Press for their creative and speedy support.

Several members of the Board on Testing and Assessment were not members of the committee but attended our meetings ex officio and were constant sources of wisdom and encouragement: Robert L. Linn, University of Colorado at Boulder, chair of the Board on Testing and Assessment, and committee member ex officio; William L. Taylor, Attorney at Law; and Carl F. Kaestle, Brown University.

Individual committee members have made outstanding contributions to the study. Several of them drafted sections on particular topics, prepared background materials, or helped to organize workshops and committee discussions. Everyone contributed constructive, critical thinking, serious concern about the difficult and complex issues that we faced, and an open-mindedness that was essential to the success of the project.

A word of acknowledgment to the sponsors of this study. We have benefited from supportive and collegial relations with members of the various House and Senate committee staffs—on both sides of the aisle—for whom the results of our work have such important implications. We thank them all for understanding and respecting the process of the NRC. Our contracting officer's technical representative, Holly Spurlock, of the U.S. Department of Education, has been a most effective project officer; we thank her for her patience and guidance throughout. Many other officials in the department, the National Assessment Governing Board, and in numerous private and public organizations involved in testing also deserve our thanks and recognition for their cooperation in providing information.

This report has been reviewed by individuals chosen for their diverse perspectives and technical expertise, in accordance with procedures approved by the NRC's Report Review Committee. The purpose of this independent review is to provide candid and critical comments that will assist the authors and the NRC in making the published report as sound as possible and to ensure that the report meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The content of the review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.

We wish to thank the following individuals, who are neither officials nor employees of the NRC, for their participation in the review of this report: Lloyd Bond, School of Education, University of North Carolina, Greensboro; Wayne J. Camara, The College Board, New York, New York; John Fremer, Educational Testing Service, Princeton, New Jersey; Adam Gamoran, Wisconsin Center for Education Research, University of Wisconsin; Arthur S. Goldberger, Department of Economics, University of Wisconsin; Lyle V. Jones, L.L. Thurstone Psychometric Laboratory, University of North Carolina, Chapel Hill; Jeannie Oakes, Graduate School of Education and Information Studies, University of California, Los Angeles; Diana Pullin, School of Education, Boston College; Henry W. Riecken, Professor of Behavioral Sciences (emeritus), University of Pennsylvania School of Medicine.

Although the individuals listed above have provided many constructive comments and suggestions, responsibility for the final content of this report rests solely with the authoring committee and the NRC.

The two of us were unacquainted when we began the project, and—one a legal scholar and the other a demographer—we had little in common beyond our shared belief in the importance of our mandate. Each of us has benefited from the other's strengths, and working together has been an unalloyed pleasure.

Jay Heubert, Study Director
Robert M. Hauser, Chair
Committee on Appropriate Test Use






The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce M. Alberts is president of the National Academy of Sciences.

The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. William A. Wulf is president of the National Academy of Engineering.

The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine.

The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy's purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce M. Alberts and Dr. William A. Wulf are chairman and vice chairman, respectively, of the National Research Council.



checkmark

Contents

 

Executive Summary

PART I
BACKGROUND AND CONTEXT

1   Introduction

2   Assessment Policy and Politics

3   Legal Frameworks

4   Tests as Measurements

PART II
USES OF TESTS TO MAKE HIGH-STAKES DECISIONS
ABOUT INDIVIDUALS

5   Tracking

6   Promotion and Retention

7   Awarding or Withholding High School Diplomas

8   Students with Disabilities

9   English-Language Learners

10   Use of Voluntary National Test Scores for Tracking, Promotion, or Graduation Decisions

PART III
ENSURING APPROPRIATE USES OF TESTS

11   Potential Strategies for Promoting Appropriate Test Use

12   Findings and Recommendations

Biographical Sketches

Index




Public Law 105-78, enacted November 13, 1997

SEC. 309. (a) STUDY—The National Academy of Sciences shall conduct a study and make written recommendations on appropriate methods, practices, and safeguards to ensure that—

(1) existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and

(2) existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills.

(b) REPORT TO CONGRESS—The National Academy of Sciences shall submit a written report to the White House, the National Assessment Governing Board, the Committee on Education and the Workforce of the House of Representatives, the Committee on Labor and Human Resources of the Senate, and the Committees on Appropriations of the House and Senate not later than September 1, 1998.




checkmark

Executive Summary

 

The use of large-scale achievement tests as instruments of educational policy is growing. In particular, states and school districts are using such tests in making high-stakes decisions with important consequences for individual students. Three such high-stakes decisions involve tracking (assigning students to schools, programs, or classes based on their achievement levels), whether a student will be promoted to the next grade, and whether a student will receive a high school diploma. These policies enjoy widespread public support and are increasingly seen as a means of raising academic standards, holding educators and students accountable for meeting those standards, and boosting public confidence in the schools.

Because the stakes are high, the Congress wants to ensure that tests are used properly and fairly, and it asked the National Academy of Sciences, through its National Research Council, to "conduct a study and make written recommendations on appropriate methods, practices and safeguards to ensure that—

    A.existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and

    B.existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills."

This study focuses on tests with high stakes for individual students. The committee recognizes that accountability for students is related in important ways to accountability for educators, schools, and school districts. Indeed, the use of tests for accountability of educators, schools, and school districts has significant consequences for individual students, for example, by changing the quality of instruction or affecting school management and budgets. Such indirect effects of large-scale assessment are worth studying in their own right. By focusing on the congressional interest in high-stakes decisions about individual students, this report does not address accountability at those other levels, apart from the issue of participation of all students in large-scale assessments.


BASIC PRINCIPLES OF TEST USE

The use of tests in decisions about student tracking, promotion, and graduation is intended to serve educational policy goals, such as setting high standards for student learning, raising student achievement-levels, ensuring equal educational opportunity, fostering parental involvement in student learning, and increasing public support for the schools. The committee recognizes that test use may have negative consequences for individual students even while serving important social or educational policy purposes. The development of a comprehensive testing policy should therefore be sensitive to the balance among the individual and collective benefits and costs of various uses of tests.

Determining whether high-stakes testing of students produces better overall educational outcomes requires that its potential benefits be weighed against its potential unintended negative consequences. Thus, the value of tests should also be weighed against the use of other information in making high-stakes decisions about students. Tracking, promotion, and graduation decisions will be made with or without tests.

The committee adopted three principal criteria, developed from earlier work by the National Research Council, for determining whether a test use is appropriate:

    (1) measurement validity—whether a test is valid for a particular purpose, and whether it accurately measures the test taker's knowledge in the content area being tested;

    (2) attribution of cause—whether a student's performance on a test reflects knowledge and skill based on appropriate instruction or is attributable to poor instruction or to such factors as language barriers or disabilities unrelated to the skills being tested; and

    (3) effectiveness of treatment—whether test scores lead to placements and other consequences that are educationally beneficial.

These criteria, based on established professional standards, lead to the following basic principles of appropriate test use for educational decisions:

    • The important thing about a test is not its validity in general, but its validity when used for a specific purpose. Thus, tests that are valid for influencing classroom practice, "leading" the curriculum, or holding schools accountable are not appropriate for making high-stakes decisions about individual student mastery unless the curriculum, the teaching, and the test(s) are aligned.

    • Tests are not perfect. Test questions are a sample of possible questions that could be asked in a given area. Moreover, a test score is not an exact measure of a student's knowledge or skills. A student's score can be expected to vary across different versions of a test—within a margin of error determined by the reliability of the test—as a function of the particular sample of questions asked and/or transitory factors, such as the student's health on the day of the test. Thus, no single test score can be considered a definitive measure of a student's knowledge.

    • An educational decision that will have a major impact on a test taker should not be made solely or automatically on the basis of a single test score. Other relevant information about the student's knowledge and skills should also be taken into account.

    • Neither a test score nor any other kind of information can justify a bad decision. Research shows that students are typically hurt by simple retention and repetition of a grade in school without remedial and other instructional support services. In the absence of effective services for low-performing students, better tests will not lead to better educational outcomes.

The committee has considered how these principles apply to the appropriate use of tests in decisions about tracking, promotion, and graduation, to increasing the participation of students with disabilities and English-language learners in large-scale assessments, and to possible uses of the proposed voluntary national tests in making high-stakes decisions about individual students. The committee has also examined existing and potential strategies for promoting appropriate test use.


USES AND MISUSES OF TESTS

Blanket criticisms of tests are not justified. When tests are used in ways that meet relevant psychometric, legal, and educational standards, students' scores provide important information that, combined with information from other sources, can lead to decisions that promote student learning and equality of opportunity. For example, tests can identify learning differences among students that the education system needs to address. Because decisions about tracking, promotion, and graduation will be made with or without testing, proposed alternatives to the use of test scores should be at least equally accurate, efficient, and fair.

It is also a mistake to accept observed test scores as either infallible or immutable. When test use is inappropriate, especially in making high-stakes decisions about individuals, it can undermine the quality of education and equality of opportunity. For example, the lower achievement test scores of racial and ethnic minorities and students from low-income families reflect persistent inequalities in American society and its schools, not inalterable realities about those groups of students. The improper use of test scores can reinforce these inequalities. This lends special urgency to the requirement that test use with high-stakes consequences for individual students be appropriate and fair.

Decisions about tracking, promotion, and graduation differ from one another in important ways. They differ most importantly in the role that mastery of past material and readiness for new material play. Thus, the committee has considered the role of large-scale high-stakes testing in relation to each type of decision separately in this report. But tracking, promotion, and graduation decisions also share common features that pertain both to appropriate test use and to their educational and social consequences.

Members of some minority groups, English-language learners, and students from low socioeconomic backgrounds are overrepresented in lower-track classes and among those denied promotion or graduation on the basis of test scores. Moreover, these same groups of students are underrepresented in high-track classes, "exam" schools, and "gifted and talented" programs. In some cases, such as courses for English-language learners, such disproportions are logical: one would not expect to find native English speakers in classes designed to teach English to English-language learners. In other circumstances, such disproportions raise serious questions. For example, grade retardation among children cumulates rapidly after age 6, and it occurs disproportionately among males and minority group members. These disproportions are especially disturbing in view of other evidence that, as typically practiced, grade retention and assignment to low tracks have little educational value. For example, assignment to low tracks is typically associated with an impoverished curriculum, poor teaching, and low expectations. It is also important to note that group differences in test performance do not necessarily indicate problems in a test, because test scores may reflect real differences in achievement. These, in turn, may be due to a lack of access to a high-quality curriculum and instruction. Thus, a finding of group differences calls for a careful effort to determine their cause.


RECOMMENDATIONS

The committee offers more detailed recommendations in Chapter 12 about the appropriate uses of tests. Those recommendations cover cross-cutting issues that affect testing generally; specific issues and problems pertaining to the uses of tests in tracking, promotion, and graduation; and the inclusion of students with disabilities and students who are English-language learners. The organization of the recommendations in Chapter 12 follows the logic of the chapters in this report. In this executive summary, we present overarching recommendations and discuss the possible use of the proposed voluntary national tests for high-stakes decisions about individual students.

    • Accountability for educational outcomes should be a shared responsibility of states, school districts, public officials, educators, parents, and students. High standards cannot be established and maintained merely by imposing them on students. Moreover, if parents, educators, public officials, and others who share responsibility for educational outcomes are to discharge their responsibility effectively, they should have access to information about the nature and interpretation of tests and test scores. Such information should be freely available to the public and should be incorporated into teacher education and into educational programs for principals, administrators, public officials, and others.

    • Tests should be used for high-stakes decisions about individual mastery only after implementing changes in teaching and curriculum that ensure that students have been taught the knowledge and skills on which they will be tested. Some school systems are already doing this by planning a gap of several years between the introduction of new tests and the attachment of high stakes to individual student performance, during which schools may achieve the necessary alignment among tests, curriculum, and instruction. But others may see attaching high stakes to individual student test scores as a way of leading curricular reform, not recognizing the danger that such uses of tests may lack the "instructional validity" required by law—that is, a close correspondence between test content and instructional content.

    • The consequences of high-stakes testing for individual students are often posed as either-or propositions, but this need not be the case. For example, "social promotion" and repetition of a grade are really only two of many educational strategies available to educators when test scores and other information indicate that students are experiencing serious academic difficulty. But neither social promotion nor retention alone is an effective treatment for low achievement, and schools can use a number of other possible strategies to reduce the need for these either-or choices, for example, by coupling early identification of such students with effective remedial education.

    • Some large-scale assessments are used to make high-stakes decisions about individual students, but most often in combination with other information, as recommended by the major professional and scientific organizations concerned with testing. For example, most school districts say they base promotion decisions on a combination of grades, achievement test scores, developmental factors, attendance, and teacher recommendations. As our study has shown, however, a number of jurisdictions have adopted policies that rely exclusively on achievement test scores to make high-stakes decisions. A test score, like other sources of information, is not exact. It is an estimate of the student's understanding or mastery at a particular time. Therefore, high-stakes educational decisions should not be made solely or automatically on the basis of a single test score but should also take other relevant information into account.

    • The preparation of students plays a key role in appropriate test use. It is not proper to expose students ahead of time to items that will actually be used on their test or to give students the answers to those questions. Test results may also be invalidated by teaching so narrowly to the objectives of a particular test that scores are raised without actually improving the broader set of academic skills that the test is intended to measure. The desirability of "teaching to the test" is affected by test design. For example, it is entirely appropriate to prepare students by covering all the objectives of a test that represents the full range of the intended curriculum. We therefore recommend that test users respect the distinction between genuine remedial education and teaching narrowly to the specific content of a test. At the same time, all students should receive sufficient preparation for the specific test so their performance will not be adversely affected by unfamiliarity with its format or by ignorance of appropriate test-taking strategies.

    • Accurate assessment of students with disabilities and English-language learners presents complex technical and policy challenges, in part because these students are particularly vulnerable to potential negative consequences when high-stakes decisions are based on tests. We recommend that policymakers pursue two key policy objectives in modifying tests and testing procedures in these special populations:

      (1) to increase such students' participation in large-scale assessments, in part so that school systems can be held accountable for their educational progress; and

      (2) to test each such student in a manner that provides appropriate accommodation for the effect of a disability or of limited English proficiency on the subject matter being tested, while maintaining the validity and comparability of test results among all students.

    These objectives are sometimes in tension, and the goals of full participation and valid measurement thus present serious technical and operational challenges to test developers and users.

    • The purpose of the proposed voluntary national tests (VNT) is to inform students (and their parents and teachers) about their performance in 4th grade reading and 8th grade mathematics relative to the standards of the National Assessment of Educational Progress and to performance in the Third International Mathematics and Science Study. The proposal does not suggest any direct use of VNT scores to make decisions about the tracking, promotion, or graduation of individual students, and thus it is not being developed to support those uses. However, states and school districts would be free to use scores on the voluntary national tests for these purposes. Given their design, the proposed voluntary national tests should not be used for decisions about the tracking, promotion, or graduation of individual students. The committee takes no position on whether the voluntary national tests are practical or appropriate for their primary stated purposes.

    • The committee sees a strong need for better evidence on the intended benefits and unintended negative consequences of using high-stakes tests to make decisions about individuals. A key question is whether the consequences of a particular test use are educationally beneficial for students—for example, by increasing academic achievement or reducing dropout rates. It is also important to develop statistical reporting systems of key indicators that will track both intended effects (such as higher test scores) and other effects (such as changes in dropout or special education referral rates). Indicator systems could include measures such as retention rates, special education identification rates, rates of exclusion from assessment programs, number and type of accommodations, high school completion credentials, dropout rates, and indicators of access to high-quality curriculum and instruction.


PROMOTING APPROPRIATE TEST USE

At present, professional norms and legal action (through administrative enforcement or litigation) are the principal mechanisms available to enforce appropriate test use. These mechanisms are inadequate. Compliance with provisions of the Joint Standards for Educational and Psychological Testing and the Code of Fair Testing Practices in Education is largely voluntary, and enforcement is often weak. Legal action is typically adversarial, time-consuming, and expensive, and applicable law can vary by jurisdiction, making enforcement uneven.

New methods, practices, and safeguards could take any of several forms, but in general they would appear at various points on a continuum between professional norms and legal enforcement, some less coercive, some more so. Deliberative forums, an independent oversight body, labeling, and federal regulation represent a range of possible options that could supplement professional standards and litigation as means of promoting and enforcing appropriate test use.

The committee is not recommending adoption of any particular strategy or combination of strategies, nor does it suggest that these four approaches are the only possibilities. We do think, however, that ensuring proper test use will require multiple strategies. Given the inadequacy of current methods, practices, and safeguards, there should be further research on these and other policy options to illuminate their possible effects on test use. In particular, we would suggest empirical research on the effects of these strategies, individually and in combination, on testing products and practice, and an examination of the associated potential benefits and risks.

Large-scale assessments, used properly, can improve teaching, learning, and equality of educational opportunity. That tests are sometimes used improperly should not discourage policymakers, teachers, and parents. Rather, it should motivate action to ensure that educational tests are used fairly and effectively. This report is a contribution to that essential work.


Report Home Page | NAP Home Page