Read "Grading the Nation's Report Card: Research from the Evaluation of NAEP" at NAP.edu

« Previous: 6 Subject-Matter Experts' Perceptions of the Relevance of the NAEP Long-Term Trend Items in Science and Mathematics

Page 132 Cite

Suggested Citation:"7 Issues in Phasing Out Trend NAEP." National Research Council. 2000. Grading the Nation's Report Card: Research from the Evaluation of NAEP. Washington, DC: The National Academies Press. doi: 10.17226/9751.

Page 133 Cite

Page 134 Cite

Page 135 Cite

Page 136 Cite

Page 137 Cite

Page 138 Cite

Page 139 Cite

Page 140 Cite

Page 141 Cite

Page 142 Cite

Page 143 Cite

Page 144 Cite

Page 145 Cite

Page 146 Cite

Page 147 Cite

Page 148 Cite

Page 149 Cite

Page 150 Cite

Page 151 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

7 Issues in Phasing Out Trend NAEP Michael ]. Kolen This paper considers ways in which the long-term trend National Assess- ment of Educational Progress (NAEP) can be phased out and replaced by the main NAEP while still maintaining a long-term trend line. Relevant history of NAEP is presented with a focus on those aspects that led to separating long-term trend NAEP and main NAEP. Differences between the two assessments are discussed, including differences in content, operational procedures, examiner subgroup definitions, analysis procedures, and results. Four designs for assessing long-term trends with NAEP are considered. Evaluation of these designs addresses how their implementation would affect main NAEP and the assessment of long- term trends. The paper concludes with recommendations for research and recom- mendations for the designs that should receive further consideration. The recommendations focus on two designs. In one promising design, long- term trends are monitored with main NAEP, and overlapping main NAEP assess- ments are used whenever an assessment is modified. Implementation of this design requires extensive research. Because long-term trends are assessed with main NAEP in this design, modifications of main NAEP to reflect curricular changes must be tightly constrained. In another promising design, a separate long-term trend assessment is used that is periodically updated. This design can continue to provide long-term trends without an extensive research program. It also allows for main NAEP to change, as necessary, to reflect curricular changes. Drawbacks of this second design are that it requires continuing both the main NAEP and the long-term trend NAEP programs and it allows for only small changes in long-term trend NAEP. 132

MICHAEL J. KOLEN 133 INTRODUCTION NAEP "is mandated by Congress to survey the educational accomplishments of U.S. students and to monitor changes in those accomplishments" (Ballator, 1996:1~. Originally, NAEP surveyed educational accomplishments and long- term trends with a single assessment. Because of continual changes in the assess- ments, NAEP has evolved into a collection of state and national assessments. The main NAEP is designed to be flexible enough to adapt to changes in assess- ment approaches. The long-term trend NAEP is intentionally constructed and administered to be stable so that trends in student performance can be examined over time. Whereas both main NAEP and long-term trend NAEP focus on assessing achievement for the nation and for various subgroups of students, state NAEP, which is the most recent addition to NAEP, focuses on achievement of students by state. Main NAEP and long-term trend NAEP have distinct assess- ment exercises and administration procedures. The National Assessment Governing Board (NAGB) oversees policy for the NAEP program and has called for NAEP to be redesigned (NAGB, 1996~. One of NAGB's concerns involves the apparent inefficiency in continuing to maintain both main NAEP and long-term trend NAEP. To address this concern, NAGB (1996:10) has stated that "it may be impractical and unnecessary to operate two separate assessment programs.. . . A carefully planned transition shall be developed to enable 'the main National Assessment' to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program." Many individuals and committees have expressed con- cern that the transition suggested by NAGB might result in losing the currently available long-term trends (e.g., Jones, 1996; Glaser et al., 1996, 1997; National Research Council, 1996~. In response to this concern, NAGB no longer plans to use main NAEP as the primary way to measure trends; however, there might be . In. . . . . neti~c~enc~es In nerving two programs. This paper was commissioned by the National Research Council to discuss ways in which long-term trend NAEP could be phased out and replaced by the main NAEP assessments while still maintaining a long-term trend line. One significant question to be addressed is the following: How can a single assess- ment be developed that is stable enough to provide long-term trends while still being flexible enough to adapt to changes in assessment approaches? Another significant question is: How can such an assessment be implemented without losing the current long-term trend line? The history of NAEP is considered with a focus on those aspects that led to separating long-term trend NAEP and main NAEP. Those aspects include changes to the NAEP purpose with the first redesign in the mid-1980s and prob- lems that were encountered in measuring trends with the redesigned assessment, such as those involving the NAEP reading anomaly (Beaton and Zwick, 1990; Zwick, 1991~. Relevant components of the current redesign effort are summa

34 ISSUES IN PHASING OUT TREND NAEP rized, and characteristics of the current main NAEP and long-term trend NAEP assessments are compared on their content and administration procedures. This comparison facilitates a discussion of how the two assessments might be replaced by a single assessment. Different designs for assessing long-term trends with NAEP are discussed next. These designs include ones that involve overlapping trend lines, such as those suggested by Glaser et al. (1997) and Forsyth et al. (1996~. The evaluation of these designs includes considering how their implementation would affect the measurement of long-term trends as well as the effect on the main NAEP assess- ments. The paper concludes with recommendations for research to further evalu- ate the different design possibilities, along with recommendations about which designs should receive further consideration. RELEVANT NAEP HISTORY Jones (1996) presented a history of NAEP with a focus on procedural changes that occurred at various stages of its evolution. These stages include the original development of NAEP in the early 1960s, the first operational NAEP in 1969, the first redesign in the early 1980s, and the current redesign effort. The portions of Jones's discussion that are relevant to the relationship between main NAEP and long-term trend NAEP are summarized here. The original "goals of NAEP were to report what the nation's citizens know and can do and then to monitor changes over time" (Jones, 1996:15~. From the beginning, NAEP was intended to be a group-level assessment in which scores were not reported for individuals. How- ever, significant changes have occurred in the assessment over time. Originally, performance was reported exercise by exercise, but by the time of the first redesign it was being reported on groups of exercises, often by objective. Following the first redesign, exercises were scaled using item response theory (IRT) procedures, and average scale scores were reported instead of percentages correct by exercise or groups of exercises. In the initial assessments, matrix sampling procedures were used in which different sets of exercises were given to different examiners. With these proce- dures, exercises were read aloud using tape-recorded presentations that minimized the effects of reading ability and served to pace the presentation of exercises to examiners. With the first redesign, a more efficient sampling design was used in which examiners in a given room were administered different sets of exercises, which resulted in elimination of tape-recorded presentations. In the initial assessments, nearly everyone of ages 9,13, and 17 was included in the sampling frame, but by the time of the redesign only individuals who were in school and of ages 9,13, and 17 were assessed. Following the redesign, school grades 4, 8, and 12 replaced ages 9, 13, and 17 as the primary basis for sampling and reporting. Also, the procedures used for classifying individuals into popula- tion groups differed considerably over time (Barron and Koretz, 1996~.

MICHAEL J. KOLEN 135 Jones (1996: 17) reported that the content of the assessments began to change after the redesign, and "as curricular reform took center stage, NAEP began to be viewed as an agent for change. Exercises began to focus on desired curricula rather than on curricula already in place." In addition, Jones speculated that the use of the IRT scaling procedures following the redesign affected the content of the assessments. Following the redesign, fewer extremely easy or extremely difficult exercises were chosen for the assessment, so that booklets did not neces- sarily contain some very easy and very difficult exercises. A greater proportion of exercises were multiple choice. Also, there was pressure to restrict exercises to those with unidimensional properties to meet the assumptions of IRT. To help maintain trend lines, "special 'bridge samples' were maintained when operational changes were introduced [to NAEP in the 1982,1984, and 1986 assessments]. For bridge samples, conditions deliberately were kept similar to those of earlier assessments to appraise change in achievement from earlier assess- ments" (Jones, 1996:17~. With the 1985-1986 assessment the reading achieve- ment of 9- and 17-year-olds appeared to decline more than a plausible amount from 1984 and 1986. Upon further study it was found that several changes in NAEP procedures, rather than actual changes in reading achievement, were responsible for the decline. The apparent decline in reading achievement is now known as the NAEP reading anomaly (Beaton and Zwick, 1990; Zwick, 1991~. Because of these problems, the main NAEP and long-term trend NAEP programs were separated following the 1985-1986 NAEP assessment. Main NAEP is allowed to adapt to changes in assessment approaches. Attempts are made to track short-term trends with main NAEP only when procedures are comparable from one assessment to the next. Since 1985-1986, long-term trend NAEP has been similar to those of earlier years, using the same booklets, admin- istration procedures, and definitions of examined groups. Long-term trend NAEP has allowed for tracking of important trends by "studiously maintaining condi- tions of assessment that are sufficiently comparable over time to provide valid evidence about achievement change" (Jones, 1996:18~. The main NAEP and long-term trend NAEP assessments are not designed to produce state-level data. In 1990, 1992, and 1994 voluntary trial state NAEP assessments were conducted that produced state-level data to compare states to one another and to the nation. These assessments were considered to be trial assessments because of concerns about their usefulness. Potential benefits of state-level NAEP data have been summarized by Phillips (1991) and potential problems by Koretz (1991~. The National Academy of Education (1993) panel that evaluated trial state NAEP recommended that it be continued but with ongoing evaluation and congressional oversight. In 1996 the term trial was removed from the title, and the assessments are now referred to as state NAEP. The state NAEP assessments use representative subsets of main NAEP book- lets. The two programs differ in administration procedures and other operational procedures, such as who is included in the assessments. In addition, state NAEP

136 ISSUES IN PHASING OUT TREND NAEP assesses only fourth- and eighth-grade students. Although NAGB (1996) has considered combining the state NAEP and main NAEP assessments to increase efficiency, these differences make combining them challenging (Mullis, 1997; Rust, 1996; Spencer, 1996~. The use of state NAEP likely will increase pressure for changing the assessment's content because a wider group of people in states and school districts have a stake in NAEP. For this reason and because of operational complexities, a decision to combine state NAEP and main NAEP would complicate combining main NAEP and long-term trend NAEP. DIFFERENCES BETWEEN MAIN NAEP AND LONG-TERM TREND NAEP The main and long-term trend assessments administered between 1986 and 1997 and that are planned to be administered after 1997 are summarized in Table 7-1. As is evident from this table, main NAEP covers many more subject areas than long-term trend NAEP. From 1988 until the present, long-term trend NAEP has used nearly the same procedures and exercises in each assessment. In addi TABLE 7-1 Main NAEP and Long-Term Trend NAEP Assessments by Year Since 1986a Main NAEP Long-Term Trend NAEP 1986 Reading. Mathematics, Science, Computer Competence 1988 Reading, Writing, Civics, U.S. History 1990 Reading, Mathematics, Science 1992 Reading, Writing, Mathematics 1994 Reading, U.S. History, Geography 1996 Mathematics, Science 1997 Arts (grade 8 only) 1998 Reading, Writing, Civics 1999 2000 2001 2002 2003 Mathematics, Science U.S. History, Geography Reading, Writing Civics, Foreign Language (grade 12 only) 2004 Mathematics, Science 2005 World History, Economics 2006 Reading, Writing 2007 2008 Arts Mathematics, Science Reading, Mathematics, Science Reading, Writing, Mathematics, Science, Civics (ages 13 and 17 only) Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science Reading, Writing, Mathematics, Science aAssessments administered from 1986 to 1994 were adapted from Allen et al. (1996); small special- interest assessments are not shown. Assessments administered from 1996 to 2008 are from NAGB (1997). Future assessments reflect plans.

MICHAEL J. KOLEN 137 lion, there has been sufficient stability in content frameworks and procedures for long-term trend NAEP to allow for reporting long-term trends as far back as 1970 (Campbell et al., 1997~. In contrast, main NAEP assessments have been allowed to differ from administration to administration so that results on one administra- tion of main NAEP often are not comparable to those from previous administra- tions. In addition, the main NAEP and long-term trend NAEP assessments in the same-subject matter areas differ considerably in assessment content and opera- tional procedures. Thus, results from the main NAEP and long-term trend NAEP assessments for the same subject area are not directly comparable. Barron and Koretz (1996) have summarized many of the differences be- tween main NAEP and long-term trend NAEP in content, operational procedures, examinee subgroup definitions, analysis procedures, and results. Some of their major findings are discussed here. They reported that the content of the two assessments is different: The trend assessments are based on content frameworks that were developed for the 1983-84 assessments in reading and writing or the 1985-86 assessments in mathematics and science. Since the development of these frameworks, sub- stantial changes have occurred in the objectives that content experts believe teachers should emphasize. The current practice is to make the changes in the main NAEP assessment called for by content experts and supported by the National Assessment Governing Board, but to leave the trend assessment frame- works undisturbed. (Barron and Koretz, 1996:215) They also reported that the exercise formats for long-term trend NAEP are mainly multiple choice, whereas main NAEP includes a much larger proportion of con- structed-response exercises. They reported that there are also many differences between the two assess- ments in operational procedures and definitions of examiner subgroups. Main NAEP oversamples minority populations to allow for relatively precise subgroup comparisons. Oversampling of minorities is not done with long-term trend NAEP, which leads to "insufficiently precise" assessment of trends in minority-group performance (Barron and Koretz, 1996:214~. In addition, main NAEP primarily uses grade-based sampling and reporting at grades 4, 8, and 12, whereas long- term trend NAEP primarily uses age-based sampling and reporting at ages 9, 13, and 17. Procedures for identifying minority groups differ in the two assessments. For example, for race "the variable used in the main assessment, called derived race because it combines information from multiple sources, gives priority to student-reported information about race and ethnicity.... The variable used in the trend assessment, called observed race . . . is simply the exercise admin- istrator's judgment as to the racial-ethnic background of each student" (Barron and Koretz, 1996:226~. In addition, the main NAEP assessments use a focused design, in which an examiner is administered exercises from a single subject area. In long-term trend NAEP an examiner is administered exercises from more

138 ISSUES IN PHASING OUT TREND NAEP than one subject-matter area. This difference in administration design leads to each student spending less time on a particular subject area in long-term trend NAEP than in main NAEP. Similar analysis procedures are used for the two programs; however, "in recent years, the main assessment has used a far greater number of background variables in its conditioning (Barron and Koretz,1996:220~. Furthermore, differ- ent score scales are used, which can create difficulties in comparing the two assessments. Performance levels are used in reporting performance for main NAEP but not for long-term trend NAEP. The many differences between the two assessments could influence conclu- sions about student achievement in the United States, both at a given time and in trends over time. For example, Barron and Koretz (1996:241-242) have specu- lated that "trends likely would have been somewhat different if the trend assess- ment had more closely resembled the current main assessment [in content)," that "the use of age-defined rather than grade-defined samples appears to be influenc- ing both the overall trend line and the trends for specific population groups," that "differences in the method for grouping students into population groups . . . had major effects on the classification of Hispanic students," and that "overall trends for populations as a whole might be different if the trend assessment had a mix of formats more similar to that of the main NAEP assessment." In certain situations main NAEP has given different results than long-term trend NAEP. In an example provided by Barron and Koretz (1996), the main NAEP assessments indicated greater relative gains in writing achievement in high school than did the long-term trend NAEP writing assessment. To provide a more recent example, the difference between males' and females' scores on the 1996 main NAEP science assessment is compared to the difference on the 1996 long-term trend NAEP science assessment at selected percentiles. Tables 7-2 and 7-3 provide the results used to make this comparison. Because the two assessments are reported on different metrics, the differences were standardized using the semiinterquartile range for the total group, Q = (P75 - P25~/2. (The standard deviation could not be used to standardize the differences because it was not reported by O'Sullivan et al. (1997~. In addition, Q may be preferable to the standard deviation for standardizing percentiles because it is a percentile-based statistic.) As shown in Figure 7-1, the standard- ized differences are larger on long-term trend NAEP than on main NAEP at all percentiles and grades. Although it is difficult to determine the cause of this difference, it is possible that the greater use of multiple-choice exercises on the long-term trend NAEP assessment than on the main NAEP assessment is partly responsible. In summary, main NAEP and long-term trend NAEP differ in content, exer- cise types, subgroup definitions, operational procedures, and analysis procedures. Although these differences likely affect assessment results, it is difficult to tell exactly how.

MICHAEL J. KOLEN 139 TABLE 7-2 Differences in Selected Percentiles Between Males and Females in Main NAEP Sciencea All Male Female Difference Difference/Q Grade 4 Plo 105 105 105 0 .000 P25 130 130 129 1 .046 P50 153 154 152 2 .093 P75 173 175 172 3 .140 P9O 190 191 188 3 .140 Grade 8 Plo 104 103 104 -1 -.043 P25 128 128 128 0 .000 P50 153 154 151 3 .130 P75 174 175 172 3 .130 P9O 192 194 190 4 .174 Grade 12 Plo 104 103 105 -2 - P25 128 129 127 2 P50 152 155 150 5 P75 174 178 171 7 P9O 192 196 187 9 apercentiles are from O'Sullivan et al. (1997); Q = (P7s - P25 )/2. TABLE 7-3 Differences in Selected Percentiles Between Males and Females in Long-Term Trend NAEP Sciencea All Male Female Difference Difference/Q Age 9 Plo 174.5 176.5 172.3 4.2 P25 201.3 202.9 200.0 2.9 P50 231.0 232.7 229.6 3.1 P75 258.9 262.1 256.2 5.9 P9O 283.6 286.9 279.0 7.9 Age 13 Plo 105.3 208.9 202.4 6.5 .248 P25 230.4 233.9 227.6 6.3 .240 P50 257.7 262.4 253.6 8.8 .335 P75 282.9 288.6 277.3 11.3 .430 P9O 304.4 309.3 298.4 10.9 .415 Age 17 Plo 235.1 234.0 235.8 -1.8 -.059 P25 265.9 268.9 263.3 5.6 .182 P50 298.2 303.9 293.3 10.6 .345 P75 327.3 333.2 321.7 11.5 .375 P9O 351.7 358.6 344.1 14.5 .472 aPercentiles are from Campbell et al. (1997); Q = (P75 - P25 )/2.

140 0.50 0.40 Standardized 0 30 Male - Female 0.20 Difference 0.10 ~ 0.00 -0.10 0.50 0.40 Standardized 0 30 Male - Female 0 20 D'fference 0.10 0.00 ~ -0.10 0.50 0.40 ~ Standardized 0 30 Male- Female Difference 0.20 0.10 0.00 -0.10 ISSUES IN PHASING OUT TREND NAEP Age 9 or Grade 4 10 B W:~ - / B J 25 50 75 90 Selected Percentiles Age 13 Or Grade If 10 10 25 25 50 75 90 Selected Percentiles ~'J''/~' Age170rGrade12 / / / / 50 75 90 Selected Percentiles B Long-Term Trend J Main B Long-Term Trend J Main B Long-Term Trend J Main FIGURE 7-1 Standardized male-female differences in 1996 long-term trend NAEP and main NAEP selected percentiles.

MICHAEL J. KOLEN 14 ALTERNATIVE DESIGNS FOR MAIN NAEP AND LONG-TERM TREND The design for NAEP involves using main NAEP to assess current achieve- ment and long-term trend NAEP to monitor trends. Main NAEP is allowed to change to reflect current thinking in education. Long-term trend NAEP has remained the same since the mid-1980s; even the same exercises are used from one long-term trend NAEP assessment to the next. In this section, alternative designs are discussed for the main NAEP and long-term trend NAEP assessments. Design 1: Keep the Current Design One possibility is to continue with the current design, which for long-term trend NAEP uses the same exercises and operational procedures from one assess- ment to the next. Even this tightly constrained design runs the risk, over time, that certain exercises will change in how they function. When such changes occur, the assessment of long-term trends in proficiency is threatened. As Zwick (1992:207-208) has pointed out, one "pitfall of preserving portions of the assess- ment is that, in the case of some items, the relation of item performance to overall proficiency . . . may be altered because of curricular and societal changes." She discussed an example from the NAEP science assessment on acid rain that was included on the 1978, 1982, 1986, and 1988 assessments. Presumably, because of the increased exposure of the problem of acid rain in the news media, rather than increases in science proficiency, this exercise became easier. Situations might also occur in which the content of certain exercises becomes dated, result- ing in exercises becoming more difficult over time, even though the proficiency being measured by the assessment does not decrease. Zwick (1992:208) con- cluded that "an item that remains the same across assessments in a superficial sense may nevertheless function differently as a measure of proficiency." In addition, the content of an assessment can become less relevant as a measure of achievement in current curricula. As curricula change, certain aspects that are reflected in a particular assessment might come to be emphasized less or not at all. In addition, new aspects may be introduced that could not possibly have been included in an earlier assessment. Presumably, these sorts of changes in curricular emphasis have been behind the frequent changes that have occurred in main NAEP, which often have made it difficult to measure even short-term trends with main NAEP. Goldstein (1983) concluded that it is difficult to separate changes in particular exercises from changes in the proficiency being measured. He reasoned that, if certain exercises become easier over time and other exercises more difficult (which is likely to be the case with almost any assessment over a long enough period), measuring absolute trends in achievement might not be useful. Due to these difficulties, Goldstein concluded that, over time, focusing on relative com

42 ISSUES IN PHASING OUT TREND NAEP parisons would be more useful than focusing on absolute comparisons. For example, the differences between males and females in science achievement might be examined to ascertain if the gap is narrowing. Such a comparison could be made, even if the assessments given at different times are not directly compa- rable in their content. Despite the concerns discussed by Goldstein, NAEP has continued to track what he refers to as absolute trends in achievement. Jones (1996:20) concluded that "the primary worth of NAEP has been as a monitor of changes in achieve- ment for the nation." To maintain trend lines with long-term trend NAEP, the exercises have remained the same. However, for each assessment, analyses are conducted to ascertain whether the exercises are functioning in the same way as in previous assessments. Exercises are excluded from long-term trend assess- ments for reasons that include being very difficult, having poor fit to the IRT model, and showing large changes in parameter estimates from previous assess- ments (Allen et al., 1996~. Although these procedures can help maintain trend lines, it can become difficult to separate actual changes in proficiency from changes in the functioning of particular exercises. Also, as stated previously, curricula might change so much that the relevance of the long-term trend assess- ment to current curricula becomes questionable. For these reasons, some changes in the long-term trend assessments are inevitable if the assessments are to provide educationally relevant information. One other concern about long-term trend NAEP is that it does not take into account recent advances in data analysis procedures, such as the extensive use of conditioning variables and updated subgroup definitions. To maintain stability, long-term trend NAEP continues to use procedures developed in the 1980s. Zwick (1992:206) asked, "How can NAEP maintain continuity while staying current?" As suggested in this section, addressing Zwick's question should take into account the content of the assessments, the operational procedures for admin- istering the assessments, and the societal context in which the assessments are made. This paper now explores alternative designs that might be used. Design 2: Periodically Update Long-Term Trend NAEP While Maintaining Main NAEP One possible change in the design of NAEP allows for relatively small modifications in the content of long-term trend NAEP while still maintaining both long-term trend NAEP and the main NAEP. With this design, main NAEP could continue to evolve to reflect curricular trends, unimpeded by the necessity to maintain long-term trends. However, unlike the current design, periodic modest changes are allowed in the content of long-term trend NAEP in an attempt to avoid problems associated with "the relation of item performance to overall proficiency . . . [being] altered because of curricular and societal changes" (Zwick, 1992:208~. In this design the current long-term trend NAEP would

MICHAEL J. KOLEN 143 continue to be used, with small modifications allowed, and the operational condi- tions of the long-term trend NAEP assessment would remain consistent over time. However, this design allows for replacement of some of the exercises used in the long-term trend NAEP assessment to avoid many of the problems identi- fied by Zwick. This design for long-term NAEP has many similarities to the designs of other large-scale assessment programs that use alternate forms of assessments for reasons of security. The ACT Assessment (ACT, 1997) and SAT (Donlon, 1984) which are used for college admissions purposes, are among the many assess- ments that use alternate forms. In these assessments, different exercises are used on each administration. Careful development procedures involving tight specifi- cations are used to ensure that the alternate forms each measure the same con- structs in similar ways. Although efforts are made to build alternate forms to be approximately equal in difficulty, equating procedures are used to adjust for the small differences in difficulty that are present (Kolen and Brennan, 1995~. The procedures in these assessment programs are used to ensure that scores on the alternate forms can be used interchangeably regardless of the time at which the examinee is assessed or the particular alternate form that is administered. Used in tandem, the assessment development and equating procedures allow for compar- ing scores and assessing trends, even when completely different assessment exercises are used at different times. The general concepts of developing alter- nate forms of an assessment and equating could be used in a new long-term trend NAEP assessment design. One difference between NAEP and assessments that routinely use equating processes is that NAEP uses a set of booklets, with different students adminis- tered different booklets. This type of design is made possible because group- level scores are reported, with no scores being reported to individual examiners. To consider equating processes with NAEP, an alternate form of NAEP assess- ment is defined as the set of booklets that are administered to examiners in an assessment. Using this idea, assessment specifications for NAEP are defined at the level of the set of booklets. To use an equating process with alternate NAEP forms (i.e., alternate sets of NAEP booklets), content specifications need to be developed and defined at the level of a set of NAEP booklets. Such specifica- tions present the content, skills, and exercise types to be included in sufficient detail to ensure that the alternate NAEP forms measure the same educational constructs in the same way. Statistical specifications need to be developed so that the alternate NAEP forms are of nearly the same difficulty. An equating process for long-term NAEP could involve randomly assigning students to take old and new assessments. Alternatively, a set of exercises from a previous assessment could be used as part of the new assessment. If used, this set of common exercises fully represents the content of the total assessment so that it serves to link one assessment to the next. By treating sets of NAEP

44 ISSUES IN PHASING OUT TREND NAEP booklets as alternate forms, the procedures for designing equating studies dis- cussed in Kolen and Brennan (1995) apply. An equating process could accommodate removing exercises from a long- term trend NAEP assessment when they become dated or when, as Zwick (1992) has pointed out, the relationship of exercise performance to overall achievement changes over time. Also, exercises could be removed if security concerns arise pertaining to particular exercises on NAEP assessment. An equating process can tolerate periodic updating of content as long as the updating does not affect the constructs being measured. For example, with the ACT assessment, "curriculum study is ongoing . . . . ACT assessment tests are reviewed on a periodic basis" (ACT, 1997:4~. ACT accommodates some changes to the content of the assess- ments within the context of the process of equating alternate forms. The measurement of long-term trends in NAEP using an equating process depends on developing tight assessment specifications that allow for the develop- ment of alternate forms of long-term trend NAEP. The specifications should remain stable over time, with only modest updating of the specifications allowed. The context in which the common exercises appear needs to be constant from one assessment to the next, and the operational procedures used for the assessment need to be preserved from one assessment to the next. In addition, with this design, sample sizes for minorities should be increased to address the concern expressed by Barron and Koretz (1996) that assessment of trends for minorities is not sufficiently precise. One major limitation is that this design cannot directly accommodate major changes in specifications or frameworks. For example, if the frameworks for long-term trend NAEP were updated to be much more similar to those for the current main NAEP, it would not be possible to equate the resulting long-term trend NAEP to the previous one. In this event, special studies would be required to link the two assessments if long-term trends were to be followed from one long-term trend assessment to another. Design 3: Eliminate Long-Term Trend NAEP and Use Main NAEP for Trend Assessment NAEP faces two formidable challenges if long-term trend NAEP is elimi- nated. First, the existing long-term trend comparisons for NAEP need to be preserved. As described earlier, main NAEP has evolved substantially and now is quite different from long-term trend NAEP. A study that links main NAEP to long-term trend NAEP might be used to preserve trends. Second, if main NAEP is used to assess trends in NAEP, it needs to be more stable than it has been in the past. For long-term trends to be preserved when substantial revisions are made to main NAEP, the revised assessment needs to be linked to the previous ones. These linking studies are much more challenging to conduct than equating studies because the assessments differ. In the Mislevy (1992) and Linn (1993) terminol- ogy, the processes of projection or statistical moderation would be used in these

MICHAEL J. KOLEN 145 linking studies. Suggestions for how the data might be collected to conduct these linkages are described later in this section. The linkages that result from these processes are considerably weaker than equatings because of the substantial dif- ferences in the content of the assessments. The major differences between the current long-term trend NAEP and main NAEP assessments that were described by Barron and Koretz (1996) and summa- rized earlier in this paper present significant challenges to linking these assess- ments. Along with differences in content and exercise types, these include differences between the two assessments in operational and analysis procedures. For example, as discussed earlier, the main NAEP assessment uses "derived race," whereas the long-term trend assessment uses "observed race." Barron and Koretz (1996) suggested that differences in subgroup definitions could affect the classification of examiners to subgroups. Another related issue is that main NAEP assesses students at fourth, eighth, and twelfth grades, whereas long-term trend NAEP assesses individuals at ages 9, 13, and 17. The first step in eliminating long-term trend NAEP using this design is to estimate the effect of subgroup and age/grade definitions on long-term trend NAEP. In a single year, long-term trend NAEP would be conducted using both the current long-term trend NAEP subgroup definitions and the current main NAEP subgroup definitions. Independent examinee samples could be used for this study. This linking study estimates the effect of changes in subgroup defini- tions on long-term NAEP trends. For example, this study estimates what the long-term trends would have been had "derived race" been used instead of "observed race" with long-term trend NAEP. Similar estimates are made of long- term trends for grade groups instead of for the age groups typically used with long-term trend NAEP. In a second study, long-term trend NAEP is linked to main NAEP. In the same year as the first study, the main NAEP assessment could be conducted using a group of examiners that is independent of the group used in the long-term trend linking study. This study would be used to adjust for the effects of content differences, differences in exercise types, and differences in administration con- ditions (e.g., tape-recorded, paced, administrations in long-term trend NAEP as compared to main NAEP conditions that are self-paced). The results of these studies could be analyzed in two ways. In one main NAEP is placed on the long-term trend NAEP scale, with trends continuing to be reported on the long-term trend NAEP scale. Following this process, main NAEP is reported on two scales: the main NAEP scale to report current NAEP perfor- mance and the long-term trend NAEP scale to report long-term trends. The other possibility is to place previous NAEP trend assessments on the current main NAEP scale. This second possibility involves reporting both long-term trends and current proficiency on a single scale, which might cause less confusion in assessment interpretation. This design is summarized in Figure 7-2, with the

146 Linking Study 1 Long-Term Trend NAEP, Original Subgroup Definitions ISSUES IN PHASING OUT TREND NAEP Long-Term Trend NAEP, Main NAEP Subgroup Definitions . Linking Study 2 Main NAEP FIGURE 7-2 Studies for linking long-term trend NAEP to main NAEP. arrows going from left to right to suggest that the long-term trend NAEP is placed on the main NAEP scale. Even if these studies were conducted, certain conceptual issues need resolu- tion. For example, effects of content differences between the two assessments are estimated using linking study 2. Implicit in this study is an assumption that the effects of content differences estimated in the year the study is conducted also hold for previous years (at least after controlling for year-to-year differences in distributions of examiners within subgroups). It is possible that substantive changes in education that occur between assessment cycles could affect the results of the linking. This assumption could be assessed only by repeating the design over multiple years. A decision needs to be made about whether interest is in estimating subgroup differences on main NAEP or subgroup differences on the previous long-term trend NAEP. If, as implied by Figure 7-2, all NAEP data are reported on the scale of the current main NAEP assessment, the trends are esti- mated for the current main NAEP assessment. Clearly, estimating such trends entails strong statistical assumptions, since main NAEP was not administered in previous years. As suggested by the results of Barron and Koretz (1996) and the NAEP science data results presented in Figure 7-1 here, the decisions that are made could affect the trends reported for various subgroups. The analyses associated with these designs are complicated, methodology for analyzing the data and estimating trends needs to be developed, and an exten- sive research program is required. The research program might be initiated using data that already exist from years in which main NAEP and long-term trend NAEP were administered in the same subject-matter area in the same year. How- ever, only preliminary studies of methodology could be conducted, unless data exist that allow for assessing the effects of changes in subgroup definitions, as would be investigated in linking study 1 of Figure 7-2.

MICHAEL J. KOLEN 147 For this design, linking studies 1 and 2 are conducted once. According to NAGB (1996:15), "test frameworks and test specifications developed for the National Assessment generally shall remain stable for at least ten years." When major changes are made, however, a linking study, similar to linking study 2 in Figure 7-2, is needed to link the new main assessment to the previous one. NAGB (1996:15) also stated that "in rare circumstances, such as where signifi- cant changes in curricula have occurred, the National Assessment Governing Board may consider making changes to test frameworks and specifications before ten years have elapsed." In such circumstances, linking studies are needed more often than every 10 years. Linkings such as those described above are much weaker than equatings. Similar linkings have produced useful results in other assessment programs. For example, when ACT revised the ACT assessment in 1989, the new version was linked to the previous one (Brennan, 1989) for the English, mathematics, and composite scores. The linking was used to maintain trend lines and to help colleges update outscores. However, linking studies require strong statistical assumptions, and it is always possible that the tracking of long-term trends could be disrupted if the assumptions fail to hold. An extensive research program that involves development of methodology and empirical research is needed before NAEP adopts this linking design. Design 4: Eliminate Long-Term Trend NAEP and Maintain Two Main NAEPs for Trend Assessment Zwick (1992) discussed maintaining an old and new main NAEP assessment for some time whenever the NAEP was substantially revised. Forsyth et al. (1996) and Glaser et al. (1997) expanded on Zwick's idea and suggested that at least two main assessments be used at a time so as not to lose trends developed with the previous assessment. In the design suggested by Forsyth et al., the different main assessments are linked in some way to help maintain long-term trends, although they did not describe how to conduct the linking. Compared to the previous design that uses a single main NAEP assessment, the use of over- lapping assessments with overlapping trends provides some insurance against problems with links. If the linking methodology does not work properly, a few administrations could be used to establish the linkages. In most other respects, however, this design has the same problems as the previous one. Main NAEP is still linked to long-term trend NAEP. New main assessments are still linked to previous main assessments whenever there is a major change in the assessments. The same sorts of conceptual issues remain, such as the reporting metric for trends, and how to estimate subgroup trends. As with the previous design, an extensive research program is needed to study proce- dures for conducting the linking. Unlike the previous design, this one requires

148 ISSUES IN PHASING OUT TREND NAEP that multiple assessments be maintained, and it has the potential to create confu- sion because multiple reporting metrics will be used at any given time. CONCLUSIONS AND RECOMMENDATIONS Regardless of which design is used, changes in the context of NAEP con- tinue to threaten any long-term trend NAEP assessment. For example, if NAEP were to become a high-stakes assessment, widespread teaching to NAEP might threaten long-term trend assessment (Zwick, 1992~. In a similar vein, Jones (1996:19) expressed concern that, with the adoption of state NAEP, "if NAEP materials were to be used for high-stakes assessment at the level of districts or schools within states, [could] threaten not only the comparability of national and state results with earlier findings, but also the integrity of findings from any current assessment." Jones also expressed concern that measurement of NAEP trends could be made impossible if ways were found to increase student motiva- tion on NAEP. The proposed Voluntary National Test could have similar effects. These sorts of changes in the context of NAEP would directly affect main NAEP, but might not affect a separate long-term trend assessment. Therefore, designs that assess trends using a separate long-term trend NAEP (Designs 1 and 2 presented here) could be more robust to changes in the context of NAEP than are the designs that use main NAEP to assess long-term trends (Designs 3 and 4 presented here). Two questions were posed in the first section of this paper: How can a single assessment be developed that is stable enough to provide long-term trends while still being flexible enough to adapt to changes in assessment approaches? How can such an assessment be implemented without losing the current long-term trend line? Only two of the four designs discussed address both of these ques- tions: Design 3: eliminate long-term trend NAEP and use main NAEP for trend assessment, and Design 4: eliminate long-term trend NAEP, and maintain two main NAEPs for trend assessment. Both designs require conducting complex linking studies, making strong statistical assumptions, and being supported by an extensive research program for developing linking procedures that work in the NAEP context. The outcome of this research program is difficult to predict. Possibly, procedures could be devel- oped that allow for linking assessments as different as long-term trend NAEP and main NAEP or as different as new and old main NAEP. However, it is also possible that the results of the research will indicate that changes in main NAEP assessments need to be much more tightly constrained than is presently the case. A research program could begin with existing long-term trend NAEP data and main NAEP data for those years in which the two assessments were administered during the same year in the same subject areas. However, special data collections certainly are needed in the process of developing the necessary linking proce- dures. Although safer than Design 3 in that trends will not be lost as easily

MICHAEL J. KOLEN 149 because of the use of overlapping assessments, Design 4 requires that two assess- ments be maintained. Another potentially useful alternative is: Design 2: periodically update long-term trend NAEP while maintaining main NAEP. In this design, main NAEP is allowed to change, in small ways, to better reflect current curncula. The design requires that assessment specifications be developed to ensure that the alternate forms of long-term trend NAEP measure the same constructs in similar ways. It improves on current procedures by allowing for the introduction of new exercises but still provides stable estimation of long-term trends. No extensive research program is required to develop and evaluate new linking methodology. Instead, equating designs that have been used extensively in a variety of assess- ment programs are used to ensure that long-term trends can be maintained. As suggested earlier in this section, this design might be more robust than Designs 3 or 4 to changes in the context of NAEP assessments. For these reasons, even though it does not eliminate the separate long-term trend NAEP and though it requires maintaining the current long-term trend NAEP, Design 2 deserves fur- ther consideration. ACKNOWLEDGMENTS The author thanks Rodenck Little and two anonymous reviewers for their comments on a draft of this paper. REFERENCES ACT 1997 ACTAssessment Technical Manual. Iowa City, Iowa: ACT. Allen, N.L., D.L. Kline, and C.A. Zelenak 1996 The NAEP 1994 Technical Report. Washington, D.C.: National Center for Education Statistics. Ballator, N. 1996 The NAEP Guide, Revised Edition. Washington, D.C.: National Center for Education Statistics. Barron, S.I., and D.M. Koretz 1996 An evaluation of the robustness of the National Assessment of Educational Progress trend estimates for racial ethnic subgroups. Educational Assessment 3(3):209-248. Beaton, A.E., and R. Zwick Brennan, R.L., ed. 1990 The Effect of Changes in the National Assessment: Disentangling the NAEP 1985-86 Reading Anomaly. No. 17-TR-21. Princeton, N.J.: Educational Testing Service. 1989 Methodology Used in Scaling the ACTAssessment and P-ACT+. Iowa City, Iowa: ACT. Campbell, J.R., K.E. Voelkl, and P.L. Donahue 1997 NAEP 1996 Trends in Academic Progress. Washington, D.C.: National Center for Edu- cation Statistics. Donlon, T., ed. 1984 The College Board Technical Handbook for the Scholastic Aptitude Test and Achieve- ment Tests. New York: College Entrance Examination Board.

150 ISSUES IN PHASING OUT TREND NAEP Forsyth, R., R. Hambleton, R. Linn, R. Mislevy, and W. Yen 1996 Design Feasibility Team Report to the National Assessment Governing Board. Washing- ton, D.C.: National Assessment Governing Board. Glaser, R., R. Linn, and G. Bohrnstedt 1996 Letter to Roy Truby from the National Academy of Education panel on the evaluation of the NAEP trial state assessment project. February 23. 1997 Assessment in Transition: Monitoring the Nation's Educational Progress. Stanford, Calif.: National Academy of Education. Goldstein, H. 1983 Measuring changes in educational attainment over time: Problems and possibilities. Jour- nal of Educational Measurement 20(4):369-377. Jones, L.V. 1996 A history of the National Assessment of Educational Progress and some questions about its future. Educational Researcher 25(7): 15-22. Kolen, M.J., and R.L. Brennan 1995 Test Equating: Methods and Practices. New York: Springer-Verlag. Koretz, D.M. 1991 State comparisons using NAEP: Large costs, disappointing benefits. Educational Re- searcher 20(3): 19-21. Linn, R.L. 1993 Linking results of distinct assessments. Applied Measurement in Education 6:83-102. Mislevy, R.J. 1992 Linking Educational Assessments: Concepts, Issues, Methods, and Prospects. Princeton, N.J.: ETS Policy Information Center. Mullis, I.V.S. 1997 Optimizing State NAEP: Issues and Possible Improvements. Report commissioned by the NAEP Validity Studies Panel. American Institutes of Research: Palo Alto, Calif. National Academy of Education 1993 The Trial State Assessment: Prospects and Realities. Stanford, Calif.: National Acad- emy of Education. National Assessment Governing Board (NAGB) 1996 Policy Statement on Redesigning the National Assessment of Educational Progress. Washington, D.C.: NAGB. 1997 Schedule for the National Assessment of Educational Progress. Washington, D.C.: NAGB. National Research Council 1996 Evaluation of "Redesigning the National Assessment of Educational Progress." Com- mittee on Evaluation of National and State Assessments of Educational Progress, Board on Testing and Assessment. Washington, D.C.: National Academy Press. O'Sullivan, C.Y., C.M. Reese, and J. Mazzeo 1997 NAEP 1996 Science Report Card for the Nation and the States. Washington, D.C.: National Center for Education Statistics. Phillips, G.W. 1991 Benefits of state-by-state comparisons. Educational Researcher 20(3):17-19. Rust, K. 1996 Sampling issues for redesign. Memorandum to Mary Lyn Bourque, NAGB, May 9. Spencer, B. 1996 Combining State and National NAEP. Paper prepared for the evaluation of state NAEP conducted by the National Academy of Education.

MICHAEL J. KOLEN 151 Zwick, R. 1991 Effects of item order and context on estimation of NAEP reading proficiency. Educa- tional Measurement: Issues and Practice 10:10-16. 1992 Statistical and psychometric issues in the measurement of educational achievement trends: Examples from the National Assessment of Educational Progress. Journal of Educa- tional Statistics 17(2):205-218.

Next: 8 Issues in Combining State NAEP and Main NAEP »

Grading the Nation's Report Card: Research from the Evaluation of NAEP (2000)

Chapter: 7 Issues in Phasing Out Trend NAEP

Welcome to OpenBook!

Get Email Updates