Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 152
8
Issues in Combining State NAEP and
Main NAEP
Michael ]. Kolen
Separate data collections are used in the main National Assessment of Edu-
cational Progress (NAEP) and the state NAEP. To address concerns that the
separate data collections might place too large a burden on the states, this paper
examines options for combining main and state NAEP designs. State NAEP is
described, and important differences between main NAEP and state NAEP are
highlighted. Designs are discussed that have been proposed for merging the two
data collections. The focus of these discussions is on how the sample designs
interact with operational and measurement concerns. Conclusions and recom-
mendations are presented. Significant administration differences between main
NAEP and state NAEP exist, which make combining difficult. These differences
currently are addressed by adjusting state NAEP scores. It is argued that even
with these adjustments contradictory findings and complications are apparent,
especially when making the criterion-referenced interpretations of NAEP scores.
The administration differences also make implementation of any of the designs
for combining main NAEP and state NAEP questionable. Suggestions are made
to consider using the same recruitment and administration conditions for main
NAEP and state NAEP. The strengths and weaknesses of various designs for
combining main and state NAEP are discussed.
INTRODUCTION
NAEP "is mandated by Congress to survey the educational accomplishments
of U.S. students and to monitor changes in those accomplishments" (Ballator,
1996:1~. Originally, NAEP surveyed educational accomplishments and long
152
OCR for page 153
MICHAEL J. KOLEN
153
term trends with a single assessment. Because of continual changes in the assess-
ments, NAEP has evolved into a collection of state and national assessments.
Main NAEP is designed to be flexible enough to adapt to changes in assessment
approaches. Long-term trend NAEP is intentionally constructed and adminis-
tered to be stable so that trends in student performance can be examined over
time. Whereas main NAEP and long-term trend NAEP focus on assessing
achievement for the nation and for various subgroups of students, state NAEP,
which is the most recent addition to NAEP, focuses on achievement of students
by state.
The National Assessment Governing Board (NAGB) oversees policy for the
NAEP program and has called for NAEP to be redesigned (NAGB,1996~. NAGB
has expressed concern about the burden placed on states involved in having
separate state NAEP and main NAEP data collections. To address this concern,
NAGB (1996:7) has stated that, "where possible, changes in national and state
sampling procedures shall be made that will reduce [the] burden on states, increase
efficiency, and save costs." As part of its evaluation of NAEP, the National
Research Council commissioned this paper to examine options for combining
main and state NAEP designs.
This paper starts by describing state NAEP and highlighting important dif-
ferences between main NAEP and state NAEP. A discussion follows of designs
that have been proposed for merging the two data collections either by first
selecting a national sample and then building state samples or by selecting state
samples and then determining which subset of those data could serve as the
national sample. The focus of these discussions is on how the sample designs
interact with operational and measurement concerns. Finally, conclusions and
recommendations are presented.
COMPARISON OF MAIN NAEP AND STATE NAEP
The main NAEP and long-term trend NAEP assessments were not designed
to produce state-level data. To explore the possibility of NAEP providing data at
the state level, in 1990, 1992, and 1994 voluntary trial state NAEP assessments
were conducted that produced state-level data to compare states to one another
and to the nation as a whole. These assessments were considered to be trial
assessments because of concerns about their usefulness. Potential benefits of
state-level NAEP data are summarized by Phillips (1991) and potential problems
by Koretz (1991) and Jones (1996~. The National Academy of Education Panel
(1993) that evaluated trial state NAEP recommended that it be continued but with
ongoing evaluation and congressional oversight. In 1996 the term trial was
removed from the title, and the assessments are now referred to as state NAEP.
Recently, others have discussed issues in combining state NAEP and main NAEP,
including Forsyth et al. (1996), Glaser et al. (1997), Mullis (1997), Rust (1996),
Rust and Shaffer (1997), and Spencer (1996~.
OCR for page 154
154
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
Content of the Assessments
The state NAEP and main NAEP assessment administrations since 1986 and
those planned through 2008 are listed in Table 8-1. The table indicates that,
although main NAEP typically is administered in grades 4, 8, and 12, state NAEP
typically is administered only in grades 4 and 8. In addition, main NAEP is
administered in more subject-matter areas. The subject areas for the early state
NAEP assessments were only loosely related to those for main NAEP. However,
beginning in 1996 and in future plans, state NAEP mathematics and science
assessments are to be given in the same years as the main NAEP mathematics and
science assessments. A similar statement can be made about the reading and
writing assessments.
TABLE 8-1 Main NAEP and State NAEP Assessments by Year Since 1986a
Main NAEP (grades 4, 8, and 12
Year except where noted) State (or Trial State) NAEP
1986 Reading. Mathematics, Science,
Computer Competence
1988 Reading, Writing, Civics, U.S. History
1990 Reading, Mathematics, Science Mathematics (grade 8)
1992 Reading, Writing, Mathematics Mathematics (grades 4 and 8),
Reading (grade 4)
1994 Reading, U.S. History, Geography Reading (grade 4)
1996 Mathematics, Science Mathematics (grades 4 and 8),
Science (grade 8)
1997 Arts (grade 8)
1998 Reading, Writing, Civics Reading (grades 4 and 8),
Writing (grade 8)
1999
2000 Mathematics, Science
Mathematics (grades 4 and 8),
Science (grades 4 and 8)
2001 U.S. History, Geography
2002 Reading, Writing Reading (grades 4 and 8),
Writing (grades 4 and 8)
2003 Civics, Foreign Language
(grade 12 only)
2004 Mathematics, Science
Mathematics (grades 4 and 8),
Science (grades 4 and 8)
2005 World History, Economics
2006 Reading, Writing Reading (grades 4 and 8),
Writing (grades 4 and 8)
2007 Arts
2008 Mathematics, Science Mathematics (grades 4 and 8),
Science (grades 4 and 8)
aAssessments administered from 1986 to 1994 are adapted from Allen et al. (1996); small special-
interest assessments are not shown. Assessments administered from 1996 to 2008 are from National
Assessment Governing Board (1997). Future assessments reflect plans.
OCR for page 155
MICHAEL J. KOLEN
155
In recent state NAEP assessments (Allen and Mazzeo, 1997; Allen et al.,
1997) the assessment exercises used in state NAEP have been identical to ones
used in main NAEP. In addition, the scores from state NAEP have been reported
on the NAEP proficiency scale.
Administration Procedures
State NAEP and main NAEP differ in administration procedures. According
to Allen et al. (1997:13~:
The state assessments differed from the national assessment in one important
regard: Westat [NAEP contractor] staff collected the data for the national
assessment while, in accordance with the NAEP legislation, data collection
activities for the state assessment were the responsibility of each participating
jurisdiction. These activities included ensuring the participation of selected
schools and students, assessing students according to standardized procedures,
and observing procedures for test security.
Linking State NAEP to Main NAEP
Recognizing that these differences in administration procedures might cause
differences in assessment results, linking studies have been conducted by the
National Center for Education Statistics (NCES) and its contractors to estimate
the effects of administration differences and to adjust scale scores for any effects
that exist. The rationale for these studies has been described by Yamamoto and
Mazzeo (1992:168) and is summarized here:
Because the assessment instruments for [trial state NAEP and main NAEP]
were identical, one of the common-item approaches to linking the scales might
have been considered. However, the rationale for such an approach is based on
an assumption that the item response functions for the . . . items were the same
under the [trial state NAEP and main NAEP] . . . test administration conditions.
The aforementioned considerations [differences in administration conditions],
as well as data from the assessment itself, suggest otherwise.
Thus, although the same items are used in state NAEP and main NAEP, concerns
about the effects of differences in administration procedures led to the decision to
independently scale the two assessments.
Linking studies that have been conducted use a common person design, in
which a sample of examiners from main NAEP is matched to the state NAEP
sample. These linking studies have not only estimated the size of the effects of
differences in administration conditions but also attempted to adjust for them.
Allen et al. (1997:16) described the linking study for the 1996 state assessment in
mathematics, which is typical of these linking studies, as follows:
OCR for page 156
156
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
The results from the state assessment program were linked to those from the
national assessment through linking functions determined by comparing the
results for the aggregate of all fourth- and eighth-grade public-school students
assessed in the state assessment with the results for public-school students of
the matching grade within a subsample (the National Linking sample) of the
national NAEP sample. The National Linking sample for a given grade is a
representative sample of the population of all grade-eligible public-school stu-
dents within the aggregate of the 45 participating states and the District of
Columbia (excluding Guam and the two DoDEA jurisdictions). Specifically,
the grade 4 National Linking sample consists of all fourth-grade students in
public schools in the states and the District of Columbia who were assessed in
the national mathematics assessment. The grade 8 National Linking sample is
equivalently defined for eighth-grade students who participated in the national
assessment.... Each mathematics content strand scale was linked by matching
the mean and standard deviation of the scale score averages across all fourth- or
eighth-grade students in the matching grade National Linking sample.
Thus, the linking sample for main NAEP is a subset of main NAEP that is
matched as closely as possible with the state samples. Such linking studies
appear to have been successful in adjusting for administration differences to the
extent that the distribution of scale scores for the matched sample for state NAEP
was found to be acceptably close to the distribution of scale scores for the main
NAEP matched sample (see, e.g., Allen and Mazzeo, 1997, and Allen et al.,
1997~.
Magnitude of the Effects of Administration Differences
Because the main and state NAEP assessments used exactly the same exer-
cises, the effects of the different administration procedures can be investigated
directly by comparing the proportion correct on items from the two matched
samples. If there were no administration differences between the two assess-
ments, the proportion correct, apart from sampling error, for each item and on
average, over all items, would be the same for the two assessments. However,
when these linking studies have been conducted, it has been found repeatedly that
the average proportion correct on state NAEP tends to be higher than the average
proportion correct on main NAEP for the matched samples. This finding sug-
gests that, on average, students can be expected to correctly answer more items
when an assessment is administered under state NAEP administration procedures
than when an assessment with identical questions is administered under main
NAEP administration conditions. Yamamoto and Mazzeo (1992) reported that
for the matched samples in the 1990 trial state assessment in mathematics the
average proportion correct was .02 higher on the trial state NAEP than on the
main NAEP assessment. In another example, based on linking studies for the
NAEP reading assessment, Spencer (1996) reported nearly a .01 difference in
average proportion correct in 1992 and a difference of .03 in 1994.
OCR for page 157
MICHAEL J. KOLEN
157
Administration Differences Responsible
for Differences in Assessment Results
Results of the linking studies indicate that some aspects of the differences in
administration of the two assessments are resulting in systematic differences in
the average proportion correct on the two assessments. Hartka and McLaughlin
(1993) identified motivational differences as one possible explanation and specu-
lated that:
One condition that might lead to higher scores on the TSA [trial state NAEP] is
higher motivation among students. In the TSA, quality control monitors
recorded instances of local school personnel giving students incentives to par-
ticipate.... Another possibility is that different personnel administering the
assessments (Westat staff for national [main] NAEP and local school personnel
for the TSA) created different climates in the schools and that this contributed
to the difference in performance between the national and TSA samples.
Spencer (1996) reported that there may be differences in participation rates for
the two assessments. He presented data for the 1994 trial state assessment indi-
cating that the overall percentage of sampled schools that participated was lower
than for main NAEP in 1994. For this assessment the percentage of students
participating in school was higher for the trial state NAEP than for main NAEP.
Hartka and McLaughlin (1993) found differences in some of the background
characteristics of students participating in state NAEP and main NAEP. Although
many possible reasons for the differences might exist, Spencer pointed out that it
can be difficult to assess the importance of each aspect of these administration
differences.
Implications of Differences for Score Interpretation
Apparently, the linking studies that adjust for differences in administration
conditions have the following as their goal: the scale scores reported for a par-
ticular state should reflect the scale scores that state would have received had the
state assessment been administered under the conditions used to administer the
main NAEP assessment. Various assumptions are implicit in conducting these
linking studies, and a single set of linking constants is applied for all jurisdictions.
This procedure seems sensible insofar as administration differences between main
NAEP and state NAEP are the same from state to state. However, it seems likely
that administration conditions differ across states. If so, the assessments would
be more accurate for some states than for others. The overall adjustment would
be unable to correct for these differences in accuracy.
Consider the following hypothetical illustration. States 1 and 2 have the
same mean scale scores as the nation if the assessment is administered under
main NAEP administration conditions. This common average scale score is 270,
and the average percentage of the exercises correct is 60 percent for the two states
OCR for page 158
158
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
and the nation. When state NAEP is actually administered, state 1 carefully
follows the prescribed administration conditions, and the average percentage of
the exercises correct for state 1 is 60 percent. State 2 is not so careful in follow-
ing the administration procedures, and its average percentage of the exercises
correct is 64 percent. Also, over all states the average percentage of items correct
in state NAEP is 62 percent.
State NAEP is then linked to main NAEP. Based on this study, a state with
an average percentage of items correct in state NAEP of 62 percent will have an
average scale score of 270. Following this linking study, state 1 earns an average
scale score of below 270, which is below the average for the nation and below the
average for state 2. State 2 earns an average scale score above 270, which is
above the average for the nation and above the average for state 1. In effect, state
1 has been penalized for carefully following the administration procedures. State
2 has been rewarded for not taking as much care. This sort of situation, while
presented in a hypothetical example, is bound to occur if there is variation across
states in the effects of administration procedures on NAEP performance. An
overall adjustment, like the one currently applied, is unable to remove these sorts
of inequities that are a result of administration differences from one state to
another.
The conditions that require a study for linking state NAEP to main NAEP
can also lead to apparent contradictions in statistics that are reported with state
NAEP. These contradictions are apparent when comparing the states to the
nation on statistics that are based on percentages of items correct. Table 8-2
presents scale scores and percentages correct for the nation and for various states
for the 1992 NAEP trial state assessment in mathematics for eighth grade. Statis-
tics are presented for the nation and for the states of New York, Delaware, and
Arizona. Scale scores are presented in the top portion of the table. New York has
the same average overall scale score as the nation; the average overall scale score
for Arizona is slightly below that for the nation; and the average overall scale
score for Delaware is four points below that for the nation. Comparisons of the
five subscales lead to similar conclusions about how the states compare to the
nation. These scale score averages incorporate the adjustments from the study
that linked state NAEP to the main NAEP scale.
Average percentages correct over multiple-choice and constructed-response
items are given in the bottom portion of the table. On average, New York
correctly answered 2 percent more of the items than were answered correctly in
the nation. Thus, based on the bottom portion of the table, New York appears to
be higher performing than the nation. Although Delaware performed more poorly
than the nation based on scale scores, the state performed similarly based on
average percentage correct. Some contradictory conclusions result from inspec-
tion of this table. Arizona is below the national average on scale scores but is, on
average, able to answer more items correctly than answered in the nation. New
York is at the national average based on scale scores but above the national
OCR for page 159
MICHAEL J. KOLEN
TABLE 8-2 Main NAEP and State NAEP Mean Scale Scores and Average
Percentage Correct for the Nation and Three States in the 1992 State and Main
NAEP Mathematics Assessments
159
Index Nation New York Arizona Delaware
Scale Score
Overall 266 266 265 262
Numbers and operations 270 270 269 267
Measurement 264 262 264 258
Geometry 262 261 260 257
Data analysis, statistics, and probability 267 268 265 262
Algebra and functions 266 265 264 263
Percentage Correct (Multiple-Choice and Constructed-Response)
Overall 54 56 55 54
Numbers and operations 62 64 63 62
Measurement 51 52 52 50
Geometry 52 54 53 52
Data analysis, statistics, and probability 48 51 48 48
Algebra and functions 51 53 51 50
Source: National Center for Education Statistics (1993:43, 126, 341).
average based on percentage correct. Delaware is below the national average on
scale scores but at the national average based on percentage correct.
In National Center for Education Statistics (1993:46), of the 44 jurisdictions
shown, 50 percent are above the national average in scale score. However, for
these 44 jurisdictions, over 61 percent are above the national average based on
percentage correct. These contradictions arise because scale score statistics
reported for states are adjusted for administration differences, whereas percent-
age-correct scores are not adjusted. Such contradictions and other related issues
that result from the need to conduct linking studies are particularly troublesome
in the more criterion-referenced uses of NAEP. One of the related issues is that
IRT (item response theory) parameter estimates for a given item could differ
considerably from the main NAEP to the state NAEP assessment.
Implications of Differences for Item Maps and Achievement Levels
Item maps and achievement levels are two of the procedures used to help
policy makers and the public better understand NAEP results. In item maps,
various scale score levels are chosen and items found that discriminate between
pairs of adjacent levels. The following example, based on the 1996 NAEP
mathematics assessment, is taken from Reese et al. (1997:9~:
To better illustrate the NAEP mathematics scale, questions from the assessment
are mapped onto the O-to-SOO scale at each grade level. These item maps are
OCR for page 160
160
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
visual representations that compare questions with ability, and they indicate
which questions a student can likely solve at a given performance level as
measured on the NAEP scale. . . . As an example of how to interpret the item
maps, consider a multiple-choice question that requires students to identify
cylindrical shapes and maps at a scale score of 208 for grade 4. ... Mapping
a question at a score of 208 implies that students performing at or above this
level on the NAEP mathematics scale have a 74 percent or greater chance of
correctly answering this particular question. Students performing at a level
lower than 208 would have less than a 74 percent chance of correctly answering
the question.... As another example, consider a constructed-response ques-
tion that requires students to partition the area of a rectangle and maps at a score
of 272 for grade 8. . . . Scoring of this response allows for partial credit by
using a four-point scoring guide. Mapping a question at a score of 272 implies
that students performing at or above this level have a 65 percent or greater
chance of receiving a score of 3 (Satisfactory) or 4 (Complete) on the question.
Students performing at a level lower than 272 would have less than a 65
percent chance of receiving such a score.
Reese et al. (1997:9, In. 6) go on to say that:
For constructed-response questions a criterion of 65 percent was used. For
multiple-choice questions with four or five alternatives, the criteria were 74 and
72 percent, respectively. The use of a higher criteria for multiple-choice ques-
tions reflected students' ability to "guess" the correct answer from among the
alternatives.
Main NAEP data are used to construct the item maps. Recall that students tend to
score higher when using state NAEP administration than when using main NAEP
administration conditions. So on state NAEP students at a particular ability
would tend to have a greater chance of correctly answering particular multiple-
choice items and a greater chance of receiving higher scores on constructed-
response items than the item maps would imply. Alternatively, if the item maps
had been constructed using state NAEP data, the items would have tended to have
been mapped at a higher score level than they were mapped using main NAEP
data.
Also, the parameter estimates for individual items on state NAEP differ from
those on main NAEP. Therefore, if the item maps had been constructed using
state NAEP item parameter estimates instead of main NAEP parameter estimates,
the item mapping for particular items could differ considerably, possibly in either
direction.
Achievement levels are another means used to enhance the interpretability of
NAEP results. As stated in Reese et al. (1997:42), a judgmental process is used
to set achievement levels:
The result of the achievement level-setting process is a set of achievement level
descriptions and a set of achievement level cutpoints on the 500-point NAEP
scale. The cutpoints are minimum scores that define Basic, Proficient, and
OCR for page 161
MICHAEL J. KOLEN
Advanced performance at grades 4, 8, and 12. . . . The results are based on the
judgments of panels, approved by NAGB, of what Basic, Proficient, and Ad-
vanced students should know and be able to do in mathematics, as well as on
their judgments regarding what percent of students at the borderline for each
level should answer each question correctly. The latter information is used in
translating the achievement level descriptions into cutpoints on the NAEP scale.
161
As with the item maps, achievement levels are set using main NAEP data. It is
likely that somewhat different achievement descriptions and cutpoints would
emerge from the achievement-level-setting process if state NAEP data were used
instead of main NAEP data.
For score-reporting purposes, the percentage of examiners in a state who are
reported to score at or above a particular achievement level are based on score
distributions that have been adjusted in the study in which state NAEP was linked
to main NAEP. To the extent that students earn higher scores on state NAEP than
on main NAEP, the effect of this adjustment is to lower the percentages at or
above each cutpoint for state NAEP. That is, on state NAEP there is a tendency
for a greater proportion of students to score at or above each achievement level
than the proportions reported in the state NAEP program.
To handle the effects on reported scores of the administration differences
between state NAEP and main NAEP, a decision was made to adjust the state
NAEP scores. While understandable and possibly the best decision given the
circumstances, this decision can lead to potential misinterpretations and inaccu-
racies in interpreting scores from state NAEP. These problems seem most serious
when attempting to make criterion-referenced interpretations of scores, such as
those made with item maps and achievement levels.
DESIGNS FOR COMBINING STATE AND MAIN NAEP SAMPLES
In this section, issues in developing designs for combining state and main
NAEP samples are discussed. Currently, sampling, administration, and analysis
(other than the study used to adjust for administration differences) are done
separately for state and main NAEP. Rust (1996) suggested three general ap-
proaches to combining state and main NAEP. In one approach the sampling and
administration continue to be separate, but the analyses are based on pooled data.
In another approach a national sample is drawn and supplemented as necessary to
obtain an adequate state sample. Finally, samples are drawn from each state and
supplemented as necessary to obtain an adequate national sample. Specific pro-
posals presented by Rust and Shaffer (1997) and Spencer (1996) for implement-
ing these general approaches are discussed here.
This discussion of sampling procedures relies heavily on work by sampling
statisticians, including Rust (1996), Rust and Johnson (1992), Rust and Shaffer
(1997), and Spencer (1996~. The designs suggested in these papers are reviewed
here. The designs are summarized and how they interact with various administra
OCR for page 162
162
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
live and measurement issues is evaluated. The focus is on practical design issues;
there is no intent to provide a sampling statistician's perspective on these issues.
Independent samples of schools are used in main and state NAEP, and differ-
ent designs currently are used for selecting samples in the two programs. Efforts
are made to ensure that no one school is included in both samples. In addition, as
is discussed, the sampling designs used in the two programs have important
differences.
In the schedule for future assessments, as shown in Table 8-1, more subject
areas and more grades will be included in main NAEP than state NAEP. How-
ever, in the future main and state NAEP will assess grade 4 and grade 8 math-
ematics and science in the same years (e.g., 2000, 2004, and 2008) and grade 4
and grade 8 reading and writing in the same years (e.g., 2002 and 2006~. The
following discussion of combining the state and main NAEP samples pertains
only to these combinations of grade, test, and year.
As stated by Rust and Johnson (1992: 127), "the NAEP sampling and weight-
ing procedures are designed to obtain sample data that permit estimates of sub-
population characteristics of reasonably high precision." The precision targets
are stated ahead of time, and samples are designed to meet these targets.
Current Design for Main NAEP
The goal of the main NAEP sample design is to adequately represent the
population of students in the United States in a particular grade as well as certain
subpopulations. According to Rust and Johnson (1992), the main NAEP samples
are drawn using a multistage probability sampling design with three stages of
selection. The three stages are summarized as follows:
Stage 1. The United States is divided into approximately 1,000 geographical
areas. A sample of these geographical areas is selected.
Stage 2. A sample of schools is selected from within the selected geographi
cal areas.
Stage 3. A sample of students is selected from within the selected schools.
According to Rust and Johnson (1992: 112), Stage 1 is used "to make feasible the
task of recruiting and training staff to administer the tests in a cost effective
manner" because the assessments will be given in only a small number of geo-
graphical areas (e.g., Rust and Johnson, 1992, reported that in main NAEP in
1990 only 94 of the geographical areas were selected). Stratification and weight-
ing procedures are used to ensure that the sample is representative and that the
desired levels of precision are attained. In addition, procedures are used to deal
with schools that are selected but decline to participate. Recruiting of schools
and test administration are done centrally by a single NAEP contractor. Data
analysis for main NAEP is conducted using the national data only.
OCR for page 163
MICHAEL J. KOLEN
163
Current Design for State NAEP
The goal of the state NAEP sample design is to adequately represent the
population of students in a given state in a particular grade as well as certain
subpopulations. To reduce the burden on schools, efforts are made to ensure that
schools chosen for state NAEP are not in main NAEP. The two-stage probability
sample used in each state that participates in state NAEP is summarized as
follows:
Stage 1. A sample of schools is selected from within the state.
Stage 2. A sample of students is selected from within the selected schools.
Stratification and weighting procedures are used to ensure that the sample is
representative and that the desired levels of precision are attained. In addition,
procedures are used to deal with schools that are selected but decline to partici-
pate. See Rust and Johnson (1992) for more detail. Recruiting of schools and test
administration are conducted by personnel in the state.
As indicated earlier, a linking study is used to adjust state NAEP results for
differences in administration conditions between state NAEP and main NAEP.
Recall that a single set of linking functions is developed and used to adjust the
results for all states. Apart from using main NAEP data to estimate linking
functions, data analysis for state NAEP is conducted using the state data only.
Some possibilities for combining the main and state NAEP sample designs and/or
data analyses follow.
Spencer's (1996) Approaches
One way to combine the two assessments, referred to here as Spencer's
Approach 1, uses the current designs and administration procedures for both
assessments and then pools the data during analysis. The potential benefit of
using this procedure is that sampling error could be reduced for national and
regional statistics by including the state data along with the main NAEP data. In
addition, the sampling error for the state statistics could be reduced by using main
NAEP data from a state along with the state NAEP data from that state.
However, combining the main and state NAEP data relies heavily on the
linking study used to adjust state NAEP scores for differences in administration
conditions between main and state NAEP. As Spencer (1996) pointed out, the
linking adjustment introduces error, and it would be necessary to ensure that the
random error and bias due to linking are negligible; otherwise, this approach
could increase error. Spencer also pointed out that there would be some addi-
tional costs associated with conducting the analyses, creating new weights, and
estimating standard errors. He recommended further study of this possibility.
Spencer considered a second possibility, referred to here as Spencer's
OCR for page 164
164
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
Approach 2, intended to save money and increase precision by combining the
sampling designs for main and state NAEP into one integrated design. He pre-
sented the following possibility: "Select the national sample and see how many
schools fall in each state. Then draw an additional sample of schools in each state
in state NAEP to meet the target precision for that state" (Spencer, 1996:54~. In
this design, therefore, the current main NAEP sampling plan is used, but the state
plan is modified. For main NAEP, recruiting of schools and test administration
are still done centrally by a single NAEP contractor. For the additional schools in
each state that are selected, recruiting of schools and test administration are still
done by state personnel. Spencer also suggested that, as with Spencer's Ap-
proach 1, sampling error for main NAEP might be reduced if data from main
NAEP and state NAEP were pooled for main NAEP analyses.
Preliminary analyses conducted by Spencer suggested that Spencer's
Approach 2 procedures leads to approximately a 6 percent reduction in the sample
size for state NAEP, which results in significant cost savings in test materials,
booklet processing, test scoring, and other administration costs. As with
Spencer's Approach 1, there are some (relatively small) additional costs associ-
ated with conducting the analyses. Note that under this design, to meet target
precision for the states, it is necessary to pool data from the state and main NAEP
samples. These precision targets could be met only if the random error and bias
due to linking are negligible. Spencer recommended that this design be studied
further.
Spencer also considered a third possibility, referred to here as Spencer's
Approach 3, that saves even more money and reduces the sample size for main
NAEP. He suggested the following possibility: "Select the state NAEP sample
first and then draw a supplemental sample to yield a national sample meeting the
target levels of precision overall and for subgroups. These target levels of preci-
sion would be met both for the subjects and grades covered and state NAEP and
also for those not covered" (Spencer, 1996:55~.
Spencer demonstrated that this design leads to substantial savings, beyond
those for Spencer' s Approach 2. However, he pointed out that implementing this
possibility requires that "decisions about what states will participate in state
NAEP and what subjects will be covered must be made before combined NAEP
can be designed. . . . Success would seem unlikely" (Spencer, 1996: 55~. The
concerns regarding linking error in this design are even more severe than they are
for Spencer's Approach 2, because for Spencer's Approach 3 it is necessary to
pool data from state and main NAEP samples to meet target precision for main
NAEP.
Rust and Shaffer's (1997) Sampling Possibilities
Rust and Shaffer (1997) compared three sample designs. The first design,
referred to here as Rust and Shaffer's Approach 1, involves combining the sepa
OCR for page 165
MICHAEL J. KOLEN
165
rate samples that are currently used in main and state NAEP. This design is
essentially the same as Spencer's Approach 1. In their second proposed design,
referred to here as Rust and Shaffer's Approach 2, they moved away from use of
the 1,000 geographical areas that are currently used for main NAEP.i They
proposed using the following procedures:
Stage 1. A sample of schools is selected from within each state that results in
precision comparable to current state NAEP.
Stage 2a. Among the selected schools in each state, designate a subset as
national schools (with a minimum of two schools per state). Over all states the
results from just these schools would result in precision comparable to current
national NAEP. The number of national schools selected in this way is compa-
rable to the current number of national schools.
Stage 2b. Among the selected schools, those not designated as national
schools are designated as state schools.
Stage 3a. A sample of students is selected from the selected national schools.
Stage 3b. Only if a state agrees to participate, a sample of students is
selected from the selected state schools.
Stratification and weighting procedures are used to ensure that the sample is
representative and that the desired levels of precision are attained. In addition,
procedures are used to deal with schools that are selected but decline to partici-
pate. As is currently done, administration by national schools is conducted by an
NCES contractor, and state administration is conducted by state staff. In a
departure from current procedures, recruitment is done by staff in states partici-
pating in state NAEP.2
Rust and Shaffer (1997) suggested that the analyses for main NAEP be based
on all participating schools (both national and state), although the designed preci-
sion could be obtained from national data. State NAEP analyses are based on all
{Recall that Rust and Johnson (1992:112) indicated that these geographical areas were used as a
first stage of sampling to "make feasible the task of recruiting and training staff to administer the
tests in a cost effective manner." Rust and Shaffer (1997) did not indicate why it is now possible to
move away from the use of geographical areas as a first-stage sampling unit. Note that the use of
these geographical areas as a first-stage sampling unit results in more sampling error than if schools
were sampled at the first stage (ACT, 1997).
2Rust and Shaffer (1997) suggested that this change would enhance participation in main NAEP.
However, they did not discuss how this enhanced participation, if it did exist, might affect the
comparability of main NAEP scores between current main NAEP and main NAEP after the change
in recruitment procedures was made. It seems, however, that who recruits schools is not really an
integral part of their design in that the design could be followed with the NCES contractor continuing
to recruit schools. Clearly, this issue would require further study before a change in recruitment
procedures is made.
OCR for page 166
166
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
participating schools in the state (both national and state) to meet state precision
targets. This design has some potential significant benefits. Preliminary analyses
conducted by Rust and Shaffer suggested that these procedures lead to an approxi-
mate 10 percent reduction in the sample size for state NAEP, compared to current
procedures, which leads to significant cost savings. The precision of national
statistics is comparable to current precision if the national data are used alone.
The national statistics are more precise if the state and national data are pooled
for main NAEP. Rust and Shaffer (1997:6-11) also discussed the benefits to
recruitment from the "synergism in the recruitment process for state and national
components" if states do all of the recruitment.
As with the other designs that involve an integration of main and state NAEP
data, a major issue concerning this design is that it requires a linking study to
adjust state results for differences in state and national administration conditions.
The gain in precision for main NAEP and the state precision targets likely could
be achieved only if the random error and bias due to linking are negligible. In
addition, this design requires considerable coordination of state and national
NAEP.
The final proposed design, referred to here as Rust and Shaffer's Approach 3,
dropped the requirement of Rust and Shaffer' s Approach 2 that the target preci-
sion for the national statistics be attainable using only the national data. A major
effect of dropping this requirement is to reduce the number of test administrations
that are done by the NCES contractor. The stages provided earlier for Rust and
Shaffer's Approach 2 would still be followed, except that Stage 2a would be
replaced by the following:
Stage 2a. Among the selected schools in each state, designate a subset as
national schools (with a minimum of two schools per state). Over all states the
results from just these schools do not result in precision comparable to current
main NAEP. The number of national schools selected in this way is around one-
half of the current number of national schools.
As in Rust and Shaffer' s Approach 2, stratification and weighting procedures are
used to ensure that the sample is representative and that the desired levels of
precision are attained; procedures are used to deal with schools that are selected
but decline to participate; administration by national schools is conducted by an
NCES contractor, whereas state administration is conducted by state staff; all
recruitment is conducted by state staff.
Unlike Rust and Shaffer's Approach 2, the analyses for main NAEP to
achieve target precision are based on all participating schools (both national and
state). Like Rust and Shaffer' s Approach 2, state NAEP analyses are based on all
participating schools in the state (both national and state) to meet state precision
targets.
Preliminary analyses by Rust and Shaffer (1997:6-10) indicated that this
OCR for page 167
MICHAEL J. KOLEN
167
design has all the potential benefits of Rust and Shaffer's Approach 2, with the
addition that the sample size that requires administration by the NCES adminis-
tration contractor is reduced and the overall sample size is reduced even further.
However, these analyses also indicated that benefits depend heavily on the de-
gree of participation in state NAEP. In addition, this design requires use of the
results of the linking study to achieve the desired precision for main NAEP. For
these reasons Rust and Shaffer recommended further consideration of Rust and
Shaffer's Approach 2 but not Rust and Shaffer's Approach 3 because the former
design is "considerably more robust to the vagaries of the outcome of the state
participation process."
Rust and Shaffer (1997:6-25) concluded that Rust and Shaffer's Approach 2
should be considered because "this approach will lead to much more useful data
at the national and regional levels. It will enhance participation in centrally
administered schools. It will have little impact on cost. The approach is robust to
the level of state participation in NAEP."
Discussion and Comparison of the Approaches
Spencer's Approach 1 and Rust and Shaffer's Approach 1 involve no changes
in the sample designs. These approaches have the potential to increase precision.
The additional costs associated with these approaches involve further analyses,
which likely are small compared to the administrative costs. The major potential
drawback of either of these approaches is that they rely on there being little
random error or bias when adjusting state NAEP results for operational differ-
ences between state NAEP and main NAEP. The sources of these operational
differences and their degree of stability should be thoroughly understood before
these approaches are used.
Spencer's Approach 2 continues to use geographical area as the first stage in
a multistage sampling procedure, whereas Rust and Shaffer's Approach 2 elimi-
nates this first stage. This elimination might cause some operational difficulties
in that administration of main NAEP would occur in more diverse geographical
areas. However, if this first stage is eliminated, fewer schools would need to be
sampled for main NAEP, which is true whether or not the samples are combined
(ACT, 1997~. Thus, if the first stage can be eliminated, at least in this aspect,
Rust and Shaffer's Approach 2 seems preferable to Spencer's Approach 2. How-
ever, it is unclear why the first stage can be eliminated, whereas it was deemed
necessary in the past. This issue needs to be addressed before further consider-
ation of Rust and Shaffer' s Approach 2.
A major issue with both Spencer's Approach 2 and Rust and Shaffer's Ap-
proach 2 is that both rely heavily on there being little random error or bias in
adjusting state NAEP results for operational differences between state NAEP and
main NAEP. The sources of these operational differences and their degree of
stability should be thoroughly understood before these approaches are used.
OCR for page 168
168
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
Forsyth et al. (1996) also indicated that it will be important to design the ap-
proaches so that last-minute withdrawals of states do not affect the main NAEP
samples.
Given problems that accrue from the need for the linking study, Forsyth et al.
suggested it might be possible to design NAEP so that the same administration
conditions are used for main and state versions. In particular, they suggested
using local administrators for main NAEP (as well as for state NAEP), with an
increase in the monitoring and degree of training of the administrators. If this
approach is considered, however, they suggest that the effects of such a change be
monitored on participation rates among schools selected for main NAEP in states
not participating in state NAEP. In addition, such a significant change in main
NAEP could affect comparability of national statistics before and after the change
is made.
Combining main and state NAEP sampling has the potential for a modest
reduction in the number of schools involved in NAEP. However, much more
work is needed to detail and evaluate the approaches before they are imple-
mented. A significant problem in each approach arises from the operational
differences between main and state NAEP that cause complications potentially
difficult to overcome. Unless the operational procedures for main NAEP and
state NAEP can be made much more similar to one another, the potential compli-
cations caused by these approaches might lead to severe problems in combining
NAEP samples.
CONCLUSIONS
Future plans are for state NAEP to be administered at approximately the
same time as main NAEP and for the content of state NAEP to be a subset of the
content of main NAEP. These plans suggest that now there might be a greater
chance of combining main and state NAEP samples than in the past. However,
current plans still result in significant administration differences between main
and state NAEP. These differences currently are addressed by adjusting state
NAEP scores. Even so, contradictory findings and complications are apparent,
especially when making the criterion-referenced interpretations of NAEP scores
that seem to be gaining prominence through the use of item maps, achievement
levels, and now market basket reporting (Forsyth et al., 1996; National Center for
Education Statistics, 1996~. The conditions that make the linking studies neces-
sary create confusion when attempting to make criterion-referenced interpreta-
tions with state NAEP.
The administration differences also make implementation of any of the
designs for combining main and state NAEP questionable. Much more needs to
be known about the effects of the administration differences. A starting point for
further investigation would be to address the following questions:
OCR for page 169
MICHAEL J. KOLEN
169
Question 1: To what extent are the linking constants equal across states?
Differences among states in ability, participation rates, and recruitment proce-
dures should be investigated as variables that might influence linking constants.
Question 2: How large is the random error component in estimating the
linking constants?
Question 3:
linking constants?
Question 4: To what extent would results from state NAEP be affected if the
administration and recruitment conditions for state NAEP were changed to be
consistent with those for main NAEP?
Question 5: Do the differences in administration and recruitment conditions
affect the constructs that are being measured by the NAEP assessments?
To what extent does bias or systematic error influence the
These questions should be thoroughly addressed before any design for combining
the state and main NAEP samples is implemented under current recruitment and
administration conditions. Note that even after conducting the extensive research
that addressing these questions entails, the analyses presented in Spencer (1996)
and Rust and Shaffer (1997) suggest that combining the samples for state and
main NAEP would result in only a modest decrease in sample size.
Another approach is to use administration and recruitment procedures that
are the same for main and state NAEP, such as those suggested by Forsyth et al.
(1996~. One possibility is to use the centralized administration and recruitment
procedures currently used with main NAEP. Using these procedures for both
main and state NAEP is optimal from the perspectives of combining samples, of
having comparable results for the two assessments, for combining reporting and
analyses, and for being able to compare main NAEP results from before and after
changes were made in recruitment and administration procedures. Although
these procedures might be prohibitive from a cost perspective, they should be
thoroughly investigated.
Another possibility suggested by Forsyth et al. is to use the current state
administration procedures for main NAEP but possibly with more central over-
sight and standardization than is currently used with state NAEP. This type of
change in recruitment and administration procedures would require a study to
link main NAEP under these new administration conditions to main NAEP under
the previous administration conditions. Conducting this study could be costly
and difficult to implement.
If the issues regarding linking and administration conditions are addressed
sufficiently, Spencer' s Approach 2 and Rust and Shaffer' s Approach 2 would be
good places to start in developing a combined sampling plan. Spencer's Ap-
proach 2 might be preferable if the first-stage sampling is by geographical area.
Rust and Shaffer's Approach 2 might be preferable if, from an operational per-
spective, this first stage is unnecessary.
OCR for page 170
170
ISSUES IN COMBINING STATE NAEP AND MAIN NAEP
ACKNOWLEDGMENTS
The author thanks Karen Mitchell and two anonymous reviewers for com-
ments on a draft of this paper.
REFERENCES
ACT
1997 ACT's NAEP Redesign Project: Assessment Design Is the Key to Useful and Stable
Assessment Results. Final Report. Iowa City, Iowa: ACT.
Allen, N.L., and J. Mazzeo
1997 Technical Report of the NAEP 1996 State Assessment Program in Science. Washington,
D.C.: National Center for Education Statistics.
Allen, N.L., D.L. Kline, and C.A. Zelenak
1996 The NAEP 1994 Technical Report. Washington, D.C.: National Center for Education
Statistics.
Allen, N.L., F. Jenkins, E. Kulick, and C.A. Zelenak
1997 Technical Report of the NAEP 1996 State Assessment Program in Mathematics. Wash-
ington, D.C.: National Center for Education Statistics.
Ballator, N.
1996 The NAEP Guide, Revised Edition. Washington, D.C.: National Center for Education
Statistics.
Forsyth, R., R. Hambleton, R. Linn, R. Mislevy, and W. Yen
1996 Design Feasibility Team Report to the National Assessment Governing Board. Washing-
ton, D.C.: National Assessment Governing Board.
Glaser, R., R. Linn, and G. Bohrnstedt
1997 Assessment in Transition: Monitoring the Nation's Educational Progress. Stanford,
Calif.: National Academy of Education.
Hartka, E., and D.H. McLaughlin
1993 A Study of the Administration of the 1992 National Assessment of Educational Progress
Trial State Assessment Program. Palo Alto, Calif.: American Institutes for Research.
Jones, L.V.
1996 A history of the National Assessment of Educational Progress and some questions about
its future. Educational Researcher 25(7): 15-22.
Koretz, D.M.
1991 State comparisons using NAEP: Large costs, disappointing benefits. Educational Re-
searcher 20(3): 19-21.
Mullis, I.V.S.
1997 Optimizing State NAEP: Issues and Possible Improvements. Paper commissioned by the
NAEP Validity Studies Panel.
National Academy of Education
1993 The Trial State Assessment: Prospects and Realities. Stanford, Calif.: National Acad-
emy of Education.
National Assessment Governing Board (NAGB)
1996 Policy Statement on Redesigning the National Assessment of Educational Progress.
Washington, D.C.: NAGB.
1997 Schedule for the National Assessment of Educational Progress. Washington, D.C.:
NAGB.
National Center for Education Statistics
1993 Data Compendium for the NAEP 1992 Mathematics Assessment of the Nation and the
States. Washington, D.C.: National Center for Education Statistics.
OCR for page 171
MICHAEL J. KOLEN
171
1996 An Operational Vision for NAEP-Year 2000 and Beyond. Washington, D.C.: National
Center for Education Statistics.
Phillips, G.W.
1991 Benefits of state-by-state comparisons. Educational Researcher 20(3):17-19.
Reese, C.M., K.E. Miller, J. Mazzeo, and J.A. Dossey
1997 NAEP 1996 Mathematics Report Card for the Nation and the States. Washington, D.C.:
National Center for Education Statistics.
Rust, K.F.
1996 Sampling Issues for Redesign. Memorandum to Mary Lyn Bourque, NAGB, May 9.
Rust, K.F., and E.G. Johnson
1992 Sampling and weighting in the national assessment. Journal of Educational Statistics
17(2): 111- 129.
Rust, K.F., and J.P. Shaffer
1997 Sampling. In NAEP Reconfigured: An Integrated Redesign of the National Assessment
of Educational Progress, E.G. Johnson, S. Lazer, and C.Y. O'Sullivan, eds. Working
Paper No. 97-31. Washington, D.C.: National Center for Education Statistics.
Spencer, B.
1996 Combining State and National NAEP. Paper prepared for the evaluation of state NAEP
conducted by the National Academy of Education.
Yamamoto, K., and J. Mazzeo
1992 Item response theory linking in NAEP. Journal of Educational Statistics 17:155-173.
Representative terms from entire chapter:
main naep