Would you like to add this to your personal library?
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 74
4
An External Evaluation of the 1996 Grade
NAEP Science Framework
Stephen G. Sireci, Frederic Robin, Kevin Meara,
H. lane Rogers, and Hariharan Swaminathan
The National Assessment of Educational Progress (NAEP) is the most com-
prehensive evaluation of the educational achievement of U.S. students in history.
Laudable features of the more recent NAEP tests are their breadth in terms of the
content domains measured and the manner in which students are tested. For
example, on the 1996 NAEP science assessment, the focus of this paper, three
"fields" of science are measured earth, life, and physical science and students
are required to perform "hands-on" science experiments, report the results of
their experiments in written form, and respond to multiple-choice questions.
Thus, the structure of the current NAEP science assessment is complex.
This study examined the content validity1 of the 1996 grade 8 NAEP science
assessment to determine how well items composing the assessment represent the
framework that governed the test development process. This appraisal is impor-
tant for determining whether the inferences derived from NAEP scores can be
linked to the science content and skill domains the test is designed to measure.
To accomplish the goals of this study, 10 carefully selected science teachers were
recruited to review items from the 1996 grade 8 NAEP science assessment and
provide judgments regarding the knowledge and skills measured by these items.
1Some measurement specialists (e.g., Messick, 1989) argue against use of the term content validity
because it is does not directly describe score-based inferences. Although this position has theoretical
appeal, in practice, content validity is a widely endorsed notion of test quality (Sireci, 1998b). Thus,
the position taken here is similar to Ebel (1977:59), who claimed "content validity is the only basic
foundation for any kind of validity.... One should never apologize for having to exercise judgment
in validating a test. Data never substitute for good judgment."
74
OCR for page 75
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
75
These judgments were compared to the knowledge and skill domains the items
were intended to measure.
OVERVIEW OF THE GRADE 8
SCIENCE ASSESSMENT FRAMEWORK
The 1996 grade 8 science assessment comprised 189 items. The intended
structure of the assessment is characterized in the content frameworks, which
specify four dimensions (National Assessment Governing Board, 19961. The first
dimension is a content dimension comprising three separate "fields of science"-
earth science, life science, and physical science. The committees involved in
creating the test specifications concluded that these three fields of science are
sufficiently unique as to warrant separate scales. Thus, for all 1996 NAEP
science assessments, the results were to be reported along four separate scales:
one for each of the three fields of science and a composite score scale summariz-
ing science proficiency across the three fields.
The second dimension of the science framework is a cognitive dimension
described as "ways of knowing and doing science." There are also three compo-
nents to this dimension: conceptual understanding, practical reasoning, and sci-
entific investigation. Separate score scales are not derived for these cognitive
skills; however, these skill areas were critical in defining the domains measured
on the assessment and in governing the item (task) development process. Every
item on a NAEP science assessment is targeted to one of the three fields of
science and one of the three ways of knowing and doing science.
Only some of the items were linked to the other two dimensions underlying
the content frameworks. These two dimensions are described as a "themes of
science" dimension and a "nature of science" dimension. The "themes" dimen-
sion comprised three areas: patterns of change, models, and systems. The nature
of science dimension comprised two areas: nature of science and nature of
technology. For the grade 8 assessment, 93 items (49 percent) corresponded to a
"theme" dimension and 31 items (16 percent) to a "nature" dimension. The
content, cognitive, theme, and nature test specifications are presented in Table 4-1.
Another conspicuous aspect pertinent to the content structure of the assess-
ment is the diversity of item formats used. Students were required to both read
assessment material and perform hands-on scientific experiments. The item
formats tied to these tasks were multiple-choice items (with two to four response
options per item); short constructed-response items (where students were required
to write a short answer, usually a single word or a sentence or two); and extended
constructed-response items (requiring students to supply a detailed response to
the item). There were 73 multiple-choice and 1 16 constructed-response items on
the grade 8 assessment.
OCR for page 76
76 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
TABLE 4-1 Cross-Tabulation of Item Specifications for 1996 Grade 8 NAEP
Science Assessment
Ways of Knowing and Doing Science
Field of Conceptual Practical Scientific
Science Understanding Reasoning Investigation Total
Earth science 35 13 14 62 (33%)
(Theme) (23) (7) (7) (37)
[Nature] [2] [4] [5] [11]
Life science 42 14 9 65 (34%)
(Theme) (29) (5) (4) (38)
[Nature] [2] [3] [O] [5]
Physical science 32 16 14 62 (33%)
(Theme) (9) (6) (3) (18)
[Nature] [1] [9] [5] [15]
Totals 109 (57.7%) 43 (22.8%) 37 (19.6%) 189
(61) (18) (14) (93)
[5] [16] [10] [31]
Note: Entries in the table are the number of items in each cell of the framework.
METHOD
Ten science teachers were recruited to scrutinize a carefully selected sample
of items from the 1996 grade 8 science assessment and provide judgments
regarding the content characteristics of the items. As described below, these
teachers provided both ratings of the content similarities among the items and
ratings linking each item to the content, cognitive, nature, and theme dimensions
defined in the frameworks.
Participants
The 10 science teachers who served as the subject-matter experts (SMEs) in
this study were selected by contacting the state assessment directors in states that
are currently active in developing state standards and assessments in science.
The teachers were nominated by their state assessment director because of their
involvement in science assessment movements in their state. Three of the teachers
previously served on a national working group, convened by the National Assess-
ment Governing Board, that helped clarify the achievement-level standards set on
the 1996 science assessment. Seven of the 10 SMEs were women. All had
extensive experience teaching science. These SMEs represented the following
OCR for page 77
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
77
states: California, Delaware, Florida, Kentucky, Maryland, Ohio, South Carolina,
Texas, Virginia, and Washington. The data from these SMEs were gathered
during a two-day workshop in Washington, D.C. All SMEs received an hono-
rarium for their participation.
Items Selected for Analysis
As noted above, 189 items comprised the grade 8 science assessment. Sixty
items were selected for the purposes of this study to represent the test specifica-
tions in terms of the content and cognitive dimensions as well as item format
(multiple choice, short constructed response, extended constructed response). In
addition, items were selected that represented a theme or nature of science area.
These items came from 9 of the 15 blocks comprising the grade 8 item pool.
Item-objective congruence ratings (described below) were obtained for all 60
items. However, because of time and subject fatigue limitations, a subset of 45 of
these items was chosen for the item similarity ratings (also described below).
Table 4-2 presents the test specifications for the 60-item subset, and Table 4-3
presents the test specifications for the 45-item subset. A comparison of Tables 4-
1 through 4-3 reveals that the percentages of items from each science field were
relatively comparable across the item pool and the item subsets but that the two
subsets had slightly more items measuring practical reasoning and scientific in-
vestigation.
Procedure
SME Training
Almost half (29) of the 60 NAEP items used in this study were associated
with one of the four hands-on science tasks. Twelve of these 29 items were
included in the similarity rating task involving the 45-item subset, and all 29 were
included in the item-objective congruence rating task. Training of the SMEs
began with a description of these hands-on tasks. The material kits for these tasks
were presented to the SMEs, and an oral description of the experiments was
provided. The descriptions focused on tasks the students were required to com-
plete in conducting their experiments. Next, the judges were asked to complete a
block of 14 test items as if they were students being tested. After completing the
items, the judges were given the answer keys and asked to check their answers.
Finally, the judges were given the operational test booklet sections for the nine-
item blocks (i.e., all 60 items). The 45 items that were later used were high-
lighted. The SMEs were given time to familiarize themselves with the items and
the scoring protocols.
OCR for page 78
78 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
TABLE 4-2 Cross-Tabulation of Specifications for 60-Item Subset Used in
Item-Objective Congruence Study
Ways of Knowing and Doing Science
Field of Conceptual Practical Scientific
Science Understanding Reasoning Investigation Total
Earth science 10 6 6 22 (37%)
(Theme) (8) (1) (1) (10)
[Nature] [O] [1] [1] [2]
Life science 10 7 4 21 (35%)
(Theme) (8) (3) (4) (15)
[Nature] [O] [1] [3] [4]
Physical science 7 4 6 17 (28%)
(Theme) (0) (2) (1) (3)
[Nature] [1] [2] [3] [6]
27 (45.0%) 17 (28.3%) 16 (26.7%) 60
(16) (6) (6) (28)
[1] [4] [7] [12]
Note: Entries in the table are the number of items in each cell of the framework.
TABLE 4-3 Cross-Tabulation of Specifications for 45-Item Subset Used in
Item Similarity Rating Study
Ways of Knowing and Doing Science
Field of Conceptual Practical Scientific
Science Understanding Reasoning Investigation Total
Earth science 9 3 3 15 (33%)
(Theme) (8) (1) (0) (9)
[Nature] [O] [1] [O] [1]
Life science 7 6 3 16 (36%)
(Theme) (0) (3) (3) (6)
[Nature] [O] [1] [2] [3]
Physical science 6 3 5 14 (31%)
(Theme) (1) (1) (1) (3)
[Nature] [1] [2] [3] [6]
Totals 22 (48.9%) 12 (26.7%) 11 (24.4%) 45
(9) (5) (4) (18)
[1] [4] [5] [10]
Note: Entries in the table are the number of items in each cell of the framework.
OCR for page 79
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
Item Similarity Ratings
79
Following these item familiarization steps, instructions for completing the
item similarity ratings were provided. The SMEs were informed that they would
be required to review pairs of NAEP items and provide a judgment regarding the
similarity of the items in each pair to one another in terms of the science knowl-
edge and skills tested. These instructions were intentionally general so that the
SMEs' ratings were not influenced by anyone else's preconceived notions of
what the items were measuring. Therefore, the content specifications for these
items, and the content frameworks for the test, were not described to the SMEs.
To facilitate understanding of the item similarity rating task, three "practice"
item pairs were distributed to the judges. The first pair involved two multiple-
choice items; the second pair involved a short constructed-response item and an
extended constructed-response item; and the third pair involved two extended
constructed-response items. Each item pair was printed on a single page, and an
eight-point similarity rating scale was printed at the bottom of each page. The
numeral "1" on the scale was labeled "very similar," and the numeral "8" was
labeled "very different." The SMEs rated the similarities among these three item
pairs individually and then discussed the ratings as a group. The SMEs with the
highest and lowest ratings for each item pair described the characteristics of the
items that influenced their ratings. Common factors cited were cognitive com-
plexity of the item and science content area the item was measuring. The SMEs
were told that they were on task and were each given an item similarity rating
booklet. The pages of these booklets each contained one item pair, with the same
eight-point rating scale printed at the bottom of each page. A sample item
similarity rating page is presented in Figure 4-1.
Consideration of all possible item pairings among the 45 items involved 990
item comparisons (~45 x 441/2~. Given the time constraints of the study, the
judges were required to rate only 700 of these 990 possible item pairings. Ten
separate booklets were created. Each booklet represented a different ordering of
the item similarity pairs to control for a systematic item order effect. The 700
ratings required of each SME were selected such that for each item pair seven
independent ratings would be provided. Five of the SMEs finished relatively
early and completed some of the "missing" 290 ratings. In addition to the 700
required ratings, six specific item pairs were repeated in each booklet. These
repetitions were included to provide an estimate of the reliability of the SMEs'
ratings. The six replicated item pairs were placed near the end of each booklet,
when the deleterious effects of fatigue and boredom were most likely to be
present. Thus, the error associated with the similarity ratings as measured by
these replicated item pairs most likely represents a worst-case scenario.
Upon completion of the item similarity ratings, the SMEs responded to a
short questionnaire on which they listed the criteria they used in making the item
similarity ratings. The questionnaire asked the SMEs how long they took to
OCR for page 80
80 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
2. The instrument shown is used to measure
wind direction
wind speed
air pressure
relative humidity
IL001078
S. A space station is to be located between the Earth and the Moon at
the place where the Earth's gravitational pull is equal to the Moon's
gravitational pull. On the diagram below, circle the letter indicating
the approximate location of the space station.
Earth
Explain your answer.
A B
C
Moon
1 1 ~O
HE001703
Very Similar Very Different
1 2 3 4 5 6 7 8
FIGURE 4-1 Sample item similarity rating sheet. Items are from National Center for
Education Statistics, U. S. Department of Education, 1996 National Assessment of Edu-
cational Progress in Science released items; available at http://nces.ed.gov/naep.
OCR for page 81
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
81
complete the similarity ratings and listed seven item characteristics that were
anticipated to influence their ratings: science discipline measured by each item,
cognitive level measured by each item, item format, item difficulty, item length,
item themes, and historical origin of each item. Space on the questionnaire was
also provided for the SMEs to add any additional criteria they used that were not
included on the list.
Item-Objective Congruence Ratings
The purpose of the item similarity rating task was to obtain the SMEs'
"independent" appraisal of the knowledge and skills measured by the items (i.e.,
independent of knowledge of the content, cognitive, nature, and theme dimen-
sions that governed item development). In this manner it was hoped that the
content specifications for these 45 items would be "recovered" rather than con-
firmed. Thus, the similarity rating task tested the adequacy of the dimensions
underlying the framework, given the items that were developed.
For the item-objective2 congruence ratings, the SMEs were given an oral
presentation describing the NAEP science frameworks as well as the public docu-
mentation of these frameworks (NAGB, 1996~. The SMEs were then presented
with a new booklet that listed the item numbers for each block (60 items total)
and series of columns under which they were to provide ratings for each item.
The task presented to the SMEs was to indicate their opinion regarding the "field
of science," "way of knowing and doing science," "theme of science," and "na-
ture of science" classification of each item. They were informed that each item
was classified by the test developers into one of the three "fields" and into one of
the three "ways" dimensions but that only some items were classified as a "na-
ture" item or a "theme" item. These data provided a check on whether the SMEs
would classify the items in a manner congruent with their test specifications. A
sample item-objective congruence rating page is presented in Figure 4-2.
Exit Survey
Upon completion of the item-objective congruence ratings, the SMEs were
given a brief survey. This survey asked them about their confidence in the
similarity and congruence ratings they provided and asked them to provide sug-
gestions for future research in this area. In addition, the survey asked the SMEs
about their experience with science assessment standards at the local, state, and
national levels and asked them to describe how well the NAEP science materials
matched national, state, and local science standards.
2The term objective is used here in a general sense to describe the specific field of science, way of
knowing and doing science, theme of science, and nature of science designations for each item.
OCR for page 82
82
~ 1 1 11 1
idol I T I I T I T I I T I I 1
5 ~ 1 1 1
in T~TTTTT
~1: T~TTTTT
~ ~ ~1 1 1 1 1
1~ ~IIIII
~1 1 1 1 1
d BUTT I I I
LTfTTTTT
so
to
=
Ct
so
o
=
;~
o
· ~
Ct
V:
OCR for page 83
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
Data Analysis
83
The item similarity ratings were analyzed using multidimensional scaling
(MDS). The purpose of MDS is to portray the similarities among objects visually,
as in a map (Schiffman et al., 1981~. This visual portrayal is accomplished by
scaling the items along as many continuous dimensions as are necessary to
adequately represent the similarity ratings. Each stimulus dimension in an MDS
solution corresponds to an attribute or characteristic of the objects being scaled.
The purpose of this analysis was to determine whether dimensions, such as those
specified in the NAEP frameworks, would be perceived by the SMEs and whether
the items would be configured in the MDS space in a manner congruent with the
test specifications.
The model used was an "individual differences" or weighted MDS model.
Weighted models allow for the scaling of SMEs in the same MDS space in which
the items are configured. Thus, by using a weighted MDS model, similarities and
differences among the SMEs, as well as among the items, could be observed. The
weighted MDS model used was the INDSCAL model (Carroll and Chang, 1970)
implemented in the ALSCAL procedure in SPSS, version 7.5 (Young and Harris,
1993~. The distances among items and the dimensional weights for the SMEs are
computed using the weighted distance formula developed by Carroll and Chang
(1970~. In the INDSCAL model the similarity data for each subject are trans-
formed to derive coordinates on dimensions that are used to scale the items in
Euclidean space. The perceptual space for each subject is related to a common
"group space" by weighting the dimensions of the group space separately for
each subject. That is, each subject's coordinate matrix is multiplied by a vector
of weights (w) consisting of elements wka that represent the relative emphasis
subject k places on dimension a. The distances between stimuli are computed by
incorporating this weighting factor into the Euclidean distance formula used by
classical MDS. The INDSCAL model defines the distance between two objects
. . .
~ anus as:
dijk = ~ I, Wka (Xia Ma )
a=1
where: dijkis the Euclidean distance between points i andj for subject k, Xia is the
coordinate of point i on dimension a, and r is the number of specified dimensions.
The INDSCAL analysis provides a multidimensional configuration of the attributes
rated (the stimulus, or item space) and a multidimensional configuration of the
subjects (the group, or SME space).
To facilitate interpretation of the MDS solutions, external information on the
items was analyzed together with the MDS item coordinates. These external data
included item difficulties, the item-objective congruence ratings, and dichotomous
OCR for page 84
84 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
"dummy variables" reflecting the item content specifications (i.e., field, ways,
theme, and nature designations for each item). These data were correlated with
the coordinates from the MDS solution to determine whether the dimensions
were related to these item attributes.
RESULTS
Although the SMEs completed the item similarity ratings before they com-
pleted the item-objective congruence ratings, the results of the item objective
congruence ratings are presented first. These results involve all 60 items used in
this study and are helpful for subsequent interpretation of the MDS results.
Item-Objective Congruence Ratings
Tables 4-4 through 4-7 summarize the results of the item-objective congru-
ence ratings. An item was considered to be "correctly" matched to its framework
designation if at least 7 of the 10 SMEs placed it in the same category that was
specified in the test blueprint. In addition to providing the percentages of items
correctly classified by the SMEs, these tables present the number of "unanimous"
matches (i.e., all 10 SMEs correctly classified the item) and stem-and-leaf plots
of the SMEs' ratings.
Those ratings pertaining to the field of science dimension of the NAEP
framework are presented in Table 4-4. More than half of the items (31, or 52
percent) were unanimously matched to their fields of science specified in the test
blueprint. Only nine items failed to be correctly matched to their corresponding
fields by at least seven SMEs, yielding an item-objective congruence index of 85
percent for the 60 items. Three of the "misclassified" items were earth science
items that were classified as physical science by at least eight SMEs. Four other
items were physical science items, three of which were predominantly rated as
earth science and one as life science. The two remaining misclassified items
were life science items, one of which nine SMEs classified as earth science; the
other item was classified as life science; by only six SMEs. The percentages of
correct classifications for the earth, life, and physical science fields were 86, 90,
and 76 percent, respectively. These results indicate that in general the SMEs
supported the field of science designations of the items. However, they did not
"agree" with the operational content classifications for 15 percent of the 60 items.
The results for the cognitive dimension (ways of knowing and doing science) are
presented in Table 4-5. The correct classifications were relatively lower for this
dimension than for the field of science dimension. Using the same "7 of 10"
SME criterion, only 60 percent of the items were matched to their cognitive area
specified in the test blueprint. Unanimous ratings were observed for only eight
items, all of which were conceptual understanding items. The percentages of
correct classifications for the conceptual understanding, practical reasoning, and
OCR for page 90
90 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
TABLE 4-9 Summary of SME Fit Statistics and Dimension Weights
Subject Weights Dimension
SME Stress R2 1 2 3 4 5 Weirdness
1 .153 .655 .48 .28 .33 .33 .36 .15
2 .125 .803 .39 .39 .60 .28 .26 .31
3 .137 .749 .54 .54 .16 .29 .24 .28
4 .140 .717 .45 .53 .27 .29 .28 .14
5 .123 .797 .63 .46 .25 .27 .25 .21
6 .121 .812 .70 .34 .28 .22 .29 .27
7 .153 .658 .32 .53 .31 .30 .30 .17
8 .129 .818 .27 .29 .73 .26 .28 .47
9 .163 .615 .27 .31 .13 .49 .43 .40
10 .098 .853 .47 .48 .21 .47 .37 .22
more all dimensions from the five-dimensional solution were interpretable (see
below), but the sixth dimension in the six-dimensional solution was not readily
interpretable. Thus, the five-dimensional solution was selected as the appropriate
model for these data. As indicated in Table 4-8, the five-dimensional solution
accounted for 75 percent of the variance in the SMEs' (transformed) similarity
rating data. The total variance in these data accounted for by each dimension was
22, 18, 14, 11, and 10 percent, respectively, for dimensions one through five.
SME Congruence
The model-data fit values for each SME are presented in Table 4-9. The
model fit the data for SMEs 1, 7, and 9 least well (R2 less than .7 and STRESS
greater than .15~; however, these levels of fit are on par with those found in
previous research (e.g., Deville, 1996; Sireci and Geisinger, 1992, 1995~. The
congruence among the SMEs was evaluated by inspecting the individual subject
weights and the subject weirdness indexes.3 Although differences were observed
in the weighting of the dimensions across SMEs, all SMEs appeared to be using
all five dimensions in making their similarity ratings. Figure 4-3 presents separate
two-dimensional subspaces from the five-dimensional SME weight space. These
two subspaces highlight the differences among the SMEs. SME #8 exhibited the
3The weirdness index describes the relative weightings of the dimensions for each subject in
proportion to the average dimension weights across all subjects. A subject with a large weight on
one dimension and small weights on the other dimensions would have a weirdness index near one,
which is the maximum value. Subjects with dimension weights proportional to the average weights
have weirdness indexes near zero, which is the minimum value (see Young and Harris, 1993, for the
full details).
OCR for page 91
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
A .8
red
.e
° .6
to .5
Q
Q
~ .4
. _
.3
CO
to
CO
~ .2
.E
7
.1
o o
B .8
~ .6
A_
u)
~ .5
c)
. _
u)
.4
U.
o
u, .3
. _
~ .2
7L
.1
o o
8
9
7
4
10
3
6
5
1 1 1 1 1 1 1
0.0 .1 .2 .3 .4 .5 .6
Dimension 1 (Conceptual Understanding)
.7 .8
9
6
8 4
Cad
~3
1 10
0.0 .1 ~A
.2 .3 .4 .5 .6 .7 .8
Dimension 4 (Life vs. Earth)
91
FIGURE 4-3 Two-dimensional subject weight subspaces. (a) dimensions 1 and 3;
(b) dimensions 4 and 5.
OCR for page 92
92 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
largest weirdness index, due to his relatively large emphasis on dimension 3 (see
Figure 4-3a). SME #9 and #10 had relatively larger weights on dimensions four
and five (see Figure 4-3b). As described below, these two dimensions corre-
sponded to the field of science characteristics of the items. Thus, these two SMEs
emphasized content characteristics in their similarity ratings, whereas the other
SMEs tended to emphasize cognitive characteristics of the items. Although these
differences are interesting, the subject weights indicate that all five dimensions
were used by all SMEs in making their ratings. Thus, we turn now to interpreta-
tion of these five dimensions.
Interpreting the Dimensions
The dimensions were interpreted visually and with the assistance of statisti-
cal analyses comparing known item characteristics with the item coordinates
from the MDS solution. Visual interpretations were made separately by the first
author and by a science content expert from the National Academy of Sciences.
The statistical analyses involved computing correlations among the MDS item
coordinates and content, cognitive, and format item attributes.
Because of the overlap of item characteristics (e.g., more of the practical
reasoning items were also extended constructed-response items and most of the
nature of science items were scientific investigation items), the visual interpreta-
tions were able to clarify some of the multiple interpretations that could be
attributed to the dimensions using only the statistical results. Based on the
(subjective) visual and (objective) statistical information, the following interpre-
tations were given to the dimensions: dimension 1 is a "conceptual understand-
ing" cognitive dimension that separates the "lower-order" cognitive skill items
(e.g., factual recognition items) from those items requiring higher-order skills
(e.g., design an experiment, interpret results); dimension 2 is an item format
dimension that separates the multiple-choice items from the constructed-response
items; dimension 3 is "practical/applied reasoning" cognitive dimension that sepa-
rates the practical reasoning items from the scientific investigation items; dimen-
sion 4 is a content dimension that separates the life science items from the earth
science items; and dimension 5 is a content dimension that separates the physical
science items from the life science items. Thus, the first three dimensions are
related to cognitive item attributes, and the fourth and fifth dimensions are related
to content item attributes.
Figure 4-4 presents the two-dimensional item subspace for dimensions 1 and
2. A conspicuous "chasm" can be seen above the origin of dimension 1 (horizon-
tal). This chasm roughly separates the lower cognitive level "conceptual under-
standing" (C) items (positive coordinates, or right side of the figure) from the
higher-level "scientific investigation" (S) items (negative coordinates). Three
conceptual understanding items have negative coordinates on this dimension;
however, these same three items were rated as measuring higher-level cognitive
OCR for page 93
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
2.0
1.5
1.0
a_
Cd
o .5
IL
Q)
0.0
C\l
o
In
Q)
.E
S
pi
P ~C ~
- ~S Cp ~p
P ~. ~C
S S
~P
~C
-1.0 S ~C
-1.5 _ S C C C C S TIC
-2.0 1 1 1 1 1 1 1
-2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0
Dimension 1 (Conceptual Understanding)
93
FIGURE 4-4 Two-dimensional MDS stimulus subspace: items plotted along dimensions
1 and 2 using cognitive classification symbols. C, conceptual understanding; P. practical
understanding; S. scientific investigation.
areas by the SMEs in the item-objective ratings, as described earlier. Similarly,
the two scientific investigation items with positive coordinates on this dimension
tended to be "misclassified" with respect to cognitive area by the SMEs. Dimen-
sion 2 (vertical) separates the practical reasoning items from the others; however,
all of the practical reasoning items, except one, were also constructed-response
items. Figure 4-5 presents the same configuration but labels the items according
to item format. As can be seen from this figure, all of the multiple-choice items
have negative coordinates on dimension 2.
Figure 4-6 presents the item configuration for the two-dimensional subspace
formed by dimensions 1 and 3. All but two of the scientific investigation items
have negative or near-zero coordinates on dimension 3. Both of these items
exhibited low item-objective congruence for scientific investigation. Similarly,
all but two of the practical reasoning items had positive coordinates on dimension
3, both of which also had low item-objective congruence ratings for the practical
. . .
reasomng cogmt~ve area.
Figure 4-7 presents a three-dimensional subspace comprising the first three
dimensions, which were related to cognitive area. Although some cognitive area
OCR for page 94
94 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
2.0
1.5
1.0
Cd
o .5
IL
Q)
0.0
C\l
o
. _
.E
S
So
E
SE Em
S ~
S M
-.5 E S
~M
S M
-1.0 _
-1 .5
-2.0
M
S ~
M
M M M M M
-2.0 -1 .5 -1 .0
-5 0.0 .5 1.0 1.5 2.0
Dimension 1 (Conceptual Understanding)
FIGURE 4-5 Two-dimensional MDS stimulus subspace: items plotted along dimensions
1 and 2 using item format symbols. E, extended constructed-response; M, multiple-
choice; S. short constructed-response.
4.0
3.0
.e
cat 2.0
In
Q)
1.0
Cd
Q) 0.0
c)
o
O -1 .0
o
tn -2.0
.E
-3.0
-4.0
8 ~ cp ~ ~ p P
P S S ACE
S ~ Son Cusp
S S ~S UP P p TIC ~
11 1 11 1 1
-4 -3 -2 -1 0 -.1 2 3 4
Dimension 3 (Practical/Applied Reasoning)
FIGURE 4-6 Two-dimensional MDS stimulus subspace: items plotted along dimensions
1 and 3 using cognitive classification symbols. C, conceptual understanding; P. practical
reasoning; S. scientific investigation.
OCR for page 95
3.0
2.0
Dimen 2 1.0
0.0
-1 .0
-2.0 _
3.n
2.0
1.0
Dimen. 1
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
C
C ~.C
Cat -~
PA
; -~ / pit < MAP
~ or' ~O p
5 #;',~ ,,.~jj~f<~
j\\~C
0~0 = ~ 0~0
95
~ °
o 2.0
Dimen. 3
FIGURE 4-7 Two-dimensional MDS stimulus space illustrating cognitive groupings
among grade 8 NAEP science items. C, conceptual understanding; P. practical reasoning;
S. scientific investigation.
overlap is evident, clusters of items from the same content area occupy segre-
gated regions of the subspace. In particular, the conceptual understanding items
are primarily arranged in the left side of the figure (a tight cluster of these items
appears in the lower left), and the practical reasoning items are configured near
the top of the space.
Figure 4-8 illustrates the two-dimensional "content" subspace formed by
dimensions 4 and 5. Dimension 4 (horizontal) tended to segregate the earth
science (E) items (positive coordinates) and life science (L) items (negative coor-
dinates). All but one of the life science items had negative coordinates on dimen-
sion 4. This item was classified as a life science item by seven of the 10 SMEs.
Dimension 5 (vertical) appears to account for the degree to which the items
measured physical science. Most physical science items had relatively large
negative coordinates on this dimension; only one physical science item had a
large positive coordinate. This item was classified as an earth science item by 8
of the 10 SMEs. Although some overlap among content areas is evident, in
OCR for page 96
96 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
3.0
~ .0
_`
. _
~ 1.0
It
. _
to
0.0
up
o
. _
~ -1 .0
. _
-2.0
-3.0
p E
-3 -2 -1 0 1 2 3
Dimension 4 (Life vs. Earth)
FIGURE 4-8 Two-dimensional MDS stimulus subspace: items plotted along dimensions
4 and 5 using content classification symbols. E, earth science; L, life science; P. physical
science.
general the items comprising the three different fields of science tend to be
segregated in the subspace. In particular, most of the life science items are
configured more closely to one another than they are to items from other content
areas.
To assist in verifying the visual interpretations given to the dimensions,
correlations were computed between the MDS coordinates and external data on
the items. These external data included the item-objective congruence ratings;
item format information; and the content, cognitive, nature, and theme designa-
tions of the items. The content, cognitive, nature, and theme designations were
"dummy" coded for this analysis. For example, an earth science dummy variable
was created by coding all earth science items "1" and all other items "O." The
cognitive, theme, and nature areas were also dummy coded, as well as an item
format variable (multiple-choice/constructed-response). Two separate correla-
tional analyses were conducted. The first analysis correlated the item-objective
congruence ratings with the item coordinates. To conduct this analysis, the
number of SMEs categorizing an item in each content, cognitive, nature, or theme
OCR for page 97
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
97
area was calculated. These sums were then correlated with the MDS coordinates.
The second analysis correlated the dummy variables with the item coordinates.
The results of the correlation analyses are presented in Table 4-10 (item-
objective congruence correlations) and Table 4-1 1 (dummy variable correlations).
Both sets of correlations lead to similar conclusions regarding the item character-
istics defining each dimension. However, the correlations based on the item
objective congruence data tended to be larger. The largest correlations for the
coordinates on the first dimension were with the conceptual understanding and
scientific investigation cognitive areas. The largest correlation for the second
dimension was for the item format variable. For the third dimension, large
correlations with the practical reasoning and scientific investigation cognitive
areas were observed. The nature of the science dummy variable also exhibited a
large correlation with this dimension, but the nature of science item-objective
congruence ratings did not. This finding probably stems from the fact that 5 of
the 10 nature of science items were also scientific investigation items. The
coordinates from the fourth and fifth dimensions exhibited large correlations with
the variables associated with the field of science designations of the items. Thus,
in general, the correlation analyses supported the visual interpretations given
TABLE 4-10 Correlations Among MDS Item Coordinates and Item Objective
Congruence Ratings
Item Dimension
Variable123 4
Fields
Earth science-.04-.15-.01 .61 * .21
Life science .06 .22 -.04-.65* .48*
Physical science -.02 -.09 .07-.07 -.75*
Ways of Knowing and Doing
Conceptual understanding .80* -.51 * -.17.01 .18
Practical reasoning -.27 .56* -.43*-.12 .16
Scientific understanding -.71 * .05 .66*.11 -.38
Themes
Models .07 -.03 -.08.68* .20
Patterns -.57* .10 .10-.14 -.02
Systems .43 * -.02 -.08-.49* .22
Nature
Yes -.72* .55* .27.14 -.11
No .71* -.52* -.13-.13 .17
*P < .01.
OCR for page 98
98 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
TABLE 4-11 Correlations Among INDSCAL Item Coordinates and Item
Dummy Variables
ItemDimension
Variable 1 2 3 4 5
Fields
Earth science .02-.04 .03 .61* .02
Life science -.02 .14 -.14-.58* .52*
Physical science .00 -.10 .11-.02 -.57*
Ways of Knowing and Doing
Conceptual understanding .57 * -.43 * -.03.05 .08
Practical reasoning -.16 .57* .40*-.18 .17
Scientific investigation -.49* -.12 -.40*.14 -.28
Themes
Models -.11 .06 -.11.61 * .17
Patterns -.36 .02 .02-.10 .40*
Systems .28 .09 -.14-.23 .25
Nature
Science -.44* .29 -.46*.08 -.04
Technology .08 .24 .29-.08 .01
Multiple Choice
(Yes/No) .40* -.76* .06.12 .1 A
Difficulty .09 -.52* .28.01 .02
*P < .01.
earlier. The first three dimensions correspond to cognitive and item format
attributes, and the fourth and fifth dimensions correspond to fields of science
attributes.
In summary, analysis of the item similarities data using MDS uncovered
cognitive- and content-related dimensions that were congruent with those dimen-
sions specified in the National Assessment Governing Board frameworks. Items
that did not tend to group together with other items in their content or cognitive
area tended to be the same items that were identified as problem items from
analysis of the item-objective congruence ratings.
DISCUSSION
A fundamental requirement in educational assessment is operationally defin-
ing the constructors) measured. Content validation involves determining whether
OCR for page 99
S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN
99
a test actually represents the intended construct. Thus, it is an important step in
evaluating the validity of inferences derived from test scores. As Sireci
(1998b: 106) has stated, "if the sample of tasks comprising a test is not represen-
tative of the content domain tested, the test scores and item response data used in
studies of construct validity are meaningless."
Tests used in NAEP are operationally defined using test frameworks. This
study sought to evaluate the content validity of a particular test in the NAEP
battery the 1996 grade 8 science assessment. An independent panel of science
educators was convened, and these experts provided judgments of the content
characteristics of items from this test over a two-day period. Two distinct methods
for evaluating content validity were used, and both methods provided similar
conclusions regarding how well a carefully selected subset of items represented
the framework dimensions.
Does the grade 8 1996 NAEP science assessment measure what it purports to
measure? The results from this study suggest that, in general, the two major
dimensions composing the framework were supported by the SMEs' judgments.
The majority of the items studied (85 percent) were judged to be measuring the
content areas they were designed to measure. Although less congruence was
observed for the cognitive classifications of the items, it was clear the SMEs
thought that both higher- and lower-order thinking skills were measured across
all three fields of science. These two major dimensions ("fields of science" and
"ways of knowing and doing science") were also uncovered from the SMEs' item
similarity ratings taken before the SMEs were aware of these dimensions. Sireci
(1998a, 1998b) argues that this type of rating task provides a more rigorous
appraisal of content validity. Thus, the results of the item-objective congruence
and MDS analyses provide strong evidence that the content and cognitive dimen-
sions of the framework were represented well by the actual items composing the
assessment. However, given the fact that 15 percent of the studied items were
classified differently by the SMEs with respect to field of science, a concern
remains regarding which items to include in which field of science scale when the
data are scored, calibrated, and reported. It is also interesting that the SMEs saw
cognitive distinctions among the items first and foremost, before distinguishing
among the items in terms of the fields of science content areas.
The item-objective congruence ratings, and the dimensions observed in the
SME-derived MDS solution, did not strongly support the themes of science or
nature of science dimensions of the framework. However, like the ways of
knowing and doing science dimension, separate scores are not reported for these
dimensions, and including them in the frameworks probably enhanced item devel-
opment and contributed to the overall quality of the item pool. The lack of
congruence between the SMEs and test developers regarding these two dimen-
sions may be due to problems in the item classifications or to a lack of clarity
regarding the descriptions of these dimensions. Thus, the utility of these two
dimensions deserves further study.
OCR for page 100
100 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK
Although the results of this study are encouraging, they are limited only to
the 1996 grade 8 science assessment. Similar studies are recommended for other
tests in the NAEP battery.
ACKNOWLEDGMENTS
The authors thank Karen Mitchell, Lee Jones, and Holly Wells for their
invaluable assistance with this research and an anonymous reviewer for helpful
comments on draft of this paper.
REFERENCES
Carroll, J.D., and J.J. Chang
1970 An analysis of individual differences in multidimensional scaling via an e-way generali-
zation of "Eckart-Young" decomposition. Psychometrika 35:238-319.
Deville, C.W.
1996 An empirical link of content and construct validity evidence. Applied Psychological
Measurement 20:127-139.
Dong, H.
1985 Chance baselines for INDSCAL's goodness of fit index. Applied Psychological Mea-
surement 9:27-30.
Ebel, R.L.
1977 Comments on some problems of employment testing. Personnel Psychology 30:55-63.
Kruskal, J.B., and M. Wish
1978 Multidimensional Scaling. Newbury Park, Calif.: Sage.
MacCallum, R.
1981 Evaluating goodness of fit in nonmetric multidimensional scaling by ALSCAL. Applied
Psychological Measurement 5:377-382.
Messick, S.
1989 Validity. Pp. 13-103 in Educational Measurement, 3rd ea., R. Linn, ed. Washington,
D.C.: American Council on Education.
National Assessment Governing Board (NAGB)
1996 Science Framework for the 1996 National Assessment of Educational Progress. Wash-
ington, D.C.: NAGB.
Schiffman, S.S., M.L. Reynolds, and F.W. Young
1981 Introduction to Multidimensional Scaling. New York: Academic Press.
Sireci, S.G.
1998a Gathering and analyzing content validity data. Educational Assessment 5:299-321.
1998b The construct of content validity. Social Indicators Research 45:83-117.
Sireci, S.G., and K.F. Geisinger
1992 Analyzing test content using cluster analysis and multidimensional scaling. Applied Psy-
chological Measurement 16: 17-31.
1995 Using subject matter experts to assess content representation: A MDS analysis. Applied
Psychological Measurement 19:241-255.
Young, F.W., and D.F. Harris
1993 Multidimensional scaling. Pp. 155-222 in SPSS for Windows: Professional Statistics,
computer manual, version 6.0, M.J. Noursis, ed. Chicago: SPSS.
Representative terms from entire chapter:
congruence ratings