Read "Methodological Advances in Cross-National Surveys of Educational Achievement" at NAP.edu

Page 25 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Part I
Study Design

Page 26 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

This page in the original is blank.

Page 27 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

2
The Measurement of Student Achievement in International Studies

Robert L. Linn*

The measurement of student achievement is a challenging undertaking regardless of the scope of the domain of measurement, the student population to be assessed, or the purposes of the assessment. The more specific the purpose, the more homogeneous the population of students, and the narrower the domain of measurement, however, the easier is the task of developing measures that will yield results that support valid interpretations and uses.

A teacher who prepares an end-of-unit test in algebra faces a task with a fairly clearly defined content domain and knows a great deal about the common experiences of the students who will take the test. There are still variations in purpose (e.g., grade assignment, formative feedback to students, feedback to the teacher) that need to be considered, but the purposes are reasonably circumscribed. There are also issues of the item types (e.g., multiple-choice, short-answer, extended-response problems) to be used and the cognitive demands of the items. For example, how much emphasis should be given to routine application of algorithms, how much to conceptual understanding, how much to solving new problems that require students to make generalizations, how much to communication, and how much to making connections to earlier concepts and assign

*	Robert Linn is a distinguished professor in the School of Education at the University of Colorado. He is co-director of the National Center for Research on Evaluation, Standards, and Student Testing.

Page 28 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

ments? In the individual classroom setting, however, much is known about the familiarity that students have with different item formats, and that familiarity is relatively homogeneous for all students taking the test. Moreover, instructional goals can be used to guide decisions about emphasis given to different cognitive processes.

Large-scale assessments, be they a norm-referenced test designed for use nationally, a state assessment, or an assessment such as the National Assessment of Educational Progress (NAEP), face many of the issues involved in an end-of-unit test for use in a single classroom. Issues of item types and the cognitive demands of the items, for example, remain important, but there is greater diversity in the familiarity that students have with different formats and with items that make different levels of cognitive demands. The delineation of purpose and scope of the content domain are considerably more complicated for the large-scale assessment development than for the classroom test. Moreover, the definition of the target population is no longer a given, and even when defined will be more heterogeneous in background in curriculum exposure and in instruction directed to the content of the assessment. These complications exacerbate the challenges of developing assessments that yield results that support valid interpretations and uses.

Not surprisingly, the challenges are greater still for international assessments of student achievement. An immediately apparent complication is that assessments have to be translated into the multiple languages of participating countries. Variations among countries in educational systems, cultures, and traditions of assessment add to the complexity of the problems of international assessments.

PURPOSES

Consideration of measurement issues for any assessment should start with the identification of the purpose of the assessment. Millman and Greene (1989, p. 335) note that “The first and most important step in educational test development is to delineate the purpose of the test or the nature of the inferences intended from test scores.” They justify this claim by noting that “A clear statement of purpose provides the test developer with an overall framework for test specification and for item development, tryout and review” (Millman & Greene, 1989, p. 335). Most assessments, of course, serve multiple purposes, only some of which are intended and clearly specified in advance. Nonetheless, the delineation of purpose(s) is an important undertaking that provides not only a logical starting point, but also the touchstone for evaluating the other measurement decisions throughout the process of assessment development, administration, and interpretation of results.

Page 29 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The purposes of international assessments are manifold. The purpose that attracts the most attention in the press is the horse race aspect of the studies, that is, the tendency to report the relative standing of country average total test scores. Although it is recognized that international competition inevitably draws “attention of policymakers and the general public to what has been referred to as the ‘Olympic Games’ aspect of the research” (Husen, 1987, p. 131), researchers associated with the conduct of studies under the auspices of the International Association for the Evaluation of Educational Achievement (IEA) have consistently argued that there are many other purposes that are more important than the horse race comparisons.

Mislevy (1995) began his discussion of purposes of international assessments as follows: “In the broadest sense, international assessment is meant to gather information about schooling in a number of countries and somehow use it to improve students learning” (p. 419). In keeping with this broad purpose, the introduction to the report of middle school mathematics results for the Third International Mathematics and Science Study (TIMSS) gives the following statement of purpose: “The main purpose of TIMSS was to focus on educational policies, practices, and outcomes in order to enhance mathematics and science learning within and across systems of education” (Beaton et al., 1996b, p. 7). An implicit assumption is that comparisons of student performances for different countries will contribute toward this end in some way. Otherwise, there would be no need for the involvement of countries other than one’s own in the assessment. Thus, it is not surprising that comparing achievement in a specified subject or subjects across countries is a purpose that is common to all of the international studies of achievement.

The objective of comparing relative achievement of students at a target age or grade level by country and subject immediately raises a host of measurement questions. At the most general level, there is the question of whether to limit the measurement domain to the intersection of the content coverage intended by the curricula of participating countries or having it encompass the union of content covered. Or should the domain boundaries fall somewhere between those extremes (Linn, 1988; Porter, 1991)? The union is almost surely too expansive to be practical, while the intersection would restrict the coverage to an unreasonable degree. Hence, the domains defined for international assessments have negotiated limits that fall between the extremes. Once the boundaries have been agreed on, questions remain about the relative emphasis to be given to topics within the domain, about the relative importance of different levels of cognitive demands of the assessment tasks within each topic, about the length of the assessment, and about the mix of item types.

Page 30 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The comparative results obtained on an assessment depend on the degree to which the assessment reflects the curriculum and instruction of the groups of students whose performance is being compared (Linn, 1988; Linn & Baker, 1995; Porter, 1991). In any evaluation of educational programs, “if a test does not correspond to important program goals, the evaluation will be considered unfair” (Linn, 1987, p. 6). This is true for assessments within a nation, but becomes critically important in considering comparisons of performance of nations because there are such large differences between countries in curriculum and instructional emphases. For individual countries the fairness of the assessment necessarily varies as a function of the degree of correspondence between each country’s curriculum and the content boundaries and the relative emphasis given to covered topics of the assessment.

SPECIFICATIONS

The particulars of the definition of the domain can have a significant impact on the relative position of nations on the assessment. Heavy weight given to one subdomain can advantage some nations and disadvantage others. Multiple-choice formats familiar to students in some nations may be less so to students in others. Conversely, extended-answer problems are standard fare for students in some nations, but not for students in all nations participating in the study. As Mislevy (1995, p. 423) has noted, “The validity of comparing students’ capabilities from their performance on standard tasks erodes when the tasks are less related to the experience of some of the students.” Because of the sensitivity of the relative performance of nations to the details of the specification of the assessments, considerable effort must go into negotiating the details of the specifications and to review and signoff on the actual items administered.

Messick (1989, p. 65) has noted that

[I]ssues of content relevance and representativeness arise in connection with both the construction and the application of tests. In the former instance, content relevance and representativeness are central to the delineation of test specifications as a blueprint to guide test development. In the latter instance, they are critical to the evaluation of a test for its appropriateness for a specific applied purpose.

Details of the approaches used to develop specifications for the assessments have varied somewhat in previous international assessments, but the general nature of the approaches have had a great deal in common. Generally, the approach has been to define a two-way table of specifications, beginning with one dimension defined by content. The topic and subtopic grain size has varied considerably, due in part to the subject

Page 31 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

matter of the assessment and the grade level; it also has varied from one assessment to the next within the same subject area. The First International Mathematics Study (FIMS) placed the 174 items used across the different age populations assessed into one of 14 topics, ranging from basic arithmetic to calculus (Thorndike, 1967, p. 105). The content dimension was primary, and considerable effort went into defining the topics and obtaining items for them. Despite the emphasis on content, some reviewers of the FIMS results (e.g., Freudenthal, 1975) were sharply critical of the assessments for what was seen as an overemphasis on psychometrics and a lack of involvement of subject-matter experts who were familiar with curricula and teaching practices in the participating countries.

In the Second International Mathematics Study (SIMS), the main emphasis continued to be placed on content categories, but there was substantially greater involvement of mathematics educators and much greater salience was given to the mathematics curricula of the participating countries. SIMS maintained links to FIMS by including a sizable fraction of items from FIMS, but used a different set of topical categories. SIMS had 133 content categories under five broad topics (arithmetic, algebra, geometry, probability and statistics, and measurement) for the eighth-grade population and 150 content categories under nine broad topics for the twelfth-grade population (Romberg, 1985, p. 9). Other international studies have divided the content domain using fewer broad topic areas.

A rather different approach was taken in the International Assessment of Educational Progress (IAEP) studies conducted by the Educational Testing Service (Lapointe, Askew, & Mead, 1992; Lapointe, Mead, & Askew 1992), using frameworks more in keeping with the ones developed for NAEP. In mathematics for 9- and 13-year-olds, the IAEP framework had five broad content categories. Those content categories were crossed with three cognitive process categories to yield a framework with the 15 cells shown in Table 2-1. The broad categories used by IAEP stand in sharp contrast to the fine-grained breakdown in SIMS.

The TIMSS assessments also were based on tables of specifications with characteristics that had some similarity to the frameworks used in the IAEP studies, but had greater specificity of content. For example, the eighth-grade science assessment had eight broad content areas (earth sciences; life sciences; physical sciences; science; technology and mathematics; environmental issues; nature of science; and science and other disciplines). Those categories were crossed with five cognitive process categories called performance expectations in the TIMSS reports (understanding; theorizing, analyzing, and solving problems; using tools, routine procedures, and science processes; investigating the natural world; and communicating) (Beaton et al., 1996a, p. A-6). Finer breakdowns of

Page 32 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

TABLE 2-1 IAEP Mathematics Framework for 9- and 13-Year-Olds

	Numbers and Operations	Measurement	Geometry	Data Analysis, Statistics, and Probability	Algebra and Functions
Conceptual understanding
Procedural knowledge
Problem solving
SOURCE: Based on Educational Testing Service (1991, p. 13).

content also were available and used for some analyses. For example, Schmidt, Raizen, Britton, Bianchi, and Wolfe (1997) reported results for 17 science content areas.

In contrast to the relatively fine breakdown of content categories in mathematics and science, the IEA study of reading literacy identified three major domains or types of reading literacy materials: narrative prose (“texts in which the writer’s aim is to tell a story—whether fact or fiction”), expository prose (“texts designed to describe, explain, or otherwise convey factual information or opinion to the reader”), and documents (“structured displays presented in the form of charts, tables, maps, graphs, lists or sets of instructions”) (Elley, 1992, p. 4).

In addition to variation from one international study to another in the grain size used in the specification of content, there is variation among content categories within a single study. Mesa and Kilpatrick (1998) commented on the lack of uniformity across topics. The lack of uniformity was acknowledged for TIMSS by Schmidt, McKnight, and Raizen (1997, p. 128) as follows: “No claim is made that the ‘grain size’—the level of specificity for each aspect’s categories—is the same throughout the framework.” Mesa and Kilpatrick (1998, p. 8) argue that the lack of uniformity of grain size is problematic, noting, for example, that this means “[s]mallgrained topics such as properties of whole number operations are counted on a par with large-grained topics such as patterns, relations, and functions. Such variation in grain size can result in disproportionate numbers of items for some clusters of topics relative to the intended emphasis in relation to the whole content domain.”

The content domains for the international studies have been defined in practice to be somewhere between the intersection and the union of the content domains covered by the curricula of the participating countries, but are closer to the intersection than the union. Because of the promi-

Page 33 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

nence of English-speaking countries, especially the United States, in contributing items to the pools of items contributed or developed to make up assessments in line with the specifications, there appears to be a better match to the curricula of English-speaking countries than to the curricula of countries with different languages.

COGNITIVE PROCESSES

As noted, the content dimension of test specification tables has been primary in international assessments. The second dimension of the framework or table of specifications for the assessments generally has focused on the cognitive processes those items or assessment tasks are intended to measure. The well-known breakdown of tasks into six major categories of performance (knowledge, comprehension, application, analysis, synthesis, and evaluation) in Bloom’s (1956) taxonomy of educational objectives illustrates one approach to specifying distinct categories of cognitive processes that have been applied to a variety of content domains. The rows of Table 2-1 illustrate another formulation of process categories. For FIMS a table of specifications crossed mathematical topics (e.g., basic arithmetic and elementary algebra) with the following five intellectual process categories:

Knowledge and information: definitions, notation, concepts.
Techniques and skills: solutions.
Translation of data into symbols or schema and vice versa.
Comprehension: capacity to analyze problems to follow reasoning.
Inventiveness: reasoning creatively with mathematics (Thorndike, 1967, p. 94).

In a similar vein, the First International Science Study (FISS) crossed a content dimension with a “behavioral objectives dimension consisting of four categories: information, comprehension, application, higher processes” (Comber & Keeves, 1973). The Second International Science Study (SISS) used a substantial number of items (nearly half the total) from FISS and supplemented those items with new items for SISS that were categorized into just three of the four behavioral objectives used in FISS (the higher order process category was not used for the new items) (Keeves, 1992). As was true of the contrast of the first and second mathematics studies, the second science study placed greater emphasis on the curricula of the participating countries than had been done in the first science study. Items for the test were selected not from the most common topics, but rather based on the emphasis of topics in each country as defined in the country’s intended curriculum.

Page 34 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

For SIMS the second dimension was called the “behaviors dimension” and distinguished “four levels of cognitive complexity expected of students—computation, comprehension, application, and analysis” (Romberg, 1985, p. 9). As Romberg notes, the four levels used in SIMS mapped partially, albeit imperfectly, into the Bloom categories. The second dimension for TIMSS made similar distinctions, but was referred to as the “expectations” dimension. As is true of the IAEP process categories in Table 2-1, more recent specification tables have moved farther away from the Bloom taxonomic categories. In the TIMSS mathematics assessment, for example, four categories of expectations or cognitive processes were distinguished: knowing, performing routine procedures, using complex procedures, and solving problems (Beaton et al., 1996b, p. A-7). In science the TIMSS performance expectations dimension consisted of five categories: understanding simple information; understanding complex information; theorizing, analyzing, and solving problems; using tools, routine procedures, and science processes; and investigating the natural world (Beaton et al., 1996a).

The specification of topics of the content domain involves judgments of subject-matter experts that have been informed in international studies by cross-national curriculum analysis. Agreements require negotiated compromises between desires to be comprehensive in coverage, the goal of fairly assessing the achieved curriculum of all participating countries, and issues of feasibility. Determining whether an item fits a content topic area is relatively straightforward once the topics have been defined. Determining the types of cognitive processes required to answer an item is far less straightforward. There is widespread agreement that assessments should tap more than simple knowledge of facts and procedures. The assessment also should measure a student’s ability to apply knowledge, skills, and concepts to solve problems and communicate in academic and nonacademic problem settings. Furthermore, it should measure the ability to communicate concepts, make connections, provide explanations appropriate for the subject matter, interpret findings, and evaluate problem solutions and arguments (Glaser, Linn, & Bohrnstedt, 1997). Measuring such higher order cognitive processes and achievement outcomes is more challenging than measuring factual knowledge and skills at applying routine algorithms. The fact that nearly any test development effort that solicits items from a broad range of subject-matter experts, as has been done in the IEA studies, will find an overabundance of items is symptomatic of the greater difficulty in writing items that will tap the higher level problem solving, analysis, explanation, and interpretation skills sought for the assessments.

Although, as will be described, considerable effort has gone into the development of items that do more than measure factual knowledge and

Page 35 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

low-level skills, critics continue to fault international assessments for falling short of the goal of measuring more complex understanding and problem-solving skills. For example, “Jan de Lange. . . argued that the TIMSS items are primarily useful for testing low-level knowledge and do not necessarily represent anyone’s idea of a desirable curriculum” (National Research Council [NRC], 1997, p. 17). The criticism by de Lange is due, at least in part, to a preference for assessments that present students with substantial problems requiring multiple steps to solve and that allow for a variety of solution paths and, sometimes, multiple solutions that would be judged to be of high quality. It is also based on a belief that the multiple-choice and short-answer formats used in the international assessments can only measure factual knowledge and low-level skills. More will be said about that in the following section, but here it is worth recalling an observation made by Thorndike (1967, p. 96) in his chapter describing the FIMS tests.

Time limitations together with the need to sample widely from the content of mathematics dictated another decision. It was agreed, somewhat reluctantly, that it would be necessary to keep the single problems brief. Much as one might like to explore the students’ ability to work through an involved sequence of steps, or develop a complex proof, this seemed impossible. Such a task would exhaust too large (and too variable) a fraction of the limited time that was available.

ITEM FORMATS

The criticism of international assessment on the grounds that they assess only relatively low-level cognitive processes reflects, in part, the difficulty of writing items that tap higher level skills and understanding. Many, like de Lange, would argue that the multiple-choice item formats that are most used in international assessments make it infeasible to assess some of the more complex cognitive processes that correspond to ambitious curriculum aspirations. Multiple-choice and short-answer items are obviously efficient and make it possible to assess a wide range of content in a relatively short period of time. Such items are quite effective at measuring knowledge of fact, procedures, and concepts. Skilled item writers also can and do use these formats effectively to measure understanding and the ability to apply concepts and procedures to solve problems, provide explanations, interpret, and evaluate findings or arguments. There are limits to these formats, however. Nonetheless, for the reasons articulated by Thorndike, multiple-choice has been the dominant item format, supplemented by some short-answer and a smaller number of

Page 36 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

extended-response items because of considerations of efficiency and feasibility.

The reliance on multiple-choice items was a target of criticism of several of the international studies prior to TIMSS. Stedman (1994), for example, provided the following critique of the earlier science assessments. “In most assessments science has been tested solely with multiple-choice items (Keeves, 1992, p. 59; Lapointe, Mead, & Phillips, 1989, p. 83). This seems particularly inappropriate given how essential experimentation is in science, although it may measure how well students have learned basic curricular facts” (p. 26).

Considerable effort went into expanding the types of items used in TIMSS, the most recent of the IEA studies. Roughly 300 constructed-response items were included across the subject areas and populations assessed in TIMSS. Approximately one-third of the assessment administration time was allocated for responding to the constructed-response items (Mullis & Smith, 1996). In terms of number of items, multiple-choice items were still dominant in TIMSS. TIMSS assessed three populations of students, 9-year-olds, 13-year-olds, and students in the last year of secondary school. Students in the two grades where most students of the target age were enrolled were assessed for the two younger populations. Students in the last year of secondary school were broken down into a mathematics and science literacy subpopulation and subpopulations of students taking advanced mathematics and taking an advanced physics course. Table 2-2 displays the number of items on the TIMSS assessments by item type for each population and subject.

The preponderance of multiple-choice items is evident in Table 2-2. The short-answer and extended-response items had more than the two score points of a multiple-choice item. Even if the score points of short-answer and extended-response items are taken into account, the multiple-

TABLE 2-2 TIMSS Number of Items by Item Type

Population	Subject	Multiple Choice	Short Answer	Extended Response	Total
9-year-olds	Mathematics	79	15	8	102
	Science	74	13	10	97
13-year-olds	Mathematics	125	19	7	151
	Science	102	22	11	135
Last year literacy	Math and science	52	17	7	76
Advanced	Mathematics	47	10	8	65
Advanced	Physics	42	15	8	65
SOURCE: Based on Adams and Gonzalez (1996).

Page 37 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

choice items still dominate the assessments. For example, for the 9-year-olds in mathematics, the total number of possible points was 116, 79 of which were based on the multiple-choice items. Only the physics and advanced mathematics assessments for the last year of secondary school population had close to half their points from short-answer and extended-response items. The total points for physics was 81, 39 of which came from the 23 short-answer and extended-response items. The corresponding numbers for advanced mathematics were 82 and 35.

Efficiency of the assessment is enhanced not only by the fact that multiple-choice items can be machine scored, but also by the allocation of time for these items. For all but the advanced students in the oldest population, one minute per item was assumed for the multiple-choice items (three minutes per item was assumed for advanced students). The short-answer items were not much more time consuming, with one minute per item again assumed for the 9-year-olds, two minutes per item for the 13-year-olds and the oldest population in the mathematics and science literacy subpopulation, and three minutes per item for the advanced subpopulations. Even the extended-response items were restricted in amount of testing time assumed—three minutes per item for the 9-year-olds and five minutes per item for all others.

The international study of reading literacy (Elley, 1992) also had tests that were dominated by multiple-choice items. Of the 66 items on the 9-year-olds’ test, there were four short-answer items and two items requiring a paragraph-length answer. The test for the 14-year-olds had 20 short-answer items and two items requiring a paragraph-length answer out of the 89 items on that test (Elley, 1992, p. 5).

Measuring some of the higher order cognitive processes that the IEA studies have aspired to measure is exceedingly difficult given the amount of time allowed per item. Measuring some of the aspects of understanding and problem solving—such as problem identification and representation, the use of comprehension and problem-solving strategies, and the development of coherent explanations and interpretations (Glaser, Linn, & Bohrnstedt, 1997)—requires the use of a wider array of extended-answer items and performance assessment tasks. Finding a good balance between the needs for efficiency and the desire to measure a full range of cognitive processes poses a continuing challenge for international assessments.

CURRICULUM AND INSTRUCTIONAL MATERIALS AND DEFINITION OF TEST CONTENT

One of the prominent features of international studies of achievement conducted under the auspices of the IEA has been the emphasis on both

Page 38 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

the intended and the implemented curricula in participating countries. Emphasis on the curricula of participating countries was present, albeit only to a modest degree, in the earliest IEA studies. This emphasis on the match to what is taught in participating countries was behind the effort made in FIMS, that is, to get countries to identify the topics that the assessment was expected to cover that were taught to students in the populations to be assessed.

National centers rated items as to universality of inclusion in its education system using the following categories:

U	Universal, i.e., the topic is taught or assumed by all types of schools at this level.
R	Restricted, i.e., the topic is taught only in certain types of schools or courses.
E	Experimental, i.e., the topic is not normally taught in any part of the system up to this level, but occurs sporadically in an experimental program.
N	Nil, i.e., the topic is not taught at all in the educational system at this level, and is not assumed as known from previous teaching (Thorndike, 1967, p. 95).

Using this system of ratings for universality of topic coverage for ten countries that provided ratings, it was found, for example, that all ten countries returned a rating of “U” for a 13-year-old’s “ability to carry out simple operations involving simple vulgar fractions.” “Notions of square roots,” on the other hand, received a rating of “U” from four countries, an “R” from one country, and an “N” from the remaining five countries. “The theorem of Pythagoras for solving simple practical problems” had three “U” ratings, one “R” rating, and six “N” ratings (Husen, 1967, pp. 284-286).

The emphasis on relevance to the curricula of participating countries is also apparent in the SISS report, where the basis for the test development is described by Postlethwaite and Wiley (1992, p. 49) as follows:

The tests were constructed on the basis of the common intended curriculum in all of the participating countries. The intended curriculum is that content which is included in national or state syllabi, the major science textbooks used by students and—where applicable—national examinations. A first analysis was conducted in the late 1960s for the First IEA Science Study. This was repeated for the second science study. Sufficient items had to be the same for the first and second studies to allow comparisons between the two times of testing.

The centrality of curriculum has, if anything, increased in the more recent studies, culminating in the highly elaborate data collections and

Page 39 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

analyses conducted as part of TIMSS (Schmidt, McKnight, & Raizen, 1997; Schmidt, McKnight, Valverde, Houang, & Wiley, 1997; Schmidt, Raizen, Britton, Bianchi & Wolfe, 1997). The TIMSS curriculum studies included analyses of textbooks and curriculum guides for approximately 50 countries. These analyses revealed considerable between-country variation in topic coverage and relative emphasis. Although the curriculum analyses were used in developing the TIMSS assessment, the wide range in topics and great variation in coverage and emphasis precluded the development of an assessment that would provide comprehensive coverage of the union of topics found in the curricula of all the countries. An assessment with reasonable depth covering all the topics of the union would have required collections of items several times as large as the ones administered by TIMSS. Feasibility considerations forced a more limited scope. Preliminary results of the curriculum analyses were used to guide the development of the initial frameworks used to classify items and to specify the content domain to be assessed in TIMSS. Considerable effort was required to negotiate refinements in the framework, where it could be used as a detailed table of specifications for the assessment, giving topics and intended cognitive processes to be measured as well as numbers of items in different categories to reflect negotiated agreements of coverage and emphasis.

Important distinctions have been made in the IEA studies between the explicit and implicit goals of a nation’s curriculum, known as the “intended curriculum,” and the content that is actually taught, known as the “implemented curriculum” (e.g., Schmidt & McKnight, 1995). The distinction between intended and implemented curriculum is relevant to specifications of the content for an assessment to be used in international studies and the valid interpretation of between-country differences in student achievement.

The intended curriculum can be characterized by review of official curriculum guides, analysis of widely used textbooks, and reviews by subject-matter experts from each country, all of which can be used in developing test specifications and in writing and selecting items. The implemented curriculum cannot be defined with as much specificity prior to test development and collection of data because it is defined by teacher ratings of topics taught and student ratings of opportunity to learn (OTL) generally obtained at the time the achievement test data are collected. The distinction between the intended and the implemented curriculum has been a prominent feature of the IEA studies. The implemented curriculum has been determined by using teacher estimates of the percentage of their students who would answer an item correctly without guessing. Teachers also have been asked to provide OTL ratings, which identified whether their students had been taught the material needed to answer

Page 40 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

each item correctly, or whether that material had been reviewed, was taught in prior years, would be taught in subsequent years, or would not be taught. OTL ratings have been found to be a strong predictor of student achievement on the assessments (the attained curriculum). One might question, however, the degree to which OTL ratings are pure measures of the implemented curriculum, or are contaminated to some degree as reflections of informal predictions of teachers of how students will respond to particular items. Nonetheless, the IEA studies have had an important influence on the development of the concept of opportunity to learn and approaches to measuring it (see Floden, this volume).

CONSENSUS AND SIGNOFF BY PARTICIPATING NATIONS

An international consensus needs to be developed among participating nations for any international assessment. Consensus is needed about the tables of specifications, including agreements about item types, and relative emphasis to be given to categories in each of the dimensions of the table of specifications. Consensus is also needed about the acceptability of individual items that are the instantiation of the cells of the table of specifications. Although relatively large pools of items can be assembled drawing on previous international assessments and on items contributed by participating nations based on their own national assessments or written specifically for the international assessment, the quality of the items and the distribution relative to the requirements of the specifications are more problematic.

SISS sent a matrix of science topics by teaching objectives to national centers with a request for items to measure the cells of the matrix. Although approximately 2,000 items were returned, the quality and appropriateness of the items were quite uneven, many items could not be scored objectively, and relatively few items were judged to be applicable as measures of the higher order cognitive processes (Comber & Keeves, 1973, p. 20). Contributions of potential items were also far from uniform across countries.

The uneven distribution has been characteristic of all the international assessments. FIMS, for example, initially obtained items for the item pool from only five countries, Japan, the Netherlands, Scotland, Sweden, and the United States. Moreover, the assembly of the pool of items that were field tested depended to a very great extent on items from only one country, the United States. “Relying heavily on the stock of items made available by the Educational Testing Service and items from test files at the University of Chicago Examiner’s Office and items written specifically for the test a pool of 640 items was assembled” (Thorndike, 1967, p. 98).

Page 41 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The influential role of the United States in contributing items for the assessments is a feature of most of the international assessments. As noted earlier, the IAEP studies used the framework from NAEP to categorize items. Although more than half of the countries participating submitted some items, the IAEP studies also made use of items from NAEP. The reliance on the United States for a disproportionate share of the items in international assessments is undoubtedly expedient. Indeed, such reliance may be the only way to meet the challenges of the creation of an item pool from which the items for the assessments could be selected and assembled into viable tests. This is so, in part, because the United States has a much larger national infrastructure for the development and refinement of tests at the national level than any other country. It is also the case that the United States is relatively unique in the extent of its use of multiple-choice test items, and as will be discussed, this is the format that has been used most in international assessments due to considerations of cost, breadth of coverage, and efficiency.

The dominant role of the United States and, to a lesser extent, Canada in the contribution of items to the assessments raises a question about the degree to which this introduces a North American or U.S. bias into the assessments. There is reason to believe that some bias exists in favor of North America in general, and the United States in particular, because of the disproportionate impact on the item pools used to construct the assessment. The bias is likely to be exaggerated by the heavy reliance on multiple-choice items, a format that is more familiar to students in North America than in many other countries. Although these considerations may have enhanced the relative performance of students in North America, it is impossible to know how much, if any, difference this potential bias had on the actual results of any of the international assessments. Certainly the review and approval of the content domains and the assessment specifications by participating countries were intended to minimize any such bias.

The following description by Garden and Orpwood (1996, pp. 2-3) illustrates the difficulties encountered at the item assembly and identification stage:

Although large pools of items had been assembled, a disproportionate number were found to assess computation, recall, or simple application in limited content areas. For some content areas an adequate number of potentially good items were available, but for others there were too few items of good quality. Also, because most items had been written for use within particular countries, the panel had to reject many for use in TIMSS because of cultural bias, or because translation was likely to lead to ambiguity or misunderstanding.

Page 42 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Extensive reviews of items by international panels of subject-matter and measurement experts as well as by designated panels within each participating country are needed to assure proper coverage of all cells of the table of specifications and to identify potential shortcomings due to lack of appropriateness for different cultures or likely difficulties in translation. Systematic reviews by panels of experts from participating nations can provide much-needed information to international committees making the final selection. Procedures used in TIMSS, for example, included obtaining ratings from national committees of each item on 1-to-4 scales for each of four characteristics: the extent to which the content of the item was taught and emphasized in the country, the familiarity of the implied approach to teaching by the item, the proportion of students in the country who would answer the item correctly, and overall quality independent of the appropriateness for the country’s curriculum (Garden & Orpwood, 1996).

Selection of items for field testing can be guided by ratings from national committees. For example, a criterion might be established that the item was considered appropriate for the curriculum of at least X percent of the countries and that fewer than Y percent of the countries recommend deletion of the item. Once a tentative pool of items is agreed on, they can be assembled into booklets for field testing, a stage that is critical in the evaluation of item quality for any assessment.

TRANSLATION

Before test items for an international assessment can be evaluated by representatives of participating countries, much less be field tested, they must be translated from the language in which the item was originally written into all the languages needed for use of the assessment in the participating countries. Test translation is a demanding enterprise. It is well known that student performance on an item can vary greatly as the result of seemingly minor edits within the language in which the item was originally written. This problem of sensitivity of results to subtle changes in items is compounded when items have to be translated from one language to another. Consideration of the many issues that arise in test translation, particularly where direct comparisons of achievement are desired across versions of the test in different languages, as is required for international studies of achievement, is beyond the scope of this paper. Fortunately, a separate chapter has been prepared by Hambleton (this volume) on the complex issues of translating and adapting tests of comparative international studies of educational achievement, and the reader interested in this topic is referred to that chapter.

Page 43 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

FIELD TESTING

Although careful editing by subject matter and measurement experts can detect many problems with items, there is no substitute for actual tryout of the items with students from the populations that will take the operational assessments. Ambiguities in the items, or problems with answer keys or with incorrect options on multiple-choice questions, often defy detection despite numerous rounds of editing and review by measurement and subject-matter experts. For short-answer and extended-response items, the scoring rubric is an essential aspect of the item, one that can be best evaluated and refined in the context of a field trial. Only after student performance on items is obtained during field testing do some item flaws or problems with scoring rubrics become apparent. Field-test data also provide estimates of item statistics that can be useful in the final selection of items for the operational test forms. Even if one takes a criterion-referenced view of test construction that would not rely on item discrimination and item difficulty statistics, item analyses can be quite useful in identifying flawed items and items that display differential item functioning for students from different countries or for male and female students. Although other characteristics such as socioeconomic status or community characteristics also might have been of interest, they are difficult to define in a uniform way across countries. Item difficulty and item discrimination statistics are arguably useful even from a criterion-referenced perspective, if for no other reason than one of efficiency. It is simply not very informative to administer items that almost no one can answer correctly or that are so easy that essentially everyone answers them correctly. In a similar vein, items that do not discriminate or that have negative discrimination will not make useful contributions to the measurement on the main dimensions used for reporting results.

Consequently, the more recent international studies generally have established certain guidelines for field-test item statistics. Items that do not meet the guidelines may be flagged as a caution to those constructing the assessments to use those items only if necessary to fulfill the requirements of the test specifications. If the item pool is large enough, items that do not meet the guidelines may be eliminated from further consideration. The earliest IEA studies also made use of item difficulty and discrimination statistics, but the detailed guidelines on the use to be made of the statistics apparently were not used. Rather, item difficulties and discrimination indices for each country participating in the field tests were presented to the committee responsible for assembling the operational forms, but it is unclear what use the committee made of those statistics, if any, in selecting items for the operational forms (see, for example, Thorndike, 1967, pp. 101-103).

Page 44 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The role of item statistics was more explicit for TIMSS. TIMSS set the criteria that the items have a proportion correct on multiple-choice items of at least 0.25 for four-choice items and 0.20 for five-choice items. Items were also required to have positive point-biserial correlations with the total score for the correct answer and negative point-biserial correlations with the incorrect options (Garden & Orpwood, 1996). “With few exceptions, the selected items had mean field trial p-values between 0.3 and 0.8, discrimination indices (point-biserial correlation between item and booklet scores) above 0.3, and mean review ratings above 2.5 in each of the four review categories” (coverage, the extent the item was taught and emphasized in each country; familiarity; difficulty; and appeal—each rated on a 1-to-4 score by the national committees) (Garden & Orpwood, 1996, pp. 2-16).

Field testing and the extensive review of items by measurement and subject-matter experts have resulted in collections of items that are of relatively high quality and free of major faults. Of course, as has been noted previously, the item sets have been criticized for not doing a better job of measuring higher cognitive processes, and for limitations of coverage of the curricula of participating countries. A few problematic items also have been identified that count as correct responses that are either incorrect or not as good as an alternative response that is treated as incorrect. An extreme example of the former problem is a mathematics items on TIMSS that was identified by Wang (1998b). The item is “Find x if 10x – 15 = x + 20” (Lie, Taylor, & Harmon, 1996, p. 10). The answer clearly is that x = 35/9, but the response that is keyed according to Wang (1998b), who relied on the report of Lie et al. as correct, is 7. An error like this on such a simple problem could only be a misprint. Indeed, a check of the grades 7 and 8 released mathematics items show that item L-16 is: “10x – 15 = 5x + 20.” And, the correct answer is indeed 7. The multiplicative constant of 5 was mistakenly left off on the right-hand side of the equation in Lie et al., but the actual item was correctly scored. Some other problems identified by Wang (1998a, 1998b), however, are not so easily explained. They reveal the difficulty in assuring the answers keyed to receive full credit are in fact correct for students who know more than the typical student or who think deeply about the problem. One such problem identified by Wang (1998b) is the following: “A glass of water with ice cubes in it has a mass of 300 grams. What will the mass be immediately after the ice has melted? Explain your answer” (Lie et al., 1996, p. 11). The keyed answer is 300 grams, supported by an explanation such as “The ice changes into the same amount of water.” As Wang notes, this answer and explanation is correct if evaporation is ignored, but if a student takes evaporation into account, an answer of “Less than 300 grams” is clearly defensible, but would not receive credit.

Page 45 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Another problematic item asks 9-year-old students to explain how the sun and the moon can appear to be about the same size when the sun is much larger than the moon. The keyed response is that the sun is farther away. The response that the sun is higher than the moon is not credited, even though, as Wang (1998a) argues, higher and farther may be used interchangeably by 9-year-olds to convey the same level of understanding of the scientific principle in question. For a few additional examples of problematic items, see Wang (1998a, 1998b). Although questionable items such as the ones identified by Wang represent only a tiny fraction of all the items—not enough to seriously affect the overall validity of the assessment—they illustrate the difficulty of developing items and associated scoring rubrics that are unambiguous and unassailable.

CULTURAL BIAS AND DIFFERENTIAL ITEM FUNCTIONING

All of the international studies have attempted to address problems created by items that put members of one country at a disadvantage as the result of cultural differences in the ways in which particular items might be interpreted. Distinguishing between factors that are the result of cultural differences across countries and those that are due to differences in instructional practices or student opportunity to learn the content of the item is, of course, a difficult and imperfect undertaking. The primary way in which the IEA studies have approached the question of cultural bias in items is through the expert judgment of national committees that reviewed the items. Procedures used to ensure the quality of translations and adaptations needed for items to make them appropriate in different language and cultural contexts were also an important part of the way in which issues of potential cultural bias were addressed.

The most recent studies conducted by IAEP and IEA also have included some statistical analyses of the item responses as a means of flagging items that might be judged to be problematic. The IAEP studies included differential item functioning (DIF) analyses using an omnibus statistic based on Mantel-Haenzel statistics that was developed by Johnson (1992). As Johnson describes, DIF statistics were computed for each item using the United States as the reference group separately for each other participating country, then results were combined into the omnibus statistic. The latter statistic was then used to identify items that were outliers and therefore likely to be problematic. Using this approach, only three items—one age nine mathematics item, one age 13 mathematics item, and one age 13 science item—were identified as outliers with very high DIF (Bertrand, Dupuis, Johnson, Blais, & Jones, 1992). Although the statistical sophistication and magnitude of the analysis are quite impressive, the return for the effort appears quite meager if one expects the

Page 46 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

approach to be a valuable tool in the identification of items that should be discarded from consideration because of cultural bias. On the other hand, the analyses provide some reassurance that the items generally function in a similar fashion for students from different countries.

TIMSS computed within-country item statistics of various kinds and used them to identify items that might be problematic for particular countries. In addition to the usual within-country difficulty and discrimination statistics, multiple-choice items were flagged if an incorrect option had positive point-biserial correlations with the total score, or if an item had a poor Rasch fit statistic. Item-by-country interaction statistics indicating that an item was easier or harder than would be expected based on the cross-country item difficulty and the overall performance of students from the country were computed. Items with statistically significant item-by-country interactions were flagged (Mullis & Martin, 1998).

The use of item flags as was done in either IAEP or TIMSS is useful for calling special attention to items. However, the flag by itself does not necessarily mean that the item is a biased indicator of student achievement in the country in question. Judgmental review is still needed to determine whether it is reasonable to discount the item for a given country or whether the unusual difficulty may simply reflect differences in instruction. Flagged items were reviewed and judged to be satisfactory. They did not play any further role in the analyses or presentation of results.

TEST DESIGN

Traditional tests are designed to provide information about the performance of individual students. For such tests there are substantial advantages in having all students take the same set of items. International assessments, however, are not designed to report scores for individual students. Instead all that is needed is to obtain estimates of performance for large groups of students (e.g., all students at a grade level in the country, or all students within a broad category defined by other variables such as gender, community type, curriculum strand, or race/ethnicity). For such assessments there are great advantages to administering different subsets of items to different subsamples of students. There is always an interest in administering more items than can be administered in a single sitting to any one student. By administering different subsets of items to different subsamples of students, broad coverage can be achieved with a reasonable amount of testing time for each student in the sample.

A number of designs are available for this purpose. Collectively the designs are referred to as matrix sampling designs to denote the simultaneous sampling of both items and students. Designs used in past interna-

Page 47 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

tional assessments include the administration of different test forms to subsamples of students from each population, as was done in FIMS, and the administration of a common core of items to all students together with one of several unique subsets of items, sometimes referred to as rotated forms, to different subsamples, as was done in SIMS and SISS. Yet another variation is to administer two or more blocks of items to students, with blocks administered together in various combinations. The IAEP studies used blocks of items to make up test booklets or forms, albeit only two booklets were used per subject at age nine and just a single booklet was used at age 13. The limited number of items used in IAEP made it unnecessary to have a larger number of booklets in which blocks of items would be placed. However, experience with NAEP has shown that the use of balanced-incomplete-block designs for the allocation of items can be an effective approach to administering larger numbers of items to students while limiting the administration time for any given student.

The use of a common core together with rotated forms is illustrated by SISS (Postlethwaite & Wiley, 1992). Three student populations—10-year-olds, 14-year-olds, and students in the final year of secondary education—were studied in SISS. Variations of a common core and rotated forms were used for each of the three populations. A core test of 24 items and four rotated forms of eight items each were used for the 10-year-olds. The 70 items for the 14-year-olds were divided into a core of 30 items administered to all students and four rotated forms of ten items each. For students in the last year of secondary school, the items were distinguished by subject area (biology, chemistry, or physics). Three rotated forms of 30 items each, consisting of items in one of the three content areas, were administered to subsamples of approximately one-third of the students together with a 26-core form consisting of nine biology items, nine chemistry items, and eight physics items (Postlethwaite & Wiley, 1992, p. 49).

A more complicated matrix sampling design was used in TIMSS. This can be illustrated by a brief description of the overall assessment design for 199 items administered to the 9-year-old population. Each item was placed into one of 26 item clusters. Cluster A was designated the core cluster. It contained a total of five mathematics and five science multiple-choice items. A total of eight separate test booklets, each consisting of cluster A and six of the remaining 25 clusters of items, were administered. Seven of the noncore clusters of items were designated focus clusters. Focus clusters were included in either three or four of the eight booklets, thereby assuring substantial numbers of students for those items. Ten of the remaining clusters were labeled either mathematics breadth or science breadth. Breadth clusters were included in only a single booklet, and hence were administered to only about one-eighth of the sampled students. The remaining eight clusters consisted of either mathematics or

Page 48 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

science free-response items. Each of those clusters was included in two of the eight booklets (Adams & Gonzalez, 1996). The IEA and IAEP studies have made effective use of matrix sampling designs to allow for a broader coverage of content domains than otherwise would have been possible.

SUMMARY SCORES

Results of early international assessments were commonly reported in terms of total number of correct scores or average percentage of correct scores. Such scores are reasonable as long as they are based on a common set of items. With a core and two rotated forms used, for example in SISS, total scores for the core and for the core plus the two rotated forms are readily produced, and with proper sampling weights can be used to produce various descriptive statistics. Though not essential to produce results, the more complicated assessment designs of the more recent IAEP studies (e.g., Educational Testing Service, 1991) and TIMSS have relied on scaling based on item response theory (IRT) in addition to percentage correct scores for summarizing results. The IAEP studies used a three-parameter logistic IRT model (Blais et al., 1992). The one-parameter Rasch IRT model was used in TIMSS (Martin & Kelly, 1996).

IRT models the probability that a given student will answer an item correctly based on a single latent proficiency dimension along which both items and persons are placed by person and item parameters. In the case of the Rasch model, the probability of a correct response is determined by the difference of the location of the person on the dimension (the person’s proficiency) and the location of the item on the same dimension (the item’s difficulty). The dimension or proficiency scale summarizes the achievement level of students and the relative difficulty of items. The three-parameter IRT also has a single-person parameter to locate the person on the proficiency scale, but uses three parameters to characterize items: one for the relative difficulty, one for the discriminating power of the item, and one (the pseudoguessing parameter) to account for the fact that the probability of a correct response on a multiple-choice item is always greater than zero, no matter how low the person’s level of proficiency is.

IRT provides a basis for estimating performance on a common scale even when students are given different subsets of items. This is a great advantage over simple number-right scoring for assessments such as those used in international studies where different students are administered different subsets of items. It means, for example, that performance on rotated forms can be compared on a common scale. Thus, when the assumptions of IRT are met to a reasonable approximation by the item response data obtained for a sample of test takers, proficiency estimates

Page 49 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

for test takers can be defined in relation to the pool of items administered, but do not depend on the particular subset of items taken by a given individual. Moreover, the item statistics do not depend on the particular subsample of test takers who responded to a given item. These two properties often are referred to as “item free ability estimates” and “sample-free or person-free item parameter estimates” (Hambleton, 1989, p. 148).

The assumptions of IRT about the dimensionality of the assessment, local independence (i.e., that test-taker responses to items are statistically independent after proficiency level is taken into account), and the specific mathematical form of the item response function (e.g., the one-parameter Rasch model) are, of course, only approximated in practice. Hence, the properties of item-free proficiency estimates and sample-free item parameter estimates hold only approximately. Nonetheless, IRT, even when a unidimensional model is used, has been found to be relatively robust as long as there is a strong dominant factor or underlying dimension for the set of items.

The TIMSS reports summarized the preference for IRT scaling as follows:

The IRT methodology was preferred for developing comparable estimates of performance for all students since students answered different test items depending upon which of the eight test booklets they received. The IRT analysis provides a common scale in which performance can be compared across countries. In addition to providing a basis for estimating mean achievement, scale scores permit estimates of how students within countries vary and provide information on percentiles of performance. (Beaton et al., 1996a, p. A-27; the quoted summary statement is also included in other TIMSS reports)

There are considerable advantages provided by a scale for which percentiles and variability can be computed. Clearly there is much more to characterizing the achievement of students in a country even if one is satisfied with a single summary dimension than reporting the mean. Earlier treatments of international assessment results could, of course, report information on variability and percentile points for the core set of items taken by all students simply using number-right scores. Reporting results for the full assessment, including rotated forms, was more complicated and involved forms of linking rotated forms that were less theory based (see, for example, Miller & Linn, 1989) and somewhat problematic because the forms were not comparable in difficulty or content coverage.

A single scale provides an overall summary of student achievement within the subject-matter domain of the assessment. Such summary information is useful for making overall comparisons. Policy makers, the media, and the public like the apparent simplicity that is provided by reports

Page 50 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

of results on a single scale. The rank order of the average performance of a nation’s students can be seen at a glance, and comparisons of the highest or lowest performing 5 or 10 percent of a nation’s students to other nations is not much more complicated. For those concerned about curriculum issues, however, the single scale leaves too many questions unanswered. As Mislevy (1995, p. 427) notes, “Because no single index of achievement can tell the full story and each suffers its own limitations, we increase our understanding of how nations compare by increasing the breadth of vision—as Consumers Reports informs us more fully by rating scores of attributes of automobiles.”

As Black (1996, p. 19) argues, with international summary score comparisons of test performance,

. . . like most statistics, what they conceal is as important as what they reveal. This is because pupils’ performances on a particular question depend strongly on the extent to which its demands are familiar to the pupil and on the opportunities the pupil has had to learn about responding to such demands. Thus, the effectiveness of teaching or the commitment of the pupils are only two of several important determinants of test outcomes. The curricula and the inter-related practices of teaching, learning, and testing to which the pupils are accustomed are of equal, arguably greater, importance.

MULTIPLE SCORES

As has been discussed, the international studies of achievement conducted by IEA traditionally have placed considerable emphasis on issues of curriculum differences and student opportunity to learn. Consistent with this emphasis, reports of results generally have included more than reports of performance on a single global score scale. For example, SISS reported results for 10-year-olds for collections of items under the headings of biology, chemistry, earth science, physics, information, comprehension, and application in addition to an overall score (Postlethwaite & Wiley, 1992). Similarly, SIMS reported topic scores for sets and relations, number systems, algebra, geometry, elementary functions/calculus, and probability/statistics (McKnight et al., 1987). Similar breakdowns by content area were given in both the mathematics and science reports for TIMSS.

The relative standing of a country is often quite different when subscores are used than when rankings are based on total scores. This is apparent in TIMSS, for example, where the eighth-grade results for subscores and total were summarized as follows: “In math, the number of countries outperforming the U.S. in the total score is 17. In the subscales, this ranges from 9 in data representation and analysis to 30 in measure-

Page 51 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

ment. In science, the number of countries outperforming the U.S. on the total score is nine. In the subscales this ranges from 1 in environment to 13 in physics” (Jakwerth et al., 1997, p. 11).

The TIMSS reports also allowed national research coordinators to specify subsets of items that were judged to be appropriate for the country’s curriculum. Each national research coordinator was asked to indicate whether items were or were not part of the country’s intended curriculum. Items that were judged to be in the intended curriculum for 50 percent or more of the students in a country were considered appropriate for that country. Such items were then included in the score derived to match that country’s curriculum. Scores were then produced based only on those items. Scores were obtained and summarized not only for the items appropriate for a particular country, but for those appropriate for each of the other countries as well. The number of items judged appropriate by national research coordinators varied substantially by country. For example, on the eighth-grade science test, the number of possible score points for the set of items judged appropriate for an individual country’s curriculum ranged from a low of 58 for Belgium to a high of 146 for Spain (Beaton et al., 1996a, p. B-3).

Even with some wide variability in the number of science items judged to be appropriate for the different countries, the impact on the relative performance of countries was only modest. “[T]he selection of items for the participating countries varied somewhat in average difficulty, ranging from 55–59 percent correct at the eighth grade and from 49–56 percent at the seventh grade. Despite these differences, the overall picture provided . . . reveals that different item selections do not make a major difference in how well countries do relative to each other” (Beaton et al., 1996a, p. B-5). The results for mathematics when items were selected to match the curriculum of each country were quite similar to those for science. The fact that the country-selected item sets did not show greater variability in results is not surprising because all countries selected a sizable fraction of the total set of items and the country-specific sets of items all spanned multiple content areas.

When separate scores are produced for each content area (e.g., algebra and geometry within mathematics or earth science and physics within science), some potentially useful distinctions in relative performance of countries are revealed. For example, in eighth grade, the student average percentage correct in overall mathematics was the same for England and the United States (53 percent correct). England outperformed the United States in geometry (54 versus 48 percent correct) and measurement (50 versus 40 percent correct) whereas the United States did relatively better than England in fractions and number sense (59 versus 54 percent correct) (Beaton et al., 1996b, p. 41). Comparable variations in patterns across con-

Page 52 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

tent areas also occur for other pairs of countries with the same overall performance, and for other grades and for science as well as mathematics. For people concerned with curriculum and instruction, the variations in patterns are more revealing than the comparisons of countries in terms of overall scores.

Still greater variation is obtained in country score patterns when topic scores within content areas are used. Schmidt, McKnight, and Raizen (1997) divided the TIMSS mathematics items into 20 topics using content framework categories. The science items were divided into 17 topics. The range of differences in the topic average percentage-correct scores between the topic with the highest and the one with the lowest average within a country was from 20–55 percent. The relative standing of a country varied greatly across topics. For example, 17 of the 42 countries had an average score among the highest five countries on at least one of the 20 mathematics topics, and 31 of the countries had ranks that fell in at least three quartiles. In science, 30 of the countries had topic area scores that fell in at least three quartiles. The average difference between a country’s maximum and minimum ranks across topic areas was 18 for mathematics and 23 for science (Schmidt, McKnight, & Raizen, 1997). Although there are far too few items to support reliable scores for individual students at the level of 17 or 20 topics, such a fine breakdown does yield useful information at the level of aggregation of countries. Some caution is nonetheless needed in interpreting the results for individual topics, however, because the generalization to other potential items within a topic is questionable due to the small number of items per topic. Hence, the results are probably best thought of as illustrating the degree to which the ranking of a country would be subject to change if much greater emphasis were given to some topics than to others.

SUMMARY AND CONCLUSIONS

The international studies of educational achievement sponsored by the IEA have faced great challenges in producing tests that yield a valid basis of comparing the achievement of students from a wide array of nations. Over the course of more than three decades, the studies have shown great promise and produced better measures with each successive study. Starting from scratch, the FIMS and FISS accomplished remarkable feats to assemble a pool of items that passed muster with national committees of reviewers, and produced reliable measures covering relatively broad content domains. Certainly, those studies were subjected to considerable criticism, mostly about issues other than the quality of the measures, such as the comparability of populations of students in different countries that retained widely variable fractions of the age cohorts, and

Page 53 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

about the quality of the samples of students. There were also, however, some criticisms of the quality of the tests, particularly their relevance to the curricula of different countries and the heavy reliance on multiple-choice items.

The second round of studies made substantial strides to improve the alignment of the tests to the curricula of the participating countries, and went to great lengths to get information about the intended curriculum in each country and to develop measures of opportunity to learn so that the implemented curriculum could be related to the attained curriculum as measured by the SIMS and SISS assessments. Between the second and third round of studies, the IAEP studies were undertaken. Those studies contributed to advances in analytical techniques, using IRT models and conditioning procedures that had proven useful in the context of NAEP. The IAEP studies also introduced differential item functioning techniques as an approach to flagging items deserving closer scrutiny.

TIMSS benefitted from the experience and advances made in the earlier studies. It also moved ahead substantially on several measurement fronts. The analysis of the intended curriculum of participating countries was more sophisticated and complete than anything that came before it. That analysis provided a solid basis for the construction of assessments with a high likelihood of being relevant, valid, and fair to the countries involved in the study. Greater use of short-answer and extended-response items was made. The use of more sophisticated matrix sampling procedures made it possible to achieve broader coverage of the content domains within the constraints of the administration time allowed for each student. Item response theory provided an effective way of producing scores across the whole set of items within a content area. Analyses of subscores for broad topics such as geometry as well as for narrower subsets of items allowed researchers to convey the idea that achievement is more than an overall total score and is better understood in terms of the peaks and valleys that are found for every country. Those patterns of performance and their relationships to curricular emphasis are more likely to suggest educational policy to improve achievement than simple comparisons in terms of total scores.

Another innovation of TIMSS was the provision of an opportunity for countries to choose the items that were most appropriate for their particular curriculum. Results based on the country-specific selections of items, while not greatly different from those for the total scores, provided a basis for testing how different the results would be if the assessment were more closely tailored to fit the curriculum of a particular country. Finally, the within-country item analyses and item-by-country interaction analyses resulted in an effective means of flagging items in need of more careful consideration for individual countries.

Page 54 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Mathematics assessments have been used to illustrate many of the points in this chapter. Although the details vary from subject to subject in considerations such as the relative reliability of results and the details of topic coverage within a domain, the major points apply to subjects other than mathematics.

All in all, the quality of the measurement of achievement is one of the greatest strengths of the IEA studies. Certainly there is room for improvement, but the assessments bear up well under close scrutiny. Improvements that can be anticipated for future international studies are likely to depend on advances in the technology of testing. Computer-based test administration may be too futuristic to consider in the short run for an international study. However, the use of computers has considerable appeal as a means of enabling the administration of problems that are not feasible to administer or score efficiently in a paper-and-pencil mode (see, for example, Duran, 2000; NRC, 1999).

Other suggested improvements are more incremental in nature. The curriculum analyses that were conducted for TIMSS represent a valuable, albeit aging, resource. Those analyses might be put to good use to review, and possibly revise, the specifications for the assessments. Even in the face of changes in the curricula of participating countries between the time of TIMSS and the time of a new study, the curriculum frameworks provide a broad representation of the topics in the domains of mathematics and science that could be an empirical basis for structuring a table of specifications.

REFERENCES

Adams, R. J., & Gonzalez, E. J. (1996). The TIMSS test design. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development (Chapter 3, pp. 1-36). Chestnut Hill, MA: Boston College.

Beaton, A. E., Martin, M. O., Mullis, I. V. S., Gonzalez, E. J., Smith, T. A., & Kelly, D. L. (1996a). Science achievement in the middle school years: IEA’s Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: Boston College.

Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., Kelly, D. L., & Smith, T. A. (1996b). Mathematics achievement in the middle school years: IEA’s Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: Boston College.

Bertrand, R., Dupuis, F. A., Johnson, E. G., Blais, J. G., & Jones, R. (Eds.). (1992). A world of differences: An international assessment of mathematics and science technical report. Princeton, NJ: Educational Testing Service.

Black, P. (1996). Commentary. In E. D. Britton & S. A Raizen (Eds.), Examining the examinations: An international comparison of science and mathematics examinations for college-bound students (pp. 19-21). Boston: Kluwer Academic.

Blais, J. G., Johnson, E. G., Mislevy, R. J., Pashley, P. J., Sheehan, K. M., & Zwick, R. J. (1992). IAEP technical report, Vol. two: Part IV. IRT analysis. Princeton, NJ: Educational Testing Service.

Page 55 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Bloom, B. S. (1956). Taxonomy of educational objectives: Handbook I, cognitive domain. New York: David McKay.

Comber, L. V., & Keeves, J. P. (1973). Science education in nineteen countries: An empirical study. Stockholm: International Association for the Evaluation of Educational Achievement.

Duran, R. P. (2000). Implications of electronic technology for the NAEP assessment (NVS Validity Studies Rep.). Palo Alto, CA: American Institutes for Research.

Educational Testing Service. (1991). The IAEP assessment: Objectives for mathematics, science, and geography. Princeton, NJ: Educational Testing Service.

Elley, W. B. (1992). How in the world do students read? IEA study of reading literacy. Hamburg, Germany: The International Association for the Evaluation of Educational Achievement.

Freudenthal, H. (1975). Pupils’ achievement internationally compared – The IEA. Educational Studies in Mathematics, 6, 127-186.

Garden, R. A., & Orpwood, G. (1996). Development of the TIMSS achievement tests. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development (Chapter 2, pp. 1-19). Chestnut Hill, MA: Boston College.

Glaser, R., Linn, R., & Bohrnstedt, G. (1997). Assessment in transition: Monitoring the nation’s educational progress. Stanford, CA: Stanford University, The National Academy of Education.

Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement (3^rd ed., pp. 147-200). New York: Macmillan.

Husen, T. (Ed.) (1967). International study of achievement in mathematics: A comparison of twelve countries (Vol. I.). New York: John Wiley & Sons.

Husen, T. (1987). Policy impact of IEA research. Comparative Education Review, 31(1), 129-136.

Jakwerth, P., Bianchi, L., Houang, R., Schmidt, W., Valverde, B., Wolfe, R., & Yang, W. (1997, April). Validity in cross-national assessments: Pitfalls and possibilities. Paper presented at the Annual Meeting of the American Educational Research Association. Chicago.

Johnson, E. G. (1992). Theoretical justification for the omnibus measure of differential item functioning. In R. Bertrand, F. A. Dupuis, E. G. Johnson, J. G. Blais, & R. Jones (Eds.). A world of differences: An international assessment of mathematics and science technical report. Princeton, NJ: Educational Testing Service.

Keeves, J. P. (1992). The design and conduct of the second science survey. In J. P. Keeves (Ed.), The IEA study of science II: Change in science education and achievement. 1970 to 1984. Elmsford, NY: Pergamon Press.

Lapointe, A. E., Askew, J. M., & Mead, N. A. (1992). Learning science. The International Assessment of Educational Progress (Rep. No. 22-CAEP-02). Princeton, NJ: Educational Testing Service.

Lapointe, A. E., Mead, N. A., & Askew, J. M. (1992). Learning mathematics. The International Assessment of Educational Progress. (Rep. No. 22-CAEP-01). Princeton, NJ: Educational Testing Service.

Lapointe, A. E., Mead, N. A., & Phillips, G. W. (1989). A world of differences: An international assessment of mathematics and science. (Rep. No. 19-CAEP-01). Princeton, NJ: Educational Testing Service.

Lie, S., Taylor, A., & Harmon, A. (1996). Scoring techniques and criteria. In M. Martin & D. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS): Technical report, Vol. I: Design and development (Chapter 7, pp. 1-16). Chestnut Hill, MA: Boston College.

Page 56 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Linn, R. L. (1987). State-by-state comparisons of student achievement: The definition of the content domain for assessment (Technical Rep. No. 275). Los Angeles: UCLA, Center for Research on Evaluation, Standards, and Student Testing.

Linn, R. L. (1988). Accountability: The comparison of educational systems and the quality of test results. Educational Policy, 1, 181-198.

Linn, R. L., & Baker, E. L. (1995). What do international assessments imply for world-class standards? Educational Evaluation and Policy Analysis, 17, 405-418.

Martin, M. O., & Kelly, D. L. (Eds.). (1996). Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College.

McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes.

Mesa, V., & Kilpatrick, J. (1998, September). The content of mathematics education around the world. Paper prepared for the second Mathematics and Science Education Around the World: Continuing to Learn from TIMSS committee meeting, Woods Hole, MA.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3^rd ed., pp. 13-103). New York: Macmillan.

Miller, M. D., & Linn, R. L. (1989). International achievement and retention rates. Journal of Research in Mathematics Education, 20, 28-40.

Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and ability. In R. L. Linn (Ed.), Educational measurement (3^rd ed., pp. 335-366). New York: Macmillan.

Mislevy, R. J. (1995). What can we learn from international assessments? Educational Evaluation and Policy Analysis, 17, 410-437.

Mullis, I. V. S., & Martin, M. O. (1998). Item analysis and review. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS), Vol. II: Implementation and analysis (primary and middle school years) (pp. 101-110). Chestnut Hill, MA: Boston College.

Mullis, I. V. S., & Smith, T. (1996). Quality control steps for free-response scoring. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study: Quality assurance in data collection (Chapter 5, pp. 1-32). Chestnut Hill, MA: Boston College.

National Research Council. (1997). Learning from TIMSS: Results of the Third International Mathematics and Science Study: Summary of a symposium. A. Beatty (Ed.). Board on International Comparative Studies in Education, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.

National Research Council. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Committee on the Evaluation of National and State Assessments of Educational Progress. J. W. Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.). Board on Testing and Assessment, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.

Porter, A. C. (1991). Assessing national goals: Some measurement dilemmas. In Educational Testing Service, The assessment of national goals (pp. 21-43). Princeton, NJ: Educational Testing Service.

Postlethwaite, T. N., & Wiley, D. E. (1992). The IEA study of science II: Science achievement in twenty-three countries. Oxford, England: Pergamon Press.

Romberg, R. A. (1985, October). The content validity of the mathematics subscores and items for the Second International Mathematics Study. Paper prepared for Committee on National Statistics, National Research Council of the National Academies.

Schmidt, W. E., & McKnight, C. C. (1995). Surveying educational opportunity in mathematics and science. Educational Evaluation and Policy Analysis, 17, 337-353.

Page 57 Cite

Suggested Citation:"2. The Measurement of Student Achievement in International Studies." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Schmidt, W. E., McKnight, C. C., & Raizen, S. A. (1997). A splintered vision: An investigation of U.S. science and mathematics education. Dordrecht, Netherlands: Kluwer Academic.

Schmidt, W. E., McKnight, C. C., Valverde, G. A., Houang, R. T., & Wiley, D. E. (1997). Many visions, many aims: A cross-national investigation of curricular intention in school mathematics. Dordrecht, Netherlands: Kluwer Academic.

Schmidt, W. E., Raizen, S. A., Britton, E. D., Bianchi, L. J., & Wolfe, R. G. (1997). Many visions, many aims: A cross-national investigation of curricular intention in science education. Dordrecht, Netherlands: Kluwer Academic.

Stedman, L. (1994). Incomplete explanations: The case of U.S. performance on international assessments of education. Educational Researcher, 23(7), 24-32.

Thorndike, R. L. (1967). The mathematics tests. In T. Husen (Ed.), International study of achievement in mathematics: A comparison of twelve countries (Vol. I, pp. 90-108). New York: John Wiley & Sons.

Wang, J. (1998a). A content examination of the TIMSS items. Phi Delta Kappan, 80(1), 36-38.

Wang, J. (1998b). International achievement comparison: Interesting debates on inconclusive findings. School Science and Mathematics, 98(7), 376-382.