Part I
Study Design



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement Part I Study Design

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement This page in the original is blank.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement 2 The Measurement of Student Achievement in International Studies Robert L. Linn* The measurement of student achievement is a challenging undertaking regardless of the scope of the domain of measurement, the student population to be assessed, or the purposes of the assessment. The more specific the purpose, the more homogeneous the population of students, and the narrower the domain of measurement, however, the easier is the task of developing measures that will yield results that support valid interpretations and uses. A teacher who prepares an end-of-unit test in algebra faces a task with a fairly clearly defined content domain and knows a great deal about the common experiences of the students who will take the test. There are still variations in purpose (e.g., grade assignment, formative feedback to students, feedback to the teacher) that need to be considered, but the purposes are reasonably circumscribed. There are also issues of the item types (e.g., multiple-choice, short-answer, extended-response problems) to be used and the cognitive demands of the items. For example, how much emphasis should be given to routine application of algorithms, how much to conceptual understanding, how much to solving new problems that require students to make generalizations, how much to communication, and how much to making connections to earlier concepts and assign *   Robert Linn is a distinguished professor in the School of Education at the University of Colorado. He is co-director of the National Center for Research on Evaluation, Standards, and Student Testing.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement ments? In the individual classroom setting, however, much is known about the familiarity that students have with different item formats, and that familiarity is relatively homogeneous for all students taking the test. Moreover, instructional goals can be used to guide decisions about emphasis given to different cognitive processes. Large-scale assessments, be they a norm-referenced test designed for use nationally, a state assessment, or an assessment such as the National Assessment of Educational Progress (NAEP), face many of the issues involved in an end-of-unit test for use in a single classroom. Issues of item types and the cognitive demands of the items, for example, remain important, but there is greater diversity in the familiarity that students have with different formats and with items that make different levels of cognitive demands. The delineation of purpose and scope of the content domain are considerably more complicated for the large-scale assessment development than for the classroom test. Moreover, the definition of the target population is no longer a given, and even when defined will be more heterogeneous in background in curriculum exposure and in instruction directed to the content of the assessment. These complications exacerbate the challenges of developing assessments that yield results that support valid interpretations and uses. Not surprisingly, the challenges are greater still for international assessments of student achievement. An immediately apparent complication is that assessments have to be translated into the multiple languages of participating countries. Variations among countries in educational systems, cultures, and traditions of assessment add to the complexity of the problems of international assessments. PURPOSES Consideration of measurement issues for any assessment should start with the identification of the purpose of the assessment. Millman and Greene (1989, p. 335) note that “The first and most important step in educational test development is to delineate the purpose of the test or the nature of the inferences intended from test scores.” They justify this claim by noting that “A clear statement of purpose provides the test developer with an overall framework for test specification and for item development, tryout and review” (Millman & Greene, 1989, p. 335). Most assessments, of course, serve multiple purposes, only some of which are intended and clearly specified in advance. Nonetheless, the delineation of purpose(s) is an important undertaking that provides not only a logical starting point, but also the touchstone for evaluating the other measurement decisions throughout the process of assessment development, administration, and interpretation of results.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement The purposes of international assessments are manifold. The purpose that attracts the most attention in the press is the horse race aspect of the studies, that is, the tendency to report the relative standing of country average total test scores. Although it is recognized that international competition inevitably draws “attention of policymakers and the general public to what has been referred to as the ‘Olympic Games’ aspect of the research” (Husen, 1987, p. 131), researchers associated with the conduct of studies under the auspices of the International Association for the Evaluation of Educational Achievement (IEA) have consistently argued that there are many other purposes that are more important than the horse race comparisons. Mislevy (1995) began his discussion of purposes of international assessments as follows: “In the broadest sense, international assessment is meant to gather information about schooling in a number of countries and somehow use it to improve students learning” (p. 419). In keeping with this broad purpose, the introduction to the report of middle school mathematics results for the Third International Mathematics and Science Study (TIMSS) gives the following statement of purpose: “The main purpose of TIMSS was to focus on educational policies, practices, and outcomes in order to enhance mathematics and science learning within and across systems of education” (Beaton et al., 1996b, p. 7). An implicit assumption is that comparisons of student performances for different countries will contribute toward this end in some way. Otherwise, there would be no need for the involvement of countries other than one’s own in the assessment. Thus, it is not surprising that comparing achievement in a specified subject or subjects across countries is a purpose that is common to all of the international studies of achievement. The objective of comparing relative achievement of students at a target age or grade level by country and subject immediately raises a host of measurement questions. At the most general level, there is the question of whether to limit the measurement domain to the intersection of the content coverage intended by the curricula of participating countries or having it encompass the union of content covered. Or should the domain boundaries fall somewhere between those extremes (Linn, 1988; Porter, 1991)? The union is almost surely too expansive to be practical, while the intersection would restrict the coverage to an unreasonable degree. Hence, the domains defined for international assessments have negotiated limits that fall between the extremes. Once the boundaries have been agreed on, questions remain about the relative emphasis to be given to topics within the domain, about the relative importance of different levels of cognitive demands of the assessment tasks within each topic, about the length of the assessment, and about the mix of item types.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement The comparative results obtained on an assessment depend on the degree to which the assessment reflects the curriculum and instruction of the groups of students whose performance is being compared (Linn, 1988; Linn & Baker, 1995; Porter, 1991). In any evaluation of educational programs, “if a test does not correspond to important program goals, the evaluation will be considered unfair” (Linn, 1987, p. 6). This is true for assessments within a nation, but becomes critically important in considering comparisons of performance of nations because there are such large differences between countries in curriculum and instructional emphases. For individual countries the fairness of the assessment necessarily varies as a function of the degree of correspondence between each country’s curriculum and the content boundaries and the relative emphasis given to covered topics of the assessment. SPECIFICATIONS The particulars of the definition of the domain can have a significant impact on the relative position of nations on the assessment. Heavy weight given to one subdomain can advantage some nations and disadvantage others. Multiple-choice formats familiar to students in some nations may be less so to students in others. Conversely, extended-answer problems are standard fare for students in some nations, but not for students in all nations participating in the study. As Mislevy (1995, p. 423) has noted, “The validity of comparing students’ capabilities from their performance on standard tasks erodes when the tasks are less related to the experience of some of the students.” Because of the sensitivity of the relative performance of nations to the details of the specification of the assessments, considerable effort must go into negotiating the details of the specifications and to review and signoff on the actual items administered. Messick (1989, p. 65) has noted that [I]ssues of content relevance and representativeness arise in connection with both the construction and the application of tests. In the former instance, content relevance and representativeness are central to the delineation of test specifications as a blueprint to guide test development. In the latter instance, they are critical to the evaluation of a test for its appropriateness for a specific applied purpose. Details of the approaches used to develop specifications for the assessments have varied somewhat in previous international assessments, but the general nature of the approaches have had a great deal in common. Generally, the approach has been to define a two-way table of specifications, beginning with one dimension defined by content. The topic and subtopic grain size has varied considerably, due in part to the subject

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement matter of the assessment and the grade level; it also has varied from one assessment to the next within the same subject area. The First International Mathematics Study (FIMS) placed the 174 items used across the different age populations assessed into one of 14 topics, ranging from basic arithmetic to calculus (Thorndike, 1967, p. 105). The content dimension was primary, and considerable effort went into defining the topics and obtaining items for them. Despite the emphasis on content, some reviewers of the FIMS results (e.g., Freudenthal, 1975) were sharply critical of the assessments for what was seen as an overemphasis on psychometrics and a lack of involvement of subject-matter experts who were familiar with curricula and teaching practices in the participating countries. In the Second International Mathematics Study (SIMS), the main emphasis continued to be placed on content categories, but there was substantially greater involvement of mathematics educators and much greater salience was given to the mathematics curricula of the participating countries. SIMS maintained links to FIMS by including a sizable fraction of items from FIMS, but used a different set of topical categories. SIMS had 133 content categories under five broad topics (arithmetic, algebra, geometry, probability and statistics, and measurement) for the eighth-grade population and 150 content categories under nine broad topics for the twelfth-grade population (Romberg, 1985, p. 9). Other international studies have divided the content domain using fewer broad topic areas. A rather different approach was taken in the International Assessment of Educational Progress (IAEP) studies conducted by the Educational Testing Service (Lapointe, Askew, & Mead, 1992; Lapointe, Mead, & Askew 1992), using frameworks more in keeping with the ones developed for NAEP. In mathematics for 9- and 13-year-olds, the IAEP framework had five broad content categories. Those content categories were crossed with three cognitive process categories to yield a framework with the 15 cells shown in Table 2-1. The broad categories used by IAEP stand in sharp contrast to the fine-grained breakdown in SIMS. The TIMSS assessments also were based on tables of specifications with characteristics that had some similarity to the frameworks used in the IAEP studies, but had greater specificity of content. For example, the eighth-grade science assessment had eight broad content areas (earth sciences; life sciences; physical sciences; science; technology and mathematics; environmental issues; nature of science; and science and other disciplines). Those categories were crossed with five cognitive process categories called performance expectations in the TIMSS reports (understanding; theorizing, analyzing, and solving problems; using tools, routine procedures, and science processes; investigating the natural world; and communicating) (Beaton et al., 1996a, p. A-6). Finer breakdowns of

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement TABLE 2-1 IAEP Mathematics Framework for 9- and 13-Year-Olds   Numbers and Operations Measurement Geometry Data Analysis, Statistics, and Probability Algebra and Functions Conceptual understanding   Procedural knowledge   Problem solving     SOURCE: Based on Educational Testing Service (1991, p. 13). content also were available and used for some analyses. For example, Schmidt, Raizen, Britton, Bianchi, and Wolfe (1997) reported results for 17 science content areas. In contrast to the relatively fine breakdown of content categories in mathematics and science, the IEA study of reading literacy identified three major domains or types of reading literacy materials: narrative prose (“texts in which the writer’s aim is to tell a story—whether fact or fiction”), expository prose (“texts designed to describe, explain, or otherwise convey factual information or opinion to the reader”), and documents (“structured displays presented in the form of charts, tables, maps, graphs, lists or sets of instructions”) (Elley, 1992, p. 4). In addition to variation from one international study to another in the grain size used in the specification of content, there is variation among content categories within a single study. Mesa and Kilpatrick (1998) commented on the lack of uniformity across topics. The lack of uniformity was acknowledged for TIMSS by Schmidt, McKnight, and Raizen (1997, p. 128) as follows: “No claim is made that the ‘grain size’—the level of specificity for each aspect’s categories—is the same throughout the framework.” Mesa and Kilpatrick (1998, p. 8) argue that the lack of uniformity of grain size is problematic, noting, for example, that this means “[s]mallgrained topics such as properties of whole number operations are counted on a par with large-grained topics such as patterns, relations, and functions. Such variation in grain size can result in disproportionate numbers of items for some clusters of topics relative to the intended emphasis in relation to the whole content domain.” The content domains for the international studies have been defined in practice to be somewhere between the intersection and the union of the content domains covered by the curricula of the participating countries, but are closer to the intersection than the union. Because of the promi-

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement nence of English-speaking countries, especially the United States, in contributing items to the pools of items contributed or developed to make up assessments in line with the specifications, there appears to be a better match to the curricula of English-speaking countries than to the curricula of countries with different languages. COGNITIVE PROCESSES As noted, the content dimension of test specification tables has been primary in international assessments. The second dimension of the framework or table of specifications for the assessments generally has focused on the cognitive processes those items or assessment tasks are intended to measure. The well-known breakdown of tasks into six major categories of performance (knowledge, comprehension, application, analysis, synthesis, and evaluation) in Bloom’s (1956) taxonomy of educational objectives illustrates one approach to specifying distinct categories of cognitive processes that have been applied to a variety of content domains. The rows of Table 2-1 illustrate another formulation of process categories. For FIMS a table of specifications crossed mathematical topics (e.g., basic arithmetic and elementary algebra) with the following five intellectual process categories: Knowledge and information: definitions, notation, concepts. Techniques and skills: solutions. Translation of data into symbols or schema and vice versa. Comprehension: capacity to analyze problems to follow reasoning. Inventiveness: reasoning creatively with mathematics (Thorndike, 1967, p. 94). In a similar vein, the First International Science Study (FISS) crossed a content dimension with a “behavioral objectives dimension consisting of four categories: information, comprehension, application, higher processes” (Comber & Keeves, 1973). The Second International Science Study (SISS) used a substantial number of items (nearly half the total) from FISS and supplemented those items with new items for SISS that were categorized into just three of the four behavioral objectives used in FISS (the higher order process category was not used for the new items) (Keeves, 1992). As was true of the contrast of the first and second mathematics studies, the second science study placed greater emphasis on the curricula of the participating countries than had been done in the first science study. Items for the test were selected not from the most common topics, but rather based on the emphasis of topics in each country as defined in the country’s intended curriculum.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement For SIMS the second dimension was called the “behaviors dimension” and distinguished “four levels of cognitive complexity expected of students—computation, comprehension, application, and analysis” (Romberg, 1985, p. 9). As Romberg notes, the four levels used in SIMS mapped partially, albeit imperfectly, into the Bloom categories. The second dimension for TIMSS made similar distinctions, but was referred to as the “expectations” dimension. As is true of the IAEP process categories in Table 2-1, more recent specification tables have moved farther away from the Bloom taxonomic categories. In the TIMSS mathematics assessment, for example, four categories of expectations or cognitive processes were distinguished: knowing, performing routine procedures, using complex procedures, and solving problems (Beaton et al., 1996b, p. A-7). In science the TIMSS performance expectations dimension consisted of five categories: understanding simple information; understanding complex information; theorizing, analyzing, and solving problems; using tools, routine procedures, and science processes; and investigating the natural world (Beaton et al., 1996a). The specification of topics of the content domain involves judgments of subject-matter experts that have been informed in international studies by cross-national curriculum analysis. Agreements require negotiated compromises between desires to be comprehensive in coverage, the goal of fairly assessing the achieved curriculum of all participating countries, and issues of feasibility. Determining whether an item fits a content topic area is relatively straightforward once the topics have been defined. Determining the types of cognitive processes required to answer an item is far less straightforward. There is widespread agreement that assessments should tap more than simple knowledge of facts and procedures. The assessment also should measure a student’s ability to apply knowledge, skills, and concepts to solve problems and communicate in academic and nonacademic problem settings. Furthermore, it should measure the ability to communicate concepts, make connections, provide explanations appropriate for the subject matter, interpret findings, and evaluate problem solutions and arguments (Glaser, Linn, & Bohrnstedt, 1997). Measuring such higher order cognitive processes and achievement outcomes is more challenging than measuring factual knowledge and skills at applying routine algorithms. The fact that nearly any test development effort that solicits items from a broad range of subject-matter experts, as has been done in the IEA studies, will find an overabundance of items is symptomatic of the greater difficulty in writing items that will tap the higher level problem solving, analysis, explanation, and interpretation skills sought for the assessments. Although, as will be described, considerable effort has gone into the development of items that do more than measure factual knowledge and

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement low-level skills, critics continue to fault international assessments for falling short of the goal of measuring more complex understanding and problem-solving skills. For example, “Jan de Lange. . . argued that the TIMSS items are primarily useful for testing low-level knowledge and do not necessarily represent anyone’s idea of a desirable curriculum” (National Research Council [NRC], 1997, p. 17). The criticism by de Lange is due, at least in part, to a preference for assessments that present students with substantial problems requiring multiple steps to solve and that allow for a variety of solution paths and, sometimes, multiple solutions that would be judged to be of high quality. It is also based on a belief that the multiple-choice and short-answer formats used in the international assessments can only measure factual knowledge and low-level skills. More will be said about that in the following section, but here it is worth recalling an observation made by Thorndike (1967, p. 96) in his chapter describing the FIMS tests. Time limitations together with the need to sample widely from the content of mathematics dictated another decision. It was agreed, somewhat reluctantly, that it would be necessary to keep the single problems brief. Much as one might like to explore the students’ ability to work through an involved sequence of steps, or develop a complex proof, this seemed impossible. Such a task would exhaust too large (and too variable) a fraction of the limited time that was available. ITEM FORMATS The criticism of international assessment on the grounds that they assess only relatively low-level cognitive processes reflects, in part, the difficulty of writing items that tap higher level skills and understanding. Many, like de Lange, would argue that the multiple-choice item formats that are most used in international assessments make it infeasible to assess some of the more complex cognitive processes that correspond to ambitious curriculum aspirations. Multiple-choice and short-answer items are obviously efficient and make it possible to assess a wide range of content in a relatively short period of time. Such items are quite effective at measuring knowledge of fact, procedures, and concepts. Skilled item writers also can and do use these formats effectively to measure understanding and the ability to apply concepts and procedures to solve problems, provide explanations, interpret, and evaluate findings or arguments. There are limits to these formats, however. Nonetheless, for the reasons articulated by Thorndike, multiple-choice has been the dominant item format, supplemented by some short-answer and a smaller number of

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement tional assessments include the administration of different test forms to subsamples of students from each population, as was done in FIMS, and the administration of a common core of items to all students together with one of several unique subsets of items, sometimes referred to as rotated forms, to different subsamples, as was done in SIMS and SISS. Yet another variation is to administer two or more blocks of items to students, with blocks administered together in various combinations. The IAEP studies used blocks of items to make up test booklets or forms, albeit only two booklets were used per subject at age nine and just a single booklet was used at age 13. The limited number of items used in IAEP made it unnecessary to have a larger number of booklets in which blocks of items would be placed. However, experience with NAEP has shown that the use of balanced-incomplete-block designs for the allocation of items can be an effective approach to administering larger numbers of items to students while limiting the administration time for any given student. The use of a common core together with rotated forms is illustrated by SISS (Postlethwaite & Wiley, 1992). Three student populations—10-year-olds, 14-year-olds, and students in the final year of secondary education—were studied in SISS. Variations of a common core and rotated forms were used for each of the three populations. A core test of 24 items and four rotated forms of eight items each were used for the 10-year-olds. The 70 items for the 14-year-olds were divided into a core of 30 items administered to all students and four rotated forms of ten items each. For students in the last year of secondary school, the items were distinguished by subject area (biology, chemistry, or physics). Three rotated forms of 30 items each, consisting of items in one of the three content areas, were administered to subsamples of approximately one-third of the students together with a 26-core form consisting of nine biology items, nine chemistry items, and eight physics items (Postlethwaite & Wiley, 1992, p. 49). A more complicated matrix sampling design was used in TIMSS. This can be illustrated by a brief description of the overall assessment design for 199 items administered to the 9-year-old population. Each item was placed into one of 26 item clusters. Cluster A was designated the core cluster. It contained a total of five mathematics and five science multiple-choice items. A total of eight separate test booklets, each consisting of cluster A and six of the remaining 25 clusters of items, were administered. Seven of the noncore clusters of items were designated focus clusters. Focus clusters were included in either three or four of the eight booklets, thereby assuring substantial numbers of students for those items. Ten of the remaining clusters were labeled either mathematics breadth or science breadth. Breadth clusters were included in only a single booklet, and hence were administered to only about one-eighth of the sampled students. The remaining eight clusters consisted of either mathematics or

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement science free-response items. Each of those clusters was included in two of the eight booklets (Adams & Gonzalez, 1996). The IEA and IAEP studies have made effective use of matrix sampling designs to allow for a broader coverage of content domains than otherwise would have been possible. SUMMARY SCORES Results of early international assessments were commonly reported in terms of total number of correct scores or average percentage of correct scores. Such scores are reasonable as long as they are based on a common set of items. With a core and two rotated forms used, for example in SISS, total scores for the core and for the core plus the two rotated forms are readily produced, and with proper sampling weights can be used to produce various descriptive statistics. Though not essential to produce results, the more complicated assessment designs of the more recent IAEP studies (e.g., Educational Testing Service, 1991) and TIMSS have relied on scaling based on item response theory (IRT) in addition to percentage correct scores for summarizing results. The IAEP studies used a three-parameter logistic IRT model (Blais et al., 1992). The one-parameter Rasch IRT model was used in TIMSS (Martin & Kelly, 1996). IRT models the probability that a given student will answer an item correctly based on a single latent proficiency dimension along which both items and persons are placed by person and item parameters. In the case of the Rasch model, the probability of a correct response is determined by the difference of the location of the person on the dimension (the person’s proficiency) and the location of the item on the same dimension (the item’s difficulty). The dimension or proficiency scale summarizes the achievement level of students and the relative difficulty of items. The three-parameter IRT also has a single-person parameter to locate the person on the proficiency scale, but uses three parameters to characterize items: one for the relative difficulty, one for the discriminating power of the item, and one (the pseudoguessing parameter) to account for the fact that the probability of a correct response on a multiple-choice item is always greater than zero, no matter how low the person’s level of proficiency is. IRT provides a basis for estimating performance on a common scale even when students are given different subsets of items. This is a great advantage over simple number-right scoring for assessments such as those used in international studies where different students are administered different subsets of items. It means, for example, that performance on rotated forms can be compared on a common scale. Thus, when the assumptions of IRT are met to a reasonable approximation by the item response data obtained for a sample of test takers, proficiency estimates

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement for test takers can be defined in relation to the pool of items administered, but do not depend on the particular subset of items taken by a given individual. Moreover, the item statistics do not depend on the particular subsample of test takers who responded to a given item. These two properties often are referred to as “item free ability estimates” and “sample-free or person-free item parameter estimates” (Hambleton, 1989, p. 148). The assumptions of IRT about the dimensionality of the assessment, local independence (i.e., that test-taker responses to items are statistically independent after proficiency level is taken into account), and the specific mathematical form of the item response function (e.g., the one-parameter Rasch model) are, of course, only approximated in practice. Hence, the properties of item-free proficiency estimates and sample-free item parameter estimates hold only approximately. Nonetheless, IRT, even when a unidimensional model is used, has been found to be relatively robust as long as there is a strong dominant factor or underlying dimension for the set of items. The TIMSS reports summarized the preference for IRT scaling as follows: The IRT methodology was preferred for developing comparable estimates of performance for all students since students answered different test items depending upon which of the eight test booklets they received. The IRT analysis provides a common scale in which performance can be compared across countries. In addition to providing a basis for estimating mean achievement, scale scores permit estimates of how students within countries vary and provide information on percentiles of performance. (Beaton et al., 1996a, p. A-27; the quoted summary statement is also included in other TIMSS reports) There are considerable advantages provided by a scale for which percentiles and variability can be computed. Clearly there is much more to characterizing the achievement of students in a country even if one is satisfied with a single summary dimension than reporting the mean. Earlier treatments of international assessment results could, of course, report information on variability and percentile points for the core set of items taken by all students simply using number-right scores. Reporting results for the full assessment, including rotated forms, was more complicated and involved forms of linking rotated forms that were less theory based (see, for example, Miller & Linn, 1989) and somewhat problematic because the forms were not comparable in difficulty or content coverage. A single scale provides an overall summary of student achievement within the subject-matter domain of the assessment. Such summary information is useful for making overall comparisons. Policy makers, the media, and the public like the apparent simplicity that is provided by reports

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement of results on a single scale. The rank order of the average performance of a nation’s students can be seen at a glance, and comparisons of the highest or lowest performing 5 or 10 percent of a nation’s students to other nations is not much more complicated. For those concerned about curriculum issues, however, the single scale leaves too many questions unanswered. As Mislevy (1995, p. 427) notes, “Because no single index of achievement can tell the full story and each suffers its own limitations, we increase our understanding of how nations compare by increasing the breadth of vision—as Consumers Reports informs us more fully by rating scores of attributes of automobiles.” As Black (1996, p. 19) argues, with international summary score comparisons of test performance, . . . like most statistics, what they conceal is as important as what they reveal. This is because pupils’ performances on a particular question depend strongly on the extent to which its demands are familiar to the pupil and on the opportunities the pupil has had to learn about responding to such demands. Thus, the effectiveness of teaching or the commitment of the pupils are only two of several important determinants of test outcomes. The curricula and the inter-related practices of teaching, learning, and testing to which the pupils are accustomed are of equal, arguably greater, importance. MULTIPLE SCORES As has been discussed, the international studies of achievement conducted by IEA traditionally have placed considerable emphasis on issues of curriculum differences and student opportunity to learn. Consistent with this emphasis, reports of results generally have included more than reports of performance on a single global score scale. For example, SISS reported results for 10-year-olds for collections of items under the headings of biology, chemistry, earth science, physics, information, comprehension, and application in addition to an overall score (Postlethwaite & Wiley, 1992). Similarly, SIMS reported topic scores for sets and relations, number systems, algebra, geometry, elementary functions/calculus, and probability/statistics (McKnight et al., 1987). Similar breakdowns by content area were given in both the mathematics and science reports for TIMSS. The relative standing of a country is often quite different when subscores are used than when rankings are based on total scores. This is apparent in TIMSS, for example, where the eighth-grade results for subscores and total were summarized as follows: “In math, the number of countries outperforming the U.S. in the total score is 17. In the subscales, this ranges from 9 in data representation and analysis to 30 in measure-

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement ment. In science, the number of countries outperforming the U.S. on the total score is nine. In the subscales this ranges from 1 in environment to 13 in physics” (Jakwerth et al., 1997, p. 11). The TIMSS reports also allowed national research coordinators to specify subsets of items that were judged to be appropriate for the country’s curriculum. Each national research coordinator was asked to indicate whether items were or were not part of the country’s intended curriculum. Items that were judged to be in the intended curriculum for 50 percent or more of the students in a country were considered appropriate for that country. Such items were then included in the score derived to match that country’s curriculum. Scores were then produced based only on those items. Scores were obtained and summarized not only for the items appropriate for a particular country, but for those appropriate for each of the other countries as well. The number of items judged appropriate by national research coordinators varied substantially by country. For example, on the eighth-grade science test, the number of possible score points for the set of items judged appropriate for an individual country’s curriculum ranged from a low of 58 for Belgium to a high of 146 for Spain (Beaton et al., 1996a, p. B-3). Even with some wide variability in the number of science items judged to be appropriate for the different countries, the impact on the relative performance of countries was only modest. “[T]he selection of items for the participating countries varied somewhat in average difficulty, ranging from 55–59 percent correct at the eighth grade and from 49–56 percent at the seventh grade. Despite these differences, the overall picture provided . . . reveals that different item selections do not make a major difference in how well countries do relative to each other” (Beaton et al., 1996a, p. B-5). The results for mathematics when items were selected to match the curriculum of each country were quite similar to those for science. The fact that the country-selected item sets did not show greater variability in results is not surprising because all countries selected a sizable fraction of the total set of items and the country-specific sets of items all spanned multiple content areas. When separate scores are produced for each content area (e.g., algebra and geometry within mathematics or earth science and physics within science), some potentially useful distinctions in relative performance of countries are revealed. For example, in eighth grade, the student average percentage correct in overall mathematics was the same for England and the United States (53 percent correct). England outperformed the United States in geometry (54 versus 48 percent correct) and measurement (50 versus 40 percent correct) whereas the United States did relatively better than England in fractions and number sense (59 versus 54 percent correct) (Beaton et al., 1996b, p. 41). Comparable variations in patterns across con-

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement tent areas also occur for other pairs of countries with the same overall performance, and for other grades and for science as well as mathematics. For people concerned with curriculum and instruction, the variations in patterns are more revealing than the comparisons of countries in terms of overall scores. Still greater variation is obtained in country score patterns when topic scores within content areas are used. Schmidt, McKnight, and Raizen (1997) divided the TIMSS mathematics items into 20 topics using content framework categories. The science items were divided into 17 topics. The range of differences in the topic average percentage-correct scores between the topic with the highest and the one with the lowest average within a country was from 20–55 percent. The relative standing of a country varied greatly across topics. For example, 17 of the 42 countries had an average score among the highest five countries on at least one of the 20 mathematics topics, and 31 of the countries had ranks that fell in at least three quartiles. In science, 30 of the countries had topic area scores that fell in at least three quartiles. The average difference between a country’s maximum and minimum ranks across topic areas was 18 for mathematics and 23 for science (Schmidt, McKnight, & Raizen, 1997). Although there are far too few items to support reliable scores for individual students at the level of 17 or 20 topics, such a fine breakdown does yield useful information at the level of aggregation of countries. Some caution is nonetheless needed in interpreting the results for individual topics, however, because the generalization to other potential items within a topic is questionable due to the small number of items per topic. Hence, the results are probably best thought of as illustrating the degree to which the ranking of a country would be subject to change if much greater emphasis were given to some topics than to others. SUMMARY AND CONCLUSIONS The international studies of educational achievement sponsored by the IEA have faced great challenges in producing tests that yield a valid basis of comparing the achievement of students from a wide array of nations. Over the course of more than three decades, the studies have shown great promise and produced better measures with each successive study. Starting from scratch, the FIMS and FISS accomplished remarkable feats to assemble a pool of items that passed muster with national committees of reviewers, and produced reliable measures covering relatively broad content domains. Certainly, those studies were subjected to considerable criticism, mostly about issues other than the quality of the measures, such as the comparability of populations of students in different countries that retained widely variable fractions of the age cohorts, and

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement about the quality of the samples of students. There were also, however, some criticisms of the quality of the tests, particularly their relevance to the curricula of different countries and the heavy reliance on multiple-choice items. The second round of studies made substantial strides to improve the alignment of the tests to the curricula of the participating countries, and went to great lengths to get information about the intended curriculum in each country and to develop measures of opportunity to learn so that the implemented curriculum could be related to the attained curriculum as measured by the SIMS and SISS assessments. Between the second and third round of studies, the IAEP studies were undertaken. Those studies contributed to advances in analytical techniques, using IRT models and conditioning procedures that had proven useful in the context of NAEP. The IAEP studies also introduced differential item functioning techniques as an approach to flagging items deserving closer scrutiny. TIMSS benefitted from the experience and advances made in the earlier studies. It also moved ahead substantially on several measurement fronts. The analysis of the intended curriculum of participating countries was more sophisticated and complete than anything that came before it. That analysis provided a solid basis for the construction of assessments with a high likelihood of being relevant, valid, and fair to the countries involved in the study. Greater use of short-answer and extended-response items was made. The use of more sophisticated matrix sampling procedures made it possible to achieve broader coverage of the content domains within the constraints of the administration time allowed for each student. Item response theory provided an effective way of producing scores across the whole set of items within a content area. Analyses of subscores for broad topics such as geometry as well as for narrower subsets of items allowed researchers to convey the idea that achievement is more than an overall total score and is better understood in terms of the peaks and valleys that are found for every country. Those patterns of performance and their relationships to curricular emphasis are more likely to suggest educational policy to improve achievement than simple comparisons in terms of total scores. Another innovation of TIMSS was the provision of an opportunity for countries to choose the items that were most appropriate for their particular curriculum. Results based on the country-specific selections of items, while not greatly different from those for the total scores, provided a basis for testing how different the results would be if the assessment were more closely tailored to fit the curriculum of a particular country. Finally, the within-country item analyses and item-by-country interaction analyses resulted in an effective means of flagging items in need of more careful consideration for individual countries.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement Mathematics assessments have been used to illustrate many of the points in this chapter. Although the details vary from subject to subject in considerations such as the relative reliability of results and the details of topic coverage within a domain, the major points apply to subjects other than mathematics. All in all, the quality of the measurement of achievement is one of the greatest strengths of the IEA studies. Certainly there is room for improvement, but the assessments bear up well under close scrutiny. Improvements that can be anticipated for future international studies are likely to depend on advances in the technology of testing. Computer-based test administration may be too futuristic to consider in the short run for an international study. However, the use of computers has considerable appeal as a means of enabling the administration of problems that are not feasible to administer or score efficiently in a paper-and-pencil mode (see, for example, Duran, 2000; NRC, 1999). Other suggested improvements are more incremental in nature. The curriculum analyses that were conducted for TIMSS represent a valuable, albeit aging, resource. Those analyses might be put to good use to review, and possibly revise, the specifications for the assessments. Even in the face of changes in the curricula of participating countries between the time of TIMSS and the time of a new study, the curriculum frameworks provide a broad representation of the topics in the domains of mathematics and science that could be an empirical basis for structuring a table of specifications. REFERENCES Adams, R. J., & Gonzalez, E. J. (1996). The TIMSS test design. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development (Chapter 3, pp. 1-36). Chestnut Hill, MA: Boston College. Beaton, A. E., Martin, M. O., Mullis, I. V. S., Gonzalez, E. J., Smith, T. A., & Kelly, D. L. (1996a). Science achievement in the middle school years: IEA’s Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: Boston College. Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzalez, E. J., Kelly, D. L., & Smith, T. A. (1996b). Mathematics achievement in the middle school years: IEA’s Third International Mathematics and Science Study (TIMSS). Chestnut Hill, MA: Boston College. Bertrand, R., Dupuis, F. A., Johnson, E. G., Blais, J. G., & Jones, R. (Eds.). (1992). A world of differences: An international assessment of mathematics and science technical report. Princeton, NJ: Educational Testing Service. Black, P. (1996). Commentary. In E. D. Britton & S. A Raizen (Eds.), Examining the examinations: An international comparison of science and mathematics examinations for college-bound students (pp. 19-21). Boston: Kluwer Academic. Blais, J. G., Johnson, E. G., Mislevy, R. J., Pashley, P. J., Sheehan, K. M., & Zwick, R. J. (1992). IAEP technical report, Vol. two: Part IV. IRT analysis. Princeton, NJ: Educational Testing Service.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement Bloom, B. S. (1956). Taxonomy of educational objectives: Handbook I, cognitive domain. New York: David McKay. Comber, L. V., & Keeves, J. P. (1973). Science education in nineteen countries: An empirical study. Stockholm: International Association for the Evaluation of Educational Achievement. Duran, R. P. (2000). Implications of electronic technology for the NAEP assessment (NVS Validity Studies Rep.). Palo Alto, CA: American Institutes for Research. Educational Testing Service. (1991). The IAEP assessment: Objectives for mathematics, science, and geography. Princeton, NJ: Educational Testing Service. Elley, W. B. (1992). How in the world do students read? IEA study of reading literacy. Hamburg, Germany: The International Association for the Evaluation of Educational Achievement. Freudenthal, H. (1975). Pupils’ achievement internationally compared – The IEA. Educational Studies in Mathematics, 6, 127-186. Garden, R. A., & Orpwood, G. (1996). Development of the TIMSS achievement tests. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development (Chapter 2, pp. 1-19). Chestnut Hill, MA: Boston College. Glaser, R., Linn, R., & Bohrnstedt, G. (1997). Assessment in transition: Monitoring the nation’s educational progress. Stanford, CA: Stanford University, The National Academy of Education. Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). New York: Macmillan. Husen, T. (Ed.) (1967). International study of achievement in mathematics: A comparison of twelve countries (Vol. I.). New York: John Wiley & Sons. Husen, T. (1987). Policy impact of IEA research. Comparative Education Review, 31(1), 129-136. Jakwerth, P., Bianchi, L., Houang, R., Schmidt, W., Valverde, B., Wolfe, R., & Yang, W. (1997, April). Validity in cross-national assessments: Pitfalls and possibilities. Paper presented at the Annual Meeting of the American Educational Research Association. Chicago. Johnson, E. G. (1992). Theoretical justification for the omnibus measure of differential item functioning. In R. Bertrand, F. A. Dupuis, E. G. Johnson, J. G. Blais, & R. Jones (Eds.). A world of differences: An international assessment of mathematics and science technical report. Princeton, NJ: Educational Testing Service. Keeves, J. P. (1992). The design and conduct of the second science survey. In J. P. Keeves (Ed.), The IEA study of science II: Change in science education and achievement. 1970 to 1984. Elmsford, NY: Pergamon Press. Lapointe, A. E., Askew, J. M., & Mead, N. A. (1992). Learning science. The International Assessment of Educational Progress (Rep. No. 22-CAEP-02). Princeton, NJ: Educational Testing Service. Lapointe, A. E., Mead, N. A., & Askew, J. M. (1992). Learning mathematics. The International Assessment of Educational Progress. (Rep. No. 22-CAEP-01). Princeton, NJ: Educational Testing Service. Lapointe, A. E., Mead, N. A., & Phillips, G. W. (1989). A world of differences: An international assessment of mathematics and science. (Rep. No. 19-CAEP-01). Princeton, NJ: Educational Testing Service. Lie, S., Taylor, A., & Harmon, A. (1996). Scoring techniques and criteria. In M. Martin & D. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS): Technical report, Vol. I: Design and development (Chapter 7, pp. 1-16). Chestnut Hill, MA: Boston College.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement Linn, R. L. (1987). State-by-state comparisons of student achievement: The definition of the content domain for assessment (Technical Rep. No. 275). Los Angeles: UCLA, Center for Research on Evaluation, Standards, and Student Testing. Linn, R. L. (1988). Accountability: The comparison of educational systems and the quality of test results. Educational Policy, 1, 181-198. Linn, R. L., & Baker, E. L. (1995). What do international assessments imply for world-class standards? Educational Evaluation and Policy Analysis, 17, 405-418. Martin, M. O., & Kelly, D. L. (Eds.). (1996). Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College. McKnight, C. C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. J., & Cooney, T. J. (1987). The underachieving curriculum: Assessing U.S. school mathematics from an international perspective. Champaign, IL: Stipes. Mesa, V., & Kilpatrick, J. (1998, September). The content of mathematics education around the world. Paper prepared for the second Mathematics and Science Education Around the World: Continuing to Learn from TIMSS committee meeting, Woods Hole, MA. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan. Miller, M. D., & Linn, R. L. (1989). International achievement and retention rates. Journal of Research in Mathematics Education, 20, 28-40. Millman, J., & Greene, J. (1989). The specification and development of tests of achievement and ability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 335-366). New York: Macmillan. Mislevy, R. J. (1995). What can we learn from international assessments? Educational Evaluation and Policy Analysis, 17, 410-437. Mullis, I. V. S., & Martin, M. O. (1998). Item analysis and review. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS), Vol. II: Implementation and analysis (primary and middle school years) (pp. 101-110). Chestnut Hill, MA: Boston College. Mullis, I. V. S., & Smith, T. (1996). Quality control steps for free-response scoring. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study: Quality assurance in data collection (Chapter 5, pp. 1-32). Chestnut Hill, MA: Boston College. National Research Council. (1997). Learning from TIMSS: Results of the Third International Mathematics and Science Study: Summary of a symposium. A. Beatty (Ed.). Board on International Comparative Studies in Education, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. National Research Council. (1999). Grading the nation’s report card: Evaluating NAEP and transforming the assessment of educational progress. Committee on the Evaluation of National and State Assessments of Educational Progress. J. W. Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.). Board on Testing and Assessment, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. Porter, A. C. (1991). Assessing national goals: Some measurement dilemmas. In Educational Testing Service, The assessment of national goals (pp. 21-43). Princeton, NJ: Educational Testing Service. Postlethwaite, T. N., & Wiley, D. E. (1992). The IEA study of science II: Science achievement in twenty-three countries. Oxford, England: Pergamon Press. Romberg, R. A. (1985, October). The content validity of the mathematics subscores and items for the Second International Mathematics Study. Paper prepared for Committee on National Statistics, National Research Council of the National Academies. Schmidt, W. E., & McKnight, C. C. (1995). Surveying educational opportunity in mathematics and science. Educational Evaluation and Policy Analysis, 17, 337-353.

OCR for page 25
Methodological Advances in Cross-National Surveys of Educational Achievement Schmidt, W. E., McKnight, C. C., & Raizen, S. A. (1997). A splintered vision: An investigation of U.S. science and mathematics education. Dordrecht, Netherlands: Kluwer Academic. Schmidt, W. E., McKnight, C. C., Valverde, G. A., Houang, R. T., & Wiley, D. E. (1997). Many visions, many aims: A cross-national investigation of curricular intention in school mathematics. Dordrecht, Netherlands: Kluwer Academic. Schmidt, W. E., Raizen, S. A., Britton, E. D., Bianchi, L. J., & Wolfe, R. G. (1997). Many visions, many aims: A cross-national investigation of curricular intention in science education. Dordrecht, Netherlands: Kluwer Academic. Stedman, L. (1994). Incomplete explanations: The case of U.S. performance on international assessments of education. Educational Researcher, 23(7), 24-32. Thorndike, R. L. (1967). The mathematics tests. In T. Husen (Ed.), International study of achievement in mathematics: A comparison of twelve countries (Vol. I, pp. 90-108). New York: John Wiley & Sons. Wang, J. (1998a). A content examination of the TIMSS items. Phi Delta Kappan, 80(1), 36-38. Wang, J. (1998b). International achievement comparison: Interesting debates on inconclusive findings. School Science and Mathematics, 98(7), 376-382.