4
Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement

James R. Chromy*

Cochran (1977) outlines eleven steps in the planning of a survey. Good sampling methods must exist in the environment of all of these steps. These steps are (1) a statement of the survey objectives, (2) the definition of the population to be sampled, (3) the data to be collected, (4) the degree of precision required, (5) the methods of measurement, (6) the frame or the partitioning of the population into sampling units, (7) the sample selection methods, (8) the pretest, (9) the fieldwork organization, (10) the summary and analysis of the data, and (11) a review of the entire process to see what can be learned for future surveys. Mathematically, the major concerns for sample design have focused on the sample selection procedures and the associated estimation procedures that yield precise estimates. Optimization of sample designs involves obtaining the best possible precision for a fixed cost or minimizing survey costs subject to one or more constraints on the precision of estimates. Optimized designs sometimes are called efficient designs.

The mathematical presentation of sampling theory often focuses on obtaining efficient sample designs with precision measured in terms of sampling error only, although both Cochran (1977) and many earlier texts (e.g., Deming, 1950, or Hansen, Hurwitz, & Madow, 1953) discuss nonsampling errors in surveys. A more recent text by Lessler and Kalsbeek (1992) is devoted entirely to nonsampling errors in surveys, classified as frame error, nonresponse error, and measurement error.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement 4 Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement James R. Chromy* Cochran (1977) outlines eleven steps in the planning of a survey. Good sampling methods must exist in the environment of all of these steps. These steps are (1) a statement of the survey objectives, (2) the definition of the population to be sampled, (3) the data to be collected, (4) the degree of precision required, (5) the methods of measurement, (6) the frame or the partitioning of the population into sampling units, (7) the sample selection methods, (8) the pretest, (9) the fieldwork organization, (10) the summary and analysis of the data, and (11) a review of the entire process to see what can be learned for future surveys. Mathematically, the major concerns for sample design have focused on the sample selection procedures and the associated estimation procedures that yield precise estimates. Optimization of sample designs involves obtaining the best possible precision for a fixed cost or minimizing survey costs subject to one or more constraints on the precision of estimates. Optimized designs sometimes are called efficient designs. The mathematical presentation of sampling theory often focuses on obtaining efficient sample designs with precision measured in terms of sampling error only, although both Cochran (1977) and many earlier texts (e.g., Deming, 1950, or Hansen, Hurwitz, & Madow, 1953) discuss nonsampling errors in surveys. A more recent text by Lessler and Kalsbeek (1992) is devoted entirely to nonsampling errors in surveys, classified as frame error, nonresponse error, and measurement error.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement Designing surveys that control both sampling errors and nonsampling errors remains a serious challenge. Sample designers also cannot avoid some of the conceptual issues in total survey design, such as defining the survey objectives, defining the target population to be measured, or limiting the resources that can be devoted to data collection. Decisions reached on these issues can lead to serious tradeoffs among sample design options. The framework and principles document (National Research Council [NRC], 1990) of the Board on International Comparative Studies in Education (BICSE) identifies key sample design issues in the broader context just described. The objective of measuring achievement or performance to permit comparisons across school systems in different countries is clear. Explaining differences is more problematic and may require collection of additional data. Even with these additional data, the approach to analysis and interpretation of differences may be exploratory at best because there are many potential explanatory factors, and only some will be measured. When differences are observed, they properly form the basis for additional studies that would be designed to better understand the differences. The framework makes clear that the objectives of both descriptive and explanatory studies will require rigorous sampling procedures and the capacity to produce national estimates of the variables studied. Conceptual problems of defining comparable student populations in different countries also are addressed. For students enrolled in school, the problem of defining the study population in terms of age or grade must be resolved. Problems exist with both methods because children start school at different ages, so even first graders may be five, six, or seven years old. Different countries follow different grade progression policies. At the upper grade levels, there may be a much broader representation of ages within a single grade. Different national policies about the legal age of leaving school either to drop out or to enter specialized training may alter the composition of classes completing normal schooling. The guidance document recognizes the difficulty of consistent population definition, but does not recommend one approach over another. Survey populations also must be defined temporally. The value of national and cross-national data to meet the objectives of trend measurement and trend comparisons requires regular data collection on an established schedule. If too many cross-national studies are carried out simultaneously, both the educational process itself and the success of the surveys can be adversely affected. The administration of surveys disrupts the educational process in the schools involved in the survey. Schools requested to participate in several surveys (national, cross-national, and others) may be less likely to participate in any of them or may have to select among them. Consequently, school response rates will suffer.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement The BICSE framework provides several principles for sampling and access to schools for both descriptive and explanatory studies: Samples must be drawn from the full population of teachers, administrators, students (at an age or a grade), or policy makers. Valid estimation of population parameters requires strict adherence to an explicit sample design. Plans should discuss the frame and the approach to selecting the sample. Planned exclusions of subgroups (the disabled or persons who do not speak the language in which the test is administered) must be documented. Information should be provided about the size of the population subgroup excluded and the direction of bias due to the exclusion. The extent of participation in education may create differences in the student populations in different countries. The sample design should support reasonably accurate inferences about an age or grade cohort and capture the existing range of individual, school, and classroom variations. Within-country subpopulations may be defined. The total population and subpopulations sample must be explicitly delineated. An international sampling manual is essential. The board encourages the appointment of an experienced and expert sampling consultant to review and approve all country samples before testing takes place. The achieved sample design is usually different from the planned sample design. Advance arrangements with school officials should be arranged to ensure high participation rates. A maximum acceptable nonresponse rate should be specified for inclusion of a country’s data in the international analyses. Subnational units that have separate autonomous school systems may be included in international studies. The BICSE framework also specifies test administration procedures to control the measurement error component. These include standardized procedures over time and across nations, pilot testing in each participating country, and a meeting with study coordinators between the pilot study and the full-scale study to review procedures and adjust them if necessary. The report also recommends (ideally) that “suitably trained [test administrators] from outside the [school system] be in charge of test administration” and that “people from different countries . . . supervise the implementation of the procedures to be followed (previously agreed

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement on by the countries involved) by being present on site when the field work is conducted” (NRC, 1990, p. 9). The Board framework also requires that “standard errors be calculated and reported for all reported statistics.” It also encourages the use of a single recognized expert consultant for this technically complex process. The Board also recommends audit and evaluation procedures for all aspects of the survey, including participation rates, attrition, and absentee followup. More recently a technical standards and guidelines document was published by the International Association for the Evaluation of Educational Achievement (IEA) (Martin, Rust, & Adams, 1999). These standards include (among others) standards for drawing a sample, for minimizing burden and nonresponse, for developing sampling weights, and for reporting sampling and nonsampling errors and reinforcing the principles in the BICSE framework. There is a strong emphasis on documentation of all steps of sampling and data collection and the submission of a written record for evaluating each survey. Sample selection guidelines specify that replacements for nonparticipating schools should be identified when the school sample is drawn. Guidelines for minimizing response burden and nonresponse emphasize simplicity and reasonable approaches to working with respondents. Minimum acceptable response rates are not specified. Weighting guidelines require use of base weights based on the selection probability and adjustments for nonresponse. Nonresponse adjustments should be applied at each stage of sample selection. Procedures for trimming outlier weights are recommended to control the impact of unusually large weights. The guidelines require calculation of standard errors, coefficients of variation, or confidence intervals based on the complexities of the sampling design. Data files and documentation should permit proper calculation of sampling errors. Participation rates at each sampling stage and across all stages should be reported as well as other measures that indicate potential nonsampling error. This report reviews and comments on selected comparative studies of international education with a focus on the student component. Many of the early studies had serious problems in both the process and the execution. For some, the easily available documentation was not adequate to properly evaluate them. The documentation of quality issues (e.g., documentation of the Second International Study of Mathematics) led to the development of guidelines for future studies, including the BICSE framework. During the 1990s, the processes for conducting international assessments became much better defined, and the execution has continued to improve. The remainder of this report includes sections on selected comparative studies of education completed or planned, a discussion of

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement other general appraisals of sampling issues in international comparative studies, and a section on possible remaining or continuing issues. I will argue that opportunities exist today to refine the specified processes and that execution of designs consistent with established guidelines remains a problem in many countries, including the United States. REVIEW OF PUBLISHED DESCRIPTIONS AND CRITIQUES This report section summarizes key points about the sample designs and their execution for 15 studies or sets of international comparative studies in education conducted since the early 1960s. The discussion in this section is mostly descriptive and provides background for the critiques presented in subsequent sections. A theme of this section is that improved documentation of the quality (or lack of quality) of surveys is a prerequisite to achieving any improvement in the quality of future studies. This period of time also coincides with tremendous advances in computational hardware and software. Early in this era, probability sampling, simple weighting procedures, and model-based variance estimation were adequate to define a high-quality sample design by standards of the time. With development of computing power and specialized software, direct estimation of survey sampling errors and the ability to routinely monitor other quality measures, including response rates, became the norm in survey practice. The availability of computers also fostered the execution of more complex sampling plans and the development of comparable sampling approaches through the use of a common set of procedures and sample selection software. International Comparative Studies Completed Since the 1960s Table 4-1 summarizes the participation and timeframe of the major studies described by BICSE (NRC, 1995), beginning with the First International Mathematics Study (FIMS) conducted in 1964. The Six Subjects Study was conducted over the period of 1970-71. The general group of IEA science and mathematics studies includes: First International Mathematics Study (FIMS); Second International Mathematics Study (SIMS); First International Science Study (FISS); Second International Science Study (SISS); and Third International Mathematics and Science Study (TIMSS). Two international assessments of science and mathematics also were coordinated by the Educational Testing Service, with sponsorship of the coordination and U.S. components by the U.S. Department of Education and the National Science Foundation:

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement TABLE 4-1 Selected International Comparative Studies in Education: Scope and Timing Sponsor Description Countries Year(s) Conducted IEA First International Mathematics Study (FIMS) 12 countries 1964 IEA Six Subjects Study:   1970-71   Science 19 systems     Reading comprehension 15 countries     Literature 10 countries     French as a foreign language 8 countries     English as a foreign language 10 countries     Civic Education 10 countries   IEA First International Science Study (FISS) (part of Six Subjects Study) 19 systems 1970-71 IEA Second International Mathematics Study (SIMS) 10 countries 1982 IEA Second International Science Study (SISS) 19 systems 1983-84 ETS First International Assessment of Educational Progress (IAEP-I, Mathematics and Science) 6 countries (12 systems) 1988 ETS Second International Assessment of Educational Progress (IAEP-II, Mathematics and Science) 20 countries 1991 IEA Reading Literacy (RL) 32 countries 1990-91 IEA Computers in Education 22 countries 1988-89     12 countries 1991-92 Statistics Canada International Adult Literacy Survey (IALS) 7 countries 1994 IEA Preprimary Project       Phase I 11 countries 1989-91   Phase II 15 countries 1991-93   Phase III (longitudinal followup of Phase II sample) 15 countries 1994-96 IEA Language Education Study 25 interested countries 1997 IEA Third International Mathematics and Science Study (TIMSS)       Phase I 45 countries 1994-95   Phase II (TIMSS-R) About 40 1997-98 IEA Civic Education Study 28 countries 1999 OECD Program for International Student Assessment 32 countries 2000 (reading) 2003 (mathematics) 2006 (science)

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement First International Assessment of Educational Progress (IAEP-I); and Second International Assessment of Educational Progress (IAEP-II). The IAEP studies were designed to take advantage of the established procedures and instruments from the U.S. National Assessment of Educational Progress (NAEP). Most of the studies shown in Table 4-1 address enrolled student populations. The Reading Literacy study provides an example outside the science and mathematics arena. The Adult Literacy study provides an example of a study of the general household population, which requires a household sample design as opposed to a school-based sample design. In addition, we examine plans for the Organization for Economic Cooperation and Development (OECD) Program for International Student Assessment (PISA) 2000. In building Table 4-1, the numbers of participating countries sometimes disagreed among sources because some sources are written at the planning stages and some reflect actual experience; counting of systems, parts of countries, and whole countries also caused confusion. Where possible the actual experience is reflected. The table is provided to give an overview of the wide variety of studies in various stages of completion or planning. Data are sketchy for the early studies due to passage of time and, for the most recent studies, due to the author’s inability to locate completed reports. Medrich and Griffith (1992) described and evaluated five international studies through the late 1980s: FIMS; FISS; SIMS; SISS; and IAEP-I. Their work is the primary source used to review sampling issues for three of these studies. For SIMS, the report by Garden (1987) provides the most direct source. The discussion of IAEP-I is supplemented by Lapointe, Mead, and Phillips (1989). In addition, more recent studies that will be discussed include: IAEP-II; TIMSS; Civic Education Study; and PISA.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement First International Mathematics Study FIMS was conducted in the mid-1960s in 12 countries. Two target populations were defined: Students at the grade level at which the majority of pupils were age 13 (11 educational systems). Students in the last year of secondary education (10 educational systems). In the United States, these populations corresponded to grades 8 and 12. Two- or three-stage probability samples were used, with school districts (optional stage used in the United States), schools, and students comprising the sampling stages. Multiple test forms were utilized. Medrich and Griffith (1992, p. 13) note that data on the sample design details and response rates were largely unavailable in published sources, and the total sample was small in some of the countries with the highest means. Individual country reports may have contained this information. Peaker’s (1967) discussion on sampling issues makes a persuasive argument for probability sampling and explains the impact of the intraclass correlation and cluster size decisions on the equivalent simple random size (often called effective sample size). Approximation methods are developed for relating the true variance of estimates to variance estimates developed under the assumption of simple random sampling. The use of subsamples to generate a simple measure of sampling error also is discussed. Peaker presents data on achieved sample sizes by population studied, but does not present data on school or student response rates. The concept of an international sampling referee was already in place for FIMS, and Peaker served in this capacity. First International Science Study FISS was conducted in 1970-71 as part of the Six Subjects Study in 19 educational systems; not all of them participated in all target populations or reported results in the achievement component of the study. Target populations were: Students at age 10. Students at age 14. Students in the last year of secondary education. Two test versions were used at ages 10 and 14, and three versions by subject (biology, chemistry, and physics) were used for the third population (Medrich & Griffith, 1992, p. 14).

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement Sample designs involved either two or three stages of sampling. An international referee approved each country’s plan, although no IEA funds were available to monitor the sampling programs. Medrich and Griffith noted a few particular problems: At least three countries excluded students who were one or more years behind in grade for their age. Two countries excluded states or schools based on language. One country excluded students attending vocational schools. One country limited the sample to the area around its capital. Some countries sampled 10- and 14-year-olds by grade rather than age because of difficulty or cost. Countries agreed to limit sampling to students enrolled in school. Medrich and Griffith note the controversy that arose over the impact of retention rates on estimates for the “last year of secondary education” population. Response rates were reported from most countries. For the age 14 sample, 18 systems reported school response rates ranging from 34 to 100 percent and student response rates ranging from 22 to 98 percent. Ten of the 18 had school response rates exceeding 85 percent; only six of 18 had student response rates exceeding 85 percent.1 Second International Mathematics Study SIMS was conducted in 1982 in 10 countries. Two target populations were defined for SIMS: Population A: Students in the modal grade for 13-year-olds when age is determined in the middle of the school year. Population B: Students in the terminal grade of secondary education who are studying mathematics as a substantial part of their academic program (approximately five hours per week). Each country had to restate the definition in terms specific to their own situation and to identify any exclusions. Countries could make some judgments about whether the grade tested identified students who had been exposed to the mathematics curriculum covered in the test. Sample designs generally involved a one- or two-stage PPS sample of schools, with sampling of one or two intact classes per school. Multiple test forms were used. For Population A, all students completed a core set of items and one of four other tests. For Population B, each student was administered two out of a set of eight tests (Medrich & Griffith, 1992,

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement p. 16). A cross-sectional sample was required for the international study, but individual countries had the option to conduct pretests and posttests during the same school year to measure the impact of the academic program. An excellent evaluation of sampling procedures for SIMS was prepared by Garden (1987) of the New Zealand Department of Education. Before discussing some of the problems identified in his report, let me quote some remarks from his conclusions: Given the administrational challenges involved, both at international and at national level[s], and the difficulties of communication across cultures by correspondence the quality of the data collected is extraordinarily good. Most National Centers had little funding for the project and National Research Coordinators in many cases undertook national supervision of the project with minimal resources and with a minimal time allowance. (p. 138) This conditional summary, although positive, certainly allows for improvement. He also states: There is no simple answer to the question “Is country X’s sample so poor that the data cannot be used?” If there was such an answer it would be “No” for all samples in the study. (p. 138) He points out that the data must be evaluated in conjunction with information about the sample and other aspects of the study. SIMS had an international sampling manual (copy appended to the Garden report) and an international sampling committee. Some examples of situations related to the Population A sample that occurred in some countries are cited in the Garden (1987) report: An unknown number of schools used judgment rather than random sampling of classes. Simple random sampling of students was used rather than selection of intact classes. (Note that this would be an acceptable, perhaps better, alternative, but it does not conform to the international manual.) Private schools were excluded. Classes were selected directly without an intermediate school sample; this country had a very high sampling rate, making this feasible. Vocational education students were excluded, although they sometimes comprise 20 percent or more of the population. Logistic and financial constraints forced reduction of the sample geographically within the country, with a coverage reduction exceeding 10 percent. Small (fewer than 10 students in grade) schools were excluded (estimated at 2 percent of the population).

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement A followup to a previous study was used as the SIMS sample. Because of curriculum matching to the test, the country targeted a grade that contained about 10 percent of 13-year-olds and had an average age closer to 15. All schools were asked about willingness to participate and about a third agreed. All but two of these were invited to participate, resulting in an essentially self-selected school sample. Target populations were limited by the language of instruction in several countries, sometimes amounting to a substantial (but unspecified) portion of the total. The Population B definition required considerable judgment by each country involved. In many cases, the population defined consisted of less than 10 percent of the age cohort. In many countries, the age cohort coverage was not stated. Most coverage problems were defined away by the population definition. Garden notes problems in computing response rates. The definition of response rates was problematic, usually computed as the achieved sample compared to the executed sample. Although Garden’s summary shows that 12 of 20 systems achieved response rates exceeding 90 percent and only two systems were below a 70-percent response rate, his examination of the country reports leaves some doubt about whether these reports account for the overall response rate when considering both school and student nonresponse or whether the executed student sample size could be determined. In some cases, the achieved sample exceeds the designed sample and no information is provided on the executed student sample size. How substitute schools count in the computation of response rates also is not clear. In the United States, a large sample of districts was drawn in advance in anticipation of a low response rate; 48 percent of districts participated. In addition, only 69 percent of schools selected participated. Finally, 77 percent of students selected (the executed student sample) participated. If substitution were used and masked in the rate computation process, similar results could occur, but be masked in the rate calculation process. In an attempt to identify sources of bias, Garden examined student sample distributions by gender, student age, father’s occupation, teacher judgment about class rank, and other variables where comparisons with official statistics were feasible. Occupation was not coded consistently, and comparisons of fathers of 13-year-olds with all males in the official statistic were not necessarily comparable. The mean age for the selected grade was often much higher than 13.5 (16.7 at time of testing in one country); some increase was expected because the population was defined at midyear and tested late in the year. The use of principal or de-

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement of international education surveys and suggests the Deming philosophy for quality improvement, along with cooperative methodological experiments built into ongoing cross-national surveys as a means to determine effective ways of reducing all types of survey error. Medrich and Griffith (1992) discuss completed mathematics and science studies sponsored by IEA and IAEP. They note: The surveys have not achieved the high degrees of statistical reliability across age groups sampled and among all of the participating countries. Thus, from a statistical point of view, there is considerable uncertainty as to the magnitude of measured differences in achievement. Inconsistencies in sample design and sampling procedures, the nature of the samples and their outcomes, and other problems have undermined data quality. (p. viii) Nevertheless, they believe that these surveys have value and that the challenge is to improve their quality in the future. TIMSS shows improvement in consistent definition of comparable groups across countries, but as documented earlier in this chapter, a small minority of the 42 countries involved in TIMSS still took exception to recommended target population definitions, and some countries provided only partial information. Goldstein (1995) reviews sampling and other issues in the IEA and IAEP studies conducted prior to TIMSS. He advocates consideration of longitudinal studies beyond those conducted as an option within a single academic year in some countries’ science and mathematics studies. He believes that age cohorts might provide a better study definition for longitudinal followup purposes. He also sees age definition as a possible solution to defining a target population near the final year of secondary education among students attending different types of educational institutions. Although sampling procedures for these studies involved standard applications of sample survey methodology, he notes the difficulty of ensuring uniformity across countries. He also notes problems associated with restricted sampling frames and general nonresponse, which both exhibit considerable variation across countries in the studies reviewed. He advocates obtaining characteristics of nonresponding schools and students, and publishing comparisons of these characteristics for respondents and nonrespondents. He discusses the impacts of length of time in school, age, and grade and how they jointly influence achievement under different systems of education due to promotion policies or other factors; he notes that little effort has been devoted to age and grade standardization of results prior to publication. Postlethwaite (1999) presents an excellent discussion of sampling issues relating to international studies of educational achievement. He reviews population definition issues from both policy and methodological

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement viewpoints, paying particular attention to age versus grade definitions. He also addresses guidelines for setting precision requirements, the sample selection methodology, weighting of data and adjustment for nonresponse, standards for accepting or flagging low-quality surveys, and a general checklist for evaluating the sample design and the resulting data. He also points out that the issue of defining the populations to be tested is an “educational and political” decision. From a sample design perspective, we can leave the question to the education experts and policy makers. Their decisions, however, do impact the sampling and data collection operations. In some countries, students of the same age can be spread across several grades. Postlethwaite cites U.S. data showing 13-year-olds spread across grades 2 through 11, with most in grade eight and nearly 89 percent of enrolled students in grade eight or grade nine; this has serious implications for complete population coverage in the sampling frame when selecting samples defined by age. The grade by age distribution creates a more serious problem for test developers. Quite specific standards and guidelines are provided by Martin et al. (1999) for IEA studies discussed in the introduction of this chapter. Their heavy focus on documentation is particularly noted because this sets the stage for improvement of current procedures. REMAINING ISSUES It is a challenge to add to the critiques already presented and to add any new thoughts. It is also true that lessons learned in the early studies have been applied to the design of more recent studies. Information about what actually occurred is more consistently organized and (currently) accessible for recent studies, particularly SIMS and TIMSS. An understanding of what has happened in prior studies is a fundamental requirement for improving future studies. The documentation and reporting procedures used in TIMSS and planned for PISA are excellent, but still leave room for new ideas. The BICSE framework (NRC, 1990) and the IEA technical standards (Martin et al., 1999) now provide guidance for sample design and survey implementation. What more can be done? Certainly we need to make practice more consistent with plans. In addition, as we get better compliance with the prescribed sampling process, we need to examine the prescribed process itself in light of achievable current practices. I will address issues in the following areas: Population definitions. Sampling frame completeness. Designing the sample. Executing the sample design.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement Response rates. Other nonsampling errors. Annotation of published results. Population Definitions The issue of age versus grade definition has been discussed thoroughly by other authors, but remains an issue in recent studies. TIMSS used a two-grade span to define populations close to age nine or age 13 for its Populations 1 and 2, and the final year of secondary education to define Population 3. The plans for PISA call for using age 15 rather than a grade concept to define persons near the end of secondary schooling, but at an age where schooling is still largely universal. So different approaches to the same concept continue to be applied. The plans for PISA are much more thorough in describing what is meant by a person still in the country’s education system by including part-time students, students in vocational training, and students in foreign schools within the country as well as students in full-time academic study programs. This is not a sample design decision (as noted by Postlethwaite, 1999), but it has serious implications for defining the sampling frame and selecting the student sample. The population also can be defined in the time dimension. Allowances have been made for testing at different times in the southern and northern hemispheres in recognition of different starting times for the normal academic year. Recent trends, at least in the United States, include moves to year-round schooling and home schooling. The timing of an assessment during a short period of time could arbitrarily exclude a significant portion of students enrolled in year-round schools who have a break at a nontraditional period. Students schooled at home may be considered in the educational system because they must obtain some exemption from required attendance in a formal school, but no known effort is made to test such students. At a minimum, countries (including the United States) need to quantify the extent of these alternate practices so that the exclusions from the target population defined in two dimensions (type of school enrollment and time of testing) can be better understood. Some allowance for alternate testing times could be effective in covering students in year-round schools; covering students participating in home schooling would be more challenging. The population definition also includes the definition of exclusions for disability, language, or other reasons. Excellent guidelines have been developed for exclusions of both schools and students within schools and for documenting these exclusions in international studies. Some countries have continued to exclude geographic groups or major language of in-

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement struction groups for cost or political reasons. When cost is the only issue, stratification and sampling at a lower rate in the high-cost stratum might provide an alternative to arbitrary exclusion. Recent U.S. experience in state NAEP has identified potentially serious problems in implementing comparable exclusion rates across states with the development of new guidelines for accommodation (Mazzeo, Carlson, Voelkl, & Lutkus, 2000). The international standards discussed do not address accommodation for disability or language; we might anticipate additional complications in international assessments if similar accommodation policies are more broadly implemented in other countries or applied to the international assessment samples in the United States. Sampling Frame Completeness Sampling frame completeness can only be evaluated relative to the target population definition. Exclusion of schools for inappropriate reasons could be viewed as either a population definition issue or a sampling frame incompleteness issue, depending on the intended population of inference for the analysis. School sampling frames often are developed several months before the actual survey implementation. To the extent that population is defined as of the survey date, procedures may be required to update the sampling frame for newly opened schools or other additions to the eligible school population (changes in grade range taught) occurring since the school sample had been selected. It is not clear that this has been attempted in any of the international studies reviewed. False positives in the sampling frame (schools thought to be eligible at the time of sampling who turn out not to be eligible) can be handled analytically by treating the eligible schools as a subpopulation and are less of a problem. When the target population of schools includes both public and private schools as well as vocational education schools, the development of a complete school frame may become more difficult. Quality controls could be incorporated into advance data collection activities or into the survey process itself to check the completeness of the school sampling frame on a subsample basis, perhaps defined geographically. When sampling by age group, all schools that potentially could have students in the defined age range should be included in the sampling frame. If any arbitrary cutoffs are used to avoid schools with projected very low enrollments for an age group, these also should be checked for excluded schools on at least a sample basis. The feasibility of using arbitrary cutoffs to possibly exclude a small proportion of age-eligible students depends on the dispersion of age-eligible students across grades.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement The recent guidelines for developing student sampling frames differ depending on whether the population is defined by grade or age. Age-defined population sampling requires listing all students in the school that meet the age (or birthdate range) definition; generally, the sample is then drawn as a sample of students using simple random sampling. For populations defined by grade, the sampling frame often is developed based on a “complete” list of classrooms. When the focus is on a particular subject (e.g., mathematics or science), classrooms may be limited to the subject matter being studied. There is a potential problem with the classroom approach of excluding students not currently enrolled in the target subject matter class at all or enrolled in a subject matter class at a different grade level. We may need to be more specific in defining what is meant by grade; that is, is it defined based on overall academic progress or only on progress in the subject being tested? After the grade definition is resolved, the ideal approach would be to list all grade-eligible students (just as we list all age-eligible students). Then if direct student sampling is prescribed, a simple random sample of grade-eligible students could be selected. If for logistical reasons a classroom sample is preferred, the list could be partitioned into classrooms and a sample of classrooms would then be selected. Any student not clearly associated with the type of classroom defined for administration purposes could be arbitrarily assigned to one of the classrooms before the classroom sample is selected, then tested with that classroom if it is selected. Designing the Sample Other than technical details in constructing complete sampling frames, the sample should be designed to provide the required precision for a minimum cost. Optimizing a sample design to meet precision requirements requires a reasonable model of the variance as function of controllable sample design parameters (typically, the number of schools and the number of students per school). The variance models used in the guidance documents for TIMSS and PISA incorporate the clustering effect into the variance model in terms of an assumed intracluster correlation coefficient. Empirical studies show wide variation in this population parameter; it is correctly noted that a large clustering effect is more likely with classroom sampling than with direct student sampling. Other population and sample characteristics also should be incorporated into the variance model, including stratification effects, unequal weighting effects, and expected cluster size variability. Stratification can be highly effective in reducing the school component of variance; intraclass correlation coefficients computed within strata are likely to be much smaller than those computed ignoring the strata. More data analysis may be required to

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement develop estimates of these values based on prior experience. The correct specification of the variance model is essential to the development of cost-effective sample designs that satisfy the study’s precision requirements. As pointed out by Olkin and Searls (1985), the sample design needs to be consistent with the intended analysis. If two subjects are being assessed in two different samples of students within each school and separate estimates are to be made for each subject, then the average cluster size in the variance model should be based on school-subject sample size and not on total school sample size; the same principle might apply to subtest scores. Procedures that simultaneously control modeled variances for several different estimates for the same defined population also can be implemented. This is another area where data need to be accumulated in a systematic manner across countries. The availability of microdata with the sample structure for strata, schools, classrooms, and students clearly labeled would make possible the estimation of the required sample design parameters in a consistent manner. These microdata sets also would provide a valuable resource for studying effective sample design consistent with different analytic objectives. Executing the Sample Design With development of procedures that include guidance from a respected national statistical organization and the resolution of particular issues by a similarly respected sampling referee, the execution of the sample design has not been and should not be a serious problem. The documentation of procedures following TIMSS or PISA guidelines and forms also helps guarantee correct implementation. These procedures may have room for improvement based on further experience, but must be viewed as quite excellent. Two areas of the sample design execution relate to dealing with initial nonresponse at the school and student levels. Substitution for nonresponding schools has been a practice allowed in most of the international assessments. Although substitution does not eliminate bias due to nonresponse, it does maintain the sample size required to control sampling error. If used with careful matching to the nonrespondent schools, the substitution method also can limit the extent of the potential bias introduced by nonresponding schools. A draft of the PISA sampling manual (OECD, 1999b) provides a reasonable way to implement the substitution by identifying two schools in the ordered school sampling frame as potential replacements for each nonresponding school: the schools immediately preceding and immediately following the selected school (if they are not also selected schools). If the list is ordered by administrative

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement structure of the school system (e.g., by districts), it is likely that near neighbors in the list also might be nonrespondents; close matching or ordering on other school characteristics, however, may be quite effective in supplying replacements with similar characteristics who are not prejudiced by their neighbor’s unwillingness to respond. The practice of selection of substitutes for nonresponding schools needs further review. Different approaches are favored by different applied statisticians. Clearly, no method can totally eliminate the bias due to nonresponse, and all methods just try to maintain the respondent sample size. If possible, empirical studies of alternative approaches should be developed, conducted, and reviewed by a panel of survey experts to determine if current substitution practices are the most appropriate ones for international comparative studies. The empirical research simply might be based on simulation of nonresponse from completed studies in one or more countries. The practice of routinely scheduling followup sessions for absent or unavailable students whenever response rates fall below set but relatively high levels within schools should be continued and formalized.5 Response Rates The TIMSS response criteria specified 85 percent for schools and students or a combined rate exceeding 75 percent. These criteria were used for flagging the results. The draft PISA sampling manual specified 85percent response for schools and 80 percent for students. Are these criteria generally achievable? School response rates appear to be the more serious problem, particularly in the United States. The 1996 main NAEP did not achieve the 85percent school participation rate for all session types for grades four and eight and did not achieve an 80-percent school participation rate for any session type for grade 12 (Wallace & Rust, 1999). The NAEP survey would be expected to be exemplary among studies undertaken in the United States. Setting standards is not the solution to the problem of school nonresponse. Studies need to be undertaken in the United States and other countries to better understand why the problem exists. Data on this topic most likely exist, but need to organized and reviewed to formulate better approaches. In the United States, many different surveys compete for testing-schedule time in the schools; assessments are carried out at national, state, and local levels. International assessments just add to the burden. When studies are planned independently and samples are drawn independently, the chance of overburdening some schools is predictable and may contribute to poor participation in all of the studies. Most large

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement school districts are asked to participate in all of these studies. Once we better understand the school perspective in survey participation, we can develop strategies, including possible coordination among studies, to encourage participation while simultaneously limiting the burden on any particular school. Nevertheless, these strategies should be considered as possible options for improving the precision of estimates. The other theme that has been relatively constant across all the critiques reviewed has been the lack of resources for really thorough study execution. Proper planning and the scheduling of advance contacts required to obtain good study participation from schools require both time and adequate funding. These additional resources should be applied intelligently based on what is learned about the reasons for school nonresponse. The methods of adjusting analytic weights for nonresponse also should be reviewed. Many noneducation surveys standardize their estimates or poststratify to known population distributions (e.g., age, race, and gender) or to distributions estimated from larger surveys. This is particularly difficult to do with the population of students enrolled in a country’s educational system because the population is constantly changing and good enrollment data for the time of testing are difficult to obtain from other sources. If multiple forms are used or if more than one subject is assessed in a given year, the combined sample might provide a better estimate for standardizing the individual estimates developed by subject or by objective within subject. These types of methods add complexity to the weight development process and must be applied with good judgment. Other Nonsampling Errors Frame errors and response errors have been discussed. This leaves measurement error. The conditions present at testing, the correlated errors induced by the behavior of test administrators, data processing, and other factors all can contribute to measurement error. Sample and survey design can help control such errors, but monitoring to identify and measure such sources of error is essential in deciding whether the cost of revised procedures is necessary or justified. As noted by Horvitz (1992), cooperative methodological experiments could be extremely valuable in identifying and reducing measurement error. Annotation of Published Results As data users become more sophisticated, they expect to be informed about the strengths and weaknesses of the statistical results. The flagging of results depending on participation rates employed by TIMSS is a good example of a way to warn users about data quality based on something

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement other than sampling error. The development of technical reports addressing quality control can be of use to the data professional (e.g., SIMS and TIMSS reports) and is strongly endorsed. CONCLUSIONS The sampling and survey design and execution of recent and planned international comparative studies have benefited greatly from the analysis of the results of earlier studies. Challenges remain. We should anticipate that the cutting edge technology of today will not necessarily be viewed favorably 10, 15, or 20 years from now. Our views today about the studies completed prior to 1980 might seem unfair to those who conducted those studies using the cutting edge approaches of those times. Just as the BICSE guidelines suggest focused studies to interpret differences in educational achievement, we need focused studies to understand and interpret the differences in the background conditions and the feasible survey methodologies that apply to different countries. This applies particularly to the conditions in the educational system and how they should influence the definition of the desired target populations. The concept of final year of secondary education remains vague, especially in countries with alternative academic tracks; procedures designed to avoid double counting in the final year of secondary education population may be creating undercoverage of populations defined by subject matter specialization. These types of problems have solutions that begin with a clear understanding of the study requirements. Longitudinal studies have not been a major focus of the studies reviewed, but have been a country option in some of them. The value of longitudinal measurements versus repeated cross-sectional measurements needs to be evaluated in terms of educational objectives and the types of country comparisons that are useful in evaluating the achievement of those objectives. Finally, the focus on meeting tough standards for coverage and response rates should not lead us to solve the problem by defining it away. As an example, there is always a temptation to simply rule that an excluded portion of a study population is not really part of the population of interest. This ruling immediately increases coverage measures in all participating countries, but may totally destroy the comparability of results across countries. It would be better to relax the standards somewhat (and continue to monitor them) than to essentially ignore them by defining them away.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement NOTES 1.   Table A.7 (p. 55) of Medrich and Griffith (1992), based on data from Peaker (1975). 2.   Smaller values might have been obtained if the effects of stratification and unequal weighting had been removed before calculating the intraclass correlation; design effects would remain the same because these factors would need to be put back into the model for total variance. 3.   Medrich and Griffith cite data obtained from IEA (1988). 4.   The focus here is on populations defined by age group. Retaining fewer students in school leads to excluding the poor performers, which then leads to higher average scores for those remaining in school. 5.   For PISA, these procedures are available in the National Project Managers Manual, but they were not reviewed for this chapter. REFERENCES Cochran, W. G. (1977). Sampling techniques. New York: John Wiley & Sons. Deming, W. E. (1950). Some theory of sampling. New York: Dover. Foy, P., Martin, M. O., & Kelly, D. L. (1996). Sampling. In M. O. Martin & I. A. Mullis (Eds.), Third International Mathematics and Science Study: Quality assurance in data collection (pp. 21 to 2-23). Chestnut Hill, MA: Boston College. Foy, P., Rust, K., & Schleicher, A. (1996). Sample Design. In M. O. Martin and D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College. Garden, R. A. (1987). Second IEA Mathematics Study, sampling report. Washington, DC: U.S. Department of Education, National Center for Education Statistics. Goldstein, H. (1995). Interpreting international comparisons of student achievement. Paris: UNESCO. Hansen, M. H., Hurwitz, W. N., & Madow, W. G. (1953). Sample survey methods and theory. New York: John Wiley & Sons. Horvitz, D. (1992). Improving the quality of international education surveys (draft). Prepared for the Board on International Comparative Studies in Education. International Association for the Evaluation of Educational Achievement (IEA). (1988). Student achievement in seventeen countries. Oxford, England: Pergamon Press. Lapointe, A. E., Askew, J. M., & Mead, N. A. (1991). Learning science: The Second International Assessment of Educational Progress. Princeton, NJ: Educational Testing Service. Lapointe, A. E., Mead, N. A., & Phillips, G. W., (1989). A world of differences. Princeton, NJ: Educational Testing Service. Lessler, J. T., & Kalsbeek, W. D. (1992). Nonsampling error in surveys. New York: John Wiley & Sons. Martin, M. O. (1996). Third International Mathematics and Science Study: An overview. In M. O. Martin and D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College. Martin, M. O., Rust, K., & Adams, R. J. (1999). Technical standards for IEA studies. Amsterdam: International Association for the Evaluation of Educational Achievement. Mazzeo, J., Carlson, J. E., Voelkl, K. E., & Lutkus, A. D. (2000). Increasing the participation of special needs students in NAEP: A report on 1996 NAEP research activities. Washington, DC: U.S. Department of Education, National Center for Education Statistics.

OCR for page 80
Methodological Advances in Cross-National Surveys of Educational Achievement Medrich, E. A., & Griffith, J. E. (1992). International mathematics and science assessments: What have we learned? Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement. National Research Council. (1985). Summary report of conference on October 16-17, 1985 (Draft). Committee on National Statistics, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. National Research Council. (1990). A framework and principles for international comparative studies in education. Board on International Comparative Studies in Education, Norman M. Bradburn & Dorothy M. Gilford, Editors. Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press. National Research Council. (1995). International comparative studies in education: Descriptions of selected large-scale assessments and case studies. Board on International Comparative Studies in Education. Commission on Behavioral and Social Sciences and Education. Washington DC: National Academy Press. Olkin, I., & Searls, D. T. (1985). Statistical aspects of international assessments of science education. Paper presented at the conference on Statistical Standards for International Assessments in Precollege Science and Mathematics. Washington, DC. Organization for Economic Cooperation and Development (OECD). (1999a). Measuring student knowledge and skills, A new framework for assessment. Paris: Author. Organization for Economic Cooperation and Development (OECD). (1999b). PISA sampling manual, main study version 1. Paris: Author. Peaker, G. (1975). An empirical study of education in twenty-one countries: A technical report. New York: John Wiley & Sons. Peaker, G. F. (1967). Sampling. In T. Husen (Ed.), International study of achievement in mathematics, A comparison of twelve countries (pp. 147-162). New York: John Wiley & Sons. Postlethwaite, T. N. (1999). International studies of educational achievement: Methodological issues. Hong Kong: University of Hong Kong. Schleicher, A., & Siniscalco, M. T. (1996). Field operations. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College. Suter, L. E., & Phillips, G. (1989). Comments on sampling procedures for the U.S. sample of the Second International Mathematics Study. Washington DC: U.S. Department of Education, National Center for Education Statistics. Wallace, L., & Rust, K. F. (1999). Sample design. In N. L. Allen, D. L. Kline, & C. A. Zelenak (Eds.), The NAEP 1994 technical report (pp. 69-86). Washington, DC: National Library of Education.