Read "Methodological Advances in Cross-National Surveys of Educational Achievement" at NAP.edu

Page 80 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

4
Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement

James R. Chromy*

Cochran (1977) outlines eleven steps in the planning of a survey. Good sampling methods must exist in the environment of all of these steps. These steps are (1) a statement of the survey objectives, (2) the definition of the population to be sampled, (3) the data to be collected, (4) the degree of precision required, (5) the methods of measurement, (6) the frame or the partitioning of the population into sampling units, (7) the sample selection methods, (8) the pretest, (9) the fieldwork organization, (10) the summary and analysis of the data, and (11) a review of the entire process to see what can be learned for future surveys. Mathematically, the major concerns for sample design have focused on the sample selection procedures and the associated estimation procedures that yield precise estimates. Optimization of sample designs involves obtaining the best possible precision for a fixed cost or minimizing survey costs subject to one or more constraints on the precision of estimates. Optimized designs sometimes are called efficient designs.

The mathematical presentation of sampling theory often focuses on obtaining efficient sample designs with precision measured in terms of sampling error only, although both Cochran (1977) and many earlier texts (e.g., Deming, 1950, or Hansen, Hurwitz, & Madow, 1953) discuss nonsampling errors in surveys. A more recent text by Lessler and Kalsbeek (1992) is devoted entirely to nonsampling errors in surveys, classified as frame error, nonresponse error, and measurement error.

Page 81 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Designing surveys that control both sampling errors and nonsampling errors remains a serious challenge. Sample designers also cannot avoid some of the conceptual issues in total survey design, such as defining the survey objectives, defining the target population to be measured, or limiting the resources that can be devoted to data collection. Decisions reached on these issues can lead to serious tradeoffs among sample design options.

The framework and principles document (National Research Council [NRC], 1990) of the Board on International Comparative Studies in Education (BICSE) identifies key sample design issues in the broader context just described. The objective of measuring achievement or performance to permit comparisons across school systems in different countries is clear. Explaining differences is more problematic and may require collection of additional data. Even with these additional data, the approach to analysis and interpretation of differences may be exploratory at best because there are many potential explanatory factors, and only some will be measured. When differences are observed, they properly form the basis for additional studies that would be designed to better understand the differences. The framework makes clear that the objectives of both descriptive and explanatory studies will require rigorous sampling procedures and the capacity to produce national estimates of the variables studied.

Conceptual problems of defining comparable student populations in different countries also are addressed. For students enrolled in school, the problem of defining the study population in terms of age or grade must be resolved. Problems exist with both methods because children start school at different ages, so even first graders may be five, six, or seven years old. Different countries follow different grade progression policies. At the upper grade levels, there may be a much broader representation of ages within a single grade. Different national policies about the legal age of leaving school either to drop out or to enter specialized training may alter the composition of classes completing normal schooling. The guidance document recognizes the difficulty of consistent population definition, but does not recommend one approach over another.

Survey populations also must be defined temporally. The value of national and cross-national data to meet the objectives of trend measurement and trend comparisons requires regular data collection on an established schedule. If too many cross-national studies are carried out simultaneously, both the educational process itself and the success of the surveys can be adversely affected. The administration of surveys disrupts the educational process in the schools involved in the survey. Schools requested to participate in several surveys (national, cross-national, and others) may be less likely to participate in any of them or may have to select among them. Consequently, school response rates will suffer.

Page 82 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The BICSE framework provides several principles for sampling and access to schools for both descriptive and explanatory studies:

Samples must be drawn from the full population of teachers, administrators, students (at an age or a grade), or policy makers.
Valid estimation of population parameters requires strict adherence to an explicit sample design.
Plans should discuss the frame and the approach to selecting the sample.
Planned exclusions of subgroups (the disabled or persons who do not speak the language in which the test is administered) must be documented. Information should be provided about the size of the population subgroup excluded and the direction of bias due to the exclusion.
The extent of participation in education may create differences in the student populations in different countries.
The sample design should support reasonably accurate inferences about an age or grade cohort and capture the existing range of individual, school, and classroom variations.
Within-country subpopulations may be defined.
The total population and subpopulations sample must be explicitly delineated.
An international sampling manual is essential.
The board encourages the appointment of an experienced and expert sampling consultant to review and approve all country samples before testing takes place.
The achieved sample design is usually different from the planned sample design.
Advance arrangements with school officials should be arranged to ensure high participation rates. A maximum acceptable nonresponse rate should be specified for inclusion of a country’s data in the international analyses.
Subnational units that have separate autonomous school systems may be included in international studies.

The BICSE framework also specifies test administration procedures to control the measurement error component. These include standardized procedures over time and across nations, pilot testing in each participating country, and a meeting with study coordinators between the pilot study and the full-scale study to review procedures and adjust them if necessary. The report also recommends (ideally) that “suitably trained [test administrators] from outside the [school system] be in charge of test administration” and that “people from different countries . . . supervise the implementation of the procedures to be followed (previously agreed

Page 83 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

on by the countries involved) by being present on site when the field work is conducted” (NRC, 1990, p. 9).

The Board framework also requires that “standard errors be calculated and reported for all reported statistics.” It also encourages the use of a single recognized expert consultant for this technically complex process. The Board also recommends audit and evaluation procedures for all aspects of the survey, including participation rates, attrition, and absentee followup.

More recently a technical standards and guidelines document was published by the International Association for the Evaluation of Educational Achievement (IEA) (Martin, Rust, & Adams, 1999). These standards include (among others) standards for drawing a sample, for minimizing burden and nonresponse, for developing sampling weights, and for reporting sampling and nonsampling errors and reinforcing the principles in the BICSE framework. There is a strong emphasis on documentation of all steps of sampling and data collection and the submission of a written record for evaluating each survey.

Sample selection guidelines specify that replacements for nonparticipating schools should be identified when the school sample is drawn. Guidelines for minimizing response burden and nonresponse emphasize simplicity and reasonable approaches to working with respondents. Minimum acceptable response rates are not specified. Weighting guidelines require use of base weights based on the selection probability and adjustments for nonresponse. Nonresponse adjustments should be applied at each stage of sample selection. Procedures for trimming outlier weights are recommended to control the impact of unusually large weights. The guidelines require calculation of standard errors, coefficients of variation, or confidence intervals based on the complexities of the sampling design. Data files and documentation should permit proper calculation of sampling errors. Participation rates at each sampling stage and across all stages should be reported as well as other measures that indicate potential nonsampling error.

This report reviews and comments on selected comparative studies of international education with a focus on the student component. Many of the early studies had serious problems in both the process and the execution. For some, the easily available documentation was not adequate to properly evaluate them. The documentation of quality issues (e.g., documentation of the Second International Study of Mathematics) led to the development of guidelines for future studies, including the BICSE framework. During the 1990s, the processes for conducting international assessments became much better defined, and the execution has continued to improve. The remainder of this report includes sections on selected comparative studies of education completed or planned, a discussion of

Page 84 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

other general appraisals of sampling issues in international comparative studies, and a section on possible remaining or continuing issues. I will argue that opportunities exist today to refine the specified processes and that execution of designs consistent with established guidelines remains a problem in many countries, including the United States.

REVIEW OF PUBLISHED DESCRIPTIONS AND CRITIQUES

This report section summarizes key points about the sample designs and their execution for 15 studies or sets of international comparative studies in education conducted since the early 1960s. The discussion in this section is mostly descriptive and provides background for the critiques presented in subsequent sections. A theme of this section is that improved documentation of the quality (or lack of quality) of surveys is a prerequisite to achieving any improvement in the quality of future studies. This period of time also coincides with tremendous advances in computational hardware and software. Early in this era, probability sampling, simple weighting procedures, and model-based variance estimation were adequate to define a high-quality sample design by standards of the time. With development of computing power and specialized software, direct estimation of survey sampling errors and the ability to routinely monitor other quality measures, including response rates, became the norm in survey practice. The availability of computers also fostered the execution of more complex sampling plans and the development of comparable sampling approaches through the use of a common set of procedures and sample selection software.

International Comparative Studies Completed Since the 1960s

Table 4-1 summarizes the participation and timeframe of the major studies described by BICSE (NRC, 1995), beginning with the First International Mathematics Study (FIMS) conducted in 1964. The Six Subjects Study was conducted over the period of 1970-71. The general group of IEA science and mathematics studies includes:

First International Mathematics Study (FIMS);
Second International Mathematics Study (SIMS);
First International Science Study (FISS);
Second International Science Study (SISS); and
Third International Mathematics and Science Study (TIMSS).

Two international assessments of science and mathematics also were coordinated by the Educational Testing Service, with sponsorship of the coordination and U.S. components by the U.S. Department of Education and the National Science Foundation:

Page 85 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

TABLE 4-1 Selected International Comparative Studies in Education: Scope and Timing

Sponsor	Description	Countries	Year(s) Conducted
IEA	First International Mathematics Study (FIMS)	12 countries	1964
IEA	Six Subjects Study:		1970-71
	Science	19 systems
	Reading comprehension	15 countries
	Literature	10 countries
	French as a foreign language	8 countries
	English as a foreign language	10 countries
	Civic Education	10 countries
IEA	First International Science Study (FISS) (part of Six Subjects Study)	19 systems	1970-71
IEA	Second International Mathematics Study (SIMS)	10 countries	1982
IEA	Second International Science Study (SISS)	19 systems	1983-84
ETS	First International Assessment of Educational Progress (IAEP-I, Mathematics and Science)	6 countries (12 systems)	1988
ETS	Second International Assessment of Educational Progress (IAEP-II, Mathematics and Science)	20 countries	1991
IEA	Reading Literacy (RL)	32 countries	1990-91
IEA	Computers in Education	22 countries	1988-89
		12 countries	1991-92
Statistics Canada	International Adult Literacy Survey (IALS)	7 countries	1994
IEA	Preprimary Project
	Phase I	11 countries	1989-91
	Phase II	15 countries	1991-93
	Phase III (longitudinal followup of Phase II sample)	15 countries	1994-96
IEA	Language Education Study	25 interested countries	1997
IEA	Third International Mathematics and Science Study (TIMSS)
	Phase I	45 countries	1994-95
	Phase II (TIMSS-R)	About 40	1997-98
IEA	Civic Education Study	28 countries	1999
OECD	Program for International Student Assessment	32 countries	2000 (reading) 2003 (mathematics) 2006 (science)

Page 86 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

First International Assessment of Educational Progress (IAEP-I); and
Second International Assessment of Educational Progress (IAEP-II).

The IAEP studies were designed to take advantage of the established procedures and instruments from the U.S. National Assessment of Educational Progress (NAEP).

Most of the studies shown in Table 4-1 address enrolled student populations. The Reading Literacy study provides an example outside the science and mathematics arena. The Adult Literacy study provides an example of a study of the general household population, which requires a household sample design as opposed to a school-based sample design. In addition, we examine plans for the Organization for Economic Cooperation and Development (OECD) Program for International Student Assessment (PISA) 2000.

In building Table 4-1, the numbers of participating countries sometimes disagreed among sources because some sources are written at the planning stages and some reflect actual experience; counting of systems, parts of countries, and whole countries also caused confusion. Where possible the actual experience is reflected. The table is provided to give an overview of the wide variety of studies in various stages of completion or planning. Data are sketchy for the early studies due to passage of time and, for the most recent studies, due to the author’s inability to locate completed reports.

Medrich and Griffith (1992) described and evaluated five international studies through the late 1980s:

FIMS;
FISS;
SIMS;
SISS; and
IAEP-I.

Their work is the primary source used to review sampling issues for three of these studies. For SIMS, the report by Garden (1987) provides the most direct source. The discussion of IAEP-I is supplemented by Lapointe, Mead, and Phillips (1989).

In addition, more recent studies that will be discussed include:

IAEP-II;
TIMSS;
Civic Education Study; and
PISA.

Page 87 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

First International Mathematics Study

FIMS was conducted in the mid-1960s in 12 countries. Two target populations were defined:

Students at the grade level at which the majority of pupils were age 13 (11 educational systems).
Students in the last year of secondary education (10 educational systems).

In the United States, these populations corresponded to grades 8 and 12.

Two- or three-stage probability samples were used, with school districts (optional stage used in the United States), schools, and students comprising the sampling stages. Multiple test forms were utilized. Medrich and Griffith (1992, p. 13) note that data on the sample design details and response rates were largely unavailable in published sources, and the total sample was small in some of the countries with the highest means. Individual country reports may have contained this information. Peaker’s (1967) discussion on sampling issues makes a persuasive argument for probability sampling and explains the impact of the intraclass correlation and cluster size decisions on the equivalent simple random size (often called effective sample size). Approximation methods are developed for relating the true variance of estimates to variance estimates developed under the assumption of simple random sampling. The use of subsamples to generate a simple measure of sampling error also is discussed. Peaker presents data on achieved sample sizes by population studied, but does not present data on school or student response rates. The concept of an international sampling referee was already in place for FIMS, and Peaker served in this capacity.

First International Science Study

FISS was conducted in 1970-71 as part of the Six Subjects Study in 19 educational systems; not all of them participated in all target populations or reported results in the achievement component of the study. Target populations were:

Students at age 10.
Students at age 14.
Students in the last year of secondary education.

Two test versions were used at ages 10 and 14, and three versions by subject (biology, chemistry, and physics) were used for the third population (Medrich & Griffith, 1992, p. 14).

Page 88 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Sample designs involved either two or three stages of sampling. An international referee approved each country’s plan, although no IEA funds were available to monitor the sampling programs. Medrich and Griffith noted a few particular problems:

At least three countries excluded students who were one or more years behind in grade for their age.
Two countries excluded states or schools based on language.
One country excluded students attending vocational schools.
One country limited the sample to the area around its capital.
Some countries sampled 10- and 14-year-olds by grade rather than age because of difficulty or cost.

Countries agreed to limit sampling to students enrolled in school. Medrich and Griffith note the controversy that arose over the impact of retention rates on estimates for the “last year of secondary education” population.

Response rates were reported from most countries. For the age 14 sample, 18 systems reported school response rates ranging from 34 to 100 percent and student response rates ranging from 22 to 98 percent. Ten of the 18 had school response rates exceeding 85 percent; only six of 18 had student response rates exceeding 85 percent.¹

Second International Mathematics Study

SIMS was conducted in 1982 in 10 countries. Two target populations were defined for SIMS:

Population A: Students in the modal grade for 13-year-olds when age is determined in the middle of the school year.
Population B: Students in the terminal grade of secondary education who are studying mathematics as a substantial part of their academic program (approximately five hours per week).

Each country had to restate the definition in terms specific to their own situation and to identify any exclusions. Countries could make some judgments about whether the grade tested identified students who had been exposed to the mathematics curriculum covered in the test.

Sample designs generally involved a one- or two-stage PPS sample of schools, with sampling of one or two intact classes per school. Multiple test forms were used. For Population A, all students completed a core set of items and one of four other tests. For Population B, each student was administered two out of a set of eight tests (Medrich & Griffith, 1992,

Page 89 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

p. 16). A cross-sectional sample was required for the international study, but individual countries had the option to conduct pretests and posttests during the same school year to measure the impact of the academic program.

An excellent evaluation of sampling procedures for SIMS was prepared by Garden (1987) of the New Zealand Department of Education. Before discussing some of the problems identified in his report, let me quote some remarks from his conclusions:

Given the administrational challenges involved, both at international and at national level[s], and the difficulties of communication across cultures by correspondence the quality of the data collected is extraordinarily good. Most National Centers had little funding for the project and National Research Coordinators in many cases undertook national supervision of the project with minimal resources and with a minimal time allowance. (p. 138)

This conditional summary, although positive, certainly allows for improvement. He also states:

There is no simple answer to the question “Is country X’s sample so poor that the data cannot be used?” If there was such an answer it would be “No” for all samples in the study. (p. 138)

He points out that the data must be evaluated in conjunction with information about the sample and other aspects of the study.

SIMS had an international sampling manual (copy appended to the Garden report) and an international sampling committee.

Some examples of situations related to the Population A sample that occurred in some countries are cited in the Garden (1987) report:

An unknown number of schools used judgment rather than random sampling of classes.
Simple random sampling of students was used rather than selection of intact classes. (Note that this would be an acceptable, perhaps better, alternative, but it does not conform to the international manual.)
Private schools were excluded.
Classes were selected directly without an intermediate school sample; this country had a very high sampling rate, making this feasible.
Vocational education students were excluded, although they sometimes comprise 20 percent or more of the population.
Logistic and financial constraints forced reduction of the sample geographically within the country, with a coverage reduction exceeding 10 percent.
Small (fewer than 10 students in grade) schools were excluded (estimated at 2 percent of the population).

Page 90 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

A followup to a previous study was used as the SIMS sample.
Because of curriculum matching to the test, the country targeted a grade that contained about 10 percent of 13-year-olds and had an average age closer to 15.
All schools were asked about willingness to participate and about a third agreed. All but two of these were invited to participate, resulting in an essentially self-selected school sample.
Target populations were limited by the language of instruction in several countries, sometimes amounting to a substantial (but unspecified) portion of the total.

The Population B definition required considerable judgment by each country involved. In many cases, the population defined consisted of less than 10 percent of the age cohort. In many countries, the age cohort coverage was not stated. Most coverage problems were defined away by the population definition.

Garden notes problems in computing response rates. The definition of response rates was problematic, usually computed as the achieved sample compared to the executed sample. Although Garden’s summary shows that 12 of 20 systems achieved response rates exceeding 90 percent and only two systems were below a 70-percent response rate, his examination of the country reports leaves some doubt about whether these reports account for the overall response rate when considering both school and student nonresponse or whether the executed student sample size could be determined. In some cases, the achieved sample exceeds the designed sample and no information is provided on the executed student sample size. How substitute schools count in the computation of response rates also is not clear. In the United States, a large sample of districts was drawn in advance in anticipation of a low response rate; 48 percent of districts participated. In addition, only 69 percent of schools selected participated. Finally, 77 percent of students selected (the executed student sample) participated. If substitution were used and masked in the rate computation process, similar results could occur, but be masked in the rate calculation process.

In an attempt to identify sources of bias, Garden examined student sample distributions by gender, student age, father’s occupation, teacher judgment about class rank, and other variables where comparisons with official statistics were feasible. Occupation was not coded consistently, and comparisons of fathers of 13-year-olds with all males in the official statistic were not necessarily comparable. The mean age for the selected grade was often much higher than 13.5 (16.7 at time of testing in one country); some increase was expected because the population was defined at midyear and tested late in the year. The use of principal or de-

Page 91 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

partment head judgment for selecting class samples was identified as a possible source of upward bias for that country.

The tests consisted of a core form plus rotated forms (four for Population A and eight for Population B). Ideally, each student would complete the core form plus two rotated forms, and with an appropriate rotation scheme, the sample of students would be divided equally across all possible pairs of rotated forms. This approach allows for estimation of basic statistics plus the study of relationships among all items (e.g., latent trait analyses). In two countries, rotation schemes did not conform to the desired pattern, but still permitted estimation of population means.

The SIMS samples were designed to be self-weighting under ideal execution and perfect school and student response conditions. All but two countries computed weights for their samples.

Sampling error estimates were computed for the core tests and for one form for Population A and two forms for Population B. Design effects and intraclass correlation coefficients also were estimated. Intact class sampling was thought to contribute to higher than expected intraclass coefficients (median of about .4); wide differences among schools within systems also were identified as possible causes.

Suter and Phillips (1989) examined U.S. components of several international studies, with emphasis on SIMS. They concluded that the “response rate to the U.S. SIMS was lower than would be expected for an important national survey that would be used to draw important policy conclusions” (p. 23). They also noted some departures from the estimated distributions by gender, region, and race when compared to other national estimates. With regard to the estimates, their paper “found no evidence that the results of the IEA Second International Mathematics Study would lead to grossly misleading interpretations about the status of U.S. achievement of eighth grade students when compared with other countries” (p. 23). They also examined design effects for five studies (FIMS, SIMS, FISS, SISS, and IAEP-I) and computed intraclass correlation coefficients, r, at the school level based on equating the design effect to 1 + r(m – 1). These estimates of the intraclass correlation coefficient were quite high, ranging from about 0.15 to in excess of 0.50.²

Second International Science Study

SISS was conducted in 1983-84 in 17 countries. Target populations were defined as:

Population 1: Either all 10-year-old students or all students in the grade in which most 10-year-olds are enrolled (typically grades four or five).

Page 92 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Population 2: Either all 14-year-old students or all students in the grade in which most 14-year-olds are enrolled (typically grades eight or nine).
Population 3: Students in the last year of secondary education. Students in the last year of secondary education had additional subpopulations defined as:
Population 3B: Students studying biology for examination purposes.
Population 3C: Students studying chemistry for examination purposes.
Population 3P: Students studying physics for examination purposes.
Population 3N: Students not studying a science subject in the test year.

Of 15 countries where Population 1 was tested, six tested at grade four, eight at grade five, and one at grades four, five, and six. Of 17 countries where Population 2 was tested, eight tested at grade eight, 10 at grade nine, one at grades nine and 10, and one at grades eight, nine, and 10. For Population 3, which was tested in 13 countries, the mean age ranged from 17 years, 3 months to 19 years. The percentage enrolled in school was reported at between 15 and 90 percent; some of the low-percentage enrollments were associated with students shifting to the vocational track (IEA, 1988).

Populations 1 and 2 received a set of core items plus one of four randomly assigned sets of items. Population 3 students were tested on core items plus subject-specific tests.

Two- or three-stage samples were utilized depending on the need for an initial geographic or district-based stage of sampling in large countries.

Particular problems cited by Medrich and Griffith (1992) included:

The Population 3 student sample was extremely difficult to draw, and only about half the countries were able to provide complete information on the sampling steps.
Subject matter subsamples were often extremely small.
Population exclusions were significant. Less developed countries had high levels of exclusion.
Enrollment in the sciences varied dramatically.

Page 93 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The U.S. sample suffered serious nonresponse. A new sample was drawn in 1986 for development of the official U.S. estimates.

The U.S. sample design incorporated the selection of a first-stage sample of districts about twice as large as needed to achieve the required sample. After district nonresponse of about one-half, the target sample sizes were achieved, but the bias associated with selective response (self-selection within a large sample) was not resolved.

Response rates were documented for SISS. Seventeen countries reported response rates ranging from 60 to 100 percent for schools and from 53 to 100 percent for students. Twelve countries achieved school response rates exceeding 85 percent; 11 countries achieved student response rates exceeding 85 percent.³

Olkin and Searls (1985) also provide a discussion on statistical aspects of the first two IEA science assessments and note problems with the response rates in these early studies. Their focus is on the U.S. component and their comments on SISS relate to the 1986 U.S. survey.

The more complete documentation of procedures, problems, and quality outcomes for studies completed in the early 1980s (SIMS and SISS) identified the need for better definitions of target populations, for more thorough specification of sampling and other procedures, for consistent measurement of response rates using accepted definitions, and for improved monitoring procedures.

First International Assessment of Educational Progress

IAEP-I was conducted in February 1988 in six countries (12 educational systems) and focused on mathematics and science. The U.S. study was conducted from January through mid-March. This study was modeled on the U.S. NAEP. The study target population was persons born in 1974 (ages 13 years, 1 month to 14 years, 1 month at time of testing). Two test booklets (one in each subject) were administered.

A two- or three-stage sample design was employed consisting of 50 pairs of schools (100 schools), with a sample of 20 students per school. Schools were to be selected with probability proportional to estimated size, and simple random samples of eligible students were selected in sample schools.

School response rates ranged from 70 to 100 percent; student response rates ranged from 73 to 97 percent. Eleven out of 12 systems achieved an 85-percent school response rate and eleven achieved an 85-percent student response rate for the science test (Lapointe et al., 1989).

Page 94 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Second International Assessment of Educational Progress

IAEP-II was conducted in 20 countries, focusing primarily on mathematics and science. The assessment was conducted in March 1991 in most of the countries; in three countries whose school year starts in March, the assessment was conducted in September 1990. Target populations were defined by age (year of birth):

Population 1: Nine-year-olds (born in 1981).
Population 2: Thirteen-year-olds (born in 1977).

The core assessment involved a science booklet and a mathematics booklet; selected students were administered one or the other. Countries could supplement the core Population 2 assessment with an additional block of geography questions and with performance-based assessment of ability to use equipment and materials to solve mathematics and science problems.

The sample design called for a representative sample of 3,300 students (1,650 per age group) from about 110 schools. Both public and private elementary and secondary schools were included. A two-stage stratified probability sampling design was used in most cases, with PPS sampling of schools and systematic sampling of students.

Manuals and software were provided for sampling. In addition, countries had the option of having their sample selected by Westat, Inc.; five countries exercised this option and most of the others used the prescribed design and software. Alternatives to the prescribed design were required to be reviewed and approved by Westat. Three-stage sampling was used in two countries; in one country, students were sampled using classrooms as sampling units.

Nine assessments were countrywide, with coverage at age 13 ranging from 93 to 100 percent. Eleven assessments involved parts of countries, with country coverage ranging from 3 to 96 percent at age 13 (one country did not report percentage of coverage).

Response rates were reported for all but one of the 20 countries. For the age 13 sample, the school response rate ranged from 77 to 100 percent, with 16 out of 20 countries exceeding 85 percent. Student response rates ranged from 92 to 99 percent for 19 countries reporting. Overall response rates (which factored in PSU participation in countries using three-stage designs) ranged from 48 to 99 percent, with 17 of 19 reporting countries meeting or exceeding a 75-percent response rate.

This study reported on the grade distribution within each age sample. Of 19 countries reporting these data for Population 2, the modal year of enrollment was 7 for 4 countries, 8 for 13 countries, and 9 for 2 countries. The dispersion across years varied considerably among countries.

Page 95 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Analysis weights were used in reporting. Sampling errors were computed using the jackknife procedure (Lapointe, Askew, & Mead, 1991).

Compliance with the prescribed sample design was fostered with the provision of sampling software and technical assistance. The use of analysis weights and appropriate design-based variance estimation also was enhanced by applying the methods of the U.S. NAEP surveys.

Third International Mathematics and Science Study

TIMSS was conducted in 1994-95 in 45 countries. A longitudinal followup (TIMSS-R) took place in 1997-98 in about 40 countries. The discussion here is limited to the first phase of TIMSS (1994-95). Three target populations and two optional subpopulations were defined for TIMSS:

Population 1: All students enrolled in the two adjacent grades that contain the largest proportion of 9-year-olds at the time of testing.
Population 2: All students enrolled in the two adjacent grades that contain the largest proportion of 13-year-olds at the time of testing.
Population 3: Students enrolled in their final year of secondary education.

Optional subpopulations within Population 3 were:

Students taking advanced courses in mathematics.
Students taking advanced courses in physics.

Note that the option of defining the study Populations 1 and 2 by age alone was not offered for TIMSS. Also, the grade coverage was expanded from one grade (as applied in SIMS) to two grades. The age used to identify the target grades was the standardizing factor across countries.

Population 3 definitions were refined to avoid possible double counting due to countries having multiple academic tracks or students completing the final year in more than one track at different times. The Population 3 definition was more particularly defined as “students taking the final year of one track of the secondary system for the first time.” Population 3 students would be expected to be between 15 and 19; to assess the coverage by country, the Population 3 enrollment was to be compared with official statistics on the total national population aged 15 to 19 in 1995, divided by 5. Note that students taking the final year in one track (say mathematics) a year later than their final year in the other track (say physics) would be eligible only for the advanced mathematics form because they no longer would be eligible for TIMSS in the year that they complete their final track in physics. This would systematically restrict

Page 96 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

the sample of students taking advanced courses in physics (or mathematics) to those who include it in their first academic track.

The rules for population exclusions of schools and students within schools were made more specific. Schools could be excluded if they:

Are located in geographically remote regions.
Have very small size (few students in target population).
Offer a curriculum, or school structure, different from the mainstream educational system.
Provide instruction only to students who meet the student exclusion criteria.

The target population enrollments in excluded schools were to be estimated.

Student exclusions within schools were limited to:

Educable mentally disabled students.
Functionally disabled students.
Nonnative language speakers.

These concepts were specified more fully for operational use. The effective target population was then defined as the defined target population (1, 2, or 3) less allowable exclusions. A criterion of limiting exclusions to 10 percent or less of the defined target populations was specified.

The sampling manual for TIMSS specifies a two-stage sample (schools and intact classes), with options for three-stage or four-stage sampling if a country opts to add geographic primary sampling units prior to selecting schools and/or if selected classes are to be subsampled. The TIMSS analytic requirements include estimates at the school and class level, so these analytic units also must be stages in the sampling process for Populations 1 and 2. The manual specifies standard minimum effective sample sizes for schools (150 schools) and students (400 students) and provides models (tables) for deciding on the nominal sample size given planned (minimum) cluster size and an assumed value of the intracluster correlation coefficient. A value of .3 is to be assumed if no prior data on intracluster correlation coefficient are available.

Options for stratification, handling of small schools and small classes, sampling options for designed self-weighting samples, and general detail are provided in the sample design specifications. Procedures for identifying replacement schools in the context of systematic PPS sampling also are specified.

Consistent formulas for weighted and unweighted response weights at the school and student levels are provided. Standards are specified as

Page 97 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

85-percent response for each component and 75-percent response overall (Foy, Rust, & Schleicher, 1996).

Within-school sampling procedures were specified for each population. Populations 1 and 2 were to be sampled by class, with each selected student matched to his or her mathematics and science teacher. In schools where some students were assigned to different class groups for science and mathematics, the mathematics groupings were used for forming the sampling units and special student-to-teacher matching procedures were required to identify all teachers involved in teaching science courses to the selected students. Special tracking forms were used to identify the teacher-student matches.

Because Population 3 involved a general population of all students in the final year of secondary education and two subpopulations based on enrollment in advanced courses in mathematics and science, it was sometimes necessary to partition the eligible students into as many as four groups:

Those enrolled in both advanced science and advanced mathematics courses.
Those enrolled in advanced mathematics only.
Those enrolled in advanced science only.
Those enrolled in neither advanced mathematics nor advanced science.

Because there was no analytic need to obtain teacher data related to each Population 3 student, simple random sampling from each of the four groups was the preferred sample selection procedure (Schleicher & Siniscalco, 1996).

The survey administration dates for TIMSS were set near the end of the school year. In the northern hemisphere, the prescribed dates were February to May 1995. In the southern hemisphere, Populations 1 and 2 were to be tested from September to November 1994 and Population 3 in August 1995 (Martin, 1996).

One of the major improvements implemented with TIMSS was the systematic collection of quantitative and descriptive information on the implementation of the sample design through a standard set of forms and reporting procedures. Submitted forms were reviewed and archived by Statistics Canada at the various stages of sample implementation. Foy, Martin, and Kelly (1996) use these archived data to evaluate the implementation of TIMSS sampling procedures in the participating countries. They conclude that the reporting and review process had a positive effect on the quality of the sampling enterprise and that most participants did an excellent job of carrying out their sampling tasks. Most countries were

Page 98 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

able to provide all of the requested information or sufficient information to certify the methods employed. Irregularities and exceptions to specified procedures were identified in a consistent manner and used to flag the data when reported.

TIMSS had an international sampling manual as well as optional sampling software. A TIMSS international referee was appointed. Statistics Canada, working with the TIMSS Technical Advisory Committee and the TIMSS sampling referee, provided advice and support in sampling to participating countries.

Forty-two countries participated in the Population 2 TIMSS. A few comments noted in defining the target Population 2 were:

Target grades varied by state (one country).
Students in selected grades were older than expected (four countries).
Total exclusions exceeded the 10-percent criterion (one country).
Only one target grade was selected (two countries).

Partial or incomplete reports were obtained from eight countries.

All participants provided data on the design structure and stratification, but 29 countries had partial or incomplete data on at least one item for Population 2. Comments related to sampling Population 2 were:

Sampled science classrooms (mathematics classrooms were the default).
Used a school sample for upper grade vocational track.
Included all schools in the sample (four countries).
Used stratified simple random sampling of schools (PPS sampling was specified for self-weighting design) (three countries).
Sampled students rather than classrooms.
Selected classrooms with PPS (two countries).
Employed a preliminary sampling stage (two countries).

Most of these items do not invalidate the sample for most purposes, but may create more problems for development of weights or for some special analyses on a comparable basis.

With regard to within-school sample execution, 24 countries provided complete information and all provided some information. Some comments noted included:

Unapproved school sampling procedures.
Unapproved classroom sampling procedures (four countries).
School sampling frame not available.

Page 99 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Inadequate documentation to compute sampling weights.
Nonparticipating students not recorded.

Countries reported on their coverage of the international desired population. Failure to cover the desired international population involved geographic exclusions (three countries) and exclusion of school systems by language spoken in the schools (three countries). These redefined target populations were called the national desired populations. Additional exclusions occurred by school and by students within school. For Population 2, school exclusions ranged as high as 9.6 percent; student exclusions within schools ranged as high as 2.9 percent; and overall exclusions exceeded 10 percent in only one country, at 11.3 percent.

Thirty-one of 42 countries defined their Population 2 in terms of the seventh and eighth years of formal schooling. Two countries used only one grade; the remaining countries had split policies by region or system or other variations (higher or lower years of formal schooling).

All but three countries reported on their target grade coverage of 13-year-olds. Of the remaining 39 countries, 10 had fewer 13-year-olds in the lower grade and 29 had fewer 12-year-olds in the upper grade. The combined coverage of 13-year-olds over the two grades was reported at between 45 and 100 percent of all 13-year-olds, with most countries at the high end of the range.

School participation rates (usually weighted) were reported before and after school replacement. For the upper grade schools in Population 2, before-school replacement rates ranged from 24 to 100 percent; after replacement they ranged from 46 to 100 percent. Most countries were able to increase their participating school sample by using replacements, particularly those with low initial school response rates; one country with a low initial school response rate only increased school participation by one school after replacement sampling.

From the selected student samples, reductions were made to account for students withdrawn from the school or class and those excluded by the student exclusion rules. Weighted participation rates were then computed based on weighted ratios of students assessed to eligible students selected. For the upper grade students in Population 2, weighted student participation rates were generally high, ranging from 83 to 100 percent.

Overall weighted participation rates also were computed for each country. Looking at the upper grade for Population 2 only, 30 out of 42 countries had overall participation rates exceeding 75 percent before replacement of any nonresponding schools. Five additional countries moved above 75 percent after allowing replacements. In the remaining seven countries, overall weighted participation rates remained below 75 percent even after replacement of nonresponding schools.

Page 100 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

For reporting purposes, countries were classified into three categories, as shown in Table 4-2. Note that based on the overall response rates alone for the upper grade of Population 2, most countries were in Category 1. This approach allows data to be reported for all participating countries with a warning to users about the quality of the data reported.

The use of standardized forms to record the steps of the sampling process and to provide data on eligibility and response helped to identify additional potential problems. Without this information given freely with no fear of retribution, there would be no basis for future improvement. The information gathered as part of the process of selecting the sample and conducting each country’s assessment also helps to identify those issues that need to be resolved in future studies in order to improve comparability. The formal documentation provided for both school and student exclusions is an example of collecting important data for future planning.

Program for International Student Assessment

PISA was conducted in 2000 for reading, and is planned for 2003 for mathematics, and 2006 for science. The study is sponsored by OECD and

TABLE 4-2 Reporting Categories Based on Response Rates

No.	Description	Criteria (Abbreviated and Approximate)	Designation in Reports
1	Acceptable sampling participation rates without replacement schools	Before replacement of schools: School and student response rates each exceed 85 percent, or the combined rate exceeds 75 percent	Appear without notation and may be ordered by achievement as appropriate
2	Acceptable sampling participation rates only with replacement schools	Not in Category 1 before replacement, but weighted school response rate before replacement exceeds 50 percent and after replacement the response rates meet the Category 1 requirements	Annotated with a dagger in tables and figures and may be ordered by achievement as appropriate
3	Unacceptable sampling participation rates even with replacement schools	Not in Category 1 or 2	Appear in separate section of reports ordered alphabetically

Page 101 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

will focus on measuring the “cumulative yield of education systems at an age where schooling is still largely universal.” Because of this focus, the target population is defined as 15-year-olds enrolled in both school-based and work-based educational programs. Between 4,500 and 10,000 students will be assessed in each country (OECD, 1999a, p. 9).

The author reviewed Version 1 of the sampling manual (OECD, 1999b). Comments here are limited to the planned approach outlined in that version; the PISA consortium plans to elaborate on or adjust some of the approaches in subsequent versions of the manual. Each country will have a National Project Manager responsible for the following areas:

Establishing age definitions.
Defining exclusions, documenting them, and keeping them to a minimum.
Developing the school sampling frame.
Identifying suitable stratification variables.
Determining school and student sample sizes consistent with PISA requirements.
Selecting the school sample (or providing the sampling frame to Westat, which will select the sample).
Maintaining records on school participation and the use of replacements.

The PISA consortium (in particular, Westat) will be responsible for reviewing all sampling procedures and providing assistance.

The target population is defined more fully as 15-year-olds in a country’s education system, including:

Part-time students.
Students in vocational training.
Students in foreign schools within the country.

Residents who attend school in a foreign country are not included.

The assessment is to be scheduled over a one-month period, not within the first three months of the academic year. Within reason, the age group should be defined in terms of birth dates so that students are between 15 years, 3 months and 16 years, 2 months at the beginning of the testing period. This facilitates defining 15-year-olds in terms of a calendar year cohort. Forms based on TIMSS and TIMSS-R experience will be used to record the target population definition and to document decisions about testing periods and birth year cohort definitions.

Guidelines are provided for minimizing and justifying all exclusions. School exclusions to control costs (geographically inaccessible, extremely

Page 102 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

small size, or other nonfeasibility of PISA assessment) are permitted, but limited to .5 percent of enrolled 15-year-olds. Schools enrolling only students who qualify as student exclusions also may be excluded. Student exclusions are limited to educable mentally retarded students, functionally disabled students, and nonnative language speakers. Guidelines for defining these categories are provided. The estimated size of the total excluded population is to be documented and should not exceed 5 percent of the national desired population.

The sampling manual provides specific guidelines for selecting a two-or three-stage sample with at least 150 schools and at least 35 students per school. After allowing for student nonresponse, this should yield in excess of 4,500 students. The estimated or approximated intraclass correlation coefficient and its impact on effective sample may be used to adjust the sample size requirements. If classrooms are used for sampling, allowances for a higher intraclass correlation coefficient are required.

Guidelines and forms for documenting participation at all levels are provided. Decision guidelines for scheduling makeup sessions needed to maintain acceptable student response are provided in a separate manual. Minimum levels of participation at the national level are prescribed as 85 percent for schools and 80 percent for students. Plans are to annotate results of countries that do not meet these minimums.

The PISA sampling manual clearly demonstrates the movement toward more specificity in definitions and procedures as well as some tightening of standards, such as limiting total exclusions.

SOME GENERAL APPRAISALS

This section summarizes some of the critiques provided by other authors. It is helpful to note the date of each critique and to relate it to three broad periods defined as before 1980, the 1980s, and 1990 and later. I believe major shifts occurred in the ways international comparative studies could be and were conducted over these three time periods. Rigor in sample design and execution was a stated goal over the entire period. The collection of detailed information about real or potential problems in sample design and execution began in the early 1980s and helped provide a basis for the BICSE framework and principles document published in 1990. One focus of this framework was the recognition of the need for even more data to help understand and clarify differences across school systems in different countries. The comments of reviewers presented in this section need to be placed in the timeframe of studies completed at the time of each reviewer’s comments. Later in this chapter, I present some of my own conclusions, with a focus on current (2000) status of sample design and execution in international comparative studies in education.

Page 103 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Olkin and Searls (1985), in a paper prepared for a National Academy of Sciences conference in October 1985, address the issue of standards related to nonresponse rates. They state:

We believe that standards need to be set not so much in terms of absolute acceptable non-response rates as on procedures for dealing with non-response—initial approaches, follow-up procedures, analytical approaches for adjusting or weighting the data, and the possible use of adjustments made on the basis of effort required to secure cooperation. (p. 4)

They also note the problem of nonretention in school and the variation among countries. They speculate (p. 4) that “average achievement would appear relatively worse than it would if a smaller proportion of the age group were retained.”⁴

The summary report of the conference (NRC, 1985) includes several statements relating to sampling, survey design, and, particularly, response rates. On coordination of survey design, measurement design, and analysis, it states:

It was recognized that progress toward better statistical standards for international assessments will necessarily involve a more thorough understanding of the interrelationships between the educational measurement aspects of instrument design and testing and of survey sampling design and implementation issues, together with a recognition of the need to make explicit the analytic framework within which the data from the assessments are ultimately to be used. (p. 5)

This comment remains true and, perhaps, provides the opportunity for further improvement in the process. With regard to the age versus grade definition of target population, it states that “consideration should be given in future studies to taking national probability samples of age cohorts of children or of a mixture of such sampling and class level sampling” (p. 11). On improving the overall quality of studies, it states that based on recent experience in science and mathematics studies, “higher quality results are attainable by directing more resources to pre-implementation planning and field arrangements and to vigorous follow up of non-respondents even at the cost of overall sample size” (p. 11). This is a continuing theme from many of the critiques and postsurvey discussions.

It is my understanding that the 1985 National Academy of Sciences conference was instrumental in the establishment of the Board on International Comparative Studies in Education.

The BICSE framework (NRC, 1990) provides general guidelines on sampling and access to schools, as discussed in the introduction.

Horvitz (1992) identifies a broad set of issues in improving the quality

Page 104 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

of international education surveys and suggests the Deming philosophy for quality improvement, along with cooperative methodological experiments built into ongoing cross-national surveys as a means to determine effective ways of reducing all types of survey error.

Medrich and Griffith (1992) discuss completed mathematics and science studies sponsored by IEA and IAEP. They note:

The surveys have not achieved the high degrees of statistical reliability across age groups sampled and among all of the participating countries. Thus, from a statistical point of view, there is considerable uncertainty as to the magnitude of measured differences in achievement. Inconsistencies in sample design and sampling procedures, the nature of the samples and their outcomes, and other problems have undermined data quality. (p. viii)

Nevertheless, they believe that these surveys have value and that the challenge is to improve their quality in the future. TIMSS shows improvement in consistent definition of comparable groups across countries, but as documented earlier in this chapter, a small minority of the 42 countries involved in TIMSS still took exception to recommended target population definitions, and some countries provided only partial information.

Goldstein (1995) reviews sampling and other issues in the IEA and IAEP studies conducted prior to TIMSS. He advocates consideration of longitudinal studies beyond those conducted as an option within a single academic year in some countries’ science and mathematics studies. He believes that age cohorts might provide a better study definition for longitudinal followup purposes. He also sees age definition as a possible solution to defining a target population near the final year of secondary education among students attending different types of educational institutions. Although sampling procedures for these studies involved standard applications of sample survey methodology, he notes the difficulty of ensuring uniformity across countries. He also notes problems associated with restricted sampling frames and general nonresponse, which both exhibit considerable variation across countries in the studies reviewed. He advocates obtaining characteristics of nonresponding schools and students, and publishing comparisons of these characteristics for respondents and nonrespondents. He discusses the impacts of length of time in school, age, and grade and how they jointly influence achievement under different systems of education due to promotion policies or other factors; he notes that little effort has been devoted to age and grade standardization of results prior to publication.

Postlethwaite (1999) presents an excellent discussion of sampling issues relating to international studies of educational achievement. He reviews population definition issues from both policy and methodological

Page 105 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

viewpoints, paying particular attention to age versus grade definitions. He also addresses guidelines for setting precision requirements, the sample selection methodology, weighting of data and adjustment for nonresponse, standards for accepting or flagging low-quality surveys, and a general checklist for evaluating the sample design and the resulting data. He also points out that the issue of defining the populations to be tested is an “educational and political” decision. From a sample design perspective, we can leave the question to the education experts and policy makers. Their decisions, however, do impact the sampling and data collection operations. In some countries, students of the same age can be spread across several grades. Postlethwaite cites U.S. data showing 13-year-olds spread across grades 2 through 11, with most in grade eight and nearly 89 percent of enrolled students in grade eight or grade nine; this has serious implications for complete population coverage in the sampling frame when selecting samples defined by age. The grade by age distribution creates a more serious problem for test developers.

Quite specific standards and guidelines are provided by Martin et al. (1999) for IEA studies discussed in the introduction of this chapter. Their heavy focus on documentation is particularly noted because this sets the stage for improvement of current procedures.

REMAINING ISSUES

It is a challenge to add to the critiques already presented and to add any new thoughts. It is also true that lessons learned in the early studies have been applied to the design of more recent studies. Information about what actually occurred is more consistently organized and (currently) accessible for recent studies, particularly SIMS and TIMSS. An understanding of what has happened in prior studies is a fundamental requirement for improving future studies. The documentation and reporting procedures used in TIMSS and planned for PISA are excellent, but still leave room for new ideas. The BICSE framework (NRC, 1990) and the IEA technical standards (Martin et al., 1999) now provide guidance for sample design and survey implementation. What more can be done? Certainly we need to make practice more consistent with plans. In addition, as we get better compliance with the prescribed sampling process, we need to examine the prescribed process itself in light of achievable current practices. I will address issues in the following areas:

Population definitions.
Sampling frame completeness.
Designing the sample.
Executing the sample design.

Page 106 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Response rates.
Other nonsampling errors.
Annotation of published results.

Population Definitions

The issue of age versus grade definition has been discussed thoroughly by other authors, but remains an issue in recent studies. TIMSS used a two-grade span to define populations close to age nine or age 13 for its Populations 1 and 2, and the final year of secondary education to define Population 3. The plans for PISA call for using age 15 rather than a grade concept to define persons near the end of secondary schooling, but at an age where schooling is still largely universal. So different approaches to the same concept continue to be applied. The plans for PISA are much more thorough in describing what is meant by a person still in the country’s education system by including part-time students, students in vocational training, and students in foreign schools within the country as well as students in full-time academic study programs. This is not a sample design decision (as noted by Postlethwaite, 1999), but it has serious implications for defining the sampling frame and selecting the student sample.

The population also can be defined in the time dimension. Allowances have been made for testing at different times in the southern and northern hemispheres in recognition of different starting times for the normal academic year. Recent trends, at least in the United States, include moves to year-round schooling and home schooling. The timing of an assessment during a short period of time could arbitrarily exclude a significant portion of students enrolled in year-round schools who have a break at a nontraditional period. Students schooled at home may be considered in the educational system because they must obtain some exemption from required attendance in a formal school, but no known effort is made to test such students. At a minimum, countries (including the United States) need to quantify the extent of these alternate practices so that the exclusions from the target population defined in two dimensions (type of school enrollment and time of testing) can be better understood. Some allowance for alternate testing times could be effective in covering students in year-round schools; covering students participating in home schooling would be more challenging.

The population definition also includes the definition of exclusions for disability, language, or other reasons. Excellent guidelines have been developed for exclusions of both schools and students within schools and for documenting these exclusions in international studies. Some countries have continued to exclude geographic groups or major language of in-

Page 107 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

struction groups for cost or political reasons. When cost is the only issue, stratification and sampling at a lower rate in the high-cost stratum might provide an alternative to arbitrary exclusion.

Recent U.S. experience in state NAEP has identified potentially serious problems in implementing comparable exclusion rates across states with the development of new guidelines for accommodation (Mazzeo, Carlson, Voelkl, & Lutkus, 2000). The international standards discussed do not address accommodation for disability or language; we might anticipate additional complications in international assessments if similar accommodation policies are more broadly implemented in other countries or applied to the international assessment samples in the United States.

Sampling Frame Completeness

Sampling frame completeness can only be evaluated relative to the target population definition. Exclusion of schools for inappropriate reasons could be viewed as either a population definition issue or a sampling frame incompleteness issue, depending on the intended population of inference for the analysis.

School sampling frames often are developed several months before the actual survey implementation. To the extent that population is defined as of the survey date, procedures may be required to update the sampling frame for newly opened schools or other additions to the eligible school population (changes in grade range taught) occurring since the school sample had been selected. It is not clear that this has been attempted in any of the international studies reviewed. False positives in the sampling frame (schools thought to be eligible at the time of sampling who turn out not to be eligible) can be handled analytically by treating the eligible schools as a subpopulation and are less of a problem. When the target population of schools includes both public and private schools as well as vocational education schools, the development of a complete school frame may become more difficult. Quality controls could be incorporated into advance data collection activities or into the survey process itself to check the completeness of the school sampling frame on a subsample basis, perhaps defined geographically. When sampling by age group, all schools that potentially could have students in the defined age range should be included in the sampling frame. If any arbitrary cutoffs are used to avoid schools with projected very low enrollments for an age group, these also should be checked for excluded schools on at least a sample basis. The feasibility of using arbitrary cutoffs to possibly exclude a small proportion of age-eligible students depends on the dispersion of age-eligible students across grades.

Page 108 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

The recent guidelines for developing student sampling frames differ depending on whether the population is defined by grade or age. Age-defined population sampling requires listing all students in the school that meet the age (or birthdate range) definition; generally, the sample is then drawn as a sample of students using simple random sampling. For populations defined by grade, the sampling frame often is developed based on a “complete” list of classrooms. When the focus is on a particular subject (e.g., mathematics or science), classrooms may be limited to the subject matter being studied. There is a potential problem with the classroom approach of excluding students not currently enrolled in the target subject matter class at all or enrolled in a subject matter class at a different grade level. We may need to be more specific in defining what is meant by grade; that is, is it defined based on overall academic progress or only on progress in the subject being tested? After the grade definition is resolved, the ideal approach would be to list all grade-eligible students (just as we list all age-eligible students). Then if direct student sampling is prescribed, a simple random sample of grade-eligible students could be selected. If for logistical reasons a classroom sample is preferred, the list could be partitioned into classrooms and a sample of classrooms would then be selected. Any student not clearly associated with the type of classroom defined for administration purposes could be arbitrarily assigned to one of the classrooms before the classroom sample is selected, then tested with that classroom if it is selected.

Designing the Sample

Other than technical details in constructing complete sampling frames, the sample should be designed to provide the required precision for a minimum cost. Optimizing a sample design to meet precision requirements requires a reasonable model of the variance as function of controllable sample design parameters (typically, the number of schools and the number of students per school). The variance models used in the guidance documents for TIMSS and PISA incorporate the clustering effect into the variance model in terms of an assumed intracluster correlation coefficient. Empirical studies show wide variation in this population parameter; it is correctly noted that a large clustering effect is more likely with classroom sampling than with direct student sampling. Other population and sample characteristics also should be incorporated into the variance model, including stratification effects, unequal weighting effects, and expected cluster size variability. Stratification can be highly effective in reducing the school component of variance; intraclass correlation coefficients computed within strata are likely to be much smaller than those computed ignoring the strata. More data analysis may be required to

Page 109 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

develop estimates of these values based on prior experience. The correct specification of the variance model is essential to the development of cost-effective sample designs that satisfy the study’s precision requirements.

As pointed out by Olkin and Searls (1985), the sample design needs to be consistent with the intended analysis. If two subjects are being assessed in two different samples of students within each school and separate estimates are to be made for each subject, then the average cluster size in the variance model should be based on school-subject sample size and not on total school sample size; the same principle might apply to subtest scores. Procedures that simultaneously control modeled variances for several different estimates for the same defined population also can be implemented.

This is another area where data need to be accumulated in a systematic manner across countries. The availability of microdata with the sample structure for strata, schools, classrooms, and students clearly labeled would make possible the estimation of the required sample design parameters in a consistent manner. These microdata sets also would provide a valuable resource for studying effective sample design consistent with different analytic objectives.

Executing the Sample Design

With development of procedures that include guidance from a respected national statistical organization and the resolution of particular issues by a similarly respected sampling referee, the execution of the sample design has not been and should not be a serious problem. The documentation of procedures following TIMSS or PISA guidelines and forms also helps guarantee correct implementation. These procedures may have room for improvement based on further experience, but must be viewed as quite excellent.

Two areas of the sample design execution relate to dealing with initial nonresponse at the school and student levels. Substitution for nonresponding schools has been a practice allowed in most of the international assessments. Although substitution does not eliminate bias due to nonresponse, it does maintain the sample size required to control sampling error. If used with careful matching to the nonrespondent schools, the substitution method also can limit the extent of the potential bias introduced by nonresponding schools. A draft of the PISA sampling manual (OECD, 1999b) provides a reasonable way to implement the substitution by identifying two schools in the ordered school sampling frame as potential replacements for each nonresponding school: the schools immediately preceding and immediately following the selected school (if they are not also selected schools). If the list is ordered by administrative

Page 110 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

structure of the school system (e.g., by districts), it is likely that near neighbors in the list also might be nonrespondents; close matching or ordering on other school characteristics, however, may be quite effective in supplying replacements with similar characteristics who are not prejudiced by their neighbor’s unwillingness to respond.

The practice of selection of substitutes for nonresponding schools needs further review. Different approaches are favored by different applied statisticians. Clearly, no method can totally eliminate the bias due to nonresponse, and all methods just try to maintain the respondent sample size. If possible, empirical studies of alternative approaches should be developed, conducted, and reviewed by a panel of survey experts to determine if current substitution practices are the most appropriate ones for international comparative studies. The empirical research simply might be based on simulation of nonresponse from completed studies in one or more countries.

The practice of routinely scheduling followup sessions for absent or unavailable students whenever response rates fall below set but relatively high levels within schools should be continued and formalized.⁵

Response Rates

The TIMSS response criteria specified 85 percent for schools and students or a combined rate exceeding 75 percent. These criteria were used for flagging the results. The draft PISA sampling manual specified 85percent response for schools and 80 percent for students. Are these criteria generally achievable?

School response rates appear to be the more serious problem, particularly in the United States. The 1996 main NAEP did not achieve the 85percent school participation rate for all session types for grades four and eight and did not achieve an 80-percent school participation rate for any session type for grade 12 (Wallace & Rust, 1999). The NAEP survey would be expected to be exemplary among studies undertaken in the United States.

Setting standards is not the solution to the problem of school nonresponse. Studies need to be undertaken in the United States and other countries to better understand why the problem exists. Data on this topic most likely exist, but need to organized and reviewed to formulate better approaches. In the United States, many different surveys compete for testing-schedule time in the schools; assessments are carried out at national, state, and local levels. International assessments just add to the burden. When studies are planned independently and samples are drawn independently, the chance of overburdening some schools is predictable and may contribute to poor participation in all of the studies. Most large

Page 111 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

school districts are asked to participate in all of these studies. Once we better understand the school perspective in survey participation, we can develop strategies, including possible coordination among studies, to encourage participation while simultaneously limiting the burden on any particular school. Nevertheless, these strategies should be considered as possible options for improving the precision of estimates.

The other theme that has been relatively constant across all the critiques reviewed has been the lack of resources for really thorough study execution. Proper planning and the scheduling of advance contacts required to obtain good study participation from schools require both time and adequate funding. These additional resources should be applied intelligently based on what is learned about the reasons for school nonresponse.

The methods of adjusting analytic weights for nonresponse also should be reviewed. Many noneducation surveys standardize their estimates or poststratify to known population distributions (e.g., age, race, and gender) or to distributions estimated from larger surveys. This is particularly difficult to do with the population of students enrolled in a country’s educational system because the population is constantly changing and good enrollment data for the time of testing are difficult to obtain from other sources. If multiple forms are used or if more than one subject is assessed in a given year, the combined sample might provide a better estimate for standardizing the individual estimates developed by subject or by objective within subject. These types of methods add complexity to the weight development process and must be applied with good judgment.

Other Nonsampling Errors

Frame errors and response errors have been discussed. This leaves measurement error. The conditions present at testing, the correlated errors induced by the behavior of test administrators, data processing, and other factors all can contribute to measurement error. Sample and survey design can help control such errors, but monitoring to identify and measure such sources of error is essential in deciding whether the cost of revised procedures is necessary or justified. As noted by Horvitz (1992), cooperative methodological experiments could be extremely valuable in identifying and reducing measurement error.

Annotation of Published Results

As data users become more sophisticated, they expect to be informed about the strengths and weaknesses of the statistical results. The flagging of results depending on participation rates employed by TIMSS is a good example of a way to warn users about data quality based on something

Page 112 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

other than sampling error. The development of technical reports addressing quality control can be of use to the data professional (e.g., SIMS and TIMSS reports) and is strongly endorsed.

CONCLUSIONS

The sampling and survey design and execution of recent and planned international comparative studies have benefited greatly from the analysis of the results of earlier studies. Challenges remain. We should anticipate that the cutting edge technology of today will not necessarily be viewed favorably 10, 15, or 20 years from now. Our views today about the studies completed prior to 1980 might seem unfair to those who conducted those studies using the cutting edge approaches of those times.

Just as the BICSE guidelines suggest focused studies to interpret differences in educational achievement, we need focused studies to understand and interpret the differences in the background conditions and the feasible survey methodologies that apply to different countries. This applies particularly to the conditions in the educational system and how they should influence the definition of the desired target populations. The concept of final year of secondary education remains vague, especially in countries with alternative academic tracks; procedures designed to avoid double counting in the final year of secondary education population may be creating undercoverage of populations defined by subject matter specialization. These types of problems have solutions that begin with a clear understanding of the study requirements.

Longitudinal studies have not been a major focus of the studies reviewed, but have been a country option in some of them. The value of longitudinal measurements versus repeated cross-sectional measurements needs to be evaluated in terms of educational objectives and the types of country comparisons that are useful in evaluating the achievement of those objectives.

Finally, the focus on meeting tough standards for coverage and response rates should not lead us to solve the problem by defining it away. As an example, there is always a temptation to simply rule that an excluded portion of a study population is not really part of the population of interest. This ruling immediately increases coverage measures in all participating countries, but may totally destroy the comparability of results across countries. It would be better to relax the standards somewhat (and continue to monitor them) than to essentially ignore them by defining them away.

Page 113 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

NOTES

1.	Table A.7 (p. 55) of Medrich and Griffith (1992), based on data from Peaker (1975).
2.	Smaller values might have been obtained if the effects of stratification and unequal weighting had been removed before calculating the intraclass correlation; design effects would remain the same because these factors would need to be put back into the model for total variance.
3.	Medrich and Griffith cite data obtained from IEA (1988).
4.	The focus here is on populations defined by age group. Retaining fewer students in school leads to excluding the poor performers, which then leads to higher average scores for those remaining in school.
5.	For PISA, these procedures are available in the National Project Managers Manual, but they were not reviewed for this chapter.

REFERENCES

Cochran, W. G. (1977). Sampling techniques. New York: John Wiley & Sons.

Deming, W. E. (1950). Some theory of sampling. New York: Dover.

Foy, P., Martin, M. O., & Kelly, D. L. (1996). Sampling. In M. O. Martin & I. A. Mullis (Eds.), Third International Mathematics and Science Study: Quality assurance in data collection (pp. 21 to 2-23). Chestnut Hill, MA: Boston College.

Foy, P., Rust, K., & Schleicher, A. (1996). Sample Design. In M. O. Martin and D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College.

Garden, R. A. (1987). Second IEA Mathematics Study, sampling report. Washington, DC: U.S. Department of Education, National Center for Education Statistics.

Goldstein, H. (1995). Interpreting international comparisons of student achievement. Paris: UNESCO.

Hansen, M. H., Hurwitz, W. N., & Madow, W. G. (1953). Sample survey methods and theory. New York: John Wiley & Sons.

Horvitz, D. (1992). Improving the quality of international education surveys (draft). Prepared for the Board on International Comparative Studies in Education.

International Association for the Evaluation of Educational Achievement (IEA). (1988). Student achievement in seventeen countries. Oxford, England: Pergamon Press.

Lapointe, A. E., Askew, J. M., & Mead, N. A. (1991). Learning science: The Second International Assessment of Educational Progress. Princeton, NJ: Educational Testing Service.

Lapointe, A. E., Mead, N. A., & Phillips, G. W., (1989). A world of differences. Princeton, NJ: Educational Testing Service.

Lessler, J. T., & Kalsbeek, W. D. (1992). Nonsampling error in surveys. New York: John Wiley & Sons.

Martin, M. O. (1996). Third International Mathematics and Science Study: An overview. In M. O. Martin and D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College.

Martin, M. O., Rust, K., & Adams, R. J. (1999). Technical standards for IEA studies. Amsterdam: International Association for the Evaluation of Educational Achievement.

Mazzeo, J., Carlson, J. E., Voelkl, K. E., & Lutkus, A. D. (2000). Increasing the participation of special needs students in NAEP: A report on 1996 NAEP research activities. Washington, DC: U.S. Department of Education, National Center for Education Statistics.

Page 114 Cite

Suggested Citation:"4. Sampling Issues in Design, Conduct, and Interpretation of International Comparative Studies of School Achievement." National Research Council. 2002. Methodological Advances in Cross-National Surveys of Educational Achievement. Washington, DC: The National Academies Press. doi: 10.17226/10322.

×

Medrich, E. A., & Griffith, J. E. (1992). International mathematics and science assessments: What have we learned? Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement.

National Research Council. (1985). Summary report of conference on October 16-17, 1985 (Draft). Committee on National Statistics, Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.

National Research Council. (1990). A framework and principles for international comparative studies in education. Board on International Comparative Studies in Education, Norman M. Bradburn & Dorothy M. Gilford, Editors. Commission on Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.

National Research Council. (1995). International comparative studies in education: Descriptions of selected large-scale assessments and case studies. Board on International Comparative Studies in Education. Commission on Behavioral and Social Sciences and Education. Washington DC: National Academy Press.

Olkin, I., & Searls, D. T. (1985). Statistical aspects of international assessments of science education. Paper presented at the conference on Statistical Standards for International Assessments in Precollege Science and Mathematics. Washington, DC.

Organization for Economic Cooperation and Development (OECD). (1999a). Measuring student knowledge and skills, A new framework for assessment. Paris: Author.

Organization for Economic Cooperation and Development (OECD). (1999b). PISA sampling manual, main study version 1. Paris: Author.

Peaker, G. (1975). An empirical study of education in twenty-one countries: A technical report. New York: John Wiley & Sons.

Peaker, G. F. (1967). Sampling. In T. Husen (Ed.), International study of achievement in mathematics, A comparison of twelve countries (pp. 147-162). New York: John Wiley & Sons.

Postlethwaite, T. N. (1999). International studies of educational achievement: Methodological issues. Hong Kong: University of Hong Kong.

Schleicher, A., & Siniscalco, M. T. (1996). Field operations. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS) technical report, Vol. I: Design and development. Chestnut Hill, MA: Boston College.

Suter, L. E., & Phillips, G. (1989). Comments on sampling procedures for the U.S. sample of the Second International Mathematics Study. Washington DC: U.S. Department of Education, National Center for Education Statistics.

Wallace, L., & Rust, K. F. (1999). Sample design. In N. L. Allen, D. L. Kline, & C. A. Zelenak (Eds.), The NAEP 1994 technical report (pp. 69-86). Washington, DC: National Library of Education.