Read "Placing Children in Special Education: A Strategy for Equity" at NAP.edu

« Previous: 2 Placement in Special Education: Historical Developments and Current Procedures

Page 45 Cite

Suggested Citation:"3 Assessment: Issues and Methods." National Research Council. 1982. Placing Children in Special Education: A Strategy for Equity. Washington, DC: The National Academies Press. doi: 10.17226/9440.

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Page 55 Cite

Page 56 Cite

Page 57 Cite

Page 58 Cite

Page 59 Cite

Page 60 Cite

Page 61 Cite

Page 62 Cite

Page 63 Cite

Page 64 Cite

Page 65 Cite

Page 66 Cite

Page 67 Cite

Page 68 Cite

Page 69 Cite

Page 70 Cite

Page 71 Cite

Page 72 Cite

Page 73 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

~2 J Assessment: Issues arid Methocis Most discussions of assessment in the context of special education place- ment for mildly mentally retarded students focus on proper classification and the avoidance of misclassification. These issues have been treated ex- tensively by other panels and professional organizations (e.g., Hobbs, 1975~. This panel was convened because of public concern about the pos- sible misclassification of minority students and about the violations of civil rights that such misclassification might entail. As we argued in Chapter 1, however, issues of classification or valid assessment surrounding the educable mentally retarded (EMR) category are inextricably linked to issues of instruction. One major reason why misclassification is a policy concern is that it may lead to inappropriate educational treatments. Con- sequently, we focus our discussion of assessment instruments and proce- dures on their educational relevance and utility their usefulness in iden- tifying students who need and can profit from special forms of instruction or interventions and their usefulness as guides to the type of instruction or intervention that is needed. Assessment procedures and instruments may have many functions, of Although our discussion concentrates primarily on the direct contribution of assessment to classroom instruction, we recognize that other forms of intervention may be appropriate and necessary for some children before any program of classroom instruction can be effective. For example, the correction of defective vision or hearing, medical treatment, or even psy- chotherapy or family intervention might be needed before a child can function successfully in the classroom. 45

46 REPORT OF THE PANEL which guiding intervention is only one. They might be used to diagnose abnormal or debilitating organic conditions, to predict future academic performance, or in theory even to infer the underlying capacity to learn. Each of these functions would imply different assumptions about the na- ture of the instrument being used and about the entity being measured. Each would raise different scientific controversies. Each could contribute to intervention; for example, diagnosis could point to treatment, although there might be some conditions that can be diagnosed but not treated. The discussion below subordinates these other functions to that of facili- tating effective educational intervention. For example, much of the debate surrounding IQ tests has to do with their use in inferring learning po- tential. Although we sketch the broad outlines of this debate, we base our conclusions about IQ tests primarily on their utility, or lack of utility, in helping educators select and design instructional programs. Our decision to focus on the educational utility of various assessment devices and procedures, rather than on their role in classification and mis- classification, is based primarily on the fact that we are analyzing assess- ment in an educational context, in which it is a means to the end of improving instruction. Two additional considerations reinforce our decision. First, as shown in Chapter 2, definitions of EMR originated with a particular instrument the IQ test and have shifted over time. Data on the prevalence of EMR are confounded with the assessment practices and instruments used in different states and localities (see Chapter 2 and the paper by Shonkoff in this volume). It is difficult to discuss cogently the contribution of different assessment practices to classification and mis- classification in the face of this confusion and circularity. Furthermore, it would be fruitless to cover the same ground as the far more extensive discussions of classification mentioned above. Many scientific controversies about the validity of assessment tech- niques, notably the IQ test, are unresolved. To attempt to take sides on these issues would require a detailed, technical discussion that probably would neither settle the issues nor lead to useful recommendations for educational policy and practice.2 Decisions about policy and practice can- not await the final resolution of scientific debates. By focusing on educa- tional utility, we hope to provide a framework for approaching these deci- sions despite the ambiguities in current understanding. This chapter has two major sections. The first section, the bulk of the chapter, reviews salient issues surrounding the instruments that comprise 2 For a comprehensive discussion of the issues involved in ability testing generally, see the report of the National Research Council's Committee on Ability Testing (Wigdor and Garner, 1982).

Assessment: Issues and Methods 47 a comprehensive battery for assessing a child who has proved unable to learn normally in the classroom. The section covers IQ tests and other measures of intellectual functioning, biomedical measures, and measures of adaptive behavior the child's ability to meet normal expectations ap- propriate to age and setting, with regard to self-help skills, independence, impulse control, cooperation, and the like. The second section describes an ideal assessment process in which the comprehensive assessment would be embedded. The process takes place in two phases. The first phase, prior to any attempt to find problems or deficiencies in the child, is a systematic investigation of the learning environment and the instruction the child receives. The purpose of this phase, which is almost nonexistent in current practice, is to be certain that the child cannot perform ade- quately in a well-designed instructional setting. Only after deficiencies in the environment have been ruled out, by showing that the child fails to learn under several reasonable programs of instruction, is it legitimate to expose the child to the risks of stigma and misclassification that are in- herent in any individual assessment process. The second phase is the com- prehensive individual assessment itself, which it is hoped would be applied to significantly fewer children than are affected under the current referral and placement system. COMPREHENSIVE INDIVIDUAL ASSESSMENT The purpose of comprehensive assessment is to locate the source of the child's difficulties in learning in the classroom. In many ways a compre- hensive assessment represents an attempt to test, at the individual level, some of the hypotheses about the causes of deficient classroom function- ing that were discussed in Chapter 1. The causes may lie in physical mal- functions, emotional disturbances, deficient social skills (either specific to the school or encompassing the home as well), lack of relevant academic preparation, lack of more general cognitive skills, or a basic limitation in intellectual capacity. The causes may also lie in broader sociocultural fac- tors of the kind discussed in Chapter 1, such as value systems antithetical to that of the school. Such factors may be manifested in the child's behavior in the classroom or during test situations and, to some degree, in measures of adaptive behavior. As noted in Chapter 2, broad-based assessment is required under P.L. 94-142, its implementing regulations, and the regulations implementing Section 504. The regulations require, among other provisions, that assess- ments go beyond "a single general intelligence quotient" to include mea- sures of "specific areas of educational need." They prohibit the use of any single procedure as the sole criterion for placing a child. They require that

48 REPORT OF THE PANEL tests be selected in a manner designed to reflect a child's aptitude and achievement, rather than "the child's impaired sensory, manual or speak- ing skills." Furthers the regulations for P.L. 94-142 require that a child be assessed in "all areas related to the suspected disability." In practice, as seen in Chapter 2, compliance with the law is far from complete. Whether or not other measures are administered, IQ and achievement tests tend to dominate EMR placement decisions (see Chapter 2 and the paper by Bickel in this volume). We therefore begin this section with an examination of the major con- troversies surrounding IQ tests-arguing, however, that their relevance for educational practice is limited. The section also discusses attempts to develop better measures of intellectual functioning, whether by improving the IQ test or by developing supplementary or substitute measures. The section then surveys biomedical measures and measures of adaptive be- havior. Both types of measure lie outside the intellectual domain, as it is usually defined; they are essential, however, to understanding the child's classroom performance and more general capabilities and limitations as well as to designing appropriate interventions. IQ TESTING: CONTROVERSIES, IMPLICATIONS, AND ALTERNATIVE s 3 Of all the elements in the assessment process, standardized tests of "in- telligence" have been the most controversial. They have been the subject of protracted litigation, as discussed in Chapter 2. They have been the focus of acrimonious debate in the academic community. Three related questions are at the heart of the debate as it is usually conducted: Are IQ scores4 determined primarily by genes or by the environment? Are IQ scores valid measures of academic ability? Are IQ tests culturally biased? These questions, though central to virtually all discussions of IQ testing, do not neatly divide proponents and opponents of testing in the schools. There is considerable diversity of opinion within both camps, and there has been little attempt to spell out the practical implications of these scientific controversies. 3Much of the information in this section is based on the paper by Travers in this volume. 4We recognize that leaders in the field of educational assessment have long recommended against the use of single IQ scores and have urged the use of multiple instruments and care- ful consideration of performance profiles across subscales within tests for assessing an indi- vidual's mental abilities. Our focus on summary scores and use of the term "IQ test" rather than "test of mental abilities" or the like arises because of data cited in Chapter 2 and else- where in this report that show that summary scores are often accorded predominant weight in placement decisions. While the extent of this practice is uncertain, it is an important source of the controversy surrounding the use of such tests in educational placements.

Assessment: Issues and Methods 49 Our discussion of the three issues bears primarily on widely used, indi- vidually administered IQ tests, notably the Stanford-Binet and the revised Wechsler Intelligence Scale for Children (WISC-R). Special issues raised by group ability testing and by the use of various substitutes for the major IQ tests are not discussed. The Nature-Nurture Issue Of all the questions surrounding IQ testing, the nature-nurture issue is the one most bitterly debated, although, as we argue below, it has little relevance for education policy or practice. In recent years the controversy has centered on the relative contributions of heredity and environment to the 15-point average difference usually found between the IQ scores of blacks and whites. Most of the existing scientific evidence bears on the contribution of genotypic variation to individual differences in measured (phenotypic) IQ within ethnic groups. For example, Arthur Jensen's con- troversial article (1969) examined correlations among IQs of persons in various biological kinship relationships and concluded that about 80 per- cent of the variation in IQ is genetically determined. Others (e.g., Jencks et al., 1972) have arrived at substantially lower estimates of heritability; however, a fairly recent review (Loehlin et al., 1975) offers a figure close to Jensen's for the heritability of individual differences in IQ within Euro- pean and American Caucasian populations. The reviewers found less con- sistent evidence for American black populations; heritability is substantial for these populations but perhaps somewhat lower than for whites. Numerous critics have attacked the assumptions, methods, and data that led Jensen to his high estimate of the heritability of IQ. Among the many factors cited by the critics are the confounding of genes and en- vironments, restriction in the range of environments studied, and the in- appropriateness of the statistical techniques borrowed from population genetics that were used to estimate heritability. The most controversial aspect of Jensen's work was his speculation that the average IQ difference between races in the United States is due partly to genetic factors. His critics have stressed that group differences in distributions of a trait can be due mostly or entirely to the environment, even if the heritability of the trait within groups is high. Loehlin et al. ad- dressed the issue of between-group differences, primarily by examining studies relating IQ distributions to indices of racial mixture, such as blood types, skin color, and direct genealogical information. They concluded that the data "are consistent with either moderate hereditarian or en- vironmentalist interpretations" but perhaps "more easily accommodated in an environmentalist framework (p. 238~." A similar statement could be

50 REPORT OF THE PANEL made regarding other data, which show that the IQ gap between black and white children is inversely related to the black child's exposure to white, middle-class culture and schooling. These include studies of black families who migrated from the rural South to the urban North, studies of black children adopted by white parents, studies of the effects of early in- tervention programs, and studies of sociocultural variations within black and white populations. In short, scientific controversy continues to exist with respect to the issue of heredity versus environment. Virtually everyone involved in the controversy agrees that both genetic and experiential factors influence IQ; what is at issue is the degree of influence and the mechanisms involved. The controversy has been carried into the courts, and several major judicial decisions on testing have reflected the judges' convictions that IQ tests fail to measure native intelligence (Bersoff, 19791. Yet on closer ex- amination, we feel that the ultimate, substantive, scientific outcome of the controversy is less important for education policy and practice than it may appear, in particular for policies affecting placement of students in EMR classes. There is a widespread assumption outside the field of special education that mental retardation is by definition an innate incapability to learn. (This belief is clearly reflected in the Larry P. decision; see also E. Smith, 1980.) It follows from this assumption that IQ must measure innate ca- pacity if it is to be a legitimate index of mental retardation. These views are not shared, however, by medical and educational professionals con- cerned with mental retardation (see Goodman, 1977, for a forceful exposi- tion of this point). Mental retardation is currently defined as a deficit in functioning and adaptive behavior, which may be due to a wide variety of factors, experiential as well as organic. This purely functional definition is motivated by the fact that, within the limits of current knowledge, there are no differences in prognosis or indicated educational "treatment" that distinguish organically caused deficits from experientially caused deficits. That is, children at the same level of functional ability have about the same expected level of future performance and can be taught most effec- tively in about the same ways, regardless of whether their deficits have a known organic cause, such as Down's syndrome (see Chapter 4 for further discussion of educational treatment). If education practice is independent of etiology in these clear-cut cases, it is hard to see why practice should be affected by the heritability of IQ. It is important to recognize that a wide range of academic performance can be achieved by children with any given IQ. Even if differences in academic ability or achievement are in large part genetically caused, proper instruction can do a great deal to ensure that children develop to their

Assessment: Issues and Methods 51 fullest potential. For example, children with Down's syndrome reportedly make significant gains under certain programs of instruction (Hayden and Haring, 1977~. Although a teacher, administrator, or policy maker of the hereditarian persuasion might be pessimistic about the likelihood of change in underlying intellectual ability, this pessimism would be no justi- fication for failing to provide conditions that allow each child to learn as much as possible. Decisions about curricula and teaching methods to be used with children at different levels of IQ or initial academic perform- ance as well as decisions about whether to teach these children separately or together can and should be based on the demonstrated pedagogical ef- fectiveness of the various approaches, not on preconceptions about the causes of initial differences in performance. Finally, one's position on the nature-nurture question gives little or no guidance as to the degree of ethnic imbalance in special education place- ment that one should be willing to tolerate. As long as there are special programs for children who lack traditional academic skills, environmen- talists and hereditarians alike would expect minority children to be over- represented in such programs, at least for the immediate future. If children are indeed being stigmatized or denied educational oppor- tunity because of presumed native incapacity, such practices represent an inappropriate and unjustified use of IQ scores. The practices should be discontinued, but their discontinuation does not depend on proof that IQ has low heritability. The Issue of Test Validity Are IQ tests valid measures of "intelligence" or academic ability? Though often equated or confused with the nature-nurture issue, the issue of validity is in fact a separate one. Many psychologists think of intelligence as an ability (or set of abilities) to absorb complex information and grasp and manipulate abstract concepts an ability that is developed through the interaction of genetic endowment and experience. In this view, intelligence is not native capacity, but it is much more than knowledge of answers to the specific questions on the Stanford-Binet or the WISC-R. Almost all children could be taught to answer the specific questions correctly. The question is how to interpret their performance in the absence of instruc- tion related directly to the test items. The validity question thus posed has two parts: the first asks whether the skills measured by IQ tests are specific or general; the second asks whether the entity or entities measured by the tests can legitimately be in- terpreted as "developed ability." There was a long debate in psychometrics over whether IQ tests mea

52 REPORT OF THE PANEL sure "general intelligence" or differentiated abilities verbal ability, per- ceptual ability, quantitative ability, etc. Contemporary opinion holds that they measure both; there is variation shared by all items, and there are also clusters of items that are particularly closely related. The overriding conclusion, however, is that some variation is shared within clusters and across the whole test. The rather disparate items on different IQ tests seem to be measuring the same thing or a small number of things not a miscellaneous collection of isolated facts and skills. This conclusion is consistent with the interpretation that tests measure underlying abilities, which are manifested in the mastery of specific skills and knowledge. It is equally consistent with the interpretation that the common factor arising from shared variation across different tests and items is really the degree of exposure to middle-class culture and schooling. There is no general resolution to this interpretive issue. All performance depends on both specific learning and broader abilities. For example, a child's performance on verbal analogies ("Tables are made of wood; win- dows are made of ") depends on acquired vocabulary and familiar- ity with the named objects as well as a more general ability to perceive relationships. The relative contributions of ability and specific experience are not fixed properties of the item or test but depend on the ranges of ability and experience in the population tested. For example, English- speaking American children of elementary school age would presumably be familiar with the words in the above example, and their performance would probably be determined largely by their ability to perceive relation- ships. However, if children from non-English-speaking families or from cultures without windows and tables were tested, variations in familiarity with the vocabulary items would contribute significantly to performance. Claims about the validity and meaning of test scores, then, are always population-specific. Rather than addressing the interpretive issue directly, most proponents of testing in the schools place their faith in the empirical phenomenon of predictive validity. Many studies have shown that IQ scores correlate with later school grades and scores on standardized achievement tests (see the paper by Travers in this volume). These validity coefficients (correlations) clearly do not settle the interpretive question. They are consistent with the hypothesis that IQ tests measure general academic ability, which is later manifested in scholastic performance. But, again, they also can be inter- preted as showing merely that IQ tests, achievement tests, and teacher- made tests all sample the same domain of acquired skills. The question of importance, once again, is how these conflicting interpretations bear on education policy or practice. Critics of testing have argued vehemently that tests are invalid as

Assessment: Issues and Methods 53 measures of children's general ability and are therefore unfair devices to use for placement. However, few critics have attempted to spell out why tests would be fair if they did measure ability or why they are unfair if they measure only acquired skills. Defenders of testing have justified the use of tests on grounds of predictive validity, apparently believing that they are fair even if they measure primarily acquired skills. Yet few de- fenders have spelled out their criteria of fairness either. The argument is not really about the degree to which IQ tests measure ability versus ac- quired skills but about the legitimacy of using a test that mixes the two as a basis for educational programming and placement. As Messick (1980) points out, when we begin to ask about the legiti- macy of a particular use of a test, we must consider more than just what the test measures (validity, in traditional psychometric terms). We must also ask about the consequences of the intended use. In the context of educational decision making it is not enough to know that IQ tests predict future classroom performance, nor would it be enough even to know that they measure general ability. It is necessary to ask whether IQ tests pro- vide information that leads to more effective instruction than would other- wise be possible. Specifically, is it the case that children whose IQs fall in the EMR range require or profit from special forms of instruction or special classroom settings? In the language of contemporary education research, is there an "aptitude-treatment interaction" (Cronbach and Snow, 1977) such that different instructional methods are effective for children with low IQs? An affirmative answer to these questions would constitute a good reason to use IQ scores in programming and placement decisions. (Of course, there might also be other offsetting considerations.) If the answers are negative and we argue in Chapter 4 that they probably are then the IQ has limited usefulness5 in educational decision making, and debates about the meaning of IQ scores are of secondary interest from practical and policy standpoints. The Issue of Racial and Cultural Bias Do IQ tests misrepresent the skills or abilities of minority children and those from low-income families? Are tests merely the bearers of bad news sThis is not necessarily an argument that IQ testing should be abandoned entirely. There is at least one use on which professionals with very different interpretations of IQ scores agree: If a child who is failing in school proves to have an IQ in the normal range, this finding would point to the need for further diagnostic work, e.g., a search for physical disabilities, emo- tional difficulties, or the like. The argument in the text applies to the use of IQ cutoffs at the low end of the scale in deciding on educational programs and placements.

54 REPORT OF THE PANEL about genuine differences in educational potential or academic function- ing, or are they the creators of false differences? To address these ques tions it is necessary to clarify some points of definition that have caused confusion and miscommunication between specialists in psychological measurement, on one hand, and educators, policy makers, and the public, on the other. For many persons outside the field of psychometrics, tests are "biased" if group differences in test scores can plausibly be attributed to average differences in environmental advantage enjoyed by children from different ethnic or socioeconomic groups. Prom this perspective a test can be biased even if it captures genuine differences in knowledge, skill, or developed ability between groups. In effect, bias, cultural causation, and unfairness become equivalent concepts from this point of view: It seems unfair to categorize children or allocate educational opportunities on the basis of performance differences that are culturally caused, and it seems proper to characterize the instruments that effectuate this unfair categorization as biased. For specialists in psychological measurement, questions of bias, fair- ness, and cultural causation are separate. From the specialist's perspec- tive, bias is purely a measurement issue: If a test shows the same internal structure and the same pattern of correlations with other variables across cultural groups, the test is held to be unbiased, even if different groups have different performance profiles due to differential opportunity and ex- perience. Given this conception of bias, it is not inconsistent to argue that the use of a particular test for a particular purpose may be unfair even if the test is, in the technical sense, unbiased. Three potential sources of bias have received the lion's share of atten- tion in the psychometric literature to date: (1) differences in performance induced by culturally sensitive features of the test situation, such as the race or dialect of the tester; (2) differences across cultural groups in the difficulty of particular items or in other internal features of the pattern of responses generated by test items; and (3) differences in the predictive validity of tests for different groups. Bias in the Test Situation Aspects of the test situation, aside from the child's actual skill or ability, that might influence test scores include familiarity with the particular test or type of test (coaching and practice); the race and sex of the tester; the language style or dialect of the tester; the tester's expectations about the child's performance; distortions in scoring; time pressure or lack thereof; and attitudinal factors such as test anxiety, achievement motivation, self-esteem, and countercultural motives to avoid conspicuously good performance.

Assessment: Issues and Methods 55 Cases have been cited in the courts of minority children whose IQs were low when tested by a school psychologist but increased dramatically when the children were retested by persons of the same ethnic group under non- threatening conditions. Most published research, however, finds little evidence that situational factors affect minority children differentially (Jensen, 1980: Chapter 12~. Some situational factors have significant overall effects on test scores but show no interactions with ethnicity. For example, coaching and practice together can boost an individual's IQ score by about nine points, if the individual is retested after a fairly short time interval with a test that is similar to the one used for practice. Blacks and whites profit almost equally from coaching and practice. Thus, the reported data suggest that familiarization with tests cannot eliminate much of the IQ difference between the races. Not all of the other situa- tional factors have significant overall effects on test scores, and none is as large as the effects of coaching and practice. More important, in no case is there a large interaction between a situational factor and ethnicity. Item Bias One approach to the analysis of item bias, which might be called "editorial," is to analyze the face content of items on logical or semantic grounds or on the basis of apparent or presumed connections to particular subcultural milieux. Judge John F. Grady's recent decision in Parents in Action on Special Education v. Hannon (1980) provides a dramatic and socially significant illustration of this approach. Setting aside a variety of statistical and empirical arguments for and against the use of tests in placing black children in EMR classes, the judge chose in- stead to examine test items individually and to decide in each case whether the item appeared, a priori, to present special difficulties for black children. This "item analysis" led the judge to accept all but a few items on the Stanford-Binet and WISC-R and to uphold the use of these tests for educational placement by the Chicago schools. Others have drawn dia- metrically opposed conclusions from similar editorial item analyses. One obvious flaw in this approach is that it places bias in the eye of the "editor," and different editors disagree. More important is the fact that judgments about item content (even if there is agreement) are neither necessary nor sufficient to prove that particular items discriminate against minority children, in the sense of lowering their test scores. An apparently innocent item can be disproportionately difficult for minority children compared with whites, while an item that is problematic on its face can be equally difficult for all groups. A more systematic and empirical approach to item bias is to examine the proportions of minorities and whites who get each item correct; when an item deviates markedly from the overall profile for any group, that item

56 REPORT OF THE PANEL is assumed to confer an unacceptable advantage or disadvantage for one group or the other and is deemed to be biased in this precise and limited sense. Related psychometric approaches to assessing item bias focus on item-scale correlations and the factor loadings of items. If correlations or loadings for particular items differ conspicuously for minorities and whites, those items are suspect on the grounds that they do not appear to measure the same construct for different groups. None of these psycho- metric approaches has produced data suggesting that item bias is a major factor causing ethnic differences in test scores. Profiles of item difficulty are similar across ethnic groups (Sandoval, 1979), and factor structures show only minor differences (Reschly, 1978~. If there is bias in IQ tests, it is pervasive and not linked to a few offending items. Differential Prediction Because the IQ test's primary claim to validity rests on prediction of future academic performance, differential predic- tion for different ethnic groups could potentially represent important evidence of bias. For example, if IQ tests measure academic ability more accurately for whites than for blacks, IQ might correlate more highly with measures of future school success for whites than for blacks. Or if IQ sys- tematically underestimates the academic abilities of blacks relative to whites, blacks might do better academically than their IQ scores would suggest. Thus, investigations of differential predictive validity involve two questions: whether the margin of error in prediction is the same for dif- ferent ethnic groups, and whether given test scores predict the same level of success for members of different ethnic groups. Surprisingly few studies have used appropriate statistical techniques (regression analyses) to investigate these issues for elementary and second- ary school children. Most studies present only correlations. As indicated earlier, correlations between IQ scores and scores on standardized achievement tests are generally high. Reported correlations are often .7 or higher for minority children (Settler, 1974) as well as for whites. Correla- tions with grades are less consistent. Correlations reported for black children range as high as .6-.7 (Settler, 1974~. One large study, which was influential in the Larry P. decision, found correlations of only .27 for Anglo students and .12-.18 for black and Hispanic students (Goldman and Hartig, 1976~. This study, however, has been criticized on method- ological grounds having mainly to do with the limitations of grades as criterion variables (e.g., by Messe et al., 19791. Three studies present full regression information (Farr et al., 1971; Mercer, 1979; Reschly and Sabers, 1979~. The Farr et al. and Reschly studies produced complex patterns of results, varying with the ages of the children involved, and on balance indicated only minor differences in

Assessment: Issues and Methods 57 prediction for whites, blacks, and, in the Reschly study, Hispanics. When patterns did differ, they often (not always) indicated "overprediction" for blacks and "underprediction" for whites. That is, black children did less well in school and on achievement tests than their IQ scores predicted, whereas whites did better. The Mercer analysis, based on data drawn from a sample overlapping with that of Goldman and Hartig, was unique in finding poor overall prediction, worse prediction for blacks and Hispanics than for whites, and underprediction of grades for minority children with IQs below the mid-70s the range likely to be found among children being evaluated for placement in EMR classes. Mercer's findings suggest that, if the same cutoff scores were used to place children of all ethnic groups in EMR classes, minority children in those classes would be more academically able than their white counterparts. Mercer points out, however, that her findings may be limited by technical factors (e.g., range restriction within the minority samples). In addition, some of the method- ological problems raised in connection with Goldman and Hartig's data may also apply to Mercer's analysis, although Mercer has pointed out that essentially the same results are obtained when a semantic differential rating of student competence by teachers is used as the criterion variable rather than grade point average. Conclusion In short, the technical studies of bias surveyed in the forego- ing paragraphs indicate at most a relatively modest amount of distortion in the test scores of minority children. There is limited evidence for bias in aspects of the test situation external to the test itself. There is little evidence that bias lodges in particular test items, but this fact does not preclude the possibility of generalized bias across all items. Some evidence suggestive of predictive bias at the low end of the IQ scale is reported in the Mercer study. On balance, however, it appears that bias in the tech- nical measurement sense contributes little to explaining ethnic differences in IQ and achievement. It is important to recognize the limitations of this conclusion. These analyses of "cultural" bias are typically not informed by the participation or perspectives of academic specialists, such as comparative linguists and cultural anthropologists, who work with cultural data. Psychometric anal- yses may have neglected important sources or mechanisms of bias. Typical psychometric analyses use racial, language, or national designations as if they were equivalent to cultural categories, resulting in conceptual confu- sion and neglect of potentially important cultural differences within racial, language, or national groups. In addition, psychometric investigations of bias do not address many concerns of other social scientists, educators, and policy makers regarding

58 REPORT OF THE PANEL bias, as they use the term. For example, investigations of predictive bias ignore the problem of bias in the criteria: If school grades and/or achieve- ment test scores understate the academic performance of minority stu- dents as tests allegedly underestimate their abilities it would be of no consequence, from a moral or policy standpoint, to find that prediction was perfect. Also, as noted at the beginning of this section, outside the field of psychological measurement, bias is often defined as the contribu- tion of sociocultural factors that raise or lower the IQ scores of one group relative to another. Everyone, even the firmest believer in the genetic determination of IQ, admits that there is some cultural contribu- tion, just as there is a cultural contribution to school success. Most impor- tant, even if there were no psychometric biases in IQ tests, questions raised earlier about the educational value of the tests would remain unanswered. Knowing that tests predict equally well for all cultural groups, or measure the same constructs for all groups, would not tell us whether instruction should differ as a function of IQ scores. Alternative Measures of Intellectual Functioning Standard IQ tests such as the WISC-R and the Stanford-Binet are not the only available means of measuring cognitive functioning. There have been a number of attempts to modify IQ tests, primarily with the intent of reducing or eliminating presumed cultural bias. There have also been sev- eral attempts to devise new measures, based on different assumptions about the nature and development of intelligence. Among the approaches that have been tried in order to accommodate existing IQ tests to cultural differences are translation into other lan- guages, altering procedures for administering and scoring tests, modify- ing items, and developing group-specific norms. Some of these changes have come about because of judicial or policy decisions. For example, in the case of Diana v. State Board of F,ducation, which challenged the administration in English of the Stanford-Binet to Spanish-speaking children, the California Department of Education agreed to a consent de- cree requiring bilingual testing, the elimination of "unfair" verbal items, and the development of a revised test reflecting Mexican-American cul- ture and norms on a Mexican-American population (Bersoff, 1979~. In light of what was said earlier about the modest contribution (at most) of item bias and variations in test procedure to ethnic differences in IQ scores, it is not surprising that item deletions and procedural changes have failed to reduce the discrepancy to any significant extent. (These ap- proaches have not been tried and studied extensively, however.) One modification that is likely to make a difference is translation. The

Assessment: Issues and Methods 59 one source of bias that survived even Jensen's critical scrutiny (1980: Chapter 12) is the use of English-language tests with children of limited English-speaking ability. There appears to be no doubt that such children are at an unfair disadvantage. Translation, however, introduces a prob- lem of "worming," i.e., of constructing appropriate group standards for judging the individual child's IQ. There is no guarantee that items will re- tain their levels of difficulty, even when accurately translated. New norms are needed, but these norms will necessarily be specific to the cultural group for whom the test is translated. Translation is thus directly related to what is perhaps the most direct and radical approach to correcting the alleged cultural bias of IQ tests: constructing separate norms for each subcultural group. The logic of culture-specific norms is straightforward: If subcultural groups have qualitatively different "experience pools," leading to differ- ences in average performance, the fairest comparison for any child would seem to be with members of his or her own group, not society at large. The difficulty with this approach is equally obvious: Since the different ex- perience pools do not equip children equally well to function in a school system and society dominated by the white middle class, numerically equal scores based on separate norms may no longer entail equivalent predictions about educational success. Proponents and critics of group norms are sharply split on the question of whether this reduction in predictive power invalidates group-specific norms. Another alternative to traditional IQ testing is provided by new mea- sures based on Piaget's influential theory of cognitive development, which holds that intelligence undergoes a series of qualitative changes from in- fancy to adolescence, each marked by a reorganization of the child's system of logic and understanding of natural phenomena. There is some evidence that this sequence occurs cross-culturally, although there are cultural variations in the rate of progress and the specific skills and knowl- edge that the child exhibits at each stage. Several investigators (e.g., Pinard and Laurendeau, 1964; Goldschmid and gentler, 1968; Uzgiris and Hunt, 1975) have arranged Piagetian tasks in sequential order and collected age norms for performance, thus constructing scales by which an individual's level of development can be specified, both in terms of Piagetian theory and relative to others. These scales have proved to be extremely strong on traditional psychometric grounds of test-retest reliability and inter-item homogeneity. They also correlate highly with standard IQ tests (e.g., Kohlberg, 1968) and exhibit marked black-white differences in per- formance (Tuddenham, 1970), although there is one report that differ- ences between Anglos and Hispanics are reduced (DeAvila and Havassy, 1974~. Although the Piagetian tests have the virtue of a sophisticated

60 REPORT OF THE PANEL theoretical rationale and a firm grounding in developmental research, their practical effects are likely to differ relatively little from those of standard IQ tests, with the possible important exception of use with Hispanic populations. Another example is provided by attempts to construct culture-free and culture-fair tests. To use acquired skills and knowledge as a measure of intellectual capacity requires, among other assumptions, an assumption of roughly equal motivation and access to relevant experience throughout the tested population an assumption that has repeatedly been challenged. In response, some investigators have attempted to build tests from items for which the assumption seems at least approximately tenable. The re- sulting tests typically include items heavily weighted toward perceptual or psychomotor performance and avoid verbal items. A few well-known ex- amples include (1) the Ravens Matrices, in which respondents are shown a sequence of geometrical designs that exhibit a well-defined progression; the respondent's task is to identify the regularities in the sequence and predict the next pattern, choosing it from among several possibilities; (2) the Porteus Maze Test, which requires respondents to trace paths through a series of 28 mazes of increasing difficulty; and (3) the Goodenough- Harris Drawing Test, which requires the respondent to draw a man, a woman, and himself or herself; responses are scored to reflect develop- mental differences in depiction of body proportions, attachment of limbs and head, and inclusion of certain details of facial features, hands, and clothing. Developmental norms and conversions to IQ are available for all of the cited instruments. The verdict of many years of research on these and kindred tests is fairly clear and generally accepted: They have failed to yield the desired effect of substantially reducing or eliminating cultural differences in performance (Anastasi, 19761. A final example is provided by new tests involving direct measures of learning. Almost 50 years ago, L. S. Vygotsky suggested that if one wants to measure children's ability to learn one should not test what they al- ready know but rather put them in a situation in which there is some- thing to learn and watch how they behave. Recently, Budoff (1968) and Feuerstein et al. (1979) devised approaches to testing that follow Vygot- sky's long-ignored suggestion. Feuerstein's work is particularly relevant in the present context because he has tested many children and adolescents who would be labeled EMR by conventional test criteria. Feuerstein's Learning Potential Assessment Device (LPAD) is directly linked to reme- dial teaching. Children are tested on a wide variety of conceptual tasks in- volving analogies, seriation, logical classification, and the like. They are exposed to a highly structured instructional process involving explicit ver- bal explanation (mediation) and practice and feedback in a one-to-one in

Assessment: Issues and Methods 61 teraction with a trained teacher. Children are then retested on the original tasks and on a set of related tasks designed to show how well newly learned concepts are generalized to similar problems. The measure of the child's potential is not his or her initial performance but the degree of progress made in response to instruction. More data on the validity of this ap- proach and in particular its transference to other learning situations are needed. Conclusions The IQ test remains the most widely used, most influential (in terms of its effect on placement decisions), and most controversial of current mea- sures. Much of the controversy centers on the adequacy of the tests as mea- sures of innate capacity or learning potential, but this has little bearing on their adequacy as measures of developed cognitive abilities. We have also found reason to doubt that scientific resolution of the nature-nurture issue, even if it were possible, would dictate or justify different educational treatment of children with IQs in the EMR range. We have found little evidence for test bias, in the technical sense of the term, but we recognize that this null conclusion does not address many concerns about bias as the term is used in public discussion. The IQ test's claim to validity rests heavily on its predictive power. We find that prediction alone, however, is insufficient evidence of the test's educational utility. What is needed is evidence that children with scores in the EMR range will learn more effec- tively in a special program or placement. As argued in more detail in Chapter 4, we doubt that such evidence exists. Although we are not prepared, as a panel, to advocate discontinuation of IQ testing, we feel that the burden of justification lies with its proponents to show that in par- ticular cases the tests have been used in a manner that contributes to the effectiveness of instruction for the children in question. Attempts to modify or replace the IQ as a measure of intellectual func- tioning have in some cases clearly failed and in other cases remain promising but unproven. Thus, while we advocate further pursuit of the promising approaches, we cannot at present endorse any particular technique as a substitute or supplement to the IQ. INDIVIDUAL MEASURES OUTSIDE THE INTELLECTUAL DOMAIN Even if all the conceptual and technical problems involved in measuring intellectual functioning could be solved, the resulting instrument or in- struments would constitute only a part of a fully adequate assessment bat- tery. Many aspects of individual competence lie outside the intellectual

62 REPORT OF THE PANEL domain, and these must be examined before an appropriate educational program and placement can be determined. In addition, a child's behav- ioral functioning must be understood in relation to the state of his or her physical development, nutrition, and physiological functioning; physical abnormalities and malfunctions, some of them correctable, may underlie apparent intellectual deficits and maladaptive behavior patterns. The importance of both types of measures has been widely recognized. Virtually all authoritative discussions of educational assessment recom- mend inclusion of measures of adaptive behavior and biomedical screen- ing devices. The following two sections examine some general characteris- tics of major existing measures and discuss salient issues surrounding their use in educational programming and placement. Although we con- cur with the widely accepted view that biomedical measures and measures of adaptive behavior deserve a place in a comprehensive assessment bat- tery, we also believe that the use of such measures should be guided and evaluated by the same standards that we have applied to cognitive mea- sures, i.e., their contribution to identifying functional needs and pointing toward effective interventions. Biomedical Measures The general purpose of biomedical assessment is to determine whether the child is an intact organism. In the context of a comprehensive assessment for EMR placement, biomedical measures have two more specific pur- poses: to ascertain whether the child's inability to learn in ordinary classes is due to sensory, motor, or other physical impairment and, whenever pos- sible, to guide the selection of remedial approaches. It is important to distinguish among the three different roles that physi- cal factors may play with respect to categorization of a child as mentally retarded. First, peripheral physical disabilities may impair an otherwise normal child's performance in class and on measures of intellectual func- tioning, such as IQ tests. For example, poor vision, poor hearing, psycho- motor malfunctions, or hunger could have these effects. Detection of such conditions is obviously essential to prevent misclassification and often points to effective interventions. Second, neurological conditions or endocrine malfunctions may create specific deficits in intellectual functioning (such as language disorders or dyslexia) or distortions of behavior. In the classroom, the cognitive or be- havioral symptoms may be indistinguishable from similar behaviors with different causes; however, appropriate biomedical probes may identify the causes and in some cases point to corrective steps. Third, physical trauma or deprivation, particularly in the earliest stages

Assessment: Issues and Methods 63 of life, may create global deficits of functioning. Some of these deficits may have neurological or other physical correlates in the school-age child; others may not. Shonkoff (in this volume) reviews a variety of genetic, pre- natal, perinatal, and postnatal conditions that have among their sequelae global impairment of intellectual functioning. Many of these conditions, such as maternal malnutrition or lead intoxication, can be prevented; others, such as phenolkytonuria (PKU), can be significantly ameliorated if detected early. In most cases, however, the damage cannot be corrected by known physical treatments when the child has reached school age. Remediation in these cases must address the symptom; that is, it must take the form of an educational program designed to meet the needs of an impaired learner. Within the limits of current knowledge there appear to be no differences between the educational treatments that work best for children who have global learning difficulties due to physical causes and those that work for other children with global deficits. Future research may lead to medical or educational interventions addressing physically based, global learning problems; if so, identification of long-term physical causes will become a major function of biomedical assessment in educa- tional contexts. For now, however, its primary functions are the detection of physical impairments in mentally normal children and the detection of neuropsychological conditions that impair intellectual functioning but are distinct from mental retardation as it is usually conceived. Another distinction is also important to understanding our view of bio- medical assessment. Certain assessment procedures can be performed at relatively low costs; they give a preliminary indication of where a child's problem may lie. Other procedures are more extensive and require the services of highly trained professionals and are, therefore, costly. Screen- ing procedures of the first kind are appropriate to use with all children who have been referred for learning problems. Detailed diagnostic pro- cedures of the second kind are appropriate for use in a small number of carefully targeted cases. Screening procedures are exemplified by the biomedical portion of Mer- cer's System of Multicultural Pluralistic Assessment (SOMPA) (Mercer and Lewis, 1978), a battery of instruments designed for use in comprehensive educational assessment. SOMPA includes six biomedical measures: the Snel- len test of visual acuity, a measure of auditory acuity, weight standardized by height, a set of physical dexterity tasks, a health history inventory, and the Bender Visual Motor Gestalt Test (a test that requires the child to copy a set of figures, which is regarded as indicative of perceptual matu- rity and neurological impairment). None of these measures is sufficient in itself to precisely pinpoint a disability or to specify the necessary remedia- tion. Each is capable, however, of identifying a general area of disability,

64 REPORT OF THE PANEL within which more precise measures can be taken. In some cases the screening measures may point to widely prevalent problems, for which more refined diagnosis and remediation are routine; detection of common visual problems is an obvious example. In other cases the measures may point to areas of disability for which further diagnostic work may be ex- tensive and for which remediation may or may not be available. When a preliminary screening indicates the possible existence of neuro- logical problems, a variety of specialized cognitive, sensory, and motor tests come into play. Interpretation of the results, which requires the ser- vices of a specialist in neuropsychology, rests on a large body of data ac- cumulated mainly during the last 15 years (Hecaen and Albert, 1978; Lezak, 1976; Reitan and Davison, 1974~. Unlike traditional ability and in- telligence testing, neuropsychological analysis depends on at least four different uses of testing results: the level of function, pathognomonic signs, patterns, and disparities between the left and right sides of the body. Investigations of individuals whose IQs fall in the mildly mentally retarded range (Matthews, 1974) have shown that their performance is sometimes strongly suggestive of localized lesions in the brain. Initially, in the classroom, poor performance may appear to be global in nature, whereas on closer investigation it may be seen as part of a picture resulting from selective damage to the nervous system. For example, a child may demonstrate low verbal ability, which is itself due to a lateralized damage to the speech centers of the brain. Other tests, such as comparison of per- formances from the two sides of the body, may reveal that the lateralized damage appears in other areas besides speech and language. Some performances on tests are pathognomonic; that is, in this context, diagnostic of cerebral damage. For example, a partial hemiplegia may be revealed by unusual discrepancies between finger tapping of the left and right hands. Or abnormalities of the sensory pathways may be revealed by failures of recognition in factual performance tests. The application of neuropsychological analysis is by no means straight- forward for young children and those whose verbal skills are impaired (Boll, 1974~. Nevertheless, a thorough examination of neuropsychological integrity, based on knowledge of the structural features of the brain, can lead to the detection of specific genetic, traumatic, or pathophysiological conditions (Benson, 19741. Adaptive Behavior Scales As noted earlier, the AAMD as well as the federal government and many states define mental retardation as "significantly subaverage general in- tellectual functioning, existing concurrently with deficits in adaptive be

Assessment: Issues and Methods 65 havior, and manifested during the developmental period" (Grossman, 1977:5, emphasis added.) The AAMD goes on to define adaptive behavior as "the effectiveness or degree to which the individual meets the standards of personal independence and social responsibility expected of his age or cultural group" (Grossman, 1977:11~. This broad definition is consistent with numerous more specific definitions that have been proposed by theo- reticians and researchers (Courter and Morrow, 1978, Chapter 11. Because the definition is so broad, it has given rise to a large number of instruments (at least 132, according to a review cited in Meyers et al., 1979) that stress different aspects of adaptation and have different metric properties. However, as Meyers et al. point out, most of these instruments share certain general characteristics that sharply distinguish them from intelligence tests: (1) they focus on behavior rather than thought pro- cesses; (2) they focus on common or typical behavior rather than on "po- tential"; that is, they are descriptive rather than necessarily implying the existence of underlying traits or capacities; and (3) they are based on reports of informants, usually parents or teachers, rather than on direct observation of a child's performance. Most of these instruments have been designed specifically for use with mentally retarded people and are particularly appropriate for differen- tiating levels of functioning in individuals who are clearly below the nor- mal range. However, a few are designed for use in the public school population and are intended to help discriminate "EMR" from "normal" children. Our discussion is particularly concerned with the latter in- struments, of which the most widely used are the AAMD Adaptive Be- havior Scale-Public School Version (ABS) (Lambert et al., 1975) and the Adaptive Behavior Inventory for Children (ABIC) (Mercer and Lewis, 1978; Mercer, 1979~. The two instruments have much in common, both in content and purpose, yet they also exhibit some important differences. Together they illustrate most of the major issues involved in the use of adaptive behavior scales in the schools. The AAMD public school scale, which was derived from an earlier AAMD scale designed for mentally retarded people (Nihira et al., 1969), has two parts. The first contains 10 competence domains, each with one or more subscales: independent functioning (eating, toileting, etc.), physical development, economic activity (budgeting and shopping), language de- velopment, numbers and time, vocational activity, self-direction (initia- tive, perseverance, use of leisure time), and responsibility and socializa- tion (cooperation, considerateness, interaction with others). The second part contains 12 domains of maladaptive behavior: violence and destruc- tion, antisocial behavior, rebellion, untrustworthiness, withdrawal, stereo- typed behavior and odd mannerisms, inappropriate manners, unacceptable

66 REPORT OF THE PANEL vocalizations, unacceptable or eccentric habits, hyperactivity, psychologi- cal disturbance, and use of medication. The school version of the ABS is normally completed by a teacher, although at least one study has shown a high degree of agreement between parents and teachers in describing children's behavior with the ABS (Cole, 19761. The ABS school version has been standardized on a sample of 2,600 children, including normal children and children identified as EMR, trainable mentally retarded, and educa- tionally handicapped. The standardization sample included a wide range of socioeconomic levels and ethnic backgrounds. The ABIC is part of SOMPA, a comprehensive system for assessment of children from diverse cultural groups. This instrument includes 242 items, each referring to a specific practical or social skill or behavior. For example, can the child take a message on the telephone? Does the child cross the street with the traffic light? Does the child visit friends outside the neighborhood? Questions are answered by the child's mother or mother substitute. Most of the items are age graded, over the elementary school range from five to eleven; gradings are based on data from an ex- tensive pretest and from the norm sample (described below). Items are organized into six competence areas or subscales family, community, peer relations, nonacademic school roles, earner-consumer, and self- maintenance. Scores are normalized within each subscale and calibrated to yield a mean of 50 points and a standard deviation of 15. Subscale scores are averaged to yield an overall score. The instrument has been standardized on a sample of almost 2,100, including equal numbers of black, Hispanic, and white children, spanning a range of socioeconomic levels. It is apparent that there is considerable overlap between the ABS and ABIC (and other adaptive behavior scales) in the types of behavior covered. There are differences as well. The ABS is completed by the teacher and focuses on adaptive behavior within the school. It contains items with intellectual content of the sort found in IQ tests. In contrast, the ABIC is completed by the mother and concentrates more exclusively on practical skills and social behavior exhibited outside the school. It is not surprising, therefore, that some of the ABS subscales (numbers and time, economic activity, and language development) correlate about .6 with IQ, whereas other scales show modest correlations, generally below .2 (Lambert, 1978~. The ABIC subscales show uniformly low correlations with the WISC-R (Mercer, 1979~. As Meyers et al. (1979) note, there is a wide range of variation in correlations with IQ among adaptive behavior scales generally, depending on, among other factors, item content and the populations sampled. Another important characteristic of the ABIC is that subscale scores

Assessment: Issues arid Methods 67 and overall scores have almost identical distributions among black, white, and Hispanic children (Mercer, 1979~. There is some evidence that ethnicity does not affect scores on the ABS within EMR and regular classes (Lambert, 19781. However, since ethnic proportions probably dif- fered between EMR and regular classes in the ABS norm sample, distri- butions of ABS scores may have differed for the ethnic groups overall. What are the implications of these characteristics of adaptive behavior scales for use in educational decision making? First, it is evident that adaptive behavior scores are not redundant with IQ. The ABIC and most subscales of the ABS yield information about domains of competence that are distinct from the cluster of abilities tapped by IQ tests. One implica- tion of this fact is that adaptive behavior measures cannot simply be substituted for IQ as measures of general competence. A more important implication is that the use of adaptive behavior measures in assigning children to EMR classes a practice that is mandatory given existing theoretical and legal definitions of mental retardation will reduce the numbers of children assigned to such classes relative to the numbers that would be assigned on the basis of IQ alone. (This is so because many chil- dren with low IQs have adequate adaptive behavior scores.) As we saw in Chapter 2, this outcome has been observed in practice. The latter implication raises the important question of how children with low IQs but high adaptive behavior scores will fare in regular classes. The answer depends in part on how well those classes are designed to match the pace of instruction to each child's individual needs an issue to which we return in Chapter 4. It also depends on how much the social and practical skills measured by adaptive behavior scales contribute to school success. A second potential set of implications concerns the effects of adaptive behavior scales on ethnic disproportions in special education. Some have expressed the hope that the use of adaptive behavior measures will reduce the disproportionate representation of minorities in EMR classes. Logically, there is no necessity for such an outcome. As Coulter and Morrow (1978) point out, the use of one measure (adaptive behavior) that shows no ethnic differences does not affect the ethnic differences in another measure (IQ). If IQ and an ethnically neutral adaptive behavior measure, such as the ABIC, were jointly used to place children, the IQ could in effect control the ethnic composition of the group ultimately assigned to EMR classes, depending on the decision rules used to combine the measures. However there is some evidence, cited in Chapter 2, that the use of adaptive behav- ior measures does in fact decrease ethnic disproportion in EMR place- ment. A final set of implications concerns the utility of adaptive behavior data

68 REPORT OF THE PANEL in designing programs of instruction. As Coulter and Morrow (1978) point out, the distinction between using adaptive behavior measures as classi- ficatory devices and using them as guides for programming is a critical one. Different measures may be appropriate for the two purposes. To date, the use of adaptive behavior measures in programming has been confined mainly to individuals whose deficiencies in functioning place them well below the EMR range. Measures geared to mildly mentally re- tarded populations have been used primarily for classification. It is easy to envision possible instructional applications of adaptive behavior scales in pinpointing areas of relative strength to be built on and areas of particular weakness to be remedied. Some areas needing remediation might be skills that are appropriate parts of the regular curriculum, e.g., telling time, mastering numbers, learning to handle money. Others might be the modi- fication of practical skills, such as dressing and hygiene, which would not be part of the curriculum for most children but might well be included in a program for mentally retarded children. Still others might be the modifi- cation of maladaptive social behaviors that interfere with learning of any kind, e.g., destructiveness or withdrawal. However, these potentially prom- ising applications remain largely unexplored. COMPREHENSIVE ASSESSMENT IN CONTEXT: A TWO-PHASE PROCESS Throughout our discussion of the elements of comprehensive individual assessment, we argue repeatedly that assessment should be linked to in- struction that it should discriminate among children who can profit from different modes of instruction or who require different forms of intervention before conventional instruction can work. This section sug- gests an even more fundamental link between assessment and instruction. The section is premised on the belief that what seem to be individual failures are often failures of the educational system. Children may do poorly in class because they have not been taught or managed appropri- ately and this may be disproportionately true of minority children. If this belief is correct, no assessment of the causes of learning failure would be complete without a systematic examination of the teaching and learn- ing environment. Moreover, there are good reasons to examine the learning environment before subjecting a child to a comprehensive individual assessment of the kind described above. Merely to be singled out as a learning failure and evaluated for placement in a category such as EMR may be distressing to a child and the child's parents and may affect the subsequent behavior of teachers and peers toward the child. And even with the most comprehen

Assessment: Issues and Methods 69 sive and conscientious of assessments, there is some risk that the child will be misclassified. Given these risks of emotional damage, stigma, and mis- classification, protection of the child's rights and interests would seem to require that possible deficiencies of the learning situation be examined and ruled out before comprehensive assessment bobbins. We conclude that an ideal assessment process would take place in two phases, beginning with an assessment of the learning environment and proceeding to a comprehensive assessment of the individual child only after it has been established that he or she fails to learn in a variety of classroom settings under a variety of well-conceived instructional strat- egies.5 Our conclusion is very much in the spirit of P.L. 94-142 and the regula- tions implementing Section 504 and P.L. 94-142, which stipulate that stu- dents be placed in special education programs only when "the education of the person in the regular environment with the use of supplementary aids and services cannot be achieved satisfactorily" (34 CFR 104.34(a); see also 20 USC 1412 (5~(B), 34 CFR 300.550~.6 The main thrust of tints provision has obviously been toward mainstreaming children already diagnosed as handicapped. However, a neglected implication of the provi- sion is that there must be a systematic attempt to determine whether satisfactory progress can be achieved in a regular class. In the case of children who, under present circumstances, would be referred for possible placement in EMR classes, we suggest that there is much to be gained by making this determination without waiting until the label is assigned. There are no universally established procedures for conducting the kind of two-phase assessment that we envision, nor is there a fully developed, widely used technology for conducting an assessment of the instructional environment. It is, therefore, incumbent on us to suggest the broad out- lines of a procedure and to point to some directions that development of technology might take. What kinds of information might be included in an ideal phase-one as- sessment? First, there should be some evidence that schools are using cur- ricula known to be effective for the student populations they serve. Such sOne exception to the principle that environmental assessment should precede individual assessment is the case of biomedical screening for high-prevalence problems, such as vision defects. As suggested earlier, such screening is not stigmatizing and is appropriate for chil- dren who have not experienced classroom failure as well as for those who have. 6After the split of the U.S. Department of Education from the U.S. Department of Health, Education, and Welfare, the Code of Federal Regulations was revised to transfer the educa- tion regulations from the Public Welfare Title (Title 45) to an independent title for education (Title 34). The citations of regulations for Section 504 and P.L. 94-142 in this report are to their new location in the Code of Federal Regulations.

70 REPORT OF THE PANEL evidence might be provided by publishers or independent researchers or better yet by the district's own data. It is important that the data show not only that the curriculum is effective for students in general but also that it is effective for the various ethnic, linguistic, and socioeconomic groups actually served by the school or district in question. Standardized achievement tests or criterion-referenced performance tests (see below) might serve as assessment devices. Second, there should be evidence that the teacher has implemented the curriculum effectively for the student in question. Such evidence might in- clude documentation that other children in the class are performing ade- quately and that the child in question has been adequately exposed to the curriculum, i.e., has not missed many lessons due to absence, disciplinary exclusions from class, etc. Such evidence might also include observational data collected by a school psychologist, educational consultant, or resource teacher, showing that the child's teacher is providing adequate classroom management and appropriate instruction in accord with the curriculum; that he or she is attending to the child in question and providing appropriate direction, feedback, and reinforcement; and that the child is participating adequately in the instructional process. Observational data could also be used to detect and document problems of management and/or misbehavior that interfere with the effectiveness of the curriculum, e.g., lack of atten- tion, disruption of class, and the like. Third, there should be objective evidence that the child has not learned what was taught. Again, standardized norm-referenced tests or criterion- referenced tests keyed to the curriculum itself might be used for this pur- pose. Assessment of the child's progress should, however, be frequent enough so that problems are detected early and so that the child is not allowed to spend weeks in the classroom, falling further and further be- hind, without the teacher noticing. Finally and most important, there should be evidence that, when early problems were detected, systematic efforts were made to locate the source of the difficulty and to take corrective measures. Again, school psycholo- gists or specially trained educators could play a role, acting as consultants to the teacher in suggesting remedial approaches. Under some circum- stances it might be appropriate to change teachers or curricula, in an at- tempt to find a better match to the child's needs. Results of such attempts at improvements should be documented, and only after reasonable efforts have been exhausted should the child be referred formally for assessment. What kinds of instruments are needed to support this two-phase assess- ment process? Some possible answers have already been suggested. Stan- dardized achievement tests can play a role in evaluating strong and weak points in the curriculum as a whole; assuming that sufficiently reliable

Assessment: Issues and Methods 71 tests are selected, they can also be used to assess the performance of in- dividual children. The growing literature on "effective schools" suggests that these uses of standardized tests are among the distinguishing charac- teristics of schools that are particularly effective in teaching minority children from low-income families (see Chapter 41. A developing technology that may have promise is criterion-referenced testing. Criterion-referenced tests are used to measure mastery of specific domains of subject matter. A child's performance is judged against some absolute standard; a typical measure might be the number of arithmetic problems of a specific sort that the child can solve. The child's perfor- mance is not scaled against that of other children, nor is the test used to draw inferences about broad intellectual abilities. Many informal, teacher- made tests are in effect criterion referenced, as are many of the tests in- cluded in packaged curricula and teachers' manuals accompanying stan- dard textbooks. Recently, there have been advances in thinking about the design of such tests (e.g., Martuza, 1977; Harris et al., 1974), and im- provements in their psychometric properties may be in the offing. Such tests are of interest in the context of this report because of their close link to instruction. They can be used at the beginning of an instructional se- quence to determine whether the child has the prerequisite skills needed to profit from the instruction, and they can be used at the end of a se- quence to determine whether the child has absorbed the material or needs further work to achieve mastery. Thus, they can potentially be used to evaluate the outcomes of the systematic variations in instruction that are part of a phase-one assessment. Another technology that has some promise is systematic observation in the classroom. Systems for analyzing and recording behavior in the class- room have a long history in educational research (Medley and Mitzel, 1963~. Most of the instruments used are too costly, time-consuming, and demanding in terms of observer training to be practical for use in self- evaluation by schools. However, there have been recent suggestions that suitably simplified and focused instruments may be useful as diagnostic devices and guides for the remediation of specific behavior problems (e.g., Alessi, 1980; Baker and Tyne, 19801. Observations have also been used by researchers to measure the implementation of curricula (Starlings, 1977) and time devoted to academic activities (Rosenshine and Berliner, 1978~. Again, simplified observation systems may be useful for similar purposes in assessing the quality of learning environments. None of the above suggestions about procedures and instrumentation is novel. All have been tried, in varying combinations, in different school districts. A few large districts have gone far in implementing systematic procedures of instruction and closely linked assessment; some of these

72 REPORT OF THE PANEL districts have reported dramatic improvements in students' basic academic skills (Carnine et al., 1981; Monteiro, 1981) and, by implication, a decline in the rate of learning failures. These reports encourage us to believe that the suggestions above are both feasible to implement and potentially effective. The two-phase assessment process clearly entails new costs the costs of training and maintaining staff to conduct evaluations of the learning en- vironment. The process also entails financial savings, by reducing the number of children referred for costly, comprehensive assessments and possibly also the number who must be maintained in costly special classroom settings. SUMMARY AND CONCLUSIONS The discussion in this chapter follows from the premise that the main pur- pose of assessment in education is to improve instruction and learning. Children are or should be assessed in order to identify strengths and weaknesses that necessitate specific forms of remediation or educational practice. Remediation may take the form of intervention outside the school, such as medical treatment or family intervention. We believe, however, that a significant portion of children who experience difficulties in the classroom can be treated effectively through improved instruction. These basic assumptions lead to a perspective on assessment and its contribution to ethnic and sex disproportions in EMR classes that is dif- ferent from the one with which the study began. A concern with dispro- portion per se dictates a focus on bias in assessment instruments and a search for instruments that will reduce disproportion. A concern with in- structional utility leads to a search for assessment procedures and instru- ments that will aid in selecting or designing effective programs for all children. We believe that better assessment and a closer link between as- sessment and instruction will in fact reduce disproportion, because minor- ity children have disproportionately been the victims of poor instruction. We also believe that the problem should be attacked at its roots, which lie in the presumption that learning problems must imply deficiencies in the child and in consequent inattention to the role of education itself in creating and ameliorating these problems. This viewpoint has led us to urge a greatly increased emphasis on sys- tematic educational intervention before a child is referred for individual assessment. When poor instruction has been ruled out as a cause of learning failure, it becomes appropriate to look for problems within the child or in the child's environment outside the school, again with an eye toward prob- lems that can be corrected; this is the purpose of individual assessment.

Assessment: Issues and Methods 73 We believe, and have cited evidence to support our belief, that an assessment procedure like the one we outlined will significantly reduce the proportion of children whose failure to learn must be attributed to global intellectual deficits. The question remains whether it is necessary or useful to apply the label EMR to this residual group or to separate them from other children for instructional purposes. The answer, in our view, must hinge on another question: Do these children require and can they profit from modes of instruction that are different from those that work best with other children who have experienced learning difficulties? We turn to this question in the next chapter.

Next: 4 Effective Instruction for Mildly Mentally Retarded Children »

Placing Children in Special Education: A Strategy for Equity (1982)

Chapter: 3 Assessment: Issues and Methods

Welcome to OpenBook!

Get Email Updates