| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 27
3
What Are Indicators?
DEFINING INDICATORS
Identifying the domains that need to be monitored is the first
step in developing indicators of the quality of science and mathe-
matics education. The next step is to define what indicators are arid
how they should be distinguished from such other data as simple
descriptive statistics or various kinds of qualitative information. In
its earlier report (Ralzen and Jones, 1985:27-28), the committee de-
fined an indicator as "a measure that conveys a general impression
of the state or nature of the structure or system being examined.
While it is not necessarily a precise statement, it gives sufficient
indication of a condition concerning the system of interest to be of
use in formulating policy. For a statistic or measure to be used as
an indicator, it must have a reference point so that a judgment can
be maple whether the condition being described is getting better or
worse (Oakes, 1986~. The notion of judgment has been integral to the
development of social indicators, as reflected in an early report by
the U.S. Department of Health, Education, and Welfare (1969:971~:
[An indicator is a] statistic of direct normative interest which facilitates
concise, comprehensive and balanced judgement about the condition
of major aspects of society. It is, in all cases, a direct measure of
welfare and is subject to the interpretation that if it changes in the
frights direction, while other things remain equal, things have gotten
better, or people are better oR.
27
OCR for page 28
28
INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
The literature on indicators is huge (White, 1983), and so an
exhaustive treatment here of distinctions between indicators and
other types of information is impractical. But a recurring theme that
runs through much of this literature is that indicators usually imply
a causal theory or mode! of how some underlying process operates
to generate a particular value of the indicator. This distinction is
evident in the following definition (Cariey, 1981:67-683:
Social indicators, virtually by definition, specify causal linkages or
connections between observable aspects of social phenomena, which
indicate, and other unobservable aspects or concepts, which are in-
dicated. This can only be accomplished by postulating, implicitly
or explicitly, some causal model or theory of social behavior which
serves to relate formally the variables under consideration. All social
indicator research represents, therefore, some social theory or model,
however simplistic. Much research to date laying claim to the term
"social indicators research consists either of descriptive social statis-
tics, which some have argued are not social indicators at all, or of
implicit postulations of causal linkages.
To be sure, all indicators are in some sense statistics, although
the reverse is not so clear. Figures on crime rates are obviously
important social indicators, but are the "number of police officers
per capita" social indicators as well? Yes and no. They may be
indicators of the value a society places on security, they may indicate
the presence of an oppressive regime, they may indicate the extent
of patronage, and they may also indicate crime rates indirectly. The
point is that the theory connecting "number of police officers" to
some condition in society is considerably more tenuous and remote
than "number of murders" or "number of property thefts" per capita.
The same logic applies to changes in an indicator versus changes
in a statistic. There is virtually universal agreement on the right
direction of a change in crime rates, but the right direction of a
change in number of police officers (or any other group for that
matter) is open to debate.
How should indicators be used in policy formulation? To an-
swer this question requires knowledge about the goals of a society
as well as a theory about the nexus of causal linkages and processes
that combine to produce the indicator. An unfortunate limitation
of all indicators is that, while they can inform about the state of
their respective domains, they cannot tell how the observed changes
have come about. They cannot tell what, precisely, to do about the
situation. Once the choice has been made on what social condition to
OCR for page 29
WHAT ARE INDICATORS?
29
assess, indicators are neutral, summary snapshots of that condition.
Their implications for policy and action derive not from some inher-
ent property they possess, but rather from the theory that the policy
maker has about the underlying processes. However, it is possible to
increase the utility of indicators to policy makers by ensuring that,
to the extent possible, they:
.
consist of reliable and valid information that is as closely
related to an important aspect of the educational system as possible,
have reasonably direct policy implications,
be small in number, and
. be easily understood by a broad audience.
In consideration of these criteria, the comrn~ttee has grouped
its recommendations on indicators into three categories: (1) key
indicators that are or would be feasible given adequate investment
in experimentation and development and that should be included in
even the most parsimonious monitoring system, (2) supplementary
indicators that are presently feasible or might be developed, and
(3) research on hypothesized causal links among some important
but poorly understood aspects of education in order to create and
validate indicators related to these aspects.
INTERPRETING INDICATORS
Once a value has been established for a given indicator, there
are essentially three possible interpretations, all of which involve
comparisons of some sort. First, the value of the indicator might be
compared with some absolute standard. For example, professional
consensus might be used to establish a "minimum knowledge level"
of a new K-5 teacher. An indicator of this could be scores on a
pencil-and-paper test to measure the amount of knowledge attained
by teachers. Interpretation would involve comparison of the teachers'
scores with the absolute standard. (It should be noted in passing that
absolute or ideal values for most indicators are difficult to establish.)
A second interpretation involves comparison of a given indicator
value with its value at some prior time. For example, the percentage
of high school students who took a physics course in a given year
might be compared with the percentage who took a physics course
· -
in some prior year.
Third, indicators can be presented as a basis for the compari-
son of instructional programs, demographic groups, states, regions,
OCR for page 30
30 INDICATORS OF SCIENCE AND MA THEA TICS EDUCATION
countries, and so on. The proper interpretation of such compar-
isons is limited because of differences in social, political, economic,
cultural, and other characteristics. Nevertheless, when data are dis-
aggregated on any basis and presented side by side on a page, the
temptation to make evaluative comparisons, whether warranted or
not, is overwhelming and nearly universally succumbed to.
Problems in interpreting educational indicators fall into three
broad categories and are sufficiently pervasive to merit brief mention
here, together with suggestions for avoiding or at least minimizing
their adverse consequences. The problems are (1) choice of vari-
ables, (2) levels of aggregation, and (3) scale. These problems of
interpretation have to be faced before data collection can begin.
Choice of Variables
Even after the key domains to be monitored have been identi-
fied for our purpose, student learning, general scientific and mathe-
matical literacy, student behavior, teaching quality, curriculum qual-
ity, and financial and leadership support the number of possible
variables from which to choose in constructing indicators of science
and mathematics education remains large; a partial list could well
number over 100. According to the committee's formulation, vari-
ous teacher and student behaviors and the incentives and constraints
that influence them are presumed to be causally related; for example,
the quality of the curriculum and the use of it made by the teacher
affect student competence in science and mathematics and student
attitudes. To what extent will the conclusions one draws from one
combination of variables be similar to the conclusions one would
have drawn had a different set of variables of the same underlying
condition been used to construct the indicator? The answer to this
hypothetical question depends critically on the quality of the sets of
variables and the manner in which they were combined. One gets
an entirely different picture of the educational health of the nation
depending on whether one looks at high school dropout rates, results
from the National Assessment of Educational Progress (NAEP), stu-
dent career choices, or amount of homework assigned per pupil. Each
variable or combination of variables highlights a different aspect of
the complex construct "educational health." The accuracy and ap-
propriateness of interpretations and policy decisions are limited by
the quality of the indicators themselves and the manner in which
they are combined.
OCR for page 31
WHAT ARE INDICATORS?
31
Problems of Aggregation
When data are aggregated from one level (e.g., students) to an-
other (e.g., classrooms, schools, or districts), numerous interpretive
difficulties arise. Data should be collected and aggregated accord-
ing to a clear conception of schooling and with a view of who wiB
use the information and for what purpose. Data aggregated to lev-
els that are inappropriate to relevant policy decisions may be quite
misleading-for example, statewide averages on teacher salaries may
not be useful information for a particular school district. In general,
data at the level of the individual student are most useful to that
student's teacher, cIassroom-leve! data are of most interest to prin-
cipals, school-level data are most useful to superintendents, and so
on.
Aggregation Effects and the Ecological Fallacy I,evels of aggre-
gation exert important effects on correlation coefficients. These ef-
fects help to explain why the results of educational research vary so
much from study to study. What, for example, is the correlation
between socioeconomic background and achievement test scores? Is
it .3? .6? .9? All three are possible. The correlation depends on the
unit of analysis, the population sampled, and the way the two con-
structs are measured. If one takes these three factors into account,
the results are fairly consistent.
Using national samples of high school students, family income
correlates about .3 with achievement test results at the student level.
Aggregating to the school level, the correlation is between .5 and .6
among school means nationally. If, however, one looks within large
urban districts, the school-level relationship is between .8 and .9. The
district-level relationship varies from state to state (.2 to .6), and at
the state level the correlation between 1975 poverty rates and state
achievement estimates is .63 (N = 50 states). Table 3-1 summarizes
these results.
Other differences are found when looking at different grade lev-
els, or when indicators other than poverty are used to represent
home background. For example, in Project TALENT, an indicator
of socioeconomic environment based on home variables that were
hypothesized to exert a more direct effect on achievement (mother's
education, books in the home, child has own desk, etc.) correlated .5
at the student level for high school students (Flanagan and Cooley,
1966~.
OCR for page 32
32 INDICATORS OF SCIENCE AND MATH~TICS EDUCATION
TABLE 3-1 Socioeconomic Background and Achievement
Population
Sampled SES Indicators Correlation
Student National Income .2 to .4
Student National Home environment .5
School Large urban district Income .8 to .9
School National Income .5 to .6
District Within state Income .2 to .6
State National Income .6
Source: Cooley et al. ( 1981~ .
What is the appropriate unit of analysis? It depends, of course,
on the question being asked. A scatterplot depicting the modest
relationship between socioeconomic status (SES) and achievement
at the student level is typically an oval-shaped swarm of points
with few outliers. Given this fact, inferring from the within-district
school-level correlation of .9 that most low-achieving students come
from poor homes is an excellent example of what sociologists call the
ecological fallacy: the error of using relationships at one level, such
as school, to describe relationships at a lower level, such as student
(Robinson, 1950~.
Correlations at one level of analysis differ from correlations at an-
other because of the grouping eject. This occurs when membership
in the group (e.g., class or school) is related to either one or both
of the variables being correlated. For example, the socioeconomic
homogeneity of neighborhoods produces a relationship between SES
and school, and that relationship produces the larger correlation be-
tween SES and achievement at the school level than at the student
level.
Many statisticians would argue that the proper procedure is not
to use correlations at all when, as in the case illustrateci, regressions
are appropriate (see, e.g., Cain and Watts, 1970~. However, the use
of correlations is so universal in analyzing and reporting educational
data that we consider it important to warn against misinterpre-
tations. Our brief discussion of problems in "ecological inference"
merely scratches the surface. A detailed and comprehensive (al-
though not too technical) treatment is provided by Langbein and
Litchtman (1978~.
OCR for page 33
WHAT ARE INDICATORSf
33
Inconsistent Aggregation and Self-Selection Every student of
elementary statistics is warned early in instruction that teasing out
causal relations among any set of variables can be a tricky and often
misleading endeavor. It is surprising how often unwarranted causal
conclusions are drawn from summary indicators, whether or not
the persons involved have had training in data interpretation. The
temptation, for example, to judge the quality of education in a state
by the mean Scholastic Aptitude Test (SAT) scores of its graduates,
despite cautionary statements issued by the College Board (e.g.,
Hanford, 1986), is a case in point. This annual practice illustrates in
a nutshell most of the pitfalls considered in this section.
Why are mean SAT scores inappropriate indicators of the com-
parative quality of instruction in the various states? First, consider
the problem of sample representativeness. How representative are
students who take the SAT of the typical high school graduate? In
general, college-bound seniors (the SAT population) are better pre-
pared academically than their noncollege-bound counterparts. More-
over, there is wide variation in the percentage of students by state
who take the SAT. (Some state institutions of higher education re-
quire SAT scores for admission; some do not; others require scores
on tests administered by the American College Testing Program.)
For example, in 1984, the percentage of high school seniors by state
who took the SAT ranged from a low of 3 to over 65 percent. For
various reasons, including self-selection, the smaller the percentage
of students taking the SAT, the higher their mean SAT scores. Thus,
inconsistent aggregation leads to false and misleading comparisons.
Problems of Scale
The first interpretive problem in this category involves the ac-
tual scale itself. Should absolute values (number of science and
mathematics teachers, number of students taking at least two years
of mathematics, etc.) be used, or should various ratios (for exam-
ple, science teachers per 100 or 1,000 pupils or the ratio of science
and mathematics teachers to all teachers) be used? Often but not
always ratios and proportions are more informative reporting units.
A simple example illustrates why this is so. An absolute increase in
the number of unemployed persons who are actively seeking employ-
ment is generally agreed to be a move in the wrong direction. But
such an increase, by itself, may be misreading. If the entire labor
force has increased significantly, it is possible that an increase in
OCR for page 34
34 INDICATORS OF SCIENCE AND A`4THEMA TICS EDUCATION
the absolute number of unemployed actually represents a decrease
in the unemployment rate, that is, the percentage of the labor force
that is unemployed. A counter example from education involves in-
creasing the length of the elementary school day and introducing an
additional subject, say, health and family education. Under these
circumstances, the proportion of school time devoted to mathemat-
ics might decrease, an apparent move in the wrong direction, but
the number of minutes per day given to mathematics might actually
increase.
In many situations, it is wise to collect information on and
report both types of figures. For example, it may be important to
know both the absolute number of minutes per day a student spends
doing mathematics as well as the percentage this figure represents of
the student's total time spent on school work.
Another scale issue that, surprisingly, often goes overlooked is
the use of scale units that change over time or that have different
meanings in different locations. The most commonly used units are
those involving monetary values. Total school budgets, dollars spent
on laboratory equipment, and teacher salaries are all examples of
scale units that vary to the extent that the value of the dollar varies
over time and over locations. Results not adjusted for this variation
may seriously distort the picture. Thus, total school expenditure
should be adjusted to total expenditure per pupil, with perhaps an
additional adjustment for variations in the cost of living; teacher
salary should similarly be adjusted to account for local cost of living,
and so on.
INDICATORS FOR WHOM?
This chapter argues that an indicator is more than another layer
on a mound of statistics; rather, it can be used in a systematic
attempt to investigate the interaction among selected pieces of infor-
mation. Federal, state, and local education bureaucracies are awash
in numbers. The challenge taken up by the committee in this report
is to go beyond an endless parade of statistical tables and focus on
the key questions and subsequent indicators that will be credible
to policy makers in state and local education agencies the major
decision makers since education in the United States is overwhelm-
ingly a state and local legal and fiscal responsibility. The challenge
for state and local policy makers is to adopt and use the indicators
that, when combined, best represent a snapshot of what exists today
OCR for page 35
WHAT ARE INDICATORS?
35
in mathematics and science education as well as point to promising
policy initiatives.
At all levels of the education system, there is recognition of the
need for a reliable and valid evaluation of how well students know,
understand, appreciate, and use information they have received in
their K-12 mathematics and science experience. And, as with any
evaluation, the initial temptation is to start to collect data before
the key questions have been asked. Once the questions are specified,
most of the data can probably be obtained without generating a new
national information system that may fall under its own weight (see
Appendix E). In this respect, a concern shared by the committee and
state administrators alike is feasibility. By feasibility, we mean that
collection, analysis, and reporting of valid data should be possible in
a timely manner, given reasonable resources. The design decisions
and availability of resources that affect the frequency of collecting
data, as well as methodology, may well be driven by timetables that
allow indicators to interact with and influence policy.
COLLECTING INFORMATION
Once decisions have been made on the type of indicator to be
used (e.g., student test scores, teacher salaries, judgments of curricu-
lar quality), there arises the question of how to collect the pertinent
information. This report argues that a wide range of data-collection
methods is necessary. Some of the recommended methods have been
used extensively in the past, such as surveys; others are less widely
used, such as time-use studies. The key challenge is to tailor the
proposed data-collection methods to the type of information that is
needed.
Comparability Versus Depth of Info rmatior~
There is a difficult tension in the choice of data-collection meth-
ods between collecting comparable data and being open to unex-
pected responses. For example, closed-ended questionnaires produce
standardized information comparable across space and time and are
particularly suitable for collecting information on such matters as
salaries and defined fringe benefits, for which comparability is criti-
cal, and the nature of the desired information is relatively clear-cut.
Closed-ended questionnaires are poorly suited, however, to the
collection of information dealing with such topics as how teachers
OCR for page 36
36
INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
and students spend their time outside school. The reason is that the
range of possible responses is much broader than can be captured by
a closed-ended questionnaire. Consequently, it is important to give
up standardization in favor of capturing diversity. Thus, time-use
studies are more appropriate for collecting this type of information.
A related issue arises in attempts to improve achievement tests,
questionnaires, and the like so that responses mirror more faithfully
and in greater depth, say, what students have learned and are able
to do. Two problems arise: first, to the extent that items, examples,
and questions are improved to capture more and better information,
comparability to earlier assessments is lost. Second, assessments
are likely to become more costly, and sample sizes may have to
be reduced. This may create Toss of generaTizability (as in studies
using classroom observation), although matrix sampling and other
techniques may partially overcome this problem. These problems
are not cited to argue against improving assessment instruments and
questionnaires we argue quite the contrary in the next chapter
but only to sensitize those using indicators to some of the difficulties
involved in designing the requisite collection of data and information.
Timing
How often should information be collected? There is tension
between the expense of collecting information often and the value
of up-to-date information that permits rapid discernment of changes
in trends. The choice of how often to collect data for a particular
indicator should depend on the importance of the indicator for in-
form~ng policy and on how ranidlv chances are likely to Cur in
the distribution of the behavior, incentive, or outcome reflected in
the indicator. Consequently, we argue for the assessment of student
learning at given grade levels every four years, except for science
achievement in elementary school, for which the current improve-
ment efforts warrant assessment every two years. No matter what
the frequency, it is important that each wave of information be col-
lected at the same time of the year so as to maintain consistency and
provide comparable data.
Design of Expert Panels for Assessment
At various places in this report, the committee recommends
the use of panels of experts as a method for assessing instructional
OCR for page 37
WHAT ARE INDICATORS?
37
materials and performance when no suitable outcome measure is yet
available. Because the use of experts is an often-used mechanism, we
discuss the problems inherent in its application in some detail.
Based in part on our experience with difficulties encountered in
the experiment on reviewing the science content of science achieve-
ment tests (see Appendix B), we consider it important to make some
general comments about the use of expert panels as an assessment
method. First, there should be a clear understanding among the
pane! members as to the intent and interpretation of the material to
be judged or rated. Second, if the tests or other materials are to be
used for various purposes, the pane! members should understand and
the ratings should distinguish among these purposes. Third, there
should be agreement as to the rating criteria. Panels can meet these
three conditions by using rater "training" exercises or discussing
their procedures before the actual work begins. Discussion of the
ratings by panelists after they have completed their work may fur-
ther help to clarify whether purposes of the materials and rating
criteria were unambiguous. (However, it is not desirable that the
pane! members change their ratings as a result of the post-rating
discussion, at the risk of reducing independence of the panelists' rat-
ings.) Such techniques help to improve the rating process and to
reduce the variability between raters.
Rater vaTia~iiity Variability between raters with regard to in-
dividual items is one source of variability in pane] assessments. How-
ever, the scores of an individual rater on different items tend to be
correlated. This correlation is one quantification of frequently heard
comments, such as that one rater tends to give high scores and an-
other low scores. It is not generally recognized that, as a result, the
impact of rater variability on the variability of average scores or per-
centiles can be substantially greater than indicated by the variability
between raters item-by-item, perhaps by an order of magnitude. In
the experimental review of science achievement tests, this was true
not only of types of reviewers (teachers, scientists) but also of re-
viewers within type. It is not feasible to eliminate these sources of
rater variability. Thus, panel studies should be designed to provide
estimates of rater variability and correlated variability. Such infor-
mation has the potential for improving the design of expert panels,
for example, for deciding on the number of pane! members needed to
yield acceptably reliable estimates of averages, percentiles, or other
statistics of interest. With a positive correlation between the ratings
OCR for page 38
38
INDICATORS OF SCIENCE AND MATHEMATICS EDUCATION
of an individual reviewer by item, the use of a given number of re-
viewers, each rating every item, will yield less reliable statistics than
a larger number, each rating a randomly chosen subsample of the
items. This may be potentially useful when there is a large number
of items to be rated and the rating process is time-consuming. Ap-
preciation of the sources of rater variability will also help ensure that
standard errors of statistics derived from pane] ratings are properly
computed.
Validity and Reliability The design of an expert pane] should
consider the problems of both accuracy (validity) and precision (reli-
ability). The concept of accuracy implies that there is a "truer value
to be estimated. The true value may have a theoretical definition or
may be defined only operationally as that value resulting from a set
of carefully specified empirical measurement steps. A pane! whose
assessments differ systematically, in either a positive or negative di-
rection, from the true values is "biased." In experiments such as the
science test review, the standards against which raters assign their
scores are critical since they affect the accuracy of the scores as mea-
sures of the relative value of alternative tests. Depending on their
biases, reviewers may give a poor test relatively high ratings and a
good test relatively low ratings so that two tests that differ widely in
their true value are judged on the basis of average ratings-to be
equally effective. Similarly, ratings of teacher performance based on
classroom observation are likely to be strongly affected by the per-
sonal views of the observer regardless of the procedures established
for the assessment. The steps outlined above will help to minimize
biases due to misunderstandings on the part of pane! members. They
will also unprove the interpretation of the ratings. It may be possible
to design a questionnaire for potential pane! members that would
help ensure ratings free of personal preference or provide a basis for
eliminating the ratings of particular individuals.
Coordir~atiorl of Strategies for Collecting Data
In each of the chapters that follow, recommendations are made
for data to be collected or observations to be carried out or both.
Implementation of these recommendations will involve surveys and
other data collection strategies that should be coordinated. It is not
the committee's intention that whole, new data systems be set up to
carry out its recommendations. Instead, several existing mechanisms
OCR for page 39
WHAT ARE INDICATORS?
39
currently undergoing review and reformulation should be used to
implement the recommended data collections and analyses, including
the redesigned elementary/secondary data collection of the Center
for Education Statistics, the Assessment Center of the Council of
Chief State School Officers, and the educational data improvement
effort intended to lead to common data collection by the states.
In Appendix E we discuss issues of coordination, pulling together
recommendations from throughout the report that imply surveys,
referring to ongoing efforts, and outlining suggestions for how de-
sirable new survey efforts might be implemented. More intensive
survey design planning including issues of sample size should be left
to agencies national, state, or local that assume or are assigned
responsibility for the indicators.
Representative terms from entire chapter:
mathematics education