| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 119
6
The Theory of Validity Generalization
META ANALYSIS
Meta-analysis is the combination of empirical evidence from diverse
studies. Although the term meta-ana~ysis has emerged only in the past
two decades, formal methods for combining observations have a long
history. Astronomical observations at different sites and times have been
combined in order to draw general conclusions since the 1800s (Stigler,
19861. Statistical techniques for combining significance tests and combin-
ing estimates of effects in agricultural experiments date from the 1930s
(Hedges and Olkin, 19851.
Several major programs of quantitative synthesis of research have
existed for decades in the physical sciences. For example, the Particle
Data Group, headquartered jointly at Berkeley and Centre Europeen de la
Recherche Nucleaire in Switzerland, conducts meta-analyses of the
results of experiments in elementary particle physics worldwide and
publishes the results every two years as the Review of Particle Properties.
In medicine, meta-analyses are becoming increasingly important as a
technique to systematize the results of clinical trials (Proceedings of the
Workshop on Methodological Issues in Overviews of Randomized
Clinical Trials, 1987), to collect research results in particular areas (the
Oxford Database of Perinatal Medicine), and in public health (Louis et
al., 1985).
In the social and behavioral sciences, meta-analysis has been used
primarily in psychology and education, for such diverse purposes as to
~9
OCR for page 120
~20 V~DI~ GENERATION AND GATE VA~DITIES
summarize research on the effectiveness of psychotherapy, the effects of
class size on achievement and attitudes, experimental expectancy effects,
and the social psychology of gender differences.
In studying a problem, every scientist must assimilate and assess the
results of previous studies of the same problem. In the absence of a
formal mechanism for combining the past results, it is always tempting
to assume that the present experiment is of prime originality and to
ignore or dismiss inconvenient or contradictory results from the past.
Many superfluous data are collected because it is too difficult or
confusing or unconvincing or unglamorous to assemble and examine
what is known already.
Meta-analysis attempts to provide a formal mechanism for doing so.
By combining information from different studies, meta-analysis in-
creases the precision with which effects can be estimated (or increases
the power of statistical tests of hypotheses). For example, many clinical
trials in medicine are too small for treatment effects to be estimated with
accuracy, but combining evidence across different studies can yield
estimates that are precise enough to be useful. In addition, meta-
analysis produces more robust evidence than any single study. The
convergence of evidence produced under differing conditions helps to
ensure that the effects observed are not the inadvertent result of some
unrecognized aspect of context, procedure, or measurement. And
finally, meta-analysis usually involves some explicit plan for sampling
from the available body of research evidence. Without controls for
selection, it is possible to obtain very different pictures of the evidence
by selecting, perhaps inadvertently, studies that favor one position or
another.
Although there is no general prescription for carrying out a meta-
analysis, the procedure can be divided into four steps:
1. Identify relevant studies and collect results. It is important in this
step to ensure the representativeness of the studies used. One particularly
difficult source of bias to control for is called the file drawer problem,
which alludes to the tendency for statistically insignificant results to
repose unpublished and unknown in file drawers and thus not be available
for collection.
2. Evaluate individual studies for quality and relevance to the problem
of interest.
3. Identify relevant measurements, comparable across studies.
4. Combine relevant comparable measures across studies and project
these values onto the problem of interest.
OCR for page 121
THE THEORY OF VAUDI~ GENE TON 12}
VALIDITY GENERALIZATION
Validity generalization is a branch of meta-analysis that draws on
criterion-related validity evidence to extend the results of test research.
Our precise interest is the estimation of the validities of a test for
performance on new jobs, based on meta-analysis of the validities of the
test for studied jobs. There is a very substantial statistical and psychomet-
ric literature on estimating validities measured via correlation coefficients.
This chapter presents in broad outline the statistical analyses used in
validity generalization and focuses particularly on the work of John E.
Hunter and his frequent collaborator, Frank L. Schmidt, because of
Hunter s central role in applying validity generalization to the General
Aptitude Test Battery (GATB) (Hunter, 1986; Schmidt and Hunter, 1977,
1981; Schmidt et al., 1982; U.S. Department of Labor, 1983b,c).
The Theoretical Framework
The fundamental problem addressed by validity generalization is how
to characterize the generalizability of test validities across situations,
populations of applicants, and jobs. The most prominent approach treats
the problem as one of examining the variability across studies of validity
coefficients. The theoretical framework is as follows.
One wants to estimate the true validity of a test for given jobs. (By true
validity, we mean the validity that would obtain in studies conducted under
ideal conditions, with job performance assessed with perfect accuracy by the
criterion.) As a first proposition, it is assumed that there is some distribution
of true validities across a population of jobs, and that this distribution of
validities is then taken to apply to new jobs that have not undergone a
criterion-related validity study. The conclusion will be in the form: the
validity of the test for the new job lies between .3 and .5 with probability .9.
The questions remain: how are the observed validities to be used to
estimate the distribution of true validities across a population of jobs, and
thus what is the probable range of values that can be generalized to a new
job? There are a number of ways in which the correlation coefficient
obtained in any given study of the relation of test scores to job perfor-
mance is affected by situational factors, so that the validity estimate
differs from the true validity of the test for a new job:
'The criterion-related validity of a test is a measure of the relationship between the test
score and a criterion of job performance (e.g., supervisor ratings). The relationship between
test score and job performance is measured by the product moment correlation. Following
standard practice, we refer to this correlation as validity, although this usage invites
confusion with other psychometric and legal uses of the word validity.
OCR for page 122
|22 VALIDITY GENERALIZATION AND GATE VA~rDITIES
Sampling error. The observed validities are based on a sample of
workers; the true validities are based on a population of applicants. The
difference between sample and population is adjusted for by taking into
account sampling error of the observed validities. The major effect is that
the variability of the observed validities over jobs is greater than the
variability of true validities.
Restriction of range. The observed validities are based on a sample of
workers, the true validities are based on a population of applicants.
Because the worker group may be selected from the applicant group by
criteria correlated with the test score, the distribution of test scores within
the worker group may be different from that in the applicant group. There
will be a corresponding difference between true validities for workers and
applicants. For example, in a highly selective job, range restriction occurs
so that nearly all workers will have a narrow high range of test scores, and
the true validity will be lower than that for an unselected applicant group.
If the applicant and worker distributions can be estimated, it is possible to
correct for range restriction.
Reliability of supervisor ratings. The criterion of supervisor ratings is
assumed to be perfectly measured in computing the true validities.
Unreliable supervisor ratings will tend to make the observed validities
smaller than the true validities; if the reliability of supervisor ratings can
be estimated, an adjustment can be made for it in estimating the true
validity.
Connecting the sample to the population. The new job is different from
the jobs studied. If the jobs studied are assumed to be a random sample
from the population of all jobs, then the sample distribution can be
projected to the population distribution. This is the implicit assumption of
the Schmidt-Hunter validity generalization analyses. If this assumption
cannot be sustained, some other connection must be established between
the jobs studied and the new job.
Each of these factors is considered below in some detail.
Sampling Error
The true validity for a given job, population of subjects, and criterion is
the validity coefficient that would be obtained by conducting a validity
study involving the entire population. Any actual validity study will use
only a sample of subjects- typically a group of job incumbents chosen to
participate in the study and will yield an observed validity (a sample
correlation), r, that differs from the true validity as a consequence of the
choice of sample. The observed validity r will deviate from the true
validity by a sampling error, e.
OCR for page 123
THE THEORY OF VALIDITY GENERALIZATION i23
If several samples are taken from the same population, each would
have a different observed validity. It is the variability of these observed
validities about the true validity that tells us how confident to be in
estimating the true validity by the observed validity r.
Suppose, for example, there is a population of 1,000 individuals for
which the true validity of a test is .3. We draw a sample of 100 individuals
and compute an observed validity of .41. Other samples of 100 individuals
give validities of .22, .35, .42. The observed values vary around the true
value by a range of about .1.
Now suppose there is another population of 1,000 individuals for which
the true validity is unknown. We draw a sample of 100 individuals and
compute an observed validity of .25. What is the true validity? We think
that it lies somewhere in the range .15 to .35. Thus we use the distribution
of sampling error to indicate how close the true validity is likely to be to
the observed validity. (There may be other evidence such as prior
information about the true validity.)
The average of the sampling error M is very close to zero for modest
true validities. The variance of the sampling error, the average of (e -
M)2, iS close to 1/(n - 1), where n is the sample size for modest true
validities. Thus for sample sizes of 100, the variance is about .01 and the
standard deviation is .1; we expect the observed validity to differ by .1
from the true validity.
Corrections for Sampling Error
To illustrate how corrections for sampling error fit into the estimation of
the distribution of true validities in a population of jobs, we offer a
hypothetical example. Assume, following Hunter and Schmidt, that the
jobs actually studied form a random sample of the population of jobs. For
each job studied a random sample of applicants is taken from the relevant
population for that job, and an observed validity is computed for the
random sample. Note that there are two levels of sampling, from the
universe of jobs and from the universe of applicants for each job.
Provided that the different studies are independent, the expected
variance of the observed validities is the sum of two components: the
variance of the true validities plus the average variance of the sampling
error.
Thus we estimate the mean true validity in the population of jobs by the
average of the observed validities, but we must estimate the variance of
true validities by the observed variance of observed validities less the
average sampling variance. This is the correction for sampling error. A
good practical estimate of the average sampling variance is the average
value of 1/(n - 1) where n is the sample size (Schmidt et al., 1982~.
OCR for page 124
i24 VA~DI~ GENE TON ED GATE VA~DITlES
An Example
We have 11 jobs with 1,000 applicants each. If all applicants were tested
and evaluated on the job, the true validities would be (dropping the
decimal point): 25, 26, 27, 2S, 29, 30, 31, 32, 33, 34, 35.
We sample from the 11 jobs at random to get 4 jobs with true validities:
26, 28, 31, 32.
For each of the four jobs, we sample 101 from the 1,000 applicants. For
the four samples we compute observed validities: 34, 16, 26, 40.
We use these observed validities to estimate properties of the original
distribution of true validities. The mean true validity is estimated by the
mean of the sample validities, 29. The sample variance is (52 + 132 + 32
+ 112~/3 = IDS, but this overestimates the variance of true validities
because of sampling error. For each sample, the sampling error
variance is 10, 000/(n - 1) = 100 approximately (remember that the
decimal has been dropped, multiplying the scale by 100~. Thus the
average sampling error is 100, and the estimated variance of true validities
is 108 - 1~ = 8.
The mean true validity is 30, estimated by 29, and the variance of true
validities is 10 estimated by 8. These estimates are closer than we have a
right to expect, but the important point is that a drastic overestimate in
true validity variance may occur if the sampling error correction is not
made.
Note that these procedures do not make assumptions about the form of
the distribution from which the true validities are sampled (although
distributional estimates derived from the procedures frequently do).
However, the computation of an estimate of the sampling error variance
does require weak assumptions about the distribution of test and criterion
scores within studies. When the population validities are moderate, the
estimate 1/(n - 1) is satisfactory.
The corrections for sampling error in the Hunter-Schmidt analyses, all
in all, follow accepted statistical practice for estimating components of
variance.
Restriction of Range
Observed validities are based on a sample of workers, whereas the true
validities are based on a population of applicants. Since the worker group
presumably has been selected from the applicant group by criteria
correlated with the test score, the distribution of test scores within the
worker group should be different from that in the applicant group. There
will be a corresponding difference between "true" validities for workers
and applicants.
OCR for page 125
THE THEORY OF VALIDITY GENERALIZATION i25
It is necessary to develop a mechanism to relate the validities of
workers and applicants. Since many applicants will never be employed on
the job, it is impossible to assemble job performance data on a typical
group of applicants. We must estimate, by some theoretical model, what
the job performance would have been if the applicants had been selected
for employment.
We make two assumptions. The first is that the linear function of test
score that best predicts job performance, when computed separately for
the population of applicants and the population of workers, has the same
coefficient of test score in both groups. This means that a given increase
in test score produces the same increase in predicted job performance in
both groups. Some such assumption cannot be avoided, because we have
data available only for the worker group but wish to use that data to make
predictions about the applicant group.
The second assumption is that the error of the linear prediction of job
performance by test score has the same variance in both groups. One might
argue against this assumption on the grounds that workers' job performance
will be predicted more accurately if the workers are rationally selected to
maximize job performance. But methods of prediction of job performance
are not so well developed that we would expect a very noticeable decrease
in error variance in the worker group (see Linn et al., 1981~.
Under these assumptions there is a remarkable formula connecting the
theoretical validities in the two groups: the quantity (1 - validity-2)
multiplied by test score variance is the same in both groups. When the
validities are moderate or small, this means that the ratio of the validities
in the two groups is very nearly the same as the ratio of the standard
deviations of test scores in the two groups. The ratio of the standard
deviation in the worker group to the standard deviation in the applicant
group will be called the restriction ratio. Thus if a worker group is thought
to have a standard deviation only half that of the applicant group, then the
restriction ratio is one-half, and the validity of the test for the applicant
group is close to twice that of the worker group.
The main problem in determining the correction for restriction of range
is identifying the appropriate population of applicants for a particular job
and estimating the variance of test scores for those applicants. The
validation study will use as subjects a set of workers on the job, but we
wish to estimate the validity for a set of applicants for the job who will
take the test through the Employment Service. Few data are available on
the distribution of test scores of applicants for particular jobs. It is not
even clear who should be regarded as applicants. Anyone who wishes to
apply for the job? Anyone who wishes to apply for the job and is willing
to take the test? Anyone who wishes to apply for the job and meets the
employer's minimum qualifications?
OCR for page 126
|26 V~DI~ GENE~TION~D GATE VA9DITIES
The pool of applicants for jobs as laborers or as university professors
may be considerably more restricted than the genera! population (because
of self-selection or qualifications required). Consequently, the correlation
between test score and job performance among the applicants to these
jobs may not be as high as would be the case if the general population
applied for and was employed in these occupations. Note also that the
pool of potential job applicants is not necessarily fixed across localities
and therefore across validity studies. For example, in localities with
chronically high unemployment, the pool of potential applicants for
low-paying jobs may include many people with high test scores who might
not be available (because they are employed) in localities with low
unemployment.
Corrections for Restrictions of Range
Suppose that the above assumptions about the relationship between
test score and job performance are satisfied for worker and applicant
groups. How can the observed validities be corrected for restriction of
range? The standard procedure is as follows: for each job studied, the
restriction ratio-the ratio of standard deviations of test scores for
applicants and workers is estimated. The sample validities computed on
the sample of workers are adjusted to give estimated validities for the
population of applicants for the job. The average of the true validities for
the population of jobs, and the variance of the true validities for the
population of jobs, with due adjustment for sampling error, are computed
from the estimated validities adjusted for restriction of range.
The principal effect of the restriction-of-range correction is to increase
or decrease the estimate of average true validity; for example, if the
average restriction ratio is one-half, the effect is to double the estimate of
mean true validity.
Let us trace the theoretical assumptions and the corresponding com-
putations from applicant population to sample of workers on the hypo-
thetical population considered previously.
We have 11 jobs with 1,000 applicants each. If all applicants were tested
and evaluated on the job, the true validities would be (dropping the
decimal point): 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35.
We sample from the 11 jobs at random to get 4 jobs with true applicant
validities: 26, 28, 31, 32.
The four jobs selected have restriction ratios of: 0.5, 0.5, 1, 1. Thus the
true validities for populations of workers in the four jobs are 13, 14, 31,
32.
For each of the four jobs, we sample 101 from 500 workers on the job.
For the four samples we compute observed validities: 21, 2, 26, 40.
OCR for page 127
THE THEORY OF VA~DI7Y GENERALIZATION 127
The effect of the restriction of range is to lower the observed validities
whenever the restriction ratio is less than 1. Thus the first two observed
validities average 11 although the true validities average 27. When the
ratio of standard deviations varies between jobs, a secondary effect is to
increase the variance of the observed standard deviations.
In practice, only the observed validities are known, and one wants to
infer properties of the true validities. To get from the worker sample back
to the applicant population, we must undo the various operations in going
from the population to the sample. The observed validities are corrected
for restriction of range; the corrected observed validities are 42, 4, 26, 40.
The new estimate of mean true validity is the corrected sample average
28; without the correction the estimate would be 22. The estimate of
variance of true validity is the sample variance 307 less the average error
variance in the four studies 300, yielding an estimated true variance of 7.
Note that the adjusted validities are more variable than the unadjusted
ones.
Estimating Restriction Ratios
In principle, it is possible to estimate the variances in test scores for
different applicant groups the variances necessary for correcting for
restriction of range. However, in the GATE validity studies, which use
workers on the job, no information is available about applicant groups for
those jobs. It is not even clear how applicant groups should be defined for
those jobs. It could be all people who applied for the job over a period of
time, all people in the local labor market who met the requirements for the
job, or all registrants in the local Employment Service office. The last
definition might best fit the purpose of relating test scores to job
performance for Employment Service registrants.
Methods have been developed to correct for restriction of range in a
large sample of studies without knowing the restriction ratio for every
individual study. It is assumed that the restriction ratios for the various
studies have a known mean and variance, and that the distribution of
restriction ratios is independent of the true validities (Callender and
Osburn, 1980~. The known mean and variance are sufficient to determine
the correction. For example, if the restriction ratios have average value
0.5, the average true validity is estimated to be about twice the average
observed validity. Similarly, if the restriction ratios have a large variance,
a reduction will occur in estimating the variance of true validities
compared with the observed variance of sample validities.
The model and calculations are as follows:
Sample validity = restriction ratio x true validity + error
OCR for page 128
~ 28 VALIDITY GENERATION AND GATB VALIDITIES
Since the restriction ratio is assumed to have a distribution independent of
the true validity:
average sample validity = average restriction ratio
x average true validity
variance of sample validity = average sampling variance
+ [restriction ratio variance
x (average true validity)2]
+ Etrue validity variance
x (average restriction ratio)2]
The variance calculation is only approximate, but the approximation is
good whenever the restriction ratio has small percentage variation.
The same model may be used if multiplicative factors other than the
restriction ratio are included; one need only know the mean and variance
of the multiplicative factor.
Can GATB Restriction Ratios be Estimated?
The crucial question remains: What is the average restriction ratio? The
simple option of using the variance derived from all workers who
appeared in the studies (U.S. Department of Labor, 1983c) will lead to
inflated corrections for restriction of range if this group is more variable
in test scores than a typical applicant group for a particular job. This
method of correction is also at odds with assertions made elsewhere by
Hunter (U.S. Department of Labor, 1983e) that the selection methods of
the Employment Service are '`equivalent to random selection"; if indeed
that were true, there would be no difference between applicant groups and
worker groups in test score variance. In the absence of direct information
for particular jobs, the conservative response is to apply no correction for
restriction of range.
Lack of adequate reliable data about the variance of test scores in
realistically defined applicant populations is a major problem in validity
generalization from the GATB validity studies. The absence of direct data
is so pronounced that the committee has chosen the conservative re-
sponse of making no corrections for range restrictions in its analysis of
GATB validities.
Reliability of Supervisor Ratings
In each validity study, a worker's job performance is measured by a
supervisor rating. We distinguish between a true rating, done with
exhaustive study of the worker's job performance, and an observed
OCR for page 129
THE THEORY OF VA~DI~ GENE TON ~29
rating, performed under real conditions by the supervisor. We suppose
that the observed rating differs from the true rating by some error that is
uncorrelated with the true rating over the population of workers.
Reliability is measured by the ratio of the variance of the true ratings to
the variance of the observed ratings. If there is no measurement error, the
reliability would be 1; if the observed rating is unrelated to the true rating,
the reliability would be zero.
The reliability correction is the ratio of the standard deviation of the
true ratings to the standard deviation of the observed ratings. It is the
square root of the reliability. Just as with the restriction ratio, the validity
of test score with observed rating is divided by the reliability correction to
become a validity of test score with true rating.
The main effect of the reliability correction is to increase the estimate
of average true validity. A secondary effect, when reliabilities vary among
studies, is to reduce the estimate of variance of true validities compared
with the observed variance of sample validities.
Much the same things can be said about reliability corrections as for
restriction of range corrections. It is a sensible correction if the required
ratios of variances can be estimated, but in the GATB validity studies
the reliability of the ratings is rarely available. In the Hunter and
Schmidt validity generalization analysis, the mean and variance of the
distribution of reliabilities across studies are assumed, and the mean
and variance of true validities are corrected accordingly. If the reliabil-
ities are underestimated, then the correction will be an overcorrection.
The mean reliability of .60 assumed by Hunter and Schmidt causes a
reliability correction of 0.78; the true validity estimate is increased by
30 percent. Given the dangers of overcorrecting, and given the observa-
tion of reliabilities higher than .60 in many studies, the more conservative
figure of .80 seems more appropriate to the committee and is used in its
calculations.
Connecting the Sample to the Population
The data available about validities of the GATB consist of some 750
studies, conducted by the USES, in collaboration with employers, over
the period 1945-1985. We wish to draw conclusions about the validity of
the GATB for jobs in new settings, as well as about the population of
12,000 job types in many different settings. In order to justify the
extrapolation, we must establish a connection between the jobs studied
and the targeted population of jobs.
In USES validity generalization studies (U.S. Department of Labor,
1983e), it is asserted that the jobs studied in each of five job families may
be taken to be a sample from the set of all jobs in the corresponding job
OCR for page 130
1 30 VALIDITY GENERALIZATION AND GATB VAtAiDITIES
family. Inference about population characteristics is then based on the
tacit assumption that the sample is random, that is, that all jobs in a job
family have equal chance of appearing in job studies.
There are a number of reasons to be skeptical about the assertion that
the jobs studied are representative of all jobs. The studies have been
carried out over a long period of time, and it is fair to question whether a
job study carried out in 1950 is as relevant in 1990 as it was then. Standard
job conditions may have changed, the literacy of the work force may have
changed, accepted selection procedures may have changed. There is
indeed evidence of a general decline of validities over time in the USES
data base.
Moreover, certain conditions must be met before a job appears in a
validity study. An employer must be found who is willing to have workers
spend time taking the GATB test and to have supervisors spend time
rating the workers. The employer must be persuaded that the test is of
some value in predicting job performance; why would the employer
participate in a futile exercise? If the test is then more valid for some jobs
and in some settings than others, and if we assume that either USES or
employers are able to identify the more fruitful jobs and settings, then
surely they would study such jobs first. The jobs thought to have low
validity will have less chance of being studied. The net effect is that the
average population validity will be lower than the observed sample
validity, but we do not know enough about the selection rules for
initiating and carrying out studies to estimate the size of the effect.
An example of such selection in GATB studies is provided by jobs
classified as agricultural, fishery, forestry, and related occupations. They
include 2 percent of the jobs in the Dictionary of Occupational Titles, but
only 0.04 percent (3 studies of 777) of the jobs in the USES data base.
A related selection problem in publishing the results of studies is known
as the file drawer problem. Results that show small validities may have
less chance of being written up formally and being included in the
available body of data. We do not have an estimate for the size of this
effect for the GATB studies.
THE INTERPRETATION OF SMALL VARIANCES IN
VALIDITY GENERALIZATION
Most writers in the area of validity generalization have argued that, if
the variance of the validity parameters is estimated to be small, then
validities are highly generalizable. Two justifications for this position are
advanced. The first is that, if most of the variability in the observed
validities can be accounted for by the artifacts of sampling error,
unreliability of test and criterion, and restriction of range, then it is
OCR for page 131
THE THEORY OF VANDAL GENE TON ~3 ~
reasonable to assume that much of the rest can be accounted for by other
artifacts. The second argument is that, if the variance in validity param-
eters is small, then the validities in all situations are quite similar.
There is little empirical evidence to aid in the evaluation of the first
argument. Although it seems sensible to many, reasonable people might
disagree on how much of the variation must be explained for the
argument to be persuasive. For example, Schmidt and Hunter s (1977)
75 percent rule which suggests that, if the four artifact corrections
explain 75 percent of the variation, then the remaining 25 percent is
probably due to other artifacts (such as clerical errors)- is not univer-
sally accepted (see James et al., 1986, 1988; but see also Schmidt et al.,
19881.
The argument that small variance among validity parameters implies
that all validities are quite similar is more obviously problematic. Suppose
that the sample of studies actually consists of two distinct groups
(differing from one another in job or context), which have different
distributions of validity parameters. If one of the groups in the sample has
only a small number of studies and the other has a much larger number of
studies, then between-group differences in validities need not greatly
inflate the overall variance among validities.
Note also that, when studies in the sample are not representative of
the universe of all jobs or contexts, the size of the two groups in the
sample need not reflect their incidence in the universe. Thus jobs that
might be associated with unusually high validities might occur infre-
quently in the sample of validity studies but occur with higher fre-
quency in the universe of all jobs or contexts. Moreover, the existence
of two groups of studies, each with a different distribution of validity
parameters, cannot be detected from the estimate of the overall mean
and variance of the validities alone. In general, omnibus procedures
designed to estimate the variance of validity parameters (or to test the
hypothesis that this variance is zero) are not well suited to detect the
possibility that validities are influenced by moderator variables that
may act on only a few studies in the sample. The reason is that because
such omnibus procedures are sensitive to many kinds of departures
from absolute consistency among studies, they are not optimal for
detecting a specific pattern. To put this argument more precisely, the
omnibus statistical test that tests for any difference among validities
does not have as much power to detect a particular difference between
groups of studies as does a test designed to detect that specific, between-
group contrast.
OCR for page 132
|32 VANDAL GENE~TION~D GATB VA~DITIES
CONCLUSIONS
1. The general thesis of the theory of validity generalization, that
validities established for some jobs are generalizable to some unexamined
jobs, is accepted by the committee.
Adjustments to Validity Coefficients
Sampling Error
The observed variance in validities is partly due to variance in the
"true" validities computed for a very large number of workers in each
job, and partly due to the differences between those true validities and the
sample validities computed for the actual groups of workers available in
each job.
2. For the GATB, the variance is justifiably adjusted by subtracting
from the observed variance an estimate of the contribution due to
sampling error.
Range Restriction
The adjustments of average validity are designed to correct for two
deficiencies in the data. The first is that, although the correlation between
test score and job performance is based on workers actually on the job,
the prediction will be applied to applicants for the job. If workers have a
narrower range of test scores than applicants, then the worker correlation
will be lower than the applicant correlation; an adjustment for range
restriction produces an adjusted correlation larger than the observed
correlation.
3. Lack of adequate, reliable data about the variance of test scores in
realistically defined applicant populations appears to be a major prob-
lem in validity generalization from the GATB validity studies. Appro-
priate corrections remain to be determined by comparisons between
test score variability of workers and of applicants, and, in the mean-
time, caution suggests that no corrections for restriction of range be
made.
Criterion Unreliability
A further deficiency in the data is that the criterion measure, usually
supervisory ratings, is inaccurately measured and for this reason reduces
the observed correlation. Thus an adjustment is used that produces a
correlation between the test score and a theoretical criterion measured
OCR for page 133
THE THEORY OF VALIDITY GENE TON 133
with perfect precision, which may reasonably be taken to be a better
indicator of job performance than the observed criterion.
4. In the GATB validity studies, data on the reliability of the criterion
are rarely available. Correction for criterion unreliability with too low a
figure would inflate the adjusted validity. Given the observation of
reliabilities higher than .60 in many studies, the committee finds that a
conservative value of .80 would be more appropriate than the .60 value
contained in USES technical reports on validity generalization.
Connecting the Sample to the Population
The generalization of validities computed for 500 jobs in some 750
USES studies to the population of 12,000 jobs in the Dictionary of
Occupational Titles is justified only to the degree that these jobs are
similar to the other jobs not studied. Thus a necessary component of
validity generalization for the GATB is to establish links between the jobs
studied and the remainder. One way to do so is to select the jobs at
random from a general class. Failing randomness in selection, it is
necessary to establish important similarities between the studied jobs and
the target jobs.
5. The 500 jobs in the GATB data base were selected by unknown
criteria. They cannot be considered a representative sample of all jobs in
the U.S. economy. Nevertheless, the data suggest that a modest level of
validity (greater than .15) will hold for a great many jobs in the U.S.
economy.
Representative terms from entire chapter:
true validity