| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 172
9
Differential Validity and
Differential Prediction
This chapter addresses the important question of whether the General
Aptitude Test Battery (GATB) functions in the same way for different
specified groups. Investigations of group differences in the correlations of
a test with a criterion measure are commonly referred to as differential
validity studies. Such studies can take a variety of forms, including
investigations of the possibility that validity coefficients may differ as a
function of the setting (e.g., from one job to another or from one location
to another) or the group (e.g., demographic group or groups formed on the
basis of prior work experience). Investigations of differential prediction,
which cover an equally broad range, focus on prediction equations rather
than correlation coefficients. A differential prediction study may be used
to investigate whether differences in setting or differences among demo-
graphic groups (e.g., racial or ethnic groups or gender) affect the
predictive meaning of the test scores. We are not concerned here with
setting. Our investigation is limited to the possibility that the GATB
functions differently for different population groups, and specifically that
correlations of GATB scores with on-thejob criterion measures may
differ by racial or ethnic group or gender, or that predictions of criterion
performance from GATB scores may differ for employees on a given job
who are of different racial or ethnic status or gender.
Although questions about differences in correlations and about differ-
ential prediction could be raised for groups formed on the basis of a wide
range of characteristics, these questions are of particular importance for
groups that are known to differ in average test performance. Some of the
172
OCR for page 173
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ~ 73
policy issues regarding the use of tests for selection that are raised by the
existence of group differences in average test performance were discussed
in the report of the National Research Council's Committee on Ability
Testing, from which we quote (Wigdor and Garner, 1982:71-721:
If group differences on tests used for selection do not reflect actual differences in
practice in college or on the joWthen using the test for selection may unfairly
exclude a disproportionately large number of members of the group with the lower
average test scores. Furthermore, even when the groups diner in average
performance on the job or in college as well as in average performance on the test,
the possible adverse impact on the lower-sconng group should be considered in
evaluating the use of the test.
Because the differences in average test scores for some groups are
relatively large, and because reliance on the scores without regard to
group membership can have substantial adverse impact, "it is important
to determine the degree to which the differences reflect differences in
performance . . . on the job" (Wigdor and Garner, 1982:73~. That is, the
results of differential prediction studies are needed.
Studies have been conducted by David J. Synk, David Swarthout, and
William Goode, among others, comparing the predictive validities of
GATE scores obtained for black employees and white employees (e.g.,
U.S. Department of Labor, 1987), and for men and women (U.S.
Department of Labor, 1984a). Although these comparisons of correlation
coefficients for different groups are related to the issue of differential
prediction, they do not provide a direct answer to the question of whether
group differences in average test scores are reflected in differences in job
performance. It is possible, for example, for the correlations between a
test and a criterion measure to be identical for two groups when there are
substantial differences in the prediction equations for the two groups.
Thus, the use of a single prediction equation could lead to predictions that
systematically over- or underestimate the job performance of members of
one of the groups, even though the validity coefficients are the same.
Conversely, it is possible for two groups to have the same prediction
equations and the same variability of actual criterion scores about their
predicted values, and yet have different validity coefficients.
Prediction equations are usually based on a linear regression model and
are influenced by means and standard deviations of the test and criterion
measure as well as the correlation. Thus the equations for two groups may
differ as the result of differences in means or standard deviations as well
as differences in correlations.
Although differential prediction is the more important of the two topics,
differences in correlations between scores on the GATB and scores on a
criterion measure are also of interest. This is so because there is a
OCR for page 174
~74 GATE VALIDITIES AND VALIDITY GENERALIZATION
common expectation that a test, which may be known to have a useful
degree of validity for majority-group employees, may have no useful
degree of validity for minority-group employees. Therefore, the results of
the committee's investigations of differences in correlations between the
GATE and criterion measures are briefly reviewed before turning to a
consideration of differential prediction.
GROUP DIFFERENCES IN CORRELATIONS
David J. Synk and David Swarthout compared the validity coefficients
obtained for black and for nonminority employees in 113 Specific Apti-
tude Test Battery validation studies conducted since 1972 for which there
were at least 25 people in each of the two groups (U.S. Department of
Labor, 19871. For almost all of the 113 studies the criterion measure was
based on supervisor ratings, typically "the sum of the scores from two
administrations of the Standard Descriptive Rating Scale" (p. 21. The
weighted average of the validity coefficients across studies was reported
separately by group for each of the nine aptitude scores. Also reported
were the weighted average validities for the appropriate composites for
each of the five job families. The latter results are of greatest interest here
because it is the composites that would be used in the proposed
VG-GATB Referral System.
The weighted average job family correlations reported in Table 4 of the
Synk and Swa~thout report are reproduced in Table 9-1. Also shown are
the number of studies and the number of employees on which each of the
weighted average correlations is based.
TABLE 9-1 Weighted Average Job Family Correlations for Black and
Nonminority Employees
Blacks
Nonminorities
Job Number of Average Average
Family Studies N Correlation N Correlation
I 5 196 -.01 624 .05
II 1 44 .11 81 .07
III 1 66 .19 291 .27
IV 62 3,886 .15 9,938 .19
V 44 3,662 .12 4,834 .20
SOURCE: Based on U.S. Department of Labor. 1987. Comparison of Black and
Norlminority Validities for the General Aptitude Test Battery. USES Test Research Report
No. 51. Prepared by David J. Synk and David Swarthout, Northern Test Development Field
Center, Detroit, Mich., for Division of Planning and Operations, Employment and Training
Administration. Washington, D.C.: U.S. Department of Labor, Table 4.
OCR for page 175
DIFFERENTIAL VALIDI~ AND DIFFERENTIAL PREDICTION ~ 75
As the table shows, the average correlation for black employees is
smaller than the corresponding average correlation for nonminority
employees for all but Job Family II, in which case the results are based on
only one study with a relatively small sample of employees. The differ-
ence in the average correlations for black and nonminority employees is
statistically significant according to the critical ratio test reported by Synk
and Swarthout for Job Families IV and V.
Synk and Swarthout did not present more detailed information about
the distributions of the validity coefficients for the two groups within each
job family. However, the Northern Test Development Field Center of the
U.S. Department of Labor made data available to the committee that we
used to compute correlations between the job-specific GATB composite
and criterion measures. These correlations were computed separately for
each job with at least 50 black and 50 nonminority employees with GATB
scores and scores on the criterion measure. The data files overlap with
those used by Synk and Swarthout, differing mainly in the number of
studies, since only studies that included at least 50 people in each group
were used in the present analyses. As before, the criterion measure is
based on supervisor ratings in most cases, usually the Standard Descrip-
tive Rating Scale.
A total of 72 studies had at least 50 black and 50 nonminority
employees. The 72 studies included a total of 6,290 black and 11,923
nonminority employees, for an average of about 87 black and 166
nonminority employees per study. The number of black and nonminority
employees per study ranged from 50 to 321 and from 56 to 761,
respectively.
The correlation between the GATB composite and the criterion mea-
sure was larger for the sample of nonminority employees than for the
sample of black employees in 48 of the 72 studies. The average correlation
(weighted for sample size) of the job-appropriate GATB composite with
the criterion measure was .19 for nonminority employees. The corre-
sponding weighted average for black employees was .12. Thus, the finding
of Synk and Swarthout that the average correlation is smaller for blacks
than for nonminorities is confirmed in our analysis.
A more detailed comparison of the distributions of correlations be-
tween the GATB composite and the criterion measure for the two groups
is shown in the stem-and-leaf chart in Table 9-2. The stem-and-leaf chart
can be read like a bar chart. The numbers in the center between the
brackets give the first digit (i.e., tenths) of the correlation. The numbers
to the left give the hundredths digit for each of the 72 correlations based
on black employees, and the numbers to the right give the hundredths
digit of the 72 correlations based on nonminority employees. For exam-
ple, in one study the correlation for black employees was .42. That study
OCR for page 176
~76 GATE VALIDITIES AND VANDAL GENE=~IZATION
TABLE 9-2 Stem-and-Leaf Chart of the Correlations of the
Job-Appropriate GATE Composite with the Criterion Measure for
Black and Nonminority Employees (stem = .1; leaf = .01)
Leaf for BlacksStem Leaf for Nonminor~ties
[~5] 1
[~4]
2[.4] 1
97[.3] 57778
4430[.3] 00023
8876665[~2] 55567788
442200[~2] 022222333114444
99888776666655[.1] 55557788889
433222110[.1] 022234444
98877665[.0] 5557789
443311111000[.0] 1133444
300[-.0] 44
877[-.0] 7
O[-.1]
55[-.1]
Median, blacks = .13
Median, nonminorities = .185
is depicted by the leaf of 2 to the left of the [.4]. The 1 to the right of the
[.4] represents a study where the correlation for nonminority employees
was .41 .
As the table shows, there is a general tendency for the distribution of
correlations to be higher for nonminorities than for blacks. The difference
in medians (.185 versus .13) is similar to the difference in sample-
size-weighted means (.19 versus .12~. The 25th and 75th percentiles are
.11 and .25 for the distribution of correlations based on nonminority
employees; the corresponding figures for black employees are .03 and .21.
The greater spread in the correlations for blacks compared with nonmi-
norities is to be expected because the average number of black employees
per study (87) is smaller than that for nonminorities (166~. Hence, the
correlations based on data for blacks have greater variability due to
sampling error. Nonetheless, for a quarter of the studies, the correlation
for blacks is .03 or less.
The above results give only a global picture for one of the minority
groups of interest. However, the results raise serious questions about the
degree of validity of the job family composites for blacks, especially in
Job Families IV and V for which the results are based on a sizable number
of studies and large samples of black employees. Not only are the average
OCR for page 177
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ]77
validity coefficients lower for blacks than for nonminor~ties, but the level
of the correlation for blacks is also quite low.
Comparisons of validity coefficients for other racial or ethnic groups
would be of value but data are not presently available. Comparisons of
validity coefficients for men and women, however, have been reported by
Swarthout, Synk, and Goode (U.S. Department of Labor, 1984a).
Swarthout, Synk, and Goode analyzed the results of 122 Specific
Aptitude Test Battery validation studies conducted since 1972 in which
there were at least 25 men or 25 women. Only 37 of these studies had at
least 25 male and 25 female employees. For those 37 studies the weighted
average validity of the nine aptitude scores for men and women was
reported. Except for manual dexterity, for which the average validity was
.05 higher for women than for men (.14 versus .09), the average validities
for men and women did not differ by more than .02 on the remaining eight
aptitudes.
Unfortunately for present purposes, the comparisons of the validities of
the job family composites were reported for all studies that had the
minimum number of men or 25 women. Thus the averages for men and
women are based on overlapping but not identical sets of studies. Since
the available studies in Job Families I, II, and III were all single-sex
studies, only the results from the Swarthout, Synk, and Goode research
for Job Families IV and V are summarized in Table 9-3. As the table
(which was taken from Table 6 of the Swarthout, Synk, and Goode
research) shows, the weighted average validity for women is quite similar
to the corresponding value for men in both job families. Although caution
is needed in interpreting these results because the averages for men and
women are not based on identical sets of studies, there does not seem to
be any indication that the GATE composites for Job Families IV and V
are any less valid for women than for men. It might be noted, however,
that the average validities reported here are higher for men and women
TABLE 9-3 Weighted Average Job Family Correlations for Male and
Female Employees
Men
Women
Job Number of Average Number of Average
Family Studies N Validity Studies N Validity
IV 51 8,793 .24 37 7,101 .25
V 23 2,365 .20 41 6,262 .22
SOURCE: U.S. Department of Labor. 1984. The Elect of Sex on General Aptitude Test
Battery Validity and Test Scores. USES Test Research Report No. 49. Prepared by
Northern Test Development Field Center, Detroit, Mich., for Division of Counseling and
Test Development, Employment and Training Administration. Washington, D.C.: U.S.
Department of Labor, Table 6.
OCR for page 178
)78 GATE VALIDITIES AND VALIDITY GENE~IZATION
than the averages that were presented earlier for blacks and nonminori-
ties. Recall that the average weighted validities reported by Synk and
Swarthout for blacks in Job Families IV and V were only .15 (based on 62
studies) and .12 (based on 44 studies), respectively (U.S. Department of
Labor, 19871.
DIFFERENTIAL PREDICTION
As has already been noted, differences in validity coefficients are
related to differential prediction, but the two are not identical and the
latter concept is more relevant to determining if predictions based on test
scores are biased against or in favor of members of a particular group.
According to Standards for Educational and Psychological Testing
(American Educational Research Association et al., 1985:12~:
There is differential prediction, and there may be selection bias, if different
algorithms (e.g., regression lines) are derived for different groups and if the
predictions lead to decisions regarding people from the individual groups that are
systematically different from those decisions obtained from the algorithm based
on the pooled groups.
The Standards (p. 12) go on to discuss differential prediction in terms of
selection bias:
[In the case off simple regression analysis for selection using one predictor,
selection bias is investigated by judging whether the regressions differ among
identifiable groups in the population. If different regression slopes, intercepts, or
standard errors of estimate are found among different groups, selection decisions
will be biased when the same interpretation is made of a given score without
regard to the group from which a person comes. Differing regression slopes or
intercepts are taken to indicate that a test is differentially predictive of the groups
at hand.
Since the available reports comparing validities do not provide direct
evidence regarding the possibility of differential prediction, the committee
conducted analyses for this report. Data for these analyses were provided
by the Northern Test Development Field Center of the U.S. Department
of Labor. The data tape that was provided contained studies used in the
Synk and Swarthout comparison of validities for black and nonminority
employees (U.S. Department of Labor, 19871.
Although the data tape contains a variety of other information, only
one criterion measure and one test-based predictor were used in the
analyses reported here. The criterion measure is the same as the one
used by Synk and Swarthout. Thus, with the exception of a few studies,
the criterion measure is based on supervisor ratings, usually the
Standard Descriptive Rating Scale. The predictor is the job family
OCR for page 179
DIFFERENTIAL VALIDI~AND DIFFERENTIAL PREDICTION ~79
composite appropriate for the job family to which each study is assigned.
Group membership was indicated by a variable that identified the indi-
vidual as black, Native American, Asian, Hispanic, or nonminority. Only
individuals identified as either black or nonminority were included in the
analyses.
For each of the 72 Specific Aptitude Test Battery validation studies in
the data file that had data for 50 or more black and 50 or more nonminority
individuals, the following statistics were computed separately for each
group and for the total combined group: the mean and standard deviation
of the job family composite test score and criterion measure, the
correlation between the composite test score and the criterion measure,
the slope and intercept of the regression of the criterion measure on the
composite test score, and the standard error of prediction. Within each
study the regression equations were compared by testing the significance
of the difference between the slopes, and if the slopes were not signifi-
cantly different, the significance of the difference between the intercepts
of the regression equations.
Standard Errors of Prediction
The standard error of prediction is based on the spread of the observed
scores on the criterion measure around the criterion scores that are
predicted from the test scores using the regression line. A larger standard
error of prediction indicates that there is more spread around the
regression line, and hence the prediction is less precise. If the standard
error of prediction was consistently larger for one group than for another,
then one could conclude that the errors of prediction were greater for the
group with the larger standard error, and hence that the predictor is less
useful for that group.
The standard error of prediction was larger for blacks than for nonmi-
norities in 40 of the 72 studies, whereas the converse was true in the
remaining 32 studies. Since the standard error of prediction increases as
the correlation decreases, one might have expected more of a tendency
for the standard error of prediction to be larger for blacks than for
nonminorities due to the previously discussed difference in correlations.
However, the standard error of prediction also depends on the standard
deviation of the criterion scores. Indeed, when the correlations are as low
as those typically found between the GATB composite and the criterion
measure, the standard error of prediction is dominated by the standard
deviation of the criterion scores. Thus, the fact that the standard error of
prediction is larger for blacks than for nonminorities only slightly more
often (56 percent of the studies) than it is smaller (44 percent) is not
inconsistent with the typical difference in correlations.
it'
OCR for page 180
i80 GATE VAMDITIES ID VA~DI~ GENE TON
Slopes
The slopes of the regression of the criterion scores on the job family
GATE composite scores were significantly different at the .05 level in
only 2 of the 72 studies. The number of significant differences in slopes at
the .10 level was 6 of the 72; in 4 of the latter 6 studies, the slope was
greater for nonminorities than for blacks, whereas the converse was true
in the other 2 studies. Although these results suggest that slope differ-
ences are relatively rare, it should be noted that the test for differences in
slopes for the two groups in an individual study has relatively little power
for the typical sample sizes of the studies.
A more sensitive comparison of the slopes is provided by considering
the full distribution of the 72 t-ratios computed to test the difference
between the slopes obtained for the two groups on a study-by-study basis.
A positive l-ratio indicates that the slope for nonminor~ty employees is
greater than the slope for black employees, albeit not necessarily signif-
icantly greater. Conversely, a negative l-ratio indicates that the slope for
black employees is greater than that for nonminority employees.
The distribution of the t-ratios for the tests of differences between
slopes for the two groups is shown in the stem-and-leaf chart in Table
9-4. If the pairs of slopes differed only due to sampling error in the 72
studies, positive and negative t-ratios would be equally likely and the
mean of the 72 t-ratios would differ from zero only by chance. As can be
seen, positive t-ratios outnumber negative ones almost two to one (47
versus 25~. The mean of the 72 t-ratios is .30, a value that is significantly
greater than zero. Thus, there is a tendency for the slope to be greater for
TABLE 9-4 Stem-and-Leaf Chart of the t-Ratios for the Tests of
Differences Between the Slope of the Regression Based on Data for
Black and Nonminority Employees in 72 Studies (stem = 1; leaf = .1)
Stem Leaf Count
2 01 2
1 5579 4
1 001112222333444 15
0 555666667888999 15
0 01122233444 11
-0 0012334444 10
-0 56667 5
- 1 1233444 7
- 1 589 3
NOTE: Median=.45.
OCR for page 181
DIFFE=NTIAL VALIDI~ AND DIFFE=NTIAL PREDICTION ~ ~ ~
nonminorities than for blacks, but the differences are generally not large
enough to be detected reliably in an individual study because of relatively
small samples of people in each group.
The tendency for the slope to be somewhat greater for nonminorities
than for blacks is consistent with the finding that, on average, the
correlation between the GATB composite and the criterion measure is
higher for nonminorities than for blacks. When slopes are unequal, then
the difference between the predictions based on the equations for the two
groups will vary depending on the value of the score on the GATB
composite.
The practical implication of the difference in slopes depends on the
relationship of the two regression lines. When the regression line for
nonminorities not only has a steeper slope but also is above the regression
line for blacks throughout the range of GATB scores obtained by blacks,
then blacks will be predicted to have higher criterion scores if the
equation for nonminorities is used than if the equation based on the data
for blacks is used. However, the difference will be greater for blacks with
relatively high GATB scores than for blacks with relatively low scores.
Other combinations are, of course, possible when the slopes differ.
However, as we show below, the above pattern is most common.
Intercepts
For the 70 studies in which the slopes were not significantly different at
the .05 level, a pooled within-group slope was used and the difference in
intercepts for the two groups was tested. In 26 of the 70 studies the
intercepts were significantly different at the .05 level. Even with a
significance level of .01, 20 of the studies had significantly different
intercepts. In all 20 of the latter cases, the intercept for the nonminority
employees was greater than the intercept for the black employees. This
was also the case for five of the six studies in which the difference was
significant at the .05 level but not at the .01 level. Thus, in only 1 of the
26 studies in which the intercepts were significantly different was the
intercept greater for black than for nonminority employees.
To get a sense of the magnitude of the difference in intercepts, the
intercept for black employees was subtracted from the intercept for
nonminority employees and the difference was divided by the standard
deviation of the criterion scores based on the sample of black employees.
The latter step was taken, in part, to account for differences in the
criterion scale from one study to another and, in part, to express the
difference in a metric that is defined by the spread of the scores for one of
the groups. The distribution of these standardized differences in inter-
cepts is shown in the stem-and-leaf chart in Table 9-5. (Note that all 72
OCR for page 182
~2 GATE VALIDITIES AND VALIDI~ GENE=~IZATION
TABLE 9-5 Stem-and-Leaf Chart of Standardized Differences in
Intercepts of Regression Lines for Black and Nonminority Employees
(stem = .1; leaf= .01)
Stem Leaf Count
.8 11 2
.7
.6
·5 0122499 7
.4 2233499 7
·3 012233457 9
.2 13455678999 11
.1 1111455999 10
55777999 12
-.0 0123349 7
-.1 126 3
-.2 0
-.3 04 2
NOTE: Median=.235.
studies are included in the distribution, even though 2 of the studies had
significant differences in the slopes, suggesting that a pooled, within-
group slope is not entirely appropriate.)
As the table shows, the difference is positive more often than it is
negative, with a median value roughly equal to one-quarter of the
standard deviation of criterion scores for black employees. Values of
these standardized differences in intercepts that are greater than zero
indicate that the performance that would be predicted for a given test
score would be higher if the equation with the pooled, within-group slope
but the intercept for nonminority employees were used than if the
equation with the intercept for black employees were used. With positive
values the nonminority equation would tend to overestimate the criterion
performance of black employees. The converse is true for standardized
differences that are less than zero.
Predictions Based on the Total Group
In practice, if a single regression equation were to be used to predict the
criterion performance of applicants, presumably it would not be either of
the within-group equations that was used to test the differences in
intercepts. Rather, a total-group equation based on the combined groups
would be used. Therefore, the regression equation based on the combined
group of black and nonminority employees was estimated for each study.
OCR for page 183
DIFFERENTIALVALIDI~ANDDIFFERENTIALP~DICTION 183
TABLE 9-6 Stem-and-Leaf Chart of Standardized Difference in
Predicted Criterion Scores Based on the Total-Group and Black-Only
Regression Equations: GATB Composite Score = Black Mean Minus
One Standard Deviation
Stem Leaf Count
.3
.2
.1
.0
01568
01112236
01244457889
0112222333455566677889999
8
11
25
-.0 00111111223567788 17
0015
3
2
4
NOTE: Median=.05.
The potential- impact of using a total-group regression equation to
predict the criterion performance of black employees was evaluated by
computing the predicted scores that would be obtained using the total-
group equation and comparing those predictions to the values that would
be obtained using the corresponding equation based on black employees
only. More specifically, at each of three score values on the GATB job
family composite, two scores were obtained: the predicted criterion score
based on the total-group equation and the predicted criterion score based
on the equation for black employees only. The latter predicted value was
subtracted from the former and, as before, the difference was divided by
the standard deviation of the criterion scores for black employees to take
into account between-study differences in the metric of the criterion
measure. The three levels of GATB job family composite score that were
used were (1) the mean for black employees in the study minus one
standard deviation for those employees, (2) the mean for black employ-
ees, and (3) the mean plus one standard deviation.
The distributions of these standardized differences in predicted scores
are shown in the stem-and-leaf charts in Tables 9-6, 9-7, and 9-8, one for
each of the predictor score levels used in the calculations. Analogous to
the above intercept comparisons, a positive number indicates that the
predicted criterion performance of a black employee with the selected
GATB composite score is higher when the total-group equation is used
than when the equation for black employees only is used. In this case the
total-group equation is said to overpredict or to provide a prediction that
is biased in favor of black employees with that GATB composite score.
Conversely, negative numbers would be said to underpredict or to yield
OCR for page 184
~84 GATB VALIDITIES AND VALIDI~GENE~IZATION
TABLE 9-7 Stem-and-Leaf Chart of Standardized Difference in
Predicted Criterion Scores Based on the Total-Group and Black-Only
Regression Equations: GATB Composite Score = Black Mean
.
Stem Leaf Count
.4
011
3
.3 012358 6
.2 222345588 9
.1 0022223333334555666778999 25
.0 012222334555666788 18
-.0 12223446 8
-.1 138 3
NOTE: Median = .13.
predictions that are biased against black employees with that GATB
composite score.
Although there is substantial variation from study to study, a large
amount of which would be expected simply on the basis of sampling
variability, there is some tendency for the standardized difference in
predicted criterion scores to be positive. The tendency is weakest at the
lowest predictor score value (median = .05) and strongest at the highest
predictor score value (median = .18~. The latter difference is a conse-
quence of the total-group slope typically being slightly greater than the
TABLE 9-8 Stem-and-Leaf Chart of Standardized Difference in
Predicted Criterion Scores Based on the Total-Group and Black-Only
Regression Equations: GATB Composite Score = Black Mean Plus
One Standard Deviation
Stem Leaf Count
.6 133 3
·5 011 3
.4 2567 4
.3 01235588 8
.2 00123456788999 14
.1 00244567788889 14
.0 1122233345677 13
-.0 0222445667 10
-.1 27 2
-.2 0
-.3 4 1
NOTE: Median = .18.
OCR for page 185
DIFFERENTIAL VALIDITY AND DIFFERENTIAL PREDICTION ~ 85
slope for black employees only, and it is consistent with the finding noted
above that there is a tendency for the slope based on data for nonminor-
ities to be somewhat greater than the slope based on data for blacks.
The above results suggest that the use of a total-group regression
equation generally would not give predictions that were biased against
black applicants. If the total-group equation does give systematically
different predictions than would be provided by the equation based on
black employees only, it is somewhat more likely to overpredict than to
underpredict. These results are generally consistent with results that have
been reported for other tests. As was noted by Wigdor and Garner (1982:
77), for example:
Predictions based on a single equation (either the one for whites or for a combined
group of blacks and whites) generally yield predictions that are quite similar to, or
somewhat higher than, predictions from an equation based only on data from
blacks. In other words, the results do not support the notion that the traditional
use of test scores in a prediction yields predictions for blacks that systematically
underestimate their performance.
In considering the implications of these results, it is important to note
that the criterion measure in most cases consisted of supervisor ratings.
Any interpretation of the results depends on the adequacy of the criterion
measure, including the lack of bias. In addition, it is important to recall
that the correlation of the GATE composite with criterion performance is
generally low for black employees (weighted averages of only .15 and .12
for Job Families IV and V, according to the summary reported by U.S.
Department of Labor, 1987~.
Given the low correlation and the substantial difference in mean scores
of blacks and whites on the GATE, use of the test for selection of black
applicants without taking the applicant's race into account would yield
very modest gains in average criterion scores but would have substantial
adverse impact. It is within this context that the differential prediction
results need to be evaluated.
Performance Evaluation and the Issue of Bias
It is often demonstrated in the psychological literature that supervisor
ratings are fallible indicators of job performance (e.g., Alexander and
Wilkins, 1982; Hunter, 1983~. In order to combat some of the weaknesses
of the genre, a specially developed rating form, the Standard Descriptive
Rating Scale, is used for most of the GATE criterion-related validity
studies. Raters are told that the information is being elicited only for
research purposes, not for any operational decisions.
Nevertheless, the possibility of racial, ethnic, or gender bias contami-
nating this kind of criterion measure is an issue deserving attention.
OCR for page 186
~86 GATE VALIDITIES AND VALIDI~GENE~~IZATION
Although common sense suggests that evaluations of the performance of
blacks or women might well be depressed to some degree by prejudice, it
is difficult to quantify this sort of intangible (and perhaps unconscious)
elect.
Two recent surveys draw together the efforts to date. Kraiger and Ford
(1985) and Ford et al. (1986) provide meta-analyses of the presence of race
effects in various types of performance measures. The first review
(Kraiger and Ford, 1985) examines the relation between race and (sub-
jective) performance ratings. A total of 74 studies were located, 14 of
them using black as well as white raters. The analysis reveals the
existence of a suggestive rater-ratee interaction: white raters rated the
average white ratee higher than 64 percent of the black ratees, and black
raters rated the average black ratee higher than 67 percent of the white
ratees.
For white raters, there was sufficient variability and a sufficient number
of studies to evaluate the effect of moderator variables. (Although there
was more variability for black raters, there were too few studies to
perform the moderator analysis.) The authors found a significant (p < .10)
inverse correlation between the percentage of blacks in a sample and the
difference in the average rating. The higher the percentage of blacks, the
less the difference. The three remaining moderator effects had nonsignif-
icant effects. Rater training (which may or may not have discussed race)
had no impact. The purpose for obtaining the ratings, either for real,
administrative reasons or for research only, had no impact. Although
there appeared to be a tendency for behaviorally based rating scales to
show a greater difference between blacks and whites than did trait scales,
it was not significant.
Because the 1985 study was limited to subjective ratings, the authors
could not attempt to estimate the relative contributions of ratee perfor-
mance and rater bias to the differences in ratings found for blacks and
whites. The second paper (Ford et al., 1986) represents a preliminary
attempt to address the issue of the extent to which race differences in
assessments of job performance are the product of meaningful perfor-
mance differences or the product of rater bias. Ideally, one would have a
perfect criterion, one without limitation or bias, that would provide a
perfectly accurate measure of job performance for blacks and whites. In
the absence of such an ultimate criterion, the authors seek to advance our
understanding by looking at the extent of racial differences for objective
and subjective ratings of performance.
Ford and colleagues identified 53 studies, published and unpublished,
that reported at least one objective performance measure and one
subjective rating for a sample of black and white workers. Comparisons
are reported for three types or aspects of performance: absenteeism and
OCR for page 187
DIFFE=NTIA f VA~DI~ AND DIFFERENTIAL PREDICTION ~ 87
tardiness; cognitive performance; and direct performance such as units
produced, accidents, or customer complaints.- The meta-analysis cumu-
lated correlations between race and objective indices of performance and
subjective ratings of performance in order to compute mean effect sizes
and variances across studies.
For the purposes of the committee's study, Ford and colleagues (1986)
make a number of interesting observations. First, they report a relatively
high degree of consistency in overall erect sizes across multiple criterion
measures; in other words, there are similar magnitudes of difference
between blacks and whites no matter what kind of performance is
measured or what kind of measurement is used (the effect size ranges
from.11 to .341.
Second, contrary to conventional wisdom, they found that the effect
sizes for objective and subjective performance criteria were virtually
identical. They report (1986:Table 1) for the total sample a mean effect
size of the correlation (corrected for unequal sample sizes and attenua-
tion) between ratee race and performance measure of .21 for objective
criteria and .20 for subjective criteria. One conclusion that the authors
draw from this is that the race effects found in subjective ratings cannot
be attributed solely to rater bias.
Interestingly, the biggest reported differences in measured performance
between blacks and whites are associated not with the type of criterion
measure (objective or subjective) but with the type of performance
measured. The biggest mean effect sizes with both objective and subjec-
tive measure were for cognitive performance-.34 and .23, respectively.
In comparison, the effect sizes for direct performance were .16 for
objective measures and .22 for subjective measures. Note that although
race differences are smaller when measures of direct performance are
used than when cognitive performance measures are used, all measures of
on-thejob performance produce much smaller differences in scores
between blacks and whites than do predictor tests such as the GATB, on
which blacks are typically found to score about one standard deviation
below whites. We will return to this subject in Chapter 13.
The studies reported here point up the need for more attention to the
matter of performance differences between blacks and whites and the
extent to which measured differences reflect meaningful differences in
employee performance or are the consequence of bias in the measurement
technique.
With regard to the immediate purpose of evaluating the GATB validation
research, the possibility of bias in the criterion measure adds further grounds
for caution in interpreting the validity of the GATB for minority applicants.
The U.S. Employment Servicers long-term research agenda should include
the task of exploring the influence of bias on supervisor ratings.
OCR for page 188
88 GATB VALIDITIES AND VALIDI~GENE~IZATION
CONCLUSIONS
Differential Validity by Race
1. Analysis of the 72 GATB validity studies that had at least 50 black
and 50 nonminority employees indicates that correlations are lower for
blacks than for whites. The average correlation of the GATB composite
with the criterion measure was .12 for black employees and .19 for
nonminority employees. Moreover, for a quarter of the studies, the
correlation for blacks is .03 or less.
These results raise serious questions about the degree of validity of the
job family composites for black applicants. Not only are the average
validity coefficients lower for blacks than for nonminorities, but also the
level of the correlation for blacks is quite low.
Differential Prediction by Race
2. Are group differences in average test scores reflected in differences
in performance? Analysis of the same set of 72 validity studies shows that
use of a single prediction equation relating GATB scores to performance
criteria for the total group of applicants would not give predictions that
were biased against black applicants. That is, the test scores would not
systematically underestimate their performance. A total-group equation is
somewhat more likely to overpredict than to underpredict the perfor-
mance of black applicants.
Criterion Bias
3. The results of our differential prediction analysis could be qualified
by inadequacies in the criterion measure, including racial or ethnic bias.
Supervisor ratings are susceptible to bias. There is some evidence that
supervisors tend to rate employees of their own race higher than they rate
employees of another race. Real performance differences could thus be
confounded with spurious differences in the performance measure used to
judge the accuracy of prediction of GATB scores. This is an issue that
should be part of U.S. Employment Service's long-term research agenda.
OCR for page 189
PART IV
ASSESSMENT OF THE
VG-GATB PROGRAM
Part IV contains the committee's appraisal of the VG-GATB Referral
System and consideration of the potential effects of its widespread use
throughout the Public Employment Service. Chapter 10 lays out the plan
for the VG-GATB system as it is envisioned by the U.S. Employment
Service for its local office operations; discusses the claims that have been
made for the system; and assesses the evidence available on its imple-
mentation from a small number of pilot studies. Chapter 11 discusses the
likely effects of widespread adoption of the VG-GATB system on the
specific groups involved: employers, job seekers, in particular minority
job seekers, people with handicapping conditions, and veterans. Chapter
12 is the committee's assessment of the claims of potential economic
benefits that have been made for VG-GATB referral, including both gains
for individual firms and gains for the economy as a whole.
OCR for page 190
Representative terms from entire chapter:
criterion measure