Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 291
7
Some Methodological Issues
in Making Predictions
John B. Copas and Roger Tarling
Methodological considerations are cen-
tral to all quantitative or actuarial predic-
tions, although each particular precliction
study invariably presents its own special
issues. At its most general level, a predic-
tion study investigates the extent to
which criterion measures (the clependent
variables) can be preclicted by one or
more measures of other factors (the pre-
dictor or independent variables).
It is outside the scope of this paper to
discuss all the important methodological
steps in the process: the selection and
measurement of appropriate information;
the choice of statistical method; the prac-
tical application of a prediction instru-
ment and its utility. Instead, we concen-
trate on four aspects. First, we examine in
cletail the Burgess ant! Glueck point-
scoring methods, which have been used
extensively in criminological prediction.
Second, we consider the important topic
of validating and calibrating the preclic
John B. Copas is professor of statistics, University
of Birmingham, England; Roger Tarling is deputy
head, Home Office Research and Planning Unit,
England.
29~
tion instrument. Third, we review the
various measures that have been pro-
posed to assess an instrument's predictive
power. Fourth, we describe methods for
reusing samples to carry out a prospective
validation. At each stage we attempt to
synthesize some of the previous work in
the area and present the results of our
more recent statistical and methodologi-
cal research.
POINT-SCORING METHODS
A variety of statistical methods have
been used to construct prediction instru-
ments. Chief among them are the Burgess
and Glueck point-scoring methods, mul-
tiple regression, log-linear methods, and
logistic regression. In addition, various
clustering, classification, and segmenta-
tion techniques have been used. (The
latter group of techniques are not dis-
cussed here; see Fielding, 1979; TarTing
and Perry, 1985.~) For examples of the
iThe statistical methods listed above have severe
limitations for much criminal career research, espe-
cially when the dependent variable is not binary
OCR for page 292
292
application of all these methods in crimi-
nological research, see the studies in-
cluded in Farrington and Tarling (1985~.
Invariably, these methods have been
used in studies in which the clependent
variable is binary (e.g., reconvictecT/not
reconvicted). Many criminologists have
found that simple point-scoring methods
are more efficient or robust than more
sophisticated methods and shrink less
when applied to a validation sample. This
seems especially so when the data con-
tain measurement errors or "noise" (S. D.
Gottirecison and Gottfredson, 1985;
Wilbanks, 19851. This fincling, plus the
fact that point-scoring methods are simple
in conception and administratively easy
to use, has lee] to their being adopter! in
practice, particularly in studies of parole
and sentencing decision making (D. M.
GottEredson, Wilkins, and Hoffman, 1978;
Nuttall et al., 19771. However, some com-
mentators have said that point-scoring
methods are intolerably crude, have no
statistical foundations, and clo not result
in any direct probabilistic interpretation.
In this section we explore point-scoring
methods to see if we can resolve some of and
these tensions and anomalies. In acicti-
tion, we show how point-scoring meth-
ods, reconceptualized in the way we rec-
ommenc3, can be extendecI.
There are two basic point-scoring
methods, one ascribed to Burgess (1928)
ant] the other to Glueck anti Glueck
(19501. In the Burgess method each sub-
Ject Is given a score of either O or 1 on
each of a number of predictors, depend-
ing on whether the subject falls into a
category with a below- or above-average
success rate. The Glueck method is more
and the focus of interest is on the time interval to
some event, for example, the next offense. We
would suggest that alternative statistical methods,
stochastic point-process models, and failure-rate re-
gression models are more appropriate in these situ-
ations and should receive more attention from crim-
inologists.
CRIMINAL CAREERS AND CAREER CRIMINALS
sophisticates] in that, instead of contribut-
ing a score of O or 1, each category of each
predictor is weighted according to the
percentage of subjects in that category
who are successes. The Glueck method
can be appliecl to polychotomous incle-
pendent variables, but in practice it has
only been used for binary predictors. We
keep to this simpler situation in our clis-
cussion.
Both the Burgess and the Glueck meth-
offs have their parallels in stanciarcT statis-
tical theory the "independence Bayes
memos." First, consider the Burgess
method.
Let xi be a series of binary predictive
factors, let q be the overall success rate,
en c! suppose that within the success (S)
and failure (F) groups separately, the fac-
tors xi are statistically inclepenclent of
each other. Let
hi = P(Xi = AS), gi = P(Xi = OF).
Assume the xi's are coded such that hi >
gi Then, by Bayes theorem,
P(S~xi=l)=hiq/Pi
P(S~xi = 0) = (1 - hi) q/~1 - Pi)
where
Pi = P(Xi = 1) = hi q + gi(1 - q).
By independence and Bayes theorem
again,
P(S~x) q HP(xi~S)
= .
P(F~x) 1-q HP(xi~F)
and so log owls for S after observing x is
= log ~ _ + ~ i°g 11( F)
This can be written as
k + iWixi
where
OCR for page 293
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
0(1 - gi)
(1 - hi)gi
which is just the Tog odds ratio for the 2 x
2 table classifying xi = 1 or O against S and
F. Given indepenclence, these are there-
fore the optimum weights. By the Ney-
man-Pearson Lemma in statistical theory,
any other set of weights must be less
efficient (i.e., they do not use all the
information available in the xi's).
The Burgess method has Wi = 1, or,
since a scale factor in the score is irrele-
vant, Wi = constant. Thus, the Burgess
method is only optimum if the cross-
product ratio is the same for each factor
(i.e., each xi gives the same amount of
information about S or F).
The Glueck method is equivalent to
Wi = P(S~xi = 1) - P(S~Xi = 0),
which, from above, simplifies to
W.=q(l-q)(hi-gi)
Pi(1 - Pi)
Again, a constant multiple is irrelevant, so
essentially
(hi- gi)
Pi(1 - Pi)
7: Tog ocicis ratio for xi.
However, if xi has only modest predictive
power, we can write
hi = gi + si
where si is small. We can then show that
hi(1- Hi) hi- gi
'Vie gi(1 - hi) Pi(1 - Pi)
+ terms involving si2.
Hence the Glueck method is approxi-
mately optimum if ei is small, that is, if
each individual xi contributes only a mod-
est amount of information. In many prac-
tical cases the score may involve a rela-
tively large number of xi's, none of which
by itself is spectacular, but together they
where
293
may be useful. This, we suggest, accounts
for the apparent success of the Glueck
method.
As set out above, Burgess and Glueck
are not separate and distinct moclels but
are, in fact, simple Tog-linear mo(lels in
which all the predictor variables are
treated as invepenclent, i.e., they are not
correlated. We would advocate the use of
the formal inclepenclence Bayes method
in preference to the more act hoc Burgess
and Glueck approaches because it has
several important advantages:
1. It is equally simple yet is based on a
coherent theory and is optimum within
the framework of that theory.
2. It provides a direct estimate P(S~x),
whereas the scoring methods of Burgess
and Glueck have to be separately cali-
bratecI on the data, that is, the probability
of success given a certain score is esti-
matecl by calculating the proportion of all
subjects with that score who succeeded.
3. Similarly, the value of the score is
seen to be a Tog odds ratio. Hence if the
score is s, the probability of success must
be of the form
eS
1 + es
There are two further advantages of the
Bayes method that make it extendable in
ways not possible for the Burgess and
Glueck methods. (Extensions of this kind
have been considered in the medical lit-
erature under the name of"computer-
aicled diagnosis moclels.") First, it can
more readily accommodate x:'s that are
not binary. The formula is then
log odds for S given x = log q
1 - q
+ I, logy
gi(xi)
OCR for page 294
294
fi(Xi) =
P(xilS) and gi(xi)
= P(xilF).
Of course, all these probabilities are esti-
mated from the data. Note that we need
the proportions ofthe various values of xi
within the F and S groups separately and
not the proportions of S and F within the
groups defined by various values of xi (a
crucial distinction). The above formula is
not necessarily linear in each xi (but there
is no reason to expect it to be). Thus we
avoid the need arbitrarily to dichotomize
each predictor variable, the full informa-
tion in each value of xi being retainer] in
an optimum way. Of course, if the xi's are
divided into too many categories, each
tee, such as P(xi~S), is estimated less
, , ,
accurately, and so, if there are too many
categories (e.g., age measured in years), it
is better to treat xi as a continuous vari-
able ancI use a regression technique.
Thus if some xi's are continuous, the term
fi(Xi)
log
gi(Xi)
can be estimated directly as a regression
on xi. Hence the method can accommo-
ciate mixed data in which some xi's are
continuous, e.g., age, and some xi's are
binary, e.g., sex (c£ analysis of covariance
methods).
Second, the Bayes method can be gen-
eralized to take account of particular cir-
cumstances concerning the distribution
of the xi's. For example, if the xi's are not
independent but correlated to a roughly
equal extent (e.g., they are all positively
correlated), a mollification simply in-
volves multiplying Wi by a constant, and
so the relative weights remain essentially
the same. Thus, if the Bayes formula is
recalibrated on the data (which allows an
appropriate linear transformation of the
score to be estimated), it works well even
when the xi's are moderately correlated
with each other. If the xi's are correlated,
but not all to the same degree, the so-
called "Lancaster models" can be used,
CRIMINAL CAREERS AND CAREER CRIMINALS
which are based on a second-order ap-
proximation to the joint distribution of the
xi's. These models have been found use-
fuT in medical diagnosis applications; see
review in Titterington et al. (1981).
Apart from the obvious simplicity, an
important advantage of all these methods
is the relative precision with which the
weights (or coefficients, if viewed as a
Tog-linear moclel) are estimated. This is
because the assumption of inclependence
allows each weight to be estimated sepa-
rately, and any sampling effects in the
intercorrelations of the x's have no effect.
If the sample size is relatively small, and
the correlations between the x's are, at
most, modest, point-scoring methods do
well. Larger correlations between the x's,
but with a similar sample size, can be
clealt with in an approximate way by one
of the mollifications mentioned above.
For somewhat larger sample sizes, how-
ever (say several hundred), a prediction
equation should make proper allowance
for the clependence between the x's, and
a logistic motley or log-linear model (in
the usual sense for categorical data) is the
preferred alternative. In such models,
each weight or coefficient is, of course,
not just a function of the relevant xi but
(lepencls in a much more complicated
way on the joint (distribution of all the xi's.
The complexity of the model affects the
degree of shrinkage, which will be clis-
cussed later in the paper. If our sugges-
tions for correcting for shrinkage are
used, the increased shrinkage of these
complicated models should not present a
problem.
PREDICTIVE POWER, CALIBRATION,
AND SHRINKAGE OF PREDICTION
EQUATIONS
Much statistical work in criminology
has been concemecI with the construction
and use of prediction equations. For each
incliviclual, some response y (a binary
OCR for page 295
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
yes-no variable, a time to arrest, et cetera)
is measured, alone with values of explan
r ~Id ~- ~and on the basis
Rev `~ri~hl~c ~ ~^
~ En, _, ,
ofthese x's a predicted value of y, say y, is
formulated. How good is y as a predictor
of y? Issues related to this general ques-
tion are to be discussed in this section.
We are concerned here with the underly-
ing methodology of the assessment of
prediction equations, rather than with de-
tails of prediction equations in specific
applications.
There are two contrasting, and yet com-
plementary, approaches to the discussion
ofthis question, corresponding roughly to
the two philosophies of statistical infer-
ence and decision theory as understood
in the statistical literature. The inference
approach is taken up in the next section,
where we ask: Given that an individual is
described by x = A, x2, . . ., what infor-
mation does that give us about y? A pre-
diction equation, with value y, is seen as
an estimate of the expectation of u in
some sense. The properties and behavior
of a prediction instrument are studied in
terms of the accuracy of y over the totality
of all different values of y and x. We argue
that a particular advantage of the infer-
ence approach is that a clear discussion of
shrinkage is possible. Our discussion
leads to a correction for shrinkage or to
"preshrunk" prediction equations as we
will call them.
The other approach is more pragmatic;
it views a prediction equation as a means
to an end, that of a decision instrument.
All the issues are illustrated by a binary
classification, conventionally labeled pos-
itive-negative. Each individual falls into
one or other group (e.g., success-faiTure),
the decision as to which is the true group
being made on the basis of x. The discus-
sion focuses entirely on the frequencies
of correct and incorrect decisions. A con-
fusing array of measures of predictive
power has appeared in the criminological
literature (and in the parallel literature on
295
computer-aided diagnosis in medicine).
We show that the more important of these
are in fact very closely related to each
other.
There is an obvious link between the
two approaches. If y is an observed re-
sponse, a binary classification could be:
success if y 3 ki and failure if y < kit The
classification from the prediction equa-
tion would by analogy be: success if y 3
k2 and failure if y < k2 (there is no reason
to insist that ki = k21. We would argue in
favor of formulating y to optimize such
properties as calibration and validation
(discussed in the next section) and then
choosing k2 to secure desirable aspects of
error rates and/or utility (discussed later).
It is worth noting, however, that pre-
diction equations are sometimes useful as
a research too] in their own right, not just
as a means of implementing the positive-
negative decision. For instance, to control
for differences between cases in a study,
the value of an appropriate prediction y
could sensibly be used either as a
covariate in statistical analysis using
covariance adjustments or as a criterion
for matching cases and controls in a
matched-pairs design. An example of the
former approach is in Bottoms and Mc-
CTintock (1973:Chapter 11~.
Validation and Shrinkage
It is almost universal experience that,
when a prediction equation Is fitted to
data and then applied to some new cases
or a new cohort, the usefulness and accu-
racy of the prediction are much more
disappointing than expected. The term
"shrinkage" has been used to describe
this deterioration in predictive power. Al-
though the effect is real enough, and
noted in many studies, the term has never
been given a precise definition. Quite
independently of the experience of crim-
inologists in using prediction equations,
there has been the remarkable develop
OCR for page 296
296
ment in the statistical literature of so-
callec3 "shrinkage estimation," a tech-
nique whereby a set of related parame-
ters can be estimated more accurately (on
average) than by conventional tech-
niques, such as least squares. The use of
the same term in these different contexts
has appeared at best coincidental and at
worst grotesquely misleading. However,
there are known to be close connections
between them, as cliscussec3 in Copas
(1983b). Using the theory clescribed in
that paper it is possible to (a) clarify the
manifestations of shrinkage, (b) highlight
the reasons for them, (c) derive altema-
tive methods of fitting prediction equa-
tions that will eliminate some of the ad-
verse effects of shrinkage, and (I) enable
the extent of shrinkage in any given ap-
plication to be estimated in advance from
the original ciata. These points are clis-
cussed in this section, and a brief outline
of Copas's theory is illustrated by a crim-
inological example.
In fitting a prediction equation to (lata,
we will have, as before, observations on
some response y (e.g., the number of
convictions in a Tong-term follow-up, or a
binary factor describing whether some
event, such as rearrest, has occurred) to-
gether with information on a number of
predictive factors x (number of previous
convictions, age, et cetera). The aim is to
formulate a predictor y = fix) for some
function f [e.g., multiple regression, in
which case fix) = cz + I3'x]. The fit of the
equation relates to the proximity of y to
the actual observed values of y. Two as-
pects of the prediction equation are dis-
tinguished:
1. Calibration. Here we group cases
with the same or similar values of y and
ask whether the average of the associated
y's is equal to the predicted value y. The
greater the clifference, the worse the cal-
ibration.
2. E,fficacy. Here we ask whether val-
ues of y discriminate clearly between
CRIMINAL CAREERS AND CAREER CRIMINALS
cases with different x's. A simple measure
of this is the correlation between y and y.
(In the case of multiple regression this is
just the multiple correlation coefficient or
the coefficient of determination, R.) A
large R shows that y changes substantially
as x changes, while a small R means that
y is almost the same for all x (and so is
useless as a predictor).
The ideal predictor, never realized in
practice, is one in which y = y for all x,
which calibrates perfectly and has maxi-
mal efficacy (R = 11. In practice, if the
model behind the prediction equation is
correct, when judged by values of y and y
in the data, y will calibrate well but have
R somewhat less than 1 (this is essentially
the Gauss-Markov theorem of least
squares).
A second crucial distinction is between
retrospective fit and validation fit. Retro-
spective fit concerns the comparison be-
tween values of y and y in the data on
which the prediction equation is fitted.
Validation fit envisages the prediction
equation being applied to a new set of
cases or subjects and compares the actual
values of y in the new data with the
predictions fix), calculated using the orig-
inal prediction equation f but using the
new values of x. The difference between
the sets of data is emphasized by the
terms "construction data" and "validation
data." Shrinkage implies that validation
fit is worse than retrospective fit. In prac-
tice, the predictions y calibrate well in
the construction data but less well, and
sometimes very badly, in the validation
data. Efficacy is nearly always worse in
the validation data than in the construc-
tion data. Copas's theory quantifies both
these aspects of the deterioration of fit.
There are (at least) three possible
causes of the deterioration in both these
aspects of fit: (a) a purely statistical effect
that is the inevitable result of unex-
plained (random) variation in the data; (b)
changes in the population of x's from
OCR for page 297
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS 297
construction data to validation data (e.g.,
there might be some intermediate change
of policy or other intervention that alters
the range of subjects available for study);
and (c) the underlying associations be-
tween y and x might change (e.g., a
change in some latent factor that is not
observed in x). Each of these causes of
shrinkage is discussecI below.
Shrinkage as a Statistical Effect
Cause (Al
Hence large values of y tend to be over
estimatecl and small values of y tend to be
unclerestimated. This is because
E(p'V,8) = ,l3'V,ll
+-> ,S'V,B = E(,B'V/3),
where n is the sample size in the con
struction data and m is the number of
variables measured in x. By the same
reasoning, ,l3'V,B can be estimated by ,B'V,B
- mown, where a2 is the usual residual
mean square, and so K itself can be esti
matecl by
Cause (a) wflT be illustrated in the case A
of multiple regression, in which the sta- EVE - man _ 1 - 1
tistical moclel is 9'V,B F
y = Ct + ~ X + E,
~ being the usual ranclom error. Without
Toss of generality, we can assume the x's
are stanciardized to have mean zero, so
that cr merely reflects the overall average
value of y. Suppose causes (b) and (c) do
not operate, so that we have a stable popu-
lation of x's and constant true values of cat
and ,8 as we go from construction to valida-
tion ciata. This, therefore, represents the
ideal situation as far as fitting and validating
a prediction equation is concerned.
If ~ and ,B are least squares estimates in
the construction data, the prediction
· .
equation Is
Y ~ + ~ X.
Suppose we test this out on a very large
validation sample, so that we compare
y = cat + ,B'x + ~ with c' + ,`3'x over a
population of new cases (y, x). To study
calibration, we calculate the average y
(i.e., cat + ,l3'x) over those cases x that relate
to a specific prediction y. This is clone by
fitting a linear regression of y on y, which
can be shown to have slope
K- EVE
9'VP '
where V is the variance-covariance matrix
of the x's. The average of K, over statisti-
cal errors in ,B, which is evaluated in
Copas (1983b), is always less than 1.
where F is the usual F-ratio of multiple
regression. A more thorough analysis,
valid if m ~ 3, shows that the slightly
modified estimate
K = 1
m - 2
mF
is unbiased in the sense that E(K) = E(K).
Thus K measures the (leterioration in cal-
ibration; in a set of vaTiclation data, the
average value of y to be expected for a
given y is not y, as might be anticipated
from the construction data, but
y = y + K(y - Y),
where y is the overall observed average of
y. The smaller K is, the greater the distor-
tion in calibration. Of course, this is itself
a prediction in the sense that y is caTcu-
lated from the construction data and can-
not be expected to be invariably correct
when appliecl to practical validation data.
However, on average, and to an approxi-
mation examined in detail in Copas
(1983b),
E(y~y) = Y
for a typical validation case (y, x). Thus y
can be said to be preshrunk in the sense
that it is expected to calibrate well (show
no calibration shrinkage) on validation
clata. Of course y will not calibrate well
on the construction data (because it is y
OCR for page 298
298
that does), but, from a pragmatic point of
view, retrospective performance of a pre-
cTictor is irrelevant.
The pedigree of y is confirmed] in
Copas (1983b), in that y corresponds ex-
actly to a "shrinkage estimator" in the
sense of the term used in the statistical
literature. It is proved that, within the
assumptions outlined above, y is uni-
formly better than y in the mean squared
error sense, Be.,
E(y _ y)2 < E`y _ y'2
over validation <3ata (y, x), provided m ~
3, where m is the number of x variables. If
m = 2, K = 1 and so preshrinkage has no
effect. If m = 1, the whole theory breaks
down, since the expectations of quantities
such as K cease to exist (the relevant
infinite integrals diverge). In fact, it is
shown that for m = 1 and m = 2 no
uniform improvement on least squares is
possible. The theory of preshrinking is
therefore useful only if there are three or
more predictive variables in x.
Tuming to efficacy, but still in the mul-
tiple regression case, the deterioration in
correlation is inevitable and cannot be
removed by preshrinking. In fact
Corrky, y) = Corrty, y),
and so the discrimination afforded by y is
identical to that of y. The inevitable cle-
cTine in correlation is simply due to the
fact that in the construction data y has
knowledge of the actual y's, whereas in
validation clata it cloes not. The above
theory is immediately extended to pre-
dict the validation correlation of y and y
(or y): it is
(n- 1jR2 - m
R=
(n-m - 1)R
where R is the multiple correlation coef-
ficient in the construction data. Always
we have R < R. For prediction, the retro-
spective R is irrelevant; efficacy shouIct
CRIMINAL CAREERS AND CaREER CRIMINALS
be measured by R. which on average will
be (approximately) the correlation ob-
tainecl if the predictor (y or y) were to be
validated.
A minor point to mention is that ~ can
be negative, in which case y inverts the
predictions macie by y. However, in the
worst case, in which x has no effect (,B =
O), E(F) > 1 and so E(~) > 0. Thus, if ~ is
negative, the correlations between y and
x are even worse than one wouIc3 expect
from pure random numbers, and it would
be apparent that any prediction equation
based on x is cloomed to failure. The same
comment applies to the circumstance that
R < 0.
The multiple regression model being
(liscussecl implicitly assumes that y is a
continuous variable. Models for discrete
and categorical data are mentioned else-
where in this paper, including the impor-
tant case of binary data. Suppose that y is
defined to be 1 if an event occurs (suc-
cess) and O if it does not (failure), with the
predictive factors x as before. A multiple
regression of y on x can still be fitted, with
E(y) being interpreted as the probability
of success. All the above quantities in
shrinkage theory can be calculated in the
same way, although their mathematical
validity can only be taken as an approxi-
mation (but often a reasonable one if n is
large and the correlations between y and
each x are not too close to 1~. The more
informative model is logistic regression,
for which
&+,~'
1 + e &+'
The overall significance of a fitted model
of this kind is measured by a value of x2
("deviance" in computer output from the
statistical package GLIM), and it can be
shown that in many practical cases x2 ~
mF, where F is the F-ratio in an ordinary
multiple regression of y on x. Thus Zip
becomes
OCR for page 299
OCR for page 301
OCR for page 303
OCR for page 304
OCR for page 305
OCR for page 306
OCR for page 307
OCR for page 308
OCR for page 309
OCR for page 310
OCR for page 311
OCR for page 312
OCR for page 313
Representative terms from entire chapter:
prediction equation
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
m - 2
K = 1
x2
Calibration relates to the probability of
success rather than to the average value of
y. A binary predictor is well calibrated if,
over all cases in which fix) = p, say, the
proportion of successful cases is in fact p.
In a large validation sample, this propor-
tion will be expected to be
e a+K~'x
+ e.ct+K~'x
for the same reasons as in the multiple re-
gression case. Thus p is the preshrunk forts,
of the predictor, by analogy with y above.
This is illustrated in a particular appli-
cation to the problem of predicting the
probability of absconding from open
borstals, taking into account known social
ant! criminological indicators (using ciata
kindly made available by the Prison De-
par~nent's Young Offender Psychology
Unit, Home Office, England). Here y = 1
if the trainee absconded cluring sentence,
y = 0 otherwise, and m = 22 predictive
factors were studied. A logistic regression
on n = 500 cases gives x2 = 50.2 on 22
degrees of freedom, which is highly sig-
nificant; ~ is 0.602. Calibration was exam-
inec3 by using a nonparametric smoothing
methoc! to plot the actual proportion of
absconding cases, say p, against the pre-
dictec3 proportions pE=f~x)~; the method
is from Copas (1983a). This is shown in
Figure 1, in which both axes are on Togis-
tic scales. The calibration is satisfactory in
the construction clata, in that the plotter]
curve (labeled "construction clata") is tol-
erably close to the diagonal line p = p. A
furler set of 1,500 cases was then used as
validation data and the plotting process
repeated. The shrinkage is very marked
(Figure 1~; the plotted curve is much
shallower than the diagonal (large p's are
overestimated by p, small p's underesti-
matecI). The use of p insteac! of p is
299
equivalent to retaining the graph with p
as the horizontal coordinate, but replac-
ing the diagonal line with a line of slope
K = 0.602, shown as the dashed line. The
reasonable fit of the validation curve to
the dashed line confirms that the vaTicia-
tion calibration of p is satisfactory.
The ordinary multiple correlation coef-
ficient between y and x for these data is
R - 0.322, whereas the vaTiclation corre-
lation discussed above is R = 0.194. The
substantial shrinkage has almost halved
the correlation, the efficacy of the predic-
tor on validation being extremely modest.
This magnitude of the drop in correlation
is not at all unusual in practice (e.g.,
Simon, 19711.
The multiple and logistic regression
models discussed above are fixed models
in the sense that the variables in x are
fixed in advance. In practice, prediction
equations are often simplified by using
stepwise regression or some other proce-
clure for subset selection; the variables in
x are then selected using the data, and
only those x's showing reasonably strong
correlation with y are retained. The usual
theory of least squares is, of course, com-
pletely upset by such selection. A recent
discussion in the Journal of the Royal
Statistical Society (
300
be used. The formula for shrinkage of the
correlation coefficient is modified to
(n- 1)R2-m
R= R*
In- 1 - m)R
CRIMINAL CAREERS AND CAREER CRIMINALS
with K = 0.602 and much greater than
that implied by the value K = 0.931.
Shrinkage in the Light of Changes in
the Population Cause (b)
where R* is the multiple correlation be- The theory expounded so far accommo
tween y en c] the selected x's, and as be- dates cause (a) the purely statistical ef
fore, R is the corresponding correlation feet but assumes that there are no
for all the x's. changes in the distribution of x Ecf. (b)] or
the response function [cf. acid. Neither
assumption wit! be exactly true, although
each will often hold to a reasonable ap
proximation. In this section we discuss
the effect of changes in the population
(i.e., in the distribution of x) on the vaTi
clation performance of predictors. We
suppose that x has mean me in the con
struction sample and mean m2 in the
validation sample, with the variance
covariance matrices V, and V2 definer! in
an analogous way. We therefore wish to
Since many x's in the absconding stucly
appeared to be of Tow predictive value, a
subset of just four x's was chosen for the
logistic regression, with x2 = 29.0 on four
degrees of freedom. If selection is ig-
nored, this would give K = 0.931 (indicat-
ing very little shrinkage). For the full
logistic regression Zip = 0.602, as before.
The validation fit of the reducer! regres-
sion is shown in Figure 2, which was
constructed in the same way as Figure 1.
As can be seen the shrinkage is consistent
o
-1 _
Q
4 -
._
o
J
IVY
-2
- 4
_
//
-4 - 3
-2 -1 0 1 2
logit p
FIGURE 1 Shrinkage for absconding study (full regression). Source: Derived from
data provided by Prison Department's Young Offender Psychology Unit, Home
Office, England.
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
1 _
1
o
._
o
-
_~
-3
i row ~ /
.~
,r`_ ~ 6~ '/
1
1 1 1 1
J
_4 _3 - 2
-! O 1 2
logitp
FIGURE 2 Shrinkage for absconding study (stepwise regression). Source: Derived
from data provided by Prison Department's Young Offender Psychology Unit,
Home Office, England.
study the case in which me 7L m2 ancI/or
Vi ~ V2.
A number of approaches are discussed
in tum, corresponding to various ways in
which changes in distribution can occur,
and to different ways in which the per-
formance of predictors can be assessed.
Some of these correspond to well-estab-
lished results in the statistical literature,
others to work in progress.
Wishart Variation.
Perhaps the sim-
plest case is to assume that the construc-
tion and validation samples are both sam-
pled randomly from the same underlying
population. The matrices Vat ant] V2 will
then be independent samples from the
same Wishart distribution inclexed by the
(unknown) true variance-covariance ma-
trix. Similarly, m, and m2 will be in(le-
pendent with identical multivariate nor-
mal distributions. It can be shown that
the uniform improvement of the shrink
301
age predictor over least squares continues
to hold in this more general setting, i.e.
E(y _ y)2 < E`y _ y`~2
where the expectation is over (y, x) in the
validation sample, over the distribution of
regression parameters, as well as over
sampling variation in the m's and V's. The
only requirement is, as before, that m 3 3.
Again, the improvement holds over all
possible true regression parameters, no
matter what are the unclerlying parame-
ters ofthe population. Thus differences in
samples caused by sampling variation
only do not affect the shrinkage argu-
ments put forward in the last section.
Mathematical Conditions for Uniform
Improvement. The Wishart variation
case suggests that if me - m2 and Vat - V2
are small, shrinkage theory is unaffectecl.
To investigate what happens when these
differences are larger, define the matrix
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
context of a particular study, as will be
cliscussect in a later section of this paper.
Two particular applications lend them-
seIves to the monitoring of screening.
First, a simulation study can be uncler-
taken in which the prediction equation is
fitted to a random subset of the data, and
the remaining cases are screened in the
appropriate way to form the valiclation
sample. The random sampling of the con-
struction data is repeated a large number
of times to obtain expected values of pre-
diction mean squarest error to other mea-
sures of predictive performance. The sec-
onc3 method involves the bootstrap: both
construction and vaTiciation ciata are arti-
ficially sampler! with replacement from
the complete set of available data. The
methoc] of screening under study is ap-
pliec] to the validation cases before the
prediction equation is evaluated. Again,
some detailed results are given in Jones
and Copas (1985~; the general conclusion
is similar to that made earlier, namely,
that a moderate degree of screening cloes
not usually affect the advantages of the
shrinkage correction.
Shrinkage Correction Adapted for a
Change in Population. Comments so far
in this section have concerned robust-
ness, i.e., the study of how the preshrunk
predictor y performs in the light of
changes in the distribution of x. If some
particular change in population is envis-
agecI, can the shrinkage correction be de-
signed to take account of it? A reworking
of the theory leacling to the correction K,
explained above, leacls to
* _ (`m-2~2tr(V~-~V2)
K_ , _
~ n/3'V ~
.
Note that K* = K if Vat = V2. The corre-
sponding form ofthe preshrunk predictor is
y* = (~`x + ,S'(~m2 - m~(1 - K*) + K*y.
Unfortunately, the sampling theory of K*
303
and y* is very much more complicate<]
than that of K and y, and optimum mean
squared error properties have yet to be
proved. Presumably, if m2 - me and V2 -
Vat are both fairly small, the favorable
properties of y will continue to hold, but
the situation for large population changes
is less clear.
AnAdaptive Formulation of Shrinkage
Based on Cross-validation. A very dif-
ferent approach is reported in Copas
(1984~. Here none of the usual assump-
tions of linear regression is made (e.g.,
constant variance of residuals), but
instead a shrinkage correction K** is esti-
matec3 directly from the available con-
struction data. Following the sample-
reuse approach mentioned above, the
sampling distribution of the empirical
slope of y on y for randomly chosen sub-
sets of the (lata is stucliecT mathematically,
and an asymptotic approximation to the
expected shrinkage is thereby obtained.
The form of this approximation is applied
to the whole set of ciata, given the
nonparametric shrinkage correction K**.
It is shown in Copas (1984) that, as ex-
pected, K** is equal to K if ant! only if the
usual assumptions of the underlying
model hold. The correction K** is most
sensitive to heteroskedasticity of the re-
si(luals; K** can shrink more or less than
K according to the particular observed
pattern of model resicluals. Case studies
carried out using this new approach sug-
gest that only exceptionally will K** clif-
fer markedly from it, and the validation
properties of the corresponding nonpara-
metric shrinkage predictor will often be
rather similar to those of y.
Changes in the Regression
Relationshi~Cause (c)
It is obvious that if the relationship
between ~ and the x's changes clramati-
ca~y Between construction and validation
304
data, the shrinkage will be equally dra-
matic and nothing in the way of useful
prediction will be possible. Conversely,
minor changes in the coefficients cr and
,2, · · · Will result in only small changes
in predictive performance, and y can still
be regarded as an adequate approxima-
tion. Little work has been done in study-
ing the effects of changes of intermediate
size. As in the discussion of cause (b) in
the previous section, if something is
known in advance about the likely
changes, corresponding modifications to
the prediction equation can be made
(e.g., a 10 percent rise or fall in values of
y is anticipated). However, such circum-
stances will occur rarely, if ever, and so
this remains an open research problem.
Some Concluding Remarks
We conclude this discussion of valida-
tion and shrinkage with a few comments
that may help in formulating guidelines
on the choice of prediction equation in
any given application.
First, a simple method shrinks less than
a complex one. (This can be seen in the
above algebra by noting that the denom-
inator of K exceeds the numerator by
ma2ln on average--this quantity in-
creases as m, the number of variables in
the equation, increases.) However, this is
not so when a preshrinking correction is
applied; provided the model ant] assump-
tions hold true, a preshrunk predictor is
always approximately well calibrated.
Thus the argument that a simple model
(e.g., point scoring) is preferable to a more
complicated one (em., multiple regres-
sion) because of shrinkage effects alone
cannot be sustained. Proper statistical
principles should be used in assessing
the fit between a given model and the
data; any shrinkage problems that arise
are allowed for by preshrinking rather
than by distorting the model being fitted.
Second, in selecting from among sev-
eral x variables using a stepwise proce-
dure, it is often supposed that a small
~At_, . ~
CRIMINAL CAREERS AND CAREER CRIMINALS
subset is better than a large one because
the smaller number of coefficients causes
less shrinkage. In general this argument
is false. As explained above, the empirical
selection effect itself leads to an increase
in shrinkage. Again, a larger subset, with
appropriate preshrinking correction, is
better than an artificially small set with its
own shrinkage correction. Usually, how-
ever, very little is gained by the later
variables entering a stepwise regression
procedure and so on the grounds of sim-
plicity, with little loss of efficacy, a sensi-
ble subset (with preshrinking) will nearly
always be used in the final prediction
equation. For example, in the absconding
study mentioned above, there is little
basis for choosing on statistical grounds
between the fits with the total of 22 x's
and with a subset of just 4 x's (Figures 1
and 2~.
Third, caution is needed if a prediction
equation is to be applied outside the
range of the construction data. The new
theory of robustness to changes in the
distribution of the x's, outlined above,
suggests that modest changes can be tol-
erated within the framework of the same
preshrinking method. However, if very
marked changes are anticipated, or if er-
ratic changes in the model are likely to
occur, no prediction equation can be ex-
pected to work well. These circum-
stances are perhaps the only ones in
which oversimplified methods (e.g.,
Glueck) can be justified on the grounds of
robustness, but a clear formulation of
such properties would be difficult.
Fourth, a prediction equation is essen-
tially a statement of conditional expecta-
tion: if the x's are such and such, then the
expectation of y is estimated to be such
ant] such. In reality no particular model is
exactly correct, and so an argument that
one set of x's is "right" and another is
"wrong" has no logical basis. One can
imagine values of the response variable
(y) and the explanatory variables (x's) be-
ing distributed jointly in some space
each subset of x's, and each particular
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
model, providing a separate form of con-
ditional expectation of y. Choosing a pre-
diction equation involves choosing which
conditional expectation is closest to the
actual values of y (has least conditional
variance), such a choice being made over
whatever set of candidates is available. It
may be that y is most closely correlated
with an x that cannot actually be used in
routine prediction, and so no subset con-
taining such an x can be entertained.
Typically, the best subsets or models will
be ones that act as the best proxies to the
prohibited x. Such equations may do less
well than others involving the sensitive
variable, but they cannot be discredited
on statistical grounds alone.
Practical Utility
Predictive Power
Our starting point in this section is the
familiar "risk classification," which com-
pares predicted and actual outcomes.
This approach to assessing the utility of
different prediction instruments is com-
pletely different from (yet complemen-
tary to) that discussed in the previous
section.
Risk classes can be defined as the range
ofthe predicted probability of some event
(e.g., k, = 0 < 0.1, k2 = 0.1 < 0.2, et
30S
cetera); as a score, such as the Salient
Factor Score calculated in parole predic-
tion research (D.
~.. ~. ~
M. Gottfredson,
Wilkins, and iiournan, 19781; or by some
over classification, such as low-, me-
dium-, and high-rate offenders, as in
Greenwood's (1982) study of criminal ca-
reers. The example adopted here to illus-
~ate and develop the discussion of pre-
dictive power is taken from Copas and
Whiteley (1976) as it was subsequently
used by Tarling (1982) to show the rela-
tionship between various measures.
Copas and Whiteley's aim was to con-
s~uct a prediction instrument to evaluate
He effects of therapeutic ~eatrnent at the
Henderson Hospital. The criterion of suc-
cess was taken to be no furler admission
to a psychiatric hospital or no further
conviction for a criminal offense during
the 2 to 3 years following release. Table 1
sets out the results for their construction
and validation samples.
Several summary statistics have been
proposed to measure the predictive
power of this and similar risk cIassifica-
tions, in particular mean cost rating
(MCR) (Duncan et al., 1953) and P(AW
the area under the receiver operating
characteristic curve in signal detection
theory (Fergusson, Fifield, and Slater,
19771. However, as the risk classification
in Table 1 can be regarded as an ordered
TABLE 1 Predicted Success and Observed Outcome, Construction and Validation
Samples
Risk Probability Construction Sample Validation Sample
Class of Success Success Failure Total Success Failure Total
(ki) (P) (si) fi) (ti) (ti) (fi) (ti)
k1 0 to .3 5 33 38 7 18 25
k2 .3 to .5 7 12 19 14 15 29
k3 .Sto .7 21 12 33 12 9 21
k4 .7tol.0 11 3 14 8 4 12
Total Ns = 44 Nf = 60 T= 104 Ns = 41 Nf = 46 T= 87
MCR = .57
P(A) = .78
Tc = - .55
By = -.71
SOURCE: Copas and Whiteley (1976) data as used by Tarling (1982~.
MCR = .28
P(A) = .64
of = - .28
By = -.40
306
contingency table, Kenclall's rank correla-
tion coefficient tan, if (Kendall, 1970),
and Goodman and KruskaT's gamma,
(Goodman ant] Kruskal, 1963), can also be
used to measure the degree of associa-
tion. There is as yet no consensus about
the measure to be acloptecI, but Tarling
(1982) has in fact shown that all four
measures are relatect because all are func-
tions of the statistic S (where S = P - Q.
where P is the number of"concorclant
pairs" and Q is the number of"discorclant
. ,,\
pairs 9.
Expressing each as a function of S and
using the notation of Table 1, the four
measures can be defined as:
-S
MCR =
NsNf
2NsNf
4S
~ =_
T2
S
~ =
, p+ Q.
Two advantages follow from knowing
that all four measures are a function of S.
First, by calculating S the calculation of
all four measures is greatly simplifiecl.
Second, as the distribution of S has Tong
been known, a test of the null hypothesis,
E(S) = 0, is a test that prediction is no
better than chance.
The measures tic and By have a further
advantage over MCR and P(A) in that the
variance of both can be estimated, thu
permitting tests of alternative hypotheses
and facilitating comparison of alternative
prediction instruments or their respective
power in the construction and validation
samples. For ~c, however, only an upper
bound to the variance is available, so only
a conservative test for the difference of
two observed values is possible. On the
CRIMINAL CAREERS AND CAREER CRIMINALS
other hanct, the exact value of the vari-
ance of By is available (Goodman and
Kruskal, 1963), which permits a more
powerful test. For this reason TarTing
(1982) recommencled that By should gen-
erally be preferred.
Prediction Errors
The four measures cliscussect above are
still only indicators of overall fit and just
give an indirect assessment of how a pre-
diction instrument will perform in prac-
tice. It is essential, therefore, to calculate
the number or proportion of correct and
incorrect predictions that would result
from the application of any rule.
Given the discussion of overfilling and
shrinkage in the previous section, esti-
mates should be derived from a valida-
tion sample. Before applying the Copas
and Whiteley instrument to identify
likely successes, a cutoff point must be
chosen. From the risk classification, as it
is presenter] above, there are three possi-
ble cutoff points: all subjects with a pre-
rlicte(1 probability of success of .7 or
above; all those with a predictecl proba-
bility of.5 or above; and all those with a
preclicte(1 probability of .3 or above.
Figure 3 shows, for each cutoffpoint in
the validation sample, the following:
1. the number of true positives (TP),
that is, the number of subjects predicted
to succeed who dill in fact succeed;
2. the number of false positives (FP),
that is, the number of subjects predictecI
to succeed who in fact failed;
3. the number of false negatives (FN),
that is, the number of subjects preclictec!
to fait who in fact succeeded; and
4. the number of true negatives (TN),
that is, the number of subjects predictecl
to fait who slid in fact fail.
The two marginal distributions ofthese
tables are usually clefinecI as the base rate
and the selection ratio. The base rate (or
SOME METHODOLOGICAL ISSUES 11!: MAKING PREDICTIONS
A: Cutoff point .7 and above
Predicted Outcome
B: Cutoff point .5 and above
Predicted Outcome
C: Cutoff point.3 and above
Predicted Outcome
Actual Outcome
Success Fai I ure
Success
Failure
Success
Fai I ure
Success
Failure
FP
8 4
FN TN
33 42
Ns= 41 Nf= 46
Base rate= .471
Actual Outcome
Success Failure
TP FP
20 13
. . ~
FN TN
21 33
N=41 Nf=46
s
Base rate= .471
Actual Outcome
Success Fail ure
, ,
TP FP
34 28
FN TN
7 18
N = 41
s
Base rate= .471
Nf= 46
NP = 12
s
NPf= 75
NP = 33
NPf= 54
NP = 62
s
NPf= 25
FIGURE 3 Correct predictions and errors for each cutoff point.
307
Selection ratio= .133
Selection ratio=.379
Selection ratio= .713
308
the prevalence or the incidence) is the
proportion of the sample that actually
succeeded. It can be seen that this is the
same for all three cutoff points (i.e., 47.1
percent). The second marginal distribu-
tion, the selection ratio, is the proportion
ofthe sample predicted to succeed. It can
be seen that the selected ratio changes
depending on the cutoff point: it is 13.8
percent when the cutoff point is set at .7
and above, 37.9 percent when the cutoff
point is set at .5 and above, and 71.3
percent when the cutoff point is set at .3
and above.
Defining the base rate anct the selec-
tion ratio in terms of the four outcomes:
Base rate, BR =
Selection ratio, SR
where
TP + FN
T
and (1-BR) =
TP + FP
and (1 - SR) =
T = total sample
= TP +FP + FN +TN.
FP + TN
T
FN + TN
T
Considering the relationship between
the base rate and Me selection ratio re-
veals several interesting properties. When
the selection ratio is larger than the base
rate, false positives exceed false negatives;
conversely, when the base rate is larger
than the selection ratio, false negatives ex-
ceed false positives. When the base rate
equals the selection ratio, the number of
false positives ant! false negatives is the
same. Furthermore, when both the base
rate and the selection ratio equal .5, predic-
tion becomes most accurate and results in
fewest total errors (FP + FN). However,
when the base rate (which is fixed) is not .5,
CRIMINAL CAREERS AND CAREER CRIMINALS
as is often the case in practice, total errors
are minimize<] when the selection ratio is
set to equal the base rate. These phenom-
ena are revealed in Figure 3 and can be
used to guide the choice of the appropriate
cutoff point.
Dunn (1981) sets out the various mea-
sures that can be clerived from the kind of
information presented in Figure 3, for
example, sensitivity and specificity, but
they are not discussed in any cletai! here.
Loeber and Dishion (1982, 1983) also
discuss the significance of the base rate
and the selection ratio. They point out
that the base rate anct the selection ratio
determine the maximum number of cor-
rect predictions that could be achieved by
the prediction instrument but, further,
that a certain number of correct predic-
tions could be expecter] by chance alone.
Loeber and Dishion therefore propose a
measure, relative improvement over
chance (RIOC), which attempts to assess
how an instrument performs relative to its
expected performance and its best possi-
ble performance given the base rate and
the selection ratio.
They define RIOC as:
RIoC= AC RC
MC - RC
where AC = actual number of correct
predictions, RC = randomly expecter]
number of correct predictions, and MC =
maximum possible number of correct pre-
dictions. In the notation of Figure 3 it can
be seen that
AC = TP + TN
RC (TP + FN)(TP + FP)
(FP + TN)(FN + TN)
T
MC = TN + TP + 2min(FN,FP).
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
Substituting for AC, RC, anct MC in the
above equation, RIOC reduces to:
IT'S
TP.TN - FP.FN
neon =
[TP + min(FN,FP)][TN + min(FN,FP)]
From the relationships presented earlier,
RIOC can also be expressed in terms of
the base rate ant! the selection ratio. Sub-
stituting in the denominator, RIOC re-
duces to:
TP.TN - FP.FN
T2[min(BR,SR)-BR.SR]
A commonly used measure of associa-
tion for 2 x 2 classifications such as Fig-
ure 3 is ¢, which is the product moment
correlation coefficient for dichotomous
variables.
In the notation of Figure 3,
TP.TN - FP.FN
~ =
[(TP + FP)(TP + FN)(FP + TN)(FN + TN)]'t2
Expressing the denominator in terms of
BR and SR, ~ reduces to:
TP.TN - FP.FN
T2(BR.SR - BR.SH2 - BR2.SR + BR2.SR2)"2
The relationship between RIOC and
310
say, rather than to minimize total errors,
we could have user! the approach out-
lined there to guide our choice of cutoff
point. However, decision theory provides
a more direct framework for taking into
account the weights to be attached to
different types of outcome. Although the
decision-theory approach has been
wiclely advocated in criminological appli-
cations (e.g., Loeber and Dishion, 1983),
it has not been used to any great extent,
except by Blumstein, Farrington, and
Moitra (19851. While it is outside the
scope of this paper to discuss decision
theory in any detail, we would recom-
mend that more attention be paid to it in
prediction research, especially when the
results are to be applied in practice.
SAMPLE-REUSE METHODS
Previous sections of the paper have
stressed the distinction between retro-
spective fit and prospective (or vaTida-
tion) fit of a prediction instrument. A
simple way of carrying out a prospective
validation, and the one most commonly
used in criminology, is the split-half
method, which divides the data into two
halves (at ranclom). The equation is fittest
to the first half (the construction sample)
and tested on the seconc] (the validation
sample). Although unbiased estimates of
shrinkage and error rates result from this
method, there are two obvious disac3van
CRIMINAL CAREERS AND CAREER CRIMINALS
The first, simple extension of the prin-
ciple is cross-vaTidation, of which the
split-half method is merely a special case.
To construct and validate the prediction
instrument, the sample need not be split
in halfbut couIcl, instead, be split in many
different ways; for example, 80 percent of
the sample could be used for the con-
struction sample ant! the remaining 20
percent could form the validation sample.
Moreover, any number of construction
and validation subsamples could be
drawn. The jackknife and the bootstrap
techniques are more formal c3~evelop-
meets of this latter iclea. The jackknife
(see, for example, R. G. Miller, 1974), or
"hold-one-out," proceeds as follows. Sup-
pose the sample has 1~7 members; delete
one member and develop the prediction
instrument on the remaining N - 1 and
use it to predict y for the missing mem-
ber. The procedure is repeated N times, a
different member being omitted each
time. By this means a set of independent
values of y and y are obtained, and shrink-
age and error rates can be calculated us-
ing the methods presented earlier as if
these values related to a completely new
sample of N cases.2
The bootstrap technique (Efron, 1982)
proceeds slightly differently. If sampling
with replacement is permitted, a large
number of samples of size N can be
drawn, 2N as opposed to only N by the
jackknife procedure. The bootstrap repli-
tages: (a) construction of the prediction cations can be used to assess the predic-
instrument does not use all available in- lion instrument. The method is illustrated
Connation, but only half the sample, and
(b) the comparability of the two sub-
samples will always be open to doubt; for
example, there is a 1-in-20 chance that the
two subsamples will be significantly dif-
ferent at the 5 percent level. Various tech-
niques have been developed in the statis-
tical literature to overcome these two
problems. The principle underlying them
is to generate many subsamples rather
than merely two.
by an example given in Efron and Gong
(1981, 1983) that is analogous to many
criminological prediction studies. Efron
and Gong were concerned to construct an
instrument to predict whether patients
2These ideas can be extended to other problems
relevant to the construction of prediction instru-
ments; Mabbett, Stone, and Washbrook (1980), for
instance, consider the stepwise choice of variables
in forming a binary predictor.
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
suffering from acute hepatitis wouIc3 sur-
vive or die. There were 155 patients in
the sample, 33 of whom diecI. There were
19 independent variables available for
analysis. A prediction instrument was de-
veloped in the usual way. First only x
variables associated at the 5 percent level
were retained; this left 13 variables. Sec-
ond, a kind of forward, stepwise, multi-
ple-Iogistic-regression program was used,
stopping when no additional variable
achieved the 5 percent significance level.
Four of the 13 variables were included in
the final prediction instrument. The cut-
offpoint c was set at c = Tog 33/122. Full
information was available for 133 of the
original 155 patients. When the predic-
tion instrument was applied to the 133
patients, 21 were misclassified, giving an
error rate of 21/133 = .158. The bootstrap
technique was then used to assess how
overoptimistic this error rate was or how
much it couIct be expected to shrink. Five
hundred] bootstrap samples were drawn
and the same procedure was used to con-
struct a prediction instrument. On each
occasion the "overoptimism random vari-
able," R', was calculated, which is merely
"the error rate for the bootstrap replica-
tion minus .158." The 500 values of R'
were plotted and the mean of R' was
found to be .045, which suggests that the
expected overoptimism is about one-third
as large as the apparent error rate .158.
This gives the bias-correctec3 estimated
error rate .158 + .045 = .203. In addition,
the standard deviation of R' was .036.
Another advantage of the bootstrap tech-
nique is illustrated by this example. At
each replication a check was made of the
variables included in the prediction in-
strument and this revealed, for example,
that one variable was selected 37 percent
of the time, another 59 percent of the
time, and so on, giving an intuitive, if not
theoretically rigorous, indication of the
importance of the various predictor vari-
ables.
317
Technical details of sample-reuse
methods are given in Efron (1982), and
simplified descriptions appear in Dia-
conis and Efron (1983) and Efron and
Gong (19831. Comparing and contrasting
the various methods, split-half or cross-
valiclation methods are the simplest to
perform but have certain limitations. The
advent of computer power and the in-
creasing avaflabilit,v of appropriate aIgo-
rithms make the jackknife and the boot-
strap methods more attractive and
relatively easy to use. The jackknife ant]
the bootstrap are in fact theoretically
closely related: the jackknife is almost a
bootstrap itself The bootstrap is entirely
nonparametric and is, therefore, more
flexible. Efron (1982) suggests that the
jackknife performs less well than the
bootstrap in situations that he has inves-
tigated but it requires less computa-
tion. The close relation between sample-
reuse methods an(1 Copas's theory of
shrinkage and vaTi(lation was cliscusse
earlier.
CONCLUSIONS
At the beginning of this paper we
showed how simple point-scoring meth-
ods could be incorporated within the
framework of general linear models,
along with regression, logistic regression,
and Tog-linear models. In adclition, we
noted that point-scoring methods, recon-
ceptuaTized in the way we suggest, per-
mit certain extensions that have been
found useful in medical (diagnosis.
It has Tong been recognized and empir-
ically demonstrated that a prediction in-
strument (leveloped on one sample will
perform less well when applied to a sub-
sequent sample. The phenomenon of
shrinkage has recently been subjected to
rigorous theoretical investigation, which
we outlined. The findings stemming from
this work enable the researcher to uncler-
stand an(1 anticipate the (degree of shrink
312
age that can be expected in any study and,
where necessary, to make any adjust-
ments to (or preshrink) the prediction
equation.
To examine shrinkage in practice, re-
searchers have tended to use split-half
subsamples. We pointed out the range of
other and superior "sample-reuse" meth-
ods, including the jackknife and the boot-
s~ap.
The usefulness of a prediction ins~u-
ment can also be gauged by the number
of errors and correct decisions that result
from its application. We pointed out the
similarity between many of the indices
Mat have been proposed to assess We
utility of a risk classification. In addition,
we showed the importance of the base
rate and the selection ratio in determin-
ing false-positive and false-negative er-
rors and how the selection ratio can be set
to alter We balance between the two.
When predicting rare events it may be
the case that any prediction instrument
will not improve significantly over the
base rate. For example, a prediction in-
s~ument developed to identify "danger-
ous" offenders may result in more errors
than occur by merely classifying all of-
fenders as not dangerous. This has led
some commentators to eschew attempts
to predict these kinds of events. An anal-
ogous situation occurs in medical science,
where mass-screening programs are
costly and may result in large false-pos-
itive errors, causing considerable stress,
but where they are nevertheless consid-
ered to be worthwhile to detect the small
number of true positives who actually
have the rare disease. Therefore, the
worth of any prediction instrument de-
pends on the values to be attached to the
various outcomes emanating from its ap-
plication, not simply on the total number
of errors that may accrue. Decision theory
provides a framework for making these
assessments and could be used more
widely in prediction in criminology.
CRIMINAL CAREERS AND CAREER CRIMINALS
REFERENCES
Blumstein, A., Farringon, D. P., and Moitra, S.
1985 Delinquent careers: innocents, desisters and
persisters. Pp. 187-219 in M. Tonry and N.
Morris, eds., Crime and Justice. Vol. 6. Chi-
cago, Ill.: University of Chicago Press.
Bottoms, A. E., and McClintock, F. H.
1973 Criminals Coming of Age. London, En-
gland: Heinmann.
Brown, P. J., and Zidek, J. V.
1980 Adaptive multivariate ridge regression. An-
nals of Statistics 8:64-74.
Burgess, E. W.
1928 Factors determining success or failure on
parole. In A. A. Bruce, A. J. Harno, E. W.
Burgess, and J. Landesco, eds., The Work-
ings of the Indeterminate-Sentence Law and
the Parole System in Illinois. Springfield,
Ill.: Illinois State Board of Parole.
Copas, J. B.
1983a Plotting p against x. Applied Statistics
32:2~31.
1983b Regression, prediction and shrinkage (with
discussion). Journal of the Royal Statistical
Society, Series B 45:311~54.
1984 Cross-validation Shrinkage of Regression
Predictors. Research Report, Department of
Statistics. Birmingham, England: University
of Birmingham.
Copas, J. B., and Whiteley, J. S.
1916 Predicting success in the treatment of psy-
chopaths. British Journal of Psychiatry
129:388~392.
Diaconis, P., and Efron, B.
1983 Computer-intensive methods in statistics.
Scientific American 248~51:9~108.
Duncan, O. D., Ohlin, L. E., Reiss, A. J., and
Stanton, [I. R.
1953 Formal devices for making selection deci-
sions. American Journal of Sociology
58:57~584.
Dunn, C. S.
1981 Prediction problems and decision logic in
longitudinal studies of delinquency. Crimi-
nalJustice and Behavior 8:439~76.
Efron, B.
1982 The Jackknife, the Bootstrap and Other
Resampling Plans. Philadelphia, Pa.: Society
for Industrial and Applied Mathematics.
Efron, B., and Gong, G.
1981 Statistical Theory and the Computer. Un-
published manuscript. Department of Statis-
tics, Stanford University, Calif.
1983 A leisurely look at the bootstrap, the jack-
knife, and cross-validation. American Statis-
tician 37(1~:36~8.
SOME METHODOLOGICAL ISSUES IN MAKING PREDICTIONS
Farrington, D. P., and Tarling, R., eds.
1985 Prediction in Criminology. Albany, N.Y.:
SUNY Press.
Fergusson, D. M., Fifield, J. K., and Slater, S. W.
1977 Signal detectability theory and the evalua-
tion of prediction tables.Journal of Research
in Crime and Delinquency 14:237-246.
Fielding, A.
1979 Binary segmentation. In C. A. O'Muirchear-
taigh and C. Payne, eds., Exploring Data
Structure. Vol. 1 of The Analysis of Survey
Data. London, England: John Wiley.
Glueck, S., and Glueck, E. T.
1950 Unraveling Juvenile Delinquency. Carn-
bridge, Mass.: Harvard University Press.
Goodman, L. A., and Kruskal, W. H.
1963 Measures of association for cross classifica-
tions III. Journal of the American Statistical
Association 58:310 364.
GottEredson, D. M., Wilkins, L. T., and Hoffman
P. B.
1978 Guidelines for Parole and Sentencing
Lexington, Mass.: Lexington Books.
GottEredson, S. D., and GottEredson, D. M.
1985 Screening for risk among parolees. Pp.54-77
in D. P. Farrington and R. Twirling eds..
Prediction in
N.Y: SUNY Press.
A. .
Criminology. Albany,
Greenwood, If. W.
1982 Selective Incapacitation. Santa Monica,
Calif.: Rand Corporation.
Jones, M. C., and Copas, J. B.
1985 On the Robustness of Shrinkage Predictors
in Regression: Exemplifying and Using the
Theory. Research report. Department of Sta-
tistics, University of Birrningharn, England.
In On the Robustness of Shrinkage Predictors
press in Regression: Some Theoretical Consider-
ations. Journal of the Royal Statistical Soci-
ety, Series B 48.
Kendall, M. G.
1970 Rank Correlation Methods. London: Griffin.
Loeber, R., and Dishion, T. J.
1982 Strategies for Identifying At-Risk Youths.
In
press
313
Unpublished report. Oregon Social Leaming
Center, Eugene.
1983 Early predictors of male delinquency: a re-
view. Psychological Bulletin 94:68-99.
Mabbett, A., Stone, M., and Washbrook, J.
1980 Cross-validatory selection of binary variables
in differential diagnosis. Applied Statistics
29: 198-204.
Miller, A. J.
1984 Selection of subsets of regression variables
(win discussion). Journal of the Royal Sta-
tistical Society, Series A 147:389-425.
Miller, R. G.
1974 The jackknife" a review. Biometrika 61(1):
1-15.
Nuttal, C. P., et al.
1977 Parole in England and Wales. Home Office
Research Study No. 38. London, England:
Her Majesty's Stationery Office.
Simon, F. H.
1971 Prediction Methods in Criminology. Home
Office Research Study No. 7. London, En-
gland: Her Majesty's Stationery Office.
Tarling, R.
1982 Comparison of measures of predictive
power. Educational and Psychological Mea-
surement 42:479-487.
Tarling, R., and Perry, J. A.
1985 Statistical methods in criminological predic-
tion. Pp. 21~231 in D. P. Farrington and R.
Tarling, eds., Prediction in Criminology. A1-
bany, N.Y.: SUNY Press.
Titterington, D. M., Murray, G. D., Murray, L. S.,
Spiegelhalter, D. I., Skene, A. M., Habbema,
J. D. F., and Gelpke, G. I.
1981 Comparison of discrimination techniques
applied to a complex data set of head injured
patients. Journal of the Royal Statistical So-
ciety, Series A 144: 145-175.
Wilbanks, W. L.
1985 Predicting failures on parole. Pp. 78-94 in
D. P. Farrington and R. Tarling, eds., Predic-
tion in Criminology. Albany, N.Y.: SUNY
Press.