| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 38
CHAPTER 3
THEORY
3.1 induction
This chapter emphasizes certain theoretical statistical
aspects of the techniques of disc~minant and cluster analyses dis-
cussed in Chapter 2. Section 3.2 written by R. A. Olshen, pertains
to discriminant analysis while § 3.3, written by J. A. Hartigan, is
concerned with the theory of clustering algorithms.
3.2 Theoretical Issues in Disenminant Analysis
3.2.t Introduction
The questions and techniques which are addressed in this
chapter are quite simple to state, but are rich in areas of applica-
tion. Problems of computer aided diagnosis in medicine, military
surveillance, and speech recognition, for example, can sometimes
be formulated in a common way. Observations are available from
a source that belongs to a unique population among a set of popu-
lations. For example, the observations might be gathered by radar
as a plane flies over a ship at sea. The ship is the source of the
data. It is assumed that the ship is one (the unique population) of
five ship types (the set of populations), and the classification ques-
tion is which one. Such data are often termed "test" data or "test
samples". The set of candidate populations is assumed to be finite.
Indeed, two is perhaps the most popular number. The values of a
set of features -- that is, covariates, or "independent" variables
such as the radar measurements on the ship -- are available for
each unit to be classified, and it is on the basis of these data that
assignments are made. In problems which are our focus here,
38
OCR for page 39
something is assigned known about the conditional distributions of
features given population (or perhaps more commonly "class")
membership. These distributions are generally known, or
assumed to be known from some previous experience -- in which
case the problem of class assignment can be rather simple -- or
learned from other data, a "yearnings or "training,' sample. Prior
probabilities of class membership and costs of misclassiltying candi-
date observations are implicit to most schemes for "classification"
or "discrimination". They will be made explicit in what follows.
But before that beginning of our more technical discussion, we
draw the principal distinction between the topic of this section and
the notion of "classification" which is usually associated with clus-
tering. Namely, here it is assumed without question that the
classes are well defined. Thus, what is discussed here applies to
the assignment of a cancer patient to one of several recognized
stages of illness on the basis of some data, and decidedly not to the
question of how many stages it makes sense to ascribe to the
cancer itself. Solution to the latter question, however fundamental
to science and technology, is a requisite preprocessing to the tasks
we confront here. Also, we will have little to say regarding the
fundamental question of feature selection, which has been so
prominent a part of recent literature on regression (see, e.g., Shi-
bata, 1981~. On the other hand, the probability distribution of the
data within a given class will not be assumed to be known in most
of our discussion. In fact, it will be evident that to do effective
discrimination one need not know these distributions -- only one
aspect of the rank ordering of certain linear combinations of their
(generalized) densities.
The formulation of "discrimination" which follows is adapted
from the recent book by Breiman et al. (19841. Suppose that the
variable Y can assume any Integer value between, say, ~ and
so and that the value of Y is unknown; Y represents "class".
Mote: in Chapter 2 the total number of cIas#es or gTOUpS was
denoted as G instead of J. ~ But suppose that features ~ of Y are
observed or are otherwise available. On the basis of ~ we wish to
infer Y. Assume further that there are J densities ff ~ Y =j)
with respect to some dominating measure ~ on a space X which is
the range of X. We use a dot to indicate the argument of the
39
OCR for page 40
conditional density f, and a vertical bar to denote "given" or "given
that". Generally #peaking, X can be taken to be Euclidean, but it
is the decided exception in practice for all the f 's to be absolutely
continuous (with respect to Lebe#gue measure) #ince discrete
features are common in applications. We denote by
~=P(Y=j) the "prior" probability that an observation whose
cias# membership is unknown is of clam j. While the use of prior
probabilities can be controversial in other settings, it seems
difficult to formulate discrimination in a satisfactory way without
them Also, in the present context they often have a compelling
frequentistic basis Applications of the technologies under ~iS~#-
sion are often in the content of new data like those of an existent
and at least moderately well-understood database
A more controversial aspect of our formulation is the set of
numbers C(i ~ j), the cost of classifying an observation to class i
given it is of class j It is typically convenient to take C (i ~ i ~ = 0
and C(i ~j)>0 for i 76j Consi~ler the problem of classifying a
patient who enters an emergency room with a complaint of chest
pain as to whether he has suffered a heart attack The patient
who desires that the most extensive medical resources be made
available to him just in case he has actually had a heart attack
may have C 's that are very different from those of the director of a
coronary care unit, who must make careful and responsible alloca-
tions of scarce, expensive resources (See Breiman et al, 1984, pp
176 and 177 ~
A (nonrandomized, measurable) decision rule ~ - such rules
are all that we need consider in discrimination - arises Tom ~ par-
tition of ~ into disjoint subsets If 2[ belongs to the JO of these, we
decide that Y = j, that is, d (2~) = j
The expected cost of the rule ~ is
J - J
~ ~) ~ C(i I j~(d(~=i 1~=i'
j=1 i=1
J
'=1 '~ ~ li) veldt) i} f(X I Y-j)~)
40
(18)
OCR for page 41
A rule SIB iS called a "Bayes ~e" if in cost is as small as ~-
sible. All Bayes rules have the same elected cost. One concludes
thatif d~x)=i implies
J
~ ~)C(i lit) f(x I Y =j)
j=1
~ ~'~(i'lj)f(xlY=j)
j=1
for all i', then ~ is a Bayes rule. Specialize to the case A = 2 to see
that for ~ almost all x , dB (X ~ = ~ implies
f(x ~ Y = 1) > ~(2)C(1 ~ 2)
-
f (x 1 Y = 2) - 7~(1)C(2 1 1)
.
(19)
So obviously we "know" a Bayer rue if we know the "densities"
f ( ~ Y = 1) and f ( ~ Y = 2). And this observation is precisely why
discrimination or classification as pursued here is a subject of seri-
ous inquiry. Imagine, as is the case in many applications, that 2[
is, say, twenty dimensional, and that we have available learning
samples of, say, hundreds or even thousands of observations from
f ~ ~ Y = 1) and f ~ ~ Y = 2), which are not assumed to be of any
particular functional form. Think of the crudest partitioning of the
axes of ~ into two parts each; the resulting product partition has a
total of 220 bins, or more than a million. Most will have no
members of the learning sample at all. The estimation of the den-
sities is thus hopeless. But from (19) it is clear that we need not
know the densities exactly. We only need to know when their ratio
exceeds a specifiM constant. Fortunately, it is far easier to know
when that occurs than it is to know the densities themselves.
One needs a benchmark from which to gauge the perfor-
mance of any classifier d; an obvious candidate is the no data
Bayes rule. Such a rule assigns every observation to a class i for
which
41
OCR for page 42
J
~ ~)C(i Ij)
j=1
is minimized. In case of ties, the usual convention is to take the
smallest minimizing index.
We have discussed leafing samples in casual terms, but by
so doing we obscure what can be a troubling distinction in theoreti-
cal work and an important difference between prospective and
retrospective studies. In the case of the former, one generally
assumes that the learning sample is of the form
(~1, YI) ,.~., CON, YN), where the pairs are independent and
independent of (2[, Y), and each (2[i, Yi) is distributed as (8, Y);
N is assumed to be a nonrandom constant. In retrospective stu-
dies it often happens that an existing database is searched for
preassigned numbers of pairs in each of the J classes. Then the
assumption of unconditional independence is not reasonable, and
the basis for choosing the prior probabilities may be somewhat
unclear. Yet it can be plausible to assume that Yn is 1 ,. . ., J
valued for each n = 1, 2 ,. . . and that conditioned on IN =in for
n 2 1, the random variables An, n 21 are independent. As Efron
(1975, p. 898) indicates, in most practice the distinction between
unconditional and conditional independence is ignored. He details
how in normal linear discrimination and in logistic regression -
both to be discussed - one can deal theoretically with the more
difficult case of conditional independence. In other contexts,
theoretical problems regarding the subtle issue at hand are raised
by Olshen in his discussion of Stone's (1977) landmark paper and
by Stone in his reply (p. 641) to the discussants. See also Gordon
and Olshen (1978), and especially Chapter 12 of Breiman et al.
(1984~. For ease of exposition, in what follows the point of view is
that the observations that make up the learning sample are
independent and identically distributed.
42
OCR for page 43
OCR for page 45
OCR for page 46
OCR for page 47
OCR for page 48
OCR for page 65
OCR for page 66
OCR for page 67
OCR for page 68
OCR for page 69
OCR for page 70
OCR for page 71
OCR for page 72
OCR for page 73
OCR for page 74
OCR for page 75
Representative terms from entire chapter:
learning sample
3.2.2 The Fisher Linear Discr~m~nant and Some of Its
Children
As with much of what is worthwhile in statistics, the starting
point for a discussion of explicit rules for discrimination is a result
of Fisher (19361. The Fisher linear discuminant and procedures to
which it has led have been remarkably useful in practice.
Suppose that J=2 and that given Yi=i,2[i~Np(Hi,~.
That is, 2[ has a p-dimensional normal distribution with mean hi
and covariance I:. Then, if I: is of full rank and prime denotes
transpose, one calculates that a Bayes rule
taxi= 1 if p0+,'x >0, where
,° = ]°g ~ ~2)C(~2. ~ 2) ~ - Yell 1' ]:-1 111 - 1l2' I-1 112), (20)
~ - (~l' - p2 ~ ~-t ~
Otherwise, ~ (x ~ = 2.
The coefficients p' in (20) have been described by Truett,
Cornfield, and Cannel (1967) as "the amount by which the logit of
risk increases for unit increase in the risk factor." The logit of nsk
is the log odds (given its features) that an observation of unknown
class actually is of class 1. There is no loss in assenting that ~ is
nonsingular, since singular cases can always be made nonsingular
by an appropriate reduction of dimension. In practice Hi, p2 and
are seldom assumed known, and must be estimated from the learn-
;ng sample. If Nt=#li AN :Yi =ll and N2=#li
H1- N ~ 2[i H2= ~ ~ 2[i (21)
1 in :Yi=1 N2 ion :Yi=2
£= N ~ 2; (2[i -plX25E' -~1)' + ~ (2[i -I -~2)'l;
I i~lV :Yi=1 ion :Yi--2 J
if, in addition, ~(1) and ~(2) are unknown, then
n(l)=Nl/N and ~2)=N2/N.
The maximum likelihood estimates can be substituted for their
corresponding parameters in (20~. There results one version of an
estimated linear di#criminant. This estimate of ~ and various
stepwise versions of it are widely used in practice, even when the
conditional distributions of ~ given Y not only do not share a com-
mon covanance, but also when the distributions are not normal at
an. Robustness of the Fisher linear discriminant has been noticed
by many users and studied by some. One reference on the #object
is Lachenbruch (1982~.
Even though the most important issues involve errors of
classification, one may ask from a demsion theoretic point of view
how the estimators defined by (20) and (21) perform when the
mode} is correct. Thus, for some positive definite matrix Q. one
might study the "risk"
'^ ~
E ~ (P- fly Q (~-~t,~2,~
(22)
and ask whether any other estimator does at least as well risk-
wise as p for all Ill, p2 and A, and better for some values. Me
44
phenomenon, which leads to the coordinates of a vector of
estimated parameters being pulled towards a pr~chosen value,
with salutary demsion-theoretic implications, surfaces here pro-
nded N. 2 2, N2 2 2, and N > p + 3. This idea figures in what has
come to be known as James-Stein estimation. See Stein (1956) and
James and Stein (1961~. Had (1986) has shown that in this case
the estimated Fisher linear discum~nant coefficients ~ can be
improved by improving the estimate EN of ]:~~. Basically, if one
adds a multiple of Q which depends on the trace of I:~i Q to I:
before inverting and substituting into (20), then the resulting esti-
mate of ~ improves upon A, at least in the sense given by (22~.
Another departure from the substitution of maximum likeli-
hood estimates into (20) is given by 'logistic regression". (See §
2.2.5.) Its starting point is the observation that if the stated nor-
mal mode] is correct, then
P(Y = ~ ~ x) = e~(pO ~ pI'8) / t1 + e~(pO + Pi'
P(Y=2~2~= 1-P(Y=1~2~.
(23)
To estimate (DO, is,'), one can maximize the conditional (binomial)
likelihood based on (23~. (This maximization must be done on a
computer by numerical maximization techniques (see § 4.2.3.), but
is relatively easy as such problems go since the likelihood is uni-
modal and logarithmically concave.) These estimates can be sub-
stituted into (201. Of course, if the original normal model is
correct, this estimated ~ must somehow be less efficient than
that based on (21), which utilizes the full likelihood function.
However, the '10gistic" likelihood function is valid under more gen-
eral assumptions of the exponential family than is the likelihood
function which leads to (211. Efron (1975) has investigated the
efficiency of logistic regression relative to estimated Fisher linear
discrimination when the normal model is correct and found it to be
'between one half and two thirds as effective as normal discrimi-
nation for statistically interesting values of the parameters."
45
Yet another departure from the mocle] with which this sec-
tion begins is this. Suppose that given Yi = j, hi ~ Np (pj, Ej),
where I:' may not equal ~2. (~ also § 2.2.4.) It is straightfor-
ward to compute a Bayes rule, but several individuals have taken
a different point of view. Suppose first that p = 1. Then in an
obvious notation and as Becker (1968) indicates,
~ p! - p? ~
O~ + 02
is a useful measure of the separation of the two conditional distri
butions. If p > 1, then since linear functions of normal vectors are
normal, one night consider b'(~ - p2) for various nonrandom vec-
tors b . If Y = j, b ' Xj ~ N(b ' pj, b ' Zj b ), and thus Becker's
measure extends to
=
S(b)= Ib (~1-~2)1
(6'~1 b)Y2+(b'E2 b)Y2
a criterion actually studied earlier by Anderson and Bahadur
(1962). Suppose now that one chooses to compute a linear function
of it, say bo'$, so that cl(2~)= 1 if bo'$ ~c, and otherwise
c'(~) = 2. If the goal is to minimize P (d (2~) ~ Y ~ Y) (that is, the
probability of amistake),then pick ho to maximize S(b),and
(6o' ~2 bo)Y2 be' p! + (ho' ~! boles ho ~2
(b 0 ~ 1 bo)~a + (b o' [2 b o)~2
.
This fact follows Tom work of Anderson and Bahadur (1962).
Chernoff (1972, 1973a) indicates how to compute ho and notes
that the common probability of a mistake is O(- S Ebb). Of course,
in order to implement the Anderson - Bahadur - Checkoff ideas in
practice, one must estimate those parameters which are not
known.
46
In case ~ >2, the slick appearance of the Fisher linear
discriminant Bayes rule disappears Generalizations of logistic
regression to this case have been studied by many authors
Chapters 4 through 6 of the recent book by by McCulIagh and
Nelder (19X3) are an excellent recent summary We choose not to
provide details here because the issues which arise pertain to
modeling conditional distributions of ~ given Y. and the main
aspects are not discussed in the explicit context of discrimination
3~2e3 Estimating Misciasmfication Costs
When a decision rule ~ is estimated from a learning sample,
then it follows from Fubini's theorem that the expression (18) for
the expected cost of c! is really the conditional expected cost given
the learning sample The unconditional expected cost is the expec-
tation of (18) with respect to the distribution of the learning sam-
ple n independent copies of the joint distribution of (2[, Y) Of
course, that distribution is assumed here to be unknown, and we
are left to estimate the unconditional expected cost from data.
(See earlier discussion in § 2.2.1.) In the best of all worlds, we
have available a genuinely independent test sample, taken to be
independent of the learning sample, distributed as (2[, Y), and
large enough to permit reasonable inferences. This happy situa-
tion occurred, for example, with the work of (Feldman et al. (1982),
but unfortunately seems to be the exception rather than the rule in
practice. And in the absence of such a test sample we are left to
estimate the misclassification cost associated with the prospective
use of d Tom the learning sample.
The starting point for the estimation of misclassification costs
from the learning sample is the so-called "resubstitution" estimate.
If the prior probabilities are assumed to be known and the learning
sample is of cardinality N. then the resubstitution estimate is
J
''=1 # li IN:
y _ } ~C(d(Xi ) lit)
i-J ion :Yi=i
47
· (24)
If the It's are no! assigned to be known, then the analogue to (24)
is
1 N
- ~ C(d (my ~ | Yk )
N i=1
.
(25)
These resubstitution estimates may occasionally be of practical
value with some parametric procedures such as the Fisher linear
discriminant However, they are subject to enormous over optimi#-
tic biases for nonparametric techniques such as those discussed in
the next section One approach to overcoming the biases is that of
cross-validation By m-fold cross-validation (see Breiman et al,
1984) is meant this The learning sample is divided at random into
m disjoint subsets of approximately equal size Call them
L~,...,Lm. Successively, the data of LV are deleted to yield
L(V) The classifier cl(V) is computed from those data in [(V)
according to the same algorithm by which d was calculated from
all of the data. Then Lo is used as a test sample. The process is
repeated for v = 1 ,..., m and the results averaged. In order to
simplify the exposition we suppose in what follows that (25)
applies. Thus, the estimated cost of misclassification is
m
. hi.
m ,,~1 m N-N i~itYi)~Lu
m ~
C(d(U)($i)l Yi)
.
The most widely advertised choice of m is N. the 'heave one out"
method. However, N-fold repetitions of computationally expen-
sive procedures is not practical, and seems not to be best for
theoretical reasons in some cases (see Efron, 1983, p. 3271.
The discussion thus far has been vague as to how the To are
chosen. For example, one might think of stratifying the sampling
by class even in the present context in which the It's are not
asstlmed known. With cross-vali~lation stratified, each class is as
nearly as possible equally represented in each of the m Lo. The
results on the effects of enforced stratification are skimpy and
48
minimum distance between the two clusters is as large as possible,
and continue dividing the clusters obtained, in the same way. This
produces single linkage clusters; the other agglomerative methods
have no simple divisive characterization, and so it is to be expected
that the large clusters they produce have no known desirable pro-
perties.
Replace each point by a sphere of radius ~ and consider the
maximal connected subsets of the union of spheres, for all d.
These are the single linkage clusters. Clusters of this type are stu-
died in percolation theory (Broadbent and Hammersley, 1957,
Smythe and Wierman, 1978) and so asymptotic results about sin-
gle linkage clusters may be obtained from percolation asymptotics.
The nearest neighbor density estimate is
flux) = C / iinf pP(x, Xi)
· it- ~
Clip cumenslons,
The high density clusters of In are the maximal connected subsets
of unions of spheres of the precarious paragraph, the single linkage
clusters. The nearest neighbor density estimate is a poor estimate,
not consistent for the true density; it is remarkable that single
linkage clusters retain approximate consistency. Following
Wishart (1969) and Ling (1973) eve should use high density clus-
ters corresponding to some form of kth nearest neighbor density
estimation, where for consistency ~ ~ ~ as n woo but k In ~ O.
For example, we define distance between clusters in a joining algo-
nthm by the kth smallest among distances between pairs of points
in the two clusters. This clustering method is the analogue of kth
nearest neighbor discnminant procedures, in which a new point is
allocated to the class that appears most frequently in its k nearest
neighbors.
Another characterization of single linkage clusters is through
the ultrametric
p*(x, Y ~ = inf sup phi, xi+~)
x=x~,x2...x~=y ~
65
.
The ultrametric satisfies p*(x, y ) < sup[p*(x, z ), ply, z )] and
determines a family of clusters {x | p*(x, y ) < C} for various C, y,
that turn out to be the single linkage clusters.
Finally, the minimum spanning tree is the graph of minimum
total length connecting the sample points. Gower and Ross (1969)
show that single linkage clusters are the connected sets obtained
by successively deleting, largest to smallest, the links of the
minimum spanning tree. Thus single linkage computation and
asymptotics are intimately related to minimum spanning tree com-
putation and asymptotics.
3~3.5 M=tures
The population density
k
f = ~Pifi
i=1
is a mixture of components fi in proportions Pi. This may be
newed as a model for k clusters. The fi and Pi are unidentified
without some further constraints. The usual assumption is that
each fi is a member of the same parametric family, the multivari-
ate normal (e.g., Wolfe, 1970; Day, 1969; Dick and Bowden, 1973~;
however the mature model may also be applied to general sample
spaces, not only to points in p dimensional Euclidean space. In
discriminant analysis, a random observation X is associated with a
classification I into one of k classes. Suppose that I takes the
value i with probability Pi, and that X given I = i has density fi .
Then the marginal density of 2[ is just f =[Pi fi . In dis~iminant
analysis we know ~ and I; in clustering we know only fig; thus the
mixture mode] for clustering corresponds to the marginal probabil-
ity model for discriminant analysis. Lack of knowledge of the
classification variable makes the general mixture model
unidentifiable however, so further constraints are needed for clus-
tering.
66
Let p(i I X)= pi fi(X) I ~ pi fi(x)
denote the posterior proba-
bility that the observation x belongs to class i. These posterior
probabilities are useful in maximum likelihood estimation for the
mode]
f (x ) = ~ Pifi(X, si
where the fi are known up to the. parameter Gi taking values in
r-dimensional Euclidean space. Assume the fi are differentiable
with respect to Oi. Then the maximum likelihood estimates for
Pi, Gi based on observations Ad, 82 ~ . . . ~ An satisfy
J
n
~ p (i | A ) d /d si log fi (by, si ) - O.
i=1
n
(ii): Pi = ~ p(il~j)ln.
j-1
The estimation proceeds in alternating steps: given p (i I 2[j ), esti-
mate Gi weighting the observations 2Ej by their probability of
h~lon~in~ to ' the ith component; then Even these estimates Hi,
~ _ ~ a,
estimate Pi and p (i | 2E,, ); then repeat the first step.
Rao (1948) appears to have been the first to use maximum
likelihood for normal mixtures. See also Day (1969), Wolfe (1970),
Hosmer (1973), Everitt and Hand (1981~. We can't show that the
alternating procedure leads to the true maximum likelihood esti-
mates, or even that the likelihood increases after each step, even
in the simple case of normal mixtures in one dimension. The stan-
dard madman likelihood regularity conditions do not hold in the
normal case, and the usual asymptotic consistency and distribu-
tion results do not always hold.
For well-separated clusters, the posterior probabilities
pti ~ 2[j) are all near 0 or 1, and the maximum likelihood solution
is approximated by dividing the observations into k clusters,
67
estimating ei by maximum likelihood #eparately within the clus-
ters, and finding that division into k clusters that maximizes the
product of the likelihoods. This procedure is the strict ma2nm~n
likelihood estimate for the mode] in which the observations {2[j}
are supposed drawn from components {IjI, and the components are
regarded as unknown parameters. The likelihood is
~ [~t ~ In ~ ll ~ · · · ~ In ~ 81' 02,. . ., 0~ ~ = n f~j<~j, Ail
.
In practice we can rarely adore to search all partitions, but we use
an alternating step algorithm:
(i I': Given Ij, select Gi to maximize Ink f i Jo, Gi
(ii )': Given it, select Ij to maximize f~j(~;, O'j)
.
Thus given the clusters we estimate Oi by maximum likelihood,
and given the Hi we specilty cluster membership to make hi most
likely. This is the alternating method for mixture maximum lilreli-
hood when the p (i ~ Ad ~ are all zero or one.
In the particular case when
fi(-~) = (2~,)~ /2 e2`p[- Your -'ii ~ t~ -~i ']
the above algorithm has a simplified form known in the clustering
literature as k-means. (See, e.g., Macqueen, 1967, and Hartigan,
1975~. We select pi to be the mean of the observations in the ith
cluster, and we allocate the observation 2[j to that cluster i that
minimizes the distance between 25Zj and lo
68
.
3~Be6 The Number of Clusters: Modes
In the high density clustering model, we associate a family of
cluster with each mode of the density f: f has a mode at m if
there is a neighborhood M of m such that f (x ~ ~ f (m ~ for x ~ M,
anil f (x ~ < f (m ~ for x in the boundary of M. There are disjoint
high density clusters only if f is multimodal. Thus we can test for
the presence of clusters by testing for multimodality.
For X one dimensional, unimodal and bimodal densities may
be fit by maximum likelihood giving a likelihood ratio test for
bimodality, but it is difficult to handle the large contributions to
the likelihood made by small intervals between neighboring obser-
vabons. A better test is the dip test, which measures the max-
imum difference between the empirical distribution function, and
the unimodal distribution function chosen to minimize that max-
imum difference. The dip approaches zero for unimodal distribu-
tions, and some non-zero value for multimodal distributions, as the
sample size increases. It is therefore consistent for distinguishing
unimodal from multimodal distributions. It is argued in Harkigan
and Hartigan (1985) that the uniform is the appropriate null uni-
modal distribution, Cause the dip is asymptotically stochastically
larger for the uniform than for other unimodal distributions; the
asymptotic distribution of the dip and some empirically deter-
mined distributions for finite sample #izes are given in that paper.
The dip does not generalize simply to many dimensions. The
minimum spanning tree provides a And of ordering of the n sam-
ple points that may be used to generate an analogue of the dip
statistics: select a particular sample point Lo to be the mode or
root, and consider probability distributions P supported by the
links of the minimum spanning tree. Define P to be unimodal if P
has a density, with respect to the uniform distribution on the tree,
that is a non-decreasing function of x as x moves toward the root.
At each point x on the tree define a distribution function value
F(x ~ to be the probability that a random point ~ is such that x lies
between '[ and no. Let Fn be the empirical distribution function
corresponding to the empirical distribution which gives each sam-
ple point probability 1 In ~ Define
69
d (F' Fn ) = #Up | Fn (X )-F(x ) |
D(Fn 9 no) = inf d (F. Fn ) where F is unimodal with mode x
DIP(Fn) = inf D(Fn ~ To)
so
.
This procedure locates an optimal mode no, and states how well
the data fit the un~modal hypothesis, DIP(Fn ). In the one dimen-
sional case the usual definition of dip gives the same value. The
asymptotic behavior of the multivanate version is unknown.
3.3~7 The Number of Clusters: Component
If the components of a multivariate normal mixture are
sufficiently well separated, there will be one mode for each com-
ponent. In this case the number of clusters is the number of com-
ponents or the number of modes, but in general the number of
modes is fewer than the number of components, so testing for the
presence of more than one component is less conservative than
testing for the presence of more than one mode.
Wolfe (1971) considers the likelihood ratio test for say one
component against two, but notes that the regularity conditions
which are usually required for the log likelihood ratio to be propor-
tional to a chisquare are not fulfilled. See also Binder (1978) and
Hartigan (19771.
Consider the simplest case: X~ , . . ., Xn
sample from N(O, 1)
under the null hypothesis, and from ( l - p ) N (O. l ) + pN ( it, 1 ) for
some O
For each fixed it, log L (p 9 if) iS a concave function of p that
has maximum value O if Nazi < 0 but maximum value appro~n-
mately ([, Zi ~ / 2~; zi2) otherwise Thus asymptotically,
L(~) --sup logL(p, p) is equal to zero with probability Y2, and to
p
(A ) / 2 with probability Y2 lithe (X2) / 2 would be expected fiom
usual likelihood asymptotics
If ~ and it' are widely separated, Zi(,u) and Zi(p') are nearly
uncoITelated, and so asymptotically L(~) and L(p') are nearly
independent. Thus sup L(,u) is greater than the maximum of k
nearly independent L (Il) for each k Thus sup L (~) is asymptoti
11
cally infinite.
r
The likelihood ratio test does not therefore follow the usual
asymptotics, and is not conservative: the usual sigrnficance test
will (with probability 1 asymptotically) reject the hypothesis of a
single component when only a single component is present For
each it, sup L (p, It) has asymptotically the same distribution, and
p
these distributions are nearly independent for well-separated it;
maximum likelihood computations will therefore be difficult; we
can expect to see local maxima of sup L (p, p) near every value of it.
p
If ~ has prior density normal with mean O and variance 1,
large values of ~ are inhibited and the maximum posterior density
will occur only with ~ moderate. The corresponding ratio test may
have better asymptotic behavior than the likelihood ratio test.
More generally, if the mixture model has components with means
ply p2 ,- . , pk we might assume the Ok to be a pnon a sample from
a normal; this prevents the artificially large separation of It's that
occurs in likelihood estimation and testing.
The behavior of the likelihood ratio statistic in the k-means
case has been examined in one dimension by Hartigan (1978) and
in higher dimensions by Pollard (1982~.
71
3.3.8 Ull;r~etric and Evolutionary Distances
Assume that there are N objects, and N(N -1) / 2 distances
between pairs of objects. From these distances we wish to form
clusters of close objects. One way to go about constructing the
clusters is to require that the distances satisfy certain properties
in the final clustering. For example, all distances within two dis-
joint clusters must be smaller than all distances between the clus-
ters. Or, each pair of points in the #ame cluster must be connect
by a chain of points such that neighboring points in the chain are
closer than some neighboring points in a chain connecting points
in different clusters. (This defimtion leads to single linkage.)
Another way is to suppose that the clusters correspond to some
ideal distance matrix, and to attempt to approximate the given dis-
tance matrm c! with a best fitting cluster distance D. For exam-
ple, hierarchical clustering might correspond to an ultrametric D,
a distance satisf ying D(i ,j ~ ~ sup[D(i ,k I, D(' ,k )] and we would
find the ultrameinc D closest to d. See Hartigan (1967), Johnson
(1967) and Jardine, Jardine and Sibson (1967~. Another plausible
definition, the evolutionary mode! Mom Fitch and Margoliash
(1967), is based on an evolutionary tree generating the objects.
The distance between any pair of objects is the sun, of links on the
unique path connecting them in the tree. If there exists an ances-
tor in the tree such that all points are equidistant (in sums of
links) from the ancestor, then this evolutionary distance reduces to
an ultrametric. Given the tree, the best fitting evolutionary dis-
tance or ultrametnc can be fitted by regression methods; the hard
part is searching for the best tree.
Baker (1974) has considered probability models in which an
observed distance matrix ~ vanes by some amount from an
ultramet~c D, and has investigated empirically how well the vari-
ous hierarchical techniques recover the true ultrametric D. The
results are opposite to those obtained using the high density
model: complete linkage does well, and single linkage poorly.
Euclidean distances in p-dimensional space mill form an
ultrametric distance matrix on at most (p + 1) points. For a den-
sity f, we can construct an ultrametric by
72
D(x,y) = min max 1 / f(u )
c ~ tic
where C is any path connecting x andL y Thus x and y are close if
they can be connected by a path of high density, or equivalently if
they lie together in a high density cluster In fitting such an
ultrametric to objects in p-dimensional space we would use only
the small distances between objects to obtain an estimate of the
density f Single linkage works only with the small distances,
whereas complete linkage depends on the large distances This
may be the explanation for Baker's results favoring complete link-
age, in that he requires the fits ultrametric to be close to the true
ultrametric when averaged over all distances, and the large dis-
tances are neglected by single linkage In practice, the large dis-
tances deviate most from the fitted ultrametnc (however fitted)
and it seems correct to downweight their contribution Theoreti-
cally, if we wish to allow clusters of arbitrary shape and size, it
also seems impossible to give large distances much weight
Perhaps we should fit an ultrametnc D to minimize
~ w(D)~(i,j) -D(i,j)] / ~ w(D)
where w (D ~ is small or zero for D large ~s moves single link-
age a little way towards average linkage More weight should be
given to the large distances in high-dimensional spaces
Let the objects 1, , n be generated by an evolutionary tree,
beginning at some ancestor, O For a particular measurement X
taking values Xi on the objects, assume that X changes in time
t, t ~ At on a particular link of the tree, by an amount that has
mean 0, variance a2At, and is uncoagulated with changes in dif-
ferent intervals or links
Then, letting BY denote the expected value of the random
variable, Y.
73
E (Xi - Xj)2 = 2~72t';
where lij is the time since i and j evolved from their most recent
ancestor, so E (Xi -Xj )2 is an ultramet~c!
If we had used different rates of evolution in the different
links of the tree, so that the changes in X had variance a'2At for
link i, then E (Xi -Xj )2 Womb be an evolutionary distance.
Suppose that X is normal, and there are p independent sam-
ples of X, namely, X1, x2 ,. . ., XP . (Here the number of objects is
fixed, and the measurements are assumed sampled from an
infinite population of possible measurements; it In require card
fix standardization to achieve something like this in practice.)
Then
I.` (Xi -X~)2 ~ Its 2tijX 2
r=1
d (i, j ) ~ 42~2tij %p2
Let D (i, j ) = 42~2tijp,
.
an ultrametric. For large p,
dfi,j) ~ D(i,j)[1+N(O,2/p)]
This suggests fitting the ultrametuc D by minimizing
~ ID(i,i)-d(i,j)12/D2(iti)
which downweights the large distances nicely, but probably not
enough. We have to take note also of the high correlation between
the large distances, arising from the high fraction of their paths
74
through the tree that they have in common. We can compute
these, but the criterion to be minimized is then a complex qua-
dratic in D -it.
75
-