Questions? Call 888-624-8373

PAPERBACK
list:$27.25
Web:$24.53
add to cart

Rights & Permissions

topleft topright

Discriminant Analysis and Clustering (1988)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Page
38
bottomleft bottomright
Page
38

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 38
CHAPTER 3 THEORY 3.1 induction This chapter emphasizes certain theoretical statistical aspects of the techniques of disc~minant and cluster analyses dis- cussed in Chapter 2. Section 3.2 written by R. A. Olshen, pertains to discriminant analysis while § 3.3, written by J. A. Hartigan, is concerned with the theory of clustering algorithms. 3.2 Theoretical Issues in Disenminant Analysis 3.2.t Introduction The questions and techniques which are addressed in this chapter are quite simple to state, but are rich in areas of applica- tion. Problems of computer aided diagnosis in medicine, military surveillance, and speech recognition, for example, can sometimes be formulated in a common way. Observations are available from a source that belongs to a unique population among a set of popu- lations. For example, the observations might be gathered by radar as a plane flies over a ship at sea. The ship is the source of the data. It is assumed that the ship is one (the unique population) of five ship types (the set of populations), and the classification ques- tion is which one. Such data are often termed "test" data or "test samples". The set of candidate populations is assumed to be finite. Indeed, two is perhaps the most popular number. The values of a set of features -- that is, covariates, or "independent" variables such as the radar measurements on the ship -- are available for each unit to be classified, and it is on the basis of these data that assignments are made. In problems which are our focus here, 38

OCR for page 39
something is assigned known about the conditional distributions of features given population (or perhaps more commonly "class") membership. These distributions are generally known, or assumed to be known from some previous experience -- in which case the problem of class assignment can be rather simple -- or learned from other data, a "yearnings or "training,' sample. Prior probabilities of class membership and costs of misclassiltying candi- date observations are implicit to most schemes for "classification" or "discrimination". They will be made explicit in what follows. But before that beginning of our more technical discussion, we draw the principal distinction between the topic of this section and the notion of "classification" which is usually associated with clus- tering. Namely, here it is assumed without question that the classes are well defined. Thus, what is discussed here applies to the assignment of a cancer patient to one of several recognized stages of illness on the basis of some data, and decidedly not to the question of how many stages it makes sense to ascribe to the cancer itself. Solution to the latter question, however fundamental to science and technology, is a requisite preprocessing to the tasks we confront here. Also, we will have little to say regarding the fundamental question of feature selection, which has been so prominent a part of recent literature on regression (see, e.g., Shi- bata, 1981~. On the other hand, the probability distribution of the data within a given class will not be assumed to be known in most of our discussion. In fact, it will be evident that to do effective discrimination one need not know these distributions -- only one aspect of the rank ordering of certain linear combinations of their (generalized) densities. The formulation of "discrimination" which follows is adapted from the recent book by Breiman et al. (19841. Suppose that the variable Y can assume any Integer value between, say, ~ and so and that the value of Y is unknown; Y represents "class". Mote: in Chapter 2 the total number of cIas#es or gTOUpS was denoted as G instead of J. ~ But suppose that features ~ of Y are observed or are otherwise available. On the basis of ~ we wish to infer Y. Assume further that there are J densities ff ~ Y =j) with respect to some dominating measure ~ on a space X which is the range of X. We use a dot to indicate the argument of the 39

OCR for page 40
conditional density f, and a vertical bar to denote "given" or "given that". Generally #peaking, X can be taken to be Euclidean, but it is the decided exception in practice for all the f 's to be absolutely continuous (with respect to Lebe#gue measure) #ince discrete features are common in applications. We denote by ~=P(Y=j) the "prior" probability that an observation whose cias# membership is unknown is of clam j. While the use of prior probabilities can be controversial in other settings, it seems difficult to formulate discrimination in a satisfactory way without them Also, in the present context they often have a compelling frequentistic basis Applications of the technologies under ~iS~#- sion are often in the content of new data like those of an existent and at least moderately well-understood database A more controversial aspect of our formulation is the set of numbers C(i ~ j), the cost of classifying an observation to class i given it is of class j It is typically convenient to take C (i ~ i ~ = 0 and C(i ~j)>0 for i 76j Consi~ler the problem of classifying a patient who enters an emergency room with a complaint of chest pain as to whether he has suffered a heart attack The patient who desires that the most extensive medical resources be made available to him just in case he has actually had a heart attack may have C 's that are very different from those of the director of a coronary care unit, who must make careful and responsible alloca- tions of scarce, expensive resources (See Breiman et al, 1984, pp 176 and 177 ~ A (nonrandomized, measurable) decision rule ~ - such rules are all that we need consider in discrimination - arises Tom ~ par- tition of ~ into disjoint subsets If 2[ belongs to the JO of these, we decide that Y = j, that is, d (2~) = j The expected cost of the rule ~ is J - J ~ ~) ~ C(i I j~(d(~=i 1~=i' j=1 i=1 J '=1 '~ ~ li) veldt) i} f(X I Y-j)~) 40 (18)

OCR for page 41
A rule SIB iS called a "Bayes ~e" if in cost is as small as ~- sible. All Bayes rules have the same elected cost. One concludes thatif d~x)=i implies J ~ ~)C(i lit) f(x I Y =j) j=1 ~ ~'~(i'lj)f(xlY=j) j=1 for all i', then ~ is a Bayes rule. Specialize to the case A = 2 to see that for ~ almost all x , dB (X ~ = ~ implies f(x ~ Y = 1) > ~(2)C(1 ~ 2) - f (x 1 Y = 2) - 7~(1)C(2 1 1) . (19) So obviously we "know" a Bayer rue if we know the "densities" f ( ~ Y = 1) and f ( ~ Y = 2). And this observation is precisely why discrimination or classification as pursued here is a subject of seri- ous inquiry. Imagine, as is the case in many applications, that 2[ is, say, twenty dimensional, and that we have available learning samples of, say, hundreds or even thousands of observations from f ~ ~ Y = 1) and f ~ ~ Y = 2), which are not assumed to be of any particular functional form. Think of the crudest partitioning of the axes of ~ into two parts each; the resulting product partition has a total of 220 bins, or more than a million. Most will have no members of the learning sample at all. The estimation of the den- sities is thus hopeless. But from (19) it is clear that we need not know the densities exactly. We only need to know when their ratio exceeds a specifiM constant. Fortunately, it is far easier to know when that occurs than it is to know the densities themselves. One needs a benchmark from which to gauge the perfor- mance of any classifier d; an obvious candidate is the no data Bayes rule. Such a rule assigns every observation to a class i for which 41

OCR for page 42
J ~ ~)C(i Ij) j=1 is minimized. In case of ties, the usual convention is to take the smallest minimizing index. We have discussed leafing samples in casual terms, but by so doing we obscure what can be a troubling distinction in theoreti- cal work and an important difference between prospective and retrospective studies. In the case of the former, one generally assumes that the learning sample is of the form (~1, YI) ,.~., CON, YN), where the pairs are independent and independent of (2[, Y), and each (2[i, Yi) is distributed as (8, Y); N is assumed to be a nonrandom constant. In retrospective stu- dies it often happens that an existing database is searched for preassigned numbers of pairs in each of the J classes. Then the assumption of unconditional independence is not reasonable, and the basis for choosing the prior probabilities may be somewhat unclear. Yet it can be plausible to assume that Yn is 1 ,. . ., J valued for each n = 1, 2 ,. . . and that conditioned on IN =in for n 2 1, the random variables An, n 21 are independent. As Efron (1975, p. 898) indicates, in most practice the distinction between unconditional and conditional independence is ignored. He details how in normal linear discrimination and in logistic regression - both to be discussed - one can deal theoretically with the more difficult case of conditional independence. In other contexts, theoretical problems regarding the subtle issue at hand are raised by Olshen in his discussion of Stone's (1977) landmark paper and by Stone in his reply (p. 641) to the discussants. See also Gordon and Olshen (1978), and especially Chapter 12 of Breiman et al. (1984~. For ease of exposition, in what follows the point of view is that the observations that make up the learning sample are independent and identically distributed. 42

OCR for page 43
3.2.2 The Fisher Linear Discr~m~nant and Some of Its Children As with much of what is worthwhile in statistics, the starting point for a discussion of explicit rules for discrimination is a result of Fisher (19361. The Fisher linear discuminant and procedures to which it has led have been remarkably useful in practice. Suppose that J=2 and that given Yi=i,2[i~Np(Hi,~. That is, 2[ has a p-dimensional normal distribution with mean hi and covariance I:. Then, if I: is of full rank and prime denotes transpose, one calculates that a Bayes rule taxi= 1 if p0+,'x >0, where ,° = ]°g ~ ~2)C(~2. ~ 2) ~ - Yell 1' ]:-1 111 - 1l2' I-1 112), (20) ~ - (~l' - p2 ~ ~-t ~ Otherwise, ~ (x ~ = 2. The coefficients p' in (20) have been described by Truett, Cornfield, and Cannel (1967) as "the amount by which the logit of risk increases for unit increase in the risk factor." The logit of nsk is the log odds (given its features) that an observation of unknown class actually is of class 1. There is no loss in assenting that ~ is nonsingular, since singular cases can always be made nonsingular by an appropriate reduction of dimension. In practice Hi, p2 and are seldom assumed known, and must be estimated from the learn- ;ng sample. If Nt=#li AN :Yi =ll and N2=#li OCR for page 44
H1- N ~ 2[i H2= ~ ~ 2[i (21) 1 in :Yi=1 N2 ion :Yi=2 £= N ~ 2; (2[i -plX25E' -~1)' + ~ (2[i -I -~2)'l; I i~lV :Yi=1 ion :Yi--2 J if, in addition, ~(1) and ~(2) are unknown, then n(l)=Nl/N and ~2)=N2/N. The maximum likelihood estimates can be substituted for their corresponding parameters in (20~. There results one version of an estimated linear di#criminant. This estimate of ~ and various stepwise versions of it are widely used in practice, even when the conditional distributions of ~ given Y not only do not share a com- mon covanance, but also when the distributions are not normal at an. Robustness of the Fisher linear discriminant has been noticed by many users and studied by some. One reference on the #object is Lachenbruch (1982~. Even though the most important issues involve errors of classification, one may ask from a demsion theoretic point of view how the estimators defined by (20) and (21) perform when the mode} is correct. Thus, for some positive definite matrix Q. one might study the "risk" '^ ~ E ~ (P- fly Q (~-~t,~2,~ (22) and ask whether any other estimator does at least as well risk- wise as p for all Ill, p2 and A, and better for some values. Me 44

OCR for page 45
phenomenon, which leads to the coordinates of a vector of estimated parameters being pulled towards a pr~chosen value, with salutary demsion-theoretic implications, surfaces here pro- nded N. 2 2, N2 2 2, and N > p + 3. This idea figures in what has come to be known as James-Stein estimation. See Stein (1956) and James and Stein (1961~. Had (1986) has shown that in this case the estimated Fisher linear discum~nant coefficients ~ can be improved by improving the estimate EN of ]:~~. Basically, if one adds a multiple of Q which depends on the trace of I:~i Q to I: before inverting and substituting into (20), then the resulting esti- mate of ~ improves upon A, at least in the sense given by (22~. Another departure from the substitution of maximum likeli- hood estimates into (20) is given by 'logistic regression". (See § 2.2.5.) Its starting point is the observation that if the stated nor- mal mode] is correct, then P(Y = ~ ~ x) = e~(pO ~ pI'8) / t1 + e~(pO + Pi' P(Y=2~2~= 1-P(Y=1~2~. (23) To estimate (DO, is,'), one can maximize the conditional (binomial) likelihood based on (23~. (This maximization must be done on a computer by numerical maximization techniques (see § 4.2.3.), but is relatively easy as such problems go since the likelihood is uni- modal and logarithmically concave.) These estimates can be sub- stituted into (201. Of course, if the original normal model is correct, this estimated ~ must somehow be less efficient than that based on (21), which utilizes the full likelihood function. However, the '10gistic" likelihood function is valid under more gen- eral assumptions of the exponential family than is the likelihood function which leads to (211. Efron (1975) has investigated the efficiency of logistic regression relative to estimated Fisher linear discrimination when the normal model is correct and found it to be 'between one half and two thirds as effective as normal discrimi- nation for statistically interesting values of the parameters." 45

OCR for page 46
Yet another departure from the mocle] with which this sec- tion begins is this. Suppose that given Yi = j, hi ~ Np (pj, Ej), where I:' may not equal ~2. (~ also § 2.2.4.) It is straightfor- ward to compute a Bayes rule, but several individuals have taken a different point of view. Suppose first that p = 1. Then in an obvious notation and as Becker (1968) indicates, ~ p! - p? ~ O~ + 02 is a useful measure of the separation of the two conditional distri butions. If p > 1, then since linear functions of normal vectors are normal, one night consider b'(~ - p2) for various nonrandom vec- tors b . If Y = j, b ' Xj ~ N(b ' pj, b ' Zj b ), and thus Becker's measure extends to = S(b)= Ib (~1-~2)1 (6'~1 b)Y2+(b'E2 b)Y2 a criterion actually studied earlier by Anderson and Bahadur (1962). Suppose now that one chooses to compute a linear function of it, say bo'$, so that cl(2~)= 1 if bo'$ ~c, and otherwise c'(~) = 2. If the goal is to minimize P (d (2~) ~ Y ~ Y) (that is, the probability of amistake),then pick ho to maximize S(b),and (6o' ~2 bo)Y2 be' p! + (ho' ~! boles ho ~2 (b 0 ~ 1 bo)~a + (b o' [2 b o)~2 . This fact follows Tom work of Anderson and Bahadur (1962). Chernoff (1972, 1973a) indicates how to compute ho and notes that the common probability of a mistake is O(- S Ebb). Of course, in order to implement the Anderson - Bahadur - Checkoff ideas in practice, one must estimate those parameters which are not known. 46

OCR for page 47
In case ~ >2, the slick appearance of the Fisher linear discriminant Bayes rule disappears Generalizations of logistic regression to this case have been studied by many authors Chapters 4 through 6 of the recent book by by McCulIagh and Nelder (19X3) are an excellent recent summary We choose not to provide details here because the issues which arise pertain to modeling conditional distributions of ~ given Y. and the main aspects are not discussed in the explicit context of discrimination 3~2e3 Estimating Misciasmfication Costs When a decision rule ~ is estimated from a learning sample, then it follows from Fubini's theorem that the expression (18) for the expected cost of c! is really the conditional expected cost given the learning sample The unconditional expected cost is the expec- tation of (18) with respect to the distribution of the learning sam- ple n independent copies of the joint distribution of (2[, Y) Of course, that distribution is assumed here to be unknown, and we are left to estimate the unconditional expected cost from data. (See earlier discussion in § 2.2.1.) In the best of all worlds, we have available a genuinely independent test sample, taken to be independent of the learning sample, distributed as (2[, Y), and large enough to permit reasonable inferences. This happy situa- tion occurred, for example, with the work of (Feldman et al. (1982), but unfortunately seems to be the exception rather than the rule in practice. And in the absence of such a test sample we are left to estimate the misclassification cost associated with the prospective use of d Tom the learning sample. The starting point for the estimation of misclassification costs from the learning sample is the so-called "resubstitution" estimate. If the prior probabilities are assumed to be known and the learning sample is of cardinality N. then the resubstitution estimate is J ''=1 # li IN: y _ } ~C(d(Xi ) lit) i-J ion :Yi=i 47 · (24)

OCR for page 48
If the It's are no! assigned to be known, then the analogue to (24) is 1 N - ~ C(d (my ~ | Yk ) N i=1 . (25) These resubstitution estimates may occasionally be of practical value with some parametric procedures such as the Fisher linear discriminant However, they are subject to enormous over optimi#- tic biases for nonparametric techniques such as those discussed in the next section One approach to overcoming the biases is that of cross-validation By m-fold cross-validation (see Breiman et al, 1984) is meant this The learning sample is divided at random into m disjoint subsets of approximately equal size Call them L~,...,Lm. Successively, the data of LV are deleted to yield L(V) The classifier cl(V) is computed from those data in [(V) according to the same algorithm by which d was calculated from all of the data. Then Lo is used as a test sample. The process is repeated for v = 1 ,..., m and the results averaged. In order to simplify the exposition we suppose in what follows that (25) applies. Thus, the estimated cost of misclassification is m . hi. m ,,~1 m N-N i~itYi)~Lu m ~ C(d(U)($i)l Yi) . The most widely advertised choice of m is N. the 'heave one out" method. However, N-fold repetitions of computationally expen- sive procedures is not practical, and seems not to be best for theoretical reasons in some cases (see Efron, 1983, p. 3271. The discussion thus far has been vague as to how the To are chosen. For example, one might think of stratifying the sampling by class even in the present context in which the It's are not asstlmed known. With cross-vali~lation stratified, each class is as nearly as possible equally represented in each of the m Lo. The results on the effects of enforced stratification are skimpy and 48

OCR for page 65
minimum distance between the two clusters is as large as possible, and continue dividing the clusters obtained, in the same way. This produces single linkage clusters; the other agglomerative methods have no simple divisive characterization, and so it is to be expected that the large clusters they produce have no known desirable pro- perties. Replace each point by a sphere of radius ~ and consider the maximal connected subsets of the union of spheres, for all d. These are the single linkage clusters. Clusters of this type are stu- died in percolation theory (Broadbent and Hammersley, 1957, Smythe and Wierman, 1978) and so asymptotic results about sin- gle linkage clusters may be obtained from percolation asymptotics. The nearest neighbor density estimate is flux) = C / iinf pP(x, Xi) · it- ~ Clip cumenslons, The high density clusters of In are the maximal connected subsets of unions of spheres of the precarious paragraph, the single linkage clusters. The nearest neighbor density estimate is a poor estimate, not consistent for the true density; it is remarkable that single linkage clusters retain approximate consistency. Following Wishart (1969) and Ling (1973) eve should use high density clus- ters corresponding to some form of kth nearest neighbor density estimation, where for consistency ~ ~ ~ as n woo but k In ~ O. For example, we define distance between clusters in a joining algo- nthm by the kth smallest among distances between pairs of points in the two clusters. This clustering method is the analogue of kth nearest neighbor discnminant procedures, in which a new point is allocated to the class that appears most frequently in its k nearest neighbors. Another characterization of single linkage clusters is through the ultrametric p*(x, Y ~ = inf sup phi, xi+~) x=x~,x2...x~=y ~ 65 .

OCR for page 66
The ultrametric satisfies p*(x, y ) < sup[p*(x, z ), ply, z )] and determines a family of clusters {x | p*(x, y ) < C} for various C, y, that turn out to be the single linkage clusters. Finally, the minimum spanning tree is the graph of minimum total length connecting the sample points. Gower and Ross (1969) show that single linkage clusters are the connected sets obtained by successively deleting, largest to smallest, the links of the minimum spanning tree. Thus single linkage computation and asymptotics are intimately related to minimum spanning tree com- putation and asymptotics. 3~3.5 M=tures The population density k f = ~Pifi i=1 is a mixture of components fi in proportions Pi. This may be newed as a model for k clusters. The fi and Pi are unidentified without some further constraints. The usual assumption is that each fi is a member of the same parametric family, the multivari- ate normal (e.g., Wolfe, 1970; Day, 1969; Dick and Bowden, 1973~; however the mature model may also be applied to general sample spaces, not only to points in p dimensional Euclidean space. In discriminant analysis, a random observation X is associated with a classification I into one of k classes. Suppose that I takes the value i with probability Pi, and that X given I = i has density fi . Then the marginal density of 2[ is just f =[Pi fi . In dis~iminant analysis we know ~ and I; in clustering we know only fig; thus the mixture mode] for clustering corresponds to the marginal probabil- ity model for discriminant analysis. Lack of knowledge of the classification variable makes the general mixture model unidentifiable however, so further constraints are needed for clus- tering. 66

OCR for page 67
Let p(i I X)= pi fi(X) I ~ pi fi(x) denote the posterior proba- bility that the observation x belongs to class i. These posterior probabilities are useful in maximum likelihood estimation for the mode] f (x ) = ~ Pifi(X, si where the fi are known up to the. parameter Gi taking values in r-dimensional Euclidean space. Assume the fi are differentiable with respect to Oi. Then the maximum likelihood estimates for Pi, Gi based on observations Ad, 82 ~ . . . ~ An satisfy J n ~ p (i | A ) d /d si log fi (by, si ) - O. i=1 n (ii): Pi = ~ p(il~j)ln. j-1 The estimation proceeds in alternating steps: given p (i I 2[j ), esti- mate Gi weighting the observations 2Ej by their probability of h~lon~in~ to ' the ith component; then Even these estimates Hi, ~ _ ~ a, estimate Pi and p (i | 2E,, ); then repeat the first step. Rao (1948) appears to have been the first to use maximum likelihood for normal mixtures. See also Day (1969), Wolfe (1970), Hosmer (1973), Everitt and Hand (1981~. We can't show that the alternating procedure leads to the true maximum likelihood esti- mates, or even that the likelihood increases after each step, even in the simple case of normal mixtures in one dimension. The stan- dard madman likelihood regularity conditions do not hold in the normal case, and the usual asymptotic consistency and distribu- tion results do not always hold. For well-separated clusters, the posterior probabilities pti ~ 2[j) are all near 0 or 1, and the maximum likelihood solution is approximated by dividing the observations into k clusters, 67

OCR for page 68
estimating ei by maximum likelihood #eparately within the clus- ters, and finding that division into k clusters that maximizes the product of the likelihoods. This procedure is the strict ma2nm~n likelihood estimate for the mode] in which the observations {2[j} are supposed drawn from components {IjI, and the components are regarded as unknown parameters. The likelihood is ~ [~t ~ In ~ ll ~ · · · ~ In ~ 81' 02,. . ., 0~ ~ = n f~j<~j, Ail . In practice we can rarely adore to search all partitions, but we use an alternating step algorithm: (i I': Given Ij, select Gi to maximize Ink f i Jo, Gi (ii )': Given it, select Ij to maximize f~j(~;, O'j) . Thus given the clusters we estimate Oi by maximum likelihood, and given the Hi we specilty cluster membership to make hi most likely. This is the alternating method for mixture maximum lilreli- hood when the p (i ~ Ad ~ are all zero or one. In the particular case when fi(-~) = (2~,)~ /2 e2`p[- Your -'ii ~ t~ -~i '] the above algorithm has a simplified form known in the clustering literature as k-means. (See, e.g., Macqueen, 1967, and Hartigan, 1975~. We select pi to be the mean of the observations in the ith cluster, and we allocate the observation 2[j to that cluster i that minimizes the distance between 25Zj and lo 68 .

OCR for page 69
3~Be6 The Number of Clusters: Modes In the high density clustering model, we associate a family of cluster with each mode of the density f: f has a mode at m if there is a neighborhood M of m such that f (x ~ ~ f (m ~ for x ~ M, anil f (x ~ < f (m ~ for x in the boundary of M. There are disjoint high density clusters only if f is multimodal. Thus we can test for the presence of clusters by testing for multimodality. For X one dimensional, unimodal and bimodal densities may be fit by maximum likelihood giving a likelihood ratio test for bimodality, but it is difficult to handle the large contributions to the likelihood made by small intervals between neighboring obser- vabons. A better test is the dip test, which measures the max- imum difference between the empirical distribution function, and the unimodal distribution function chosen to minimize that max- imum difference. The dip approaches zero for unimodal distribu- tions, and some non-zero value for multimodal distributions, as the sample size increases. It is therefore consistent for distinguishing unimodal from multimodal distributions. It is argued in Harkigan and Hartigan (1985) that the uniform is the appropriate null uni- modal distribution, Cause the dip is asymptotically stochastically larger for the uniform than for other unimodal distributions; the asymptotic distribution of the dip and some empirically deter- mined distributions for finite sample #izes are given in that paper. The dip does not generalize simply to many dimensions. The minimum spanning tree provides a And of ordering of the n sam- ple points that may be used to generate an analogue of the dip statistics: select a particular sample point Lo to be the mode or root, and consider probability distributions P supported by the links of the minimum spanning tree. Define P to be unimodal if P has a density, with respect to the uniform distribution on the tree, that is a non-decreasing function of x as x moves toward the root. At each point x on the tree define a distribution function value F(x ~ to be the probability that a random point ~ is such that x lies between '[ and no. Let Fn be the empirical distribution function corresponding to the empirical distribution which gives each sam- ple point probability 1 In ~ Define 69

OCR for page 70
d (F' Fn ) = #Up | Fn (X )-F(x ) | D(Fn 9 no) = inf d (F. Fn ) where F is unimodal with mode x DIP(Fn) = inf D(Fn ~ To) so . This procedure locates an optimal mode no, and states how well the data fit the un~modal hypothesis, DIP(Fn ). In the one dimen- sional case the usual definition of dip gives the same value. The asymptotic behavior of the multivanate version is unknown. 3.3~7 The Number of Clusters: Component If the components of a multivariate normal mixture are sufficiently well separated, there will be one mode for each com- ponent. In this case the number of clusters is the number of com- ponents or the number of modes, but in general the number of modes is fewer than the number of components, so testing for the presence of more than one component is less conservative than testing for the presence of more than one mode. Wolfe (1971) considers the likelihood ratio test for say one component against two, but notes that the regularity conditions which are usually required for the log likelihood ratio to be propor- tional to a chisquare are not fulfilled. See also Binder (1978) and Hartigan (19771. Consider the simplest case: X~ , . . ., Xn sample from N(O, 1) under the null hypothesis, and from ( l - p ) N (O. l ) + pN ( it, 1 ) for some O

OCR for page 71
For each fixed it, log L (p 9 if) iS a concave function of p that has maximum value O if Nazi < 0 but maximum value appro~n- mately ([, Zi ~ / 2~; zi2) otherwise Thus asymptotically, L(~) --sup logL(p, p) is equal to zero with probability Y2, and to p (A ) / 2 with probability Y2 lithe (X2) / 2 would be expected fiom usual likelihood asymptotics If ~ and it' are widely separated, Zi(,u) and Zi(p') are nearly uncoITelated, and so asymptotically L(~) and L(p') are nearly independent. Thus sup L(,u) is greater than the maximum of k nearly independent L (Il) for each k Thus sup L (~) is asymptoti 11 cally infinite. r The likelihood ratio test does not therefore follow the usual asymptotics, and is not conservative: the usual sigrnficance test will (with probability 1 asymptotically) reject the hypothesis of a single component when only a single component is present For each it, sup L (p, It) has asymptotically the same distribution, and p these distributions are nearly independent for well-separated it; maximum likelihood computations will therefore be difficult; we can expect to see local maxima of sup L (p, p) near every value of it. p If ~ has prior density normal with mean O and variance 1, large values of ~ are inhibited and the maximum posterior density will occur only with ~ moderate. The corresponding ratio test may have better asymptotic behavior than the likelihood ratio test. More generally, if the mixture model has components with means ply p2 ,- . , pk we might assume the Ok to be a pnon a sample from a normal; this prevents the artificially large separation of It's that occurs in likelihood estimation and testing. The behavior of the likelihood ratio statistic in the k-means case has been examined in one dimension by Hartigan (1978) and in higher dimensions by Pollard (1982~. 71

OCR for page 72
3.3.8 Ull;r~etric and Evolutionary Distances Assume that there are N objects, and N(N -1) / 2 distances between pairs of objects. From these distances we wish to form clusters of close objects. One way to go about constructing the clusters is to require that the distances satisfy certain properties in the final clustering. For example, all distances within two dis- joint clusters must be smaller than all distances between the clus- ters. Or, each pair of points in the #ame cluster must be connect by a chain of points such that neighboring points in the chain are closer than some neighboring points in a chain connecting points in different clusters. (This defimtion leads to single linkage.) Another way is to suppose that the clusters correspond to some ideal distance matrix, and to attempt to approximate the given dis- tance matrm c! with a best fitting cluster distance D. For exam- ple, hierarchical clustering might correspond to an ultrametric D, a distance satisf ying D(i ,j ~ ~ sup[D(i ,k I, D(' ,k )] and we would find the ultrameinc D closest to d. See Hartigan (1967), Johnson (1967) and Jardine, Jardine and Sibson (1967~. Another plausible definition, the evolutionary mode! Mom Fitch and Margoliash (1967), is based on an evolutionary tree generating the objects. The distance between any pair of objects is the sun, of links on the unique path connecting them in the tree. If there exists an ances- tor in the tree such that all points are equidistant (in sums of links) from the ancestor, then this evolutionary distance reduces to an ultrametric. Given the tree, the best fitting evolutionary dis- tance or ultrametnc can be fitted by regression methods; the hard part is searching for the best tree. Baker (1974) has considered probability models in which an observed distance matrix ~ vanes by some amount from an ultramet~c D, and has investigated empirically how well the vari- ous hierarchical techniques recover the true ultrametric D. The results are opposite to those obtained using the high density model: complete linkage does well, and single linkage poorly. Euclidean distances in p-dimensional space mill form an ultrametric distance matrix on at most (p + 1) points. For a den- sity f, we can construct an ultrametric by 72

OCR for page 73
D(x,y) = min max 1 / f(u ) c ~ tic where C is any path connecting x andL y Thus x and y are close if they can be connected by a path of high density, or equivalently if they lie together in a high density cluster In fitting such an ultrametric to objects in p-dimensional space we would use only the small distances between objects to obtain an estimate of the density f Single linkage works only with the small distances, whereas complete linkage depends on the large distances This may be the explanation for Baker's results favoring complete link- age, in that he requires the fits ultrametric to be close to the true ultrametric when averaged over all distances, and the large dis- tances are neglected by single linkage In practice, the large dis- tances deviate most from the fitted ultrametnc (however fitted) and it seems correct to downweight their contribution Theoreti- cally, if we wish to allow clusters of arbitrary shape and size, it also seems impossible to give large distances much weight Perhaps we should fit an ultrametnc D to minimize ~ w(D)~(i,j) -D(i,j)] / ~ w(D) where w (D ~ is small or zero for D large ~s moves single link- age a little way towards average linkage More weight should be given to the large distances in high-dimensional spaces Let the objects 1, , n be generated by an evolutionary tree, beginning at some ancestor, O For a particular measurement X taking values Xi on the objects, assume that X changes in time t, t ~ At on a particular link of the tree, by an amount that has mean 0, variance a2At, and is uncoagulated with changes in dif- ferent intervals or links Then, letting BY denote the expected value of the random variable, Y. 73

OCR for page 74
E (Xi - Xj)2 = 2~72t'; where lij is the time since i and j evolved from their most recent ancestor, so E (Xi -Xj )2 is an ultramet~c! If we had used different rates of evolution in the different links of the tree, so that the changes in X had variance a'2At for link i, then E (Xi -Xj )2 Womb be an evolutionary distance. Suppose that X is normal, and there are p independent sam- ples of X, namely, X1, x2 ,. . ., XP . (Here the number of objects is fixed, and the measurements are assumed sampled from an infinite population of possible measurements; it In require card fix standardization to achieve something like this in practice.) Then I.` (Xi -X~)2 ~ Its 2tijX 2 r=1 d (i, j ) ~ 42~2tij %p2 Let D (i, j ) = 42~2tijp, . an ultrametric. For large p, dfi,j) ~ D(i,j)[1+N(O,2/p)] This suggests fitting the ultrametuc D by minimizing ~ ID(i,i)-d(i,j)12/D2(iti) which downweights the large distances nicely, but probably not enough. We have to take note also of the high correlation between the large distances, arising from the high fraction of their paths 74

OCR for page 75
through the tree that they have in common. We can compute these, but the criterion to be minimized is then a complex qua- dratic in D -it. 75 -

Representative terms from entire chapter:

learning sample