Questions? Call 888-624-8373

PAPERBACK
list:$27.25
Web:$24.53
add to cart

Rights & Permissions

topleft topright

Discriminant Analysis and Clustering (1988)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Page
76
bottomleft bottomright
Page
76

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 76
CHAPTER 4 SOF1`WAElE AND ALGOlU~EIM IMPIE:MENTATION 4.1 Induction This chapter provides a summary of available software and algorithms for disc~minant and cluster analyses While it is intended to be up to date, the current pace of statistical software evolution is such that some of the more recent developments may be inadvertently excluded Section 4 2, which focuses on discrinnnant analysis, was writ- ten by P A Lachenbruch Section 4 3 on cluster analysis was prepared by R A Blashfield The anal section, 4 4, which tallies about software needs, combines materials written by both Blashfield and Lachenbruch 4.2 Discr~minant Analysis i 4.2.1 Linear and Quadratic Discriminant Functions Many packages are available for performing linear discrim- nant analyses Fewer are available for quadratic discriminant analyses and only one (to my knowledge) is available for perform- ing density estimate discriminant analysis These are reviewed in the section on packages I have not attempted to cover programs which are not widely available in the United States BMDP, GLIM, IMSL, Minitab, P-STAT, SAS, SPSS-X, and Statgraphics are now available in versions for MS-DOS compatible microcom- puters In addition, several statistical systems developed specifically for microcomputers have appeared on the market SYSTAT, CRISP, GAUSS, and STATA are examples 76

OCR for page 77
The linear discriminant function is well implemented for most applications The numerical techniques are standard, the equations and algorithms (inversion, solution of a set of linear equations) have been tested thoroughly and accuracy is not a major concern Estimating e'Tors in discnminant analysis is gen- erally (lone by reclassifying the training sample If the sample sizes are sufficiently large (say 3 to 5 times the number of vari- ables in each group), this method is satisfactory and has an approximately binomial distribution One package offers the jack- knife (or leaving-one-out) method This has a smaller bias than the resubstitution method, but because of the correlation among the pseudo-observations has a larger variance This method should only be nsed for small samples when the danger of the optimist/c bias of the resubstitution method is substantial Plots of the linear discriminant variables are available in most packages Weighting of cases is possible in SAS and SPSS, and it is not clear from the manuals whether it is possible in BMDP and P-STAT None of the packages offer the option of proportional covariance matrices This intermediate step between the full quadratic func- tion and the linear function involves estimation of only one addi- tional parameter for each additional group, rather than the full covariance matrix for the added group, and may be a satisfactory · ~ compromise In many cases 4~2.2 Review of Packages P-STAT The P-STAT discriminant analysis procedure is similar to the BMDP7M program It is a backward stepwise procedure and allows from 10 to 40 groups depending on the P-STAT size No warning on the sample size requirements for the many groups case is given This may lead some naive users astray Also, the assumption of common covariance is not discussed The resubsti- tution method is available for estimating error rates There are three types of runs 77

OCR for page 78
3 Analyze and cIassify a data set. 2. Classify a known data set using previously generated func- tions. This can be used as a validation method by holding out a fraction of the data. Classify an unknown data set. Output data sets contain the original group, the assigned group, the posterior probability the observation belongs to its origi- nal group, and the posterior probability it belongs to its highest probability group. The program does not automatically step in the batch mode but is stepwise when run interactively. SPS~X The DISCRIMINANT procedure in SPSS-X allows one to use a fixed set of variables or to select variables in a forward manner. Removal of variables is possible, but backward stepping does not appear to be possible. Five criteria are possible to select variables. Inclusion levels are available to force variables into the discrim- inant function. For the multiple groups problem, the canonical discriminant functions are computed rather than the likelihood ratio Unctions which minimize the total (weighted) error rate. Cases with missing values are excluded. It is possible to select a subset of cases to analyze and then test the performance of the rule on the remainder of the cases. Plots may be obtained which map the two-dimensional space of canonical functions and show the classification boundary, an all-groups plot which plots each case, or a separate-groups plot. A variety of matrix operations are possible on the discriminant coefficients. 78

OCR for page 79
BMDP7M The BMDP7M Stepwise Discnminant Analysis program is the descendent of the oldest discriminant analysis packaged pro- gram. It offers forward and backward stepping, forcing levels for inclusion or exclusion of variables, and two criteria for variable entry. It is possible to specify prior probabilities. For estimating error rates, the resubstitution method and the jackknife method are available. Plots of canonical variables are given either by group or for any subset of groups. The error rates may be printed at any set of steps in the variable selection process. The size of the problem is a function of the number of variables, groups and cases. It is not clear if there is an upper limit on groups or variables. Quadratic discrimination does not appear to be available. SAS PROCEDURES SAS offers four discrin~nant procedures: DISCRIM, NEIGH- BOR, CANDISC, and STEPDISC. CANDISC performs a canonical discriminant analysis and provides output for other SAS pro- cedures for plotting or printing. A number of statistics ar avail- able. The DISCRIM procedure computes a linear or a quadratic discnminant Unction on a fixed set of variables. Prior probabili- ties may be specified. Classification may be done on the training sample or on a test sample. Stratified analyses may be performed using a BY statement. The NEIGHBOR procedure performs a nearest-neighbor cliscnminant analysis. Either the single nearest-neighbor or the k-nearest-neighbor rule may be used. Prior and posterior probabilities are printed and an error matrix is given. The STEPDISC procedure performs a stepwise discrim- inant analysis. It is similar to the BMDP7M program. The Wilks' lambda criterion is used to determine which variable enters or is removed. 79

OCR for page 80
Other Packages MINITAB (Ryan et al., 1982) has no discnminant analysis procedure, although it is possible to use a linear regression pro- gram to obtain the discriminant coefficients. After using the regression procedure, one could calculate the resubstitution esti- mator of error rates by the MULTiply and ADD commands. Other packages would be preferred for discriminant analysis. IMSL has two subroutines for linear discnminant analysis, ODFISH and ODNORM. In ODFISH, the canonical discriminant functions are calculated. In ODNORM, the multivariate normal discriminant functions are computed. These subroutines do not print output; this becomes the user's responsibility. There are also routines which will estimate a density function using the kerned method. The user must supply a kerned function. The subroutine computes density estimates at a set of points requested by the user. Printing is the user's responsibility. These routines are NDKE:R and NMPLE which estimate the density for a one dimen- sional problem. ALLOC (Hermans et al., 1982) is a program which computes allocation rules based on density estimation. It uses multivariate normal kernels with a diagonal covariance matnx. The smoothing parameter is estimated by the program. A subsequent modification allows the program to use variable kernels to obtain better estimates of densities. 4.2.3 Logistic Regression The major statistical packages all offer some form of logistic regression analysis. Additionally, there are a number of other pro- grams available to perform these computations. The method was originally suggested by Cornfield (1962) in connection with the Framingham studies. Walker and Duncan (1967) suggested a weighted least squares method of estimating the parameters which has been widely used. Day and Kerridge (1967) discussed several properties of the method. Nelder and Wedderburn (1972) derived the theory of generalized linear models which has been the basis 80

OCR for page 81
for much further important work. The program, GLIM (General- ized Linear Interactive Modeling), is an outgrowth of this work and Is easily used for fitting these models which include the logis- tic regression model. BMDP offers a logistic regression program based on a method developed by Jennrich and Moore (1975~. This is a stepwise pro- gram and uses iteratively reweighted least squares. Conditional logistic regression is possible for matched pairs analyses. SAS includes a procedure, LOGIST, in their supplementary programs which performs logistic regression by computing ma2~- imum likelihood estimates of the parameters. Stepwise variable selection is possible. SPSS does not have a separate logistic regression procedure. One can get estimates of the regression coefficients if the observa- tions can be analyzed using a categorical analysis program. Thus, continuous variables cannot be handled by SPSS. The program, GLIM, was developed by the- Numerical Algo- nthms Group (NAG), in conjunction with the Royal Statistical Society, to estimate parameters from the Nelder-Wedderburn models. Special cases of this model include logistic regression, log-linear categorical models, analysis of variance, and multiple regression. This program fits these models using maximum likeli- hood. A new program, PRISM, has recently been issued by NAG which includes all facilities of GLIM. A general criticism of these packages is that they over little in the way of diagnostic computations for the detection of influential observations. Work by Pregibon (1981) is now available and new revisions of these programs should include these results. 4.2.4 Classification Trees Recent work on classification trees was summarized briefly in § 2.2.7. Batch and interactive versions of the CART methodology are available through California Statistical Software, Lafayette, CA. 81

OCR for page 82
4.3 Cluster Analysis The amount and diversity of cluster analysis software has been surprisingly large for a statistical method with effectively only a twenty year history. New methods are produced continu- ally, and there appears to be no end in sight to the process in inno- vation. Probably hundreds of software packages and programs are available to perform cluster analysis, and it is likely that many researchers have written their own "home-grown" versions of popular algorithms. That so much clustering software has been written can be explained by two factors: (1) unlike many statisti- cal procedures, clustering algorithms, which are often no more than heuristics, are relatively easy to program on a computer; and (2) since most sciences have different goals, analytical needs and methodological requirements, many different clustering methods have been developed to exploit these needs. Clustenng software can be placed into four major categories: (1) collections of subroutines and algorithms, (2) general statistical packages which contain clustering methods, (3) cluster analysis packages, and (4) simple programs which perform one type of clus- tering (Blashfield, Aldenderfer, Morey, 19821. Since a comprehen- sive renew of clustering software is beyond the scope of this report, the focus shall be only upon those programs and packages which are widely available. 4.3.1 Collections of Subroutines and Algorithms Three major collections of software are available today in this category; books by Anderberg (1973), Hartigan (19751, and Jamb u and Lebeaux (1983) plus the International Mathematical and Sta- tistical Library (IMSL, 1977~. Although much of this software is fairly sophisticated, it requires the user to supply all job control language of the computing system to link and subsequently run the routines. As a result, these programs are not very "user- ffiendly". The user must be familiar with the local control language as well as FORTRAN in order to be able to get the pro- grams running. In general the level of user support for these 82

OCR for page 83
routines is low; Hartigan's algorithms are described in a separate user's manual (DalIal, 1975), whereas Anderberg's algorithms are only documented in his book. While the IMSL clustering aigo- nthms are embedded within the documentation of the entire col lection of IMSL subroutines, this does not necessarily make them any easier to use. The FORTRAN programs in the Jambu and Lebea~ book (1983) are quite extensive and represent a consider- able effort by these two French writers. Like the routines in Harti- gan (1975), the algorithms in Jambu and Lebeaux are unique. Despite the breadth of methods available, algorithms in this category are not recommended for use by the novice unless exten- sive guidance is available. Statistical Packages Containing Clustering Software The most convenient cluster analysis available for general use is that contained within popular packages of statistical pro- grams such as BMDP (Dixon, 1981), SAS (SAS Institute, 1985), and SPSS-X (SPSS, 1986~. The philosophy of these packages is weH known; they provide non-programmers with relatively easy access to sophisticated statistical methods for a wide variety of research problems. The packages provide an "umbrella" of support for the user in that they use a consistent control language that communicates the needs of the user to the computing system with a minimum of effort. These packages also contain a full range of data screening and manipulation methods which help to make complex analyses simple and feasible. If the package contains the method of interest to the user, the advantages of using existing statistical packages are substantial. Until recently, the range of clustering options contained i most statistical packages has been severely limited. For instance, before 1980, SAS contained only one clustering method and SPSS had no clustering methods. However, this state of affairs has changed dramatically. Since 1979, BMDP has developed four pro- cedures devoted to cluster analysis: (1) a collection of single, com- plete and average linkage to cluster variables; (2) an average link- age (centroid sorting) method to cluster cases; (3) a block cluster- ing method (Hartigan, 1975) to simultaneously cluster cases and . 83

OCR for page 84
variables; and (4) an iterative k-means method which fonns parti- tions among the cases. The BMDP procedures are well annotated, have clear output and are relatively easy to use. The most serious limitations of this package are the limited range of hierarchical agglomerative methods, especially for clustering cases, and the choice of only a single similarity measure, Euclidean distance. Earlier versions of the second statistical package, SAS, con- tained one method of cluster analysis -- complete linkage. How- ever, a recent release of this package, SAS 1985, includes substan- tial additions. This version of SAS contains Ward's single linkage, complete linkage, average linkage plus seven other hierarchical agglomerative methods. Euclidean distance is still the only simi- lanty measure offered. In procedure I?ASTCLUS, a k-means method (Anderberg~s centroid sorting method) has been added, and finally, a factor analysis-type variable clustering method has been included (procedure VARCLUS). The diagnostics of the package has been expanded. In addition, the output provides a "Teat deal of information about the solutions. A major limitation, however, is that SAS continues to use "sly line" plots to represent hierarchical trees; these plots are difficult to use with large data sets. Of con- siderable interest in SAS is the inclusion of a new statistic, cubic clustering criterion, for the determination of the number of clus- ters. The 1986 version of SPSS-X contains two major clustering procedures: CLUSTER and QUICK CLUSTER. The former emphasizes hierarchical agglomerative methods including seven of the most commonly used techniques (single linkage, complete link- age, average linkage, Ward's method, etc.~. There are six distance measures and three types of plots available. The second pro- cedures, QUICK CLUSTER, uses a k-means method with limited options for starting partitions. Interesting aspects of this pro- cedure are provisions for missing data and the ability to handBe extremely large data sets. 84

OCR for page 85
4.3.2 Cluster Analysis Packages For the sophisticated and senous user of cluster analysis, cluster analysis packages represent the ultimate in flexibility and user convenience. These packages combine the advantages of gen- eral statistical packages, such as an integrated control language and data screerung and manipulation procedures, with features of interest to users of cluster analysis, such as a diversity of cluster- ing methods, special diagnostic features, and enhanced graphics. Of the greatest importance is that many of the packages contain hard-to-find clustering methods or analytical procedures which are appropriate for special problems. The first of these packages in NT-SYS which is and has been important because it adopts the terminology and methodology inherent in the most fiequently cited book on cluster analysis, Sneath and Sokal (1973~. This package has undergone numerous revisions and updates in its 15 year existence. Moreover, there now exists a micro-computer version, called NTSYS-pc which con- tains the standard hierarchical agglomerative methods, graph theoretic methods, and an eigenvector routine. This version can handle similarity matrices up to 400 x 400 plus it contains three graphics programs. The most versatile of the clustering packages is CLUSTAN. I]ke BMDP, SAS and SPSS-X, CLUSTAN contains procedures for hierarchical agglomerative and iterative partitioning methods. However, CLUSTAN also contains a number of other procedures including NORMIX to decompose multivariate normal mixtures Wolfe, 1971~; INVARIANT, which uses partitioning methods to optimize MANOVA statistics; DENDRITE, which is a minimum spanning tree method, plus others. In addition, CLUSTAN has cluster diagnostic and validation aids, including the procedures called RULES and COMPARE, which implement the stopping rules of Mojena (1977) and the cophenetic correlation coefficient of Mojena and Wishart (1980~. A fatal of 38 similarity measures are contained in procedure CORREL, and the package contains a util- ity procedure which permits the user to define any type of similar- ity coefficient (DEFINE). Other important utilities are those which prepare a number of cluster diagnostics or which produce a 85

OCR for page 86
wide variety of graphical output. The novice user of CLUSTAN should be aware that although this package contains a large number of methods, the package contains little guidance on which methods may be most appropriate for what types of data sets. There are three other packages which are devoted to cluster analysis. CLUS (Friedman and Rabin, 1967) is an old program which used a powerful set of iterative partitioning methods. A more modern version of CLUS is the procedure INVARIANT in CLUSTAN. Another large package is BC-TRY. Like CLUS, this program was written in the 1960's and contained the innovative ideas of Tryon who was one of the earliest writers about cluster analysis Rayon, 19391. Currently this program is being revised for re-distribution. Finally, a recent clustering package for use on m~cro-computers has been developed called MICRO-CLUSTER (Edmonston, 1985~. This package contains seven hierarchical agglomerative methods and an iterative partitioning method. 4.3.3 Simple Cluster Analysis Programs: Simple cluster analysis programs are just that: simple. These are programs written primarily in FORTRAN, and they implement one or two cluster analysis methods. In some ways, they strongly resemble the subroutine of the first category defined above, in that they require the user to be fully competent in the job control language of the computing system as well as the language in which the program is written. In general, these programs have no or few aids for checking programming errors, are poorly docu- mented, and pronde limited output information. These programs are important, however, because they have often been used within particular scientific areas, or they have been used for the basis for the algorithms presented in major packages such as SAS, IMSL, and OSIRIS. Some of the more popular of these simple programs are HGROUP, a method which implements Ward's minimwn vari- ance method (Veldman, 1967~; JCLUST, which implements single and complete linkage as discussed in the influential article by Johnson (1967~; and ISODATA, a flexible iterative partitioning method which has used extensively in engineering (Hall and Khanna, 19771. 86

OCR for page 87
Another category of cluster analysis programs consists of those that handle large data sets (N is greater than 500~. Unfor- {unately most clustering routines statistical packages are some- what limited in the number of cases which can be analyzed at one time. Epically, most have a practical upper limit of 200 cases. In response to this problem, a number of authors have extended the capabilEties of popular hierarchical agglomerative and iterative partitioning methods to deal with very large data sets. Among the most important of these is Sibson's (1973) single linkage algorithm (SLINKS. Note: SLINK is now incorporated within CLUSTAN 2.~.; CLUSTER (Levisohn and Funk, 1974) and QUICLUSTR (Bell and Korey, 1975) which implement Ward's methods; and programs by Defays (1977) and Rohlf (1977) for complete linkage clustering. Rohlf (1982) presented a number of different algorithms for single linkage that could be useful for large data sets. Lennington and Rossbach (1978) have developed CLASSY, an iterative partitioning method based upon the logic of ISODATA, for use with the very large data sets obtained LANDSAT satellite research. 4.4 Need`; For parametric (multivanate normal) discriminant problems, relatively little is needed. A variety of programs are available which over flexibility of use, adequate error rate estimation, and many variable selection options. A general shortcoming is advice on usage. For many users, the only place they will learn about dis~minant analysis is in the user's manual of a computer pack- age. Some discussion of the limitations (e.g., if you have many groups, you need many observations) and robustness (e.g., transform your data if a variable is badly skewed) is needed. Some of the programs seem to have the attitude, if it can be pro- grammed, include it. For example, in one program up to 40 groups can be included in discnminant analysis. A user with that many groups has probably not thought sufficiently about the problem. (Nevertheless, there are some problems, such as speaker recogni- tion, where the only interesting situation is having many groups.) 87

OCR for page 88
The quadratic discnrninant function, available only in SAS, has some senous robustness problems. These should be noted. The ability to enter previously coefficients or a set of means and covanances is useful. It is valuable to enter a simplified set of coefficients play integers) and compare the performance of the rue to the optimal rule. Other than the IMSL routines for unidimensional problems and ALLOC, no package has any programs for density estimation. This is a useful procedure, especially when distributions are rather far from normal. Nearest neighbor procedures, which are related to non-parametric density estimates, are available in SAS in the Neighbor Procedure. Concerning cluster analysis programs, the inclusion of k- means and hierarchical agglomerative methods in the SPSS and SAS packages have helped standardize the clustering methods that are used in applied research. The SAS manual is particularly helpful concerning the use of these techniques since it provides a skeptical perspective and references some of the best articles in the field. Nonetheless, none of the packages is successfi~1 in pro- nding sufficient cautions and indications of the practical problems that are of serious concern to new users of these procedures (e.g., - defines on the choice of methods, the number of cluster prob- lem, the issue of outliers, the choice of the si~nilanty measure). The preceding discussion has focused largely on mainframe and mini-computers since most users of these procedures have access to such computing equipment at the present time. Several of the programs available on microcomputers offer discriminant analysis routines. BMDP, SAS, P-STAT, and SPSS-PC offer discriminant analysis through the usual programs. SYSTAT pro- vides a discriminant analysis facility by using the module MGLH. Other programs offer regression capabilities which give an equivalent analysis, although not tailored exactly to the purposes of allocation. Development of graphical methods for allocation and their computer implementation is needed. The mainframe packages usually offer plots of the sample discriminant variables which often is adequate to determine differences among groups. 88

OCR for page 89
However, these variables are linear combinations of the observed variables and are not always easily interpretable. A "simple" exploratory graphical program would be welcome. Such a program would be interactive, with very good graphics (i.e., much better than the usual transcription of a page of text to graphical sym- bols). Graphic clustering procedures have also been neglected in generally available programs. There are no interactive programs for allocation or clustering that are generally available. Such a program would allow the user to specify the variables to be included in the allocation rule, to specify the form of the Me (e.g., linear, quadratic, tree structure for discrimination), to enter new variables or delete old ones, and to detect influential observations. Regression diagnostics have become increasingly important in statistical practice, but little in the way of diagnostics is avail- able for allocation rules. In a sense, the regression diagnostics suffice for classical linear discriminant theory, and Pregibon's work has large application in logistic regression. See also Landwehr, Pregibon and Shoemaker (1984) and Fowllres (19871. Diagnostic procedures are generally lacking in cluster analysis. However, this lack is primarily due to the problems in developing an adequate statistical theory for clustering rather than reflecting a programming deficiency. Nevertheless, a few procedures have been developed and appear to be useful; see § 2.3.3. 89

Representative terms from entire chapter:

logistic regression