Questions? Call 888-624-8373

PAPERBACK
list:$33.75
Web:$30.38
add to cart

Rights & Permissions

topleft topright

(Sackler NAS Colloquium) Mapping Knowledge Domains (2004)
Proceedings of the National Academy of Sciences (PNAS)

Page
115
bottomleft bottomright
Page
115

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 115
Colloquium User-controllecl mapping of significant literatures Howard D. White*, Xia Lin, Jan W. Buzydlowski, and Chaomei Chen College of Information Science and Technology, Drexel University, Philadelphia, PA 19104 We apply a version of our web-based literature-mapping system to PNAS for 1971-2002, as indexed by the National Library of Medi- cine and the Institute for Scientific Information. Given a single input term from a user, a medical subject heading, a cocited author, or a cocited journal, PNAStINK rapidly displays views in which that term and the other 24 terms that most frequently co-occur with it in a bibliographic database are interrelated in ways suggesting fruitful combinations for document retrieval. The interrelation- ships are produced by two algorithms, pathfinder networks and Kohonen-style self-organizing maps. PNASUNK displays are them- selves interactive interfaces that can retrieve documents from digital libraries (e.g., PNAS Online). This style of visualizing knowl- edge domains is called "localized" because it does not attempt to map the indexing of literatures in full but concentrates on the top terms in an "associative thesaurus" reflecting user interests. It also permits swift remappings, as the user recognizes terms worth pursuing. PNASLINK is illustrated with maps drawn from the litera- ture of population genetics. Some comparative and evaluative comments are added, one from a domain expert indicating that the face validity of the system may be tempered by insufficient specificity in the indexing terms being mapped. We here present two ways of rapidly mapping literatures in terms of selected indexing vocabularies. Both ways are responsive to users, and either can serve as an interface for retrieval of documents from digital libraries. Either can also complement a work that focuses on the structure of a literature, such as a research review (14. Our data are the contents of PNAS for 1971-2002, as described by medical subject headings from the National Library of Medicine (NLM) and by citation indexing from the Institute for Scientific Information (ISI).f Indexing by these organizations typifies the bibliographic control that is extended only to significant, that is, highly valued, literatures. Our software, called PNASLINK (available at http://project.cis. drexel.edu/puas), is designed to amplify such control, by en- abling customized browsing on the basis of user input. Both of our mapping techniques exploit co-occurrences of terms in NLM and ISI bibliographic records. The terms are systematically paired, and their co-occurrences are counted in matrices. Because people can easily assimilate numeric matrices only when they are recast as pictures of some kind (2), one of our techniques transforms the counts into Kohonen-style self- organizing maps (SOMs) (3), and the other transforms them into pathfinder networks (PFNETs) (4~. SOMs show frequently co-occurring terms as nodes that are spatially close. PFNETs show them as nodes with explicit ties. The two kinds of maps will be exemplified here with medical subject headings (MeSH) and cocited authors in a specialty of genetics. Other researchers have visualized bibliographic data with PFNETs and SOMs (2, 5), but we use them to map significant literatures in real time with retrieval capabilities built into the maps (refs. 6 and 7 and cf. ref. 8~. The data are initially processed by our NOAH indexing engine, a specialized database application we designed for fast computations with verbal co-occurrence data (9~. With NOAH, mapping time is determined by the size of the indexing vocabulary, not by the number of documents in the www.pnas.org/cgi/doi/10.1 073/pnas.03076301 00 database. In CONCEPTLINK, a predecessor of PNASLINK, for example, we can almost instantly create maps of MeSH terms from >12 million MEDLINE records. (CONCEPTLINK maps the co-occurring MeSH indexing of the journals in NLM's PubMed. It is available at http://project.cis.drexel.edu/conceptlink.) Once the data are indexed, the user can map and manipulate them through a unified web interface. The maps (Figs. 1 and 2) are based on term counts solely from PNAS records because we were set that task as participants in this colloquium. Elsewhere, we have mapped terms drawn from the NLM and ISI databases in full.t However, even by itself PNAS is a major interdisciplinary resource, and we can easily imagine PNAS online or other journal-specific web sites offering domain visualizations like ours for the benefit of users. In refs. 6 and 7, we discussed AUTHORLINK (http:// project.cis.drexel.edu/authorlink), the version of our software that is used to map cocited authors from ISI's Arts and Human- ities Citation Index for 1988-1997. The present article attests to the quick adaptability of the CONCEPTLINK/AUTHORLINK soft- ware to the authors, journals, and MeSH terms in the PNAS data. This in turn prompts us to offer a rationale for our general approach, the localized mapping of association thesauri, along with different accounts of the two main algorithms. We describe certain interactive features of our system and conclude with some fresh comparative and evaluative data, including an ex- pert's commentary on PNASLINK as applied to the domain of population genetics. Localized vs. Global Mapping In their extensive review, Borner et al. (5) emphasize that "painting a big picture" is a main goal in domain mapping. This may lead to a strategy of mapping very large co-occurrence matrices in their entirety. Indeed, system designers have made many significant developments in software for such global portrayals of literatures, e.g., THEMESCAPE and VXINSIGHT ren- der literatures as landscapes; GAL~lES and STARRYNIGHT ren- der them as astral bodies (10-12~. Ours, however, is an alter- native way of visualizing knowledge domains, the localized mapping. Perhaps the chief difference is that the localized approach relinquishes scope to increase the user's control of the mapping process. This paper results from the Arthur M. Sackler Colloqulum of the National Academy of Sciences, "Mapping Knowledge Domains," held May 9-11, 2003, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. Abbreviations: NLM, National Librar,v of Medicine; ISI, Institute for Scientific Information; SOM, self-organizing map; PFNET, pathfinder network; MeSH, medical subject headings. *To whom correspondence should be addressed. E-mail: whitehd~tdrexel.edu. tThese data are extracted from Science Citation Index Expanded [Institute for Scientific Information, Inc. (ISI), Philadelphia, PA; Copyright ISI]. All rights reserved. No portion of these data may be reproduced or transmitted in any form or by any means without the prior written permission of ISI. $For restricted access to mapping of ISl's full databases, contact X.L. at xlin@?drexel.edu or H.D.W. 2004 by The National Academy of Sciences of the USA PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5297-5302

OCR for page 116
A GENOTYPE / POLYMORPHISM-(GENETICS) | DROSOPHIL-MELANOGASTER / \ RECOMRINATION-GENETIC MODELS,-RIOLOGICAL GENE-FREQ Y \ 1// DROSOPHILA /ELECTION-tGENETICS' ALLELES_MuTATIoN DNA EVOLUTION / / it\` \ \S / / ~PETIT/VE-SEQUENCES,-NUCLEIC-ACID\ \\ HAPLOTYPES PHENOTYPE GENES,-STRUCTURAL \ \ VARIATION-(GENETICS) HETERC ZYGOTE ~ \ \ |POLYMERASE CHAIN REAcTloN ~ENETlCS'-POPuLATloN CHROM01ME-MAPPING MODELS,-GENETiC LINKAGE- (GENETICS) B _ CHROMOSOME-MAPPING ' : DENOTE' ~ ALLELES MONOTYPES LINKAOE-(GENETICS)' RECOMBlNATION,-GEN~C PHENOTYPE ' HETEROZYGOTE POLYMORPHISM-(GENETICS) GENES,-STRUCTURAL~ e , . MUTATION ~ - ~ ~ OENE-FREQUENCY DNA · POLYMERASE-CHAI N- REACTION REPE11TIVE-SEOUENCES,-NUCLEIC-ACID VAR~TIO~(GENETICS) DROSOPHIL4MELANOGASTER DROSOPHILA, . , . . . , PROBA5iLITY MOOELS,'GENETIC ~ MO D ELS ,- RI O LOGI CAL EVOLUTION: ' MATHEMATICS Fig. 1. (A) PFNET of Gene Frequency. (B) SOM of Gene Frequency. Table 1 helps to sharpen this comparison. In global mapping, system designers present the user with a preformed view, often in 3D, of some sizeable literature. Within the panel of visual- ization, landscapes invite flyovers; star-fields or other constructs invite flythroughs. In the former, peaks representing major accretions of documents on some subject are likely to exert a powerful pull on the user; in the latter, document points coded as important, e.g., by differences in shape, size, or color, exert a similar pull. Essentially, the user is engaged in old-fashioned browsing, as of book titles in library stacks, but system designers may minimize or even eliminate labeling of objects in the map because labels clutter precious screen space and block the metaphorical presentation (see examples in ref. 12~. The user explores the view by "visiting" or "homing in on" objects of interest, rather as in video games, but typically cannot remap the literature in pursuit of some new interest because a new map takes hours of computer time to create. In contrast, our localized system of mapping more closely resembles online searching. The user starts the process by entering a single term at a web interface. This is consistent with the way most people search the web (13) and is intended to minimize cognitive demands on users. It is true that PNASLINK must be entered with MeSH or ISI-style terms instead of whatever word pops into the user's head, but our system includes guides that help one make the proper entries. The system responds to the entry (or "seed") term by forming a list of the terms that co-occur with it, ranked high to low by frequency. The seed term and its 24 next-highest neighbors are then exhibited as a PFNET or a SOM, which the user can switch between. Each of the two modes of mapping in PNASLINK yields different insights into the relations of the indexing terms. Both modes 5298 1 www.pnas.org/cgi/doi/10.1073/pnas.0307630100 A GRIFFITHS-RC TAKAHATA-N ROGERS-AR SOKAL-RR / AVISE-JC EXCOFFIER-L \ \ /32 / TEMPLETON-AR OHTA-T \10 \\ 21\ / GOLDSTEIN-D- ^.~LATKIN-M - 24 \ !/~ 87 K'URA-M SHRiVER-MD /\ ~ \\~CAVALUSFOR~-~ /10 \11 ~ \25 80WCOCK-AM JORDE-LB / ~ DIRIENZO-~ WRIGHT-S WERER-~L \~IR_RC ~ \ \ ''--- VALDES-AM B TP~liMA-F - ~7 \4 FISHER-RA HUDSON-RR GRIFFITHS-RC TAJIMA-F ' VALDES-AM . ' ' · ' . ' . HUDSON'RR ' DIRIENZO-A WERER-JL OHTA T WRIGHT S MAYR E FISHER-RP .- ' AVISE-JG TEMPLETON-AR _ _ _ _ _ , . GOLDSTEiN-DS ~ _ _ _ _ _ SOKA;-RR ' CAVALLISFORZA-LL ROGERS AR Fig. 2. (A) PFNET of Montgomery Slatkin. (B) SOM of Montgomery Slatkin. place the user's seed term in the locale of a limited number of other terms that are guaranteed to co-occur with it, thus customizing browsing. Any of these terms, if selected by the user, will be automatically "ANDed" with the seed term in retrieving documents. Mapping only 25 terms at a time is an arbitrary design decision with several advantages. It allows PNASLINK to make maps on the fly in seconds. It affords the node labels, the indexing terms, enough room that they have little or no overlap, thus making them and their interrelationships the primary features of the display. It gives the user a rich, but not overwhelming, array of associations to work with. Finally, because of its speed, it permits users to create new maps on the basis of single or combined terms from an old map. Thus, instead of visiting different places in a global visualization, one moves locally from interest to interest by point-and-click remapping (which accords with Hearst's point in ref. 10 that an interactive system should let users change their search strategies as their goals change). One Table 1. Contrasting styles of literature mapping Global Localized . Designer initiated Full matrix mapped Map created before inquiry Creation time: hours Labeling Reemphasized Explore by relocating on map User initiated Small subset of matrix mapped Map created by inquiry Creation time: seconds Labeling emphasized Explore by generating new map Surface objects not manipulable Surface objects manipulable User is visitor User is wielder Wh ite et a/.

OCR for page 117
also moves by recognizing terms of interest rather than by having to guess them or look them up in a thesaurus. The Associative Thesaurus If the indexing terms used in the mapping are indeed controlled by a formal thesaurus, our SOMs and PFNETs provide an alternative: they display the top listings in what is sometimes called a term's associative thesaurus (2~. Formal thesauri are published in hard and soft copy; associative thesauri are created ad hoc within search software. A formal thesaurus, such as NLM's MeSH, brings out a term's standard linguistic features, e.g., its definition, synonyms, hypernyms, and hyponyms. In contrast, an associative thesaurus shows what terms co-occur with it when it has been used to index actual publications (cf. ref. 14~. In NLM's MeSH, for example, the term Anthrax is related to Bacillus anthracis and subordinated to Bacterial Infections and Mycoses. But if Anthrax is mapped in our system, which covers the biomedical literature through 2002, its top co-occurring terms include Postal Service, a connection obviously never to be part of its entry in MeSH. Associative thesauri are shaped by historical contingencies, by what is being written about. That is why they may be useful for online retrieval in ways that formal thesauri are not. Not all indexing uses subject headings and formal thesauri, of course. ISI's indexing, for example, allows searchers to retrieve the items that cite a given author. From that capability, online searchers with the right software can move to retrieving items that cite pairs of authors jointly. To people literate in a domain, frequently cocited authors may suggest nuances of meaning that are absent in standard subject indexing (for example, articles that cite both Derek de Solla Price and Diana Crane may bear on "invisible colleges" in science even if that phrase does not appear in their bibliographic records). A map of cocited authors is, in effect, an associative thesaurus of authors linked by conjoint use of their works. Again, these linkages may permit useful retrievals that are not otherwise possible (14. Additional Capabilities PNASLINK can produce maps not only of associated MeSH terms and cocited authors but also of cocited journals. That is, if a user supplies the name of a seed journal, such as Gut or Cell, PNASLINK maps the top 24 journals cocited with it in PNAS. Journal maps are most likely to be of interest to professional literature managers, such as serials librarians, whereas maps of MeSH and cocited authors are intended more for users in general. Guided by our emphasis on user control, we have imple- mented several interactive functions for PNASLINK. For example, the system lets the user regenerate the maps after removing some terms. This is helpful, for example, in journal mapping, when one may want to eliminate omnibus journals like Science and Nature from a map to focus on more specialized titles. PNASLINK also has alternate data models to show term rela- tionships from different perspectives. By default, the seed term is used to generate 24 other terms, but then the counts for these pairs are obtained without reference to their counts with the seed term. However, if the user chooses the "tri-citation" option, the seed term is always required to be present with other two, and the maps are accordingly different. Throughout the interaction process, the user can directly retrieve documents by subject through PNAS Online. Every time the user clicks on a MeSH term, it is added to a query list. When the user clicks on the find button, a separate window opens to show the documents retrieved from PNAS Online by the terms in the query list. The maps are thus a "live" interface that allows the user to interact with terms to see what documents they yield. (PNAS Online lacks ISI-type indexing, which prevents the co- r.ite.d author retrieval possible in, e.g., our AUTHORLINK system). White et a/. Two Modes of Mapping PFNETs and SOMs are dimension-reduction techniques that have been used to visualize the structure of literatures for more than a decade. In the context of the movement joining biblio- metrics with document retrieval (2, 5, 10), PFNETs have been described by Fowler and colleagues (15-17), McGreevy (18), and Chen (19, 20~. Analogous accounts of SOMs have been done by Lin et al. (21), Roussinov and Chen (22), and Chen et al. (23~. PFNETs. Characterizing PFNETs, Borner et al. (5) write, "Path- finder algorithms take estimates of the proximities between pairs of items as input and define a network representation of the items that preserves only the most important links." Our input is pairs of terms, and the pairs are linked as output only if their co-occurrence counts are the highest (or tied-highest) in their respective vectors. By emphasizing only the most prominent links, PFNETs reduce the user's cognitive load in interpreting the most important relationships depicted in the map. These relationships in particular are highlighted as potentially fruitful for retrievals. PFNETs were developed to portray the results of studies in which subjects' judgments of the closest semantic items were represented by the lowest weights. That is, the algorithm selects the lowest-weight (also called minimum-distance or minimum- cost) paths to render the most salient ties. However, in our matrices the closest connections are signaled by the highest co-occurrence counts. The counts therefore require a transfor- mation (subtraction from a constant) to convert them to a distance measure before PFNETs are actually plotted. In PFNETs, nodes represent terms, and the importance of links between them is measured by path weights, computed from term co-occurrence counts. The PFNET algorithm compares these weights over both direct (one link) and indirect (multilink) paths between nodes. It retains just those links that constitute minimum-weight paths. Such paths are required not to violate the triangle inequality dfa,c) c d~a,b) + d~b,c), where d is the distance between points a, b, and c. These paths will be direct unless an indirect path is computed to be shorter. The number of links in a PFNET is controlled by two parameters, r and q. These are set in our software so as to produce the sparsest possible network, which occurs when r equals infinity and q equals n - 1, where n is the number of nodes in the matrix. The parameter r, which determines how path weights are computed, is lucidly explained by Fowler et al. (17~: "Path weight, r, is computed according to the Minkowski r-metric. It is the rth root of the sum of each distance raised to the rth power for all links in a path between two nodes. Although the r-metric is continuously variable, simple interpretations exist only for r = 1 (path weight is the sum of the link weights in the path), r = 2 (path weight is the Euclidean distance), and r = infinity (path weight equals the maximum link weight in the path). One advantage of r = infinity is that one need only assume that the original distance estimates have ordinal properties. Another advantage is that the link structure will be preserved for any monotonic transformation of the data."§ The parameter q sets the range within which all paths of length q will be examined in the test of the triangle inequality (24) and removed if they violate it. The larger the value of q, the more extensive the triangle inequality constraint; therefore, links are more likely on a path that violates the rule. If q is one less than the number of nodes, then all of the potential violators are under scrutiny. The settings r = infinity and q = n - 1 are widely used in pathfinder research because they tend to produce networks that ~Quoted with permission from ref. 17. PNAS 1 Aprii 6, 2004 1 vol. 101 1 supp~. ~ 1 5299

OCR for page 118
Table 2. Top terms associated with gene frequency in two databases Common to PNAS and PubMed Unique to PNAS Unique to PubMed Alleles Drosophila Apolipoproteins E Chromosome mapping Drosophila melanogaster Caucasoid race DNA Evolution Ethnic groups Gene frequency Genes, structural Genes, MHC class 11 Genetics, population Genotype Genetic markers Haplotypes Heterozygote HLA antigens Models, genetic Linkage, genetics HLA-DQ antigens Mutation Mathematics HLA-DR antigens Polymerase chain reaction Models, biological Microsatellite repeats Polymorphism (genetics) Phenotype Minisatelliterepeats Repetitive sequences, nucleic acid Probability Mongoloid race Selection (genetics) Recombination, genetic Tandem repeat sequences Variation (genetics) are highly intelligible simplifications of the data. An algorithm called a spring embedder (25) is used to enhance the maps by minimizing unsightly features such as crossed links and overlap- ping nodes. The finished map is virtually instantaneous once a seed term is entered. SOMs. Unlike PFNETS, which explicitly join highly related terms, SOMs render semantic relationships through a distance meta- phor. The more frequently co-occurring terms, which presum- ably have greater mutual relevance, occupy more proximate regions on the map. SOMs are designed to render not just the highest co-occurrence counts between terms, but rather rela- tively high co-occurrences across groups of terms. They are a softer-focus kind of mapping than PFNETs, but they, too, suggest specific combinations of terms on which the user might want to base retrievals. The PNASLINK algorithm extracts the proximity relations of data in 25 dimensions, one for each of the input terms paired with all others, and seeks to preserve them as closely as possible in 2D. This process of self-organization (also known as unsu- pervised learning) runs over many iterative cycles. In each iteration, the images of term pairs that are strongly related in the high-dimensional space will be moved closer on the lower- dimensional space until stability is reached. More specifically, the 2D grid of PNASLINK consists of 64 output nodes distributed in an 8-by-8 pattern. Each output node corresponds to a vector of 25 weights that are initially set as small random numbers. Each is also connected to 25 input nodes, and the latter correspond to vectors in the 25-by-25 matrix compris- ing all possible pairs of a seed term and the 24 terms most frequently co-occurring with it. tThere are 25~24~/2 = 300 unique pairs in the matrix, and the main diagonal, consisting of terms paired with themselves, is not used.] This co-occurrence matrix is used to train the SOM. The account of PNAS~NK'S parent AUTHOR~NK (6) describes the iterative training process as follows. A row from the co- occurrence matrix "is randomly selected and compared to every output node to determine a winner. Weights of the winning output nodes then are updated so that the next time this input node is presented, this output node will likely be selected again as the winner. In the meantime, nodes surrounding the winning node are similarly adjusted. The number of iterations needed to train a SOM is often determined empirically (in our case, we optimize the number of training cycles to 2,500~. After the training, input vectors closest in the input space will map to the same regions in the output map. The regions are delineated by areas of nodes in which the elements with the highest value on 5300 1 www~pnas~org/cgi/doi/10.1073/pnas.0307630100 the vectors are the same."' SOMs, like PFNETs, usually take only a second or two to produce. In interpreting SOMs, points in the same area are held to be closely related. Adjacent areas reflect stronger relationships than nonadjacent areas. Terms in large areas are more influen- tial than terms in small areas. Examples from Population Genetics Fig. lA is a PFNET, and Fig. 1B is a SOM formed with the MeSH term Gene Frequency as the seed. The result is a complex, yet still radically simplified, picture of term relations in population genetics as that subject has developed in PNAS. Fig. 2 repeats the same map types with a cocited author as seed, in this case, the population geneticist Montgomery Slatkin (University of Cali- fornia, Berkeley), a leading researcher in the study of gene frequencies and genetic drift. In Fig. 2A, the author cocitation counts have been toggled on so that they appear above the links, an option not exercised with the term co-occurrence counts in Fig. L4. The two map types suggest specific terms from the literature that can be used in document retrieval. The interface of which the maps are part has been cropped away to focus on terms that are related in ways that the literature searcher often does not know in advance. Someone interested in exploring the connec- tion between, say, Gene Frequency and Mathematics or be- tween, say, Slatkin and Luigi Cavalli-Sforza could click on the appropriate labels and retrieve documents in which those par- ticular conjunctions occurred. They would be documents for which Gene Frequency and Mathematics co-occur as subject headings or in which Slatkin is cocited with Cavalli-Sforza. (Further terms may be added at will.) In Fig. L4 the main nodes in the PFNET are (from left) Alleles, Mutation, Genes (Structural), DNA, and Evolution, a transition from relatively specific to relatively general terms as one moves rightward. The seed term Gene Frequency is seen to be an offshoot of the literature on Alleles. Indeed, if Gene Frequency is required to be present as a third term in all pairings in the map (the tri-citation option mentioned above), the new map has Alleles at the center with 19 of the other terms radiating directly from it. In the SOM in Fig. 1B, the most central term, the one whose region touches most others, is Mutation. Gene Frequency is placed near the same terms it appeared with in the PFNET, and other connections between the PFNET and the SOM can be traced, but the SOM emphasizes different relations than the PFNET. For example, the two terms for fruit flies appear apart Quoted from ref. 6, Copyright 2003, with permission from Eisevier. White et a/.

OCR for page 119
in the PFNET, whereas the SOM brings them together at lower left. Because Fig. 1 shows term relationships solely within PNAS, the question arises whether a mapping of Gene Frequency would differ markedly across all of the journals covered by NLM's PubMed. The latter mapping is possible through our system CONCEPTLINK. It turns out that the two maps have 13 terms in common, which demonstrates the breadth of PNAS in repre- senting topics in genetics. (However, the PFNETs have only four links in common.) Table 2 shows the common and the unique terms. Those unique to PubMed seem more specific and more oriented toward human genetics. Many of the MeSH terms associated with Gene Frequency in Fig. 1 appear in chapters on population genetics in introductory genetics textbooks, and they are the sort of terms that turn up in textbook glossaries (e.g., Haplotypes, Heterozygotes). Ironically, beginners at the glossary stage may know too little to profit from maps like those in Fig. 1, whereas advanced students and experts may know too much. Asked to comment on Fig. 1 as a domain expert, Slatkin said that the terms and their groupings in the two maps were intelligible, but that the MeSH terms were at such a high level of generality (e.g., Evolution, Mutation, Mathematics) that almost any way of connecting them would make some sense. (He preferred the PFNET's tighter structure to the SOM's for this reason.) He thought only mappings based on a much more specific set of seed terms, e.g., the ecology of a particular species of African millipede, would have much value for him and his students. This is a criticism with which many people might agree, and progress in bibliographic visualizations like ours may well lie in adding capabilities to map specific natural-language "co-words" from the titles, abstracts, or full texts of documents (8, 26, 27~. Possibly the chief beneficiaries of MeSH (or other controlled- vocabulary) mapping will be neither beginners nor subject experts, but "in-between" persons, such as librarians, subject indexers, science writers, journal editors, and teachers as they browse the many research areas to which they come as outsiders. Slatkin found his own cocited author maps readily interpret- able. He was acquainted with every name that appears in Fig. 2. In the PFNET (which he again preferred), he identified the main structural feature, the clusters around himself and Masatoshi Nei, as representing two slightly different subject areas. Both the Nei group and the Slatkin group, he said, have contributed to the literature on genetic flow and population structure, but the Slatkin group has contributed relatively more to the literature on microsatellites (short, repetitive sequences of DNA). Hence, the PFNET was picking up a division he found meaningful. Many combinations of linked names in Fig. 2A are coherent in the sense that they yield sensible internet retrievals. However, a stricter test for the coherence of a particular domain is whether an expert can rapidly and accurately predict why two authors are linked. Given a random pair from Fig. 2A (Ned and R. R. Sokal), Slatkin guessed that the link between them was caused by 1. White, H. D. (1990) in Scholarly Communication and Bibliometrics, ed. Borgman, C. (Sage, Newbury Park, CA), pp. 84-106. 2. White, H. D. & McCain, K. W. (1997)Annu. Rev. Inf: Sci. Technol. 32, 99-168. 3. Kohonen, T. (1997) Self-Organizing Maps (Springer, New York), 2nd Ed. 4. Schvaneveldt, R. W., ed. (1990) Pathfinder Associative Networks: Studies in Knowledge Organization (Ablex, Norwood, NJ). 5. Borner, K., Chen, C. & Boyack, K. W. (2003) Annul Rev. In; Sci. Technol. 37, 179-255. 6. Lin, X., White, H. D. & Buzydlowski, J. (2003) In; Process. Manag. 39, 689-706. 7. Buzydlowski, J. W., White, H. D. & Lin, X. (2003) in Visual Interfaces to Digital Libraries, Lecture Notes in Computer Science 2539, eds. Borner, K. & Chen, C. (Springer, Berlin), pp. 133-144. 8. Chen, H., Lally, A. M., Zhu, B. & Chau, M. (2003) J. Am. Soc. In; Sci. Technol. 54, 683-694. White et a/. frequent cocitation of Sokal's book Biometry with Nei's works on the computation of standard genetic distance. A subsequent retrieval of the articles cociting the pair bore this out. The SOM in Fig. 2B picks up some of the same dyadic structure as the PFNET, such as the connections between Ohta and Kimura, Tajima and Hudson, Takahata and Griffiths, Cavalli-Sforza and Jorde, and Valdes and Weber (which may reflect coauthorships as well as cocitation ties). Slatkin and Nei remain central figures, but are joined by Avise and Templeton. Interestingly, at the lower left the SOM conjoins Wright, Mayr, and Fisher, who represent the older, pioneering generation in statistical genetics. The SOM algorithm is able to bring this out solely on the basis of their overall cocitation profiles. Other Reactions to the Map Types If PFNETs seem directive about term relationships, SOMs are merely suggestive. However, their greater ambiguity is perhaps a virtue. Using AUTHORLINK, the forerunner of PNASLINK, Buzy- dlowski (9) found that SOMs outperformed PFNETs in captur- ing the mental models of 20 experts in selected fields of the humanities. These were SOMs and PFNETs devoted to cocited authors, exactly like those in Fig. 2. The experts' mental models were elicited by having them sort cards bearing authors' names into intuitively meaningful piles. Their task was to show how they would group, first, the 24 authors most highly cocited with Plato (almost all quite famous) and, second, the 24 authors most highly cocited with an indi- vidual author of the expert's choice. The matrices of card-sort groupings were compared with matrices of the groupings pro- duced by PFNET linkages and SOM positionings. For both the Plato trial, which all experts participated in, and the individual- author trials, which were unique to each expert, SOMs agreed with the card-sort data better than PFNETs. In the Plato trial, both SOMs and PFNETs were highly correlated with the pooled card-sort data (SOMs, r = 0.97; PFNETs, r = 0.78), but these correlations were significantly different at P < 0.001. In the individual-author trials, a t test of mean agreement scores favored SOMs significantly at P < 0.01. The experts were nevertheless about equally divided in their preferences for one map type over the other. In other, less formal trials, we have found that some experts object when maps of either type differ from their mental models of how the subject headings or authors in their fields are connected. With respect to this criticism, it should be borne in mind that the maps are pictures of the database. They show term associations that have developed as authors and indexers actually create literatures, in the present case, solely within PNAS, and these will often differ from the terminological hierarchies one finds in individual heads, not to mention textbooks, thesauri, or other databases (compare Table 2~. In fact, the maps should be taken as new information, not as "erroneous" attempts to generate preexisting hierarchies from bibliographic data. The ongoing task is to find which types of maps and which types of terms are most useful to particular clienteles. 9. Buzydlowski, J. W. (2003) Ph.D. thesis (Drexel University, Philadelphia). 10. Hearst, M. (1999) in Modern Information Retrieval, eds. Baeza-Yates, R. & Ribeiro-Neto, B. (Addison-Wesley, New York), pp. 257-323. 11. Chen, C. (2003) Mapping Scientific Frontiers: The Quest for Knowledge Visual- ization (Springer, London). 12. Dodge, M. & Kitchin, R. (2001)Atlas of Cyberspace (Addison-Wesley, Harlow, U.K.). 13. Jansen, B. J., Spink, A. & Saracevic, T. (2000) Ini Process. Manag. 36, 207-227. 14. Schatz, B. R., Johnson, E. H., Cochrane, P. A. & Chen, H. (1996) in Proceedings of the First ACM International Conference on Digital Libraries, eds. Fox, E. A. & Marchionini, G. (Association for Computing Machinery, New York), pp. 126-133. 15. Fowler, R. H. & Dearholt, D. W. (1990) in Pathfinder Associative Networks: Studies in Knowledge Organization, ed. Schvaneveldt, R. W. (Ablex, Norwood, NJ), pp. 165-178. PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5301

OCR for page 120
16. Fowler, R. H., Fowler, W. A. L. & Wilson, B. A. (1991) in Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Develop- ment in Information Retrieval, ed. Bookstein, A. (Association for Computing Machinery, New York), pp. 142-151. 17. Fowler, R. H., Wilson, B. A. & Fowler, W. A. L. (1992) Information Navigator: An Information System Using Associative Networks for Display and Retrieval, Technical Report NAG9-551,92-1 (Department of Computer Science, Uni- versity of Texas-Pan American, Edinburg). 18. McGreevy, M. W. (1995) A Relational Metric, Its Application to Domain Analysis, and an Example Analysis and Model of a Remote Sensing Domain, National Aeronautics and Space Administration Technical Memorandum 119358 (Ames Research Center, Moffett Field, CA). 19. Chen, C. (1998) J. His. Lang Comput. 9, 267-286. 5302 1 www.pnas.org/cgi/doi/10.1073/pnas.0307630100 20. Chen, C. (1999) Ini Process. Manag. 35, 401-420. 21. Lin, X., Soergel, D. & Marchionini, G. (1991) in Proceedings of the 14thAnnual International ACM/SIGM Conference on Research and Design in Information Retrieval, ed. Bookstein, A. (Association for Computing Machinery, New York), pp. 262-269. 22. Roussinov, D. & Chen, H. (1998) Commun. Cognit. Artificial Intelligence 15, 81-1 12. 23. Chen, H., Houston, A. L., Sewell, R. R. & Schatz, B. R. (1998) J. Am. Soc. In; Sci. 49, 582-603. 24. Tversky, A. & Gati, I. (1982) Psychol. Rev. 89, 123-154. 25. Kamada, T. & Kawai, S. (1989) Ini Process. Lett. 31, 7-15. 26. Ding, Y., Chowdhury, G. G. & Foo, S. (2000) Ini Process. Manag. 37, 817-842. 27. Ibekwe-San Juan, F. & San Juan, E. (2002) Knowl. Org. 29, 181-197. White et a/.

Representative terms from entire chapter:

seed term