| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 115
Colloquium
User-controllecl mapping of significant literatures
Howard D. White*, Xia Lin, Jan W. Buzydlowski, and Chaomei Chen
College of Information Science and Technology, Drexel University, Philadelphia, PA 19104
We apply a version of our web-based literature-mapping system to
PNAS for 1971-2002, as indexed by the National Library of Medi-
cine and the Institute for Scientific Information. Given a single
input term from a user, a medical subject heading, a cocited author,
or a cocited journal, PNAStINK rapidly displays views in which that
term and the other 24 terms that most frequently co-occur with it
in a bibliographic database are interrelated in ways suggesting
fruitful combinations for document retrieval. The interrelation-
ships are produced by two algorithms, pathfinder networks and
Kohonen-style self-organizing maps. PNASUNK displays are them-
selves interactive interfaces that can retrieve documents from
digital libraries (e.g., PNAS Online). This style of visualizing knowl-
edge domains is called "localized" because it does not attempt to
map the indexing of literatures in full but concentrates on the top
terms in an "associative thesaurus" reflecting user interests. It also
permits swift remappings, as the user recognizes terms worth
pursuing. PNASLINK is illustrated with maps drawn from the litera-
ture of population genetics. Some comparative and evaluative
comments are added, one from a domain expert indicating that the
face validity of the system may be tempered by insufficient
specificity in the indexing terms being mapped.
We here present two ways of rapidly mapping literatures in
terms of selected indexing vocabularies. Both ways are
responsive to users, and either can serve as an interface for
retrieval of documents from digital libraries. Either can also
complement a work that focuses on the structure of a literature,
such as a research review (14. Our data are the contents of PNAS
for 1971-2002, as described by medical subject headings from the
National Library of Medicine (NLM) and by citation indexing
from the Institute for Scientific Information (ISI).f Indexing by
these organizations typifies the bibliographic control that is
extended only to significant, that is, highly valued, literatures.
Our software, called PNASLINK (available at http://project.cis.
drexel.edu/puas), is designed to amplify such control, by en-
abling customized browsing on the basis of user input.
Both of our mapping techniques exploit co-occurrences of
terms in NLM and ISI bibliographic records. The terms are
systematically paired, and their co-occurrences are counted in
matrices. Because people can easily assimilate numeric matrices
only when they are recast as pictures of some kind (2), one of our
techniques transforms the counts into Kohonen-style self-
organizing maps (SOMs) (3), and the other transforms them into
pathfinder networks (PFNETs) (4~. SOMs show frequently
co-occurring terms as nodes that are spatially close. PFNETs
show them as nodes with explicit ties. The two kinds of maps will
be exemplified here with medical subject headings (MeSH) and
cocited authors in a specialty of genetics.
Other researchers have visualized bibliographic data with
PFNETs and SOMs (2, 5), but we use them to map significant
literatures in real time with retrieval capabilities built into the
maps (refs. 6 and 7 and cf. ref. 8~. The data are initially processed
by our NOAH indexing engine, a specialized database application
we designed for fast computations with verbal co-occurrence
data (9~. With NOAH, mapping time is determined by the size of
the indexing vocabulary, not by the number of documents in the
www.pnas.org/cgi/doi/10.1 073/pnas.03076301 00
database. In CONCEPTLINK, a predecessor of PNASLINK, for
example, we can almost instantly create maps of MeSH terms
from >12 million MEDLINE records. (CONCEPTLINK maps the
co-occurring MeSH indexing of the journals in NLM's PubMed.
It is available at http://project.cis.drexel.edu/conceptlink.)
Once the data are indexed, the user can map and manipulate
them through a unified web interface.
The maps (Figs. 1 and 2) are based on term counts solely from
PNAS records because we were set that task as participants in
this colloquium. Elsewhere, we have mapped terms drawn from
the NLM and ISI databases in full.t However, even by itself
PNAS is a major interdisciplinary resource, and we can easily
imagine PNAS online or other journal-specific web sites offering
domain visualizations like ours for the benefit of users.
In refs. 6 and 7, we discussed AUTHORLINK (http://
project.cis.drexel.edu/authorlink), the version of our software
that is used to map cocited authors from ISI's Arts and Human-
ities Citation Index for 1988-1997. The present article attests to
the quick adaptability of the CONCEPTLINK/AUTHORLINK soft-
ware to the authors, journals, and MeSH terms in the PNAS
data. This in turn prompts us to offer a rationale for our general
approach, the localized mapping of association thesauri, along
with different accounts of the two main algorithms. We describe
certain interactive features of our system and conclude with
some fresh comparative and evaluative data, including an ex-
pert's commentary on PNASLINK as applied to the domain of
population genetics.
Localized vs. Global Mapping
In their extensive review, Borner et al. (5) emphasize that
"painting a big picture" is a main goal in domain mapping. This
may lead to a strategy of mapping very large co-occurrence
matrices in their entirety. Indeed, system designers have made
many significant developments in software for such global
portrayals of literatures, e.g., THEMESCAPE and VXINSIGHT ren-
der literatures as landscapes; GAL~lES and STARRYNIGHT ren-
der them as astral bodies (10-12~. Ours, however, is an alter-
native way of visualizing knowledge domains, the localized
mapping. Perhaps the chief difference is that the localized
approach relinquishes scope to increase the user's control of the
mapping process.
This paper results from the Arthur M. Sackler Colloqulum of the National Academy of
Sciences, "Mapping Knowledge Domains," held May 9-11, 2003, at the Arnold and Mabel
Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA.
Abbreviations: NLM, National Librar,v of Medicine; ISI, Institute for Scientific Information;
SOM, self-organizing map; PFNET, pathfinder network; MeSH, medical subject headings.
*To whom correspondence should be addressed. E-mail: whitehd~tdrexel.edu.
tThese data are extracted from Science Citation Index Expanded [Institute for Scientific
Information, Inc. (ISI), Philadelphia, PA; Copyright ISI]. All rights reserved. No portion of
these data may be reproduced or transmitted in any form or by any means without the
prior written permission of ISI.
$For restricted access to mapping of ISl's full databases, contact X.L. at xlin@?drexel.edu or
H.D.W.
2004 by The National Academy of Sciences of the USA
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5297-5302
OCR for page 116
A
GENOTYPE /
POLYMORPHISM-(GENETICS) | DROSOPHIL-MELANOGASTER /
\ RECOMRINATION-GENETIC MODELS,-RIOLOGICAL
GENE-FREQ Y \ 1// DROSOPHILA /ELECTION-tGENETICS'
ALLELES_MuTATIoN DNA EVOLUTION
/ / it\` \ \S
/ / ~PETIT/VE-SEQUENCES,-NUCLEIC-ACID\ \\
HAPLOTYPES PHENOTYPE GENES,-STRUCTURAL \ \ VARIATION-(GENETICS)
HETERC ZYGOTE ~ \ \
|POLYMERASE CHAIN REAcTloN ~ENETlCS'-POPuLATloN
CHROM01ME-MAPPING MODELS,-GENETiC
LINKAGE- (GENETICS)
B
_
CHROMOSOME-MAPPING ' : DENOTE' ~ ALLELES MONOTYPES
LINKAOE-(GENETICS)'
RECOMBlNATION,-GEN~C PHENOTYPE ' HETEROZYGOTE POLYMORPHISM-(GENETICS)
GENES,-STRUCTURAL~ e ,
. MUTATION ~ -
~ ~ OENE-FREQUENCY
DNA
· POLYMERASE-CHAI N- REACTION
REPE11TIVE-SEOUENCES,-NUCLEIC-ACID VAR~TIO~(GENETICS)
DROSOPHIL4MELANOGASTER
DROSOPHILA,
. ,
. .
. ,
PROBA5iLITY
MOOELS,'GENETIC
~ MO D ELS ,- RI O LOGI CAL
EVOLUTION: ' MATHEMATICS
Fig. 1. (A) PFNET of Gene Frequency. (B) SOM of Gene Frequency.
Table 1 helps to sharpen this comparison. In global mapping,
system designers present the user with a preformed view, often
in 3D, of some sizeable literature. Within the panel of visual-
ization, landscapes invite flyovers; star-fields or other constructs
invite flythroughs. In the former, peaks representing major
accretions of documents on some subject are likely to exert a
powerful pull on the user; in the latter, document points coded
as important, e.g., by differences in shape, size, or color, exert a
similar pull. Essentially, the user is engaged in old-fashioned
browsing, as of book titles in library stacks, but system designers
may minimize or even eliminate labeling of objects in the map
because labels clutter precious screen space and block the
metaphorical presentation (see examples in ref. 12~. The user
explores the view by "visiting" or "homing in on" objects of
interest, rather as in video games, but typically cannot remap the
literature in pursuit of some new interest because a new map
takes hours of computer time to create.
In contrast, our localized system of mapping more closely
resembles online searching. The user starts the process by
entering a single term at a web interface. This is consistent with
the way most people search the web (13) and is intended to
minimize cognitive demands on users. It is true that PNASLINK
must be entered with MeSH or ISI-style terms instead of
whatever word pops into the user's head, but our system includes
guides that help one make the proper entries. The system
responds to the entry (or "seed") term by forming a list of the
terms that co-occur with it, ranked high to low by frequency. The
seed term and its 24 next-highest neighbors are then exhibited as
a PFNET or a SOM, which the user can switch between. Each
of the two modes of mapping in PNASLINK yields different
insights into the relations of the indexing terms. Both modes
5298 1 www.pnas.org/cgi/doi/10.1073/pnas.0307630100
A
GRIFFITHS-RC
TAKAHATA-N
ROGERS-AR SOKAL-RR / AVISE-JC
EXCOFFIER-L \ \ /32 / TEMPLETON-AR OHTA-T
\10 \\ 21\ /
GOLDSTEIN-D- ^.~LATKIN-M - 24 \ !/~ 87 K'URA-M
SHRiVER-MD /\ ~ \\~CAVALUSFOR~-~
/10 \11 ~ \25 80WCOCK-AM JORDE-LB
/ ~ DIRIENZO-~ WRIGHT-S
WERER-~L \~IR_RC ~ \ \
''---
VALDES-AM
B
TP~liMA-F - ~7
\4 FISHER-RA
HUDSON-RR
GRIFFITHS-RC TAJIMA-F ' VALDES-AM
. ' ' · ' . ' .
HUDSON'RR ' DIRIENZO-A WERER-JL
OHTA T
WRIGHT S
MAYR E
FISHER-RP
.-
' AVISE-JG TEMPLETON-AR
_ _ _ _ _
, .
GOLDSTEiN-DS
~ _ _ _ _ _
SOKA;-RR ' CAVALLISFORZA-LL ROGERS AR
Fig. 2. (A) PFNET of Montgomery Slatkin. (B) SOM of Montgomery Slatkin.
place the user's seed term in the locale of a limited number of
other terms that are guaranteed to co-occur with it, thus
customizing browsing. Any of these terms, if selected by the user,
will be automatically "ANDed" with the seed term in retrieving
documents.
Mapping only 25 terms at a time is an arbitrary design decision
with several advantages. It allows PNASLINK to make maps on the
fly in seconds. It affords the node labels, the indexing terms,
enough room that they have little or no overlap, thus making
them and their interrelationships the primary features of the
display. It gives the user a rich, but not overwhelming, array of
associations to work with. Finally, because of its speed, it permits
users to create new maps on the basis of single or combined
terms from an old map. Thus, instead of visiting different places
in a global visualization, one moves locally from interest to
interest by point-and-click remapping (which accords with
Hearst's point in ref. 10 that an interactive system should let
users change their search strategies as their goals change). One
Table 1. Contrasting styles of literature mapping
Global Localized
.
Designer initiated
Full matrix mapped
Map created before inquiry
Creation time: hours
Labeling Reemphasized
Explore by relocating on map
User initiated
Small subset of matrix mapped
Map created by inquiry
Creation time: seconds
Labeling emphasized
Explore by generating new map
Surface objects not manipulable Surface objects manipulable
User is visitor User is wielder
Wh ite et a/.
OCR for page 117
also moves by recognizing terms of interest rather than by having
to guess them or look them up in a thesaurus.
The Associative Thesaurus
If the indexing terms used in the mapping are indeed controlled
by a formal thesaurus, our SOMs and PFNETs provide an
alternative: they display the top listings in what is sometimes
called a term's associative thesaurus (2~. Formal thesauri are
published in hard and soft copy; associative thesauri are created
ad hoc within search software. A formal thesaurus, such as
NLM's MeSH, brings out a term's standard linguistic features,
e.g., its definition, synonyms, hypernyms, and hyponyms. In
contrast, an associative thesaurus shows what terms co-occur
with it when it has been used to index actual publications (cf.
ref. 14~.
In NLM's MeSH, for example, the term Anthrax is related to
Bacillus anthracis and subordinated to Bacterial Infections and
Mycoses. But if Anthrax is mapped in our system, which covers
the biomedical literature through 2002, its top co-occurring
terms include Postal Service, a connection obviously never to be
part of its entry in MeSH. Associative thesauri are shaped by
historical contingencies, by what is being written about. That is
why they may be useful for online retrieval in ways that formal
thesauri are not.
Not all indexing uses subject headings and formal thesauri, of
course. ISI's indexing, for example, allows searchers to retrieve
the items that cite a given author. From that capability, online
searchers with the right software can move to retrieving items
that cite pairs of authors jointly. To people literate in a domain,
frequently cocited authors may suggest nuances of meaning that
are absent in standard subject indexing (for example, articles that
cite both Derek de Solla Price and Diana Crane may bear on
"invisible colleges" in science even if that phrase does not appear
in their bibliographic records). A map of cocited authors is, in
effect, an associative thesaurus of authors linked by conjoint use
of their works. Again, these linkages may permit useful retrievals
that are not otherwise possible (14.
Additional Capabilities
PNASLINK can produce maps not only of associated MeSH terms
and cocited authors but also of cocited journals. That is, if a user
supplies the name of a seed journal, such as Gut or Cell, PNASLINK
maps the top 24 journals cocited with it in PNAS. Journal maps
are most likely to be of interest to professional literature
managers, such as serials librarians, whereas maps of MeSH and
cocited authors are intended more for users in general.
Guided by our emphasis on user control, we have imple-
mented several interactive functions for PNASLINK. For example,
the system lets the user regenerate the maps after removing some
terms. This is helpful, for example, in journal mapping, when one
may want to eliminate omnibus journals like Science and Nature
from a map to focus on more specialized titles.
PNASLINK also has alternate data models to show term rela-
tionships from different perspectives. By default, the seed term
is used to generate 24 other terms, but then the counts for these
pairs are obtained without reference to their counts with the
seed term. However, if the user chooses the "tri-citation" option,
the seed term is always required to be present with other two, and
the maps are accordingly different.
Throughout the interaction process, the user can directly
retrieve documents by subject through PNAS Online. Every time
the user clicks on a MeSH term, it is added to a query list. When
the user clicks on the find button, a separate window opens to
show the documents retrieved from PNAS Online by the terms
in the query list. The maps are thus a "live" interface that allows
the user to interact with terms to see what documents they yield.
(PNAS Online lacks ISI-type indexing, which prevents the co-
r.ite.d author retrieval possible in, e.g., our AUTHORLINK system).
White et a/.
Two Modes of Mapping
PFNETs and SOMs are dimension-reduction techniques that
have been used to visualize the structure of literatures for more
than a decade. In the context of the movement joining biblio-
metrics with document retrieval (2, 5, 10), PFNETs have been
described by Fowler and colleagues (15-17), McGreevy (18), and
Chen (19, 20~. Analogous accounts of SOMs have been done by
Lin et al. (21), Roussinov and Chen (22), and Chen et al. (23~.
PFNETs. Characterizing PFNETs, Borner et al. (5) write, "Path-
finder algorithms take estimates of the proximities between pairs
of items as input and define a network representation of the
items that preserves only the most important links." Our input
is pairs of terms, and the pairs are linked as output only if their
co-occurrence counts are the highest (or tied-highest) in their
respective vectors. By emphasizing only the most prominent
links, PFNETs reduce the user's cognitive load in interpreting
the most important relationships depicted in the map. These
relationships in particular are highlighted as potentially fruitful
for retrievals.
PFNETs were developed to portray the results of studies in
which subjects' judgments of the closest semantic items were
represented by the lowest weights. That is, the algorithm selects
the lowest-weight (also called minimum-distance or minimum-
cost) paths to render the most salient ties. However, in our
matrices the closest connections are signaled by the highest
co-occurrence counts. The counts therefore require a transfor-
mation (subtraction from a constant) to convert them to a
distance measure before PFNETs are actually plotted.
In PFNETs, nodes represent terms, and the importance of
links between them is measured by path weights, computed from
term co-occurrence counts. The PFNET algorithm compares
these weights over both direct (one link) and indirect (multilink)
paths between nodes. It retains just those links that constitute
minimum-weight paths. Such paths are required not to violate
the triangle inequality dfa,c) c d~a,b) + d~b,c), where d is the
distance between points a, b, and c. These paths will be direct
unless an indirect path is computed to be shorter.
The number of links in a PFNET is controlled by two
parameters, r and q. These are set in our software so as to
produce the sparsest possible network, which occurs when r
equals infinity and q equals n - 1, where n is the number of nodes
in the matrix.
The parameter r, which determines how path weights are
computed, is lucidly explained by Fowler et al. (17~: "Path
weight, r, is computed according to the Minkowski r-metric. It is
the rth root of the sum of each distance raised to the rth power
for all links in a path between two nodes. Although the r-metric
is continuously variable, simple interpretations exist only for r =
1 (path weight is the sum of the link weights in the path), r = 2
(path weight is the Euclidean distance), and r = infinity (path
weight equals the maximum link weight in the path). One
advantage of r = infinity is that one need only assume that the
original distance estimates have ordinal properties. Another
advantage is that the link structure will be preserved for any
monotonic transformation of the data."§
The parameter q sets the range within which all paths of length
q will be examined in the test of the triangle inequality (24) and
removed if they violate it. The larger the value of q, the more
extensive the triangle inequality constraint; therefore, links are
more likely on a path that violates the rule. If q is one less than
the number of nodes, then all of the potential violators are under
scrutiny.
The settings r = infinity and q = n - 1 are widely used in
pathfinder research because they tend to produce networks that
~Quoted with permission from ref. 17.
PNAS 1 Aprii 6, 2004 1 vol. 101 1 supp~. ~ 1 5299
OCR for page 118
Table 2. Top terms associated with gene frequency in two databases
Common to PNAS and PubMed Unique to PNAS Unique to PubMed
Alleles Drosophila Apolipoproteins E
Chromosome mapping Drosophila melanogaster Caucasoid race
DNA Evolution Ethnic groups
Gene frequency Genes, structural Genes, MHC class 11
Genetics, population Genotype Genetic markers
Haplotypes Heterozygote HLA antigens
Models, genetic Linkage, genetics HLA-DQ antigens
Mutation Mathematics HLA-DR antigens
Polymerase chain reaction Models, biological Microsatellite repeats
Polymorphism (genetics) Phenotype Minisatelliterepeats
Repetitive sequences, nucleic acid Probability Mongoloid race
Selection (genetics) Recombination, genetic Tandem repeat sequences
Variation (genetics)
are highly intelligible simplifications of the data. An algorithm
called a spring embedder (25) is used to enhance the maps by
minimizing unsightly features such as crossed links and overlap-
ping nodes. The finished map is virtually instantaneous once a
seed term is entered.
SOMs. Unlike PFNETS, which explicitly join highly related terms,
SOMs render semantic relationships through a distance meta-
phor. The more frequently co-occurring terms, which presum-
ably have greater mutual relevance, occupy more proximate
regions on the map. SOMs are designed to render not just the
highest co-occurrence counts between terms, but rather rela-
tively high co-occurrences across groups of terms. They are a
softer-focus kind of mapping than PFNETs, but they, too,
suggest specific combinations of terms on which the user might
want to base retrievals.
The PNASLINK algorithm extracts the proximity relations of
data in 25 dimensions, one for each of the input terms paired
with all others, and seeks to preserve them as closely as possible
in 2D. This process of self-organization (also known as unsu-
pervised learning) runs over many iterative cycles. In each
iteration, the images of term pairs that are strongly related in the
high-dimensional space will be moved closer on the lower-
dimensional space until stability is reached.
More specifically, the 2D grid of PNASLINK consists of 64
output nodes distributed in an 8-by-8 pattern. Each output node
corresponds to a vector of 25 weights that are initially set as small
random numbers. Each is also connected to 25 input nodes, and
the latter correspond to vectors in the 25-by-25 matrix compris-
ing all possible pairs of a seed term and the 24 terms most
frequently co-occurring with it. tThere are 25~24~/2 = 300
unique pairs in the matrix, and the main diagonal, consisting of
terms paired with themselves, is not used.] This co-occurrence
matrix is used to train the SOM.
The account of PNAS~NK'S parent AUTHOR~NK (6) describes
the iterative training process as follows. A row from the co-
occurrence matrix "is randomly selected and compared to every
output node to determine a winner. Weights of the winning
output nodes then are updated so that the next time this input
node is presented, this output node will likely be selected again
as the winner. In the meantime, nodes surrounding the winning
node are similarly adjusted. The number of iterations needed to
train a SOM is often determined empirically (in our case, we
optimize the number of training cycles to 2,500~. After the
training, input vectors closest in the input space will map to the
same regions in the output map. The regions are delineated by
areas of nodes in which the elements with the highest value on
5300 1 www~pnas~org/cgi/doi/10.1073/pnas.0307630100
the vectors are the same."' SOMs, like PFNETs, usually take
only a second or two to produce.
In interpreting SOMs, points in the same area are held to be
closely related. Adjacent areas reflect stronger relationships
than nonadjacent areas. Terms in large areas are more influen-
tial than terms in small areas.
Examples from Population Genetics
Fig. lA is a PFNET, and Fig. 1B is a SOM formed with the MeSH
term Gene Frequency as the seed. The result is a complex, yet
still radically simplified, picture of term relations in population
genetics as that subject has developed in PNAS. Fig. 2 repeats the
same map types with a cocited author as seed, in this case, the
population geneticist Montgomery Slatkin (University of Cali-
fornia, Berkeley), a leading researcher in the study of gene
frequencies and genetic drift. In Fig. 2A, the author cocitation
counts have been toggled on so that they appear above the links,
an option not exercised with the term co-occurrence counts in
Fig. L4.
The two map types suggest specific terms from the literature
that can be used in document retrieval. The interface of which
the maps are part has been cropped away to focus on terms that
are related in ways that the literature searcher often does not
know in advance. Someone interested in exploring the connec-
tion between, say, Gene Frequency and Mathematics or be-
tween, say, Slatkin and Luigi Cavalli-Sforza could click on the
appropriate labels and retrieve documents in which those par-
ticular conjunctions occurred. They would be documents for
which Gene Frequency and Mathematics co-occur as subject
headings or in which Slatkin is cocited with Cavalli-Sforza.
(Further terms may be added at will.)
In Fig. L4 the main nodes in the PFNET are (from left)
Alleles, Mutation, Genes (Structural), DNA, and Evolution, a
transition from relatively specific to relatively general terms as
one moves rightward. The seed term Gene Frequency is seen to
be an offshoot of the literature on Alleles. Indeed, if Gene
Frequency is required to be present as a third term in all pairings
in the map (the tri-citation option mentioned above), the new
map has Alleles at the center with 19 of the other terms radiating
directly from it.
In the SOM in Fig. 1B, the most central term, the one whose
region touches most others, is Mutation. Gene Frequency is
placed near the same terms it appeared with in the PFNET, and
other connections between the PFNET and the SOM can be
traced, but the SOM emphasizes different relations than the
PFNET. For example, the two terms for fruit flies appear apart
Quoted from ref. 6, Copyright 2003, with permission from Eisevier.
White et a/.
OCR for page 119
in the PFNET, whereas the SOM brings them together at lower
left.
Because Fig. 1 shows term relationships solely within PNAS,
the question arises whether a mapping of Gene Frequency would
differ markedly across all of the journals covered by NLM's
PubMed. The latter mapping is possible through our system
CONCEPTLINK. It turns out that the two maps have 13 terms in
common, which demonstrates the breadth of PNAS in repre-
senting topics in genetics. (However, the PFNETs have only four
links in common.) Table 2 shows the common and the unique
terms. Those unique to PubMed seem more specific and more
oriented toward human genetics.
Many of the MeSH terms associated with Gene Frequency in
Fig. 1 appear in chapters on population genetics in introductory
genetics textbooks, and they are the sort of terms that turn up in
textbook glossaries (e.g., Haplotypes, Heterozygotes). Ironically,
beginners at the glossary stage may know too little to profit from
maps like those in Fig. 1, whereas advanced students and experts
may know too much. Asked to comment on Fig. 1 as a domain
expert, Slatkin said that the terms and their groupings in the two
maps were intelligible, but that the MeSH terms were at such a
high level of generality (e.g., Evolution, Mutation, Mathematics)
that almost any way of connecting them would make some sense.
(He preferred the PFNET's tighter structure to the SOM's for
this reason.) He thought only mappings based on a much more
specific set of seed terms, e.g., the ecology of a particular species
of African millipede, would have much value for him and his
students.
This is a criticism with which many people might agree, and
progress in bibliographic visualizations like ours may well lie in
adding capabilities to map specific natural-language "co-words"
from the titles, abstracts, or full texts of documents (8, 26, 27~.
Possibly the chief beneficiaries of MeSH (or other controlled-
vocabulary) mapping will be neither beginners nor subject
experts, but "in-between" persons, such as librarians, subject
indexers, science writers, journal editors, and teachers as they
browse the many research areas to which they come as outsiders.
Slatkin found his own cocited author maps readily interpret-
able. He was acquainted with every name that appears in Fig. 2.
In the PFNET (which he again preferred), he identified the main
structural feature, the clusters around himself and Masatoshi
Nei, as representing two slightly different subject areas. Both the
Nei group and the Slatkin group, he said, have contributed to
the literature on genetic flow and population structure, but the
Slatkin group has contributed relatively more to the literature on
microsatellites (short, repetitive sequences of DNA). Hence, the
PFNET was picking up a division he found meaningful.
Many combinations of linked names in Fig. 2A are coherent
in the sense that they yield sensible internet retrievals. However,
a stricter test for the coherence of a particular domain is whether
an expert can rapidly and accurately predict why two authors are
linked. Given a random pair from Fig. 2A (Ned and R. R. Sokal),
Slatkin guessed that the link between them was caused by
1. White, H. D. (1990) in Scholarly Communication and Bibliometrics, ed.
Borgman, C. (Sage, Newbury Park, CA), pp. 84-106.
2. White, H. D. & McCain, K. W. (1997)Annu. Rev. Inf: Sci. Technol. 32, 99-168.
3. Kohonen, T. (1997) Self-Organizing Maps (Springer, New York), 2nd Ed.
4. Schvaneveldt, R. W., ed. (1990) Pathfinder Associative Networks: Studies in
Knowledge Organization (Ablex, Norwood, NJ).
5. Borner, K., Chen, C. & Boyack, K. W. (2003) Annul Rev. In; Sci. Technol. 37,
179-255.
6. Lin, X., White, H. D. & Buzydlowski, J. (2003) In; Process. Manag. 39,
689-706.
7. Buzydlowski, J. W., White, H. D. & Lin, X. (2003) in Visual Interfaces to Digital
Libraries, Lecture Notes in Computer Science 2539, eds. Borner, K. & Chen,
C. (Springer, Berlin), pp. 133-144.
8. Chen, H., Lally, A. M., Zhu, B. & Chau, M. (2003) J. Am. Soc. In; Sci. Technol.
54, 683-694.
White et a/.
frequent cocitation of Sokal's book Biometry with Nei's works on
the computation of standard genetic distance. A subsequent
retrieval of the articles cociting the pair bore this out.
The SOM in Fig. 2B picks up some of the same dyadic
structure as the PFNET, such as the connections between Ohta
and Kimura, Tajima and Hudson, Takahata and Griffiths,
Cavalli-Sforza and Jorde, and Valdes and Weber (which may
reflect coauthorships as well as cocitation ties). Slatkin and Nei
remain central figures, but are joined by Avise and Templeton.
Interestingly, at the lower left the SOM conjoins Wright, Mayr,
and Fisher, who represent the older, pioneering generation in
statistical genetics. The SOM algorithm is able to bring this out
solely on the basis of their overall cocitation profiles.
Other Reactions to the Map Types
If PFNETs seem directive about term relationships, SOMs are
merely suggestive. However, their greater ambiguity is perhaps
a virtue. Using AUTHORLINK, the forerunner of PNASLINK, Buzy-
dlowski (9) found that SOMs outperformed PFNETs in captur-
ing the mental models of 20 experts in selected fields of the
humanities. These were SOMs and PFNETs devoted to cocited
authors, exactly like those in Fig. 2.
The experts' mental models were elicited by having them sort
cards bearing authors' names into intuitively meaningful piles.
Their task was to show how they would group, first, the 24
authors most highly cocited with Plato (almost all quite famous)
and, second, the 24 authors most highly cocited with an indi-
vidual author of the expert's choice. The matrices of card-sort
groupings were compared with matrices of the groupings pro-
duced by PFNET linkages and SOM positionings. For both the
Plato trial, which all experts participated in, and the individual-
author trials, which were unique to each expert, SOMs agreed
with the card-sort data better than PFNETs. In the Plato trial,
both SOMs and PFNETs were highly correlated with the pooled
card-sort data (SOMs, r = 0.97; PFNETs, r = 0.78), but these
correlations were significantly different at P < 0.001. In the
individual-author trials, a t test of mean agreement scores
favored SOMs significantly at P < 0.01. The experts were
nevertheless about equally divided in their preferences for one
map type over the other.
In other, less formal trials, we have found that some experts
object when maps of either type differ from their mental models
of how the subject headings or authors in their fields are
connected. With respect to this criticism, it should be borne in
mind that the maps are pictures of the database. They show term
associations that have developed as authors and indexers actually
create literatures, in the present case, solely within PNAS, and
these will often differ from the terminological hierarchies one
finds in individual heads, not to mention textbooks, thesauri, or
other databases (compare Table 2~. In fact, the maps should be
taken as new information, not as "erroneous" attempts to
generate preexisting hierarchies from bibliographic data. The
ongoing task is to find which types of maps and which types of
terms are most useful to particular clienteles.
9. Buzydlowski, J. W. (2003) Ph.D. thesis (Drexel University, Philadelphia).
10. Hearst, M. (1999) in Modern Information Retrieval, eds. Baeza-Yates, R. &
Ribeiro-Neto, B. (Addison-Wesley, New York), pp. 257-323.
11. Chen, C. (2003) Mapping Scientific Frontiers: The Quest for Knowledge Visual-
ization (Springer, London).
12. Dodge, M. & Kitchin, R. (2001)Atlas of Cyberspace (Addison-Wesley, Harlow,
U.K.).
13. Jansen, B. J., Spink, A. & Saracevic, T. (2000) Ini Process. Manag. 36, 207-227.
14. Schatz, B. R., Johnson, E. H., Cochrane, P. A. & Chen, H. (1996) in
Proceedings of the First ACM International Conference on Digital Libraries,
eds. Fox, E. A. & Marchionini, G. (Association for Computing Machinery,
New York), pp. 126-133.
15. Fowler, R. H. & Dearholt, D. W. (1990) in Pathfinder Associative Networks:
Studies in Knowledge Organization, ed. Schvaneveldt, R. W. (Ablex, Norwood,
NJ), pp. 165-178.
PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5301
OCR for page 120
16. Fowler, R. H., Fowler, W. A. L. & Wilson, B. A. (1991) in Proceedings of the
14th Annual International ACM/SIGIR Conference on Research and Develop-
ment in Information Retrieval, ed. Bookstein, A. (Association for Computing
Machinery, New York), pp. 142-151.
17. Fowler, R. H., Wilson, B. A. & Fowler, W. A. L. (1992) Information Navigator:
An Information System Using Associative Networks for Display and Retrieval,
Technical Report NAG9-551,92-1 (Department of Computer Science, Uni-
versity of Texas-Pan American, Edinburg).
18. McGreevy, M. W. (1995) A Relational Metric, Its Application to Domain
Analysis, and an Example Analysis and Model of a Remote Sensing Domain,
National Aeronautics and Space Administration Technical Memorandum
119358 (Ames Research Center, Moffett Field, CA).
19. Chen, C. (1998) J. His. Lang Comput. 9, 267-286.
5302 1 www.pnas.org/cgi/doi/10.1073/pnas.0307630100
20. Chen, C. (1999) Ini Process. Manag. 35, 401-420.
21. Lin, X., Soergel, D. & Marchionini, G. (1991) in Proceedings of the 14thAnnual
International ACM/SIGM Conference on Research and Design in Information
Retrieval, ed. Bookstein, A. (Association for Computing Machinery, New
York), pp. 262-269.
22. Roussinov, D. & Chen, H. (1998) Commun. Cognit. Artificial Intelligence 15,
81-1 12.
23. Chen, H., Houston, A. L., Sewell, R. R. & Schatz, B. R. (1998) J. Am. Soc. In;
Sci. 49, 582-603.
24. Tversky, A. & Gati, I. (1982) Psychol. Rev. 89, 123-154.
25. Kamada, T. & Kawai, S. (1989) Ini Process. Lett. 31, 7-15.
26. Ding, Y., Chowdhury, G. G. & Foo, S. (2000) Ini Process. Manag. 37, 817-842.
27. Ibekwe-San Juan, F. & San Juan, E. (2002) Knowl. Org. 29, 181-197.
White et a/.
Representative terms from entire chapter:
seed term