Questions? Call 888-624-8373

PAPERBACK
list:$33.75
Web:$30.38
add to cart

Rights & Permissions

topleft topright

(Sackler NAS Colloquium) Mapping Knowledge Domains (2004)
Proceedings of the National Academy of Sciences (PNAS)

Page
38
bottomleft bottomright
Page
38

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 38
Colloquium Mixed-membership models of scientific publications Elena Erosheva*t, Stephen Fienbergt§, and John Lafferty *Department of Statistics, School of Social Work, and Center for Statistics and the Social Sciences, University of Washington, Seattle, WA 98195, and "Department of Statistics, Computer Science Department, and §Center for Automated Learning and Discovery, Carnegie Mellon University, Pittsburgh, PA 15213 PNAS is one of world's most cited multidisciplinary scientific journals. The PNAS official classification structure of subjects is reflected in topic labels submitted by the authors of articles, largely related to traditionally established disciplines. These include broad field classifications into physical sciences, biological sciences, social sciences, and further subtopic classifications within the fields. Focusing on biological sciences, we explore an internal soft- classification structure of articles based only on semantic decom- positions of abstracts and bibliographies and compare it with the formal discipline classifications. Our model assumes that there is a fixed number of internal categories, each characterized by multi- nomial distributions over words (in abstracts) and references (in bibliographies). Soft classification for each article is based on proportions of the article's content coming from each category. We discuss the appropriateness of the model for the PNAS database as well as other features of the data relevant to soft classification. The Proceedings is there to help bring new ideas promptly into play. New ideas may not always be right, but their prominent presence can lead to correction. We must be careful not to censor even those ideas which seem to be off beat. Saunders MacLane (1) Are there internal categories of articles in PNAS that we can obtain empirically with statistical data-mining tools based only on semantic decompositions of words and references used? Can we identify MacLane's "off-beat" but potentially path- breaking PNAS articles by using these internal categories? Do these empirically defined categories correspond in some natural way to the classification by field used to organize the articles for publication, or does PNAS publish substantial numbers of interdisciplinary articles that transcend these disciplinary bound- aries? These are examples of questions that our contribution to the mapping of knowledge domains represented by PNAS explores. Mathematical and statistical techniques have been developed for analyzing complex data in ways that could reveal underlying data patterns through some form of classification. Computa- tional advances have made some of these techniques extremely popular in recent years. For example, 2 of the 10 most cited articles from 1997-2001 PNAS publications are on appl~cahons of clustering for gene-expression patterns (2, 3~. The traditional assumption in most methods that aim to discover knowledge in underlying data patterns has been that each subject (object or individual) from the population of interest inherently belongs to only one of the underlying subpopulations (clusters, classes, aspects, or pure type categories). This implies that a subject shares all its attributes, usually with some degree of uncertainty, with the subpopulation to which it belongs. Given that a rela- tively small number of subpopulations is often necessary for a meaningful interpretation of the underlying patterns, many data collections do not conform with the traditional assumption. Subjects in such populations may combine attributes from several subpopulations simultaneously. In other words, they may 5220-5227 1 PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 have a mixed collection of attributes originating from more than one subpopulation. Several different disciplines have developed approaches that have a common statistical structure that we refer to as mixed membership. In genetics, mixed-membership models can ac- count for the fact that individual genotypes may come from different subpopulations according to (unknown) proportions of an individual's ancestry. Rosenberg et al. (4) use such a model to analyze genetic samples from 52 human populations around the globe, identifying major genetic clusters without using the geographic information about the origins of individuals. In the social sciences, such models are natural, because members of a society can exhibit mixed membership with respect to the underlying social or health groups for a particular problem being studied. Hence, individual responses to a series of questions may have mixed origins. Woodbury et al. (5) use this idea to develop medical classification. In text analysis and information retrieval, mixed-membership models have been used to account for dif- ferent topical aspects of individual documents. In the next section, we describe a class of mixed-membership models that unifies existing special cases (64. We then explain how this class of models can be adapted to analyze both the semantic content of a document and its citations of other publications. We fit this document-oriented mixed-membership model to a subcollection of the PNAS database supplied to the participants in the Arthur M. Sackler Colloquium Mapping Knowledge Domains. We focus in our analysis on a high-level description of the fields in biological sciences in terms of a small number of extreme or basis categories. Griffiths and Steyvers (7) use a related version of the model for abstracts only and attempt a finer level of description. Mixed-Membership Models The general mixed-membership model that we work with relies on four levels of assumptions: population, subject, latent vari- able, and sampling scheme. Population level assumptions de- scribe the general structure of the population that is common to all subjects. Subject-level assumptions specify the distribution of observable responses given individual membership scores. Mem- bership scores are usually unknown and hence can be viewed also as latent variables. The next assumption is whether the mem- bership scores are treated as fixed or random in the model. Finally, the last level of assumptions specifies the number of distinct observed characteristics (attributes) and the number of replications for each characteristic. We describe each set of assumptions formally in turn. This paper results from the Arthur M. Sackler Colloquium of the National Acaclemy of Sciences, "Mapping Knowiec~ge Domains," held May 9-11, 2003, at the Arnold anc' Mabel Beckman Center of the National Acaclemies of Sciences and Engineering in Irvine, CA. tTo whom correspondence shouIcl be ac~cdressecl. E-maii: eiena~stat.washington.eclu. 2004 by The National Academy of Sciences of the USA www.pnas.org/cgi/doi/10. ~ 073/pnas.0307760101

OCR for page 39
Population Level. Assume there are K original or basis subpopu- lations in the populations of interest. For each subpopulation k, denote by f(Xj~6kj) the probability distribution for response variable I, where (9kj iS a vector of parameters. Assume that, within a subpopulation, responses to observed variables are independent. Subject Level. For each subject, membership vector A = (A1, .... AK) provides the degrees of a subject's membership in each of the subpopulations. The probability distribution of observed re- sponses Xj for each subject is defined fully by the conditional probability Pr~xj~A) = IkA OCR for page 40
can be interpreted also as a latent classification process in which an aspect of origin is drawn first for each word and for each reference in a document, according to a multinomial distribution parameterized by the document-specific membership scores A, and words and references then are generated from correspond- ing distributions of the aspects of origin (64. Rather than a mixture of K latent classes, the model can be thought of as a "simplicial mixture" (13) because the word and reference probabilities range over a simplex with corners Elk and 62k, respectively. The likelihood function is thus P ~ Bid ~ = J.Dir(A~ or) [| pA(w~n(W ~) [l qA(r~n~r dada Jw r - 1 [ rid Jo rl A' i rI PA(W)n(W d) I| q ~ryn(r did A '=1 w r where integrals are over the (K - 1) simplex. It is important to note that the assumption of exchangeability among words and references (conditional independence given the membership scores) does not imply joint independence among the observed characteristics. Instead, the assumption of exchangeability means that dependencies among words and references can be explained fully by the membership scores of the documents. For an extended discussion on exchangeability in this context, see ref. 16. Alternative Model for References For the analysis of PNAS publications in the next section, we assume multinomial sampling of words and references. Although multinomial sampling is computationally convenient, it is not a realistic model of the way in which authors select references for the bibliography of an article. We briefly describe an example of more realistic generative assumptions for references. Suppose an article focuses on a sufficiently narrow scientific area. In this case, the authors may have essentially perfect knowledge of the literature, and thus they would pay separate attention to each article in their pool of references as they consider whether to include it in the bibliography. Under these circumstances, given that the pool of references contains R articles, we assume that a document is represented as d = ({X(irl)), X2, X3, . . ., XR+1), where x(~r1) is a word in the abstract, R is the number of references, end x2, . . . ,XR+~ are all references in the pool. Reference counts do not change: they are given by nor, d) = 1 if the bibliography of d contains a reference to r and by nor, d) = 0 if otherwise. Then our model for generating documents would be to sample A and x(~r1), according to Eqs. 4 and 5, and sample x;, j = 2, .... R + 1, according to Xj~ Bernoulli~qA(xj)], K where qA(xj) = ~ Akdjk. [9] k=1 The likelihood function based on this alternative model would not only take into account which documents contain which references, but it also would incorporate the information about which references documents do not contain. Both the basic model for references and any alternatives still would need to reflect the time ordering on publications and include in the pool of possible references only those that have been published already, perhaps even with a short time lag. 5222 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101 However, even such changes are unlikely to produce a "correct" model for citation practices. Estimating the Model The primary complication in using a mixed-membership model such as is shown in Eqs. ~6, in which the membership proba- bilities are random rather than fixed, is that the integral in Eq. 7 cannot be computed explicitly and therefore must be approx- imated. Two approximation schemes have been investigated recently for this problem and the associated problem of fitting the model. In the variational approach (12), the mixture terms PA(W) = ~k=iAk6Ik(W) are bounded from below in a product form that leads to a tractable integral; the lower bound is then maximized. A related approach, called expectation-propagation (13), also approximates each mixture term in a product form but chooses the parameters of the factors by matching first and second moments. Either of these approximations to the integral (Eq. 7) can be used in an approximate expectation- maximization (EM) algorithm to estimate the parameters of the models. It is shown in ref. 13 that expectation-propagation in [8] general leads to better approximations than the simple varia- tional method for mixed-membership models, although we ob- tained comparable results with both approaches on the PNAS collection. The results reported below use the variational approximation. The PNAS Database The National Academy of Sciences provided the database for the participants of the colloquium. We focused on a subset of all biological sciences articles in volumes 94-98 (Julian years 1997- 2001) of PNAS, thereby ignoring articles published in the social and physical sciences unless they have official dual classifications with one classification in the biological sciences. The reason for this narrowing of focus is 2-fold. First, the major share of PNAS publications in recent years represents research developments in the biological sciences. Thus, of 13,008 articles published in volumes 94-98, 12,036 (92.53%) are in the biological sciences. The share of social and physical sciences articles in volumes 94-98 is a much more modest 7.47%. Second, we assume that a collection of articles is characterized by mixed membership in a number of internal categories, and social and physical sciences articles are unlikely to share the same internal categories with articles from the biological sciences. We also automatically ignore other types of PNAS publications such as corrections, commentaries, letters, and reviews, because these are not tra- ditional research reports. Among the biological sciences articles in our database, 11 articles were not processed because they did not have an abstract, and 1 article was not processed because it did not contain any references. PNAS is one of world's most cited multidisciplinary scientific journals. Historically, when submitting a research paper to PNAS, authors have to select a major category from physical, biological, or social sciences and a minor category from the list of topics. PNAS permits dual classifications between major categories and, in exceptional cases, within a major category. The lists of topics change over time to reflect changes in the National Academy of Sciences sections. PNAS, in its information for authors (revised in June 2002), states that it classifies publica- tions in biological sciences according to 19 topics; the numbers of published articles and numbers of dual-classified articles in each topic are shown in Table 1. The topic labels provide a classification structure for pub- lished materials, and most of the articles are members of only a single topic. For our mixed-membership model, we assume that there is a fixed number of extreme internal categories or aspects, each of which is characterized by multinomial distributions over words (in abstracts) and references (in bibliographies). Aspects are determined from contextual decompositions in such a way Erosheva et a/.

OCR for page 41
Table 1. Biological sciences publications in PNAS volumes 94-98 by subtopic Topic n 1 Biochemistry 2,578 (33) ~ | ~ O ~ ~ a~ ~u 2 Medicalsciences 1,547(13) a) a' Q ·- u a) v a) ~ o ~ 3 Neurobioogy Q V ~ ' v Z 4 Cell biology ~ 343 (9) | ~ ~ '~' ~ 0 ~ ~ E 5 Genetics 980 (14) ~ o ~ ~ ~ oo ~ ~ ~ ~ o oo r~ r~ ~ ~ ~ ~ ~ r~ ~ ~ ~ ~ 6 Immunology 865 (9) c~ g 8 8 8 8 8 8 8 8 8 8 8 8 8 8 7 Biophysics 636 (40) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 Evolution 510 (12) 9 Microbiology 498 (11) ~ ° ~o ~ ~5 0 aJ 10 Plant biology 488(4) ~ O ~ Q 07 ~, ~ · - -~_ ~ 11 Deve opmenta bio ogy 366 (2) E ~ ~ 0 ~ ~ E E c . 0 12 Physiology 340 (1) ~ ~ ~ r ~ ~ ~ E ~ ~~ ~~, 13 Pharmacology 188(2) o ~ ~ o ~ ~ ~ ~ r~ o oo oo oo r (D ~ rn ~ ~ ~ c~ ~ c~ ~ ~ ~ ~ ~ ~ 14 Ecology 133 (5) cL 8 8 8 o ° ° o ° ° o o o o o o 15 Applied biologicalsciences 94(6) o o o o o o o o o o o o o o o 16 Psychology 88 (1) _ 17 Agricultural sciences 43 (2) 18 Population biology 43 (5) 4. 0 .o 19 Anthropology 10 (0) a~ .~Q ~ a~ .~= O ~ ~ ~ 0 Total 11,981 (179) ~ <,, ,o, `~, Z ~ ~, ~v cr ~ ^, The numbersofarticles with dual classifications are given in parentheses. ~ ~ Q ~ Q ~: ~ `~ ~ ~ ~ u c~ that a multinomial distribution of words and references in each document is a convex combination of the corresponding distri- butions from the aspects. The convex combination for each article is based on proportions of the article's content coming ,,, from each category. These proportions, or membership scores, <. ~ Q ~ ~ ° ~ O p° ~ a~ ° ~ Q > the "stop list" before fitting the model. If the distribution of stop .~ words is not uniform across the internal categories, this alter- Q~ ~ ~ ~ ~o oo ~o c~ ~ ~ ~ ~o ~ ~ ~ ~o native approach may potentially produce different results. 0 Q 0 0 0 o° 0 o° 0 o° 0 o° 0 0 0 o° 0 The following interpretations are based on examination of 50 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 high-probability words for each aspect. Note that enumeration of the aspects is arbitrary. The first aspect includes words such as ~: Ca2+, kinase, phosphorylation, receptor, and G (protein) chan- ~ v ~ Q ~ ° ~= O ~ ~ a' a; ~ a nel, which pertain to cell signaling and intracellular signal Q Q ~ ~ Q ~ :: ~s ~ a transduction. It is likely that, in this aspect, signal transduction ~ <: v v ~ v ~ ~ ~ ~ v ~ ~ ~ ~ ' Erosheva eta/. PNAS | April 6, 2004 I vol. 101 | suppl. 1 | 5223

OCR for page 42
Table 3. High-probability references by aspect Author Aspect 1 Journal, Year C Aspect 2 Author Journal Year C HAMILLOP PFLUG ARCH EURJPHY, 1981 72 SAITOUN MOLBIOLEVOL, 1987 96 LAEMMLIUK Nature, 1970 322 THOMPSON JD NUCLEIC ACIDS RES, 1994 147 HILLE B IONIC CHANNELS EXCIT, 1992 58 ALTSCHUL SF NUCLEIC ACIDS RES, 1997 160 BLISS TVP NATURE, 1993 54 SAMBROOK J MOL CLONING LAB MAN'U, 1989 764 SUDHOF TC NATURE, 1995 33 ALTSCHUL SF ~ MOL BIOL, 1990 253 GRYNKIEWICZ G ~ Bl:OL CHEM, 1985 31 FELSENSTEIN J EVOLU'TION, 1985 51 SAMBROOKJ MOLCLONING LAB MANU, 1989 764 KISEIINOH JMOLEVOL, 1989 31 SHER.R.INGTON R NATURE, 1995 33 STRIMMERK MOLBIOL EVOL, 1996 31 ROTHMANJE NATURE, 1994 27 KIMURAM JMOLEVOL, 1980 34 SIMONSK NATURE, 1997 35 EISEN MB PNATLACAD SCI USA, 1998 60 SOLLNERT NATURE, 1993 25 SWOFFORDDL PAUPPHYLOGENETIC AN, 1993 25 ROTHMAN JE SCIENCE, 1996 24 KIMURA M NEUTRAL THEORY MOL E, 1983 28 THINAICARAN G NEURON, 1996 23 KUMAR S MEGA MOL EVOLUTIONAR, 1993 26 TOWBIN H P NATL ACAD SCI USA, 1979 86 HASEGAWA M J MOL EVOL, 1985 24 BERMAN DM CELL, 1996 21 NEI M MOL EVOLUTIONARY GEN, 1987 28 KRAULIS PJ J APPL CRYSTALLOGR, 1991 JONES TA ACTA CRYSTALLOGR A, 1991 OTWINOWSKI Z METHOD ENZYMOL, 1997 BRUNGER AT ACTA CRYSTALLOGR D 5, 1998 LASKOWSKI RA J APPL CRYSTALLOGR, 1993 NICHOLLS A PROTEINS, 1991 NAVAZA J ACTA CRYSTA.LLOGR A, 1994 SAMBROOK J MOL CLONING LAB MANIJ, 1989 LAEMMLI UK NATURE, 1970 MERRITT EA ACTA CRYSTALLOGR D, 1994 BRUNGER AT NATURE. 1992 BRADFORD MM ANAL BIOCHEM, 1976 MERRITT EA METHOD ENZYMOL, 1997 WUTHRICH K NMR PROTEINS NUCL AC, 1986 KABSCH W BIOPOLYMERS, 1983 202 174 140 118 96 85 81 764 322 66 48 209 41 40 39 Aspect 3 Aspect 4 Author Journal, Year C Author Journal, Year C SAMBROOK J MOL CLON'ING LAB MAN'U, 1989 764 HOGAN B MANIPULATING MOUSE E, i994 68 LAEMMLI UK NATURE, 1970 322 CHOMCZYNSKI P ANALBIOCHE'~, 1987 206 ALTSCHUL SF J MOL BIOL, 1990 253 TALAIRACH J COPLANAR STEREOTAXIC, 1988 60 BRADFORD MM ANAL BIOCHEM, 1976 209 PAXINOS G RAT BRAIN STEREOTAXI, 1986 38 SANGER F P NATL ACAD SCI USA, 1977 140 SAMBROOK J MOL CLONING LAB MANU' 1989 764 MILLER JH EXPTMOLGENETICS, 1972 102 NAGYA PNATLACADSCIUSA, 1993 39 ALTSCHUL SF NUCLEIC ACIDS RES, 1997 160 MANSOUR SL NATURE' 1988 37 THOMPSON JD NUCLEIC ACIDS RES, 1994 147 BRAND AH DEVELOPMENT, 1993 46 CHOMCZYNSKI P ANAL BIOCHEM, 1987 206 HOGAN B MANIPULATING MOUSE E, 1986 32 HARLOW E ANTIBODIES LAB MANUA, 1988 129 TYBULEWICZ VLJ CELL, 1991 46 BLATTNER FR SCIENCE, 1997 56 KWONG KK P NATL ACAD SCI USA, 1992 24 SCHENA M SCIENCE, 1995 40 DUNLAP JC CELL, 1999 19 KYTE ~ J MOL BIOL, 1982 51 Ll E CELL, 1992 35 MU'RASHIGE T PHYSL PLANTARUM, 1962 33 ALTSCHUL SF J MOL BIOL, 1990 253 TOWBIN H P NATL ACAD SCI USA, 1979 86 EISEN MB P NATL ACAD SCI USA, 1998 60 Aspect 5 Aspect 6 Author Journal, Year C Author Journal, Year C SAMBROOK J MOL CLONING LAB MANU, 1989 SIKORS}C:I RS DIGNAM JD LEVINE AJ ELDEIRY WS HARLOW E HARPER ~W FRIEDBERG EC ALTSCHUL SF OGRYZKO W WEINBERG RA KAMEI Y HOLLSTEIN M FIELDS S YANG XJ GENETICS, 1989 NUCLEIC ACIDS RES, 1983 CELL, 1997 CELL, 1993 ANTIBODIES LAB MANUA, 1988 CELL, 1993 DNA REPAIR MUTAGENES, 1995 J MOL BIOL 1990 CELL, 1996 CELL, 1995 CELL 1996 SCIENCE, 1991 NATURE, 1989 NATURE, 1996 764 102 68 57 54 129 50 58 253 41 40 39 41 67 37 Aspect 7 Aspect 8 Author Jou~nal7 Year C Author Journal,Year C DENG HK NATURE, 1996 46 CHOMCZYNSKI P ANAL BIOCHEM, 1987 206 DRAGIC T NATURE, 1996 45 BRADFORD MM ANAL BIOCHEM, 1976 209 DORANZ BJ CELL, 1996 45 LAEMMLI UK NATURE, 1970 322 FENGY SCIENCE, 1996 43 LOWRY OH JBIOLCHEM, 1951 73 ALKHATIB G SCIENCE, 1996 43 ZHANG Y NATURE, 1994 31 COCCHI F SCIENCE, 1995 41 KUIPER GGJM P NATL ACAD SCI USA, 1996 27 CHOE H CELL, 1996 41 SAMBROOKJ MOL CLON LAB MANU, 1989 764 THOMPSON CB SCIENCE, 1995 38 MONCADA S PHARMACOLREV, 1991 25 ZOU H CELL, 1997 40 PELLEYMOUNTERMA SCIENCE, 1995 23 'DARNELL JE SCIENCE. 1994 40 CAMPFIELD LA SCIENCE, 1995 23 MUZIOM CELL, 1996 35 KUIPERGGJM ENDOCRINOLOGY, 1997 22 Ll P CELL, 1997 36 HALAAS JL SCIENCE, 1995 21 XIAZG SCIENCE, 1995 38 BLIGH EG CAN J BIOCH PHYSL, 1959 45 BOLDIN MP CELL, 1996 34 BROWN MS CELL, 1997 28 PEAR WS PNATL ACAD SCI USA 1993 57 ZHANG SH SCIENCE 1992 18 For each aspect, the top references are shown in order of decreasing probability, according to the model. The count of each reference in the PNAS collection is shown in the right column (C). is considered as applied to neuron signaling as indicated by the words synaptic, neurons, voltage. It is interesting that Ca2+ in the first aspect is the highest-probability contextual word over all the aspects. Frequent words for the second aspect indicate that its context is related to molecular evolution that deals with natural selection on the population and intraspecies level and mechanisms of acquiring genetic traits. Words in aspect 3 pertain mostly to the plant molecular biology area. High-probability words in aspect 4 relate to studies of neuronal responses in mice and humans, which identify this aspect as related to develop- mental biology and neurobiology. Aspect 5 contains words that can be associated with biochemistry and molecular biology. 5224 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101 Words in aspect 6 point to genetics and molecular biology. Frequent words for aspect 7 contain such terms as immune, IL (or interleukin), antigen, (IFN) gamma, and MHC class II, which point to a relatively new area in immunology, namely, tumor immunology. The presence of such words as HIV and virus in aspect 7 indicates a more ~eneral immunology content. ~ ~,, tor aspect 8, words such as increase or reduced, treatment, effect, fold, and P (assuming it stands for P value) correspond to general reporting of experimental results, likely in the area of endocrinology. As for words, multinomial distributions are estimated for the references that are present in our collection. For estimation, we Erosheva et a/.

OCR for page 43
Aspect 1 Aspect 5 O Aspect 1 0 Aspect2 Kiln °1 lo 0 - ~D lo 0 - c~ o- o lo lo l 0.0 0.4 0.8 Aspect 5 l 0~0 0.4 0.8 Evolution Aspect 2 Aspect 3 Aspect 4 Aspect 6 lo 0 Aspect 7 Aspect 8 Genetics lo 8- 0~0 0.4 0.8 Aspect 3 lo 0 Aspect 4 O ~= 0.O 0.4 0.8 Aspect 6 0.0 0~4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 O Aspect7 ,, Aspect 8 g 0.0 0.4 0.8 0.0 0.4 0.8 Fig. 1. Distributions by aspect of the posterior means of membership scores for articles published in evolution and genetics. only need unique indicators for each referenced article. After the model is fitted, attributes of high-probability references for each aspect provide additional information about its contextual in- terpretation. Table 3 provides attributes of 15 high-probability references for each aspect that were available in the database together with PNAS citation counts (number of times cited by PNAS articles in the database). Notice that, because the model draws from the contextual decomposition, having a high citation count is not necessary for having high aspect probability. In Erosheva et a/. Table 3, high-probability references for aspect 1 are dominated by publications in Nature; references in aspect 7 are mostly Nature, Cell, and Science publications from the mid-199Os. Examining titles of the references (see Table 5, which is published as supporting information on the PNAS web site, www.pnas.org), we see that manuals, textbooks, and references to methodology articles seem to be prominent for many aspects. Thus, among the first 15 high-probability references, all 15 from aspect 3 and more than half from aspect 4 are of this method- PNAS 1 April 6, 2004 1 volt. 101 1 supply. ~ 1 5225

OCR for page 44
Table 4. Mean decompositions of aspect membership scores (Lower), together with a graphical representation of this table (Upper) Biochemistry Medical Sciences Neurobiology Cell Biology Genetics Immunology Biophysics Evolution Microbiology Plant Biology Developmental Biology Physiology Pharmacology Topic 1 , _ r _ 2 3 4 5 6 7 8 Biochemistry 0.0469 0.0347 0.1810 0.0178 0.3838 0.2057 0.0477 0.0823 Medical sciences 0.0244 0.0502 0.0938 0.1274 0.0181 0.1075 0.3286 0.2500 Neurobiology 0.2875 0.0398 0.0722 0.3768 0.0196 0.0296 0.0441 0.1304 Cell biology 0.1691 0.0165 0.1420 0.0684 0.1097 0.2423 0.1637 0.0884 Genetics 0.0141 0.3056 0.1422 0.1532 0.0487 0.2621 0.0395 0.0347 Immunology 0.0127 0.0593 0.1003 0.0413 0.0422 0.0915 0.6244 0.0283 Biophysics 0.0507 0.0295 0.2398 0.0162 0.5496 0.0542 0.0176 0.0423 Evolution 0.0042 0.7679 0.0465 0.0913 0.0289 0.0378 0.0101 0.0133 Microbiology 0.0158 0.1725 0.3431 0.0335 0.0647 0.1174 0.1870 0.0661 Plant biology 0.1333 0.0983 0.4400 0.0360 0.0462 0.0954 0.0166 0.1344 Developmental biology 0.0475 0.0288 0.1071 0.3729 0.0274 0.2558 0.0974 0.0631 Physiology 0.3179 0.0275 0.0712 0.1123 0.0258 0.0116 0.0595 0.3743 Pharmacology 0.2883 0.0161 0.0772 0.1965 0.0299 0.0349 0.0537 0.3033 For clarity, the six lowest-frequency topics, which make up 3.4% of the biological sciences articles, are not shown. ological type. In contrast, most high-probability references for aspect 7 are those that report new findings. Titles of the references indicate neurobiology content for aspect 1, molecular evolution for aspect 2, and plant molecular biology for aspect 3, which is in agreement with our conclusions based on high- probability words. For other aspects, titles of high-probability references help us refine the aspects. Thus, aspect 4 mostly pertains to the study of brain development, in particular, via genetic manipulation of mouse embryo. Aspect 5, identified as biochemistry and molecular biology by the words, can be de- scribed as protein structural biology by the references. Aspect 6 may be labeled in a more detailed way as "DNA repair, mutagenesis, and cell cycle." The references for aspects 7 and 8 shift their focuses more toward HIV infection and studies of molecular mechanisms of obesity. Among frequent references for the eight aspects, there are seven PNAS articles that share a special feature: they were all 5226 1 www.pnas.org/cgi/doi/10.1073/pnas.0307760101 either coauthored or contributed by a distinguished member of the National Academy of Sciences. In fact, one article was coauthored by a Nobel prize winner, and two were contributed by other Nobelists. Although these articles do not have the highest counts in the database, they are notable for various reasons; e.g., one is on clustering and gene expression (2), and it is also one of the two highly cited PNAS articles on clustering that we mentioned in the Introduction. These seven articles may not necessarily be off-beat, but they may be among those that fulfill MacLane's petition regarding the special nature of PNAS. From our analysis of high-probability words, it is difficult to determine whether the majority of aspects correspond to a single topic from the official classifications in PNAS biological science publications. To investigate whether there is a correspondence between the estimated aspects and the given topics, we examine aspect loadings (means of posterior membership scores) for each article. Given estimated parameters of the model, the distribu- Erosheva et a/.

OCR for page 45
tion of each article's loadings can be obtained by means of Bayes' theorem. The variational and expectation-propagation proce- dures provide Dirichlet approximations to the posterior distri- is, butionp(A~d, 8) for each document d. We use the mean of this Dirichlet as an estimate of the weight of the document on each aspect. Histograms of these loadings are provided in Fig. 1 for articles in evolution and genetics. Relatively high histogram bars near zero correspond to the majority of articles having small posterior membership scores for the given aspect. Among the articles published in genetics, some can be considered as full members in aspects 2, 3, 4, and 6, but many have mixed membership in these and other aspects. Articles published in evolution, on the other hand, show a somewhat different behav- ior: the majority of these articles comes fully from aspect 2. The sparsity of the loadings can be gauged also by the parameters of the Dirichlet distribution, which are estimated as cat = 0.0195, ox = 0.0203, tt3 = 0.0569, tt4 = 0.0346, ors = 0.0317, tt6 = 0.0363, ct7 = 0.0411, and as = 0.0255. The estimated Dirichlet, which is the generative distribution of membership scores, is "bathtub-shaped" on the simplex; as a result, articles tend to have relatively high membership scores in only a few aspects. To summarize the aspect distributions for each topic, we provide mean loadings and the graphical representation of these values in Table 4 Upper. Larger values correspond to darker colors, and the values below some threshold are not shown (white) for clarity. As an example, the mean loading of 0.2883 for pharmacology in the first aspect is the average of the posterior means of the membership scores for this aspect over all phar- macology publications in the database. Note that this percentage is based on the assumption of mixed membership and can be interpreted as indicating that 29% of the words in pharmacology articles originate from aspect 1, according to our model. Examining the rows of Table 4, we see that most subtopics in biological sciences have major components from more than one aspect (extreme or basis category). Examining the columns, we can gain additional insights in interpretation of the extreme categories. Aspect 8, for example, is the aspect of origin for a combined 37% of physiology, 30% of pharmacology, and 25% of medical sciences articles, according to the mixed-membership model. The most prominent subtopic is evolution; it has the greatest influence in defining an extremal category, aspect 2. This is consistent with a special place that evolution holds among the biological sciences by standing apart both conceptually and methodologically. 1. MacLane, S. (1997) Proc. Natl. Acad. Sci. USA 94, 5983-5985. 2. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998) Proc. Natl. Acad. Sci. USA 95,14863-14868. 3. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander E. S. & Golub, T. R. (1999) Proc. Natl. Acad. Sci. USA 96, 2907-2912. 4. Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. & Feldman, M. W. (2002) Science 298, 2381-2385. 5. Woodbury, M. A., Clive, J. & Garson, A. (1978) Comput. Biomed. Res. 11, 277-298. 6. Erosheva, E. A. (2002) Ph.D. thesis (Carnegie Mellon University, Pittsburgh). 7. Griffiths, T. L. & Steyvers, M. (2004) Proc. Natl. Acad. Sci. USA 101, 5228-5235. 8. Manton, K. G., Woodbury, M. A. & Tolley, H. D. (1994) StatisticalApplications Using Fuzzy Sets (Wiley Interscience, New York), p. 312. 9. Potthoff, R. G., Manton, K. G., Woodbury, M. A. & Tolley, H. D. (2000) J. Classification 17, 315-353. 10. Pritchard, J. K., Stephens, M. & Donnelly, P. (2000) Genetics 155, 945-959. Erosheva et al. Finally, we compare the loadings (posterior means of the membership scores) of dual-classified articles to those that are sin~lv classified. We consider two articles as similar if their loadings are equal for the first significant digit for all aspects. One might interpret singly classified articles that are similar to dual-classified as articles that should have had dual classification but did not. We find that, for 11 % of the singly classified articles, there is at least one similar dual-classified article. For example, three biophysics dual-classified articles with loadings 0.9 for the second and 0.1 for the third aspect turned out to be similar to 86 singly classified articles from biophysics, biochemistry, cell bi- ology, developmental biology, evolution, genetics, immunology, medical sciences, and microbiology. Concluding Remarks We have presented results from fitting a m~xed-membership model to PNAS biological sciences publications, from 1997 to 2001, providing an implicit semantic decomposition of words and references in the articles. The model allows us to identify extreme internal categories of publications and to provide soft classifications of articles into these categories. Our results show that the traditional discipline classifications correspond to a mixed distribution over the internal categories. Our analyses and modeling were intended to capture a high-level description of a subset of PNAS articles. In an often-quoted statement, Box remarked: "all models are wrong" (17~. In our case, the assumption of a bag of words and references in the m~xed-membership model clearly oversimpli- fies reality; the model does not account for the general structure of the language, nor does it capture the compositional structure of bibliographies. Many interesting extensions of the basic model we have explored are possible, from hierarchical models of topics to more detailed models of citations and dynamic models of the evolution of scientific fields over time. Nevertheless, as Box notes, even wrong models may be useful. Our results indicate that mixed-membership models can be useful for analyzing the implicit structure of scientific publications. We thank Dr. Anna Lokshin (University of Pittsburgh, Pittsburgh) for help with interpreting model results from a biologist's perspective. E.E. was supported by National Institutes of Health Grants 1 RO1 AG023141-01 and RO1 CA94212-01; S.F. was supported by National Institutes of Health Grant 1 RO1 AG023141-01. J.L. was supported by National Science Foundation Grant CCR-0122581 and Advanced Re- search and Development Activity Contract MDA904-00-C-2106. 11. Hofmann, T. (2001) Machine Learn. 42, 177-196. 12. Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1002. 13. Minka, T. P. & Lafferty, J. (2002) Uncertainty in Artificial Intelligence: Proceedings of the Eighteenth Conference (UAI-2002) (Morgan Kaufmann, San Francisco), pp. 352-359. 14. Cohn, D. & Hofmann, T. (2001) Neural Information Processing Systems 13 (MIT Press, Cambridge, MA). 15. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 1107-1135. 16. Blei, D. M., Jordan, M. I. & Ng, A. Y. (2003) in Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, eds. Bernardo, J. M., Bayarri, M. J., Dawid, A. P., Berger, J. O., Heckerman, D., Smith, A. F. M. & West, M. (Oxford Univ. Press, Oxford), pp. 25-44. 17. Box, G. E. P. (1979) in Robustness in Statistics, eds. Launer, R. L. & Wilkinson, G. G. (Academic, New York), p. 202. PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5227

Representative terms from entire chapter:

sci usa