Questions? Call 888-624-8373

PAPERBACK
list:$33.75
Web:$30.38
add to cart

Rights & Permissions

topleft topright

(Sackler NAS Colloquium) Mapping Knowledge Domains (2004)
Proceedings of the National Academy of Sciences (PNAS)

Page
59
bottomleft bottomright
Page
59

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 59
Colloquium A method for finding communities of related genes Dennis M. Wilkinson and Bernardo A. Hubermana Stanford University and HP Laboratories, 1501 Page Mill Road, Palo Alto, CA 94394 We present a method for creating a network of gene co-occur- rences from the literature and partitioning it into communities of related genes. The way in which our method identifies communi- ties makes it likely that the component genes of each community will be related by their function. The method processes a large database of article abstracts, synthesizing information from many sources to shed light on groups of genes that have been shown to interact. It is a tool to be used by researchers in the biomedical sciences to swiftly search for known interactions and to provide insight into unexplored connections. The partitioning procedure is designed to be particularly applicable to large networks in which individual nodes may play a role in more than one community. In this paper, we explain the details of the method, in particular the partitioning process. We also apply the method to produce com- munities of genes related to colon cancer and show that the results are useful. The automated analysis of biomedical text is useful in any form, because knowledge in the biomedical sciences is predominantly disseminated in the form of journal articles. However, when applied to the subject of human gene function, automated text analysis is critically important. There are ~15,000 currently known human genes and >1 million related articles in the Medline databases alone. Moreover, genes act in a complex interrelated way, so information from many experiments is necessary to explain the function of a typical gene. A comprehensive study of even a simple cellular process involving several genes might require a researcher to be familiar with hundreds of articles. Merely locating all relevant articles in a database by using a simple search utility would be time consuming, not to mention inefficient and difficult, because of shortcomings of the human gene nomenclature system. In contrast, our method indexes gene symbol occurrences in all articles of large database such as Medline in <1 days and then can produce a list of communities of functionally related genes in another half day.0 In this article, we present a method to find communities of related genes. The method creates a network of gene symbol co-occurrences from Medline article abstracts and partitions this network into communities. The genes in each community are likely to be functionally related because of the way in which the communities are identified, and because most recent research on genes and proteins has been devoted to their function This method can thus be a valuable tool that both summarizes available information and indicates possible directions of re- search. The format of the results is designed to make them easy to use. The results can easily include a list of the Medline PubMed identification numbers (PMID) for articles containing each gene and pair of genes to facilitate research. Varying the user-selected key words (see Method Overview) allows the method to be applied repeatedly and focused on particular topics of interest. We apply our method to the Medline database to identify communities of genes related to colon cancer. We show that genes placed together in a community that are not explicitly connected in any Medline article or in the Online Mendelian Inheritance in Man (OMIM)e listing for either gene can nevertheless be related by their function. The communities thereby imply connections among genes www.pnas.org/cgi/doi/10.1073/pnas.0307740100 that may otherwise be overlooked or that would require much time and effort to be found manually. We also show that our method separates genes that co-occur but are not functionally related into different communities. Finally, we demonstrate cases in which a node common to two communities indicates a link between two groups of related genes. It is important to note that the gene communities in the results are not meant to perfectly reproduce biological reality. The communities are simply interesting artifacts within the network that provide a powerful method for organizing and presenting information from the literature. Method Overview Gene symbol mentions are first extracted from almost auf 12.5 million Medline article titles and abstracts. We then select sets of genes found to be statistically correlated to a set of user- selected (related) key words. These two steps are performed following the procedure of ref. 1. This procedure includes steps to account for alias symbols and to distinguish gene symbol abbreviations from identical abbreviations referring to other concepts" (2~. Selecting genes correlated to certain key words ensures continuity of biological function of the genes considered and reduces the number of genes considered so the results can be readable and useful. Networks are then created from these sets of genes. In the networks, each node represents a gene, and an edge connects two genes if they co-occur in at least one article. The degree distribution of the networks follows a power law, as we show, so their clustering structure is scale-free and there is no typical community size. Therefore, to find communities, we partition the graph using a nonlocal process exploiting the concept of betweenness centrality (3~. This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, "Mapping Knowledge Domains," held May 9-11, 2003, at the Arnold and Mabel Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. Abbreviations: COX-2, cyclooxygenase 2; PTGS2, prostaglandin-endoperoxide synthase 2; ON, Girvan-Newman; PMID, PubMed identification number. aTo whom correspondence should be addressed. E-mail: huberman~?hpl.hp.com. bMedline is the foremost English-language database of biomedical articles. The search utility for Medline is PubMed (www.ncbi.nim.nih.gov/entrez/query.fcgi). 'The machine we used is a standard 1-GHz machine with an Intel Pentium 3 processor running RED HAT EINUX. The time to perform this step increases as nm2, where n is the number of genes in the network and m is the number of pairs of genes that co-occur. As we explain later, genes are selected to create a network, and if the network is too large, this step could be very slow. We found that a size of <1,000 genes is generally tractable for our method. ehttp://www3.ncbi.nim.nih.gov/omim. This web site provides detailed information about many genes, proteins, and other biological objects as well as references to related articles. fWe omitted the small fraction of abstracts published before 1990, because they very rarely discuss gene function and tend to use outmoded nomenclature. In addition, we neglect abstracts that mention more than four genes, because they are typically abstracts of survey-type articles that impair the community identification process. These two types of articles form a very small fraction of the Medline database. 9An example is DCC, which may be used to refer to the gene "deleted in colon cancer" or the cancer assay method "dextran-coated charcoal." Such ambiguous symbols are very common because of the frequent use of abbreviations in biological texts. 2004 by The National Academy of Sciences of the USA PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5241-5248

OCR for page 60
The partitioning process may be applied to any network, but it is particularly applicable to networks of several hundred to 1,000 nodes in which nodes may play a role in more than one community. It is based on the process of Girvan and Newman (GN) (ref. 4; for a faster algorithm for finding communities, see ref. 5), which was shown to give very good results for a variety of small graphs. The general idea of our process is the same as that of ON, but the details are significantly different. Our modifications allow nodes to be placed in several communities if the structure of the network indicates that the nodes belong there, and they provide a quantitative estimate of how strongly each node belongs to each community. This is important when single nodes play a role in several communities or when the source information is incomplete or flawed. It can also indicate a link between two communities that have one or more nodes in common, and it "smooths" the process of partitioning, which for any large network is somewhat arbitrary. The modifications also allow communities to be identified as discrete units. Identifying discrete communities is particularly useful when community sizes are not known in advance and makes the results easier to use if the network is large (6~. Motivation and Previous Work For the most part, biologists now understand the rules by which the system of genes, proteins, RNAs, and other cellular constit- uents operates; what remains is to determine the exact details of this system. A worldwide effort is underway in the biomedical community to identify and understand the cellular interactions at the root of human health. Given the enormous number of human genes and the complex interrelated nature of gene and protein interaction, this task is more than a little daunting, and accomplishing it will involve an unprecedented level of collab- oration and information exchange. However, the current con- dition of knowledge organization in the field makes extensive collaboration and complete information exchange difficult. As mentioned above, information pertinent to human gene function exists largely in the form of an astoundingly large number of journal articles. Medline yields 1.5 million hits when queried for "gene" or "protein" with "human," ~150,000 of which were published in 2002 alone. Our results, taking into account co-occurrences within the set of 682 genes we identified as correlated to colon cancer, were created from the 7,985 article abstracts from an astonishing 904h different journals. Given these numbers, it is easy to see that an expert, although familiar with many hundreds of articles, could nonetheless be unaware of developments related to his or her area of interest. And, whereas online biomedical databases provide easy access to abstracts, a manual literature survey would encounter difficulties beyond the large number of results, due to the nomenclature system for human genes. Both the existence of multiple alias symbols for many genes and the frequent occurrence of unrelated abbrevi- ations equivalent to gene symbols interfere with any simple search utility. Despite the impracticality of an exhaustive manual search, online databases of journal abstracts present a gold mine of available information. In fact, the ability to sift through millions of abstracts, extract pertinent information, and present it in a useful format is arguably essential to the understanding of human gene function. Accordingly, automated text analysis has been an area of focus in the field of bioinformatics. One approach has been to extract detailed information by using natural language-processing techniques (7-17~. Our method follows a different line of attack: only simple informa- hThis number was determined by comparing the International Standard Serial Numbers of the journals in the Medline listings of the 7,985 abstracts involved in creating the network of genes related to colon cancer. 5242 1 www.pnas.org/cgi/doi/10.1073/pnas.0307740100 tion, such as gene and protein names, is extracted from each article, and more detailed conclusions are then inferred from this information. Gene and protein term identification in particular has been simplified by the recent appearance of online libraries of gene and protein symbols (refs. 18-20 show this can otherwise be a major task). However, data obtained by simple term matching will be highly error-prone due to false positive iden- tifications of human gene symbols, unless carefully treated. A reasonable conclusion that can be drawn from gene occur- rence data is that genes mentioned in the same article are related in some way. This has been shown to be true both on large (21) and small (22, 23) scales. It is also possible to connect genes to key words found in articles and thus to biological processes, as in refs. 1 and 24. These results have been applied in conjunction with natural language-processing techniques to find related groups of genes, from among a restricted set of genes mentioned in a restricted set of articles, in refs. 25 and 26. Our method, while similar, has a very different way of finding communities that requires neither the preprocessing step of selecting genes or articles nor natural language processing. Obtaining Co-occurrence Data As stated above, the first step of the method is to identify literature co-occurrences of genes relevant to a disease by using the procedure of ref. 1. This section is simply a brief summary of this procedure; for more detail, please see the referenced article. Using a list of all official and alias symbols for human genes compiled from the Human Genome Organisation (HUGO) (www.gene.ucl.ac.uk/nomenclature), OMIM, and Locuslink (www.ncbi.nlm.nih.gov/LocusLink) web sites, we automatically extracted the gene name symbols and disease mentions from all Medline article titles and abstracts. Where possible, we replaced alias symbols with official ones. We also extracted key words related to a certain disease and used them to determine which genes were statistically correlated with this disease. To test a gene for statistical relevance to a disease, we simply compared the observed number of gene-disease co-occurrences to the number we would expect given no correlation. Because the distribution of co-occurrences of two uncorrelated terms follows a binomial distribution, a value of observed gene-disease co- occurrences more than one SD greater than the binomial expected value indicates correlation. This statistical method is preferable to the "term frequency, inverse document frequency" metric, because it accurately handles infrequently mentioned genes, which are very common. The final step in obtaining data was to remove false positives, which occur frequently because gene symbols generally coincide with other abbreviations having nothing to do with genes. For example, the symbol HDC, representing the gene histidine decarboxylase, was commonly used in the literature as an abbreviation for high dose chemotherapy. We disambiguated the data, using a method shown in ref. 2, which yielded unambiguous symbol identifications with a low error rate. Gene Graph The creation of gene graphs from the co-occurrence data was performed following a well known procedure (21, 23~. Each vertex in the graph represents a gene, and an edge exists between two vertices if the genes they represent co-occur at least once. We did not use weighted edges. In creating the graph, we neglected articles published before 1990 and articles that listed more than five genes, as mentioned in the Introduction. The resulting graph has a power law distribution in its degree. That is, the number of vertices of degree x is given by Ax-0, where ,B < 0. This is shown in Fig. 1, where we plot the data on a log-log scale for gene graphs corresponding to several diseases. The properties of such power law graphs have been extensively studied (27-29~. It has been shown that random graphs with Wilkinson and Huberman

OCR for page 61
PAL, b=2.25 beast me, b=2.38 Co1cri cancer, b=2.40 6 4 2 ma\ I,,, 1 2 3 'I 5 -2 1 2 3 '%,. 5 -2 . ~ - 1 2 3 \~ 5 Fig. 1. The number of vertices (y axis) is plotted against the degree of the vertex (x axis) for several diseases on a log-log scale. We followed the usual binning procedure in plotting the data. The deviation from the power law for low vertex degree is typical. AML, acute myelogenous leukemia. 2<~3<3.5 consist of one giant connected component and other small components of size O(ln(N)) (28~. Here N is the size of the graph, and I3 is the power law exponent. The component structure of the gene graphs agrees with the predictions of ref. 28 for random graphs, as shown in Table 1. Because the smaller components contain few genes with few neighbors, they are of limited interest. They usually consist of little-known genes that have not been related to other genes. In what follows, we focus exclusively on identifying communities within the giant component. Partitioning the Graph into Communities There is no formal definition for a community of vertices within a graph. A graph can be said to have community structure if it consists of subsets of genes, with many edges connecting vertices of the same subset but few edges lying between subsets (4~. Finding communities within a graph is an efficient way to identify groups of related vertices. As mentioned in the Introduction, the community discovery process we use is based on that of GN (the GN process or method), which has been shown to identify communities in graphs with known community structure to a high degree of accuracy (4~. Our modifications were necessary to make the method applicable to gene graphs, which are large and are created from source information that may by nature be incom- plete or flawed. In particular, we identify many possible com- munity structures and average them into a final list of commu- nities. The statistical character of this step provides a more accurate picture of the complicated nature of community struc- ture of a gene graph, without undermining the effectiveness of the basic principle of the algorithm. Table 1. Sizes of connected components in several gene graphs Components Disease (no. of statistically relevant genes) Acute myelogenous leukemia (488) Breast cancer (816) Colon cancer (682) Wilkinson and Huberman Size No. 460 1 4 1 3 4 2 6 686 1 6 2 5 1 4 5 3 9 3 33 561 1 4 4 3 15 2 30 A concept central to the community discovery process is the betweenness centrality (hereafter betweenness) of a vertex or edge. The betweenness of an edge AB (or a vertex A) is defined as the number of shortest paths between pairs of other vertices that contain AB (or A). As mentioned before, this concept was introduced (3) as a measure of influence of an individual, with respect to information flow, within a social network., However, it was noticed (4) that betweenness may also be used to identify communities within a graph, because intercommunity edges (those that lie between different communities) are much more likely to have a higher betweenness than intracommunity edges (edges that lie within one community). To explain the community discovery process, we consider as a first example the small graph shown in Fig. 2. This graph consists of two well defined communities: the four vertices denoted by squares, including vertex A, and the nine denoted by circles, including vertex B. In the graph of Fig. 2, edge AB has the highest betweenness. If we were to remove it, the graph would split into two connected components, the square and circle communities. This illustrates the idea behind the GN method of imposing community struc- ture on a graph. One repeatedly identifies intercommunity edges by the criterion that they have higher betweenness than intra- community edges and removes them. This procedure splits the giant component into many separate components, which coin- cide with the communities of the original graph. It is important to note that the removal of an edge strongly affects the betweenness of many others, so that one must repeatedly recalculate the betweenness of all edges. To do this quickly, we used the fast algorithm of ref. 29 or 30. At a certain point in our procedure, as opposed to the GN method, we stop removing edges from a component when we cannot further meaningfully subdivide it into communities; for example, as in Fig. 2, after removing edge AB. This allows us to obtain distinct communities of nodes, such as the circles and squares of Fig. 2. What criterion tells us when to stop? Structurally, a component of five or fewer vertices cannot consist of two viable communities. The smallest possible such component is size 6, consisting of two triangles linked by one edge (Fig. 3). Fig. 2. A graph consisting of two communities. PNAS 1 April 6, 2004 1 vol. 101 1 suppl. 1 1 5243

OCR for page 62
Fig. 3. The smallest possible graph consisting of two communities. Components of size '6 can also be individual communities, like the group of nine in Fig. 2. The criterion we used to identify this type of component as a community was that the largest betweenness of any edge in the component did not exceed N - 1, where N is the number of vertices in the component. This threshold is based on the betweenness of an edge connecting a leaf vertex, or vertex of degree one, to the rest of the graph. Consider the graph of Fig. 4 below. It is clear that it consists of just one community. Applying the Brandes algorithm, we find that edge XY has the highest betweenness, indicating that the size of the largest distinct community within the graph has size 1. That is, there are no distinct communities within the graph. In general, the single edge connecting a leaf vertex (such as X in Fig. 4) to the rest of a component of N vertices has a betweenness of N - 1, because it contains the shortest path from X to all N - 1 other vertices. If no edge's betweenness exceeds N - 1, therefore, we can identify the component as a community.) We can now explain the need to neglect survey-type articles that list many genes in creating our graph. The genes listed in these articles will all be linked to one another, forming a complete subgraph Kn. Such a grouping is very tightly knit and will likely not be split into different communities. This situation, due only to the survey article, may not accurately reflect the interactions between the genes. It is possible that a few articles mention many genes that are in fact functionally related, but in this case it is likely that the genes will be linked by other articles that discuss them three or four at a time. Communities Consist of Functionally Related Genes. The communi- ties thus created consist of genes that were strongly interrelated in the literature. Most, but not all, gene co-occurrences imply a functional relation; genes may also co-occur in an article abstract because of physical proximity, similarity of nomenclature or structure, historical association, or other reasons. However, because such nonfunctional edges are a minority, they are highly likely to be intercommunity, because the neighbors of two nonfunctionally related genes are unlikely to be linked. For example, genes S1OOA4 and S1OOA6 are members of the S100 family and co-occur twice in articles related to colon cancer, but they are not functionally related (Medline PMIDs 10389988 and 10952782~. In our results, S1OOA4 and S1OOA6 do not occur in a community together. The neighbors of one are not linked to the neighbors of the other, which causes them to be placed in separate communities. Further examples are given in Results. lit is not in general true that an intercommunity edge must have betweenness greater than N- 1, although such a situation is extremely unlikely in a power law graph. For a community of size m within a graph of size N. there is a total betweenness of m(N-m) divided among the edges connect) ng the community to the graph. So, if there are more than m such edges, it is possible that none of them will have betweenness greaterthan N. However, remember that few of these edges, or the extracommunity vertices they connect, should be adjacent, because otherwise m would not be a community. Even in GN's highly nonpower law .. . .. . . ~ _ college football graph, the criterion only occasionally fails when an intercommunity edge has a betweenness slightly less than N- 1. 5244 1 www.pnas.org/cgi/doi/10.1073/pnas.0307740100 ,: l ~ Fig. 4. A partitioning algorithm should not separate this graph into two communities. Multiple Community Structures. The process of assigning the nodes of a graph to communities may be called identifying a commu- nity structure on the graph. In the small examples given thus far in Figs. 2-4, there was only one reasonable community structure on each graph, because each node clearly belonged to only one community. In contrast, complex real-world graphs contain many "ambiguous" nodes that can be said to belong to two or more different communities due to their placement in the graph. An example, described in detail later, is the subgraph in Fig. 5, in which node B is ambiguous. Gene graphs include many ambiguous genes that belong to several communities, both in the context of the graph and in the context of biological function. Therefore, if we identify only one community structure on a real-world graph, such as a gene graph, we could only hope to be somewhat accurate in classifying the nodes. A large amount of information concerning ambiguous genes and communities re- lated through ambiguous genes would be lost. Our resolution to this problem is to identify many plausible community structures on the graph and compare them. To do this, we make a modification to the ON process that introduces an element of randomness into which edges of very high be- tweenness are removed early in the process. Tightly knit com- munities are not affected by the order of edge removal and will eventually be identified no matter which high-betweenness edges are removed first. However, the eventual placement of ambig- uous genes is strongly affected by which high-betweenness edges are removed first. By varying which high-betweenness edges are removed early in the process, we may therefore identify many community structures on a graph. By then comparing the structures, we can easily identify tightly knit communities, which do not vary from structure to structure, and ambiguous genes, which migrate from group to group.) The subgraph of Fig. 5 illustrates why the order of edge removal affects the placement of ambiguous genes and the need for multiple community structures. This subgraph consists of two communities, one on the left including vertex A and another on the right including C. Among its edges, BC initially has the highest betweenness, and AB's betweenness is also high. Once we remove BC, however, AB becomes an intracommunity edge with low betweenness, and it will never be removed. Gene B will eventually be placed in a community with gene A. Had we removed AB first, BC would be rendered intracommunity, and gene B would end up in the community with C. Moreover, in considering Fig. 5, it is not clear where ~ should end up. ~ Is This process is essentially a form of soft clustering, although it ciiffers significantly from existing methods of soft clustering. These methoc~s (see ref. 34 for an example) are essentiaityrestrictec]tociusteringobjectssuchasdocumentsthatcomprisemanyinciividuai elements (e.g., worcis). The words of one ciocument are compareci to the words of another, anci a relative closeness can be establisheci. The soft clustering presenteci here is affecter] only by a novels placement in the graph, not by a comparison of elements comprising neighboring nocies. Wilkinson and Huberman

OCR for page 63
W$' rest of graph Fig. 5. In this graph, it is unclear to which community node B belongs. ambiguous and could rightfully be considered to be a part of both communities. If we considered multiple community structures on this subgraph, we would see that B ended up in the commu- nity with A in some structures (the ones in which BC was removed early) and in the community with C in others (those where AB was removed). By comparing the structures, we could see that B really plays a role in both communities and (depending on the meaning of the links in the graph) possibly ties the communities together. To describe in detail our method of identifying multiple community structures, we briefly describe how the Brandes algorithm (31) computes the betweenness for all edges in a graph, and how the ON process decides which edge to remove at each step. We then explain our modification, which allows for the identification of multiple community structures. The first step of the Brandes algorithm is to find the shortest paths from all vertices to one "center" vertex using a breadth first search. In the second step, the contributions of these paths to the betweenness of each vertex or edge are added to a running total. The center is then switched and the above steps repeated. After every vertex in the component has been the center, we will have considered every shortest path twice, and the running totals for each vertex or edge will equal twice the betweenness of that vertex or edge. At this point in the ON procedure, one simply chooses the edge of highest betweenness and removes it. How- ever, this choice is somewhat arbitrary, because there are likely to be many intercommunity edges in the graph. Our modification is as follows. Instead of using every vertex as the "center" once in the Brandes algorithm, we cycle randomly through at least m centers (where m is some cutoffk) until the betweenness of at least one edge exceeds a threshold, again based on the betweenness of a "leaf" vertex. We then remove the edge whose betweenness is highest at that point and repeat. Because the betweenness of the edge we remove exceeds the threshold, it is very likely to be intercommunity. We continue removing edges from each component in this way, until all of the components become small.m We then perform the full Brandes kThe cutoff we used was m(N) = 1010g(N)-25, where N is the size of the component. This function has m(50) ~ 15, and m(800) ~ 41. We found that 15 was a reasonable number of centers to consider for a component of size 50, whereas 40 centers is more than enough for any component, however large. Basically, an intracommunity edge will be erroneously removed if we repeatedly choose centers from the same community. For a component of 50 vertices and 4 communities, the probability of choosing 8 of 15 centers from one community is ~1%. For a large component with many communities, the probability of error is very low for a cutoff of 40 centers. iThe value of the threshold in this case is (N + 1)/2-1, where N is the size of the component, and i is the number of centers that have been considered up to that point in the process. mWe never attempted to precisely define "small." We used values in the 35-50 node range and, as one might expect, it made little or no difference in the final result. An exact definition would depend on the community size, the graph size, and a desired probability of error (see discussion on this page). However, even when we used a number as large as 50, the randomness of the method was sufficient to produce a slightly different commu- nity structure every time. Wilkinson and Huberman algorithm and remove the edge of highest betweenness from each component until it is resolved into communities. The random nature of the modification allows us to change the order in which edges are removed, because the edge with highest betweenness after m centers have been considered will vary depending on which centers are considered. Our modified process can therefore be applied repeatedly to identify different plausible community structures on the graph. This process may erroneously remove an intracommunity edges, which can happen if a large percentage of the centers considered lies in one community. In a large graph with many small commu- nities, this probability is small, especially because we perform only the modified removal step in large components. Additionally, when we compare many different community structures, anomalous placements due to errors will be suppressed. Applying this modified process n times, we obtain n commu- nity structures imposed on the graph. We can then compare the different structures and identify communities, as well as the strength of each gene within the community. For example, after imposing 45 structures on our graph, we might find: a community of genes A, B. C, and D in 20 of the 45 structures; a community of genes A, B. C, D, and E in another 20; and one of genes A, B. C, D, E, and F in the remaining 5. We report this result in the following way: A(45) B(45) C(45) D(45) E(25) F(5), which signifies that A, B. C, and D form a well defined community, E is related to this community but also to some otherts), and F is only slightly, possibly erroneously, related to it. Aggregating Communities from Different Structures. To aggregate communities from different structures and obtain a final list of communities in the form {A(45) B(45) C(45) D(45) E(25) F(5~l, we use a procedure that is straightforward but rather tedious to explain because of the terminology. To summarize briefly, we create an initial "master list" Mi of communities by choosing one structure at random from among our set of N (in our experi- ments, n = 50~. We then perform N - 1 steps, each consisting of comparing one of the remaining N - 1 structures S to the master list, and based on the results of the comparison, aggre- gating S into the list. The final master list, obtained by aggre- gating all N structures, is the final result of the entire algorithm. Let us introduce the notation Mt to denote the state of the master list created from aggregating the first t structures (chosen in arbitrary order from the set of structures we found). At step t + 1, we select a structure S from among those we have not yet considered and compare its communities to the communities of M'. S is aggregated into Mt based on the results of the compar- ison, creating an updated master list Mt+~. Mt is a list of communities, and we will denote the kth community of Mt by Ark. The numbering system of communities in the list is arbitrary and serves only to distinguish them. Each community M`k of Mt is a collection of genes and associated weights (my, pow. The weight pa associated with gene A in a community Mik indicates how strongly it "belongs" to M(k. To be precise, pa is the number of structures, out of the t we have aggregated to form M`, in which A has been associated with the community that evolved (because structures were aggregated) into Ark. This will be clearer when we explain the aggregation step. The communities evolve very little as the structures are aggregated, because the structures are on the whole quite similar. Thus the weight as defined is an accurate indication of how strongly a gene belonged to a community. One might expect that, because the communities in the master list evolve, the final result would depend on the order of aggregation. To the contrary, we found the order of aggregation had little effect on the final result due to the similarity of the different structures. The details of the matching of communities S to the commu- nities of Mt-i, and of the aggregation of S into Mt-~ to form M`, are as follows. A basic metric to compare two communities A and PNAS 1 April 6, 2004 1 vol. 101 1 suppl. ~ 1 5245

OCR for page 64
B of genes Al, 0~2' . . . ~ An, and `81, I, . . . ~ Em, respectively, the traditional union/intersection metric A U B t, ~ T%\ Idai' I3j d(A' A) A Hi B n + m—Ibo~,9~ i,j Here n and m are the number of genes in communities A and B. so the sums run over i = 1,. . ., n end j = 1,. . ., m. This notation is overcomplicated but useful for comparison to the weighted metric below. The sum over 1` functions just means we are counting how many of the genes in A and B are the same. The metric d(A, B) has a value between 0 and 1 and will be larger for a closer match. To compare a community A of S to a weighted community M OCR for page 65
Table 3. A sample community of nine genes from our results for colon cancer Gene Weight in Overall mentions symbol community with colon cancer Neighbors with colon cancer PTGS2 50 263 PL42G2A PLA2G4 SPLA2 FACL4 NOS2A PDCD4 PGES LEFT 50 50 50 50 50 50 18 5 12 1 18 PTGS1 * OLD* MLH1 * BCL2* PLA2G2A PLA2G4 APC* ERBB2* PGES ERBB3* PLA2 ACL4 WNT1 * GRP* GRPR~ LEF DLR~ TCF4* TCF* MYB* VEGF* NOS2A TP53* MADH4* EGFR* S11* PDCD4 BRCA1* BRCA2* MSH2* ERBB4 APC* PTGS2 PLA2G4 TP53* NF2* DCC* MLH1 * SPLA2 PL42G2A PTGS2 PTGS2 PLA2G2A PTGS2 PTGS2 PTGS2 ERBB2* PTGS2 ERBB3* WNT1 * TCF* PTGS2 TCF4* APC* FRA 1 * PLAUR* MYC* MMP7* TCF7* Here score in community denotes the number of community structures, out of 50, in which each gene was placed in this community (Partitioning the Graph into Communities). Genes with a score of 50 were members of this community only; genes with a lower score were members of this community and others. *Neighbor not in community. FACL4 expression, although a Medline search of NSAID and FACL4 turned up no results. Absent Neighbors. In examining neighbors of PTGS2 (COX-2) not present in this community, we noticed in particular the similarly named gene PTGS1 (also known as COX-1~. These two genes are isoforms of cyclooxygenases (PMID 9099957, for example); they co-occurred in 70 articles related to colon cancer and 1,500 articles overall. However, they have been shown to regulate colon carcinoma-induced angiogenesis by two different mecha- nisms (Medline PMID 9630216~. COX-2 has also been shown to be expressed much more frequently than COX-1 in tumors and less frequently in normal tissue (PMID 7780968, for example; note the use of the alias PGHS-1 or -2 for COX-1 or -2 in this article) The separation of COX-1 and -2 into different commu- nities thus accurately reflects our current knowledge about how these genes function in relation to colon cancer. Although the enzymes they code for are structurally very similar, COX-2 plays a strong role in colon cancer, whereas COX-l's role is weaker and by a different mechanism. Several other neighbors of PTGS2, such as MLH1, BRCA1, BRCA2, and MSH2, also proved to be weakly or nonfunctionally related. However, a few of PTGS2's noncommunity neighbors have been tentatively identified as functionally related, such as GRP and GRPR (GRP receptor; PMID 11292836) and EGFR (PMID 9012840~. For this reason, we include a list of all neighbors of each gene in the results as a secondary list of p~ossible connections to explore. Links to Other Communities. We also looked for links to other communities through the genes PGES and LEFT, both of which show a weak connection to the COX-2 community and were often placed in other communities. Both searches yielded good results. PGES co-occurs with other genes only once, in an abstract with COX-2, ERBB2, and ERBB3. Examining this abstract, we find a link between the COX-2 pathway and autocrine/panacrine activation of HER2/ HER3 (also known as ERBB2 and ER~B3; 99271874. The ERBB genes are present in another community of 25 genes. In con- junction with the previous discussions about arachidonic acid, there is a possible link between not only COX-2 but all of the genes related to arachidonic acid (most of which never co-occur with ERBB2 or -3) to any gene related to the autocrine/ panacrine activation of ERBB2/ERBB3. This conclusion de- pends on knowledge of many articles, in particular PMID 9927187, and could easily escape notice in a manual search. LEFT was found with COX-2 in only one article (PMID Wilkinson and Huberman 10834941~. It states that "NO (nitric oxide) may be involved in PGHS-2 (`COX-2) overexpression in conditionally immortalized mouse colonic epithelial cells. Although the molecular mechanism of the link is still under investigation, this effect of NO appears directly or indirectly to be a result of the increase in free soluble ,B-catenin and the formation of nuclear ,B-catenin/LEF-1 DNA complex." This article indicates a possible connection between COX-2, NOS2A (nitric oxide synthase, responsible for the produc- tion of NO! and the very important colon cancer gene ,[3-catenin. Importance of Alias Symbols. As a last note, this community demonstrates the crucial importance of considering alias sym- bols when extracting gene names. The aliases COX-2, PGHS-2, NOX2, and cPL~A (2J were very commonly used in articles that tied this community together Other Results. Here we present similar results from two other communities: A connection between PXR (pregnane X receptor) and GP170 (P-glycoprotein) is indicated because they are placed together in a community. PXR is implicated in the induction of the MDR1 gene (PMID 11297522), whereas MDR1 expression has been associated with the expression of functional P-glycoprotein (PMID 10334913~. A Medline search turns up no results for GP170 or GP-170 with PXR or its aliases PAR, SXR, and NRli2. Another probable undocumented connection between GP200- MR6 and STAT6, via IL-4 and its receptor IL-4R is suggested by their placement together in a community. IL-4 induces STAT6, which is involved in mediating activation of IL-4R gene expression (PMID 8810328), whereas GP200-MR6 has been shown to be functionally associated with IL-4R (PMID 91788154. This example demonstrates the power of an automated method to bring together information from disparate, old sources (cited articles from J. Biol. Chem., Oct 11, 1996 and Int. J. Cancer, May 16, 1997~. Although large communities are more difficult to analyze for the nonexpert, we were nevertheless able to draw some conclu- sions. For example, we considered a 30-gene community largely concerned with apoptosis and genes related to BCL-2, contain- ing in particular the gene TRAIL. TRAIL has been shown to induce procaspase-8 activation, triggering caspase-dependent apoptosis in colon cancer cells (PMID 11245478~. It could thus be related to the function of genes such as BCLX, BCLXS, etc., which we find in this community but which do not co-occur with TRAIL via the genes BCL-2 and CASP8. A good example of nonfunctionally related genes with similar names that are placed in different communities is MMPll and MMP9 (PMID 8645587~. Often nonfunctionally related neigh- boring genes do appear together in one community in a small PNAS 1 Aprii 6, 2004 1 vol. 101 1 suppl. ~ 1 5247

OCR for page 66
number of structures (see Partitioning the Graph into Commu- nities) but appear in different communities in the majority of structures. Examples of this include CYP3A4 or CYP3A5 and CYP1A2 (PMID 9202751) as well as SMAD3 and SMAD5 (PMID 10446110 and 11196171, for example; SMAD2 and SMAD4 are aliases for MADH2 and MADH4, respectively). Conclusion We have presented a data-mining technique for biological literature that produces detailed results while extracting only very simple data from each article abstract and title. The method produces a list of communities of functionally related genes that are designed to summarize available information and indicate genes that are likely to be complementary in their function. The genes within a community are weighted, indicating how strongly they belong to the community. We show that the communities produced in the case of colon cancer have interesting features that give one insight into the function of the component genes. The identification of many similar community structures on each gene graph allows us to recognize those genes that belong in two or more different communities. In this sense, our method produces a richer result than previous methods that impose one rigid structure on the graph. This idea could be applied to social and other networks where individuals play a role in more than one community. We introduce two statistical components into the process, which lessen the inevitable errors of text mining in the biological literature, particularly severe in our case because of the complex young nomenclature system for genes. However, our method retains the ability to detect relations among rarely mentioned genes, one of its strongest features. To reiterate an important point from the Introduction, our results are not meant to perfectly model biological reality, only to function as a tool for biologists. It was not possible to compare our communities to a database or list of groups of related human genes, because such a list does not exist. The only justification we can provide that our communities were "accurate" is to cite ref. 4, in which the GN method was shown to be very effective in identifying communities. In fact, because genes within a com- munity are linked by edges from a co-occurrence, it is almost certain that they are related somehow. A much more interesting measure of the effectiveness of the method is whether it sepa- rates genes that should be separated. 1. Adamic, L., Wilkinson, D., Huberman, B. & Adar, E. (2002) in Proceedings of the IEEE Bioinformatics Conference (Institute of Electrical and Electronic Engineers, Los Alamitos, CA), pp. 109-117. The factor that most limits our results is the absence of many gene symbols from HUGO and other online databases. Hope- fully, these databases will soon be more complete. Related problems are the unorganized nomenclature system for human genes (see discussion in ref. 31) and small modifications to recognized symbols introduced by many authors, such as the addition of hyphens, parentheses, or spaces, which make the symbols difficult to detect. Efforts are being made to standardize the gene nomenclature system (33~. A less acute limiting factor was the placement of many genes in either large and very small communities in our results. Although still a step forward from raw co-occurrence data, such communities are of limited usefulness; they often did not provide much insight into the function of their component genes, other than that the genes were rarely related to others in the context of colon cancer. If such genes were more commonly mentioned in other contexts, a search using other diseases or key words would likely turn up more interesting communities with these genes. Large communities were difficult for us to analyze but nevertheless yielded some interesting results. These communities contained many of the most commonly mentioned genes in connection with colon cancer, such as APC and TP53. Strangely, a search for colon cancer genes is probably not the most efficient way to study these genes, which are simply too highly linked in this context. Instead, one could perform other searches with other key words, hoping to focus on particular aspects of these genes' function by confining them to smaller more informative communities. We believe that large communities are a product of graph topology, not of the threshold we use to stop subdividing a community or of the aggregation process. To further subdivide large communities, one could consider a weighted graph, where the weight corresponds to the (normalized) number of times the two genes co-occur. This could increase the "distance" between, for example, two commonly studied distantly related co-occurring genes. They would then not end up in the same community and, more importantly, would not glue a false community together. The simplest such weighting would be to neglect all links below some (normalized) threshold weight. Another resolution to the problem of large communities would be to refine the step that aggregates the community structures into one result. We thank Lada Adamic, Eytan Adar, and Melissa Wilkinson for many useful discussions. 17. Raychaudhuri, S., Chang, J., Sutphin, P. & Altman, R. (2002) Genome Res. 12, 203-214. 18. Andrade, M. & Valencia, A. (1997) in Intelligent Systems for Molecular Biology 2. Adar, E. (2004) Bioinformatics, in press. (American Association for Artificial Intelligence Press, Heidelberg), pp. 25-32. 3. Freeman, L. (1977) Sociometry 40, 35-41. 19. Fukuda, K, Tsunoda, T., Tamura, A. & Takagi, T. (1998) Pac. Symp. Biocompute3, 4. Girvan, M. & Newman, M. (2002) Proc. Natl. Acad. Sci. USA 99, 8271-8276. 705-716. 5. Wu, F. & Huberman, B. A. (2004) Eur. J. Phys. B. in press. 20. Proux, D., Rechenmann., F., Julliard, L., Pillet, V. & Jacq, B. (1998) in Genome 6. Tyler, J. R., Wilkinson, D. M. & Huberman, B. A. (2003) Proceedings of the Informatics Workshop (Universal Academic, Tokyo), pp. 72-80. International Conference on Communities and Technologies (Kluwer, Amsterdam). 7. Blaschke, C., Andrade, M., Ouzounis, C. & Valencia, A. (1999) in Proceedings of the AAAI Conference on Intelligent Systems in Molecular Biology (American Association for Artificial Intelligence Press, Heidelberg), pp. 60-67. 8. Ng, S. K. & Wong, M. (1999) Genome Inf. 10, 104-112. 9. Humphreys, K., Demetriou, G. & Gaizauskas, R. (2000) Pac. Symp. Biocomput. 5, 502-513. 10. Rindflesch, T. C., Tanabe, L., Weinstein, J. N. & Hunter, L. (2000) Pac. Symp. Biocomput. 5, 514-525. 11. Thomas, J., Milward, D., Ouzounis, C., Pulman, S. & Carroll, M. (2000) Pac. Symp. Biocomput. 5, 538-549. 12. Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. (2001) Bioinfor- matics 17, S74-S82. 13. Pustejovsky, J., Castano, J., Zhang, J., Cochran, B. & Kotecki, M. (2002) Pac. Symp. Biocomput. 7, 362-272. 14. Tanabe, L., Scheft, U., Smith, L., Lee, J., Hunter, L. & Weinstein, J. (1999) BioTechniques 27, 1210-1217. 15. Craven, M. & Kumlien, J. (1999) in Proceedings of the ISMB Conference (International Society for Computational Biology, Brisbane, Australia), pp. 77-86 16. Shatkay, H. & Wilbur, W. (2000) in Proceedings of the IEEE Conference on Advances in Digital Research (Institute of Electrical and Electronic Engineers, Los Alamitos, CA), pp. 183-192 5248 1 www.pnas.org/cgi/doi/10.1073/pnas.0307740100 21. Jenssen, T.-K., Laegrid, A., Komorowski, J. & Hovig, E. (2001) Nat. Genet. 28, 21-28. 22. Stephens, M., Palakal, M., Mukhopadhyay, S. & Raje, R. (2001) Pac. Symp. Biocomput. 6, 483-396. 23. Stapley, B. & Benoit, G. (2000) Pac. Symp. Biocomput. 5, 529-540. 24. Masys, D., Welsh, J., Fink, L., Gribskov, M., Klacansky, I. & Corbeil, J. (2001) Bioinformatics 17, 319-326. 25. Shatkay, H., Edwards, S. & Boguski, M. (2002) in IEEE Intelligent Systems, Special Issue on Intelligent Systems in Biology (Institute of Electrical and Electronic Engineers, Los Alamitos, CA), pp. 45-53. 26. Raychaudhuri, S., Schutze, H. & Altman, R. (2002) Genome Res. 12, 1582-1590. 27. Huberman, B. A. & Adamic, L. (1999) Nature 401, 131. 28. Aiello, W., Chung, F. & Lu, L. (2000) in Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (Association for Computing Machinery, New York), pp. 171-180. 29. Albert, R. & Barabasi, A.-L. (2002) Rev. Mod. Phys. 74, 47-97. 30. Newman, M. (2001) Phys. Rev. E 64, 026118-1-026118-17. 31. Brandes, U. (2001) J. Math. Soc. 25, 163-177. 32. Pearson, H., (2001) Nature 411, 631-632. 33. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S. & Eppig, J. (2000) Nat. Genet. 25, 25-29. 34. Tishby, N., Pereira, F. & Bialek, W. (1999) in Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (UIUC Press, Champaign- Urbana, IL), pp. 368-377. Wilkinson and Huberman

Representative terms from entire chapter:

community structures