4
The Potential Impact of HECC in Evolutionary Biology

INTRODUCTION

The dictum of Theodosius Dobzhansky (1964)—“nothing makes sense in biology except in the light of evolution”—has never been truer than it is today. With the rise of such fields as comparative genomics and bioinformatics, evolutionary developmental biology, and the expanded effort to build the tree of life, the discipline of biology has become increasingly dependent on the inferences, methods, and tools of evolutionary biology. These contributions from evolutionary biology have become standard in solving problems in comparative biology, the biomedical and applied sciences, agriculture and resource management, and biosecurity. Thus an understanding of evolutionary change in individuals and populations provides the foundation for advances in crop improvement and vaccines, for improved understanding of epidemiology and antibiotic resistance, and for managing threatened and endangered species, to name only a few (Meagher and Futuyma, 2001).

Likewise, at the level of species and multispecies lineages, our new understanding of the tree of life is providing a comparative framework for interpreting the similarities and differences among organisms. Using newly generated knowledge about the phylogenetic relationships of life on Earth, comparative biologists have been able to do interesting and useful things:

  • Identify wild relatives of domesticated plants and animals, leading to enhanced food security.

  • Create tools for the discovery of countless new life forms, many of which are economically important.

  • Establish a framework for comparative genomics and developmental biology, which speeds up the identification of emerging diseases (such as avian influenza and West Nile virus) and helps to locate their places of origin.

  • Advance such disparate fields as resource management and forensics (Cracraft et al., 2002).

The committee was charged with reviewing the important scientific questions and technological problems in evolutionary biology by drawing on survey documents. Unlike the other fields covered in



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 63
4 The Potential Impact of HECC in Evolutionary Biology INTRODUCTION The dictum of Theodosius Dobzhansky (1964)—“nothing makes sense in biology except in the light of evolution”—has never been truer than it is today. With the rise of such fields as comparative genomics and bioinformatics, evolutionary developmental biology, and the expanded effort to build the tree of life, the discipline of biology has become increasingly dependent on the inferences, methods, and tools of evolutionary biology. These contributions from evolutionary biology have become standard in solving problems in comparative biology, the biomedical and applied sciences, agriculture and resource man- agement, and biosecurity. Thus an understanding of evolutionary change in individuals and populations provides the foundation for advances in crop improvement and vaccines, for improved understanding of epidemiology and antibiotic resistance, and for managing threatened and endangered species, to name only a few (Meagher and Futuyma, 2001). Likewise, at the level of species and multispecies lineages, our new understanding of the tree of life is providing a comparative framework for interpreting the similarities and differences among organisms. Using newly generated knowledge about the phylogenetic relationships of life on Earth, comparative biologists have been able to do interesting and useful things: • Identify wild relatives of domesticated plants and animals, leading to enhanced food security. • Create tools for the discovery of countless new life forms, many of which are economically important. • Establish a framework for comparative genomics and developmental biology, which speeds up the identification of emerging diseases (such as avian influenza and West Nile virus) and helps to locate their places of origin. • Advance such disparate fields as resource management and forensics (Cracraft et al., 2002). The committee was charged with reviewing the important scientific questions and technological problems in evolutionary biology by drawing on survey documents. Unlike the other fields covered in 

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING this report—astrophysics, the atmospheric sciences, and chemical separations—evolutionary biology has no definitive reports setting forth its scientific breadth and describing future challenges. Thus in developing its evaluation of the main scientific challenges facing evolutionary biology, the committee drew on workshop reports, primarily those prepared for the National Science Foundation (NSF, 1998, 2005a, 2005b, 2006), and on a document produced by eight scientific societies (Meagher and Futuyma, 2001), as well as on discussions among committee members and invited experts at a small workshop in December 2006, the agenda of which is included in Appendix B. This chapter identifies the main challenges of evolutionary biology and evaluates the extent to which computational methods are impacting each of them. It describes the primary mathematical models that are currently available or being developed. On this basis, the committee then assesses the potential impact of HECC on the major challenges of evolutionary biology. MAJOR CHALLENGES OF EVOLUTIONARY BIOLOGY Major Challenge 1: Understanding the History of Life The most fundamental question posed by Major Challenge 1 is this: How did life arise? Despite the large body of scientific literature, this question remains unanswered. Addressing it requires knowledge spanning the physical and biological sciences: chemistry, the Earth sciences, astrophysics, and cellular and molecular biology. A key piece of information would be knowing whether life is indigenous to Earth or exists elsewhere in our solar system. More generally, people in the field ask how the assembly of simple organic compounds led to complex macromolecules and then to self-replicating entities, and what role Earth-bound processes played. Another unknown is how many species there are on Earth. Systematic biologists have discovered and described about 1.7 million living species. How many more exist in Earth’s ecosystems has not been answered satisfactorily, even to within an order of magnitude. Without a better quantification of life’s diversity we have only a very incomplete understanding of the distribution of diversity and thus cannot characterize with precision ecosystem structure and function, extinction rates, and the amount of molecular and functional biodiversity. Lack of knowledge about Earth’s biodiversity also precludes our potential use of those species and their products. Ultimately, Major Challenge 1 calls for us to develop an understanding of the tree of life and then to use it. With advances in methods of phylogenetic reconstruction and increasing amounts of new comparative data from DNA sequences, the last decade has seen an unprecedented increase in our knowledge about the phylogenetic relationships of organisms, which collectively constitute the tree of life. Since the beginning of the 1990s, the number of species represented in the gene sequence database GENBank has grown to more than 155,000. If this correlates roughly with the number of species that can be placed on phylogenetic trees using molecular data alone, then the combined number of extinct and living species currently included on phylogenetic trees may approximate 200,000 species. Assum- ing an increase in the number of researchers and technological advances, it seems safe to expect that between 750,000 and 1 million of Earth’s estimated 10 million to 100 million species will be placed on trees within the next decade. Data sets from individual studies are becoming larger and larger, and many contain information on thousands of species and thousands of molecular and/or morphological characters (qualities or attributes). This situation creates well-known computational challenges when one searches existing phylogenetic trees or works to resolve relationships so as to add or clarify particular branches (Felsenstein, 2004). In

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY addition, as more species are added, more comparative character data must be added in order to resolve relationships with confidence. Resolving the tree of life has become a priority because the hierarchical pattern of relationships is a powerful predictive tool for comparative biology, with many applications in the fundamental and applied sciences and in industry (Bader et al. 2001; Cracraft, et al. 2002; Cracraft and Donoghue, 2004). Beyond knowing how specific species are related to one another, phylogenetic methods themselves are now routinely incorporated into many fields of evolutionary biology, molecular and developmental biology, the health sciences (comparing viral sequences as well as describing and predicting molecular change), species discovery/description, natural resource management, and biosecurity (identification of invasive species, pathogens). In addition, they are used in the identification of microorganisms, in the development of vaccines, antibacterials, and herbicides, and by the pharmaceutical industry in the prediction of drug targets. Computational Challenges Standard phylogenetic analysis comparing the possible evolutionary relationships between two species can be done using the method of maximum parsimony, which assumes that the simplest answer is the best one, or using a model-based approach. The former entails counting character change on alterna- tive phylogenetic trees in order to find the tree that minimizes the number of character transformations. The latter incorporates specific models of character change and uses a minimization criterion to choose among the sampled trees, which often involves finding the tree with the highest likelihood. Counting, or optimizing, change on a tree, whether in a parsimony or model-based framework, is a computationally efficient problem. But sampling all possible trees to find the optimal solution scales precipitously with the number of taxa (or sequences) being analyzed (Felsenstein, 2004) (Figure 4-1). Thus, it has long been appreciated that finding an exact solution to a phylogenetic problem of even moderate size is NP complete (see, for example, Bader, 2004). Accordingly, numerous algorithms have been introduced to search heuristically across tree space and are widely employed by biological inves- tigators using platforms that range from desktop workstations to supercomputers. These algorithms include methods for fusing and splitting taxa, swapping among branches, and moving through tree space stochastically to avoid becoming stranded at local suboptimal solution sets (Felsenstein, 2004; Yang, 2006). Such methods include simulated annealing, genetic (evolutionary) algorithmic searches, and Bayesian Markov chain Monte Carlo (MCMC) approaches (see, for example, Huelsenbeck et al., 2001), among others. The accumulation of vast amounts of DNA sequence data and our expanding understanding of molecular evolution have led to the development of increasingly complex models of molecular evolu- tionary change. As a consequence, the enlarged parameter space required by these molecular models has increased the computational challenges confronting phylogeneticists, particularly in the case of data sets that combine numerous genes, each with their own molecular dynamics. The growth of phylogenetic research and its empirical database presents computational chal- lenges beyond those of pure tree building. Phylogeneticists are more and more concerned about having statistically sound measures of estimating branch support. In model-based approaches, in particular, such procedures are computationally intensive, and the model structure scales significantly with the size of the number of taxa and the heterogeneity of the data. In addition, more attention is being paid to statistical models of molecular evolution (see, for example, Nielsen, 2005; Yang, 2006), which are the backbone of reconstructing ancestral sequences across a tree. This type of analysis is a prime objective

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Species Number of Trees FIGURE 4-1 The number of rooted, bifurcating, labeled trees for n species, for various values of n. The numbers for more than 20 species are approximate. SOURCE: Felsenstein (2004). for many evolutionary biologists and has numerous applications, including reconstructing virulence in viruses, predicting the probabilities of genetic change, and the design of vaccines. Theoretical molecular evolutionists and phylogeneticists have long simulated data sets and trees—to understand and compare phylogenetic methods and their statistical properties, for example, and to com- pare models of sequence change between simulated phylogenies and those found in the real world. The scale and efficacy of these studies are inherently limited by computational capability as investigators seek to make their simulations more sophisticated and realistic. Finally, as phylogenetic studies increase in scope, visualization of the results becomes more com- putationally complex (NSF, 2005b). The problem of visualization has received less attention than other aspects of phylogenetics, but because the field is growing so rapidly, visualization will need to be addressed. The computational challenge associated with visualization calls not only for more compu- tational capability but also for the development of visualization software for phylogenetics, which has received very little attention.

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY Major Challenge 2: Understanding How Species Originate The predominant view of how species originate is that speciation takes place through geographic isolation of populations, followed by differentiation, a process known as allopatric speciation. However, an increasing number of biologists are proposing that many species have arisen under local, nonallopatric conditions owing to rapid shifts in host preference, a process known as sympatric speciation, in which the diverging populations are not separated geographically. A third mode of speciation, parapatric, pos- tulates taxonomic differentiation along steep environmental gradients in which divergence can occur under intense natural selection even in the presence of some exchange of genes between species. A fundamental unknown is associated with Major Challenge 2—namely, What is the relative importance of these alternative modes of speciation across different taxa? The frequency of different modes of speciation across taxonomic groups is critical for proposing and testing general explanations about the genetic, developmental, and demographic conditions leading to speciation, as well as for understanding patterns of species diversity. Even though different modes may have multiple underlying mechanisms, knowledge about mode frequency will form the cornerstone of any general mechanistic theory of speciation. Although the factors leading to allopatry (vicariance 1 and dispersal) are well-known, the conditions under which sympatric and parapatric speciation can occur have received moderate theoretical treatment but not enough experimental and empirical study. Of course, Major Challenge 2 cannot be clearly defined, let alone addressed, unless we understand the nature of species and their boundaries. Two associated questions are these: What changes take place in the genetic and developmental architectures of isolated populations? What mechanisms underlie those changes? Few questions in evolutionary biology have generated as much debate as how species are to be defined. At a certain fundamental level, these debates are about the nature of species boundaries. One type of boundary involves reproductive relationships within and among populations. Traditionally associated with the notion of biological species, in that case the boundary delimits populations that have evolved genotypic or phenotypic characteristics that make them reproductively incompatible with (iso- lated from) one another. A second form of boundary, generally associated with phylogenetic or cladistic species, circumscribes populations that are diagnosably distinct from one another, whether or not these character differences result in reproductive isolation. Leaving aside the issue of which type of boundary should adjudicate the species debate, both types raise fundamental problems of interest to evolutionary biologists. In many taxonomic groups there are multispecies lineages in which viable offspring are produced by hybridization events in the wild, even between species that are distantly related. Although these hybridization events may be uncommon or rare due to pre-mating isolating barriers, they imply that distinct taxa are able to maintain cohesion of their developmental and genetic architectures in the face of some gene flow. Therefore, reproductive incompatibilities are extremely weak or nonexistent, and this situation can be maintained for millions of years. In other groups, by contrast, closely related and very similar species that were isolated only recently and have since become sympatric can have strong reproductive incompatibilities. Thus we are faced with the question of how to explain the continuum of responses to isolation and differentiation in genetic and developmental terms. Addressing Major Challenge 2 will also necessitate fundamental advances in understanding how changes in genetic architecture translate into changes in the development of the organism and of its phenotypic (observable) characteristics as an adult. It is likely that many types of changes in these complex genetic-developmental pathways could lead to reproductive incompatibilities in behavior, 1Vicariance describes a situation in which a widespread population is subdivided into two or more allopatric populations by a newly formed physical barrier such as a river, mountain range, or change in habitat due to environmental change.

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING physiology, or ecological preferences, but we currently do not know to what extent predictive regularities exist. Old debates about the relative importance and frequency of “micro” versus “macro” phenotypic effects of mutational change are, in many respects, still with us because simple genetic changes can potentially be amplified into significant phenotypic differences through complex developmental net- works. Evolutionary biologists have long believed that reproductive incompatibilities are more significant in genetic terms than are even striking phenotypic changes that do not result in those incompatibilities. This inference, however, is based more on looking at the world through a particular theoretical frame- work than on generalizable knowledge about the genetics of differentiation. The fact that it is possible to maintain the integrity of species even in the face of hybridization indicates those species’ genomes are porous but at the same time resistant to gene flow that might break down their boundaries. These considerations raise questions not only about the nature of species and their boundaries but also about the morphogenetic mechanisms responsible for maintaining phenotypic expression. Very little is known about this, and answers will require an approach calling on the expertise of numerous disciplines. A third key component of Major Challenge 2 is learning about the history of populations and species, particularly the human species. As a consequence of the vast increase in genomic information, these histories can be probed in increasing detail. But populations and species are complex entities; every individual of every species has thousands of genes in its genome. Combining genomic data therefore becomes an issue: ribosomal DNA sequences from organelles like chloroplasts and mitochondria may yield different histories from those of the nuclear genome. These genes have variable mutation rates and can provide varying amounts of information for demographic history. Whereas traditional phylogenetic analysis often treats each species as a single monomorphic entity that cannot be further decomposed, a genomic perspective of biodiversity acknowledges that each species is in fact a complex entity whose cohesion and trajectory through time could be constrained by many processes, including reproductive isolation from other species. The theoretical framework provided by coalescent theory—a body of population genetics theory that models the genealogical signatures of genetic lineages as they are passed down through generations—has further increased the resolving power of DNA sequence data. Computational Challenges Much current research on the above questions and problems can and does proceed without extensive computational requirements. The relative frequency of modes of speciation, for example, can often be determined by standard phylogenetic and biogeographic studies. And there have been numerous com- putational studies in population biology and genetics that attempt to model the conditions under which sympatric speciation is likely to operate (reviewed in Gavrilets [2003] and in Coyne and Orr [2004]). As evolutionary biology in general and speciation research in particular become more mature and as new genomic information becomes available, there has been a shift to the use of mathematical models and methods. Currently, there is growing interest in developing mathematical models for particular cases in order to test well-defined hypotheses associated with speciation—for example, was speciation of cichlids in a Nicaraguan crater lake sympatric, or was it a result of double invasion? Did modern humans hybridize with Neanderthals during their colonization of Europe? While many of the current studies do not require high-end computing, future advances in population evolution and speciation will be stymied unless they can scale, which will require new computational approaches, algorithmically and computationally. Coalescent models now encompass a wide variety of demographic and genetic phenomena, includ- ing population bottlenecks, migration, change in population size over time, natural selection, gene

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY flow between populations, gene conversion and recombination between alleles in a population, and complex mutation patterns. DNA sequence data are now routinely analyzed in terms of genealogical (phylogenetic) trees at various levels in the hierarchy, from individuals within a population to closely and distantly related species. These genealogical patterns will show signatures of various demographic processes, including migration, population size changes, and reproductive isolation events through time. However, the highly stochastic nature of the coalescent process means that the realization of any particular genealogical pattern in nature could be consistent with many different scenarios. As a result, genealogical signatures from many different genetic loci are required to accurately estimate demographic histories. Linking many different genealogical signals with models of various population demographics is computationally demanding, and advances to date have been made via a series of strong but useful approximations. Relaxing these assumptions and exploring the full range of demographic histories is a major computational challenge for the future (Beaumont, 2002, 2004). Going hand in hand with these developments, however, is a pressing need for gathering more empirical data, which can provide the context for building more realistic models. These types of data would include the kinds and amount of genetic variation and selective pressures that might bear on the evolution of pre- and postzygotic reproductive isolation. It seems clear that studies along the interface of population biology and efforts to unravel the origin of species will make significant use of computational resources. Increased understanding of how popu- lations become spatially structured genetically will rely on large populational sampling and detailed descriptions of populational history. Moreover, the integration of genetic and demographic information through complex models and simulations of populational histories will present profound computational challenges. Major Challenge 3: Understanding Diversification of Life Across Space and Time At a general level, it is well known that processes in the geosphere and biosphere have been tightly linked since the origin of life (NRC, 1995), but we have only partial understanding of the linkages across different spatial and temporal scales. At large scales, movements of continents and terrains, tectonic effects within continents, and long-term climate changes have had a profound influence on the distributions of organisms and the ecological associations they comprise, and such phenomena may be first-order drivers of biotic evolution. At smaller geographic and temporal scales, geological processes can be implicated in controlling the rates of speciation and extinction. At still smaller scales, geospheric processes influence local environmental change, which is one cause of microevolutionary change within and among populations. At none of these scales do we have enough theory on which to build strong models or the ability to simulate the coupled processes. Beyond understanding the coupling, we would like to understand the intrinsic and extrinsic controls on the rate of speciation. Over long periods of time, diversity has increased, decreased, or remained relatively stable, yet at a mechanistic level the causes of these patterns are poorly known. Diversifica- tion is generally modeled as a birth/death process, with change in diversity over time being a function of the rate of speciation and extinction. Many factors have been implicated in the rate controls of each, but current models describe simple diversity-dependent processes that rely largely on biotic interactions as proximate causes of increases or decreases in speciation and extinction. Omitted are abiotic factors such as tectonically mediated changes in mountain building, large-scale alterations of river systems, or climate change, all of which are widely recognized as potential drivers of rate controls. There is a pressing need for more realistic models of diversification that can be applicable across different spatiotemporal scales. These might not only parameterize traditional biotic rate controls but

OCR for page 63
0 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING also take into account causal linkages between Earth history and speciation/extinction rates. This also provides a foundation for understanding how communities and ecosystems are assembled across space and time. Understanding the evolutionary assembly of communities and ecosystems is a fundamental problem that cuts across multiple disciplines, including systematics and historical biogeography, community and landscape ecology, and paleontology. It has application to conservation, resource management, and understanding the consequences of global change. The mechanisms governing the assembly and maintenance of species associations (communities, ecosystems) at different spatial scales have received considerable attention, especially in ecological science (see, for example, Ricklefs and Schluter, 1993), but many aspects of the evolutionary dynamics of these assemblages have been less studied. Increasingly, history is recognized as playing an important role in shaping taxonomic assembly at a wide range of scales. The coevolutionary history of different groups, or clades, of organisms within biological commu- nities can be analyzed using methods of historical biogeography, but there is considerable disagreement and little consensus on whether any of the current methods are sufficiently sophisticated to reconstruct the spatial history of moderate to large clades, let alone multiple clades simultaneously. Developing a more sophisticated understanding of community assembly over multiple timescales will necessitate the development of new models and algorithms to integrate multiple species histories and ecologies. A related part of Major Challenge 3 is how to determine the evolutionary history of microorganismal community structure and function. This is a somewhat different challenge, because new methods in com- parative genomics are giving us a better understanding of microbial community organization. Termed “environmental genomics” or metagenomics, these new tools use advances in high-throughput sequenc- ing to sample the genomes in environmental samples (Riesenfeld et al., 2004; NRC, 2007). Convention- ally, microbial DNA is isolated from a sample, cloned, and then used to create metagenomic libraries, but newer technologies can access the environmental sample directly (NRC, 2007). The resulting sequences have many uses, including for phylogenetic studies, measures of taxonomic and genomic diversity, the discovery of new genes, functional analysis of specific genes, and for modeling large biochemical pathways, including community metabolism (Tyson et al., 2004; Rusch et al., 2007). Although current metagenomics research is primarily directed toward genome characterization and the structure and function of microbial communities, it is clear that the massive amounts of new sequence data being collected will substantially expand our knowledge of global diversity, the tree of life, and genome evolution. Additionally, genomic comparisons interpreted in the context of phylogenetic relationships can also be expected to reveal new insights into genome structure and function. Computational Challenges Many of the causal linkages between the geosphere and biosphere, such as how tectonically driven change might have influenced the speciation and extinction rates, have been inferred from correlations generated by empirical studies. There is a need for a better theoretical-mathematical foundation that can lead to predictive quantitative assessment. Moreover, these causal models linking Earth history with biotic evolution should be operable at different spatiotemporal scales. Some simulations have been run that couple climate models with environmental models and data about ecosystems, and modeling of species distribution is becoming increasingly common. Some simulations have been performed to recon- struct how taxonomic elements of communities and ecosystems have assembled over time by integrating the phylogenetic and spatial histories of many groups of organisms simultaneously. To date, however, few if any of these studies have taken advantage of high-capacity computing.

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY Major Challenge 4: Understanding the Origin and Evolution of the Phenotype The phenotype describes the observable characteristics of an individual. Although an individual may have the genetic capability to express a trait, only the traits that manifest are considered phenotypic characters. Phenotypic characters can be broadly construed across different levels of organization, from genomic and developmental characteristics to external features, physiology, and behavior. The genetic architecture of individuals or populations includes all interactions and functional linkages among genes that have influence on the expression of traits (NSF, 1998). We have broad knowledge of the nature of genetic variation within and between populations, and the promise of large-scale genomic sequencing across many individuals promises to revolutionize that knowledge and allow much more sophisticated questions to be asked. Knowledge about the origin and evolution of phenotypes is built on an under- standing of genetic variation in all its components, from nucleotide polymorphisms and their frequency in populations to the interactions among genomic loci. A key question is, How does that variation in coding or regulatory genes relate to changes in phenotype? Recently, significant progress in understanding the evolution of phenotypes has come from inte- grating the fields of evolutionary and developmental biology. Both fields have long histories, but start- ing several decades ago they diverged. Evolutionary biologists focused increasingly on understanding evolution at the population level and developed sophisticated genetic models to understand changes in allele frequencies, while developmental biologists focused on experimental manipulations to uncover the mechanisms of development. More recently, however, developmental biologists have taken their analysis to very deep molecular and genetic levels, and this has led to a renewed interest in understanding the interplay between evolution and development (called “evo-devo”), as both fields now have a common language of genetics and genomics. Evo-devo has recently had explosive growth and has become an exciting area of investigation and attracted much popular attention, from, among others, S.B. Carroll (2005). These studies of developmental evolution have spanned all levels, from microevolution within populations to macroevolution among the major clades of life. One of the remarkable outcomes of these initial studies is the discovery that individual genes and genetic pathways can have important evolution- ary effects on development and morphology. For example, patterning along the anterior-posterior axis of animals is controlled by a set of genes known as the homeotic (Hox) genes. While initially characterized through the study of highly deleterious mutant alleles in model species such as Drosophila melanogaster (fruit fly), subsequent studies have shown that evolutionary changes in the Hox genes and changes in how those genes are expressed play a clear role in animal evolution across the entire micro- to macro- evolutionary spectrum. For example, the Hox gene Ultrabithorax (Ubx) has helped us to understand microevolutionary changes in the pattern of fly bristles as well as the macroevolutionary changes seen in the appendages of crustaceans (Stern, 1998; Averof and Patel, 1997). Similarly, Hox genes have also been implicated in the evolution of large-scale changes in the vertebrate skeleton during evolution. The analysis of these genetic networks presents us with the opportunity to understand evolution in increas- ingly sophisticated ways and has allowed us to generate better models of how evolutionary change occurs. For quite some time, evolutionary models suggested that the phenotypic differences between even very closely related species were due to variation at a large number of genomic loci, with any given individual mutation having only a minute effect. Recent experiments, however, suggest that this is not always the case. Increasingly we see that phenotypic variation can often be attributed to one or just a few genes of large effect. For example, recent studies in sticklebacks show that variation at a single gene, Pitx, has a very large effect on the pelvic spines found in these fish (Shapiro et al., 2004). At

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING the molecular level, these changes can involve both protein coding and gene regulatory modifications, although current theories suggest that regulatory mutations in developmental genes may have a pre- dominant role in underlying major evolutionary change in phenotype. The pace of such discoveries is ever increasing as new genomic and developmental tools are allowing us to decipher how developmental systems evolve. One fundamental goal of Major Challenge 4 is to understand integrated phenotypes and how they evolve. Phenotypic features—such as morphological form, physiology, behavior, even biochemical pathways—are often integrated into functional groups based on their interaction with the environment. Such linkages may be tight or loose. An additional challenge is to understand the boundaries and strength of these linkages, how the components of these groups arise, coevolve, and perhaps become unlinked functionally, potentially to be incorporated into other functional groups. Thus, considerable attention is now being paid to establishing the boundaries of integrated suites of morphological, behavioral, and physiological traits of organisms that function and interact collectively with the environment. The con- cept of modularity within evolutionary developmental biology is playing a role in understanding the structural and functional organization of integrated phenotypes and how they arise in development and change during evolution (Schlosser and Wagner, 2004). Integrated phenotypes are being studied from the perspectives of developmental biologists, comparative functional biologists, and population biologists conducting ecological and genetic experiments in the wild or laboratory. These evolutionary alterations in developmentally important genes also lead to changes in the genetic architecture of development in such a way as to control the range of phenotypic variation that is possible in subsequent generations. In some situations this can constrain or limit future phenotypic evolution, while in other cases it can open up entirely new possibilities for subsequent evolutionary change and the appearance of totally novel morphologies and physiologies. An important future challenge is to integrate these developmental and evolutionary studies with ecological ones to understand how natural selection shapes the course of growth and development of phenotypes and the underlying genetic architecture. Computational Challenges A serious major computational challenge is to generate qualitative and quantitative models of development as a necessary prelude to applying sophisticated evolutionary models to understand how developmental processes evolve. Developmental biologists are just beginning to create the algorithms they need for such analyses, based on relatively simple reaction-rate equations, but progress is rapid, and this work will soon be able to take advantage of HECC resources. Another important breakthrough in the field is the analysis of gene regulatory networks (Levine and Davidson, 2005). These networks describe the pathways and interactions that guide development, and while their formulation is dependent on intense experimental data collection, once produced, the net- works provide explicit models to test how perturbations affect all manner of developmental events. While they resemble metabolic pathways in overall structure, they are far more complex in their regulation and behavior. As these models grow to include more pathways and more organisms, they will increasingly benefit from greater computational capacity and will become vital to many evolutionary studies. Simi- larly, protein-interaction network analysis provides insight into an organism’s functional organization and evolutionary behavior—see, for example, http://www.hicomb.org/papers/HICOMB2007-03.pdf.

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY Major Challenge 5: Understanding the Evolutionary Dynamics of the Phenotype–Environment Interface Evolutionary biology has long sought to understand environmental selective effects by focusing on a single trait or trait complex, such as bill shape, body size, or body shape. Yet, selective regimes in the environment act on entire phenotypes that are the result of highly complex and linked (integrated) developmental and metabolic pathways. Earth’s biota is a product of complex interactions between the abiotic and biotic realms (see Major Challenge 3). Life on Earth has evolved in the context of a dramati- cally and often rapidly changing environment, whose trajectory and long-term trends have themselves been modified by evolving life forms. Earth’s atmosphere and long-term climate trends have played an important role in the major transitions and increasing complexity of life, but the system is interactive, replete with complex positive and negative feedbacks between living and geological systems. The timing and causes of many of the major transitions in the origin of biotic complexities, such as the origin of oxygen-based metabolism, are still somewhat controversial, given the challenges of interpreting the chemical and morphological signals in the fossil record of the first 3 billion years of evolution. The evolution of integrated complexes can also be investigated at different hierarchical levels. One critical approach is to link variation in these integrated complexes to environmental differences within and among populations in order to understand outcomes of selection. At a higher level, changes in integrated complexes can be analyzed across species, particularly those that are closely related, thus describing how the components of these complexes change at times of speciation. Such analyses are critical for providing insight into how the tightness of integration “constrains” change in phenotypic- functional complexes. Finally, we build on this knowledge to learn about the relationship between phenotypic change and adaptation. At a population level, variants in phenotype may have different consequences for survival or reproduction. Those variants that become fixed through natural selection (because of those conse- quences) are often referred to as adaptations. The nature of adaptations has been studied intensely from the viewpoint of population biology and ecology. Less attention has been paid to the molecular basis of population variation underlying phenotypic change and the linkages that might exist between genome evolution and phenotypic evolution. To what extent, for example, is convergence in phenotype related to convergence in the genetic and developmental pathways that produce those phenotypes? And, to what degree is similarity in presumptive adaptations constrained by those pathways, or is there flexibility in morphogenetic systems such that different ones can produce very similar phenotypic expressions? Answers to many of these questions and others in this field will require more empirical information about the amount and kind of genetic variation that underlies phenotypic variation and its response to selection. Such data are crucial for building more sophisticated and realistic models and simulations of the evolutionary process. A second fundamental goal underlying Major Challenge 5 is to understand better the links between ecological and evolutionary processes. The conservation of biodiversity depends critically on our ability to predict the responses of populations to changes in their environment that occur on short- and medium- term timescales. It is likely that models can be developed that are capable of predicting short- and medium-term population fluctuations in response to environmental change in greater detail than we can over long-term evolutionary time. Such models could, for instance, address biologists’ concerns about the effects of climate change in the recent past and in the short-term future. Both the increased detail of the environmental record and the increased sophistication of demographic models should enable biolo- gists to understand the effect of environmental change on many of the key life-history components of population fluctuations, such as juvenile and adult survival, fecundity, and population age-structure.

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING • Heuristic pruning of the fragment alignment graph to eliminate experimentally inconsistent subpaths. • Signal processing of raw sequencing data to produce higher quality fragment sequences and better characterization of their error profiles. • Development of new representations of the sequence-assembly problem—for example, string graphs that represent data and assembly in terms of words within the dataset. • Alignment of error-prone resequencing data from a population of individuals against a reference genome to identify and characterize individual variations in the face of noisy data. • Demonstration that the new methodologies are feasible, by producing and analyzing suites of simulated data sets. Once we have a reconstructed genomic or metagenomic sequence, a further challenge is to identify and characterize its functional elements: protein-coding genes; noncoding genes, including a variety of small RNAs; and regulatory elements that control gene expression, splicing, and chromatin structure. Algorithms to identify these functional regions use both statistical signals intrinsic to the sequence that are characteristic of a particular type of functional region and comparative analyses of closely and/or distantly related sequences. Signal-detection methods have focused on hidden Markov models and varia- tions on them. Secondary structure calculations take advantage of stochastic, context-free grammars to represent long-range structural correlations. Comparative methods require the development of efficient alignment methods and sophisticated statistical models for sequence evolution that are often intended to quantitatively model the likelihood of a detected alignment given a specific model of evolution. While earlier models treated each position independently, as large data sets became available the trend is now to incorporate correlations between sites. To compare dozens of related sequences, phylogenetic methods must be integrated with signal detection. Major Challenge 7: Understanding the Evolutionary Dynamics of Coevolving Systems Individuals of the same species or of different species generally have either conflicting or cooperating (mutualistic) interactions. Increasingly, many of these interactions have been found to evolve in relation to one another, a process known as coevolution. These coevolving interactions take many forms, from the symbiosis of organelles that once invaded free-living microbes, to predator-prey, host-parasite, and plant-pollination systems, to cooperative breeders or sexually selected mate choice, and to competitive interactors within habitats and communities. Understanding the evolutionary biology of such interactions has broad implications for solving problems in many areas of applied biology, including human health, agriculture, and resource management. For this reason, there is great interest in understanding the genomic underpinnings of conflict and cooperation and how conflict and cooperation evolve. Genetic mechanisms underlying conflict and cooperation have long been investigated empirically and theoretically, and considerable research has been undertaken on the genetics of behavior. It is widely recognized that behavior is influenced in complicated ways by numerous genes and their interactions, but we still have inadequate empirical knowledge of the genetic variation in the wild that is available to selection. The new tools of genomics promise to broaden the kinds of questions and approaches that have been standard in the field. To put the interconnections among genomic data, development, neurological function, and expressed behavior in an evolutionary context, studies will need to be comparative. These studies will ask new questions about the numbers and kinds of genes and about differences in the genetic architectures that influence conflictual and cooperative behaviors, including the genetic basis of instinct.

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY Cross-species, coevolutionary comparisons of multispecies interactors will reveal new insights into the nature of coevolution itself—for example, How fast is genetic and phenotypic change in interactors? Are coevolving systems conservative over time? How are those systems shaped by genetic factors? The evolution of conflict and cooperation can be studied at different hierarchical levels. For example, the history of species associations, such as hosts and their parasites, has been the focus of considerable phylogenetic coevolutionary analysis. At the level of populations, ecologists and population biologists have also built a large library of detailed field and laboratory studies. Despite this large body of work on conflict, cooperation, and coevolution, many aspects of the roles that conflict and cooperation may play in evolution are still poorly understood—for example, in the evo- lution of adaptation and in the origin of species. And there is a need for studies that integrate causality at the genomic level with that at the population, ecological, and demographic levels. Computational Challenges Sophisticated modeling of conflict/cooperation systems has a long history, and the large body of literature investigating these systems integrates genetic and population approaches. Much of this modeling stems from game-theoretic approaches and from classic population dynamic models such as predator-prey. Although game theory originally dealt with economic problems, it has also had a pro- found impact on evolutionary biology (see, for example, Maynard-Smith, 1982; Vincent and Brown, 2005). Most quantitative analyses of conflict/cooperation models have been carried out using desktop computing. Yet, as models subsume more parameters and include demographic or genetic information across space and multiple generations, access to advanced capability computing will become necessary (see, for instance, Nowak, 2006). MAJOR CHALLENGES IN EVOLUTIONARY BIOLOGY THAT REQUIRE HECC Progress in most areas of evolutionary biology has been very rapid over the past several decades. However, because desktop computing has continued to advance, much of quantitative and theoretical evolutionary biology has prospered without reliance on advanced computational capabilities. But this is likely to be a transient phase, because over the past decade, evolutionary biology has become increas- ingly multidisciplinary and integrative. This, and the rapid accumulation of genetic and other data on populations and species, has accelerated the transformation of evolutionary biology into a quantitative science. The study of microevolution, which requires genetic and demographic analyses of evolution- ary change within populations, has a long history of theoretical and quantitative modeling and remains robust. In many other areas of evolutionary biology, however, the building of quantitative theoretical models has been neglected, and research is sorely needed. One area that has made use of high-performance machines is phylogenetic research. As more mem- bers of the community become adept at using cutting-edge computational methods, there will inevitably be pressure to port models and codes to more powerful platforms and thereby address scientific questions in ways that mirror the complexity of natural systems. Evolutionary biology is already in transition, and to realize its potential fully will require progress on all capability-computing-related fronts: models, theory, data management, education and training, algorithms, and hardware. It should be stressed that progress in some areas of evolutionary biology is being limited by a lack of computational power. Even in phylogenetic research, where HECC is being exploited, researchers could make use of additional computational resources for increased statistical testing, larger simulations, and advanced visualization tools.

OCR for page 63
0 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Major Challenges 1 and 2 are probably where we will first feel the need to transition to HECC- enabled research. In both cases, access to advanced computational approaches will enable representing enough complexity to reveal new phenomena. To address Major Challenge 1, the phylogenetics com- munity is building larger and larger trees, evaluating them statistically, and manipulating them visually. As was noted earlier in this chapter, however, the computational challenge scales superexponentially with the number of terminals on the trees. This makes it especially important for the community to take maximum advantage of whatever state-of-the-art computing exists at the time. Not doing so will seriously hamper future progress. In addition, research using phylogenetic methodologies is expanding rapidly in the biomedical sciences as well as for metagenomic studies investigating community structure/function, ecosystem metabolism, and global climate change. In all of these areas, producing results that can mean- ingfully answer practical questions calls for much greater complexity, which in turn demands advanced computing. Along with the greater complexity, these applications typically involve massive amounts of data, the management and analysis of which will require advanced computing. For studies about speciation (Major Challenge 2), the simple mathematical models with only a few parameters that were of such importance for previous theoretical work are unable to make good use of the large amounts of data becoming available. They also are proving inadequate given the desire for higher resolution descriptions of genetic and/or population behavior. Thus, although experimentation may continue to be the dominant research modality for Major Challenge 2, there is a need to move to simulations that are explicitly genetic and characterized by a large number of parameters, large popula- tions (hundreds of thousands of individuals), long timescales (hundreds of thousands of generations), and significant stochasticity (which requires that the simulations be run multiple times to enable statistical analysis). For the research community to take the step from qualitative predictions and speculations to much more powerful and precise quantitative predictions and estimates, it must be able to perform such simulations, for which it will need HECC. Moreover, advanced computing opens some new options for approaching Major Challenge 2. Evaluating demographic histories using genetic data is computationally challenging for several reasons. For any data set of DNA sequences (or other genetic markers), there are many potential gene trees (phylogenies) that are consistent with the data; in addition, for any given set of phylogenies, there are a number of often complex population histories that can be accommodated by this set of gene trees. The result is two levels of uncertainty. In theory, this challenge could be met by integrating across all possible gene trees and population histories: ( ) ∫ Pr ( X G ) p (G Θ) Pr X Θ = G ∈ψ (Felsenstein, 1988; Hey and Nielsen, 2007). Here, X is the collected data (say, DNA sequence data sampled from multiple loci and multiple species or populations); Θ is the species history, which could be a complex demographic history involving bottlenecks, gene flow, and local extinction or a purely dichotomous history of population divergence, a phylogeny; G is a gene tree, an estimated genealogy of DNA sequences; and ψ is the set of all such genealogies, which include continuous branch lengths and very many topologies. However, this integration is not realizable in practice, not only because of the need for efficient estimation when the state space is large (for complex demographic models with many parameters) but also because, as discussed in connection with Major Challenge 1, the number of possible gene trees (phylogenies) increases superexponentially with the increasing number of taxa (in our case, genes or al- leles sampled from a given species). This makes it impractical to evaluate the above integral by sampling

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY a large number of genealogies at random. Thus even though the probabilities of mutation events for any one genetic locus can be multiplied by those for other loci because they are independent (conditional on the demographic history itself), they cannot be calculated analytically. As a result, computational and statistical approximations have been used extensively. The most recent of these methods to be employed, Bayesian analysis, frequently utilizes MCMC methods to sample many possible gene trees and parameters associated with the demographic history. Various sampling and rejection schemes have been proposed, as well as means of proposing parameters via complex prior distributions. With a sufficient number of MCMC cycles, complex probability distributions can be approximated. Still, this exercise gets us only to the point where we can evaluate statistically the universe of trees that should be considered appropriate for an analysis. We are still left with deciding which population model—described by gene flows, population size changes, and so on—best fits the set of gene trees. Many of the approaches to Major Challenge 3 could use HECC now or are moving inexorably in that direction. Irrespective of scale, models that link geosphere and biosphere could be highly parameterized and, if they are, will ultimately rely on HECC when fully implemented. As noted in Chapter 3, geo- scientists are beginning to couple climate modeling with environmental modeling and satellite data on ecosystem distribution to reconstruct the environmental history of Earth’s ecosystems and to predict changes due to global warming. Concurrently, environmental modeling of species distributions is also becoming more common, and evolutionary biologists are beginning to use information about phylo- genetic relationships to reconstruct the historical environmental envelopes of common ancestors down the tree. One can imagine the possibility of linking these classes of models and extending the coupled system farther back in time to examine how geological and climatological changes at a global scale might have influenced the distributions of species and biotas in terrestrial and marine environments. The complexity and precision of these reconstructions will depend on massive computational power. Such calculations will continue to stimulate advances in data integration and theoretical analysis, and the difficulties of these challenges will tax even the next generation of HECC. Reconstructing how taxonomic elements of communities and ecosystems are assembled over time involves integrating the phylogenetic and spatial histories of many groups of organisms simultaneously. Current analytical approaches to this problem, conventionally undertaken on desktop computers, are widely regarded as inadequate because the simplifying assumptions of the methods and the models of spatial change are lacking in realism. Biogeographers, phylogeneticists, and computer scientists are col- laborating to develop algorithmic approaches that will be able to extract the complex spatial and temporal histories of multiple groups simultaneously. Because of the large parameter space of the solution set and the algorithmic complexity, HECC will play a major role in data analysis. From the outset, metagenomic studies have been intensely computational because they involve the assembly and comparisons of millions of gene fragments for hundreds or thousands of different kinds of microbes, many of which are new. The scale of such studies is expanding rapidly (see, for example, Rusch, 2007), and analyzing the results to address evolutionary questions, which will necessarily involve computationally intensive phylogenetic approaches, means that studies in evolutionary biology will more than ever need to push the frontiers of HECC. Metagenomic studies also reveal the remarkable diversity of protein families and their functional subdomains that have evolved. Predicting the functions of these domains is a highly complex computational problem that brings physical and chemical modeling together with evolutionary biology. No single algorithm has yet been established that works in all cases, but this is a subfield of very active research, with tools being developed and ongoing experimental validation of the predictions (Friedberg, 2006). The prediction of a protein’s three-dimensional structure from its sequence is an area that already makes extensive use of HECC, and structure information could now be

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING applied to understand the evolutionary diversification of protein structures and of families of function- ally related proteins. Simulations capable of representing the process of development, both qualitatively and quantita- tively, are necessary for addressing Major Challenge 4. As noted in that section, developmental biolo- gists are just beginning to create the algorithms, but progress is rapid, and they will soon be able to take advantage of HECC. Many of these models are based on relatively simple reaction-rate equations, but the number of parameters is very large. In many cases, parameters such as the rate of protein produc- tion and degradation, the rate of ligand diffusion, and the rate of receptor turnover can be specified only within certain limits. This means that the number of possible outcomes is too large to compute; instead, the space of all possible outcomes is sampled to gain an understanding of the developmental pathway. 2 Access to HECC will allow these developmental analyses to be done more efficiently and for a wider range of parameters. Likewise, the growing application of models from chemistry and physics, such as for the diffusion of ligands through the embryo (Gregor et al., 2007), has opened up the possibility of truly predictive models of development, which in turn promise the opportunity to understand the evo- lutionary outcomes of changes to the system. Also, as noted in the discussion surrounding Major Challenge 4, analysis of gene regulatory networks and protein interaction networks is an important tool for understanding the development and evolution of phenotypes, and the analysis of both sorts of network can exhibit complexity such as will require HECC. For instance, it will be important to test how perturbations affect all manner of developmental events, and this requires multiple large-scale simulations. Also, as we develop a more detailed understanding of these networks and their effects on phenotypes, research will need to include more pathways and more organisms, thereby necessitating increased computational capability. As noted in the discussion of Major Challenge 5, computational approaches are beginning to be used to recouple ecological and evolutionary dynamics. To date, most research has focused on simple cases, such as examining the dynamics within a single species with a very simplistic genetic architecture and genetic basis for phenotypic traits. Even so, such cases have contributed to some solid conceptual advances. Future success in this area, however, will depend heavily on analyses of models of communi- ties of organisms with realistic ecological, genetic, and spatial structures, and the complexity of such models quickly brings us to the HECC domain. The computational demands will be staggering, insofar as such analyses will involve the simultane- ous tracking of thousands of genes in hundreds of interacting species, each with its own independent evolutionary history and genetic basis for traits. Extending this work to incorporate spatially explicit landscapes will further escalate the computational demands. One can imagine adding yet another set of equations to Lewontin’s (Major Challenge 5) that would capture the complex trajectories of develop- ment (linking genotype to phenotype). Incorporating this last feature would truly integrate ecology and evolution by demonstrating the development of phenotypes from genotypes. At least for some model organism systems (sea urchins, Drosophila), developmental biologists have formulated equations that allow predicting phenotypic properties from the structure of complex gene networks (see Major Chal- lenge 4). Very little has been done to tie such developmental predictions quantitatively to evolution or ecology (Kingsolver et al., 2007), no doubt because of the computational challenges, but this is clearly the direction in which coupled models of ecological and evolutionary dynamics are heading. The computational methods that are essential to addressing Major Challenge 6 are largely manage- able today, as explained in that section. As described there, evolutionary comparisons play a critical role 2A good example of such an approach is seen for models of the Drosophila segment polarity network, which maintains a segmental pattern in the early embryo. Computational analysis led to the surprising result that the network was remarkably robust in the face of environmental and genetic perturbations (van Dassow et al., 2000).

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY in genome annotation. In this way, we identify conserved protein coding regions and noncoding regions that presumably play a variety of regulatory roles. But the task of comparing and aligning genomes becomes increasingly difficult as we get more and more genomes and as we ask increasingly sophisti- cated questions. As this report is being written, not many bioinformaticians have routine access to very powerful computers, so it can be said that their research capability faces limitations. Algorithms and data are changing rapidly, and bioinformaticians often want to run their analyses repeatedly, tweaking parameters on each iteration. At this stage of development, then, it is advantageous for a researcher to work closely with his or her own machine, even if that constrains the scale of the calculations. As algo- rithms are perfected and databases swell, the need for HECC will grow rapidly. Making access fast and easy will also be critical in getting the user community to switch to state-of-the-art capabilities. These capabilities will also play an important role in developing gene models that can find and annotate genes independent of evolutionary conservation, which is essential when looking for genes that are evolving rapidly or are unique to specific lineages. As our understanding of comparative genomics in wild populations improves, we will be better able to look for evolutionary signatures of selection and thus tease apart genome-level events that lead to speciation and macroevolutionary changes. While we have sophisticated theories and algorithms for this analysis, we are challenged to test these theories rigorously through large-scale genome analysis. As described above, transposable elements and the genome alterations they cause have played an important role in evolution, but piecing together the course of events is difficult from the computational stand- point. Ironically, the repeat structures created by transposable elements make the process of assembling whole genome sequences from raw data an even more complex computational problem. Right now, when we say a genome has been fully sequenced, that often applies only to its euchromatic region; the heterochromatic region, which often is rich with transposable elements, remains unassembled because computational methods are still lacking to make sense of the data. While specific computational challenges and approaches in evolutionary biology have been dis- cussed above, several additional observations can be made that apply across all levels of evolutionary research. All of the biological sciences are data rich. This is not just in terms of volume per se but also in complexity, uniqueness or individuality, and their nonreducible characteristics. The data include such disparate materials as relatively simple DNA sequences, catalogs of museum specimens, photo images of collection materials, and movies of developing embryos. Thus the data storage, organization, and dissemination of biological material present challenges. (While the challenges of storing and making available genomic data are significant, it is even more difficult to store and share digitized visual data such as photographs, movies, and images of museum collections.) This flood of data has led to a need for computational tools and computer hardware for database storage, management, and usage. The torrent of biological data also has changed the evolutionary biology community’s ability to study speciation. For example, earlier work on sympatric speciation used simple mathematical models with few parameters. These models are inadequate for addressing the complexity found with the new genetic data as well as more detailed information about population structure and dynamics. HECC will soon allow evolutionary biologists to move to large-scale simulations of individual-based, dispersed populations. Through new technologies, especially for macromolecular analysis, biological investigations are becoming even more characterized by this data richness. In an era of high-throughput, high-information- content discovery in biological science, more and more research domains in evolutionary biology will be able to profit from high-end computing. To progress in exploiting the massive amounts of data, subdisciplines of evolutionary biology should strive to reach the point where HECC use is routine. Ultimately, this trajectory will unleash the potential for a very rich theoretical framework for evolutionary

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING biology, just as exists today for the physical sciences. As this framework is built up and becomes robust, high-end computing will become pervasive within the community. In general, access to computing— using computing in the broadest sense—and at capabilities up to state of the art, combined with the data revolution, has already transformed studies in evolutionary biology, and it will grow more enabling as theory and experimentation continue their rapid advance. Genomics and metagenomics data sets, individual genomes, and entire population or community genomes (metagenomics) all require the methods of computational evolution to gain understanding. For example, comparative analysis remains an essential tool for understanding biology. To utilize the explosion in genomics and metagenomics, efficient alignment methods and advanced statistical methods for characterizing sequence evolution are needed. Typically, a mathematical model (itself part of the framework of a specific model of evolution) is created to discern likely alignments. As ever-larger data sets are made available to the community, it may be possible to include correlation so that it no longer is necessary to treat each sequence position independently. Signal detection algorithms will need to be integrated into phylogenetic methods. HECC is needed to cope with the data flow, which includes growing numbers of sequences, characters, numbers of species, and so on. Each parameter can take on thousands to hundreds of thousands of values, yet even more character data have to be added to gain confidence for resolving trees, whether those trees depict patterns of species interrelationships or the evolutionary patterns of genes within populations. The data richness of biology often leads biologists to simplify their questions to the level at which they can be addressed in a reasonable amount of time on local computing resources—for example, reducing the number of parameters in modeling an ecological network. Enabling evolutionary biolo- gists to readily exploit supercomputing power would significantly change the aims and scope of many research programs. Even today we can see how progress is limited by the relative scarcity of substantial computing resources at the high end: Workstations require months to solve medium-sized problems for modeling molecular evolutionary change even though new algorithms have provided some improve- ments. For theory, experimentation, and modeling to work coherently to advance evolutionary biology, computing must be available to sustain correlated intellectual inquiry. Long compute times are especially limiting when the range of models required is so large and the field has to depend on models and their validation, since no exact analytical solutions are possible. The trajectory of the phylogenetics community as it seeks to probe the history of life is necessarily aimed at building and understanding ever larger trees. Thus another opportunity exists if we can encourage the development of community codes and community efforts to advance our understanding of the tree of life. An immediate example is the NSF’s Assembling the Tree of Life project, which has nucleated a closely connected community. That community has the incentive and the common purpose to work together effectively to develop codes and use HECC extensively to improve models and their validation for the tree of life. Many other biological computing applications today are developed locally. Colleagues ask to use them, and at a certain point there is enough interest such that it is worthwhile investing resources into hardening them, bringing them up to standards that software engineers can live with, and making them more user-friendly. This bottom-up method of developing software works best for biology, where there is such a diversity of questions being asked, but work on developing, distribut- ing and—especially—making it possible for these programs to work together is not well supported by the current funding mechanisms. In addition, greater access to advanced computing environments can forge extensive partnerships with mathematicians and computer scientists, leading to clearer definition of computational problems and establishment of new algorithms. Greater collaboration with the mathematical, computer science, and engineering communities, as has happened in climate change and environmental biology, would

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY greatly accelerate progress in evolutionary biology. These collaborations can also lead to capabilities for advanced visualization, methods for the implementation and application of evolutionary models, and capabilities for the interactive analysis of large, very complex data sets, all tailored to the particular needs of evolutionary biologists. These examples illustrate the many practical advantages of such access; the incentives for using advanced computing encompass more than just the classic NP-complete nature of generating and validating phylogenetic trees. But many steps must be taken before this vision can be realized. The benefits of such access go beyond the ability to develop and use heuristic and approximate solu- tions for ever larger phylogenetic trees to advance our understanding of the history of life. They include how to apply this knowledge for the good of society. A deep understanding of evolution integrated into the fabric of biology provides the basis for all our understanding and knowledge of life and how living systems function. That, in turn, allows biology to contribute to developing new vaccines, antibiotics, and other medicines, predicting drug targets, managing natural resources, providing biosecurity through identification of pathogens and invasive species, and so on. Evolutionary biology also presents challenges for the scientific computing community. That com- munity has its roots more in the computational problems arising in physics, such as fluid flow, structural analyses, and molecular dynamics, and it has built up expertise and software for those sorts of problems. But many biological applications involve irregular data structures and unpredictable memory accesses (because the data come from strings, lists, trees, and networks), which place more demands on integer performance. So the community of expert algorithmicists and code builders must also be built up if evolutionary biology is to replicate the computational successes of the physical sciences. What would evolutionary biologists need in a HECC facility? At the very least, the community needs uniform access to large data sets—tera- to petabyte scale for image data, giga- to terabyte scale for genomic data—and a network infrastructure that allows remote access and sharing. More generally, the challenge of large, interrelated data sets is a new one for biology, and the research community does not yet have the habit of looking for patterns in those data or the theoretical framework for doing so. When it does, we will be able to ask entirely new questions. REFERENCES Averof, M., and N.H. Patel. 1997. Crustacean appendage evolution associated with changes in Hox gene expression. Nature 388: 682-686. Bader, D.A. 2004. Computational biology and high-performance computing. Communications of the ACM 47(11): 34-41. Bader, D.A., B.M.E. Moret, and L. Vawter. 2001. Industrial applications of high-performance computing for phylogeny recon- struction. In Siegel, Howard J. (ed.), Commercial Applications for High-Performance Computing. Bellingham, Wash.: SPIE, 159-168. Bader, David A., Allan Snavely, and Gwen Jacobs. 2006. Petascale Computing in the Biological Sciences. National Science Foundation Workshop Report. Arlington, Va., August 29-30. Barton, N.H., D.E.G. Briggs, J.A. Eisen, D.B. Goldstein, and N.H. Patel. 2007. Eolution. Cold Spring Harbor Laboratory Press. Batzer, M., and P.L. Deininger. 2001. Alu repeats and human genomic diversity. Nature Reiews in Genetics 3: 370-379. Beaumont, M.A., and B. Rannala. 2004. The Bayesian revolution in genetics. Nature Reiews in Genetics 5: 251-261. Beaumont, M.A., W. Zhang, and D.J. Balding. 2002. Approximate Bayesian computation in population genetics. Genetics 162: 2025-2035. Brosius, J. 1999. Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107: 209-238. Carroll, S.B. 2005. Endless Forms Most Beautiful. New York, N.Y.: W.W. Norton. Chen, L., A.L. DeVries, and C.-H.C. Cheng. 1997. Evolution of antifreeze glycoprotein gene from a trypsinogen gene in Antarctic notothenioid fish. Proceedings of the National Academy of Sciences 94: 3811-3816.

OCR for page 63
 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Coyne, J.A., and H.A. Orr. 2004. Speciation. Sunderland, Mass.: Sinauer Associates. Cracraft, J., and M.J. Donoghue (eds.). 2004. Assembling the Tree of Life. New York, N.Y.: Oxford University Press. Cracraft, J., M.J. Donoghue, J. Dragoo, D. Hillis, and T. Yates (eds.). 2002. Assembling the Tree of Life: Harnessing life’s history to benefit science and society. Brochure produced for the National Science Foundation. Available at http://www. nsf.gov/bio/pubs/reports/atol.pdf/. Deininger, P.L., and M.A. Batzer. 2002. Mammalian retroelements. Genome Research 12: 1455-1465. Dobzhansky, T. 1964. Biology, molecular and organismic. American Zoologist 4(November): 49. Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annual Reiew of Genetics 22: 521-565. Felsenstein, J. 2004. Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates. Friedberg, I. 2006. Automated function prediction: The genomic challenge. Briefings in Bioinformatics 7(3): 225-242. Gavrilets, S. 2003. Perspective: Models of speciation: What have we learned in 40 years? Eolution 57: 2197-2215. Gregor T., E.F. Wieschaus, A.P. McGregor, W. Bialek, and W. Tank. 2007. Stability and nuclear dynamics of the bicoid mor- phogen gradient. Cell 130 141-152. Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the National Academy of Sciences 104: 2785-2790. Huelsenbeck, J.P., F. Ronquist, R. Nielsen, and J.P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolu- tionary biology. Science 294: 2310-2314. Kingsolver, J.G., K.R. Massie, J. G. Shlichta, M.H. Smith, G.J. Ragland, and R. Gomulkiewicz. 2007. Relating environmental variation to selection on reaction norms: An experimental test. American Naturalist 169: 163-174. Levine, M., and E.H. Davidson. 2005. Gene regulatory networks for development. Proceedings of the National Academy of Sciences 102: 4936-4942. Lewontin, R.C. 1979. Fitness, survival, and optimality. Pages 3-21 in D.H. Horn, R. Mitchell, and G.R. Stairs, eds. Analysis of Ecological Systems. Columbus, Ohio: Ohio State University Press. Lewontin, R.C. 2002. Directions in evolutionary biology. Annual Reiew of Genetics 36: 1-18. Li, W.-H. 1997. Molecular Eolution. Sunderland, Mass.: Sinauer Associates. Lindström, J., and H. Kokko. 2002. Cohort effects and population dynamics. Ecology Letters 5: 338-344. Lynch, M., and J.S. Conery. 2003. The origins of genome complexity. Science 302: 1401-1404. Lynch, M. 2007. The Origins of Genome Architecture. Sunderland, Mass.: Sinauer Associates. Maynard-Smith, J. 1982. Eolution and the Theory of Games. Cambridge, England: Cambridge University Press. Meagher, T.R., and D.J. Futuyma (eds.). 2001. Evolution, science, and society: Evolutionary biology and the national research agenda. American Naturalist 158 (Supplement): 1-46. Available at http://www.journals.uchicago.edu/ASN/meagher.html/, and at http://evonet.sdsc.edu/evoscisociety/. Nielsen, R. (ed.). 2005. Statistic methods in molecular eolution. New York, N.Y.: Springer Verlag. Nowak, Martin A. 2006. Eolutionary Dynamics: Exploring the Equations of Life. Cambridge, Mass.: Harvard University Press. NRC (National Research Council). 1995. Effects of Past Global Change on Life. Washington, D.C.: National Academy Press. NRC. 2007. The New Science of Metagenomics: Reealing the Secrets of Our Microbial Planet. Washington, D.C.: The National Academies Press. NSF (National Science Foundation). 1998. Frontiers in Population Biology. Workshop report from the Population Biology Task Force. Available at http://www.nsf.gov/publications/ pub_summ.jsp?ods_key=biorpt1098. NSF. 2005a. Frontiers in Eolutionary Biology. Workshop report, Document number biorpt080106. Available at http://www. nsf.gov/publications/ods/results.cfm?url_type=Reports&url_subtype=Biology&browse_type=org_type. NSF. 2005b. Assembling the Tree of Life. Multiple workshop reports available at http://www.nsf.gov/publications/ods/results. cfm?url_type=Reports&url_subtype=Biology&browse_type=org_type. Ricklefs, R.E., and D. Schluter (eds.). 1993. Species Diersity in Ecological Communities. Chicago, Ill.: University of Chicago Press. Riesenfeld, C.S., P.D. Schlos, and J. Handelsman. 2004. Metagenomics: Genomic analysis of microbial communities. Annual Reiew of Genetics 38: 525-552. Rusch, D.B., A.L. Halpern, G. Sutton, K.B. Heidelberg, S. Williamson, et al. 2007. The Sorcerer II Global Ocean Sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077. Schlosser, G., and G.P. Wagner (eds.). 2004. Modularity in Deelopment and Eolution. Chicago, Ill.: University of Chicago Press.

OCR for page 63
 THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY Shapiro, M.D, M.E. Marks, C.L. Peichel, B.K. Blackman, K.S.Nereng, B. Jónsson, D. Schluter, and D.M. Kingsley. 2004. Genetic and developmental basis of evolutionary pelvic reduction in three spine sticklebacks. Nature 428: 717-723. Stern, D.L. 1998. A role of Ultrabithorax in morphological differences between Drosophila species. Nature 396: 463-466. Tyson, G.W., J. Chapman, P. Hugenholtz, E.E. Allen, R.J. Ram, P.M. Richardson, V.V. Solovyev, E.M. Rudin, D.S. Rokhsar, and J.F. Banfield. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37-43. van Dassow, G., E. Meir, E.M. Munro, and G.M. Odell. 2000. The segmenta polarity network is a robust developmental module. Nature 406: 188-192. Vincent, T.L., and J.S. Brown. 2005. Eolutionary Game Theory, Natural Selection, and Darwinian Dynamics. Cambridge, England: Cambridge University Press. Whitlock, M.C., and R. Gomulkiewicz. 2005. Probability of fixation in a heterogeneous environment. Genetics 171: 1407-1417. Wilmers, C.C., E. Post, and A. Hastings. 2007. A perfect storm: The combined effects on population fluctuations of auto- correlated environmental noise, age structure, and density dependence. American Naturalist 169: 673-683. Yang, Z. 2006. Computational Molecular Eolution. Oxford, England: Oxford University Press.

OCR for page 63