National Academies Press: OpenBook

The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering (2008)

Chapter: 4 The Potential Impact of HECC in Evolutionary Biology

« Previous: 3 The Potential Impact of HECC in the Atmospheric Sciences
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 63
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 64
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 65
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 66
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 67
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 68
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 69
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 70
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 71
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 72
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 73
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 74
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 75
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 76
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 77
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 78
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 79
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 80
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 81
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 82
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 83
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 84
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 85
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 86
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 87
Suggested Citation:"4 The Potential Impact of HECC in Evolutionary Biology." National Research Council. 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Washington, DC: The National Academies Press. doi: 10.17226/12451.
×
Page 88

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 The Potential Impact of HECC in Evolutionary Biology INTRODUCTION The dictum of Theodosius Dobzhansky (1964)—“nothing makes sense in biology except in the light of evolution”—has never been truer than it is today. With the rise of such fields as comparative genomics and bioinformatics, evolutionary developmental biology, and the expanded effort to build the tree of life, the discipline of biology has become increasingly dependent on the inferences, methods, and tools of evolutionary biology. These contributions from evolutionary biology have become standard in solving problems in comparative biology, the biomedical and applied sciences, agriculture and resource man- agement, and biosecurity. Thus an understanding of evolutionary change in individuals and populations provides the foundation for advances in crop improvement and vaccines, for improved understanding of epidemiology and antibiotic resistance, and for managing threatened and endangered species, to name only a few (Meagher and Futuyma, 2001). Likewise, at the level of species and multispecies lineages, our new understanding of the tree of life is providing a comparative framework for interpreting the similarities and differences among organisms. Using newly generated knowledge about the phylogenetic relationships of life on Earth, comparative biologists have been able to do interesting and useful things: • Identify wild relatives of domesticated plants and animals, leading to enhanced food security. • Create tools for the discovery of countless new life forms, many of which are economically important. • Establish a framework for comparative genomics and developmental biology, which speeds up the identification of emerging diseases (such as avian influenza and West Nile virus) and helps to locate their places of origin. • Advance such disparate fields as resource management and forensics (Cracraft et al., 2002). The committee was charged with reviewing the important scientific questions and technological problems in evolutionary biology by drawing on survey documents. Unlike the other fields covered in 63

64 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING this report—astrophysics, the atmospheric sciences, and chemical separations—evolutionary biology has no definitive reports setting forth its scientific breadth and describing future challenges. Thus in developing its evaluation of the main scientific challenges facing evolutionary biology, the committee drew on workshop reports, primarily those prepared for the National Science Foundation (NSF, 1998, 2005a, 2005b, 2006), and on a document produced by eight scientific societies (Meagher and Futuyma, 2001), as well as on discussions among committee members and invited experts at a small workshop in December 2006, the agenda of which is included in Appendix B. This chapter identifies the main challenges of evolutionary biology and evaluates the extent to which computational methods are impacting each of them. It describes the primary mathematical models that are currently available or being developed. On this basis, the committee then assesses the potential impact of HECC on the major challenges of evolutionary biology. MAJOR CHALLENGES OF EVOLUTIONARY BIOLOGY Major Challenge 1: Understanding the History of Life The most fundamental question posed by Major Challenge 1 is this: How did life arise? Despite the large body of scientific literature, this question remains unanswered. Addressing it requires knowledge spanning the physical and biological sciences: chemistry, the Earth sciences, astrophysics, and cellular and molecular biology. A key piece of information would be knowing whether life is indigenous to Earth or exists elsewhere in our solar system. More generally, people in the field ask how the assembly of simple organic compounds led to complex macromolecules and then to self-replicating entities, and what role Earth-bound processes played. Another unknown is how many species there are on Earth. Systematic biologists have discovered and described about 1.7 million living species. How many more exist in Earth’s ecosystems has not been answered satisfactorily, even to within an order of magnitude. Without a better quantification of life’s diversity we have only a very incomplete understanding of the distribution of diversity and thus cannot characterize with precision ecosystem structure and function, extinction rates, and the amount of molecular and functional biodiversity. Lack of knowledge about Earth’s biodiversity also precludes our potential use of those species and their products. Ultimately, Major Challenge 1 calls for us to develop an understanding of the tree of life and then to use it. With advances in methods of phylogenetic reconstruction and increasing amounts of new comparative data from DNA sequences, the last decade has seen an unprecedented increase in our knowledge about the phylogenetic relationships of organisms, which collectively constitute the tree of life. Since the beginning of the 1990s, the number of species represented in the gene sequence database GENBank has grown to more than 155,000. If this correlates roughly with the number of species that can be placed on phylogenetic trees using molecular data alone, then the combined number of extinct and living species currently included on phylogenetic trees may approximate 200,000 species. Assum- ing an increase in the number of researchers and technological advances, it seems safe to expect that between 750,000 and 1 million of Earth’s estimated 10 million to 100 million species will be placed on trees within the next decade. Data sets from individual studies are becoming larger and larger, and many contain information on thousands of species and thousands of molecular and/or morphological characters (qualities or attributes). This situation creates well-known computational challenges when one searches existing phylogenetic trees or works to resolve relationships so as to add or clarify particular branches (Felsenstein, 2004). In

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 65 addition, as more species are added, more comparative character data must be added in order to resolve relationships with confidence. Resolving the tree of life has become a priority because the hierarchical pattern of relationships is a powerful predictive tool for comparative biology, with many applications in the fundamental and a ­ pplied sciences and in industry (Bader et al. 2001; Cracraft, et al. 2002; Cracraft and Donoghue, 2004). Beyond knowing how specific species are related to one another, phylogenetic methods themselves are now routinely incorporated into many fields of evolutionary biology, molecular and developmental biology, the health sciences (comparing viral sequences as well as describing and predicting molecular change), species discovery/description, natural resource management, and biosecurity (identification of invasive species, pathogens). In addition, they are used in the identification of microorganisms, in the development of vaccines, antibacterials, and herbicides, and by the pharmaceutical industry in the prediction of drug targets. Computational Challenges Standard phylogenetic analysis comparing the possible evolutionary relationships between two s ­ pecies can be done using the method of maximum parsimony, which assumes that the simplest answer is the best one, or using a model-based approach. The former entails counting character change on alterna- tive phylogenetic trees in order to find the tree that minimizes the number of character transformations. The latter incorporates specific models of character change and uses a minimization criterion to choose among the sampled trees, which often involves finding the tree with the highest likelihood. Counting, or optimizing, change on a tree, whether in a parsimony or model-based framework, is a computationally efficient problem. But sampling all possible trees to find the optimal solution scales precipitously with the number of taxa (or sequences) being analyzed (Felsenstein, 2004) (Figure 4-1). Thus, it has long been appreciated that finding an exact solution to a phylogenetic problem of even moderate size is NP complete (see, for example, Bader, 2004). Accordingly, numerous algorithms have been introduced to search heuristically across tree space and are widely employed by biological inves- tigators using platforms that range from desktop workstations to supercomputers. These algorithms include methods for fusing and splitting taxa, swapping among branches, and moving through tree space stochastically to avoid becoming stranded at local suboptimal solution sets (Felsenstein, 2004; Yang, 2006). Such methods include simulated annealing, genetic (evolutionary) algorithmic searches, and Bayesian Markov chain Monte Carlo (MCMC) approaches (see, for example, Huelsenbeck et al., 2001), among others. The accumulation of vast amounts of DNA sequence data and our expanding understanding of molecular evolution have led to the development of increasingly complex models of molecular evolu- tionary change. As a consequence, the enlarged parameter space required by these molecular models has increased the computational challenges confronting phylogeneticists, particularly in the case of data sets that combine numerous genes, each with their own molecular dynamics. The growth of phylogenetic research and its empirical database presents computational chal- lenges beyond those of pure tree building. Phylogeneticists are more and more concerned about having s ­ tatistically sound measures of estimating branch support. In model-based approaches, in particular, such procedures are computationally intensive, and the model structure scales significantly with the size of the number of taxa and the heterogeneity of the data. In addition, more attention is being paid to statistical models of molecular evolution (see, for example, Nielsen, 2005; Yang, 2006), which are the backbone of reconstructing ancestral sequences across a tree. This type of analysis is a prime objective

66 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Species Number of Trees FIGURE 4-1  The number of rooted, bifurcating, labeled trees for n species, for various values of n. The numbers for more than 20 species are approximate. SOURCE: Felsenstein (2004). for many evolutionary biologists and has numerous applications, including reconstructing virulence in viruses, predicting the probabilities of genetic change, and the design of vaccines. Theoretical molecular evolutionists and phylogeneticists have long simulated data sets and trees—to understand and compare phylogenetic methods and their statistical properties, for example, and to com- pare models of sequence change between simulated phylogenies and those found in the real world. The scale and efficacy of these studies are inherently limited by computational capability as investigators seek to make their simulations more sophisticated and realistic. Finally, as phylogenetic studies increase in scope, visualization of the results becomes more com- putationally complex (NSF, 2005b). The problem of visualization has received less attention than other aspects of phylogenetics, but because the field is growing so rapidly, visualization will need to be a ­ ddressed. The computational challenge associated with visualization calls not only for more compu- tational capability but also for the development of visualization software for phylogenetics, which has received very little attention.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 67 Major Challenge 2: Understanding How Species Originate The predominant view of how species originate is that speciation takes place through geographic isolation of populations, followed by differentiation, a process known as allopatric speciation. However, an increasing number of biologists are proposing that many species have arisen under local, non­allopatric conditions owing to rapid shifts in host preference, a process known as sympatric speciation, in which the diverging populations are not separated geographically. A third mode of speciation, parapatric, pos- tulates taxonomic differentiation along steep environmental gradients in which divergence can occur under intense natural selection even in the presence of some exchange of genes between species. A fundamental unknown is associated with Major Challenge 2—namely, What is the relative importance of these alternative modes of speciation across different taxa? The frequency of different modes of speciation across taxonomic groups is critical for proposing and testing general explanations about the genetic, developmental, and demographic conditions leading to speciation, as well as for understanding patterns of species diversity. Even though different modes may have multiple underlying mechanisms, knowledge about mode frequency will form the cornerstone of any general mechanistic theory of speciation. Although the factors leading to allopatry (vicariance  and dispersal) are well-known, the conditions under which sympatric and parapatric speciation can occur have received moderate theoretical treatment but not enough experimental and empirical study. Of course, Major Challenge 2 cannot be clearly defined, let alone addressed, unless we understand the nature of species and their boundaries. Two associated questions are these: What changes take place in the genetic and developmental architectures of isolated populations? What mechanisms underlie those changes? Few questions in evolutionary biology have generated as much debate as how species are to be defined. At a certain fundamental level, these debates are about the nature of species boundaries. One type of boundary involves reproductive relationships within and among populations. Traditionally associated with the notion of biological species, in that case the boundary delimits populations that have evolved genotypic or phenotypic characteristics that make them reproductively incompatible with (iso- lated from) one another. A second form of boundary, generally associated with phylogenetic or cladistic species, circumscribes populations that are diagnosably distinct from one another, whether or not these character differences result in reproductive isolation. Leaving aside the issue of which type of boundary should adjudicate the species debate, both types raise fundamental problems of interest to evolutionary biologists. In many taxonomic groups there are multispecies lineages in which viable offspring are produced by hybridization events in the wild, even between species that are distantly related. Although these hybridization events may be uncommon or rare due to pre-mating isolating barriers, they imply that distinct taxa are able to maintain cohesion of their developmental and genetic architectures in the face of some gene flow. Therefore, reproductive incompatibilities are extremely weak or nonexistent, and this situation can be maintained for millions of years. In other groups, by contrast, closely related and very similar species that were isolated only recently and have since become sympatric can have strong reproductive incompatibilities. Thus we are faced with the question of how to explain the continuum of responses to isolation and differentiation in genetic and developmental terms. Addressing Major Challenge 2 will also necessitate fundamental advances in understanding how changes in genetic architecture translate into changes in the development of the organism and of its p ­ henotypic (observable) characteristics as an adult. It is likely that many types of changes in these complex genetic-developmental pathways could lead to reproductive incompatibilities in behavior, Vicariance describes a situation in which a widespread population is subdivided into two or more allopatric populations by a newly formed physical barrier such as a river, mountain range, or change in habitat due to environmental change.

68 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING p ­ hysiology, or ecological preferences, but we currently do not know to what extent predictive regularities exist. Old debates about the relative importance and frequency of “micro” versus “macro” phenotypic effects of mutational change are, in many respects, still with us because simple genetic changes can p ­ otentially be amplified into significant phenotypic differences through complex developmental net- works. Evolutionary biologists have long believed that reproductive incompatibilities are more significant in genetic terms than are even striking phenotypic changes that do not result in those incompatibilities. This inference, however, is based more on looking at the world through a particular theoretical frame- work than on generalizable knowledge about the genetics of differentiation. The fact that it is possible to maintain the integrity of species even in the face of hybridization i ­ ndicates those species’ genomes are porous but at the same time resistant to gene flow that might break down their boundaries. These considerations raise questions not only about the nature of species and their boundaries but also about the morphogenetic mechanisms responsible for maintaining phenotypic expression. Very little is known about this, and answers will require an approach calling on the expertise of numerous disciplines. A third key component of Major Challenge 2 is learning about the history of populations and species, particularly the human species. As a consequence of the vast increase in genomic information, these histories can be probed in increasing detail. But populations and species are complex entities; every individual of every species has thousands of genes in its genome. Combining genomic data therefore becomes an issue: ribosomal DNA sequences from organelles like chloroplasts and mitochondria may yield different histories from those of the nuclear genome. These genes have variable mutation rates and can provide varying amounts of information for demographic history. Whereas traditional phylogenetic analysis often treats each species as a single monomorphic entity that cannot be further decomposed, a genomic perspective of biodiversity acknowledges that each species is in fact a complex entity whose cohesion and trajectory through time could be constrained by many processes, including reproductive isolation from other species. The theoretical framework provided by coalescent theory—a body of population genetics theory that models the genealogical signatures of genetic lineages as they are passed down through generations—has further increased the resolving power of DNA sequence data. Computational Challenges Much current research on the above questions and problems can and does proceed without extensive computational requirements. The relative frequency of modes of speciation, for example, can often be determined by standard phylogenetic and biogeographic studies. And there have been numerous com- putational studies in population biology and genetics that attempt to model the conditions under which sympatric speciation is likely to operate (reviewed in Gavrilets [2003] and in Coyne and Orr [2004]). As evolutionary biology in general and speciation research in particular become more mature and as new genomic information becomes available, there has been a shift to the use of mathematical models and methods. Currently, there is growing interest in developing mathematical models for particular cases in order to test well-defined hypotheses associated with speciation—for example, was speciation of cichlids in a Nicaraguan crater lake sympatric, or was it a result of double invasion? Did modern humans hybridize with Neanderthals during their colonization of Europe? While many of the current studies do not require high-end computing, future advances in population evolution and speciation will be stymied unless they can scale, which will require new computational approaches, algorithmically and computationally. Coalescent models now encompass a wide variety of demographic and genetic phenomena, includ- ing population bottlenecks, migration, change in population size over time, natural selection, gene

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 69 flow between populations, gene conversion and recombination between alleles in a population, and complex mutation patterns. DNA sequence data are now routinely analyzed in terms of genealogical (phylogenetic) trees at various levels in the hierarchy, from individuals within a population to closely and distantly related species. These genealogical patterns will show signatures of various demographic processes, including migration, population size changes, and reproductive isolation events through time. However, the highly stochastic nature of the coalescent process means that the realization of any particular genealogical pattern in nature could be consistent with many different scenarios. As a result, genealogical signatures from many different genetic loci are required to accurately estimate demographic histories. Linking many different genealogical signals with models of various population demographics is computationally demanding, and advances to date have been made via a series of strong but useful approximations. Relaxing these assumptions and exploring the full range of demographic histories is a major computational challenge for the future (Beaumont, 2002, 2004). Going hand in hand with these developments, however, is a pressing need for gathering more empirical data, which can provide the context for building more realistic models. These types of data would include the kinds and amount of genetic variation and selective pressures that might bear on the evolution of pre- and postzygotic reproductive isolation. It seems clear that studies along the interface of population biology and efforts to unravel the origin of species will make significant use of computational resources. Increased understanding of how popu- lations become spatially structured genetically will rely on large populational sampling and detailed descriptions of populational history. Moreover, the integration of genetic and demographic information through complex models and simulations of populational histories will present profound computational challenges. Major Challenge 3: Understanding Diversification of Life Across Space and Time At a general level, it is well known that processes in the geosphere and biosphere have been tightly linked since the origin of life (NRC, 1995), but we have only partial understanding of the linkages across different spatial and temporal scales. At large scales, movements of continents and terrains, tectonic effects within continents, and long-term climate changes have had a profound influence on the distributions of organisms and the ecological associations they comprise, and such phenomena may be first-order drivers of biotic evolution. At smaller geographic and temporal scales, geological processes can be implicated in controlling the rates of speciation and extinction. At still smaller scales, geospheric processes influence local environmental change, which is one cause of microevolutionary change within and among populations. At none of these scales do we have enough theory on which to build strong models or the ability to simulate the coupled processes. Beyond understanding the coupling, we would like to understand the intrinsic and extrinsic controls on the rate of speciation. Over long periods of time, diversity has increased, decreased, or remained relatively stable, yet at a mechanistic level the causes of these patterns are poorly known. Diversifica- tion is generally modeled as a birth/death process, with change in diversity over time being a function of the rate of speciation and extinction. Many factors have been implicated in the rate controls of each, but current models describe simple diversity-dependent processes that rely largely on biotic interactions as proximate causes of increases or decreases in speciation and extinction. Omitted are abiotic factors such as tectonically mediated changes in mountain building, large-scale alterations of river systems, or climate change, all of which are widely recognized as potential drivers of rate controls. There is a pressing need for more realistic models of diversification that can be applicable across different spatiotemporal scales. These might not only parameterize traditional biotic rate controls but

70 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING also take into account causal linkages between Earth history and speciation/extinction rates. This also provides a foundation for understanding how communities and ecosystems are assembled across space and time. Understanding the evolutionary assembly of communities and ecosystems is a fundamental problem that cuts across multiple disciplines, including systematics and historical biogeography, community and landscape ecology, and paleontology. It has application to conservation, resource management, and understanding the consequences of global change. The mechanisms governing the assembly and maintenance of species associations (communities, ecosystems) at different spatial scales have received considerable attention, especially in ecological science (see, for example, Ricklefs and Schluter, 1993), but many aspects of the evolutionary dynamics of these assemblages have been less studied. Increasingly, history is recognized as playing an important role in shaping taxonomic assembly at a wide range of scales. The coevolutionary history of different groups, or clades, of organisms within biological commu- nities can be analyzed using methods of historical biogeography, but there is considerable disagreement and little consensus on whether any of the current methods are sufficiently sophisticated to reconstruct the spatial history of moderate to large clades, let alone multiple clades simultaneously. Developing a more sophisticated understanding of community assembly over multiple timescales will necessitate the development of new models and algorithms to integrate multiple species histories and ecologies. A related part of Major Challenge 3 is how to determine the evolutionary history of micro­organismal community structure and function. This is a somewhat different challenge, because new methods in com- parative genomics are giving us a better understanding of microbial community organization. Termed “environmental genomics” or metagenomics, these new tools use advances in high-throughput sequenc- ing to sample the genomes in environmental samples (Riesenfeld et al., 2004; NRC, 2007). Convention- ally, microbial DNA is isolated from a sample, cloned, and then used to create metagenomic libraries, but newer technologies can access the environmental sample directly (NRC, 2007). The resulting sequences have many uses, including for phylogenetic studies, measures of taxonomic and genomic diversity, the discovery of new genes, functional analysis of specific genes, and for modeling large biochemical pathways, including community metabolism (Tyson et al., 2004; Rusch et al., 2007). Although current metagenomics research is primarily directed toward genome characterization and the structure and function of microbial communities, it is clear that the massive amounts of new s ­ equence data being collected will substantially expand our knowledge of global diversity, the tree of life, and genome evolution. Additionally, genomic comparisons interpreted in the context of phylogenetic relationships can also be expected to reveal new insights into genome structure and function. Computational Challenges Many of the causal linkages between the geosphere and biosphere, such as how tectonically driven change might have influenced the speciation and extinction rates, have been inferred from correlations generated by empirical studies. There is a need for a better theoretical-mathematical foundation that can lead to predictive quantitative assessment. Moreover, these causal models linking Earth history with biotic evolution should be operable at different spatiotemporal scales. Some simulations have been run that couple climate models with environmental models and data about ecosystems, and modeling of species distribution is becoming increasingly common. Some simulations have been performed to recon- struct how taxonomic elements of communities and ecosystems have assembled over time by integrating the phylogenetic and spatial histories of many groups of organisms simultaneously. To date, however, few if any of these studies have taken advantage of high-capacity computing.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 71 Major Challenge 4: Understanding the Origin and Evolution of the Phenotype The phenotype describes the observable characteristics of an individual. Although an individual may have the genetic capability to express a trait, only the traits that manifest are considered phenotypic characters. Phenotypic characters can be broadly construed across different levels of organization, from genomic and developmental characteristics to external features, physiology, and behavior. The genetic architecture of individuals or populations includes all interactions and functional linkages among genes that have influence on the expression of traits (NSF, 1998). We have broad knowledge of the nature of genetic variation within and between populations, and the promise of large-scale genomic sequencing across many individuals promises to revolutionize that knowledge and allow much more sophisticated questions to be asked. Knowledge about the origin and evolution of phenotypes is built on an under- standing of genetic variation in all its components, from nucleotide polymorphisms and their frequency in populations to the interactions among genomic loci. A key question is, How does that variation in coding or regulatory genes relate to changes in phenotype? Recently, significant progress in understanding the evolution of phenotypes has come from inte- grating the fields of evolutionary and developmental biology. Both fields have long histories, but start- ing several decades ago they diverged. Evolutionary biologists focused increasingly on understanding evolution at the population level and developed sophisticated genetic models to understand changes in allele frequencies, while developmental biologists focused on experimental manipulations to uncover the mechanisms of development. More recently, however, developmental biologists have taken their analysis to very deep molecular and genetic levels, and this has led to a renewed interest in understanding the interplay between evolution and development (called “evo-devo”), as both fields now have a common language of genetics and genomics. Evo-devo has recently had explosive growth and has become an exciting area of investigation and attracted much popular attention, from, among others, S.B. Carroll (2005). These studies of developmental evolution have spanned all levels, from microevolution within populations to macroevolution among the major clades of life. One of the remarkable outcomes of these initial studies is the discovery that individual genes and genetic pathways can have important evolution- ary effects on development and morphology. For example, patterning along the anterior-posterior axis of animals is controlled by a set of genes known as the homeotic (Hox) genes. While initially characterized through the study of highly deleterious mutant alleles in model species such as Drosophila melanogaster (fruit fly), subsequent studies have shown that evolutionary changes in the Hox genes and changes in how those genes are expressed play a clear role in animal evolution across the entire micro- to macro­ evolutionary spectrum. For example, the Hox gene Ultrabithorax (Ubx) has helped us to understand microevolutionary changes in the pattern of fly bristles as well as the macroevolutionary changes seen in the appendages of crustaceans (Stern, 1998; Averof and Patel, 1997). Similarly, Hox genes have also been implicated in the evolution of large-scale changes in the vertebrate skeleton during evolution. The analysis of these genetic networks presents us with the opportunity to understand evolution in increas- ingly sophisticated ways and has allowed us to generate better models of how evolutionary change occurs. For quite some time, evolutionary models suggested that the phenotypic differences between even very closely related species were due to variation at a large number of genomic loci, with any given individual mutation having only a minute effect. Recent experiments, however, suggest that this is not always the case. Increasingly we see that phenotypic variation can often be attributed to one or just a few genes of large effect. For example, recent studies in sticklebacks show that variation at a single gene, Pitx1, has a very large effect on the pelvic spines found in these fish (Shapiro et al., 2004). At

72 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING the molecular level, these changes can involve both protein coding and gene regulatory modifications, although current theories suggest that regulatory mutations in developmental genes may have a pre- dominant role in underlying major evolutionary change in phenotype. The pace of such discoveries is ever increasing as new genomic and developmental tools are allowing us to decipher how developmental systems evolve. One fundamental goal of Major Challenge 4 is to understand integrated phenotypes and how they evolve. Phenotypic features—such as morphological form, physiology, behavior, even biochemical pathways—are often integrated into functional groups based on their interaction with the environment. Such linkages may be tight or loose. An additional challenge is to understand the boundaries and strength of these linkages, how the components of these groups arise, coevolve, and perhaps become unlinked functionally, potentially to be incorporated into other functional groups. Thus, considerable attention is now being paid to establishing the boundaries of integrated suites of morphological, behavioral, and physiological traits of organisms that function and interact collectively with the environment. The con- cept of modularity within evolutionary developmental biology is playing a role in understanding the structural and functional organization of integrated phenotypes and how they arise in development and change during evolution (Schlosser and Wagner, 2004). Integrated phenotypes are being studied from the perspectives of developmental biologists, comparative functional biologists, and population biologists conducting ecological and genetic experiments in the wild or laboratory. These evolutionary alterations in developmentally important genes also lead to changes in the genetic architecture of development in such a way as to control the range of phenotypic variation that is possible in subsequent generations. In some situations this can constrain or limit future phenotypic evolution, while in other cases it can open up entirely new possibilities for subsequent evolutionary change and the appearance of totally novel morphologies and physiologies. An important future challenge is to integrate these developmental and evolutionary studies with ecological ones to understand how natural selection shapes the course of growth and development of phenotypes and the underlying genetic architecture. Computational Challenges A serious major computational challenge is to generate qualitative and quantitative models of development as a necessary prelude to applying sophisticated evolutionary models to understand how developmental processes evolve. Developmental biologists are just beginning to create the algorithms they need for such analyses, based on relatively simple reaction-rate equations, but progress is rapid, and this work will soon be able to take advantage of HECC resources. Another important breakthrough in the field is the analysis of gene regulatory networks (Levine and Davidson, 2005). These networks describe the pathways and interactions that guide development, and while their formulation is dependent on intense experimental data collection, once produced, the net- works provide explicit models to test how perturbations affect all manner of developmental events. While they resemble metabolic pathways in overall structure, they are far more complex in their regulation and behavior. As these models grow to include more pathways and more organisms, they will increasingly benefit from greater computational capacity and will become vital to many evolutionary studies. Simi- larly, protein-interaction network analysis provides insight into an organism’s functional organization and evolutionary behavior—see, for example, http://www.hicomb.org/papers/HICOMB2007-03.pdf.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 73 Major Challenge 5: Understanding the Evolutionary Dynamics of the Phenotype–Environment Interface Evolutionary biology has long sought to understand environmental selective effects by focusing on a single trait or trait complex, such as bill shape, body size, or body shape. Yet, selective regimes in the environment act on entire phenotypes that are the result of highly complex and linked (integrated) developmental and metabolic pathways. Earth’s biota is a product of complex interactions between the abiotic and biotic realms (see Major Challenge 3). Life on Earth has evolved in the context of a dramati- cally and often rapidly changing environment, whose trajectory and long-term trends have themselves been modified by evolving life forms. Earth’s atmosphere and long-term climate trends have played an important role in the major transitions and increasing complexity of life, but the system is interactive, replete with complex positive and negative feedbacks between living and geological systems. The timing and causes of many of the major transitions in the origin of biotic complexities, such as the origin of oxygen-based metabolism, are still somewhat controversial, given the challenges of interpreting the chemical and morphological signals in the fossil record of the first 3 billion years of evolution. The evolution of integrated complexes can also be investigated at different hierarchical levels. One critical approach is to link variation in these integrated complexes to environmental differences within and among populations in order to understand outcomes of selection. At a higher level, changes in integrated complexes can be analyzed across species, particularly those that are closely related, thus describing how the components of these complexes change at times of speciation. Such analyses are critical for providing insight into how the tightness of integration “constrains” change in phenotypic- functional complexes. Finally, we build on this knowledge to learn about the relationship between phenotypic change and adaptation. At a population level, variants in phenotype may have different consequences for survival or reproduction. Those variants that become fixed through natural selection (because of those conse- quences) are often referred to as adaptations. The nature of adaptations has been studied intensely from the viewpoint of population biology and ecology. Less attention has been paid to the molecular basis of population variation underlying phenotypic change and the linkages that might exist between genome evolution and phenotypic evolution. To what extent, for example, is convergence in phenotype related to convergence in the genetic and developmental pathways that produce those phenotypes? And, to what degree is similarity in presumptive adaptations constrained by those pathways, or is there flexibility in morphogenetic systems such that different ones can produce very similar phenotypic expressions? Answers to many of these questions and others in this field will require more empirical information about the amount and kind of genetic variation that underlies phenotypic variation and its response to selection. Such data are crucial for building more sophisticated and realistic models and simulations of the evolutionary process. A second fundamental goal underlying Major Challenge 5 is to understand better the links between ecological and evolutionary processes. The conservation of biodiversity depends critically on our ability to predict the responses of populations to changes in their environment that occur on short- and medium- term timescales. It is likely that models can be developed that are capable of predicting short- and medium-term population fluctuations in response to environmental change in greater detail than we can over long-term evolutionary time. Such models could, for instance, address biologists’ concerns about the effects of climate change in the recent past and in the short-term future. Both the increased detail of the environmental record and the increased sophistication of demographic models should enable biolo- gists to understand the effect of environmental change on many of the key life-history components of population fluctuations, such as juvenile and adult survival, fecundity, and population age-structure.

74 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING From their humble beginnings as simple logistic growth curves, ecological models now routinely grapple with the effects of stochastic environmental change on demographic and population trends. But populations do not expand indefinitely in favorable environments; they inevitably hit the carrying ­capacity of the ecosystem and undergo declines due to increased competition. Many such density-­dependent popu- lation crashes have been documented in real populations that have been the subject of long-term censuses. Thus, most models now acknowledge the ecological feedbacks intrinsic to the dynamics of populations, in particular the phenomenon of density dependence. Other demographic complexities are also being incorporated into ecological models, such as age-specific responses to the environment, variable levels of year-to-year autocorrelation in environmental variables, and cohort r­esponses, in which the environmental conditions specific to the birth year of juveniles affects the shapes of their survival and fecundity curves (Lindstrom and Kokko, 2002). All of these variables impose addi­tional complexities for predicting popu- lation responses to environmental change. Thus realistic models of population and demographic history must take into account stochastic changes in the environment, as well as the impacts of a population’s growth on the environment and associated feedbacks. For ­example, a recent simulation study (Wilmers et al., 2007) found that drastic fluctuations in populations, and hence increased chances of extinction, were more likely to be found in environments that were positively ­correlated from year to year. Computational Challenges Increasingly, evolutionary biologists are incorporating realistic models of population and demo- graphic changes into their modeling of population genetic processes (Whitlock and Gomulkiewicz, 2005). In many respects, ecological and evolutionary modeling have proceeded independently of one another, primarily because of the mathematical difficulties of linking the two holistically. One difficulty is that ecological and evolutionary processes are perceived to operate on vastly different timescales. In addition, the dynamical complexity of population responses to environmental change observed in many ecological models typically exceeds the complexity of the types of demographic changes modeled in population genetics. For example, relatively few population-genetic models cover situations in which populations are not at equilibrium. Even complex migration models assume long-term stability of the migration matrix and population sizes over time. Although models that attempt to estimate population- size changes and rates of growth and decline from molecular data do address a type of nonequilibrium situation, they are limited to estimating, for example, a constant rate of population growth. Typically, parameters of interest are estimated using computationally burdensome methods such as maximum like- lihood, in which the optimal parameter values are searched for heuristically to optimize the likelihood. Alternatively, a faster approach involves sampling multiple parameter values stochastically, evaluating the likelihood of each, and accepting these in proportion to their likelihood or their posterior probability. This is the essence of Bayesian approaches. Evaluating the likelihood of a set of parameter values is not in itself time consuming, but finding the optimal set of values is. Recent approaches to inferring past population dynamics from multiple genetic loci now use simula- tion of gene genealogies within a specified population model and compare simulated patterns of DNA sequence diversity to the patterns observed in the data under study. Such simulations typically utilize a relatively simple population model. The fit of these simulated DNA sequences to the data actually collected can be assessed using summary statistics, and the efficiency of simulations can be increased through statistical methods such as importance sampling. Typically, this involves first simulating gene trees within population histories and then simulating mutations on these trees in order to produce DNA sequences consistent with the model. Incorporating all the demographic events that have been proposed to influence populations is computationally challenging.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 75 Lewontin (1979) expressed the dynamic interplay between ecological and evolutionary and popula- tion genetic processes through a set of interrelated differential equations: dN/dt = f(N, P) and dG/dt = g(G, W). Here the vector N represents the distribution and abundances of species in nature, G represents the genetic structures of those same species, and W represents the parameters of selection governing the microevolutionary trajectories of the species. In the first equation, P represents the parameters of popula- tion growth and interactions among species as mediated by morphology, physiology, dispersal, and so on. The key insight is that the parameters making up P are in turn a function of G; that is, P = h(G). In a similar fashion, microevolution (W) is determined in part by the abundances of species, N: W = H(N). That is, evolution is density dependent. These relationships between P and G and between W and N express the direct coupling of ecological and evolutionary processes. The integration of ecological and evolutionary modeling is still incomplete, though. As noted above, ecological theory (dN/dt) and evolutionary theory (dG/dt) are typically studied in isolation from each other today; thus one of the main objectives of computational research at the inter­ face of ecology and evolution will be the “recoupling” of ecological and evolutionary dynamics. Such approaches have contributed to some solid conceptual advances, and they hint at how the reintegration of ecology and evolution can lead to results that would not be expected had they been studied in isolation. One such unexpected result is the prediction of mutational meltdown (extinction due to accumulation of deleterious mutations) in metapopulations in situations when population sizes become critically low, thereby accelerating the fixation of deleterious mutations. Major Challenge 6: Understanding the Patterns and Mechanisms of Genome Evolution No tool kit has revolutionized evolutionary biology more than genomics. The foundation of large- scale genome analysis is the complete sequencing of genomes, whether from single-celled bacteria or unicellular eukaryotes or from more structurally and developmentally complex animals and plants. Such an approach allows systems-level comparisons of whole genomes across evolutionary time and across lineages that share few or no morphological traits. The advent of ever-more-rapid sequencing approaches promises that our ability to compare whole genomes to one another will only improve and provide an increasingly refined picture of the processes underlying genome evolution at all scales. Yet very fundamental questions remain, such as understanding how new genes arise. There are thought to be a large number of often serendipitous routes by which new genes can arise (Lynch, 2007). Duplication of individual genes, entire chromosomes, or entire genomes, followed by subsequent diver- gence, is probably the most common mechanism for the generation of novel genes. As a consequence of this mechanism, the new genes are usually closely related in sequence to their progenitor genes. The fate of these genes will depend on the selective environment at the time of duplication and the avail- ability of favorable mutations. Newly created genes often experience a host of novel mutations: the expansion and contraction of repeated amino acid motifs; changes to gene regulatory domains, which control when and where the genes are expressed; additions and deletions of entire exons and domains; and even changes in reading frame. The role of whole-genome duplications is also an important consideration in organismal evolution, especially in vertebrates and certain plant lineages. The fate of genes after duplication, either as part of whole genome duplications or just individual gene duplications, has been extensively modeled, but as we get more and more genomic data we will begin to get a better picture of the actual course of events

76 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING and the role of genetic change in evolutionary change. Several large-scale surveys of gene duplications appeared in recent years, and the estimated rates of gene duplications are now thought to be about the same as, if not faster than, rates of point mutations in DNA sequences (Lynch, 2007). Transposable elements are a diverse class of autonomous mobile DNA elements that can proliferate within genomes, copying themselves from sites of origin to novel chromosomal positions at bewilder- ingly high rates. Transposable elements have been known since the mid-1900s, yet their full impact on the evolution of genomes and organisms has only recently been appreciated. Approximately 40 percent of the human genome is composed of easily recognized transposable elements, and if one includes ancient elements whose identity has eroded over time, the fraction affected is likely much greater. As such, they are probably the single most important long-term component of complex ­eukaryotic genomes, and their effects on evolution and disease are profound and widespread (Batzer and Deininger, 2002; Deininger and Batzer, 2002). Transposable elements have long been known as a source of mutation, and many of their early phenotypic and genetic effects were considered ­detrimental or at best neutral. However, in the last 5 years, the number of cases in which transposable elements have played a demonstrably adap- tive role in the origin of new genes and genomic functions has dramatically increased, so that traces of transposable elements in the genome can no longer be assumed to be junk DNA. Transposition, often through an mRNA intermediate, is responsible for the many genes in the genome without introns, so-called retrotransposed genes (Brosius, 1999). These novel genes often assume novel functions unrelated to the function of the original gene. As a result, the creation of novel genes via transposition is thought to underlie some of the most dramatic examples of adaptation to extreme environments, such as the ability of notothenioid ice fish to maintain their blood in liquid form, without freezing, while inhabiting the subfreezing Antarctic Ocean (Chen et al., 1997). Transposable elements are now known to provide some of the basis for novel cis-regulatory elements of genes, enabling enhanced versatility at the level of expression. The structures of many genes clearly retain vestiges of transposable elements, indicating that the genome frequently co-opts these elements for novel coding functions. Sometimes the breaking up and reregulation of genes by transposable elements clearly has adaptive value; it has been postulated that the vertebrate immune system was brought about by the insertion of a transposable element into an ancestral immunity gene. At the same time, it has been estimated that up to 10 percent of congenital diseases in mice are caused by transposable elements. The balance between adaptive and detrimental consequences for transposable element proliferation is thus a key unknown in evolutionary genomics. Transposable elements are also an important factor in the variability of total genome size of animals, plants, and fungi. For example, the evolution of the vertebrate genome over the past 600 million years has seen the origin, proliferation, and decline of several families of transposable elements. Why fami- lies of transposable elements increase and then decline, often in concert with changes in the numerical dominance of unrelated families, is still unclear. Are there global genomic regulations on the number of transposable elements that are driven by the deleterious effects of accumulation? If so, where do these regulations come from and can they be predicted? What is the ecology of transposable elements in the genomic community? Large-scale variation in transposable elements will play an important role in explaining the 1,000-fold variation in genome size observed among living eukaryotes. As we build our knowledge of how genes arise, evolutionary biology can begin to understand how evolution is constrained by networks of interaction between genes and noncoding elements in the genome. Genes do not exist in isolation; they belong to complex networks of interacting genes, gene products, and environmental signals. There is increasing interest in the degree to which genetic networks are robust to perturbations from within and without. Perturbations from within include the generation of novel genes as a result of gene duplication, and the fate of such gene duplicates is known to depend on the novelty of function of the new gene and the structure of the network into which it is inserted.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 77 Perturbations from without include environmental changes, and the effect of such perturbations on the stability of developmental modes and outcomes is key to understanding whether networks are adaptive or whether they have arbitrary and neutral aspects to their construction. Genetic networks are also known to influence rates of genomic change (Barton et al., 2007), and some possess a structure that buffers against the insertion of new genes and gene functions. In recent years biologists have added a host of novel interacting genomic elements to the already long list of well-annotated genes. Such novelties include “ultraconserved” noncoding elements in the human genome, which are short sequences—fewer than 200 base pairs (bp)—that are 100 percent conserved between humans, other mammals, and in some cases chicken and fish; microRNAs, which are short sequences encoding RNAs of only ~22 bp that are now known to play critical roles in development and gene regulation (Bartel, 2004); and long-distance enhancers, short regions of the genome that regulate a set of constituent genes despite sometimes being located millions of base pairs away from their targets. Untangling the web of interactions among such diverse sets of genomic players and providing a seam- less link between genomic data (such as large-scale gene expression data) and network models remains a computational challenge for the future. Progress over the last 10 years has refined to an unprecedented degree our characterization of primary genome constituents in humans and other organisms while revealing new structures and forces operating within complex genomes that were unknown in the pregenomic era. Understanding how these myriad constituents interact and influence one another and how genomes and chromosomes function and evolve as hierarchical networks is a major challenge for evolutionary biology. Many of the principles of popula- tion genetics and molecular evolution, laid down in the twentieth century, are still applicable to genome data despite the scaling up from single genes to entire genomes (Li, 1997; Lynch, 2003). By contrast, the computation and estimation of whole-genome parameters and the analysis of data sets that are vastly increased in size has required, and will continue to require, new computational and algorithmic tools. Computational Challenges Obtaining genomic sequences is becoming easier and less expensive each day, and we are becom- ing inundated with genomic data. It is predicted that within a few years the cost of sequencing a human genome will drop to less than $1,000 and that such information will be a routine component of personal medical information used for diagnostic purposes. Making sense of all these sequence data, however, requires a combination of computational and evolutionary approaches. Modern sequencing technologies routinely yield relatively short fragments of a genomic sequence, from 25 to 1,000 bp. Whole genomes range in size from the typical microbial sequence, which has millions of base pairs, to plant and animal sequences, which often consist of billions of base pairs. A ­ dditionally, “metagenomic” sequencing from environmental samples often mixes fragments from dozens to hundreds of different species and/or ecotypes. The challenge is to take these short subsequences and assemble them to reconstruct the genomes of species and/or ecosystems (NRC, 2007). While the fragment assembly problem is NP-complete, heuristic algorithms have produced high-quality reconstruc- tions of hundreds of genomes. The recent trend is toward methods of sequencing that can inexpensively generate large numbers (hundreds of millions) of ultrashort sequences (25-50 bp). Technical and algo- rithmic challenges include the following: • Parallelization of all-against-all fragment alignment computations. • Development of methods to traverse the resulting graphs of fragment alignments to maximize some feature of the assembly path.

78 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING • Heuristic pruning of the fragment alignment graph to eliminate experimentally inconsistent subpaths. • Signal processing of raw sequencing data to produce higher quality fragment sequences and better characterization of their error profiles. • Development of new representations of the sequence-assembly problem—for example, string graphs that represent data and assembly in terms of words within the dataset. • Alignment of error-prone resequencing data from a population of individuals against a reference genome to identify and characterize individual variations in the face of noisy data. • Demonstration that the new methodologies are feasible, by producing and analyzing suites of simulated data sets. Once we have a reconstructed genomic or metagenomic sequence, a further challenge is to identify and characterize its functional elements: protein-coding genes; noncoding genes, including a variety of small RNAs; and regulatory elements that control gene expression, splicing, and chromatin structure. Algorithms to identify these functional regions use both statistical signals intrinsic to the sequence that are characteristic of a particular type of functional region and comparative analyses of closely and/or distantly related sequences. Signal-detection methods have focused on hidden Markov models and varia- tions on them. Secondary structure calculations take advantage of stochastic, context-free grammars to represent long-range structural correlations. Comparative methods require the development of efficient alignment methods and sophisticated statistical models for sequence evolution that are often intended to quantitatively model the likelihood of a detected alignment given a specific model of evolution. While earlier models treated each position independently, as large data sets became available the trend is now to incorporate correlations between sites. To compare dozens of related sequences, phylogenetic methods must be integrated with signal detection. Major Challenge 7: Understanding the Evolutionary Dynamics of Coevolving Systems Individuals of the same species or of different species generally have either conflicting or cooperating (mutualistic) interactions. Increasingly, many of these interactions have been found to evolve in relation to one another, a process known as coevolution. These coevolving interactions take many forms, from the symbiosis of organelles that once invaded free-living microbes, to predator-prey, host-parasite, and plant-pollination systems, to cooperative breeders or sexually selected mate choice, and to competitive interactors within habitats and communities. Understanding the evolutionary biology of such inter­actions has broad implications for solving problems in many areas of applied biology, including human health, agriculture, and resource management. For this reason, there is great interest in understanding the g ­ enomic underpinnings of conflict and cooperation and how conflict and cooperation evolve. Genetic mechanisms underlying conflict and cooperation have long been investigated empirically and theoretically, and considerable research has been undertaken on the genetics of behavior. It is widely recognized that behavior is influenced in complicated ways by numerous genes and their interactions, but we still have inadequate empirical knowledge of the genetic variation in the wild that is available to selection. The new tools of genomics promise to broaden the kinds of questions and approaches that have been standard in the field. To put the interconnections among genomic data, development, neurological function, and expressed behavior in an evolutionary context, studies will need to be comparative. These studies will ask new questions about the numbers and kinds of genes and about differences in the genetic architectures that influence conflictual and cooperative behaviors, including the genetic basis of instinct.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 79 Cross-species, coevolutionary comparisons of multispecies interactors will reveal new insights into the nature of coevolution itself—for example, How fast is genetic and phenotypic change in interactors? Are coevolving systems conservative over time? How are those systems shaped by genetic factors? The evolution of conflict and cooperation can be studied at different hierarchical levels. For example, the history of species associations, such as hosts and their parasites, has been the focus of considerable phylogenetic coevolutionary analysis. At the level of populations, ecologists and population biologists have also built a large library of detailed field and laboratory studies. Despite this large body of work on conflict, cooperation, and coevolution, many aspects of the roles that conflict and cooperation may play in evolution are still poorly understood—for example, in the evo- lution of adaptation and in the origin of species. And there is a need for studies that integrate causality at the genomic level with that at the population, ecological, and demographic levels. Computational Challenges Sophisticated modeling of conflict/cooperation systems has a long history, and the large body of literature investigating these systems integrates genetic and population approaches. Much of this modeling stems from game-theoretic approaches and from classic population dynamic models such as p ­ redator-prey. Although game theory originally dealt with economic problems, it has also had a pro- found impact on evolutionary biology (see, for example, Maynard-Smith, 1982; Vincent and Brown, 2005). Most quantitative analyses of conflict/cooperation models have been carried out using desktop computing. Yet, as models subsume more parameters and include demographic or genetic information across space and multiple generations, access to advanced capability computing will become necessary (see, for instance, Nowak, 2006). MAJOR CHALLENGES IN EVOLUTIONARY BIOLOGY THAT REQUIRE HECC Progress in most areas of evolutionary biology has been very rapid over the past several decades. However, because desktop computing has continued to advance, much of quantitative and theoretical evolutionary biology has prospered without reliance on advanced computational capabilities. But this is likely to be a transient phase, because over the past decade, evolutionary biology has become increas- ingly multidisciplinary and integrative. This, and the rapid accumulation of genetic and other data on populations and species, has accelerated the transformation of evolutionary biology into a quantitative science. The study of microevolution, which requires genetic and demographic analyses of evolution- ary change within populations, has a long history of theoretical and quantitative modeling and remains robust. In many other areas of evolutionary biology, however, the building of quantitative theoretical models has been neglected, and research is sorely needed. One area that has made use of high-performance machines is phylogenetic research. As more mem- bers of the community become adept at using cutting-edge computational methods, there will inevitably be pressure to port models and codes to more powerful platforms and thereby address scientific questions in ways that mirror the complexity of natural systems. Evolutionary biology is already in transition, and to realize its potential fully will require progress on all capability-computing-related fronts: models, theory, data management, education and training, algorithms, and hardware. It should be stressed that progress in some areas of evolutionary biology is being limited by a lack of computational power. Even in phylogenetic research, where HECC is being exploited, researchers could make use of additional computational resources for increased statistical testing, larger simulations, and advanced visualization tools.

80 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Major Challenges 1 and 2 are probably where we will first feel the need to transition to HECC- enabled research. In both cases, access to advanced computational approaches will enable representing enough complexity to reveal new phenomena. To address Major Challenge 1, the phylogenetics com- munity is building larger and larger trees, evaluating them statistically, and manipulating them visually. As was noted earlier in this chapter, however, the computational challenge scales superexponentially with the number of terminals on the trees. This makes it especially important for the community to take maximum advantage of whatever state-of-the-art computing exists at the time. Not doing so will seriously hamper future progress. In addition, research using phylogenetic methodologies is expanding rapidly in the biomedical sciences as well as for metagenomic studies investigating community structure/function, ecosystem metabolism, and global climate change. In all of these areas, producing results that can mean- ingfully answer practical questions calls for much greater complexity, which in turn demands advanced computing. Along with the greater complexity, these applications typically involve massive amounts of data, the management and analysis of which will require advanced computing. For studies about speciation (Major Challenge 2), the simple mathematical models with only a few parameters that were of such importance for previous theoretical work are unable to make good use of the large amounts of data becoming available. They also are proving inadequate given the desire for higher resolution descriptions of genetic and/or population behavior. Thus, although experimentation may continue to be the dominant research modality for Major Challenge 2, there is a need to move to simulations that are explicitly genetic and characterized by a large number of parameters, large popula- tions (hundreds of thousands of individuals), long timescales (hundreds of thousands of generations), and significant stochasticity (which requires that the simulations be run multiple times to enable statistical analysis). For the research community to take the step from qualitative predictions and speculations to much more powerful and precise quantitative predictions and estimates, it must be able to perform such simulations, for which it will need HECC. Moreover, advanced computing opens some new options for approaching Major Challenge 2. Evaluating demographic histories using genetic data is computationally challenging for several reasons. For any data set of DNA sequences (or other genetic markers), there are many potential gene trees ( ­ phylogenies) that are consistent with the data; in addition, for any given set of phylogenies, there are a number of often complex population histories that can be accommodated by this set of gene trees. The result is two levels of uncertainty. In theory, this challenge could be met by integrating across all possible gene trees and population histories: ( Pr X Θ =) ∫ Pr ( X G ) p (G Θ) G ∈ψ (Felsenstein, 1988; Hey and Nielsen, 2007). Here, X is the collected data (say, DNA sequence data sampled from multiple loci and multiple species or populations); Θ is the species history, which could be a complex demographic history involving bottlenecks, gene flow, and local extinction or a purely dichotomous history of population divergence, a phylogeny; G is a gene tree, an estimated genealogy of DNA sequences; and ψ is the set of all such genealogies, which include continuous branch lengths and very many topologies. However, this integration is not realizable in practice, not only because of the need for efficient estimation when the state space is large (for complex demographic models with many parameters) but also because, as discussed in connection with Major Challenge 1, the number of possible gene trees (phylogenies) increases superexponentially with the increasing number of taxa (in our case, genes or al- leles sampled from a given species). This makes it impractical to evaluate the above integral by sampling

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 81 a large number of genealogies at random. Thus even though the probabilities of mutation events for any one genetic locus can be multiplied by those for other loci because they are independent (conditional on the demographic history itself), they cannot be calculated analytically. As a result, computational and statistical approximations have been used extensively. The most recent of these methods to be employed, Bayesian analysis, frequently utilizes MCMC methods to sample many possible gene trees and parameters associated with the demographic history. Various sampling and rejection schemes have been proposed, as well as means of proposing parameters via complex prior distributions. With a sufficient number of MCMC cycles, complex probability distributions can be approximated. Still, this exercise gets us only to the point where we can evaluate statistically the universe of trees that should be considered appropriate for an analysis. We are still left with deciding which population model—described by gene flows, population size changes, and so on—best fits the set of gene trees. Many of the approaches to Major Challenge 3 could use HECC now or are moving inexorably in that direction. Irrespective of scale, models that link geosphere and biosphere could be highly ­parameterized and, if they are, will ultimately rely on HECC when fully implemented. As noted in Chapter 3, geo- scientists are beginning to couple climate modeling with environmental modeling and satellite data on ecosystem distribution to reconstruct the environmental history of Earth’s ecosystems and to predict changes due to global warming. Concurrently, environmental modeling of species distributions is also becoming more common, and evolutionary biologists are beginning to use information about phylo- genetic relationships to reconstruct the historical environmental envelopes of common ancestors down the tree. One can imagine the possibility of linking these classes of models and extending the coupled system farther back in time to examine how geological and climatological changes at a global scale might have influenced the distributions of species and biotas in terrestrial and marine environments. The complexity and precision of these reconstructions will depend on massive computational power. Such calculations will continue to stimulate advances in data integration and theoretical analysis, and the difficulties of these challenges will tax even the next generation of HECC. Reconstructing how taxonomic elements of communities and ecosystems are assembled over time involves integrating the phylogenetic and spatial histories of many groups of organisms simultaneously. Current analytical approaches to this problem, conventionally undertaken on desktop computers, are widely regarded as inadequate because the simplifying assumptions of the methods and the models of spatial change are lacking in realism. Biogeographers, phylogeneticists, and computer scientists are col- laborating to develop algorithmic approaches that will be able to extract the complex spatial and temporal histories of multiple groups simultaneously. Because of the large parameter space of the solution set and the algorithmic complexity, HECC will play a major role in data analysis. From the outset, metagenomic studies have been intensely computational because they involve the assembly and comparisons of millions of gene fragments for hundreds or thousands of different kinds of microbes, many of which are new. The scale of such studies is expanding rapidly (see, for example, Rusch, 2007), and analyzing the results to address evolutionary questions, which will necessarily involve computationally intensive phylogenetic approaches, means that studies in evolutionary biology will more than ever need to push the frontiers of HECC. Metagenomic studies also reveal the remarkable diversity of protein families and their functional subdomains that have evolved. Predicting the functions of these domains is a highly complex computational problem that brings physical and chemical modeling together with evolutionary biology. No single algorithm has yet been established that works in all cases, but this is a subfield of very active research, with tools being developed and ongoing experimental validation of the predictions (Friedberg, 2006). The prediction of a protein’s three-dimensional structure from its sequence is an area that already makes extensive use of HECC, and structure information could now be

82 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING applied to understand the evolutionary diversification of protein structures and of families of function- ally related proteins. Simulations capable of representing the process of development, both qualitatively and quantita- tively, are necessary for addressing Major Challenge 4. As noted in that section, developmental biolo- gists are just beginning to create the algorithms, but progress is rapid, and they will soon be able to take advantage of HECC. Many of these models are based on relatively simple reaction-rate equations, but the number of parameters is very large. In many cases, parameters such as the rate of protein produc- tion and degradation, the rate of ligand diffusion, and the rate of receptor turnover can be specified only within certain limits. This means that the number of possible outcomes is too large to compute; instead, the space of all possible outcomes is sampled to gain an understanding of the developmental pathway.  Access to HECC will allow these developmental analyses to be done more efficiently and for a wider range of parameters. Likewise, the growing application of models from chemistry and physics, such as for the diffusion of ligands through the embryo (Gregor et al., 2007), has opened up the possibility of truly predictive models of development, which in turn promise the opportunity to understand the evo- lutionary outcomes of changes to the system. Also, as noted in the discussion surrounding Major Challenge 4, analysis of gene regulatory networks and protein interaction networks is an important tool for understanding the development and evolution of phenotypes, and the analysis of both sorts of network can exhibit complexity such as will require HECC. For instance, it will be important to test how perturbations affect all manner of developmental events, and this requires multiple large-scale simulations. Also, as we develop a more detailed understanding of these networks and their effects on phenotypes, research will need to include more pathways and more organisms, thereby necessitating increased computational capability. As noted in the discussion of Major Challenge 5, computational approaches are beginning to be used to recouple ecological and evolutionary dynamics. To date, most research has focused on simple cases, such as examining the dynamics within a single species with a very simplistic genetic architecture and genetic basis for phenotypic traits. Even so, such cases have contributed to some solid conceptual advances. Future success in this area, however, will depend heavily on analyses of models of communi- ties of organisms with realistic ecological, genetic, and spatial structures, and the complexity of such models quickly brings us to the HECC domain. The computational demands will be staggering, insofar as such analyses will involve the simultane- ous tracking of thousands of genes in hundreds of interacting species, each with its own independent evolutionary history and genetic basis for traits. Extending this work to incorporate spatially explicit landscapes will further escalate the computational demands. One can imagine adding yet another set of equations to Lewontin’s (Major Challenge 5) that would capture the complex trajectories of develop- ment (linking genotype to phenotype). Incorporating this last feature would truly integrate ecology and evolution by demonstrating the development of phenotypes from genotypes. At least for some model organism systems (sea urchins, Drosophila), developmental biologists have formulated equations that allow predicting phenotypic properties from the structure of complex gene networks (see Major Chal- lenge 4). Very little has been done to tie such developmental predictions quantitatively to evolution or ecology (Kingsolver et al., 2007), no doubt because of the computational challenges, but this is clearly the direction in which coupled models of ecological and evolutionary dynamics are heading. The computational methods that are essential to addressing Major Challenge 6 are largely manage- able today, as explained in that section. As described there, evolutionary comparisons play a critical role A good example of such an approach is seen for models of the Drosophila segment polarity network, which maintains a segmental pattern in the early embryo. Computational analysis led to the surprising result that the network was remarkably robust in the face of environmental and genetic perturbations (van Dassow et al., 2000).

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 83 in genome annotation. In this way, we identify conserved protein coding regions and noncoding regions that presumably play a variety of regulatory roles. But the task of comparing and aligning genomes becomes increasingly difficult as we get more and more genomes and as we ask increasingly sophisti- cated questions. As this report is being written, not many bioinformaticians have routine access to very powerful computers, so it can be said that their research capability faces limitations. Algorithms and data are changing rapidly, and bioinformaticians often want to run their analyses repeatedly, tweaking parameters on each iteration. At this stage of development, then, it is advantageous for a researcher to work closely with his or her own machine, even if that constrains the scale of the calculations. As algo- rithms are perfected and databases swell, the need for HECC will grow rapidly. Making access fast and easy will also be critical in getting the user community to switch to state-of-the-art capabilities. These capabilities will also play an important role in developing gene models that can find and annotate genes independent of evolutionary conservation, which is essential when looking for genes that are evolving rapidly or are unique to specific lineages. As our understanding of comparative genomics in wild populations improves, we will be better able to look for evolutionary signatures of selection and thus tease apart genome-level events that lead to speciation and macroevolutionary changes. While we have sophisticated theories and algorithms for this analysis, we are challenged to test these theories rigorously through large-scale genome analysis. As described above, transposable elements and the genome alterations they cause have played an important role in evolution, but piecing together the course of events is difficult from the computational stand- point. Ironically, the repeat structures created by transposable elements make the process of assembling whole genome sequences from raw data an even more complex computational problem. Right now, when we say a genome has been fully sequenced, that often applies only to its euchromatic region; the heterochromatic region, which often is rich with transposable elements, remains unassembled because computational methods are still lacking to make sense of the data. While specific computational challenges and approaches in evolutionary biology have been dis- cussed above, several additional observations can be made that apply across all levels of evolutionary research. All of the biological sciences are data rich. This is not just in terms of volume per se but also in complexity, uniqueness or individuality, and their nonreducible characteristics. The data include such disparate materials as relatively simple DNA sequences, catalogs of museum specimens, photo images of collection materials, and movies of developing embryos. Thus the data storage, organization, and dissemination of biological material present challenges. (While the challenges of storing and making available genomic data are significant, it is even more difficult to store and share digitized visual data such as photographs, movies, and images of museum collections.) This flood of data has led to a need for computational tools and computer hardware for database storage, management, and usage. The torrent of biological data also has changed the evolutionary biology community’s ability to study speciation. For example, earlier work on sympatric speciation used simple mathematical models with few parameters. These models are inadequate for addressing the complexity found with the new genetic data as well as more detailed information about population structure and dynamics. HECC will soon allow evolutionary biologists to move to large-scale simulations of individual-based, dispersed populations. Through new technologies, especially for macromolecular analysis, biological investigations are becoming even more characterized by this data richness. In an era of high-throughput, high-­information- content discovery in biological science, more and more research domains in evolutionary biology will be able to profit from high-end computing. To progress in exploiting the massive amounts of data, sub­disciplines of evolutionary biology should strive to reach the point where HECC use is routine. U ­ ltimately, this trajectory will unleash the potential for a very rich theoretical framework for evolutionary

84 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING biology, just as exists today for the physical sciences. As this framework is built up and becomes robust, high-end computing will become pervasive within the community. In general, access to ­computing— using computing in the broadest sense—and at capabilities up to state of the art, combined with the data revolution, has already transformed studies in evolutionary biology, and it will grow more enabling as theory and experimentation continue their rapid advance. Genomics and metagenomics data sets, individual genomes, and entire population or community genomes (metagenomics) all require the methods of computational evolution to gain understanding. For example, comparative analysis remains an essential tool for understanding biology. To utilize the explosion in genomics and metagenomics, efficient alignment methods and advanced statistical methods for characterizing sequence evolution are needed. Typically, a mathematical model (itself part of the framework of a specific model of evolution) is created to discern likely alignments. As ever-larger data sets are made available to the community, it may be possible to include correlation so that it no longer is necessary to treat each sequence position independently. Signal detection algorithms will need to be integrated into phylogenetic methods. HECC is needed to cope with the data flow, which includes growing numbers of sequences, characters, numbers of species, and so on. Each parameter can take on thousands to hundreds of thousands of values, yet even more character data have to be added to gain confidence for resolving trees, whether those trees depict patterns of species interrelationships or the evolutionary patterns of genes within populations. The data richness of biology often leads biologists to simplify their questions to the level at which they can be addressed in a reasonable amount of time on local computing resources—for example, reducing the number of parameters in modeling an ecological network. Enabling evolutionary biolo- gists to readily exploit supercomputing power would significantly change the aims and scope of many research programs. Even today we can see how progress is limited by the relative scarcity of substantial computing resources at the high end: Workstations require months to solve medium-sized problems for modeling molecular evolutionary change even though new algorithms have provided some improve- ments. For theory, experimentation, and modeling to work coherently to advance evolutionary biology, computing must be available to sustain correlated intellectual inquiry. Long compute times are especially limiting when the range of models required is so large and the field has to depend on models and their validation, since no exact analytical solutions are possible. The trajectory of the phylogenetics community as it seeks to probe the history of life is ­necessarily aimed at building and understanding ever larger trees. Thus another opportunity exists if we can e ­ ncourage the development of community codes and community efforts to advance our understanding of the tree of life. An immediate example is the NSF’s Assembling the Tree of Life project, which has nucleated a closely connected community. That community has the incentive and the common purpose to work together effectively to develop codes and use HECC extensively to improve models and their validation for the tree of life. Many other biological computing applications today are developed locally. Colleagues ask to use them, and at a certain point there is enough interest such that it is worthwhile investing resources into hardening them, bringing them up to standards that software engineers can live with, and making them more user-friendly. This bottom-up method of developing software works best for biology, where there is such a diversity of questions being asked, but work on developing, distribut- ing and—especially—making it possible for these programs to work together is not well supported by the current funding mechanisms. In addition, greater access to advanced computing environments can forge extensive partnerships with mathematicians and computer scientists, leading to clearer definition of computational problems and establishment of new algorithms. Greater collaboration with the mathematical, computer science, and engineering communities, as has happened in climate change and environmental biology, would

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 85 greatly accelerate progress in evolutionary biology. These collaborations can also lead to capabilities for advanced visualization, methods for the implementation and application of evolutionary models, and capabilities for the interactive analysis of large, very complex data sets, all tailored to the particular needs of evolutionary biologists. These examples illustrate the many practical advantages of such access; the incentives for using advanced computing encompass more than just the classic NP-complete nature of generating and validating phylogenetic trees. But many steps must be taken before this vision can be realized. The benefits of such access go beyond the ability to develop and use heuristic and approximate solu- tions for ever larger phylogenetic trees to advance our understanding of the history of life. They include how to apply this knowledge for the good of society. A deep understanding of evolution integrated into the fabric of biology provides the basis for all our understanding and knowledge of life and how living systems function. That, in turn, allows biology to contribute to developing new vaccines, antibiotics, and other medicines, predicting drug targets, managing natural resources, providing biosecurity through identification of pathogens and invasive species, and so on. Evolutionary biology also presents challenges for the scientific computing community. That com- munity has its roots more in the computational problems arising in physics, such as fluid flow, structural analyses, and molecular dynamics, and it has built up expertise and software for those sorts of problems. But many biological applications involve irregular data structures and unpredictable memory accesses (because the data come from strings, lists, trees, and networks), which place more demands on integer performance. So the community of expert algorithmicists and code builders must also be built up if evolutionary biology is to replicate the computational successes of the physical sciences. What would evolutionary biologists need in a HECC facility? At the very least, the community needs uniform access to large data sets—tera- to petabyte scale for image data, giga- to terabyte scale for genomic data—and a network infrastructure that allows remote access and sharing. More generally, the challenge of large, interrelated data sets is a new one for biology, and the research community does not yet have the habit of looking for patterns in those data or the theoretical framework for doing so. When it does, we will be able to ask entirely new questions. REFERENCES Averof, M., and N.H. Patel. 1997. Crustacean appendage evolution associated with changes in Hox gene expression. Nature 388: 682-686. Bader, D.A. 2004. Computational biology and high-performance computing. Communications of the ACM 47(11): 34-41. Bader, D.A., B.M.E. Moret, and L. Vawter. 2001. Industrial applications of high-performance computing for phylogeny recon- struction. In Siegel, Howard J. (ed.), Commercial Applications for High-Performance Computing. Bellingham, Wash.: SPIE, 159-168. Bader, David A., Allan Snavely, and Gwen Jacobs. 2006. Petascale Computing in the Biological Sciences. National Science Foundation Workshop Report. Arlington, Va., August 29-30. Barton, N.H., D.E.G. Briggs, J.A. Eisen, D.B. Goldstein, and N.H. Patel. 2007. Evolution. Cold Spring Harbor Laboratory ������ Press. Batzer, M., and P.L. Deininger. 2001. Alu repeats and human genomic diversity. Nature Reviews in Genetics 3: 370-379. Beaumont, M.A., and B. Rannala. 2004. The Bayesian revolution in genetics. Nature Reviews in Genetics 5: 251-261. Beaumont, M.A., W. Zhang, and D.J. Balding. 2002. Approximate Bayesian computation in population genetics. Genetics 162: 2025-2035. Brosius, J. 1999. Genomes were forged by massive bombardments with retroelements and retrosequences. Genetica 107: 209-238. Carroll, S.B. 2005. Endless Forms Most Beautiful. New York, N.Y.: W.W. Norton. Chen, L., A.L. DeVries, and C.-H.C. Cheng. 1997. Evolution of antifreeze glycoprotein gene from a trypsinogen gene in A ­ ntarctic notothenioid fish. Proceedings of the National Academy of Sciences 94: 3811-3816.

86 THE POTENTIAL IMPACT OF HIGH-END CAPABILITY COMPUTING Coyne, J.A., and H.A. Orr. 2004. Speciation. Sunderland, Mass.: Sinauer Associates. Cracraft, J., and M.J. Donoghue (eds.). 2004. Assembling the Tree of Life. New York, N.Y.: Oxford University Press. Cracraft, J., M.J. Donoghue, J. Dragoo, D. Hillis, and T. Yates (eds.). 2002. Assembling the Tree of Life: Harnessing life’s history to benefit science and society. Brochure produced for the National Science Foundation. Available at http://www. nsf.gov/bio/pubs/reports/atol.pdf/. Deininger, P.L., and M.A. Batzer. 2002. Mammalian retroelements. Genome Research 12: 1455-1465. Dobzhansky, T. 1964. Biology, molecular and organismic. American Zoologist 4(November): 49. Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annual Review of Genetics 22: 521-565. ����������������������������������������������������������������� Felsenstein, J. 2004. Inferring Phylogenies. Sunderland, Mass.: Sinauer Associates. Friedberg, I. 2006. Automated function prediction: The genomic challenge. Briefings in Bioinformatics 7(3): 225-242. Gavrilets, S. 2003. Perspective: Models of speciation: What have we learned in 40 years? Evolution 57: 2197-2215. Gregor T., E.F. Wieschaus, A.P. McGregor, W. Bialek, and W. Tank. 2007. Stability and nuclear dynamics of the bicoid mor- phogen gradient. Cell 130 141-152. Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proceedings of the National Academy of Sciences 104: 2785-2790. Huelsenbeck, J.P., F. Ronquist, R. Nielsen, and J.P. Bollback. 2001. Bayesian inference of phylogeny and its impact on evolu- tionary biology. Science 294: 2310-2314. Kingsolver, J.G., K.R. Massie, J. G. Shlichta, M.H. Smith, G.J. Ragland, and R. Gomulkiewicz. 2007. Relating environmental variation to selection on reaction norms: An experimental test. American Naturalist 169: 163-174. Levine, M., and E.H. Davidson. 2005. Gene regulatory networks for development. Proceedings of the National Academy of Sciences 102: 4936-4942. Lewontin, R.C. 1979. Fitness, survival, and optimality. Pages 3-21 in D.H. Horn, R. Mitchell, and G.R. Stairs, eds. Analysis of Ecological Systems. Columbus, Ohio: Ohio State University Press. Lewontin, R.C. 2002. Directions in evolutionary biology. Annual Review of Genetics 36: 1-18. Li, W.-H. 1997. Molecular Evolution. Sunderland, Mass.: Sinauer Associates. Lindström, J., and H. Kokko. 2002. Cohort effects and population dynamics. Ecology Letters 5: 338-344. ���������������������������������������������� Lynch, M., and J.S. Conery. 2003. The origins of genome complexity. Science 302: 1401-1404. Lynch, M. 2007. The Origins of Genome Architecture. Sunderland, Mass.: Sinauer Associates. Maynard-Smith, J. 1982. Evolution and the Theory of Games. Cambridge, England: Cambridge University Press. Meagher, T.R., and D.J. Futuyma (eds.). 2001. Evolution, science, and society: Evolutionary biology and the national research agenda. American Naturalist 158 (Supplement): 1-46. Available at http://www.journals.uchicago.edu/ASN/meagher.html/, and at http://evonet.sdsc.edu/evoscisociety/. Nielsen, R. (ed.). 2005. Statistic methods in molecular evolution. New York, N.Y.: Springer Verlag. Nowak, Martin A. 2006. Evolutionary Dynamics: Exploring the Equations of Life. Cambridge, Mass.: Harvard University Press. NRC (National Research Council). 1995. Effects of Past Global Change on Life. Washington, D.C.: National Academy Press. NRC. 2007. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. Washington, D.C.: The N ­ ational Academies Press. NSF (National Science Foundation). 1998. Frontiers in Population Biology. Workshop report from the Population Biology Task Force. Available at http://www.nsf.gov/publications/ pub_summ.jsp?ods_key=biorpt1098. NSF. 2005a. Frontiers in Evolutionary Biology. Workshop report, Document number biorpt080106. Available at http://www. nsf.gov/publications/ods/results.cfm?url_type=Reports&url_subtype=Biology&browse_type=org_type. NSF. 2005b. Assembling the Tree of Life. Multiple workshop reports available at http://www.nsf.gov/publications/ods/results. cfm?url_type=Reports&url_subtype=Biology&browse_type=org_type. Ricklefs, R.E., and D. Schluter (eds.). 1993. Species Diversity in Ecological Communities. Chicago, Ill.: University of Chicago Press. Riesenfeld, C.S., P.D. Schlos, and J. Handelsman. 2004. Metagenomics: Genomic analysis of microbial communities. Annual Review of Genetics 38: 525-552. Rusch, D.B., A.L. Halpern, G. Sutton, K.B. Heidelberg, S. Williamson, et al. 2007. The Sorcerer II Global Ocean Sampling ���� expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3): e77. doi:10.1371/journal.pbio.0050077. Schlosser, G., and G.P. Wagner (eds.). 2004. Modularity in Development and Evolution. Chicago, Ill.: University of Chicago Press.

THE POTENTIAL IMPACT OF HECC IN EVOLUTIONARY BIOLOGY 87 Shapiro, M.D, M.E. Marks, C.L. Peichel, B.K. Blackman, K.S.Nereng, B. Jónsson, D. Schluter, and D.M. Kingsley. 2004. Genetic and developmental basis of evolutionary pelvic reduction in three spine sticklebacks. Nature 428: 717-723. Stern, D.L. 1998. A role of Ultrabithorax in morphological differences between Drosophila species. Nature 396: 463-466. Tyson, G.W., J. Chapman, P. Hugenholtz, E.E. Allen, R.J. Ram, P.M. Richardson, V.V. Solovyev, E.M. Rudin, D.S. Rokhsar, and J.F. Banfield. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37-43. van Dassow, G., E. Meir, E.M. Munro, and G.M. Odell. 2000. The segmenta polarity network is a robust developmental module. Nature 406: 188-192. Vincent, T.L., and J.S. Brown. 2005. Evolutionary Game Theory, Natural Selection, and Darwinian Dynamics. Cambridge, England: Cambridge University Press. Whitlock, M.C., and R. Gomulkiewicz. 2005. Probability of fixation in a heterogeneous environment. Genetics 171: 1407-1417. Wilmers, C.C., E. Post, and A. Hastings. 2007. A perfect storm: The combined effects on population fluctuations of auto­ correlated environmental noise, age structure, and density dependence. American Naturalist 169: 673-683. Yang, Z. 2006. Computational Molecular Evolution. Oxford, England: Oxford University Press.

Next: 5 The Potential Impact of HECC in Chemical Separations »
The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering Get This Book
×
Buy Paperback | $53.00 Buy Ebook | $42.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Many federal funding requests for more advanced computer resources assume implicitly that greater computing power creates opportunities for advancement in science and engineering. This has often been a good assumption. Given stringent pressures on the federal budget, the White House Office of Management and Budget (OMB) and Office of Science and Technology Policy (OSTP) are seeking an improved approach to the formulation and review of requests from the agencies for new computing funds.

This book examines, for four illustrative fields of science and engineering, how one can start with an understanding of their major challenges and discern how progress against those challenges depends on high-end capability computing (HECC). The four fields covered are:

  1. atmospheric science
  2. astrophysics
  3. chemical separations
  4. evolutionary biology

This book finds that all four of these fields are critically dependent on HECC, but in different ways. The book characterizes the components that combine to enable new advances in computational science and engineering and identifies aspects that apply to multiple fields.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!