CHAPTER TWO
Sequencing: Generation of the Raw Material

The generation of new DNA sequence data will continue to be critical for the understanding of plant genomes in the near future. Although it would be desirable to sequence many plant genomes at this time, the cost of sequencing needs to drop substantially to allow immediate deep-draft or finished sequencing of more than a few genomes. By deep-draft we mean at least 6-fold coverage of the gene-rich regions, accompanied by other data, such as a physical map and sufficient sequencing information on bacterial artificial chromosome ends (BAC ends) necessary to generate a scaffold of ordered contiguous DNA sequence information. Finishing includes filling gaps and increasing the sequence accuracy to no more than one error per 10,000 base pairs. Finishing is painstaking and hence more expensive than draft sequence, and is not anticipated in every case. The current cost of finished DNA sequencing is about 9 cents per finished base, down from 50 cents per base 3 years ago, while the cost of all other high-throughput sequencing (ESTs, deep draft) is $1.50 per sequencing run. Projects proposed in NPGI should be this competitive, and should use these costs as benchmarks. The cost of sequencing will probably continue to drop during the next 5 years and beyond, and the NPGI must be positioned to take advantage of cost improvements as they occur in the next 5–10 years.

The explosive increase in understanding of biology over the last 20–30 years has been enabled by work on model genetic organisms, including Arabidopsis. The NPGI is best served by



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 CHAPTER TWO Sequencing: Generation of the Raw Material The generation of new DNA sequence data will continue to be critical for the understanding of plant genomes in the near future. Although it would be desirable to sequence many plant genomes at this time, the cost of sequencing needs to drop substantially to allow immediate deep-draft or finished sequencing of more than a few genomes. By deep-draft we mean at least 6-fold coverage of the gene-rich regions, accompanied by other data, such as a physical map and sufficient sequencing information on bacterial artificial chromosome ends (BAC ends) necessary to generate a scaffold of ordered contiguous DNA sequence information. Finishing includes filling gaps and increasing the sequence accuracy to no more than one error per 10,000 base pairs. Finishing is painstaking and hence more expensive than draft sequence, and is not anticipated in every case. The current cost of finished DNA sequencing is about 9 cents per finished base, down from 50 cents per base 3 years ago, while the cost of all other high-throughput sequencing (ESTs, deep draft) is $1.50 per sequencing run. Projects proposed in NPGI should be this competitive, and should use these costs as benchmarks. The cost of sequencing will probably continue to drop during the next 5 years and beyond, and the NPGI must be positioned to take advantage of cost improvements as they occur in the next 5–10 years. The explosive increase in understanding of biology over the last 20–30 years has been enabled by work on model genetic organisms, including Arabidopsis. The NPGI is best served by

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 a structure that effectively exploits the application of detailed knowledge gained from Arabidopsis and judiciously selected reference species to related crop species. To that end, we recommend a detailed characterization of genomes of a small number of reference species selected on the basis of criteria detailed below, to represent key plant taxa. This should be accompanied by parallel investment in genetics and genomics tools from related crop plants that are explicitly designed to transfer knowledge gained from research in model and reference species into agronomic development. Partnership among federal agencies that span the breadth from basic to applied plant research is essential. Species to be considered for development into reference status should be chosen according to how well they fulfill the following criteria: Experimental tractability, including Forward genetics—the ability to isolate mutants and the relevant genes, Reverse genetics—the ability to target, or identify, mutations in a predefined gene, Availability of a physical genetic map. Short generation time, Ease of transformation, Ease of growth under defined conditions. Low genome complexity, including Diploid genome, Small genome size. Size, expertise, and ability of the research community meant to use the sequence and functional-genomics tools, including the opportunity for international collaboration. Suitability for translation to agronomically valuable plants. The few plant species that meet those criteria should be selected to encompass a range of phylogenetic diversity and to include major plant processes not present in Arabidopsis. Such species should already have well-developed research communities and experimental resources and

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 should fall within key, agronomically relevant clades of the plant kingdom. In particular, we recommend that the reference species be chosen from the families Poaceae (grasses), Fabaceae (legumes), and Solanaceae (including tomato) in that order as funds allow. Those three families contain species that are both important research organisms and critical crops. Of the roughly 250,000 plant species, humans have domesticated less than five thousand as crops, and roughly 20 species, primarily from two plant families (the grasses and legumes), provide the great majority of our food. Independent origins of agriculture were based on the domestication of cereals and legumes: rice and soybean in Asia, maize and beans in the Americas, and barley, wheat, pea, lentil, and chickpea in the Near East. The grasses and legumes are principal components of most terrestrial ecosystems. Poaceae includes sugar cane and all the major cereals: maize, wheat, rice, barley, sorghum, millet, and rye. Grass species account for about 50% of human caloric intake, and provide all cereals and most of the world’s sugar, are the principal forage for animals, and occupy about 70% of the world’s farmland. Fabaceae supplies nearly 33% of the human nutritional requirement for nitrogen, with a protein content that is balanced in amino acids and roughly 2/3 that of cereals. The high protein content of legumes is related to their capacity to symbiotically fix atmospheric nitrogen in ammonia; this property is a key factor in the global nitrogen cycle. Legumes are also important sources of fodder and forage for animals and of edible and industrial oils. Tomato and potato, both natives of South America and members of the family Solanaceae, have become increasingly important in the global food supply. Thus, there are compelling candidates in each of the three plant families that, on the basis of the criteria for selection outlined above, could serve as reference species. We suggest that rice and maize in Poaceae, Medicago truncatula in Fabaceae, and tomato in Solanaceae are strong candidates for elevation to reference-species status. They meet essentially all of the criteria proposed above. Thus, these genomes should be sequenced to at least deep-draft stage and finished if scientifically warranted. For example, we believe that the rice draft sequences now or soon available should be

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 finished. Further, NPGI investment in functional genomics tool development, where appropriate as described in chapter 4, should be focused on these species. The discovery of extensive conserved synteny (gene order on chromosomes) among grass genomes reflects their divergence from a common ancestral species within the last 50–60 million years. Thus, the grasses as a group constitute a unified genetic system and provide a collective model genome for the monocots. That concept holds for many plant families. Poaceae, Brassicaceae, Solanaceae, and Fabaceae are all poised to become model families in which genetic and genomic data can be extrapolated to a broad array of comparative and evolutionary studies. That potential is best exemplified in the grasses: rice and maize sequences together would permit development of a unified model system. Developing genetics and genomics tools applicable to an entire plant family requires that the advantages and idiosyncrasies of individual species in the family be delineated and explored. For example, given the goal of defining taxon-specific characters, it will be important to sample sequence the genomes of more than one species from each of the Poaceae, Fabaceae, Solanaceae and Brassicaceae. The ability to compare the gene content and structure of fully sequenced BAC clones between confamilial species is already yielding dividends. In particular, such paired-species analyses allow inference of ancestral genome structure and should reveal major changes brought about by domestication. The combined efforts we propose for NPGI, in concert with related efforts around the world, will hasten identification of the evolutionary adaptations that characterize each plant family and will facilitate the transfer of knowledge to additional crop species in each family. In addition to Arabidopsis and rice, complete drafts of the genome sequences of the single-celled alga Chlamydomonas (CGP 2002) and poplar tree (PGP 2002) will be available by 2003, and should be exploited as scientifically appropriate. We propose that DNA sequencing consume a substantial part of the total expected funding allowance for the next 5-year phase of the NPGI (about 40% of the current total), but the cost of sequencing is

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 a legitimate issue to consider in deciding what to sequence, how deeply, and when to begin. Strategic and cost-effective organization of genomic sequencing projects depends on some necessary prerequisites. The approaches and costs for genomic sequencing are likely to vary, but these projects will require adherence to most of the following minimum criteria: (1) high quality physical maps and accompanying BAC-end sequences, (2) genetic maps integrated with the physical map, (3) knowledge of genome organization and complexity based on cytogenetic and pilot sequencing data, (4) existence of well characterized transcript libraries to facilitate genome annotation, and (5) public relational databases that integrate whole genome data with other data types. Before a large-scale sequencing project for any reference species is launched, the community interested in that species should be well organized. These research communities should coordinate efforts on an international scale to make functional connections with communities of researchers working on related species. These communities should be asked to formally propose projects based on a community endorsed white paper as outlined in “Pre-project vetting” later in this chapter. With the appropriate tools in hand, reference-species sequencing should also include low-pass random sequencing (or BAC-end sequencing, when a physical map is available) of related species, which may be other crops, as a means of gene and regulatory DNA annotation and as a mechanism for comparative genomics; and critically, establishment of a user-friendly, integrated public database. On the basis of the experimental and agricultural criteria discussed above and the expected cost of DNA sequencing during the next 5 years, we offer the following explicit recommendations for genomic DNA sequencing for 2003–2008: Finish the rice genome sequence except for heterochromatin. The current sequences available are draft sequences, and finishing is vital to establish a “gold standard” grass sequence. This project is continuing and international. In addition, sequence the gene-rich,

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 low-complexity genomic DNA (known as the “gene space”) of Medicago truncatula (a legume), tomato, and maize. The unmethylated gene-rich portion of diploid higher plant genomes is typically about 200–250 Mb. This is therefore the expected size of tomato and Medicago, while maize, which is a partial tetraploid, is about 400 Mb. Collect (~2x) sample genomic sequence from related and progenitor species of the reference species and of the model species Arabidopsis to use as germplasm resources and for comparative and evolutionary genomics. This will be useful for phylogenetic comparisons, single-nucleotide polymorphism (SNP) definition, and gene-model predictions; it will also provide data useful for studies of population genetics and evolution of development. COMMUNITY STANDARDS FOR LARGE-SCALE SEQUENCING PROJECTS Because the plant-research community will rely on the availability of the sequences of the reference species for their individual work, it is critical that sequencing be carried out effectively. In this regard, community standards for large-scale sequencing projects play an important role in moving science forward. PRE-PROJECT VETTING There are still serious funding constraints on DNA sequencing of large genomes, and there is a need to balance them against other goals in the NPGI. Therefore, it is vital that the research community for any species seeking NPGI funds for either genomic or deep EST sequencing prepare a community consensus white paper both to justify the project scientifically and to demonstrate community unity and community organization before the massive investment required.

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 ORGANIZATION Large-scale sequencing projects should be done in a high-quality, inexpensive manner at any of the several large sequencing centers, public and private, around the world. We see no reason to build new sequencing centers as part of NPGI. Successful proposals must be competitively budgeted with respect to the per-base, per-EST, or per-cDNA cost of high-throughput sequencing. (Current estimates of sequencing costs are 9 cents per finished base, or approximately $1.50 per high-throughput sequence run.) However, assembly, finishing, and analysis require contributions from highly motivated experts. Because the most efficient finishing takes assembly and analysis into account, there is clearly a role for academic sequencing centers associated with plant biologists, even though companies and genome centers can offer competitive contracts for high-throughput sequencing. Hybrid strategies can be successful where responsibility for deliverables is held jointly by academic scientists and private-sector sequencing facilities. Whatever sequencing strategy is pursued, quality and cost metrics must be parts of the definition of deliverables to ensure maximal gene discovery per dollar invested. DATA RELEASE The NPGI should institute a uniform standard for sequence-data release that ensures rapid release of high-quality data (using the so-called Bermuda rules established for the human genome sequencing project). The standards may differ for different types of sequencing; they need to be carefully written and should be managed with effective oversight to serve the community in as timely a manner as possible. METRICS OF SUCCESS Large-scale DNA sequencing demands effective measures of success. The success of large-scale community service projects should be measured by the number of bases and clones generated and deposited into

OCR for page 15
The National Plant Genome Initiative: Objectives for 2003–2008 GenBank and Stock Centers per unit of elapsed time and per dollar invested. The number of publications citing data provided by the sequencing group, and the number of new research projects initiated throughout the community reliant on a particular large-scale sequencing project are useful measures of the public value of such projects. As measures of any sequencing center’s success in serving its community, those factors must be visibly advertised and regularly updated on every project Web page. The community needs to be able to understand easily what data to expect and when to expect them, so that researchers whose work requires access to the center’s information can plan their experiments and activities. Progress in large-scale community-service projects should not be presented only when they are up for renewal or at the end of a funding period, and scientific advisory boards and project reviewers should be rigorous in their monitoring of this feature of large-scale service projects.