CHAPTER SIX
Development of a National Strategy for Plant Bioinformatics

When the NPGI was launched, it was recognized that the long-term success of plant biology depended on researchers’ obtaining seamless access to the disparate and massive datasets arising from genomics research and to the tools needed to examine and analyze the data. There is now a flood of sequence and other plant data, and with it has come the need to expand access to the collective data being generated, so that biologists working on a wide array of plants can find answers to a diverse set of research questions. Making the data that are representative of the entire Kingdom of plant life available and usable to the scientific community is a major undertaking—one that requires a national strategy for plant bioinformatics.

Bioinformatics is a broad discipline that exploits the richness of large datasets to generate research findings. More than a set of tools, bioinformatics is a research approach that includes the engineering of information systems (such as the creation of databases), the development of analytic methods (such as data-mining tools to extract biologically significant patterns in sequence or other data), and the creation of computation-based, predictive models that use multiple types of data to understand how plant systems operate. As a framework that enables investigators to access, integrate, analyze, and compare large datasets, bioinformatics is central to genomics research.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 CHAPTER SIX Development of a National Strategy for Plant Bioinformatics When the NPGI was launched, it was recognized that the long-term success of plant biology depended on researchers’ obtaining seamless access to the disparate and massive datasets arising from genomics research and to the tools needed to examine and analyze the data. There is now a flood of sequence and other plant data, and with it has come the need to expand access to the collective data being generated, so that biologists working on a wide array of plants can find answers to a diverse set of research questions. Making the data that are representative of the entire Kingdom of plant life available and usable to the scientific community is a major undertaking—one that requires a national strategy for plant bioinformatics. Bioinformatics is a broad discipline that exploits the richness of large datasets to generate research findings. More than a set of tools, bioinformatics is a research approach that includes the engineering of information systems (such as the creation of databases), the development of analytic methods (such as data-mining tools to extract biologically significant patterns in sequence or other data), and the creation of computation-based, predictive models that use multiple types of data to understand how plant systems operate. As a framework that enables investigators to access, integrate, analyze, and compare large datasets, bioinformatics is central to genomics research.

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 In the short term, a national strategy for bioinformatics requires the plant-research community to place greater emphasis on integrating bioinformatics approaches into its work. That includes training, collaboration with large data centers, and bioinformatics-oriented research itself, such as the creation of specialized databases or new views into genomic data that lead to novel insights. General databases will be needed to provide community services for the reference species, and they should be developed with community participation. The stewards of data and the creators of databases and tools should not act independently but should communicate and coordinate with each other and with public genome repositories to develop common platforms, standards, and interfaces. In the long term, the common platforms and specifications will become the foundation of a “genomics grid” that will allow appropriately trained investigators to harness the power of a broad network of distributed databases, tools, and computing power from their desktops. That vision of the future requires investment in a computational infrastructure (hardware and software) needed not only for plant biology but for all of genomics research nationally. The NPGI should be a leader on the path to that goal. To lay the groundwork for this vision, we offer the following specific recommendations for the next 5-year phase of the NPGI. 1. Support the development of community databases as tools to generate knowledge. Scope and participation: In the context of the NPGI, bioinformatics must serve the unique information needs of diverse research groups focused on different plants and different research goals. The relevant research groups, nationally and internationally, must be active participants in the development of dynamic, interoperable, specialized databases. Databases should provide an intellectual focus for the integration and interpretation of a wide spectra of biologic data. If properly conceived and constructed, a dynamic, distributed database interrelating everything

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 from nucleotide sequences to ecologic data will provide a research tool that will potentiate new kinds of discoveries in biology. An investment in databases for reference species must be supported by an investment in interoperable species-specific databases. The databases may incorporate information from related species (comparative-genome databases) and should include core information for cross-species referencing. Thus, for instance, a rice database might provide a basic data model that could meet the database needs of all cereals if funds were available to curate nonrice data into a parallel version of the rice database. Such a model is being pursued by the Gramene database. In general, it is neither desirable nor economically feasible to support separate databases for all species; there must be other mechanisms, such as data warehousing for smaller projects in related community databases. In order for community databases to succeed, data maintenance needs to be recognized as a valid activity, and supported accordingly. This is especially true as a database grows and additional dedicated support personnel are needed. The Arabidopsis Information Resource (TAIR) constitutes a model for some aspects of the scope and level of research and service desirable for all the other reference-species databases (TAIR 2002). Each of the reference species will need financial support at least comparable with that received by TAIR. Note that TAIR is under-funded (in budget and staff) relative to central databases dedicated to Drosophila and C. elegans (personal communication, Chris Somerville), a reality that gives an estimation of the support required for success, in as much as those model animal genomes are comparable in size with the Arabidopsis genome. Database design: The long-term vision of a bioinformatics strategy is to create a decentralized collection of independent and specialized databases that are developed and maintained by different groups and communities but that operate as one large, distributed information resource with common controlled vocabularies, related user-interfaces, and curation practices. An example of a collective effort to develop a common

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 vocabulary is the Gene Ontology Consortium (2001). Other standards for interoperability are evolving in semantics and syntax, and these standards-developing activities can be enhanced by their adoption in the community databases and in cooperation with the national data repositories. Databases might also be designed to incorporate information from related species; they would be comparative-genomics databases that would include core information for cross-species referencing. Examples of this cross-referencing mechanism are the distributed annotation system (DAS) and the developing distributed service registration environment, bioMoby (bioMoby 2002). Standards for the exchange of data and derived information between databases must be developed not only within the plant community but also in the international genomics community. Therefore, cooperation with national data resources, such as the National Center for Biotechnology Information (NCBI), is critically important. It is essential that the databases be available for participation in the international scientific community. The current organization and operation of many of the community research databases for plant species will need to change dramatically if they are to take on this role and successfully accommodate the full sequence of a species’ complete genome. As a data resource, these databases should be prepared to handle huge volumes of incoming data, annotate them automatically, present them to the research community in a timely fashion, work with the national data resources, and develop or adopt a curation model for the data. In addition, the databases must become a platform for comparative studies with data from related species, and their managers must recognize their responsibility as members of the larger genomics community. In this environment, even the technical details of managing the computer system will be more demanding because not only the species community but the global genomics community will depend on its availability 24 hours per day, 7 days a week. Hardware, software, and data redundancy capabilities will become major design considerations.

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 The Journal Concept for Community Database Curation The annotation of genomics data maintained in support of biological research activity provides much of the value and success of the community database. This has been demonstrated in the model organism databases for Drosophila and C. elegans, where there is substantial support for curation activities. These model organisms have the advantage of small genomes and hence finite and limited data sets. In the plant community, where comparative genomics will become an essential tool to leverage related information, new models for annotation must be explored to accommodate the exponential growth of integrated comparative information. The real annotation of genomic information is in the published literature, and a new paradigm is needed to foster, as a curational activity, the incorporation of information derived from the literature into the database. Community databases might also develop the analytical tools to enable launching, accomplishing, and even publishing primary research results. The implications of this direction are profound, allowing the community database to become a dynamic mechanism to lead, respond to, and integrate genomics-research efforts. When a database environment is capable of providing analytical services, the database also has the potential to become a vehicle for publication of those results. Four types of curation activities could therefore be envisioned within a community database: 1. The algorithmic annotation of data; 2. The inclusion of literature related to genome information; 3. The publication of new methods and derived results; and 4. The potential publication of negative results. The latter three areas fall into categories best supported by peer review and publication. To accomplish those goals, the concept of structuring a database in concert with scientific journals is attractive; for example, databases could have editors and reviewers. Some of the information in the databases in fact, will require peer review, and new mechanisms to support such publication can be developed in concert with the traditional means of publication. This curation-publication model builds on the strengths of both systems: the immediacy and community ties of the database, the need for timely and effective curation, and the peer review and recognition of the journal. The development of a model for the inclusion of the published literature as a scholarly activity alters the view of that activity, and provides a check for the accuracy of interpretation. The publication of new insights developed from database services provides a closed loop for the database activities, and again provides a direct mechanism for peer review. Finally, the publication of negative results gives added value to the database as a source of information not traditionally having an outlet, but essential to the progress of genomic activities.

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 The databases must be robust, extensible, scalable, and maintainable. When possible, plant databases should use off-the-shelf software for their infrastructure and for the development of major data-mining tools. All data models for the databases will need to be published in an electronic format, kept up to date, and documented in detail. Database-associated software (such as parsers and loaders) will need to be made available to the community. The methods used in the preparation of derived information (methods and standard operating procedures) must also be published and available for review and replication. Those strategies will encourage the bioinformatics and computational research efforts essential to address the challenges awaiting us in the next decade by minimizing the duplication of effort in database development and deployment. Relationship with national data repositories: Currently, community databases often incorporate data that are not validated, because including them can provide additional insights for users. However, these data often contain errors and are frequently asynchronous with data in the national public repositories (such as GenBank). In developing the long view of bioinformatics, we must address the need to develop a gold standard for data quality in our national repositories. If national repositories can certify the correctness of the data they contain, then the essential role of community-oriented databases will be to present integrated and alternative views into the data. A clear understanding of this relationship and greater collaboration with the national repositories might result in more effective curation of plant data. As a matter of efficiency and for the archival maintainence of reference genomic datasets, community-oriented databases should contribute to ensuring the quality of the data at the national repositories but not duplicate the services available at NCBI, which is charged and qualified to certify, update, and maintain plant-genome data and to augment the fundamental tools available to large genomics projects. Such tools may include services that coordinate identifiers across multiple databases and provide a critical link between the database-as-publication and the publications tracked by the National Library of

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 Medicine and the National Agricultural Library. Increased interactions between the bioinformatics community and NCBI will potentiate an entirely new view of what can be derived from genomics data. Oversight: It is imperative that plant databases be implemented and managed in such a way as to ensure responsiveness to present and future community needs. An essential component of any database-management structure will be advisory committees that can provide critical periodic evaluations of the success of the databases in meeting the needs of the plant-biology community and work in concert with representatives of NCBI. Because of the convergence of research in plant biology around common sets of goals and of reference and model organisms, the management and advisory committees for databases should include members from outside the immediate community served by the databases. 2. Support research on new algorithms and technologies. Beyond the development of integrated information resources appropriate to the plant community, sophisticated analytical tools must be developed to handle the flow of large, multidimensional datasets and to allow biologists to analyze and interpret the data in an interactive fashion. Examples of this kind of specialized application are statistical analysis of microarrays, comparative sequence alignment and QTLs, and data mining. Computational resources need to be developed that apply the most advanced techniques in the domains of computer and computational science and that are only now being conceptualized in those fields. New research must be funded in database-management systems designed for native genomic information, algorithms for data mining, supervised and unsupervised machine learning, statistical analysis of multiple views of nontraditional data, and data visualization. That kind of research needs to be conducted on a computational infrastructure appropriate to the scale of the problems. Generalized infrastructures capable of supporting

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 such research are envisioned as a set of technologies that include globally distributed datasets, distributed and interoperable databases, and interconnected clusters of computers that could be used to solve computationally intense problems. The high-performance, distributed computational architecture can be provided by technologies such as those being developed for grid computing. In the future, the development and maintenance of a genomics grid will allow many more investigators to participate in exploiting genomics data by making a vast array of data resources and computational tools generally available, thus leveling the playing field for biologic researchers. Like the databases themselves, algorithms, software, substantive scripts, and analytical methods developed and applied with support from the NPGI should be made freely available. A large community of computer-science and bioinformatics developers have embraced the open-source model of software development, which provides an environment for availability and cooperative development of tools. Just as the immediate release of genomic sequence data was considered an essential component at the initiation of the genomic sequencing efforts, so will the availability of high-quality software affect the development of bioinformatics. The impact of such source-sharing has already been dramatic in the furtherance of bioinformatics goals with such tools as BLAST and Ensembl. Such broad community efforts should be strongly encouraged. 3. Ensure that NPGI-funded community databases contain a substantive informatics training component. There is a shortage of researchers with interdisciplinary training that spans biology and bioinformatics. Over the long term, the shortage will be addressed by undergraduate and graduate programs being developed at many universities. In the meantime, there is a role for community databases in increasing these skills in their respective user communities. The databases established for the reference species should, as an element of their mission, develop and organize short courses and encourage

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 exchange visits between investigators associated with the database and user sites. Training can also be integrated with database research and development. Bioinformaticists who are responsible for community databases must be able to meet the projected demands of the community to incorporate increasingly diverse information into databases. There should be some support for database-design brainstorming sessions and for short-and long-term visits at a database or computing center (for example, to examine critical needs in new database construction or develop strategies for migration to improved hardware and software platforms). Through training efforts, therefore, community databases can foster a collaborative interface between biologists and computational scientists.

OCR for page 41
The National Plant Genome Initiative: Objectives for 2003–2008 This page in the original is blank.