Read "Exploring Horizons for Domestic Animal Genomics: Workshop Summary" at NAP.edu

« Previous: 4 Roles of Public, Private, and Nongovernmental Organizations in Advancing Genomics Research

Page 26 Cite

Suggested Citation:"5 Data Access." National Research Council. 2002. Exploring Horizons for Domestic Animal Genomics: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/10487.

Page 27 Cite

Page 28 Cite

Page 29 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Data Access The final issue tackled by the participants was how best to work with the tremendous amount of data that will be generated by domestic animal genome projects. The data create a number of challenges, said Daniel Drell of the U.S. Department of Energy (DOE). âThese have to do with the interoperability of data, the sharing of data in some cases, but, principally, organizing it in such a way that others can come along and add value to it in some efficient ways.â So far, he said, âthe genome projects have been largely unsuccessful at dealing with many of these.â APPROPRIATE TOOLS AND THE IMPORTANCE OF DATA ACCESS One can frame the issue in terms of access to data, said Claire Fraser. âWhen it comes to data access,â she said, âthere are two ways to think about it. One, are the data accessible in GenBank or someplace else? And the answer is yes. But individual sequence reads or assembled data are only so useful. What we really need in terms of data access, in order to empower all of the users that are interested in getting a hold of these data, are far better databases and tools to really exploit the information. And I think this is an area that so far has been more of an afterthought with these projects than it should have been.â The result, she said, is that some genomics researchers end up having easier access to the data than others. âWe are seeing a bit of a genomics-divide being created between those groups that are involved in generating the data and have been forced to build the tools in order to manipulate it, and the more typical user who doesnât necessarily have access to the same tools, (and) who 26

EXPLORING HORIZONS FOR DOMESTIC ANIMAL GENOMICS 27 doesnât have bioinformatics expertise at his or her university. I think thatâs one of the real problems that we need to address.â The other problem, Fraser added, is that the various genome projects generally make no allowance for taking care of the data they generate once the project is finished. âFor the most part, even for sequencing projects with bioinformatics support during the term of the project, that support ends when the sequence is completed. Thereâs been no plan put in place for how to maintain and update all of this information.â âThat problem is going to get even worse as we begin to accumulate more data. There have been all sorts of models proposed, from letting people in the community who are passionately interested in an organism do it on an ad hoc basis, to having this done in a more centralized facility, to having this done in a distributed way but with clear rules for interoperability. Iâve even heard some people go so far as to suggest that perhaps we need to come up with some sort of tax on genome projects that goes to fund a bioinformatics trust managed by an inter-agency group responsible for maintaining these databases.â Several participants pointed out that in order to maximize the value of the information generated by domestic animal genome projects, researchers and information technology specialists will have to pay more attention to data handling. In particular, programs need to be designed not only to maintain the data and make it accessible to any researcher who needs it but also to make sure the information can be integrated with new data and new understandings as they appear. THE CHALLENGE OF SCALING UP IN RESPONSE TO INCREASES IN DATA The biggest difficulty is the problem of scaling: A database must be designed so that it continues to work, and work well, when the amount of data in it is doubled or increased by a factor of ten or twenty. That will be a challenging job, Fraser noted. âIâm not convinced,â she said, âthat any of the existing databases that have been built so far to handle sequence information are robust enough to scale to the level that we know we are going to need in going forward.â The databases built to handle the sequence information are actually the easy part, she said. âWe would like to begin to add in functional information, either directly or through links, to all of the existing gene and protein databases. When you start thinking about doing that, the challenge goes up by several orders of magnitude.â Owen White, of The Institute for Genomic Research (TIGR), made a similar point. âThe National Center for Bioinformatics (NCBI) is doing a heroic job,â he said. âThey are doing an amazing job managing sequence data and

28 DATA ACCESS publication data. Thatâs a specific data type, and they have a fighting chance of scaling up for just the raw sequence information. âBut thereâs another data type that a lot of us are familiar with, which is annotation. Annotation is kind of a generic term, but I usually mean identification of all the genes and trying to give functional assignments to those genes and trying to represent them well in a structured database. So if youâve got 500 microbial genomes and people want to come in and work with the data, I would argue that we donât really have representation systems for that type of thing.â While the problem of scaling up the databases that hold basic information, such as sequences of base pairs, is challenging but seemingly solvable, no one yet has constructed databases that will be able to handle the amount of annotation that likely will proliferate in the years to come. STRUCTURING GENOME DATABASES Workshop participants had various perspectives on how a system of genome databases should be structured. White, for instance, offered a vision of large central repositories that would handle all the data of one particular typeâ say, information on how genes are expressedâfor many different species. He warned that it would not be feasible to have one mega-center handle all different types of data for every type of organism, but he argued that if each center focused on one type of data, it would work quite well. âThere are a number of reasons why I think this is a much more attractive model,â he said. âTraining becomes much easier, and there is reduced reinvention of the wheel. Once you instantiate those infrastructures, they are easy to apply to new organisms.â Furthermore, he added, these data-specific centers should be able to expand easily enough to accommodate ever-growing amounts of data. âI think they are the only things that had a chance of scaling.â Suppose, he said, that some individual research center had developed a good way to represent expression information for the particular organism studied at that center. âHopefully they generalize their services enough so they can apply them to another organism. Then if they instantiate what the standard operational procedures are, they develop a relatively good training program, and they have a robust representation system going on in the database. Thatâs the hard part. That is the energy of activation, so to speak. Then adding another organism is actually much simpler.â A member of the audience disagreed with Whiteâs suggestion, however. For him, it made more sense to keep smaller, individualized databases and develop standards so that the various databases could exchange information and work with each other almost as if they had a single database. âYou donât

EXPLORING HORIZONS FOR DOMESTIC ANIMAL GENOMICS 29 have to bring things into gigantic warehousesâ or try to federate databases. You try to create a level of information that can be exchanged among databases. In part, this goes along the lines of the discussions about whether you sequence in a center only or distribute the work in order to create local communities of scientists and train graduate students. This is particularly true in bioinformatics. If you have only centers for collecting information, you develop no local skills and no local students to use that information.â âCenters like NCBI do an extraordinary job of archiving low-level information,â he continued. âBut in the plant community, for instance, there is an immense difference in the interests of, say, the cereal genomicists versus just the legume folks. The legume folks have a high interest in secondary metabolism, symbiosis, and nitrogen fixation. Those are all functions that fit within community exploration of data and creation of data models and data- mining mechanisms appropriate to those. But they donât map onto cereals, and if you try to force these into a one-size-fits-all model, you come down to a lowest common denominator of things that are done well.â In short, having different centers for different organisms allows each to specialize and take into account the areas of interest for that particular organism. It might make sense to accumulate certain types of informationâgenerally the very basic, low-level informationâin one, large central repository, but the higher-level information, with its sensitivity to the type of genome being considered, is better handled at individual centers. ALLOCATION OF RESOURCES FOR BIOINFORMATICS No matter how the centers ultimately are organized, several participants expressed the view that more resources must be allocated toward bioinformatics if researchers are to be able to work with all the data that is being accumulated. âIf you want a system,â White said, âthat can dynamically manage data thatâs coming in from several projects in parallel and have version dates and a help desk and just a well-engineered system, we are talking about a completely different magnitude of budget thatâs required to do that.â

Next: 6 Looking Forward »

Exploring Horizons for Domestic Animal Genomics: Workshop Summary (2002)

Chapter: 5 Data Access

Welcome to OpenBook!

Get Email Updates