National Academies Press: OpenBook
« Previous: 3 Improving Current Capabilities for Data Integration in Science
Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

4
Success in Data Integration

As a domain becomes more mature, more scientists begin to develop interest in it and progress starts to depend on the sharing of data. In the beginning such sharing is quite difficult, so a domain must develop ways to facilitate sharing as it matures. This includes the setting of standards, which may slow progress in individual groups to achieve a greater good for all. While the discussions recounted above evinced skepticism about any global schema, there are places where standards have been quite successful, some of which are described below. The most successful standards tend to occur bottom-up. In other words, individual scientists recognize the need and work to build consensus standards. Other standards are imposed top-down by some sort of dominant force in an enterprise. Top-down appears to work only rarely, and bottom-up approaches have a much better chance of success, according to several workshop participants. However, standards are also facilitated if there is a dominant player in a domain, as pointed out by Dr. Stonebraker. In enterprise data, for example, Walmart has so much influence that it can specify standards and force all of its suppliers to conform if they wish to sell goods to Walmart. Google also has this sort of influence in the Web search space. In domains where there is a dominant player, standards are much easier to achieve.

The successes of the Sloan Digital Sky Survey and Genbank in sharing astronomy data and genomic data are well known in the scientific community. The National Spatial Data Infrastructure (NSDI), mentioned by Dr. Clarke, has been emulated worldwide as the global spatial data infra-

Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

structure and is another example of success. The NSDI was prompted by an Executive Order issued by President Clinton in 1994, which also called for “development of a National Geospatial Data Clearinghouse, spatial data standards, a National Digital Geospatial Data Framework and partnerships for data acquisition.”1 The NSDI enables sharing of geographical information, elimination of redundancies, and other significant benefits. Some other success stories, perhaps less well known, are presented here.

FREEBASE

Freebase is a large, collaboratively edited database of crosslinked data developed by Metaweb Technologies. Freebase has incorporated the contents of several large, openly accessible data sources, such as Wikipedia and Musicbrainz, allowing users to add data and build structure by adding metadata tags that categorize or connect items.2 To date, most of the information in Freebase relates to people and places, though it can accommodate a wide range of data types, including research data.

Freebase is intended to be an important component of the Semantic Web, allowing automation of many Web search functions and communication between electronic devices (New York Times, 2007). However, Freebase has quality issues, omissions, errors, and redundant information—most of its information is not truly integrated. While Freebase is a success in some respects (community contributions have led to large volumes of information and it is possible to get useful answers to some queries), it cannot guarantee accurate and complete answers. Overall, Freebase demonstrates a novel mechanism for data aggregation, but it has not yet solved many of the challenges of information integration.

MELBOURNE HEALTH

Melbourne Health, a healthcare provider in Melbourne, Australia, envisions building a generic informatics model for beneficial collaboration across organizations and expansion to other research areas (Bihammar and Chong, 2007). Melbourne Health’s original goal was to link the databases from seven hospitals and two research institutes for multiple disease research. The challenges in this work come from the large amount of data, the paucity of data standards, poor interoperability between databases, and the need to ensure compliance with ethical, privacy, and regulatory norms.

1

Quoted from http://www.fgdc.gov/nsdi/policyandplanning/executive_order. Accessed May 5, 2010.

2

Available at http://freebase.com. Accessed October 23, 2009.

Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

Medical documents and research data come from files, Excel spreadsheets, and databases. The hospitals and clinics may use different systems. The HL7 Clinical Document Architecture (CDA), an XML-based markup standard intended to specify the encoding, structure, and semantics of clinical documents for exchange, is used. According to the IDC case study (Bihammar and Chong, 2007), Melbourne Health has linked research databases in 16 organizations, allowing them to collaborate.

SCIENCE COMMONS AND NEUROCOMMONS

Science Commons (http://sciencecommons.org), launched in 2005, is an offshoot of Creative Commons, a not-for-profit organization that develops and disseminates free legal and technical tools to facilitate the finding and sharing of creative content (Garlick, 2005). It also focuses on lowering barriers that researchers face to sharing data, publications, and materials.

The goals are to expand sharing, interoperability, and reuse of data, but these goals are hampered by legal and cultural barriers. Although research data are not subject to copyright protection, the arrangement of data and the structure of databases may be protected (for a discussion of the legal context for sharing and accessing research data, see NRC, 2009). Specific rights to reuse or integrate data may be unclear, and integrating data collected under different jurisdictions may be problematic. Researchers in some fields might take proprietary approaches to data or might lack the motivation to make their data available proactively.

Science Commons has developed several programs and tools to lower these barriers. The Protocol for Implementing Open Access Data allows researchers to mark their data for machine-readable discovery in the public domain so that their databases can be legally integrated with others, including those collected in other jurisdictions.3

The NeuroCommons project, under the auspices of Science Commons, is developing an open-source knowledge management platform for biological research. The goal is to make all knowledge sources—including articles, knowledge bases, research data, and physical materials—interoperable and uniformly accessible by computational agents. NeuroCommons is a prototype framework for creating information artifacts that can provide lessons for future communities, particularly in reaching community consensus around technical standards and curation processes. The NeuroCommons framework utilizes URIs and RDF, making it part of the Semantic Web.4

3

Information drawn from http://www.sciencecommons.org. Accessed October 23, 2009.

4

Information drawn from http://neurocommons.org. Accessed October 23, 2009.

Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×

To apply this idea to scientific information artifacts, one creates a set of conventions for syntactic and semantic compatibility among components and a standard packaging mechanism to make selecting and installing components easy. One starts with the primary sources (databases, knowledge bases, and the like), applies a script to do the normalization, and comes up with a packaged component. The resulting “binary” may or may not be collected with others to make a distribution. Someone creating a local installation optimized for local query obtains needed components from one or more distributions and installs those into their own environment.

Some two dozen components have been created and collected in the NeuroCommons framework. The components are independent and the architecture is open, so that anyone may pick and choose the ones they like without having to take all of them. One may create new components and either add them to the distribution (subject to quality control), create a new distribution, or just use them privately. Currently the NeuroCommons distribution is accomplished either through a set of RDF files or a database dump.

Bio2RDF

Bio2RDF (http://bio2rdf.org) is an open-source project that aims to facilitate biomedical knowledge discovery using Semantic Web technologies. Bio2RDF is an important contributor to the Linked Data Web, offering the integration of over 30 major biological databases with content ranging from biological sequences (such as are stored in UniProt, Genbank, RefSeq, Entrez Gene), structures (from the Protein Data Bank), pathways and interactions (cPATHs), and diseases (OMIM), to community-developed biomedical ontologies (OBO).

This project builds on W3C standards for sharing information over existing Web architecture and representing biomedical knowledge using standardized logic-based languages. Powered by open-source tools, Bio2RDF enables scientists to not only explore manually curated and computed aggregated knowledge about biological entities but to also link their data and enable all scientists to ask fairly sophisticated questions across distributed, but integrated, biomedical resources. Bio2RDF-linked data are available today as N3 files, indexed Virtuoso databases, and SPARQL endpoints across three mirrors located in Canada and Australia.

With interest growing in the Bio2RDF data and services beyond the initial developers, the group is fielding requests to add more than 50 additional data sources in the areas of yeast and human biology, toxicogenomics, and drug discovery.

Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 31
Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 32
Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 33
Suggested Citation:"4 Success in Data Integration." National Research Council. 2010. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/12916.
×
Page 34
Next: 5 Workshop Lessons »
Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop Get This Book
×
Buy Paperback | $29.00 Buy Ebook | $23.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Steps Toward Large-Scale Data Integration in the Sciences summarizes a National Research Council (NRC) workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop was held August 19-20, 2009, in Washington, D.C.

The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!