4
Success in Data Integration
As a domain becomes more mature, more scientists begin to develop interest in it and progress starts to depend on the sharing of data. In the beginning such sharing is quite difficult, so a domain must develop ways to facilitate sharing as it matures. This includes the setting of standards, which may slow progress in individual groups to achieve a greater good for all. While the discussions recounted above evinced skepticism about any global schema, there are places where standards have been quite successful, some of which are described below. The most successful standards tend to occur bottom-up. In other words, individual scientists recognize the need and work to build consensus standards. Other standards are imposed top-down by some sort of dominant force in an enterprise. Top-down appears to work only rarely, and bottom-up approaches have a much better chance of success, according to several workshop participants. However, standards are also facilitated if there is a dominant player in a domain, as pointed out by Dr. Stonebraker. In enterprise data, for example, Walmart has so much influence that it can specify standards and force all of its suppliers to conform if they wish to sell goods to Walmart. Google also has this sort of influence in the Web search space. In domains where there is a dominant player, standards are much easier to achieve.
The successes of the Sloan Digital Sky Survey and Genbank in sharing astronomy data and genomic data are well known in the scientific community. The National Spatial Data Infrastructure (NSDI), mentioned by Dr. Clarke, has been emulated worldwide as the global spatial data infra-
structure and is another example of success. The NSDI was prompted by an Executive Order issued by President Clinton in 1994, which also called for “development of a National Geospatial Data Clearinghouse, spatial data standards, a National Digital Geospatial Data Framework and partnerships for data acquisition.”1 The NSDI enables sharing of geographical information, elimination of redundancies, and other significant benefits. Some other success stories, perhaps less well known, are presented here.
FREEBASE
Freebase is a large, collaboratively edited database of crosslinked data developed by Metaweb Technologies. Freebase has incorporated the contents of several large, openly accessible data sources, such as Wikipedia and Musicbrainz, allowing users to add data and build structure by adding metadata tags that categorize or connect items.2 To date, most of the information in Freebase relates to people and places, though it can accommodate a wide range of data types, including research data.
Freebase is intended to be an important component of the Semantic Web, allowing automation of many Web search functions and communication between electronic devices (New York Times, 2007). However, Freebase has quality issues, omissions, errors, and redundant information—most of its information is not truly integrated. While Freebase is a success in some respects (community contributions have led to large volumes of information and it is possible to get useful answers to some queries), it cannot guarantee accurate and complete answers. Overall, Freebase demonstrates a novel mechanism for data aggregation, but it has not yet solved many of the challenges of information integration.
MELBOURNE HEALTH
Melbourne Health, a healthcare provider in Melbourne, Australia, envisions building a generic informatics model for beneficial collaboration across organizations and expansion to other research areas (Bihammar and Chong, 2007). Melbourne Health’s original goal was to link the databases from seven hospitals and two research institutes for multiple disease research. The challenges in this work come from the large amount of data, the paucity of data standards, poor interoperability between databases, and the need to ensure compliance with ethical, privacy, and regulatory norms.
1 |
Quoted from http://www.fgdc.gov/nsdi/policyandplanning/executive_order. Accessed May 5, 2010. |
2 |
Available at http://freebase.com. Accessed October 23, 2009. |
Medical documents and research data come from files, Excel spreadsheets, and databases. The hospitals and clinics may use different systems. The HL7 Clinical Document Architecture (CDA), an XML-based markup standard intended to specify the encoding, structure, and semantics of clinical documents for exchange, is used. According to the IDC case study (Bihammar and Chong, 2007), Melbourne Health has linked research databases in 16 organizations, allowing them to collaborate.
SCIENCE COMMONS AND NEUROCOMMONS
Science Commons (http://sciencecommons.org), launched in 2005, is an offshoot of Creative Commons, a not-for-profit organization that develops and disseminates free legal and technical tools to facilitate the finding and sharing of creative content (Garlick, 2005). It also focuses on lowering barriers that researchers face to sharing data, publications, and materials.
The goals are to expand sharing, interoperability, and reuse of data, but these goals are hampered by legal and cultural barriers. Although research data are not subject to copyright protection, the arrangement of data and the structure of databases may be protected (for a discussion of the legal context for sharing and accessing research data, see NRC, 2009). Specific rights to reuse or integrate data may be unclear, and integrating data collected under different jurisdictions may be problematic. Researchers in some fields might take proprietary approaches to data or might lack the motivation to make their data available proactively.
Science Commons has developed several programs and tools to lower these barriers. The Protocol for Implementing Open Access Data allows researchers to mark their data for machine-readable discovery in the public domain so that their databases can be legally integrated with others, including those collected in other jurisdictions.3
The NeuroCommons project, under the auspices of Science Commons, is developing an open-source knowledge management platform for biological research. The goal is to make all knowledge sources—including articles, knowledge bases, research data, and physical materials—interoperable and uniformly accessible by computational agents. NeuroCommons is a prototype framework for creating information artifacts that can provide lessons for future communities, particularly in reaching community consensus around technical standards and curation processes. The NeuroCommons framework utilizes URIs and RDF, making it part of the Semantic Web.4
3 |
Information drawn from http://www.sciencecommons.org. Accessed October 23, 2009. |
4 |
Information drawn from http://neurocommons.org. Accessed October 23, 2009. |
To apply this idea to scientific information artifacts, one creates a set of conventions for syntactic and semantic compatibility among components and a standard packaging mechanism to make selecting and installing components easy. One starts with the primary sources (databases, knowledge bases, and the like), applies a script to do the normalization, and comes up with a packaged component. The resulting “binary” may or may not be collected with others to make a distribution. Someone creating a local installation optimized for local query obtains needed components from one or more distributions and installs those into their own environment.
Some two dozen components have been created and collected in the NeuroCommons framework. The components are independent and the architecture is open, so that anyone may pick and choose the ones they like without having to take all of them. One may create new components and either add them to the distribution (subject to quality control), create a new distribution, or just use them privately. Currently the NeuroCommons distribution is accomplished either through a set of RDF files or a database dump.
Bio2RDF
Bio2RDF (http://bio2rdf.org) is an open-source project that aims to facilitate biomedical knowledge discovery using Semantic Web technologies. Bio2RDF is an important contributor to the Linked Data Web, offering the integration of over 30 major biological databases with content ranging from biological sequences (such as are stored in UniProt, Genbank, RefSeq, Entrez Gene), structures (from the Protein Data Bank), pathways and interactions (cPATHs), and diseases (OMIM), to community-developed biomedical ontologies (OBO).
This project builds on W3C standards for sharing information over existing Web architecture and representing biomedical knowledge using standardized logic-based languages. Powered by open-source tools, Bio2RDF enables scientists to not only explore manually curated and computed aggregated knowledge about biological entities but to also link their data and enable all scientists to ask fairly sophisticated questions across distributed, but integrated, biomedical resources. Bio2RDF-linked data are available today as N3 files, indexed Virtuoso databases, and SPARQL endpoints across three mirrors located in Canada and Australia.
With interest growing in the Bio2RDF data and services beyond the initial developers, the group is fielding requests to add more than 50 additional data sources in the areas of yeast and human biology, toxicogenomics, and drug discovery.