1
Introduction

This report summarizes a National Research Council (NRC) workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop was held August 19-20, 2009, in Washington, D.C. The charge to the planning committee was as follows:

To plan and organize a cross-disciplinary public workshop to explore alternative visions for achieving large-scale data integration in fields of importance to the federal government. Large-scale data integration refers to the challenge of aggregating data sets that are so large that searching or moving them is nontrivial, or to the challenge of drawing selected information from a collection (possibly large, distributed, and heterogeneous) of such sets. The workshop will address the following questions:

  • What policy and technological trajectories are assumed by some different communities (climatology, biology, defense, and others to be decided by the committee) working on large-scale data integration?

  • What could be achieved if the assumed policy and technological advances are realized?

  • What are the threats to success? Who is working to address these threats?

The NRC Committee on Applied and Theoretical Statistics organized the activity, with the original impetus coming from discussions of the NRC’s Government-University-Industry Research Roundtable.

Advances in information technology have resulted in enormous



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
1 Introduction T his report summarizes a National Research Council (NRC) workshop to identify some of the major challenges that hinder large-scale data integration in the sciences and some of the technologies that could lead to solutions. The workshop was held August 19-20, 2009, in Washington, D.C. The charge to the planning committee was as follows: To plan and organize a cross-disciplinary public workshop to explore alternative visions for achieving large-scale data integration in fields of importance to the federal government. Large-scale data integration refers to the challenge of aggregating data sets that are so large that searching or moving them is nontrivial, or to the challenge of drawing selected information from a collection (possibly large, distributed, and heterogeneous) of such sets. The workshop will address the following questions: • hat policy and technological trajectories are assumed by some dif- W ferent communities (climatology, biology, defense, and others to be decided by the committee) working on large-scale data integration? • hat could be achieved if the assumed policy and technological ad - W vances are realized? • hat are the threats to success? Who is working to address these W threats? The NRC Committee on Applied and Theoretical Statistics organized the activity, with the original impetus coming from discussions of the NRC’s Government-University-Industry Research Roundtable. Advances in information technology have resulted in enormous 

OCR for page 1
 STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES increases in the amount of data available to science and engineering researchers. This includes not only data from experiments and observa- tions but also data generated by computer simulations. It is becoming common for research groups to quickly gather or generate terabytes of data, and a number of programs are accumulating petabytes of data. (One terabyte equals 1012 bytes and 1 petabyte equals 1015 bytes.) Data integra- tion must overcome the challenge of finding disparate, distributed sources of data, which is often referred to as “data discovery,” and the challenge of effectively utilizing the collective information in those sources to produce new insight—a process known as “data exploitation.” The workshop on which this report is based did not try to characterize comprehensively the various ways in which data integration is useful or necessary for the advance of science. The term “data integration” first emerged in connection with the need for organizations to provide data users “with a homogeneous logical view of data that is physically distributed over heterogeneous data sources” (Ziegler and Dittrich, 2004). The concept of data integration used here is a broad one, encompassing any technology, process, or policy that affects a scientist or engineer’s ability to find, interpret, and aggregate/mine/ana - lyze distributed sources of information. Data interoperability and knowl - edge discovery are both intended to be within the concept’s scope. All too often, data discovery depends on word of mouth: A researcher happens to have heard about a data set that might be useful in his or her own research or makes inquiries of colleagues in order to find relevant data. In fields where there are a limited number of large facilities (for example, high-energy physics and astronomy) or a predictable adminis - trative structure for data storage (for example, national weather bureaus), the challenge may be manageable, although meeting it still often depends on a haphazard, serendipitous process. But in research fields where small groups can accumulate and store large amounts of data, valuable data sets can exist in many places. In particular, useful data might be held by someone who is outside the network of a researcher who is seeking those data. More problematic still are instances where a researcher seeks to integrate data from very different communities, such as geospatial data with sociological, medical, and other overlays. Such creative merging of knowledge can lead to very novel insights, but it is hindered by the data discovery challenge. Once data sources have been found, data exploitation presents another set of challenges. A researcher must develop a clear understanding of the meaning of each of the data sets. Achieving such an understanding is difficult, because documentation of the conditions under which the data were collected can be spotty. Simple aspects such as the units of measure must be known definitively, and more subtle aspects such as environmen-

OCR for page 1
 INTRODUCTION tal conditions, equipment calibrations, preprocessing algorithms, and so on can also be important. If data are being used for research outside the field for which they were collected, the risk of misinterpretation is severe, because research communities can have unstated assumptions about what to document or what to assume, and these assumptions can be overlooked during the integration process. There are technical and policy challenges associated with the actual aggregation of data. If some data were collected with privacy guarantees, how should those guarantees be interpreted if only a subset of the data, or a summary of it, is used for a secondary analysis? There are also technical challenges in translating disparate data sets so that they can be merged: for example, putting maps into the same coordinate system, aligning data that were collected on different sampling grids, correcting for systematic differences among equipment, and so on. For the purposes of the workshop, “large-scale data integration” was taken to refer to the aggregation of data sets that are so large that search- ing or moving them is nontrivial, a technical challenge that is becoming ever more common as it becomes easy to produce and store terabytes. Workshop participants were also aware that a growing number of oppor- tunities require the aggregation of large numbers of modest-size datasets, and some of the workshop discussion reflects the challenges associated with those situations. To bound the discussion and produce the most useful outcomes, the workshop planning committee decided to focus on issues related to integrating scientific research data.1 The particular disciplines discussed include physics, biology, chemistry, Earth sciences, satellite imagery, astronomy, geospatial data, and research medical data. By and large, these are all structured data—that is, records of fairly rigidly formatted information. In contrast, many data integration efforts outside scientific research deal more with unstructured data (text) and semistruc- tured data (want ads, personnel records, and so on). Unstructured data and the needs of nonresearch users with an interest in data integration were not a focus of the workshop. Of course, there is a substantial gray area. For example, even when one is seeking and aggregating structured scientific data, tools designed for unstructured data might be necessary because structure may not be readily recognizable. Michael Marron of the National Institutes of Health (NIH), a co- sponsor of the workshop, explained NIH’s interest in the topic. The long- 1 The statement of task and original work plan for the project documented in this report presumed two workshops and a committee consensus report. The project was scaled back to one workshop and a rapporteur-authored summary in order to align with available re - sources. The workshop planning committee decided that focusing the subject matter cover- age on scientific research data and related communities would allow for the most productive discussion of issues and possible solutions during a single two-day workshop.

OCR for page 1
 STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES time predictions about the data deluge have come to pass: Many fields of science now have more data than they know what to do with. The amounts of data being collected are increasingly important to biomedical research. In addition, more and more research is now built on the analysis of data that were not collected by the researchers themselves, and many of the extant data have not been utilized to their full potential. Alex Szalay of Johns Hopkins University reported that analogous changes are underway in astronomy, with the collection of data increasingly separated from its subsequent analysis, which is a disruption from the way science has been practiced over the centuries. Increasingly, the connection between data and their analysis is facilitated through data archives and different sorts of federation services. This represents a new way of doing science, and the infrastructure must be able to support it. Dr. Marron expressed concern about science’s abilities to share, man - age, and curate data; correct errors; and map the provenance of data. In short, he is concerned about all of the factors that go into ensuring the reliability of data and enabling their exploitation. Thus, NIH is exploring where to make investments in building those capabilities and generally developing parts of the information infrastructure. He pointed to the Biomedical Informatics Research Network (BIRN) as an example of an NIH investment in information. He said it is not the solution, but that it is an important contribution to an infrastructure that will help facilitate sharing data and tools. The Cancer Bioinformatics Grid (CaBIG), a similar infrastructure for cancer-related research, is another example. Dr. Marron said that it is far from clear how one can find and access data. He noted the common hope for a capability that would be as useful as Google and other search engines but that could also perform more of the exploration and filtering that is now left to researchers. This hoped- for tool could work with multidimensional data and could find not only the data that are deliberately made available (“published” or placed in repositories) but also the huge amounts of data that are less readily iden- tified but nevertheless of value to people other than those who collected them. It would also have to have the ability to recognize data sets that are similar, redundant, or overlapping. Ed Seidel of the National Science Foundation (NSF) explained that the tendency to collaborate is increasing in every single area of science. He gave the example of modeling the effects of hurricanes and storm surges, which requires bringing together a wide range of models and data, including satellite observations, atmospheric models, storm-surge models, wave models, levee models, traffic flow models, and so on. This increase in the prevalence of collaboration calls for cyberinfrastructure to support distributed teams of researchers who collaborate through shar- ing data.

OCR for page 1
 INTRODUCTION The workshop examined a collection of scientific research domains, with application experts explaining the issues in their disciplines and current best practices. This approach allowed the participants to gain insights about both commonalities and differences in the data integration challenges facing the various communities. In addition to hearing from research domain experts, the workshop also featured experts working on the cutting edge of techniques for handling data integration problems. This provided participants with insights on the current state of the art. The goals were to identify areas in which the emerging needs of research communities are not being addressed and to point to opportunities for addressing these needs through closer engagement between the affected communities and cutting-edge computer science. The workshop also discussed policy barriers to widespread data shar- ing, considering the pros and cons of various ways forward.