For more information, purchase options, and for other versions (if available) please visit
Appendix C | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

U.S. National Committee for CODATA
National Research Council

Appendix C

Conference Abstracts

Plenary Session Abstracts

Plenary Session 1
Interdisciplinary and Intersectoral Data Applications:
A Focus on Environmental Observations

Session Chair: Susan Zevin, National Oceanic and Atmospheric Administration

     The quality of our lives and the health of our environment will be determined by the choices and decisions we make today. There are clear connections among the environment, the economy, and society--economic growth, maintenance of environmental quality, and wise use of resources must go hand-in-hand to ensure a rising standard of living for us all. National efforts of promoting global environmental stewardship include describing, assessing, monitoring, and predicting Earth's environment. For our nation, we must develop a national decision support system in order to save lives and protect property, promulgate public policy, manage and conserve living resources, and enhance the economic prosperity and quality of life. Critical to the success of these goals is the long-term stewardship, access to, and use of critical environmental data. Our first session examines these issues including some of the most creative applications of weather and climate data in the development of new businesses.

Gateway to the Earth: A Framework for a National Decision Support System

Barbara Ryan, U.S. Geological Survey

     "Gateway to the Earth" is a vision for information management at the U.S. Geological Survey (USGS) that could provide a framework for a National Decision Support System. It is an organizing principle, a system for accessing, integrating, managing, and delivering the information assets of the USGS to a wide range of users. These assets are the sum total of USGS natural science data, information, and knowledge (geospatial, temporal, and textual) regardless of communication media (analog or digital). The cost of developing these assets, which are not static, but continually growing, is currently estimated to be $20 billion, yet as an organization the USGS has not effectively exploited the full use of these assets. To promote their optimal use, these information assets must first be integrated. Yet even within the USGS, and certainly within the government, existing systems, processes, practices, and even culture result in disintegration rather than integration.

     Anyone who starts to build a geographic information system soon realizes there is no one place to go to acquire or access the full range of information assets the USGS, much less the broader Earth and natural science communities, have to offer; a fully implemented Gateway to the Earth will correct that. Access to these information assets must be made easier, both internally and externally. One should have easy and full access to the entire suite of information, whether the point of entry is discipline based (biology, geology, geography, or hydrology), theme based (hazards, resources, environment, or information), geospatially based (place name, latitude, or longitude), organizationally based (science center, district office, branch, or field station), or time based (date, time series).

     Historically the USGS has done an excellent job of integrating data and information horizontally as indicated by many national coverages of selected data sets (e.g., topography, surficial geology, bedrock geology, aquifers, aerial photography, land use and land characterization, ecosystem coverage). Now, these horizontal national coverages must be better integrated vertically.

     Gateway to the Earth also needs to have pointers to each partner's information, where available. Although the USGS has tremendous amounts of data, measured in terabytes and possibly petabytes, there are many other sources of data and information. Gateway to the Earth is about creating full and easy access to Earth and natural science information for addressing the needs of citizens, scientists, resource managers, and policy officials. It is a coherent set of interfaces that enable diverse users to find, get, and use natural science information in ways that are meaningful to them. Integrating information across sectors is a necessary next step in building a National Decision Support System.

Advances in Applied Interdisciplinary Research Using Environmental Observations

Thomas Gay, Vista Information Solutions

     VISTAinfo develops Internet and desktop technology solutions, and owns and maintains the largest database of location-specific real estate, environmental, and insurance underwriting information in the United States. VISTAinfo is consistently developing new technologies for integration and analysis of its data; the newest project consists of using XML database technology and the Internet to bring a company's environmental and property information together. Advances in technology assist with the integration and distribution of company data and products including, VISTAScore for the insurance and banking industries, for consumers, and the NEPA layers of data.

The Intergovernmental Panel on Climate Change (IPCC) and the Futures Market: Common Climate Data Management Requirements?

Thomas Karl, National Climatic Data Center

     Over the past few decades climate scientists have focused on being able to discern climate variability and change from the instrumental records of the past century or more. The data that scientists have worked with have been contaminated by many types of biases. This has been the topic of numerous articles, and has been a major issue in each of the IPCC assessments. Recently, the futures market has started trading several climate financial instruments. Utilities, banks, insurance companies, re-insurance companies, commodity exchanges, and other businesses have taken a special interest not only in past climate variations and changes but also in the delivery of real-time, high-quality climate data to settle contracts. There are many areas of common interest in climate data between the IPCC and the futures market. We will explore the history up to this point, the challenges we expect in the future, and how we can exploit these unlikely partners in the management of climate data.

NYMEX Energy Markets and the Use of Environmental Data

Bradford Leach, New York Mercantile Exchange

     Mr. Leach did not submit an abstract. Please refer to Chapter 7 of these Proceedings for the full text of his presentation.

Weather Risk Management

Lynda Clemmons, Weather Risk Management Association

     This presentation examines how the weather impacts utility planning and budgeting, and how the industry's needs for reliable and accurate weather data are being addressed.

Teaching Our Kids about Science and the Natural Environment

Steven Richards, District 11 Weather Study Program, Bronx, New York

     This presentation explores the benefits of using near-real-time environmental data in K-12 education. The activities of New York City's District Eleven Weather Study Program (DEWS), grades 5-8, will be highlighted. The use of environmental information in DEWS over the last 16 years will be summarized. Recommendations will be offered to overcome obstacles that currently impede the wider implementation of environmental data-based programs and activities in our nation's schools.

Weather Data--Implications of Increased Privatization

Raymond Ban, The Weather Channel

     During the past decade, private organizations have deployed observational systems to gather a variety of weather and weather-related data. As the cost of technology continues to fall, it appears likely that this trend will continue and that by the end of the next decade, a significant percentage of critical weather data will be controlled by private industry. How will the community deal with weather data as intellectual property and resolve the potential conflict between competitive advantage and scientific advancement?

Plenary Session 2
Improving the Data Policy Framework

Session Chair: Paul Uhlir, National Research Council

     The advent of the digital information revolution has brought about an explosion of databases in all areas of research, with progress in many fields now highly dependent on the creation, access to, and use of highly specialized data. This vast increase in scientific and technical (S&T) database activity has resulted in competing pressures on the legal and public policy communities to provide appropriate guidance to the creators, disseminators, and users of these critical information resources. This session will examine recent legal and policy developments and trends impacting the S&T database activities in government, academia, and industry and the relationships among these sectors in the management of their respective rights and obligations.

Intellectual Property Law and Policy Issues in Interdisciplinary and Intersectoral Data Applications

Stephen Maurer, Attorney

     Society receives full benefit from its investment in science only when data are transmitted to all potential users, including those who work in different disciplines and/or sectors of the economy. One way to think about this transmission is to consider what economists call a "cumulative innovation model." Comparing this model to real-world S&T databases provides significant insights into how the current system works, where it has been successful, and the challenges remain. Efforts to reform the system will almost certainly rely on intellectual property law and related legal approaches. The strengths and weaknesses of these approaches will be discussed. The presentation will conclude by asking how database protection legislation currently pending in Congress would affect the flow of S&T data between and among disciplines.

Obtaining Descriptive Data to Describe Database Use and Users: Policy Issues and Strategies

Charles McClure, Florida State University

     Obtaining accurate and reliable information that describes use, users, access, downloads, visits, and other data related to scientific and technical information (STI) databases is often difficult if not impossible. These difficulties are magnified considerably when one tries to produce "industry-wide" descriptions of use and users of STI (and other) databases. Obtaining such information is important for improving the quality and responsiveness of public and private-sector policies for the management and regulation of online databases and for various research purposes. This presentation will identify and describe selected issues related to this problem and propose solutions by which we can obtain better and more accurate data about database use and users.

University Technology Transfer Practices: Reconciling the Academic and Commercial Interests in Data Access and Use

Lita Nelson, Massachusetts Institute of Technology

     Increasing interactions between industry and universities in research collaborations invariably bring about conflicts between industrial needs and academic principles. Experience has shown us however, that sophisticated crafting of agreements, coupled with strong, inviolable policies can hold the line and preserve academic freedom of action and of dissemination while still meeting our industrial partners' commercial needs. This presentation discusses the policies that guide such agreements at one major research university and reviews several examples of how these agreements work in practice.

Using Scientific and Technical Data in the National Interest

The Honorable Rush Holt (D-NJ)

     Representative Holt did not provide an abstract for his presentation. Please refer to Chapter 14 of these Proceedings for the full text of his presentation.

Plenary Session 3
Promoting Data Applications for Science and Society:
Technological Challenges and Opportunities

Session Chair, Julian Humphries, University of New Orleans

     Information systems used in S&T research, including such diverse areas as Earth observation information systems, computer-aided modeling applications, geographic information systems, and bioinformatic applications and databases, share the characteristic that they store, manage, and disseminate large amounts of relatively complex data. The complexity of these data is multiplied when content providers and managers are asked to share scientific data with novel audiences, or asked to provide means of integrating their data with those of other domains. Technological advances in access to commercial and public data and knowledge have significantly raised user expectations about access to scientific data. This session will examine these issues from the viewpoint of data discovery, access, integration, preservation, and standards. Particular attention will be given to the areas in which significant research or technological innovation remains to be done on how best to meet these goals.

Data Mining and Databases

Usama Fayyad, Microsoft Research

     Data mining is about finding interesting structure from databases, especially large data stores. Understanding how databases can accommodate data mining operations is fundamental to making these techniques convenient and easy to use. Operating under such scalability constraints poses interesting problems for how models can be built and what methods are practical. This presentation will outline the research challenges and opportunities and an approach to setting standards for mining data. An application in science data analysis will be covered for illustrative purposes.

Diverse Geospatial Information Integration

Daniel Gordon, Autometric, Inc.

     This presentation addresses issues in the integration and display of disparate data sets for science and technology users. The advantages of multisource, multiresolution visual fusion will be discussed and demonstrated. Specific examples will be used to demonstrate key advantages of visual fusion using information available from the National Oceanic and Atmospheric Administration and other government sources. Such applications are especially important for understanding technically challenging problems and will assist the environmental community in educating the public and scientists about key environmental issues.

Long-term Preservation of High-Quality Information

Walter Warnick, U.S. Department of Energy

     Information preservation in the digital age poses problems foreign to preservation in the paper age. The digital age rules out passive preservation, leaving proactive preservation as the only alternative. Proactive preservation relies on migration technologies that become outmoded every few years and on physical media and equipment of unknown but potentially short life span. The need for regular migration implies that the permanence of information is no greater than the permanence of the organizations hosting it. For government-sponsored information, which includes the great bulk of deliverables coming from basic research in the United States, government institutions with an information mission are the best solution for ensuring preservation. Beyond mere preservation, such organizations are also best suited to promote permanent public access. Today, digital national libraries allow federal agencies to envision searchable and comprehensive collections of information through which information can be preserved and made permanently accessible.

Developing Standards for Interdisciplinary Data Applications

John Rumble, Jr., Standard Reference Data Program, National Institute of Standards and Technology

     The triple revolutions caused by advanced computers, advanced informatics, and the Internet have changed forever the way scientific and technical data are stored, located, and disseminated. There is no turning back. Today almost every scientist uses the Internet to share data, sometimes just with close colleagues, other times through large-scale databases used by scientists in many fields. Standards are clearly going to be required to allow data exchange and data sharing to proceed smoothly and coherently, especially across disciplines and sectors. Some areas of science already have such standards, specifically crystallography, X-ray photoelectron spectroscopy, materials testing (in part), and neutron interactions. In addition, virtually every international scientific union has established standard nomenclature for its discipline, though almost never from the perspective of computer database building and dissemination. This talk will identify the common elements of all scientific data exchange standards, including those related to substance and object description, test and property data, test conditions, and data and database quality. It will discuss how data formats can be integrated together, thereby linking data from different disciplines. Finally, the opportunities for CODATA to develop specific guidelines for scientific data exchange standards and to provide support for standards development in a few key areas will be identified.

Experience with Metadata on the Internet

     Jim Restivo, Blue Angel Technologies, Inc.

     This talk describes experiences with taking community-driven standards and providing solutions that gather, manage, and publish metadata over the Internet. A brief overview of metadata, related communities, and standards will be provided. Most of the talk will include several high-level case study overviews, which show how organizations and groups are using metadata as a means of publishing, sharing, and using information on the Internet.

Plenary Session 4
Promoting Data Applications for Science and Society:
Organizational and Management Issues

Session Chair: Goetz Oertel, Association for University Research in Astronomy

     The final session addresses issues that derive from organizational and management challenges and opportunities in the use of S&T data for interdisciplinary research and for other purposes that may not have been anticipated when they were acquired. Several developments discussed in the previous sessions are making these issues timely and pressing: the rapid expansion in the capability to acquire, process, and transmit data; the emergence of new fields of investigation and entrepreneurism enabled by this expansion; and various related policy developments. This session examines management approaches to the use of agricultural and health data; focuses on several key issues in making S&T databases widely available and usable across disciplines and sectors; and discusses methods for evaluating data management productivity and performance.

Data Management Issues at the Agricultural Research Service

Floyd Horn, Agricultural Research Service

     The Agricultural Research Service (ARS) is the intramural research arm for the United States Department of Agriculture. The Agency's research programs are carried out by a permanent core of about 2,000 PhD scientists located at 104 locations across the country, supported by an annual budget of approximately $1 billion. Examples of key ARS research with enormous data management challenges include human nutrition; plant, animal, and microbial germplasm and genomics; food safety; natural resources and international programs with countries of the former Soviet Union. Public access to research data, particularly those developed in the federal sector with public funds, is critical to the long-term availability of food and fiber for America.

Making Health Information Available to the Public: The Federal Opportunity

Donald Lindberg, National Library of Medicine

     The historic level of federal investment in medical research is resulting in a wealth of high-quality information for the public. This information is produced not only by the National Institutes of Health (NIH) but by academic institutions, professional societies, and other nonprofit organizations. The National Library of Medicine, a part of the NIH, has created MEDLINEplus, a World Wide Web site that provides free access to much of this information for consumers. This new site, combined with the library's easy-to-use systems for accessing the MEDLINE database of 10 million medical journal article references and abstracts, has resulted in an unprecedented volume of authoritative health information being available to the public. Nonetheless, much further study of consumer use and comprehension of this information is needed, plus continuation of studies of the non-Internet users.

Evaluating Data Management Productivity and Performance in Government: The View from the Trenches

Thomas Mace, Data Management Working Group, CENR and SGCR

     As an interagency coordinating body for virtually all the federal government's environmental data, the Data Management Working Group (DMWG) does not fall under the Government Performance and Results Act requirements for reporting performance goals. Individual components in the agencies do respond to the Act's requirements, and the intent of the law is to encourage planning, review, and management by objective. These are useful practices for any entity that wishes to be successful, and they have been adopted by the working group in the development of the Global Change Data and Information System (GCDIS) for the Global Change Research Program, and its extension to other Committee on Environment and Natural Resources (CENR) data and information management requirements. GCDIS has developed into an operational system consisting of agency components, supported by interagency coordinating and user services functions serving both the scientific and nonscientific communities.

     There are two important guiding principles that help make GCDIS successful. The first is that the system is supported by professional agency data centers, which provide data and information far beyond that generated by the Global Change Research Program. The second principle is that data and information, to the largest extent possible, should be made available on a nondiscriminatory basis, at no more than the cost of reproduction and dissemination. This presentation will discuss a number of challenges to federal data management programs in meeting established goals and assessing progress.

Promoting Data Access for a Broad User Base

Matthew Schwaller, National Aeronautics and Space Administration

     Dr. Schwaller did not provide an abstract of his presentation. Please refer to Chapter 23 of these Proceedings for full text of his presentation.

Providing Incentives for Interdisciplinary Data Activities Across Government-Academia-Industry Boundaries1

Henry Etzkowitz, Science Policy Institute, State University of New York

     Beyond the "Endless Frontier" of linear models lies a continuous series of experiments on the relationships among science, industry, and government in creating the conditions for future innovation: the Endless Transition. There is no fixed end point to transition nor are only the Former Soviet Union and Central and Eastern European countries in transition: so are the United States, Asia, Western Europe, and Latin America. In this transition, there is a need for a significant, but not dominating, role for the state in science and technology policy. In the United States where laissez faire ideology is strong, government plays a significant, if not always obvious, role in science, technology, and industrial policy.

     One reason government's role is not always obvious in the United States is that policy formation and execution often occurs through university-industry-government relations, the triple helix. In addition to performing their traditional functions the institutional spheres also take the role of the other, with universities creating an industrial penumbra, or performing a quasi-governmental role as a regional innovation organizer. Although the endless transition is an international phenomenon, it does not follow a single course. The common goal is to build on existing resources and new initiatives to create niches of technological innovation and secure a place in the division of labor in the global economy.

     In this context data activities go beyond moving bits of information across organizational and institutional boundaries. Rather, it is the set of informal relationships and formal mechanisms through which joint activities occur. Interdisciplinary data activities can be encouraged from various institutional sources, individually and collaboratively. This presentation looks at examples of several such mechanisms.

1 Dr. Etzkowitz's presentation is not included in these Proceedings.

Abstracts Of Contributed Papers

(Arranged in alphabetical order by principal author)

Problems and Solutions in the Integration of Population Data with Other Disparate Data Sets

Deborah Balk and Gregory G. Yetman, Center for International Earth Science Information Network (CIESIN), Columbia University

     When creating databases that cross disciplines, units of analysis are often compromised. This paper examines three approaches to data integration, each of which considers the problems of varying and seemingly incompatible analytic units. We highlight the following issues associated with building, maintaining, and using a database: federally and commercially dictated data restrictions, confidentiality, database documentation and metadata, foreign-language translation, and cross-national variable compatibility.

     To integrate population data with other disparate data sets, CIESIN uses three approaches which differ in scope (national to global), scale (first- to fourth-level administrative boundaries), and thematic breadth (single variable to multivariate). These approaches include the creation of (1) a gridded global database of population (Gridded Population of the World--GPW); (2) a tool to visualize and export data across contiguous national boundaries (U.S.-Mexico Demographic Data Viewer--DDViewer); and (3) a tool to generate equivalencies between U.S. geographies (Geocorr).

     Each approach deals with data integration issues at the sub-national level; GPW and Geocorr also facilitate integration of data collected by administrative units with georeferenced biophysical data. The U.S.-Mexico DDViewer contains social, economic, and health behavioral data for three levels of boundaries. The approaches vary in the problems they address, but all are models highly applicable to other themes and scales.

A Framework for Science Data Access Using XML

Daniel Crichton, J. Steven Hughes, Jason Hyon, and Sean Kelly, Jet Propulsion Laboratory

     Science missions and instruments continue to produce volumes of useful data, and scientists depend on the data systems and tools that archive these data to access and analyze them. These legacy systems do not interoperate well, and scientists must access each data system and its corresponding science data independently through tools that have been custom built for the particular science data system or mission. The Object Oriented Data Technology Task is working on the Distributed Resource Location Service, which will allow geographically distributed data to be located and exchanged. Advances in Internet and distributed-object technologies provide an excellent framework for sharing data across multiple data systems. The Extensible Markup Language (XML) and the Common Object Request Broker Architecture (CORBA) provide support for electronic data interchange between heterogeneous data sources. CORBA provides over-the-wire exchange of XML-based profiles that contain descriptive information of science products archived at remote data systems. This paper will discuss a framework for data system interoperability that will not only benefit space science but also provide a cross-disciplinary solution for a next-generation data system architecture.

The High Altitude Observatory Data Service: Experience in Interdisciplinary Data Delivery

Peter Fox, Jose Garcia, and Patrick Kellogg, National Center for Atmospheric Research (NCAR)

     The High Altitude Observatory (HAO) division of NCAR investigates the sun and the Earth's space environment, focusing on the physical processes that govern the sun, the interplanetary environment, and the Earth's upper atmosphere. HAO is a focal point for two important programs: (1) the Coupling, Energetics and Dynamics of Atmospheric Regions (CEDAR) program designed to enhance the capability of ground-based instruments to measure the upper atmosphere and to coordinate instrument and model data for the benefit of the scientific community, and (2) the Radiative Inputs from Sun to Earth program designed to address causes of variations in the Sun's radiation as a star as well as the source of radiant energy at the Earth.

     In this paper, we detail a two-year effort at the HAO to access data services uniformly. The underlying technology uses common application programs; the Interactive Data Language, the Web, and the Distributed Oceanographic Data System (DODS). We will describe the design, implementation, and support of each component, including end-user search and access via the Web and applications, data transmission and subsetting, and data format support on servers. New support was added to DODS for the database format of the CEDAR program and for the Flexible Image Transport System. Since DODS uses URLs to locate data, several server side functions were designed and implemented to simplify the URL syntax. Close attention is paid to evaluating the productivity and performance of each part of the systems. As a result, a new implementation of the DODS server architecture was developed, using the Apache Web server API allowing significant performance improvements in delivering large data sets.

Increasing Access to Distributed, Multidisciplinary Data through Application of a Biological Species Names Thesaurus

Michael Frame and Michael Ruggiero, U.S. Geological Survey

     Much of the scientific and technical data relating to the natural world include references to the scientific and/or vernacular names of the species or higher taxonomic groups that are represented in the data. Biological names are thus the common denominator that can be used to link data from many distributed sources and across disciplines, from molecular biology and genetics to entire ecosystem-level studies. The U.S. National Biological Information Infrastructure Program is developing and implementing an innovative approach to using biological names to increase access to multidisciplinary environmental data. This system integrates a scientifically credible and dynamic biological names thesaurus with a suite of Web-based indexing and searching tools (including metadata content standards for data set documentation, metatags for Web page documentation, and specialized search engine configurations). The relative effectiveness of this approach versus more conventional strategies is demonstrated. The system relies on a partnership between biological scientists involved in building and maintaining the names thesaurus and information scientists interested in using these names to fuel data discovery and access tools. The advantages to both groups in working together on solutions are described.

Diverse Geospatial Information Integration

     Daniel Gordon, Alfred Powell, Phillip Zuzolo, Autometric, Inc.

     This paper addresses issues in the integration and display of disparate data sets for science and technology users. The advantages of multi-source, multi-resolution visual fusion will be discussed and demonstrated. Specific examples will be used to demonstrate key advantages of visual fusion using information available from NOAA and other government sources. Such applications are especially important for understanding technically challenging problems and will assist the environmental community in educating the public and scientists about key environmental issues.

A Data System to Integrate Data from Landscapes, Streams, and Estuaries for Determining the Condition of Estuaries on the U.S. Mid-Atlantic Coast

Stephen S. Hale and John F. Paul, U. S. Environmental Protection Agency

     Estuaries are natural integrators of substances and processes that occur internally and externally (watersheds, ocean, atmosphere). Watershed activities that contribute fresh water, nutrients, contaminants, and suspended solids have a strong effect on the estuary health. Researchers trying to understand the estuary conditions must do a similar integration, using data from many scientific disciplines. Because these data come from numerous databases, operated by different organizations in various formats, it is often a challenge to find and integrate them. The Mid-Atlantic Integrated Assessment (MAIA), a pilot for projects of the Committee on Environment and Natural Resources, gathered data from many sources in the U.S. mid-Atlantic coastal region and integrated them with MAIA data from landscapes, streams, and estuaries. The purpose was to assess current conditions and to establish a data system that will support continuing assessments. Problems in finding and using data from diverse sources were approached with a variety of data management tools including data directories, inventories of monitoring programs, analytical databases, geographic information systems, data clearinghouses, and data warehouses. Encouraging data owners to move toward common standards, directories, and data descriptions for databases with distributed ownership has made it easier to find, download, understand, and integrate data.

Practical Challenges in Creating an Integrated National-Level Environmental Data Set: Lessons Learned from the Environmental Sustainability Index

Marc Levy, Jessie Cherry, Alex de Sherbinin, and Francesca Pozzi, Center for International Earth Science Information Network, Columbia University

     There is a growing demand for data about environmental sustainability that covers most of the world's countries, is comparable across those countries, and integrates across physical, biological, and socioeconomic domains. Generating such data poses severe practical challenges. Many variables are missing data for many countries. Some data are point based; others are available only in gridded or vector format. Often, data are based on voluntary reporting. In many cases a variable desired on conceptual grounds is simply not available.

     This paper reports on lessons learned through the creation of the Environmental Sustainability Index, a prototype commissioned by the World Economic Forum in conjunction with the Yale Center for Environmental Law and Policy. A variety of strategies were employed to try to overcome these challenges, including

  • developing a non arbitrary way to limit the number of countries so as to reduce missing-data problems;

  • using survey data to augment physical measurements;

  • developing weighting schemes to convert point-based measurements to national aggregates;

  • using GIS methods to aggregate gridded and vector data;

  • creating numerical data series from textual reports and other non quantitative sources; and

  • developing a heuristic structural model of sustainability to permit meaningful integration across various categories of variables.

     In development of the index, it is critical to balance the immediate need for usable policy information with the longer-term potential for more accurate and complete data on sustainability.

Advanced Visualization of Scientific Metadata

Ion Mateescu, Center for International Earth Science Information Network, Columbia University

Lucy Nowell and Leigh Williams, Pacific Northwest National Laboratory

Karen L. Moe, National Aeronautics and Space Administration

     As more and more metadata catalogs become available online, researchers face the challenge of information overload. Many search tools, such as specialized thesauri and more sophisticated query interfaces, help users narrow their search when users know specifically what they want. In many cases, however, users are not familiar with what data are potentially available and how different types of data may relate to each other. They may need assistance in exploring the contents of one or more distributed data catalogs in order to better understand the universe of potentially relevant data sets and interrelationships among them.

     To help alleviate this problem and to increase efficiency in dealing with large metadata collections, the Center for International Earth Science Information Network (CIESIN) at Columbia University is applying research on text visualization to the world of scientific data catalogs. Researchers at Pacific Northwest National Laboratory and CIESIN have teamed up to use the search, visualization, and analysis capabilities of WebTheme to collections of metadata documents retrieved using the Z39.50 protocol. This paper describes some of the difficulties and opportunities encountered in this prototype project.

The National Climatic Data Center's Policy on the Quality Assurance of Daily Temperature Observations

Matthew J. Menne and Michael Crowe, National Climatic Data Center

     The National Climatic Data Center has used a variety of quality assurance techniques to detect errors in temperature and other variables as data are operationally ingested and processed prior to archival. As part of a recent initiative to monitor the "health" of NOAA's observational networks, new quality assurance methods have recently been developed and added to the existing suite of data processing algorithms to improve timely error detection in temperature data from two of these networks: the National Weather Service Cooperative Observer Network and the Automated Surface Observation System. While enhanced error detection of observations from these and other NOAA observational platforms will benefit the scientific community in assessments of climate and global change, the quality assurance of weather and climate data has received increased attention with the addition of a growing private-sector constituency--weather risk management. In fact, the user needs of this major new constituency have provided the incentive for the NCDC to revisit such issues as the impact of evolving quality assurance methods on historic archives and its policy on supplying substitute values for missing values and for observations flagged as potential errors. This paper will report on progress towards the formulation of a consistent policy regarding these issues.

Integrating Data from the Internet: Are Metadata All We Need?

R.J. Olson and R.A. McCord, Oak Ridge National Laboratory

     Ecologists are mining data from the Internet in addition to collecting field measurements to address today's questions about the effects of human activities on regional and global processes. The Internet provides information on aspects of the environment at broader spatial and temporal perspectives than ecologists could easily acquire from their own fieldwork. Despite the ever-increasing computer power and freely available databases, integrating data from multiple sources is difficult. A major barrier is having adequate metadata; however, even with well-documented data from fully functional archives, these data often cannot be easily combined to produce credible results. When data are combined, new problems and opportunities arise. For example, in a project to evaluate the performance of regional ecosystem models, the model outputs were used to identify unreasonable combinations of climate, land cover, and productivity data that had been compiled from separate sources. A major effort was invested in cleaning up these data for the analysis, including evaluating the outliers by a diverse group of scientists. The new integrated data set was a significant product in itself. Based on our experience with this and other projects, our poster will illustrate pitfalls and ways to take advantage of combining information from multiple sources.

     Note: Oak Ridge National Laboratory is managed by Lockheed Martin Energy Research Corp. under contract DE-AC05-96OR22464 with the U.S. Department of Energy. The U.S. government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution or allows others to do so for U.S. government purposes.

Data for Assessment of Irrigated Soils Degradation and Management of Long-Term Soil Preservation

V. Prikhodko, Department of Land, Air, and Water Resources, University of California, Davis

M.J.Singer, Department of Land, Air, and Water Resources, University of California, Davis

E. Manakhova, Postgraduate student, Moscow University, Russia

     Irrigation is one of the most powerful anthropogenic factors that influences soils, water, and the environment. Seventy percent of water use by humanity is used for irrigation. Irrigation provides favorable conditions for plants, but frequently results in soil deterioration. Degradation can vary from slight change in soil functioning in the biosphere and < 10 percent crop yield reduction to very severe, in which soil functions are severely affected and crop yields are decreased by 75-90 percent. Both static and dynamic soil properties are affected. Simultaneous evaluation of soil degradation can be determined by one or many parameters. Early diagnosis of negative transformations caused by irrigation is facilitated by the study of micro- and meso-scale soil properties. Reversible (e.g., changes in humus content, waterstable aggregates, gypsum, other salts, and nutrients) and irreversible processes (e.g., microstructure destruction, reduction of total and reactive organic content and clay loss) are considered. Quantitative parameters that characterize favorable conditions of irrigated soils and measures of the degree of degradation will be presented. Examples of irrigation effects on soil carbon, soil structure, and porosity from the University of California Long Term Research in Agricultural Systems (LTRAS) plots will be presented to illustrate the kinds of management effects.

Better Access Through Federal and Private-Sector Data Integration

Hedy J. Rossmeissl, U.S. Geological Survey

     The U.S. Geological Survey (USGS) makes available a wide range of Earth and biological science data in the public domain. Tools are being developed to access these multiple data sets in an integrated manner. The Earth Explorer software will allow cross-inventory searches; the National Atlas is an integrated database of small-scale national data sets produced by numerous federal agencies.

     In addition, the USGS constantly examines its policies and actions to ensure that its activities are inherently governmental and that the private sector is being encouraged and enabled to reprocess and serve these data in formats, combinations, and products that meet the needs of their customers. Successful ventures with the private sector have been through business partner agreements, of which the USGS has several dozen relating to its digital data, and cooperative research and development agreements, for example, the TerraServer was developed under such an agreement with Microsoft. Lexon Technologies is developing consumer products from the National Atlas data. The USGS has an agreement with Pictometry to develop an image product that incorporates USGS digital data. The private sector is also being encouraged to value-add satellite data. These policies provide the public with far wider access to USGS data in ways that better meet their requirements.

A Case Study of Environmental Research Data Management

Trent G. Schade, U.S. Environmental Protection Agency

     To support EPA's ongoing research in watershed ecology and global climate change, we gather and analyze environmental data from several government agencies. This case study demonstrates a researcher's approach to accessing, organizing, and using intersectoral data. The research topic is an assessment of the potential impact of global climate change on engineered environmental systems.

     Data providers include government agencies, commercial contractors, and professional organizations. For example, NOAA provides precipitation data; USGS provides streamflow data and channel geometry; EPA provides point-source permit data; and cost data are provided by EPA and professional organizations.

     Each group has a unique access requirement and a unique data format. The researcher must force these data into a format appropriate for the model or method applicable to the problem. For this topic, we require the following models:

  • Water quality models--USGS and EPA

  • Climate change models--EPA

  • Statistical models--commercial software

     Synthesizing and analyzing the results from the models is another data management task. In the end, a key goal of the data management is to ease technology transfer; our research should adapt to the efforts of such technology-limited data users as local watershed groups. Our metrics for meeting this goal are economy and efficiency.

Providing a Common Search and Data Usage Facility for Independent Space Physics Data Centres

James Thieman and Edward Bell, National Space Science Data Center

Michael A. Hapgood, CLRC Rutherford Appleton Laboratory

Christopher C. Harvey, Centre de Données de la Physique des Plasmas, CNRS/CESR

J. David Winningham, Southwest Research Institute

     The space physics community has assembled large and diverse data holdings--catalogues, databases, simulations, and digital data archives--from both space- and ground-based facilities most of which are available online in a wide variety of formats via network services. The goal of this initiative is to find a way to link these data resources so that space physicists can easily locate and use data of interest regardless of which of the many facilities actually holds the data. Any approach to providing this facility must impose minimal impact on data providers, use existing network access tools, and require little or no addition to project budgets for implementation. One approach to creating this facility is to model it after the Astrobrowse system presently used for data finding by the astronomy community. Similarly, it is proposed that the acquisition and intercomparison of data be done in a manner similar to the current Distributed Oceanographic Data System (DODS). There are many lessons to be learned from the Astrobrowse and DODS approaches that can be applied to the space physics problem. A prototype implementation planned for year 2000 will be described.

The DOE ARM Program Data Management System

Joyce Tichler, Wanda Ferrell, Raymond McCord, and Jimmy Voyles, U.S. Department of Energy

     The Atmospheric Radiation Measurement (ARM) Program is the largest global change research program in the Department of Energy (DOE). The program investigates the role of clouds in climate models, a critical scientific issue identified by the United States Global Change Research Program. The ARM established and operates field research sites in three climatically significant locations. Scientists collect and analyze data obtained over extended periods of time from large arrays of instruments to study the effects and interactions of sunlight, radiant energy, and clouds on temperatures, weather, and climate.

     A team of scientists and computational scientists from DOE national laboratories designed and built the ARM data management system. Data and accompanying metadata are collected at the three sites and from auxiliary sources, converted to a self-describing data format (netCDF or HDF) and transferred to the ARM archive for distribution to ARM scientists and to the general scientific community.

     The early years of the data management effort were dominated by implementation of the infrastructure necessary to collect, format, and archive the data; more recently dominant issues have turned to ensuring the quality and usefulness of ARM data. This paper will document the current status of the ARM data system and discuss future plans.

Calculational Data Base for Unimolecular Processes

Wing Tsang and Vladmir Mokrushin, National Institute of Standards and Technology

     Chemical kinetic and thermodynamic data represent essential inputs for the simulation of real-world processes. Such data have traditionally been presented extensive tables or equations. The variety of conditions where such data are applied make the traditional approach limiting. We describe an alternative approach in which theory and modern computational capabilities are combined to generate data as required. The application is for unimolecular reactions. The approach has general applicability and is in line with modern capabilities in science and technology. The results are in a recently completed user-friendly Windows program. Input data are the properties of the molecules and transition states and their interactions with the bath molecules. Rate constants for Boltzmann systems are then calculated on the basis of the partition functions. RRKM theory is used to derive specific rate constants as a function of energy and angular momentum. When combined with parameters that describe energy-transfer processes, rate constants to all conditions are derived through the solution of the time-dependent master equation. The database is reduced to the input parameters and can be applied to practically an infinite variety of conditions and easily updated.

Mercury: Managing Distributed Multidisciplinary Scientific Data1

L.D. Voorhees, P. Kanciruk, B.T. Rhyne, S.E. Attenberger, Oak Ridge National Laboratory2

     Large-scale multidisciplinary field investigations typically involve many investigators and can result in thousands of data files. Often, sharing data from these investigations among researchers throughout the world does not occur for several years after the study has been completed, in part because of the effort required to document, organize, and present highly diverse, distributed data. In addition, traditional centralized data systems for storing and searching metadata are time consuming and costly to develop, require significant resources to operate, and are frequently out of date. We have developed a modern Web-based system Mercury (, which assists the investigator in documenting data and allows them to maintain control of their data and its metadata. Mercury uses a combination of commercial off-the-shelf software, custom software, and metadata standards to provide an economical, dynamic, and rapidly deployable system. Using XML, metadata are coded into HTML documentation files on an investigator's server, which are periodically harvested by an HTTP retrieval program. The results are used to automatically build a searchable index of metadata. A user of this system searches this index through a Web-based interface, which provides links back to the documentation and data files located on investigator servers. This new way of sharing data and information among researchers throughout the world greatly facilitates the research process and can be applied to many kinds of projects, regardless of discipline.

1 This work is sponsored by the National Aeronautics and Space Administration.

2 Oak Ridge National Laboratory is managed by Lockheed Martin Energy Research Corporation for the U.S. Department of Energy under contract DE-AC05-96OR22464.

Challenges in Managing Model-Generated Data: Supporting an Open International Scientific Assessment Process

Xiaoshi Xing and Robert S. Chen, Center for International Earth Science Information Network, Columbia University

     Data generated by computer-based simulation models have not generally received the same level of attention as observational data from a data management perspective; however, computer models in areas such as global climate change are increasingly being used in interdisciplinary research and assessment efforts and in national and international policy discussions. Intercomparison of different models and results, often developed by different research groups around the world, is vital especially in a time frame that is difficult for traditional processes of scientific review and publication to accommodate.

     Working closely with Working Group III of the Intergovernmental Panel on Climate Change (IPCC), CIESIN developed an online World Wide Web site to support an open process of international scientific review and exchange for the Special Report on Emission Scenarios. This Web site has provided interactive access to a variety of scenarios and supporting materials developed by the working group and a means for the international scientific community to submit comments and new scenarios for working group consideration. The collaborative effort also gave CIESIN a unique opportunity to deal with the archiving and documentation needs of an unusual set of model-generated data. This paper describes key lessons learned in supporting the IPCC open process and in managing complex model-based data sets.

Abstracts of Technical Demonstrations

(Arranged in alphabetical order by principal author)

The U.S. Department of Agriculture and NASA's Global Change Master Directory's Collaboration in Writing and Sharing Metadata

Rebecca. J. Bilodeau, Global Change Master Directory, National Aeronautics and Space Administration

     NASA's Global Change Master Directory (GCMD) and the U.S. Department of Agriculture have recently collaborated to establish the Agricultural Research Online System (AGROS). The collaboration has resulted in broader availability of metadata from USDA-funded research, including soil, crop and plant, forest, rangeland, animal sciences, through both AGROS and the GCMD. The GCMD database includes descriptions of data sets covering climate change, the biosphere, hydrosphere and oceans, geology, geography, and human dimensions of global change. The directories, with more than 8,000 entries, are accessible by various search methods, ranging from free-text, keyword, to more complex queries over the Internet. Searches can be refined using temporal and spatial constraints. In addition to search and retrieval software, the GCMD maintains metadata authoring tools and conversion software for different metadata standards. Metadata are written using the Directory Interchange Format content standard, which is compatible with the Federal Geographic Data Committee's Content Standard on Digital Geospatial Metadata and Dublin Core. The AGROS Data Directory is located at The GCMD is located at

Evolution of the CIESIN Gateway: Demonstration of a Working International Data Search and Access System

Robert S. Chen, Center for International Earth Science Information Network, Columbia University

     The CIESIN Gateway has evolved significantly from its origins as a data search tool based on relatively closed protocols. The current version uses the internationally recognized Z39.50 information retrieval protocol to search some 60 data catalogs and other information resources around the world in parallel. It recognizes several different metadata standards, including the Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata, the Government Information Locator Service, and the Directory Interchange Format. The search interface can be adapted flexibly using standard Web tools.

     Versions of the CIESIN Gateway have been deployed in support of the World Data Center for Human Interactions in the Environment, the Inter-American Institute for Global Change Research Data and Information System, the Earth Science and Technology Organization in Japan, the NASA Socioeconomic Data and Applications Center, the Global Change Research Information Office, and the World Bank. A current challenge is to help users efficiently search the large number of possible catalogs from diverse disciplines, institutions, and regions. CIESIN is exploring how to implement automated targeting of relevant catalogs based on a user's query and other ways to help users deal in essence with the successful implementation of data catalog interoperability.

Developing an ArcView Surface Water Integration Prototype for the Wisconsin Department of Natural Resources

Jim Cory, GeoAnalytics, Inc.

     The Surface Water Integration System Spatial Prototype for Arcview is designed to fulfill a number of criteria based on the needs identified by the Wisconsin Department of Natural Resources (WDNR) user community. The prototype will provide users with tools to query and analyze hydro-related data using spatial methods. The foundation for this functionality will be the WDNR 24k Hydro spatial data set. This data set incorporates advanced linear and areal features that provide an intelligent hydrographic framework. Data "events" can be attached to this geographic information system and related to one another spatially.

     Another major criterion for the prototype is that it be based on the data distribution and implementation effort being developed the WDNR. This model provides a standard mechanism by which staff can easily access resource information throughout the state. The hydro data, in combination with other base layers, provide the framework on which the prototype associates disparate natural resource parameters, including fisheries, pollution, and dams.

Interdisciplinary Information Management Systems

Sara Graves and Rahul Ramachandran, University of Alabama in Huntsville

     The Information Technology and Systems Center at the University of Alabama in Huntsville will demonstrate one or more systems that address several multidisciplinary issues in managing and using scientific and technical data for both research and applications. The integrated approach of both information and physical scientists involved in the development and evolution of these systems has been the key to the success of these endeavors.

     The Event/Relationship Search System (E/RSS), HyDRO (Hydrology Search, Retrieval and Order System) and the Eureka Data Mining Toolkit are all systems that can be demonstrated. The E/RSS assists users in formulating requests for customized orders or data subsets. It is also used for coincidence or relationship testing between such factors as geographic regions, political boundaries, and phenomena for specific time periods. HyDRO allows users to search, retrieve, and order data at the Global Hydrology and Climate Center. The Eureka Data Mining Toolkit is used to provide technical support to multidisciplinary users wishing to apply data mining techniques to problem solving.

The Goddard Earth Sciences Distributed Active Archive Center (DAAC)

Steve Kempler, Goddard Earth Sciences Distributed Active Archive Center

     The Goddard Earth Sciences (GES) Distributed Active Archive Center (DAAC) is one of eight discipline-specific Earth science and data centers that comprise the NASA-held Earth-observing data. The GES DAAC is an active archive primarily for atmospheric, hydrologic, and ocean color data that provides data, information, and services for global change research, applications, and education. Its mission is to maximize the investment of NASA's Earth Science Enterprise by providing data and services that enable people to realize the scientific, educational, and applications potential of global climate data. The GES DAAC aim is to be a facility for studying the natural and human processes that influence Earth's climate.

     The GES DAAC works closely with both the science teams who supply data to the archive and the science, applications, and education data users. Working with the data suppliers, the GES DAAC personnel ensure the integrity of the archived data and are prepared to support the data and the data systems. The GES DAAC provides tools to support data processing and preliminary science analysis. Providing user services is a crucial part of the GES DAAC work.

     Data are available electronically using Internet browsers, by file transfer protocol, and on digital tape. The various operational features of the GES DAAC will be demonstrated and explained.

The Global Land Cover Facility: Web-Based Tools for Land Cover Data Processing, Exploration, and Delivery

Frank Lindsay, Global Land Cover Facility, University of Maryland, College Park

     The Global Land Cover Facility (GLCF), a member of NASA's Earth Science Information Partnership, located at the University of Maryland, College Park, will demonstrate a number of Web-based tools for searching, viewing and ordering land cover data and derived data products. The demonstration will include MOCHA (Middleware based On a Code SHipping Architecture), a system for querying and retrieving distributed earth-science data, as well as tools for real-time viewing and manipulating Landsat Thematic Mapper data. The purpose of the GLCF is to provide the global change and earth science communities a portal to low-cost land cover data products.

Duluth: An Ontological Data Management System

Glen Newton, Scott Mellon, and Gordon Wood National Research Council of Canada

     Duluth is a Web-based system allowing Web-naive domain experts to create parallel hierarchical taxonomies of Web resources (URLs). Duluth allows these experts to use these taxonomies to manage information and users to find it. Users have a read-only view, allowing them to navigate and search the taxonomy based on the classification labels (subjects), groupings (hierarchy-specific classification), and various URL resource metadata attributes, such as title, type, and keyword. Users can register interest in changes to parts of the taxonomy and have Duluth contact them periodically with these changes. The parallel taxonomies share a common metadata base of canonically classified URL resources. This allows for greater scaling of expensive and usually scarce human classification resources.

     Duluth presently supports two languages, English and French. Duluth is implemented using the Java Servlet architecture and is open sourced.

Data Access, Query, and Analysis in a Distributed Framework for Federated Earth Science Support

R. Yang, Center for Earth Observing and Space Research, George Mason University

M. Kafatos, Center for Earth Observing and Space Research, George Mason University

L. Chiu, Center for Earth Observing and Space Research, George Mason University; Distributed Active Archive Center, NASA Goddard Space Flight Center

X. Deng, Center for Earth Observing and Space Research, George Mason University

B. Doty, Center for Ocean, Land, Atmosphere Studies

Tarek El-Ghazawi, Center for Earth Observing and Space Research, George Mason University

A. Jearanai, Center for Earth Observing and Space Research, George Mason University

O. Kelley, Center for Earth Observing and Space Research, George Mason University; TRMM Science Data Information System

S. Kempler, Distributed Active Archive Center, NASA Goddard Space Flight Center

J. Kinter, Center for Ocean, Land, Atmosphere Studies

J. Kwiatkowski, Center for Earth Observing and Space Research, George Mason University; TRMM Science Data Information System

Z. Liu, Center for Earth Observing and Space Research, George Mason University; Distributed Active Archive Center, NASA Goddard Space Flight Center

C. Lynnes, Distributed Active Archive Center, NASA Goddard Space Flight Center

K. Matsuura, University of Delaware

J. McManus, Center for Earth Observing and Space Research, George Mason University; Distributed Active Archive Center, NASA Goddard Space Flight Center

Prachya, Center for Earth Observing and Space Research, George Mason University

P. Schopf, Institute for Computational Sciences and Informatics, George Mason University; Center for Ocean, Land, Atmosphere Studies

G. Serafino, Distributed Active Archive Center, NASA Goddard Space Flight Center

C. Wang, Center for Earth Observing and Space Research, George Mason University

X.S. Wang, Center for Earth Observing and Space Research, George Mason University

H. Weir, Center for Earth Observing and Space Research, George Mason University

C. Willmott, University of Delaware

K-S. Yang, Center for Earth Observing and Space Research, George Mason University

     The recent successful launch of NASA's Terra mission and the existence of other remote-sensing satellites already in orbit or to be launched over the next decade will provide for an unprecedented opportunity for global coverage of Earth. Platforms will be observing the Earth's oceans, lands, and atmosphere and collecting data with volumes approaching a terabyte per day. It is expected that many different communities will wish access these data sets but with diverse goals and capabilities. To facilitate data access to large volumes of data, users need to obtain information on the content of data before they proceed to order data sets that may or may not serve their needs. At GMU we have developed the concept and a working prototype for Virtual Domain Application Data Center (VDADC) ( to facilitate data access and querying. The VDADC maintains global L3 data sets supporting interdisciplinary Earth science and provides online data analysis capabilities.

     As a follow-on and natural evolution of the prototype, we are developing a distributed data system designed to serve seasonal to interannual science communities that include El Niño and monsoon studies, teleconnection effects, as well as (TRMM) scientists and such regional experiments as the South China Sea Monsoon Experiment.

     Specifically, the Seasonal to Interannual Earth Science Information Partner (SIESIP) is a distributed data and information system consisting of several physically distributed nodes: George Mason University, the Center for Ocean, Land, Atmosphere Studies (COLA), the NASA Goddard Distributed Active Archive Center (GDAAC), and the University of Delaware. Support is also provided by GMU TRMM Science Data Information System staff. SIESIP is part of NASA's Earth Science Information Partners Program. The information technology implementation involves three nodes (GMU, COLA, GES, GDAAC) using a multitiered client-server architecture. The implementation allows for flexibility in data access by using different metadata, different ingest protocols, and different data access modes. A data access and querying system is being implemented that will provide online data access, data order, and data analysis and browsing capabilities. The popular GrADS analysis package is being enhanced and included in the three-phase process as well as TRMM data access via a web-accessible OrbView package developed for TRMM scientists. The project is described at We will demonstrate features of the system including data access, data analysis, and interoperability in SIESIP and with other distributed data systems such as the Distributed Oceanographic Data System.

Copyright 2001 the National Academy of Sciences