V

Report of the Geoscience Data Panel

Ted Albert,* Shelton Alexander, Sara Graves, David Landgrebe, and Soroosh Sorooshian

CONTENTS

1 INTRODUCTION

The concerns of the Geoscience Data Panel were the long-term retention of scientific and technical data and information related to the solid earth geosciences and their applications. Spatially, the domain covered by the geosciences extends from the Earth's core to the surface and into space (for example, the effects of cosmic rays on the Earth 's surface). Temporally, it covers broad trends from the remote origins of the Earth to possible ultimate future scenarios, but is also concerned with many, rapidly varying, and often short-lived phenomena.

Geoscience data may be characterized according to two broad categories. One is the observation and description of unique events, such as earthquakes, volcanic eruptions, and floods. In most cases, such data need to be archived for a long time period, regardless of their quality. The other category consists of observations of quantities continuous in space and time, e.g., gravity and the Earth's magnetism. Also included are the Earth's structure, seismic sampling, ground water distribution, and the like.

The quantity of such data obtained with public funding has increased dramatically in the past few decades. This is the result of the extremely varied types of observational data collected by the scientific community; the large volumes available through better measurement techniques; more sophisticated instrumentation and advancing computer technology; and increasing demand from not only the scientific community but also the general public,

*  

Panel chair. The authors' affiliations are, respectively, Consultant, Savannah, Georgia; Pennsylvania State University; University of Alabama in Huntsville; Purdue University; and University of Arizona.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers V Report of the Geoscience Data Panel Ted Albert,* Shelton Alexander, Sara Graves, David Landgrebe, and Soroosh Sorooshian CONTENTS     1  Introduction,   105     2  Database Examples,   106     3  Retention Criteria and Related Issues,   115     4  Policy Considerations,   118     5  Summary of Recommendations,   119      Acknowledgments,   120      Bibliography,   120 1 INTRODUCTION The concerns of the Geoscience Data Panel were the long-term retention of scientific and technical data and information related to the solid earth geosciences and their applications. Spatially, the domain covered by the geosciences extends from the Earth's core to the surface and into space (for example, the effects of cosmic rays on the Earth 's surface). Temporally, it covers broad trends from the remote origins of the Earth to possible ultimate future scenarios, but is also concerned with many, rapidly varying, and often short-lived phenomena. Geoscience data may be characterized according to two broad categories. One is the observation and description of unique events, such as earthquakes, volcanic eruptions, and floods. In most cases, such data need to be archived for a long time period, regardless of their quality. The other category consists of observations of quantities continuous in space and time, e.g., gravity and the Earth's magnetism. Also included are the Earth's structure, seismic sampling, ground water distribution, and the like. The quantity of such data obtained with public funding has increased dramatically in the past few decades. This is the result of the extremely varied types of observational data collected by the scientific community; the large volumes available through better measurement techniques; more sophisticated instrumentation and advancing computer technology; and increasing demand from not only the scientific community but also the general public, *   Panel chair. The authors' affiliations are, respectively, Consultant, Savannah, Georgia; Pennsylvania State University; University of Alabama in Huntsville; Purdue University; and University of Arizona.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers including engineers, lawyers, and statisticians. The private sector is also a major collector and source of pertinent data, to the extent that private sector entities are willing to release data from proprietary claims and place them in the public domain. A representative spectrum of geoscience data sources and types includes: agronomic and soil reserves metallic and nonmetallic mineral resources paleontology marine geology fuels/energy resources—oil, coal, geothermal, gas, drilling records, etc. hydrology—ground water, surface water, stream flow, water quality tectonics—structural geology mapping on and below the surface, land classification physical properties—elemental composition, thermodynamic properties, conductivity, magnetic properties, etc. glaciology—snow, ice earthquake prediction and studies volcanic and landslide hazards Geoscience data relate to the Earth's surface and below; they describe the things that are found on and in the earth (minerals, oil) and natural phenomena (earthquakes, floods) and include descriptive material (maps). All of these data have a commonality—a locational aspect and a three dimensional or X, Y, Z coordinate system whether in space, on Earth, or below the surface, and they are likely to have a temporal coordinate as well. Essentially, a common locational code or structure provides the basis for indexing, identifying, evaluating, and synthesizing the widely diverse information involved. The remainder of this report is organized according to the statement of task provided by the steering committee. Section 2 describes some representative examples of databases in the geosciences and identifies some lessons learned that are applicable to long-term archiving. Section 3 addresses the retention criteria and related issues, supported by references to the examples discussed in the previous chapter. Section 4 briefly discusses the major policy considerations associated with archiving of data in the geosciences. The report concludes with a summary of the most important recommendations. 2 DATABASE EXAMPLES Four geoscience database examples are described in this section: the Landsat archive at the U.S. Geological Survey (USGS) Earth Resources Observing System (EROS) Data Center; the Water Data Storage and Retrieval System/National Water Information System-II (WATSTORE/NWIS-II) operated by the USGS; seismic data held by several federal agencies and other institutions; and the National Snow and Ice Data Center (NSIDC) supported by the National Oceanic and Atmospheric Administration (NOAA). These data collections are illustrative of the scope of earth science databases. Moreover, many of the issues raised by these examples reflect the concerns of those involved in storing, utilizing, and archiving geoscience data. The Landsat data archive is of great value as a continuing record of changes on the surface of the Earth. As an illustration of its value, it should be noted that in the mid-1980s, the EROS Data Center decided that it could no longer afford to keep all of the data and intended to discard those from the first few years of collection. When the related user community was notified, there was almost unanimous agreement that nothing should be destroyed. This was strongly indicative of the importance of the continuity of these data, and of their value. The WATSTORE/NWIS-II is the largest hydrologic database in the United States. It is the basic source of such information and it is continually being expanded. The database is structured so that it is easily accessible to primary users. The metadata are appropriate for a broad range of users, not only the primary researchers. The volume of the WATSTORE/NWIS-II will continue to grow over time, and its rate of growth is also expected to increase. The database is of enduring value because analyses of the nation's water supply will always require access to all of the historical data. The seismic data example describes a broad range of research and operations applications. It shows that the diverse seismic databases are physically stored in a highly distributed manner. The accumulation of these data over the next several years will exceed all present holdings. It further illustrates a variety of dynamic databases in which

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the historical data are of great long-term value. Some of these data are held by a non-federal government group with some government funding, which raises the question of NARA's responsibilities vis-à-vis the private sector. The National Snow and Ice Data Center is an example of a data repository as well as an originator of scientific projects. Much of its activity today concerns collecting, compiling, and distributing pertinent data collected and processed by others. As such, it functions primarily as a clearinghouse and packager of snow and ice data and information. The Landsat Database The Landsat database consists of multispectral images of the Earth 's surface, which have been accumulating since the launch of Landsat 1 in July 1972. The archive includes digital tapes of multispectral image data in several formats, black and white film, and false-color composites of synoptic views of the Earth's surface from space. The database thus constitutes an important record of the evolving characteristics of the Earth's land surface, including that of the United States, its territories, and possessions. This record not only documents the results of various federal government policies and programs, but also those of many state and local governments and private sector activities. It further provides documentation of the impact of various large-scale episodic events, such as floods, storms, and volcanic eruptions. Database Characteristics There have been three principal sensors in use on the five Landsat satellites successfully launched so far. The sensor systems and the operational periods of service of each are summarized in Table V.1. The Return Beam Vidicon system was a coordinated set of three television-like cameras collecting frames of analog imagery in the blue, red, and near-infrared bands. The Multispectral Scanner (MSS) collects digital data with 80m pixels in each of four spectral bands with a signal-to-noise ratio supporting 6-bit precision (64 shades of intensity). The Thematic Mapper (TM) has 30m pixels in six bands, 120m pixels in a seventh band, and a signal-to-noise ratio supporting 8-bit precision (256 shades of intensity). The data are collected in swaths 185km wide and are frequently divided arbitrarily into frames that are 185km by 185km for storage and access purposes. Landsat 6, which had a Thematic Mapper that was a slightly augmented version of the original, was lost during launch in August 1993. Landsat 7, which will be similar to Landsat 6, is expected to be launched in 1998. Landsat 8 is in the early planning stages. As of January 1993, the Landsat database contained in excess of 100,000 tapes of varying-density digital data in several different formats, and over 2,850,000 frames of hard copy imagery. Landsat data usually are delivered to users in the form of digital magnetic tapes, which may be read on any standard computer magnetic tape drive. There are indications that other media, such as CD-ROMs and streaming tapes, also may soon be used. Data requests occur most frequently in the form of reference to a particular geographic location, commonly expressed in terms of latitude and longitude, for a particular time of the year, and meeting certain cloud cover limitations. This retrospective mode of access (as opposed to real-time or same-day access) is thought to have reduced substantially the breadth of activities that have been undertaken with the use of such data. Retrospective data retrieval became TABLE V.1 Landsat Sensor Systems and Periods of Service Spacecraft Sensorsa Period of Service Landsat 1 RBV & MSS 23 July 1972 to 6 January 1978 Landsat 2 RBV & MSS 22 January 1975 to 27 July 1983 Landsat 3 RBV & MSS 5 March 1978 to 7 September 1983 Landsat 4 MSS & TM 16 July 1982 to present Landsat 5 MSS & TM 1 March 1984 to present a RBV—Return Beam Vidicon MSS—Multispectral Scanner TM—Thematic Mapper

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the common mode of data availability due to initially very long delays between data collection and their distribution. Though it is possible at this time to request a priori that data be collected at a particular location on a particular satellite pass, this is not common, due to the current high cost charged for this type of data collection. Changes in the cost structure recently mandated by Congress will reduce significantly the cost of such data to users. This will likely expand the user base, thereby substantially increasing the value of the database to the nation. The types of data processing vary almost as widely as does the user base. Typically the data receive at least a minimal amount of radiometric and geometric adjustment to account for known and measurable variations in the data collection process. From that point, depending upon the intended use of the data, there may be adjustments for the effect of the atmosphere, terrain effects, sun angle and observation effects, and other such scene and observation variables. The data may be geometrically registered on a pixel-to-pixel basis to data from a previous pass over the site or to a particular geographic coordinate system and incorporated into a geographic information system. The data may be analyzed using any of a wide variety of algorithms with varying degrees of sophistication and efficacy, extending from simple spectral matching algorithms to those accounting for first- and second-order spectral variations and spatial relationships as well. Analysis of the system in recent years has shown that early data management practices resulted in some deterioration of the media. Furthermore, changing technology made some of the data unreadable because of obsolescence of the necessary hardware. This emphasizes the necessity of periodically transcribing the database to the most current media. Landsat Data Uses As indicated above, the data are used quite widely across the spectrum of geoscience applications, in both civilian and military operational and research activities. These include such applications as the impact of human activities upon the environment, land-use planning and resource-allocation decisions, disaster assessment, renewable and nonrenewable resource measurement and assessment, and many others. They also are used by the general public in any context where views of the Earth's surface are needed. Examples of these include such diverse applications as visual aids in elementary and secondary education, background for highway maps, and illustrations for magazine articles about various regions of the world. Landsat data currently are available in either image or digital form from the EROS Data Center in Sioux Falls, South Dakota. The Landsat satellites were originally under the control of NASA; however, in 1980 they became the responsibility of NOAA. Existing spacecraft are controlled by the EOSAT Company. Under EOSAT's control, the data are not in the public domain, are more expensive, and have proprietary restrictions on their use. Beginning with Landsat 7, however, responsibility for the Landsat system will be returned to NASA, which will operate the systems and deliver the data to the EROS Data Center for processing and distribution. The data will once again be in the public domain. It is now recognized that the shift to private control of the Landsat system significantly reduced the access to the data and its use. The Landsat database is unique because data from any given area may be available at sampled instants over a period of more than 20 years, thus making possible for the first time the study of slowly varying phenomena taking place on the Earth. Even though data from the early 1970s may now have a low frequency of use, their potential value remains high and they represent a very significant archival record. The WATSTORE/NWIS-II Database The U.S. Geological Survey's Water Resources Division (USGS-WRD) has had the responsibility for maintaining hydrologic and water resources data for the United States. Their latest system is called the National Water Information System-II (NWIS-II). The primary objectives of this new system are to maintain a highly flexible hydrologic data management system that can be easily changed and expanded in a rapidly changing technological environment. NWIS-II will be a single, integrated database that will be distributed across the nation via a network of 32-bit microcomputers. Its automatic polling capability will allow the user to query multiple nodes of the network for available data, as well as to determine if the desired data reside on a specific node. NWIS-II will be replacing the functions of USGS-WRD's current systems and will include expanded capabilities of processing and storing much more varied hydrologic data than the current systems. One of the systems that NWIS-II will be replacing is the Water Data Storage and Retrieval System (WATSTORE). WATSTORE was

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers established in 1971 as a response to the need for USGS-WRD to improve its processing and management of its water data. The system is currently housed on a mainframe computer in Reston, Virginia. Database Characteristics In 1990, WATSTORE had nearly 4 gigabytes of data within its files, increasing yearly at an estimated 160 megabytes of data (see Table V.2). At that time, it contained over seven million records, demonstrating a growth rate of more than 336,000 records per year. The volume of NWIS-II, however, is difficult to determine because the system has not yet been implemented. However, USGS-WRD has agreed to transfer data from its current systems (i.e., WATSTORE, NWIS-I, and others) for archiving the two major types of data: water and index. Yorke and Williams (1991) estimated that NWIS-II will be capable of storing nearly five to six times as much data as WATSTORE. As indicated in Table V.2, WATSTORE has seven major databases and files within its system. Briefly, these may be characterized as follows: The Header File contains information pertinent to the identification, location, and physical description of all the sites where data are collected. Data are collected by either automatic digital recorders or data collection platforms. The Daily Values File is a general purpose file for storing water data collected on a daily, or continuous, basis which is then numerically reduced to daily values. Its attributes are an agency code, a station identifier, a parameter code, and a statistic code. A cross-section locator, a sampling depth, and a parameter code are examples of information found within WATSTORE's daily value records. The file contains one record type of fixed length. The types of data contained in the Unit Values File are rainfall, stream discharge, and temperature. A maximum of 1440 unit values are possible for one record. The Ground Water Site Inventory contains inventory data about wells, springs, and other sources of ground water. Site location and identification and geohydrologic characteristics are some of the examples of data that are included within this database. Files within the Ground Water Site Inventory are cross-referenced to the Water-Quality File and the Daily Values File. Analyses of water samples describing chemical, physical, and biological characteristics of both surface and ground waters can be found in the Water Quality File. Data elements for generating the record number are an agency code, a data category, a station number, a sample date, and time. The file contains one record type and has a maximum length of 4200 bytes. A maximum of 250 occurrences is allowed. The Peak Flow File contains the annual peak discharge and gauge stage values for surface water stations. The attributes of this file are an agency code, a station identifier, and the water year. Finally, the station, descriptive name, data values, and whether the values were observed or computed are part of the Basin Characteristics database. The primary attribute of this file is the station identifier. TABLE V.2 Major WATSTORE Databases and Files Showing Size in Kilobytes and Records from 1990   Actual Size 1990 K-Bytes Actual Annual Growth Rate K-Bytes Actual 1990 Records Actual Annual Growth Rate Records Header File 314,640 36,830 871,444 71,080 Daily Values File 1,210,754 24,393 756,721 15,246 Unit Values File 383,556 11,208 1,036,637 30,292 Ground Water Site Inventory 833,568 33,507 994,329 39,930 Water Quality File 1,082,026 47,618 3,182,428 140,053 Peak Flow File 75,810 5,701 529,753 39,815 Basin Characteristics 19,950 993 16,849 292 Source: The 1990 data are from S.B. Mathey, ed., Open-File Report91-525.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The databases and files are projected to be uploaded from the current USGS-WRD systems such as the WATSTORE to NWIS-II sometime in 1994. Therefore, it is difficult at the time of this writing to completely describe all the characteristics of the new database. However, data will be collected by automated field instrumentation that consists of analog-to-digital recorders, electronic data loggers, portable field computers, analog recorders, satellite relay, telephone relay, and radio transmission. Furthermore, NWIS-II will be more flexible than WATSTORE, which has a total of one or two tables (matrices) capable of storing data. WATSTORE is a one-dimensional, flat, hierarchial system. This reduces the flexibility of storing new data and new metadata because the shape is fixed. In contrast, NWIS-II comprises 50 tables that are relational and can be changed according to internal identifications. NWIS-II's shape and metadata can be amended or added to easily because the entire system need not be altered. This has been the major drawback of WATSTORE: any change in the structure of the data or metadata has required an entire system overhaul. WATSTORE's function to process water data was eliminated in 1983. Since then, other systems have been uploading their processed data into WATSTORE for archival purposes. This has created many problems. For example, uploading data from NWIS-I to WATSTORE has required additional, sometimes redundant, editing of data from NWIS-I and errors have not been easily detected. In some cases, data would be lost during their transfer to WATSTORE. WATSTORE output capabilities are computer-printed tables, computer-printed graphs, statistical analyses, digital plotting, and machine-readable data. This means that not only are condensed indexes of data available, but frequency distribution curves and map plots, regression and variance analyses, as well as hydrographs and contour plots can be obtained. However, WATSTORE can still provide data in the form of punched cards or card images on magnetic tape to those who require them. NWIS-II will have the function of processing and managing multidisciplinary hydrological data such as site location, water use, stream flow, sediment transport, water quality monitoring, and biological conditions. Examples of NWIS-II processing capabilities are computations of stage-discharge, slope-discharge, velocity index-discharge, reservoir content and level, sediment load, and rainfall. It will also have the added feature of data verification, which can be done manually or automatically. These verification checks can be system-wide, user-defined, or hydrologic event notifications. Status of the data during processing can be easily determined by flags indicating “original,” “working,” “in-review,” “approved,” and “published” (Mathey, 1991). The upgraded NWIS-II will incorporate all of the WATSTORE capabilities (with the possible exception of punched cards and card images on magnetic tape) and the ability to create a manuscript that will meet publication standards. This feature will allow users to integrate text, tables, and figures into various report styles, while providing the function to extract data from the database for creating these publication-ready reports. Additionally, output results will be able to be selected from the main menu. This will permit the user, with a screen display, to choose time period, station, and constituent, and allow for a narrowing of the data scope. Uses of the WATSTORE Database WATSTORE currently supplies (and NWIS-II will supply) approximately 70 percent of the water data to federal, state, county, and city government agencies for all types of hydrologic applications, including understanding water supplies, planning flood control projects, and regulating sources of pollution. WATSTORE data also are released to the general public for a nominal fee. However, NWIS-II will allow the general public on-line access to the data. Moreover, NWIS-II will have a greater capacity for archiving data than WATSTORE. In conclusion, the USGS WATSTORE has been the nation's water data archive since 1983. However, it has had limited capacities for storing voluminous amounts of hydrologic data, for upgrading its functions, and for producing publication-ready manuscripts. The new system, NWIS-II, is expected to meet the challenges of a rapidly changing technological environment not only by becoming a larger archive of hydrologic data, but by becoming more capable in processing data. The greatest advantage of NWIS-II is that it will be an interactive, integrated network providing better access and support to the USGS-WRD 's users. Seismic Data In contrast to the two previous examples, seismic data are distributed widely rather than being located in one data center or system. This section focuses primarily on earthquake and explosion (both nuclear and chemical)

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers seismic data, for which there is a significant involvement by federal government agencies, although important exploration seismic data sets also are collected and archived by some federal agencies, notably the U.S. Geological Survey's National Earthquake Information Center (NEIC). In addition to the federal agencies, international private sector organizations now collect, distribute, and archive seismic data sets of long-term significance. To assure long-term retention and access to these data, the government will need to make appropriate arrangements with all of these groups. Among the federal agencies, the U.S. Geological Survey (USGS), Department of Defense (DOD), Department of Energy (DOE), U.S. Nuclear Regulatory Commission (USNRC), and Department of Commerce's National Oceanic and Atmospheric Administration (NOAA) have been and continue to be engaged in the collection and archiving of earthquake and explosion data. The USGS has operated an earlier version of the present Global Seismic Network and has archived its global digital seismic data since the late 1970s. The agency's National Earthquake Information Center produces, distributes, and archives a seismic bulletin giving the location, depth, origin time, magnitude, and other information on damage for all detected earthquakes. The NEIC also produces and distributes a CD-ROM containing the recorded signals from all earthquakes of Richter magnitude 5 and greater. Currently the USGS is deploying a National Seismic Network of approximately 50 digital seismic stations, and the data from these stations will be transmitted in real-time and archived by NEIC. The DOD, through the Advanced Research Projects Agency (ARPA) and Air Force Technical Applications Command (AFTAC), and DOE have collected a large amount of earthquake and explosion data in support of nuclear test ban monitoring research, which is archived at various locations (e.g., Teledyne-Geotech in Alexandria, Virginia; ARPA's Center for Seismic Studies; and DOE's Lawrence Livermore National Laboratory). A large portion of these data, which have long-term research value as well as significant historical value in documenting the U.S. and foreign nuclear testing programs, is in imminent danger of being discarded. A data rescue effort has been initiated in an attempt to save these unique data sets. The USNRC has funded the operation of regional seismic networks over much of the United States since the 1970s in support of programs for the siting and safety of nuclear power plants. These data are archived in a highly distributed mode by the organizations (mostly universities) that collected the data from the various networks. A significant portion of these data have long-term value for characterizing in detail the tectonic activity of seismogenic areas in the United States, particularly the eastern United States where the available record of seismic activity is not adequate for long-term hazard assessments. The Incorporated Research Institutions for Seismology (IRIS), a not-for-profit consortium of universities and private research organizations, is engaged in a major deployment of a global digital seismic network (GDSN) of approximately 100 continuously recording, three-component, broadband stations in cooperation with the USGS. A versatile portable seismic array of up to 1000 stations also can be deployed for various intervals for special seismological studies. The data are being permanently archived at the IRIS Data Management Center in Seattle, Washington. IRIS funding for this activity comes primarily from National Science Foundation (NSF) and the Department of Defense. Finally, individual universities have important archives of earthquake data from years of seismic monitoring (e.g., California Institute of Technology and University of California at Berkeley for southern and northern California, respectively, and the University of Washington in Seattle). Characteristics of the Seismological Data Sources As noted above, the sources of the seismological data are several federal agencies and nongovernmental organizations, both national and international. Global earthquake data have been acquired on a systematic basis since the early 1960s, when the Coast and Geodetic Survey of the Department of Commerce deployed a global seismic network of approximately 130 stations called the World-Wide Standardized Seismographic Network (WWSSN). The agency produced an archive of photographic film “chips” of the 24-hr/day recordings at all of these stations. Copies of these analog data were then distributed to users at modest cost for research and other applications. The success of this precursor to today's global digital network cannot be overestimated, because the availability of a global data set in standard format from well-calibrated instruments permitted previously impossible studies of global seismicity patterns, earthquake source mechanisms, and earth structure. These studies have led to

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers a vastly improved understanding of the dynamics of the Earth as a whole, including plate tectonics, the generation of new ocean floor, and occurrences of destructive earthquakes and volcanic eruptions. During the 1970s the USGS took responsibility for the WWSSN and began to upgrade and add a number of stations to record data digitally. This was called the Digital World-Wide Standardized Seismographic Network (DWWSSN). Users obtained the digital data on 1600 bpi round tapes, which contained a network-day of signals recorded by the network. The documentation provided in the headers was adequate to allow users to obtain the true ground motions at each recording station. The additional dynamic range and spectral resolution of the digital data led to further advances in understanding of earthquake source mechanisms and whole Earth structure. In 1984, the IRIS was founded as a major consortium of universities and other not-for-profit, nongovernmental research organizations. Its membership has increased to 79 member institutions. The goals of this consortium are to deploy and operate approximately 100 state-of-the-art, broadband, high-dynamic-range digital global stations; to develop a versatile, portable digital seismic array capability of approximately 1000 elements that can be deployed in many different configurations for variable time intervals; and to operate a Data Management Center (DMC) to archive and distribute these data to the seismological community, including federal government scientists. The DMC is located in Seattle, Washington, and it now archives all IRIS data, all of the earlier GDSN digital data collected by the USGS, and substantial amounts of data recorded by foreign countries that operate digital seismic stations. IRIS funding comes mainly from NSF, supplemented in recent years by substantial DOD funding for nuclear nonproliferation monitoring in the former Soviet Union, which has added several new permanent global stations and several portable array deployments, all of which have generated a significant volume of new digital data in the DMC archive. The USGS also maintains the original GDSN digital data at its facilities in Albuquerque, New Mexico, and together with the University of California at San Diego, deploys and operates the IRIS Global Seismic Network, as well as other stations funded by the Department of Defense in support of nuclear test ban monitoring research. The data collected from DOD-funded stations and arrays initially were archived at Teledyne-Geotech 's Seismic Data Laboratory in Alexandria, Virginia. The data currently are archived at ARPA's Center for Seismic Studies in Rosslyn, Virginia. Collectively, these data sets involve a variety of different instrumentation, geographic locations of stations, sampling rates, dynamic range, and metadata. The USGS is now beginning to deploy a 50-station U.S. National Seismographic Network of digital stations, which will transmit continuous low-frequency and event-detected, high-frequency signals to their facilities in Golden, Colorado, where they will be archived and distributed to users. Besides these operations, a large amount of local and regional digital seismic data is collected by the federal government (e.g., USGS in California and Alaska, DOE in Nevada), and by industry and universities. Typically these data sets are archived locally by the organization collecting the data, but they are generally made available to the seismological community. Significant volumes of exploration seismic data obtained through geophysical contractors are held by the Department of Interior. These data are used by the federal government and by petroleum companies in preparing for lease sales for oil and gas exploration activities. There are, however, various proprietary restrictions on the access to these data by other users. In summary, the sources of seismic data are diverse and their archiving is highly distributed. Moreover, data sets with long-term scientific and historical value reside in both federal and nongovernment organizations, although in most of the latter cases federal funds have paid for their acquisition, archiving, and distribution. Volume The volume of digital data currently held and anticipated to be acquired by some of the above organizations is summarized in Table V.3. Although some data sets have been completed because they are part of a project or program that has ended, most of the current operations continue to add large amounts of new data and to implement new technology for recording, storage, retrieval and distribution, thereby creating a dynamic, highly distributed archive. Metadata Most of the data sets consist of time-sequential recordings (continuous or event windows) at each station or array. They typically are stored by station, although event data sets commonly are extracted from the database and are available as network event files. The universal time constant (UTC) associated with each sample is recorded.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers TABLE V.3 Summary of Actual and Projected Data Volumes Archived in the IRIS Data Management Center     Projected Data Volumes (gigabytes/year)   Number of Instrumentsa 1994 1995 1996 1997 1998 1999 2000 GSN 100 1,159 2,359 3,959 6,003 8,047 10,091 12,281 FDSN 146 370 670 1,070 1,530 2,050 2,670 3,416 JSP arrays 5 1,095 2,190 3,650 5,475 7,300 9,125 10,950 OSN 30 0 0 15 58 218 498 936 PASSCAL-BB 500 1,318 2,277 3,556 5,154 7,073 9,312 11,867 PASSCAL -RR 500 542 885 1,341 1,912 2,597 3,397 4,310 Regional-Trig 500 150 290 490 730 1,030 1,390 1,755 Total 1,781 4,634 8,671 14,081 20,862 28,315 36,483 45,515 Note: Abbreviations are as follows: GSN Global Seismic Network (IRIS) FDSN Federation of Digital Seismic Networks JSP Joint Seismic Program (with the former Soviet Union) (IRIS) OSN Ocean Seismic Network PASSCAL-BB Program for Array Studies of the Continental Lithosphere —Broadband (IRIS) PASSCAL-RR Program for Array Studies of the Continental Lithosphere —Regional Recordings (IRIS) Regional-Trig Regional Triggered Recordings a Projected numbers by year 2000. Source: IRIS Data Management Center, private communication, 1994. This is essential for using data from geographically distributed stations together for both operational applications and research. Sampling rates, and hence frequency bandwidths, vary among and within the various data sets. Exploration seismic data sets consist of multichannel linear or two-dimensional arrays, with sampling rates of up to 1000 samples per second. For all observational seismic data the header and metadata information represents only a small fraction of the total data set; the actual time series are orders of magnitude greater in volume. Segmentation of the time series into blocks is typical, but their length and structure varies from organization to organization, or from project to project in some cases. The metadata associated with seismic databases varies in structure and format among the different organizations and among projects within the same organization in many instances. However, at a minimum, all metadata files contain information about the station location (name, latitude, longitude, elevation); the instrumentation used to record the data, including instrument response and calibration data; channel identification; date and universal time for each sample; sampling rate; conversion factors relating digital counts to units of ground motion (acceleration, velocity, or displacement); and format of the data and metadata files. In some instances, additional comments are included that describe the recording site or special circumstances associated with the recordings. Data sets provided to users also commonly include the software needed to read the data and restore it to uncompressed form. Some 25 years ago, the Society of Exploration Geophysicists developed (with strong industry participation) a standard format for exploration seismic data called SEG-y. This standard has been used almost exclusively by the petroleum industry and by most other organizations for controlled source seismic recordings involving many channels of data. The metadata contained in the header files provides adequate information for processing and analysis using standard software packages, as well as user-specific algorithms. Levels of Processing and Information Content For earthquake and explosion data used in research, there is little or no processing other than to apply data compression algorithms when the data are recorded and decompression algorithms at the user end. Of course, the usual over-sampling and anti-alias filtering is done when the data are recorded. Typically, quality control checks are made before the data are archived to discover and correct errors or other problems with the digital data; however,

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the original raw data are retained. A rudimentary type of processing takes place in generating event data sets that consist of selected time windows at each station that capture signals from an event of interest; these event data sets are then provided to users electronically or on standard media such as magnetic tapes or CD-ROMs. Processing for operational purposes includes the application of event detection algorithms; near-real-time determination of epicenters, magnitudes, and source mechanisms of earthquakes that are detected by multiple stations; and beam forming or frequency-wave number analysis of array recordings. Processing of exploration data includes stacking over repetitions of the source; merging of multiple channels of data to produce a two-dimensional record section (distance versus time); and migration of seismic reflection data to portray true locations of subsurface reflectors in the seismic record section. Individual users may, and typically do, process the data in various ways to suit their particular needs. Examples include band-pass filtering, matched filtering, and frequency-wave number transformation; Fourier transforms; polarization filtering; and beam forming. Media and Formats Various types of media are used for recording, archiving, and distribution. Until recent years, round tapes (1600 or 6250 bpi) were used for recording, archiving, and distribution. More recently, tape cartridges, exabyte tapes, CD-ROMs, optical disks, juke boxes, and other high-capacity media have been introduced; however, no standard media have been adopted. Researchers typically use several types of media in carrying out their analysis of data. Increasingly, electronic networks, such as Internet, are being used for data transmission of large amounts of data and information, and this trend is likely to continue as available bandwidths increase. The role of electronic networks for data and information transmission is very significant and rapidly becoming the “medium of choice” in obtaining data from the various archives. As discussed earlier, the formats extant are numerous, but typically sufficient information (metadata) is provided to users to allow them to access the data and read them at their own institutions. Users of Seismic Data The users of seismic data are numerous and diverse. They include scientists and engineers in federal and state government agencies, universities, and private industry, particularly the petroleum industry. Thousands of individuals are direct or indirect users of seismic data as part of their employment. Most of the seismic data sets discussed above have been or are now used both for operational purposes and for research, although for operational activities the data are used primarily immediately following their collection. Examples of their use for operational activities include tsunami warnings and the rapid determination of the magnitude, location, and fault mechanism of destructive earthquakes and their after-shocks, both to inform the public and to assist in emergency response and special monitoring. On a longer time scale, the data are used for hazard reduction and seismic safety in seismogenic regions, including local zoning decisions for future development; siting and safety of critical facilities, such as nuclear power plants, including continuous monitoring of surrounding earthquake activity; and global monitoring of threshold or comprehensive test bans on underground nuclear explosions. There is a broad spectrum of research done on the seismological data collected which includes studies of the physics of earthquake and explosive sources, propagation effects on seismic signals, tomographic imaging of earth structures at all scales, seismicity patterns, and earthquake prediction or hazard estimation. Older data are very important and are commonly used for most of these types of research. For example, establishing the recurrence rate for larger-magnitude earthquakes requires decades, to centuries or more, of observations even in the most seismically active areas, and some of the associated seismogenic processes have very long time scales as well. Conclusion Most of the seismic data are archived in a broadly distributed manner and have long-term value for scientific research, disaster mitigation, and various socioeconomic uses. However, only a fraction of the archived data are under the direct control of federal government agencies, and it appears that many of the data sets are not considered official federal records. Except for exploration seismic data, mainly federal funds have paid for the instrumentation, station operation and maintenance, collection, storage, and distribution of earthquake and nuclear explosion data.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Therefore, it is important for NARA and especially the funding federal agencies to work cooperatively with all of these nongovernmental organizations to ensure that important seismic data sets are kept indefinitely in a form accessible to the scientific user community and the general public. The National Snow and Ice Data Center The World Data Center-A: Glaciology [Snow and Ice]/National Snow and Ice Data Center (WDC/NSIDC) provides a national and international focus for snow and ice data information services. Snow and ice data are made available through specialized data reports and inventories in Glaciological Data, special data sets maintained in the center, tailored bibliographies from either the WDC internal database or other on-line search facilities, and access to the Snow and Ice Library. The center also provides a data and information clearinghouse service using information from WDC resources and exchanges or referrals to other domestic and foreign agencies or individuals. Characteristics of the Data Center The data services provided by the NSIDC include creation of data products, referrals, and inventory assessment. The center also stores and retrieves data, assesses and improves data quality, and develops improved data handling and management techniques. Data from the NSIDC are used for a variety of applications, including input and validation in climate models and meteorological studies; resource or hazard assessments (e.g., snowfall, sea ice, icebergs, avalanches); paleoclimatic research (e.g., glacier fluctuations, ice core records); monitoring of environmental changes through snow and ice parameters; and numerous other scientific research purposes. One of the variables by which NSIDC tracks users is by type of organization. About 35 percent of the center's clientele are from academic institutions; 19 percent are private researchers; 20 percent are from the federal government; 4 percent are from state and local governments; and the remaining 22 percent are researchers from foreign nations. About 43 percent of user requests are for Defense Meteorological Satellite Program data; 19 percent are for data on snow cover, glaciers, avalanches, polar ice masses, ice cores and fresh water ice; 17 percent are for data on sea ice; 12 percent are for publications; and the remaining 10 percent of user requests are for miscellaneous data services. The data holdings of the NSIDC include data sets for which NOAA has responsibility for long-term preservation, data sets that are held as part of NASA's Earth Observing System program, and others from various government-sponsored scientific programs. Overall, most of the data sets are modest in size. However, satellite data dominate the volume, and the center anticipates that it will greatly expand its holdings by the year 2000. The data are distributed to users in a variety of forms—CD-ROMs, hard copy, magnetic tapes, and on-line. Each data set is described in a data announcement, which is distributed to potential users. All data are transcribed onto new media (e.g., from nine-track tapes to optical media) as time and resources permit. The National Geophysical Data Center's preservation guidelines are used for the NOAA data segment, and these are based on NARA guidelines. Conclusion The National Snow and Ice Data Center archives and distributes data that are part of NOAA's long-term database, NASA's Earth Observing System data, and other data sets and products generated by government sponsored projects, both within federal agencies and nongovernmental organizations. Some of the data sets are dynamic and growing, while others are the result of completed projects. The NSIDC relies on discipline-specific scientific expertise for its effective functioning. It represents the type of government supported, active archive data center that can fulfill the long-term data access requirements of a specialized subdivision of the geosciences. 3 RETENTION CRITERIA AND RELATED ISSUES Suggested Retention Criteria for Archiving Geoscience Data Sets The four examples described in the previous chapter are representative of the types of data that should be archived. The panel proposes that data should be retained if they meet the following criteria.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The data must be unique; the data cannot be recreated (e.g., via a mathematical model), or the same data cannot be collected again. The data must be sufficiently documented so that they are understandable and useful. The storage media must be in such condition that the data can be used or copied to newer media. Hardware and software must exist or be obtainable at reasonable cost so that the data can be read and utilized. What follows is a discussion of additional issues relating to the long-term retention and archiving of geoscience data. Process and Schedule for Data Archiving The Earth is an evolving and dynamic system. Changes of significance occur continually as a result of the activities of humankind and of natural processes. A record of these changes is of fundamental importance to achieving a better understanding of their driving forces and impacts, and to allow better anticipation of them and the adaptation of humankind's activities to them. It is thus the panel's recommendation that, when the above retention criteria are met, such data should be kept in perpetuity. It is appropriate that all retention decisions be reviewed periodically, perhaps every decade or so, by an appropriate body of scientists and users to determine that the criteria continue to be met. An important issue in this regard is which of the geoscience data archived by the federal agencies are official federal records subject to NARA's control of their disposition. A case in point, illustrated in the seismic data example above, is ARPA's decision to discard some 70,000 magnetic tapes containing over two decades of recordings of U.S. and foreign nuclear explosions and earthquakes that occurred during this interval. NARA probably did not know about this large data archive, nor was NARA advised of this decision, and in any case ARPA's position was that the disposition of these records could be made at its own discretion. Such problems underscore the need for greater communication between NARA and the agencies, as well as strengthened oversight of the process for scheduling data for archiving. Supporting Documentation The documentation, or metadata, accompanying a data set must be extensive enough to support accessing, assessing, analyzing, and generally using the data. Although geoscience data are highly diverse, the metadata should include the aspects of three-dimensional physical location and other pertinent georeferencing aspects; detail as to corrections and transformations of the data; instrumentation involved in data collection and calibration; physical location of the data set; how to retrieve or access and use the data; and a general description. For a comprehensive discussion of metadata requirements for effective archiving, see the report of the Ocean Sciences Data Panel. Technology Considerations The storage medium for the digital data should be of particular concern. The technology is changing continually, providing increased storage capacity in various forms. These forms include floppy disk magnetic tape (from reels to cassettes and cartridges of increasing capacity) and optical or compact disks. The three basic concerns are: (1) the capacity of the storage device, (2) the ease, speed, and facility of retrieval over time considering changes in technology, and (3) the length of time the integrity of the storage medium can be maintained. What is desirable, as data sets continue to grow in size, is the greatest capacity, longest integrity (to lessen the need to rewrite the data too often), ease of use in retrieval now and in the future with the new hardware and software that will be developed, and a minimal physical storage requirement. Many data that have been digitally stored using older technologies are beginning to fail, making the data unreadable. For instance, a significant and well-known problem of this nature was encountered in the late 1980s by the EROS Data Center with regard to old Landsat magnetic tapes. If the data are to be preserved, they must be migrated periodically to new media. The migration of all data holdings to newer media should be a planned activity on a regular basis (e.g., every 10 years or when a major new upgrade is implemented). If this is factored into the planning, it might affect the decisions on the new systems acquired. These decisions should be made by the

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers concerned scientists, technicians, and management at the custodial agency, in consultation with NARA. Revision should in part be governed by the ease of migrating data from one medium to another as technology changes. It is also important to point out that migration to new media does not imply that the data be converted to a new format. For some types of data sets this may pose no problems, but important data and information may be lost in any reformatting and new errors may be introduced. This decision probably needs to be made on a case-by-case basis, but the conservative approach would be simply to copy, not reformat, when an old data set is migrated to new media. (It is understood that techniques like data compression might cause the stored bytes to be different in the transcribed data, but if the reconstruction is lossless, the original form of the data is preserved.) NARA has indicated a predisposition toward the use and recommendation of relational or flat file structures—a form of table with row and column descriptions and items of data in the cells. This structure is in much use today in various applications and discipline areas. Another form of structure—object structure—would be of greater value in the more complex structures of much of the scientific and technical data being collected today. An object structure, in fact, can deal with any form of data, and it is being used with increasing frequency in today's systems. It is more encompassing in that it can include data, data descriptions, procedures, and methods, and can restructure the data hierarchy. NARA and the organizations collecting data should consider the long-term use of such structures. Better software is continually being developed. The best software should provide the greatest utility for the longest time while, being cost-effective in its use. Further, it should provide capability to be utilized on as many types of hardware and with as many types of format as possible. The latest work toward open systems and standards should be monitored for possible future use. In dealing with digital data, access methods are either batch or on-line. With very large data sets, the batch method of access and use may be the most efficient. However, with the continual improvement of technology, the on-line method becomes the method of choice. The interface should be as easy to use as possible, while providing expanded capabilities. Electronic communication networks obviously provide an additional set of tools for the improved handling of data sets. They facilitate the transfer of and the access to data both on-site and remotely by an expanding population of users. Internet, Bitnet, Omnet, and Span are examples of existing networks used by the geoscience community. The facility and the continual improvement of networks enables the locating of data sets for active use and for archiving at remote locations. Archive locations are most desirable with the collectors and the most active primary users of the data, as well as with those who would be the most helpful and knowledgeable in providing information about the data. Also, the main directory is accessible to all network users wherever they may be. Of special interest for both active and archived data sets is a browse capability. The user should be able to browse through data sets to find those best suited for any given application as well as to browse within a data set to examine and evaluate the data. An overall top-level directory, a more detailed second-level catalog, and a browse capability would be an optimum set of tools for all levels of users, including the archivist. The agencies, including NARA, should consider developing a complete, electronically accessible directory of what long-term scientific data and information exist within the federal government and in related nongovernmental archives (such as the seismic case) Such a comprehensive directory should include pointers for users on how to access the particular data sets they want. A prototypical model for a comprehensive directory is NASA's Master Directory for global change data and information. If developed to its fullest extent, users could access the comprehensive directory to determine if particular data sets exist, where they are archived, how they can be acquired and in what form. Then, if desired, specific data and metadata could be retrieved from the relevant archive(s) and transmitted to the user over an electronic network. Building such a comprehensive directory for all the electronic scientific databases would require a major effort. To make it possible to interface to all of the diverse archives and carry out the search, retrieval, and transmission steps to fill user requests would be even more challenging, but probably technically feasible based on NASA's recent efforts toward that goal for global change data. Finally, the indexing method used with a data set is most often that which suits the needs of the collectors, the principal investigators, and primary users. They may not be the best for the secondary and tertiary users, or for the archivists. Archivists need to provide an overlay of additional indexing to facilitate use by the broader audience. Subject Matter Expertise Given the premise that numerous and diverse data sets need to be archived, the question of reference, meaningful access, and use of these data by researchers and the public must be addressed over time. Considering

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the generally complicated nature of geoscience data and their use, it is unlikely that a nonexpert could get much value from the archive without expert help, at least for a significant fraction of the data that need to be kept in perpetuity. If NARA were to become the repository for a substantial number of geoscience data sets, in-house personnel with adequate scientific backgrounds or experience in the various disciplines would be required to support a broad range of users in finding the right data, accessing them, and perhaps providing initial use instruction. The documentation and metadata adequate for the primary research community often is insufficient to make geoscience data understandable or fully usable by non-experts. Thus, if NARA were to be a major scientific data repository, it would have to maintain a staff of scientifically knowledgeable people. Otherwise the archived data would have significantly diminished value. NARA also could make agreements with the originating agencies to supply expertise as needed. A much more practical approach by NARA for assuring the long-term viability and accessibility of geoscience data would be the designation of the various existing data centers that meet NARA standards as “Affiliated Archives” of NARA. These arrangements could be formalized through Memoranda of Understanding. In summary, whether NARA is to archive scientific data either in-house or at a remote location, it is clear that it will have to maintain some staff with appropriate scientific capability. Further, it will have to work closely with the various agencies to help them improve and provide adequate documentation for all data. Proprietary Restrictions and Privacy There are cases in which the dissemination of data may be expressly limited. These include data concerning government leasing of land and drilling permits, data provided in part by private (commercial) sources, and key statistical data involved in fuel or agricultural production estimates. Issues of special consideration include Native American lands and national security. Any such restrictions on data need to be considered on a case-by-case basis when they are being archived. 4 POLICY CONSIDERATIONS The panel is encouraged by several recent attempts to better coordinate and manage data in the earth sciences. Two notable examples are the National Spatial Data Infrastructure, coordinated by the Federal Geographic Data Committee, and the Global Change Data and Information System, coordinated by the Interagency Working Group on Data Management for Global Change. Nevertheless, there is no comprehensive plan or policy in the federal government relating to digital databases, information systems, or archives, even though they pervade our entire society. This is especially true for scientific and technical data across the physical sciences. There has been a concomitant reluctance on the part of government to state any broad policy relating to these areas. Although various groups of scientists and professionals have attempted to encourage government action, they have not yet been successful. Funding for databases, information systems, and archiving is notoriously the first to be cut whenever there are budget constraints, which appear to be perennial. Many issues arise, as indicated in the previous chapter. While the resolution of these issues may be difficult, one aspect appears clear—some type of comprehensive government policy needs to be developed. The panel's report is concerned specifically with the problems of long-term archiving of geoscience data by the government and the role of NARA. On the one hand, NARA could use a clarification of its mandate relating to digital scientific and technical data in terms of the statutory, administrative, custodial, and institutional responsibilities. On the other hand, lacking a clearer, stated mandate, NARA could develop a coherent program of its own, taking due consideration of the physical, personnel, and funding limitations. Such a plan, properly promulgated and carried out, could respond well to the needs of the scientific community, the government, and the nation. Two general possibilities exist for a physical archive. One is a centralized installation containing adequate computer and storage facilities, along with appropriate software and in-house scientific expertise. The other is a distributed archive; that is, geographically dispersed locations with similar characteristics as above, but with network connections for remote access. In the distributed model, the optimal locations for data are where the principal collectors and subject matter specialists reside. In both cases, we are assuming an up-to-date, automated master directory that is remotely accessible for query through a network. The automated directory should contain sufficient data to allow a user accessing the system to

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers quickly narrow the search to a pertinent set of data. Each individual storage location should maintain a detailed catalog of all stored data, with the proper links to the main directory. It should support and provide needed access to the stored data, and maintain and update the data as required. These functions also require considerations of security and database backup. If possible, a software browse function should be provided to allow the user greater ability to scan the data to determine how useful the data set will be. In the case of NARA, the panel believes that a combination of both types of archival arrangements would be best. Distributed locations could be a discipline center of excellence, a university, a national laboratory, an agency expert center, or even certain industrial organizations. Each archival location or center should have information science specialists available, with knowledge in the particular disciplines, to provide support for all levels of users related to access, general retrieval, the interface system, the possibilities for analysis, and perhaps some historical background. The optimal candidates for such positions would be knowledgeable scientists with an extensive interest in data management and dissemination. Given the existing levels of support, it is certain that NARA would not be given sufficient resources to carry out a comprehensive plan alone. Nor would it be desirable for NARA to do so, even if the resources were made available to it. Rather, NARA must develop stronger ties to the scientific community and the pertinent government agencies with due consideration of the specific mandates of the agencies. In light of the existing and projected capabilities of NARA to archive electronic data, geoscience data should be archived at federal science agency data centers or at universities—the primary collectors and users of the data. At the same time, NARA should become more proactive in establishing and maintaining liaisons with the agencies, participating in a program of standards for archival storage, directories, and documentation, and maintaining up-to-date knowledge of all extant databases. NOAA and the other federal agencies should view NARA as a supporting agency, informing NARA on a timely basis of projects that could impact NARA in the future. A process should be established for NARA to keep informed about user needs. Further, it may be necessary to consider new financial resources for NARA and federal agencies to be able to properly serve the secondary and tertiary users of the scientific data. The scientific community, and indeed the entire nation, would benefit if a government interagency coordination and policy group were established to provide direction and support for data and information infrastructure for (at a minimum) the physical sciences. Such a group would build on the initial accomplishments of the Federal Geographic Data Committee and the Interagency Working Group on Data Management for Global Change. It would have broad representation from the agencies, NARA, the scientific community, the National Institute of Standards and Technology, and the Office of Science and Technology Policy. A major national resource deserves no less. 5 SUMMARY OF RECOMMENDATIONS Geoscience databases that meet retention criteria should be archived in perpetuity, but reviewed periodically. Data should be retained if they meet the following criteria: The data must be unique; the data cannot be recreated (e.g., via a mathematical model), or the same data cannot be collected again. Sufficient documentation must exist so that the data are understandable and useful. The storage media must be in such condition that the data can be used or copied to newer media. Hardware and software must exist or be obtainable at reasonable cost so that the data can be read and utilized. All data should be archived in the form in which they were maintained by the original science community. Geoscience data should be kept, to the maximum extent feasible, within the originating organizations that serve the broad user community and that have the requisite scientific expertise with regard to the data. NARA should monitor and provide oversight for archived databases held by other data centers or agencies. Such activities on NARA' s part would ensure that, wherever located, archived databases are broadly accessible and are not discarded without NARA's approval. NARA should be proactive in working with agencies and other organizations to ensure the longevity and long-term accessibility of geoscience data. Non-federal government organizations should not be allowed to discard data, funded wholly or in part by the federal government, without first offering them to the funding agency(s). NARA must continue to maintain cognizance of all statutory archival requirements of the federal government. NARA may wish to establish written agreements for working with the agencies and non-federal government organizations.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Geoscience databases should be transferred, but not otherwise altered, on a periodic basis to new generations of reliable storage media. A federal interagency coordination and policy group should be established to provide direction and support for a comprehensive data and information infrastructure for the sciences. Such a group would provide direction for long-term retention of geoscience data, among its other activities. The group would report to the Office of Science and Technology Policy, and have representation from NARA, NOAA, NASA, the other relevant federal agencies, and the scientific community. One of the major functions of this group would be the establishment of an electronic, interactive master directory for the physical sciences, including the geosciences. The master directory should be broadly accessible through the federal government, including NARA. It should contain high-level metadata relating to geoscience data sets and provide pointers to more detailed information. Metadata and other documentation, appropriate and adequate for primary users (and secondary users whenever feasible), should accompany every geoscience data set that is to be archived, no matter where the data set resides. NARA should have on its staff information specialists sufficient to effectively interface with the science agencies and the user community. NOAA, NASA, and the other federal science agencies should take the view that the geoscience data they collect are likely to have long-term value, and adequate provisions should be made for their effective management and use decades later. The panel wishes to emphasize that the science agencies should keep NARA fully informed about the databases that will be archived, and that the agencies should maintain active liaisons with NARA. ACKNOWLEDGMENTS The panel gratefully acknowledges the assistance of Julie Esanu, Alice Killian, and Paul Uhlir of the National Research Council staff in the preparation of this report, as well as the information provided at its meetings by the following individuals: Kenneth Thibodeau and John Dwyer, NARA; Helen Wood, NOAA/NESDIS; Donald Collins, Jet Propulsion Laboratory; Herbert Meyers, National Geophysical Data Center; William Draeger, Daniel Cavanaugh, and Thomas Yorke, U.S. Geological Survey; Katrin Douglass, Southern California Earthquake Center Data Center; and Roger Barry, Claire Henson, and Ronald Weaver, National Snow and Ice Data Center. BIBLIOGRAPHY Christman, J.D., and O.O. Williams. 1993. Second Release of the U.S. Geological Survey's National Water Information System II: Proceedings of the Federal Interagency Workshop, Hydrologic Modeling Demands for the 90s, Fort Collins, Colorado, June 6-9, 9 p. Committee on Environment and Natural Resources Research (in press). The U.S. Global Change Data and Information System Draft Implementation Plan, National Science and Technology Council, Washington, D.C. Kiesler, J.L., and T.H. Yorke. 1993. National Water Information System II: An Integrated Hydrologic Data System. Presented at the Workshop on Development of Water Information Systems , Washington, D.C., May 19-20, 10 p. Mathey, S.B., ed. 1991. System Requirements Specification of the U.S. Geological Survey's National Water Information System II: U.S. Geological Survey Open-File Report 91-525, 622 p. National Aeronautics and Space Administration (NASA). 1982. NASA Reference Publication 1078, The Landsat Tutorial Workbook, Nicholas M. Short, Goddard Space Flight Center, U.S. Government Printing Office, Washington, D.C. National Aeronautics and Space Administration (NASA). 1988a. Directory Interchange Format Manual, No. 88-19, World Data Center A for Rockets and Satellites, National Space Science Data Center, Goddard Space Flight Center, Greenbelt, MD. National Aeronautics and Space Administration (NASA). 1988b. Report on the Third Catalog Interoperability Workshop, November 16-18, No. 89-04, World Data Center A for Rockets and Satellites, National Space Science Data Center, Goddard Space Flight Center, Greenbelt, MD. National Archives and Records Administration. 1990. Managing Electronic Records, Office of Records Administration, Washington, D.C. National Research Council (NRC). 1982. Data Management and Computation—Volume I: Issues and Recommendations, Committee on Data Management and Computation, Space Science Board , National Academy Press, Washington, D.C. National Research Council (NRC). 1985. Sharing Research Data, Committee on National Statistics, Commission on Behavioral and Social Sciences and Education, National Academy Press, Washington, D.C.

OCR for page 105
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers National Research Council (NRC). 1988. Geophysical Data: Policy Issues, Committee on Geophysical Data, Commission on Physical Sciences, Mathematics, and Resources, National Academy Press, Washington, D.C. National Research Council (NRC). 1989. Information Technology and the Conduct of Research, Panel on Information Technology and the Conduct of Research, Committee on Science, Engineering, and Public Policy, National Academy Press, Washington, D.C. National Research Council (NRC). 1990. Spatial Data Needs: The Future of the National Mapping Program, Mapping Science Committee, Board on Earth Sciences and Resources, Commission on Physical Sciences, Mathematics, and Resources, National Academy Press, Washington, D.C. National Research Council (NRC). 1992. Toward a Coordinated Spatial Data Infrastructure for the Nation, Board on Earth Sciences and Resources, National Academy of Sciences , National Academy Press, Washington, D.C. National Snow and Ice Data Center. 1992. Annual Report. National Snow and Ice Data Center, Boulder, Colo. Yorke, T.H., and O.O. Williams. 1991. Design of a National Water Information System by the U.S. Geological Survey, Seventh International Conference on Interactive Information and Processing Systems for Meteorology, Hydrology, and Oceanography, New Orleans, Louisiana, January 14-18 ,Pp. 284-88 in Proceedings: American Meteorological Society, Boston, Mass.