II

Report of the Space Sciences Data Panel

Christopher Russell,* Giuseppina Fabbiano, Sarah Kadec, William Kurth, Steven Lee, and R. Stephen Saunders

CONTENTS

1 INTRODUCTION

The purpose of this report is to broadly characterize the data holdings in the observational space sciences, review the current status and practices regarding the long-term archiving of those data, and provide advice on improving that process to the National Archives and Records Administration (NARA), the National Oceanic and Atmospheric Administration (NOAA), the National Aeronautics and Space Administration (NASA), and other responsible entities.

The flow of scientific data from the primary producers to the archives of NARA and other agencies is a long and tortuous path. This path in many ways resembles the course of rainfall gathered in creeks and ponds and passed through streams and lakes and eventually through rivers to the sea. The residence time on land is small compared to the residence time in the sea, and there may be much evaporation so that not all the water reaches its final destination. Moreover, sometimes the quality of the water, despite our best efforts, is degraded in the journey. Nevertheless, despite all the imperfections of our waterways, they are critical to our society and, especially so, the final repository of all that water, our oceans. So too the data pathways and archives of the nation are becoming increasingly important to our society from a number of perspectives.

The drainage systems, or the roots, that tap into the various sources of scientific data vary across the federal government and even within individual agencies. Efforts that involve discipline data centers working with

*  

Panel chair. The authors' affiliations are, respectively, the University of California, Los Angeles; Harvard-Smithsonian Center for Astrophysics; Consultant, Williamsburg, Virginia; University of Iowa; University of Colorado; and Jet Propulsion Laboratory.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers II Report of the Space Sciences Data Panel Christopher Russell,* Giuseppina Fabbiano, Sarah Kadec, William Kurth, Steven Lee, and R. Stephen Saunders CONTENTS     1  Introduction,   23     2  Overview of Space Science Data,   24     3  Space Science Archive Model,   39     4  Suggested Retention Criteria and Appraisal Guidelines,   44     5  Summary of Findings, Conclusions, and Recommendations,   48      Acknowledgments,   50      Bibliography,   50 1 INTRODUCTION The purpose of this report is to broadly characterize the data holdings in the observational space sciences, review the current status and practices regarding the long-term archiving of those data, and provide advice on improving that process to the National Archives and Records Administration (NARA), the National Oceanic and Atmospheric Administration (NOAA), the National Aeronautics and Space Administration (NASA), and other responsible entities. The flow of scientific data from the primary producers to the archives of NARA and other agencies is a long and tortuous path. This path in many ways resembles the course of rainfall gathered in creeks and ponds and passed through streams and lakes and eventually through rivers to the sea. The residence time on land is small compared to the residence time in the sea, and there may be much evaporation so that not all the water reaches its final destination. Moreover, sometimes the quality of the water, despite our best efforts, is degraded in the journey. Nevertheless, despite all the imperfections of our waterways, they are critical to our society and, especially so, the final repository of all that water, our oceans. So too the data pathways and archives of the nation are becoming increasingly important to our society from a number of perspectives. The drainage systems, or the roots, that tap into the various sources of scientific data vary across the federal government and even within individual agencies. Efforts that involve discipline data centers working with *   Panel chair. The authors' affiliations are, respectively, the University of California, Los Angeles; Harvard-Smithsonian Center for Astrophysics; Consultant, Williamsburg, Virginia; University of Iowa; University of Colorado; and Jet Propulsion Laboratory.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers individual investigators represent the most decentralized approach. Other efforts adopt a project-oriented, more centralized approach that involves neither individual investigations nor discipline data centers. For example, NASA has both centralized and decentralized approaches to archiving. In planetary science, it is now customary for data first to flow from individual investigators and projects to distributed discipline data centers. These discipline data centers in turn provide data to the Planetary Data System, which manages the archiving process, fulfills data requests, develops standards, and provides directory information. Space physics is attempting to emulate this system, but other disciplines, such as the life sciences, seem to favor a more centralized approach, while astrophysics favors a hybrid approach with some centralized and some decentralized efforts. The data in NASA then traditionally flow to the National Space Science Data Center for more permanent archiving. In other agencies the data pathways—dataways, in analogy to the nation's waterways—do not appear to be as highly developed as NASA's dendritic channels. For instance, NOAA's National Geophysical Data Center (NGDC) tends to work directly with investigators and not through intermediaries. Federal laboratories, such as Los Alamos National Laboratory (LANL) and Lawrence Livermore National Laboratory (LLNL), tend to produce their data internally and not seek them from external sources. Ultimately, the most important of this information should pass down the dataways to permanent archives, but little of it presently does. Thus, one of the panel's major objectives in this report is to provide some guidance on how the dataways can be opened. 2 OVERVIEW OF SPACE SCIENCE DATA This section broadly characterizes the observational data collected in the three major space science disciplines: planetary sciences, astronomy and astrophysics, and space physics. It reviews the status of long-term data archiving in those disciplines and identifies the major issues associated with that process. In planetary science, most of the data archiving effort is carried out by the Planetary Data System. The procedures used by the Planetary Data System provide perhaps the best model for life-cycle data management and archiving in the space sciences. The Venus data acquired through the recent Magellan mission are described below as one such successful example. In astronomy and astrophysics, observations are obtained by both ground-based and space-based instruments. Greater emphasis on data archiving has been given for space-based observations, especially in the Hubble Space Telescope project and within the high-energy astrophysics community. The Astrophysics Data System, however, is at an earlier stage of development than the Planetary Data System. Most archiving of space physics data is done at two centers: the NASA National Space Science Data Center and the NOAA National Geophysical Data Center. A fledging Space Physics Data System has been initiated, but has not yet had sufficient time or resources to significantly influence the archiving efforts in this field. Planetary Data Planetary data are acquired by both ground-based and space-based observations. Planetary data include observations of the entire physical system and forces affecting a planet or other body, including the geology and geophysics, atmosphere, rings, and fields. The sensors used collect data across much of the electromagnetic spectrum. Currently, most planetary observations are supported by NASA, either as the direct result of planetary missions or ground-based observations that support a mission. Over the past three decades, NASA has sent robotic spacecraft to every planet in the solar system except Pluto, to two asteroids, and to three comets. Men have walked on the moon, performed experiments there, and returned samples. The knowledge we have about the bodies in the solar system, with the exception of our own planet, is due in large part to space missions. In some cases, such as the gas giants Jupiter, Saturn, Uranus, and Neptune, robotic space probes have provided almost all of our current knowledge. Many of the moons of the other planets were no more than points of light with minimal spectral and light-curve measurements before the Voyager mission. Now each is recognized as a separate world with highly individual characteristics. The Planetary Data System (PDS) was created by NASA to provide a cost-effective system to preserve the scientific results of past and present planetary exploration missions and to make those data readily accessible to the planetary science research community. The PDS is supported by NASA's Office of Space Science (OSS)1. 1   Called the Office of Space Science and Applications until March 1993.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers A fundamental philosophy of the PDS is that data sets are best cared for by the research community, which has been actively involved in their creation and use. Therefore, the PDS is based upon a distributed, electronically connected architecture that provides an organization intended to support the many disciplines comprising planetary science by facilitating access to planetary data and hence stimulating research. Developed by and for scientists during the 1980s, the PDS became an operational system in 1990 and is available for routine use by planetary scientists. Structure of the PDS The PDS is divided into several operational units. The primary components are a Central Node, seven Discipline Nodes, and a variable number of Data Nodes. In addition, the National Space Science Data Center (NSSDC) currently serves as the deep archive for all PDS data sets. The detailed duties and structure of these components are described below. The Central Node The Central Node provides overall project management and coordination of PDS activities related to the development and promotion of data standards, advanced technology evaluation, data restoration and ingestion, and interactions with planetary flight projects. In addition, the Central Node maintains and provides access to a catalog system, the Central Catalog, used for locating and obtaining information about planetary data sets. For a user unfamiliar with the data sets available through the PDS or the location of particular data sets, the Central Catalog may provide the initial contact with the PDS. It contains information about the avaibility and location of PDS data sets and associated ancillary information (such as instrument and spacecraft descriptions), and provides one means of ordering data sets. If a user wants to take delivery of an entire data set, the order will be filled by the NSSDC (for large data volumes) or by a Discipline Node (for smaller data volumes or those orders requiring special processing). For some data sets, more detailed information is available in the Central Catalog, allowing predetermined portions of a data set to be ordered. In cases where more extensive technical expertise is required, the user will be referred to the Discipline Node responsible for the data set or field of interest. Discipline Nodes A guiding principle of the PDS is that data sets will be curated by institutions that have the expertise to care for the data as well as to aid (and educate) others in their use. The primary functions of the Discipline Nodes, therefore, are to manage the restoration and curation of data sets of interest to a particular planetary discipline, to provide access to those data sets, and to provide scientific expertise in their use. The collection of Discipline Nodes (and associated Subnodes) crosses the breadth of planetary science, serving the entire planetary community. The current Discipline Nodes are the Navigation and Ancillary Information Facility (NAIF), Planetary Atmospheres Node, Planetary Geosciences Node, Planetary Imaging Node, Planetary Plasma Interactions Node, Planetary Rings Node, and Small Bodies Node. The Discipline Nodes are the primary mechanism by which the PDS supports the analysis of planetary data. In many cases, users may browse detailed catalogs or even online data sets which are curated by a particular Discipline Node. Depending on the capabilities of a particular Node, orders for data sets or subsets may be filled through creation of files on the Node's computer system (allowing immediate access from the user's host computer), or through delivery of magnetic tapes or compact disk-read only memory (CD-ROMs). Some Nodes also provide software useful in the analysis of particular data sets. In all cases, Discipline Nodes are sources of expertise in the use of PDS data sets. Data Nodes The purpose of Data Nodes is to restore individual data sets from a past planetary mission, or to make significant improvements to recently obtained data sets. The restoration process includes documentation of the data set and associated instruments, assembly of ancillary data (such as pointing geometry and calibration files), and possibly reformatting of the data. In many cases, a Discipline Node performs all of the tasks necessary to restore a data set. In cases where the necessary expertise or resources are not available at a Node, a Data Node is selected through a competitive proposal process. Data Nodes are funded by the PDS and associated with the Discipline Node most appropriate for ultimate curation of the restored data. Once the data are submitted to and accepted by the PDS, the Data Node will cease to exist.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The National Space Science Data Center The NSSDC, located at the Goddard Space Flight Center, was established by NASA in the 1960s to serve as an archive and distribution center for space science data. It provides a deep archive, with environmentally safe storage facilities, for both analog (e.g., film and paper) and digital media (e.g., magnetic tapes and CD-ROMs). Periodically, data sets are migrated to modern media to ensure their viability. Once a data set has been ingested into PDS, as described below, a copy is submitted to the NSSDC deep archive. Functions of the PDS Data Peer Review and Ingestion Before data sets can be made available to the community through the PDS, it is necessary to bring the data into the system. The foundation of the data ingestion process, and the basis for simplifying and streamlining the distribution of data, is the application of standards. The PDS has led the effort to develop standards needed to enhance the storage, the distribution, and, ultimately, the use of planetary data. PDS standards encompass the structure and contents of file labels, file formats, and documentation of data sets. The PDS has instituted new methods for checking and validating the quality of submitted data sets. Prior to acceptance, every data set undergoes a peer review process. This involves assembling a committee of PDS personnel (those involved in restoring or documenting the data set) and non-PDS scientists familiar with the use of such data. The goal is to ensure that the submitted data are scientifically valid and useful and that the associated documentation is complete. In cases where software is required for accessing a data set, the documentation and operation of the software package is reviewed as well. When problems are noted with the data, documentation, or software, corrections are made whenever possible; if corrections cannot be made, the problems will be noted in the documentation package accompanying every data delivery. Data Distribution Ready access to data by the user community is the primary service provided by the PDS. Users unfamiliar with the available data sets will want to use the advanced cataloging system provided by the Central Catalog. Through use of the Central Catalog, orders may be placed for entire data sets (filled by the NSSDC or a Discipline Node) or for predetermined data subsets (filled by a Discipline Node). Such data deliveries will be in the form of magnetic tapes, CD-ROMs, or electronic files, and will include documentation pertinent to the use of the data. As all data sets will be curated by a Discipline Node, more experienced users may want to access the data at the Node itself. Some Nodes provide software to browse the data or detailed catalogs, allowing the user to create customized data orders containing only the desired portions of a data set and to quickly take delivery of the data. In some cases, the Nodes provide basic analytical tools as well. In all cases, the Discipline Nodes will be able to provide expertise on the scientific use of the data. The PDS has also led the development and application of new technologies to aid the archiving and distribution of planetary science data. One such example is the production of CD-ROM volumes. CD-ROMs have made it possible to store large quantities of data on small archival volumes and to readily distribute copies to numerous users. Support of Active Missions Data sets resulting from future missions should be readily ingested by the PDS, as the PDS project will work closely with mission science teams early in the definition phases of the flight projects. Such early contacts and the application of standards will ensure that any data sets to be archived will be delivered by the PDS with the needed formats and documentation. The PDS will be involved in the bulk distribution of mission data sets to a large portion of the planetary community. An example is the distribution of numerous CD-ROMs containing Magellan radar images to members of the planetary geosciences community, which is discussed in more detail below. It is expected that such mass distributions will be the norm for active missions, such as the Galileo mission to Jupiter. Researchers not included in the initial distributions always will be able to order data in the normal manner, with the order being filled by a Discipline Node or the NSSDC.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Promotion of Research An underlying function of the PDS is to support research activities of the planetary community. Improving the ease of access to data through use of standardized data labels and file structures is essential to this goal, as is providing access to the scientific expertise resident at the Discipline Nodes. In addition, the PDS promotes research activities by organizing topical scientific workshops, providing visiting scientist programs, and developing and making available basic data analysis tools. Examples of Ingesting Data into the PDS Restoration of Data from an Inactive Planetary Mission The PDS undertakes restoration of data sets when any of the following criteria exist: A mission had no project data management plan, or the plan predated PDS; Processing of data products is underway or incomplete; Archival products have not been produced to PDS standards; No archival products have been produced for some or most of the data; Archival products do not represent the most recently derived data sets; or Archival products are not on long-lived media. The Pioneer Venus mission, comprising an orbiter, an entry-probe carrying bus, and four atmospheric entry probes, operated from 1978 through 1992. Numerous data sets were submitted to NSSDC throughout the mission, but many investigators continued to work on producing a “best and final” version of their data sets. The PDS undertook a restoration effort in 1991 to bring all Pioneer Venus data sets up to PDS standards and to widely distribute the data throughout the planetary science community. Representatives of the PDS met with members of the Pioneer Venus Science Working Group to plan the restoration effort. The mission 's principal investigators (PIs) agreed to prepare a list of archival products, to prepare high-quality data products, and to participate in peer reviews of their data products. It was agreed that the PDS would have the following responsibilities: Prepare documentation on all instruments and data sets; Prepare PDS labels for all data sets; Design archive volumes; Conduct peer reviews of all data sets; Publish data sets on CD-ROM; Archive copies of all data sets with the NSSDC; and Distribute data sets to the community. Members of the PDS worked closely with the Pioneer Venus PIs to build archive products for all of the mission's instruments. This effort was completed in 1995. The process leading up to peer review and ingestion of the data sets followed the steps outlined above. The final Pioneer Venus/PDS data archive will total about 160 CD-ROM volumes (containing about 650 megabytes each) of processed and derived data from about 15 instruments. In the future, about 300 additional CD-ROM volumes will be created for the “raw” data sets. Ingestion of Data from an Active Planetary Mission For planetary missions initiated after the PDS became operational, plans are developed prior to launch for establishing standards and procedures for ingestion of all anticipated data volumes. Ideally, the PDS members and mission teams have been working together for several years prior to launch, allowing design of the mission's ground data system to be optimized with data archiving as a goal. Key steps needed to ensure a timely flow of data from the mission teams into the PDS archive are: Archive planning. The PDS assists the project in developing a Project Data Management Plan, an Archive Policy and Data Transfer Plan, and detailed Software Interface Specifications for each instrument and data set expected to be produced. These documents specify the responsibilities of both the project and the PDS, define the

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers content and format of all data volumes, and specify the procedures for data volume production, validation, and ultimate transfer to the archive. Data preparation. The PDS assists the project in applying PDS standards for production of archival quality data sets. Members of relevant PDS Discipline Nodes and the Central Node lend their expertise to instrument teams during the data volume design and production phases. The PDS cooperates with the mission to validate data products and to produce archive volumes. Data cataloging. The PDS provides active catalog databases for access by the planetary community. These catalogs provide easy identification of data archive volumes to assist in ordering. Data transfer. The PDS is the designated recipient of the mission's data archives. PDS coordinates delivery of the final archive volumes to the NSSDC. Data distribution. The PDS acts as the distribution agent for the planetary science community. Large data volumes may be automatically distributed on CD-ROM to a predefined mailing list of investigators. Smaller volumes may be distributed via magnetic tapes or electronic file transfer after users place orders through one of the PDS on-line catalog systems. PDS Discipline Nodes provide science expertise where appropriate, and distribute data volumes as long as copies are available. The NSSDC provides data copies to users outside the planetary community and over the long term. Magellan Venus Mission Case Study The Magellan mission to map the planet Venus, conducted by the National Aeronautics and Space Administration and the Jet Propulsion Laboratory (JPL), is a typical example of a large data collecting mission that included, from inception, an end-to-end plan for data formats, acquisition, processing, distribution, and archiving. The Magellan data set defines Venus for all future studies. This data set will remain important and valuable virtually forever, and will be the reference data set for all future exploration of the most Earth-like planet in our solar system. The Magellan data set is an example that can be used to design the Mars data acquisition and other future planetary exploration. Magellan was cost-constrained from the start. Nevertheless, the project management resolved to provide a complete end-to-end system to deliver the objective data product. Mission operations were simplified to reduce costs. Among the organizational innovations that were developed was a single project data system, unlike other flight projects that have separate data management systems for each element of the project. This scheme proved to be simple, cheaper, and more effective. One of the important drivers of the Magellan plan was NASA Management Instruction 8030 (NMI 8030.3A), which mandated a Project Data Management Plan. Magellan was the first planetary mission to come under this requirement, although the California Institute of Technology, which manages JPL and hence is ultimately responsible for the Magellan project, had never agreed to this provision. The Project attempted to follow the requirements strictly. Several documents were essential to the success of the Project's data management. NMI 8030.3A provided the first basic framework and policy guidance for a project archive plan. The Project Plan is a high-level document that provided the data policy guidelines. The Project Scientist and Program Scientist at NASA headquarters prepared the Science Requirements Document, which contained the details of the science management of the data. The Project Data Management Plan was the overall plan that met the requirements of NMI 8030.3A. A much more detailed document was the Archive Policy and Data Transfer Plan, which provided details on the archiving function. Finally, the PDS Magellan Mission Interface Plan defined the respective roles of the Project and the PDS. Magellan accomplished three 243-day (one Venus year) cycles of imaging using a synthetic aperture radar system. Cycle one was the primary mission, and accomplished the mission objectives of 70 percent coverage, actually mapping 87 percent of the planet. Cycle two was primarily devoted to opposite-side (right-looking) coverage and filling major gaps. Cycle three was cut short by radio transmitter problems, but obtained about 28 percent of Venus in left-looking stereo coverage when combined with cycle one. In all, 98 percent of Venus was imaged. Cycle four was devoted to radio tracking for Doppler gravity coverage. Magellan's elliptical orbit, in cycle four, 180 km by 8000 km, could not be used effectively in the polar regions. For that reason, at the end of cycle four, a successful experiment was performed to use the upper atmosphere to aerobrake the spacecraft into a nearly circular orbit. This was accomplished over a period of 70 days and completed to a 290 km by 540 km orbit in early August 1993. High-resolution gravity data were then collected until the termination of the mission in October 1994.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers In all, Magellan produced over 4 terabytes of data. More than 70 gigabytes of these data are on CD, representing the most usable part of the data set. The Magellan Project has completed the distribution and archiving of hundreds of data product types and many thousands of individual products. Lessons Learned from the Magellan Data Management Experience Archive considerations and commitment must begin in the early stages of a project and be included in basic project documentation. A single individual, for example the Project Scientist, should be tasked with the authority and the responsibility for the complete life cycle of the data, up to and including permanent archiving. A single data management system can be efficient and cost effective. Strict attention and careful preparation of Software Interface Specifications is essential. These documents describe the data formats. All data products must be reviewed by the science team before release for distribution. A Project Data Management Plan is essential and must contain detailed plans and schedules. This document should be updated as conditions and realities of budget and mission operations require. Appropriate and clear NASA policy is essential to the process. NASA must be committed to the completion of all data tasks for the full life cycle. Without continued NASA support for the extended mission, Magellan would have been a fraction of the final success, mostly because of the resulting inability to complete the data processing, documentation, and distribution for even the first 243 days of the mission. Summary of Findings and Conclusions The fundamental premise of the Planetary Data System is that active archives should be managed by centers with appropriate scientific interest and expertise, while the NSSDC provides the deep archive for all planetary data. To date this has worked quite well. As all planetary data have been obtained by federal government agencies using federal government funding, all resulting archives will ultimately fall under the purview of NARA. The following are lessons learned from the PDS/NSSDC experience. Data are ingested best and made most accessible to the community through cognizant discipline scientists. The PDS generally provides the ingestion and access to the data through discipline data centers that are staffed by “working” scientists in the field. Scientific expertise is needed at all stages of the archiving process. Panels of experts may provide this expertise. A hierarchy of archiving stages is appropriate for most projects. A data set will be most frequently accessed at the earliest phases of the archive. At this stage, on-line access may be desirable and frequent consulting with cognizant scientists may be necessary. Later, when the rate of access declines, the data can be moved to a deeper archive such as the NSSDC, and eventually to NARA for permanent retention, if NASA becomes unwilling or unable to continue the archiving of those data. The data storage technology is in a constant state of flux, with new computer hardware and software tools continually being introduced. In order to properly service data archives, frequent updates to software, hardware, and media are required. This lesson should apply equally well to PDS, NSSDC, and NARA, but perhaps with differing time scales for change. Astronomy and Astrophysics Data Like planetary data, astronomy and astrophysics data are acquired by both ground-based and space-based observatories. Ground-based observatories are operated by universities or other nonprofit organizations (e.g., Association of Universities for Research in Astronomy (AURA) and the Smithsonian Institution). They are funded by these organizations or by the National Science Foundation (NSF). Space-based observations are carried out by NASA. The discussion that follows is organized according to the ground-based and space-based programs and related data-retention issues. Ground-based Astronomy and Astrophysics Astronomy is an observational science; that is, it is based on what the sky provides and we collect. Because of their very nature, in many astronomical investigations there is no such thing as “repeating an experiment” with the

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers same results. This is because many objects may have properties changing with time either because of their intrinsic nature (e.g., variable stars), evolution (e.g., stars going supernova), or reasons yet unknown. It happens quite frequently that a highly variable object is found in satellite data, such as a flare in the X-ray data collected by the NASA Einstein Observatory, and subsequent archival research in optical plates allows identification with a given type of star. Ground-based observatories traditionally have been used to study the sky at visual wavelengths. Since the Second World War, however, astronomers have used new technologies to observe in the radio wavelengths, and now also in the infrared. There are many optical observatories in the United States or operated by U.S. institutions abroad. They range from privately owned telescopes to university observatories to national facilities. Among the big centers are the National Optical Astronomy Observatories (NOAO), managed by AURA and funded by NSF, which operates the Kitt Peak telescopes in Arizona and the Cerro Tololo telescopes in Chile; the Mt. Hopkins telescopes in Arizona, operated jointly by the Smithsonian Astrophysical Observatory (SAO) and the University of Arizona; and a complex of telescopes on Mauna Kea in Hawaii, which are owned by different U.S. institutions, including the University of Hawaii, the University of California, and California Institute of Technology (the new-generation 10m Keck telescope), and by agencies in Canada, France, and the United Kingdom. This list is by no means exhaustive and omits many important observatories. New telescopes are being built by consortia of universities, including both U.S. and foreign institutions. These telescopes use new technology to build larger mirrors that will allow us to look deeper into the universe. Most telescopes are meant to be used for individual observing programs, but some are dedicated to systematic sky surveys. The latter include the Sloan Digital Sky Survey (University of Chicago, Princeton University), the Spectroscopic Survey Telescope (Pennsylvania State University, University of Texas), and the (Harvard) Cambridge-University of Cambridge, U.K. telescope. Infrared (IR) ground-based observations are made at many optical telescopes. A dedicated IR telescope on Mauna Kea, the Infrared Telescope Facility (IRTF), is funded by NASA. Radio observatories also range from smaller ones operated by universities to larger national facilities. Most of the latter are operated by the National Radio Astronomy Observatory (NRAO), funded by NSF. These include telescopes in Greenbank, West Virginia, the Very Large Array (VLA) in New Mexico, and a 12m dish at Kitt Peak. A large telescope, mostly used for neutral hydrogen and other line observations in astronomy, is located in Arecibo, Puerto Rico, and is operated by the National Atmospheric and Ionospheric Center with NSF funding. These observatories are used both for individual research programs and for survey work. Data obtained from ground observations have traditionally been considered the property of the observer and, therefore, observatories have no standard policies for data archival. The exceptions are some big projects, such as the Palomar Sky Survey, ESO Sky Survey, Harvard Plates, and the Hubble Space Telescope Guide Star Survey, where data either are made public and sold (many copies are kept by researchers and at libraries), or are archived within the university/observatory. Some centers (e.g., NRAO, NOAO, SAO) have started to archive all or most data obtained from major telescopes (e.g., VLA, Canada France-Hawaii Telescope, Multiple Mirror Telescope). These archival data are valued and used broadly by astronomers. Nevertheless, archival activities remain of generally low priority. Although the older astronomical data consist of photographic plates and other analog data, virtually all data today are collected digitally. There also have been major efforts to digitize large amounts of old photographic data to allow their analysis by computer. An example of this is the digitization of a whole-sky survey by the Space Telescope Science Institute (STScI), and this survey is now available for sale on CD-ROM from the Astronomical Society of the Pacific. Recently the astronomical and astrophysical community adopted a standard format for the Flexible Image Transfer System (FITS) of digital files. With the advent of digital data, there also has been an evolution from individual data analysis packages to a few widely distributed packages (IRAF, AIPS, VISTA, XANADU), which provide standard tools for baseline analysis. Space Astronomy and Astrophysics Because of the filtering and distortion effects produced by the Earth 's atmosphere, the amount of energy emitted by celestial bodies that can be detected on the ground is significantly limited. Observations from space remove such limitations. For these reasons, space astrophysics emerged as an important field as soon as the

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers technology for space observations became available. From its inception, space astronomy and astrophysics has been mostly under NASA's purview, although a few experiments have been financed by the Department of Defense (DOD). The data are collected through telescopes and detectors placed on airborne devices (balloons or planes), rockets, NASA's Space Shuttle, and orbiting satellites. The largest volume of data is collected by satellites, and most of these missions are international collaborations. The U.S. portion has always been managed by NASA. Disciplines and Representative Data Sets Within NASA, space astronomy and astrophysics are organized in different wavelength-based disciplines, reflecting the organization of the scientific community. These disciplines include the infrared, whose main data center is Infrared Processing and Analysis Center in Pasadena, California, where the data from the Infrared Astronomy Satellite (IRAS) mission are archived; the optical and ultraviolet (UV), with data centers at the Space Telescope Science Institute (STScI) in Baltimore, Maryland, where the Hubble Space Telescope data are sent and archived, and at the NASA Goddard Space Flight Center (GSFC), where the International Ultraviolet Explorer (IUE) archive resides; and high-energy astrophysics, which will be discussed below in more detail. Table II.1 provides a representative sample of NASA Astrophysics Archives. Table II.2 lists some well-documented data products that should be archived. Evolution of Archival Procedures The earlier NASA astrophysics projects were called “principal investigator” (PI) missions. In these missions, a grant was awarded to a group of PIs, who built the hardware, received the data from the experiments, and performed data analysis and research, typically resulting in publication. These PIs had no clearly stated guidelines to archive data or to prepare data for archiving as part of their funded activities. As a result, at the end of funding, the data and some data products were typically sent to the NASA data repository at the NSSDC, in whatever state they happened to be at that time. Documentation generally was minimal, and these data were very difficult to retrieve for scientific use, even if they were adequately physically preserved. It has become fully apparent, however, that the unique character of the space data—a relatively small data volume obtained at great cost—makes their effective preservation and archiving a high priority. At variance with the custom of ground-based observatories and reflecting the unique characteristics of the data, NASA funded data centers, which were originally linked with the PI groups. These data centers processed the data to eliminate the instrument signatures, produced software to facilitate scientific analysis, and supported guest observers in their projects. Even if the lifetime of a space observatory is very limited, the data are typically processed and used by many scientists for years after they have been collected and the satellite has ceased functioning. TABLE II.1 A Representative Sample of NASA Astrophysics Archives, by Satellite Mission   High Energy Astrophysical Observatory 2 International Ultraviolet Explorer Infrared Astronomical Satellite Hubble Space Telescope Compton Gamma Ray Observatory Data type X-ray data Ultraviolet data Infrared data Optical/Ultraviolet data Gamma-ray data Year of launch 1978 1978 1983 1990 1990 Duration 2.5 years Ongoing 300 days Ongoing Ongoing Total data volume (gigabytes) ~100 ~100 ~150 ~5500 by year 2005 ~1000 by year 2000 Data center Einstein Observatory Data Center, Cambridge, Massachusetts National Space Science Data Center, Greenbelt, Maryland Infrared Processing and Analysis Center, Pasadena, California Space Telescope Science Institute, Baltimore, Maryland National Space Science Data Center, Greenbelt, Maryland

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers TABLE II.2 Examples of “Mature” Data Products Ready for Long-term Archiving Einstein Observatory (HEAO-2) There are six CD-ROM sets: EOSCAT (Rev. 1 - January 1990, Rev. 2 - July 1991, 3 CD-ROMs); HRI Images (July 1990, 2 CD- ROMs); IPC Slew Survey (January 1991, 1 CD-ROM); HRI Event List (January 1992, 2 CD-ROMs); IPC Event List (June 1992, 4 CD-ROMs); SSS, MPC, and FPCS Data Products (June 1992, 1 CD-ROM produced by HEASARC). The Einstein Observatory Catalog of IPC sources is a 7-volume set (hardcopy, paper), which includes the same information as the first CD-ROM set Rev. 2 (July 1991). International Ultraviolet Explorer Many paper catalogs have been developed and are stored at NASA. Infrared Astronomical Satellite Many paper catalogs have been developed as well as a set of CD ROMs, which are stored at IPAC and NASA NSSDC. Hubble Space Telescope 101 compressed digitized sky survey CD-ROMs, which are made available by the Astronomical Society of the Pacific (ASP). In the following sections the panel provides several examples that illustrate the activities of active data centers and archives in space astrophysics. The Hubble Space Telescope Data Archive The Hubble Space Telescope (HST) was the first of NASA's “Great Observatories” to be put into orbit. The HST observes astronomical objects (including planets, stars, galaxies, and quasars) in the optical-UV band at very high angular and spectral resolution. It returns about 1 gigabyte of data per day, and given an estimated telescope lifetime of 15 years, the HST data archive is expected to hold about 5.5 terabytes. The HST archive is operated by the Space Telescope Science Institute. The Institute is also responsible for overseeing and managing the scientific program of HST for NASA; for being the principal point of contact for scientists using HST; for operating the telescope and developing and executing calibrations; for receiving, calibrating, archiving, and distributing to the scientific community all of the HST data; and for supporting users in their data analysis. When the STScI ceases at some future point to operate, the HST data archive holdings will be transferred to the NSSDC. At that time, presumably, there will be a decreased level of scientific use of the data. Most HST data are proprietary for a period of one year, after which they become generally available to the community. The archive was officially opened for external use in February 1993. Users of the archive include funded and unfunded archival investigators; HST observers visiting the institute who retrieve their proprietary data for analysis; institute scientists and engineers who access the data for calibration purposes and for monitoring the operations of the HST; and scientists planning HST proposals who may want to check if a certain object was observed and in which mode, and to determine the results of the observation. The archive constitutes a central element in the architecture of the STScI data system. The STScI receives the data from the satellite and extracts the science data from the telemetry stream. These data are stored in GEIS format (IRAF VMS data format; IRAF is the data analysis system used at STScI), and calibrated using the STSDAS package of IRAF. STSDAS has been developed at the Institute and is the same software that is used by the scientific community to analyze the data. Metadata are stripped from file headers and catalogs are updated. The data are then separately archived on two optical disks: one copy forms the primary HST archive, and the second copy is used at the HST data analysis facility in Munich, Germany, which serves as a backup site. After one year, the nonproprietary data are copied onto optical disks and given to the Canadian HST data center. Both raw and reduced data are archived. The raw data are kept as a safeguard against possible reduction problems, and to allow customized reprocessing by the user if necessary. All the calibration files are archived, but only the latest version of the calibrated files is kept in the archive. Besides the scientific instrument data files, the archive also includes metadata, arranged in relational databases, the catalog of observations, the engineering database, and the calibration database. Proposal information, data analysis software, and a large amount of documentation are kept as well and made available to users. Data quality is monitored through a detailed review of two percent of the data as they are archived, and through comments resulting from use of the data. An Archive Users Committee periodically reviews the archiving

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers activities. Users are assisted through on-line instructions (available through the following electronic mail address: archive@stsci.edu) and users' manuals. Archive scientists at STScI are responsible for monitoring the integrity of the archive, providing high-level scientific support, and producing the users' manuals. The current staffing for the operational archive consists of three Ph.D. scientists, three archive specialists, one archive manager, four operators, and two programmers. The STScI, with input from the Archive Users Committee, is exploring ways to make the archive more efficient for scientific use. These will include adding available astronomical information to each HST target, as well as developing a mechanism to browse compressed images and providing intelligent scanning of help documentation. Major long-term issues include interfacing with the Astrophysics Data System (see below) and responding to the aging of the hardware without affecting access the data by the archive's users. High-Energy Astrophysics Data High-energy astrophysics uses observations of the sky in the X-ray and gamma-ray regions of the electromagnetic spectrum. Because the Earth's atmosphere absorbs photons at these energies, high-energy astrophysics is based entirely on space missions. For data management and archiving purposes, NASA sponsors a number of active data centers and the High Energy Astrophysics Science Archive Research Center (HEASARC). A “deep,” permanent data archive is provided by the NSSDC. The active data centers are responsible for data processing, distribution, and archiving; user support; and developing and providing analysis software. These centers operate during the lifetime of a spacecraft and during the period when the data are still being used intensely by the scientific community. The centers are located with groups scientifically active in the given field, because it is recognized that scientific expertise is required to manage the data and to identify and correct anomalies. They also help scientists who may not have experience with high-energy data in their data analysis, by providing remote on-line help or on-site assistance. The centers are staffed by scientists, software engineers, and support personnel. Active data centers include the Einstein Observatory Data Center at SAO, now winding down its activities; the Roentgen Satellite (ROSAT) Data Center at GSFC, with a branch at SAO; and the Gamma Ray Observatory (GRO) and Advanced Satellite for Cosmology and Astrophysics (ASCA) Data Centers at GSFC. The Einstein Observatory Data Center, which gives a good example of the transition from PI mission to archiving center, and the HEASARC, which provides expert archiving support, are examined below. The Einstein Observatory (HEAO-2) was the first X-ray satellite to be equipped with a high-resolution mirror, which allowed astronomers to obtain real images of the sky in this energy band. The satellite was operational from December 1978 to mid-1981. It was conceived as a PI mission, but the PI team decided to allow the community to use an increasing fraction of the observing time. Eventually all of the data became part of a public archive, which contains approximately 100 gigabytes of data. The satellite data were received by NASA and sent to the Einstein Observatory Data Center at the SAO in Cambridge, Massachusetts. Here the data were processed and examined to monitor the health of the satellite and to verify the quality of the results of the processing. Following this verification the data were made available to the observer and then archived. The Einstein experience demonstrates the importance of retaining all of the raw satellite data. Some problems in the original software and better instrument calibrations required subsequent reprocessing of all the data. Moreover, users frequently asked for additional processing of their data with a setup different from that of the original processing. The following three examples are representative. (1) A flare in Proxima Cen was captured in data that would have been normally screened out because of high background. (2) Data that would have been normally discarded were made available to an ionospheric physicist. He was interested in the data taken of a bright nonvarying source, when it was close to being occulted by the Earth, to measure the optical depth of the Earth's atmosphere. (3) A catalog of nearly 1000 stars was made from serendipitous data. In the original processing, only data obtained when the satellite was locked on a given target were processed. However, the instruments were not turned off while the satellite was scanning between targets. This resulted in data amounting to a “shallow,” but all-sky, survey. These data were later retrieved and processed in a nonstandard way, resulting in a catalog of X-ray sources, which is still being exploited for scientific investigations. Because the level-0 data were retained at all times, these investigations could be supported. The Einstein Observatory Data Center is completing its operations now, more than 10 years after it detected its last photon. The most recent years have been dedicated to reprocessing and archiving activities. The data have been

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The issues that must be addressed in the archiving of these data and in making them available to the primary and secondary users include: The large amount of data collected, particularly those collected by satellite. The interpretation and analysis of this information, particularly in its raw form. The specialization of space science data. Its primary consumers will always be specialists in particular fields. How does this affect the ability of a general archives to house and service such data? Data continuity. The methodologies used in gathering the data may vary, and the time frames in which the data are collected may not be contiguous. Changes in media. Data require periodic migration to new media. How will archive hardware and software change over time? The varying formats of the archivable data. Data in repositories are in many formats, with different levels of documentation and different levels of review and validation. Metadata. How does one maintain knowledge about instruments and their calibration over time and provide adequate documentation for all prospective users? Technology tools. What technologies are likely to increase the productivity of archive researchers most dramatically? Standards versus flexibility. How important is the need for a “single” user interface for the community in relation to the special needs of users? Changes in use over time. Data will continue to support scientific endeavors, but may also be applied to social and economic problems and to historical research. Needs of the future users of the archive. What is the appropriate user paradigm for archival research? Proposed Model The Space Sciences Data Panel considered a number of options for an archival model of space science data. However, recognizing that major components of such a model have been developed at considerable cost to support data management activities, it proposes building on these. Particular attention has been paid in the existing components to data management plans, metadata, and data management standards. The proposed model consists of the following elements: Management controls developed early in the research project/mission plan to ensure data collection, handling, processing, and archiving of all data during the lifetime of the program; the development of a life-cycle data management plan. A series of subdisciplinary nodes for the processing and servicing of data. These nodes make up a discipline's data archive center, which may also be referred to as an active archive. An example of this archival level is the Planetary Data System with its atmospheric, geosciences, planetary plasma interactions, small bodies, rings, imaging and navigation, and ancillary information facility nodes. A central agency “deep” archive serving as the data repository for the discipline archival centers. The National Archives and Records Administration, serving as the permanent archival manager for space science and related scientific and technical data for the nation; establishing policies and standards for very long-term and permanent archives of space science data, regardless of their physical location. An important element in the model is an electronic network that permits scientists to access the data managed at any level in the archival model. Planning Program plans incorporating new space missions or experiments establish the basis for observations, collection of data, instrumentation to be used, and the recording of findings. The purpose of the mission or scientific program, its time frame, and its boundaries are defined at this stage. The documentation created at this point is important to the long-term use and understanding of the data collected. Though not a service point within the data network, records generated at this time are essential to the documentation of the history of the mission, the event, or experiment. This historical documentation always should be archived permanently.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers At the initial planning stage, the Program Office must generate a data management plan that covers the entire project life cycle. The program manager, data manager, and associated researchers together should be responsible for ensuring the adequate development and implementation of this plan. A good example of careful preparation of a research project is a data life cycle that was presented in the Magellan case study. Discipline Data Centers Scientists are, and will remain, the primary users of space science data. To ensure that data are available to scientists for furthering research and scientific endeavor, a data center must be established and charged with the responsibility of maintaining important observational and experimental data. The data center must collect and organize the data, create the metadata structure that permits effective access and use, provide mechanisms for access and use, and create products that convey the research results. To provide access to primary and secondary users, communications networks must be in place and experts in the discipline and subdiscipline must be available to assist in the use of the data, particularly by nonscientists or those unable to interpret or understand the data. An example of a well planned and implemented discipline data management program is provided in the PDS case study. Data Archive Centers The government agencies NASA, NOAA, NSF, DOD, and DOE generate or collect most space sciences data, often through contracts and grants with academic and not-for-profit research organizations in the field. Experts serve as PIs and design experiments, determine specific parameters for the data collection, and often analyze and maintain data in personal databases. Yet the data collected belong to the sponsoring organization, and an effort must be made to ensure their availability to other PIs, to the broader space science community, and to the nation as a whole. Thus, it is incumbent upon each agency to manage data for which it is responsible effectively, as long as the data are still being used and of further importance to the agency's programs. The agency also must ensure the data are appropriately managed for transfer to a preservation agency when the agency no longer finds it necessary to maintain the data for its own use. NASA's NSSDC and NOAA's statutorily designated data centers are examples of such agency archival programs. “Deep” Archives As has been noted, space science is not the purview of a single agency, or even of a single nation. There is every reason to believe that as the volume of scientific data grows, the demand for a national scientific data archive will grow. Such a national archive would serve to provide management control over centralized or decentralized discipline-specific archives, one of which might be space science. Such an archive would permit scientists and other users access to the entire body of scientific data generated, collected, or received by the United States, including much provided by the scientific communities of foreign countries. This center would provide expertise in the use of data, as well as ensure their preservation and availability to primary and other users. As the ultimate archival safety net, NARA plays a role in the archiving of scientific data from the initial stages when existence of the data is registered with NARA and archival controls are placed on the data. Its responsibility continues through the active collection, processing, and primary use phases of the data life cycle, in the form of audits of the status of archival practices, management, and preservation steps at regular intervals. At a time in the future when data are no longer considered useful to support current scientific activities, NARA will assume responsibility for those data to ensure their preservation and accessibility for all time. The assumption of this responsibility, however, does not require relocation of data to preserve archival standards and practices. The panel believes that NARA also may serve as a permanent back-up to large collections of data that are still being actively serviced at one of the other levels of the model, and that the data sets may be more easily migrated to new technology in a central technology translation facility operated by NARA. As has been noted previously, existing communications capabilities and those expected to be developed can make the actual location of data irrelevant to their successful availability and use. Likewise, rapidly increasing storage capabilities of various electronic media and their decreasing costs make the storage of data in more than one location possible and frequently practical. The data collection and expertise necessary to interpret and make the

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers data useful remain the long-term, high-cost items. Thus, the panel 's proposed model is based on the need to permanently retain data acquired from costly missions and observations that provide snapshots in time, comparisons of space data over time, and documentation of solar, atmospheric, and other events. Availability of these data for future scientific endeavors can minimize costly duplications of effort and ensure that baseline data which cannot be recreated are always available. Institutional Relationships In many scientific areas, data and information sharing is more common because of budget constraints, research that crosses disciplines, and national and global problems that require a broad range of data from many sources for solutions (e.g., global change). In the space sciences, sharing generally does occur, but there are often instances in which PIs maintain a proprietary interest in the data collected or generated, and data ownership becomes a question. The panel emphasizes the importance of national ownership of, and access to, scientific data paid for by the taxpayers for the benefit of all. International and Interagency Relationships Many problems are global in nature, and scientific data on our solar system and the universe are equally important to all peoples. Space science is expensive, and international cooperation can reduce the cost to any one nation. Thus, the space science archives must serve as a repository not only of data collected by U.S. scientists, but also of data that result from joint projects or that are shared among scientists from different countries. Many U.S. scientists participate in international projects and experiments, and space missions are more frequently cooperative projects. Some of these data have been archived by NASA and NOAA under the aegis of the World Data Center system. However, much more needs to be done in this regard. The federal agencies collect data to support their individual missions. However, data holdings and management responsibilities may rest with a different agency, requiring data to be turned over to that agency following space missions, experiments, or data analysis programs. The cost of collecting observational data requires that agencies share in the costs, as well as the data that result. Thus, we find Air Force data going to NOAA, NASA data going to NOAA, and NOAA data going to these agencies and others. NARA's Relationship with Other Agencies NARA has a unique role to play in the management of archived scientific data. It not only establishes archiving policy and procedures, but is charged by statute to preserve and permanently store the important records of the federal government. Consequently, NARA is the safety net for the scientist of the future; it is the deep archive for data that the research and development agencies no longer wish to keep, but that meet NARA's retention criteria. To date, NARA has appraised and acquired almost no space science data from the science agencies, particularly those in electronic form. It currently has neither the resources to manage such data nor the expertise necessary to promote their use. The agencies have not given much consideration to NARA's potential role in their data management plans, nor in their efforts to archive and service space science data. The panel concludes that this has created a situation in which many of the older data sets on paper, film, charts, and the like may well be lost due to a lack of understanding of the role NARA can and should play in the preservation of scientific records created or received by the government, regardless of the format. NARA and the agencies must work together to develop a clear understanding of the role each must play in archiving scientifically and historically valuable, costly, and generally nonrecreatable space science data. NARA should be involved early in decisions relating to data management over the life cycle. In many instances, it may serve as an offsite backup archive for some data. In others, it may register data and make their existence known to secondary users, long before primary use has declined or it has taken archival ownership. In no instance should NARA assume responsibility for servicing primary users. The panel believes that space science data are an immensely valuable national resource that must be preserved for use by future scientists. The panel therefore recommends a strengthened relationship among the agencies, with significant NARA participation, to ensure that: All data are readily available to scientists anywhere;

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Data are shared and not duplicated; Each agency fulfills its responsibility for quality controls; metadata structures; documentation of analysis, forms, and systems designed to process the data; and production of data products and development of services and mechanisms for making the data available and usable by the scientist and nonscientist alike; and Each agency participates in electronic networks that enable access, sharing, and transfer of data. Documentation Requirements To effectively use space science data well into the future requires comprehensive and accurate documentation of the program that generates the data, the data themselves, the analyses carried out using the data, and the system that maintains and stores the data. Documentation should follow the project life cycle, and begin with the initial plans for a project or mission. In the case of NASA, this results in a NASA Data Management Plan (NDMP), which defines the data management considerations at each of the stages of the life cycle: collection; processing; analyzing and peer reviewing; reprocessing, reformatting, and reanalyzing; storage; primary use; secondary use; and final disposition (archive permanently or destroy). For a comprehensive discussion of metadata requirements and framework relevant to all observational data, see the report of the Ocean Sciences Data Panel. 4 SUGGESTED RETENTION CRITERIA AND APPRAISAL GUIDELINES Retention Criteria The panel identified the following retention criteria, ranked in priority order, based on its discussions and a review of the literature. Significant value of the data Do the data contain fundamental information that will be of use to future researchers or future national programs? A consideration for retention of a data set is whether the data have resulted in significant scientific return. That is, have they been used in scientific analyses? A negative answer to this question will require a decision as to whether there is potential for future use of the data. In determining this, there are likely to be considerations unrelated to the primary research value, including whether the data document important characteristics of the program and the mission of the agency that produced them. Adequacy of documentation Do the data sets have accompanying documentation containing data formats, conversion factors to physical units, and error assessment information? Are there ancillary data with dates, times, ephemerides, etc? Are the necessary software and algorithms included? Cost of replacement Could the data be reacquired if a future national need for the data would arise? If so, would the data be costly to reacquire relative to the costs of preservation? Uniqueness of data Do the data exist in an accessible repository that meets NARA standards of permanence and security? If so, are they adequately backed up? Peer review Has the data set undergone a formal peer review to certify the integrity and completeness of the data set, or is there documented evidence of use of the data set as leading to publication of results in peer reviewed journals? Have expert users provided evidence that this data set is as described in the documentation? Ancillary Recommendations As much scientific data as possible should be preserved. In reviewing data for retention, no arbitrary percentage of available information should be regarded as an adequate archive.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers All levels of processed data should be considered for archival storage, with priority accorded to “level-0” data. All levels of processed data should be considered for permanent archiving as long as they meet other criteria for retention. Lower levels contain the most intrinsic information. Higher-level data generally are more accessible. NASA defines “level-0” data as raw instrument data at the original resolution, time ordered, with duplicate data packets removed. Level-0 data are the most important to keep in perpetuity in order to correct possible processing errors, to take advantage of new considerations, and to support future scientific investigations. Successively higher levels of processing produce nonredundant instrument or experiment data records, and ultimately result in fully reduced data products such as images, topographic profiles, spectra, and the like. The highest levels are the most usable for drawing conclusions, but also contain the most model-dependent assumptions and are subject to interpretive errors that may be unacceptable to future researchers. For this reason, it may be necessary to retain several levels of processed data, as well as the level-0 data. Judgments must be made by qualified researchers as to the value of each level of processing before any data are destroyed. Complete sets, rather than samples of data, should be accessioned. It must be made clear that the scientific value of a data set that has been deemed archivable depends on its completeness. For research purposes, a subsampled data set or an abbreviated example of the data set is generally not acceptable. In some cases portions of a data set may be lost or not readable, but this should not prevent the salvaging of the usable portion if it is otherwise considered to be archivable. Data in any commonly used format should be accepted. The establishment of inflexible format or media specifications should be avoided. These may be difficult or impossible for agencies or individual scientists to meet for the archiving of data that are determined to fall under the above criteria. Examples of Data Sets That May Be Suitable for Archiving Several examples exist of data sets that are likely to meet or come close to meeting the above retention criteria in their present state. In no case have these data sets been subject to a retention review; hence, the panel cannot state definitively that these should be retained. However, these examples have received a great deal of attention within their respective agencies and discipline communities as data sets which should be kept for long periods. Planetary Science As discussed in Section 2, the Planetary Data System has pioneered the methodology of “publishing” data sets on CD-ROM. This process includes acquiring all related ancillary data, capturing and referencing relevant publications and documents relating to the data set or instrument that produced the data, and describing the format of the information in a self-describing object description language that is both human- and machine-readable. Each data set is reviewed by peers, including members of an appropriate PDS discipline node as well as informed members of the scientific community. While only part of the process, the preparation of the data set and related files for mastering on the CD provides a valuable service, much like the final editing of a book, in terms of verifying that all relevant information is available and easy to find. The PDS has published some CDs in cooperation with a number of planetary programs. These include the Voyager Missions to Jupiter, Saturn, Uranus, and Neptune; the Magellan Mission to Venus; and the Viking Mission to Mars. Astronomy and Astrophysics The Hubble Space Telescope project has acquired a large volume of observations of a wide range of astrophysical objects and phenomena. In order to serve a broad community of astronomers, the project has emphasized the maintenance of a robust database of observations along with the appropriate ancillary information, including documentation on the instrument and calibrations. The database includes both raw data and the most up-to-date calibrated version of the data. Other data that may be suitable for long-term archiving include data from the Einstein Observatory, the International Ultraviolet Explorer, and the Infrared Astronomy Satellite, as suggested in Table II.2.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Space Physics The Dynamics Explorer investigators have recently completed moving a large percentage of the data, documentation, and ancillary information onto optical platters and then submitting the platters to the NSSDC. The magnetic field data from the ISEE-1 and ISEE-2 missions are another example of archivable data sets. Appraisal Guidelines The National Archives and Records Administration appraises records based on their evidential and informational values. It is concerned with records of secondary or longer-term value, those that may be expected to have value long after they cease to be of more immediate or primary uses. Though scientific databases do provide evidence of the work of the department and agency, their value is primarily informational and is based on the content of the records rather than on the activities of the agency that collected or created them. Scientific data create special problems for appraising their long-term value, particularly beyond the primary user community. They are voluminous, constantly increasing, difficult to label, and often difficult to access and use by all but the scientists. In addition, they often are very expensive to collect, provide baselines for future collections, and enhance understanding of other data. They are of immense importance to those who are analyzing the data, for advancing scientific endeavors, and for educating new scientists. They also are important to an understanding of the world in which we live and are used by economists, historians, statisticians, politicians, and the general public. At the same time, it is impossible to ascertain the full value of data to researchers and other users centuries from now. More important, it is difficult to ensure that technical experts will-always be available to assist the nonscientist in the use of these data. The retention criteria outlined above are important for the long-term or permanent retention of scientific data. The appraisal of the data should follow the guidelines presented below. Who Should Perform the Appraisal? A number of people should be involved in the appraisal process. These should include officials of the agency carrying out or sponsoring the scientific investigation or mission, the investigators themselves, peer reviewers who attest to the validity and value of the data, and NARA. At this time, most individual investigators and peer reviewers do not recognize their roles as appraisers for archiving purposes, but the views of these experts should weigh heavily in the decisions relating to permanency or long-term value of the data obtained. The principal investigators and the project manager who define the data that are to be collected often have the best sense of how long the data will be valuable for use by the agency, and in some instances, how much secondary use of the data can be anticipated. However, it is difficult for them or others to determine the longer-term value of the data, at least until the data have been collected and some analysis has been made of their accuracy and usability. The primary users can provide a detailed level of understanding regarding the uses of the data and their longer-term value for application to national problems or further research. A data management plan generally is or should be a part of any research project or mission plan. The data manager has responsibility for implementing the plan and ensuring accessibility and maintenance of the data. Thus, the data manager must play a key role in the appraisal process. The agency records manager and information systems manager also should play an important role in the appraisal of a particular data set or database. While the information systems manager is mainly concerned with primary uses of the data, the records manager is more concerned with their long-term value and preservation in a usable state. The records manager, working with the project and systems managers, defines disposition requirements for the data and transmits these to NARA for approval. Since many scientific endeavors require participation from a number of agencies, it is appropriate in those cases for an interagency team to be formed to coordinate data management activities and to assign responsibilities for the maintenance of the data during periods of primary use. This interagency team is also important to the appraisal process since it ensures that the perspectives of a number of different user groups are considered. NARA is responsible for the final appraisal of records and the determination of their value as accessions to the permanent National Archives. Thus, NARA is the final determinant in the appraisal process, and must either appraise the scientific records with its own staff or rely on a community of appraisers made up of experts in the

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers subject field, both in government and the private sector, to assist in the appraisal. It is questionable whether NARA can now, or could in the future, maintain sufficient expertise on its staff to make these appraisal decisions alone. However, it should maintain archive management control and, therefore, establish and promulgate guidelines for appraisal. The panel believes that it is impossible for NARA to maintain and adequately service large scientific databases and numerous data sets for the foreseeable future. Thus, it recommends the establishment, under NARA management controls, of satellite archives within the agencies responsible for primary collection, creation, analysis, and use of the data. The panel therefore recommends that NARA retain standing advisory committees with membership from the academic community, science archivists and historians, and researchers in related subject fields to address the retention of scientific data collections or those in which questions regarding their long-term use have arisen. These committees also should work with NARA to refine standards and other requirements related to scientific data over time, ensuring that changing needs are reflected in appraisal decisions. The committees should annually identify data sets that should be accessioned and monitor each agency 's progress in providing data to NARA, or its designated satellite archives, and in preserving and documenting the data. The committees would thus provide advice to NARA in its appraisal process, bringing to bear a wide range of scientific and technical expertise to the appraisal decisions, particularly to those concerning secondary value of the data. When Appraisal Should Take Place—The Periodicity of Appraisals Appraisal of scientific data sets should begin at the initial stages of the investigation or mission when the data to be collected are identified. It is essential that appraisals continue as data are collected, analyzed, and used by the primary users consistent with the archival model proposed in Section 3. Appraisals might be considered under the following circumstances: Whenever determinations are made on the success of an experiment, investigation or mission, the quality of the data collected, the extent of errors or other degradations in the data that would not support analysis or further research, and the use to be made of the data. The determinations should be based on whether the current state of the data makes their retention worthwhile over time, whether they supersede or enhance other data, whether they can serve as the basis for further research, and whether the level-0 data should be retained permanently. Often these appraisals are made more on the current value to scientists than any considerations of secondary value or archival importance. This appraisal leads to acceptance of the records in the subdiscipline centers. Whenever data are moved to a formal discipline data center and the determination is made as to whether: the data set duplicates other data; has superseded, or been superseded by, other data; or is useful in connection with other data. Analyses of the data submitted to the data center may not necessarily replace the level-0 data, but in fact may serve to make the level-0 data more valuable. Whenever a project, investigation, or mission is completed and data are no longer being collected and added to the data set. At this point, the data are most likely in an agency-wide data center that holds numerous other data sets and represents significant holdings from various other projects. Whenever a decision is being made regarding whether data are to be transferred to NARA, or remain as a satellite archive in the agency. It is at this stage that NARA may wish to ask its advisory committee to review the data sets and determine if they meet the retention criteria defined earlier, and whether their informational value requires that they be maintained for long periods of time and possibly permanently. There are a number of other times at which appraisals may be undertaken by NARA, such as when data sets are being migrated to new technology to ensure that the useful data remain once migration is completed, when problems develop and the data are at risk of degrading, or when a designated satellite archive is being disbanded due to budget constraints or reorganizational moves. With these appraisals, the decision will need to be made as to whether the data sets should be incorporated in another satellite archive or moved to the National Archives for permanent or long-term storage. At each appraisal stage, the data will be reviewed to ensure that they continue to meet the retention criteria defined in this study. With the passage of time, clearer understandings of the value of present-day scientific data to researchers and other users will be possible. At the same time, it may be possible to ascertain whether the value of

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers level-0 data is greater than that for higher levels of data analysis and aggregation, and whether this value offsets the costs of maintaining all levels of data. How the Appraisal Should Be Done—Steps in the Appraisal Process Many of the procedures currently followed in appraising other records are valid for scientific data. Descriptions of the data, the systems that contain the data, the project that collected the data, and the purposes for the project all add to the knowledge necessary to determine informational value of a record. Documentation of the data and the system hardware and software also are essential to the future use of the data and must be considered in the appraisal process. Sample outputs of the systems, useful in appraising other electronic records, may not be as valuable in determining the informational value of scientific records unless the reviewer is a scientist. The following scenario may be useful in defining how the appraisals may be carried out. Identify the data to be collected in the scientific missions or investigations; prepare a data management plan; name a data manager; and define information systems that can adequately accept, process and store the data. Involve the agency's records manager in the planning process. Describe the data to be collected and the proposed disposition in an SF-115 form and send this to NARA to approval. In the initial analysis of the data, focus on the success or failure of the project, the quality and completeness of the data, and their usability. Determining the validity of the data collection is in itself an appraisal process. Process, organize, and fully document the data to make them accessible to primary users. Peer review procedures can assist the scientists and the data managers in determining whether the data have received careful and accurate analysis and need to be retained. Appraise any data sets resulting from additional analyses. Complete the appraisal of data when they are in the agency's “deep archive.” The location of the final archival set should be determined at this time. The panel recommends that the initial appraisal of scientific data sets be made by the investigators, the information systems manager, and the agency's records manager. The agency's data manager, together with experts in the subject area, should verify the disposition determinations during peer review and as analysis of the data takes place. For larger data collection efforts that involve two or more agencies, an interagency team should review the data to determine their correlation with other agency data. As scientific projects or missions are completed, procedures for data disposition are applied. Data are moved to archives at the agency level or to more centralized data repositories, such as NOAA's data centers, serving a number of agencies. Appraisal decisions made at this time may affect the data's accessibility to researchers in generations to come. Thus, NARA, and possibly the advisory scientific data committee, should be brought into the review and decision process at this point. Based on the review at this time, earlier tentative disposition decisions may be overridden. Also at this time, NARA may decide to accept the records for storage, or determine that a designated satellite archive would be more appropriate for long-term or permanent retention of the data. 5 SUMMARY OF FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS Findings and Conclusions Most space science data collected by space-based instruments may be presumed to be federal records. Ground-based data generally are not federal records. NARA currently has no digital space science data, but needs to develop a plan if it is to undertake the long-term retention of such data. Space science data are managed and archived in a distributed system, principally by the users of the data. The panel concludes that it is impossible for NARA to maintain and adequately service all archivable scientific databases and data sets for the foreseeable future. The policies and priorities relating to data archiving are highly variable across and even within agencies. There is concern that even agencies (e.g., NASA) with a clear charter to obtain scientific data and to make those data widely available are not fully meeting this responsibility.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Many scientists and program managers at agencies responsible for federal space science data do not know their responsibility vis-à-vis NARA. The data management policies and the resources devoted to data after their acquisition by R&D agencies will largely determine whether data will ever become archivable and ready for long-term retention. Many space science data are not adequately documented or preserved, and many already have been lost or have become inaccessible. All observational space science data have unique (e.g., nonrepeatable) and significant informational content. Many space science data sets also have important evidentiary value. The data are the primary products of lasting value from the tens of billions of dollars spent on space science and operations programs to date. Nevertheless, the panel has found that the majority of data collected by federal agencies since the dawn of the space age are not currently archivable, and these data are not likely to ever become archivable by NARA unless major changes are instituted at those agencies. The most valuable data for long-term retention are the level-0 data resulting from a mission, but higher-level data are generally more usable. Some disciplines (e.g., planetary science) and agency programs (e.g., NASA Planetary Data System) have been successful in preparing archivable data sets. The most successful programs offer valuable lessons for the others. Recommendations The panel proposes the following recommendations for NARA regarding the long-term archiving of observational space science data: NARA should generally strengthen its liaison with each federal agency producing space science data. Toward that end, NARA should establish what data the agency intends to deposit with NARA and when, determine what other data may be suitable for archiving, and stipulate in detail what procedures should be followed in providing data to NARA. The panel suggests that NARA examine the following attributes, ranked in priority order, in determining whether a data set is archivable: Significant value of the data. Adequacy of documentation. Cost of replacement. Uniqueness of the data. Peer review. In addition, the following ancillary recommendations should guide the application of these retention criteria: As much scientific data as possible should be preserved. All levels of processed data should be considered for archiving, with priority accorded to level-0 data. Complete sets, rather than samples of data, should be accessioned. Data in any commonly used format should be accepted. NARA should establish, under NARA management controls, satellite archives within the agencies responsible for primary collection, creation, analysis, and use of data, thereby maintaining the data in locations most likely to serve the needs of the primary scientific users. The existence of distributed permanent archives should not be seen as jeopardizing NARA's ability to meet its statutory obligations, but as enhancing or enabling them. NARA should retain standing advisory committees with membership from the academic community, science archivists and historians, and researchers in related subject fields to address the retention of scientific data collections or those in which questions regarding their long-term use have arisen. These committees also should work with NARA to refine standards and other requirements related to scientific data over time, ensuring that changing needs are reflected in appraisal decisions. The committees should annually identify data sets that should be accessioned and monitor each agency's progress in providing data to NARA, or its designated satellite archives, and in preserving and documenting the data. NARA should be actively involved in the development of the Master Directory, which is being led by NASA in cooperation with other federal agencies.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The panel also makes the following recommendations to NOAA and the other federal agencies for space science data: NOAA should work more closely with NARA in documenting and establishing directories of its holdings and providing access to them. NASA should continue to support and improve its active archive services, and strengthen its working relationships in those areas with the scientific community on the one hand and with NARA on the other. NSF should require that all appropriate level-0 data obtained under its auspices be properly documented and archived. All federal agencies producing space science data should have in place mechanisms that provide for a proper archive copy of their data. Finally, the panel urges a strengthened relationship among all the R&D agencies, with significant NARA participation, to ensure that: All data are readily available to scientists anywhere; Data are shared and not duplicated; Each agency fulfills its responsibility for quality controls; metadata structures; documentation of analysis, forms, and systems designed to process the data; and production of data products and development of services and mechanisms for making the data available and usable by the scientist and nonscientist alike; and Each agency participates in electronic networks that enable access, sharing, and transfer of data. ACKNOWLEDGMENTS The panel gratefully acknowledges the assistance of Paul Uhlir and Julie Esanu in the preparation of this report, as well as the following individuals, who provided briefings and other information: Joseph Allen, NGDC; Joseph Bredekamp, NASA; Dean Bundy, Naval Research Laboratory; David deYoung, National Optical Astronomical Observatory; Robert Frederick, Air Force Space Forecast Center; Joseph King, NSSDC; Knox Long, Space Telescope Science Institute; Guenther Riegler, NASA Astrophysics Division; Jud Stailey and Thomas Smith, Air Force Environmental Technical Applications Center; Earl Tech and Steven Blair, Los Alamos National Laboratory; Raymond Walker, UCLA; and James Willett, NASA Space Physics Division. BIBLIOGRAPHY King, Joseph. 1993. “Saving the Right Data,” NSSDC News, (4)8, Winter 1992/93. National Academy of Public Administration. 1991. The Archives of the Future: Archival Strategies for the Treatment of Electronic Databases, A Report for the National Archives and Records Administration, Washington, D.C. National Aeronautics and Space Administration (NASA). undated. Management Instruction 8030, NMI 8030.3A. National Aeronautics and Space Administration (NASA). 1992a. OSSA Information Systems Program Annual Report 1992, Office of Space Science and Applications, Washington, D.C. National Aeronautics and Space Administration (NASA). 1992b. State of the Data Union, Office of Space Science and Applications, Information Systems Branch , Washington, D.C. National Aeronautics and Space Administration (NASA). 1992c. 1991 Annual Statistics and Highlights Report, National Space Science Data Center, Goddard Space Flight Center, Greenbelt, MD. National Aeronautics and Space Administration (NASA). 1993. Magellan Archive Policy and Data Transfer Plan, Final Revision G, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, Calif. National Archives and Records Administration (NARA). 1985. Saving the Right Stuff, Office of Records Administration, Washington, D.C. National Archives and Records Administration (NARA). 1990. Managing Electronic Records, Office of Records Administration, Washington, D.C. National Research Council (NRC). 1982. Data Management and Computation, Volume 1: Issues and Recommendations, Committee on Data Management and Computation, Space Science Board , National Academy Press, Washington, D.C. National Research Council (NRC). 1984. Solar-Terrestrial Data Access, Distribution, and Archiving, Joint Panel of the Committee on Solar and Space Physics and the Committee on Solar-Terrestrial Research, National Academy Press, Washington, D.C.

OCR for page 23
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers National Research Council (NRC). 1986. Issues and Recommendations Associated with Distributed Computation and Data Management Systems for the Space Sciences, Committee on Data Management and Computation, Space Science Board , National Academy Press, Washington, D.C. National Research Council (NRC). 1988. Selected Issues in Space Science Data Management and Computation, Committee on Data Management and Computation, Space Science Board , National Academy Press, Washington, D.C. National Research Council (NRC). 1993. 1992 Review of the World Data Center-A for Rockets and Satellites, National Space Science Data Center, Committee on Geophysical and Environmental Data, Board on Earth Sciences and Resources, National Academy Press, Washington, D.C. Smithsonian Astrophysical Observatory (SAO). 1993a. The Einstein Observatory CD-ROMS, Revision 2.0, Cambridge, Mass. Smithsonian Astrophysical Observatory (SAO). 1993b. Quick Reference Guide to einline, Revision 2.3, Cambridge, Mass.