IV

Report of the Ocean Sciences Data Panel

Bruce Gritton,* Richard Dugdale, Thomas Duncan, Robert Evans, Terrence Joyce, and Victor Zlotnicki

CONTENTS

1 INTRODUCTION

The ocean is complex and vast; there are strong seasonal and interannual signals, and large regions are undersampled. Therefore, even in the age of satellite data products, the community of oceanographic data users needs more information. New sources of data usually are collected by agencies, groups, and individuals for a particular research project and remain in their respective hands for periods of time extending to several years before reaching data centers. Data often arrive at data centers without close scrutiny by anyone other than the originator. Deficiencies in data reporting, such as the elimination or flagging of bad data and provision of necessary metadata (see Box IV.1 for a set of relevant general definitions), make the addition of new data into a center's system difficult even with well-defined data types. While data exchange among researchers generally has taken an informal route outside the purview of the data centers, large data sets (especially satellite data) are increasingly coming from data acquisition and analysis centers designated for particular data types. These centers may or may not be responsible for distribution of data outside the group of primary users of the data.

In the course of secondary usage by scientific peers, problems with data sets are often uncovered, yet are seldom reported. Furthermore, results of the above usage often are in the form of a new, “created ” data set. In some cases, this new data set may be of greater general interest than the original data. These value-added data sets often migrate to data centers and are distributed as well. This process of usage by secondary users can be viewed as a type of peer review of data sets. This stage is particularly important considering the value and importance of data to tertiary users, who are not scientists.

*  

Panel chair. The authors' affiliations are, respectively, the Monterey Bay Aquarium Research Institute; University of Southern California; University of California at Berkeley; Rosenstiel School of Marine and Atmospheric Science; Woods Hole Oceanographic Institution; and Jet Propulsion Laboratory.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers IV Report of the Ocean Sciences Data Panel Bruce Gritton,* Richard Dugdale, Thomas Duncan, Robert Evans, Terrence Joyce, and Victor Zlotnicki CONTENTS     1  Introduction,   86     2  Oceanographic Data,   87     3  Metadata for Long-term Retention of Observational Oceanographic Data Sets,   92     4  Major Elements in Life-cycle Management of Observational Data Sets: Creation Through Long-term Retention,   99     5  Summary Conclusions and Recommendations,   101 1 INTRODUCTION The ocean is complex and vast; there are strong seasonal and interannual signals, and large regions are undersampled. Therefore, even in the age of satellite data products, the community of oceanographic data users needs more information. New sources of data usually are collected by agencies, groups, and individuals for a particular research project and remain in their respective hands for periods of time extending to several years before reaching data centers. Data often arrive at data centers without close scrutiny by anyone other than the originator. Deficiencies in data reporting, such as the elimination or flagging of bad data and provision of necessary metadata (see Box IV.1 for a set of relevant general definitions), make the addition of new data into a center's system difficult even with well-defined data types. While data exchange among researchers generally has taken an informal route outside the purview of the data centers, large data sets (especially satellite data) are increasingly coming from data acquisition and analysis centers designated for particular data types. These centers may or may not be responsible for distribution of data outside the group of primary users of the data. In the course of secondary usage by scientific peers, problems with data sets are often uncovered, yet are seldom reported. Furthermore, results of the above usage often are in the form of a new, “created ” data set. In some cases, this new data set may be of greater general interest than the original data. These value-added data sets often migrate to data centers and are distributed as well. This process of usage by secondary users can be viewed as a type of peer review of data sets. This stage is particularly important considering the value and importance of data to tertiary users, who are not scientists. *   Panel chair. The authors' affiliations are, respectively, the Monterey Bay Aquarium Research Institute; University of Southern California; University of California at Berkeley; Rosenstiel School of Marine and Atmospheric Science; Woods Hole Oceanographic Institution; and Jet Propulsion Laboratory.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Box IV.1: General Definitions Observational Data Set—An observational data set is an aggregate of measurements or observations of environmental properties or the properties of objects in the environment. Measurements are the direct or minimally processed outputs from sensors, and thus are often in engineering units, such as volts. Observations represent the best assessment of the state of the environment expressed in scientific units, such as degrees centigrade. Typically, both measurements and observations are associated with a time and space specification, and are aggregated into data sets on the basis of some common context criteria, such as all salinity observations from the same platform or all surface temperatures for a narrow range of time. Metadata—A class of data used to describe the content, representation, structure, and context of some well-defined set of observational data. Content metadata define observational data items in terms of description, units of measurement, and legitimate value ranges. Representation metadata define the units of measurement for each observational data item and the physical format for the whole data set. Structure metadata categorize the groupings of data items into logical aggregates, which typically correspond to real-world entities. Context metadata define all ancillary information associated with the collection, processing, and use of the observational data. The metadata may include a quality assessment of an individual observation in a data set and an overall evaluation of the observational data set. Information Model—An information model is the specification of the objects (things, people, events, concepts) about which one needs to maintain information. The specification should identify and define the objects, important attributes of the objects, and inherent relationships between the objects. The model may be expressed in a formal language or a narrative text with a graphical depiction that follows well-defined notation standards. Examples of pertinent real-world objects in the oceanographic domain include entities such as platforms, instruments, sensors, investigators, sampling plans, processing algorithms, stations, sections, projects, data collection runs, and calibrations. Environmental data are becoming increasingly important to scientists, policymakers, resource managers, educators, students, and the general public as technology brings them closer to data sets (and their manipulation and display) and because of the relevance of such data to understanding and managing the environment. Considering the importance of promptly providing these users with correct and properly documented data, there is an ever increasing need for the scientific review of oceanographic data, for the active distribution and assistance with the proper use of the data, and for the safe, long-term archiving of the data. The panel approached the study problem by identifying the items needed to aid the steering committee in synthesizing effective recommendations for the National Archives and Record Administration (NARA) and the National Oceanic and Atmospheric Administration (NOAA). The panel identified four needed items: Rules for retention or deletion of observational data sets from the ocean sciences, and a mechanism of appraisal that would apply the rules; A framework that specifies the types of metadata needed to make long-term observational data sets useful to primary, secondary, and tertiary users; An architectural model to effectively handle observational data throughout its life cycle, from creation to long-term archival storage; and Recommendations to existing organizations in the data management infrastructure to affect workable long-term retention of oceanographic records in electronic form. The panel used the task statement provided by the steering committee to help guide its discussion and analysis. 2 OCEANOGRAPHIC DATA Nature of Oceanographic Data The oceans and atmosphere are turbulent fluids, constantly changing over many spatial and temporal scales. The numerous types of data that describe the oceans are often unrelated to one another, and even those that are

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers related frequently have nonlinear and poorly understood interactions. For example, temperature data from the North Atlantic, at 30° North, 20° West, at 100m depth, at 06:00 GMT, May 5, 1993, cannot be accurately predicted from data collected in the same place the year before, or even the week before, or from data collected at the same time 1000km or even 100km away, or from salinity data collected at the same place and time. Each datum contributes unique information as long as it is accurate, corresponds to a different physical quantity, is obtained from a different time and place, and cannot be accurately computed from other existing data. Thus, observed oceanographic data are largely non-redundant in nature. Each observation is a unique datum that, due to the passage of time alone, cannot be replicated. In oceanography, each observational data point records the state of a part of the Earth's oceans at the time and place of collection, and each data set provides related data that extend scientific knowledge across location, time, or other dimensions. Collectively, oceanographic data sets make up a record of the ocean's changes in time and space, its interactions with humankind, the atmosphere, and the rest of the Earth system. As such, all non-redundant oceanographic data will be needed by future generations to understand the planet they inherit. Conversely, any non-redundant data set that is destroyed can never be recovered because we cannot turn back time in the ever-changing oceans and because it cannot be accurately reconstructed from other data. The above logic dictates the preservation of data for use by scientists of future generations. However, a separate question arises: what oceanographic information will future historians need to study and assess today's policy decisions based on information available to today's policy makers? The panel believes that such information can best be obtained from sources such as published reports, journal articles, and books, rather than by consulting the original data or model output. Nonetheless, the preservation of oceanographic data sets will provide needed references in the future for all individuals needing to use the information. Major Oceanographic Data Holdings in the United States The principal federal agency ocean data holdings are at the NOAA National Oceanographic Data Center (NODC), at the NASA Physical Oceanography Distributed Active Archive Center (PO.DAAC), and at several Navy centers, which hold mostly classified data sets. In addition, significant amounts of data are held by universities. Located in Washington, D.C., the NODC archives physical, chemical, and biological oceanographic data collected by other federal agencies, including data collected by principal investigators under grants from the National Science Foundation (NSF), state and local government agencies, universities and research institutions, and private industry. The center also obtains foreign data through bilateral exchanges with other nations and through the facilities of World Data Center A for Oceanography, which is operated by the NODC under the auspices of the National Academy of Sciences. The NODC provides a broad range of oceanographic data and information products and services to thousands of users worldwide, and increasingly, these data are being distributed on CD-ROMs and on the Internet. Table IV.1 presents a summary of the NODC's data holdings. The PO.DAAC is a major federally sponsored oceanographic data center, which is operated by the California Institute of Technology's Jet Propulsion Laboratory in Pasadena, California. As one element of the NASA Earth Observing System Data and Information System, the mission of the PO.DAAC is to archive and distribute data on the physical state of the oceans. Unlike the data at the NODC, most of the data sets at the PO.DAAC are derived from satellite observations. Data products include sea-surface height, surface-wind vector, surface-wind stress vector, surface-wind speed, integrated water vapor, atmospheric liquid water, sea-surface temperature, sea-ice extent and concentration, heat flux, and in situ data that are related to the satellite data. The satellite missions that have produced these data include the NASA Ocean Topography Experiment (TOPEX/Poseidon, done in cooperation with France), Geos-3, Nimbus-7, and Seasat; the NOAA Polar-orbiting Operational Environmental Satellite series; and the Department of Defense's Geosat and Defense Meteorological Satellite Program. Classes of Oceanographic Data Oceanographic data can be divided into two broad classes: small-volume and large-volume data sets. The majority of traditional oceanographic observations, such as temperature, salinity, and nutrient concentration, usually form small-volume data sets because they are based on individually conducted measurements or sample collections.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers TABLE IV.1 National Oceanographic Data Center Data Holdings (as of October 1994) Discipline Volume (megabytes) Physical/Chemical Data   Master data files   Buoy data (wind/waves) 9,679 Currents 4,290 Ocean stations 1,645 Salinity/temperature/depth 1,557 BT temperature profiles 872 Sea level 125 Marine chemistry/marine pollutants 89 Other 68 Subtotal 18,325 Individual data sets, for example   Geosat data sets 12,841 CoastWatch data 60,000 Levitus Ocean Atlas 1994 data sets 4,743 Other (estimated) 11,000 Subtotal 88,584 Total Physical/Chemical 106,909 Marine Biological Data   Master data files   Fish/shellfish 115 Benthic organisms 69 Intertidal/subtidal organisms 30 Plankton 32 Marine mammal sighting/census 21 Primary productivity 7 Subtotal 274 Individual data sets, for example   Marine bird data sets 52 Marine mammal data sets 4 Marine pathology data sets 4 Other (estimated) 200 Subtotal 260 Total Biological 534 Total Data Holdings 107,443 Source: NOAA, private communication, 1994.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Satellite and other remotely sensed observations, such as radar measurements of wave heights or ocean infrared emissions, generally form large-volume data sets. Small-volume Data Sets Small-volume data sets may be defined as those with volumes that are small in relation to the capacity of low-cost, widely available storage media and related hardware. The hardware and software to write and produce CD-ROMs are generally available for less than $10,000, and personal computers capable of reading CD-ROMs are being marketed as home-use, consumer items. Specifically, the total volume of small-volume oceanographic data is projected to be less than 50 gigabytes by 1995, and thus the entire historical data set for all observations could be stored on fewer than 100 CD-ROMs. This is fewer diskettes than many people have in their compact disc music collections. Issues such as archiving cost, longevity of media, and maintenance of the data holdings are not the dominant considerations with regard to retaining small-volume data sets. Rather, the major issue with respect to this class of data is the completeness of the descriptive information, or metadata. If a data set has been properly prepared and documented, the operations required to migrate between the data should be amenable to significant automation and therefore pose only a minor challenge to the long-term maintenance of the archive. Further, these data may be widely distributed with simple replication of the media For example, the NODC and various NASA data centers have provided copies of data sets from their holdings to many users for a number of years. The NODC ocean data holdings listed in Table IV.1 all may be characterized as small-volume data sets. Large-volume Data Sets A different problem is posed by large-volume data sets. The biggest data sets typically come from earth observation satellite sensors, such as altimeters, synthetic aperture radars, microwave radiometers, and “ocean color” instruments. As noted above, many of these data are held at the Jet Propulsion Laboratory. These data sets can be challenging to contemporary storage devices. However, it is clear that for the data set to exist at all, an adequate storage medium capable of capturing and maintaining the data for some time period must exist when the data are generated. Further, the time period for reliable, initial storage should cover at least the lifetime of the data at the organization acquiring and using the data before the records need to be migrated to new media or transferred to another organization, such as NOAA or NARA. In addition, during the initial storage period there are likely to be major increases in the density of mass storage accompanied by significant decreases in the cost of storage. Thus, data sets that are challenging today will be transformed to a manageable “ small-volume status” in the future, as advancing technology increases the capacity and lowers the cost of storage devices. To date, this scenario has in fact been the case for oceanographic data sets as both data acquisition and data storage have made the transition from manual to electronic means and the volume of practical, inexpensive data storage has improved by orders of magnitude. Redundant Data While many of the data used in oceanography are unique and thus cannot be replaced, some data are in fact redundant and can be considered for disposal. An example is a data product that can be calculated from one or more sets of observational data. During the time when this type of derived data is being actively used, the least-cost strategy may be to store and distribute it in such a manner as to simplify timely access and use. However, use of such data may eventually decline to a point where very few accesses are made, and it becomes more cost-effective to recalculate the derived data set when needed than to continue to store it. The panel believes it is important to have broad, but clear, guidelines for the types of data that may be considered for disposal, and proposes the following definitions of redundant data: Different versions of the same basic data, calculated from the same primary data set using different algorithms. In this case, retention of the primary data set suffices as it is the one from which all other versions can be reconstructed. However, sufficient information about the computational programs and algorithms should be kept permanently so that the “redundant” versions of the data can in fact be recomputed in the future.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Data that are oversampled in relation to the frequency response of the measuring instrument. In this case, a subset of the data points would suffice. For example, if a temperature sensor on a drilling platform takes two minutes to respond to a temperature change of one degree Celsius, recording the reported temperature every 30 seconds results in redundant data of no real use. The data recorded at the same corresponding point in each two-minute response interval could be kept and all others discarded. Care must be taken, however, to distinguish and retain data that are oversampled in relation to only to our assumptions of the shortest scales of interest in the ocean. For example, valid accurate surface temperatures to the nearest degree Celsius might be automatically recorded at a particular location on a drilling platform every 15 minutes. If the researchers who most commonly use this type of information are only interested in hourly observations (or believe that temperature can only change at less than one degree per hour), there may be a tendency to retain only the first reading each hour and dispose of the other three. This could prove shortsighted later if, for instance, other researchers wish to test a hypothesis regarding very rapid temperature changes under certain weather conditions, or there is a change in the understanding of time scales by the general oceanographic community. Model output. Models of oceanographic processes rely on observational data sets to predict or model other information that is not available. As long as the model's input data and methods of calculation remain available, the actual output need not be placed in a long-term archive. However, sufficient documentation about the model and its algorithms and programming must be kept with the input data to allow recomputation in the future when the model data are needed for either scientific or historical research. Poorly documented data. Data that are poorly documented also may be considered as being redundant in the sense that they contribute nothing to the meaning of the collection of oceanographic data sets. In fact, data from instruments with unknown or undocumented calibrations or response characteristics, or data produced by unknown processing methods may be seriously misleading. In either case, if adequate metadata can be neither found nor reconstructed, the associated data set is of very little value and should be considered for elimination. In the future, even after eliminating redundant data, large volumes will remain because satellite remote sensing will be increasingly used for data collection. Data archive centers faced with budget difficulties may be tempted to destroy, or allow the decay of, nonredundant data sets. However, the following points should be considered carefully before any oceanographic data are destroyed or allowed to decay into an unusable state. The cost of maintaining oceanographic data sets, whether from in situ or remote observations, for any 10-year period appears to average less than 1 percent of the cost of collecting the data (see Box IV.2). While the storage cost may seem significant relative to the budget of the archive, the nation spent much more to collect it and, in all probability, the data still contain information that can lead to increased understanding of the world's oceans. As has already been noted, no data set is collected that cannot be stored at the time of collection and for a few years thereafter. In other words, storage technology at the time of collection has allowed the data to be held and processed. Subsequent storage will actually become less expensive and less burdensome as electronic storage technologies continue to improve. For example, technology advances in the last two decades have resulted in the common availability of devices of ever-larger capacity and smaller cost, thus consistently shrinking the size of previously huge data sets relative to commonly available storage media. This improvement in technology is likely to continue for the foreseeable future. Box IV.2: Example of Data Maintenance Costs TOPEX/Poseidon is the joint U.S. and French satellite data system that produces accurate measurements of sea level, significant wave heights, and other related, derived data sets. It is expected to operate over a three-to five-year period. Total cost for the entire mission will be approximately $750,000,000. The Jet Propulsion Laboratory serves as the Distributed Active Archive Center (DAAC) for all mission data, interacting with data users, developing and distributing data products including several on CD-ROM, and ensuring the safe archiving of the data. The annual budget for this activity is significantly less than $500,000. Thus, the projected 10-year cost for processing, distributing and archiving the data is less than $5,000,000, or less than one percent of the cost of collection.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Budget and administrative policies should not be used to justify the abandonment of a data set in lieu of translating it to newer storage media. In particular, older storage equipment often becomes difficult and expensive to maintain, while the purchase of newer technology may be met with budgetary, administrative, and organizational resistance, including inadequately trained personnel. The proper solution is not to abandon data or allow the deterioration of access, but rather to modify organizational policies to encourage the movement to lower-cost media as they become available. At a time when electronic data storage capacities per unit cost double annually, planned migration to new media types should be a central part of planning for data storage. 3 METADATA FOR LONG-TERM RETENTION OF OBSERVATIONAL OCEANOGRAPHIC DATA SETS The planning, implementation, and maintenance of an effective mechanism for long-term archiving of observational data sets must address three critical issues: storage management, accessibility, and assessability. Storage management focuses on various aspects of archiving, including the reliable storage of data for long periods of time, the transfer of data from old to new storage technology, physical data distribution to accommodate institutional policies regarding custodianship or the physical limitations of an institution, and retrieval performance requirements. Accessibility concerns include the provision of capabilities that provide a model of interaction and a mechanism for accepting input from a user on information needs; that locate all data relevant to those needs; and that retrieve, package, and deliver the needed data to the user. Assessability permits the user to clearly determine the significance, relevance, and quality of the data. This section defines a generalized framework for the minimal documentation of observational data that is necessary to ensure adequate accessibility-and assessability. Metadata Requirements Metadata are generally considered to be the information necessary for someone who is not previously acquainted with a data set to make full and accurate use of that data set. At a minimum, the metadata associated with a data set must provide a consistent framework that accomplishes the following objectives: permits assessment of the applicability of a data set to the question or problem at hand; supports the assessment of the quality and accuracy of the data set; provides all necessary information to permit a user to access or physically read the values in a data set; permits the assignment of correct physical units to the values; supports the translation of logical concepts and terminology between communities; and supports the exchange of data stored in differing physical formats. Adequacy of Metadata The problem of supplying adequate metadata is receiving increased attention in the context of scientific data management. For example, global climate change research along with general environmental concerns have ignited interest in a more interdisciplinary and long-term approach to conducting science. Interdisciplinary collaboration requires more effective sharing of data and information among individual researchers, programs, institutions, and communities that may operate under different paradigms of knowledge organization or have different terminology for similar concepts. Further, long-term research requires that researchers be able to access and compare data sets that were created by past researchers and collected in different contexts by different technologies. Therefore, to support the interdisciplinary sharing and long-term usefulness of observational data, adequate metadata must be linked inextricably to the data. Interdisciplinary sharing and long-term usefulness of observational data sets are important goals for any organization involved in the distribution or archiving of scientific data. Such organizations must become increasingly concerned with the provision of high-quality metadata, without which the usefulness of the associated observational data will be severely compromised. Existing information retrieval and data management technologies already provide the scientific and archiving communities the technical means to a satisfactory solution; the problems that exist in this area are more the consequence of the human tendency to ignore the value of metadata during the collection and production of the data. It is at this time that metadata are easiest to produce. The research

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers or data production community and the user community must jointly improve their understanding and specification of metadata requirements and cooperate in establishing policies, procedures, and infrastructure to meet these improved requirements. The ultimate solution for metadata handling problems will include an approach that supports the documentation of a data set throughout its life cycle, to specifically include support for evolutionary documentation requirements. For example, early in the development and use of a typical new instrument system, the scientific community may not be able to completely specify what metadata will be important for the effective use of the observations produced by this system by individuals who are not a part of the research team. In this case, some of the metadata may need to include free-form narratives without the benefit of controlled vocabularies. Documentation of this nature is only useful to a limited audience that understands the specialized terminology related to the source instrument, discipline, or institution; but it may be all that is available when an instrument is placed in initial use. In addition, it is very difficult to make these uncontrolled narrative descriptions useful to an automated agent performing a search on behalf of a user. As use of a particular instrument becomes more routine, its documentation should evolve to a more structured form. One useful approach constrains the textual descriptions to a well-defined, controlled vocabulary. If the clearly specified vocabulary and associated metadata are made easily available with the observational data, users beyond those closely associated with data set creation will be better able to use the information and to assess its relevance, significance, and reliability. Eventually, these more structured metadata elements likely will evolve into the specification of structured records with well-defined fields, standard value domains, and relationships with observational data set records. Metadata handling procedures will have to integrate both the unstructured and the structured approaches into a coherent system of policies, procedures, standards, and infrastructure for the entire scientific community. A Generalized Metadata Framework An important component of the effort to improve metadata is the identification and detailed definition of classes of information that are critical to the complete and consistent documentation of observational data sets. Information modeling techniques can be used to synthesize these classes, or entities, from analyses of documentation needs by data set type and class of user. Some classes will have clear, concise definitions and a set of well-defined attributes that imply structured record documentation. Other classes will be identified, but will not have clearly defined attributes or boundaries with other classes, thus implying narrative documentation with limited control vocabulary and free-form extensions. The resulting information model should present a technology-independent description of metadata entities and their relationships with observational data entities. It should identify entities that are general across all classifications of data sets and usage patterns, as well as specialized needs. This model should provide the basis of intelligent information policies, data management practices, and data content standards. The remainder of this section presents a synthesis of the information classes (i.e., metadata entities) that are critical to the documentation of oceanographic observational data sets. A minimal set of entities is defined based on analyses of documentation needs for data set classifications and usage classifications that appear most likely to require documentation differences. User and data set classifications are presented next, followed by a high-level information model for observational data sets and associated documentation (the Observational Metadata Model). User Classifications Users may be classified along several dimensions. The two that seem most useful to the panel's analyses are function and proximity to data creation. Functional classes are differentiated on the basis of the level of interpretation that must be applied to the observational data sets in order to make them useful to the consumers. The panel defines four classes of users: researcher, policymaker, educator, and general public. Proximity to data creation does not refer to geographic proximity. It differentiates among users who are associated with data creation activities or are members of the primary scientific user community; are members of the general scientific community, present and future; and are nonscientific users, present and future. These three classes are labeled here as primary, secondary, and tertiary, respectively, and the basic differences in their information needs are detailed below.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Researcher Primary. These users collect observational data sets and perform tasks that require delivery of full-resolution observational data sets. Such users often include principal investigators and associated scientists. When accessing recent observations, these users are typically well aware of critical limits, constraints, assumptions, and context of the observations. Simple keywords, names, and labels that are familiar to the community can be attached to the data set and serve as full documentation. Only special events or circumstances may require delivery of more detailed documentation. Secondary. These users are peripheral to data set creation, but perform tasks that require full-resolution observational data sets. These users have both general and detailed knowledge of the scientific context in which the observations were made, but may not have detailed knowledge of specific sampling technology and protocols, experimental objectives and assumptions, practices of the principal investigator(s), and relevant events, all of which could affect the interpretation of the data set in a different context. These users require documentation beyond simple keywords and labels. For example, a user in this class who wants to compare environmental measurements of the same property by different sensor technology may need more than the name(s) of the applied algorithms. If the user is from another discipline or is using historical data, the need for complete metadata, or documentation, is exacerbated. Policymaker Secondary. These users are scientifically literate individuals who must make decisions based on analyses and interpretations of aggregates of observational data sets and predictive models. They do not need the full data set or detailed documentation on sensors, algorithms, or sampling protocols, but they do need an accounting of the lineage of the information on which to base their decisions and some assessments of quality, reliability, and sensitivity of the information. Providing this information requires detailed documentation from the point of data creation onward. Tertiary. These users are decisionmakers who have little or no knowledge of the underlying science, but must have access to standard information products derived from observational data sets and predictive models. Like the previous class of users, they have no need for full-resolution data sets or complete documentation. They do require access to the appropriate experts for explanations and interpretations as needed. In this case, much like the researcher, primary class, all they need is access to the keyword, names, or labels that refer them to the appropriate source. Educator Secondary. This class of users refers to university educators. They have the same data requirements as the researcher, secondary and policymaker, secondary classes. Tertiary. This class of users includes those involved in K-12 education. The data requirements of these users overlap most strongly with the policymaker, tertiary class. They need access to analyses and interpretations drawn from observational data sets and their summaries. They also need to place this information into a higher-level scientific context for assimilation by their students. This context is typically not explicitly stored with the data sets, but must be derived by the educator. This type of context should probably remain in the hands of the knowledge facilitator (i.e., educator) and not be included as metadata. However, the facilitator should be given enough information to derive such a context. General Public, Tertiary These users have data needs similar to those of educator, tertiary users. There also generally will be an intermediary between the public and the observational data sets and their derived products. Data Set Classifications Data sets also may be analyzed simultaneously along several dimensions. Those that appear to have the most effect on metadata issues include volume, level of interpretation, collection source type, measurement source stability, discipline, and data structure. A summary of metadata issues for each class follows.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Volume Large-volume. Data sets that stress the capabilities of current storage technology are included in this class; satellite image data and model data are examples. Metadata handling is often biased toward the documentation of the data set as a whole, rather than toward providing access to or documentation of smaller components within the data set. In addition, representative abstractions of these data sets often are created to facilitate access. These abstractions become part of the metadata for the observational data set and may have their own associated metadata. Small-volume. Data sets that do not challenge current storage technology are included in this class; examples include standard hydrographic profiles, ocean station data, or surface observations from ships of opportunity. These data sets often are aggregates of observations of the same type and provide coverage of specific geographic regions (e.g., temperature data for the North Atlantic). Metadata then are developed for the aggregated data set, and there is no capability to document individual observations within the data set. For stable, quality-controlled observational data this is probably the appropriate approach. Levels of Interpretation Measurements. These data sets contain direct or minimally processed output from sensor systems. They represent the beginning of the life-cycle for observational data sets. Users transform subsets of these data to scientific observations by using sensor algorithms. The subsets often are defined by time intervals, which may be regular or characterized by external events, such as sensor recalibration or experimental sampling protocol During this early stage, individual values within each subset may be modified on the basis of quality control procedures. Historically, documentation of this activity is performed at the individual data set level. Documentation must include items such as description of sensors, algorithms, collection events and malfunctions, and quality control procedures. It may be valuable to associate documentation with relatively small data components in this class. Observations. This class comprises data sets that contain assessments of the state of one or more environmental properties at specified locations in space or time. Typically, these data sets are aggregated by source (e.g., instrument, experiment, or expedition) and documentation is provided for the aggregate. Early in the life of these data sets, their active use may result in modifications or corrections to individual observations. Usually this results in the creation of a new version of the same basic data set. Allowing evolution and documentation of individual observations within a data set may provide more effective metadata management. For the long-term user, these data sets may be considered stable. Typical metadata should include, among other items, information about lineage to measurement data sets, quality control procedures, principal investigators, sampling protocols, data collection activities, and experimental objectives. Derived observations. Data sets that have been derived from one or more collocated observational sets make up this class. Individual data points are not created through direct or remote measurement techniques, but instead are derived through theoretically or empirically determined relationships between the properties. Typically, this process derives one output data set from one or more input data sets, rather than operating at the individual observation level. Metadata should document the source (i.e., input data sets), the transformation algorithm, and the relationship that is implemented by the algorithm. Synthetic observations. Broadly, this class refers to gridded data sets constructed by simple data interpolation schemes, sophisticated data assimilation techniques, or predictive models. In most cases, these data sets are not accessed or documented at the level of detail of the individual grid point. Data sets are usually aggregated and documented by model run, time and space cross section, or output parameter. Metadata information for these data sets may include, for example, model version, data assimilation procedures, model-run settings of sensitivity factors, and data input quality control procedures. If model output is going to be saved in a long-term archive, then information about the hardware and software configurations of the host machine may be appropriate. Interpretations. These data sets may include combinations of multiple observational data sets that are brought together to support a scientific interpretation of some natural phenomenon, such as the occurrence of an anomalous upwelling event. For this class, supporting metadata should identify the input data sets and associated subsetting criteria, additional data analysis procedures and processing, constraining assumptions, and the name of the interpreter. Collection Source Type In situ and remotely sensed data. These classes of data sets are produced by the transformation of measurement data into observational data. As such, analysis of this aspect of the data source type dimension does

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers not add anything new to the metadata requirements already identified. A possible exception to this would be a subtle difference between documentation for stable calibration algorithms for in situ sensors versus the possibility of multiple algorithms for remote sensors. In vitro data. This class includes data sets that are produced by the laboratory analysis of samples collected from the ocean environment. Many biological and chemical properties of ocean water are determined in the laboratory just like attributes of specimens collected in other disciplines. Documentation needs for these data present new requirements, including information about laboratory procedures and equipment, calibration standards, and laboratory runs. Another major difference is that typically these metadata are associated with single observations of the environment in time and space, rather than being associated at the data set level. Measurement Source Stability—Experimental Versus Verified Instruments Experimental instruments produce data sets that are not completely verified and have not been accepted by the general scientific community as producing useful observations. However, these data sets may hold valuable information that can be captured and used once processing algorithms are verified. Prior to this time, metadata entities cannot be populated with specific and reliable data for the associated observational data sets. Therefore, metadata must contain assessments of their own reliability and quality, as well as assessments of the data set itself. Discipline The definition of required metadata is complicated by interdisciplinary considerations. For example, metadata for biological data sets are inherently more complex than for physical oceanography because of differences in measuring techniques, lack of community agreement on naming standards, and the very process by which biology progresses. The metadata handling for biological and similar classes of data will have to accommodate multiple naming schemes and alternate taxonomies. The mechanism should include the capability for consistent reconciliation between different frames of reference. Data Structure Observational data sets can be differentiated according to their structure. Structured data sets (composed of alphanumeric fields) impose nothing new in this analysis of metadata needs. However, unstructured data sets, such as text, sound, image, and video, require different metadata. These unstructured classes of data are useful because of the information embedded in their content. Therefore, management of these data must accommodate structured representations of the extracted content, and linkages between the content data and the unstructured data item. Further, because there may be multiple interpretations of content from the same unstructured item, the mechanism must track the context of each interpretive activity. In some cases, the content data may be considered the primary data while the interpretive context and the unstructured data item may be considered the metadata. Observational Metadata Model Data modeling, including metadata modeling, often begins by specifically identifying the major elements of an observational data set and the various activities and entities involved in the production of such a data set. To help organize this discussion, the elements, activities, and entities are all referred to as objects within the model and are grouped into realms. The panel's preliminary metadata modeling discussions identified the following eight major realms, each with a number of important objects. These realms and objects form a generalized framework, or preliminary Observational Metadata Model, for the identification of the minimal metadata requirements of observational data sets in oceanographic sciences. Parameter realm. The following metadata objects in this realm are required to fully document the data parameter types that make up an observational data set. Environmental or sensor measurement properties define and describe an environmental property, such as surface wind velocity, or a sensor measurement property, such as sensor voltage or current, that will be represented in the data set. Parametric representations name, define, and describe the one or more parameters or measurements in the data set. For example, the property surface wind velocity commonly has two parameters: wind speed and wind direction (wind u-component and v-component).

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Value domains name, define, and describe the legal values and engineering or scientific units that apply to environmental properties with specific parametric representations. For example, a data set could contain observations of surface wind velocity with a value domain for wind speed of 0 to 300km per hour. Process realm. Metadata objects in this realm document processing or procedural activities that were used in the creation of a data set. Process type names, defines, and describes the types of processing involved in the creation of a data set. For example, a temperature sensor calibration algorithm could be documented as to its origin, form, input and output parameter types, constants, and limitations. Process names, defines, and describes the actual implementation(s) of a process type. For instance, a series of process objects would be associated with a sensor system, recording the history of its calibration coefficients. Program objects document a specific software implementations of a process, including the source language version, source code version, and other attributes important for the recomputation of data sets using this process. System realm. Metadata objects in this realm document the platforms, instruments, sensors, computers, and other devices associated with the creation of data sets. System type names, defines, and describes broad classes of systems. Platform system types may include, for example, ships, remotely operated vehicles, moorings, aircraft, and satellites. For sensors, a system type object may record the attributes of all sensors of a specific make and model. System names, defines, and describes an actual plementation of a system type. For example, a sensor system object can document the serial number and other specific performance attributes of a particular sensor. A ship platform system object would include the name, data center code, and operating agency of a particular ship. System configuration is used to define and describe the specific operating form or architecture of a system during the collection of a data set. Because basic systems typically are built with some flexibility to modify their operating characteristics, a series of system configuration objects would be used to document the history of these modifications for a specific system. This also could provide the mechanism to document system or subsystem relationships. System malfunction objects can be used to document the malfunction history of systems. System maintenance metadata objects can be used to document the maintenance activities associated with systems. Data generation activity realm. Information objects in this realm are designed to document all activities that contribute to the creation of data sets. Project metadata objects name, define, and describe the details of the project under which data sets are created. Experiment objects name, define, and describe the experimental context for data sets. For example, these objects may record the hypothesis being tested and the sampling plan used to collect data or laboratory specimens. Event objects describe any relevant events that occur during the collection or processing of samples or data that may have impact on their interpretation. Data collection run objects document some unit of data collection activity. Typical attributes include the beginning and end date-time groups, the spatial coverage, the standard site name where collection activity occurred, and the type of collection performed. Data generation objects document activities that produce synthetic observations, such as gridded model output. These objects may point to process and system descriptions that correspond to the hardware and software implementations of the model and should indicate any special adjustments applied to a specific run of the model that produced a specific data set. Laboratory run describes the specific laboratory activity that analyzes a sample, such as an ocean water sample, to produce one or more observations of environmental properties or properties of objects in the environment. Such objects refer to the specific laboratory sample that was analyzed.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Laboratory sample identifies the sample and documents the history of the sample, which is subjected to one or more laboratory runs to extract properties of the environment. For example, for an ocean water sample, users of the laboratory results will need to know what sample was used, in what type of baffle the sample was collected, and what storage practices were employed to preserve the integrity of the sample. Observational data set realm. Objects in this realm contain the body of the observational data sets. Observations or measurements are typically individually tagged regarding the time and location of measurement, or carry a specification of how to determine such tags for the whole data set. Observational objects also may describe the division of the data set into groups on the basis of common system or process sources, data generation runs, projects, experiments, or spatial coverage regions. Quality assessment realm. Metadata objects in this realm document any quality assessments associated with individual observations or observational aggregates. Assessments may be in the form of standard quality codes or statistics, as well as free-form narratives. There are two primary types of quality assessment objects: Primary quality assessments performed by the person(s) charged with the creation of the data set. User quality assessments made by users of the data set in various contexts. Typically, these objects name the user, describe the context, and record the assessment in narrative form. Locality realm. These metadata objects provide spatial specifications of places where repeated observational activity occurs and provide criteria for data set aggregation or selection. Site names, defines, and describes a point, a two-dimensional region, or a volume that can contain one or more data collection runs or corresponds to the minimal spatial extent of an observational aggregate. For example, a site object may correspond to a point where a history of meteorological observations has been made. Transect names, defines, and describes a line between two well-defined points in three-dimensional space upon which a series of observations have been made. Typically, the process being studied is considered temporally homogeneous while the observations collected along the line are gathered. Track documents the complete path of a platform, instrument, or sensor collecting a series of observations. The path may be arbitrarily complex, and thus may be described by a series of track segments, sites, or transects that are more easily specified. Locality may be used to name spatial entities that have no well-defined spatial boundaries, but do provide a general description of a collection area. For instance, “North Atlantic” may have no specific boundaries, but may be used as a label on some data sets to indicate the general area of observation. Descriptive realm. This realm contains documentation that describes the observational data or the metadata objects associated with the observational data. General comments, persons to contact, and format descriptions are examples of metadata objects in this realm. Future Work Needed on the Observational Metadata Model Full implementation of the Observational Metadata Model will require a more detailed information modeling effort done in the context of all oceanographic subdisciplines. The modeling effort could begin with the high-level model presented in this section and extend it by defining each metadata object and all relationships among objects with more rigor. Objects in each realm could be analyzed to define the detailed documentation attributes that are required. The analysis process would lead to the creation of generalized objects that span all subdisciplines of oceanography, as well as specialized objects that document specific elements of an oceanographic subdiscipline. The Observational Metadata Model effort needs to be endorsed by the oceanographic community as a whole. A small team of modelers should then be assembled to perform the analysis with participation from oceanographers from all subdisciplines, probably through a series of interviews and workshops. In addition to serving as a metadata template for the oceanographic community, there are numerous other benefits that could be provided by the development of the Observational Metadata Model. These include the following: a basis for improved scientific information policies and data management practices; better data and metadata content standards; a basis for database design activities by institutions or individuals using diverse information technology; and a model for user interface design.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Critical Issues for Implementation of a Metadata Model System Comprehensive and consistent metadata must be maintained with all long-term observational data sets in order to support effective maintenance, access, and usage. Therefore, effective long-term retention of observational data sets requires an underlying compatibility among data management activities at the point of origin, data collection performed during the conduct of research funded by the agencies, discipline-specific data centers, and general archive centers. A mechanism must be defined to coordinate these activities over long periods of time while allowing for autonomous control of technology assimilation and data management approaches. Policies, procedures, and technology must be put into place and coordinated across all phases of the observational data life cycle. The metadata requirements imposed by long-term retention significantly affect the data management activities at the point of origin and at short-term archive centers. This impact is a result of the requirement for complete and consistent metadata to support future primary and secondary uses of the observational data sets. Although the metadata requirements for the primary users and data originators are not as comprehensive as those for secondary users, it is the primary user group that must bear the burden of attaching the full documentation. Addressing full documentation later in the life cycle of the data will introduce prohibitive costs or may even be impossible. A fully specified information model of metadata requirements will provide the baseline for intelligent and effective data management standards. However, this will not be enough unless the standards are enforced. Agencies funding data collection activities as part of scientific research must play a key role in ensuring the implementation of the standards. NARA needs to work closely with the federal agencies to communicate the dependence of eventual long-term archiving on effective data management from the point of origin. In many cases, existing research support agreements from these agencies do require grant-supported data to be submitted to a federal repository, such as the NODC. This agreement structure could enhance long-term retention efforts if it were better implemented, financially supported, and enforced. The institutions that host multiple researchers also can play a key role by providing enhanced data management services as an infrastructure function to all affiliated researchers and users. In addition, research institutions also should be involved in developing and externally promoting improved information technology, as well as data and metadata standards, to improve inter-institutional data exchange and cooperation. Such activities could offset any added burden on individual researchers resulting from increased enforcement of data and metadata standards by funding agencies. Experience with the continuous management of observational data sets has shown that although the underlying organization of the data may remain fairly constant, small changes in technology, such as software versions or schema evolution, require active planning for eventual long-term archiving throughout the life cycle of the data sets. 4 MAJOR ELEMENTS IN LIFE-CYCLE MANAGEMENT OF OBSERVATIONAL DATA SETS: CREATION THROUGH LONG-TERM RETENTION Successful data management centers and archives have the following characteristics: They are “close” to the data originators. Thus, they learn what data are being collected, they know how the technology to measure specific ocean properties has evolved, they are trusted by data originators with their data, and they actively bring data into the archive. They are “close” to their users, in physical proximity and in intellectual training. Therefore, they understand their users needs for data access and respond in a timely manner to rapidly changing priorities. They take advantage of evolving computer technologies to minimize the cost of their activity relative to the amount of information held. This requires continuous assessment of market offerings and continuous training of personnel. They “exercise” their data holdings regularly. Usually in partnership with a researcher, data should be compared, analyzed, summarized, and gridded. This ensures interest in the holdings and intimate knowledge of the current holdings' status. They not only receive data, but can promptly deliver any holdings.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers The state of communications and computing technology allows the uncoupling of the legal responsibility for custodianship and the responsibility for physical retention. The volume and diversity of data, in fact, mandates distributed physical retention. While data exchange among researchers usually has taken an informal route outside of the purview of the data centers, large data sets (especially satellite data) are increasingly coming from data acquisition and analysis centers designated for particular data types. They may or may not be responsible for distribution of data outside the group of primary users. Secondary usage of data by scientific peers sometimes leads to the creation of a new data set. This new data set may be the result of problem correction during scientific usage. This process of usage by secondary users can be viewed as a type of peer review of data sets. This stage is particularly important considering the value of accurate data to tertiary users. Even the long-term, tertiary use of observational data sets requires discipline-specific expertise at archive centers for effective utilization. A significant barrier for implementation of data archives is the lack of support by principal investigators, which is, in part, the result of past difficulties in retrieving data in a requested form and in a timely manner. Finally, many data centers and archives have technology acquisition policies and practices that contribute to the high cost and ineffective nature of current archive systems. The panel's paradigm for the architecture of a life-cycle model of oceanographic data revolves around the activities and responsibilities of four major participants in the life of the data set: the data originators and primary users, short-term archives, long-term archives, and the National Archives and Records Administration. Data Originators and Primary Users Data originators, whether individual researchers, laboratories, or agencies, must be responsible for submitting data from federally funded programs to short-term data centers. While policies on data submission to the public domain may vary depending on the federal agency sponsoring the data collection, after some reasonable interval of time, perhaps on the order of less than two years from the time of measurement, all data sets should be reviewed and submitted to an appropriate active archive data center. In an increasingly electronic era, both data and metadata should be submitted in digital form whenever possible. Data originators should be responsive to the relevant archive center's requirements for the information content of the data, formats, and metadata. In addition, the originator should expect to engage in some interchange with the short-term center in the course of the submission. Short-term Archives A short-term, active archive center is a central site for the collection, examination, and distribution of oceanographic data sets. However, it need not be a single monolithic entity. Different types of data sets may take different pathways to different short-term centers. Such centers need not be under the direct control of federal agencies, although continuing assistance from federal sources likely will be required for support of this important infrastructure activity. The short-term archive center should seek out data sets from various sources, perform quality assessments of the data, and digest information about the data (or the data set itself when of sufficient importance) into a searchable electronic catalog. If any reformatting of data is performed by the short-term archive, it is essential that no information be lost from the originator's data set. The quality assessment must be performed by knowledgeable individuals working either at the center or under contract with it. The center should distribute copies of data widely and encourage examination by secondary users. It must seek to resolve any data issues that arise during review and secondary use, and retain any value-added information, either as changes to the original data or in the form of derived data sets. Data are not seen as residing at such a center so much as passing through it— to users on the one hand and to the long-term archive on the other. The short-term archive center must ensure that if current copies of data sets are migrated to a different storage medium or format, no information is lost in the process. Finally, the center should seek the advice of primary and secondary research users of the data to help guide decisions about priorities and new initiatives. Long-term Archives A long-term archive center holds the originators' data, after they have been in use at the short-term archive for some suitable period of time, and an archival copy of the most recent version of compilation of the data supplied by

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers the short-term center. The long-term center should ensure that redundant copies of the same data sets are purged and that archived data are stored in conditions and on media that reflect the most appropriate technology available. It also should ensure that catalogs of current holdings are up to date and accessible to all interested parties. Because most of the existing oceanographic data can fit onto a collection of CD-ROMs, the long-term archive could best serve the need of the community for data access by distributing copies of selected data sets, together with public domain software for displaying these data, to a wide variety of sites, such as libraries and universities. The long-term archive also must be responsive to individual requests for particular, little-used data sets. It is expected that this role will continue to be filled by NOAA as the primary archivist of oceanographic data. The National Archives and Records Administration As the responsible agency for the archiving of federal records, NARA should play a role as advisor, facilitator, and enforcer. As an advisor, NARA should provide guidance and incentives to the oceanographic community as a whole to assure that a complete observation record is adequately maintained for long-term use. Based on oceanographic community input, NARA should lead and facilitate the process of identifying appropriate information standards, establishing appropriate policies and procedures for implementing those standards, and identifying and promulgating a technology infrastructure that will support oceanographic data management at all levels. When appropriate, NARA should enforce conformance to minimum standards for information completeness and consistency. Because of NARA's limited scientific expertise, agreements with affiliated, oceanographic archive centers may be the primary means for meeting these responsibilities. 5 SUMMARY CONCLUSIONS AND RECOMMENDATIONS Retention Criteria and Appraisal Guidelines Conclusions Critical retention criteria and appraisal guidelines for long-term archiving of observational data sets are largely independent of the type of the data set. Each data component contributes unique information as long as it is accurate, measures a different physical quantity, is obtained from a different time and place, and cannot be accurately computed from other existing data. The entire collection of nonredundant oceanographic data will be needed by future generations to understand the planet they inherit. Data sets that are redundant, have limited usefulness, or have low reliability are candidates for deletion. Historians and others who must assess today's policy decisions based on the information that was available to the decisionmaker should use sources such as published reports, journal articles, and books, rather than the original data or model output. A data set without metadata, or with metadata that do not support effective access and assessment of data lineage and quality, is likely to have limited long-term usefulness. No data set is collected that cannot be stored at the time of collection and for a few years thereafter. Subsequent storage will actually become less expensive and less burdensome as electronic storage technologies continue to improve. This improvement in technology is likely to continue for the foreseeable future. For small-volume data sets, the major issue with regard to retention and effective use is completeness of the sets' metadata, rather than archiving cost, longevity of media, or maintenance of data holdings. Large-volume data sets initially may stress storage technologies, but probably will be transformed to a relatively small-volume category in the future, as technology increases the capacity and lowers the cost of storage devices. While data storage costs may seem significant relative to the budget of an archive, the nation spent much more in the data collection effort, and in all probability, the data still contain information that can lead to increased understanding of the world's oceans and broader environment. Issues relating to the appraisal of the redundancy, usefulness, and reliability of observational data sets will continue to require joint review by scientists, data and information managers, and archivists.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers Recommendations All observational data that are nonredundant, useful, and reliable for most primary uses should be permanently maintained. The criteria for assessing redundancy, usefulness, and reliability are discussed in the body of the report. Budget and administrative policies should not be used to justify the abandonment of data in lieu of transferring them to newer storage media. In particular, older storage equipment usually becomes difficult and expensive to maintain, while the purchase of newer technology may be met with budgetary, administrative, and organizational resistance, including poorly trained personnel. The proper solution is not to abandon data or allow the deterioration of access, but rather to modify organizational policies to encourage the movement to lower-cost media as they become available. Retention criteria should be applied through an appraisal process performed by all stakeholders, including interdisciplinary scientists, data and information management professionals, archivists, and representatives of secondary and tertiary user groups. The appraisal process should have the following characteristics: It must be based on clearly defined criteria for assessing redundancy, usefulness, reliability, and retention priorities. Because retention criteria and priorities may be expected to evolve over time, the appraisal mechanism must allow for the review of criteria and priorities by the scientific community, as well as for the review of current holdings of observational data sets. It should be applied on a periodic basis. It also should also be initiated by special events, such as when the survival of one or more observational data sets is threatened, or when there is to be a transfer in the custodianship of data. The appraisal mechanism also must have a component that evaluates the archival process as a whole. Ocean Metadata Requirements Conclusions In determining the proper metadata for an observational data set, it is useful to think of the intended user as a researcher removed from the time of data creation by 30 years or more, from another scientific discipline, and desiring to use the data in a way unintended at the time of creation. Complete metadata will define data set content, structure, format or representation, and context. Technology is not an inhibiting factor in the establishment of an effective metadata strategy. Lack of effective policies, procedures, and technical infrastructure based on community-wide information standards are the primary constraints. The metadata required by primary users are not as comprehensive as those required by secondary and tertiary users. It is the primary users, however, who bear the burden of attaching full documentation. Addressing this cost-benefit imbalance must be an important part of any plan to improve the provision of metadata. Recommendations There are several basic classes of information that should be provided as metadata components of ocean data sets regardless of the source: type, volume, discipline, data class (text, numeric, image, video), and observation regime (e.g., deep ocean, coastal, and surface). Documentation for parameters, processes, systems, data generation activities, observations, quality assessments, localities, comments, contact persons, and formats must also be included. Long-term archives must maintain metadata adequate to support the sharing of data and information across user groups which operate under different paradigms, have different terminology for similar concepts, and collaboratively use data collected at different times and with different technologies. Long-term archives must maintain metadata adequate to ensure reliable and secure storage of the observational data sets. Equally important, a data archive must provide effective access to its holdings by diverse users and provide enough ancillary information to make the data sets useful for interdisciplinary and long-term purposes. To achieve effective long-term retention of observational data sets, the following steps should be taken: Metadata requirements must be better understood and specified as the basis for affecting improved

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers usefulness of data sets beyond the research activity that creates the data. The metadata information model should be subjected to review by all levels of the oceanographic community. Metadata requirements should be used as a basis for establishing community-wide, minimum standards for information consistency and completeness. In turn, these standards should drive the establishment of policies, procedures, and technical infrastructure which support the acceptance of these standards at all stages in the life cycle of observational data sets: creation, active use and modification, and long-term archiving. All participants in the ocean science enterprise must commit to long-term data management as an important benefit to science, and work toward an underlying compatibility in data management practices at all levels while allowing for autonomous control of technology assimilation and basic data management approach. To accomplish these steps, the following more specific recommendations are offered: NOAA, with the active cooperation of NARA, should lead efforts to better define standards for data and metadata in the oceanographic community. Information modeling techniques should be used to define a technology-independent specification. The specification should capture detailed definitions and descriptions of classes of information which are critical to ocean data set documentation. The classes identified in this report may serve as a starting point for future efforts. The metadata specification must provide a framework that: — provides meaningful criteria for selecting pertinent data; — supports translation of logical concepts and terminology between oceanographic disciplines; — supports the exchange of data stored in different physical formats; — supports the assessment of the quality, accuracy, and applicability to a particular problem f the observational data sets; and — supports the need to access foreign data sets. — To assure widespread adoption of these standards, funding agencies should provide the financial resources for researchers to initiate appropriate data management activities and they should enforce the delivery of data sets that satisfy appropriate standards. NARA needs to work with appropriate federal agencies, such as NOAA, NASA, NSF, DOE, and EPA, to communicate the importance of effective data management at the point of origin to long-term archiving of the data. — Institutions that host multiple researchers can play a key role by providing enhanced data management services as an infrastructure function to all affiliated program managers, researchers, and users. Research institutions also should be involved in developing and externally promoting improved information technology, as well as data and metadata standards, to improve inter-institutional data exchange and cooperation. Major Elements in Life-cycle Management of Observational Data Sets Conclusions Successful data management centers and archives have the following characteristics: They are “close” to the data originators. Thus, they learn what data are being collected, they know how the technology to measure specific ocean properties has evolved, they are trusted by data originators with their data, and they actively bring data into the archive. They are “close” to their users, in physical proximity and in intellectual training. Therefore, they understand their users needs for data access and respond in a timely manner to rapidly changing priorities. They take advantage of evolving computer technologies to minimize the cost of their activity relative to the amount of information held. This requires continuous assessment of market offerings and continuous training of personnel. They “exercise” their data holdings regularly. Usually in partnership with a researcher, data should be compared, analyzed, summarized, and gridded. This ensures interest in the holdings and intimate knowledge of the current holdings' status. They not only receive data, but can promptly deliver any holdings. The state of communications and computing technology allows the uncoupling of the legal responsibility for custodianship and the responsibility for physical retention. The volume and diversity of data, in fact, mandates distributed physical retention.

OCR for page 86
Study on the Long-term Retention of Selected Scientific and Technical Records of the Federal Government: Working Papers While data exchange among researchers usually has taken an informal route outside of the purview of the data centers, large data sets (especially satellite data) are increasingly coming from data acquisition and analysis centers designated for particular data types. They may or may not be responsible for distribution of data outside the group of primary users. Secondary usage of data by scientific peers sometimes leads to the creation of a new data set. This new data set may be the result of problem correction during scientific usage. This process of usage by secondary users can be viewed as a type of peer review of data sets. This stage is particularly important considering the value of accurate data to tertiary users. Even the long-term, tertiary use of observational data sets requires discipline-specific expertise at archive centers for effective utilization. A significant barrier for implementation of data archives is the lack of support by principal investigators, which is, in part, the result of past difficulties in retrieving data in a requested form and in a timely manner. Finally, many data centers and archives have technology acquisition policies and practices that contribute to the high cost and ineffective nature of current archive systems. Recommendations The organizational structure called for by the oceanographic data life-cycle consists of a web of cooperating, but independent, entities. These entities must be guided by a clear set of technology-independent standards for observational data and metadata. Each entity may have its own technology acquisition strategy and data management approach, but all must conform to the standards for information content and level of service. Each entity will fall into one or more of the following classes: Data originators include individuals, research groups, and organizations serving as data sources for themselves and for other primary users. They may maintain proprietary control over the data they acquire for some limited period of time, but should be required to submit the data to a short-term or long-term archive center after a period of no more than two years. Short-term archive centers serve as focal points for the collection, assessment, and distribution of particular types of oceanographic data. It is through these centers that peer review will be performed on the data, thus adding value to the data over time and improving their quality prior to submission to long-term archives. A long-term archive center will maintain the originator's copy of the data set and the latest compilation of associated versions submitted by short-term archive centers. It is expected that this role will continue to be filled by NOAA as the primary archivist of oceanographic data. The National Archives and Records Administration, as the agency responsible for the archiving of federal records, should play a role as advisor, facilitator, and enforcer. As an advisor, NARA should provide guidance and incentives to the oceanographic community as a whole to assure that a complete observation record is adequately maintained for long-term use. Based on oceanographic community input, NARA should lead and facilitate the process of identifying appropriate information standards, establishing appropriate policies and procedures for implementing those standards, and identifying and promulgating a technology infrastructure that will support oceanographic data management at all levels. When appropriate, NARA should enforce conformance to minimum standards for information completeness and consistency. Because of NARA's limited scientific expertise, agreements with affiliated, oceanographic archive centers may be the best means for meeting these responsibilities.