Skip to main content
Chapter 17: Long-term Preservation of High-Quality Information | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

U.S. National Committee for CODATA
National Research Council
Promoting Data Applications for Science and Society: Technological Challenges and Opportunities
 


17

Long-term Preservation of High-Quality Information

Walter Warnick




     I'll begin with an old joke. A man went to the doctor. The doctor said, "Sir, you have 6 months to live. Go marry an information specialist." The man asked, "If I marry an information specialist, will that extend my time here on Earth?" The doctor said, "Well no, but the 6 months will seem like forever."

     Well, this is an old joke--and it is an old joke because, nowadays, being an information specialist, at least at the Department of Energy (DOE), is exciting. If the unfortunate man with 6 months to live married a DOE information specialist, the time would go by as if it were a week. The people who work for me at DOE are pumped, and I'm here to tell you why and what that means for our topic today.

     First, let me give you a thumbnail sketch of the Department of Energy, for those of you who might not be familiar with the organization. We are among the leading research and development (R&D) agencies in the world. Most of the department's R&D is performed by a system of large national laboratories all around the country, including Lawrence Livermore National Laboratory in California, Oak Ridge National Laboratory in Tennessee, and the Argonne National Laboratory in Illinois, among others. There are 39 facilities in all, and about 100,000 scientists and engineers work at these facilities.

     The Office of Scientific and Technical Information (OSTI) within DOE has been in business for more than half a century. Our mission includes collecting, preserving, and disseminating scientific and technical (S&T) information created by DOE. We also provide access to national and global information for use by DOE and the research community. This mission has not changed significantly since the organization was formed in the 1940s. However, information age technologies have radically changed the manner in which the mission is addressed and our service to our patrons.

     DOE invests about $7 billion per year in R&D. The principal deliverable from R&D is S&T information. It is in the vital interest of all research agencies that their information be disseminated as broadly and as quickly as possible.

     DOE has been a proud sponsor of science-driven economic growth through the combined efforts of our national laboratories. There have been 68 Nobel laureates sponsored by DOE and thousands of other outstanding university- and industry-based researchers nationwide.

     The twentieth century was the century of physics. In this period, physics became the dominant science, producing nuclear power, air and space travel, computers, and many more marvels. However, that century has ended, and the life sciences now are offering immense opportunities. Too little appreciated, however, is the fact that much of the progress in the life sciences is dependent upon prior advances in the physical sciences.

     Until recently, OSTI served the department's information needs through traditional means. We collected and processed hard-copy paper information in a central location. We focused on bibliographic data, which we call "information about information." We packaged and disseminated information in paper and microfiche formats. We preserved and archived information in paper formats.

     The main method of disseminating DOE's research results was through bibliographic databases. Nuclear Science Abstracts (NSA) is a historic record of nuclear research, beginning with the Manhattan Project in the early 1940s and following throughout the life of the Atomic Energy Commission. When the energy mission was broadened in the mid-1970s to include nonnuclear energy, such as fossil fuel energy and solar energy, NSA was supplanted with a new database called the Energy Science and Technology Database. Both databases contain information about information, referring a patron to the paper or microfiche source to see the full text. The result is that OSTI maintains the world's most comprehensive collection of energy-related S&T information. We have 1.5 million full-text reports and 5.7 million bibliographic references.

     In the last 3 years, information age technologies have radically changed our information services. Our patrons increasingly want information right at the desktop. Accordingly, we have transitioned our operations from a paper-based environment to a decentralized electronic environment.

     Across the DOE laboratories, we lead the Scientific and Technical Information Program, which is a collaboration of information specialists at each of these laboratories. Practices are in place to link to collections of information at these laboratories seamlessly. In one sense, we serve as a pointer to those decentralized collections at our national laboratories and elsewhere. When a laboratory elects not to host its own S&T information, my office does it for them.

     Several vast virtual collections have been produced to meet the needs of DOE's R&D community (see Figure 17.1). We have also made some of these collections available to the public through the National Technical Information Service (NTIS) and the U.S. Government Printing Office (GPO). We now share information faster, more completely, more conveniently, and at lower cost.



Figure 17.1

   


     We have a trilogy of vast virtual collections, one for each of the three main ways by which scientists communicate their findings: gray literature, peer-reviewed journal literature, and preprints.

     Our first significant advance dealt with gray literature, that is, technical reports and conference papers. Our introduction of the DOE Information Bridge in 1998 provided access to full text. Each word of each report was searchable. As of March 2000, the Information Bridge had grown to more than 62,000 digital reports and more than 4 million searchable pages. It includes the entire output of the Department of Energy in terms of gray literature reports and conference proceedings from January 1995 to the present. Working with the Government Printing Office, the Information Bridge is available free to the public via GPO Access at http://www.osti.gov/bridge.

     With the use of information age technologies well established for gray literature, DOE then turned its attention to other ways in which scientists disseminate their findings. The most prominent way by which scientists communicate their findings is the peer-reviewed journal literature.

     Yesterday morning in his opening remarks, Bruce Alberts praised PubMed, which is a life sciences journal product of the National Library of Medicine. We fully agree with Dr. Albert's assessment of PubMed, and we have given it the sincerest form of flattery--we imitated it. We recreated PubMed; only our focus is on the physical sciences and other disciplines of concern to DOE. We call our product PubSCIENCE.

     PubSCIENCE compiles citations and abstracts submitted by publishers into a searchable database and then uses hyperlinks to take patrons to the publishers' doorstep where full-text information can be obtained. In assessing the need for such a collection in the physical sciences, we worked closely with the American Physical Society. PubSCIENCE filled a void and was the next logical step beyond our bibliographic databases.

     In collaboration with the GPO, PubSCIENCE is also available to the public at http://www.osti.gov/pubscience. It allows the patron to search across abstracts and citations of multiple publishers at no cost. Once a user has found an interesting abstract, a hyperlink provides access to the publisher's server to obtain the full-text article. The article will come up immediately if the patron or his or her organization has a subscription to the journal. If the patron lacks such a subscription, access to the full text can be obtained by pay per view, by special arrangement with the publisher, by library access, or through commercial providers.

     OSTI's primary patrons are scientists in the DOE system of national labs. PubSCIENCE is particularly attractive to such large institutions, which are increasingly using site licenses to bring full-text journals to their scientific staffs. For example, Los Alamos National Laboratory has site licenses to well over 2,000 journals. At any institution that has a site license hosted at a publisher's server, the hyperlinks to full text in PubSCIENCE are automatically live.

     Global information sharing has become a reality via the Web, making public availability of information easier and cheaper to implement than restricting access. PubSCIENCE now covers 1,032 journals with 26 participating publishers, as well as 1.7 million journal citations. Now compare that to PubMed: PubMed only has about 600 journal titles, but it has a whole lot more citations and is a more robust database. In the future we plan to continue to expand the number of journal titles in areas of interest to the Department of Energy.

     DOE's most recent Web-based product is the PrePRINT Network, which was launched on January 31, 2000. The PrePRINT Network is a gateway to about 400 preprint servers dealing with physics, materials, chemistry, and other disciplines of concern to the Department of Energy. We have tried to capture every preprint server in the world that is within our scope. The patron is offered a variety of ways to search across the entire collection of 400 preprint servers. In the most novel search method, the PrePRINT Network pulses the search engines of multiple preprint servers. When the patron places a query, the PrePRINT Network accesses several selected databases, causes searches to be done by their search engines, and then compiles the results. Essentially, the network is acting as a parallel processor, uniquely created for searching across preprint servers that do not have standardized data formats and are geographically dispersed. In fact, we put no burden of any kind on any of the administrators of any of the preprint servers that we access. The parallel processor searching capability allows us to build inexpensive distributed digital libraries. In other words, this search technology works not just for preprints, but for any kind of data set you can imagine--at least text data sets. We think this search capability is the wave of the future.

     With the addition of preprints to our suite of Web products and services, the trilogy of ways by which researchers make their results known are now all accessible on the Web, including gray literature through the DOE Information Bridge, journal literature through PubSCIENCE, and preprints through the PrePRINT Network. Each of these is a vast, virtual collection. In each case, the information is accessible by a Web site on one of OSTI's servers, but the full-text information resides at servers all over the world.

     OSTI aims to be first in gray literature, first in journal literature, first in preprints, and first in the hearts of our researchers. This vision has my office very excited. It is tantamount to conquering text in the physical sciences, a goal that has never before been within human reach.

     Rapid changes in electronic storage formats, applications, and operating systems all shorten the useful life of digital information. Thus, preservation presents us with technical issues. Further, the swift changes in the digital environment have left some organizations without institutionalized practices for managing and preserving electronic information.

     Overlaying the technical and institutional issues is a deeper issue, which is funding. For federal agencies, information often gets too little respect. I think this has generally been true, except in the life sciences, in recent years. The government spends billions of dollars to fund research, but it balks at spending a few dollars to ensure access to and preservation of the literature, which is the principal deliverable coming from those billions of dollars worth of research. This is incredibly shortsighted.

     Here is a private-sector analogy to think about. The phone company gives away white pages without a separate incremental charge to its customers. Why does the phone company do this? Is it because the phone company has overlooked the obvious opportunity to recover the cost of the white pages? Or is the phone company made up of nice, generous people who are trying to charmingly make this information available at the company's expense? Or has the phone company determined that funding the cost of the white pages with their mainline services is a sound business practice? Of course, this obviously is the answer. The white pages are nothing but information, just like scientific and technical information. The phone company has determined correctly that disseminating phone information is an integral part of the company's business, and charging for that information would discourage its dissemination and thereby hurt the company's core business.

     Similarly, the government should determine that the dissemination of information coming from its huge research and development investments is what makes R&D useful. Just as charging for the white pages would be self-defeating for the phone company, charging for S&T information, as the NTIS is required to do, is also self-defeating. The government needs to commit sufficient funding to disseminate its information broadly, just as the phone company gives white pages to all likely users.

     With the resources available, OSTI is taking steps to ensure that S&T information is preserved. For the Department of Energy, we maintain a repository of last resort. In other words, if information resides in a national laboratory, then we stand ready to take over that information should the labs be unable to maintain it. By the way, we are only the repository of last resort for electronic information. We do not have the means to take care of paper information.

     We digitize legacy collections when we can afford to do it. Most legacy information goes unaddressed, but we are slowly working backwards in time for our most important collections. We also migrate data collections to new electronic formats when the old formats become outdated, and we are partnering with the GPO to ensure permanent public access to DOE information.

     Information preservation in the digital age poses problems foreign to preservation in the paper age. The digital age rules out passive preservation, leaving proactive preservation as the only alternative. Proactive preservation relies for information migration upon technologies that become outmoded every few years because the physical media and equipment have short life spans. The need for regular migration of information implies that the permanence of information is no greater than the permanence of the organizations hosting it. We despair that digital information can never be made maintenance free. We are unaware of a technical solution that will resolve the problem of preservation. However, we think an institutional fix is quite evident.

     For federal government-sponsored information, which includes the great bulk of deliverables coming from all of the basic research in this country, government institutions with an information mission are the best option for ensuring preservation. Beyond mere preservation, such organizations also are best suited to promote permanent public access. Today, digital national libraries allow agencies to envision the searchable and comprehensive collections of information through which information can be preserved and made permanently accessible by migrating information and adapting to changes in technologies as they are needed.

     Three agencies already have national libraries, including the National Institutes of Health, the Department of Education, and the Department of Agriculture. Each of these is making great strides with digital collections. By any measure, the National Library of Medicine is the leader. It has produced PubMed for several years, and this has become the single most used collection of information in medicine. Recently, the National Library of Medicine launched PubMed Central, which, unlike PubMed, hosts full text of journal articles and preprints on the library servers. Now DOE has tried to copy PubMed, but we have no plans to emulate PubMed Central--yet.

     The National Agriculture Library recently has made its AGRICOLA database freely available on the Web. It differs from PubMed and PubSCIENCE in two ways. The obvious difference is the subject matter. AGRICOLA deals with agriculture, but it does not offer hyperlinks to full text. Similarly, the National Library of Education has put a huge collection on the Web, its ERIC database. And like AGRICOLA, ERIC does not contain hyperlinks to the full text.

     The Environmental Protection Agency and the Department of Transportation also have Web sites offering access to extensive collections of on-line information, and they call themselves national libraries. Additionally, the National Science Foundation (NSF) has a program solicitation for the National Science, Mathematics, Engineering, and Technology Education Digital Library, which focuses on learning resources. It has $13 million of new money, making me green with envy, and NSF has a solicitation for proposals out right now.

     I contend that the nation needs a national library focused on energy, science, and technology--a place where researchers, educators, students, and citizens can come for answers in the physical sciences and other energy disciplines.

     All revolutions cause turbulence and anarchy. Afterwards, there is a new order. That's what history shows. So it is with the digital revolution. We are coming to the end of the period of turbulence and anarchy in the digital revolution. In my view, a new order made up of digital national libraries is upon us.

     The mere existence of a National Digital Library fosters not only the dissemination of information, but also its preservation. It would be the surest way to promote permanent public access to government information. Additionally, the term National Digital Library announces to the world that the agency has information resources that are permanently available to the public.



Copyright 2001 the National Academy of Sciences

PreviousNext