4
PRESERVING A DIGITAL HERITAGE

A library is not a last resting place for the books contained there but a place where information and ideas live and breathe in new minds. To continue to do so, the materials collected—whether books or Web pages—must themselves be alive and fresh, in forms and formats that preserve their character and make them accessible to new readers. The digital age brings, here as elsewhere, opportunities and challenges.

Chapter 3 describes mechanisms that the Library of Congress (LC) could use to collect digital materials aggressively and coordinate the development of distributed virtual collections. This chapter discusses the nature of LC’s preservation mission in the changing digital context. The Library can now provide access to some objects without assuming any responsibility for preserving them. It can distinguish “research collections,” for which LC serves as a portal for access, from more focused “curatorial collections,” for which LC would assume primary long-term preservation responsibility. The committee’s focus here is on the curatorial collections, but see Chapter 6 for its remarks on the Library’s obligation and opportunity to play a part in assuring the quality and preservation of collections it does not directly control.

PRESERVATION: TRADITIONAL SCOPE AND RESPONSIBILITIES

The acquisition of materials and their integration into a library’s collections traditionally have implied a responsibility to preserve those items



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress 4 PRESERVING A DIGITAL HERITAGE A library is not a last resting place for the books contained there but a place where information and ideas live and breathe in new minds. To continue to do so, the materials collected—whether books or Web pages—must themselves be alive and fresh, in forms and formats that preserve their character and make them accessible to new readers. The digital age brings, here as elsewhere, opportunities and challenges. Chapter 3 describes mechanisms that the Library of Congress (LC) could use to collect digital materials aggressively and coordinate the development of distributed virtual collections. This chapter discusses the nature of LC’s preservation mission in the changing digital context. The Library can now provide access to some objects without assuming any responsibility for preserving them. It can distinguish “research collections,” for which LC serves as a portal for access, from more focused “curatorial collections,” for which LC would assume primary long-term preservation responsibility. The committee’s focus here is on the curatorial collections, but see Chapter 6 for its remarks on the Library’s obligation and opportunity to play a part in assuring the quality and preservation of collections it does not directly control. PRESERVATION: TRADITIONAL SCOPE AND RESPONSIBILITIES The acquisition of materials and their integration into a library’s collections traditionally have implied a responsibility to preserve those items

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress for use by future generations. The Library of Congress has carried out its preservation responsibilities using a variety of professionally accepted practices: Providing adequate storage conditions (e.g., proper environmental controls and appropriate binding and shelving); Reformatting materials from their original fragile formats and media to more stable media (e.g., microfilming newspapers and brittle books, transferring audio recordings to more stable media, and copying content on nitrate and acetate film to more stable polyester film bases); and For a small percentage of rare and unique materials with intrinsic value in their original formats, restoring originals through conservation treatments. The Library also has a long history of providing leadership in the broader field of preservation. Over the last two centuries, it has conducted research and led efforts in areas such as binding and shelving books, proper environmental conditions for storage, use of microfilm for preservation, and mass deacidification of paper. The Library has also contributed to national preservation efforts, such as the development and adoption by many publishers of a standard for permanent paper and the coordination of preservation efforts under the Brittle Books Program, funded by the National Endowment for the Humanities, and the National Newspaper Project.1 PRESERVATION CHALLENGES FOR DIGITAL COLLECTIONS The Library faces challenges in digital preservation that are widely recognized and shared by many other libraries and archives. They include the following: Fragile storage media—Digital materials are especially vulnerable to loss and destruction because they are stored on fragile magnetic and optical media that deteriorate rapidly and that can fail suddenly from exposure to heat, humidity, airborne contaminants, faulty reading and writing devices, human error, and even sabotage. Technology obsolescence—Digital materials become unreadable and inaccessible if the playback devices necessary to retrieve information from 1   For additional information on LC’s preservation efforts for analog materials, see <http://lcweb.loc.gov/preserv/>.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress the media become obsolete2 or if the software that translates digital information from machine- to human-readable form is no longer available. Legal questions surrounding copying and access—Libraries, archives, and other cultural institutions have limited and uncertain rights to copy digital information for preservation or backup purposes, to reformat information so that it remains accessible by current technology, and to provide public access.3 All organizations with responsibilities for preserving digital information are seeking better technical solutions, model policies, best practices, and clearer guidelines regarding legal and intellectual property issues. The committee found few examples of LC providing leadership or contributing actively to solving critical problems of digital archiving and long-term preservation. Until now, LC seems to have assumed a wait-and-see attitude toward its role in preserving digital information created outside the walls of the Library itself; this was probably exacerbated by the absence of a director for the Preservation Directorate. The committee observed a circular logic operating at LC with regard to born-digital materials: “We don’t have much born-digital content because we don’t know what to do with it; we don’t know what to do with born-digital content because we don’t have very much of it.” Yet because of the Library’s national stature and visibility and its past record of leadership, its participation in digital preservation efforts would be of great value and indeed is sorely missed. The Library’s collecting policies and mechanisms with regard to born-digital materials are closely tied to its preservation capabilities. As long as traditional collecting mechanisms guarantee a steady stream of print, other analog materials, and tangible digital objects such as CD-ROMs into the Library’s collections, there is an illusion that little significant content is being lost. The absence of significant digital content in the Library’s collections removes a sense of urgency about digital preservation, and the lack of organizational capacity to preserve many types of born-digital information discourages the Library from taking on responsibilities that it 2   It is worth noting that analog materials also suffer from this problem. 3   For additional discussion on copyright and digital preservation, see Chapter 3 of The Digital Dilemma, by the Computer Science and Telecommunications Board, National Research Council (Washington, D.C.: National Academy Press, 2000); “Digital Preservation Needs and Requirements in RLG Member Institutions,” by Margaret Hedstrom and Sheon Montgomery, Research Libraries Group, available online at <http://www.rlg.org/preserv/digpres.html>; and the film “Into the Future,” by the Council on Library and Information Resources, description available online at <http://www.clir.org/pubs/film/film.html#future>.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress is not prepared to fulfill. In order for LC to break out of this dilemma, it needs to be aggressive in building collections of born-digital materials, as argued in Chapter 3. The methods for doing this include traditional collecting and custodianship and developing relationships with publishers, other research libraries, other national libraries, bibliographic networks and utilities, government agencies, and archives.4 ORGANIZATIONAL ISSUES: DEFINING THE SCOPE OF THE LIBRARY’S PRESERVATION RESPONSIBILITIES Traditionally, the acquisition of materials through purchase, exchange, or deposit and their cataloging into the Library’s permanent collection entailed a commitment to preserve those materials. In the digital environment, LC will assume a wider variety of roles and responsibilities with regard to preservation. One of the first steps that LC needs to take in adapting its collecting practices to accommodate born-digital information is to delineate clearly its responsibilities for preserving digital information. These preservation responsibilities may be loosely classified into three categories: LC as creator, active collector, and primary custodian; LC as key player in a fail-safe mechanism; and LC as a partner in preserving distributed digital collections. As LC identifies the areas in which it will assume the lead responsibility for digital preservation, other organizations can adjust the scope of their digital collections accordingly. Just as the Library cannot ignore the problem of digital preservation, so also it cannot be expected to do it all. If LC does not set clear boundaries around its digital preservation responsibilities, then many people may assume unrealistically that the Library will be the repository of last resort for everything worth keeping. The Library of Congress As Owner and Primary Custodian The Library of Congress must act as the owner and primary custodian for the digital collections it creates.5 This is a logical extension of 4   The relationships that LC needs to foster are discussed in Chapter 6. 5   The notion of “creation” in the context of digital collections is admittedly ambiguous. A model of sole creation may be too simplistic, and in fact it may be more common that LC creates digital material in concert with other partners or that is interlinked with content from other partners. In these cases, the Library needs to ensure that the custodial responsibilities of all parties are clearly delineated.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress the Library’s current efforts to preserve the digital resources that it creates through the retrospective conversion of materials for the National Digital Library Program (NDLP), compilation of public-domain materials, and cataloging. The THOMAS system maintains a comprehensive full-text database of legislation introduced into the U.S. Congress since 1989 and bill summaries and status data from 1973 to 1989. The Law Library’s Global Legal Information Network (GLIN) maintains an online database composed primarily of searchable legal abstracts in English of foreign laws and regulations enacted in selected countries since 1976 and over 20,000 full texts of legal instruments since 1995. In addition, LC has preserved its bibliographic database of some 12 million machine-readable catalog records representing the books, serials, maps, sound recordings, manuscripts, and visual materials in its collections by migrating these data successfully from older legacy systems to the new Integrated Library System. The Library can also logically be expected to serve as owner and primary custodian of materials for which it has a unique mandate and of digital resources that it has unique responsibilities for acquiring. The Library’s role in registering copyright and enforcing mandatory deposit law creates a unique opportunity for the Library to collect digital information that might otherwise vanish from the historical record. To fulfill its role and meet its responsibilities, LC urgently needs to develop the organizational and technical capacity to preserve digital deposits of long-term value, as discussed in Chapter 3. One particular concern with regard to preservation is how deposited items will be identified for integration into LC’s permanent collection and which parts of LC will look after long-term preservation requirements. It is not clear when (or if) CORDS will begin retaining and preserving complete digital objects in a systematic way rather than maintaining only registration information and a digital signature of the object being registered. In some cases, only the digital signature will be kept in CORDS, to verify potential alterations to a digital document or copyright infringements. It is also unclear whether the Copyright Office will assume responsibility for the long-term preservation of the digital content deposited in CORDS or whether some or all of this task will pass to Library Services. The Library needs to determine whether CORDS is intended to serve solely as a registration and deposit mechanism or whether it should also include a repository for digital materials with long-term value. If CORDS is not the most appropriate place to preserve deposited digital works—and the committee doubts that it is—then what other provisions will LC make for digital deposits?

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress The Library of Congress As a Fail-safe Mechanism The Library of Congress will continue to provide a fail-safe mechanism for preserving materials that no other library has the mandate, resources, or will to maintain; however, it will do so in a new way for digital information. In the paper world, each library that decided to provide its users with a given resource both obtained and conserved its own copy of the object. This replication in the paper world served as a powerful fail-safe mechanism, helping to ensure long-term accessibility through the uncoordinated but distributed maintenance of independent copies. (The only coordination between libraries took place when materials deteriorated to the point where reformatting was required. At that point, a library would check in national databases to see whether another library had already reformatted the item.) In the digital environment, in which much content is distributed through the centralized services of a publisher, it is not clear where the preservation responsibility lies. If libraries can provide timely services to their users without the complexity and expense of locally storing and preserving the content, then who is to assume preservation responsibilities?6 If the creator, publisher, or distributor is willing to preserve digital resources as long as it is financially viable, when should libraries step in to ensure long-term preservation? The Library has begun experimenting with arrangements that may help clarify its role in these instances. It is now necessary to move from these early experiments to the development of a coherent, overarching strategy for digital preservation. The committee found two noteworthy examples where LC is experimenting with its role as a fail-safe mechanism. The first is in the context of the National Digital Library Program, which—as mentioned above—is investigating a repository system to preserve the content and associated indexing and retrieval capabilities of this rich collection. The NDLP begins to diverge from the strict ownership and custodial model because nearly 30 other institutions besides LC have contributed digital content. NDLP staff reported to the committee that the 6   Many publishers who provide direct online access to their materials today explicitly disclaim long-term preservation responsibility. Others say they will keep resources available, but there is a growing consensus that the dynamics of the commercial world do not favor long-term archiving. As content ages and use declines, it is far from clear that a commercial publisher will be willing to invest in keeping old content alive, migrating it as technology changes and retrofitting old content to take advantage of the ever-increasing functionality from advances in technology (as users will expect). Further, content will be threatened whenever it is in the hands of a single institution. Companies evolve, combine with others, change their orientation as markets evolve, and (a particular threat in this time of upheaval and dramatic technological change) go out of business.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress Library would take responsibility for preserving digital content from partner institutions should any of them become unable to maintain their portions of the collection. This is a good example of the Library agreeing to serve as a fail-safe mechanism. The committee nevertheless has several concerns about this arrangement. There are no formal arrangements that define LC’s role in long-term maintenance and preservation of the portions of the NDLP that are currently in the custody of other participating institutions. Although the committee does not question the participating institutions’ commitment, in principle, to preserving the digital content they have contributed, a variety of unforeseen circumstances could prevent them from doing so. This is a particular concern because of the funding model for the NDLP and the emphasis to date on digitization and the development of access mechanisms. While the committee encourages LC to function as a fail-safe mechanism, it believes that LC does not understand that it is placing itself in a risky position when it expresses a willingness to do this without adequately calculating the scale of the commitment and developing the capacity to fulfill it. Also, as the NDLP grows and as more institutions contribute content, the Library will have a larger number of relationships to manage. An important next step for the NDLP is to develop standards and agreements for long-term stewardship that define when and under what circumstances LC will serve as the fail-safe mechanism or repository of last resort.7 The committee also notes that the requirements and technical standards for the repository have not yet been defined. Repositories are systems for storing digital objects in a robust and managed fashion. They protect data from inappropriate access, facilitate the recording of appropriate metadata to allow the management of objects, and provide delivery facilities for both curatorial and user access. The Library has recently acquired, through a gift, the TEAMS system developed by Thompson Publishing. TEAMS supports both object and metadata storage and maintenance and has been implemented in a variety of corporate settings (for example, it is used by the Washington Post to manage its digital content). Its use by LC represents the first implementation of TEAMS in a traditional library application. The committee supports the development of a repository system as an important next step for the NDLP and agrees that this work needs to be stepped up, but it has reservations about the direc- 7   As an implementation issue, the role of LC as a fail-safe mechanism or repository of last resort does not necessarily have to be fulfilled by having the digital content resident on LC servers only. Redundancy across locations (whether at LC servers at other sites or at the servers of cooperating organizations) is desirable.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress tions that LC is taking. Is it in the best long-term interest of LC or the participants in the NDLP to use a commercial product developed for a commercial application? Although accepting donated hardware and software helps to mobilize private support for LC and may reduce expenditures in the short run, contributions like these can also commit LC to proprietary systems and methods that in the long run will limit the Library’s ability to federate its collections with those of other repositories and interoperate easily with potential partners. Where long-term strategic purposes are involved, selection should be based on a full review of the technical architecture, costs, and long-term value. One approach is for LC to coordinate its efforts to develop a repository system with those of other organizations that are using the reference model for the Open Archival Information System (OAIS) (Box 4.1). This high-level model does not specify any particular implementation of an archival information system, nor does it define standards accession, description, data management, or distribution. The OAIS model is important for digital preservation standards and strategies because it defines the functions and requirements for a digital archive through an international standard that vendors and producers of digital information can reference. If the OAIS reference model is widely adopted (and there are indications that it will be), then it may provide the framework for a network of cooperating and federated repositories. The National Archives and Records Administration is adopting this model for some of its digital archiving requirements and is working with the San Diego Supercomputer Center on one specific implementation. In general, for LC to develop the capacity to serve as a fail-safe mechanism, it will need to acquire a much more extensive technical infrastructure and greater expertise in a wide variety of file types and formats. While they are important experiments, neither the NDLP repository nor CORDS yet offers solutions to LC’s responsibility to preserve born-digital content created outside the Library. In the case of NDLP, LC has been working with materials converted to digital form, and for these materials it has taken the lead in setting standards for formats, metadata, naming conventions, and other technical attributes. But for born-digital materials, especially those created outside the Library, LC is unlikely to have the leverage to define or limit the formats and structures that are used. In the experimental period of the dissertation project, ProQuest is presenting dissertations in the portable document format (PDF) directly to the Copyright Office.8 These are being accepted as the best edition, cataloged using the cataloging specifications offered by 8   See Chapter 3 for a discussion of the ProQuest arrangement.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress BOX 4.1 Open Archival Information System The Open Archival Information System (OAIS) is a high-level reference model developed by the Consultative Committee for Space Data Systems with representatives of the leading space science agencies in North America, Europe, and Japan.1 The OAIS reference model provides a unifying set of concepts for an OAIS archive. It consists of an organization of people and systems that has accepted responsibility for preserving information and making it available to a designated community. The OAIS model provides terminology and concepts for describing and comparing the architectures and operations of archives, defines the responsibilities of an open archival information system, and offers detailed models for the functions, components, and processes necessary to support long-term preservation and access to digital information. Although the model was developed originally to assist organizations with the preservation of large databases of space science information, it has been used in several other contexts, including in Project NEDLIB and in archival program development at the National Archives and Records Administration. Building on the OAIS model, the Council on Library and Information Resources (CLIR) began an initiative in the year 2000 involving the preservation of digital scholarly journals.2 1   Information about the OAIS reference model is available online at <http://ssdoo.gsfc.nasa.gov/nost/isoas/>. 2   For additional information, see <http://www.clir.org/diglib/preserve/presjour.htm>. ProQuest, verified by examiners in the Copyright Office, and stored at ProQuest. That is, registration is handled by the Copyright Office but the deposit is virtual. This arrangement presents LC with no immediate need to preserve the materials. Finding: Because of intellectual property law and the uncertainty of some publishers regarding the deposit of copies of digital works, institutions with long-term preservation responsibilities must seek and develop new means of ensuring continuing access to the valuable documentation of history, culture, and creativity. One possible approach is contractual agreements with rights holders who maintain digital information in off-site repositories, with provisions for deposit in a library or other institution should the publisher cease to maintain the information. Some publishers have agreed to provide perpetual access to their materials as one of the conditions of a license. The Library has initiated an experiment in reaching such an agreement with ProQuest. The committee believes that

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress such arrangements need to be tested carefully and that other models need to be explored as well. Recommendation: The Library should establish contractual arrangements (i.e., projects) in 2000 and 2001 with a pilot set of publishers and distributors of significant digital content, in order to conduct additional experimental programs for storing and maintaining digital information in off-site and on-site respositories. Recommendation: For all fail-safe arrangements, the Library must regularly test the integrity of the materials and systems and its capacity to accept responsibility in a timely way. Such tests will demonstrate whether LC has the appropriate technical capability and whether the arrangements with publishers are realistic ones. The way in which librarians work must be totally reconceptualized for these fail-safe mechanisms to work. The Library should coordinate such efforts with institutions doing related work, including other research libraries, the National Libraries of Agriculture and Medicine, the National Archives and Records Administration, other national libraries working to preserve their nations’ digital heritage, and other organizations that have a legal mandate for long-term preservation or a commercial interest in it. The Library of Congress As a Participant in Shared Responsibilities for Long-term Preservation The committee’s analysis suggests that the Library needs to articulate carefully a policy identifying the subset of digital materials for which it will assume long-term curatorial responsibility, taking into account the following: The burden of preserving digital collections is daunting and must be shared with other archiving institutions. The archiving and preservation of digital resources normally accessed over the Internet will not take place as a by-product of normal access but must be explicitly pursued. Simply assuming that preservation will be carried out somewhere across numerous replicated research collections will not be a solution for networked resources. Both considerations argue for libraries to define the scope of their

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress archiving roles, in order that responsibilities be distributed across the archiving libraries of the world. The archiving and preservation responsibility is a long-term one that will serve researchers in generations to come. One advantage of digital materials is the potential to distribute preservation responsibilities among a wide variety of partners so that each institution preserves only a designated portion of the global digital record. With careful planning, coordination, cooperative agreements, and clearly articulated boundaries around its curatorial collection, the Library could assume long-term preservation responsibility for a much smaller portion of the digital corpus than it did for the paper one. Some redundancy is necessary for backup and security purposes, but less redundancy is needed for digital collections than was required in the past because access no longer requires physical proximity to materials. Distributed curatorial responsibility will be achieved only with leadership from LC and cooperation with many partners. A variety of roles can be envisioned for the Library in a collaborative effort among libraries, publishers, government agencies, and other stakeholders to define the parameters of distributed digital collections and delineate the roles and responsibilities of various parties for access and long-term maintenance of important digital works (see Box 4.2). One of LC’s roles could involve coordination with other national libraries. As mentioned above, several European national libraries and the national libraries of Canada and Australia have launched programs to collect and preserve the digital portions of their national bibliographies. If the mechanisms to acquire and preserve the digital national bibliographies of some countries succeed, LC could be relieved of responsibility for preserving most digital materials from those countries. At that point, it could concentrate its curatorial efforts on works created or published by Americans or that reflect important aspects of U.S. history, policy development, and culture. Some other countries would need help. For the foreseeable future, many developing countries will not have the resources to preserve their digital heritage. Curatorial responsibility for these collections could be shared by LC and other libraries that have well-developed repository systems rather than assuming that LC will serve as the repository for all significant materials globally. There are many other opportunities to divide up long-term preservation responsibilities by subject area or domain. This is not unprecedented: LC already cedes responsibility for materials in medicine, agriculture, and education to the national libraries set up for those subject domains. In addition, the National Archives and Records Administration preserves records of the federal government that have long-term value for documenting U.S. government, policy development, and history. A clear delineation of LC’s long-term curatorial responsibilities would be critical as

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress BOX 4.2 Research and Experiments in Digital Preservation The CEDARS project aims to provide guidance in best practices for digital preservation by both developing practical demonstrator projects and sponsoring strategic working groups. Funded by the British Joint Information Systems Committee (JISC), with work carried out at Leeds, Oxford, and Cambridge universities, its main objectives are to promote awareness; to identify, document, and disseminate strategic frameworks for the development of appropriate digital collection management policies; and to investigate, document, and promote methods appropriate for long-term preservation. (For more information, see <http://www.leeds.ac.uk/cedars/>.) For Project NEDLIB, several European national libraries have joined forces to develop strategies, programs, and infrastructure for digital deposit and archiving. Project NEDLIB—Networked European Deposit Library—started in January 1998 with funding from the European Commission’s Telematics for Libraries Programme. Project partners include deposit libraries, archives, developers of information technology, and three large publishers (Kluwer, Elsevier, and Springer-Verlag) that contribute to the project and will supply electronic publications for demonstration purposes. The project aims to construct the basic infrastructure upon which a networked European deposit library can be built. Key issues to be investigated are standards and interfaces for the generic architecture; electronic document technical data; and access controls and archival maintenance procedures. Information technology developers and publishers will assist in defining standards, methods, and techniques. The commercial and copyright interests of publishers will be handled through access controls implemented when the publications are stored and activated when they are accessed. (Information on NEDLIB projects, working papers, and reports is available online at <http://www.konbib.nl/coop/nedlib/>.) The PANDORA project (Preserving and Accessing Networked Documentary Resources of Australia) is an initiative of the National Library of Australia that aims to develop policies and procedures for the selection, capture, and archiving of Australian electronic publications and the provision of long-term access to them. One of its priorities is the fostering of working partnerships with other national libraries and overseas agencies that are also undertaking research and development in this area. To date, it has developed a working proof-of-concept archive (<http://www.nla.gov.au/pandora/archive.html>) and a collection of policy documents outlining the conceptual framework for a permanent electronic archive. It is now developing a working model for a national collection of Australian electronic publications. (See <http://pandora.nla.gov.au/pandora/>.) Project PRISM, at Cornell University, is a 4-year effort to investigate and develop the policies and mechanisms needed for information integrity in distributed digital libraries. A collaboration of librarians, computer scientists, and social scientists, it is funded by the Digital Library Initiative, phase 2 (see <http://www.dli2.nsf.gov/>). One of its five foci is digital preservation; it aims to investigate the long-term survivability of information in digital form. Another focus is to

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress investigate policies and mechanisms for preserving digital content, especially when that content is not under direct curatorial control. (For more information, see <http://www.prism.cornell.edu/main.htm>.) The CAMiLEON Project is a joint undertaking between the University of Michigan and the University of Leeds (United Kingdom) and is funded by the Joint Information Systems Committee (JISC, which also funds the CEDARS project) and the National Science Foundation. It aims to evaluate emulation as a digital preservation strategy for retaining the original functionality and look and feel of digital objects and to locate emulation within a larger suite of digital preservation strategies. Its deliverables include cost comparisons of different levels of emulation; a set of emulation tools that will be available for use and further testing in libraries; and preliminary guidelines for the use of different strategies (conversion, migration, and emulation) for managing and preserving digital collections. (See <http://www.si.umich.edu/CAMILEON/>.) The Data Provenance Project at the University of Pennsylvania is exploring methods for keeping track of the source (the provenance) of digital information as it is extracted from databases, translated, transformed, and combined with other information. Funded by the Digital Library Initiative (DLI), it aims to identify the central issues of digital provenance and to contribute to the development of new data models, new query languages, and new storage techniques that will lead to the creation of a substrate for recording and tracking provenance. (See <http://db.cis.upenn.edu/Research/provenance.html>.) The NARA Project: Persistent Archives and Electronic Records Management, a project of the San Diego Supercomputer Center and the National Archives and Records Administration, is working on an approach to maintaining digital data for hundreds of years by developing an environment that supports the migration of collections onto new software systems. It is developing both the technologies and the preservation and management policies needed to define an infrastructure for a collection-based persistent archive. The project’s current focus is the creation of a 1-million-message persistent e-mail collection. (See <http://www.sdsc.edu/NARA/>, <http://www.dlib.org/dlib/march00/moore/03moore-pt1.html>, and <http://www.dlib.org/dlib/april00/moore/04moore-pt2.html>.) LOCKSS (Lots of Copies Keeps Stuff Safe), a project of Stanford University’s HighWire Press, has funding from Sun Microsystems and the National Science Foundation. LOCKSS is testing the feasibility of preserving digital documents by storing multiple copies on different computers. Each computer in the LOCKSS network looks for and corrects errors in its copy by comparing it with other copies in the LOCKSS system. The system is currently being tested at the libraries of Columbia, Harvard, and Stanford universities, the University of California at Berkeley, the University of Tennessee, and the Los Alamos National Laboratory. If these tests are successful, the Stanford researchers hope to expand the project to libraries overseas. (For more on LOCKSS, see <http://lockss.stanford.edu/>.) Like LOCKSS, the Intermemory project, based at the NEC Research Institute, aims to preserve digital information by replicating it at multiple sites. It is performing basic research in the area of Internet-distributed algorithms and protocols. (See <http://www.intermemory.org>.)

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress a means to avoid unachievable commitments to long-term preservation and enable the Library to preserve materials that it is best suited or uniquely able to collect and maintain. Finding: Many national libraries, university research libraries, national archives, bibliographic utilities, and organizations with large holdings of digital information are actively pursuing solutions to the problems of digital preservation. Although the Library of Congress might have been expected to provide leadership in this area as it once did in others, LC has at best played only a minimal role in these initiatives. As a consequence, it has little awareness of potential solutions that are emerging from joint research and development projects and has not contributed much to this important national and international problem for the library community. Recommendation: Ensuring its leadership in digital preservation will require the Library to hire or develop relevant expertise. The Library should join and, where possible, lead or facilitate national and international research and development efforts in digital preservation. There are opportunities for the Library to learn from and contribute to such efforts in preserving born-digital information and converting certain types of information to digital form as a preservation strategy. Recommendation: To make it a safe haven for preservation purposes,9 the Library should take an active role—including working with the Congress if necessary—in efforts to rework intellectual property restraints on copying and migration. 9   In The Digital Dilemma, p. 210, it was similarly recommended that “Congress should enact legislation to permit copying of digital information for archival purposes, whether the copy is in the same format or migrated to a new format.” LC is currently investigating intellectual property rights in the digital domain under a mandate from the Digital Millennium Copyright Act of 1998, which asks the Librarian of Congress to determine whether technical protection measures are having an adverse effect on the ability to make noninfringing uses of copyrighted works.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress WHAT DOES THE LIBRARY OF CONGRESS NEED TO DO TO FULFILL ITS LONG-TERM PRESERVATION RESPONSIBILITIES? Even if LC carefully defines its roles and responsibilities along the continuum from serving as a portal to acting as the primary custodian for digital materials, there is an urgent need for it to enhance its technical capacity and expertise in digital preservation. Such preservation will involve a wide range of activities, including the following: “Protecting the bits”—The making of backup copies, periodic re-copying to new media, and regular checking of object coherence and validity are required to make certain that rarely accessed materials remain technically sound. Access will be enhanced to the extent that archived materials remain online rather than, say, being stored on media such as tapes, which must be mounted manually. Archiving appropriate copies—For many digital materials, the format most useful for current services (say, a PDF file or a GIF image) is not the most robust for long-term archiving (SGML, XML, or TIFF may serve better). A preservation program must encompass the selection and archiving of the appropriate formats for long-term use and then the derivation of current-use copies appropriate to the technological base of today’s users. It must also work with users to help them undertake these tasks. Maintaining appropriate metadata—All preservation activities will depend on the completeness and quality of the metadata for the objects to be preserved. It will be critical for the Library to monitor developments in metadata standards and follow best practices for metadata as they develop. Migrating formats—Even the most careful selection of archiving formats cannot ensure that objects will be useful in the decades to come. It will be necessary to migrate objects periodically from one archival format to another. Such processes must be carefully designed and executed to ensure minimal loss of content (it is impossible to ensure that all such migrations will be loss-free). Some works, such as those that include active software (e.g., Java applets), may raise particularly difficult issues. Conducting research and development—Only a handful of institutions are likely to face digital preservation challenges on the scale or scope of those that LC faces in the coming years. This makes it unlikely that LC will be able to import models and solutions for all of its preservation needs. An active program of research and development in digital preservation is needed to solve immediate preservation problems—technical, legal, and economic—at LC and to provide guidance to other libraries and archives.

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress Educating the relevant communities—Especially in this transitional period, when digital materials are new and preservation practices are still in flux, institutions and particular communities will need to be educated about digital preservation: What is the state of the art? What factors must be taken into account in planning for preservation? LC is well situated to participate in such efforts and possibly to take the lead. A robust preservation program will employ curators and preservation staff with knowledge of the formats of materials in their collections and of appropriate metadata standards and practices and an understanding of the issues involved in migrating objects from one format to another. It will require well-developed production services for creating the specified metadata, sound and robust repository services, and periodic quality checking and copying of objects in the collection. Technical staff will be required to build or implement many pieces of the necessary infrastructure and to create the custom migration facilities that a digital collection will need to endure over time. It is not imperative that the Library itself carry out all of this activity. The curatorial responsibility can be met by ensuring that appropriate activities are carried out by contractors or other libraries and archives. In the future, LC will also need to make much more extensive use of digitization for preservation. Some professionals consider digital objects—whether born digital or turned digital—unacceptable as preservation masters because their longevity is uncertain.10 In some cases, however, digital conversion may offer the only viable means of salvaging and preserving certain materials, such as audio recordings in obsolete analog formats. The Library has used digitization to preserve severely damaged black-and-white negatives and some audio and video records on magnetic media. Just as LC will need new capabilities to collect and maintain born-digital materials, so also will it need to develop new capabilities to preserve all of the digital information under its control. The challenges of digital preservation make it easy to overlook the benefits that LC could enjoy by rapidly enhancing its capacity to collect and preserve digital information. Digital storage media are very compact, making it possible to store enormous quantities of information in a very small amount of space. Digital information can be managed and handled more automatically. More significantly, whatever LC collects and preserves in digital form has the potential to be made accessible to 10   Why Digitize? by Abby Smith (Washington, D.C.: Council on Library and Information Resources, 1999).

OCR for page 105
LC 21: A Digital Strategy for the Library of Congress anyone, anywhere, on any day of the week or at any time of the day (as rights permit under the copyright law or terms of licenses). Unless LC develops the capacity to integrate digital materials into the mainstream of its collecting and access programs, it will forgo all of these benefits and will cede its position as one of the world’s leading libraries. Finding: The Library of Congress lacks an overarching strategy and long-range plan for digital preservation. (In recent years, it has also been without a permanent head for its Preservation Directorate.) Although the Library has preserved many of its own digital resources, including the full-text databases of the THOMAS system, its own bibliographic databases, and the content, descriptive information, and retrieval capabilities of the National Digital Library Program, these efforts are not coordinated with each other or with efforts to address the larger problem of capturing and preserving born-digital content, nor is there any strategy, plan, or infrastructure to capture, manage, and preserve born-digital information that originates outside the Library. Recommendation: The Library should immediately form a high-level planning group to coordinate digital preservation efforts and develop the policies, technical capacity, and expertise to preserve digital information. The hiring of someone who is knowledgeable about digital preservation as a new head of the Preservation Directorate must be given high priority. Recommendation: The Library should put a digital preservation plan in place and implement it as soon as possible, taking into account life-cycle costs and minimizing the need for manual intervention. The Open Archival Information System (OAIS) reference model provides a useful framework for identifying the requirements for a digital archiving system. The initiative by the Council on Library and Information Resources that builds on the OAIS should also be consulted.