5
ORGANIZING INTELLECTUAL ACCESS TO DIGITAL INFORMATION: FROM CATALOGING TO METADATA

Libraries not only collect and preserve material; they also provide access to it. The intellectual basis of providing such access is the organization and cataloging of materials under their aegis. The Library of Congress, even more than most libraries, invests heavily in cataloging its collections. It probably has the largest cataloging operation in the world1 and is a source of both cataloging data and standards for much of the library community. In the world of paper-based publication, its role is second to none.

One enduring role of libraries during the transition from physical to digital information will be the intellectual task of cataloging—imposing order2 on diverse resources with the goal of making those resources easier to discover and manage. As the chair of this committee said, “The librarian [in the digital age] will have to be a more active participant in staving off ‘infochaos’ by playing a role in selecting information resources and describing them in the ‘information waterfall’ of the ‘virtual library.’”3

1  

The Cataloging Directorate at LC employs approximately 550 people. Cataloging functions also take place in other units at LC.

2  

“Cataloging in the Digital Order,” by David M. Levy, paper presented at Digital Libraries ’95: The Second Annual Conference on the Theory and Practice of Digital Libraries, available online at <http://csdl.tamu.edu/DL95/papers/levy/levy.html>.

3  

Avatars of the Word: From Papyrus to Cyberspace, by James J. O’Donnell (Cambridge, Mass.: Harvard University Press, 1998), p. 43.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress 5 ORGANIZING INTELLECTUAL ACCESS TO DIGITAL INFORMATION: FROM CATALOGING TO METADATA Libraries not only collect and preserve material; they also provide access to it. The intellectual basis of providing such access is the organization and cataloging of materials under their aegis. The Library of Congress, even more than most libraries, invests heavily in cataloging its collections. It probably has the largest cataloging operation in the world1 and is a source of both cataloging data and standards for much of the library community. In the world of paper-based publication, its role is second to none. One enduring role of libraries during the transition from physical to digital information will be the intellectual task of cataloging—imposing order2 on diverse resources with the goal of making those resources easier to discover and manage. As the chair of this committee said, “The librarian [in the digital age] will have to be a more active participant in staving off ‘infochaos’ by playing a role in selecting information resources and describing them in the ‘information waterfall’ of the ‘virtual library.’”3 1   The Cataloging Directorate at LC employs approximately 550 people. Cataloging functions also take place in other units at LC. 2   “Cataloging in the Digital Order,” by David M. Levy, paper presented at Digital Libraries ’95: The Second Annual Conference on the Theory and Practice of Digital Libraries, available online at <http://csdl.tamu.edu/DL95/papers/levy/levy.html>. 3   Avatars of the Word: From Papyrus to Cyberspace, by James J. O’Donnell (Cambridge, Mass.: Harvard University Press, 1998), p. 43.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress The first section of this chapter reviews the longtime leadership role of the Library of Congress in developing and maintaining cataloging standards for physical resources. The chapter then looks at the implications of the new digital milieu for these traditional cataloging mechanisms. It closes with some observations on how the Library seems to be addressing these changes, arguing that it is not treating them as the strategic issue the committee believes them to be. A HISTORY OF LEADERSHIP IN CATALOGING STANDARDS The Library of Congress has a well-earned reputation for its leading role in the development and administration of cataloging and associated standards. Before reviewing these standards efforts, it is worthwhile to answer briefly two questions: Why are standards important? What role have they played over the past century?4 Cataloging is arguably among the most expensive tasks in the library. Current estimates range from $50 to $110 for the creation of a single full cataloging record.5 What is responsible for this high cost? While some of the tasks of cataloging—for example, recording a title—are indeed mundane (in the majority of but not all cases), others are intellectually challenging and time consuming: Subject analysis—The usability of library catalogs for finding resources “about” a particular subject depends greatly on the nontrivial task of understanding the content of a resource and tagging it with a controlled subject heading.6 Authority control—While the subject of assigning authorship may 4   The original goals of cataloging as expressed by Charles Cutter at the end of the nineteenth century were to enable readers to do the following: (1) to find all works by a particular author, (2) to find any work by title, (3) to find all editions of a work, and (4) to find all works on a subject. These goals were originally conceived of as applying to works held by a particular library. 5   Based on testimony to the committee and the personal knowledge of committee members. However, a “full” cataloging record is not generated for many materials; less thorough cataloging records will naturally cost less. 6   Any user of a library catalog, either in card or online form, is familiar with this subject tagging. For example, the book Avatars of the Word, by James J. O’Donnell, is tagged with the three subject headings Communication and technology—History; Written communication—History; and Cyberspace, which are elements of the Library of Congress subject headings (see <http://lcweb.loc.gov/catdir/cpso/wlsabt.html>). The advantages of this tagging are two-fold. A user of the library catalog can search for resources on the basis of the subject classifications, and once a resource is found, related resources (by subject classification) can be located.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress seem superficially simple, it is confounded by the fact that people frequently use different forms of their name, and different people frequently have very similar or identical names. There is the problem of the multiple spellings of historical figures (is it “Shakespeare” or “Shakespere” or “Shakespear” or ?), the problem of aliases (“Mark Twain” and “Samuel Clemens”), and the seemingly random use of initials, shortened names, and the like (“Samuel Langhorne Clemens,” “Samuel L. Clemens,” “Sam Clemens,” etc.). Authority control, in the context of author names, is the task of, first, associating these name variations with a canonical name in the cataloging record to show that the variations are indeed the same person and, second, differentiating between ambiguous and overlapping names. The development of standard cataloging practices from the end of the nineteenth century through the twentieth century made it possible to share cataloging records, leading to a significant cost savings for libraries. “Copy cataloging” exploits the fact that the overwhelming majority of resources in an average library are not unique. Rather than produce original cataloging records for the duplicated resources, libraries can use cataloging records from other libraries. In fact, in preautomation days, the Library of Congress was in the business of supplying physical catalog cards to many other libraries across the United States.7 The following subsections, while not exhaustive, give an overview of the Library’s involvement and leadership in cataloging and resource discovery standards. Machine-Readable Cataloging The development of the machine-readable cataloging (MARC) record8 by the Library of Congress, in concert with the library community, in the 1960s was a landmark event in the automation of library operations. As recalled above, preautomation catalog sharing involved the physical shipment of catalog cards to fellow libraries. The introduction of computers into the library environment allowed 7   The LC cards were also an important source of information for the purpose of selecting new titles to be added to a library’s collection. 8   There are in fact many separate dialects of MARC used across the world. For simplicity’s sake, this report uses the term “MARC” to denote the USMARC standard. When other dialects are intended, their specific designation is used. For additional information on MARC, see Understanding MARC Bibliographic: Machine-Readable Cataloging, by B. Furie (Washington, D.C.: Cataloging Distribution Service, Library of Congress, 1998).

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress sharing computer catalog records among libraries by exchanging magnetic tapes. MARC consists of both an encoding scheme for labeling cataloging elements (e.g., “author” or “title”) and an exchange format for packaging the encoded bibliographic data into a record for transfer purposes. The MARC record remains a critical technical foundation of existing integrated library systems (e.g., the Endeavor ILS recently installed at the Library of Congress) and permits the transfer of records (now via networks) among these systems. The Library’s Network Development and MARC Standards Office9 leads and coordinates international efforts to further develop MARC as a standard for the efficient and long-term interchange of bibliographic information. In recent years, the office has been analyzing the relationship of MARC to standard generalized markup language (SGML)10 and the use of MARC for digital media. One important aside to this discussion of MARC is the potential that the widespread and rapid adoption of MARC presented for the migration of the National Union Catalog to electronic format.11 A number of the individuals interviewed by the committee said that the Library had missed an opportunity in the late 1960s and early 1970s to provide online access to shared catalog records. Such a catalog could have been a natural product for a national library and, perhaps, could have provided a revenue stream to underwrite the Library’s cataloging operation. In fact, the Library, for reasons not entirely obvious to the committee or the individuals, failed to take advantage of this key opportunity and ceded it to the Online Computer Library Center (OCLC) and the Research Libraries 9   Its Web site is at <http://lcweb.loc.gov/marc/>. 10   A markup language is one that allows tags to be intermixed with standard text to instruct a computer about how that text should appear when rendered (presented) or to indicate the structure of that text (its division into chapters and sections, for example). The markup language most well-known to users of the Web is hypertext markup language (HTML). HTML was preceded by SGML, which is used mainly in the publishing community but has proven too complex for general use. The recent development of extensible markup language (XML) in the Web community represents an attempt to find a standard that mediates between the low functionality but simplicity of HTML and the overwhelming complexity of SGML. 11   The National Union Catalog (NUC) is a record of publications held in more than 1,100 libraries in the United States and Canada, including the Library of Congress. Major portions of the NUC are published in two principal series: one covers post-1955 publications and the other pre-1956 imprints. Since 1983, the NUC has been issued on microfiche. For additional information, see <http://lcweb.loc.gov/rr/main/inforeas/catalogs.html#union>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress Group (RLG).12 The Library should learn from this experience to identify and secure for itself appropriate strategic leadership roles (see Chapter 6). General Cataloging Standards While MARC provides the markup and transfer syntax for bibliographic records, the Anglo-American Cataloging Rules (AACR2)13 provide the rules for the actual description of a bibliographic item. The Library of Congress plays a major role in coordinating activities in AACR2, and its Library of Congress Rule Interpretations14 defines common practice for the use of AACR2 in cooperative cataloging. The result, in combination with the MARC standard, is that both the meaning and encoding of cataloging records can be shared among a large number of libraries. These activities are centered in the Library’s Cataloging Policy and Support Office (CPSO), which provides “leadership in the creation and implementation of cataloging policy within the Library of Congress and in the national and international library community.”15 Among the cataloging standards coordinated by CPSO are authority files (both names and subjects) to support MARC data elements, the Library of Congress classification rules, and Library of Congress subject headings, all described above. Encoded Archival Description In general, archival and manuscript items are handled differently from monographs and serials. Whereas the latter are cataloged at the item level, the former are described at a coarser level of granularity (for example, a manuscript box or folder). Tools for locating such items are 12   OCLC (<http://www.oclc.org>) and RLG (<http://www.rlg.org>) are nonprofit corporations, each having a union catalog product that it markets to libraries. Combined, OCLC’s WorldCat (<http://www.oclc.org/oclc/menu/colpro.htm>) and RLG’s Union Catalog (<http://www.rlg.org/cit-bib.html>) are the basis of cooperative cataloging among the world’s libraries. Many of the original cataloging records in these products came from the Library of Congress. One possible reason why LC did not take advantage of the commercial opportunity is that such an initiative would have taken it far afield of its primary mission of serving Congress. 13   The Concise AACR2, 1988 revision, by Michael Gorman (Chicago: American Library Association, 1989), p. 161. 14   Library of Congress Rule Interpretations, 2nd ed. (Washington, D.C.: Cataloging Distribution Service, Library of Congress, 1989). 15   Library of Congress Cataloging Policy and Support Office; see <http://lcweb.loc.gov/catdir/cpso/cpso.html#tools>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress referred to as “finding aids.” The encoded archival description (EAD)16 was the product of a project at the University of California at Berkeley in the early 1990s to develop a standard for machine-readable finding aids.17 The EAD standard utilizes SGML to mark up structured descriptions of units of archival information. The Library’s Network Development and MARC Standards Office serves as the maintenance agency for the EAD standard, an excellent example of the Library stepping in to play an important role in the new metadata environment.18 THE DIGITAL CONTEXT AND ITS CHALLENGES TO TRADITIONAL CATALOGING PRACTICES A user of the Web who has sampled any of the numerous search engines (e.g., Google, AltaVista, Excite19) might argue that digital content, networks, and full-text indexing have made human-mediated organization through cataloging obsolete. Web search engines demonstrate the great utility of such searching and the benefits of over 30 years of research in information retrieval20 and, to a lesser degree, natural language processing.21 However, there are numerous inherent limitations to the technology underlying them: Scalability—Most Web search engines accumulate indexes by scanning the Web and downloading full content from sites.22 As the volume of Web content grows, it has become increasingly difficult to keep these 16   “Encoded Archival Description,” by D. Pitti, paper presented at Mid-Atlantic Regional Archives Conference, 1997, Wilmington, Del. 17   This notion of machine-readability for finding aids can be compared to the role that MARC plays with AACR2. That is, finding aids have semantic content (in the same sense that AACR2 defines the “meaning” of bibliographic cataloging) and a syntax for encoding that information in computer files and exchanging it between computers (in the same sense that MARC provides a “markup” for AACR2 records). 18   Encoded Archival Description official Web site at <http://lcweb.loc.gov/ead/>. 19   See <http://www.google.com>, <http://www.altavista.com>, and <http://www.excite.com>, respectively. 20   Readings in Information Retrieval, by K. Sparc-Jones and P. Willett (Los Angeles, Calif.: Morgan Kaufmann Publishers, 1997). 21   Natural Language Understanding, 2nd ed., by J. Allen (Redwood City, Calif.: Benjamin/ Cummings, 1995), p. 654. 22   The model followed by existing Web search engines is known as Web crawling. This involves downloading individual pages, analyzing the content of those pages and indexing it in full-text search engines, and then following the hyperlinks on those pages to determine additional pages to download.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress indexes current or complete. One study23 indicates that even the best search engines index only about 12 to 15 percent of Web content.24 Even more problematic are the limitations of the information retrieval (IR) technology used in most popular Web search engines. The nature of the Web as a corpus presents some difficult scalability challenges for IR and often leads to poor results. The sheer size of the corpus is a notable problem; a typical Web query will retrieve a very large set of potentially relevant documents. In addition, the Web corpus is usually presented as a single, unorganized collection of documents, which makes synonym clashes inevitable.25 Synonym clashes are well understood and can be addressed with a variety of techniques (e.g., thesauri, user feedback, local context analysis, phrase structure), but these techniques are generally not exploited by Web search engines. Access limitations and databases—While a great deal of useful content is freely available on the Internet, there is a growing and equally valuable portion of Internet content that is proprietary and held in protected systems. Much “valuable” content (from the point of view of the rights holder) is held in databases on special servers that require a password or other means of authorization for access. These databases also provide enhanced functions beyond what can be done with simple, static Web pages. However, they do not in general support access via crawling, the method used to build most Internet search services. Thus, while Web indexers are able to access and index a large percentage of the total content on the Web, an ever-growing percentage of the most-sought-after material is not available from Internet search facilities. Format—Existing Web search engines are limited to textual content. They index words in documents and process textual queries—for example, “digital imaging”—returning lists of documents ranked according to the appearance of the query words in their content. Extending this approach to images will require tools to analyze image content and re- 23   “Search Engines Fall Short,” by S. Lawrence and C.G. Giles, in Science, Vol. 285 (1999), No. 5426, p. 295. 24   This low proportion of indexed pages is caused by a number of factors. Among them are the simple scale of the number of Web pages and limitations on accessing all of them in a reasonable time, problems of occasional server unavailability, and policies of certain Web sites (voluntarily followed by Web crawlers) that prevent crawling of that site. 25   Any user of Web search engines quickly becomes aware of the problem of synonym clashes, in which a search term has multiple meanings and results are returned for all these possible meanings, producing false hits that overload the results relevant for the query. For example, a user might want information about the planet Mercury and enter a query with that term. In addition to sites about the planet, the search engines will intermix results about the element Mercury, the car brand Mercury, the Roman god Mercury, and various other meanings or uses of that term.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress spond to queries such as “find images with cars in them” or, at an even more advanced level, to queries that ask for images with features similar to those of another digitized image. The tools to retrieve images, video segments, voice, and music are being actively researched26 but are currently beyond the capabilities of Internet search engines. Context—At a more abstract level, the usefulness of indexing based solely on the text content of a resource is compromised by lack of context. The best tools to help a person locate a resource are those that are tailored to the context in which the resource occurs and to the knowledge context of the searcher. For example, content-based searches of MEDLINE (the medical index at the National Library of Medicine) might be appropriate for a professional familiar with medical terminology and with the body of medical literature indexed by MEDLINE. However, a high-school student might not be able to select documents that are appropriate to his or her background and might not be familiar with medical terms, so he or she might find content-based searching to be difficult. The lack of context is a problem for both human and automatically generated representations, but different representations can often be combined to good effect. Markup—HTML, the main markup language of the Web, provides only a very simplified set of tags for labeling the parts of a document. These are primarily oriented to supporting the appropriate display of the document and in general tell little about the meaning of the various sections of the document. Many search engines utilize smart markup to provide more powerful retrieval facilities, allowing users to limit which parts of a document are used to satisfy the search argument. The simplicity of HTML markup severely constrains Internet search engines’ use of such facilities, which are very useful for limiting and refining search results. The creation of structured descriptive records for resources (e.g., traditional cataloging records or, more generally, surrogates) can help to address some of these limitations.27 Scalability can improve if surrogates are used instead of the full content for indexing. Content providers may be more willing to distribute freely descriptive surrogates for indexing, in 26   Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, by W.B. Croft, ed. (Boston: Kluwer Academic Publishers, 2000). 27   Surrogates that augment the full text of documents are common. Abstract and indexing services routinely add terms to make documents more accessible to a given audience. It is not uncommon for a journal article to be indexed by two or three secondary indexing services, each of which characterizes the article in terms that will be meaningful within its subject domain.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress lieu of the full content. Surrogates can be created, and standards are being developed for describing all manner of digital objects. Finally, surrogate records may include descriptive information that is not part of the document itself (usually the result of human analysis). For example, surrogates to facilitate searching MEDLINE by high-school students might associate more common medical subject terms with the resources, thus making searching easier for this community.28 On the other hand, there is broad agreement among the committee members and in the general information community that the nature of resource description needs a thorough examination in the context of digital resources. Traditional cataloging is one kind of resource description, which, in turn, is one kind of “metadata”29 (information that describes the structure or content of a document but is not part of the document). The nature and use of metadata are evolving to accommodate the great variety of digital objects.30 The following sections examine some of the new challenges presented by the Internet and the Web, to help explain the expanding role of metadata and the requirements for expressing and delivering it. Scale Traditional library cataloging has scaled up to serve institutions of great size—prominent among them the Library of Congress. However, over the past year the number of resources on the Web has grown to the point that they exceed the number of books in even the largest of libraries and even the number of book pages in the average library.31 The growth rate of these networked resources substantially exceeds the growth rate of traditional physical resources. Sheer size presents considerable challenges 28   Abstracting and indexing services index material for a specific community, with its terms and its knowledge base. It is not unusual to have a journal indexed by two or three or even more secondary services, with each putting the spin of its own subject expertise and the needs of its audience into the indexing and classification of the journal’s articles. 29   “Summary Review of the Working Group on Metadata,” by T. Baker and Clifford A. Lynch, in A Research Agenda for Digital Libraries: Summary Report of the Series of Joint NSF-EU Working Groups on Future Directions for Digital Libraries Research, P. Schauble and A.F. Smeaton, eds. (Paris: European Research Consortium for Informatics and Mathematics, 1998). 30   See, among others, <http://www.nla.gov.au/meta>, <http://www.ifla.org/II/metadata.htm>, <http://domino.wileynpt.com/NPT_Pilot/Metadata/mici.nsf>, <http://www.ukoln.ac.uk/meta-data/>, and <http://www.w3.org/Metadata/>. 31   Estimates of the size of the Web vary, but a reasonable estimate is about 2 billion publicly available Web pages as of mid-2000. See <http://censorware.org/web_size/> and the articles referenced there.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress to the economics of traditional library cataloging, in which metadata records are characterized by great precision, detail, and professional intervention. The high price of traditional library cataloging makes it impractical in the context of such growth, and less expensive alternatives are needed for many, if not all, of these resources. There has always been a trade-off between the cost of creating metadata and its value in facilitating access to document collections. Large print collections generally cannot be accessed without some sort of metadata, so the value of the metadata is high. The appearance of digital objects in commercially interesting collections during the 1970s changed the economic model. Then, access could be provided with relatively little investment in metadata, although higher-quality metadata could still be justified for high-value materials. Now, the steady increase in the volume of electronic materials has increased pressure to reduce the cost of metadata, although manually produced metadata are still common. Permanence The lifespan of networked resources differs dramatically from that of physical resources. The well-known problem of “dangling URLs”32 be-devils any librarian who is trying to incorporate Web pages into a collection. The impermanence of networked resources is rooted in the economics of networked dissemination. The cost of distributing networked content is low compared with the cost of printing and distributing hardcopy content, so there is little benefit to the publisher of maintaining older versions of a document.33 With no incentive to retain older versions, the management of objects is haphazard and object permanence is problematic. Such an environment has a strong impact on the economics and incentives for producing metadata and also points out the critical need for preservation-oriented metadata and mechanisms to manage the preservation of digital objects. Credibility The breakdown of traditional publishing roles has disrupted some of the traditional mechanisms for establishing the credibility of an informa- 32   This is the problem familiar to many a Web user whereby a link to a resource through a URL returns an error indicating that the resource has not been found or, even more insidious, that the returned resource has changed, changing the semantics of the link in ways not obvious to the user. 33   However, in general it is unclear whether the overall cost of networked electronic publishing is significantly lower than the costs of publishing in traditional media.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress tion source. In the traditional print model, the pedigree of a document derived in part from the credibility of the publisher (e.g., publisher X publishes good computer books) and the credibility of the information intermediary—library or bookstore—that made the material available. Because the information intermediaries are being circumvented, it has become increasingly difficult to assess the integrity of information resources. A further problem is that information can be viewed in ways that were never intended by an author or publisher. An excerpt from an otherwise credible source may be misleading when viewed out of context. Since metadata is itself an information resource, the credibility issue applies to the quality of metadata created by external sources (outside the traditional library cataloging community). The creation of bad metadata can be nonmalicious: for example, an author who lacks training or who doesn’t care may assign a bad subject classification to a descriptive metadata record. It can also be malicious: so-called “index spamming,” whereby content creators seed metadata fields with misleading or incorrect information to affect the ranking of their pages by search engines, is a real problem on the Web.34 An important challenge for networked information is developing the mechanisms and policies to verify the origin of any information, including metadata. Variety The Library of Congress, like all libraries, deals with a considerable variety of resources, including books, serials, maps, software, movies, images, and, now, digital resources. The Library’s efforts to create metadata for this spectrum of resources can be divided into two categories. First, much attention has been paid to enhancing the traditional cataloging mechanisms—AACR2 and MARC, for example—to accommodate these new genres. These efforts are motivated by the central role that the traditional cataloging formats play in the Integrated Library System and in the cooperative cataloging efforts described above. Notable among these efforts in the context of a discussion of digital resources is the creation of a new field to handle links to electronic resources35 and a set of 34   Index spamming is of considerable concern to search engine providers in that it interferes with the usefulness of their search engines. For a look at what one search engine provider, AltaVista, says about index spamming, see <http://www.altavista.com/av/content/spamindex.htm>. 35   The 856 field in MARC—see <http://lcweb.loc.gov/marc/bibliographic/ecbdhold.html#mrcb856>—was created in the early 1990s to contain information needed to locate and access an electronic resource. Typical information stored in the 856 field includes a URL, the relationship of the item referred to by the URL, and the resource described by the MARC record itself.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress draft interim guidelines for cataloging electronic resources36 published and disseminated by the Library’s Cataloging Policy and Support Office. Second, the Library has employed a number of other vehicles for resource description—metadata schemata tailored for individual resource characteristics. Some schemata have been coordinated with external communities; others have been developed internally. The use of other metadata vocabularies raises the issue of how these vocabularies interact to provide integrated information spaces for users of digital libraries. After all, one of the major strengths of the standardization on AACR2 and its expression in MARC records has been the uniform search interface provided by library catalogs to large and heterogeneous collections. The subject of multiple metadata vocabularies is the focus of further attention below in this chapter. METADATA AS A CROSS-COMMUNITY ACTIVITY As mentioned above, the field of metadata has exploded into a major area of investigation and development over the past several years. As information becomes more of a commodity item—and, as many would argue, is the largest product of the “new economic paradigm”37—its management is of interest to a broad spectrum of organizations. This stands in rather strong contrast to the situation in the pre-Internet era, when the standardized management of information was more or less restricted to libraries, with the Library of Congress playing a key leadership role. This broadening of the metadata environment will include many new players and applications and require the Library to think in new ways if it is to reassert its leadership in this area. Descriptive cataloging, exemplified by the traditional library cataloging record, is but one of many classes of metadata. Real-world applications need to make use of a much broader range of metadata than descriptive cataloging. Some other metadata types are listed below to provide a sense of this range. The list is not in any way comprehensive, nor are all of these types of data appropriately within the scope of the Library of Congress.38 36   See <http://lcweb.loc.gov/catdir/cpso/elec_res.html>. 37   Information Rules: A Strategic Guide to the Network Economy, by Carl Shapiro and Hal R. Varian (Boston: Harvard Business School Press, 1999), p. 352. 38   “Metadata: Foundation for Image Management and Use,” by Carl Lagoze and S. Payett, in Moving Theory into Practice: Digital Imaging for Libraries and Archives, A.R. Kenney and O.Y. Rieger, eds. (Mountain View, Calif.: Research Libraries Group, 2000), p. 250. Also see The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata, by Carl Lagoze, Clifford Lynch, and Ron Daniel, Jr., Technical Report TR96-1593 (Ithaca, N.Y.: Cornell University Computer Science Department, 1996).

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress Terms and conditions—metadata that describe the rules for use of an object. Terms and conditions might include an access list of who may view the object, a conditions-of-use statement that might be displayed before access to the object is allowed, a schedule (tariff) of prices and fees for use of the object, or a definition of the permitted uses of an object (viewing, printing, copying, etc.). Administrative data—metadata that relate to the management of an object in a particular server or repository. Some examples of information stored in administrative data are the date of last modification, the date of creation, and the administrator’s identity. Content rating—a description of attributes of an object within a multidimensional, scaled rating scheme assigned by some rating authority; an example might be the suitability of the content for various audiences, similar to the well-known movie rating system used by the Motion Picture Association of America. Note that content ratings have applications far beyond simple filtering on sex and violence levels. Content ratings are likely to play important roles in future collaborative filtering systems, for example. Provenance—data defining the source or origin of some content object, for example, of some physical artifact from which the content was scanned. The data might also include a summary of all algorithmic transformations that have been applied to the object (filtering, reductions in image density, etc.) since its creation. Arguably, provenance information might also include evidence of authenticity and integrity through the use of digital signature schemes; or, authenticity and integrity information might be considered a separate class of metadata. Linkage or relationship data—data indicating the often complex relationships between content objects and other objects. Some examples are the relationship of a set of journal articles to the containing journal, the relationship of a translation to the work in its original language, the relationship of a subsequent edition to the original work, or the relationships among the components of a multimedia work (information on synchronization between images and a soundtrack, for example). Structural data—data defining the logical components of complex or compound objects and how to access those components. A simple example is a table of contents for a textual document. More complex examples include the definition of the different source files, subroutines, data definitions in a software suite, SGML or XML tagged books, or other complex works. The need for additional metadata types and for traditional metadata for a larger volume of materials challenges traditional means of metadata creation—manual techniques cannot be scaled up to meet demand. But

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress automated techniques are available that can help in the production of metadata (natural language processing, automatic classification, document clustering). These techniques have yet to be integrated with more traditional manual techniques; such integration will be a huge task for the library community. The following sections summarize a number of the current efforts that address these metadata requirements, the goal being to show the breadth of communities involved in this endeavor. While the Library is involved in a number of these efforts, it by no means plays the same prominent role it plays in traditional cataloging. Dublin Core Metadata Initiative The Dublin Core Metadata Initiative (DCMI),39 begun in 1994, hoped to create a simplified metadata convention that would provide more effective resource discovery on the Web. Over the past 5 years, it has developed and refined a set of 15 elements—the Dublin Core Element Set (DCES)—for resource description to facilitate discovery. The DCMI has broad international participation from librarians, digital library specialists, the museum community, and other information specialists. Its advocates claim that the DCES has distinct advantages over traditional cataloging methods in terms of simplicity, interoperability, and extensibility. DCMI, which is hosted by OCLC, represents a concerted effort by OCLC to extend the leadership role it has played with physical resources into the world of digital resources. The DCES plays an important role in one of OCLC’s latest projects, the Cooperative Online Resource Cataloging (CORC) project,40 which is examining the use of new Web-based tools and techniques for cataloging electronic resources. One important aspect of CORC is that it examines the mechanics and economics of different levels of representation, whereby resources can be described simply using the DCES and, when appropriate and economically feasible, described more completely using traditional cataloging techniques (MARC). The Library of Congress does participate in the DCMI: a member of its Cataloging Directorate has long played a role. Not only did the Library host the sixth Dublin Core meeting in November 1998, but its Network Development and MARC Standards Office has also developed “crosswalks” (translations) between the DCES and MARC.41 39   See <http://purl.org/DC>. 40   See <http://www.oclc.org/news/oclc/corc/about/about.htm>. 41   See <http://lcweb.loc.gov/marc/dccross.html>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress Geospatial Metadata Standards The geospatial community’s interest in metadata stems from the difficulty of managing and locating the burgeoning amount of data being produced by geographic information systems (GISs),42 remote sensing initiatives (e.g., satellites), and the defense and intelligence community. The U.S. Federal Geographic Data Committee43 has been working for several years to create a complete and complex metadata format for describing geospatial entities—Content Standards for Digital Geospatial Metadata.44 The Open GIS Consortium, which comprises technology companies, universities, and government agencies, manages consensus processes with the goal of achieving interoperability among diverse geo-processing systems.45 Content Rating The motivation for content-rating metadata grows out of the proliferation of adult material on the Internet and the desire for filtering mechanisms to keep certain individuals (e.g., children) from accessing certain content (e.g., pornography).46 Faced with this challenge, the World Wide Web Consortium (W3C)47 developed a description standard, Platform for Internet Content Selection (PICS).48 The PICS standard enables content providers to label (voluntarily) the content they create and distribute and “enables multiple, independent labeling services to associate additional labels with content created and distributed by others.”49 E-commerce and Rights Management The burgeoning of content available over the Internet has stimulated interest in the business community in mechanisms for managing, controlling, and receiving remuneration for providing access to digital intellec- 42   GIS systems include a broad class of software systems for storing, representing, and analyzing geospatial data. They are widely employed by governmental agencies such as planning departments, public utility companies, surveying and engineering concerns, and a spectrum of other parties. 43   See <http://www.fgdc.gov>. 44   See <http://www.ifla.org/documents/libraries/cataloging/metadata/meta6894.txt>. 45   See <http://www.opengis.org>. 46   “Filtering Information on the Internet,” by Paul Resnick, in Scientific American, 1997, pp. 106-108, available online at <http://www.sciam.com/0397issue/0397resnick.html>. 47   The W3C <http://www.w3.org> is an international organization that develops and maintains the standards by which the World Wide Web operates. These standards include the protocols for exchange of data over the Web, markup languages for structuring that data (e.g., HTML), and the metadata standards discussed here. 48   See <http://www.w3.org/PICS/>. 49   See <http://www.w3.org/PICS/principles.html>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress tual property. So-called “rights management” is among the most active topics in digital library research and Internet business development. One initiative—<indecs> (Interoperability of Data in E-commerce Systems)50—is an international initiative of rights owners formulating metadata standards to govern the exchange of digital intellectual content. Resource Description Framework The proliferation of standards for metadata has motivated W3C to examine a general infrastructure for associating multiple metadata records with Web resources and packaging those records for exchange (as MARC does for AACR2 descriptions). The result is the resource description framework (RDF),51 a major initiative by W3C to facilitate descriptions of Web resources. The intellectual underpinnings of RDF lie in a variety of knowledge representation efforts.52 The RDF is based on two key assumptions: (1) the diverse metadata needs of networked objects argue for a modular, not a monolithic, metadata solution and (2) the different metadata modules should be created and managed by individual communities of experts (e.g., let the librarians construct the bibliographic descriptions). RDF is an area of intense development within the Web community, and the associated tools and standards promise to enhance significantly the functionality of the Web over the next several years. INTEROPERABILITY OF METADATA STANDARDS Cataloging has evolved from primarily a library practice into an activity engaged in across the Internet economy. As described above, the Library itself, faced with a proliferation of resource types and management needs, has undertaken and participated in a variety of metadata initiatives outside traditional cataloging. This proliferation of metadata types and standards raises a pressing need for intensive work, both technical and organizational, on issues of metadata interoperability. The problem of interoperability is not new. Specialized metadata standards have been developed in many domains (e.g., law, medicine, chemistry) to help normalize the description of knowledge in each. Interoperability problems are exacerbated by the nature of the “Internet Commons,”53 where multiple communities intermix and interact in nontradi- 50   See <http://www.indecs.org>. 51   See <http://www.w3.org/RDF>. 52   Knowledge Representation: Logical, Philosophical, and Computational Foundations, by J.F. Sowa (Pacific Grove, Calif.: Brooks/Cole, 2000), p. 594. 53   The term “Internet Commons” was coined by Stuart Weibel of OCLC, founder and leader of the Dublin Core Metadata Initiative.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress tional ways. By eliminating physical barriers, the Internet encourages interactions among formerly separate communities. In such interactions, one community may have good reason to use the work of another. For example, an art museum may put a digitized map, which is the product of the geospatial community, on its Web page to provide directions for visitors. These spontaneous interactions underlie the need for common metadata standards to support finding, evaluating, and managing content across traditional boundaries. Metadata interoperability has three dimensions: Semantic interoperability—Every metadata scheme defines its own set of data elements or categories for data. For example, the DCES defines a set of 15 categories for resource discovery. Semantic interoperability is the extent to which different metadata schemes express the same semantics in their categorization. Successful interoperation requires clarity about how the categories of metadata across schemes relate to each other: When do elements have the same meaning? When are elements derivatives, subsets, or variations of each other? When are elements completely unrelated? Structural interoperability—Each metadata record expresses a set of values for the categories described above. Humans are often quite capable of translating between unstructured values; many recognize that “Bill Gates” and “Gates, William H.” are the same person. On the other hand, U.S. and British citizens are likely to interpret the date 10-1-99 differently. Computers are even worse at interpreting unstructured values and so require strict definitions of structure. Authority files, described above, have been the primary mechanism in the library community for enforcing structural interoperability. Syntactic interoperability—Creators of metadata want to store, exchange, and use metadata records from different sources. Such exchange requires common mechanisms for expressing metadata semantics and structure. These common mechanisms include hypertext markup language (HTML), the resource description framework (RDF), and extensible markup language (XML). The MARC record format, described above, has been the primary mechanism in the library community for addressing syntactic interoperability. The last two areas of metadata interoperability, structural and syntactic, have been and are being addressed in forums such as the W3C and in the various standards agencies, such as the International Organization for Standardization (ISO),54 the American National Standards Institute 54   See <http://www.iso.ch/>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress (ANSI),55 and the (U.S.) National Institute of Standards and Technology (NIST).56 The first area, semantic interoperability, is the subject of current research activity57 and the area in which the knowledge organization and classification expertise of the Library, and the library community in general, has the most to offer. Briefly, the problem can be characterized as follows. Various metadata vocabularies are by nature frequently not semantically distinct but overlap and relate to each other in numerous ways. Achieving interoperability between these packages by means of one-to-one “crosswalks”58 (tables that show the relationship of elements across various schemes) is useful, but this approach does not scale to the many metadata vocabularies that will inevitably develop. A more scalable solution is to exploit the fact that many entities and relationships—for example, people, places, creations, organizations, events, certain relationships, and the like—are so frequently encountered that they do not fall clearly into the domain of any particular metadata vocabulary but apply across all of them. Mechanisms for semantic interoperability could exploit this fact by expressing the relationships between the vocabularies of individual metadata sets and the core concepts (e.g., places and events) using notions such as subtypes, supertypes, or siblings.59 Although a number of these concepts in semantic interoperability are still in the early development or even the research stage, they address a critical problem that the Library needs to recognize: metadata vocabularies will proliferate, and methods for interoperability among them must be developed. The committee believes strongly that the Library needs to take an active role in such research areas if it is going to successfully deal with digital materials and information on the Internet. Furthermore, the 55   See <http://www.ansi.org>. 56   See <http://www.nist.gov/>. 57   “A Common Model to Support Interoperable Metadata: Progress Report on Reconciling Metadata Requirements from the Dublin Core and INDECS/DOI Communities,” by David Bearman et al., in D-Lib Magazine, January 1999. Also see ABC: A Logical Model for Metadata Interoperability, by D. Brickley, J. Hunter, and Carl Lagoze, 1999, available online at <http://www.ilrt.bris.ac.uk/discovery/harmony/docs/abc/abc_draft.html>. 58   Dublin Core/MARC/GILS Crosswalk (Washington, D.C.: Network Development and MARC Standards Office, Library of Congress, 1997). 59   This is similar to techniques developed for natural language processing and computation linguistic research. For example, see WordNet—A Lexical Database for English, at <http://www.cogsci.princeton.edu/~wn/> or WordNet: An Electronic Lexical Database, Christiane Fellbaum, ed. (Cambridge, Mass.: MIT Press, 1998), p. 423. This is also the main focus of XML.ORG.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress Library most assuredly has much to add to these research areas. The Library, by virtue of its years of cataloging experience, has a unique understanding of the nature of the entities and relationships that lie at the root of these interoperability mechanisms. NEW CATALOGING MODELS Establishing and expressing the relationships among information resources is one of the most difficult aspects of cataloging. These relationships are multidimensional, bidirectional, and many-to-many. Examples of such relationships are translations, versions, editions, transcriptions, and structures (e.g., the issue of hierarchy and the relationship of articles to serials). In many ways the relationships between information resources are of primary importance in the management, discovery, and accessibility of these resources. Yet, the basic model of the MARC record, whereby cataloging metadata is packaged into discrete records associated with individual information artifacts (that is, MARC records are resource-centric), makes it unwieldy to express the richness and complexity of these relationships. The importance of relationships among entities has been recognized for a long time in the database and knowledge representation communities.60 Entity-relationship modeling and E-R diagrams61 are tools to model information that expresses entity types (e.g., books, serials, agents, events), the permissible relationships between these entity types (e.g., authoring, creating, publishing) and the constraints on these relationships (e.g., an article must be “in” a journal). Recent work in the international cataloging community recognizes the importance of relationships for formulating accurate descriptions of the resources commonly dealt with by libraries. The Functional Requirements for Bibliographic Records (FRBR),62 defined by an International Federation of Library Associations63 task force and contributed to by the Library of Congress, provides a provocative starting point for discussions 60   For example, see Knowledge Representation: Logical, Philosophical, and Computational Foundations, by John F. Sowa (Pacific Grove, Calif.: Brooks/Cole, 1999). 61   For example, see The Entity-Relationship Approach to Logical Database Design, by P.P.S. Chen (Wellesley, Mass.: QED Information Sciences, 1991), p. 83, and Entity-Relationship Approach: The Use of ER Concept in Knowledge Representation, by P.P.S. Chen (Washington, D.C.: IEEE Computer Society Press, 1985), p. 327. 62   See Functional Requirements for Bibliographic Records (Munich: K.G. Saur for International Federation of Library Associations and Institutions, 1998). 63   See <http://www.ifla.org>.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress on how bibliographic description might more accurately reflect resource relationships. The FRBR presents a model that distinguishes among intellectual ideas (“works”), the forms in which they are realized (“manifestation,” for example, a novel), and their availability as physical objects (e.g., as an individual book or an individual performance of a play). These stages in the evolution of intellectual content are connected by distinguishing relationships that affect the description or metadata for the objects. The FRBR is just a start for rethinking the theory behind bibliographic description, but it provides an interesting foundation for the treatment of digital objects whose plasticity will require that considerably more attention be given to issues of relationships than has been required for physical artifacts. (This plasticity poses a serious challenge to AACR2—which is based on describing a specific manifestation of a work—because manifestations can be fleeting in the digital world.) The importance of relationships in resource description of digital objects is also reflected in the metadata activities on the World Wide Web. The Web has demonstrated, even in its current primitive form, that the linkages (hyperlinks) between information objects are an integral part of their content. RDF incorporates a data model in which relationships, or “properties,” play a key role in the description of digital resources. The committee understands that it will be a tremendous challenge to change the base model for metadata (e.g., from resource-centric to relationship-centric) in a world of widespread data exchanges (the MARC records that are the basis of cooperative cataloging) and reliance on turnkey software (commercial integrated library systems that are based on MARC). However, it is certain that library-type metadata practices will at some point need to be reexamined in the light of a changed world. It is certainly valid to ask when the time will come that there is sufficient understanding of this changed world to undertake such a process. It is not productive to ignore the fact that changes are inevitable and will be dramatic. SUMMARY The creation and utilization of metadata are fundamental activities of the Library of Congress. The Library dedicates enormous resources to them, not just in the Cataloging Directorate but also across the curatorial departments, in the NDLP, and in the Copyright Office. High-quality, well-organized tools for finding a collection’s resources are fundamental to the functioning of a modern library. The discussions above describe the traditional strength and leadership of the Library of Congress in the cataloging and metadata arena. However, the committee hopes that the discussions also demonstrate the

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress volatility in this environment today, with the number of metadata initiatives and players growing rapidly. While no one today can say with any certainty the precise directions in which the field will evolve, any knowledgeable observer can with confidence say that it will evolve dramatically in the coming years. The committee has been enormously uneasy about the Library’s metadata involvement but has tried to state crisply its concerns and recommendations. The concerns are related to the committee’s belief that the current turmoil in metadata has profound implications for libraries and to its understanding of how enormously difficult it is going to be for the library world to evolve in response to these changes. If there is any one institution that could be expected to be pondering these issues and involved in trying to shape the environment as it evolves, it would be the Library of Congress. Both its scale of operations and its traditional leading role in the area create this expectation. However, the committee was unable to detect any sign that the Library considered this to be a strategic set of issues or that it had mounted any substantial initiative to analyze and plan for the changes that are coming. As mentioned above, the committee is aware of the interconnectedness of the library world in terms of both metadata standards and data itself. The existing environment does not lend itself easily to any single institution devising its own strategies for coping with change in metadata practice. There is an enormous need for educating the library community on the implications of current developments, for creating initiatives to coordinate strategies, and for becoming seriously involved in the key metadata initiatives under way today, to ensure that actions taken are informed by the needs of the library community. This seems an obvious role for the Library, a role largely unfilled insofar as the committee can tell (although the committee applauds the Library’s effort to organize a conference in the fall of 2000 on cataloging policy in the digital age).64 It should be made clear that it is not the committee’s finding that the Library lacks any knowledge of or involvement with metadata developments. As mentioned above, the Library has played a role in a number of areas, and the committee found some very knowledgeable staff at the Library who have a sophisticated understanding of current developments. The concern centers, rather, on the level of institutional involvement with the issues. The committee believes that these developments are of over-riding importance for the Library and that they require the Library to become much more active in analyzing and planning for change across 64   See “Library of Congress Hosts Conference on Cataloging Policy in the Digital Age November 15-17,” News release, February 22, 2000.

OCR for page 122
LC 21: A Digital Strategy for the Library of Congress the library community and in influencing the evolution of the various metadata initiatives now under way. Finding: The Library of Congress is heavily involved in the creation and use of metadata and has long been a leader in the establishment of standards and practices. However, the metadata environment is evolving rapidly. This will have profound implications for libraries and other information providers generally and for the Library of Congress in particular. It is a responsibility of the Library, and indeed of the nation, to offer leadership here for the benefit of the national and worldwide communities of information providers and users. Recommendation: The Library should treat the development of a richer but more complex metadata environment as a strategic issue, increasing dramatically its level of involvement and planning in this area, and it should be much more actively involved with the library and information community in advancing the evolution of metadata practices. This effort will require the dedication of resources, direct involvement by the Librarian in setting and adjusting expectations, and the strong commitment of a project leader assigned from the Executive Committee of the Library. Recommendation: The Library should actively encourage and participate in efforts to develop tools for automatically creating metadata. These tools should be integrated in the cataloging work flow.