Cover Image

PAPERBACK
$51.00



View/Hide Left Panel

Page 180

25

Emerging Models for Maintaining Scientific Data in the Public Domain

Harlan Onsrud

After sitting through the presentations yesterday, I went back to my room and wrote a new talk. And after hearing further presentations today I think we need a reality check, or perhaps more accurately a view from the trenches. As we all know, the price that private scholarly publishers set for maximizing profits is not the same price that one would set for maximizing the scientific uses or distribution of the scholarly literature or databases. If a publisher can double the price for access to their journals or databases and lose fewer than half of their subscribers, their overall profits will increase. So this is of course what they have been doing.

To get my biases on the table, I teach at one of the universities that has been marginalized by this process. In this new information age, at my university, our ability to access the scholarly literature actually is decreasing each year as our library is forced to subscribe to fewer and fewer electronic journals. In addition, unlike paper journals where we still had a copy on the shelf, what we had access to last year is now gone.

So how should we as scientists react to this situation? One of my first reactions was tell our dean of libraries to simply cancel wholesale all the publications of those publishers that are the worst offenders, to take a political stance. She says that it is alright for the Massachusetts Institute of Technology (MIT) to do so but if she does it she will look like a crackpot. She has professors literally begging her to continue to subscribe to journals that now annually are approaching the cost of cars. Of course, no matter how high the price goes, the MITs and Caltechs will be in that upper portion of the academic market best able to afford access to such journals. It is those of us in the lower third of research university wealth that are particularly hard-pressed.

Should we simply abandon scholarly work and research in our poorer states? That is, should we allow the marketplace to determine which universities will have access to the core research literature and which will not? When you look at a state like Maine we have one of the highest high school graduation rates in the nation, and those graduates rank very high in the National Assessment of Educational Progress in Science. Should we admit that these highly qualified students should not have the opportunity to contribute to the advancement of science at the university level? With a statewide average per capita income well below the national average, few of our Maine high school students can choose to attend universities at out-of-state or Ivy League tuition rates.

So if you believe in the democracy of education, what can we as individual scientists and professors do? I argue to my peers that support of the public domain begins at home. I webcast my graduate course class sessions openly on the Web. I publish my syllabi, class notes, course materials, and most of my peer-reviewed journal articles openly on the Web. If we can do it at poor universities why is it that our leading universities do not already have hundreds of their courses openly webcast and openly archived? It is not a big technical burden. Just do it.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 180
Page 180 25 Emerging Models for Maintaining Scientific Data in the Public Domain Harlan Onsrud After sitting through the presentations yesterday, I went back to my room and wrote a new talk. And after hearing further presentations today I think we need a reality check, or perhaps more accurately a view from the trenches. As we all know, the price that private scholarly publishers set for maximizing profits is not the same price that one would set for maximizing the scientific uses or distribution of the scholarly literature or databases. If a publisher can double the price for access to their journals or databases and lose fewer than half of their subscribers, their overall profits will increase. So this is of course what they have been doing. To get my biases on the table, I teach at one of the universities that has been marginalized by this process. In this new information age, at my university, our ability to access the scholarly literature actually is decreasing each year as our library is forced to subscribe to fewer and fewer electronic journals. In addition, unlike paper journals where we still had a copy on the shelf, what we had access to last year is now gone. So how should we as scientists react to this situation? One of my first reactions was tell our dean of libraries to simply cancel wholesale all the publications of those publishers that are the worst offenders, to take a political stance. She says that it is alright for the Massachusetts Institute of Technology (MIT) to do so but if she does it she will look like a crackpot. She has professors literally begging her to continue to subscribe to journals that now annually are approaching the cost of cars. Of course, no matter how high the price goes, the MITs and Caltechs will be in that upper portion of the academic market best able to afford access to such journals. It is those of us in the lower third of research university wealth that are particularly hard-pressed. Should we simply abandon scholarly work and research in our poorer states? That is, should we allow the marketplace to determine which universities will have access to the core research literature and which will not? When you look at a state like Maine we have one of the highest high school graduation rates in the nation, and those graduates rank very high in the National Assessment of Educational Progress in Science. Should we admit that these highly qualified students should not have the opportunity to contribute to the advancement of science at the university level? With a statewide average per capita income well below the national average, few of our Maine high school students can choose to attend universities at out-of-state or Ivy League tuition rates. So if you believe in the democracy of education, what can we as individual scientists and professors do? I argue to my peers that support of the public domain begins at home. I webcast my graduate course class sessions openly on the Web. I publish my syllabi, class notes, course materials, and most of my peer-reviewed journal articles openly on the Web. If we can do it at poor universities why is it that our leading universities do not already have hundreds of their courses openly webcast and openly archived? It is not a big technical burden. Just do it.

OCR for page 180
Page 181 The faculty senate at my university proposed, and the university system administration adopted, a new formal intellectual property policy with a strong presumption of ownership by professors in the copyrightable teaching, scholarly, and multimedia works they produce. Furthermore, all professors are highly encouraged by this policy to make their works available through open-access licenses or to place them in the public domain. Thus, at my university, it is now clear that faculty members do have the power to place works into the public commons. Let us assume that the Creative Commons project proves to be a great success and hundreds of scientists start attaching open-access licenses to their articles and datasets before submitting them for peer review. I already attach such licenses to my submitted journal articles, and those articles are rejected summarily by most publishers without even being subjected to the peer review process. 1 I receive letters from corporate attorneys telling me why they cannot possibly accept the open commons license. Will increased submissions by other scientists help place pressure on the publishers? Possibly, but my guess is that most scientists will simply buckle. Therefore, perhaps my university should pass a formal policy stating that any peer-reviewed publication that is not allowed to be within an openly accessible archive six months after publication shall constitute a low-grade publication comparable to a nonrefereed publication. After all, such publications inherently are of less value to the global scientific community due to their limited accessibility. Thus, on any applications or nominations for promotion, tenure, or honors, such a publication could not be listed as a qualifying prestigious publication. It does not qualify as being in the peer-reviewed, open-archival literature. If we impose this requirement at my university alone, we will indeed probably look like a bunch of crackpots. If, however, the Caltechs, MITs, Harvards, Columbias, Dukes, and other elite universities represented in this room start pursuing this approach, it very well may have an effect. However, those are the same elite universities that actually benefit from the current research and publishing paradigms. Administrators at these first-tier universities have very little incentive to make waves. Ten years ago the Office of Technology Assessment reported that 50 percent of all federal research funding went to 33 universities, and my impression is that those numbers have not changed much. 2 These are the same universities that receive the major share of corporate funding and are the primary beneficiaries of Bayh–Dole. It would take very enlightened administrators believing in a broader sense of scientific peerage and long-term preservation of science to actually risk altering an incentive system of which their own faculty are the primary beneficiaries. 3 Will it happen? Who knows? Hope springs eternal. Perhaps in reporting the progress of past work when submitting research proposals to the National Science Foundation and the National Institutes of Health, scientists should be requested to list only those published articles that are available in openly accessible electronic archives. Right now when scientists at universities in the medium to lower tiers of wealth are requested to peer review research proposals, we have no way of accessing much of the referenced work because our libraries do not have access to those works. This degrades the peer review process and the overall quality of science. In my department, we serve as editors for and regularly publish in journals that our library says it cannot afford to subscribe to. The National Science Foundation and the National Institutes of Health have the power to fix this situation. However, it is not even on their radar screens. Why not? Because it is not a high priority for the top 50 research universities because those universities will never be cut out of the literature access loop by marketplace dynamics. The current situation, however, is perpetuating a caste system in the ability to do high-quality research in our nation. We can talk all we want to about economic and legal theories. However, to arrive at a sustainable intellectual commons in scientific and technical information, we will need to 1 Self-archiving of an electronic prerefereed version can help circumvent some legal issues. See, by example, frequently asked questions at http://www.eprints.org. However, this approach currently when applied generally results in cumbersome metadata and corrigenda maintenance issues. 2 Fuller, Steve. 2002. “The Road Not Taken: Revisiting the Original New Deal,” Chap. 16 in Mirowski and Sent, eds., Science Bought and Sold, University of Chicago Press, Chicago, IL, p. 447 referring to Office of Technology Assesment, 1991. Federally Funded Research: Decisions for a Decade, U.S. Government Printing Office, Washington, D.C., pp. 263-265. 3 We obviously have some enlightened administrators. By example, Caltech has been a leader in research self-archiving ( http://caltechcstr.library.caltech.edu) whereas MIT has been a leader in making teaching materials accessible (http:// http://ocw.mit.edu/index.html). Yet motivations and constraints vary among universities. Thus, partial solutions by one university in addressing academic literature access problems may be impractical or not as useful for use or emulation by many others.

OCR for page 180
Page 182 provide incentive systems whereby it becomes very obvious to individual researchers that they will be far better off in terms of prestige and other rewards if they publish in forums where their works will be openly and legally accessible in long-term archives. Again, I advise the funding agencies to just do it. Now, please do not misunderstand my comments. There are actually some great benefits in pursuing research in the hinterlands of science. I work with a small group of scientists that have managed on average a couple million dollars of research funding over the past few years. All of these professors regularly receive offers to move elsewhere. But, as Paul David said yesterday, 4 to do truly innovative work that is on the fringes of established research fields you are sometimes far better off to actually break away from those fields. That perfectly describes the researchers I work with. They are doing very high-quality research. Access to scientific data and information is just as critical to our faculty and students as it is to those at the top research universities, but we currently work under a very different access environment. Go ahead and point at the publishers but, as Pogo said, “We have met the enemy and he is us.” We hold much of the solution to our own access problems in our own hands. The solution rests in the hands of scientists, funding agencies, and university administrators. Our goal should be to provide all university students with the same access to scientific and technical data and literature that the leading research universities have. We would all be far better off. I did not come here to talk about any of what I have just talked about, however. I came to talk about some success stories, some emerging approaches for the widespread sharing of data and information that are actually working and hold out great promise for scientists globally. In particular I wanted to talk about CiteSeer (formerly Research Index), which purportedly provides access to the largest collection of openly accessible full-text scientific literature on the Web. Its legal and technological approaches are very different from most other archiving efforts on the Web, and the social dynamics it has created in the scientific community also are very different. This system currently contains over five million citations and a half-million full-text articles. How can it be legal to have a half-million full-text articles openly accessible through this system when no one gave Research Index permission to copy those articles? CITESEER5 Most of you are familiar with at least some of the major specialty collections of full-text journal articles that are freely accessible on the Web. For instance, the NASA Astrophysics Data System has 300,000 full-text articles online; Highwire Press has about the same number but focused in the biomedicine and life science fields. There is also the ArXiv at Los Alamos, PubMed Central, etc. Most of these online archives deal with intellectual property issues on a journal-by-journal negotiation basis or have scientists submit original work directly to their archive. Scientists and graduate students in my research field typically need to access articles and datasets across a broad range of disciplines, including various branches of engineering, computer science, the social sciences, and even the legal literature. Many of us would prefer the ability to cite across any and all scholarly domains and link from any citation we find on the Web to the full-text copy of that article on the open Web. One approach that is being used to index and access the computer science literature is to search and crawl the entire Web. They do this using an algorithmic approach to find citations that are germane to the computer science literature, and then the system allows you to directly link to any full-text article that is found. It works on a citation-to-citation basis. From my perspective this is far preferred for indexing and accessing literature across and among scholarly domains. If you go to the CiteSeer Web site 6 today you will find about five million distinct citations within the computer science literature that have been drawn from about 400,000 full-text online articles. The system also has some 4 See Chapter 5 of the Proceedings, “The Economic Logic of ‘Open Science' and the Balance between Private Property Rights and the Public Domain in Scientific Data and Information: A Primer,” by Paul David. 5 The following description of CiteSeer and the legal foundations of the approach are based on a presentation by the author at the Duke University School of Law Conference on the Public Domain on November 10, 2001. For more information, see http://www.law.duke.edu/pd/realcast.htm. 6 See http://citeseer.com for more information.

OCR for page 180
Page 183 useful automated tools for sifting the wheat from the chaff—in other words, for getting at the most cited and respected articles within a specific subdomain of interest. The legal problem with this approach is in obtaining permissions to copy the half-million articles. You need to automatically copy the journal articles to test the article against your profile conditions, extract and index the citations, as well as then host copies of the full-text PDF or postscript files. The developers have not bothered asking publishers for copyright permissions, and no publisher in the computer science community has yet to complain. The system developers appear to have taken the position that (1) they gain at least some legal protection granted to Web crawlers by the Digital Millennium Copyright Act (DMCA); (2) if publishers or authors do not protect their Web sites from Web crawlers, that is their fault; and (3) if you object to the Web crawler copying any of your articles, the system managers will be happy to remove those articles but please protect your site in the future or the crawler is likely to pick them up again. 7 Many of the full-text articles that the crawlers have copied from Web sites were of course placed there by the professors and scientists who wrote them. Can one assume that these professors retained the copyrights in their published works? Or should one assume that scientists transferred all or a portion of their copyrights to the publisher? If authors did transfer their rights to publishers, which is certainly very common in my research field, does that mean that CiteSeer is acting similar to a Napster for the computer science literature? After all, it is a facility that contributes to the illegal sharing of copyrighted articles among scientists. However, unlike Napster, the original authors or talent are not complaining or losing any money because scientists typically are not into publishing articles for compensation. Furthermore, unlike Napster, many of the lead publishers in the computer science community are member organizations whose members would rise up in revolt if their professional organization objected to the system. This is an extremely valuable resource and it is used by thousands and thousands of scientists every day. Although you and I might use Google, our faculty and graduate students run to check CiteSeer. In fact, they will always check CiteSeer before resorting to the commercial online databases that the university subscribes to. Lee Giles at Penn State is one of the people who set this crawler running. The project started as a side assignment to one of his graduate students to find some articles on the Web that they figured must exist but that they could not find through normal channels. The code found what they wanted, and then they started to use it for broader and broader searches. So there was no initial grand scheme to create this capability. It has evolved over time as various people found it useful, improved the code, and let it run. With 5 million citations and growing, you can now ask questions like who is the most cited person across all of the modern computer science literature? You can come to a conference and know the general citation ranking of every person in the room, who has the most cited article addressing a topic like the public domain, which journal is the most influential on the topic, and who has the most cited article in the most respected journal. You can compare the citation records of those articles that are available online with those that are not and discover that you are 4.5 times more likely to be cited if your articles are openly available online. Professors are now actively shipping in the URLs where their articles may be found so the crawler can pick up their missing full-text articles. If you are an academic, your goal in life is to be cited, to be a recognized authority, to know that your work matters. CiteSeer has created a dynamic of professors making certain that all their articles are available and readily accessible on the Web. So far, the private scientific publishers have not been complaining about this situation. My guess is they are not likely to unless they want to be boycotted by the general computer science community. I also ponder whether the scholarly community reactions would be the same if a crawler was currently indexing and copying all openly available law review articles or biology articles. Let us assume that you are in one of these other scholarly domains and you want to solve the legal dilemma for your own discipline. How would you do it? In my case, I would set up a Web site and call it the Public Commons of Geographic Information Science. We actually have a mock site up, but eventually I would want this site hosted by the University Consortium for Geographic Information Science, the group of 70 universities and research institutions housing the leading GIScience research programs in the United States, to give it credibility.
OCR for page 180
Page 184 The commons or open library has four components: (1) open access to GIScience literature, (2) open access to GIScience course materials, (3) the public commons of geographic data, and (4) open-source GIS software. Our basic rule in designing this online material is to keep it very simple for the scientist. In a single paragraph we tell them what is wrong with the current publishing paradigm. In the second, we present a solution. Third, we walk them through four steps that solve their journal copyright and access problems. Step 1 focuses on submitting articles to GIScience journals. In submitting your work to a scientific journal for peer review, we recommend that you place the following notice on your work prior to submission: “This work, entitled [xxx], is distributed under the terms of the Public Library of Science Open Access License . . . . ” That license essentially says anyone can copy the article for free as long as attribution is given. Then we provide an optional statement: “Although this license is in effect immediately and irrevocably, the authors agree to not make the article available to any publicly accessible archive until it is first published, is withdrawn from publication, or is rejected for publication.” Note that we give scientists one recommendation and no options. We do not give them a suite of open-access licenses to choose from. Most research scientists and engineers could care less about analyzing the law and social policy. They just want your best shot at supplying a legal solution for the discipline. The Creative Commons project is taking a similar approach by offering only a very limited set of license provisions for users to chose from. Step 2 concerns which GIScience journals will publish articles subject to prior rights to the public existing in the article. Ultimately, I think most journals in my field will have to come around. However, the solution is not to beg journals in your discipline to accept an open-access license. One potential solution is to list the primary journals in the field and then have scientists report back whether use of the license was accepted. Those journals that allow open-access licenses will have a substantial competitive advantage in attracting submissions, particularly when most scientists discover that they are four times as likely to be cited by other scientists if their articles are openly available. Step 3 focuses on how the researcher can ensure that others will find their published articles through a widely accessible citation indexing system. This step includes a description of CiteSeer. The problem in the GIScience discipline is that our scientists publish across a broad range of literature, not just the computer science literature. Therefore, we either need to set up a CiteSeer capability with algorithms and keywords developed for our specific domain or hope that someone comes along to scale up CiteSeer to cover all science literature on the web. Step 4 involves ensuring that a researcher's article, once published, is maintained in a long-term public electronic archive in addition to sitting on a server at his university or on the server of his publisher. Of course there is no longer a default right to archive in a world of electronic licensing. Scientists can overcome the legal impediments to archiving by following the open-access licensing approach recommended in step 1. However, many long-term technical archiving issues still remain. My primary interest is in developing a public commons of geographic data. The challenges in that instance are far greater than for open-access sharing of journal articles, particularly if the vision is one of hundreds of thousands of people creating spatial datasets, generating the metadata for those works, and then freely sharing the files. Libraries in our local communities do not exist as a result of operation of the marketplace. Public libraries exist in our communities because a majority of citizens agree that the tax money we spend on them results in substantial benefits for our communities. Similarly, do not expect digital libraries that provide substantial public-goods benefits to be developed or maintained by the marketplace. FURTHER LEGAL DISCUSSION The computer science literature implementation of CiteSeer links to and maintains copies of over a half-million full-text online journal articles that may be freely accessed. Yet CiteSeer is in a different legal position to that of most other online archives. No authors or publishers are asked permission regarding whether the CiteSeer Web crawler may copy and retain articles. How can the copying and provision of access to over a half-million articles without gaining explicit permission of the copyright holders be legal? CiteSeer finds the articles that it indexes by crawling the Web. To index found articles, the software copies each PDF or postscript article it finds, converts it automatically to ASCII, searches for keywords, and then extracts

OCR for page 180
Page 185 and processes appropriate indexing information. A link is automatically provided in the database created by the software to the URL where the article was found. Typically this link is to the Web site of the author. As a backup, in case the link to the author's Web site is down, and as a means to provide more efficient access to the author-hosted article, the system caches a copy of the article and provides it in various formats that may be directly linked by users. Articles are seldom copied by the crawler from, for instance, commercial publishers of scientific articles because those sites are typically protected by passwords or other technological protections. Unless permission is granted, CiteSeer indexes only articles found publicly available on the Web without charge. 8 Furthermore, CiteSeer purports to adhere to all known standards for limiting robots and caching. As the system has evolved, most articles now being indexed are those submitted by authors. 9 CiteSeer gains its legal basis for existence primarily through the DMCA of 1998. Title II of the DMCA added a new Section 512 to the Copyright Act to create four new limitations on liability for copyright infringement by online service providers. The limitations are based on conduct by a service provider in the areas of (1) transitory communications, (2) system caching, (3) storage of information on systems or networks at direction of users, and (4) information location tools. The legal issues are varied and complex but, by example, one issue involves whether authors have authority to post their scientific articles on the Web. If CiteSeer picks up an article for which exclusive rights were given up by an author to a publisher, is CiteSeer liable or is the author liable for the violation? Section (d) of the DMCA on Information Location Tools states explicitly that “[a] service provider shall not be liable . . . for infringement of copyright by reason of the provider . . . linking users to an online location containing infringing material . . . by using information location tools . . . .” This exclusion from liability is followed by a list of three conditions that the operation of CiteSeer appears to meet. Similarly, in those instances in which an author specifically submits a URL to the system so that material can be picked up by the automated system, Section (c) of the statute on Information Residing on Systems or Networks at Direction of Users states “[a] service provider shall not be liable . . . for infringement of copyright by reason of the storage at the direction of a user of material . . . .” This exclusion from liability is followed by a list of three conditions that the operation of CiteSeer again appears to meet. The most tenuous part of the legal position of CiteSeer relates to caching of the full text of articles. It is not clear that the exclusion from liability under Section (b) on System Caching applies to CiteSeer. Nor does it appear to apply to most other nonprofit or academic service providers because the requirement of passwords or other controls are often not applied in these exchange environments. However, a similar exclusion for caching by the typical nonprofit service provider can be argued under other sections of the Copyright Act. Furthermore, if such a challenge were ever raised under the DMCA, CiteSeer operations could convert immediately to a free subscription and registration operational environment to subvert such a challenge. Another problematic legal issue is defining the point at which allowable temporary storage of material (i.e., caching) crosses over to become longer-term storage or archiving. Even though making backups of materials on the Web sites of others is widespread and commonplace across the Web, “archiving” that exceeds “caching” arguably requires explicit permission from copyright holders. The World Wide Web, when initiated, was clearly illegal from a plain language reading of the law by most attorneys because the Web allowed and in fact required the copying of documents without explicit permission. However, the Web was found by society to be so useful that ways were found to reinterpret and clarify the law to allow the innovation to spread. In a similar manner, the consensus appears to be that even if some lack of clarity exists in the law today with regard to the operation of systems such as CiteSeer and Google, highly useful Web-wide indexing systems are likely to be looked upon with favor by interpreters of the law. As long as the provisos in Section 512 of the DMCA are met, it appears that the approach is on relatively sound legal footing. Fortunately, most scientific publishers have revised their copyright policies in recent years to allow authors to post their authored articles on their own Web sites. This largely negates the potential argument that CiteSeer 8 See CiteSeer's Terms of Service, Paragraph 2, at http://citeseer.nj.nec.com/terms.html. 9 Correspondence from feedback@researchindex.org dated January 11, 2003. See also paragraph 4 of Terms of Service at http://citeseer.nj.nec.com/terms.html.

OCR for page 180
Page 186 operates much like the former Napster in facilitating the illegal sharing of articles among scientists. Furthermore, as required by the DMCA, systems such as CiteSeer must remove articles in a responsible manner when requested to do so by a copyright holder [Section 512(b)(c)(d)]. One of the most promising options for addressing the legal clarity issue in the long run is to encourage and facilitate the ability of scientists to grant rights to the public to use their works prior to submission of those works to publishers that do not already allow open-access archiving of scientific works. By example, the Creative Commons is working to facilitate the free availability of information, scientific data, art, and cultural materials by developing innovative, machine-readable licenses that individuals and institutions can attach to their work. 10 10For additional information, see http://www.creativecommons.org and Chapter 23 of these Proceedings, “New Legal Approaches in the Private Sector,” by Jonathan Zittrain.