Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 59
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age 3 Ensuring Access to Research Data The advance of knowledge is based on the open flow of information. Only when a researcher shares data and results with other researchers can the accuracy of the data, analyses, and conclusions be verified. Different researchers apply their own perspectives to the same body of information, which reduces the bias inherent in individual perspectives. Unrestricted access to the data used to derive conclusions also builds public confidence in the processes and outcomes of research. Furthermore, scientific, engineering, and medical research is a cumulative process. New ideas build on earlier knowledge, so that the frontiers of human understanding continually move outward. Researchers use each other’s data and conclusions to extend their own ideas, making the total effort much greater than the sum of the individual efforts. Openness speeds and strengthens the advance of human knowledge. As an example, Box 3-1 describes how the sharing of genomic data has advanced life sciences research. Finally, only by sharing research data and the results of research can new knowledge be transformed into socially beneficial goods and services. When research information is readily accessible, researchers and other innovators can use that information to create products and services that meet human needs and expand human capabilities. The Organisation for Economic Co-operation and Development (OECD) describes a new effort to enhance public access to research data (see Box 3-2). According to this approach, “Openness means access on equal terms for the international research community at the lowest possible cost, preferably at no more than the marginal cost of dissemination. Open access to research data from public funding should be easy, timely, user-friendly and preferably Internet-based.”1 As the National Research Council’s 1 “OECD Principles for Access to Research Data from Public Funding,” Available at http://www.oecd.org/dataoecd/9/61/38500813.pdf.
OCR for page 60
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-1 Access to Genomic Data In biology, the culture of research and the applications of digital technologies have traditionally been heterogeneous, independent, and dispersed. However, the growth of interdisciplinary research, the advent of projects that have generated large volumes of data, and the invention of data-intensive devices such as DNA microarrays and high-throughput sequencers have highlighted the increasing importance of digitization of the biomedical sciences.a In the field of genomics, strong forces have pushed in the direction of unrestricted access to data, including directives from funding agencies, requirements from journals that researchers submit data to public repositories, community expectations, and the development of powerful data-sharing systems such as PubMed. In the case of the human genome, for example, the desire by funding agencies, researchers, and the general public for public access to research data led the genomics research community to develop an ethic of unrestricted access. This ethic was formally adopted as the “Bermuda statement” in February 1996: All human genomic information produced at large-scale sequencing centres should be freely available and in the public domain, in order to encourage research and development and to maximize its benefit to society.b At the same time, other forces have had the effect of restricting access to genomics data, including: The need to protect patient or individual privacy; The principal investigator’s desire to maintain research advantage; The danger of misuse (e.g., of virus sequences); A profit motive (for data with potential commercial value); The tendency to “publish and forget” used data, especially supplementary data. Committee on Issues in the Transborder Flow of Scientific Data stated in its report Bits of Power: Issues in Global Access to Scientific Data, “The value of data lies in their use.”2 The norms and traditions of research reflect the value of openness. Researchers receive intellectual credit for their work and recognition from their peers—and perhaps from the broader community of researchers and the public—when they publish their results and share the data on which those results are based. Some 2 National Research Council. 1997. Bits of Power: Issues in Global Access to Scientific Data. Washington, DC: National Academy Press.
OCR for page 61
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age The generation of complete genome sequences for a growing number of organisms has intensified the digitization of biomedical research. These data have many applications in both basic and applied research, with the lines between the two often being difficult to discern. For example, computational processing and reference to information and knowledge bases about organisms and disease processes allow researchers to reach faster conclusions about the likely results of a therapy.c The combination of cellular data, genomic profiling, and biological simulation may reduce the failure rate of drug candidates and the cost of testing. In the near future, it will even be possible, given sufficient computing and storage resources, to record the genotype of each person in a secure database. Variations in genes may indicate specific disease susceptibility or responses to known drug types. This information could enable physicians to prescribe a personal immunization and screening schedule or to recommend specific preventive measures for each patient. Further integration of the biomedical sciences using digital technologies could allow independent investigators to remain the engine of innovative research by participating in “virtual team science.” Early examples of such “cyberinfrastructure”—including the Biomedical Informatics Research Network, myGrid, and the cancer Biomedical Informatics Grid—indicate that it is technically feasible, if not easy, to integrate the many threads of biomedicine. The challenge is to ensure that new “cybersilos” do not replace existing disciplinary and institutional silos.d a “The race to computerize biology.” 2002. Economist, Dec. 12, 2002. b David R. Bentley. 1996. “Genomic sequence information should be released immediately and freely in the public domain.” Science 274:533–534. This statement was written on behalf of the Sanger Institute at the Wellcome Trust Genome Campus and the Genome Sequencing Center at Washington University in St. Louis. c Chris Sander. 2000. “Genomic medicine and the future of health care.” Science 287:1977–1978. d Kenneth H. Buetow. 2005. “Cyberinfrastructure: Empowering a ‘third way’ in biomedical research.” Science 308: 821–824. journals require the submission and public dissemination of the data supporting an accepted manuscript. Funding agencies and research institutions also have policies that require the open sharing of the data on which research conclusions are based. Codes of conduct in a research community, whether explicit or tacit, can exert a powerful influence on researchers to make data accessible. Advances in information technology—for instance, the advent of grid computing and cloud computing3—will continue to transform the environment for 3 In grid computing, distributed computing resources link experimental apparatus, processing, analysis, and storage; cloud computing involves large-scale, data-intensive, Internet-hosted applications and related infrastructure.
OCR for page 62
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-2 OECD Principles and Guidelines for Access to Research Data from Public Funding From 2004 to 2006 the 30-nation Organisation for Economic Co-operation and Development (OECD) developed a set of guidelines based on commonly agreed principles to facilitate cost-effective access to digital research data generated through public funding. Endorsed by the OECD Council on December 14, 2006, the “OECD Principles and Guidelines for Access to Research Data from Public Funding” serve as objectives for each member country to achieve given its own legal, cultural, economic, and social context. The Principles and Guidelines cover 13 broad areas: Openness Flexibility Transparency Legal conformity Protection of intellectual property Formal responsibility Professionalism Interoperability Quality Security Efficiency Accountability Sustainability The Principles and Guidelines call “for a flexible approach to data access” under a default principle of openness and recognize “that one size does not fit all.” They also state that “Whatever differences there may be between practices of, and policies on, data sharing, and whatever legitimate restrictions may be put on data access, practically all research could benefit from more systematic sharing.” NOTE: For more information, see Organisation for Economic Co-operation and Development. 2007. OECD Principles and Guidelines for Access to Research Data from Public Funding. Available at http://www.oecd.org/dataoecd/9/61/38500813.pdf. research and lower the technical barriers to sharing data. As this transformation occurs, researchers are organizing their work in new ways to take advantage of new possibilities. An innovative example is the conduct of research in what can be called an open-knowledge environment.4 Building on the methodology pioneered by the open-source software movement, this approach begins with 4 Economist. 2004. “An Open-Source Shot in the Arm?” June 10.
OCR for page 63
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age the identification of a problem that is to be examined in a public forum on the Internet. Researchers from different disciplines, organizations, and countries then can all contribute to solving the problem, with the open sharing of data and ideas that might bear on that problem. An open-knowledge environment allows people with many different backgrounds and viewpoints to interact in a relatively unstructured way while moving toward a common objective. The free flow of information speeds progress, while the global reach of the Internet greatly expands the number and breadth of researchers who can contribute to a project. Another approach to sharing is open-notebook science.5 Similarly, blogs, wikis, and other forms of electronic interaction are tools that enable collaborative work on common problems in a generally open research environment. In the context of this report, sharing research data enhances the data’s integrity by allowing other researchers to scrutinize and verify them (as described in the Chapter 2). Sharing also increases the likelihood that data will be preserved for long-term uses, although the stewardship of data requires more than that the data be accessible (as described in the Chapter 4). Thus, the three themes of this report—integrity, accessibility, and stewardship—are intertwined. BARRIERS TO SHARING DATA Despite the many benefits to be gained by the sharing of research data and results, even a cursory survey of research activity reveals many circumstances in which access to data is limited. Because researchers require time to verify data, analyze their data, and derive research conclusions, individual researchers generally are not expected to make all their data public immediately. Individual researchers need latitude to follow hunches, experiment with methods, explore conjectures, and make mistakes. New tools for automatically assessing the quality of data and sharing them with others can facilitate the rapid sharing of digital data, although verifying the reliability of these tools presents its own set of challenges. Once a research result is published, the norms of science—and often the terms of the research grant or contract—call for the supporting data to be accessible. Researchers may nevertheless try to keep the data private, perhaps to derive additional results without competition from others, for the exclusive use of a student or postdoctoral fellow whose career would be advanced by generating further papers, or just to avoid the effort to put the data in usable form for others. In the worst cases, they may retain data to hide acts of research misconduct or to conceal defects in the dataset. The norms of a research community may allow keeping data private for a certain period. These norms can be formalized through the terms of a grant 5 Katherine Sanderson. 2008. “Data on display.” Nature. 455:273.
OCR for page 64
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age giving the investigator a defined period of exclusive use of the data, with the exclusivity ending upon the publication of results, after a particular length of time, or when data are deposited in a data center or archive. There is great variation among research fields in their data-sharing norms, to such an extent that different fields can be said to have different data cultures. (Box 3-3 describes aspects of the data culture in economics.) A recent report commissioned by the Research Information Network of the United Kingdom examined data-sharing practices and expectations across a number of fields (Table 3-1).6 The report highlights the global importance and relevance of data accessibility in research, as well as the fact that differences between fields are often more important than national differences in determining data-sharing practices. The international aspects of data access and sharing are discussed in more detail below. Observational astronomy offers a good example of the data-sharing norms that can characterize a field of research. Astronomical data often can be used for multiple purposes and are usually made public, but proprietary periods in which only the members of a research team have access to data are common. The European Southern Observatory (Europe’s large optical observatory) and the National Aeronautics and Space Administration have 12-month proprietary periods. The U.S. National Optical Astronomy Observatory has an 18-month proprietary time. These periods provide researchers with an opportunity to make discoveries as a reward for dedicating significant periods of their careers to creating new facilities and developing new techniques. They also provide an opportunity for critical evaluation of the data before they are released. In the high-energy physics community, collaborations are so large and the experiments so complex—with hundreds of scientists involved with the operation of a single detector—that it could take years for an independent scientist to learn enough to reanalyze the data. The data of each collaboration are treated as proprietary. Other groups that want to undertake the same measurement must form their own large collaboration and repeat the experiment. As explained in Box 2-1, large collaborations in high-energy physics involve elaborate procedures for internal scrutiny of and validation of data. Cultural norms and expectations in research fields regarding data can change over time. For example, as data sharing has proven increasingly valuable to the advancement of research in many areas of the life sciences, researchers, sponsors, research institutions, and other stakeholders have built new infrastructure and established guidelines to facilitate data sharing. A 2003 National Research Council study (Box 3-4) recommended guidelines for the sharing of 6 Alma Swan and Sheridan Brown. 2008. To Share or not to Share: Publication and Quality Assurance of Research Data Outputs. Report Commissioned by the Research Information Network. June. Available at: http://www.rin.ac.uk/data-publication.
OCR for page 65
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-3 Data Sharing Within Economics Economists rely on an enormous variety of research data—for instance, administrative data from government records, datasets provided by companies to the federal government, or data provided directly to researchers by companies. Some economists rely on methods similar to those used by anthropologists, in which large quantities of data are collected and analyzed. Often the datasets are subject to confidentiality agreements because individuals could be identified from the data. Use of the data may even be restricted to “enclaves,” where a researcher has to work on a nonnetworked computer in a secure room from which materials cannot be removed. Analysis of economic data may depend critically on highly complex computer programs. These programs, rather than the actual data, can be the most valuable part of an economist’s research, because many datasets are available publicly, whereas a computer program could embody months or years of individual effort. Thus, to assess the original analysis, other researchers often need access to the computer programs as well as to the original data. As in other sciences, the social sciences have an expectation of reproducibility—that if the data are available and analyzed with the same assumptions, the same results will emerge. But without considerable assistance from the original researchers, actual replication of published results in economics can be time-consuming, tedious, and subject to many errors. Furthermore, journals are reluctant to publish studies that are confirmatory rather than groundbreaking. Social scientists, like other scientists, are more interested in doing their own studies and getting credit for something new than in repeating work that has already been done. Even if replication is not common, the data should be available to enable replication, but in economics this often is not the case.a Several years ago two economists wrote to the authors of every paper in the March 2004 issue of the American Economic Review, a leading journal in the field, and requested the data to replicate the research. Although the journal has a statement saying “Authors are required to maintain their data and supply it to other researchers upon request,” 14 of the 15 sets of authors to whom the economists wrote said that they did not have the data or would not share them. The authors summarized their findings in an article and submitted it to the American Economic Review, which published their paper. As a result of this and other cases, the American Economic Review adopted a new policy. For published articles, the authors must provide both the data and the programs sufficient for the articles’ findings to be replicated. These data and programs are then posted on the journal’s Web site. If the use of the data is restricted, the authors must provide instructions on how to obtain permission to use the data. If some of the data are proprietary, the editors try to work out ways for other researchers to use the data. In addition, the journal is encouraging studies to reanalyze data and replicate results. The American Economic Review is supported by dues from 20,000 members and has the resources to institute such a policy, whereas journals with fewer resources could have difficulty adopting and enforcing the same or similar policies. Also, the data and programs are not requested at the time of submission of an article—only upon acceptance—so that the 92 percent of the papers submitted to the journal that are rejected do not fall under the new guidelines. Some economists have decided not to submit a paper to the American Economic Review because they do not want to release their data or software. Nevertheless, because authors want to publish their papers in the journal, it has considerable influence over their actions. a Robert A. Moffitt, American Economic Review, Presentation to the committee, April 17, 2007.
OCR for page 66
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age TABLE 3-1 Summary of the Data-sharing Environment in Various Fields in the United Kingdom Culture of sharing data Infrastructure-related barriers to publishing data Effect of policy initiatives to encourage data publishing Overall propensity to publish datasets (with appropriate metadata and contextual documentation) Astronomy Strong culture of sharing Low level of barriers Policy has medium positive effect Strong propensity to publish datasets Chemical crystallography Medium culture of sharing Low level of barriers Policy has little positive effect Strong propensity to publish datasets Genomics Strong culture of sharing Low level of barriers Policy has strong positive effect Strong propensity to publish datasets Systems biology Medium culture of sharing Moderate level of barriers Policy has strong positive effect Medium propensity to publish datasets Classics (Humanities) Strong culture of sharing High level of barriersa Policy has medium positive effect Medium propensity to publish datasets Social and Public Health Sciences Weak culture of sharing Low level of barriers Policy has little positive effect Low propensity to publish datasetsb RELUc Medium culture of sharing Low level of barriers Policy has medium positive effect Medium propensity to publish datasets Climate Science Weak culture of sharing Low level of barriersd Policy has medium positive effect Low to medium propensity to publish datasets a The Arts and Humanities Data Service was established in 1995 to provide a national service to collect, preserve, and promote electronic resources in the arts and humanities; its funding was eliminated in 2008. b This descriptor covers researchers not directly connected with a national data collection. c The Rural Economy and Land Use Program is a collaborative research program among several UK research councils. d The Natural Environment Research Council provides data centers. SOURCE: © Research Information Network. 2008. To Share or not to Share: Publication and Quality Assurance of Research Data Outputs. June. http://www.rin.ac.uk/data-publication.
OCR for page 67
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-4 Sharing Publication-Related Data and Materials In 2003 the National Research Council Committee on Responsibilities of Authorship in the Biological Sciences released a report that focused directly on the issues discussed in this chapter. In that report, the committee established what it called “the uniform principle for sharing integral data and materials expeditiously” (UPSIDE). They described this principle as follows: Community standards for sharing publication-related data and materials should flow from the general principle that the publication of scientific information is intended to move science forward. More specifically, the act of publishing is a quid pro quo in which authors receive credit and acknowledgment in exchange for disclosure of their scientific findings. An author’s obligation is not only to release data and materials to enable others to verify or replicate published findings (as journals already implicitly or explicitly require) but also to provide them in a form on which other scientists can build with further research. All members of the scientific community—whether working in academia, government, or a commercial enterprise—have equal responsibility for upholding community standards as participants in the publication system, and all should be equally able to derive benefits from it. The committee also identified five corollary principles associated with sharing publication-related data, software, and materials. For example, the committee stated that “authors should include in their publications the data, algorithms, or other information that is central or integral to the publication—that is, whatever is necessary to support the major claims of the paper and would enable one skilled in the art to verify or replicate the claims.” The committee noted that its purview extended only to the biological sciences. It also stated, however, that “in the committee’s view, there should be a single scientific community that operates under a single set of principles regarding the pursuit of knowledge. This includes a common ethic with regard to the integrity of the scientific process and a long-held commitment to the validation of concepts of experimentation and later verification or refutation of published observations.” SOURCE: National Research Council. 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington, DC: The National Academies Press. data and other information supporting research results that emphasize openness and expanded access, including research performed by companies.7 Although the charge to our committee excluded privacy and other issues 7 National Research Council. 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington, DC: The National Academies Press.
OCR for page 68
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age related to human subjects from our study, it is important to note that these issues can act as barriers to data access. Some data are not released because of confidentiality or privacy considerations, such as data related to biomedical research or the social sciences. For example, the 1996 Health Insurance Portability and Accountability Act established rules for disclosure of individually identifiable health information (known as protected health information, or PHI).8 If PHI is used in research, the researcher must comply with regulations regarding its use and storage in the project. There are instances where PHI may be disclosed, but the need to support published research is not among them. For PHI to be made publicly available, a subject must agree to the disclosure of the information. For some medical research data, privacy and confidentiality obstacles can be overcome by removing identifiers prior to the private sharing of data or the public release of data. However, this remains an area of ongoing concern and investigation. Efforts are now under way to make medical research data available while ensuring that the data cannot be used to identify individuals. Research data also can be kept private because they pertain to intelligence, military, or terrorist activities.9 Examples include research related to nuclear, radiological, and biological threats; human and agricultural health systems; chemicals and explosives; and information technology infrastructure. National Security Decision Directive 189 (NSDD 189), which was issued by President Ronald Reagan in 1985, states that the policy of the U.S. government is not to restrict, to the maximum extent possible, the products of unclassified fundamental research.10 The challenge to policy makers and researchers is where to draw the line between classified and unclassified information and how to balance restrictions on access to sensitive information with the potential costs of such restrictions. Our committee was not asked to examine national security issues in depth. Other National Research Council committees, including the Committee on Scientific Communication and National Security (CSCANS), are directly focused on issues such as classified information, export controls, and nonimmigrant visa policies. A recent CSCANS report points out that many federal government policies and practices since the September 11 attacks have effectively reversed NSDD 189.11 The report calls for a standing entity to review policies in order to 8 Institute of Medicine. 2006. Effect of the HIPAA Privacy Rule on Health Research: Proceedings of a Workshop Presented to the National Cancer Policy Forum. Washington, DC: The National Academies Press. 9 National Research Council. 2007. Science and Security in a Post 9/11 World: A Report Based on Regional Discussions Between the Science and Security Communities. Washington, DC: The National Academies Press. 10 National Policy on the Transfer of Scientific, Technical and Engineering Information. September 21, 1985. 11 Ibid.
OCR for page 69
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age ensure that the small risks of basic research being misused are balanced with the enormous benefits that accrue from the free exchange of information. Another National Research Council Committee examined the national security implications of access to genomic databases and found that unrestricted access, combined with the development of education programs by professional societies, is the best approach to balancing the advancement of knowledge with protecting the public from misuse of genomic data for bioterrorism threats.12 The federal government’s creation in 2008 of a new category—“Controlled Unclassified Information”—illustrates that restrictions on the sharing of research based on national security concerns will continue to pose challenges to the research enterprise.13 When research is carried out or sponsored by public agencies, the general presumption in the United States is that data generated as part of that research should be publicly available.14 Different considerations apply for research funded by a private company, whether that research occurs within a company or in the academic sector. Though some companies have been experimenting with the benefits of freely sharing results from proprietary research,15 many companies carefully guard this information as a trade secret and a potential source of commercial advantage. Similarly, an academic researcher may temporarily withhold data in order to file a patent or develop a commercial product, even when the research is publicly funded. These issues are discussed later in this chapter. The cost of disseminating data can be a barrier to its use. Circular A-130 from the Office of Management and Budget (OMB) stipulates that government-generated data should be available to users at cost sufficient to recover the expense of dissemination but not higher.16 However, data from private sources, even when purchased by the federal government for research purposes, frequently have high distribution costs and restrictions on redistribution. These costs can be a significant problem for academic researchers who need access to large databases for modeling or data analysis. Finally, research data may be kept private because the resources are lacking to make data collections available to the public. A project might generate data that could be valuable to researchers in the same or other fields, but the 12 National Research Council. 2004. Seeking Security: Pathogens, Open Access, and Genome Databases. Washington, DC: The National Academies Press. 13 George W. Bush. 2008. “Designation and Sharing of Controlled Unclassified Information (CUI).” Memorandum for the Heads of Executive Departments and Agencies. May 9. 14 Paul F. Uhlir and Peter Schröder. 2007. “Open data for global science.” Data Science Journal 6:OD36–OD53. 15 Bernard Munos. 2006. “Can open-source R&D reinvigorate drug research?” Nature Reviews Drug Discovery 5:723–729. 16 Office of Management and Budget. No date. Management of Federal Information Resources. Circular A-130. Memorandum for Heads of Executive Departments and Agencies. Available at http://www.whitehouse.gov/omb/circulars/a130/a130trans4.html.
OCR for page 84
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age With the advent of global digital networks over the past two decades, both international cooperation in research and the formation of networked data resources on regional and global levels have become commonplace. Examples include the Global Biodiversity Information Facility, the International Federation of Digital Seismograph Networks, the International Nucleotide Sequence Database Collaboration, the International Virtual Observatory Alliance, and the Global Earth Observation System of Systems, to name but a few. Almost all fields of inquiry have some data centers or networks designed to provide access to data. In most cases, the U.S. research community has been the organizing force for the collaborative data-sharing networks. Greater access to research data from public funding also is receiving more attention at the national policy levels of many countries, in part because such data resources are now seen as being major research infrastructure components. For example, the Research Councils of the United Kingdom adopted a more open policy for their data holdings in 2006. The Ministry of Science and Technology in China initiated the Scientific Data Sharing Project in 2002, in recognition of the fact that “[t]he insufficient use of China’s massive data holdings has been an urgent problem.”64 Many other countries are similarly reviewing or revising their national policies and myriad institutional ones to make better use of their data resources. Finally, some international scientific, engineering, and medical organizations at both the intergovernmental and nongovernmental levels, such as the International Council of Scientific Unions, the Committee on Data for Science and Technology, and the OECD, are developing data-sharing policies and guidelines for adoption by members and the international research community. For example, the OECD in 2007 published its Principles and Guidelines for Access to Research Data from Public Funding, which are summarized in Box 3-2. The InterAcademy Panel, an organization of national science academies, supports a program to expand access to digital scientific information to researchers in developing countries.65 GENERAL PRINCIPLE FOR ENHANCING ACCESS TO RESEARCH DATA Because of the huge increase in the quantity of research data being generated, it is possible to say both that more data are being publicly disseminated than have ever been before and that more data are being withheld from public 64 Jinpei Cheng. 2006. The development of China’s scientific data sharing policy. In National Research Council. Strategies for Preservation of and Open Access to Scientific Data in China: Summary of a Workshop. Washington, DC: The National Academies Press. Available at: http://www.nap.edu/catalog.php?record_id=11710. 65 See the program’s Web site: http://www.interacademies.net/CMS/Programmes/4704.aspx.
OCR for page 85
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age access today than have ever been before. Many fields of research have moved toward more open data-sharing policies as the value of data has increased and as digital technologies have enabled information to be disseminated more broadly. At the same time, heightened interest in the commercial applications of research data has caused some forms of data to be more restricted. As described earlier in this chapter, there are legitimate reasons why some research data are not made publicly available, ranging from privacy concerns to technical barriers. Yet the basic principle that should guide decisions involving research data supporting publicly reported research results is clear: Data Access and Sharing Principle: Research data, methods, and other information integral to publicly reported results should be publicly accessible. This principle applies throughout research, but in some cases the open dissemination of research data may not be possible or advisable when viewed from the perspective of enhancing research in science, engineering, or medicine. Access to research data prior to reporting results based on those data might undermine the incentives to pursue the research. There might also be technical barriers, such as the sheer size of datasets, that make sharing problematic, or legal restrictions on sharing as discussed in the section on “Legal and Policy Requirements for Access to Data.” Also, “accessible” does not necessarily imply that data should be disseminated for free, though free or marginally priced distribution is the ideal. Nor are researchers responsible for providing data users with instruction or training in the use of their data, though they do have a responsibility to provide metadata, analysis software, models (including code and input data) and other information necessary for practitioners to validate and build on the results. Where researchers have proprietary interests in such tools, they have the option of protecting those interests through applying for patents and/or asserting copyright, as appropriate, in advance of publicly reporting results. This principle is a standard that is not currently being met in some areas of research. Yet it provides a yardstick against which to measure current initiatives and future plans. Researchers know that the information they generate should be available to others to advance the frontiers of knowledge. The objective therefore must be to implement policies and promote practices that allow this principle to be realized as fully as possible. This principle may seem to apply only to publicly funded research, but a strong case can be made that much data from privately funded research should be made publicly available as well. In many cases, making such data available can produce societal benefits while not threatening the commercial opportunities that led to the data’s generation. Note that this principle covers data underlying publicly reported results. When a researcher working at a corporate lab seeks to publish results, patent applications can be filed in advance
OCR for page 86
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age of publication, so that making data accessible at the time of publication will not compromise commercialization of the invention in question. If a company decides to protect an invention as a trade secret, it might be assumed that researchers will not publish papers about the invention and the question of providing access to data will not arise. In the past few years we have also seen private companies announce plans to make significant data resources available on an open access basis. For example, Merck has spun off a nonprofit, open access platform known as Sage.66 Sage is aimed at helping researchers to build new databases aimed at more effectively modeling disease. Where possible, public policies should encourage the release of such data, and privately funded researchers and their managers should explore possible means of making data available. The Access and Sharing Principle is consistent with recommendations from National Academies committees that have previously addressed data access. A 2003 report, Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences, puts forward the “uniform principle for sharing integral data and materials expeditiously (UPSIDE).”67 The UPSIDE principle calls on researchers employed in the academic, government, and commercial sectors to provide data and materials needed to support published findings, and to “provide them in a form on which other scientists can build with further research.” The 1997 report Bits of Power: Issues in Global Access to Scientific Data states that “full and open access to scientific data should be adopted as the international norm for the exchange of scientific data derived from publicly funded research.”68 RESPONSIBILITIES OF RESEARCHERS As with the integrity of research data, the primary responsibility for sharing data lies with the researchers who produced them. (In addition, other parts of the research enterprise have responsibilities for sharing data, as described later in this chapter and in the next chapter.) Only researchers know their data well enough to ascertain what information must be publicly available to allow others to verify their results and build on their work. Only researchers are in a position to work with research institutions, research sponsors, and journals to make data available in a way that they can be understood and used effectively by others. Thus, our committee recommends that: 66 Bryn Nelson. 2009. “Something wiki this way comes.” Nature 458(13, March 4). doi:10.1038/458013a. 67 National Research Council. 2003. Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences. Washington, D.C: The National Academies Press. 68 National Research Council. 1997. Bits of Power: Issues in Global Access to Scientific Data. Washington, DC: National Academy Press.
OCR for page 87
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age Recommendation 5: All researchers should make research data, methods, and other information integral to their publicly reported results publicly accessible in a timely manner to allow verification of published findings and to enable other researchers to build on published results, except in unusual cases in which there are compelling reasons for not releasing data. In these cases, researchers should explain in a publicly accessible manner why the data are being withheld from release. Making data available does not necessarily mean providing them at no cost. The next chapter discusses the need for research projects to develop plans for the management and sharing of data from the initial stages of a research program. Chapter 4 also describes the evolving infrastructure for providing data access and stewardship, whose components include institutional and disciplinary repositories. Fulfilling this recommendation also requires that researchers be familiar with any possible constraints on the release of data. Although this information is usually known to researchers and their managers from the outset of a research project, agreements may be informal, may be understood differently by different parties (such as principal investigators and graduate students), or may change during the course of a research project. Requiring that researchers clarify and agree to these arrangements places the responsibility on researchers to oversee the accessibility of research data and to decide whether to participate in research where data accessibility is limited. Researchers who are considering becoming involved in a project where data accessibility is restricted need to ask themselves whether the benefits of participating in that project outweigh the benefits of transparency in generating and disseminating data. Research thrives under conditions where data are available to others. If data are not available, there should be a clear and public reason why those data are being withheld from dissemination. Indeed, justifications for not making data available should be understood by the researcher, sponsor, and institution. Dissemination of the reasons why data are being withheld could be published with journal articles, posted on Web sites, stated in the publicly accessible award statements of research sponsors or research institutions, or made available by some other means. The important point is that the reasons should be publicly available so that others can review and comment on the grounds for withholding data. As discussed in the following section, the committee believes that research fields, research sponsors, research institutions, and journals have considerable ability to set appropriate standards and expectations regarding data access and sharing, and to develop the necessary incentives. Some are taking leadership roles in setting standards and instituting incentives. The committee believes that continued efforts taken by these stakeholders can create an environment in which the Data Access and Sharing Principle is widely followed in the research enterprise, and in which a bureaucratic framework of regulations and enforcement will not need to be imposed.
OCR for page 88
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age RESPONSIBILITIES OF RESEARCH FIELDS As emphasized earlier, there are major differences between research fields in the handling of data, including technological infrastructure, publication practices, and data-sharing expectations. In some fields, aspects of their data culture act as barriers to access and sharing of data. Because of the growing importance of research data and the rate at which practices are changing in research, it is important for various fields and disciplines to examine their standards and practices regarding data and to make these explicit. The development of plans for data management and sharing is greatly facilitated when a field of research has standards and institutions in place designed to promote the accessibility of data. Recommendation 6: In research fields that currently lack standards for the sharing of research data, such standards should be developed through a process that involves researchers, research institutions, research sponsors, professional societies, journals, representatives of other research fields, and representatives of public interest organizations, as appropriate for each particular field. The development of standards and institutions can occur in different ways depending partly on the field of research in which it occurs. The process can be led by journal editors, professional societies, ad hoc bodies of researchers established to solve particular problems, or permanent institutions charged with overseeing data management issues. National Academies committees and advisory committees to federal agencies can play constructive roles. In large, complex fields, multiple initiatives may be undertaken to address various aspects of standard setting. Input and participation from international stakeholders will often be needed. The life sciences provide useful examples of the standards-setting process. As described in Box 3-4, a National Academies committee developed broad standards for the sharing of research data in the life sciences. Similarly, as described in Box 3-5, a journal-led effort incorporating community input developed the Paris Guidelines for the management of protein data. Both examples demonstrate how standards can be put in place to deal with existing or new issues affecting the management of research data. The Principles for the Release of Scientific Research Results, released in 2008 and discussed in the earlier section on “Federal and Journal Policies Affecting the Availability of Data,” establish data-sharing standards for research conducted by employees of federal civilian agencies.69 One section of the principles states: 69 John H. Marburger, III. 2008. Principles for the Release of Scientific Research Results. Memorandum. May 28. Available at www.arl.org/bm~doc/ostp-scientific-research-28may08.pdf.
OCR for page 89
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-5 The Paris Guidelines In some fields, journals have played a major role in developing standards for data collection, sharing, and preservation. In 2004, for example, the journal Molecular and Cellular Proteomics (MCP) developed standards for the management of protein data.a These standards were revised 1 year later based on community input, resulting in the “Paris Guidelines.”b These guidelines were made available in a checklist format, in a tutorial, and in MCP-hosted workshops to educate researchers about the details of the requirements for publication and data submission.c MCP’s standard requires all relevant quantitative data to be made available at a level in which it is possible to reproduce the reported results. Methods can reference previously published standards but any deviations must be explained. In particular, authors must submit along with the manuscript the data that have the greatest potential for misinterpretation—for instance, mass spectrographic spectra for post-translationally modified proteins—for the journal to publish. Data considered less important but worthy of access are recommended for submission to the journal as supplementary material to be deposited in a nonjournal repository, which therefore may not be archival.d In addition, an institutionally based government-funded data depository was recommended (“Tranche”) that has a distributed storage system similar to Bit Torrent, thereby lessening costly bandwidth problems caused by downloading large amounts of data over the Internet. In this way the Paris guidelines ensure that the most important data are deposited for perpetual and accessible storage while second-tier data also are accessible without placing too large a burden on the journal as the sole repository for data. a Steven Carr, Ruedi Aebersold, Michael Baldwin, Al Burlingame, Karl Clauser, and Alexey Nesvizhskii. 2004. “The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data.” Molecular and Cellular Proteomics 3:531–533. b Ralph A. Bradshaw, Alma L. Burlingame, Steven Carr, and Ruedi Aebersold. 2006. “Reporting protein identification data: The next generation of guidelines.” Molecular and Cellular Proteomics 5:787–788. c See http://www.mcponline.org/misc/Tutorial_MCP_final.pdf. d For an example of supplementary data, see http://www.mcponline.org/cgi/content/abstract/6/7/1123. Research data produced by scientists working within Federal agencies should, to the maximum extent possible and consistent with existing Federal law, regulations, and Presidential directives and orders, be made publicly available consistent with established practices in the relevant fields of research. This principle is consistent with the Data Sharing and Access Principle stated above. This report advocates that the principle apply not just to federal scientists but to all research where results are publicly reported.
OCR for page 90
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age A wide range of issues must be considered in setting data standards, including dissemination, usage restrictions, periods of exclusive use, documentation requirements, financial provisions, ownership, licensing terms, infrastructure needs, technological compatibility, and sustainable preservation. These issues vary greatly from field to field, depending on particular traditions and requirements. Although it is not impossible to prescribe a standard set of practices to which all researchers should adhere—indeed, the general principles stated in this report apply to all researchers—every field collectively and every researcher individually must address issues of data accessibility. RESPONSIBILITIES OF RESEARCH INSTITUTIONS, RESEARCH SPONSORS, PROFESSIONAL SOCIETIES, AND JOURNALS For researchers to make their data accessible, they need to work in an environment that promotes data sharing and openness. Recommendation 7: Research institutions, research sponsors, professional societies, and journals should promote the sharing of research data through such means as publication policies, public recognition of outstanding data-sharing efforts, and funding As noted earlier in this chapter, research institutions, research sponsors, professional societies, and journals are undertaking a range of initiatives to promote the sharing of research data. In taking the next steps, research institutions and research sponsors need to create incentives for researchers to share data, just as they have incentives to maintain the integrity of research data and to publish their findings. Researchers need both formal and informal ways of being acknowledged and rewarded for making research data accessible and usable. For example, in some cases tenure and promotion decisions could take into account efforts to promote the accessibility of data, the creation of publication-based metrics, or service to a community or institution. Data professionals also have an important role to play in ensuring the accessibility of research data. In close cooperation with researchers in a field, data professionals can anticipate the needs of data users and establish data management systems that meet those needs. Their contributions to making data accessible, as well as ensuring the integrity of data, need to be recognized. One way for research sponsors and journals to promote data accessibility is to establish the terms of access and sharing expected of institutions and investigators. For example, NIH explicitly requires that all grant applications for more than $500,000 in direct costs in a single year must include a data management plan that embodies the principles of the NIH Data Sharing Policy. This policy says that “data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and
OCR for page 91
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age proprietary data.” The data management plan becomes part of the proposal, and “NIH expects that plan to be enacted.… In the case of noncompliance (depending on its severity and duration) NIH can take various actions to protect the Federal Government’s interests.”70 These actions are not specified but may affect the review of future proposals. As discussed above, research institutions, research sponsors, and journals have considerable leverage in encouraging data access and sharing on the part of researchers. Several leading research institutions have announced open access publication recommendations, which encourage faculty to deposit their publications in their institutional repository. Such recommendations could be extended to data. Some federal research programs and journals have adopted open access data policies that require or encourage researchers to deposit underlying data in a disciplinary or institutional repository (see Tables 2-1 and 2-2). Depending on the program or discipline, adopting and effectively enforcing such open access data policies may be an appropriate way for research institutions, research sponsors, and journals to implement this recommendation. The Council on Government Relations points out that “few institutions have formal policies and procedures for access to and retention of research data.”71 As described above, the terms of research contracts and grants and other regulations often specify that research institutions are responsible for retaining data and providing access. Given the current lack of formal policies and procedures, we make the following recommendation. Recommendation 8: Research institutions should establish clear policies regarding the management of and access to research data and ensure that these policies are communicated to researchers. Institutional policies should cover the mutual responsibilities of researchers and the institution in cases in which access to data is requested or demanded by outside organizations or individuals. The knowledge needed to develop data access policies is not widespread or fully developed. Research institutions and sponsors may need to come together to identify best practices and policy models. Organizations such as the Association of American Universities, the Association of Public and Land-Grant Universities, the Association of Research Libraries, and the Council on Government Relations can contribute to this process. Disputes between researchers and their institutions regarding control of data are not unusual. For example, faculty members may be denied tenure and seek to take their research data with them, while the institution may seek 70 National Institutes of Health Office of Extramural Research. 2003. NIH Data Sharing Policy and Implementation Guidance. 71 Council on Government Relations. 2006. Access to and Retention of Research Data: Rights and Responsibilities. March. Washington, DC: Council on Government Relations.
OCR for page 92
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age to keep it. Or researchers and institutions may have different perspectives on how to respond to outside requests for access to data, including requests made under the auspices of the DAA or in connection with litigation. As described earlier in this chapter, requests for information can go beyond research data to information about a researcher’s personal life. Procedures for handling requests for data that either intentionally or inadvertently hamper the progress of research need special attention. Although the data from publicly funded research should be accessible in general, exploiting the norms of science to slow or stop the progress of research harms society. For example, institutional policies might stipulate that an institution will come to the aid of researchers in disputes with third parties, but researchers also must comply with institutional policies. Many journals play a critical role in ensuring access to the data that support the publications appearing in those journals (see Box 3-6 for an example). Access to those data may be lost as journals evolve under the pressures of dramatic changes being catalyzed by digital technologies. The following chapter covers the responsibilities of journals to make data accessible in the context of the long-term preservation of research data.
OCR for page 93
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age BOX 3-6 Promoting Reproducibility in Medical Research As of April 1, 2007, the Annals of Internal Medicine instituted a new policy designed to help the research community evaluate and build on published results. Authors of original research articles in the Annals are required to include a statement indicating whether the study protocol, data, and statistical code are available to readers and under what terms the authors will share this information. Sharing is not mandatory, but authors are required to state whether they are willing to share the protocol, data, and statistical code. Authors are not asked whether they are willing to make this information available until after a manuscript is accepted for publication. According to an article announcing the new policy, the goal of the new requirement is to promote “reproducible research” in which independent researchers can reproduce results using the same procedures and data as the original investigators. Reproducible research does not require unlimited access to data and methods, but it requires access to as much of the dataset and statistical procedures as is necessary to reproduce the published results. As the article states: Major cultural shifts in research must occur before a world of completely reproducible research can exist. These shifts include increasing the technical capacity of many research teams, further developing acceptable data-sharing mechanisms, and supporting—both professionally and financially—the publishing of reproducible research.… We hope that shining a spotlight on the availability of the study protocol, data, and statistical code for every Annals research report will be seen as a small but important step toward biomedical research that the public can really trust. At the same time, it will enhance what is perhaps the main function of a journal: to provide a transparent medium for a conversation about science.a aFor more information, see Christine Laine, Steven N. Goodman, Michael E. Griswold, and Harold C. Sox. 2007. “Reproducible research: Moving toward research the public can really trust.” Annals of Internal Medicine 146:450–453.
OCR for page 94
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age This page intentionally left blank.