Advances in digital computing, communications, sensors, and storage technologies are revolutionizing nearly every area of scientific, engineering, and medical research. Today, researchers are employing sophisticated technologies to generate, analyze, and share data to address questions that were unapproachable just a few years ago. They are carrying out detailed simulations to guide theoretical approaches and to validate new experimental approaches. They are working in interdisciplinary and often international teams on complex integrative problems that require inputs from a multitude of perspectives. They are using data generated by others to augment their own data and sometimes to address problems that the original researchers could not have envisioned. Digital technologies have fostered a new world of research characterized by immense datasets, unprecedented levels of openness among researchers, and new connections among researchers, policy makers, and the public.
Even as these new capabilities are expanding the power and reach of research, they are raising complex issues for researchers, research institutions, research sponsors, professional societies, and journals. Digital technologies can complicate the process of verifying the accuracy and validity of research data, in part because of the enormous rate at which data can be generated and the intricate processing those data undergo. The high rate of innovation in digital technologies, a lack of standards, and issues such as privacy, national security, and possible commercial interests can inhibit the sharing of data, which can reduce the ability of researchers to verify results and build on previous research. Huge increases in the quantity of data being generated, combined with the need to move digital data between successive storage media and software environments as technologies evolve, are creating severe challenges in preserving data for long-term use. And these issues are not restricted to large-scale research
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Summary
Advances in digital computing, communications, sensors, and storage tech-
nologies are revolutionizing nearly every area of scientific, engineering, and
medical research. Today, researchers are employing sophisticated technologies
to generate, analyze, and share data to address questions that were unapproach-
able just a few years ago. They are carrying out detailed simulations to guide
theoretical approaches and to validate new experimental approaches. They are
working in interdisciplinary and often international teams on complex inte -
grative problems that require inputs from a multitude of perspectives. They
are using data generated by others to augment their own data and sometimes
to address problems that the original researchers could not have envisioned.
Digital technologies have fostered a new world of research characterized by
immense datasets, unprecedented levels of openness among researchers, and
new connections among researchers, policy makers, and the public.
Even as these new capabilities are expanding the power and reach of
research, they are raising complex issues for researchers, research institutions,
research sponsors, professional societies, and journals. Digital technologies can
complicate the process of verifying the accuracy and validity of research data,
in part because of the enormous rate at which data can be generated and the
intricate processing those data undergo. The high rate of innovation in digital
technologies, a lack of standards, and issues such as privacy, national security,
and possible commercial interests can inhibit the sharing of data, which can
reduce the ability of researchers to verify results and build on previous research.
Huge increases in the quantity of data being generated, combined with the need
to move digital data between successive storage media and software environ -
ments as technologies evolve, are creating severe challenges in preserving data
for long-term use. And these issues are not restricted to large-scale research
OCR for page 1
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA
projects; they can be especially acute for the small-scale projects that continue
to constitute the bulk of the research enterprise.
This report examines the consequences of the changes affecting research
data with respect to three issues: integrity, accessibility, and stewardship.
Because of the enormous range in the detailed procedures and styles of research
from field to field, it is impossible to formulate specific recommendations for
every field. Instead, for each of the three issues examined in this report, the
authoring committee has developed a fundamental principle that applies in all
fields of research regardless of the pace or nature of technological change. The
report then explores the implications of these three central principles for the
various components of the research enterprise.1
Developing the policies, standards, and infrastructure needed to ensure
the integrity, accessibility, and stewardship of research data is a critically impor-
tant task. It will require sustained effort on the part of all stakeholders in the
research enterprise. The committee believes that the broad principles stated in
this report provide the appropriate framework for this undertaking.
ENSURING THE INTEGRITY OF RESEARCH DATA
The fields of science, engineering, and medicine span the totality of physi -
cal, biological, and social phenomena. Research in all these fields is based on
certain fundamental procedures and convictions. However, each research field
has its own characteristic methods and scientific style. Consequently, research
is too broad an enterprise to permit many generalizations about its conduct.
One theme, however, threads through its many fields: the primacy of scrupu-
lously recorded data. Because the techniques that researchers employ to ensure
the integrity—the truth and accuracy—of their data are as varied as the fields
themselves, there are no universal procedures for achieving technical accuracy.
The term “integrity of data” also has a structural meaning, related to the data’s
preservation and presentation. This is the subject of Chapter 4. There are, how-
ever, broadly accepted practices for generating and analyzing research. In most
fields, for instance, experimental observations must be shown to be reproducible
in order to be credible. Even this fundamental principle can have exceptions.
For instance, observations with an historical element, such as the explosion of a
supernova or the growth of an epidemic, cannot be reproduced. Other general
practices include checking and rechecking data to confirm their accuracy and
validity and submitting data and research results to peer review to ensure that
the interpretation is valid. In addition, some practices may be employed only
within specific fields, such as the use of double-blind clinical trials.
Many of the traditional methods for ensuring the integrity of data—whether
universal or discipline specific—are being modified as digital technologies alter
1 In this Summary, the principles appear in boldface type and the recommendations drawn from
the principles are presented in italic type.
OCR for page 1
SUMMARY
capabilities and procedures. Because of the huge quantities of data generated
by digital technologies, an increasing fraction of the processing and commu -
nication of data is done by computers, sometimes with relatively little human
oversight. If this processing is flawed or misunderstood, the conclusions can be
erroneous. Documenting work flows, instruments, procedures, and measure -
ments so that others can fully understand the context of data is a vital task, but
this can be difficult and time-consuming. Furthermore, digital technologies
can tempt those who are unaware of or dismissive of accepted practices in a
particular research field to manipulate data inappropriately.
Several recent incidents and trends provided an impetus for this study, such
as the challenge journals face in preventing inappropriate manipulation of digi -
tal images in submitted papers and well-publicized, albeit rare, cases of research
misconduct involving fabricated or manipulated data. Assessing the broad set
of institutions, policies, and practices that have been put into place to prevent
and detect research misconduct, including the fabrication or inappropriate
manipulation of data, was beyond the scope of this study. Nevertheless, the
committee recognizes that the advance of digital technologies presents special
challenges to the individuals and institutions charged with ensuring responsible
conduct in research. Since these individuals and institutions will continue to
play a critical role in ensuring the integrity of research data, it is important that
they adapt their procedures in order to function effectively in the digital age.
The most effective method for ensuring the integrity of research data is to
ensure high standards for openness and transparency. To the extent that data
and other information integral to research results are provided to other experts,
errors in data collection, analysis, and interpretation (intentional or uninten -
tional) can be discovered and corrected. This requires that the methods and
tools used to generate and manipulate the data be available to peers who have
the background to understand that information.
The traditional way for submitting data and results to the scrutiny of
other researchers is through peer review, which allows the validity of data and
results to be judged for quality by a research community before dissemina-
tion. Although traditional peer review practices remain essential for evaluating
the importance and validity of research, it has become clear that these have
limitations when it comes to ensuring that digital data have been appropriately
collected, analyzed, and interpreted. Fortunately, it has also become clear that
the advance of digital technologies is providing new opportunities to ensure
data integrity through greater openness and transparency. The emergence and
growth of accessible databases such as GenBank and the Sloan Digital Sky
Survey illustrate these opportunities in widely disparate disciplines. 2 Yet in
2 Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L.
Wheeler. 2006. “GenBank.” Nucleic Acids Research 34(Database):D16–D20. Available at http://nar.
oxfordjournals.org/cgi/content/abstract/34/suppl_1/D16. See also Robert C. Kennicutt, Jr., 2007.
“Sloan at five.” Nature 450:488–489.
OCR for page 1
ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA
many fields, a lack of technological infrastructure, cultural norms and expecta -
tions, and other factors act as barriers to openness and transparency.
The integrity of data in a time of revolutionary changes in research practice
is too important to be taken for granted. Consequently, this report affirms the
following general principle for ensuring the integrity of research data:
Data Integrity Principle: Ensuring the integrity of research data is essential for
advancing scientific, engineering, and medical knowledge and for maintaining
public trust in the research enterprise. Although other stakeholders in the
research enterprise have important roles to play, researchers themselves are
ultimately responsible for ensuring the integrity of research data .
This straightforward principle leads to several specific recommendations.
Recommendation : Researchers should design and manage their projects so as to
ensure the integrity of research data, adhering to the professional standards that
distinguish scientific, engineering, and medical research both as a whole and as
their particular fields of specialization.
Some professional standards apply throughout research, such as the injunc -
tion never to falsify or fabricate data or plagiarize research results. These are
fundamental to research, and have been confirmed by leading organizations
and codified in regulations.3 Other standards are relevant only within specific
fields—such as requirements to conduct double-blind clinical trials. Researchers
must adhere to both sets of standards if they are to maintain the integrity of
research data, and they can adhere to professional standards only if they fully
understand the standards.
Recommendation 2: Research institutions should ensure that eery researcher
receies appropriate training in the responsible conduct of research, including the
proper management of research data in general and within the researcher’s field
of specialization. Some research sponsors proide support for this training and for
the deelopment of training programs.
Researchers, research institutions, research sponsors, professional societies,
and journals all are responsible for creating and sustaining an environment
that supports the efforts of researchers to ensure the integrity of research
data. In some cases, digital technologies are having such a dramatic effect on
research practices that some professional standards affecting the integrity of
3
National Academy of Sciences, National Academy of Engineering, and Institute of Medicine.
1992. Responsible Science: Ensuring the Integrity of the Research Process. Washington, DC: National
Academy Press.
OCR for page 1
SUMMARY
research data either have not yet been established or are in flux. The recent
recognition of the inappropriate manipulation of digital images submitted in
journal articles illustrates the need for the research enterprise to continue to
set clear expectations for appropriate behavior and effectively communicate
those expectations.
Recommendation : The research enterprise and its stakeholders—research
institutions, research sponsors, professional societies, journals, and indiidual
r
esearchers—should deelop and disseminate professional standards for ensuring
the integrity of research data and for ensuring adherence to these standards. In
areas where standards differ between fields, it is important that differences be
clearly defined and explained. Specific guidelines for data management may require
reexamination and updating as technologies and research practices eole.
Although all researchers should understand digital technologies well
enough to be confident in the integrity of the data they generate, they cannot
always be expected to be able to take full advantage of new capabilities. In
an increasing number of fields, professionals with expertise specifically in the
generation, analysis, storage, or dissemination of data are playing an essential
role in taking advantage of digital technologies and ensuring the integrity of
research data.
Recommendation : Research institutions, professional societies, and journals
should ensure that the contributions of data professionals to research are appropri-
ately recognized. In addition, research sponsors should acknowledge that financial
support for data professionals is an appropriate component of research support in
an increasing number of fields.
ENSURING ACCESS TO RESEARCH DATA
Advances in knowledge depend on the open flow of information. Only if
data and research results are shared can other researchers check the accuracy of
the data, verify analyses and conclusions, and build on previous work. Further-
more, openness enables the results of research to be incorporated into socially
beneficial goods and services and into public policies, improving the quality of
life and the welfare of society.
Despite the many benefits arising from the open availability of research
data and results, many data are not publicly accessible, or their release is
delayed, for a variety of reasons. Data may be withheld because they are being
used to generate a commercial product or service, because of confidentiality
considerations, or because of national security concerns. Furthermore, in some
fields it is acceptable for researchers to have a limited period of exclusivity in
which the data are used only by the principal investigators and their immediate
OCR for page 1
ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA
associates. In areas of potential commercial applications, patenting consider-
ations, contractual restrictions, and technological constraints also can limit or
delay the accessibility of data.
Legitimate reasons may exist for keeping some data private or delaying
their release, but the default assumption should be that research data, methods
(including the techniques, procedures, and tools that have been used to collect,
generate, or analyze data, such as models, computer code, and input data), and
other information integral to a publicly reported result will be publicly acces -
sible when results are reported, at no more than the cost of fulfilling a user
request. This assumption underlies the following principle of accessibility:
Data Access and Sharing Principle: Research data, methods, and other infor-
mation integral to publicly reported results should be publicly accessible.
Although this principle applies throughout research, in some cases the
open dissemination of research data may not be possible or advisable. Grant-
ing access to research data prior to reporting results based on those data can
undermine the incentives for generating the data. There might also be technical
barriers, such as the sheer size of datasets, that make sharing problematic, or
legal restrictions on sharing as discussed in Chapter 3. Nevertheless, the main
objective of the research enterprise must be to implement policies and promote
practices that allow this principle to be realized as fully as possible.
This principle has important implications for researchers.
Recommendation : All researchers should make research data, methods, and
other information integral to their publicly reported results publicly accessible in
a timely manner to allow erification of published findings and to enable other
researchers to build on published results, except in unusual cases in which there
are compelling reasons for not releasing data. In these cases, researchers should
explain in a publicly accessible manner why the data are being withheld from
release.
This principle may seem to apply only to publicly funded research, but a
strong case can be made that much data from privately funded research should
be made publicly available as well. Making such data available can produce
societal benefits while also preserving the commercial opportunities that moti -
vated the research.
As discussed earlier, differences in technological infrastructure, publication
practices, data-sharing expectations, and other cultural practices have long
existed between research fields. In some fields, aspects of this “data culture”
act as barriers to access and sharing of data. With the growing importance of
research results to certain areas of public policy, the rapid increase of interdisci -
plinary research that involves integration of data from different disciplines, and
OCR for page 1
SUMMARY
other trends, it is important for fields of research to examine their standards
and practices regarding data and to make these explicit.
Data accessibility standards generally depend on the norms of scholarly
communication within a field. In many fields these norms are now in a state
of flux. In some fields, researchers may be expected to disseminate data and
conclusions more rapidly than is possible through peer-reviewed publications.
Digital technologies are providing new ways to disseminate research results—
for example, by making it possible to post draft papers on archival sites or by
employing software packages, databases, blogs, or other communications on
personal or institutional Web sites.
Data sharing is greatly facilitated when a field of research has standards and
institutions in place that are designed to promote the accessibility of data.
Recommendation : In research fields that currently lack standards for sharing
research data, such standards should be deeloped through a process that inoles
researchers, research institutions, research sponsors, professional societies, jour-
nals, representaties of other research fields, and representaties of public interest
organizations, as appropriate for each particular field.
If researchers are to make data accessible, they need to work in an environ-
ment that promotes data sharing and openness.
Recommendation : Research institutions, research sponsors, professional societies,
and journals should promote the sharing of research data through such means as
publication policies, public recognition of outstanding data-sharing efforts, and
funding.
Recommendation 8: Research institutions should establish clear policies regard-
ing the management of and access to research data and ensure that these policies
are communicated to researchers. Institutional policies should coer the mutual
responsibilities of researchers and the institution in cases in which access to data
is requested or demanded by outside organizations or indiiduals.
PROMOTING THE STEWARDSHIP OF RESEARCH DATA
Research data can be valuable for many years after they are generated. Data
that led to initial insights can sometimes be used to generate new findings in
the same or entirely different research fields. Existing data can be reanalyzed
or combined with new data to verify published results or arrive at new conclu -
sions. In some research areas, accessible databases have become essential parts
of the research infrastructure, comparable to laboratories, research facilities,
and computing devices and networks.
Maintaining high-quality and reliable databases can be costly, especially
OCR for page 1
8 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA
over long time periods. Obviously not all data should be preserved, but decid-
ing what to save and what to discard becomes more difficult as increasing
quantities of data are generated. Because the future uses of data are difficult to
predict, returns on investments in stewardship can be uncertain. Furthermore,
in many fields of research, there is no consensus as to who should maintain large
databases or who should bear the costs. These problems can be especially dif -
ficult for investigators involved in small projects, who can face great challenges
in deciding which data will be useful, in documenting those data thoroughly for
future uses, and in finding funds from limited budgets for data preservation.
The value of data for long-term use suggests the following general principle
for the stewardship of data:
Data Stewardship Principle: Research data should be retained to serve
future uses. Data that may have long-term value should be documented, ref-
erenced, and indexed so that others can find and use them accurately and
appropriately.
Curating data requires documenting, referencing, and indexing the data so
that they can be used accurately and appropriately in the future. Data steward -
ship must start at the beginning of the project, not partway through or at the
end of the project.
Recommendation 9: Researchers should establish data management plans at the
beginning of each research project that include appropriate proisions for the
stewardship of research data.
Because data without accompanying information about how they were
derived can be useless, arranging for preserved data to be annotated so that they
retain their long-term value is among the most important tasks for researchers
establishing a data management plan.
This recommendation is not meant to imply that individual researchers are
responsible for ensuring indefinite preservation of their own data, but that they
ensure that data that are judged to have potential long-term value are prepared
and transferred to the appropriate archives or repositories. Researchers should
work in partnership with their institutions, sponsors, and fields to formulate
and implement their plans.
Researchers need to participate in the development of policies and stan -
dards for data annotation, preservation, and long-term access. Data need not
be annotated in such detail that nonspecialists can immediately use them, but
guidelines should exist for the degree of expertise required to use a data collec -
tion. Researchers also need to develop procedures for error reporting, tracking,
and correction. These policies and standards will vary greatly from field to field
because they depend on the nature and potential uses of data. Nevertheless,
OCR for page 1
9
SUMMARY
establishing such policies is the collective responsibility of the researchers in
each field.
Recommendation 0: As part of the deelopment of standards for the manage-
ment of digital data, research fields should deelop guidelines for assessing the
data being produced in that field and establish criteria for researchers about which
data should be retained.
Researchers need a supportive institutional environment to fulfill their
responsibilities toward the stewardship of data.
Recommendation : Research institutions and research sponsors should study the
needs for data stewardship by the researchers they employ and support. Working
with researchers and data professionals, they should deelop, support, and imple-
ment plans for meeting those needs.
The problem of paying for long-term stewardship of research data and
other digital scholarly work is difficult, and solutions need to be developed
over time. It is important that requirements for improved data management
practices not be imposed as unfunded mandates. In the digital age, data man-
agement needs to be integrated into research program funding as an essential
component of the conduct of research. Where appropriate, grant applications
should include costs for data stewardship.
Many issues regarding the integrity, accessibility, and stewardship of
research data are common across the research enterprise. Bodies that oversee
multiple fields of research should disseminate lessons learned and help to foster
interdisciplinary cooperation. Within the U.S. federal government, a recent
report by the Interagency Working Group on Digital Data explores the needs
for preservation and dissemination of publicly funded research data. 4 At the
nongovernmental level, the National Research Council recently established
a new Board on Research Data and Information that will address emerging
issues in the management, policy, and use of research data at the national and
international levels.
4 Interagency Working Group on Digital Data. 2009. Harnessing the Power of Digital Data for
Science and Society. Washington, DC: National Science and Technology Council, Executive Office
of the President.
OCR for page 1