Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 123
Justifications for and
Obstacles to Data Sharing
TerTy Elizabeth Hednck
INTRODUCTION
Several types of data sharing have been described in the preceding paper by
Boruch, from relatively passive efforts to intensive efforts involving the pro-
vision of large computerized data files and extensive accompanying documen-
tation. That the sharing of data in many instances has led to significant bene-
fits is easy to document. Yet a simple, unqualified endorsement of the prac-
tice would be bow unrealistic and irresponsible. Many parties have interests
at stake when data are shared, and He appropriate balancing of these interests
is not always clear. The complexity of Be issues is attested to by con~over-
sies over data release and reanalysis described in the popular press and the
Terry Elizabeth Hedrick, a social psychologist specializing in program evaluation, is a group
director with the U.S. General Accounting Office. The views expressed in this paper are the
views of the author and do not necessarily reflect the policies of the U.S. General Accounting
Office.
123
OCR for page 124
124
Terry E. Hedrick
scientific literature (see, e.g., Nature, 1980; Feldstein, 1980; Hedrick et al.,
1978; Wolins, 19621.
This paper is based on the general premise that data sharing is a desirable
and worthwhile practice. Thus it is organized around discussions of justifica-
tions for data sharing and obstacles that impede it. ~ When possible, actual ex-
amples are provided. The lack of empirical information on benefits and prob-
lems associated with data sharing means that these discussions may be some-
what biased toward the more controversial cases that have received public at-
tention; in addition, in some cases, only one side of a controversy may have
been fully documented. Therefore, this paper should be read as an attempt to
identify, rather than to quantify, demonstrated or anticipated benefits from
data sharing and obstacles to an across-the-board institutionalization of such a
practice.
In the paper, the interests of the following parties are identified and
discussed:2
Primary researchers persons originally responsible for collecting or ana-
lyzing the data or in some cases for funding its collection and analysis.
Research participants- persons or units from whom data have been col-
lected: people, firms, towns, states, schools, etc.
Data requesters- researchers or other persons requesting release of data.
Scientif c community all members of the research community.
Society- all persons.
As will be seen, the benefits and burdens of data sharing are not evenly dis-
tnbuted across these parties, and their interests can vary according to the
characteristics of each particular case. Guidelines on data sharing must be re-
sponsive to the diverse interests and circumstances.
JUSTIFICATIONS FOR DATA SHARING
Justifications for data sharing are based on demonstrated or anticipated bene-
fits for specific parties. To a large degree, the beneficiaries of data sharing
are the scientific community, data requesters, and society; to a lesser degree
and under some circumstances, primary researchers and research participants
may also realize gains. A variety of benefits associated with data sharing are
discussed below.
Reinforcement of Open Scientific Inquiry
One of the most widely held tenets of science is that research should be con-
ducted and reported in a manner that yields sufficient information to enable
people other than the original researchers to assess its meets and to replicate
OCR for page 125
Justifications and Obstacles
125
it. While the majority of researchers are likely to interpret this tenet as re-
ferring to the provision of careful descriptions of study procedures as pro-
vided for in most journal articles the provision of data for reanalysis can
serve similar functions. The establishment of a policy of data sharing by pro-
fessional organizations, journals, research institutions, and government could
serve to reinforce the openness of scientific inquiry, thereby benefiting the
scientific community and society.
Verification, Refutation, or
Refinement of Original Results
Probably the most significant benefits realized through the sharing of data
stem from reanalyses by other researchers. These benefits include the verifi-
cation and refinement of original findings and the refutation of them.
Secondary analysts may reanalyze data by following the original researcher's
methods, thus checking the accuracy of the reported results, or by using com-
peting analytic techniques or sets of assumptions, thus testing the robustness
of the original conclusions to alternative approaches. If independent reana-
lyses are done conscientiously and with visibility, the credibility of the origin-
al research may be enhanced.
When research results have entered the policy process, the sharing of data
to permit reanalysis is extremely important. Analyses that confab the ori-
ginal results can help combat political pressures to deny or bury them.
The Wortman et al. (1978) reanalysis of data from the Alum Rock
Education Voucher Demonstration Program is a good example of a reanalysis
that refined the original work. Initial reports on the voucher demonstration
posited a relative loss or no gain in reading achievement for the six voucher
schools (Barker, 1974; Klitgaard, 1974~. Wortman et al. used a quasi-
experimental design with multiple pretests and individual-level data and con-
cluded that the deleterious effect reported earlier was confined to a few non-
traditional programs within the six schools.
The complex analyses and large data sets now used in much social science
research have increased We susceptibility of findings to statistical and pro-
gramming errors, errors unlikely to be detected without intensive review or
reanalysis of the data. As Martin Feldstein said ( 1980:961:
When economists deal with large data sets and complex econometric operations,
there will be mistakes. If anyone relies on one study, he runs the risk of being mis-
led by an error or statistical fluke. Indeed all models are untrue in the sense that
they are crude approximations to the real world.
A dramatic illustration of this susceptibility comes from a reanalysis of
Feldstein's own early work, exploring the effect of Social Security on person-
OCR for page 126
126
Terry E. Hedrick
al saving behavior, by two analysts of the Social Security Administration,
Dean Leimer and Selig Lesnoy. At a 1980 conference of the American
Economic Association, Leimer and Lesnoy showed that an elementary com-
puter programming error led Feldstein to greatly overestimate the negative
impact of Social Security on saving behavior. Although Feldstein later took
issue with the Leimer and Lesnoy claim that the introduction of Social
Security has not substantially reduced personal saving, he acknowledged the
programming error and stressed that such replication studies are at the core of
the scientific tradition (Feldstein, 19821.
The Ehrlich research on the deterrent effect of capital punishment is a clas-
sic case of research findings quickly entering the policy process without pro-
vision for timely reanalysis by other interested parties. In 1975 Ehrlich pu-
blished an article claiming that between 1935 and 1969, each execution in this
country prevented seven to eight murders (Ehrlich, 19751. At the time of
publication, the Supreme Court (in Fowler v. North Carolina) was reconsid-
ering its 1972 decision declaring capital punishment unconstitutional, and
U.S. Solicitor General Robert Bork used the study results in an amicus curiae
brief filed by the Justice Department to argue for the reinstitution of capital
punishment. The data on which the research was based were not immediately
available to other researchers, so it was impossible for other parties to deter-
mine the quality of the work. 3
Examples of errors in analyses or assumptions Mat led to distorted or incor-
rect results are not difficult to locate. Steven Director's (1979) work in the
area of evaluations of employment and training programs, for instance, dem-
onstrated that past evaluations had used approaches that probably underesti-
mated the impacts of employment and training programs on postprogram
earnings of enrollees. Campbell and Erlebacher (1970), using a simulation
technique, concluded that many evaluations of compensatory education pro-
grams were likely to have suffered from similar problems. Magidson's
(1977) and Rindskopfts (1978) applications of competing analytic techniques
to Head Start and Title I data were based on similar concerns.
The evaluation field is not necessarily more prone to these problems than
other fields. Wolin's efforts in the early 1960s to acquire and reanalyze sev-
eral small data sets underlying articles in psychology journals were based on a
suspicion that the original analysts had used inappropriate analytic techniques
(Wolins, 19624. More recent work by Wolins (1978), a secondary analysis
of Bayer and Astin's (1975) data on faculty salaries, was concerned with
problems of nonadditivity and irrelevant variance in Me predictors and chal-
lenged the conclusion that the data supported a finding of a sex differential in
the academic reward system.
These kinds of concerns have motivated several observers to call for simul-
taneous or serial analyses of evaluative data sets, arguing that data wid1 signif-
OCR for page 127
Justifications and Obstacles
127
icant potential for influencing public policy should undergo analysis by sever-
al different researchers (Cronbach et al., 1980; Boruch and Cordray, 1980;
Raizen and Rossi, 1981~. Cronbach's 65th thesis of program evaluation
states: "In any primary statistical investigation, analyses by independent
teams should be made before the report is distributed" (Cronbach et al.,
1980~. That such a Republication reanalysis policy can be beneficial is cor-
roborated by the experience of Stephen Fienberg, who, as editor of the
Journal of the American Statistical Association (JASA), required authors of
manuscripts to simultaneously submit copies of data. In Fienberg's judg-
ment, many of the articles submitted for review were subsequently streng-
thened by alternative analyses conducted by journal referees.4 Bryant and
Wortman (1978) have proposed that similar procedures should be adopted to
govern submissions to psychology journals.
The benefits to the public from sharing policy-relevant data to permit verifi-
cation and refutation of the original conclusions are fairly obvious. Benefits
to the scientific community may also result to the extent that faulty studies are
not published and, therefore, do not lead other researchers astray, shaping the
directions of future research until sufficient numbers of conflicting studies ter-
minate that avenue of inquiry. Public confidence in the worth of research
might also be improved. Finally, as Fienberg's experience with JASA illus-
trates, this is one circumstance in which primary researchers may also profit
from the sharing of data.
Replications With Multiple Data Sets
Conclusions drawn from the analysis of a single data set are heavily dependent
on Me quality of that data set and are subject to distortion from its
idiosyncrasies its scope, format, method of collection, etc. The confidence
one places in research conclusions can be greatly increased by consistency of
results across data sets; conversely, inconsistencies in results across data sets
lead one to view research results with skepticism and to engage in a careful
exploration of possible reasons underlying those inconsistencies.
The advancement of knowledge, especially in the social sciences, has been
hindered by single studies that capture the fancy of a discipline and send re-
searchers off on extended efforts to replicate, refute, or refine the findings of
the original study. The pressure for academic researchers to publish is one
cause of this reactive approach. Researchers are sorely tempted to publish
each study, rawer than to pursue a systematic line of inquiry through the exe-
cution of multiple studies. In partial response to the proliferation of one-shot
studies (and an extremely high submission rate), the Journal of Personality
and Social Psychology in 1976 instituted an editorial policy that encouraged
more systematic research efforts and stronger support for conclusions
OCR for page 128
128
Terry E. Hedrick
(Greenwald, 19761.
Researchers should be encouraged to complement their collection and anal-
ysis efforts with analyses of other existing data appropriate for addressing the
same questions. To the extent that a policy of sharing data makes researchers
aware of other data sets suitable for their needs and encourages the publication
of articles that demonstrate similar findings across multiple data sets, sharing
data can benefit the scientific community and increase public confidence in
research findings.
Exploration of New Questions
In many cases, especially with large surveys or evaluative data sets, the pri-
mary researcher's interest in a data set may encompass only a small part of the
data set's potential usefulness. Providing other analysts access to such data
sets would pennit additional benefits to be obtained from Be original invest-
ments In data collection. In this respect, evaluative data sets collected by pn-
vate research firms under government contract are one of He most underused
sources of information. Contract research funs necessarily must direct their
analytic efforts to address specific questions posed by the sponsor agency;
time and resource constraints are likely to prevent analysts from branching out
and exploring additional questions when the sponsor considers these questions
subsidiary and outside the original scope of work. Consequently, these kinds
of data frequently pass into oblivion without other parties being aware of
them. Hedrick et al. (1978) have provided a discussion of four general cate-
gones of obstacles to the acquisition of evaluative data: problems in locating a
particular data set and authority for its release; insufficient documentation;
inappropriate aggregation; and delays and refusals to data requests. Many of
these obstacles are applicable to the present discussions.
A positive example of an effort to increase returns from investments in data
collection can be found in the Employment and Training Administration's
(ETA) dissemination and support activities with respect to the production of
public-use tapes from the Continuous Longitudinal Manpower Survey. This
survey collected information from quarterly samples of enrollees in CETA
programs (employment and Gaining services delivered under the
Comprehensive Employment and Training Act) and includes, or will even-
tually include, Tree years of postprogram labor force and welfare participa-
tion data, as well as Social Security earrungs information over an extended
period. ETA's interest in the survey was initially largely confined to descrip-
tions of He characteristics of CETA enrollees and estimates of earnings gains
from CETA participation, but other researchers have been encouraged to ex-
ploit the data set for other purposes. From such data-shanng efforts, benefits
accrue to data requesters, the scientific community, and society.
OCR for page 129
Justifications and Obstacles
129
Creation of New Data Sets
Through Data File Linkages
Another benefit obtainable through the sharing of data is the opportunity to
create new data sets by linking two or more existing sources of information.
As will be discussed in the section on obstacles to data sharing, this procedure
can raise problems of violations of confidentiality, possibly even of privacy
(through outright or deductive disclosure of identities), but the potential exists
for researchers to address new questions or refine their inquiries into old ones
by expanding the kinds and amounts of information available.
The Continuous Longitudinal Manpower Survey is also a good example of
data linkage since it involves linking information from CETA agency files,
enrollee interviews, and Social Security records to create a single data file rich
in detail about CETA participants. On a smaller scale, researchers have com-
bined media reports of daily pollution levels in Los Angeles with Blue Cross
of California records of cause, frequency of admission, and length of hospital
stay to assess the effects of air pollution on urban morbidity (Sterling et al.,
19691. Keesling (1978) creatively merged his own data on school attendance
rates with information on reading test scores to examine the contribution of
school attendance to achievement-test performance. Again, data requesters,
the scientific cornrnunity, and society are the major beneficiaries of this type
of data sharing.
Encouragement of Multiple Perspectives
Every scientific discipline has its own blinders with respect to methodologies,
analytic techniques, and the phrasing of research questions. Even the selec-
tion of outcome indicators often involves making value judgments about the
desirability of certain types of behavior or the characteristics possessed by
certain groups of people (Cochran, 1979; Johnston, 19761. The findings of
marital instability from the Negative Income Tax Experiments are a case in
point (Groeneveld et al., 19801: increases in divorce rates are viewed by some
as a positive indicator of women's emancipation; others view them as a nega-
tive indicator of the breakdown of the traditional family. Thus analysts may
interpret identical variables from different perspectives. Of course, they may
also select different variables to address the same questions.
The analytic techniques employed by researchers may also be a function of
disciplinary background. Unfortunately, decisions to employ input-output
models, analysis of covariance, multiple regression, causal modeling, or oth-
er techniques frequently derive less from the nature of the question at hand or
the appropriateness of the technique for the data than from a researcher's per-
sonal training or past experience. Researchers are most comfortable with
analytic techniques that are familiar to them and for that reason they can be in-
OCR for page 130
130
discriminate in their use of the techniques.
Terry E. Hedrick
Data shanng, if it can be extended across disciplinary lines- a large if—
has the potential to benefit almost all parties. Sharing data may encourage
cross-disciplinary work, permitting questions to be viewed from diverse view-
points, and it may broaden the perspectives of researchers, including primary
researchers, by exposing them to new viewponts, methodologies, and analytic
techniques. To the extent that a broader perspective is taken and the develop-
ment of knowledge is enhanced, society should profit from the sharing of da-
ta.
Reductions in the Incidence
of Faked and Inaccurate Results
The existence of dishonesty in science is becoming more and more difficult to
ignore. In the past few years, disclosures of hoaxes such as the Piltdown man
and Cyril Burt's fabricated data on the inheritance of intelligence have sensit-
ized the scientific community and society to the issue of dishonesty in science.
More recently, controversies concerning Dr. John Long's cultures of human
Hodgkin's disease cells, which turned out to be monkey cells (Harris et al.,
1981), and the 30 drug researchers discovered by the Food and Drug
Administration to be faking data or otherwise being dishonest (Hilts, 1981)
highlight weaknesses in the current system of peer review and promise to keep
public attention on He issue. The costs of errors or dishonesty in science go
beyond public loss of confidence in the objectivity of science; in the Long
case, several researchers wasted considerable time working with cultures of
monkey instead of human cells. Not only did the researchers bear costs in
terms of advancement of their careers, but society may have borne costs
through time lost in research on cancer. When dishonest science leads other
researchers astray or influences policy decisions on public programs, it can
have negative effects on the welfare of society that range far beyond the scope
of the original research.
Again, motivations for dishonest behavior must be at least partially attrib-
uted to the intense competition to publish and to obtain grant money. A poli-
cy of open access to data, while far from a complete solution to the problem,
might serve as a deterrent to He faking of data and dishonest reporting of re-
search results. The extent of the deterrent potential of open access is un-
known, but if even a few researchers are discouraged from dishonesty by fear
of discovery and exposure, a data-shanng policy may be cost effective.
Unintentional mistakes are a wholly different problem, and respect is war-
ranted for Dose researchers who, when shown errors in Heir work, acknowl-
edge He problems. Although researchers are assumed to carefully check the
accuracy of quality control, the pressure to work and publish quickly, the
OCR for page 131
Justifications and Obstacles
131
complexity of many analytic techniques, and simple human fallibility not sur-
pnsingly sometimes result in errors. Once again, an acknowleged policy of
open access to data and the attendant risk that one's mistakes could be public-
ly exposed might increase the attention researchers give to their work and,
therefore, might improve its quality. Both the scientific community and so-
ciety would benefit from any reductions in errors resulting from an open-
access policy.
Development of Knowledge
About Analytic Techniques and Research Designs
Secondary analysis is a fruitful activity for the production of information on
analytic techniques and research designs. "To the conscientious analyst,
there is often no single, generally accepted way to deal with data stemming
from an evaluation" (Rindskopf, 1978:15~. The furler the study design de-
viates from the classical experiment, the more confusing the choice of analy-
sis approaches becomes. In evaluation, and presumably in most areas of
science, acquiring data sets to explore new analytic approaches promises fu-
ture benefits to many parties through the development of inflation on the
strengths and weaknesses of analytic approaches. It can also provide justifi-
cations for better research designs to detect what are frequently small and elu-
sive treatment (program) effects. At Northwestern University, participants in
the Project on Secondary Analysis have been engaged in efforts to advance the
state of the art in analyzing effects of education programs since 1974, identi-
fying problems in drawing causal inferences, investigating methods of using
multiple analytic approaches to provide evidence of convergent validation,
and exploring the biases that result when assumptions of various analytic
methods are violated (Boruch and Wortman, 1978~.
Provision of Resources for Training
The availability of data for secondary analysis offers benefits for training stu-
dents, especially in statistics and methodology. Boruch and Reis (1980) have
documented a wide variety of payoffs to students from engaging in secondary
analysis: reduction in the time necessary to get to the analysis stage of re-
search, lower research costs, gains in knowledge about the nature of
evidence, increased experience with analytic procedures, early exposure to
the untidy world of applied research in comparison with the world of the text-
book, and early entry into discussions of public policy. Fields such as
economics have traditionally relied heavily on data collected by others for
training graduate students. For postdoctoral programs, which may allow
only a one-year training period for exposure to a new area of inquiry, access to
OCR for page 132
132
Terry E. Hedrick
data collected by others may be a necessity if a trainee is to be able to follow a
project to completion.
Improved training is a benefit in and of itself, but in many cases, students
engaging in secondary analysis have made significant contributions to theory,
methodology, and statistics. Magidson's (1977) work with competing ana-
lyses of Head Start data, Rindskopf's (1978) analyses of Head Start and Title I
data, and Rezmovic and Rezmovic's (1981) efforts to test theories underlying
the measurement of psychological traits are but a few examples of high-
quality and useful secondary analysis work by students or postdoctoral
trainees.
Student requests for the sharing of data are probably more likely to run into
obstacles or outright refusals then requests from more established researchers.
A policy of open access to data should reduce these obstacles and increase
both the quality of student training and students' resources for making signifi-
cant contributions to their professions and to public policy. A side benefit is
that early experiences with data sharing may sensitize these new professionals
to both the legitimacy of others' data requests and the need for thorough docu-
mentation for their own data.
Reduction of Respondent Burden
A concern of many researchers is to reduce the response burden on research
participants, whether participants are students in undergraduate survey
courses or people in government programs. This concern is especially salient
in government research, for which the clearance procedures of the Office of
Management and Budget call for budgets (in terms of hours) to be submitted
estimating the amount of respondent burden associated with ongoing or
planned data collection efforts. If appropriate data already exist that are suit-
able for answering a researcher's questions, there is little justification for im-
posing additional respondent burden. Uncontrolled data collection runs the
risks of depleting research subject pools and endangering the future coopera-
tion of participants. The sharing of data, by preventing redundant data col-
lection, can benefit both the scientific community and society.
OBSTACLES TO DATA SHARING
The variety of obstacles to data sharing range from clearly illegitimate refus-
als for data access by primary researchers who fear criticism to legitimate re-
fusals based on national security considerations. In between are many gray
areas in which the legitimacy of refusing access is not easily resolved or in
which resources and effort are needed before data sharing can become possi-
ble. Low-cost solutions are readily apparent for some of these impediments,
OCR for page 133
Justifications and Obstacles
133
moderate- or high-cost solutions will resolve others, and a few appear intran-
sigent.
The discussion of justifications for data sharing demonstrated that most of
the actual or anticipated benefits are received by three parties—data re-
questers, the scientific community, and society. Primary researchers and re-
search participants are by definition members of these groups and, therefore,
they also receive benefits. However, as the following discussion of obstacles
indicates, primary researchers and research participants bear the brunt of the
costs and risks attendant to data sharing, much more so than other parties.
Clearly, the costs and benefits of data sharing are not distributed evenly
among the parties involved.
Concern About the Qualifications of Data Requesters
When a primary researcher has reservations about the qualifications of a data
requester, he or she may be reluctant to share data for several reasons. First,
the primary researcher may anticipate that the data requester will require ex-
tensive assistance in developing specifications for the exact variables desired
and the most appropriate format for transfer of the data. Second, if the data
requester does not have experience with comparable data sets, the primary re-
searcher may anticipate having to respond to repeated requests for guidance
concerning interpretation of variables, computer programming, and analytic
procedures. Third, and perhaps most importantly, the primary researcher
may fear that analyses performed by the requester will be of poor quality and
that significant amounts of time will be necessary to review, critique, and re-
but those analyses. Finally, the researcher may fear that the data set itself and
the original analyses will lose credibility if poor reanalyses are performed that
elicit criticism in the scientific community. (These concerns may be exacer-
bated if the researcher perceives the requester to have personal interests in
analyzing the data to demonstrate a particular outcome.) The criticism issue is
broached again later in this paper, but it should be noted that there is no
evidence that incompetent reanalyses come to overshadow competent primary
analyses. Also, the same problem of time-consuming debate between inves-
tigators critiquing each others' analyses exists for research that does not in-
volve reanalyses of data; therefore, this problem is not peculiar to secondary
analysis.
National Security Considerations
Most researchers recognize that national security can sometimes be a com-
pelling reason for nonrelease of data or even nonpublication of results, al-
though some have argued that even under national security constraints, data
OCR for page 137
Justifications and Obstacles
137
Fear and Costs of Criticism
Two different bases can underlie a fear of criticism from secondary analysis:
fear that purposive distortions of study results or faked data will be exposed,
and fear that the secondary analyst will find fault with the original researcher's
work.
From society's standpoint, exposure of faked data and correction of inac-
curate results are appropriate justifications for the sharing of data; from the
primary researcher's standpoint, sharing data is a risky enterprise. It is al-
ways uncomfortable to have someone checking your work. Fear of criticism
by dishonest researchers is easily understood, but there are also reasons for
honest and careful researchers to experience trepidation at the thought of sec-
ondary analysis. Regardless of the care given to the original analyses and in-
terpretation, the complexity of many analytic techniques and the rapid advent
of new methods of analysis make criticism by a secondary analyst a not un-
likely possibility. Such criticism may at times be warranted; in other cases, it
may be shoddy and unjustified. It can have two kinds of costs for the re-
searcher. First, it may threaten his or her reputation and esteem and, to an
unknown degree, interfere with obtaining money for future work. Second, as
mentioned earlier, responding to criticism may become a significant drain on
a primary researcher's time.
It is unclear what impact criticism and debate have on society. For in-
stance, the debates over econometric analyses of the deterrent effects of capi-
tal punishment have been vigorous since Ehrlich's original work. But we do
not know what effects that controversy had, if any, on legislators, their staffs,
or the public. For that matter, it is unclear whether distinctions are ultimately
made between poorly informed criticism and sound criticism. Does the best
work sift out by the end of a debate? What is clear is that there is concern that
sharing data will yield criticism entailing substantial costs. Objections have
been raised, for example, to the policy recommendations that all major pro-
grarn evaluations be subjected to secondary analyses. The argument by fed-
eral managers responsible for evaluation is that the opponents of the results of
an evaluation will exploit the opportunity to attack even when faced with a
good product. Few incentives are perceived to exist for confirming the re-
sults of the original researchers; instead, secondary analysis may be viewed as
an opportunity to make a reputation by refuting the work of others.
As more information is shared, research often becomes more vulnerable to
criticism. The University Group Diabetes Program, a research effort to con-
trast multiple therapies for mild diabetes in middle-aged and elderly people,
responsibly reported detailed information on baseline variables for each type
of therapy group. By doing so, it opened the door to critics (and opponents).
As Meter (1975) and others (Jablon, 1979; Sterling and Weinkam, 1979) have
noted, twentieth-century science may have more of an adversarial atmosphere
OCR for page 138
138
than one of dispassionate, open inquiry (Meter, 1975:5211:
Terry E. Hedrick
Where the study is conducted with even greater care, on the other hand, and many
baseline variables are reported, demands for endless subanalyses and recombina-
tions, not to mention recriminations, seem to be the inevitable result. Of course, I
regard more infonnahon to be better than less, and I certainly do not mean to sug-
gest that one should refrain from reporting the values of all variables of interest.
My complaint is with the carnivorous appetites among critics which seem to be gen-
erated by this kind of raw meat.
Poor Communication
There are two very different types of communication problems that are of con-
cern: problems of identifying and locating a data set appropriate for one's
needs and problems of resolving disputes when secondary analyses lead to
different results.
To begin with, an investigator interested in a particular topic must be aware
that a data set appropriate for his or her needs exists; the lack of mechanisms
for matching research needs to existing data is a major impediment to second-
ary analysis (Finifter, 19751. Guidelines from the Office of Federal
Statistical Policy and Standards that require federal agencies to submit stan-
dardized abstracts summarizing public-use, machine-readable data files for
input into the Department of Commerce's Directory of Federal Statistical
Data Files promise a partial solution to this problem. Focused efforts, such
as the archive of data on long-term care at Michigan State University (Katz et
al., 1979), can also contribute. A variety of data archives and mechanisms to
identify data sets are discussed by Clubb et al. (in this volume). Robbin
(1981), for instance, has proposed guidelines for bibliographic follies to en-
sure access to information on machine-readable data files. Even so, if data
sets were not originally intended for use by others and were not products of
federal agency collection efforts, the matching of researcher interests and data
sources is likely to be a hit-or-miss activity relying on informal communica-
tion networks.
If a researcher does identify a data set appropriate for his or her needs, the
next step is to locate the people with authority to grant release of the data.
While this may sound so straightforward as to not be worm mentioning, ex-
perience has shown otherwise (see Hedrick et al., 1978~. Even though data
collected with public funds are assumed to be public property, federal agen-
cies do not always obtain and archive federally fended data sets. If a data set
has been left in the control of the original researchers, there is no guarantee
that it will skill exist at the time of the access request. Or, if a data set does
exist, the people most closely involved with data collection, data processing,
and analysis may have changed interests or jobs, may have died, or may not
OCR for page 139
Justifications and Obstacles
139
be willing to expend their time to organize, explain, and transfer the data.
The second problem of communication between users, that arising from
challenges to the accuracy of the original analysis, has yet to be addressed in
any systematic fashion. As mentioned previously, some researchers have
characterized twentieth-century science as an adversary process (Jablon,
1979; Sterling and Weinkam, 19791. For instance, Sterling and Weinkam
have described in detail their efforts to seek corrections and clarifications re-
garding misclassifications of persons in the data of He Dorn Study of
Mortality among U.S. veterans. While their account gives only one side of
the controversy, it does not breed optimism with respect to the resolution of
such disputes (Sterling and Weinkam, 1979:11:
Our experience indicates that reactions to discovery of and attempts to correct errors
in scientific studies are similar to those met by consumer attempts to deal with errors
in large commercial computerized procedures. Because scientific "management"
appears to opt for an adversary rather than cooperative mode of responding to dis-
covery of errors, much of the value may be lost which secondary analysis has for
verifying the validity of past work.
When a secondary analyst has sufficient information to replicate the origin-
al analysis and obtains different results, the solution to the communication
problem may lie in simply offering the primary analyst (and funding agency)
an opportunity to review and comment on the new analysis. If no comment is
forthcoming, He secondary analyst then proceeds with publication. When,
however, additional information from the primary researcher is required to
identify the source of the discrepancy and an adversarial attitude wins out over
a cooperative one, the benefits associated with the sharing of data may not
materialize.
Data Set Inadequacies
A number of obstacles to data sharing stem from inadequate preparation and
retention of data sets. If data are located and agreement is forthcoming for
data release, the lack of good documentation can reduce or completely pre-
vent exploitation of a data set by a secondary analyst. Here documentation is
meant in its widest sense: sampling frames and study design, copies of the on-
ginal data collection instruments, data collection procedures, validity infor-
mation, data transformations, aggregation procedures, procedures followed in
creating new variables, etc. Having information on the physical format of
data tapes is not enough; some of the most valuable information for a second-
ary user of data is information about the strengths or weaknesses of data set
items. Unfortunately, this type of information is often not treated formally
and exists only in the memory of the original researcher. To the extent Hat
formal documentation is poor or not available and the primary researcher is
OCR for page 140
140
Terry E. Hedrick
uncooperative or unavailable, major obstacles exist to the productive use of
the data by other parties. It may be next to impossible to create documenta-
tion after the fact, or the time necessary to disentangle formats may prove pro-
hibitive.
If use of data by other parties has not been anticipated by the primary re-
searcher, data requesters may find that the data have not been properly ar-
chived and maintained. Information recorded on magnetic tapes or disks can
deteriorate rapidly unless stored under proper conditions. Placing a data tape
on a shelf in one's office provides far from optimal storage conditions; storing
card copies of data in a garage or basement is likewise undesirable. Backup
copies of data sets should be created and carefully maintained as a matter of
normal research procedures.
Recognition and Proprietary
Concerns of Primary Researchers
A researcher who invests time and resources in the collection and processing
of data deserves the first opportunity to analyze those data and make a contn-
bution to his or her field. Release of data before a primary researcher has had
a reasonable opportunity to capitalize on those efforts would be an enormous
disservice to researchers and would discourage future data collection. In
many cases, the concerns of primary researchers for recognition will not be an
obstacle to data release since data requesters will have learned of the data set's
existence from the published work of the original researcher. Even in this
circumstance, however, problems arise when the requester desires the data for
a purpose that overlaps with the data collector's future research plans.
Questions concern the scope of data available to other parties the entire data
set or only the portion supporting published material; the specification of a
reasonable time period for the primary researcher to regain control of the data;
the extent of the primary researcher's obligation to release data for purposes
that may overlap with future research plans; and the definition of when data
enter the public domain upon publication, upon presentation at a conven-
tion, upon use in court, upon communication in some form of professional
correspondence, etc.
The proprietary issue is treated in detail in the Cecil and Griffin paper in
this volume, and the complex issues of private versus public ownership are
not discussed further here. It is worth noting, however, that the reward struc-
ture for scientists is undergoing a significant change. Researchers working in
such fields as genetic engineering and econometrics are finding that there are
commercial rewards for their work that may supersede the goal of the cooper-
ative pursuit of knowledge (NeLkin, 19811. Since constrained access to re-
search results can increase their commercial value, this shift in He reward
OCR for page 141
Justifications and Obstacles
structure may become a major obstacle to data sharing.
141
A controversy between a public interest group, the Interfaith Center on
Corporate Responsibility (ICCR), and Abbott Laboratones and Mead-
Johnson (see Nature, 1980) illustrates several of the recognition and propne-
tary issues. ICCR conducted a nationwide survey of infant feeding practices
and sought assistance from the Center for Disease Control (CDC) in providing
computer resources for analysis of the survey responses. CDC's records, as a
public agency, are open to scrutiny by others under the Freedom of
Information Act, and Abbott Laboratories and Mead-Johnson requested ac-
cess to the data. The nutritional quality of baby foods is a controversial topic
and one of extreme economic importance to the data requesters. ICCR
argued that it should be able to retain control over the data until its analysts
analyzed and published their findings; the data requesters argued for immedi-
ate access to public records, presumably to get a jump on the ICCR findings.
Similar dilemmas occur with respect to access to company test data on
product safety. Companies argue that test data are protected as trade secrets;
their release could endanger a company's competitiveness in the marketplace.
Other parties see independent analyses of test data as necessary for the deve-
lopment and promulgation of regulations to promote public safety.
There are no obvious resolutions to these kinds of disputes. One party or
Me other stands to lose, and the degree of loss is likely to vary substantially
with the particulars of each case. Guidelines on sharing data must be sensi-
tive to protecting the investments of primary researchers, yet when data have
immediate relevance to public policy, the interests of the public in having the
best information possible available for use in the decision-making process
may be judged to outweigh costs to primary researchers. The case for access
is even stronger when the collection of the data was supported by public
funds.
Violations of Confidentiality
Win the increasing number of surveys, the growth in administrative record-
keeping, and the potential to link data files, there is growing concern with pro-
tecting the identities of research participants. Sharing data can create prob-
lems of violations of confidentiality and even lead to threats to privacy if iden-
tified data are transferred, if cross-tabulations are run on variables of low fre-
quency (which can result in deductive disclosure), or if a data set is linked
with other data or information sources.
Consider, for example, the case of an education evaluation data set contain-
ing information on principal and teacher attitudes and student performance.
Even if personal identifiers such as Social Security number and name are de-
leted before data me publicly released, a second party with knowledge of He
OCR for page 142
142
Terry E. Hedrick
location of the research and publicly available information on teaching assign-
ments may be able to identify individuals through cross-tabulations of van-
ables such as race and sex of teacher and grade level of class. If there is only
one black female math teacher at the sixth-grade level, her responses regard-
ing her school, her principal, and her students could become public knowl-
edge. So, perhaps, could the responses of her principal and students.
Another situation in which researchers face problems in reporting results or
sharing data is in research conducted in private firms. Access to collect data
in a private company may be contingent upon a promise never to identify He
company when reporting results and to deny access to the data by other par-
ties. This puts the researcher in an awkward position of not being able to per-
mit others to verify his or her analyses. A recent article on sex discrimination
in a private company illustrates the difficulty of camouflaging the identity of a
private fun (Hoffman and Reed, 19811. The company asked not to be identi-
f~ed, but the article described it as a Fortune 500 company and gave descrip-
tive information on the number of employees, number of branch offices, and
administrative organization to a degree that may have compromised the
pledge of confidentiality.
Treats to confidentiality, therefore, should be recognized as an obstacle to
data shanag. While these examples and the previous discussion of data with
special problems indicate that in some circumstances data may be difficult to
share, it is likely that this obstacle is cited much more frequently as an excuse
for denying access than is warranted. There are a variety of mechanisms to
solve the confidentiality problem, including deletion of identifiers, use of
cruder report categories, random subsample release, microaggregation, and
error inoculation (see Campbell et al., 19751.
Administrative Inconvenience and Cost
Lastly, there are administrative and cost burdens associated win data sharing
that must be balanced against the benefits of access. These burdens largely
fall on primary researchers.
Documentation of data is a major burden; it is a labor-intensive activity,
and if done properly, involves much more than the provision of a simple list of
variables and their format on a computer tape. Robbin (1981) has listed five
major parts to documentation: (1) a general study description, (2) a history of
the project, (3) a summary of the data processing history, (4) a codebook, and
(5) appendices with error listings, glossaries, list of publications, and instruc-
tions for using the data file. For large, complex data files with multiple
waves of data, He documentation task can be expensive and time consuming.
After documentation is complete, researchers who are willing to share data
must either notify the research community of the data's availability or make
OCR for page 143
Justi~catzor~s and Obstacles
143
arrangements to transfer it to an archive. In addition, there are costs of stor-
age and maintenance, as well as costs of updating the file with new informa-
tion or correcting errors. When data requests are received, there are still
more costs associated win providing copies of the data in a form usable by the
requester and responding to inquiries for clarification. These issues are not
Heated comprehensively here; estimating time or dollars associated with these
tasks is beyond the scope of this paper and would depend heavily on the nature
of the data set. Nevertheless, researchers who are impatient to proceed with
their own work may find the requirements of data sharing to be an unwelcome
interference.
SUMMARY
This paper has identified a variety of actual and anticipated benefits and obsta-
cles associated with a policy of data sharing. From these discussions, it has
become evident that He benefits and costs of data sharing are not evenly dis-
tributed across all parties. Data requesters, the scientific to bear most of the
costs. The extent to which these benefits and costs will materialize if a policy
of sharing data is institutionalized is an empirical question, in many respects
not yet answerable. Except for large-scale survey efforts, the scientific
comn~unity's experiences win data sharing have been spotty and unchroni-
cled; case studies constitute He major source of information and often docu-
ment only one side of a controversy.
NOTES
1. It is worth noting that there are also some arguments for not shanog data. Fast, some data
are not worth the costs involved in shanag them. In these cases, science might be better served
by committing such data to obscurity rather than by their occupying the time and resources of oth-
er researchers. Second, encouraging reliance on already existing data sets through data shanag
may have negative effects by reducing efforts to collect new data. This can have drawbacks to
the extent that science progresses by analyses of independent data and to the extent that rescarch-
ers become insensitive to the difficulties of collecting good-quality data and simply take variables
on existing data sets at face value. Other justifications for not sharing data, such as time and re-
source burdens on primary researchers, are embedded in the discussions of obstacles to data shar-
~ng.
2. Other distinctions, such as those between primary researchers and research funders and be-
tween researchers in private and in public institutions, can also be important. For reasons of par-
simony, these parties have been treated as one group in this chapter. Other papers in this vol-
ume, particularly that of Cecil and Griffin, frequently accord them separate treatment.
3. Subsequently, several other studies refuted or at least failed to confirm Ehrlich's results.
4. Letter to Clifford Hildreth from Stephen Fienberg regarding JASA policy on publication of
data, October 8, 1979.
5. A minority report was also filed by He Public Cryptography Study Group. It argued
against restraints on the publication of nongovernmental cryptography research on several
OCR for page 144
144
Terry E. Hedrick
grounds: national security interests are broader than the interests of NSA; restraints will have ne-
gative effects on research in other fields; unconstitutionality; internanona1 complications; legal
complications; the ineffectiveness of such restraints; and a low perceived threat to NSA's crypto-
systems from publication of such material (see Davida, 1981).
REFERENCES
Barker, P.
1974 Preliminary analysis of metropolitan achievement test scores, voucher schools and Title
I schools. Pp. 9~104 in D. Weller, ea., A Public School Voucher Demonstration:
The First Year at Alu n Rock, Technical Appendix. Technical Report R-1495/2-NIE.
Santa Monica, Calif.: Rand Corporation.
Bayer, A.E. and Astin, H.S.
1975 Sex differentials in the academic reward system. Science 188:79~802.
Boruch, R.F. and Cordray, D.S.
1980 An Appraisal of Educational Program Evaluations: Federal, State, and Local
Agencies. Washington, D.C.: U.S. Department of Education.
Boruch, R.F. and Reis, J.
1980 The student, evaluative data, and secondary analysis. In R.F. Boruch, ea., New
Directionsfor Program Evaluation 8:5~72. San Francisco: Jossey-Bass.
Boruch, R.F. and Workman, P.M.
1978 An illustrative project on secondary analysis. In R.F. Boruch, ea., New Directionsfor
Program Evaluation. Vol. 4. San Francisco: Jossey-Bass.
Bryant, F.B. and Wortman, P.M.
1978 Secondary analysis: He case for date archives. American Psychologist 33:381-837.
Campbell, D.T., and Erlebacher, A.E.
1970 How regression artifacts in quasi-experimental evaluations can mistakenly make com-
pensatory education look harmful. In J. Hellmuth, ea., The Disadvantaged Child.
Vol. 3. New York: Brunner/Mazel.
Campbell, D.T., Boruch, R.F., Schwartz, R.D., and Steinberg, J.
1975 Confidentiality-preserving modes of access to files and to interfile exchange for useful
statistical analyses. In Protecting Individual Privacy in Evaluation Research.
Committee on Federal Agency Evaluation Research, Assembly of Behavioral and
Social Sciences, National Academy of Sciences, Nahona1 Research Council.
Washington, D.C.: National Academy of Sciences.
Cochran, N.
1979 On the limiting of properties of social indicators. Evaluation and Program Planning
2: 1~.
Cronbach, L.J., Ambron, S.R., Dornbusch, S.M., Hess, R.D., Homik, R.C., Phillips, D.C.,
Walker, Did., Weiner, S.S.S.
1980 Toward Reform of Program Evaluation. San Francisco: Jossey-Bass.
Davida, G.I
1981 The case against restraints on non-governmental research in cryptography. (A minori-
ty report of the Public Cryptography Study Group prepared for the American Council
on Education.) Communications of the Association for Computing Machinery
24(7):445~50.
Director, S.M.
1979 Underadjustment bias In the evaluation of manpower training. Evaluation Quarterly
3(2):190 218.
OCR for page 145
Justifications and Obstacles
145
Tl~e Economist
1981 Science and Technology: Export Law Affects Scientific Meetings. January 24:1326.
Edsall, J.T.
1975 Scientific freedom and responsibility: report of the AAAS Committee on Scientific
Freedom and Responsibility. Science 188:687-691.
Ehrlich, I.
1975 The deterrent effect of capital punishment: a question of life or death. American
Economic Review 65:397~17.
Feldstein, M.
1980 Business Week October 6:96.
1982 Social Security and private savings: a reply.
90(3):630 642.
Finifter, B.
Journal of Political Economy
1975 Replication and extension of social research through secondary analysis. Social
ScienceInformation 14(2): 110153.
Greenwald, A.G.
1976 An editorial. Journal of Personality and Social Psychology 33:1-7.
Groeneveld, L.P., Tuma, N.B., and Harmon, M.T.
1980 Marital dissolution and remarriage: the expected effect of a negative income tax pro-
grarn on marital stability. In P.K. Robins, R.G. Spiegelman, and S. Weiner, eds., A
Guaranteed Annual Income: Evidence from a Social Experiment. New York: Academic
Press.
Hams, N.L., Gang, D.L., Quay, S.C. Poppema, S., Zamecnik, P.C., Nelson-Rees, W.A., and
O'Brien, S.
1981 Contamination of Hodgkin's Disease cell cultures. Nature 289:22~230.
Hedrick, T.E., Boruch, R.F., and Ross, J.
1978 On ensuring the availability of evaluative data for secondary analysis. Policy Sciences
9:25~280.
Hilts, P.J.
1981 Research results falsified. The Washington Post February 17Al ,A4.
Hoffman, C. and Reed, J.S.
1981 Sex discrimination? The XYZ affair. The Public Interest 62(Winter):2 1-39.
Jablon, S.
1979 The Uses of Third-Party Data Sets: A View From the Fence. Paper presented at the an-
nual meetings of the American Statistical Association, Washington, D.C.
Johnston, D.F.
1976 The OMB report: social indicators. Pp. 100 105 in Proceedings of the American
StatisticalAssociation. Part I. Washington, D.C.: American Statistical Association.
Katz, S., Hedrick, S.C., and Henderson, N.
1979 The measurement of long-ten care needs and impact. Health and Medical Care
Services Review 2(1):1-21 .
Keesling, J.W.
1978 On school attendance and reading achievement. In R.F. Boruch, ed, New Directions
for program Evaluation. Vol. 4. San Francisco: Jossey-Bass.
Klitgaard, R.
1974 Preliminary analysis of achievement test scores in Alum Rock voucher and nonvoucher
schools. Pp. 10~119 in D. Weller, ea., A Public School Voucher Demonstration:
The First Year at alum Rock, TechnicalAppendix. Technical Report R-1495/2-NIE.
Santa Monica, Calif.: Rand Corporation.
OCR for page 146
146
Magidson, J.
Terry E. Hedrick
1977 Toward a causal model approach for adjusting for pre-existing differences in the none-
quivalent control group situation: a general alternative to ANCOVA. Evaluation
Quarterly 1:39020.
Meier, P.
1975 Statistics and medical expenmentation. Biometrics 31:521.
Menges, R.J.
1973 Openness and honesty versus coercion and deception in psychological research.
American Psychologist 28:103~1034.
Nature
1980 Research Data: Private Property or Public Good? 284(March):292.
NeLlcin, D.
1981 Intellectual Property: The Control of Scientific Infonnation. Paper prepared for the
AAAS Committee on Scientific Freedom and Responsibility.
Public Cryptography Study Group
1981 Report of the Public Cryptography Study Group. Prepared for the American Council
on Education. Communications of the Association for Computing Machinery
24(7):435 444.
Rainwater, L. and Pittman, D.J.
1967 Ethical problems in studying a politically sensitive and deviant community. Social
Problems 14:61-72.
Raizen, S.A. and Rossi, P.H.
1981 Program Evaluation in Education: When? How? To What Ends? Committee on
Progr~un Evaluation in Education, Assembly of Behavioral and Social Sciences,
National Research Council. Washington, D.C.: National Academy Press.
Rezmovic, E.L. and Re7movic, V.A.
1981 A confirmatory factor analysis approach to construct validation. Educational and
Psychological Measurement 41:7~88.
Rindskopf, D.M.
1978 Secondly analysis: using multiple analytic approaches with Head Start and Title I data.
IN R.F. Boruch, ea., New Directions for Program Evaluation 4:7~88. San Francisco:
Jossey-Bass.
Robbin, A.
1981 Technical guidelines for preparing and documenting data. In R.~. Boruch, P.M.
Worunan, and D.S. Cordray, eds., Reanalyzing Program Evaluanons. San Francisco:
Jossey-Bass.
Ruebhausen, O.M., and Bnm, O.G.
1966 Privacy and behavioral research. American Psychologist 21:423 411.
Shapiro, S.
1979 Release of Primary Data Sets From the Producer's Point of View. Paper presented at
the annual meetings of the American Statistical Association, August 1979.
Sterling, T.D., Pollack, S., and Wein~wn, J.J.
1969 Measuring the effect of air pollution on urban morbidity. Archives of Environmental
Health 18:485 494.
Sterling, T.D., and Weinikam, J.J.
1979 What Happens When Major Errors are Discovered Long After an Important Report has
been Published? Paper presented at the annual meetings of the American Statistical
Association, August 1979.
Ware, W.H.
1974 Computer privacy and computer security. Bulletin of the American Society for
Infonnation Science 1:3.
OCR for page 147
Justifications and Obstacles
Wolins, L.
1962 Responsibility for raw data. American Psychologist l 7:657058.
1978 Secondary analysis: in published research in the behavioral sciences. In R.F. Boruch,
ea., New Directions for Program Evaluation. Vol. 4. San Francisco: Jossey-Bass.
Wortman, P.M., Reichardt, C.S., and St. Pierre, R.G.
1978 The first year of the education voucher demonstration: a secondary analysis of student
achievement test scores. Evaluation Quarterly 2(2):19~214.
147
Representative terms from entire chapter:
secondary analysis