Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 123
Justifications for and Obstacles to Data Sharing TerTy Elizabeth Hednck INTRODUCTION Several types of data sharing have been described in the preceding paper by Boruch, from relatively passive efforts to intensive efforts involving the pro- vision of large computerized data files and extensive accompanying documen- tation. That the sharing of data in many instances has led to significant bene- fits is easy to document. Yet a simple, unqualified endorsement of the prac- tice would be bow unrealistic and irresponsible. Many parties have interests at stake when data are shared, and He appropriate balancing of these interests is not always clear. The complexity of Be issues is attested to by con~over- sies over data release and reanalysis described in the popular press and the Terry Elizabeth Hedrick, a social psychologist specializing in program evaluation, is a group director with the U.S. General Accounting Office. The views expressed in this paper are the views of the author and do not necessarily reflect the policies of the U.S. General Accounting Office. 123
OCR for page 124
124 Terry E. Hedrick scientific literature (see, e.g., Nature, 1980; Feldstein, 1980; Hedrick et al., 1978; Wolins, 19621. This paper is based on the general premise that data sharing is a desirable and worthwhile practice. Thus it is organized around discussions of justifica- tions for data sharing and obstacles that impede it. ~ When possible, actual ex- amples are provided. The lack of empirical information on benefits and prob- lems associated with data sharing means that these discussions may be some- what biased toward the more controversial cases that have received public at- tention; in addition, in some cases, only one side of a controversy may have been fully documented. Therefore, this paper should be read as an attempt to identify, rather than to quantify, demonstrated or anticipated benefits from data sharing and obstacles to an across-the-board institutionalization of such a practice. In the paper, the interests of the following parties are identified and discussed:2 Primary researchers persons originally responsible for collecting or ana- lyzing the data or in some cases for funding its collection and analysis. Research participants- persons or units from whom data have been col- lected: people, firms, towns, states, schools, etc. Data requesters- researchers or other persons requesting release of data. Scientif c community all members of the research community. Society- all persons. As will be seen, the benefits and burdens of data sharing are not evenly dis- tnbuted across these parties, and their interests can vary according to the characteristics of each particular case. Guidelines on data sharing must be re- sponsive to the diverse interests and circumstances. JUSTIFICATIONS FOR DATA SHARING Justifications for data sharing are based on demonstrated or anticipated bene- fits for specific parties. To a large degree, the beneficiaries of data sharing are the scientific community, data requesters, and society; to a lesser degree and under some circumstances, primary researchers and research participants may also realize gains. A variety of benefits associated with data sharing are discussed below. Reinforcement of Open Scientific Inquiry One of the most widely held tenets of science is that research should be con- ducted and reported in a manner that yields sufficient information to enable people other than the original researchers to assess its meets and to replicate
OCR for page 125
Justifications and Obstacles 125 it. While the majority of researchers are likely to interpret this tenet as re- ferring to the provision of careful descriptions of study procedures as pro- vided for in most journal articles the provision of data for reanalysis can serve similar functions. The establishment of a policy of data sharing by pro- fessional organizations, journals, research institutions, and government could serve to reinforce the openness of scientific inquiry, thereby benefiting the scientific community and society. Verification, Refutation, or Refinement of Original Results Probably the most significant benefits realized through the sharing of data stem from reanalyses by other researchers. These benefits include the verifi- cation and refinement of original findings and the refutation of them. Secondary analysts may reanalyze data by following the original researcher's methods, thus checking the accuracy of the reported results, or by using com- peting analytic techniques or sets of assumptions, thus testing the robustness of the original conclusions to alternative approaches. If independent reana- lyses are done conscientiously and with visibility, the credibility of the origin- al research may be enhanced. When research results have entered the policy process, the sharing of data to permit reanalysis is extremely important. Analyses that confab the ori- ginal results can help combat political pressures to deny or bury them. The Wortman et al. (1978) reanalysis of data from the Alum Rock Education Voucher Demonstration Program is a good example of a reanalysis that refined the original work. Initial reports on the voucher demonstration posited a relative loss or no gain in reading achievement for the six voucher schools (Barker, 1974; Klitgaard, 1974~. Wortman et al. used a quasi- experimental design with multiple pretests and individual-level data and con- cluded that the deleterious effect reported earlier was confined to a few non- traditional programs within the six schools. The complex analyses and large data sets now used in much social science research have increased We susceptibility of findings to statistical and pro- gramming errors, errors unlikely to be detected without intensive review or reanalysis of the data. As Martin Feldstein said ( 1980:961: When economists deal with large data sets and complex econometric operations, there will be mistakes. If anyone relies on one study, he runs the risk of being mis- led by an error or statistical fluke. Indeed all models are untrue in the sense that they are crude approximations to the real world. A dramatic illustration of this susceptibility comes from a reanalysis of Feldstein's own early work, exploring the effect of Social Security on person-
OCR for page 126
126 Terry E. Hedrick al saving behavior, by two analysts of the Social Security Administration, Dean Leimer and Selig Lesnoy. At a 1980 conference of the American Economic Association, Leimer and Lesnoy showed that an elementary com- puter programming error led Feldstein to greatly overestimate the negative impact of Social Security on saving behavior. Although Feldstein later took issue with the Leimer and Lesnoy claim that the introduction of Social Security has not substantially reduced personal saving, he acknowledged the programming error and stressed that such replication studies are at the core of the scientific tradition (Feldstein, 19821. The Ehrlich research on the deterrent effect of capital punishment is a clas- sic case of research findings quickly entering the policy process without pro- vision for timely reanalysis by other interested parties. In 1975 Ehrlich pu- blished an article claiming that between 1935 and 1969, each execution in this country prevented seven to eight murders (Ehrlich, 19751. At the time of publication, the Supreme Court (in Fowler v. North Carolina) was reconsid- ering its 1972 decision declaring capital punishment unconstitutional, and U.S. Solicitor General Robert Bork used the study results in an amicus curiae brief filed by the Justice Department to argue for the reinstitution of capital punishment. The data on which the research was based were not immediately available to other researchers, so it was impossible for other parties to deter- mine the quality of the work. 3 Examples of errors in analyses or assumptions Mat led to distorted or incor- rect results are not difficult to locate. Steven Director's (1979) work in the area of evaluations of employment and training programs, for instance, dem- onstrated that past evaluations had used approaches that probably underesti- mated the impacts of employment and training programs on postprogram earnings of enrollees. Campbell and Erlebacher (1970), using a simulation technique, concluded that many evaluations of compensatory education pro- grams were likely to have suffered from similar problems. Magidson's (1977) and Rindskopfts (1978) applications of competing analytic techniques to Head Start and Title I data were based on similar concerns. The evaluation field is not necessarily more prone to these problems than other fields. Wolin's efforts in the early 1960s to acquire and reanalyze sev- eral small data sets underlying articles in psychology journals were based on a suspicion that the original analysts had used inappropriate analytic techniques (Wolins, 19624. More recent work by Wolins (1978), a secondary analysis of Bayer and Astin's (1975) data on faculty salaries, was concerned with problems of nonadditivity and irrelevant variance in Me predictors and chal- lenged the conclusion that the data supported a finding of a sex differential in the academic reward system. These kinds of concerns have motivated several observers to call for simul- taneous or serial analyses of evaluative data sets, arguing that data wid1 signif-
OCR for page 127
Justifications and Obstacles 127 icant potential for influencing public policy should undergo analysis by sever- al different researchers (Cronbach et al., 1980; Boruch and Cordray, 1980; Raizen and Rossi, 1981~. Cronbach's 65th thesis of program evaluation states: "In any primary statistical investigation, analyses by independent teams should be made before the report is distributed" (Cronbach et al., 1980~. That such a Republication reanalysis policy can be beneficial is cor- roborated by the experience of Stephen Fienberg, who, as editor of the Journal of the American Statistical Association (JASA), required authors of manuscripts to simultaneously submit copies of data. In Fienberg's judg- ment, many of the articles submitted for review were subsequently streng- thened by alternative analyses conducted by journal referees.4 Bryant and Wortman (1978) have proposed that similar procedures should be adopted to govern submissions to psychology journals. The benefits to the public from sharing policy-relevant data to permit verifi- cation and refutation of the original conclusions are fairly obvious. Benefits to the scientific community may also result to the extent that faulty studies are not published and, therefore, do not lead other researchers astray, shaping the directions of future research until sufficient numbers of conflicting studies ter- minate that avenue of inquiry. Public confidence in the worth of research might also be improved. Finally, as Fienberg's experience with JASA illus- trates, this is one circumstance in which primary researchers may also profit from the sharing of data. Replications With Multiple Data Sets Conclusions drawn from the analysis of a single data set are heavily dependent on Me quality of that data set and are subject to distortion from its idiosyncrasies its scope, format, method of collection, etc. The confidence one places in research conclusions can be greatly increased by consistency of results across data sets; conversely, inconsistencies in results across data sets lead one to view research results with skepticism and to engage in a careful exploration of possible reasons underlying those inconsistencies. The advancement of knowledge, especially in the social sciences, has been hindered by single studies that capture the fancy of a discipline and send re- searchers off on extended efforts to replicate, refute, or refine the findings of the original study. The pressure for academic researchers to publish is one cause of this reactive approach. Researchers are sorely tempted to publish each study, rawer than to pursue a systematic line of inquiry through the exe- cution of multiple studies. In partial response to the proliferation of one-shot studies (and an extremely high submission rate), the Journal of Personality and Social Psychology in 1976 instituted an editorial policy that encouraged more systematic research efforts and stronger support for conclusions
OCR for page 128
128 Terry E. Hedrick (Greenwald, 19761. Researchers should be encouraged to complement their collection and anal- ysis efforts with analyses of other existing data appropriate for addressing the same questions. To the extent that a policy of sharing data makes researchers aware of other data sets suitable for their needs and encourages the publication of articles that demonstrate similar findings across multiple data sets, sharing data can benefit the scientific community and increase public confidence in research findings. Exploration of New Questions In many cases, especially with large surveys or evaluative data sets, the pri- mary researcher's interest in a data set may encompass only a small part of the data set's potential usefulness. Providing other analysts access to such data sets would pennit additional benefits to be obtained from Be original invest- ments In data collection. In this respect, evaluative data sets collected by pn- vate research firms under government contract are one of He most underused sources of information. Contract research funs necessarily must direct their analytic efforts to address specific questions posed by the sponsor agency; time and resource constraints are likely to prevent analysts from branching out and exploring additional questions when the sponsor considers these questions subsidiary and outside the original scope of work. Consequently, these kinds of data frequently pass into oblivion without other parties being aware of them. Hedrick et al. (1978) have provided a discussion of four general cate- gones of obstacles to the acquisition of evaluative data: problems in locating a particular data set and authority for its release; insufficient documentation; inappropriate aggregation; and delays and refusals to data requests. Many of these obstacles are applicable to the present discussions. A positive example of an effort to increase returns from investments in data collection can be found in the Employment and Training Administration's (ETA) dissemination and support activities with respect to the production of public-use tapes from the Continuous Longitudinal Manpower Survey. This survey collected information from quarterly samples of enrollees in CETA programs (employment and Gaining services delivered under the Comprehensive Employment and Training Act) and includes, or will even- tually include, Tree years of postprogram labor force and welfare participa- tion data, as well as Social Security earrungs information over an extended period. ETA's interest in the survey was initially largely confined to descrip- tions of He characteristics of CETA enrollees and estimates of earnings gains from CETA participation, but other researchers have been encouraged to ex- ploit the data set for other purposes. From such data-shanng efforts, benefits accrue to data requesters, the scientific community, and society.
OCR for page 129
Justifications and Obstacles 129 Creation of New Data Sets Through Data File Linkages Another benefit obtainable through the sharing of data is the opportunity to create new data sets by linking two or more existing sources of information. As will be discussed in the section on obstacles to data sharing, this procedure can raise problems of violations of confidentiality, possibly even of privacy (through outright or deductive disclosure of identities), but the potential exists for researchers to address new questions or refine their inquiries into old ones by expanding the kinds and amounts of information available. The Continuous Longitudinal Manpower Survey is also a good example of data linkage since it involves linking information from CETA agency files, enrollee interviews, and Social Security records to create a single data file rich in detail about CETA participants. On a smaller scale, researchers have com- bined media reports of daily pollution levels in Los Angeles with Blue Cross of California records of cause, frequency of admission, and length of hospital stay to assess the effects of air pollution on urban morbidity (Sterling et al., 19691. Keesling (1978) creatively merged his own data on school attendance rates with information on reading test scores to examine the contribution of school attendance to achievement-test performance. Again, data requesters, the scientific cornrnunity, and society are the major beneficiaries of this type of data sharing. Encouragement of Multiple Perspectives Every scientific discipline has its own blinders with respect to methodologies, analytic techniques, and the phrasing of research questions. Even the selec- tion of outcome indicators often involves making value judgments about the desirability of certain types of behavior or the characteristics possessed by certain groups of people (Cochran, 1979; Johnston, 19761. The findings of marital instability from the Negative Income Tax Experiments are a case in point (Groeneveld et al., 19801: increases in divorce rates are viewed by some as a positive indicator of women's emancipation; others view them as a nega- tive indicator of the breakdown of the traditional family. Thus analysts may interpret identical variables from different perspectives. Of course, they may also select different variables to address the same questions. The analytic techniques employed by researchers may also be a function of disciplinary background. Unfortunately, decisions to employ input-output models, analysis of covariance, multiple regression, causal modeling, or oth- er techniques frequently derive less from the nature of the question at hand or the appropriateness of the technique for the data than from a researcher's per- sonal training or past experience. Researchers are most comfortable with analytic techniques that are familiar to them and for that reason they can be in-
OCR for page 130
130 discriminate in their use of the techniques. Terry E. Hedrick Data shanng, if it can be extended across disciplinary lines- a large if— has the potential to benefit almost all parties. Sharing data may encourage cross-disciplinary work, permitting questions to be viewed from diverse view- points, and it may broaden the perspectives of researchers, including primary researchers, by exposing them to new viewponts, methodologies, and analytic techniques. To the extent that a broader perspective is taken and the develop- ment of knowledge is enhanced, society should profit from the sharing of da- ta. Reductions in the Incidence of Faked and Inaccurate Results The existence of dishonesty in science is becoming more and more difficult to ignore. In the past few years, disclosures of hoaxes such as the Piltdown man and Cyril Burt's fabricated data on the inheritance of intelligence have sensit- ized the scientific community and society to the issue of dishonesty in science. More recently, controversies concerning Dr. John Long's cultures of human Hodgkin's disease cells, which turned out to be monkey cells (Harris et al., 1981), and the 30 drug researchers discovered by the Food and Drug Administration to be faking data or otherwise being dishonest (Hilts, 1981) highlight weaknesses in the current system of peer review and promise to keep public attention on He issue. The costs of errors or dishonesty in science go beyond public loss of confidence in the objectivity of science; in the Long case, several researchers wasted considerable time working with cultures of monkey instead of human cells. Not only did the researchers bear costs in terms of advancement of their careers, but society may have borne costs through time lost in research on cancer. When dishonest science leads other researchers astray or influences policy decisions on public programs, it can have negative effects on the welfare of society that range far beyond the scope of the original research. Again, motivations for dishonest behavior must be at least partially attrib- uted to the intense competition to publish and to obtain grant money. A poli- cy of open access to data, while far from a complete solution to the problem, might serve as a deterrent to He faking of data and dishonest reporting of re- search results. The extent of the deterrent potential of open access is un- known, but if even a few researchers are discouraged from dishonesty by fear of discovery and exposure, a data-shanng policy may be cost effective. Unintentional mistakes are a wholly different problem, and respect is war- ranted for Dose researchers who, when shown errors in Heir work, acknowl- edge He problems. Although researchers are assumed to carefully check the accuracy of quality control, the pressure to work and publish quickly, the
OCR for page 131
Justifications and Obstacles 131 complexity of many analytic techniques, and simple human fallibility not sur- pnsingly sometimes result in errors. Once again, an acknowleged policy of open access to data and the attendant risk that one's mistakes could be public- ly exposed might increase the attention researchers give to their work and, therefore, might improve its quality. Both the scientific community and so- ciety would benefit from any reductions in errors resulting from an open- access policy. Development of Knowledge About Analytic Techniques and Research Designs Secondary analysis is a fruitful activity for the production of information on analytic techniques and research designs. "To the conscientious analyst, there is often no single, generally accepted way to deal with data stemming from an evaluation" (Rindskopf, 1978:15~. The furler the study design de- viates from the classical experiment, the more confusing the choice of analy- sis approaches becomes. In evaluation, and presumably in most areas of science, acquiring data sets to explore new analytic approaches promises fu- ture benefits to many parties through the development of inflation on the strengths and weaknesses of analytic approaches. It can also provide justifi- cations for better research designs to detect what are frequently small and elu- sive treatment (program) effects. At Northwestern University, participants in the Project on Secondary Analysis have been engaged in efforts to advance the state of the art in analyzing effects of education programs since 1974, identi- fying problems in drawing causal inferences, investigating methods of using multiple analytic approaches to provide evidence of convergent validation, and exploring the biases that result when assumptions of various analytic methods are violated (Boruch and Wortman, 1978~. Provision of Resources for Training The availability of data for secondary analysis offers benefits for training stu- dents, especially in statistics and methodology. Boruch and Reis (1980) have documented a wide variety of payoffs to students from engaging in secondary analysis: reduction in the time necessary to get to the analysis stage of re- search, lower research costs, gains in knowledge about the nature of evidence, increased experience with analytic procedures, early exposure to the untidy world of applied research in comparison with the world of the text- book, and early entry into discussions of public policy. Fields such as economics have traditionally relied heavily on data collected by others for training graduate students. For postdoctoral programs, which may allow only a one-year training period for exposure to a new area of inquiry, access to
OCR for page 132
132 Terry E. Hedrick data collected by others may be a necessity if a trainee is to be able to follow a project to completion. Improved training is a benefit in and of itself, but in many cases, students engaging in secondary analysis have made significant contributions to theory, methodology, and statistics. Magidson's (1977) work with competing ana- lyses of Head Start data, Rindskopf's (1978) analyses of Head Start and Title I data, and Rezmovic and Rezmovic's (1981) efforts to test theories underlying the measurement of psychological traits are but a few examples of high- quality and useful secondary analysis work by students or postdoctoral trainees. Student requests for the sharing of data are probably more likely to run into obstacles or outright refusals then requests from more established researchers. A policy of open access to data should reduce these obstacles and increase both the quality of student training and students' resources for making signifi- cant contributions to their professions and to public policy. A side benefit is that early experiences with data sharing may sensitize these new professionals to both the legitimacy of others' data requests and the need for thorough docu- mentation for their own data. Reduction of Respondent Burden A concern of many researchers is to reduce the response burden on research participants, whether participants are students in undergraduate survey courses or people in government programs. This concern is especially salient in government research, for which the clearance procedures of the Office of Management and Budget call for budgets (in terms of hours) to be submitted estimating the amount of respondent burden associated with ongoing or planned data collection efforts. If appropriate data already exist that are suit- able for answering a researcher's questions, there is little justification for im- posing additional respondent burden. Uncontrolled data collection runs the risks of depleting research subject pools and endangering the future coopera- tion of participants. The sharing of data, by preventing redundant data col- lection, can benefit both the scientific community and society. OBSTACLES TO DATA SHARING The variety of obstacles to data sharing range from clearly illegitimate refus- als for data access by primary researchers who fear criticism to legitimate re- fusals based on national security considerations. In between are many gray areas in which the legitimacy of refusing access is not easily resolved or in which resources and effort are needed before data sharing can become possi- ble. Low-cost solutions are readily apparent for some of these impediments,
OCR for page 133
Justifications and Obstacles 133 moderate- or high-cost solutions will resolve others, and a few appear intran- sigent. The discussion of justifications for data sharing demonstrated that most of the actual or anticipated benefits are received by three parties—data re- questers, the scientific community, and society. Primary researchers and re- search participants are by definition members of these groups and, therefore, they also receive benefits. However, as the following discussion of obstacles indicates, primary researchers and research participants bear the brunt of the costs and risks attendant to data sharing, much more so than other parties. Clearly, the costs and benefits of data sharing are not distributed evenly among the parties involved. Concern About the Qualifications of Data Requesters When a primary researcher has reservations about the qualifications of a data requester, he or she may be reluctant to share data for several reasons. First, the primary researcher may anticipate that the data requester will require ex- tensive assistance in developing specifications for the exact variables desired and the most appropriate format for transfer of the data. Second, if the data requester does not have experience with comparable data sets, the primary re- searcher may anticipate having to respond to repeated requests for guidance concerning interpretation of variables, computer programming, and analytic procedures. Third, and perhaps most importantly, the primary researcher may fear that analyses performed by the requester will be of poor quality and that significant amounts of time will be necessary to review, critique, and re- but those analyses. Finally, the researcher may fear that the data set itself and the original analyses will lose credibility if poor reanalyses are performed that elicit criticism in the scientific community. (These concerns may be exacer- bated if the researcher perceives the requester to have personal interests in analyzing the data to demonstrate a particular outcome.) The criticism issue is broached again later in this paper, but it should be noted that there is no evidence that incompetent reanalyses come to overshadow competent primary analyses. Also, the same problem of time-consuming debate between inves- tigators critiquing each others' analyses exists for research that does not in- volve reanalyses of data; therefore, this problem is not peculiar to secondary analysis. National Security Considerations Most researchers recognize that national security can sometimes be a com- pelling reason for nonrelease of data or even nonpublication of results, al- though some have argued that even under national security constraints, data
OCR for page 137
Justifications and Obstacles 137 Fear and Costs of Criticism Two different bases can underlie a fear of criticism from secondary analysis: fear that purposive distortions of study results or faked data will be exposed, and fear that the secondary analyst will find fault with the original researcher's work. From society's standpoint, exposure of faked data and correction of inac- curate results are appropriate justifications for the sharing of data; from the primary researcher's standpoint, sharing data is a risky enterprise. It is al- ways uncomfortable to have someone checking your work. Fear of criticism by dishonest researchers is easily understood, but there are also reasons for honest and careful researchers to experience trepidation at the thought of sec- ondary analysis. Regardless of the care given to the original analyses and in- terpretation, the complexity of many analytic techniques and the rapid advent of new methods of analysis make criticism by a secondary analyst a not un- likely possibility. Such criticism may at times be warranted; in other cases, it may be shoddy and unjustified. It can have two kinds of costs for the re- searcher. First, it may threaten his or her reputation and esteem and, to an unknown degree, interfere with obtaining money for future work. Second, as mentioned earlier, responding to criticism may become a significant drain on a primary researcher's time. It is unclear what impact criticism and debate have on society. For in- stance, the debates over econometric analyses of the deterrent effects of capi- tal punishment have been vigorous since Ehrlich's original work. But we do not know what effects that controversy had, if any, on legislators, their staffs, or the public. For that matter, it is unclear whether distinctions are ultimately made between poorly informed criticism and sound criticism. Does the best work sift out by the end of a debate? What is clear is that there is concern that sharing data will yield criticism entailing substantial costs. Objections have been raised, for example, to the policy recommendations that all major pro- grarn evaluations be subjected to secondary analyses. The argument by fed- eral managers responsible for evaluation is that the opponents of the results of an evaluation will exploit the opportunity to attack even when faced with a good product. Few incentives are perceived to exist for confirming the re- sults of the original researchers; instead, secondary analysis may be viewed as an opportunity to make a reputation by refuting the work of others. As more information is shared, research often becomes more vulnerable to criticism. The University Group Diabetes Program, a research effort to con- trast multiple therapies for mild diabetes in middle-aged and elderly people, responsibly reported detailed information on baseline variables for each type of therapy group. By doing so, it opened the door to critics (and opponents). As Meter (1975) and others (Jablon, 1979; Sterling and Weinkam, 1979) have noted, twentieth-century science may have more of an adversarial atmosphere
OCR for page 138
138 than one of dispassionate, open inquiry (Meter, 1975:5211: Terry E. Hedrick Where the study is conducted with even greater care, on the other hand, and many baseline variables are reported, demands for endless subanalyses and recombina- tions, not to mention recriminations, seem to be the inevitable result. Of course, I regard more infonnahon to be better than less, and I certainly do not mean to sug- gest that one should refrain from reporting the values of all variables of interest. My complaint is with the carnivorous appetites among critics which seem to be gen- erated by this kind of raw meat. Poor Communication There are two very different types of communication problems that are of con- cern: problems of identifying and locating a data set appropriate for one's needs and problems of resolving disputes when secondary analyses lead to different results. To begin with, an investigator interested in a particular topic must be aware that a data set appropriate for his or her needs exists; the lack of mechanisms for matching research needs to existing data is a major impediment to second- ary analysis (Finifter, 19751. Guidelines from the Office of Federal Statistical Policy and Standards that require federal agencies to submit stan- dardized abstracts summarizing public-use, machine-readable data files for input into the Department of Commerce's Directory of Federal Statistical Data Files promise a partial solution to this problem. Focused efforts, such as the archive of data on long-term care at Michigan State University (Katz et al., 1979), can also contribute. A variety of data archives and mechanisms to identify data sets are discussed by Clubb et al. (in this volume). Robbin (1981), for instance, has proposed guidelines for bibliographic follies to en- sure access to information on machine-readable data files. Even so, if data sets were not originally intended for use by others and were not products of federal agency collection efforts, the matching of researcher interests and data sources is likely to be a hit-or-miss activity relying on informal communica- tion networks. If a researcher does identify a data set appropriate for his or her needs, the next step is to locate the people with authority to grant release of the data. While this may sound so straightforward as to not be worm mentioning, ex- perience has shown otherwise (see Hedrick et al., 1978~. Even though data collected with public funds are assumed to be public property, federal agen- cies do not always obtain and archive federally fended data sets. If a data set has been left in the control of the original researchers, there is no guarantee that it will skill exist at the time of the access request. Or, if a data set does exist, the people most closely involved with data collection, data processing, and analysis may have changed interests or jobs, may have died, or may not
OCR for page 139
Justifications and Obstacles 139 be willing to expend their time to organize, explain, and transfer the data. The second problem of communication between users, that arising from challenges to the accuracy of the original analysis, has yet to be addressed in any systematic fashion. As mentioned previously, some researchers have characterized twentieth-century science as an adversary process (Jablon, 1979; Sterling and Weinkam, 19791. For instance, Sterling and Weinkam have described in detail their efforts to seek corrections and clarifications re- garding misclassifications of persons in the data of He Dorn Study of Mortality among U.S. veterans. While their account gives only one side of the controversy, it does not breed optimism with respect to the resolution of such disputes (Sterling and Weinkam, 1979:11: Our experience indicates that reactions to discovery of and attempts to correct errors in scientific studies are similar to those met by consumer attempts to deal with errors in large commercial computerized procedures. Because scientific "management" appears to opt for an adversary rather than cooperative mode of responding to dis- covery of errors, much of the value may be lost which secondary analysis has for verifying the validity of past work. When a secondary analyst has sufficient information to replicate the origin- al analysis and obtains different results, the solution to the communication problem may lie in simply offering the primary analyst (and funding agency) an opportunity to review and comment on the new analysis. If no comment is forthcoming, He secondary analyst then proceeds with publication. When, however, additional information from the primary researcher is required to identify the source of the discrepancy and an adversarial attitude wins out over a cooperative one, the benefits associated with the sharing of data may not materialize. Data Set Inadequacies A number of obstacles to data sharing stem from inadequate preparation and retention of data sets. If data are located and agreement is forthcoming for data release, the lack of good documentation can reduce or completely pre- vent exploitation of a data set by a secondary analyst. Here documentation is meant in its widest sense: sampling frames and study design, copies of the on- ginal data collection instruments, data collection procedures, validity infor- mation, data transformations, aggregation procedures, procedures followed in creating new variables, etc. Having information on the physical format of data tapes is not enough; some of the most valuable information for a second- ary user of data is information about the strengths or weaknesses of data set items. Unfortunately, this type of information is often not treated formally and exists only in the memory of the original researcher. To the extent Hat formal documentation is poor or not available and the primary researcher is
OCR for page 140
140 Terry E. Hedrick uncooperative or unavailable, major obstacles exist to the productive use of the data by other parties. It may be next to impossible to create documenta- tion after the fact, or the time necessary to disentangle formats may prove pro- hibitive. If use of data by other parties has not been anticipated by the primary re- searcher, data requesters may find that the data have not been properly ar- chived and maintained. Information recorded on magnetic tapes or disks can deteriorate rapidly unless stored under proper conditions. Placing a data tape on a shelf in one's office provides far from optimal storage conditions; storing card copies of data in a garage or basement is likewise undesirable. Backup copies of data sets should be created and carefully maintained as a matter of normal research procedures. Recognition and Proprietary Concerns of Primary Researchers A researcher who invests time and resources in the collection and processing of data deserves the first opportunity to analyze those data and make a contn- bution to his or her field. Release of data before a primary researcher has had a reasonable opportunity to capitalize on those efforts would be an enormous disservice to researchers and would discourage future data collection. In many cases, the concerns of primary researchers for recognition will not be an obstacle to data release since data requesters will have learned of the data set's existence from the published work of the original researcher. Even in this circumstance, however, problems arise when the requester desires the data for a purpose that overlaps with the data collector's future research plans. Questions concern the scope of data available to other parties the entire data set or only the portion supporting published material; the specification of a reasonable time period for the primary researcher to regain control of the data; the extent of the primary researcher's obligation to release data for purposes that may overlap with future research plans; and the definition of when data enter the public domain upon publication, upon presentation at a conven- tion, upon use in court, upon communication in some form of professional correspondence, etc. The proprietary issue is treated in detail in the Cecil and Griffin paper in this volume, and the complex issues of private versus public ownership are not discussed further here. It is worth noting, however, that the reward struc- ture for scientists is undergoing a significant change. Researchers working in such fields as genetic engineering and econometrics are finding that there are commercial rewards for their work that may supersede the goal of the cooper- ative pursuit of knowledge (NeLkin, 19811. Since constrained access to re- search results can increase their commercial value, this shift in He reward
OCR for page 141
Justifications and Obstacles structure may become a major obstacle to data sharing. 141 A controversy between a public interest group, the Interfaith Center on Corporate Responsibility (ICCR), and Abbott Laboratones and Mead- Johnson (see Nature, 1980) illustrates several of the recognition and propne- tary issues. ICCR conducted a nationwide survey of infant feeding practices and sought assistance from the Center for Disease Control (CDC) in providing computer resources for analysis of the survey responses. CDC's records, as a public agency, are open to scrutiny by others under the Freedom of Information Act, and Abbott Laboratories and Mead-Johnson requested ac- cess to the data. The nutritional quality of baby foods is a controversial topic and one of extreme economic importance to the data requesters. ICCR argued that it should be able to retain control over the data until its analysts analyzed and published their findings; the data requesters argued for immedi- ate access to public records, presumably to get a jump on the ICCR findings. Similar dilemmas occur with respect to access to company test data on product safety. Companies argue that test data are protected as trade secrets; their release could endanger a company's competitiveness in the marketplace. Other parties see independent analyses of test data as necessary for the deve- lopment and promulgation of regulations to promote public safety. There are no obvious resolutions to these kinds of disputes. One party or Me other stands to lose, and the degree of loss is likely to vary substantially with the particulars of each case. Guidelines on sharing data must be sensi- tive to protecting the investments of primary researchers, yet when data have immediate relevance to public policy, the interests of the public in having the best information possible available for use in the decision-making process may be judged to outweigh costs to primary researchers. The case for access is even stronger when the collection of the data was supported by public funds. Violations of Confidentiality Win the increasing number of surveys, the growth in administrative record- keeping, and the potential to link data files, there is growing concern with pro- tecting the identities of research participants. Sharing data can create prob- lems of violations of confidentiality and even lead to threats to privacy if iden- tified data are transferred, if cross-tabulations are run on variables of low fre- quency (which can result in deductive disclosure), or if a data set is linked with other data or information sources. Consider, for example, the case of an education evaluation data set contain- ing information on principal and teacher attitudes and student performance. Even if personal identifiers such as Social Security number and name are de- leted before data me publicly released, a second party with knowledge of He
OCR for page 142
142 Terry E. Hedrick location of the research and publicly available information on teaching assign- ments may be able to identify individuals through cross-tabulations of van- ables such as race and sex of teacher and grade level of class. If there is only one black female math teacher at the sixth-grade level, her responses regard- ing her school, her principal, and her students could become public knowl- edge. So, perhaps, could the responses of her principal and students. Another situation in which researchers face problems in reporting results or sharing data is in research conducted in private firms. Access to collect data in a private company may be contingent upon a promise never to identify He company when reporting results and to deny access to the data by other par- ties. This puts the researcher in an awkward position of not being able to per- mit others to verify his or her analyses. A recent article on sex discrimination in a private company illustrates the difficulty of camouflaging the identity of a private fun (Hoffman and Reed, 19811. The company asked not to be identi- f~ed, but the article described it as a Fortune 500 company and gave descrip- tive information on the number of employees, number of branch offices, and administrative organization to a degree that may have compromised the pledge of confidentiality. Treats to confidentiality, therefore, should be recognized as an obstacle to data shanag. While these examples and the previous discussion of data with special problems indicate that in some circumstances data may be difficult to share, it is likely that this obstacle is cited much more frequently as an excuse for denying access than is warranted. There are a variety of mechanisms to solve the confidentiality problem, including deletion of identifiers, use of cruder report categories, random subsample release, microaggregation, and error inoculation (see Campbell et al., 19751. Administrative Inconvenience and Cost Lastly, there are administrative and cost burdens associated win data sharing that must be balanced against the benefits of access. These burdens largely fall on primary researchers. Documentation of data is a major burden; it is a labor-intensive activity, and if done properly, involves much more than the provision of a simple list of variables and their format on a computer tape. Robbin (1981) has listed five major parts to documentation: (1) a general study description, (2) a history of the project, (3) a summary of the data processing history, (4) a codebook, and (5) appendices with error listings, glossaries, list of publications, and instruc- tions for using the data file. For large, complex data files with multiple waves of data, He documentation task can be expensive and time consuming. After documentation is complete, researchers who are willing to share data must either notify the research community of the data's availability or make
OCR for page 143
Justi~catzor~s and Obstacles 143 arrangements to transfer it to an archive. In addition, there are costs of stor- age and maintenance, as well as costs of updating the file with new informa- tion or correcting errors. When data requests are received, there are still more costs associated win providing copies of the data in a form usable by the requester and responding to inquiries for clarification. These issues are not Heated comprehensively here; estimating time or dollars associated with these tasks is beyond the scope of this paper and would depend heavily on the nature of the data set. Nevertheless, researchers who are impatient to proceed with their own work may find the requirements of data sharing to be an unwelcome interference. SUMMARY This paper has identified a variety of actual and anticipated benefits and obsta- cles associated with a policy of data sharing. From these discussions, it has become evident that He benefits and costs of data sharing are not evenly dis- tributed across all parties. Data requesters, the scientific to bear most of the costs. The extent to which these benefits and costs will materialize if a policy of sharing data is institutionalized is an empirical question, in many respects not yet answerable. Except for large-scale survey efforts, the scientific comn~unity's experiences win data sharing have been spotty and unchroni- cled; case studies constitute He major source of information and often docu- ment only one side of a controversy. NOTES 1. It is worth noting that there are also some arguments for not shanog data. Fast, some data are not worth the costs involved in shanag them. In these cases, science might be better served by committing such data to obscurity rather than by their occupying the time and resources of oth- er researchers. Second, encouraging reliance on already existing data sets through data shanag may have negative effects by reducing efforts to collect new data. This can have drawbacks to the extent that science progresses by analyses of independent data and to the extent that rescarch- ers become insensitive to the difficulties of collecting good-quality data and simply take variables on existing data sets at face value. Other justifications for not sharing data, such as time and re- source burdens on primary researchers, are embedded in the discussions of obstacles to data shar- ~ng. 2. Other distinctions, such as those between primary researchers and research funders and be- tween researchers in private and in public institutions, can also be important. For reasons of par- simony, these parties have been treated as one group in this chapter. Other papers in this vol- ume, particularly that of Cecil and Griffin, frequently accord them separate treatment. 3. Subsequently, several other studies refuted or at least failed to confirm Ehrlich's results. 4. Letter to Clifford Hildreth from Stephen Fienberg regarding JASA policy on publication of data, October 8, 1979. 5. A minority report was also filed by He Public Cryptography Study Group. It argued against restraints on the publication of nongovernmental cryptography research on several
OCR for page 144
144 Terry E. Hedrick grounds: national security interests are broader than the interests of NSA; restraints will have ne- gative effects on research in other fields; unconstitutionality; internanona1 complications; legal complications; the ineffectiveness of such restraints; and a low perceived threat to NSA's crypto- systems from publication of such material (see Davida, 1981). REFERENCES Barker, P. 1974 Preliminary analysis of metropolitan achievement test scores, voucher schools and Title I schools. Pp. 9~104 in D. Weller, ea., A Public School Voucher Demonstration: The First Year at Alu n Rock, Technical Appendix. Technical Report R-1495/2-NIE. Santa Monica, Calif.: Rand Corporation. Bayer, A.E. and Astin, H.S. 1975 Sex differentials in the academic reward system. Science 188:79~802. Boruch, R.F. and Cordray, D.S. 1980 An Appraisal of Educational Program Evaluations: Federal, State, and Local Agencies. Washington, D.C.: U.S. Department of Education. Boruch, R.F. and Reis, J. 1980 The student, evaluative data, and secondary analysis. In R.F. Boruch, ea., New Directionsfor Program Evaluation 8:5~72. San Francisco: Jossey-Bass. Boruch, R.F. and Workman, P.M. 1978 An illustrative project on secondary analysis. In R.F. Boruch, ea., New Directionsfor Program Evaluation. Vol. 4. San Francisco: Jossey-Bass. Bryant, F.B. and Wortman, P.M. 1978 Secondary analysis: He case for date archives. American Psychologist 33:381-837. Campbell, D.T., and Erlebacher, A.E. 1970 How regression artifacts in quasi-experimental evaluations can mistakenly make com- pensatory education look harmful. In J. Hellmuth, ea., The Disadvantaged Child. Vol. 3. New York: Brunner/Mazel. Campbell, D.T., Boruch, R.F., Schwartz, R.D., and Steinberg, J. 1975 Confidentiality-preserving modes of access to files and to interfile exchange for useful statistical analyses. In Protecting Individual Privacy in Evaluation Research. Committee on Federal Agency Evaluation Research, Assembly of Behavioral and Social Sciences, National Academy of Sciences, Nahona1 Research Council. Washington, D.C.: National Academy of Sciences. Cochran, N. 1979 On the limiting of properties of social indicators. Evaluation and Program Planning 2: 1~. Cronbach, L.J., Ambron, S.R., Dornbusch, S.M., Hess, R.D., Homik, R.C., Phillips, D.C., Walker, Did., Weiner, S.S.S. 1980 Toward Reform of Program Evaluation. San Francisco: Jossey-Bass. Davida, G.I 1981 The case against restraints on non-governmental research in cryptography. (A minori- ty report of the Public Cryptography Study Group prepared for the American Council on Education.) Communications of the Association for Computing Machinery 24(7):445~50. Director, S.M. 1979 Underadjustment bias In the evaluation of manpower training. Evaluation Quarterly 3(2):190 218.
OCR for page 145
Justifications and Obstacles 145 Tl~e Economist 1981 Science and Technology: Export Law Affects Scientific Meetings. January 24:1326. Edsall, J.T. 1975 Scientific freedom and responsibility: report of the AAAS Committee on Scientific Freedom and Responsibility. Science 188:687-691. Ehrlich, I. 1975 The deterrent effect of capital punishment: a question of life or death. American Economic Review 65:397~17. Feldstein, M. 1980 Business Week October 6:96. 1982 Social Security and private savings: a reply. 90(3):630 642. Finifter, B. Journal of Political Economy 1975 Replication and extension of social research through secondary analysis. Social ScienceInformation 14(2): 110153. Greenwald, A.G. 1976 An editorial. Journal of Personality and Social Psychology 33:1-7. Groeneveld, L.P., Tuma, N.B., and Harmon, M.T. 1980 Marital dissolution and remarriage: the expected effect of a negative income tax pro- grarn on marital stability. In P.K. Robins, R.G. Spiegelman, and S. Weiner, eds., A Guaranteed Annual Income: Evidence from a Social Experiment. New York: Academic Press. Hams, N.L., Gang, D.L., Quay, S.C. Poppema, S., Zamecnik, P.C., Nelson-Rees, W.A., and O'Brien, S. 1981 Contamination of Hodgkin's Disease cell cultures. Nature 289:22~230. Hedrick, T.E., Boruch, R.F., and Ross, J. 1978 On ensuring the availability of evaluative data for secondary analysis. Policy Sciences 9:25~280. Hilts, P.J. 1981 Research results falsified. The Washington Post February 17Al ,A4. Hoffman, C. and Reed, J.S. 1981 Sex discrimination? The XYZ affair. The Public Interest 62(Winter):2 1-39. Jablon, S. 1979 The Uses of Third-Party Data Sets: A View From the Fence. Paper presented at the an- nual meetings of the American Statistical Association, Washington, D.C. Johnston, D.F. 1976 The OMB report: social indicators. Pp. 100 105 in Proceedings of the American StatisticalAssociation. Part I. Washington, D.C.: American Statistical Association. Katz, S., Hedrick, S.C., and Henderson, N. 1979 The measurement of long-ten care needs and impact. Health and Medical Care Services Review 2(1):1-21 . Keesling, J.W. 1978 On school attendance and reading achievement. In R.F. Boruch, ed, New Directions for program Evaluation. Vol. 4. San Francisco: Jossey-Bass. Klitgaard, R. 1974 Preliminary analysis of achievement test scores in Alum Rock voucher and nonvoucher schools. Pp. 10~119 in D. Weller, ea., A Public School Voucher Demonstration: The First Year at alum Rock, TechnicalAppendix. Technical Report R-1495/2-NIE. Santa Monica, Calif.: Rand Corporation.
OCR for page 146
146 Magidson, J. Terry E. Hedrick 1977 Toward a causal model approach for adjusting for pre-existing differences in the none- quivalent control group situation: a general alternative to ANCOVA. Evaluation Quarterly 1:39020. Meier, P. 1975 Statistics and medical expenmentation. Biometrics 31:521. Menges, R.J. 1973 Openness and honesty versus coercion and deception in psychological research. American Psychologist 28:103~1034. Nature 1980 Research Data: Private Property or Public Good? 284(March):292. NeLlcin, D. 1981 Intellectual Property: The Control of Scientific Infonnation. Paper prepared for the AAAS Committee on Scientific Freedom and Responsibility. Public Cryptography Study Group 1981 Report of the Public Cryptography Study Group. Prepared for the American Council on Education. Communications of the Association for Computing Machinery 24(7):435 444. Rainwater, L. and Pittman, D.J. 1967 Ethical problems in studying a politically sensitive and deviant community. Social Problems 14:61-72. Raizen, S.A. and Rossi, P.H. 1981 Program Evaluation in Education: When? How? To What Ends? Committee on Progr~un Evaluation in Education, Assembly of Behavioral and Social Sciences, National Research Council. Washington, D.C.: National Academy Press. Rezmovic, E.L. and Re7movic, V.A. 1981 A confirmatory factor analysis approach to construct validation. Educational and Psychological Measurement 41:7~88. Rindskopf, D.M. 1978 Secondly analysis: using multiple analytic approaches with Head Start and Title I data. IN R.F. Boruch, ea., New Directions for Program Evaluation 4:7~88. San Francisco: Jossey-Bass. Robbin, A. 1981 Technical guidelines for preparing and documenting data. In R.~. Boruch, P.M. Worunan, and D.S. Cordray, eds., Reanalyzing Program Evaluanons. San Francisco: Jossey-Bass. Ruebhausen, O.M., and Bnm, O.G. 1966 Privacy and behavioral research. American Psychologist 21:423 411. Shapiro, S. 1979 Release of Primary Data Sets From the Producer's Point of View. Paper presented at the annual meetings of the American Statistical Association, August 1979. Sterling, T.D., Pollack, S., and Wein~wn, J.J. 1969 Measuring the effect of air pollution on urban morbidity. Archives of Environmental Health 18:485 494. Sterling, T.D., and Weinikam, J.J. 1979 What Happens When Major Errors are Discovered Long After an Important Report has been Published? Paper presented at the annual meetings of the American Statistical Association, August 1979. Ware, W.H. 1974 Computer privacy and computer security. Bulletin of the American Society for Infonnation Science 1:3.
OCR for page 147
Justifications and Obstacles Wolins, L. 1962 Responsibility for raw data. American Psychologist l 7:657058. 1978 Secondary analysis: in published research in the behavioral sciences. In R.F. Boruch, ea., New Directions for Program Evaluation. Vol. 4. San Francisco: Jossey-Bass. Wortman, P.M., Reichardt, C.S., and St. Pierre, R.G. 1978 The first year of the education voucher demonstration: a secondary analysis of student achievement test scores. Evaluation Quarterly 2(2):19~214. 147
Representative terms from entire chapter: