Click for next page ( 60


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 59
4 The Tradeoff: Confidentiality Versus Access The previous three chapters describe the challenge of preserving confi- dentiality while facilitating research in an era of increasingly detailed and available data about research participants and their geographic locations. This chapter presents the committee’s conclusions about what can—and cannot—be done to achieve two goals: ensure that both explicit and im- plied pledges of confidentiality are kept when social data are made spatially explicit and provide access to important research data for analysts working on significant basic and policy research questions. Following our conclu- sions, we offer recommendations for data stewards, researchers, and re- search funders. CONCLUSIONS Tradeoffs of Benefits and Risks Recognition of the Benefits and Risks Making social data spatially explicit creates benefits and risks that must be considered in ethical guidelines and research policy. Spatially precise and accurate data about individuals, groups, or organizations, added to data records through processes of geocoding, make it possible for researchers to examine questions they could not otherwise explore and gain better understanding of human actors in their physical and environmental contexts, and they create benefits for society in terms of the knowledge that can flow from that research. 59

OCR for page 59
60 PUTTING PEOPLE ON THE MAP CONCLUSION 1: Recent advances in the availability of social and spatial data and the development of geographic information systems (GIS) and related techniques to manage and analyze those data give researchers important new ways to study important social, environ- mental, economic, and health policy issues and are worth further development. Sharing of linked social-spatial data among researchers is imperative to get the most from the time, effort, and money that goes into obtaining the data. However, to the extent that data are spatially precise and accurate, the risk increases that the people or organizations that are the subject of the data can be identified. Promises of confidentiality that are normally pro- vided for research participants and that can be kept when data are not linked could be jeopardized as a result of the data linkage, increasing the risk of disclosure and possibly also of harm, particularly when linked data are made available to secondary data users who may, for example, combine the linked data with other spatially explicit information about respondents that enables new kinds of analysis and, potentially, new kinds of harm. These risks affect not only research participants, but also the scientific enterprise that depends on participants’ confidence in promises of confiden- tiality. Researcher’s Obligations Researchers who collect or undertake secondary analysis of linked social-spatial data and organizations that support re- search or provide access to such data have an ethical obligation to maxi- mize the benefits of the research and minimize the risk of breaches of confidentiality to research participants. This obligation exists even if legal obligations are not clearly defined. Those who collect, analyze, or provide access to such data need to articulate strong data protection plans, stipulate conditions of access, and safeguard against possible breaches of confidenti- ality through all phases of the research—from data collection through dis- semination. Protecting against any breach of confidentiality is a priority for researchers, in light of the need to honor confidentiality agreements be- tween research participants and researchers, and to support public confi- dence in the integrity of the research. The Tradeoff of Confidentiality and Access Restricting data access affords the highest protection to the confidentiality of linked social-spatial data that include exact locations. However, the costs to science are high. If confidentiality has been promised, common public-use forms of data distri- bution create unacceptable risks to confidentiality. Consequently, only more restrictive forms of data management and dissemination are appropriate, including extensive data reduction, strong licenses, and data center (en-

OCR for page 59
61 THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS clave) access. When the precise data are available only in data enclaves, many researchers simply do not use the datasets, so research that could be done is not undertaken. Improved methods for providing remote access to enclave data require research and development efforts. CONCLUSION 2: The increasing use of linked social-spatial data has created significant uncertainties about the ability to protect the confi- dentiality promised to research participants. Knowledge is as yet inad- equate concerning the conditions under which and the extent to which the availability of spatially explicit data about participants increases the risk of confidentiality breaches. The risks created by the availability and publication of such informa- tion increases the better-known risks associated with other publication- related breaches of confidentiality, such as the publication of the names or locations of primary sampling units or of specific tabular cell sizes. For example, cartographic materials are often used in publications to illustrate points or findings that do not lend themselves as easily to tabular or text explication: what is not yet understood are the conditions under which they also increase the ability to identify a research participant. Technical Strategies for Reducing Risk Cell Suppression, Data Swapping, and Aggregation Cell suppression and data swapping techniques can protect confidentiality, but they seriously degrade the value of data for analyses in which spatial information is essential. Aggregation can provide adequate protection and preserves analy- sis at a level of aggregation, but it renders data useless when exact locations are required. Hence, aggregation has merit for data that have low levels of risk and are slated for public-use dissemination, but not for data that will be used for analyses that require exact spatial information. When analyses require exact locations, essentially all observations are the equivalent of small cells in a statistical table: cell suppression would therefore be tantamount to destroying the spatial component of the data. Suppressing nonspatial attributes leaves so much missing information that the data are difficult to analyze. Swapping exact locations may not prevent identifications and can create serious distortions in analysis when a location or a topological relationship is a critical variable. Swapping nonspatial attributes to limit attribute disclosure risk may need to be done at so high a rate that the associations in the data are badly attenuated. Suppression or swapping can be used to preserve confidentiality when analyses require inexact levels of geography, but aggregation is a superior approach in these cases because it preserves analyses at those levels. Aggregation makes it

OCR for page 59
62 PUTTING PEOPLE ON THE MAP impossible to perform many types of analyses, and when it is used it can lead to ecological inference problems. Data Alteration Data alteration methods, such as geographic masking or adding noise to sensitive nonspatial attributes, may improve confidentiality protection but at the expense of data quality. Altering data to mask precise spatial locations impedes the ability of researchers to calculate accurate spatial relationships, such as distances, directions, and inclusion of loca- tions within an enumeration unit (e.g., a census tract). There is a tradeoff between the magnitude of any masking displacement and the correspond- ing utility of an observation for a particular use. Decisions about this tradeoff affect the risk of a breach of confidentiality. A mask may also be applied to nonspatial attributes associated with known locations: this might be done when knowledge about the magnitude of an attribute, along with knowledge about a generating process (such as a deterministic model of toxic emissions), could enable the recovery of a location that could then be linked to other information. Synthetic Data Synthetic data approaches may have the potential to pro- vide access to data with exact spatial identifiers while preserving confiden- tiality. There is insufficient evidence at present to determine how well this approach preserves the social-spatial relationships of interest to research- ers. In addition, with current technologies, it is very difficult for data stew- ards to create analytically valid synthetic datasets. The goal of synthetic data approaches is to protect confidentiality while preserving certain rela- tionships in the data. This approach depends on data simulation models that capture the relationships among the spatial and nonspatial variables. The effectiveness of such models has not been fully demonstrated across a wide range of analyses and datasets. For example, it is not known how well these models can preserve distance and topological relationships. It is also not known whether and how the various synthetic data approaches can be applied when linking datasets. Secure Access Techniques for providing secure access to linked data, such as sharing sums but not individual values or conducting data analyses on request and returning the results but not the data may have the potential to provide results from spatial analyses without revealing data values. These approaches are not yet extensively used by stewards of spatial data, and their feasibility for social and spatial data is unproven. They are com- putationally intensive and require expertise that is not available to many data stewards. The value of some of these methods is limited by restric- tions on the total number of queries that can be performed before queries could be combined to identify elements in the original data.

OCR for page 59
63 THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS CONCLUSION 3: Recent research on technical approaches for reduc- ing the risk of identification and breach of confidentiality has demon- strated promise for future success. At this time, however, no known technical strategy or combination of technical strategies for managing linked social-spatial data adequately resolves conflicts among the ob- jectives of data linkage, open access, data quality, and confidentiality protection across datasets and data uses. In our judgment, it will remain difficult to reconcile these conflicting objectives by technical strategies alone, though efforts to identify effective methods and procedures should continue. It is likely that different methods and procedures will be optimal for different applications and that the best approaches will evolve with the data and with techniques for protecting confidentiality and for identifying respondents. Institutional Approaches CONCLUSION 4: Because technical strategies will be not be sufficient in the foreseeable future for resolving the conflicting demands for data access, data quality, and confidentiality, institutional approaches will be required to balance those demands. Institutional approaches involve establishing tiers of risk and access and producing data-sharing solutions that match levels of access to the risks and benefits of the planned research. Institutional approaches must address issues of shared responsibility for the production, control, and use of data among primary data producers, secondary producers who link additional information, data users of all kinds, research sponsors, IRBs, government agencies, and data stewards. It is essential that the power to decide about data access and use be allocated appropriately among these responsible actors and that those with the greatest power to decide are highly informed about the issues and about the benefits and risks of the data access policies they may be asked to approve. It is also essential that users of the data bear the burden of confidentiality protection for the data they use. RECOMMENDATIONS We generally endorse the recommendations of two reports, Protecting Participants and Facilitating Social and Behavioral Sciences Research (Na- tional Research Council, 2003) and Expanding Access to Research Data: Reconciling Risks and Opportunities (National Research Council, 2005a) regarding general issues of confidentiality and data access. It is important to note that the recommendations in those reports address only data collected

OCR for page 59
64 PUTTING PEOPLE ON THE MAP and held by federal agencies, and they do not deal with the special issues that arise when social and spatial data are linked. This report extends those recommendations to include the large body of data that are collected by individual researchers and academic and research organizations and held at universities and other public research entities. It also addresses the need for research sponsors, research organizations such as universities, and research- ers to pay special attention to data that record exact locations. In particular, we support several key recommendations of these re- ports: • Access to data should be provided “through a variety of modes, including various modes of restricted access to confidential data and unre- stricted access to public-use data altered in a variety of ways to maintain confidentiality” (National Research Council, 2005a:68). • Organizations that sponsor data collection should “conduct or spon- sor research on techniques for providing useful, innovative public-use data that minimize the risk of disclosure” (National Research Council, 2005a:72) and continue efforts to “develop and implement state-of-the-art disclosure protection practices and methods (National Research Council, 2003:4). • Organizations that sponsor data collection “should conduct or spon- sor research on cost-effective means of providing secure access to confiden- tial data by means of a remote access mechanism, consistent with their confidentiality assurance protocols” (National Research Council, 2005a:78). • Data stewardship organizations that use licensing agreements should “expand the files for which a license may be obtained [and] work with data users to develop flexible, consistent standards for licensing agreements and implementation procedures for access to confidential data” (National Re- search Council, 2005a:79). • Professional associations should develop strong codes of ethical con- duct and should provide training in ethical issues for “all those involved in the design, collection, distribution, and use of data collected under pledges of confidentiality” (National Research Council, 2005a:84). Some of these recommendations will not be straightforward to imple- ment for datasets that link social and spatially explicit data. We therefore elaborate on those recommendations for the special issues and tradeoffs raised by linking social and spatial data. Technical and Institutional Research RECOMMENDATION 1: Federal agencies and other organizations that sponsor the collection and analysis of linked social-spatial data—

OCR for page 59
65 THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS or that support data that could provide added benefits with such link- age—should sponsor research into techniques and procedures for dis- seminating such data while protecting confidentiality and maintain- ing the usefulness of the data for social and spatial analysis. This research should include studies to adapt existing techniques from other fields, to understand how the publication of linked social-spatial data might increase disclosure risk, and to explore institutional mechanisms for disseminating linked data while protecting confidentiality and main- taining the usefulness of the data. This research should include three elements. First, it should include studies that focus on both adapting existing techniques and developing new approaches in social science, computer science, geographical science, and statistical science that have the potential to deal effectively with the prob- lems of linked social-spatial data. The research should include assessments of the disclosure risk, data quality, and implementation feasibility associ- ated with the techniques, as well as seeking to identify ways for data stew- ards to make these assessments for their data. This line of research should include work on techniques that enable data analysts to understand what analyses can be reliably done with shared data. It should also include research on analytical methods that correct or at least account for the effects of data alteration. Finally, the research should be done through collaborations among data stewards, data users, and researchers in the appropriate sciences. Among the most promising techniques are spatial aggregation, geographic masking, fully and partially synthetic data and remote access model servers and other emerging meth- ods of secure access and secure record linkage. Second, the research should include work to understand how the pub- lication of spatially explicit material using linked social-spatial data might increase disclosure risk and thus to increase sensitivity to this issue. The research would include assessments of disclosure risk associated with carto- graphic displays. It should involve researchers from the social, spatial, and statistical sciences and would aim to better understand how the public presentation of cartographic and other spatially explicit information could affect the risk of confidentiality breaches. The education should involve researchers, data stewards, reviewers and journal editors. Third, the research should work on institutional mechanisms for dis- seminating linked social-spatial data while protecting confidentiality and maintaining the usefulness of the data for social and spatial analysis. This research should include studies of modifications to traditional data enclave institutions, such as expanded and virtual enclaves, and of modified licens- ing arrangements for secondary data use. Direct data stewards, whether in government agencies, academic institutions, or private organizations, should

OCR for page 59
66 PUTTING PEOPLE ON THE MAP participate in such research, which should seek to identify and examine the effects of various institutional mechanisms and associated enforcement sys- tems on data access, data use, data quality, and disclosure risk. Education and Training RECOMMENDATION 2: Faculty, researchers, and organizations in- volved in the continuing professional development of researchers should engage in the education of researchers in the ethical use of spatial data. Professional associations should participate by establishing and incul- cating strong norms for the ethical use and sharing of linked social- spatial data. Education is an essential tool for ensuring that linked social-spatial data are organized and used in ways that balance the benefits of the data for developing knowledge, the value of wide access to the data, and the need to protect the confidentiality of research participants. Education and training, both for students and as part of continuing education, require materials that extrapolate from general ethical principles for data collection, mainte- nance, dissemination, and access. These materials should include the ethical issues raised by linked social-spatial data and, to the extent they are identi- fied and accepted, best practices in the handling of these forms of data. Organizations and programs involved in training members of institutional review boards (IRBs) should incorporate attention to the benefits, uses, and potential risks of linked social-spatial data. Training in Ethical Issues RECOMMENDATION 3: Training in ethical considerations needs to accompany all methodological training in the acquisition and use of data that include geographically explicit information on research par- ticipants. Education about how to collect, analyze, and maintain linked social- spatial data, how to disseminate results without compromising the identi- ties of individuals involved in the research, and how to share such data consonant with confidentiality protections is essential for ensuring that scientific gains from the capacity to obtain such information can be maxi- mized. Graduate-level courses and professional workshops addressed to ethical considerations in the conduct of research need to include attention to social and spatial data; to enhance awareness of the ethical issues related to consent, confidentiality, and benefits as well as risks of harm; and to identify the best practices available to maximize the benefits from such

OCR for page 59
67 THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS research while minimizing any added risks associated with explicit spatial data. Similarly, institutes, courses, and programs focusing on spatial meth- ods and their use need to incorporate substantive consideration of ethical issues, in particular those related to confidentiality. Education needs to extend to primary and secondary researchers, staffs of organizations en- gaged in data dissemination, and institutional review boards (IRBs) that consider research protocols that include linked social-spatial data. Outreach by Professional Societies and Other Organizations RECOMMENDATION 4: Research societies and other research orga- nizations that use linked social-spatial data and that have established traditions of protection of the confidentiality of human research par- ticipants should engage in outreach to other research societies and organizations less conversant in research with issues of human partici- pant protection to increase their attention to these issues in the context of the use of personal, identifiable data. Expertise on outreach is not uniformly distributed across research dis- ciplines and fields. Given the likely increased interest in using explicit spa- tial data linked to other social data, funding agencies, scientific societies, and related research organizations should take steps to ensure that exper- tise in the conduct of research with human participants is broadly accessible and shared. An outreach priority should be to develop targeted materials, workshops, and short-course training institutes for researchers in fields or subfields that have had little or no tradition of safeguarding personal, identifiable information. Research Design RECOMMENDATION 5: Primary researchers who intend to collect and use spatially explicit data should design their studies in ways that not only take into account the obligation to share data and the disclo- sure risks posed, but also provide confidentiality protection for human participants in the primary research as well as in secondary research use of the data. Although the reconciliation of these objectives is difficult, primary researchers should nevertheless assume a significant part of this burden. Researchers need to consider the tradeoffs between data utility and confidentiality at the very start of their research programs, when they are making commitments to sponsors, designing procedures to obtain informed consent, and presenting their plans to their IRBs. They should be mindful of

OCR for page 59
68 PUTTING PEOPLE ON THE MAP both potential benefits and potential harm and plan accordingly. Everyone involved needs to understand that achieving a balance between benefits and harms may turn out to be difficult, and at the very least it will require innovative thinking, compromise, and partnership with others. It is impera- tive to recognize that it may take a generation to find norms for sharing the new kind of data and an equally long effort to ensure the safety of human research subjects. If, for example, IRBs need to be continuously involved in monitoring projects, they (and the researchers) should accept that role. If researchers must turn their data over to more experienced stewards for safe-keeping, that, too, will need to be acknowledged and accepted. Finally, secondary researchers need to understand that access to confidential data may involve difficulties, and plan their work accordingly. Institutional Review Boards RECOMMENDATION 6: Institutional Review Boards and their orga- nizational sponsors should develop the expertise needed to make well- informed decisions that balance the objectives of data access, confiden- tiality, and quality in research projects that will collect or analyze linked social-spatial data. Given the rapidity with which advances are being made in collecting and linking social and spatial data, maintaining appropriate expertise will be an ongoing task. IRBs need to learn what they do not know and develop plans to consult with experts when appropriate. Traditionally, IRBs have concerned themselves more with the collection of data than its dissemina- tion, but the heightened risks to confidentiality that arise from linking social data to spatial data requires increased attention to data dissemina- tion. Government agencies that sponsor research that requires the applica- tion of the common rule, the Human Subjects Research Subcommittee of the Executive Branch Committee on Research, and the Association for the Accreditation of Human Research Protection Programs (AAHRPP) should work together to convene an expert working group to address the issue of social and spatial data and make recommendations for best practices. Data Enclaves RECOMMENDATION 7: Data enclaves deserve further development as a way to provide wider access to high-quality data while preserving confidentiality. This development should focus on the establishment of expanded place-based enclaves, “virtual enclaves,” and meaningful pen- alties for misuse of enclaved data.

OCR for page 59
69 THE TRADEOFF: CONFIDENTIALITY VERSUS ACCESS Three elements are critical to this development. First, data producers, data stewards, and academic and other research organizations should con- sider expanding place-based (as opposed to virtual) data enclaves to hold more extensive collections of social and spatial data. Currently, many such data enclaves are maintained by a data producer (such as the U.S. Bureau of the Census) and contain only the data produced by that organization or agency. The panel’s recommendation proposes alternative models in which organizations that store the research they produce also house social and spatial datasets produced elsewhere or in which institutions that manage multiple enclaves combine them into a single entity. This recommendation may require that some agencies (e.g., the Census Bureau) obtain regulatory or legislative approval in order to broaden their ability to manage re- stricted data. This approach could make such data more accessible and cost-effective for secondary researchers while also increasing the capacity and sustainability of data enclaves. The main challenge is to work out adequate confidentiality protection arrangements between data producers and the stewards of expanded enclaves. Second, “virtual enclaves,” in which data are housed in a remote loca- tion but accessed in a secure setting by researchers at their own institution under agreed rules, deserve further development. Virtual archives at aca- demic institutions should be managed by their libraries, which have exper- tise in maintaining the security of valuable information resources, such as rare books and institutional archives. The Census Bureau has demonstrated the effectiveness of such remote archives with the technology used for its Research Data Centers, and Statistics Canada has created a system that is relatively more accessible (relative to the number of Canadian researchers) through its Research Data Centre program (see http://www.statcan.ca/ english/rdc/index.htm). The extension of these approaches will reduce the cost of access to research data if researchers and their home institutions invest in construction and staffing and if principles of operation can be agreed on. One key issue in the management of virtual or remote enclaves is the location of the “watchful eye” that ensures that the behavior of re- stricted data users follows established rules. In some cases, the observer will be a remote computer or operator, while in others it will be a person working at the location where the data user is working, for example, in a college or university library. Third, access to restricted data through virtual or place-based enclaves should be restricted to those who agree to abide by the confidentiality protections governing such data, and meaningful penalties should be en- forced for willful misuse of the linked social-spatial data. High-quality science depends on sound ethical practices. Ethical standards in all fields of science require honoring agreements made as a condition of undertaking professional work—whether those agreements are between primary re-

OCR for page 59
70 PUTTING PEOPLE ON THE MAP searchers and research participants or between researchers and research repository in the case of secondary use. Appropriate penalties might include publication of reports of willful misuse, disbarment from future research using restricted-access data, reduced access to federal research funding, and mechanisms that would provide incentives to institutions that employ re- searchers who willfully or carelessly misuse enclaved data so that they enforce agreements to which they are party. Licensing RECOMMENDATION 8: Data stewards should develop licensing agreements to provide increased access to linked social-spatial datasets that include confidential information. Licensing agreements place the burden of confidentiality protection on the data user. Several aspects of licensing deserve further development. First, nontransferable, time-limited licenses require the data user only to ensure that his or her own use does not make respondents identifiable to others or cause them harm and to return or destroy all copies of the data as promised. However, to be effective, such agreements require strong incen- tives for users to protect the confidentiality of the research participants. Second, strong licensing, which requires data users to take special precautions to protect the shared data, can make sensitive data more widely available than has been the case to date. Data stewards who are responsible for managing data enclaves or other restricted data centers, as well as research sponsors who support research that can only be disseminated under tight restrictions, should make these kinds of data as accessible as possible. Strong licensing agreements provide an appropriate mechanism for providing increased access in many situations. Third, research planning should include mechanisms to facilitate data use under license. Sponsors of primary research should ensure that plans are developed at the outset, with sufficient resources provided (e.g., time to do research, funds to pay for access) to prepare datasets that facilitate analysis by secondary data users. Data sponsors and data stewards should ensure that the plans for data access are carried through. Fourth, explicit enforcement language should be included in contracts and license agreements with secondary users setting forth penalties for breaches of confidentiality and other willful misuse of the linked geospatial and social data. Funding agencies and research societies with codes of ethics should scrutinize confidentiality breaches that occur and take actions ap- propriate to their roles and responsibilities.