Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 26
2
Legal, Ethical, and Statistical Issues in
Protecting Confidentiality
PAST AND CURRENT PRACTICE
There is a long tradition in government agencies and research institu-
tions of maintaining the confidentiality of human research participants
(e.g., de Wolf, 2003; National Research Council, 1993, 2000, 2005a).
Most U.S. research organizations, whether in universities, commercial firms,
or government agencies, have internal safeguards to help guide data collec-
tors and data users in ethical and legal research practices. Some also have
guidelines for the organizations responsible for preserving and disseminat-
ing data, data tables, or other compilations.
Government data stewardship agencies use a suite of tools to construct
public-use datasets (micro and aggregates) and are guided by federal stan-
dards (Doyle et al., 2002; Confidentiality and Data Access Committee, 2000,
2002). For example, current practices that guide the U.S. Census Bureau
require that geographic regions must contain at least 100,000 persons for
micro data about them to be released (National Center for Health Statistics
and Centers for Disease Control and Prevention, 2003). Most federal agen-
cies that produce data for public use maintain disclosure review boards that
are charged with the task of ensuring that the data made available to the
public have minimal risk of identification and disclosure. Federal guidelines
for data collected under the Health Insurance Portability and Accountability
Act of 1996 (HIPAA) are less stringent: they prohibit release of data for
regions with fewer than 20,000 persons. Table 2-1 shows the approaches of
various federal agencies that regularly collect social data to maintaining con-
fidentiality, including cell size restrictions and various procedural methods.
26
OCR for page 27
27
LEGAL, ETHICAL, AND STATISTICAL ISSUES
Fewer guidelines exist for nongovernmental data stewardship organiza-
tions. Many large organizations have their own internal standards and
procedures for ensuring that confidentiality is not breached. Those proce-
dures are designed to ensure that staff members are well trained to avoid
disclosure risk and that data in their possession are subject to appropriate
handling at every stage in the research, preservation, and dissemination
cycle. The Inter-university Consortium for Political and Social Research
(ICPSR) requires staff to certify annually that they will preserve confidenti-
ality. It also has a continual process of reviewing and enhancing the training
that its staff receives. Moreover, ICPSR requires that all data it acquires be
subject to a careful examination that measures and, if necessary, reduces
disclosure risk. ICPSR also stipulates that data that cannot be distributed
publicly over the Internet be made available using a restricted approach (see
Chapter 3). Other nongovernmental data stewardship organizations, such
as the Roper Center (University of Connecticut), the Odum Institute (Uni-
versity of North Carolina), the Center for International Earth Science Infor-
mation Network (CIESIN, at Columbia University), and the Murray Re-
search Archive (Harvard University), have their own training and disclosure
analysis procedures, which over time have been very effective; there have
been no publicly acknowledged breaches of confidentiality involving the
data handled by these organizations, and in private discussions with archive
managers, we have learned of none that led to any known harm to research
participants or legal action against data stewards.
Universities and other organizations that handle social data have guide-
lines and procedures for collecting and using data that are intended to
protect confidentiality. Institutional review boards (IRBs) are central in
specifying these rules. They can be effective partners with data stewardship
organizations in creating approaches that reduce the likelihood of confiden-
tiality breaches. The main activities of IRBs in the consideration of research
occur before the research is conducted, to ensure that it follows ethical and
legal standards. Although IRBs are mandated to do periodic continuing
review of ongoing research, they generally get involved in any major way
only reactively, when transgressions occur and are reported. Few IRBs are
actively involved in questions about data sharing over the life of a research
project, and fewer still have expertise in the new areas of linked social-
spatial data discussed in this report.
Although not all research is explicitly subject to the regulations that
require IRB review, most academic institutions now require IRB review for
all human subjects research undertaken by their students, faculty, and staff.
In the few cases for which IRB review is not required for research that links
location to other human characteristics and survey responses, researchers
undertaking such studies are still subject to standard codes of research
ethics. In addition, many institutions require that their researchers, regard-
OCR for page 28
28 PUTTING PEOPLE ON THE MAP
TABLE 2-1 Agency-Specific Features of Data Use Agreements and
Licenses
Mechanisms for Data Approval*
IRB Security
Approval Institutional Pledges Report
Agency Required Concurrence All Users Disclosures
National Center for X X X
Education Statistics
National Science Foundation X X X
Department of Justice X X X
Health Care Financing
Administration
Social Security X X X X
Administration
Health Care Financing
Administration-National
Cancer Institute
Bureau of Labor Statistics- X X
National Longitudinal
Survey of Youth
Bureau of Labor Statistics- X X
Census of Fatal
Occupational Injuries
National Institute of Child X X X
Health and Human
Development
National Heart, Lung, and X
Blood Institute
National Institute of Mental X X
Health
National Institute on Drug X X
Abuse
National Institute on Alcohol X
Abuse and Alcoholism
*The agreement mechanisms for data use range from those believed to be most stringent
(IRB approval) on the left to the least stringent (notification of reports) on the right. In
practice, policies for human subjects protection often comprise several mechanisms or
facets of them. IRB approval and “institutional concurrence” are similar, though the
latter often encompasses financial and legal requirements of grants not generally covered
by IRBs.
less of their funding sources, undergo general human subjects protection
training when such issues are pertinent to their work or their supervisory
roles. IRBs are also taking a more public role; for example, making re-
sources available for investigators and study subjects.1 Educating IRBs and
1For example, see the website for Columbia University’s IRB: http://www.columbia.edu/
cu/irb/ [accessed April 2006].
OCR for page 29
29
LEGAL, ETHICAL, AND STATISTICAL ISSUES
Security Security Cell Size Prior-approval Notification of
Plan Inspections Restrictions Reports Reports
X X X X X
X X X X X
X X
X X
X X X
X X X X
X X X X
X X X X
X
X
X
X
NOTE: Security plans may be quite broad, including safeguards on the computing envi-
ronment as well as the physical security of computers on which confidential data are
stored. Security inspections are randomly timed inspections to assess compliance of the
security plan.
SOURCE: Seastrom (2002:290).
having IRBs do more to educate investigators may be important to in-
creased awareness of confidentiality issues, but education alone does not
address two challenges presented by the analysis of linked spatial and social
data.
One of these challenges is that major sources of fine-grained spatial
data, such as commercial firms and such government agencies as the Na-
tional Aeronautics and Space Administration (NASA) and National Oce-
OCR for page 30
30 PUTTING PEOPLE ON THE MAP
anic and Atmospheric Administration (NOAA), do not have the same his-
tory and tradition of the protection of human research subjects that are
common in social science, social policy, and health agencies, particularly in
relation to spatial data. As a result, they may be less sensitive than the
National Institutes of Health (NIH) or the National Science Foundation
(NSF) to the risks to research participants associated with spatial data and
identification. Neither NASA nor NOAA has large-scale grant or research
programs in the social sciences, where confidentiality issues usually arise.
However, NASA and NOAA policies do comply with the U.S. Privacy Act
of 1974, and in some research activities that involve human specimens or
individuals (e.g., biomedical research in space flight) or establishments (such
as research on the productivity of fisheries).2 NASA and NOAA also pro-
vide clear guidance to their investigators on the protection of human sub-
jects, including seeking IRB approval, obtaining consent from study sub-
jects, and principal investigator education. For example, NASA’s policy
directive on the protection of human research subjects offers useful guid-
ance for producers and users of linked spatial-social data, although it is
clearly targeted at biomedical research associated with space flight.3
The difference in traditions between NASA and NOAA and other re-
search agencies may be due in part to the fact that spatial data in and of
themselves are not usually considered private. Although aerial photography
can reveal potentially identifiable features of individuals and lead to harm,
legal privacy protections do not apply to observations from navigable air-
space (see Appendix A). Thus, agencies have not generally established hu-
man subjects protection policies for remotely sensed data. Privacy and
confidentiality issues arise with these data mainly when they are linked to
social data, a kind of linkage that has been regularly done only recently.
These linkages, combined with dramatic increases in the resolution of im-
ages from earth-observing satellites and airborne cameras in the past de-
cade, now make privacy and confidentiality serious issues for remote data
providers. Thus, it is not surprising that NASA and NOAA are absent from
the list of agencies in Table 2-1 that have been engaged in specifying data
use agreements and licenses—another context in which confidentiality is-
sues may arise. Agencies that already have such procedures established for
social databases may be better prepared to adopt such procedures for spa-
tial data than agencies that do not have established procedures for human
subjects protection.
The other challenge is that, absent the availability of other information
2For details, see http://www.pifsc.noaa.gov/wpacfin/confident.php [accessed January 2007].
3See http://nodis3.gsfc.nasa.gov/npg_img/N_PD_7100_008D_/N_PD_7100_008D__main.
pdf [accessed January 2007].
OCR for page 31
31
LEGAL, ETHICAL, AND STATISTICAL ISSUES
or expertise, IRBs have, for the most part, treated spatially linked or spa-
tially explicit data no differently from other self-identifying data. There are
no current standards or guidelines for methods to perturb or aggregate
spatially explicit data other than those that exist for other types of self-
identifying data. Current practice primarily includes techniques such as
data aggregation, adding random noise to alter precise locations, and re-
stricting data access. Without specialized approaches and specialized knowl-
edge provided to IRBs, they can either be overly cautious and prevent
valuable data from being made available for secondary use or insufficiently
cautious and allow identifiable data to be released. Neither option ad-
dresses the fundamental issues.
The need for effective training in confidentiality-related research and
ethics issues goes beyond the IRBs and investigators, and extends to data
collectors, stewards, and users. Many professional organizations in the
social sciences have ethics statements and programs (see Chapter 1 and
Appendix B), and these statements generally require that students be trained
explicitly in ethical research methods. Training programs funded by the
NIH also require ethics components, but it is not at all certain that the
coverage provided or required by these programs goes beyond general ethi-
cal issues to deeper consideration of ethics in social science research, let
alone in the context of social-spatial linkages.4 Professional data collection
and stewardship organizations, as noted above, typically have mandatory
standards and training. Nonetheless, there is no evidence that any of these
organizations are systematically considering the issue of spatial data linked
to survey or other social survey data in their training and certification
processes. We offer some recommendations for improving this situation in
Chapter 4.
LEGAL ISSUES
Researchers in college or university settings or supported by federal
agencies are subject to the rules of those institutions, in particular, their
Federalwide Assurances (FWAs) for the Protection of Human Subjects and
the institutional review boards (IRBs) designated under their Assurances.
4For example, the Program Announcement for National Research Service Award Institu-
tional Research Grants (T32) specifies: Although the NIH does not establish specific cur-
ricula or formal requirements, all programs are encouraged to consider instruction in the
following areas: conflict of interest, responsible authorship, policies for handling miscon-
duct, data management, data sharing, and policies regarding the use of human and animal
subjects. Within the context of training in scientific integrity, it is also beneficial to discuss
the relationship and the specific responsibilities of the institution and the graduate students
or postdoctorates appointed to the program (see http://grants1.nih.gov/grants/guide/pa-files/
PA-02-109.html [accessed April 2006]).
OCR for page 32
32 PUTTING PEOPLE ON THE MAP
Also, researchers may find guidance in the federal statutes and codes that
govern research confidentiality for various agencies.5 Rules may also be
defined legally through employer-employee or sponsor-recipient contracts.
Obligations to follow IRB rules, policies, and procedures may be incorpo-
rated in the terms of such contracts in addition to any explicit language that
may refer to the protection of human subjects.
Researchers who are not working in a college or university or who are
not supported with federal funds may be bound, from a practical legal
perspective, only by the privacy and confidentiality laws that are generally
applicable in society. Such researchers in the United States usually include
those working for private companies or consortia. In an international con-
text, research may be done using human subjects data gathered in nations
where different legal obligations apply to protecting privacy and confiden-
tiality and where the social, legal, and institutional contexts are quite differ-
ent. As a general rule, U.S. researchers are obligated to adhere to the laws of
countries in which the data are collected, as well as those of the United
States.
The notion of confidentiality is not highly developed in U.S. law.6
Privacy, in contrast with confidentiality, is partly protected both by tort law
concepts and by specific legislative protections. Appendix A provides a
detailed review of U.S. privacy law as it applies to issues of privacy, confi-
dentiality, and harm in relation to human research subjects. The appendix
summarizes when information is sufficiently identifiable so that privacy
rules apply, when the collection of personal information does and does not
fall under privacy regulations, and what legal rules govern the disclosure of
personal information. As Appendix A shows, the legal status of confidenti-
ality is less well defined than that of privacy.
U.S. law provides little guidance for researchers and the holders of
datasets except for the rules imposed by universities and research sponsors
regarding methods by which researchers may gain access to enhanced and
detailed social data linked to location data in ways that both meet their
research needs and protect the rights of human subjects. Neither does cur-
rent U.S. privacy law significantly proscribe or limit methods that might be
used for data access or data mining. The most detailed provisions are in the
Confidential Information Protection and Statistical Efficiency Act of 2002
(CIPSEA).7 This situation makes it possible for researchers and organiza-
5An illustrative compendium of federal confidentiality statutes and codes can be found at
http://www.hhs.gov/ohrp/nhrpac/documentsnhrpac15.pdf [accessed April 2006].
6For some references to federal laws on confidentiality, see http://www.hhs.gov/ohrp/
nhrpac/documents/nhrpac15.pdf [accessed January 2007].
7E-Government Act of 2002, Pub. L. 107-347, Dec. 17, 2002, 116 Stat. 2899, 44 U.S.C. §
3501 note § 502(4).
OCR for page 33
33
LEGAL, ETHICAL, AND STATISTICAL ISSUES
tions that are unconstrained by the rules and requirements of universities
and federal agencies to legally access vast depositories of commercial data
on the everyday happenings, transactions, and movements of individuals
and to use increasingly sophisticated data mining technology to conduct
detailed analyses on millions of individuals and households without their
knowledge or explicit consent.
These privacy issues are not directly relevant to the conduct of social
science research under conventional guarantees of confidentiality. How-
ever, they may become linked in the future, either because researchers may
begin to use these data sources or because privacy concerns raised by uses
of large commercial databases may lead to pressures to constrain research
uses of linked social and spatial data. Solutions to the tradeoffs among data
quality, access, and confidentiality must be considered in the context of the
legal vagueness surrounding the confidentiality concept and the effects it
may have on individuals’ willingness to provide information to researchers
under promises of confidentiality.
ETHICAL ISSUES
The topics of study, the populations being examined, and the method
or methods involved in an inquiry interact to shape ethical considerations
in the conduct of all research involving human participants (Levine and
Skedsvold, 2006). Linked social-spatial research raises many of the typical
issues of sound science and sound ethics, for which the basic ethical prin-
ciples have been well articulated in the codes of ethics of scientific societ-
ies,8 in research and writing on research ethics, in the evolution of the Code
of Federal Regulations for the Protection of Human Subjects (45 CFR 46)
and the literature that surrounds it, and in a succession of important reports
and recommendations typically led by the research community (see Appen-
dix B). Much useful ethical guidance can also be extrapolated from past
National Research Council reports (e.g., 1985, 1993, 2004b, 2005a).
In addition, as noted above, linked social and spatial data raise particu-
larly challenging ethical issues because the very spatial precision of these
data is their virtue, and, thus, aggregating or masking spatial identifiers to
protect confidentiality can greatly reduce their scientific value and utility.
Therefore, if precise spatial coordinates are to be used as research data,
primary researchers and data stewards need to address how ethically to
store, use, analyze, and share those data. Appendix B provides a detailed
discussion of ethical issues at each stage of the research process, from
primary data collection to secondary use.
8For example, see those of the American Statistical Association, at http://www.amstat.org/
profession/index.cfm?fuseaction=ethicalstatistics [accessed January 2007].
OCR for page 34
34 PUTTING PEOPLE ON THE MAP
The process of linking micro-level social and spatial data is usually
considered to fall in the domain of human subjects research because it
involves interaction or intervention with individuals or the use of identifi-
able private information.9 Typically, such research is held to ethical guide-
lines and review processes associated with IRBs at colleges, universities, and
other research institutions. This is the case whether or not the research is
funded by one of the federal agencies that are signatories to the federal
regulations on human subjects research.10 Thus, generic legal and ethical
principles for data collection and access apply. Also, secondary analysts of
data, including those engaged in data linking, have the ethical obligation to
honor agreements made to research participants as part of the initial data
collection. However, the practices of IRBs for reviewing proposed second-
ary data analyses vary across institutions, which may require review of
proposals for secondary data analysis or defer authority to third-party data
providers that have protocols for approving use.11 Data stewardship—the
practices of providing or restricting the access of secondary analysts to
original or transformed data—entails similar ethical obligations.
Planning for ethically responsible research is a matter of professional
obligation for researchers and other professionals, covered in part by IRBs
under the framework of a national regulatory regime. This regime provides
for a distributed human subjects protection system that allows each IRB to
tailor its work with considerable discretion to meet the needs of researchers
and the research communities in which the work is taking place. The link-
ing of social and spatial data raises new and difficult issues for researchers
and IRBs to consider: because the uses of linked data are to some extent
unpredictable, decisions about data access are rarely guided by an explicit
set of instructions.
The National Commission for the Protection of Human Subjects of
Biomedical and Behavioral Research (1979) concisely conveyed the essen-
tial ethical principles for research:
9These are the elements of human subject research as defined in the Code of Federal
Regulations at 45 CFR 46.102(f).
10Academic and research institutions typically have in place federally approved Federal-
wide Assurances that extend the Federal Regulations for the Protection of Human Subjects to
all human subjects research undertaken at the institution, not just to research funded by the
17 agencies that have adopted the Federal Regulations.
11IRBs even vary in whether research using public-use data files is reviewed, although
increasingly the use of such data, if not linked to or supplemented by other data, is viewed
as exempt once vetted for public use). See http://www.hhs.gov/ohrp/nhrpac/documents/
dataltr.pdf for general guidelines and http://info.gradsch.wisc.edu/research/compliance/
humansubjects/7.existingdata.htm for a specific example. [Web pages accessed January
2007].
OCR for page 35
35
LEGAL, ETHICAL, AND STATISTICAL ISSUES
Beneficence—maximizing good outcomes for society, science, and in-
dividual research participants while avoiding or minimizing unnecessary
risk or harm;
respect for persons—protecting the autonomy of research participants
through voluntary, informed consent and by assuring privacy and confi-
dentiality); and
justice—ensuring reasonable, carefully considered procedures and a
fair distribution of costs and benefits.
These three principles together provide a framework for both facilitat-
ing social and spatial research and doing so in an ethically responsible and
sensitive way.
For primary researchers, secondary analysts, and data stewards, the
major ethical issues concern the sensitivity of the topics of research; main-
taining confidentiality and obtaining informed consent; considerations of
benefits to society and to research participants; and risk and risk reduction,
particularly the obligation to reduce disclosure risk. Linking spatial data to
social data does not alter ethical obligations, but it may pose additional
challenges.
Data collectors, stewards, and analysts have a high burden with regard
to linked social and spatially precise data to ensure that the probability of
disclosure approaches zero and that the data are very securely protected.
They also need to avoid inadvertent disclosure through the ways findings
are presented, discussed, or displayed. To meet this burden, they need to
consider all available technical methods and data management strategies.
We examine these methods and strategies in Chapter 3 in relation to their
ability to meet the serious challenges of data protection for linked social-
spatial data.
STATISTICAL ISSUES
All policies about access to linked social-spatial data implicitly involve
tradeoffs between the costs and benefits of providing some form of access
to the data, or modified versions of the data, by secondary data users. The
risk of disclosures of sensitive information constitutes the primary cost, and
the knowledge generated from the data represents the primary benefit. At
one extreme, data can be released as is, with identifiers such as precise
geocodes intact. This policy offers maximum benefit at a maximum cost
(i.e., minimum confidentiality protection). At the other extreme, data can
be completely restricted for secondary use, a policy that provides minimal
benefit and minimal cost (i.e., maximum confidentiality protection). Most
current methods of data release, such as altering or restricting access to the
original data, have costs and benefits between these two extremes.
Well-informed data access policies reflect wise decisions about the
OCR for page 36
36 PUTTING PEOPLE ON THE MAP
tradeoffs, such as whether the data usefulness is high enough for the disclo-
sure risks associated with a particular policy. However, most data stewards
do not directly measure the inputs to these cost-benefit analyses. This is not
negligence on the part of data stewards; indeed, the broader research commu-
nity has not yet developed the tools needed to make such assessments. Yet,
data stewards could quantify some aspects of the cost-benefit tradeoff,
namely, disclosure risks and data quality. Evaluating these measures can
enable data stewards to choose policies with better risk-quality profiles (e.g.,
between two policies with the same disclosure risk, to select the one with
higher data quality). There have been a few efforts to formalize the task of
assessing data quality and disclosure risk together for the purpose of evaluat-
ing the tradeoffs (Duncan et al., 2001; Gomatam et al., 2005). This section
briefly reviews statistical approaches to gauging the risk-quality tradeoffs
both generally and for spatial data. (For further discussion about the cost-
benefit approach to data dissemination, see Abowd and Lane, 2003).
Most data stewards seeking to protect data confidentiality are concerned
with two types of disclosures. One is identity disclosure, which occurs when
a user of the data correctly identifies individual records using the released
data. The other is attribute disclosure, which occurs when a data user learns
the values of sensitive variables for individual records in the dataset. At-
tribute disclosures typically require identification disclosures (Duncan and
Lambert, 1986a). Other types of disclosures include perceived identity disclo-
sure, which occurs when a data user incorrectly identifies individual records
in the database, and inferential disclosure, which occurs when a data user can
accurately predict sensitive attributes in the dataset using the released data
that may have been altered—for example, by adding statistical noise—to
prevent disclosure. (For introductions to disclosure risks, see Federal Com-
mittee on Statistical Methodology, 1994; Duncan and Lambert, 1986a,
1986b; Lambert, 1993: Willenborg and de Waal, 1996, 2001.)
Efforts to quantify identity disclosure risk generally fall in two broad
categories: (1) estimating the number of records in the released data that are
unique records in the population and can therefore be at high risk of
identification, and (2) estimating the probabilities that users of the released
data can determine the identities of the records in the released data by using
the information in those data. Although these approaches are appropriate
for many varieties of data, in cases where there are exact spatial identifiers,
virtually every individual is unique, so the disclosure risk is very great.
Quantifying Disclosure Risks
Methods of estimating the risk of identification disclosure involve esti-
mating population uniqueness and probabilities of identification. Estimates
of attribute disclosures involve measuring the difference between estimates
OCR for page 37
37
LEGAL, ETHICAL, AND STATISTICAL ISSUES
of sensitive attributes made by secondary data users and the actual values.
This section describes methods that are generally applicable for geographic
identification at scales larger than that characterized by exact latitude and
longitude (such as census blocks or tracts, minor civil divisions, or coun-
ties). In many cases, exact latitude and longitude uniquely identifies respon-
dents, although there are exceptions (e.g., when spatial identifiers locate a
residence in a large, high-rise apartment building).
Population uniqueness is relevant for identity disclosures because
unique records are at higher risk of identification than non-unique records.
For any unperturbed, released record that is unique in the population, a
secondary user who knows that target record’s correct identifying variables
can identify it with probability 1.0. For any unperturbed released popula-
tion non-unique target record, secondary users who know its correct iden-
tifying variables can identify that record only with probability 1/K, where K
is the number of records in the population whose characteristics match the
target record. For purposes of disclosure risk assessment, population unique-
ness is not a fixed quality; it depends on what released information is
known by the secondary data user. For example, most individuals are
uniquely identified in populations by the combination of their age, sex, and
street address. When a data user knows these identifying variables and they
are released on a file, most records are population unique records. How-
ever, when the secondary user knows only age, sex, and state of residence,
most records will not be unique records. Hence, all methods based on
population uniqueness depend on assumptions about what information is
available to secondary data users. The number of population unique records
in a sample typically is not known and must be estimated by the data
disseminator. Methods for making such estimates have been reported by
several researchers (see, e.g., Bethlehem et al., 1990; Greenberg and Zayatz,
1992; Skinner, 1992; Skinner et al., 1994; Chen and Keller-McNulty, 1998;
Fienberg and Makov, 1998; Samuels, 1998; Pannekoek, 1999; Dale and
Elliot, 2001.) These methods involve sophisticated statistical modeling.
Probabilities of identification are readily interpreted as measures of
identity disclosure risk: the larger the probability, the greater the risk. Data
disseminators determine their own thresholds for probabilities considered
unsafe. There are two main approaches to estimating these probabilities.
The first is to match records in the file being considered for release with
records from external databases that a secondary user plausibly would use
to attempt an identification (Paass, 1988; Blien et al., 1992; Federal Com-
mittee on Statistical Methodology, 1994; Yancey et al., 2002). The match-
ing is done using record linkage software, which (1) searches for the records
in the external data file that look as similar as possible to the records in the
file being considered for release; (2) computes the probabilities that these
matching records correspond to records in the file being considered for
OCR for page 38
38 PUTTING PEOPLE ON THE MAP
release, based on the degrees of similarity between the matches and their
targets; and (3) declares the matches with probabilities exceeding a speci-
fied threshold as identifications.
The second approach is to match records in a file being considered for
release with the records from the original, unperturbed data file (Spruill,
1982; Duncan and Lambert 1986a, 1986b; Lambert, 1993; Fienberg et al.
1997; Skinner and Elliot, 2002; Reiter, 2005a). This approach can be easier
and less expensive to implement than obtaining external data files and
record linkage software. It allows a data disseminator to evaluate the iden-
tification risks when a secondary user knows the identities of some or all of
the sampled records but does not know the location of those records in the
file being considered for release. This approach can be modified to work
under the assumption that the secondary user does not know the identities
of the sampled records.
Many data disseminators focus on identity disclosures and pay less
attention to attribute disclosures. In part, this is because attribute disclo-
sures are usually preceded by identity disclosures. For example, when origi-
nal values of attributes are released, a secondary data user who correctly
identifies a record learns the attribute values. Many data disseminators
therefore fold the quantification of attribute disclosure risks into the mea-
surement of identification disclosure risks. When attribute values are al-
tered before release, attribute risks change to inferential disclosure risks.
There are no standard approaches to quantifying inferential disclosure risks.
Lambert (1993) provides a useful framework that involves specifying a
secondary user’s estimator(s) of the unknown attribute values—such as an
average of plausible matches’ released attribute values—and a loss function
for incorrect guesses, such as the Euclidean or statistical distance between
the estimate and the true value of the attribute. A data disseminator can
then evaluate whether the overall value of the loss function—the distance
between the secondary user’s proposed estimates and the actual values—is
large enough to be deemed safe. (For examples of the assessment of at-
tribute and inferential disclosure risks, see Gomatam et al., 2005; Reiter,
2005d.)
The loss-function approach extends to quantifying overall potential
harm in a data release (Lambert, 1993). Specifically, data disseminators can
specify cost functions for all types of disclosures, including perceived iden-
tification and inferential disclosures, and combine them with the appropri-
ate probabilities of each to determine the expected cost of releasing the
data. When coupled with measurements of data quality, this approach
provides a decision-theoretic framework for selecting disclosure limitation
policies. Lambert’s total harm model is primarily theoretical and has not
been implemented in practice.
OCR for page 39
39
LEGAL, ETHICAL, AND STATISTICAL ISSUES
Quantifying Data Quality
Compared with the effort that has gone into developing measures of
disclosure risks, there has been less work on developing measures of data
quality. Existing quality measures are of two types: (1) comparisons of
broad differences between the original and released data, and (2) compari-
sons of differences in specific models between the original and released
data. The former measures suffer from not being tied to how users analyze
the data; the latter measures suffer from capturing only certain dimensions
of data quality.
Broad difference measures essentially quantify differences between the
distributions of the data values on the original and released files. As the
differences between the distributions grow, the overall quality of the re-
leased data drops. Computing differences in distributions is a nontrivial
statistical problem, particularly when there are many variables and records
with unknown distributional shapes. Most approaches are therefore ad
hoc. For example, some researchers suggest computing a weighted average
of the differences in the means, variances, and correlations in the original
and released data, where the weights indicate the relative importance that
those quantities are similar in the released and observed files (Domingo-
Ferrer and Torra, 2001; Yancey et al., 2002). Such ad hoc methods are only
tangentially tied to the statistical analyses being done by data users. For
example, a user interested in analyzing incomes may not care that means
are preserved when the tails of the distribution are distorted, because the
researcher’s question concerns only the extremely rich. In environmental
research, the main concern may be with the few people with the greatest
exposure to an environmental hazard. These measures also have limited
interpretability and little theoretical basis.
Comparison of specific models is often done informally. For example,
data disseminators look at the similarity of point estimates and standard
errors of regression coefficients after fitting the same regression on the
original data and on the data proposed for release. If the results are consid-
ered close—for example, the confidence intervals for the coefficients ob-
tained from the models largely overlap—the released data have high quality
for that particular analysis. Such measures are closely tied to how the data
are used, but they only reflect certain dimensions of the overall quality of
the released data. It is prudent to examine models that represent the wide
range of expected uses of the released data, even though unexpected uses
may arise for the conclusions of such models that do not apply.
A significant issue for assessing data quality with linked spatial-social
data is the need at times to simultaneously preserve several characteristics
or spatial relationships. Consider, for example, a collection of observations
represented as points that define nodes in a transportation network, when a
OCR for page 40
40 PUTTING PEOPLE ON THE MAP
node is defined as a street intersection. Although it is possible to create a
synthetic or transformed network that has the same mean (global) link
length as the original one, it is difficult to maintain, in addition, actual
variation in the local topology of links (the number of links that connect at
a node), as well as the geographical variability in link lengths that might be
present in the original data. Consequently, some types of analyses done
with transformed or synthetic data may yield results similar to those that
would result with the original data, while others may create substantial
risks of inferential error. The results may include both Type I errors, in
which a null hypothesis is incorrectly rejected, and Type II errors, when a
null hypothesis is incorrectly accepted. Data users may be tempted to treat
transformed data as equal quality to the original data unless they are in-
formed otherwise.
Effects of Spatial Identifiers
The presence of precise spatial identifiers can have large effects on the
risk-quality tradeoffs. Releasing these identifiers can raise the risks of iden-
tification to extremely high levels. To reduce these risks, data stewards may
perturb the spatial identifiers if they plan to release some version of the
original data for open access—but doing this can very seriously degrade the
quality of the data for analyses that use the spatial information, and par-
ticularly for analyses that depend on patterns of spatial covariance, such as
distances or topological relationships between research participants and loca-
tions important to the analysis (Armstrong et al., 1999). For example, some
analyses may be impossible to do with coarsened identifiers, and others may
produce misleading results due to altered relationships between the attributes
and spatial variables. Furthermore, if spatial identifiers are used as matching
variables for linking datasets, altering them can lead to matching errors,
which, when numerous, may seriously degrade analyses.
Perturbing the spatial information may not reduce disclosure risks suffi-
ciently to maintain confidentiality, especially when the released data include
other information that is known by a secondary data user. For example, there
may be only one person of a certain sex, age, race, and marital status in a
particular county, and this information may be readily available for the
county, so that coarsening geographies to the county level would provide no
greater protection for that person than releasing the exact address.
Identity disclosure risks are complicated to measure when the data are
set up to be meaningfully linked to other datasets for research purposes.
Altering spatial identifiers will reduce disclosure risks in the set of data
originally collected, but the risks may increase when this dataset is linked to
datasets with other attributes. For example, unaltered attributes in File A
may be insufficient to identify individuals if the spatial identifiers are al-
OCR for page 41
41
LEGAL, ETHICAL, AND STATISTICAL ISSUES
tered, but when File A is linked to File B, the combined attributes and
altered spatial identifiers may uniquely identify many individuals. The com-
plication arises because the steward of the collected data may not know
which attributes are in the files to be linked to those data, so that it is
difficult to evaluate the degree of disclosure risk.
Even when safeguards have been established for data sharing, publica-
tion of research papers using linked social-spatial data may pose other
problems such as those associated with the visualization of data. VanWey
et al. (2005) present a means for evaluating the risks associated with dis-
playing data through maps that may be presented orally or in writing to
communicate research results. The method involves identifying data with a
spatial area of a radius sufficient to include, on average, enough research
participants to reduce the identity disclosure risk to a target value. Methods
for limiting disclosure risk from maps are only beginning to be developed.
No guidelines currently exist for visualizing linked social-spatial data, in
published papers or even presentations; but future standards for training
and publication contexts should be based on systematic assessment of such
risks.
In principle, policies for access to data that include spatial identifiers
can be improved by evaluating the tradeoff between disclosure risks and
data quality. In practice, though, such an evaluation will be challenging for
many data stewards and for IRBs that are considering proposals to use
linked data. Existing approaches to quantifying risk and quality are techni-
cally demanding and may be beyond the capabilities of some data stewards.
Low-cost, readily available methods for estimating risks and quality do not
yet exist, whether or not the data include spatial identifiers. And existing
techniques do not account for the additional risks associated with linked
datasets. This challenge would be significantly lessened, and data dissemi-
nation practice improved, if data stewards had access to reliable, valid, off-
the-shelf software and protocols for assessing the tradeoffs between disclo-
sure risk and data quality and for undertaking broad cost-benefit analyses.
The next chapter addresses the issue of evaluating and addressing the
tradeoffs involving disclosure risk and data quality.
Representative terms from entire chapter:
disclosure risks