Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 42
3
Meeting the Challenges
Although the challenges described in Chapter 2 are substantial, a number
of possible approaches exist for preserving respondent confidentiality when
links to geospatial information could engender breaches. They fall in two
main categories: institutional approaches, which involve restricting access to
sensitive data; and technical and statistical approaches, which involve trans-
forming the data in various ways to enhance the protection of confidentiality.
This chapter describes these two broad categories of approaches.
INSTITUTIONAL APPROACHES
Institutions that have responsibility for preserving the confidentiality of
respondents employ a number of strategies. These strategies are very impor-
tant for protecting respondent confidentiality in survey data under all cir-
cumstances, and especially when there is a high risk of identification due to
the existence of precise geospatial attributes. At their heart, many of these
strategies protect confidentiality by restricting access to the data, either by
limiting access to those data users who explicitly guarantee not to reveal
respondent identities or attributes or by requiring that data users work in a
restricted environment so they cannot remove information that might re-
veal identities or attributes. Restricting data access is a strategy that can be
used with original data or with data that have been deidentified, buffered,
or synthesized.
In addition to restricting access, institutional approaches require that
researchers—students and faculty or staff at universities or other institu-
tions—be educated in appropriate and ethical use of data. Many data stew-
42
OCR for page 43
43
MEETING THE CHALLENGES
ards provide guidelines about how data should be used, what the risks of
disclosure are, and why researchers should be attentive to disclosure risk
and its limitation.1
User education at a more fundamental level—in the general training of
active and future researchers—should be based on sound theoretical prin-
ciples and empirical research. Such studies, however, are few: there are only
a few examples of good materials or curricula for ensuring education in
proper data use that minimizes the risk of confidentiality breaches. For
instance, the disclosure limitation program project, “Human Subject Pro-
tection and Disclosure Risk Analysis,” at the Inter-university Consortium
for Political and Social Research (ICPSR) has resources available for teach-
ing about its findings and the best practices it has developed (see http://
www.icpsr.umich.edu/HSP [April 2006]). ICPSR is also working on a set of
education and certification materials on handling restricted data for its own
staff, which will probably evolve into formal training materials. The Caro-
lina Population Center also has a set of practices for training students who
work on its National Longitudinal Study of Adolescent Health (Add Health:
see http://www.cpc.unc.edu/projects/addhealth [April 2006]) and other
projects, and for teaching its demography trainees about ethics (see http://
www.cpc.unc.edu/training/meth.html [April 2006]). However, few other
training programs have equivalent practices.
Fundamental to most institutional approaches is the idea that the
greater the risk of disclosure or harm, the more restricted access should be.
For every tier of disclosure risk, there is an equivalent tier of access restric-
tion. The tiers of risk are partly a function of the ability of the data distribu-
tor to make use of identity masking techniques to limit the risk of disclo-
sure. On a low tier of risk are data with few identifiable variables, such as
the public-use microdata sets from the U.S. Census Bureau and many small
sample surveys. Because there is little or no geographic detail in these data,
when they are anonymized there is very little risk of disclosure, although if
a secondary user knows that an individual is a respondent in a survey (e.g.,
because it is a family member), identification is much easier. Dissemination
of these data over the Web has not been problematic from the standpoint of
confidentiality breaches. On the highest tier of risk are absolutely identifi-
able data, such as surveys of business establishments and data that include
the exact locations of respondents’ homes or workplaces.2 The use of these
data must be tightly restricted to preserve confidentiality. Methods and
1For example, see the Inter-university Consortium for Political and Social Research (ICPSR),
2005; also http://www.icpsr.umich.edu/access/deposit/index.html [accessed April 2006].
2Business establishments are generally considered to be among the most easily identifiable
because data about them are frequently unique: in any given market, there are usually only a
small number of business establishments engaged in any given area of activity, and each has
unique characteristics such as relative size or specialization.
OCR for page 44
44 PUTTING PEOPLE ON THE MAP
procedures for restricted data access are well described in a National Re-
search Council report (2005a:28-34).
The number of tiers of access can vary from one study to another and
from one data archive to another. A simple model might have four levels of
access: full public access, limited licensing, strong licensing, and data en-
claves.3
Full Public Access
Full access is provided through Web-based public-use files that are
available to the general public or to a limited public (for example those who
subscribe to a data service, such as ICPSR). Access is available to all users
who accept a data use agreement through a Web-based form that requires
them to avoid disclosure. This tier of access is typically reserved for data
files with little risk of disclosure and harm, such as those that include very
small subsamples of a larger sample, that represent a tiny fraction of the
population in a geographic area, that contain little or no geographic infor-
mation, or that do not include any sensitive information. We are unaware
of any cases for which this form of public access is allowed to files that
combine social data with highly specific locational data, such as addresses
or exact latitude and longitude.
Public use, full-access datasets may include some locational data, such as
neighborhood or census tract, if it is believed that such units are too broad to
allow identification of particular individuals. However, when datasets are
linked, it is often possible to identify individuals with high probability even
when the linked data provide only neighborhood-level information. Because
of this probability, the U.S. Census Bureau uses data swapping and other
techniques in their full-access public-use data files (see http://factfinder.census.
gov/jsp/saff/SAFFInfo.jsp?_pageId=su5_confidentiality).
Full public access is extremely popular with data users, for whom it
provides a very high level of flexibility and opportunity. Their main com-
plaint is that the datasets made available by this mechanism often include
fewer cases and variables than they would like, so that certain types of
analysis are impossible. Although data stewards appear generally satisfied
3For other models, see the practices of the Carolina Population Center at the University of
North Carolina at Chapel Hill for use of data from the National Survey of Adolescent Health
(http://www.cpc.unc.edu/projects/addhealth/data[accessed April 2006]) and the Nang Rong
study of social and environmental change, among others. As part of ICPSR’s Data Sharing for
Demographic Research project (http://www.icpsr.umich.edu/dsdr [accessed April 2006]), re-
searchers there have published a detailed review of contract terms used in restricted use
agreements, with recommendations about how to construct such agreements. Those docu-
ments are available at http://www.icpsr.umich.edu/dsdr/rduc [accessed April 2006].
OCR for page 45
45
MEETING THE CHALLENGES
with this form of data distribution, they have in recent years begun to
express concern about whether data can be shared this way without risk of
disclosure, and so have increasingly restricted the number of data collec-
tions available in this format. For example, the National Center for Health
Statistics (NCHS) linked the National Health Interview Survey to the Na-
tional Death Index and made the first two releases of the linked file available
publicly. The third release, which follows both respondents from the earlier
survey years and adds new survey years, is no longer available publicly; it is
available for restricted use in the NCHS Research Data Center.
Limited Licensing
Limited licensing provides a second tier of access for data that present
some risk of disclosure or harm, but for which the risk is limited because
there is little geographic precision—the geographic information has been
systematically masked (Armstrong et al., 1999) or sensitive variables have
been deleted or heavily masked. Limited licensing allows data to be distrib-
uted to responsible scientific data users (generally those affiliated with
known academic institutions) under terms of a license that requires the data
user and his or her employer to certify that the data will be used responsi-
bly. Data stewards release data in this fashion when they believe that there
is little risk of identification and that responsible professionals are able to
understand the risk and prevent it in their research activities. For example,
the Demographic and Health Surveys (DHS) (see http://www.measuredhs.
com/[April 2006]) distributes its large representative survey data collection
under a limited licensing model. It makes geocoded data available under a
more restricted type of limited licensing arrangement.
The DHS collects the geographic coordinates of its survey cluster, or
enumerator areas, but the boundaries or areas of those regions are not
made available. These geocodes can be attached to individual or household
records in the survey data, for which identifying information has been
removed. When particularly sensitive information has been collected in the
survey (e.g., HIV testing), the current policy is to introduce error into the
data, destroy the original data, and release only the data that have been
transformed.
Data users consider it a burden to obtain limited licensing agreements,
but both data stewards and users generally perceive them as successful
because they combine an obligation for education and certification with
relatively flexible access for datasets that present little risk of disclosure or
harm. Nevertheless, the limitations on the utility of data that may be altered
(see Armstrong et al., 1999) for release in this manner are still largely
unknown, in part because careful tests with the original data cannot be
conducted after the data have been transformed.
OCR for page 46
46 PUTTING PEOPLE ON THE MAP
Strong Licensing
Strong licensing is a third tier of data access used for data that present
a substantial risk of disclosure and for which the data steward decides that
this risk cannot be protected within the framework of responsible research
practice. Datasets are typically placed at this tier if they present a substan-
tial risk of disclosure but are not fully identified or if they include attribute
data that are highly sensitive if disclosed, such as responses about sexual
practices, drug use, or criminal activity. Most often, these data are shared
through a license that requires special handling: for example, they may be
kept on a single computer not connected to a network, with specific techni-
cal requirements. Virtually all strong licenses require that the data user
obtain institutional review board (IRB) approval at his or her home institu-
tion. Many of these strong licenses also include physical monitoring, such
as unannounced visits from the data steward’s staff to make sure that
conditions are followed. These licenses may also require very strong institu-
tional assurances from the researcher’s employer, or may provide for sanc-
tions if not followed. For example, the license to use data from the Health
and Retirement Survey of the National Institutes of Health (NIH) includes
language that says the data user may be prevented from obtaining NIH
grants in the future if he or she does not adhere to the restrictions. Some
data stewards also require the payment of a fee, usually in the range of
$500 to $1,000, to support the expenses associated with document prepa-
ration and review and the cost of site visits.
Although some researchers and universities are wary of these agree-
ments, in recent years they have been seen as successful by most data users.
Data distributors, however, continue to be fearful that their rules about
data access are not being followed sufficiently closely or that sensitive data
are under inadequate control.
Data Enclaves
For data that present the greatest risk of disclosure or harm, or those
that are collected under tight legal restrictions—such as geospatial data that
are absolutely identifiable—access is usually limited to use within a re-
search enclave. For example, this will be the case when the fully geocoded
Nang Rong data are made available at the data enclave at ICPSR. The most
visible example of this practice in the United States today is the network of
nine Research Data Centers (RDCs) created by the Bureau of the Census—
Washington, DC; Durham, NC; New York City and Ithaca, NY; Boston,
MA; Ann Arbor, MI; Chicago; and Los Angeles and Berkeley, CA.4 The
4See http://webserver01.ces.census.gov/index.php/ces/1.00/researchlocations [accessed April
2006].
OCR for page 47
47
MEETING THE CHALLENGES
Bureau makes its most restricted data, including the full count of the Cen-
sus of Population and the Census of Business Enterprises, available only in
these centers.
The principle behind data enclaves is that researchers are not able to
leave the premises with any information that might identify an individual.
In practice, a trained professional reviews all the materials that each re-
searcher prints. For data analyses, researchers are typically allowed to re-
move only coefficients from regression-type analyses and tabulations that
have a large cell size (because small cell sizes may lead to identification).
Although many data stewards limit users to working within a single, super-
vised room described as a data center or enclave, alternatives also exist. For
example, in addition to its data enclaves NCHS also maintains a system
that allows data users to submit data analytic programs from a remote
location, have them run against the data in the enclave, and then receive the
results by e-mail. This procedure is sometimes performed with an auto-
mated disclosure review and sometimes with a manual, expert review.
There are considerable barriers of inconvenience and cost to use of the
data centers, which means that they are not used as much as they might or
should be. Most centers only hold data from a single data provider (for
example, the census, NCHS data, or ADD Health), and the division of
work leads to inefficiencies that might be overcome if a single center held
data from more than one data provider. For the use of its data, the Census
Bureau centers require a lengthy approval process that can take a full year
from the time a researcher is ready to begin work, as well as a “benefit
statement” on the part of the researcher that demonstrates the work under-
taken in the RDC will not only contribute to science, but will also deliver a
benefit to the Census Bureau—something required by the Bureau’s statu-
tory authority. Although other data centers and enclaves do not require
such lengthy approval processes, many require a substantial financial pay-
ment from the researcher (often calculated as a per day or per month cost of
research center use), in addition to travel and lodging costs. Personal sched-
uling to enable a researcher to travel to a remote site competes with teach-
ing, institutional service, and personal obligations and can be a serious
barrier to use of data in enclaves. The challenge of scheduling becomes even
more severe in the context of the large, interdisciplinary teams often in-
volved in the analysis of spatial social science data and the need to use
specialized technology and software. In addition to the cost passed on to
users, the data stewards who maintain data enclaves bear considerable cost
and space requirements.
In sum, data enclaves are effective but inefficient and inequitable. So-
cial science research is faced with the prospect of full and equal access to
data when risk is low, but highly differential and unequal access when risks
are high. Considerable improvements in data access regimes will be re-
OCR for page 48
48 PUTTING PEOPLE ON THE MAP
quired so that price will not be the mediating factor that determines who
has access to linked social science and geospatial data.
TECHNICAL APPROACHES
Data stewards and statistical researchers have developed a variety of
techniques for limiting disclosure risks (for a summary of many of them, see
National Research Council, 2005a). This section briefly reviews some of
these methods and discusses their strengths and weaknesses in the context
of spatial data. Generally, we classify the solutions as data limitation (re-
leasing only some of the data), data alteration (releasing perturbed versions
of the data), and data simulation (releasing data that were not collected
from respondents but that are intended to perform as the original data
when analyzed). The approaches described here are designed to preserve as
much spatial information as possible because that information is necessary
for important research questions. In this way, they represent advances over
older approaches to creating public-use social science data, in which the
near-universal strategy was to delete all precise spatial information from
the data, usually through aggregation to large areas.
Data Limitation
Data limitation involves manipulations that restrict the number of vari-
ables, the number of values for responses, or the number of cases that are
made available to researchers. The purpose of data limitation is to reduce
the number of unique values in a dataset (reducing the risk of identification)
or to reduce the certainty of identification of a specific respondent by a
secondary user. A very simple approach sometimes taken with public-use
data is to release only a small fraction of the data originally collected,
effectively deleting half or more of all cases. This approach makes it diffi-
cult, even impossible, for a secondary user who knows that an individual is
in the sample to be sure that she or he has identified the right person: the
target individual may have been among those deleted from the public
dataset.
For tabular data, as well as some microdata, one data limitation ap-
proach is cell suppression. The data steward essentially blanks out cells
with small counts in tabular data or blanks out the values of identifiers or
sensitive attributes in microdata. The definition of “small counts” is se-
lected by the data steward. Frequently, cells in tables are not released unless
they have at least three members. When marginal totals are preserved, as is
often planned in tabular data, other values besides those at risk may need to
be suppressed; otherwise, the data analyst can subtract the sum of the
available values from the total to obtain the value of the suppressed data.
OCR for page 49
49
MEETING THE CHALLENGES
Complementary cells are selected to optimize (at least approximately) vari-
ous mathematical criteria. (For discussions of cell suppression, see Cox,
1980, 1995; Willenborg and de Waal, 1996, 2001.)
Cell suppression has drawbacks. It creates missing data, which compli-
cates analyses because the suppressed cells are chosen for their values and
are not randomly distributed throughout the dataset. When there are many
records at risk, as is likely to be the case for spatial data with identifiers,
data disseminators may need to suppress so many values to achieve satisfac-
tory levels of protection that the released data have limited analytical util-
ity. Cell suppression is not necessarily helpful for preserving confidentiality
in survey data that include precise geospatial locations. It is possible, even if
some or many cells are suppressed, for confidentiality to be breached if
locational data remain. Cell suppression also does not guarantee protection
in tabular data: it may be possible to determine accurate bounds for values
of the suppressed cells using statistical techniques (Cox, 2004; Fienberg and
Slavkovic, 2004, 2005). An alternative to cell suppression in tabular data is
controlled tabular adjustment, which adds noise to cell counts in ways that
preserve certain analyses (Cox et al., 2004).
Data can also be limited by aggregation. For tabular data, aggregation
corresponds to collapsing levels of categorical variables to increase the cell
size for each level. For microdata, aggregation corresponds to coarsening
variables; for example, releasing ages in 5-year intervals or locations at the
state level in the United States. Aggregation reduces disclosure risks by
turning unique records into replicated records. It preserves analyses at the
level of aggregation but creates ecological inference problems for lower
levels of aggregation.
For spatial data, stewards can aggregate spatial identifiers or attribute
values or both, but the aggregation of spatial identifiers is especially impor-
tant. Aggregating spatial attributes puts more than one respondent into a
single spatial location, which may be a point (latitude-longitude), a line
(e.g., along a highway), or an area of various shapes (e.g., a census tract or
other geographic division or a geometrically defined area, such as a circle).
This aggregation has the effect of eliminating unique cases within the dataset
or eliminating the possibility that a location in the data refers to only a
single individual in some other data source, such as a map or list of ad-
dresses. In essence, this approach coarsens the geographic data.
Some disclosure limitation policies prohibit the release of information at
any level of aggregation smaller than a county. Use of a fixed level of geogra-
phy, however, introduces variability in the degree of masking provided. Many
rural counties in the United States contain very small total populations, on
the order of 1 thousand, while urban counties may contain more than
1 million people. The same problem arises with geographic areas defined by
spatial coverage: 1 urban square kilometer holds many more people than
OCR for page 50
50 PUTTING PEOPLE ON THE MAP
1 rural square kilometer. The more social identifiers, such as gender, race, or
age, are provided for an area, the greater the risk of disclosure.
The use of aggregation to guard against accidental release of confiden-
tial information introduces side effects into analyses. When point data are
aggregated to areas that are sufficiently large to maintain confidentiality,
the ability of researchers to analyze data for spatial patterns is attenuated.
Clusters of disease that may be visually evident or statistically significant at
the individual level, for example, will often become undetectable at the
county level of aggregation. Other effects arise as a consequence of the well-
known relationship between variance and aggregation: variance tends to
decrease as the size of aggregated units increase (see Robinson, 1950; Clark
and Avery, 1976). The suppression of variance with increasing levels of
aggregation introduces uncertainty (sometimes called the ecological infer-
ence problem) into the process of making inferences based on statistical
analyses and is a component of the more general modifiable areal unit
problem in spatial data analysis (see Openshaw and Taylor, 1979).
For tabular data, another data limitation alternative is to release a
selection of subtables or collapsed tables of marginal totals for some prop-
erties to ensure that the cells for the full joint table are large (Fienberg and
Slavkovic, 2004, 2005). This approach preserves the possibility of analysis
when counts from the released subtables are sufficient for the analysis. For
spatial data, this approach could be used with aggregated spatial identifiers,
perhaps enabling smaller amounts of aggregation. This approach is
computationally expensive, especially for high-dimensional tables, and re-
quires additional research before a more complete assessment can be made
of its effectiveness.
Data Alteration
Spatial attributes are useful in linked social-spatial data because they
precisely record where an aspect of a respondent’s life takes place. Some-
times these spatial data are collected at the moment that the original social
survey data are collected. In the Nang Rong (see Box 1-1) and other similar
studies, researchers use a portable global positioning system (GPS) device to
record the latitude and longitude of the location of the interview or of
multiple locations (farm fields, daily itineraries) during the interview pro-
cess. It is also possible for researchers to follow the daily itineraries of study
participants by use of GPS devices or RFID (radio frequency identification)
tags.
In the United States and other developed countries, however, locations
are frequently collected not as latitude and longitude from a GPS device,
but by asking an individual to supply a street address. Street addresses
require some transformation (e.g., to latitude and longitude) to be made
OCR for page 51
51
MEETING THE CHALLENGES
specific and comparable. This transformation, called geocoding, consists of
the processes through which physical locations are added to records. There
are several types of geocoding that vary in their level of specificity; each
approach uses different materials to support the assignment of coordinates
to records (see Box 3-1).
Areal geocoding can reduce the likelihood of identification, but most
other forms of geocoding have the potential to maintain or increase the risk
of disclosure because they provide the data user with one or more precise
locations (identifiers) for a survey respondent. The improvements in accu-
racy associated with new data sources and new technologies, such as parcel
geocoding, only heighten the risk. As a consequence, a new set of tech-
niques has been devised to distort locations, and hence to inhibit disclosure.
Two of the general methods available are swapping and masking.
Swapping It is sometimes possible to limit disclosure risk by swapping
data. For example, a data steward can swap the attributes of a person in
one area for those of a person in another area, especially if some of those
attributes are the same (such as two 50-year-old white males with different
responses on other questions), in order to reduce a secondary user’s confi-
dence in correctly identifying an individual. Swapping can be done on
spatial identifiers or nonspatial attributes, and it can be done within or
across defined geographic locations. Swapping small fractions of data gen-
erally attenuates associations between the swapped and unswapped vari-
ables, and swapping large fractions of data can completely destroy those
associations. Swapping data will make spatial analyses meaningless unless
the spatial relationships have been carried into the swapped data. It is
generally difficult for analysts of swapped data to know how much the
swapping affects the quality of analyses.
When data stewards swap cases from different locations but leave
(genuine) exact spatial identifiers on the file, the identity of participants
may be disclosed, even if attributes cannot be meaningfully linked to the
participant. For example, if the data reveal that a respondent lived at a
particular address, even if that person’s data are swapped with someone
else’s data, a secondary user would still know that a person living at that
address was included in the study. Swapping spatial identifiers thus is
better suited for limiting disclosures of respondents’ attributes than their
identities. Swapping may not reduce—and probably increases—the risk of
mistaken attribute disclosures from incorrect identifications.
Swapping may be more successful at protecting participants’ identities
when locations are aggregated. However, swapping may not provide much
additional protection beyond the aggregation of locations, and it may de-
crease data quality relative to analyzing the unswapped aggregated data.
OCR for page 52
52 PUTTING PEOPLE ON THE MAP
BOX 3-1
Geocoding Methods
Areal Geocoding Areal geocoding assigns observations to geographic areas. If
all a researcher needs is to assign a respondent to a political jurisdiction, census
unit, or administrative or other areas in order to match attributes of those larger
areas to the individual and perform hierarchical analyses, areal geocoding resolu-
tion is a valuable tool. Areal geocoding can be implemented when the database
has either addresses or latitude and longitude data, either through the use of a list
of addresses that are contained in an area or through the use of an algorithm that
determines whether a point is contained within a particular polygon in space. In the
latter case, a digital file of polygon geometry is needed to support the areal geoc-
oding process.
Interpolated Geocoding Interpolated geocoding estimates the precise location
of an address along a street segment, typically defined between street intersec-
tions, on a proportional basis. This approach relies on the use of a geographic
base file (GBF) that contains street centerline descriptions and address ranges for
each side of each street segment in the coverage area. An example is the U.S.
Census Bureau’s TIGER (topologically integrated geographic encoding and refer-
encing) files. For any specific address, an algorithm assigns coordinates to records
by finding the street segment (typically, one side of a block along a street) that
contains the address and interpolating. Thus, the address 1225 Maple Street would
be placed one-quarter of the way along the block that contains the odd-numbered
addresses 1201-1299 and assigned the latitude and longitude appropriate to that
precise point.
Interpolated geocoding can produce digital artifacts, such as addresses placed
in the middle of a curving street, or errors, such as can occur if, for example, 1225
is the last house on the 1201 block of Maple Street. Some of these problems can
Masking Masking involves perturbations or transformations of some
data. Observations, in some cases, may be represented as points, but have
their locations altered in such a way to minimize accurate recovery of
personal-level information. Among the easiest masking approaches to
implement involves the addition of a stochastic component to each obser-
vation, which can be visualized as moving the point by a fixed or random
amount so that the information about a respondent is associated not with
that person’s true location but with another location (see Chakraborty
and Armstrong, 2001). That is, one can replace an accurately located
point with another point derived from a uniform distribution of radius r
centered on that location. The radius parameter may be constant or al-
lowed to vary as a function of density or some other factor important to
a particular application. If density is used, r will be large in low-density
areas (rural) and would be adjusted downward in areas with higher
densities.
OCR for page 53
53
MEETING THE CHALLENGES
be minimized with software (e.g., by setting houses back from streets). The extent
to which such data transformations change the results of data analyses from what
they would have been with untransformed data has not been carefully studied.
This approach may reduce disclosure risks.
Parcel Geocoding Parcel geocoding makes use of new cadastral information
systems that have been implemented in many communities. When this approach
is used, coordinates are often transferred from registered digital orthophotographs
(images that have been processed to remove distortion that arises as a conse-
quence of sensor geometry and variability in local elevation, for example). These
coordinates typically represent such features as street curbs and centerlines, side-
walks, and most importantly for geocoding, the locations of parcel and building
footprint polygons and either parcel centroids or building footprint centroids. Thus,
a one-to-one correspondence between each address and an accurate coordinate
(representing the building or parcel centroid) can be established during geocoding.
This approach typically yields more accurate positional information than interpolat-
ed geocoding methods.
GNSS-based Geocoding The low cost and widespread availability of devices
used to measure location based on signals provided by Global Navigation Satellite
Systems (GNSS), such as the Global Positioning System deployed by the U.S.
Department of Defense, GLONASS (Russia), and Galileo (European Union), has
encouraged some practitioners to record coordinate locations for residence loca-
tions through field observations. As in the parcel approach, a one-to-one corre-
spondence can be established between each residence and an associated coordi-
nate. Though this process is somewhat labor intensive, the results are typically
accurate since trained field workers can make policy-driven judgments about how
to record particular kinds of information.
Though masking can be performed easily, it has a negative side effect:
the displaced points can be assigned to locations that contain real observa-
tions, thus creating the possibility of false identification and harm to indi-
viduals who may not even be respondents in the research. Moreover, re-
search on spatial data transformation that involve moving the location of
data points (Armstrong et al., 1999; Rushton et al., 2006) shows that these
transformations may have a significant deleterious effect on the analysis of
data. Not only is there still risk of false identification, but sometimes the
points are placed in locations where they cannot be—residences in lakes
that do not permit houseboats, for example. Moreover, no single transfor-
mation process provides data that are valuable for every possible form of
analysis. These limitations have major consequences both for successful
analysis and for reduction of the disclosure risk.
OCR for page 54
54 PUTTING PEOPLE ON THE MAP
Adding noise generally inflates uncertainties in data analyses. For some
attributes being estimated, the effect is to increase the width of confidence
intervals. Adding noise can also attenuate associations: in a simple linear
regression model, for example, the estimated regression coefficients get
closer to zero when the predictors have extra noise. There are techniques
for accounting for the extra noise, called measurement error models (e.g.,
Fuller, 1993), but they are not easy to use except in such standard analyses
as regressions. Some research by computer scientists and cryptographers
under the rubric of “privacy-preserving data mining” (e.g., Agrawal and
Srikant, 2000; Chawla et al., 2005) also follows the strategy of adding
specially constructed random noise to the data, either to individual values
or to the results of the computations desired by the analyst. Privacy-
preserving data mining approaches have been developed for regression
analysis, for clustering algorithms, for discrimination, and for association
rules. Like other approaches that add noise, these approaches generally
sacrifice data quality for protection against disclosure. The nature of that
tradeoff has not been thoroughly evaluated for social and spatial data.
Secure Access
An emerging set of techniques aims to provide users with the results of
computations on data without allowing them to see individual data values.
Some of these are based on variants of secure summation (Benaloh, 1987),
which allows different data stewards to compute the exact values of sums
without sharing their values. One variant, used at the National Center for
Educational Statistics, provides public data on a diskette or CD-ROM that
is encoded to allow users to construct special tabulations while preventing
them from seeing the individual-level data or for calculating totals when
there are fewer than 30 respondents in a cell. Secure summation variants
entail no sacrifice in data quality for analyses based on sums. They provide
excellent confidentiality protection, as long as the database stewards follow
specified protocols. This approach is computationally intensive and chal-
lenging to set up (for a review of these methods, see Karr et al., 2005).
Another approach involves remote access model servers, to which users
submit requests for analyses and, in return, receive only the results of
statistical analyses, such as estimated model parameters and standard er-
rors. Confidentiality can be protected because the remote server never al-
lows users to see the actual data (see Boulos et al., 2006). Remote access
servers do not protect perfectly, however, as the user may be able to learn
identities or sensitive attributes through judicious queries of the system (for
examples, see Gomatam et al., 2005). Computer scientists also have devel-
oped methods for secure record linkage, which enable two or more data
stewards to determine which records in their databases have the same
OCR for page 55
55
MEETING THE CHALLENGES
values of unique identifiers without revealing the values of identifiers for
the other records in their databases (Churches and Christen, 2004; O’Keefe
et al., 2004).
Secure access approaches have not generally been used by stewards of
social science data, and the risks and benefits for spatial-social data dis-
semination and sharing are largely unevaluated. However, the concept un-
derpinning these techniques—to allow users to perform computations with
the data without actually seeing the data—may point to solutions for shar-
ing social and spatial data.
Data Simulation
Data providers may also release synthetic (i.e., simulated) data that
have similar characteristics as the genuine data as a way to preserve both
confidentiality and the possibility of meaningful data analysis, an approach
first proposed by Rubin (1993) in the statistical literature. The basic idea is
to fit probability models to the original data, then simulate and release new
data that fit the same models. Because the data are simulated, the released
records do not correspond to individuals from the original file and cannot
be directly linked to records in other datasets. These features greatly reduce
identity and attribute disclosure risks. However, synthetic data are subject
to inferential disclosure risk when the models used to generate data are too
accurate. For example, when data are simulated from a regression model
with a very small mean square error, analysts can use the model to estimate
outcomes precisely and can infer the identities of respondents with high
accuracy.
When the probability models closely approximate the true joint prob-
ability distributions of the actual data, the synthetic data should have simi-
lar characteristics, on average. The “on average” caveat is important: pa-
rameter estimates from any one synthetic dataset are unlikely to equal
exactly those from the actual data. The synthetic parameter estimates are
subject to variation from sampling the collected data and from simulating
new values. It is not possible to estimate all sources of variation from only
one synthetic dataset, because an analyst cannot measure the amount of
variability from the synthesis. Rubin’s (1993) suggestion is to simulate and
release multiple, independent synthetic data sets from the same original
data. An analyst can then estimate parameters and their variances in each of
the synthetic datasets and combine the results with simple formulas (see
description by Raghunathan et al., 2003).
Synthetic datasets can have many positive data utility features (see
Rubin, 1993; Raghunathan et al., 2003; Reiter, 2002, 2004, 2005b). When
the data generation models are accurate, valid inferences can be obtained
from multiple synthetic datasets by combining standard likelihood-based or
OCR for page 56
56 PUTTING PEOPLE ON THE MAP
survey-weighted estimates. An analyst need not learn new statistical meth-
ods or software programs to unwind the effects of the disclosure limitation
method. Synthetic datasets can be generated as simple random samples, so
that analysts can ignore the original complex sampling design for infer-
ences. The data generation models can adjust for nonsampling errors and
can borrow strength from other data sources, thereby making high-quality
inferences possible. Finally, because all units are simulated, geographic
identifiers can be included in synthetic datasets.
Synthetic data reflect only those relationships included in the models
used to generate them. When the models fail to reflect certain relationships,
analysts’ inferences also do not reflect those relationships. For example, if
the data generation model for an attribute does not take into account
relationships between location and that attribute, the synthetic data will
contain zero association between the spatial data and that attribute. Simi-
larly, incorrect distributional assumptions built into the models are passed
on to the users’ analyses. For example, if the data generation model for an
attribute is a normal distribution when the actual distribution is skewed,
the synthetic data will fail to reflect the shape of the actual distribution. If
a model does fail to include such relationships, it is a potentially serious
limitation to releasing fully synthetic data. Practically, it means that some
analyses cannot be performed accurately and that data disseminators need
to release information that helps analysts decide whether or not the syn-
thetic data are reliable for their analyses.
To reduce dependency on data generation models, Little (1993) sug-
gests a variant of the fully synthetic data approach called partially synthetic
data. Imagine a data set with three kinds of information: information that,
when combined, is a potential indirect identifier of the respondent (age, sex,
race, occupation, and spatial location); information that is potentially highly
sensitive (responses about antisocial or criminal behavior, for example);
and a residual body of information that is less sensitive and less likely to
lead to identification (responses about personal values or nonsensitive be-
haviors). Partially synthetic data might synthesize the first two categories of
data, while retaining the actual data of the third category. For example, the
U.S. Federal Reserve Board protects data in the U.S. Survey of Consumer
Finances by replacing monetary values at high disclosure risk with multiple
imputations, releasing a mixture of these imputed values and the unreplaced,
actual values (Kennickell, 1997). The U.S. Bureau of the Census protects
data in longitudinal linked data sets by replacing all values of some sensitive
variables with multiple imputations and leaving other variables at their
actual values (Abowd and Woodcock, 2001). Partially synthetic approaches
promise to maintain the primary benefits of fully synthetic data—protect-
ing confidentiality while allowing users to make inferences without learning
OCR for page 57
57
MEETING THE CHALLENGES
complicated statistical methods or software—with decreased sensitivity to
the specification of the data generation models (Reiter, 2003).
The protection afforded by partially synthetic data depends on the
nature of the synthesis. Replacing key identifiers with imputations obscures
the original values of those identifiers, which reduces the chance of identifi-
cations. Replacing values of sensitive variables obscures the exact values of
those variables, which can prevent attribute disclosures. Partially synthetic
datasets present greater disclosure risks than fully synthetic ones: the origi-
nally sampled units remain in the released files, albeit with some values
changed, leaving values that analysts can use for record linkages.
Currently, for either fully or partially synthetic data, there are no semi-
automatic data synthesizers. Data generation models are tailored to indi-
vidual variables, using sequential regression modeling strategies
(Raghunathan et al., 2001) and modifications of bootstrapping, among
others. Substantial modeling expertise is required to develop valid synthe-
sizers, as well as to evaluate the disclosure risks and data utility of the
resulting datasets. Modeling poses an operational challenge to generating
synthetic datasets. A few evaluations of the disclosure risk and data utility
issues have been done with social surveys, but none with linked spatial-
social data.
For spatially identifiable data, a fully synthetic approach simulates all
spatial identifiers and all attributes. Such an approach can be achieved
either by first generating new values of spatial identifiers, (for example,
sampling addresses randomly from the population list, and then simulating
attribute values tied to those new values of identifiers) or by first generating
new attribute values and then simulating new spatial identifiers tied to
those new attribute values. In generating new identifiers, however, care
should be taken to avoid implausible or impossible results (e.g., private
property on public lands, residences in uninhabitable areas). Either way,
the synthesis requires models relating the geographic identifiers to the at-
tributes. Contextual variables can provide information for modeling. The
implications of these methods for data utility, and particularly for the
validity of inferences drawn from linked social-spatial data synthesized by
different methods, have not yet been studied empirically.
Fully synthetic records cannot be directly linked to records in other
datasets, which reduces data utility when linkage is desired. One possibility
for linkage is to make linkages informed by statistical analyses that attempt
to match synthetic records in one dataset with appropriate nonsynthesized
records in another dataset. Research has not been conducted to determine
how well such matching preserves data utility.
Partially synthetic approaches can be used to simulate spatial identifiers
or attributes. Simulating only the identifiers reduces disclosure risks with-
out distorting relationships among the attribute variables. Its effect on the
OCR for page 58
58 PUTTING PEOPLE ON THE MAP
relationships between spatial and nonspatial variables depends on the qual-
ity of the synthesis model. At present, not much is known about the utility
of this approach.
Linking datasets on synthetic identifiers or on attributes creates match-
ing errors, and relationships between spatial identifiers and the linked vari-
ables may be attenuated. Analyses involving the synthetic identifiers reflect
the assumptions in the model used to generate new identifier values on the
basis of attribute values. This approach introduces error into matches ob-
tained by linking the partially synthetic records to records in other datasets.
Alternatively, simulating selected attributes reduces attribute risks without
disturbing the identifiers: this enables linking, but it does not prevent iden-
tity disclosures. Relationships between the synthetic attributes and the
linked attributes are attenuated—although to an as yet unknown degree—
when the synthesizing models are not conditional on the linked attributes.
This limitation also holds true when linking to fully synthetic data.
The release of partially synthetic data can be combined with other
disclosure limitation methods. For example, the Census Bureau has an
application, On the Map (http://lehdmap.dsd.census.gov/), that combines
synthetic data and the addition of noise. Details of the procedure, which
coarsens some workplace characteristics and generates synthetic travel ori-
gins conditional on travel destinations and workplace characteristics, have
not yet been published.
Representative terms from entire chapter:
spatial identifiers