This chapter discusses several issues related to the proposed census coverage measurement (CCM) program for 2010: the sample design for the CCM postenumeration survey (PES), use of logistic regression models, missing data in new coverage error models, matching cases with minimal information, and demographic analysis. On several of these topics the panel offers recommendations for the Census Bureau.
The Census Bureau is planning a CCM PES sample of 300,000 housing units, with primary sampling units composed of block clusters (for details, see Fenstermaker, 2005). An important question concerning the census coverage measurement program in 2010 is to what extent can and should the new goal of process improvement be incorporated into the design of the CCM PES.
For purposes of CCM design, the United States will be divided into 3.7 million block clusters, and the CCM will select about 10,000 of these, each averaging roughly 30 housing units (for a total of 300,000 housing units). The Census Bureau will use an initial stratification of the 3.7 million block clusters into four types: (a) small, with between 0 and 2 housing units (as determined by the Census Bureau’s Master Address File in 2009), (b) medium, with between 3 and 79 housing units, (c) large, with more than 80 housing units, and (d) block clusters of groups of American Indians on reservations. The current proposed CCM design will allocate
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 81
4
Technical Issues
This chapter discusses several issues related to the proposed census
coverage measurement (CCM) program for 2010: the sample design for
the CCM postenumeration survey (PES), use of logistic regression models,
missing data in new coverage error models, matching cases with minimal
information, and demographic analysis. On several of these topics the
panel offers recommendations for the Census Bureau.
SAMPLE DESIgN FOR CENSuS COvERAgE MEASuREMENT
The Census Bureau is planning a CCM PES sample of 300,000 hous
ing units, with primary sampling units composed of block clusters (for
details, see Fenstermaker, 2005). An important question concerning the
census coverage measurement program in 2010 is to what extent can and
should the new goal of process improvement be incorporated into the
design of the CCM PES.
For purposes of CCM design, the United States will be divided into
3.7 million block clusters, and the CCM will select about 10,000 of these,
each averaging roughly 30 housing units (for a total of 300,000 housing
units). The Census Bureau will use an initial stratification of the 3.7 mil
lion block clusters into four types: (a) small, with between 0 and 2 hous
ing units (as determined by the Census Bureau’s Master Address File in
2009), (b) medium, with between 3 and 79 housing units, (c) large, with
more than 80 housing units, and (d) block clusters of groups of American
Indians on reservations. The current proposed CCM design will allocate
1
OCR for page 81
2 COVERAGE MEASUREMENT IN THE 2010 CENSUS
a minimum of 1,800 housing units from about 60 medium and large block
clusters per state (3,000 block clusters of the 10,000), with the remainder
allocated proportionate to state population size. Also, Hawaii is allocated
a minimum of 4,500 housing units in the CCM sample (roughly 150
block clusters), and 10,000 housing units (roughly 330 block clusters) are
selected of American Indians living on reservations, which are allocated
proportionally to the number of American Indians living on reservations
in each state.
Once the 10,000 block clusters for the CCM are identified, they will be
independently listed to determine how many housing units are actually
present (since the MAF does not provide a perfect count and also because
the MAF will be slightly dated). In particular, for small block clusters,
this independent listing will find many of them to have more than two
housing units. If the number of housing units for small block clusters is
found to be more than 10, current plans are to choose those block clusters
into the CCM sample with certainty. Otherwise, the remaining small block
clusters will be subsampled. (Plans are to subsample small block clusters
with between none and two housing units at the rate of 0.1, those with
between three and five housing at the rate of 0.25, and those with between
six and nine housing units at the rate of 0.45.)
Finally, regarding substate allocations of block clusters, while plans
are currently not final, the Census Bureau is likely to include some modest
degree of oversampling of block clusters in areas that have a large fraction
of people that rent their residences and possibly in areas that have a large
fraction of minority residents.
The general argument in support of the state allocations for the 2010
CCM PES is that they mimic those for 2000, since the Census Bureau was
generally satisfied with the 2000 design of the Accuracy and Coverage
Evaluation (A.C.E.) Program in terms of the variance of estimates of net
undercoverage for poststrata. (The Census Bureau has no specific variance
requirements for the 2010 CCM estimates, because production of adjusted
counts is not anticipated.) With respect to substate allocations, the Census
Bureau is concerned with increased variances and so intends to refrain
from more than a modest amount of oversampling.
The Census Bureau examined some alternative specifications for the
design of the CCM PES to see if they might have advantages, using
simulation studies of both the quality of the resulting net coverage error
estimates and the quality of estimates of the number of omissions and
erroneous enumerations at the national level and for 64 poststrata (for
details, see Fenstermaker, 2005, 2006). Initially, four designs were exam
ined: (1) the design described above—i.e., allocations proportional to
total state population, with a minimum of 60 block clusters per state,
with Hawaii allocated at least 150 block clusters; (2) as (1) except with
OCR for page 81
TECHNICAL ISSUES
Hawaii allocated at least 60 block clusters; (3) allocations to the four cen
sus regions to minimize the variance of estimates of erroneous enumera
tions, and within regions, allocations are proportional to state size; and
(4) half of the sample is allocated proportional to the number of housing
units within update/leave areas1 and half is allocated proportional to
each state’s number of housing units. Through use of simulations, for each
design and resulting set of PES samples across simulation replications,
national estimates were computed of the rate of erroneous enumerations
(and the rates of erroneous enumerations from mail returns, from nonre
sponse followup, and from coverage followup), the nonmatch rate, the
omission rate, and the net error rate. National estimates of the population
were also computed, along with their standard errors. The same analysis
was done at the poststratum level. One hundred replications were used
for the simulation study. The results supported retention of the design
that closely approximated the 2000 A.C.E. design (described above). A
subsequent analysis added an additional six proposed sample designs
for analysis.
The panel supports the overall sample size of 300,000 housing units,
which was also endorsed, as part of Recommendation 6.1, by the Panel
to Review the 2000 Census (National Research Council, 2004b). Such a
design would produce net coverage estimates of similar precision to those
of the 2000 A.C.E. The adequacy of the CCM sample size is somewhat
supported by the adequacy of the A.C.E. sample size, though the objec
tives of these surveys have changed and therefore arguments used to
support the A.C.E. sample size may no longer be fully relevant. However,
such a position is necessary given the current lack of experience estimat
ing the components of census coverage error.
Aside from sample size, the selection of a sample design for the CCM
in 2010 will involve addressing related but somewhat competing goals,
given that there are two overall objectives of the coverage measurement
program for 2010. First, there is the primary objective—the measurement
and analytic study of components of census coverage errors. Second, there
is still the need to be able to measure net coverage error for at least three
reasons: (1) to estimate the number of census omissions, (2) to serve the
many users who remain interested in assessments of net error at least for
states and major demographic groups, and (3) to facilitate comparison
with the quality of the 2000 census.
To address the first general goal, one would like to target problematic
domains—determined using 2000 census data or data from the American
Community Survey (ACS) for which there is predicted to be a high fre
1 These are areas in which the enumerator updates the address list, and, at the same time,
drops off a questionnaire to be filled out and returned by mail.
OCR for page 81
COVERAGE MEASUREMENT IN THE 2010 CENSUS
quency of various types of census coverage error. (In an optimal design
for any individual component, the sampling rate would be proportional
to the stratum standard deviation, which is likely to be higher in strata
where the particular coverage error is greater.) However, one has to
be careful because one also has to have a facility for discovering any
unanticipated problems that might appear in areas that were relatively
easy to count in 2000. Each census seems to raise relatively novel sources
of census coverage error, and at the same time, each census seems to
have areas that were hard to count a decade earlier and subsequently are
relatively easy to count.
Yet the goal of producing highquality estimates of net coverage error
for all states and for all major demographic groups calls for a design that
is somewhat less targeted. As with estimation of components of error,
the most efficient design for estimation of net coverage error would over
sample areas with high rates of omissions or erroneous enumerations.
This then allows reducing the sampling weights associated with indi
vidual blocks expected to exhibit the most variance in the two compo
nents of net error. However, it is especially critical for net error estimation
to avoid extreme undersampling in any areas because large sampling
weights will quickly inflate variances if associated with blocks having
more problems than anticipated.
Another way to look at the situation is that there is a modest tension
between the need for crossU.S. reports on net coverage error, and the
need for specific analytic studies on possibly problematic processes. So if
one had a list of potentially worrisome places where census processes are
likely to enumerate certain kinds of housing units with a high frequency
of coverage error, those places should be oversampled in the 2010 CCM
design. But this should be done while maintaining the ability to produce
reliable estimates of net coverage error at some level of geographic and
demographic detail.
Given this modest tension, the panel believes that the Census Bureau
has selected a design that may not sufficiently accommodate the pri
mary goal of measuring and analyzing components of census coverage
error. The state allocations of the Census Bureau’s proposed CCM sample
design are too oriented toward producing statelevel estimates of net
undercoverage of comparable quality with the 2000 estimates. Instead,
the new purpose of CCM in 2010 should be accommodated by modify
ing the state allocations of block clusters to include more block clusters
from states that are predicted to be harder to count, by including a greater
degree of oversampling of substate areas that are likely to be difficult to
count (though the latter is clearly dependent on the Census Bureau’s as
yet unspecified plans), or both.
The analysis carried out by the Census Bureau of 10 sample designs
OCR for page 81
TECHNICAL ISSUES
for state allocations is thorough. However, with respect to substate alloca
tions of block clusters, the Census Bureau might consider, in addition to
oversampling medium and large block clusters with a high percentage of
renters, oversampling block clusters with large percentages of individuals
or housing units with other features that might be associated with census
coverage error, such as: (1) small multiunit structures, (2) a high percent
age of foreignborn residents in 2000, (3) a high percentage of proxy
interviews in 2000, (4) a high percentage of whole household imputation
in 2000, (5) a high percentage of vacation homes, or (6) recent additions
to the housing stock. In addition, as in 2000, the Census Bureau could
oversample blocks in which there is a higher chance of geocoding errors
or areas in which there was a high percentage of additions through the
Local Update of Census Address (LUCA) Program or block canvass adds
or deletes.2 It is likely that efforts devoted to modifying substate alloca
tions will be more important than the state allocations, but both deserve
attention.
In addition to the above general suggestions, the panel has a specific
suggestion for the 2010 CCM sample design that provides a reasonable
compromise between designs that are focused on estimation of net cover
age error and designs that focus on components of census coverage error.
We urge the Census Bureau to evaluate a CCM sample design that retains
the identical structure of the current census design for a substantial frac
tion of the sampled units, possibly 60 to 75 percent (by making the obvi
ous change to the sampling rates) and allocates the remaining sample to
anticipated problematic regions or block clusters. Such a change would
potentially provide a much greater number of census coverage errors to
support models examining which factors relate to coverage error. At the
same time, allocating the bulk of the sample to a general purpose design
would limit the risk of inflated variances for net error estimates associated
with finding large errors in unexpected locations. Research on what per
centage to use and how this compares with the Census Bureau’s proposed
design can be carried out using simulation studies of the type the Census
Bureau has already carried out, though it also would be very useful to
incorporate some accounting for any differences that are expected to be
seen between 2000 and 2010 (possibly based on the ACS).
In conducting additional simulations, we propose that the Census
2 Inconsidering characteristics that can serve as the basis for oversampling, it is important
to stress that even if some problematic circumstances are identified, it will generally be the
case that very little individual householdlevel information could be used as the basis for
oversampling since such information would have to be available on the MAF. However,
areawide frequencies of the same characteristics can often provide reasonable surrogates.
For example, areas with many renters can be targeted, but one cannot target renters indi
vidually for oversampling.
OCR for page 81
6 COVERAGE MEASUREMENT IN THE 2010 CENSUS
Bureau also reconsider the metrics it uses to compare and assess 2010
CCM sample designs. In its simulations, the Census Bureau examined
estimates of the coefficient of variation of estimates of net error, rate of
erroneous enumerations, rate of omissions, and the rate of Psample non
matches. The Bureau also looked at coefficients of variation for net error
estimates for groups of poststrata from 2000. The panel would like to
suggest, in addition, the use of several additional types of metrics. Letting
DSEi = the direct dualsystems estimate for an aggregate i (e.g., state by
major demographic group), Ei = the Esample total for an aggregate i,
Pi = the Psample total for an aggregate i, Ii = the number of imputations
for an aggregate i, EEi = the number of erroneous enumerations for an
aggregate i, and Mi = the number of matches for an aggregate i, the panel
believes that the following metrics would provide more direct indication
of the benefits of alternative CCM designs:
DSEi + Ii EEi Ii
i i ( i i) ( i i)
E + I , E + I , and E + I
The first metric is intended to be evaluated at the block cluster level
(based on synthetic estimation), while the remaining two metrics are
computed at the state level. The first metric,
DSEi + Ii
E +I
i i
is a local undercount estimate since DSEi + Ii is similar to the dual
systems estimate and Ei + Ii is an estimate of the census count. The second
metric,
EEi
(Ei + Ii )
is a measure of the percent of erroneous enumerations. The third metric,
Ii
(Ei + Ii )
is a measure of the percent of wholeperson imputations. The last two
metrics therefore assess the degree to which an area is encountering
problems in enumeration. The goal then is to select a CCM sample design
that produces estimates of these quantities at the indicated level that have
substantially lower variances than those from the currently proposed
design.
Simulation studies of the design alternatives mentioned above, using
these metrics, may identify designs that are nearly as effective as the Cen
sus Bureau’s current design at estimating net coverage error at the level
of states and major demographic groups while increasing the number
OCR for page 81
TECHNICAL ISSUES
of sampled census coverage errors. The panel believes there still may be
sufficient time to carry out this analysis.
The panel’s suggestion that the Census Bureau consider additional
oversampling of difficulttocount housing units in the 2010 CCM design
is incomplete without considering what data should be used in sup
port of this effort. Certainly, the Census Bureau could continue to use
census and A.C.E. data from 2000, as in the above simulations, possibly
making some effort to better identify erroneous enumerations and omis
sions given the weakness of A.C.E. data for that purpose. However, that
might be too timeintensive an activity as 2010 nears. Census, ACS, and
StARS (see Chapter 3) data could also be used as the basis for an artificial
population study, in which the components of census coverage error were
“imposed” on the census enumerations. That is, statistical models using
current best guesses for causal factors and their impact on coverage error
could be developed, relating person, household, and contextual character
istics to probabilities of duplication, omission, and being counted in the
wrong location. Then, a number of simulated censuses and PESs could
be conducted, with people and housing units missed, duplicated, and
counted in the wrong place with various probabilities. Erroneous enu
merations, as defined here (which excludes duplications and enumera
tions in the wrong location), would be more difficult to incorporate in
such a study since one does not have a base population to apply a model
to. However, this component is likely the least important to address, and
there may not be an effective causal model predicting which newborns are
erroneously included in the census, which recent deaths are erroneously
included, which visitors are included, and which fictitious individuals are
included. If the suggested study is carried out, then, analyses in 2010 to
identify which factors are and are not associated with various components
of coverage error can be used to refine the models used for incorporating
components of coverage error to better plan the coverage measurement
data collection in 2020.
Finally, a very serious complication in carrying out this research plan
is that many of the most important predictive factors in statistical models
of components of census coverage error will have to be indicator variables
for the various census processes used in association with the enumeration
of each housing unit or individual. (This requirement results from having
a feedback loop that identifies census processes in need of modification.)
However, the census processes are generally not represented on the stan
dard census files or in A.C.E. in 2000. This lack strongly argues for the
collection of a master trace sample (a sample of households for which the
entire census procedural history is retained) in 2010 and that the designs
of the CCM sample and the master trace sample be such that there is
substantial overlap between them. For current work, and in the case that
OCR for page 81
COVERAGE MEASUREMENT IN THE 2010 CENSUS
a master trace sample database is not constructed in 2010, the Census
Bureau should determine how it can use census management information
files to populate an analysis database that represents the components of
census coverage error and as many as possible of the predictors discussed
in Chapter 5.
None of the approaches suggested here as to how to examine the
optimal extent of oversampling problematic households is ideal, which
is not surprising. The Census Bureau does not have good historic infor
mation on how coverage errors are related to census processes, which
makes targeting the sample much harder and which makes simulating
the situation harder as well. However, the Census Bureau has acquired
a lot of information about the circumstances that cause some of the cov
erage errors and where those more problematic areas are located; those
areas need to be oversampled to some extent. The coverage problems do
change from census to census and some of the problematic areas are due
to idiosyncrasies that appear for only a single census. Yet it is sensible
to assume that much of the causal nature of coverage error is persistent
across at least a few censuses, and that is what needs to be captured in the
CCM survey. So focusing on areas with high proxy interview rates or high
imputation rates in the previous census, on areas with a large percentage
of vacation homes, or on areas with many small, multiunit housing units
(though this has some difficult definitional and implementation aspects)
is likely to be beneficial in the design of the 2010 CCM survey since such
households have been and are likely to remain hard to count. Of course,
over time, new problems will crop up, and old ones will be addressed,
and so the process of census improvement will be a dynamic one.
Given that the design of the 2010 CCM PES needs to target block
groups that have a higher frequency of housing units that are vulner
able to census coverage error, the Census Bureau should give serious
consideration to alternative designs that, without sacrificing much effi
ciency in estimating net coverage error, could provide a larger number
of (anticipated to be) hardtoenumerate households and individuals.
Such a design would improve the estimation of parameters of the statis
tical models linking coverage error to census procedures. In particular,
the Census Bureau should consider implementing a design that mixes a
high proportion of cases selected using the current design with a smaller
proportion of cases in hardtoenumerate areas. This design could be
assessed through simulation studies like those the Bureau has already
used, supplemented by additional metrics suggested here.
Recommendation 6: The Census Bureau should compare its sample
design for the 2010 census coverage measurement postenumeration
survey with alternative designs that give greater sampling prob-
OCR for page 81
TECHNICAL ISSUES
ability to housing units that are anticipated to be hard to enumer-
ate. If an alternative design proves preferable for the joint goals
of estimating component coverage error and net coverage error
estimation, such a design should be used in place of the current
sample design.
LOgISTIC REgRESSION MODELS
In the last few years the Census Bureau has devoted a considerable
amount of its resources on coverage measurement research to improv
ing the estimation of net coverage error in 2010, with a primary focus on
developing two logistic regression models to replace poststratification to
address correlation bias. Any smallarea estimates of net coverage error
will likely be based on these same logistic regression models, replacing
the use of (socalled) synthetic estimation. Both poststrata and synthetic
estimation were used in the coverage measurement programs in 1990 and
in 2000, so the current plan is a substantial change to the estimation of net
coverage error at the level of both large and small domains. Despite the
new focus on estimating components of error, there remains good reasons
for devoting considerable attention to the estimation of net error.
First, given the focus in 2000 on net error estimation, the data avail
able from A.C.E. are not directly useful as substitutes for the data on
components of coverage error that will be collected in 2010. An important
example of this is the different definitions of correct enumerations in 2000
and 2010, which suggests that the frequency of erroneous enumerations
and omissions will likely be substantially less and of a somewhat different
nature in 2010 than they were in 2000. As a result, any attempts to model
the 2000 A.C.E. data without accounting for various differences between
2000 and 2010 will probably provide limited guidance for how to estimate
components of coverage error in 2010. (However, we believe that some
efforts in this direction are warranted.)
Second, as argued in Mulry and kostanich (2006), estimating net
coverage error facilitates estimation of the number of census omissions.
Therefore, some focus on estimation of net coverage error is justified.3
Third, as noted in Chapter 2, strong interest remains for many census
data users in the assessment of net coverage error, in particular for demo
graphic groups, but also for states and cities.
3Although the Census Bureau will use estimates of net coverage error to develop estimates
of the number of census omissions for domains that will support various tabulations, we hope
that the Census Bureau will develop analytical models based on the Psample individuals
that are determined to be census omissions. The main disadvantage of doing this is that this
analysis may miss the types of census omissions that are not captured in either the census or
the Psample, which are collectively estimated using the Census Bureau’s approach.
OCR for page 81
0 COVERAGE MEASUREMENT IN THE 2010 CENSUS
The Census Bureau plans to use logistic regression for fitting the
probability of match status for the Psample and the probability of cor
rect enumeration status for the Esample. Logistic regression is more
flexible than poststratification in terms of handling continuous predictor
variables and selective use of interactions among predictor variables.
This flexibility potentially allows inclusion of more predictor variables
without increasing the variance of estimated probabilities. Furthermore,
logistic regression is a model that, in this context, is applied at the level of
the individual; therefore, information collected at that level can be easily
used in conjunction with information that is collected at a more aggre
gate level. Finally, not only is logistic regression likely to be better than
poststratification in estimating net coverage error for these reasons, but it
is also much better suited for the analytic purposes of providing a better
understanding of which factors are and are not related to net coverage
error than poststratification.
Poststratification is mentioned in the earliest literature advocating
the use of dualsystems estimation (DSE) to measure populations (Sekar
and Deming, 1949), and it has been used in the census since the 1980
postenumeration program to reduce correlation bias. Poststratification
simply means that one partitions the CCM sample data into groups that
are more homogeneous and then separately estimates the adjusted popu
lation counts
(C − II) CE M ,
P
E
within those poststrata.4 A perfect poststratification would partition the
Psample population and the Esample population so that the under
lying enumeration propensities for individuals in a poststratum were
identical. However, this is unattainable and therefore the practical goal
is to partition the sample cases so that individuals are more alike within
a poststratum than individuals are from different poststrata. If this is
accomplished, correlation bias should be reduced (see kadane et al., 1999,
for details).
Poststratification also supports the use of synthetic estimation, which
carries down adjustments to census counts to low geographic levels. Syn
thetic estimation makes use of coverage correction factors,
C − II CE P
C E M ,
4 See Chapter 3 for definitions. Note that CE is defined consistent with the definition of a
correct enumeration in A.C.E., that is, an enumeration that is located in the search area.
OCR for page 81
1
TECHNICAL ISSUES
which are applied to any subpopulation in a poststratum by multiply
ing the appropriate factor by the relevant subpopulation’s census count
to produce the adjusted count for that subpopulation. To produce geo
graphic estimates, which often requires adding subpopulations that
belong to different poststrata, one simply sums the associated adjusted
counts.
Estimates of the variance of synthetic estimates for small domains
are necessarily a combination of estimates of the variance of the coverage
correction factors for the poststrata involved (depending on the domains)
and a residual variation due to any unmodeled heterogeneity of the rel
evant subpopulations of interest within the required poststrata. The first
component can be estimated by standard methods. However, estimation
of the second variance component is more difficult.
As mentioned above, although poststratification has the advantages
of reducing correlation bias and supporting synthetic estimation, a major
disadvantage is that, as applied by the Census Bureau, it allows only a
relatively small number of factors to be included in the poststratification
scheme (and in the resulting synthetic estimation). This limitation exists
because the Census Bureau typically includes the full crossclassification
of the factors used to define the poststrata, and, as a result, the individual
poststrata quickly become very sparsely populated, despite the large
sample size of the PES. Use of more poststratification factors, and there
fore more poststrata, trades off greater homogeneity in each poststratum
at the price of higher sampling variances for the coverage correction
factors. Furthermore, the fact that the various poststrata generally share
some characteristics with other poststrata (for instance, there are many
poststrata for Hispanic women) is generally ignored in the associated
estimation. As a result, there is a failure to pool information when it may
be beneficial to do so.
The alternative that is being planned for use by the Census Bureau in
the 2010 CCM is logistic regression of both the binary match/nonmatch
variable and the binary correct enumeration/not correct enumeration
variable. Poststratification is a special case of logistic regression in this
context in which the predictors of the logistic regression are indicator
variables for membership in the categories defining the poststrata, and all
interactions are included in the model. In theory, for the same reasons that
logistic regression may be preferred to poststratification at the aggregate
at which that analysis is carried out, smallarea estimates that are based
on the probabilities of match and correct enumeration status estimated
using logistic regression could improve on those provided through syn
thetic estimation by effectively averaging over more of the data.
In the following, a number of issues relevant to the use of logistic
regression are raised and discussed, and a variety of suggestions are
OCR for page 81
10 COVERAGE MEASUREMENT IN THE 2010 CENSUS
as the observed values, without any conditioning. This assumption is
extremely unlikely to obtain in either missing data problem.) If the miss
ing at random assumption is not considered reasonable, one could impute
M by conditioning on various aspects of R (referred to as patternmixture
models; for details, see Little, 1993).
It would be valuable for the Census Bureau to assess its current impu
tation methods in its coverage measurement models for consistency with
the above principles. As noted above, the logistic regression approach
for modeling match status seems too focused on the Pfile data, ignoring
potentially useful information both in auxiliary data used in the match
ing algorithm and in the Efile. It may be that after this reconsideration,
modest adjustments to the current procedures will provide a model for
match status with smaller meansquared error under a variety of realistic
models for both the generation of data and missing values.
The Census Bureau’s current imputation methodology, hotdeck
imputation, works well in situations with limited covariate information.
However, the difficulty in this approach is that dealing with more than
a few covariates at a time compromises its ability to condition on all
relevant variables. In contrast, parametric multiple imputation methods
make better use of covariate information, and these methods can be used
to estimate the contribution to variance as a result of the missing data. An
example of this is IVEWARE (see, e.g., Raghunathan et al., 2002). Another
question involves the role of imputation for missing census characteristics
values. After estimating the logistic regression of M on X, imputations for
missing census characteristics are needed to provide the predictors for
input to the logistic regression models to estimate a match probability for
these cases, through:
( )
( ) ˆE
Pr M Xobs ≈ Pr M Xobs , Xmis
E E
as input into the smalldomains estimation procedure to be used in 2010.
Use of hotdeck imputation here is reasonable, but an alternative is to esti
mate this probability directly given the observed Esample characteristics,
( )
E
Pr M Xobs .
This approach avoids the additional uncertainty from the imputation, and
it should be straightforward to employ with the move to use of logistic
regression.
Finally we note that the coverage measurement data collected in 2010,
in particular the various followup data collections that are typically car
ried out, could be used to validate the imputation models used, though
the sparseness of these samples may make this of only limited utility.
OCR for page 81
10
TECHNICAL ISSUES
In summary, missing data methodology needs to be viewed in the
context of the completedata problem, file matching. As noted above:
• Imputation is only useful if it adds information to the logistic
regression; otherwise cases can be dropped.
• Imputations should be multivariate in order to preserve associa
tions between missing variables.
• Imputations should condition on predictive covariates.
For example, imputations should condition on M if M is observed, and
imputations should condition on potential covariate information from
matches or potential matches from the Efile. Some form of weighting
might be developed to reflect the strength of the potential matches. The
Census Bureau could also consider parametric multiple imputation as an
alternative to the hot deck because it makes better use of the covariate
information and because it propagates imputation uncertainty. Finally,
the Census Bureau could also consider nonignorable models, such as
patternmixture models, if the missingatrandom assumption is likely
to be violated.
This is a set of research problems that the Census Bureau needs to
allocate substantial staff resources to address. We believe that the benefits
are likely to be considerable and the understanding from the Psample
matching problem discussed in detail should be transferable to some
of the other missing data problems listed on p. 103. The Census Bureau
should identify missing data methods that are consistent with the philoso
phy that is articulated above and implement those methods in support of
statistical models of Census Coverage Measurement data in 2010.
Recommendation 7: The Census Bureau should develop missing
data techniques, in collaboration with external experts if needed,
that preserve associations between imputed and observed variables,
condition on variables that are predictive of the missing values,
and incorporate imputation uncertainty into estimates of standard
errors. These ideas should be utilized in modeling the census cover-
age measurement data collected in the 2010 census.
MATCHINg CASES WITH MINIMAL INFORMATION
For an Esample enumeration to have sufficient information for match
ing and followup, as defined in the 2000 census, it needed to include a
person’s complete name and two other nonimputed characteristics. To be
data defined in the census itself, an enumeration simply had to have two
nonimputed characteristics. In the A.C.E. Esample in 2000, 1.7 percent
OCR for page 81
110 COVERAGE MEASUREMENT IN THE 2010 CENSUS
(4.8 million sample survey weighted) of the datadefined enumerations
had insufficient information for matching and followup. These cases
were coded as “kE” cases in A.C.E. processing.
A.C.E. estimation treated kEs as having insufficient information for
matching, and they were removed from the census enumerations prior to
dualsystems computations. If kEs are similar in all important respects to
census enumerations with sufficient information for matching, removal
from dualsystems computations slightly increases the variance of the
resulting estimates, but it does not greatly affect the estimates themselves.
Removal of kEs helped to avoid counting people twice because matches
for these cases are difficult to ascertain. Also, it was difficult to follow up
these Esample cases to determine their match status if they were initially
not matched to the Psample because of the lack of information about
whom to interview.
However, some unknown and possibly a large fraction of these cases
were correct enumerations. Therefore, removing these cases from the match
ing inflated the estimate of erroneous enumerations, and it also inflated the
estimate of the number of census omissions by about the same amount, since
roughly the same number that are correct enumerations would have matched
to Psample enumerations. (There is no way of validating this assumption
since the kEs generally cannot be followed up.) Given that the emphasis in
2000 was on the estimation of net census error, this inflation of the estimates
of the rates of erroneous enumeration and omission was of only minor
concern. However, with the new focus in 2010 on estimates of components
of census coverage error, there is a greater need to find alternative methods
for treating kE enumerations. One possibility that the Census Bureau has
explored is whether many of these cases can be matched to the Psample
data using information from other household members.
To examine this possibility, the Census Bureau carried out an analy
sis using 2000 census data on 13,360 unweighted datadefined census
records with insufficient information for matching to determine whether
their match status could be reliably determined. (For details, see Auer,
2004; Shoemaker, 2005.) This clerical operation used name, date of birth,
household composition, address, and other characteristics to match the
cases to the Psample. For the 2000 A.C.E. data, 44 percent of the kE cases
examined were determined to match to a person who lived at the same
address on Census Day and was not otherwise counted, with either “high
confidence” or “medium confidence” (which were reasonable and objec
tively defined categories of credibility). For the 2000 census, this would
have reclassified more than 2 million census enumerations from errone
ous to correct enumerations, as well as a similar number from Psample
omissions to matches, thereby greatly reducing the estimated number
OCR for page 81
111
TECHNICAL ISSUES
of census component coverage errors.11 (We note that it is important in
carrying out this research to remain evenhanded in evaluating whether
a case does or does match; this is not simply an effort to identify more
cases that are matches.)
The treatment of the kEs remaining after this revisiting of the defini
tion of insufficient information for matching can be viewed as another
component of “error” in the same way that a person incorrectly geocoded
is an error—that is, as a problem for processing but not a part of what one
would call an omission or an erroneous enumeration. Therefore, the use of
the term “erroneous enumeration” for these cases is inappropriate. Cases
with insufficient information should be treated as having unknown or
uncertain enumeration or match status and the term “erroneous” should
be reserved for incorrect enumerations. The terminology used needs to
distinguish between types of error and the uncertainty associated with
these types of error for particular cases.
The panel is impressed with the findings of this research, which
should substantially improve the assessment of components of census
coverage error in 2010. In considering further development of the idea, it
would be useful to try to find out more about any characteristics associ
ated with kEs in order to find out how to reduce their occurrence in the
first place. StARS might be useful for this purpose. Furthermore, the cleri
cal operation used to determine the status of kEs was resource intensive,
and it would be useful to try to automate some of the matching to reduce
the size of this clerical operation in 2010.
We anticipate that, as a result of this research, the Census Bureau will
adopt a different standard of what is considered to be insufficient infor
mation for matching more generally.
DEMOgRAPHIC ANALYSIS
Demographic analysis may be facing a very dynamic period in the
next few years for several reasons. First, nearly all record systems are
becoming increasingly more complete with higher quality data. Second,
the American Community Survey is now providing a great deal of useful,
subnational information that could be used to improve and extend demo
graphic analysis estimates. Third, StARS, a merged, unduplicated list of
U.S. residents and addresses, is also a likely source of information on
the number of housing units and residents at small levels of geographic
aggregation that could also be used to improve demographic analysis
estimates.
11 For the remaining unresolved cases, the Census Bureau currently plans to treat them in
a separate category as “enumerations unable to evaluate.”
OCR for page 81
112 COVERAGE MEASUREMENT IN THE 2010 CENSUS
At the same time, some things are becoming more complicated, nota
bly, the expansion of the number of race and ethnicity categories on the
decennial census and the growing and increasingly mobile population of
undocumented immigrants.
In this context, the panel was asked to examine how demographic
analysis might function more effectively as an independent assessment of
the quality of the coverage of the decennial census. In addition, the panel
was asked to consider the use of sex ratios from demographic analysis,
especially for Hispanic residents, to reduce the effect of correlation bias
in dualsystems estimation.
As described above, the basic demographic analysis equation is
ˆ ˆ
P NEW = POLD + B − D + I − E ,
ij ij ij ij ij ij
ˆ NEW represents the current estimate of the population for demo
where Pij
ˆ OLD
graphic group i and geographic area j, Pij is the analogous estimate for
a previous census, Bij represents the number of births between the current
and a previous census, Dij represents the number of deaths between the
current and a previous census, Iij represents the number of immigrants
between the current and a previous census, and Eij represents the number
of emigrants between the current and a previous census, all for demo
ˆ
graphic group i and geographic area j. Once PijNEW is computed, the net
ˆ
census undercount, U ij for demographic group i and area j is defined
ˆ ˆ
as U ij = PijNEW − Cij , where Cij is the census count, again for demographic
group i and area j.
Error is introduced into estimates from demographic analysis due to
omissions in the birth and death records and due to large inaccuracies in
the data on immigration and emigration. The error in net undercoverage
estimates from demographic analysis then stems from error in the various
components, error in the census counts, and any lack of alignment of the
demographic categories.
Given these concerns, the most reliable outputs from demographic
analysis are any national counts by age and sex, and functions of such
counts, in particular sex ratios by age; birth and death estimates; and
historical patterns of various kinds. More problematic outputs are race
(depending on the degree of alignment to the new race/ethnicity catego
ries) and subnational estimates for demographic groups.
The most problematic outputs are estimates of international migration
components, estimates of the Hispanic population, subnational totals for
states and smaller geographic areas.
The Census Bureau plans for demographic analysis in 2010 are to
produce “estimates” and “benchmarks,” with estimates represented to
users as being more reliable than benchmarks. The Census Bureau will
produce estimates of national level totals by year of age and by sex, and
OCR for page 81
11
TECHNICAL ISSUES
estimates of 2000–2010 change for the above groups. The Census Bureau
will also produce benchmarks of national net undercount error by age,
sex, and race/ethnicity. In addition, the demographic analysis program
will produce sex ratios by age and race/ethnic origin, possibly for use in
reducing the effects of correlation bias on estimates of net undercoverage
from the census coverage measurement program.
Even without any major advances from 2000, demographic analysis
will still likely play an important role in evaluation of the 2010 census.
As pointed out above, demographic analysis provided an early indication
that the initial estimates of the total U.S. population from A.C.E. may have
been too high, and it will continue to provide an estimated count that
serves as a useful estimate for many demographic groups and a useful
lower bound for others.
The Census Bureau is currently pursuing important research direc
tions, though it is unclear whether they will contribute to the 2010 demo
graphic analysis program. Those research plans include: (1) improved
estimation of international migration, (2) estimation of the uncertainty of
demographic analysis estimates, and (3) progress towards the production
of subnational estimates. The latter includes research on methods and
data sources, with some pieces already considered of possibly acceptable
quality, such as estimates of the number of people younger than 10 years
of age at the state level. We believe that these are extremely important
projects to pursue and deserve full support from the Census Bureau.
In addition, the panel has the following questions concerning the
2010 demographic analysis estimates that may help orient these research
avenues:
• Given that there is race/ethnicity incomparability between the
decennial census and demographic analysis, which categories are
going to be used in 2010?
• Given overlapping data for some cohorts (e.g., Medicare informa
tion for those over 65) in comparison with standard demographic
analysis, which sources will be used and how will that be deter
mined? Will there be efforts to combine information?
• Estimates of Hispanic origin were produced by the censuses of
1980, 1990, and 2000, as were adjusted counts. Have these sequences
been examined to determine their likely quality over time?
• In considering subnational estimates, relatively highquality esti
mates are available of the number of nativeborn children under
10 years old at the state level, and the number over 65 from
Medicare, again at the state level. Given additional information
on interstate migration from tax returns, school enrollment, and
possibly the American Community Survey, could highquality
OCR for page 81
11 COVERAGE MEASUREMENT IN THE 2010 CENSUS
estimates be provided for the remaining demographic groups at
the state level?
• If the Census Bureau again uses sex ratios from demographic anal
ysis to reduce the correlation bias in adjusted population counts,
should these be applied for all minority men or selectively, as in
2000?
• The American Community Survey is providing information that
might be extremely useful for improving demographic analysis esti
mates. The possibilities include: (1) better estimates of the number
of foreignborn residents, (2) better estimation of net international
migration, and (3) information on sex ratios for more detailed ethnic
and racial groups. How should each of these information sources be
best used to improve demographic analysis, and what evaluations
should be used to support decisions of implementation?
• Measurement of the size of the undocumented population is a con
tinuing problem for demographic analysis. The current method,
described in Passel (2005), is, roughly speaking, to subtract the
estimated size of the legal immigrant population from the esti
mated size of the total foreignborn population. Are there any new
methods that might be more effective in estimating the size of this
population?
• StARS is already, or will soon be, of high enough quality to provide
useful input into demographic analysis estimates. There are reasons
to believe that administrative records could play an important role
in improving various aspects of demographic analysis, especially
the counting of immigration and emigration, and research in this
area would be very desirable.
Demographic Analysis in Combination with Dual-Systems Estimation
As part of A.C.E. revision II, the Census Bureau decided to modify
the final A.C.E. estimates based on sex ratios from demographic analysis
and the assumption that the A.C.E. counts for women and children were
correct. Specifically, at the level of aggregate poststrata (aggregated over
nondemographic and geographic characteristics), the A.C.E. counts for
black men 18 and over and for all other males 30 and over were adjusted
upward so that the ratio of women to men for A.C.E. (essentially) agreed
with that estimated using demographic analysis.
The argument in support of this joint use of demographic analysis and
dualsystems estimation is as follows. Demographers generally believe
that the most accurate outputs of demographic analysis are national
level sex ratios by age for blacks and nonblacks. Even if absolute counts
are subject to some bias, sex ratios are expected to be quite accurate.
OCR for page 81
11
TECHNICAL ISSUES
Historically, at least for adult blacks, the corresponding maletofemale
ratios based on adjusted counts using dualsystems estimation have been
lower than those from demographic analysis, suggesting that correlation
bias (or other sources of bias) result in relative underestimation of adult
males by dualsystems estimation. Because the most obvious source of
correlation bias (heterogeneity of enumeration probabilities) would not
have resulted in a negative bias for dual systems estimates, the most
conservative step, in terms of additional counts, is to leave estimates for
the female population unchanged and to increase the male population
enough so that the resulting sex ratios for the adjusted counts agree with
those from demographic analysis.
It is not sufficient to simply add these additional enumerations at the
level of the aggregate poststrata; they must then be allocated down to the
poststrata within each of the aggregate poststrata. Bell (1993) and Bell et
al. (1996) identified five different methods for doing so, but there is little
evidence available as to which of the methods works best. The Census
Bureau selected one of these five approaches on the basis of its best judg
ment, but the arbitrariness of the selection, along with the fact that the
counts were sensitive to the method used, is troubling. Also, given the
limitations of demographic analysis, this technique could not be applied
to such particular subgroups as nonblack men aged 30 and over (espe
cially Hispanics), despite some historical evidence that a similar correc
tion might have improved estimates for those subgroups. Finally, adjusted
counts for both adult males and females have rested on the assumption
that there is no correlation bias for adult females.
Admittedly, the approach used resulted in higher “face validity” for
the adjusted census counts at the aggregate level as a result of the consis
tency with the sex ratios from demographic analysis. However, given the
issues described above, especially the lack of a formal assessment of the
effect of this process on the quality of the resulting counts, the decision
was controversial.
Given this situation, it seems reasonable to carry out a more com
prehensive evaluation of what was done in 2000 and possible alterna
tives before adopting a similar modification in 2010. (The Census Bureau
currently plans to use a similar technique in 2010 as a correction for
correlation bias.) Artificial population studies, in which models are devel
oped to designate which individuals in an artificial population are and
are not missed by the census, the PES, and by the record systems used
by demographic analysis could be useful in such evaluations. We suggest
that the Census Bureau include the approach described by Elliott and
Little (2000) in their analysis of the method used in 2000. Their approach
provides useful smoothing to the technique described in Bell (1993) and
Bell et al. (1996). In addition to the beneficial smoothing, Elliott and
OCR for page 81
116 COVERAGE MEASUREMENT IN THE 2010 CENSUS
Little’s work provides estimates of precision that incorporate the uncer
tainty in the demographic analysis sex ratios.
In addition, the information from the American Community Survey
and from StARS on various demographic statistics, such as sex ratios,
could be considered for use in providing not only modifications to the
counts for males, but also modifications to the counts for females, avoid
ing the necessity of relying on the assumption that no correlation bias
exists for that demographic group.
Estimation of uncertainty of Demographic Analysis
The Census Bureau (see Robinson et al., 1993) conducted initial
research on developing uncertainty intervals for population forecasts, but
to date these have not been fully developed. Development of such uncer
tainty intervals would have two benefits: users would be supplied with
uncertainty intervals with a formal probabilistic interpretation, and esti
mates from demographic analysis could be combined with estimates from
independent sources by weighting by the precision of each estimate.
In the past 15 years, a number of researchers have suggested interest
ing methods to consider for development of uncertainty intervals. Poole
and Raftery (2000) suggest the use of Bayesian melding for this purpose.
Briefly, the idea is that one has expert knowledge about inputs to a deter
ministic model and their variability (i.e., a prior distribution) and expert
knowledge about the outputs of interest (the forecasts), which through
exact or approximate inversion presents a second prior distribution for
the inputs. These two prior distributions then have to be reconciled. There
is also the most recent data for the inputs that have been collected, and
one can develop likelihoods for the previous inputs and outputs given the
data. Bayes rule is then used, implemented by the samplingimportance
resampling algorithm of Rubin (1988), to update the prior distribution to
produce a posterior distribution of the forecasts, which would include a
posterior variance.
Other approaches have also been suggested by, among others, Alho
and Spencer (1997) and Lee and Tuljapurkar (1994). Given all of this
promising research, and the benefits from the development of uncertainty
intervals, it would be valuable for the Census Bureau to revisit this issue
and evaluate some of these approaches for their applicability to demo
graphic analysis of the U.S. census. It is true that the U.S. census tends to
have idiosyncratic challenges each decade, such as the number of undocu
mented immigrants that are enumerated in a given census, the number
of duplicate enumerations from multiple modes of enumeration, or the
degree of census undercoverage, and these challenges may be difficult to
model. Therefore, in particular, the specific stochastic models suggested
OCR for page 81
11
TECHNICAL ISSUES
by Alho and Spencer (1997) and Lee and Tuljapurkar (1994) might need
some modification. However, even recognizing this, if started now, the
panel is confident that a research effort devoted to this issue would very
likely produce useful uncertainty intervals for the 2010 census.
In summary, demographic analysis played an important role in help
ing to evaluate the estimates produced by A.C.E. in 2000, and it can play
an even larger role in 2010 and 2020, especially if some improvements are
implemented. Those improvements include improving the measurement
of undocumented and documented immigration, development of sub
national geographic estimates, development of estimates of uncertainty,
and further refining methods for combining demographic analysis and
coverage measurement survey information.
Recommendation 8: The Census Bureau should give priority to
research on improving demographic analysis in the four areas:
(1) improving the measurement of undocumented and documented
immigrants, (2) development of subnational geographic estimates,
(3) assessment of the uncertainty of estimates from demographic
analysis, and () refining methods for combining estimates from
demographic analysis and postenumeration survey data.
OCR for page 81