Logistic Regression for Modeling Match and Correct Enumeration Rates

It is reasonable to suspect that match rates and correct enumeration rates, in addition to being a function of the variables used to define the accuracy and coverage evaluation (A.C.E.) poststrata in 2000, may also vary across the local census offices used to manage the workload in the census. The local office identifiers are on the A.C.E. research database, but they were not included in the six logistic regression models described above or the study by Schindler (2006).

Local census office indicator variables might be predictive of match and correct enumeration rates because factors that are particular to small areas could affect ease of enumeration. For example, local economic conditions and the expertise and capabilities of local census office administrators could vary. Because of the large number of local census offices (more than 500) and the limited amount of data for each, these effects are more naturally represented as random effects. By including these random effects in the logistic regression models, the Census Bureau could estimate the effects of individual offices on match and correct enumeration rates and obtain valid estimates of the contribution of variability across offices to uncertainty about coverage rates in each area.

Malec and Maples (2005) explored this approach by adding local area random effects into a synthetic estimation model and then measured the variance component of these random effects for local census offices. The ultimate objective of this approach is a small-area estimation methodology that would provide a compromise between synthetic estimation and a design-based estimator for each local office area.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 153

Appendix B
Logistic Regression for Modeling
Match and Correct Enumeration Rates
It is reasonable to suspect that match rates and correct enumeration
rates, in addition to being a function of the variables used to define the
accuracy and coverage evaluation (A.C.E.) poststrata in 2000, may also
vary across the local census offices used to manage the workload in the
census. The local office identifiers are on the A.C.E. research database,
but they were not included in the six logistic regression models described
above or the study by Schindler (2006).
Local census office indicator variables might be predictive of match
and correct enumeration rates because factors that are particular to small
areas could affect ease of enumeration. For example, local economic con
ditions and the expertise and capabilities of local census office admin
istrators could vary. Because of the large number of local census offices
(more than 500) and the limited amount of data for each, these effects are
more naturally represented as random effects. By including these random
effects in the logistic regression models, the Census Bureau could estimate
the effects of individual offices on match and correct enumeration rates
and obtain valid estimates of the contribution of variability across offices
to uncertainty about coverage rates in each area.
Malec and Maples (2005) explored this approach by adding local area
random effects into a synthetic estimation model and then measured the
variance component of these random effects for local census offices. The
ultimate objective of this approach is a smallarea estimation methodol
ogy that would provide a compromise between synthetic estimation and
a designbased estimator for each local office area.
1

OCR for page 153

1 COVERAGE MEASUREMENT IN THE 2010 CENSUS
Because of the complex design of A.C.E.’s postenumeration survey
(weighted cases within samples of block clusters), many of the empirically
correct enumeration rates and match rates used in Malec and Maple’s
model are more variable than the nominal sample sizes would indicate.
To account for the extra variability, Malec and Maples (2005) used a
pseudolikelihood approach with effective sample sizes estimated by the
bootstrap approach.
In this approach, both logistic regression models (for match rate and
correct enumeration rate) have the following generic form:
p
log i , k = βi + µ k + α i , k ,
1 − pi , k
where bi is the fixed effect for ith poststratum membership, mk is a random
effect for the kth local census office, and aik is model error. Furthermore,
( )
µ k ~ N ( 0, Σ ) and α ik ~ N 0 , γ ce(i) ,
2
where ce(i) is an index representing the collapsing of the poststrata into
11 or 8 cells, depending on whether the model is applied to the Esample
or the Psample. Malec and Maples (2005) were able to estimate the large
number of parameters in these models using Bayesian simulation.
This research suggests that inclusion of smallarea effects could sub
stantially improve coverage estimates. Several questions remain: how best
to treat the complex sample design, how many random effects can be
included and at what level of aggregation, the best way to estimate the
model parameters, and how the model fit should be assessed. The panel is
impressed with this highcaliber research that addresses an important issue
in coverage modeling; further work in this area would be very valuable.
Mulry et al. (2005) examined the following anomalous results in
A.C.E. More than 5 percent of incorporated places1 in 2000 had an esti
mated net overcount of greater than 5 percent, and 0.5 percent had a net
overcount of greater than 10 percent. This result runs counter to findings
from the 1980 and 1990 coverage measurement programs of the potential
net overcoverage due to true erroneous enumerations and duplications. In
contrast with 2000, only 0.1 percent of places had an estimated net under
count of greater then 5 percent, and nationally, the degree of overcoverage
and undercoverage were of essentially the same magnitude. There is a
concern that the lack of balance of designated erroneous enumerations
and designated omissions may be due to the use of proxy status and the
type of census return as poststratification variables for the Esample but
not for Psample computations.
1 See http://www.census.gov/dmd/www/ACEREVII_PLACES.txt for a list.

OCR for page 153

1
APPENDIX B
To examine this further, Mulry et al. (2005) demonstrated that by
using proxy status in the Esample poststratification, there were 91 places
with a net overcount of more than 10 percent: however, if it is assumed
that there was no error for proxy enumerations, there were only 16 places
with net overcounts of more than than 10 percent. Furthermore, if one
assumes that there were no errors for proxy enumerations and no errors
for late nonmail returns, there were only four places with a net overcount
of more than 5 percent. Given this and given that 27 percent of proxy
enumerations had insufficient information for matching and followup, it
is clear that proxy enumerations could contribute to substantial balancing
error. The Census Bureau concluded that proxy enumerations contributed
to these anomalous findings, but that it was not the only cause.
Related research carried out by Spencer (2005) examined the quality
of synthetic estimates for block clusters based on A.C.E. revision II esti
mates, either using 938 Esample poststrata and 648 Psample poststrata
or using the same 648 poststrata for the E and Psamples. His findings, in
which the standard of comparison was either (a) the direct dualsystems
estimate or (b) the census count plus people found in the Psample who
were omitted in the census for each block cluster, suggested that coarser
but consistent poststrata may have provided more accurate estimates
of net coverage error than finer poststratifications based on different
E and Psample stratifications. However, for large blocks with proxy rates
greater than 10 percent, the finer and inconsistent poststrata performed
better.
The specific model form for logistic regression is
p
= Xβ .
log
(1 − p )
As described in the literature on generalized linear models, this represents
a specific relationship between the mean of a random variable and a linear
combination of predictors, called the link function,
y
log .
(1 − y )
Research on the best link function is continuing at the Census Bureau,
with possibilities that include logit, probit, loglog, and robit. An incorrect
link function would result in poor extrapolations to situations that do
not occur in the P or Esample data, unnecessary interaction terms in
the model, and other typical results of lack of fit. The panel suggests that
if the Census Bureau uses the HosmerLemeshow goodnessoffit test, it
may help to choose the appropriate link function: that test will indicate
whether an alternative link function would provide a better fit to the
data.

OCR for page 153

16 COVERAGE MEASUREMENT IN THE 2010 CENSUS
Several complications would remain to be addressed.
Software for Alternate Link Functions. If it is discovered that an alter
nate link function is preferred, it might require a modest amount of
software development to implement. However, this should be relatively
straightforward in either SAS or R, which are two standard statistical
software systems that the Census Bureau uses.
Loss Function or Objective Functions for Assessing Fit of Models. Another
complication is that the current loss function underlying the fitting of the
coefficients of these logistic regression models is implicit in the separate
likelihood equations for the two models and is therefore somewhat dis
connected from the ultimate goal, which is to predict the population size
or, what amounts to the same thing, net coverage error. It may be that the
ultimate goal can be better represented by weighting the likelihood equa
tions to take this modified objective function into account. The Census
Bureau has done some work in this direction and we support this research
and its implementation if it is found to provide preferred estimates.
Measurement Error. Census data are subject to measurement error,
and these errors will have deleterious effects on the application of logistic
regression models. If the measurement error is unrelated to the outcome
(match status or correct enumeration status), the effect on the data is the
attenuation of relationships. In other words, the predictors will not be as
effective without the measurement error. But if the measurement error
is related to the outcomes, the effect could be much more complicated,
including the introduction of severe biases.