FINDING THE BOUNDARIES: WHEN DO DIRECT SURVEY ESTIMATES MEET SMALL-AREA NEEDS?

This session focused on the topic of producing estimates in situations in which only a small amount of information is available or there are other limitations, such as physical, temporal, or conceptual boundaries that make direct estimation difficult. In the first presentation, Robert Fay (Westat) discussed the boundaries between direct estimation and model-based small-area estimation. He noted that model-assisted estimation (Särndal, Swensson, and Wretman, 1992) can be viewed as an intermediate point between the traditional design-based approaches and model-based estimation. For the purposes of his talk, however, he included model-assisted estimation as part of direct estimation.

Theories of design-based sampling (Neyman, 1934; Hansen, Hurwitz, and Madow, 1953), although robust and useful in many applications, are based on the asymptotic properties of large samples. However, in practice, researchers are often dealing with moderate-size samples, and even in the case of large samples, the subdomains of interest are often represented by small samples. What constitutes a sufficiently large sample size depends on the intended use of the data, and this is an important question because it determines whether direct estimation is adequate for a specific task.

Fay recalled the 1976 Survey of Income and Education (SIE), which was conducted one time only, with the goal of producing state-level estimates of children ages 5-17 in poverty, with a 10 percent coefficient of variation. Because its sample was approximately three times as large as the Current Population

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 61

6
Small-Area Estimation
FINDING THE BOUNDARIES:
WHEN DO DIRECT SURVEY ESTIMATES
MEET SMALL-AREA NEEDS?
This session focused on the topic of producing estimates in situations in
which only a small amount of information is available or there are other limita -
tions, such as physical, temporal, or conceptual boundaries that make direct
estimation difficult. In the first presentation, Robert Fay (Westat) discussed the
boundaries between direct estimation and model-based small-area estimation.
He noted that model-assisted estimation (Särndal, Swensson, and Wretman,
1992) can be viewed as an intermediate point between the traditional design-
based approaches and model-based estimation. For the purposes of his talk,
however, he included model-assisted estimation as part of direct estimation.
Theories of design-based sampling (Neyman, 1934; Hansen, Hurwitz, and
Madow, 1953), although robust and useful in many applications, are based on
the asymptotic properties of large samples. However, in practice, researchers
are often dealing with moderate-size samples, and even in the case of large
samples, the subdomains of interest are often represented by small samples.
What constitutes a sufficiently large sample size depends on the intended use
of the data, and this is an important question because it determines whether
direct estimation is adequate for a specific task.
Fay recalled the 1976 Survey of Income and Education (SIE), which was
conducted one time only, with the goal of producing state-level estimates of
children ages 5-17 in poverty, with a 10 percent coefficient of variation. Because
its sample was approximately three times as large as the Current Population
61

OCR for page 61

62 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
Survey (CPS), the SIE generally achieved this target reliability, which was
considerably better than what the CPS offered, particularly in small states.
Even when reliability targets were negotiated in advance, however, the impact
of sampling error on the face validity of the estimates was more pronounced
than survey designers anticipated. Although the SIE met the target reliability
requirements, the state of Alabama protested the large decrease in its poverty
rate since the preceding census. The use of small-area models to correct data
problems of this type was not yet an established practice at the time.
In some situations, model-based small-area estimation represents a neces -
sary alternative to direct estimation. Some of the earliest examples of small-area
estimation include postcensal population estimates from the Census Bureau and
economic series produced by researchers at the Bureau of Economic Analysis,
even though these precedents use slightly different paradigms.
Several of the basic model-based approaches to small-area estimation
emerged decades ago, including:
• ynthetic estimation (Gonzalez and Waksberg, 1973; Levy and French,
s
1977; Gonzalez and Hoza, 1978),
• rea-level models (Fay and Herriot, 1979),
a
• tructure preserving estimation (Purcell and Kish, 1980), and
s
• nit-level models (Battese, Harter, and Fuller, 1988).
u
These basic approaches were followed by a number of refinements, such
as mean square estimation and hierarchical Bayes approaches, he noted. Even
as these model-based approaches expanded, researchers pointed out that
model-assisted estimators could represent a viable alternative in some situa -
tions (Särndal and Hidiroglou, 1989; Rao, 2003).
Fay mentioned some reviews of early applications: Small Area Statistics: An
International Symposium (Platek et al., 1987) and Indirect Estimators in U.S.
Federal Programs (Schaible, 1996), which was based on a 1993 Federal Commit-
tee on Statistical Methodology report and includes examples of practice from
several agencies. A basic resource on theory for scholars starting out in this area
is the classic Small Area Estimation (Rao, 2003). Another useful review of theory
is Small Area Estimation: An Appraisal (Ghosh and Rao, 1994).
It is clear that even though the theory of model-based small-area estimation
has been available for decades and a number of researchers have expanded the
theory, the number of applications is not yet large. Possible reasons are that
model-based estimates are more difficult to produce, replicate, and combine
with other estimates than direct estimates. Model-based estimates are also more
difficult to document and explain to data users. For example, even when esti -
mates of error are produced for an annual series of small-area estimates, users
are typically unable to answer other questions from the published information,

OCR for page 61

63
SMALL-AREA ESTIMATION
such as what reliability is achieved by averaging small-area estimates for a given
area over multiple years.
Some have argued that, although experimentation with model-based esti -
mates should be encouraged, more caution should be exercised when deciding
whether to publish them because they are often not equivalent in quality to the
direct estimates typically published by government agencies. One approach
is to clearly distinguish them from other statistical products. Fay referred to
the example of the United Kingdom and the existence of experimental versus
official statistics. In the United States, model-based estimates have, in some
cases, been published and endorsed. The arguments for doing so are especially
strong when the data produced are more informative than any other alterna -
tives available to users, particularly when the estimates are able to meet a legally
mandated need.
One example is the Current Population Survey, including its Annual Social
and Economic Supplement (ASEC). The original CPS sample design was devel-
oped with national labor force objectives in mind. The first-stage design com -
prised strata of self-representing primary sampling units (PSUs), mostly large
counties, and non-self-representing strata, each composed of PSUs of one or
more smaller counties, from which one, or sometimes two, PSUs were selected
at random. Because the design was originally guided by its national objectives,
non-self-representing strata typically crossed state lines.
In 1973 publication began of average annual unemployment for some of
the states, accomplished by a quasi-modeling that involved collapsing strata
and reweighting PSUs to compensate for effects of the national stratification.
The sample was soon expanded to produce estimates of annual unemployment
meeting the criterion of a 10-percent coefficient of variation in each of the
states (for an underlying unemployment rate of 6 percent). To avoid continuing
the quasi-modeling approach, in 1984 the design was changed to state-based
stratification as part of the survey redesign, which eliminated the need for these
adjustments.
The ASEC supplement in the CPS has been the source of a number of
small-area applications, including the Small Area Income and Poverty Estimates
(SAIPE) Program, which provides income and poverty estimates for states,
counties, and school districts. SAIPE is an example of a program that fills
mandated needs. Considering the small CPS sample size for the target areas, the
SAIPE program was quite ambitious and largely successful (National Research
Council, 2000).
After the American Community Survey (ACS) was launched, the SAIPE
program moved from the CPS to the ACS because of the larger ACS sample size.
The ACS includes approximately 2 million interviewed households per year. The
ACS pools data over 1-, 3-, and 5-year periods to produce estimates. Although
the 5-year estimates produce data for the smallest publishable geographic areas,
the SAIPE program currently models the 1-year ACS estimates, most of which

OCR for page 61

64 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
are not publicly released, to increase the timeliness of the data, and relies on
small-area models in place of averaging over time to reduce the relative impact
of sampling error. It will be interesting to assess the trade-offs related to the dif-
ferent releases after two or three sets of 5-year ACS estimates become available.
It will also be of interest to observe whether the sampling variability of the ACS
will encourage a new series of small-area applications to replace the ACS direct
estimates for some uses.
Fay mentioned another case study that is worth following closely: Canada’s
National Household Survey (NHS), which replaces the mandatory long ques -
tionnaire that one in five households used to receive as part of the Canadian
population census. Although details are still emerging, the current plans for a
voluntary survey partially integrated into the census are likely to result in lower
response rates and higher variances compared with previous censuses. The case
of the NHS could become an unplanned experiment in what happens when
data become less reliable than users have grown to expect.
Fay also briefly mentioned his work as part of a Westat team commissioned
by the National Center for Health Statistics to evaluate options for averaging
several years of data from the National Health and Nutrition Examination
Survey (NHANES) to produce quasi-design-based estimates for the state of
California. The products will be both a weighted file and an approach to esti -
mate the variance of the estimates.
Looking ahead, Fay predicted that the area on the boundary between tra-
ditional design-based survey estimates and small-area estimates will probably
grow in importance because there is an increasing demand for subnational esti -
mates, surveys costs are rising, and modeling tools represent a possible route for
incorporating existing administrative records into the estimates. Review of the
case studies presented and similar ones can help guide the evolution of policy
on the use of small-area estimation at federal statistical agencies.
USING SURVEY, CENSUS, AND ADMINISTRATIVE
RECORDS DATA IN SMALL-AREA ESTIMATION
William Bell (Census Bureau) discussed strategies of combining data from
several sources—sample survey, census, and administrative records—to pro -
duce small-area estimates. To illustrate these procedures, he used examples
from the Census Bureau’s Small Area Income and Poverty Estimates Program,
which combines data from different sources to provide income and poverty
estimates that are more current than census information for states, counties,
and school districts. Specifically, SAIPE relies on
• irect poverty estimates from the ACS (and previously the CPS),
d
• rior census long-form sample poverty estimates,
p
• ax data from the Internal Revenue Service (IRS),
t

OCR for page 61

65
SMALL-AREA ESTIMATION
• nformation from Supplemental Nutrition Assistance Program (SNAP)
i
records, and
• emographic population estimates.
d
All data sources, including the ones used for SAIPE, are subject to various
types of error, and these must be taken into consideration when making deci -
sions about how the data can best be combined. Bell mentioned some of the
main types of error affecting data sources:
• ampling error (the difference between the estimate from the sample
s
and what would be obtained from a complete enumeration done the
same way),
• onsampling error (the difference between what would be obtained
n
from a complete enumeration and the population characteristic of inter-
est), and
• arget error (the difference between what a data source is estimating—its
t
target—and the desired target).
Table 6-1 shows error types most likely to affect survey, census, and admin-
istrative records data, although all three data sources could include all three
types of errors. “Census” data may or may not have sampling error, depending
on whether they refer to the complete enumeration or to data from a prior cen -
sus sample (also known as the census long form). The distinction is important
for modeling purposes.
Both the ACS and the CPS provide data suitable to produce estimates of
poverty, albeit in slightly different ways, Bell said. Their weaknesses are that
they are subject to large sampling error for small areas, particularly the CPS.
The census estimates have negligible (state level) or low (most counties) sam -
pling error, but the estimates become gradually more outdated after the census
income year, which is essentially a form of target error. For administrative
records, sampling error is usually not a concern, but the data are subject to
nonsampling error, and they are not collected with the specific goal of measur-
ing poverty, which leads to a form of target error (for SAIPE’s purposes). In
particular, the IRS tax data leave out many low-income people who do not need
TABLE 6-1 Typical Sources of Error for Different Data Sources
Error
Data Source Sampling Nonsampling Target
Sample survey X X
Census Maybe X X
Administrative records X X
SOURCE: Workshop presentation by William Bell.

OCR for page 61

66 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
to file tax returns, while in the case of SNAP, the qualifications for the program
are different from the determination of poverty status and not everyone who
is eligible participates.
Taking into consideration the errors described, there are different options
for combining these data sources, Bell said. Suppose yi is a survey estimate
of desired population quantity Yi for area i, and zi is a related quantity from
another, independent data source. The question is how to combine yi and zi to
produce an improved estimator of Yi.
One option for combining the data sources is to take fi yi + (1 – fi)zi with
fi Var(yi)–1 and 1 – fi Var(zi)–1. This assumes that the estimates from the
two surveys, yi and zi, are both unbiased estimates of the target, Yi, which rarely
happens in practice.
An alternative is to take a weighted average of the estimates with weights
instead proportional to the mean squared errors (MSEs): fi yi + (1 – fi)zi with
fi MSE(yi)–1 and 1 – fi MSE(zi)–1. This assumes that the mean squared errors
are known, or equivalently the biases are known, which is also rare in practice.
One could take one of these estimates, yi, and use it to define the
target—in other words, assume that it is unbiased. One can then use ordi -
nary least squares regression to adjust zi to provide an unbiased predictor of
ˆ
ˆ ˆ
Yi : Yi syn = α OLS + βOLS z i , where syn indicates a synthetic estimator.
For a more formal modeling approach (Fay and Herriot, 1979), the follow-
ing structure is assumed:
yi = Yi + e i
= (xi' β + ui ) + e i
where:
yi = direct survey estimate of population target Yi for area i
ei = sampling errors that are assumed to be independently distributed with
a normal N(0, vi) distribution, with vi assumed known
xi = vector of regression variables for area i
b = vector of regression parameters
ui = area i random effects (model errors), which are assumed to be indepen-
dent and identically distributed according to N (0, σ u ) and independent of ei.
2
To illustrate this with an example from SAIPE, Bell discussed the state
poverty rate model for children ages 5-17. The direct survey estimates, yi, were
originally from the CPS (three-year averages) but have since been replaced
with single-year estimates from the ACS. The regression variables for each state
include a constant or intercept term and
• “pseudo poverty rate” for children, calculated based on the adjusted
a
gross income and the number of child exemptions on the tax return;

OCR for page 61

67
SMALL-AREA ESTIMATION
• he tax “nonfiler” rate, which is the percentage of the population not
t
represented on tax returns;
• he SNAP participation rate, which is the number of participants in the
t
program divided by the population estimate; and
• ensus data in one of two forms, either the estimated state poverty rate
c
for school-age children ages 5-17, or residuals from regressing previous
census poverty estimates for ages 5-17 on other elements of xi for the
census year.
One generally has reasonable estimates of the sampling variances, ni. If one
also had estimates of σ u , their sum would provide an estimate of the variances
2
of the yi. Since the various sampling errors and random effects are indepen-
( )
dent, the estimated covariance matrix for the yi is ∑ = diag σ u + ν i , with the
2
off-diagonal terms equal to zero given the assumed independence. Using this
covariance matrix, we could estimate b using weighted least squares as follows:
ˆ
β = ( X ' ∑ X )–1 X ' ∑ –1 y,
where y = (yi,. . . ,ym)’ and X is m x r with rows x'i .
Turning things around, given the ni and some initial estimates of b, one
could estimate σ u using the method of moments, maximum likelihood estima-
2
tion, REML, or through a Bayesian approach. (One might iterate from an initial
estimate of b by setting the σ u equal to some initial value.)
2
It would then be possible to combine the direct survey estimates and
the regression estimates using the best linear unbiased prediction (BLUP) as
follows:
ˆ
ˆ
Y = h y + (1 – h )x' β
i i i i i
( )
where hi = σ u / σ u + ν i .
2 2
Bell said that a way to think about how the data are being used for small-
area modeling and prediction assumes that there is a regression function that
describes the variation of the mean in the target from state to state as a function
() ˆ
of xi: E(Yi) = E(yi) = x'i β . It then follows that E x'i β = x'i β = E(Yi ) so the fitted
regression can be thought of as a predictor of the target Yi. For example, if there
ˆˆ ˆ
is only one regression variable, zi, plus an intercept, then x'i β = β0 + β1 z i . The
model fitting makes a linear adjustment to the data source zi, which otherwise
has target error. After the adjustment, the fitted linear function of zi can be used
to better predict Yi.
The BLUP is the weighted average of two predictors of the target, the

OCR for page 61

68 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
direct estimate and the regression fit, where the weights are inversely propor-
tional to the variances of the errors in the two predictors, that is,
hi = σ u / (σ u + ν i ) ∝ 1 / ν i ; 1 – hi = ν i / (σ u + ν i ) ∝ 1 / σ u .
2 2 2 2
To illustrate the improvements in accuracy resulting from this modeling,
Bell compared the approach of regressing the CPS poverty rate for children
ages 5-17 on the pseudo poverty rate from tax records with the Fay-Herriot
model, with one regression variable (the pseudo poverty rate), and with the
SAIPE production model that brings in other regression variables.
Using data from 2004, let yi = CPS 5-17 poverty rates and zi = pseudo
poverty rate for children. Regressing yi on zi using ordinary least squares gives
the synthetic predictor
ˆ
ˆ ˆ
Yi syn = α OLS + βOLS z i
= –.18 + .82z i
The analogous Fay-Herriot model is yi =Yi + ei. In contrast to the OLS
model, here weighted least squares are used, weighting inversely proportional
to Var(yi), to estimate the regression coefficients. Then the regression estimates
are combined with the direct estimates, weighting the regression estimates
inversely proportional to σ u and the direct estimates inversely proportional
2
to ni.
Table 6-2 shows the mean squared errors of the two predictors for four
states. The synthetic predictor is worse than the direct estimate, except in the
case of Mississippi. The mean squared errors for the Fay-Herriot model with
one regressor are lower than the variances of the direct estimates. The improve-
ment is larger in the states with smaller samples and higher sampling variances,
as is typical in this context. The last column in the table shows the weights that
are applied to the direct estimate. For example, in California, approximately
80 percent of the weight is for the direct estimate—in other words, the model
prediction is going to be very close to the direct estimate in this state.
TABLE 6-2 Prediction Mean Squared Errors (MSE) for 2004 Poverty
Rates for Children Ages 5-17 Based on the Current Population Survey
Target and the Fay-Herriot Model with One Regressor (FH1)
MSE(Yi|y,
ˆ
MSE (Yisyn )
State FH1)
ni vi hi
California 5,834 1.1 7.7 .9 .80
North Carolina 1,274 4.6 4.7 2.3 .50
Indiana 904 8.1 9.0 3.4 .36
Mississippi 755 12.0 6.3 3.9 .26
SOURCE: Workshop presentation by William Bell.

OCR for page 61

69
SMALL-AREA ESTIMATION
TABLE 6-3 Prediction Mean Squared Errors (MSE) from the Fay-
Herriot Model with One Regressor Compared to Those of the Full
SAIPE Production Model
MSE(Yi|y, MSE(Yi|y,
State FH1) full model)
vi hi
California 1.1 .9 .8 .61
North Carolina 4.6 2.3 2.0 .28
Indiana 8.1 3.4 2.0 .18
Mississippi 12.0 3.9 2.9 .13
SOURCE: Workshop presentation by William Bell.
Table 6-3 compares the MSEs of the one-regressor Fay-Herriot model to
the MSEs for the full SAIPE production model. The mean squared errors are
lower with the full model, and, again, the difference is bigger in the case of
smaller states, where the predictions are less able to rely on the direct estimates.
Bell also discussed an extension of the Fay-Herriot model to a bivariate
version, which can be used for modeling two statistics simultaneously. The
targets in the two equations are different in this case, and there are procedures
for model fitting and prediction that can potentially improve the estimates for
both quantities. The bivariate model is written
( )
y1i = Y1i + e1i = x1i β1 + u1i + e1i
'
= (x )
β 2 + u2i + e 2i
y2i = Y2i + e 2i '
2i
This approach is useful when there are estimates of ostensibly the same
characteristic from two independent surveys, such as the state poverty rates for
the 5-17 age group from the CPS and the ACS. It can also be used for estimates
of two related characteristics from one survey, such as the poverty rates for the
5-17 and 18-64 age groups from the CPS, or for estimates of the same charac-
teristic but for two time points, such as poverty rates for the 5-17 age group
from this year and last year’s CPS.
In cases in which there are two estimates of ostensibly the same character-
istic from two surveys, researchers have to decide which of the two estimates
defines the target (as being the expectation of one of the estimates). One way to
think about this is to consider which of the two surveys is suspected of having
lower nonsampling error. If both estimates are thought to have similar levels of
nonsampling error, then one may let the direct estimate that has lower sampling
variance define the target, and to try to improve that estimate by modeling.
If one of the two estimates has some sort of “official” status, then this
estimate could define the target. In any case, the bivariate model will utilize

OCR for page 61

70 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
the regression variables and the estimates from the other survey to predict the
specified target. This adjusts for bias due to differential nonsampling and target
error between the two survey estimates, but it does not address any bias in the
survey estimate that is used to define the target.
The approach to generating the SAIPE income and poverty estimates is
fairly unusual in the federal statistical system. Yet the estimates are widely used
for administering federal programs and allocating federal funds. In Bell’s view,
there were several factors that contributed to the acceptance of the model-
based estimates among data users. First, the modeling relies on high-quality
data sources that can generate good-quality estimates. Second, the time was
right for this initiative when the Improving America’s Schools Act was passed in
1994, requiring the allocation of Title 1 education funds according to updated
poverty estimates for school districts for the 5-17 age group, unless the model-
based estimates were deemed “inappropriate or unreliable.” In addition, a
panel of the Committee on National Statistics that reviewed SAIPE methods
and initial results also recommended that the model-based estimates be used
(National Research Council, 2000).
ROLE OF STATISTICAL MODELS IN FEDERAL SURVEYS:
SMALL-AREA ESTIMATION AND OTHER PROBLEMS
Trivellore Raghunathan (University of Michigan) discussed research areas
in which model-based estimation represents an ideal tool that allows research -
ers to use data for purposes beyond what they were intended for. He noted that
there has been a recent increase in the complexity of research conducted using
data from federal surveys. The data available from a single survey often do not
meet these complex research needs, and the answer is often a model-based
approach that can synthesize and integrate data from several surveys. Some of
the arguments for combining data sources include
• extending the coverage,
• extending the measurement,
• correcting for nonresponse bias,
• correcting for measurement error, and
• improving precision.
Raghunathan presented four examples from his own work. The first one
involved combining estimates from the National Health Interview Survey
(NHIS) and the National Nursing Homes Survey (NNHS), with the goal of
improving coverage. The variables of interest were chronic disease condi-
tions. Data were available from both surveys for 1985, 1995, and 1997. The
initial strategy was a design-based approach, treating the two surveys as strata.
Current work involves Bayesian hierarchical modeling to obtain subdomain

OCR for page 61

71
SMALL-AREA ESTIMATION
estimates for analysis of health disparity issues based on education and race
(Schenker et al., 2002; Schenker and Raghunathan, 2007).
Another project involved matching respondents from the NHIS and the
NHANES on common covariates involving a propensity score technique.
The NHIS collects data about disease conditions in a self-reported format,
which raises concerns about underreporting due to undiagnosed disease con-
ditions, especially among those least likely to have access to medical care. The
NHANES has a self-report component, but it also collects health measure-
ments. This allowed the researchers to model the relationship between the
self-reported data and clinical measures in the NHANES and then to impute
“no” responses to questions about disease conditions in the NHIS using the
model from the NHANES. After applying this correction to the NHIS, many of
the relationships “became more reasonable.” Current work focuses on extend -
ing the approach to several years of data and on obtaining small-area esti -
mates of undiagnosed diabetes and hypertension (Schenker, Raghunathan, and
Bondarenko, 2010).
The third project combined data from two surveys with the goal of pro-
viding small-area estimates of cancer risk factors and screening rates to the
National Cancer Institute (NCI). In the past, NCI has relied on the Behavioral
Risk Factor Surveillance System (BRFSS) to construct these estimates. How -
ever, the BRFSS is a telephone survey that faces increasing challenges associated
with uneven landline coverage and low response rates. Raghunathan and his
colleagues combined the BRFSS data with data from the NHIS, which covers
both telephone and nontelephone households and has higher response rates.
The technique selected for this study was a hierarchical model, treating
NHIS data as unbiased estimates and BRFSS data as potentially biased esti-
mates. These assumptions were made because of the face-to-face mode and
higher response rates in the case of the NHIS. The telephone household esti -
mates from the NHIS and the telephone household estimates from the BRFSS
were used to correct for nonresponse bias associated with the nontelephone
households and then produce a model-based estimate for all counties.
Although in the past the concept of nontelephone households was under-
stood to mean households without a telephone, it is becoming increasingly
important to distinguish between households that do not have a telephone at all
and households that do not have a landline but do have a cell phone, because
the demographic characteristics of these two types of households are different.
The model thus becomes a four-variate model.
Raghunathan mentioned that although the NHIS and the BRFSS are both
surveys of the Centers for Disease Control and Prevention, accomplishing
the data sharing still involved substantial work. A predecessor to this project,
which involved linking data from the National Crime Victimization Survey and
the Uniform Crime Reporting Survey, also experienced challenges related to
confidentiality and privacy restrictions. Raghunathan emphasized that these are

OCR for page 61

72 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
issues with which the federal statistical system will have to grapple in research
of this type.
Raghunathan’s current project involves developing an alternative to the
current National Health Expenditure Accounts. The goal is to study the rela -
tionship between health care expenditures and population health, with a focus
on specific elements, such as disease prevalence and trends; treatment, inter-
vention, and prevention effectiveness; and mortality, quality of life, and other
health outcomes. The relationships are examined using Bayesian network mod-
eling, and microsimulations are performed to evaluate hypothetical alternative
scenarios.
Given that no existing data set contains all of the desired measures,
Raghunathan and his colleagues are working on combining data from a variety
of sources. For example, the team identified 120 disease categories with major
impact on expenditures. Related data for subsets of diseases are available from
• elf-report sources: NHIS, NHANES, Medical Expenditure Panel Sur-
S
vey (MEPS), Medicare Current Beneficiary Survey (MCBS),
• Clinical measures: NHANES, and
• Claims: MEPS, MCBS.
Although information on past and current disease conditions is available
from self-report data, the claims data represent current conditions, so to com -
bine the information, both types of data are converted to a measure of whether
the person ever had the disease. For example, Figure 6-1 shows the information
available from the MCBS and the NHANES. Respondents are matched on the
covariates and then the missing self-report in the MCBS is imputed, so that the
overall self-report rates in the two surveys agree.
1
Health
Expenditure
MCBS Covariates SR = 1 If Claim = 1
Condition
? SR = ? If Claim = 0
Health
Covariates Self-report
NHANES Condition
FIGURE 6-1 Data layout for the Medicare Current Beneficiary Survey (MCBS) and the
National Health and Nutrition Examination Survey (NHANES).
SOURCE: Workshop presentation by Trivellore Raghunathan.

OCR for page 61

73
SMALL-AREA ESTIMATION
Raghunathan concluded by saying that although there are a lot of chal -
lenges related to the portability of information from one survey to another,
including differences in the data collection instruments, protocols, and timing,
often combining several data sets is the best option. When data from multiple
sources are synthesized, statistical modeling and an imputation framework are
particularly useful tools to create the necessary infrastructure. However, data
sharing has to be made easier for these types of approaches to be successful. In
an ideal world, all data collection would be part of a large matrix, with a unified
sampling frame and complementary content that could be filled in by different
agencies or researchers working on different pieces.
DISCUSSION
Roderick Little started the discussion by saying that the term “design-
based” theory of sampling conflates the design aspects of the work with the
analysis aspects, and that it is perhaps more appropriate to think of it as
design-based theory of inference. Little described himself as a modeler and
an advocate of what he calls “calibrated Bayes,” in which the design and the
repeated sampling aspects come into the assessment of the model rather than
into the actual inference (Little, 2006). This approach makes it possible to avoid
“inferential schizophrenia” between being design-based or model-based. He
prefers to think of everything as models, and, in that sense, the design-based
model can be described as a saturated fixed-effects model, in which one does
not make strong structural assumptions, so there are no random effects. One
can also consider unsaturated hierarchical models, so to the extent that there
are any distinctions, they are in terms of the different types of models.
Little argued that hierarchical models are the right way to think about this
conceptually. The advantage of hierarchical models is that it is not necessary
to use either the direct estimates or the model-based estimates, because they
provide a compromise between the direct estimate from the saturated model
and the model-based estimate from the unsaturated model. Fay’s discussion
of SAIPE illustrates how it is possible to get estimates for different areas,
some borrowed mostly from direct estimates and some from the model-based
estimates.
In some cases, Bayes modeling may be a little better because it does
a better job with the uncertainty and the variance components. While the
calibrated approach is a weighted combination of the two, the weights can
be poorly estimated, and in simulation studies the calibrated approach can
provide worse inference than the model-based approach when the hierarchi -
cal model is reasonable. Little finished by stating that the challenge is to come
up with some kind of index of what he called structural assumption depen -
dence. For example, when the weights allocated to the model get too high, it
might be possible to use that as a criteria for whether to publish an estimate.

OCR for page 61

74 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
Other aspects of this include how well the model is predicting and the extent
to which the model is highly parametric. He said that research is needed to
develop the right indices.
Fay responded that he will have to think about some of Little’s points, but
that he wanted to defend the need for a boundary because it is a practical tool
for statistical agencies. The number of people who can implement the design-
based theory of inference is much larger than those with the skills described by
Little, so that represents a very practical boundary. Identifying the boundary
will help those who have to decide whether they want to pursue a small-area
application that requires considerable effort and buildup of staff. In response,
Little responded that, since this is a forward-looking workshop, the emphasis
is not on how things are now, but on thinking about how things might be in
the future.
Graham Kalton asked Raghunathan whether using Medicare administra-
tive records was considered when producing estimates about the population
ages 65 and older. Raghunathan responded that he is a “scavenger for infor-
mation,” using as much data as he can find, and he did explore the Medicare
claims information, which is now part of the administrative data used for the
fourth project he discussed. He agreed that the quality of the auxiliary data is
very important in order to borrow strength for small-area estimation. In his
third project, he and his team worked hard on obtaining county-level data from
a wide variety of sources, not only census data, but also marketing data and data
about how active the public health department is.
He added that they also went through a lot of soul searching in terms of
whether the estimates are publishable. They had a meeting at the Centers for
Disease Control and Prevention with state health department representatives
and presented the estimates. Most said that the numbers looked reasonable.
The few who did not, also came around after they went back and thought about
it. The fact is that researchers have to rely on the best information available to
solve a particular problem, and the modeling framework provides an opportu -
nity to move forward with research on these topics.
Raghunathan commented that in some areas of statistics modeling is widely
used, but the techniques are less common in the survey field. He argued that the
distinctions made by survey researchers between model-based, model-assisted,
and design-based approaches are not particularly helpful. In his research, they
relied on the latest computational and statistical developments in the field as a
whole, and that allowed them to solve the problems at hand. Quoting George
Box, he said that all models are wrong, but some are useful. Viewing models
as a succinct summary of the information available and using that to make
projections about the population helps scientific areas, such as health policy,
move forward.
Regarding Fay’s presentation, Kalton commented that state stratification
makes a lot of difference if state-level small-area estimates are of interest, as

OCR for page 61

75
SMALL-AREA ESTIMATION
they were in the California case discussed. A related issue is the number of
sampled PSUs in each small area; if there is not a sizable number of PSUs in
an area, direct variance estimates will be too imprecise, leading to the need to
model the variances.
Fay responded that the problem of degrees of freedom raised was a com-
mon one. The NHANES has certainly lived with few degrees of freedom
before. In the case of the eight years of data in California, about half of the
PSUs were self-representing, which means a lot of degrees of freedom for that
half of the population. The study did poorly in the remaining part of the state.
He agreed that a distinction can be made between design-based estimation and
design-based inference, adding that the variances may have to proceed out of
some form of generalization. This was true for the CPS case as well, because
for the averages it was only a guess what the true variances were.
Kalton quoted Gordon Brackstone, former assistant chief statistician at
Statistics Canada, who many years ago said that the problem with small-area
estimates is that the local people know better, and they will always challenge
model-based estimates that appear wrong. Kalton said that it turns out that the
local people do not necessarily know better, and that surprisingly they tend to
react to the estimates presented by constructing a rationalization for why the
estimates are correct. At least early on, there were not a lot of challenges to the
SAIPE estimates.
Bell said that he believes that when Kalton spoke of large errors, he was
referring to the school district estimates and also some counties. The issue was
the paucity of the data they had at the school district level. In the case of the
estimates that the panel chaired by Kalton was reviewing (National Research
Council, 2000), the updates were coming from the county level, and there were
no school district level updates, so the quality of those estimates was not as
good as the data that were available for higher levels of geography. The general
principle is that the smaller the area, the worse the data are going to be, and
that is an issue. In recent years, SAIPE has also brought in IRS data, but the
program is not always able to assign the tax returns to school districts.
Regarding challenges, Bell commented that they are continuing to get
challenges, although he does not deal with them directly himself. Often they
come from very small school districts, where it is easier for the locals to have a
good sense of what is going on. Occasionally the challenges make reference to
other data, such as free and reduced price lunch data, a situation that indicates
that there is some confusion, given that these are not the same as the poverty
estimates. There were also a lot of challenges based on the 2000 census data,
using the census numbers to estimate the school district to county shares of
poverty and making reference to what the previous estimates were. Generally,
data users compare the current estimates to something else, and they tend to
react when they see a large discrepancy, even though it is clearly possible that
the other estimate was incorrect. Sometimes they have a legitimate case and it

OCR for page 61

76 THE FUTURE OF FEDERAL HOUSEHOLD SURVEYS
is clear that the estimates are far enough out of line, but Bell and his team are
not correcting for statistical errors.
Little referred back to Fay and Raghunathan’s points about the skills
needed to conduct these types of analysis, arguing that it does not help to think
about survey sampling as a field separate from the general statistical community
in which models are being taught. Zaslavsky added that if the general feeling
is that there are not enough people who can do this type of analysis, then it is
important to think about the implications for new directions in training.
Fay said that this debate has been going on for many years, and the concern
about model-based estimation has always been that data users cannot under-
stand the complex techniques and are suspicious of what is going on “behind
the curtain.” But if data users really understood what is involved with design-
based estimation, for example, postsurvey adjustment and variance estimation,
they would be concerned about that as well.
He thinks it would be useful for researchers to continue to pursue this
research and talk to the data users in contexts similar to that described by
Raghunathan. To the extent that researchers are able to communicate their
methods and demonstrate a commitment to accuracy, it is likely that data users
will embrace these techniques, in the same way they accepted the classical esti -
mators that they do not fully understand.