Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 119
Coverage Measurement in the 2010 Census 5 Analytic Use of Coverage Measurement Data After the 1990 and 2000 censuses, coverage measurement tabulations focused on estimates of net coverage error for poststrata and aggregates of poststrata. While such tabulations remain valuable, the expanded goals for census coverage measurement (CCM) in 2010 call for a wider array of tabulations and analyses. First, there should be tabulations that break out the four types of census errors (defined in Chapter 2). Second, in order to learn about how census operations affect the occurrence of census errors, each type of error should be linked with a much wider array of variables that include detailed measures of census processes. Finally, the Census Bureau needs to perform analyses that take advantage of the wealth of information from CCM. This chapter first presents a framework for defining, estimating, and modeling the components of census coverage error. It then broadly specifies the variables that might play a role in statistical models. Lastly, it considers data products in the light of the goal of census improvement. FRAMEWORK FOR UNDERSTANDING COVERAGE ERRORS We first consider a framework for defining and estimating components of census coverage error. An excellent start in this direction was provided in Mulry and Kostanich (2006), which details the assumptions underlying dual-systems estimation (DSE) in the presence of nonresponse and other census data deficiencies. In carrying this out, Mulry and Kostanich provide rigorous definitions for some of the components of census cover-
OCR for page 120
Coverage Measurement in the 2010 Census age error, which are the “dependent variables” for a research program modeling these components. (See Appendix A for some of the details of this research.) The panel supports this research and would like to see it developed more fully, both operationally and in broadening the structure to incorporate duplicates. Recommendation 9: The Census Bureau should further develop and refine its framework for defining the four basic types of census coverage error and measuring their frequency of occurrence. The Census Bureau should also develop plans for operationalizing the measurement of these components using data from the census and the census coverage measurement program. In addition to producing tabulations of net coverage error at some level roughly comparable to the poststrata used in 2000, the Census Bureau has started developing plans for producing tabulations of components of census coverage error, including the percentages of erroneous enumeration, duplicates, missed enumerations, and people counted in the wrong area (e.g., in the wrong state) by major demographic characteristics, by geography (possibly states), and by some census operations (such as mailout–mailback in comparison with alternatives). The panel believes that such tabulations, while representing a very positive shift in the role of products of coverage measurement, still fail to fully utilize the richness of information contained in the coverage measurement data that will be collected in 2010. The hoped-for feedback loop linking component census coverage errors to their causes will require the creative use of exploratory data analysis to answer the following general question: Which census processes are associated with a substantially increased rate of erroneous enumerations, duplications, omissions, or enumerations in the wrong location? This, in turn, will necessitate answering several other general questions: What types of housing units are missed more often than others? What types of housing units are duplicated more often than others? What types of people are missed more frequently than others? What types of people are erroneously enumerated more frequently than others? What types of people are duplicated more often than others? What types of people are counted in the wrong location more often than others?
OCR for page 121
Coverage Measurement in the 2010 Census Through use of data analyses that use these outcome variables as dependent variables, one should be able to identify collective values for sets of predictors that identify situations that produce a higher rate of one (or more) of the four types of coverage error. In other words, given the current plans for 2010, it seems very possible that the Census Bureau could provide assessments of when a census enumeration is likely to be erroneous, when a census enumeration is likely to be a duplicate, when a census enumeration is likely to be counted in the wrong location, and when a P-sample enumeration is likely to be a census omission. It is also reasonable to expect that beginning to develop answers to the above questions and provide the above assessments will be an important step toward identifying alternative census processes and designs that might reduce the frequency of the associated coverage errors in subsequent censuses. For example, missing people in gated communities might suggest altering the way nonresponse follow-up is carried out for such residences. Duplicating people serving in the military would suggest a change to the examples of residency examples provided on the census questionnaire. Toward this end, we outline a general approach to the statistical modeling of component census coverage errors that we believe the Census Bureau should consider. Note that the statistical models outlined will be fit using data from the P-sample and from the associated E-sample enumerations, though efforts should be made to augment these models with data from administrative records data, the American Community Survey, and, if possible, other sources. STATISTICAL MODELING The following provides some specifics as to the variables that should be used in statistical models in support of census process improvement. Dependent Variables There are four primary types of dependent variables that should be used. The first dependent variable is an indicator variable as to whether a P-sample enumeration is a census omission. It may also be useful to consider separate indicator variables that indicate whether the omission is (a) a within-household omission, with others in the housing unit having been enumerated in the census; (b) a whole-household omission in a listed housing unit; or (c) an entirely missed housing unit. The second dependent variable is an indicator variable as to whether an E-sample enumeration is a census duplicate. It may also be useful to consider separate indicator variables that denote whether the situation
OCR for page 122
Coverage Measurement in the 2010 Census is (a) a whole-household duplicate or (b) a duplication of an individual in a household in which others were counted only once. Also, consider the possible ways in which people might be duplicated. People that are often duplicated are those in nursing homes, people in prisons, or people in military barracks being enumerated with their relatives rather than at their institutional residences. Also there are college students being enumerated where their parents reside and at school; people with winter homes being counted as residents at those locations and also at their primary residences; and movers being counted both at their previous residence and at their current residence. These various causes of duplication result in different displacements between the correct residence and the duplicate residence. Therefore, to help differentiate between these causes of duplication, it may be useful to have dependent variables that distinguish between duplications that have varying degrees of displacement between the two residences. The third dependent variable is an indicator variable as to whether an E-sample enumeration is erroneous. It may also be useful to consider indicator variables that denote whether the erroneous enumeration corresponds to (a) a fictitious person, (b) someone who was born after Census Day, (c) someone who died before Census Day, or (d) a visitor. The fourth dependent variable is an indicator variable as to whether an E-sample enumeration is in the wrong location. As mentioned above, there are different possible levels of geographic displacement (e.g., state or county), and for each definition a separate indicator variable could be formed. Further, it may also be useful to consider separate indicator variables that could distinguish between the following situations: (1) a household counted at the right address/housing unit but in the wrong place due to a geocoding error, (2) a household counted in the wrong housing unit, due to a move, and (3) a household counted at a Census Day residence that is not the usual residence. Before continuing, we note that there will always be mistakes in attributing census component coverage error to one of the four types of component coverage error, and even whether a coverage error occurred at all. For example, due to matching errors, some cases judged to be census duplicates in the coverage measurement program through use of the national match for duplicates will not be duplicates and so a few correct enumerations will be judged as being duplicates. For the same reason, some actual duplicates will be erroneously classified as correct enumerations. Other situations of ambiguity and error will certainly exist in assessing erroneous enumerations, enumerations in the wrong place, and omissions. However, the current methods for judging which cases belong to which categories are sufficiently reliable to strongly support the current aims of the Census Bureau to improve census methodology.
OCR for page 123
Coverage Measurement in the 2010 Census Independent Variables Primary To understand which subset of individuals and housing units are more frequently subject to coverage errors of the four indicated types, and to understand what census processes contributed to those errors, it is necessary to focus on predictors that distinguish between individuals and housing units that are likely to have different interactions with census enumeration processes, as well as predictors that indicate the census processes that were used to attempt to enumerate those individuals and housing units. We caution that the set of census component processes that one might want to represent is much richer than the discussion here would suggest. To get some sense of the complexity, there will be dozens of different questionnaires used in the 2010 census (to account for different forms of delivery, foreign languages, and other factors). Given the size of the postenumeration surveys (PES) used to date, the sample sizes of the coverage measurement program are unlikely to support analysis of the rarer component processes used in conjunction with the more detailed subsets of the population. Therefore, some compromise between full representation and parsimony in modeling must be struck. We recognize that some of the covariates suggested for use in statistical modeling of the frequency of various components of coverage error may not be routinely available given the planned data collection in the P-sample blocks. For instance, we discuss below the possibility of determining whether someone in a household telephoned for questionnaire assistance and whether that was associated with a higher or lower rate of component coverage error. Such information has not been collected in previous coverage measurement programs and would therefore not be available to modelers. However, as mentioned above, if a master trace sample is collected in a way that overlaps substantially with the P-sample blocks in 2010, such covariates could be made available for analysis. Our outline assumes that certain covariates will be available in 2010. If this information will not be available, we hope that this discussion motivates a revision of plans to make more predictors available to support the models. With these caveats, we discuss the primary predictors that might be valuable to include in statistical models for component coverage errors. Individuals and Housing Units There are four individual and housing units variables that need to be considered. First is the type of enumeration area and other features that identify the local geography and the types of housing units in the area. Housing units in the United States are primarily enumerated using one of three procedures: mailout–mailback, update–
OCR for page 124
Coverage Measurement in the 2010 Census leave, and list–enumerate (see National Research Council, 2000). Broadly speaking, housing units are selected for enumeration by one of these three procedures as a function of the quality of the mailing list and whether the housing units have unique identifiers. In addition, other features that might be related to coverage error are the frequency of small multi-unit residences in the area, the rates of new construction, and the rates of demolition of residences, as well as whether the immediate area is mainly residential or a mixture of residential and business establishments. The second type of variable is the kind of housing unit. Indicator variables that might be useful to consider include whether the household in question has been newly constructed or is part of a small multi-unit building. Other indicator variables that are likely to be predictive are whether the housing unit is part of a group quarters and, if so, what kind of group quarters. The third type of variable is people’s demographic characteristics, which are related to differences in rates of components of coverage error (e.g., being between 18 and 22 is related to increases in the chances of duplication), and there also remains considerable interest in assessment of the quality of the census as a function of these characteristics. Therefore, it is important to include them in any statistical models of the rates of components of coverage error. The fourth type of variable involves the relationships of residents. It is generally believed that a household with people who are unrelated to each other or are not part of the same nuclear family is related to census omissions. Therefore, indicator variables for this would likely be useful. Census Processes At least five types of census process variables could be useful in these models. The first type uses results of the Master Address File (MAF) building process. Four basic types of coverage error can result from mistakes in the MAF building: if a nonresidential unit is erroneously included, if a housing unit is included twice, if a housing unit is omitted, and if an address is geocoded in the wrong location (which involves the interaction of the MAF and TIGER).1 None of these mistakes forces a census coverage error, since these mistakes can be, and often are, corrected during the field enumeration. However, it seems clear that indicator variables of these errors are very likely to be associated with a greater frequency of coverage error. Therefore, four variables that should be considered for inclusion in statistical models to predict the relevant components of census coverage 1 These variables, and others listed, may serve dual purposes. For some analyses they may be explanatory (or stratifying) variables. But it would also make sense to use them as outcomes in other analyses, modeled as functions of housing characteristics and MAF-related operations.
OCR for page 125
Coverage Measurement in the 2010 Census error are a variable that indicates when an address included in the MAF is not a residence, a variable that indicates when an address is included more than once on the MAF, a variable that indicates when an address was omitted from the MAF, and a variable that indicates when a geocoding error has been made. It is anticipated that these indicator variables will be strongly predictive of the obvious component coverage error. (It may sometimes be difficult to identify the specific address-building process that resulted in a MAF error: for example, knowing that an address was omitted from the MAF does not necessarily identify the specific address-building process that was in error.) The second type of process variable to be considered uses results of the mailout/mailback enumeration. Three indicator variables related to the initial mail data collection might be associated with coverage errors: (1) an indicator that a foreign language questionnaire was requested or some other contact was made with telephone questionnaire assistance, (2) an indicator of some degree of item nonresponse on the returned questionnaire, and (3) an indicator that some of the responses to the census questionnaire needed to be keyed rather than scanned from image. Each of these indicator variables may be related to difficulties that respondents had in responding to the census questionnaire. The third type of process variable to be considered uses information from nonresponse follow-up. Indicator variables associated with nonresponse follow-up that might be related to coverage error are the number of attempts needed to collect the information from initial mail nonrespondents and whether the ultimate enumeration was through a proxy respondent. The fourth type of process variable uses information from the coverage follow-up interview. Indicator variables involving the coverage follow-up interview that might be related to coverage error are the six possible reasons for a coverage follow-up interview (more than six residents in the household, a count discrepancy in the census questionnaire, a possible duplicate enumeration given the national search for duplicates, count discrepancy with administrative records counts, response to the undercoverage probe, and response to the overcoverage probe). Another variable in this category is the number of phone calls needed to complete the coverage follow-up interview. The fifth type of process variable involves other modes of enumeration. Other methods of being enumerated include “Be Counted” and through telephone questionnaire assistance. However, these and other modes are relatively rare, and it is doubtful that the PES would support statistical models that made use of variables that indicated their use. However, expansion of these modes of enumeration might take place, and in that case, indicator variables of their use might be useful in future
OCR for page 126
Coverage Measurement in the 2010 Census models of the frequency of component coverage error. Similarly, any other process that occurs relatively frequently that is not included in this outline of census component processes, could be considered for inclusion. Predictors Related to Enumerator Error There has been little research as to whether census coverage error is related to the effectiveness or doggedness of field enumerators. However, if possible, it would seem useful to examine. One variable that might discriminate between effective and ineffective enumerator effort is the turnover rate for the associated field office, and other variables that might be useful can be imagined, including variables that indicate how enumerators faired in any training exercises and variables that represent the results of the quality control program that checks on each enumerator’s initial workload. Contextual Factors There are likely to be additional contextual variables that are associated with the frequency of components of census coverage error that may relate to how the census operates in a broad sense. Examples of such covariates are the percentage of people in an area that own their own residences, the local mail return rate, the local crime rate (assuming improvements are made to the completeness of the Uniform Crime Reporting system), the health of the local economy (using the local employment rate from the American Community Survey, for example), or the percentage of estimated undocumented local residents in the current or the previous census (again from the American Community Survey, though estimated indirectly). Implementation To facilitate use of these various dependent and predictor variables, it is important for the census coverage measurement database to be structured to enable the representation of data at multiple levels—including individual, household, local area, and the area covered by the Census’ local enumeration office—since various variables will exist at these different levels. In addition, it will also be important to be able to link the E-sample data to the P-sample data for the same individual and other individuals in the same household. It would be premature at this point to suggest the precise form of the statistical models that could be used for this application, but since most of the dependent variables are dichotomous, logistic regression, discriminant analysis, and classification trees are obvious models to consider. Moreover, it is very likely that given the complexity of the under-
OCR for page 127
Coverage Measurement in the 2010 Census lying phenomena being modeled, focusing on fixed-effects models with predictors restricted to the poststratification variables from 2000 will be unsatisfactory. Instead, what is needed is a representation of the complexity of the situation, involving characteristics of households, housing units, census processes, enumerators’ performance, and interactions among these variables. In addition, the Census Bureau should also examine the possibility of using separate regression models for separate geographic domains. The panel stresses that although the implicit orientation of this discussion has been on developing individual-level models, modeling at the individual level is not the only way of making progress in developing feedback loops for census improvement. For example, one could take the areas that have a high frequency of net undercoverage, or any of the four types of coverage error, and see if those areas have any similarities. In particular, one could carry out a cluster analysis of the areas with high rates of duplication to find out that, say, 30 percent of these areas are college towns, 35 percent are rural areas that have a lot of people with alternative addresses, with the remainder difficult to categorize. Such an analysis would provide obvious clues as to how to modify the census to reduce the duplication rate in the future. Or one could try to develop models or carry out exploratory data analysis at the household level. There is a strong need for multiple approaches to the analysis of such data. It remains to be seen which kinds of exploratory data analyses or statistical modeling techniques are most helpful to the Census Bureau in identifying census deficiencies. It would be important to consider a wide variety of models and analyses before focusing on any small number of techniques. And it would be comforting if different approaches to the modeling produced similar findings. And, consistent with the need to maintain disclosure avoidance, it would be valuable to share the data with outside experts in various areas of application so that the latest techniques are examined for their applicability. To summarize, in developing models for the frequency of the components of coverage error, and for modeling match rate and correct enumeration rate, the Census Bureau needs to consider a broad range of possible approaches before focusing on a specific model. Alternatives to the approach currently being considered include logistic regression with other link functions, discriminant analysis, and data mining, such as classification trees, support vector machines, and neural nets. Candidate regression models would include various subsets of predictors, as well as transformations and interaction terms. Also, separate models by geographic strata of various types should also be considered, for instance separate models for urban and rural areas. We stress that the inclusion of interaction terms in such models is
OCR for page 128
Coverage Measurement in the 2010 Census particularly important, since there are likely very significant interactions between geography and demography. We are unaware of any studies by the Census Bureau, in evaluating coverage error for previous censuses, where the possibility of complex regional effects has been investigated. For example, it is very easy to believe that the growing immigrant Spanish-speaking population will have a different response than others to changes in the presentation of the questions on race/ethnicity and residence. Also, changes to the procedures used to procure foreign language questionnaires and changes in the procedures used to hire field enumerators that are bilingual may also have an important impact on components of census coverage error. The variables used for these various problems need not be identical, and therefore, separate model building research would need to be carried out. However, to avoid some form of balancing error, there may be a benefit to using the same variables for the logistic regression model of the match rate and erroneous enumeration rate in estimating net coverage error: the Census Bureau could investigate the costs and benefits of this idea prior to deciding whether to apply this constraint. Recommendation 10: In developing the logistic regression models or other types of discriminant-analysis models of match status, correct enumeration status, and components of census coverage error, the Census Bureau should consider: Use of several approaches before focusing on a specific model; besides logistic regression, alternatives should include use of other link functions, discriminant analysis, and various data mining approaches, such as classification trees, support vector machines, and neural nets. Thorough examination of the subset of predictors that is best suited to each individual statistical model; the predictors for these various statistical models need not be identical; however, there may be a benefit to constraining the (logistic regression) models of match rate and correct enumeration rate to have identical variables in the estimation of net coverage error, and research should be carried out to assess whether this benefit outweighs the benefit of selecting variables that are optimal for each of these two logistic regression models. To effectively blend information from auxiliary sources at various levels of geographic and demographic aggregation, random effects modeling and Bayes’ methods also should be examined.
OCR for page 129
Coverage Measurement in the 2010 Census Missing Data In addition to developing a better understanding of the correlates of component census coverage error, a related issue that is worth mentioning is the development of a better understanding of the correlates of missing data—obviously related to the quality of the census—that may also be obtained through the use of statistical models. There are several different types of missing data in DSE. First, there are incomplete E-sample interviews, where the degree of “missingness” can range from characteristics information (e.g., missing demographic information) to situations in which the number of residents in a housing unit is unknown. Census interviews can also have missing information about whether a housing unit is occupied or whether it is even a housing unit. (There are also incomplete P-sample interviews, but given the more intensive field work, these are relatively infrequent.) Second, missing characteristics information can affect whether E- and P-sample enumerations are able to be identified as matching or nonmatching, whether a P-sample enumeration is properly designated as a resident of the housing unit or not, and whether an E-sample enumeration is correct or incorrect. Third, missing characteristics information can affect either the use of poststrata or other means for smoothing or modeling estimates of the match rate or the correct enumeration rate. Because missing data reduce the quality of census counts in various ways, it will be important for the Census Bureau to study which operational factors relate to its occurrence as a first step in trying to reduce its frequency in the future. Since missing data can be further categorized as occurring at the level of an entire household or an individual, useful dependent variables in such a study would be: (1) an indicator variable for missing household count, (2) various indicator variables either for specific missing household characteristics or for the degree of missing household characteristics, and (3) various indicator variables either for specific missing individual characteristics or for the degree of missing individual characteristics. Given these dependent variables, one could develop statistical models that would help identify which factors helped to discriminate between households and people with and without missing data. These models have the potential to make use of the entire census database, rather than just the data from the coverage measurement program. In addition, given that proxy enumerations can be viewed as informed imputations, one might also develop statistical models that predict whether an enumeration is likely to be by proxy, again to assess which factors may be related and potentially causal. Since proxy enumerations are actual responses and not missing data, they can directly cause com-
OCR for page 130
Coverage Measurement in the 2010 Census ponent coverage errors and therefore one could, as mentioned above, use such indicator variables as predictors in the study of causes of the rates of duplication, omission, erroneous enumeration, and enumeration in the wrong location. Finally, PES data, through the matching of the P-sample and the E-sample, could be used to assess the frequency of failure of characteristic values to agree, such as race/ethnicity, age, or sex. Such errors can affect some count tabulations. Administrative records could also be used to assist in such studies. (Of course, some disagreements are due to imputations, but others are due to misresponse.) CENSUS DATA PRODUCTS FOR PROCESS IMPROVEMENT As the panel emphasizes throughout this report, the new objective for coverage measurement in 2010—to improve census processes through collecting and analyzing relevant information on census coverage error and its covariates—is challenging. There are many possible routes to enumerating a person in the census (or attempting to do so), and the propensity for coverage error is a complex function of both the census processes used to enumerate individuals and households and their characteristics. Even so, we are optimistic that much can be learned about which processes lead to greater frequencies of coverage error for individuals with certain characteristics and living situations. We believe that statistical analysis of data from the census coverage measurement program as designed for the 2010 census, possibly along with data from administrative records and the ACS, can provide considerable evidence about the causes of coverage error or at least suggest valuable hypotheses for further investigation. How this information is summarized for data users, and how it is analyzed by people both inside and outside the Census Bureau has not yet been decided. This is an extremely important dimension of the coverage measurement program that needs to receive additional attention on the part of the Census Bureau. As described above, the coverage measurement program in 2010 will provide information not only on net coverage error, but also on the four components of census coverage error.2 With regard to net coverage error, somewhat analogous to 2000, the Census Bureau has proposed that it will 2 There will be inconsistencies between the tabulations for net coverage error and for components of census coverage error that need to be communicated to census data users. In particular, while net census coverage error uses DSE to estimate those missed by both the census and the CCM, since those omissions have no information locating them in a housing unit, they are excluded from tabulations of components of coverage error and associated analyses.
OCR for page 131
Coverage Measurement in the 2010 Census produce estimates of the net coverage error nationally, for major demographic groups, and for states. Although there are likely to be no poststrata in 2010 due to the use of logistic regression modeling, for ease of comparison the Census Bureau should consider releasing estimates of net coverage error for the 2010 census for comparable aggregates to support the comparison of net coverage error from 2000 to 2010. For housing units, the Census Bureau will produce estimates of net coverage error nationally and by occupied status and other characteristics. With regard to components of census coverage error, the Census Bureau proposes to produce the following research tables for persons: erroneous enumerations (by which is meant duplications and what is defined here as erroneous enumerations), and omissions nationally and by major demographic group, and erroneous enumerations by major census process. For housing units, the Census Bureau proposes to produce research tables on the rate of erroneous enumerations by major census process. In addition, the Census Bureau also proposes to produce research tables providing some assessment of the extent to which people are counted in the wrong geographic locations and the distances from the correct locations. The panel commends the Census Bureau for its forward thinking about the role of coverage measurement in 2010 and the production of these various research tabulations. In particular, providing rates of erroneous enumerations by census process are likely to provide very useful information on the sources of census coverage errors. Also, given that measurements of the components of census coverage error have never been tabulated before, the public release of these tabulations for the 2010 census is especially to be commended, since the newness of these statistics argues for a somewhat conservative approach to their public release. Although census data users may not be asking about rates of components of census coverage error, the release of these tabulations will give data users a better sense of the totality of error in the census counts. However, the panel believes that the Census Bureau needs to go further than the release of these tabulations, at least internally, in order to support the ultimate goal of a feedback loop that uses information from coverage measurement for identification of deficient census processes (and possibly even hints at preferred alternative processes). We believe that the major benefit from the current coverage measurement program could very well be the production of a database that provides linked information for purposes of analysis on: (1) persons and their households and housing units, (2) the census processes that were used to (try to) enumerate each individual or housing unit, (3) if an enumeration error was made, what type of error was made, and (4) possibly, contextual information
OCR for page 132
Coverage Measurement in the 2010 Census from the relevant geographic area. Such a database could then support the use of regression-type models that would identify the covariates that were (and were not) predictive for rates of the four components of census coverage error. To be more specific, an analytic database, based on 2010 CCM data plus auxiliary data, would contain important information. First, it would have descriptive information about each individual from either the P-sample interview or the E-sample interview, along with information on the housing unit (e.g., type of dwelling unit and owner/renter). Second, it would have the results of the census coverage measurement program assessment of whether the individual was properly counted, omitted, erroneously included, duplicated, or included at the wrong location. Also included would be the coverage status of other members of the housing unit (e.g., to establish when a person was an omission in an otherwise enumerated household and when the entire housing unit was missed). Third, to the extent possible, this database would include the history of census processes that were used to attempt to enumerate each individual and household. This would include how the address was added to the MAF, whether the questionnaire was returned, whether the enumeration was obtained in nonresponse follow-up, etc. The panel has been told that for some processes this history is not retained. For instance, if someone was a duplicate and removed from the census during the follow-up process, the fact that a duplicate enumeration existed for a short period of time will likely not be retained. Especially given the critical importance of coverage follow-up and the need to evaluate its performance, the Census Bureau should consider retaining this history in a more complete form for just the CCM block clusters, i.e., essentially a master trace sample (for details, see National Research Council, 2004a). StARS might also provide very helpful information on verifying whether a census coverage error was made, and so efforts should be made to fold in the StARS database. And the American Community Survey might provide contextual information at the small-area level, such as the poverty or unemployment rate. In addition, the quality of the enumerator work might be weakly assessed through information on enumerator turnover rates and other measures of the quality of the census field staff, based on information from the various management information systems for census enumeration districts. One might even argue for linking in the results of the quality assurance findings on the initial workload for each enumerator (which could be used to see whether the current decision rule of which enumerators to retain and which to let go is set at the right level). The degree to which this database can and will be used to address important questions about census design is directly a function of the extent to which the needed analyses are permitted (even better, facili-
OCR for page 133
Coverage Measurement in the 2010 Census tated) by the structure of the database. This is a question of database management, which this panel did not address. However, it is clear that the development of such a database will not be easy, and it is therefore important that it be done by those with considerable expertise. The panel believes it is very likely that the Census Bureau would benefit from external assistance in developing such a database if it decides to pursue its development. To indicate just one type of complication in creating such a database, consider the different geographies that might be encountered. There is the geography of the contextual information, which is likely to be census tracts. Then there is the geography of the CCM survey, which is the P-sample block clusters (though estimates will be available for any domain). There is also the geography of census enumeration districts. This variety of “geographies” complicates linkages of data. The specification of the structure of such a database should start with the development of a list of the primary types of questions that the database will likely be asked to address. To this end, the coverage measurement staff should formulate various hypotheses that it would like a database to help resolve. Examples include: How much duplication is the result of children living in joint custody arrangements? How many erroneous enumerations are caused by the misinterpretation of residence rules? Could StARS be used to reduce the number of MAF omissions? The development of such a list will greatly assist in determining the structure of the database, including what information should be linked in, and how it should be linked. In addition to using this database to confirm (or reject) a set of primary hypotheses of interest, such a database could also be used in a more exploratory manner to search for interesting and unanticipated patterns that lead to census coverage error. Every census seems to have surprises involving the types of people or households that experience novel causes of coverage error. A caution, though, is that it is the nature of such exploratory analyses to “discover” situations that are illusory or patterns that are idiosyncratic aspects of the 2010 census. Recommendation 11: The primary output of the Census Bureau’s coverage measurement program in 2010 should be an analytic database that is used to support the development of statistical models to inform census process improvement. The production of summary tabulations should be of lesser priority. Having argued for the development and use of this database, it is the panel’s responsibility to suggest how the analyses generated by this database could be used to support a feedback loop for census improve-
OCR for page 134
Coverage Measurement in the 2010 Census ment. Our view is that for each hypothesis of interest, one could develop a statistical model to examine its validity. The development of such a model would likely be nontrivial, potentially involving conditioning on variables (such as sex of the individual, or urban/rural), and requiring diagnostic tools for identifying which predictors are useful and which are not, when transformations are and are not useful, etc. As mentioned above, many of these hypotheses will be modeled as forms of discriminant analyses, i.e., methods for using the predictors to separate the cases that were subject to a type of coverage error from those that were not. When successful, such an analysis would suggest that certain types of housing units, or households, or individuals, were more often missed (or duplicated or otherwise incorrectly “counted”) when associated with a particular census enumeration process. Although this information would be very useful, it would be incomplete, not providing the Census Bureau with what it needs most: an alternative process that would remedy the situation. For example, suppose an analysis demonstrated that many “Be Counted” enumerations were duplicates, by noticing that a model of the frequency of census duplication had a significant coefficient for the predictor that was an indicator variable for enumeration using the “Be Counted” program. (Such a finding would hardly be a surprise given the nature of the program.) The difficulty is that it might not be clear what could be done to the “Be Counted” program that would reduce the frequency of duplication. Is it the questionnaire itself that is being misunderstood? Is the problem where the “Be Counted” forms are distributed? Or is the problem how the forms are validated once they are collected? So it is important to realize that the full feedback loop will still require insight and interaction with field staff to use the information provided to identify what specifically should be modified to provide needed improvement in subsequent censuses. Finally, we caution that such analyses cannot be used to tease out extremely detailed aspects of census taking. The size of the CCM, and the general infrequency of the four components of census coverage error (in 2000, there were 37 million estimated omissions and overcounts of 283 million enumerations and as discussed often in this report, that is an overestimate of the gross error rate; see National Research Council, 2004b) crossed by census processes used, among other variables, will necessarily limit the value of the proposed database to address very detailed questions. Consequently, the panel cautions that Herculean efforts to attempt to perfectly represent the census processes used through complicated indicator variables is probably not going to be profitable. Rather, general patterns of the census processes that proved to be less than fully effective in enumerating relatively substantial subpopulations is what one hopes to be able to identify. Such analytic findings may also be strengthened if
OCR for page 135
Coverage Measurement in the 2010 Census they are supported by anecdotal reports of field staff that suggest causal explanations for the relationships discovered. A remaining question is what should the Census Bureau communicate to census data users about the analyses that are produced by the proposed database? First, it is our hope that some extracts of this database could be made available, possibly through the Census Bureau data centers (taking care with respect to avoiding disclosure), to experts in various areas of statistical methodology to see if different patterns of interest can be identified. Second, while most census data users will be uninterested in either developing their own models or even the models that the Census Bureau develops, there is going to be interest in why the Census Bureau is making various changes in the census coverage measurement program: therefore, the Census Bureau may find it useful to issue a research report series on major findings from the analyses (both internally and externally) of a CCM database. These reports can describe the statistical models developed and report on improvements to the census design or census processes that might be indicated. Recommendation 12: The Census Bureau should develop regression models that elucidate the various types of census coverage error, using specified dependent and predictor variables. To the extent that the database supporting these models can be made available to external researchers, it is extremely important that the Census Bureau pursue all viable avenues to involve outside researchers in the development of such models. Given that the goal of the coverage measurement program in 2010 is census improvement, complete documentation of the functioning of the various census processes in 2010, at least for a sample of households, is crucially important to provide a baseline for comparison of potential alternative processes. Such a sample of households is commonly called a master trace sample. The value of such a master trace sample would be significantly increased if it substantially overlaps with the CCM sample, in order to be able to link indications of components of census coverage error with the census processes involved. Recommendation 13: For a sample of households, the Census Bureau should retain data that provide a comprehensive picture of the census processes used to enumerate it, and the individuals residing in it, to facilitate subsequent evaluation. To allow linking assessment of census coverage error with a history of the census processes, this sample should substantially overlap with the CCM sample.
OCR for page 136
Coverage Measurement in the 2010 Census This page intentionally left blank.