Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 37
Reengineering the Survey of Income and Program Participation 3 Expanded Use of Administrative Records In reengineering the Survey of Income and Program Participation (SIPP), the Census Bureau has from the outset envisioned a role for administrative records. Although the bureau backed away from the notion of using administrative records to replace a large portion of the SIPP questionnaire content (see Chapter 2), it has continued to stress the contribution that administrative records could make to improving the quality of SIPP data (see Johnson, 2008). This chapter addresses the role that administrative records can play in a reengineered SIPP. The chapter first outlines a framework for evaluating the benefits and costs of different uses of administrative records for SIPP. Using the framework as a guide, the chapter reviews the uses of administrative records in SIPP’s history to date, along with other uses of administrative records at the Census Bureau that are relevant to SIPP. It then addresses the feasibility of acquiring and linking different federal and state administrative records and the benefits and costs of the following seven ways of using such records in a reengineered SIPP: evaluating the accuracy of survey responses in the aggregate by comparison with aggregate estimates from administrative records; evaluating the accuracy of survey responses at the individual respondent level by comparison with exactly matched administrative records; improving the accuracy of imputation routines used to supply values for missing survey responses and of survey weighting factors used to improve coverage of the population;
OCR for page 38
Reengineering the Survey of Income and Program Participation providing values directly for missing survey responses; adjusting survey responses for underreporting or overreporting; using administrative records values instead of asking survey questions; and appending administrative records values to survey records. The first three uses we term “indirect,” in that administrative data are never actually recorded on SIPP data files; the last four uses are “direct,” in that administrative data become part of the SIPP data files to a greater or lesser extent. Following the discussion of uses, the chapter considers methods of confidentiality protection and data access that would be appropriate for a reengineered SIPP. Our conclusions and recommendations are presented at the end of the chapter. A FRAMEWORK FOR ASSESSING USES OF ADMINISTRATIVE RECORDS SIPP’s primary goal—which is to provide detailed information on the short-term dynamics of economic well-being for families and households, including employment, earnings, other income, and program eligibility and participation—requires a survey as the main source of data. There are no administrative records from federal or state agencies that, singly or in combination, could eliminate the need for survey data collection, even if it were feasible to obtain all relevant records and the custodial agencies did not object to their use for this purpose. Consider the following examples of shortcomings in administrative records: Records for programs to assist low-income people, such as the Supplemental Security Income (SSI) Program or the Food Stamp Program (since 2008 termed the Supplemental Nutrition Assistance Program or SNAP), contain information only for beneficiaries and not also for people who are eligible for the program but do not apply for or are erroneously denied benefits. Being able to estimate the size of the eligible population, including participants and non-participants, is important to address the extent to which an eligible population’s needs are being met, what kinds of people are more or less likely to participate in a program, and other policy-relevant questions. Program records do not always accurately distinguish new recipients of benefits from people who received benefits previously, had a spell of nonparticipation, and are once more receiving benefits. One of
OCR for page 39
Reengineering the Survey of Income and Program Participation SIPP’s important contributions to welfare program policy analysis has been to make possible the identification of patterns of program participation over time, including single and multiple spells. Federal income tax records on earnings and other income exclude some important income sources that recipients do not have to report, such as Temporary Assistance for Needy Families (TANF) and pretax exclusions from gross wage and salary income. Pretax employer-sponsored health insurance contributions, which are a growing share of wage and salary income, do not have to be reported on Internal Revenue Service (IRS) 1040 individual income tax returns, nor are they always reported on W-2 wage and tax statements. Federal income tax records do not define some income sources in the manner that is most useful for assistance program policy analysis. Thus, self-employment income is reported to tax authorities as gross income minus expenses, including depreciation of buildings and equipment, which can result in a net loss, even when the business provided sufficient income to the owner(s) for living expenses. In contrast, the SIPP questionnaire asks for the “draw” that self-employed people take out of their business for their personal living expenses. The recipient or filing unit that is identified in administrative records often differs from the family or household unit that is of interest for policy analysis. For example, minor children may be claimed as dependents on the income tax return of the noncustodial parent, and unmarried cohabitors will be two distinct income tax filing units but only one survey household and (assuming they share cooking facilities) one food stamp household. (It is not always possible to accurately identify tax and transfer program filing units in survey data, either.) Despite these and other problems, it is clearly the case, as we demonstrate in later sections, that administrative records can be helpful to SIPP in a number of ways, as they have been helpful in the past (see “SIPP’s History with Administrative Records” below). Indeed, the Census bureau hopes that significantly greater use of administrative records can be achieved in a reengineered SIPP to improve the quality of reporting of income and program participation. The benefits and costs of using administrative records for a reengineered SIPP must be carefully assessed, and each of the possible seven uses identified above implies a different mixture of benefits and costs. We provide below a cost-benefit framework for considering alternative uses of administrative records for SIPP, including not only records from federal
OCR for page 40
Reengineering the Survey of Income and Program Participation agencies, but also records that state agencies use to administer such programs as the Children’s Health Insurance Program (CHIP), food stamps, general assistance,1 Medicaid, school lunch and breakfast programs, the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC), SSI (in states that supplement federal benefits), TANF, unemployment insurance (UI), and workers’ compensation (WC) (referred to as “state records” in this chapter). Benefits There are potentially two types of benefits for a reengineered SIPP from using administrative records, such as Social Security payments to beneficiaries or food stamp allotments to families: (1) providing higher-quality data in comparison to survey reports (the one benefit specifically identified by the Census Bureau) and (2) providing additional data that would be more difficult or expensive to obtain in interviews. For improving data quality, administrative records may also have the advantage that the ongoing costs of using them for this purpose are modest—at least once an initial investment has been made in acquiring and processing them—compared with efforts to improve the quality of survey reporting (see “Costs” below). Improved Data Quality There is substantial evidence, summarized in Chapter 2, that survey reports of program participation and sources of income are often incomplete and inaccurate—despite considerable efforts to improve the quality of reporting by redesigning questions, adding probes, and the like. SIPP, with its detailed, probing questionnaire, historically has a record of obtaining more complete reporting of program participation than other surveys, but its reporting of program participation still falls short of administrative benchmarks. Moreover, the amounts reported by acknowledged participants often differ from administrative benchmarks in the aggregate and on an individual basis. There are both underreporting and overreporting errors, typically with a net underreporting on balance. Consequently, administrative records have the potential to provide significantly more accurate data on many sources of income and types of programs. In assessing the benefits of improved data quality from using a particular administrative records source, such as Social Security or food stamp records, it is important not to take at face value that the administrative record is always of better quality than the corresponding survey response. 1 General assistance, or general relief, is a name for state programs to provide cash benefits to adults without dependent children.
OCR for page 41
Reengineering the Survey of Income and Program Participation In this regard, it is important to distinguish among the data items recorded on administrative records. On one hand, for example, in the case of a record for a food stamp recipient, it is highly likely that the amount provided to a beneficiary is accurately recorded (even though, in some cases, the payment may have been made to someone who was not in fact eligible for the program or an erroneous amount may have been provided to an eligible recipient). On the other hand, the ancillary information on the record, such as the person’s employment, income, and family composition, may have contained errors when it was collected or may have become out of date. Moreover, for some programs, records for people who no longer receive benefits may be comingled with records for current beneficiaries, and, for most if not all programs, the program unit of one, two, or more people is typically not the same as the survey unit of a family or household. The information in administrative records, even when accurate, may differ sufficiently in definition from the information sought by the survey designer as to make the administrative information unusable for the survey’s purpose. As noted earlier, self-employment income from federal income tax records is an example—although the gross and net income amounts from tax records may be of interest for some analyses, they do not satisfy SIPP’s purposes of understanding the economic resources available to individuals and families. Additional Data Some administrative records may contain valuable information that would be difficult to obtain in a survey context. For example, the Social Security Administration (SSA) has records not only of benefits paid to retirees, people with disabilities, and others, but also histories of earnings received each year for everyone who is or has been in covered employment, which SSA receives annually from W-2 and Self-Employment Income forms filed with the IRS.2 Such earnings histories, which may extend back for decades of an individual’s work life, would be difficult to collect in a survey unless it began following individuals from an early age, but they could be valuable for some types of research, such as research on the determinants of the decision to retire. Costs The use of administrative records for a reengineered SIPP cannot be cost free. Staff time and other resources must be expended for acquisition 2 Prior to 1978, SSA files contain quarterly indicators of covered employment in addition to annual earnings.
OCR for page 42
Reengineering the Survey of Income and Program Participation and processing of records. Moreover, the use of some kinds of records could potentially incur two other types of costs: (1) increased delays in releasing data products due to delays in obtaining records from the cognizant agencies and (2) increased risks of disclosure of individuals in SIPP, which in turn could necessitate more restricted conditions for use of the data. Additional Resources The strictly monetary costs of using administrative records for a reengineered SIPP would include staff and other resources for acquisition of records, data quality review and associated cleaning of records, and processing of records for the particular application, such as evaluation or imputation. In some cases, the costs of acquisition could be substantial, at least initially. For example, time-consuming negotiations could be required to draw up acceptable memoranda of understanding and other legal documents to obtain an agency’s records, although once agreed-upon procedures were in place, the marginal costs of acquiring records in subsequent years could be minimal. There could also be significant costs when an agency’s records are not well maintained, requiring Census Bureau staff to engage in substantial back-and-forth with agency staff to clean up the data. Processing costs would vary with the type of application. For example, aggregate comparisons of survey responses with administrative records are likely to be considerably less costly than the use of administrative records in imputation models. In its original concept for a new Dynamics of Well-being System (DEWS), the Census Bureau had hoped that administrative records could be used directly to supply so much of the needed content as to make possible a significant reduction in the costs of the system compared with the current SIPP. The cost savings would come from reduced frequency of interviews and reduced content of each interview, with the remaining needed content obtained by matching administrative records for individuals to the corresponding survey records. However, users were concerned that such a major role for administrative records would not only be unfeasible, given the difficulties of acquiring all of the needed records from state and federal agencies, but also would curtail the bureau’s ability to release public-use microdata files because of increased disclosure risk. These concerns led the bureau to scale back its plans in this regard. The Census Bureau now plans to achieve cost savings by conducting annual interviews with event history calendars to obtain intrayear information and by requiring agencies to pay for supplements with variables not included in the core questionnaire (see Chapter 4). Reducing the frequency of interviews assuredly reduces the costs of a survey, but whether reducing the content of a particular interview by substituting administrative records reduces costs is not clear. The main cost of an interview is making contact with the respondent; moreover, acquiring
OCR for page 43
Reengineering the Survey of Income and Program Participation and processing administrative records adds costs. Hence, we think that the use of administrative records to replace survey content should be judged primarily on criteria other than cost savings, such as the effects on data quality, timeliness, and accessibility. Increased Delays Administrative records systems are managed first and foremost to facilitate the operation of assistance programs. The Census Bureau’s need for timely information from records systems for statistical purposes is of secondary importance, at best, for program agencies. Consequently, while it may be possible for the Census Bureau to obtain and process some records with little delay, the acquisition of other records may lag the survey data collection by significant periods of time (see “Statistical Administrative Records System” below). One response to this situation could be to further delay the data products from SIPP in order to be able to use the administrative information to improve imputations or substitute for questionnaire content. This outcome would be distressing to users. Other responses could be to project the administrative information from a prior year forward to the survey data year, to issue preliminary and revised data products, or to confine the use of administrative information to evaluation of the survey content, which would not be as time sensitive. Increased Disclosure Risks On one hand, because the data collected in SIPP is of great interest to policy analysts, researchers, and others users, it is essential to make the data in some form available to these varied constituencies. On the other hand, the Census Bureau is ethically and legally obligated to protect the confidentiality of SIPP participants’ identities and attributes. Thus, unfettered access to all collected SIPP data is not likely to be achievable. Rather, as recommended in previous National Research Council panels on data access (2005, 2007), an appropriate strategy for the Census Bureau is to provide access to data of differential detail, and hence differential disclosure risk, depending on the goals for data use and the trustworthiness of the likely data users (see Box 3-1 for a summary of the risk and utility trade-off in data dissemination).3 3 We do not discuss confidentiality threats that might originate from inside the Census Bureau. The bureau has sufficient expertise on internal confidentiality protection that it does not need our panel to comment. Evidence of its dedication to confidentiality protection is the practice adopted for its Statistical Administrative Records System of substituting personal identification keys for Social Security numbers on matched files.
OCR for page 44
Reengineering the Survey of Income and Program Participation BOX 3-1 The Risk and Utility Trade-Off in Data Dissemination The Census Bureau and other disseminators of data collected under a pledge of confidentiality for statistical purposes strive to release data that are (1) safe from attacks by ill-intentioned data users seeking to learn respondents’ identities or sensitive attributes, (2) informative for a wide range of statistical analyses, and (3) easy for users to analyze with standard statistical methods (Reiter, 2004). These goals are often in conflict. For example, releasing fine details about individuals enables accurate analyses, but it also provides ill-intentioned users with more and higher quality resources for linking records in released data sets to records in other databases. Releasing highly aggregated summaries of data protects confidentiality, but it severely limits the analyses that can be done with the data. Data disseminators usually choose policies that lie in between these two extremes, sacrificing absolute protection (possible only when not releasing any data) and perfect data usefulness (possible only when releasing all data as collected) for a compromise. Most data disseminators are concerned with two types of disclosures. One type is identity disclosure, which occurs when ill-intentioned users correctly identify individual records using the released data. Efforts to quantify identity disclosure risk in microdata (records for individual respondents) generally fall into two broad categories: (1) estimating the number of records in the released data that are unique records in the population and (2) estimating the probabilities that users of the released data could determine the identities of the records in the released data by using the information in that data. The other type is attribute disclosure, which occurs when ill-intentioned users learn the values of sensitive variables for individual records in the data set. Quantification of attribute disclosure risk is often folded into the quantification of identity disclosure risk, since ill-intentioned users SIPP’s great value for policy analysis and research on short-term dynamics of economic well-being requires that users have access to micro-data and not only aggregate summaries. Administrative records could potentially add valuable information to SIPP microdata, but the more information that is added, the greater the risk that individuals in the SIPP sample could be identified in public-use microdata files. Disclosure risk is also increased because people in the agency supplying the administrative data have knowledge that could be used to identify individuals in SIPP files. Countering such increased risk could require the use of disclosure protection techniques that would diminish the value of the public-use microdata products and compel users who require the confidential data for their research to seek access to one of the Census Bureau’s Research Data Centers (RDCs). Yet for policy analysis that is in any way time sensitive,
OCR for page 45
Reengineering the Survey of Income and Program Participation typically need to identify individuals before learning their attributes. Other types of disclosures include perceived identification disclosure, which occurs when intruders incorrectly identify individual records in the database, and inferential disclosure, which occurs when intruders accurately predict sensitive attributes in the data set using the released data. For a discussion of metrics for quantifying identification and attribute disclosure risks, see Duncan and Lambert (1989), Federal Committee on Statistical Methodology (1994), Lambert (1993), National Research Council (2005, 2007), and Reiter (2005). Agencies must also consider the usefulness of the released data, often called data utility. Existing utility measures are of two types: (1) comparisons of broad differences between the original and released data and (2) comparisons of specific estimates computed with the original and released data. Broad difference measures are based on statistical distances between the original and released data, for example, differences in distributions of variables. Comparison of specific models is often done informally. For example, data disseminators look at the similarity of point estimates and standard errors of regression coefficients after fitting the same regression on the original data and on the data proposed for release. Ideally, the agency releasing data optimizes the trade-off between disclosure risk and data utility when selecting a dissemination strategy. To do so, the agency can make a scatter plot of the quantified measures of disclosure risk and data utility for candidate releases. This has been termed the “R-U confidentiality map” in the statistical literature (Duncan, Keller-McNulty, and Stokes, 2001). Making this map can enable data disseminators to eliminate policies with risk-utility profiles that are dominated by other policies (e.g., between two policies with the same disclosure risk, select the one with higher data utility). the alternative of accessing microdata in an RDC is daunting because it adds delays in making a successful application to the delays that are already incurred in release of the files from the Census Bureau. A related risk of directly using administrative data in SIPP could be a decline in the willingness of people to participate in the survey once they were made aware of the planned uses of their administrative records. However, when 2004 SIPP panel respondents were informed halfway through the panel that administrative records might be used to reduce the need to ask them so many questions, less than one-half of 1 percent requested that record matches not be made for them (David Johnson, chief, Housing and Household Economic Statistics Division, U.S. Census Bureau, personal communication to panel, February 3, 2009; see also “Direct Uses” below).
OCR for page 46
Reengineering the Survey of Income and Program Participation Trading Off Benefits and Costs Different types of uses of administrative records in a reengineered SIPP will present different pictures of the likely benefits and costs. For a given use, the benefits and costs may also differ by the type of record or even by the agency responsible for the record. For example, program agencies in some states may be more willing to share records with the Census Bureau for use with SIPP than with agencies in other states. In determining when a particular use of a specific type of record warrants the investment, it is important always to bear in mind the goals of SIPP and that it cannot be all things to all users. For example, while SSA records of past earnings histories would be useful for research on life-time patterns of employment and related issues, they might not contribute greatly to SIPP’s primary focus on the short-term dynamics of economic well-being. Moreover, the addition of earnings histories to SIPP would substantially increase the risks of disclosure and consequently the need to restrict the use of data products containing them (see “SIPP Gold Standard Project” below). Some of the trade-offs involved in working with different types of administrative records for different purposes become evident in reviewing the history of uses of administrative records in SIPP and other Census Bureau programs. SIPP’S HISTORY WITH ADMINISTRATIVE RECORDS In order to achieve SIPP’s goals of improving information on the economic well-being of the population and short-term changes in income and program participation, the survey’s designers at the outset envisioned at least three major roles for administrative records (see National Research Council, 1993:31-33): to increase sampling efficiency by providing supplementary frames of participants in specific assistance programs or persons with other specified characteristics; to provide additional data (e.g., by matching with Social Security earnings records to obtain longitudinal earnings histories to add to the SIPP files); and to compare and validate specific items common to both SIPP and administrative records by means of record-check studies. ISDP Use of Records The Income Survey Development Program (ISDP) used administrative records extensively to evaluate the quality of survey responses and
OCR for page 47
Reengineering the Survey of Income and Program Participation to improve question wording and interviewer training procedures (see Kasprzyk, 1983; Logan, Kasprzyk, and Cavanaugh, 1988). The primary method used was the forward record check, in which people included in independent samples from administrative sources (including IRS and federal and state program records) were administered the ISDP interviews. This method eliminates the need to match survey and administrative records, but it permits only identifying false-negative responses (people with an administrative record of program participation who say they did not participate in the particular program) and not also false-positive ones, which a full record-check study would support. Aggregate comparisons of income and program participation reported in the 1979 ISDP panel with administrative records sources were also conducted. These comparisons necessitated, in many cases, extensive adjustments of one or both sources (SIPP or the applicable administrative records source) for comparability of the population and income concept and reporting period covered. The ISDP also drew supplementary samples from administrative records to augment the 1978 and 1979 ISDP panel main samples. However, the data were never analyzed, because data files that included the main and supplementary samples with appropriate weights could not be produced before the ISDP was shut down in 1981 (see Kasprzyk, 1983). SIPP’s Use of Records, 1983-1993 During SIPP’s first decade, the Census Bureau was hard-pressed to operate the survey in full production mode and to accommodate budget reductions that necessitated cutbacks in sample size or number of interview waves or both for most panels (see Chapter 2). Bureau staff had limited time and resources to exploit the potential value of administrative records. Consequently, no supplementary sampling frames were developed from administrative records for SIPP during this period although some work went forward on evaluation and related uses of administrative records. The Census Bureau carried out a handful of matches of SIPP panels with administrative records, which were facilitated by a successful program to obtain Social Security numbers from SIPP respondents and match them to SSA files for validation purposes. These matches included (1) a match of the 1984 SIPP panel with SSA records conducted for SSA under an agreement that limited its use to SSA analysts for a 2-year period; (2) a match of a small number of variables in IRS tax records with the 1984 panel conducted as part of an effort (which did not come to fruition) to develop weighting factors from IRS tax records for reducing the variance of income estimates from SIPP (Huggins and Fay, 1988); and (3) a match of IRS tax records with the 1990 panel conducted as part of an effort to develop a simulation model for estimating after-tax income in SIPP (which also did not come
OCR for page 86
Reengineering the Survey of Income and Program Participation histories, such as longitudinal earnings records, the increase in disclosure risk is likely to be substantial, even when an intruder does not have access to the custodial agency’s records. Alternative approaches are possible, however. One approach is to transform the appended data into categorical instead of continuous variables. In the case of earnings histories, for example, categorical variables could represent different patterns of earnings histories (number of lifetime jobs, number of periods out of the labor force, etc.) rather than the detailed histories. Another approach (which could be used in combination with categorization of selected variables) is to use partial synthesis of a much smaller set of selected values. Such partial synthesization could provide reliable information with satisfactory confidentiality protection, as we discuss below. In any event, the need for appending additional variables to SIPP should be carefully vetted with data users because of the implications for confidentiality protection and data access. CONFIDENTIALITY PROTECTION AND DATA ACCESS As summarized in Box 3-1, the Census Bureau, like other data disseminators that collect individual information under a pledge of confidentiality, strives to release data files that are not only safe from illicit efforts to obtain respondents’ identities or sensitive attributes, but also useful for analysis. In general, strategies for optimizing the risk-utility trade-off fit into two broad categories. Restricted access strategies allow only select analysts to use the data, for example, via licensing or by requiring analysts to work in secure data enclaves. Restricted data strategies allow analysts to use altered versions of the data, for example, by deleting variables from the file, aggregating categories, or perturbing data values (see National Research Council, 2005). The Census bureau has extensive experience in applying both of these methods. For example, currently, standard public-use files of SIPP data (not linked with administrative records) can be downloaded from the SIPP website, and a version of SIPP data for specific panels linked with earnings histories and Social Security benefits can be used in the RDCs (the gold standard project). Both restricted data and restricted access strategies are likely to be useful for a reengineered SIPP, as described below. Restricted Data for SIPP The Census bureau releases public-use microdata samples for many of its products, including SIPP, usually with some values altered to protect confidentiality. Typical alterations include recoding variables, such as releasing ages or geographical variables in aggregated categories;
OCR for page 87
Reengineering the Survey of Income and Program Participation reporting exact values only above or below certain thresholds, for example, reporting all incomes above $100,000 as “$100,000 or more”; swapping data values for selected records, for example, switching the quasi-identifiers for at-risk records with those for other records to discourage users from matching, since matches may be based on incorrect data; and adding noise to numerical data values to reduce the possibilities of exact matching on key variables or to distort the values of sensitive variables. These methods can be applied with varying intensities. Generally, increasing the amount of alteration decreases the risks of disclosures; but, it also decreases the accuracy of inferences obtained from the released data, since these methods distort relationships among the variables. For example, aggregation makes analyses at finer levels impossible and can create ecological inference problems, and intensive data swapping severely attenuates correlations between the swapped and unswapped variables. It is difficult—and for some analyses impossible—for data users to determine how much their particular estimation has been compromised by the data alteration, in part because disseminators rarely release detailed information about the disclosure limitation strategy. Even when such information is available, adjusting for the data alteration to obtain valid inferences may be beyond some users’ statistical knowledge. For example, to analyze properly data that include additive random noise, users should apply measurement error models (Fuller, 1993) or the likelihood-based approach of Little (1993), which are difficult to use for nonstandard estimands.13 Nonetheless, when the amount of alteration is very small, the negative impacts of traditional disclosure limitation methods on data utility could be minor compared with the overall error in the data caused by nonresponse and measurement errors. The current SIPP public-use files (without linked administrative records values) are protected mainly by top-coding monetary variables and age and by suppressing geographic detail in areas with fewer than 250,000 people. In addition, some individuals in metropolitan areas are recoded to be in nonmetropolitan areas with too few people in the sample. This can invalidate estimates of characteristics in nonmetropolitan areas. 13 Estimands are types of estimates, such as means, ranges, percentiles, and regression coefficients.
OCR for page 88
Reengineering the Survey of Income and Program Participation Protecting Files with Linked SIPP and Administrative Records Data If values available in administrative data are included in SIPP public-use files, top-coding and geographic aggregation may not offer sufficient protection. The Census Bureau probably would need to alter the administrative variables to prevent exact linking, especially if multiple variables for the same person are culled from an administrative database to create a SIPP record. Additional aggregation, such as rounding monetary values, may offer sufficient protection without impairing data utility. Alteration with high intensity, however, such as intense swapping or noise addition, will attenuate relationships and distort distributions so that the released data are no longer useful. If heavy substitution of administrative values is planned, one option is to create multiply imputed, partially synthetic data. These data comprise the units originally surveyed with only some collected values replaced with multiple imputations. For example, the Census Bureau could simulate sensitive variables or quasi-identifiers for individuals in the sample with rare combinations of quasi-identifiers, and it might synthesize those values that are available and potentially linkable in external databases. Partial Synthesis To illustrate how partially synthetic data might work in practice, we modify the setting described by Reiter (2004). Suppose a statistical agency has collected data on a random sample of 10,000 people. The data comprise each person’s race, gender, income, and years of education. Suppose the agency wants to replace race and gender for all people in the sample—or possibly just for a subset, such as all people whose income is below $5,000—to disguise their identities. The agency could generate values of race and gender for these people by randomly simulating values from the joint distribution of race and gender, conditional on their education and income values. These distributions would be estimated using the collected data and possibly other relevant information. The result would be a partially synthetic data set. The agency would repeat this process, say, 10 times, and these 10 data sets would be released to the public. The analyst would estimate parameters and their variances in each of the synthetic data sets and combine the results using the methods of Reiter (2003). Several statisticians in the Statistical Research Division of the Census Bureau and in academia are working to develop partially synthetic, public-use data for Census Bureau products. These products include the Longitudinal Business Database, the Longitudinal Employer-Household Dynamics data sets, the ACS group quarters, veterans, and full sample data, and the SIPP linked with Social Security benefit information.
OCR for page 89
Reengineering the Survey of Income and Program Participation Partially synthetic data sets can have positive features for data utility. When the synthetic data are simulated from distributions that reflect the distributions of the collected data, valid inferences for frequencies can be obtained for wide classes of estimands (e.g., means, ranges, percentile distributions). This is true even for high fractions of replacement, whereas swapping high percentages of values or adding noise with large variance produces worthless data. The inferences are determined by combining standard likelihood-based or survey-weighted estimates; the analyst need not learn new statistical methods or software to adjust for the effects of the disclosure limitation. The released data can include simulated values in the tails of distributions so that no top-coding is needed. Finally, because many quasi-identifiers can be simulated, finer details of geography can be released, facilitating small-area estimation. There is a cost to these benefits—the validity of synthetic data inferences depends on the validity of the models used to generate the synthetic data. The extent of this dependence is driven by the nature of the synthesis and the question asked. For example, when all of race and gender are synthesized, analyses involving those variables would reflect only the relationships included in the data generation models. When the models fail to reflect certain relationships accurately, analysts’ inferences also would not reflect those relationships. Similarly, incorrect distributional assumptions built into the models would be passed on to the users’ analyses. However, when replacing only a select fraction of race and gender and leaving many original values on the file, inferences may be relatively insensitive to the assumptions of the synthetic data models. In practice, this model dependence means that agencies should release metadata that help analysts decide whether or not the synthetic data are reliable for their analyses. For example, agencies might include the code used to generate the synthetic values as attachments to public releases of data. Or they might include generic statements that describe the imputation models, such as “main effects and interactions for income, education, and gender are included in the imputation models for race.” Analysts who desire finer detail than afforded by the imputations may have to apply for restricted access to the collected data. Even with such metadata, secondary data analysts would be greatly helped if the Census Bureau provided some way for them to learn in real time about the quality of inferences based on the synthetic data (or any masked version of SIPP). Ideally, the quality measures provided would be specific to particular inferential quantities rather than broad measures. For example, reporting comparisons of means, variances, and correlations in the observed and synthetic data does little to help analysts estimating complex models. One approach is for the Census Bureau to develop a verification server
OCR for page 90
Reengineering the Survey of Income and Program Participation (Reiter, Oganian, and Karr, 2009). This server, located at the Census Bureau, would store the original and synthetic (or otherwise masked) data sets. Analysts, who have only the synthetic data, would submit queries to the server for measures of data quality for certain estimands. The server would run the analysis on both the original and synthetic data and report back to the analyst a measure of data quality that compares the inferences obtained from both sources. The server could also serve as a feedback mechanism for the agency, capturing what quantities analysts care most about. Agencies might be able to use this information to improve the quality of future data releases. There may be additional disclosure risks of releasing the utility measures; research would be needed to gauge these risks and, more broadly, to develop and fully test the functionality and usability of a verification server. Synthesizing SIPP Data The synthesis of the SIPP gold standard file, which contains linked SIPP, SSA, and IRS data, is very intense: Only a handful of some 600 variables remain unsynthesized. Practically all variables are synthesized to ensure a small chance of linking the synthesized records to the existing SIPP public-use records. With the reengineered SIPP, such heavy synthesis may not be necessary. If the released data do not include such detailed administrative information as longitudinal earnings histories, the Census Bureau can synthesize only the values of quasi-identifiers for at-risk records and the linkable values available in administrative sources. It may not even be necessary to synthesize entire variables to achieve adequate protection. For example, synthetic values could replace top-coded monetary and age values and aggregated geographies. The benefits of synthesis over top-coding are illustrated by An and Little (2007); more research is needed on methods for simulating geographies. Providing information in the tails and finer geographies would improve on the current SIPP public-use product without necessarily increasing disclosure risks. Methods of gauging the risks inherent in partially synthetic data with only some values synthesized are described in Reiter and Mitra (2009). If the released data do contain detailed administrative data, similar to the gold standard file, the Census Bureau has several options. It can proceed as with the current SIPP, releasing a file without linked data and a highly synthesized version of the linked data. Or it can try to reach new memoranda of understanding with SSA and IRS that make it possible to do less synthesizing. For example, it may be possible to synthesize earnings and benefits histories, leaving the other variables on SIPP as is. Regardless of the path chosen, the Census Bureau should recognize that most SIPP users are not likely to support the release of a file with linked administrative records
OCR for page 91
Reengineering the Survey of Income and Program Participation if the time required to create the file and evaluate its risks and utility delays its release in comparison to a standard SIPP public-use file. Restricted Access for SIPP In addition to public-use microdata files, the Census Bureau makes more detailed data from SIPP and other surveys available via a restricted access mode, which permits use of the data in any of the nine RDCs operated by the bureau (see http://www.ces.census.gov/index.php/ces/cmshome). The files available in the RDCs are stripped of obvious identifiers, such as name and address, but do not contain recodes or other modifications that blur the underlying data in the public-use versions.14 The RDC restricted access mode, however, has limitations. Analysts who do not live near a secure data enclave, or who do not have the resources to relocate temporarily to be near one, are shut out from RDCs. Gaining restricted access generally requires months of proposal preparation and background checks; analysts cannot simply walk into any secure data enclave and immediately start working with the data. As recommended by a previous National Research Council report (2005), the Census Bureau should continue to pursue ways to speed up the project approval process in the RDCs. Another restricted access approach is to establish a remote access system for SIPP data. When queried by analysts, these systems provide output from statistical models without revealing the data that generated the output. Such servers are in the testing stage at the Census Bureau. If they are found useful, they would provide an excellent resource for certain analyses on the genuine data without having to go to an RDC. However, remote access systems are not immune from disclosure risks. Clever queries can reveal individual data values. For example, asking for a regression model that includes an indicator variable that equals 1 for a unique value of some predictor and 0 for all other variables enables the analyst to predict the outcome variable perfectly (Gomatam et al., 2005). These types of intrusions could be especially problematic if a public-use data set is provided and the remote access system is open to all users. For example, an ill-intentioned user could look at a continuous, unaltered variable to determine unique values, then submit regression queries with indicator variables to learn about those records’ other variables. The Census Bureau can limit the risks of such problems by restricting access to the server. For example, users of the server could be required to go through a licensing procedure. In addition, the server could keep track of and audit requests, 14 To date, SIPP files that have been linked to administrative records are not available in the RDCs outside the Census Bureau.
OCR for page 92
Reengineering the Survey of Income and Program Participation so that any ill-intentioned intruder who sneaks through the licensing might be identified and punished. CONCLUSIONS AND RECOMMENDATIONS The Role of Administrative Records in a Reengineered SIPP Conclusion 3-1: In reengineering the Survey of Income and Program Participation (SIPP) to provide policy-relevant information on the short-run dynamics of economic well-being for families and households, the Census Bureau must continue to use survey interviews as the primary data collection vehicle. Administrative records from federal and state agencies cannot replace SIPP, primarily because they do not provide information on people who are eligible for—but do not participate in—government assistance programs and, more generally, because they do not provide all of the detail that is needed for SIPP to serve its primary goal. Many records are also difficult to acquire and use because of legal restrictions on data sharing, and some of the information they contain may be erroneous. Nonetheless, information from administrative records that is relevant to SIPP and likely to improve the quality of SIPP reports of program participation and income receipt in particular can and should be used in a reengineered SIPP. Conclusion 3-2: The Census Bureau has made excellent progress with the Statistical Administrative Records System and related systems, such as the person validation system, in building the infrastructure to support widespread use of administrative records in its household survey programs. The bureau’s administrative records program, both now and in the future as it adds new sets of records and analysis capabilities, will be an important resource for applications of administrative records in a reengineered Survey of Income and Program Participation. Acquisition of Records Conclusion 3-3: Many relevant federal administrative records are readily available to the Census Bureau for use in a reengineered Survey of Income and Program Participation (SIPP). However, most state administrative data are not available for use in a reengineered SIPP at this time and could be difficult to obtain. Recommendation 3-1: The Census Bureau should seek to acquire additional federal records that are relevant to the Survey of Income and Program Participation, which could include records from the U.S. Department of Veterans Affairs and the Office of Child Support Enforcement.
OCR for page 93
Reengineering the Survey of Income and Program Participation Recommendation 3-2: The Census Bureau, in close consultation with users, should develop a strategy for acquiring selected state administrative records, recognizing that it will be costly and probably unfeasible to acquire all relevant records from all or even most states. The bureau’s acquisition strategy should be guided by such criteria as the importance of the income source for lower income households, particularly in times of economic distress, and the relative ease of acquiring the records. Unemployment insurance benefit records should be a high priority for the Census Bureau to acquire on both of these counts, and the bureau should investigate whether it is possible to acquire these records from the National Directory of New Hires, which would eliminate the need to negotiate with individual states. Indirect Uses of Records Conclusion 3-4: Indirect uses of administrative records are those uses, such as evaluation of data quality and improvement of imputation models for missing data, in which the administrative data are never recorded on survey records. They are advantageous for a reengineered Survey of Income and Program Participation (SIPP) in that they should have little or no adverse effects on timeliness or the needed level of confidentiality protection of SIPP data products. Recommendation 3-3: The Census Bureau, in close cooperation with knowledgeable staff from program agencies, should conduct regular, frequent assessments of Survey of Income and Program Participation (SIPP) data quality by comparison with aggregate counts of recipients and income and benefit amounts from appropriate administrative records. When feasible, the bureau should also evaluate reporting errors for income sources—both underreporting and overreporting—by exact-match studies that link SIPP records with the corresponding administrative records. The Census Bureau should use the results of aggregate and individual-level comparisons to identify priority areas for improving SIPP data quality. Recommendation 3-4: The Census Bureau should move to replace hot-deck imputation routines for missing data in the Survey of Income and Program Participation with modern model-based imputations, implemented multiple times to permit estimating the variability due to imputation. Imputation models for program participation and benefits should make use of program eligibility criteria and characteristics of beneficiaries from administrative records so that the imputed values reflect as closely as possible what is known about the beneficiary population. Before implementation, new imputation models should be evaluated to establish their superiority to the imputation routines they are to replace.
OCR for page 94
Reengineering the Survey of Income and Program Participation Recommendation 3-5: The Census Bureau should request the Statistical and Science Policy Office in the U.S. Office of Management and Budget to establish an interagency working group on uses of administrative records in the Survey of Income and Program Participation (SIPP).15 The group would include technical staff from relevant agencies who have deep knowledge of assistance programs and income sources along with Census Bureau SIPP staff. The group would facilitate regular comparisons of SIPP data with administrative records counts of income recipients and amounts (see Recommendation 3-3) and advise the Census Bureau on priorities for acquiring additional federal and selected state administrative records, how best to tailor imputation models for different sources of income and program benefits, and other matters related to the most effective ways to use administrative records in SIPP. The Census Bureau should regularly report on its progress in implementing priority actions identified by the group. Direct Uses of Records Conclusion 3-5: Direct uses of administrative records in a reengineered Survey of Income and Program Participation (SIPP), which include substituting administrative values for missing survey responses, adjusting survey responses for net underreporting, using administrative values instead of asking survey questions, and appending additional administrative data, potentially offer significant improvements in the quality of SIPP data on income and program participation. They also raise significant concerns about increased risks of disclosure and delays in the release of SIPP data products. Recommendation 3-6: In the near term, the Census Bureau should give priority to indirect uses of administrative records in a reengineered Survey of Income and Program Participation (SIPP). At the same time and working closely with data users and agencies with custody of relevant administrative records, the bureau should identify feasible direct uses of administrative records in SIPP to be implemented in the medium and longer terms. Social Security and Supplemental Security Income benefit records, which are available to the Census Bureau on a timely basis, are prime candidates for research and development on ways to use the administrative values directly—either to adjust survey responses for categories of beneficiaries or to replace survey questions (which would reduce respondent burden)—in ways that protect confidentiality. 15 See Recommendation 4-5 regarding an advisory group of outside researchers and policy analysts.
OCR for page 95
Reengineering the Survey of Income and Program Participation Recommendation 3-7: When considering the addition to the Survey of Income and Program Participation (SIPP) of administrative records values for variables that have never been ascertained in the survey itself, the Census Bureau should ensure that the benefits from the added variables are worth the costs, such as additional steps to protect confidentiality. The bureau should consult closely with users to be sure that the added variables are central to SIPP’s purpose to provide information on the short-run dynamics of economic well-being and that their inclusion does not compromise the ability to release public-use microdata files that accurately represent the survey data. Confidentiality Protection and Data Access Conclusion 3-6: Multiple strategies for confidentiality protection and data access are necessary for a survey as rich in data as the Survey of Income and Program Participation. Public-use microdata files, which are available on a timely basis and in which confidentiality protection techniques do not unduly distort the relationships in the data, are the preferred mode of data release. Some uses may require access to confidential data that at present can be provided only at one of the Census Bureau’s Research Data Centers. Recommendation 3-8: The Census Bureau should develop confidentiality protection techniques and restricted access modes for the Survey of Income and Program Participation (SIPP) that are as user-friendly as possible, consistent with the bureau’s duty to minimize disclosure risk. In this regard, the bureau should develop partial synthesis techniques for SIPP public-use microdata files that, based on evaluation results, are found to preserve the research utility of the information. For SIPP data that cannot be publicly released, the Census Bureau should give high priority to developing a secure remote access system that does not require visiting a Research Data Center to use the information. The bureau should also deposit SIPP files of linked survey and administrative records data (with identifiers removed) at all Research Data Centers in order to expand the opportunities for research that contributes to scientific knowledge and informed public policy.
OCR for page 96
Reengineering the Survey of Income and Program Participation This page intentionally left blank.