Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics 1 Introduction and Context OVERVIEW OF FEDERAL STATISTICS Federal statistics play a key role in a wide range of policy, business, and individual decisions that are made based on statistics produced about population characteristics, the economy, health, education, crime, and other factors. The decennial census population counts —along with related estimates that are produced during the intervening years—will drive the allocation of roughly $180 billion in federal funding annually to state and local governments.1 These counts also drive the apportionment of legislative districts at the local, state, and federal levels. Another statistic, the Consumer Price Index, is used to adjust wages, retirement benefits, and other spending, both public and private. Federal statistical data also provide insight into the status, well-being, and activities of the U.S. population, including its health, the incidence of crime, unemployment and other dimensions of the labor force, and the nature of long-distance travel. The surveys conducted to derive this information (see the next section for examples) are extensive undertakings that involve the collection of detailed information, often from large numbers of respondents. The federal statistical system involves about 70 government agencies. Most executive branch departments are, in one way or another, involved 1 U.S. Census Bureau estimate from U.S. Census Bureau, Department of Commerce. 1999. United States Census 2000: Frequently Asked Questions. U.S. Census Bureau, Washington, D.C. Available online at <http://www.census.gov/dmd/www/faqquest.htm>.
OCR for page 2
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics in gathering and disseminating statistical information. The two largest statistical agencies are the Bureau of the Census (in the Department of Commerce) and the Bureau of Labor Statistics (in the Department of Labor). About a dozen agencies have statistics as their principal line of work, while others collect statistics in conjunction with other activities, such as administering a program benefit (e.g., the Health Care Financing Administration or the Social Security Administration) or promulgating regulations in a particular area (e.g., the Environmental Protection Agency). The budgets for all of these activities—excluding the estimated $6.8 billion cost of the decennial census2—total more than $3 billion per year.3 These federal statistical agencies are characterized not only by their mission of collecting statistical information but also by their independence and commitment to a set of principles and practices aimed at ensuring the quality and credibility of the statistical information they provide (Box 1.1). Thus, the agencies aim to live up to citizens' expectations for trustworthiness, so that citizens will continue to participate in statistical surveys, and to the expectations of decision makers, who rely on the integrity of the statistical products they use in policy formulation. ACTIVITIES OF THE FEDERAL STATISTICS AGENCIES Many activities take place in connection with the development of federal statistics—he planning and design of surveys (see Box 1.2 for examples of such surveys); data collection, processing, and analysis; and the dissemination of results in a variety of forms to a range of users. What follows is not intended as a comprehensive discussion of the tasks involved in creating statistical products; rather, it is provided as an outline of the types of tasks that must be performed in the course of a federal statistical survey. Because the report as a whole focuses on information technology (IT) research opportunities, this section emphasizes the IT-related aspects of these activities and provides pointers to pertinent discussions of research opportunities in Chapter 2. 2 Estimate by Census Bureau director of total costs in D'Vera Cohn. 2000. “Early Signs of Census Avoidance,” Washington Post, April 2, p. A8. 3 For more details on federal statistical programs, see Executive Office of the President, Office of Management and Budget (OMB). 1998. Statistical Programs of the United States Government. OMB, Washington, D.C.
OCR for page 3
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics BOX 1.1 Principles and Practices for a Federal Statistical Agency In response to requests for advice on what constitutes an effective federal statistical agency, the National Research Council's Committee on National Statistics issued a white paper that identified the following as principles and best practices for federal statistical agencies: Principles Relevance to policy issues Credibility among data users Trust among data providers and data subjects Practices A clearly defined and well-accepted mission A strong measure of independence Fair treatment of data providers Cooperation with data users Openness about the data provided Commitment to quality and professional standards Wide dissemination of data An active research program Professional advancement of staff Caution in conducting nonstatistical activities Coordination with other statistical agencies SOURCE: Adapted from Margaret E. Martin and Miron L. Straf, eds.1992. Principles and Practices for a Federal Statistical Agency. Committee on National Statistics, National Research Council, NationalAcademy Press, Washington, D.C. Data Collection Data collection starts with the process of selection.4 Ensuring that survey samples are representative of the populations they measure is a significant undertaking. This task entails first defining the population of interest (e.g., the U.S. civilian noninstitutionalized population, in the case of the National Health and Nutrition Examination Survey). Second, a 4 This discussion focuses on the process of conducting surveys of individuals. Many surveys gather information from businesses or other organizations. In some instances, similar interview methods are used; in others, especially with larger organizations, the data are collected through automated processes that employ standardized reporting formats.
OCR for page 4
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics listing, or sample frame, is constructed. Third, a sample of appropriate size is selected from the sampling frame. There are many challenges associated with the construction of a truly representative sample: a sample frame of all households may require the identification of all housing units that have been constructed since the last decennial census was conducted. Also, when a survey is to be representative of a subpopulation (e.g., when the sample must include a certain number of children between the ages of 12 and 17), field workers may need to interview households or individuals to select appropriate participants. BOX 1.2 Examples of Federal Statistical Surveys To give workshop participants a sense of the range of activities and purposes of federal statistical surveys, representatives of several large surveys sponsored by federal statistical agencies were invited to present case studies at the workshop. Reference is made to several of these examples in the body of this report. National Health and Nutrition Examination Survey The National Health and Nutrition Examination Survey (NHANES) is one of several major data collection studies sponsored by the National Center for Health Statistics (NCHS). Under the legislative authority of the Public Health Service, NCHS collects statistics on the nature of illness and disability in the population; on environmental, nutritional, and other health hazards; and on health resources and utilization of health care. NHANES has been conducted since the early 1960s; its ninth survey is NHANES 1999.1 It is now implemented as a continuous, annual survey in which a sample of approximately 5,000 individuals representative of the U.S. population is examined each year. Participants in the survey undergo a detailed home interview and a physical examination and health and dietary interviews in mobile examination centers set up for the survey. Home examinations, which include a subset of the exam components conducted at the exam center, are offered to persons unable or unwilling to come to the center for the full examination. The main objectives of NHANES are to estimate the prevalence of diseases and risks factors and monitoring trends for them; to explore emerging public health issues, such as cardiovascular disease; to correlate findings of health measures in the survey, such as body measurements and blood characteristics, and to establish a national probability sample of DNA materials using NHANES-collected blood samples. There are a variety of consumers for the NHANES data, including government agencies, state and local communities, private researchers, and companies, including health care providers. Findings from NHANES are used as the basis for such things as the familiar growth charts for children and material on obesity in the United States. For example, the body mass index used in understanding obesity is derived from NHANES data and was developed by the National Institutes of Health in collaboration with NCHS. Other findings, such as the effects of lead in gasoline and in paint and the effects of removing it, are also based on NHANES data.2 1 Earlier incarnations of the NHANES survey were called, first, the Health Examination Survey and then, the Health and Nutrition Examination Survey (HANES). Unlike previous surveys, NHANES 1999 is intended to be a continuous survey with ongoing data collection. 2 This description is adapted in part from documents on the National Health and Nutrition Examination Survey Web site. (Department of Health and Human Services, Centers for Disease Control, National Center for Health Statistics (NCHS). 1999. National Health and Nutrition Examination Survey. Available online at <http://www.cdc.gov/nchswww/about/major/ nhanes/nhanes.htm>.)
OCR for page 5
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics American Travel Survey The American Travel Survey (ATS), sponsored by the Department of Transportation, tracks passenger travel throughout the United States. The first primary objective is to obtain information about long-distance travel 3 by persons living in the United States. The second primary objective is to inform policy makers about the principal characteristics of travel and travelers, such as the frequency and economic implications of long-distance travel, which are useful for a variety of planning purposes. ATS is designed to provide reliable estimates at national and state levels for all persons and households in the United States —frequency, primary destinations, mode of travel (car, plane, bus, train, etc.), and purpose. Among the other data collected by the ATS is the flow of travel between states and between metropolitan areas. The survey samples approximately 80,000 households in the United States and conducts interviews with about 65,000 of them, making it the second largest (after the decennial census) household survey conducted by federal statistical agencies. Each household is interviewed four times in a calendar year to yield a record of the entire year 's worth of long-distance travel; in each interview, a household is asked to recall travel that occurred in the preceding 3 months. Information is collected by computer-assisted telephone interviewing (CATI) systems as well as via computer-assisted personal interviewing (CAPI). Current Population Survey The primary goal of the Current Population Survey (CPS), sponsored by the Bureau of Labor Statistics (BLS), is to measure the labor force. Collecting demographic and labor force information on the U.S. population age 16 and older, the CPS is the source of the unemployment numbers reported by BLS on the first Friday of every month. Initiated more than 50 years ago, it is the longest-running continuous monthly survey in the United States using a statistical sample. Conducted by the Census Bureau for BLS, the CPS is the largest of the Census Bureau's ongoing monthly surveys. It surveys about 50,000 households; the sample is divided into eight representative subsamples. Each subsample group is interviewed for a total of 8 months—in the sample for 4 consecutive months, out of the sample during the following 8 months, and then back in the sample for another 4 consecutive months. To provide better estimates of change and reduce discontinuities without overly burdening households with a long period of participation, the survey is conducted on a rotating basis so that 75 percent of the sample is common from month to month and 50 percent from year to year for the same month.4 3 Long-distance is defined in the ATS as a trip of 100 miles or more. The Nationwide Personal Transportation Survey (NPTS) collects data on daily, local passenger travel, covering all types and modes of trips. For further information, see the Bureau of Transportation 's Web page on the NPTS, available online at <http://www.nptsats2000.bts.gov/>. 4 For more details on the sampling procedure, see, for example the U.S. Census Bureau. 1997. CPS Basic Monthly Survey: Sampling. U.S. Census Bureau, Washington, D.C. Available online at <http://www.bls.census.gov/cps/bsampdes.htm>.
OCR for page 6
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics Since the survey is designed to be representative of the U.S. population, a considerable quantity of useful information about the demographics of the U.S. population other than labor force data can be obtained from it, including occupations and the industries in which workers are employed. An important attribute of the CPS is that, owing to the short time required to gather the basic labor force information, the survey can easily be supplemented with additional questions. For example, every March, a supplement collects detailed income and work experience data, and every other February information is collected on displaced workers. Other supplements are conducted for a variety of agencies, including the Department of Veterans Affairs and the Department of Education. National Crime Victimization Survey The National Crime Victimization Survey (NCVS), sponsored by the Bureau of Justice Statistics, is a household-based survey that collects data on the amount and types of crime in the United States. Each year, the survey obtains data from a nationally representative sample of approximately 43,000 households (roughly 80,000 persons). It measures the incidence of violence against individuals, including rape, robbery, aggravated assault and simple assault, and theft directed at individuals and households, including burglary, motor vehicle theft, and household larceny. Other types of crimes, such as murder, kidnapping, drug abuse, prostitution, fraud, commercial burglary, and arson, are outside the scope of the survey. The NCVS, initiated in 1972, is one of two Department of Justice measures of crime in the United States, and it is intended to complement what is known about crime from the Federal Bureau of Investigation's annual compilation of information reported to law enforcement agencies (the Uniform Crime Reports). The NCVS serves two broad goals. First, it provides a time series tracing changes in both the incidence of crime and the various factors associated with criminal victimization. Second, it provides data that can be used to study particular research questions related to criminal victimization, including the relationship of victims to offenders and the costs of crime. Based on the survey, the Bureau of Justice Statistics publishes annual estimates of the national crime rate.5 5 Description adapted in part from U.S. Department of Justice, Bureau of Justice Statistics (BJS). 1999. Crime and Victims Statistics. BJS, Washington, D.C. Available online at <http:/ /www.ojp.usdoj.gov/bjs/cvict.htm#ncvs>.
OCR for page 7
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics Once a set of individuals or households has been identified for a survey, their participation must be tracked and managed, including assignment of individuals or households to interviewers, scheduling of telephone interviews, and follow-up with nonrespondents. A variety of techniques, generally computer-based, are used to assist field workers in conducting interviews (Box 1.3). Finally, data from interviews are collected from individual field interviewers and field offices for processing and analysis. Data collected from paper-and-pencil interviews, of course, require data entry (keying) prior to further processing.5 Processing and Analysis Before they are included in the survey data set, data from respondents are subject to editing. Responses are checked for missing items and for internal consistency; cases that fail these checks can be referred back to the interviewer or field office for correction. The timely transmission of data to a location where such quality control measures can be performed allows rapid feedback to the field and increases the likelihood that corrected data can be obtained. In addition, some responses require coding before further processing. For example, in the Current Population Survey, verbal descriptions of industry and occupation are translated into a standardized set of codes. A variety of statistical adjustments, including a statistical procedure known as weighting, may be applied to the data to correct for errors in the sampling process or to impute nonresponses. A wide variety of data-processing activities take place before statistical information products can be made available to the public. These activities depend on database systems; relevant trends in database technologies and research are discussed in the Chapter 2 section “Database Systems.” In addition, the processing and release of statistical data must be managed carefully. Key statistics, such as unemployment rates, influ- 5 For more on survey methodology and postsurvey editing, see, for example, Lars Lyberg et al. 1997. Survey Measurement & Process Quality. John Wiley & Sons, New York; and Brenda G. Cox et al. 1995. Business Survey Methods, John Wiley & Sons, New York. For more information on computer-assisted survey information collection (CASIC), see Mick P. Couper et al. 1998. Computer Assisted Survey Information Collection. John Wiley & Sons, New York.
OCR for page 8
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics ence business decisions and the financial markets, so it is critical that the correct information be released at the designated time and not earlier or later. Tight controls over the processes associated with data release are required. These stringent requirements also necessitate such measures as protection against attack of the database servers used to generate the statistical reports and the Web servers used to disseminate the final results. Process integrity and information system security research questions are discussed in the Chapter 2 section “Trustworthiness of Information Systems.” BOX 1.3 Survey Interview Methods Computer-Assisted Personal Interviewing (CAPI). In CAPI, computer software guides the interviewer through a set of questions. Subsequent questions may depend on answers to previous questions (e.g., a respondent will be asked further questions about children in the household only if he/she indicates the presence of children). Questions asked may also depend on the answers given in prior interviews (e.g., a person who reports being retired will not be repeatedly asked about employment at the outset of each interview except to verify that he or she has not resumed employment). Such questions, and the resulting data captured, may also be hierarchical in nature. In a household survey, the responses from each member of the household would be contained within a household file. The combination of all of these possibilities can result in a very large number of possible paths through a survey instrument. CAPI software also may contain features to support case management. Computer-Assisted Telephone Interviewing (CATI). CATI is similar in concept to CAPI but supports an interviewer working by telephone rather than interviewing in person. CATI software may also contain features to support telephone-specific case management tasks, such as call scheduling.1 Computer-Assisted Self-Interviewing (CASI). The person being interviewed interacts directly with a computer device. This technique is used when the direct involvement of a person conducting the interview might affect answers to sensitive questions. For instance, audio CASI, where the respondent responds to spoken questions, is used to gather mental health data in the NHANES.2 The technique can also be useful for gathering information on sexual activities and illicit drug use. Paper-and-Pencil Interviewing (PAPI). Paper questionnaires, which predate computer-aided techniques, continue to be used in some surveys. Such questionnaires are obviously more limited in their ability to adapt or select questions based on earlier responses than the methods above, and they entail additional work (keying in responses prior to analysis). It may still be an appropriate method in certain cases, particularly where surveys are less complex, and it continues to be relied on as surveys shift to computer-aided methods. PAPI questionnaires have a smaller number of paths than computer-aided questionnaires; design and testing are largely a matter of formulating the questions themselves. 1 The terms “CATI” and “CAPI” have specific, slightly different meanings when used by the Census Bureau. Field interviewers using a telephone from their home and a laptop are usually referred to as using CAPI, and only those using centralized telephone facilities are said to use CATI. 2 The CASI technique is a subset of what is frequently referred to as computerized self-administered questionnaires, a broader category that includes data collection using Touch-Tone phones, mail-out-and-return diskettes, or Web forms completed by the interviewee.
OCR for page 9
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics Creation and Dissemination of Statistical Products Data are commonly released in different forms: as key statistics (e.g., the unemployment rate), as more extensive tables that summarize the survey data, and as detailed data sets that users can analyze themselves. Historically, most publicly disseminated data were made available in the form of printed tables, whereas today they are increasingly available in a variety of forms, frequently on the Internet. Tables from a number of surveys are made available on Web sites, and tools are sometimes provided for making queries and displaying results in tabular or graphical form. In other cases, data are less accessible to the nonexpert user. For instance, some data sets are made available as databases or flat-text files (either downloadable or on CD-ROM) that require additional software and/or user-written code to make use of the data. A theme throughout the workshop was how to leverage IT to provide appropriate and useful access to a wide range of customers. A key consideration in disseminating statistical data, especially to the general public, is finding ways of improving its usability—creating a system that allows people, whether high school students, journalists, or market analysts, to access the wealth of statistical information that the government creates in a way that is useful to them. The first difficulty is simply finding appropriate data—determining which survey contains data of interest and which agencies have collected this information. An eventual goal is for users not to need to know which of the statistical agencies produced what data in order to find them; this and other data integration questions are discussed in the Chapter 2 section “Metadata.” Better tools would permit people to run their own analyses and tabulations online, including analyses that draw on data from multiple surveys, possibly from different agencies. Once an appropriate data set has been located, a host of other issues arise. There are challenges for both technological and statistical literacy in using and interpreting a data set. Several usability considerations are discussed in the Chapter 2 section “Human-Computer Interaction.” Users
OCR for page 10
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics also need ways of accessing and understanding what underlies the statistics, including the definitions used (a metadata issue, discussed in the Chapter 2 section “Metadata”). More sophisticated users will want to be able to create their own tabulations. For example, household income information might be available in pretabulated form by zip code, but a user might want to examine it by school district. Because they contain information collected from individuals or organizations under a promise of confidentiality, the raw data collected from surveys are not publicly released as is or in their entirety; what is released is generally limited in type or granularity. Because this information is made available to all, careful attention must be paid to processing the data sets to reduce the chance that they can be used to infer information about individuals. This requirement is discussed in some detail in the Chapter 2 section “Limiting Disclosure.” Concerns include the loss of privacy as a result of the release of confidential information as well as concerns about the potential for using confidential information to take administrative or legal action.6 However, microdata sets, which contain detailed records on individuals, may be made available for research use under tightly controlled conditions. The answers to many research questions depend on access to statistical data at a level finer than that available in publicly released data sets. How can such data be made available without compromising the confidentiality of the respondents who supplied the data? There are several approaches to address this challenge. In one approach, before they are released to researchers, data sets can be created in ways that deidentify records yet still permit analyses to be carried out. Another approach is to bring researchers in as temporary statistical agency staff, allowing them to access the data under the same tight restrictions that apply to other federal statistical agency employees. The section “Limiting Disclosure” in Chapter 2 takes up this issue in more detail. ORGANIZATION OF THE FEDERAL STATISTICAL SYSTEM The decentralized nature of the federal statistical system, with its more than 70 constituent agencies, has implications for both the efficiency of statistical activities and the ease with which users can locate and use 6 The issue of balancing the needs for confidentiality of individual respondents with the benefits of accessibility to statistical data has been explored at great length by researchers and the federal statistical agencies. For a comprehensive examination of these issues see National Research Council and Social Science Research Council. 1993. Private Lives and Public Policies, George T. Duncan, Thomas B. Jabine, and Virginia A. deWolf, eds. National Academy Press, Washington, D.C.
OCR for page 11
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics federal statistical data. Most of the work of these agencies goes on without any specific management attention by the Office of Management and Budget (OMB), which is the central coordinating office for the federal statistical system. OMB's coordinating authority spans a number of areas and provides a number of vehicles for coordination. The highest level of coordination is provided by the Interagency Council on Statistical Policy. Beyond that, a number of committees, task forces, and working groups address common concerns and develop standards to help integrate programs across the system. The coordination activities of OMB focus on ensuring that priority activities are reflected in the budgets of the respective agencies; approving all requests to collect information from 10 or more respondents (individuals, households, states, local governments, business);7 and setting standards to ensure that agencies use a common set of definitions, especially in key areas such as industry and occupational classifications, the definition of U.S. metropolitan areas, and the collection of data on race and ethnicity. In addition to these high-level coordination activities, strong collaborative ties—among agencies within the government as well as with outside organizations—underlie the collection of many official statistics. Several agencies, including the Census Bureau, the Bureau of Labor Statistics, and the National Agriculture Statistical Service, have large field forces to collect data. Sometimes, other agencies leverage their field-based resources by contracting to use these resources; state and local governments also perform statistical services under contracts with the federal government. Agencies also contract with private organizations such as Research Triangle Institute (RTI), Westat, National Opinion Research Center (NORC), and Abt Associates, to collect data or carry out surveys. (When surveys are contracted out, the federal agencies retain ultimate responsibility for the release of data from the surveys they conduct, and their contractors operate under safeguards to protect the confidentiality of the data collected.) Provisions protecting confidentiality are also decentralized; federal statistical agencies must meet the requirements specified in their own particular legislative provisions. While some argue that this decentralized approach leads to inefficiencies, past efforts to centralize the system have run up against concerns that establishing a single, centralized statistical office could magnify the threat to privacy and confidentiality. Viewing the existence of multiple sets of rules governing confidentiality as a 7 This approval process, mandated by the Paperwork Reduction Act of 1995 (44 U.S.C. 3504), applies to government-wide information-collection activities, not just statistical surveys.
OCR for page 12
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics barrier to effective collaboration and data sharing for statistical purposes, the Clinton Administration has been seeking legislation that, while maintaining the existing distributed system, would establish uniform confidentiality protections and permit limited data sharing among certain designated “statistical data center” agencies.8 As a first step toward achieving this goal, OMB issued the Federal Statistical Confidentiality Order in 1997. The order is aimed at clarifying and harmonizing policy on protecting the confidentiality of persons supplying statistical information, assuring them that the information will be held in confidence and will not be used against them in any government action.9 In an effort to gain the benefits of coordinated activities while maintaining the existing decentralized structures, former OMB Director Franklin D. Raines posed a challenge to the Interagency Council on Statistical Policy (ICSP) in 1996, calling on it to implement what he termed a “virtual statistical agency.” In response to this call, the ICSP identified three broad areas in which to focus collaborative endeavors: Programs. A variety of programs and products have interagency implications —an example is the gross domestic product, a figure that the Bureau of Economic Analysis issues but that is based on data from agencies in different executive departments. Areas for collaboration on statistical programs include establishing standards for the measurement of income and poverty and addressing the impacts of welfare and health care reforms on statistical programs. Methodology. The statistical agencies have had a rich history of collaboration on methodology; the Federal Committee on Statistical Methodology has regularly issued consensus documents on methodological issues. 10 The ICSP identified the folowing as priorities for collaboration: measurement issues, questionnaire design, survey technology, and analytical issues. Technology. The ICSP emphasized the need for collaboration in the area of technology. One objective stood out from the others because it was of interest to all of the agencies: to make the statistical system more 8 Executive Office of the President, Office of Management and Budget (OMB). 1998. Statistical Programs of the United States Government. OMB, Washington, D.C., p. 40. 9 Office of Management and Budget, Office of Information and Regulatory Affairs. 1997. “Order Providing for the Confidentiality of Statistical Information,” Federal Register 62(124, June 27):33043. Available online at <http://www.access.gpo.gov/index.html>. 10 More information on the Federal Committee on Statistical Methodology and on access to documents covering a range of methodological issues is available online from <http:// fcsm.fedstats.gov/>.
OCR for page 13
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics consistent and understandable for nonexpert users, so that citizens would not have to understand how the statistical system is organized in order to find the data they are looking for. The FedStats Web site,11 sponsored by the Federal Interagency Council on Statistical Policy, is an initiative that is intended to respond to this challenge by providing a single point of access for federal statistics. It allows users to access data sets not only by agency and program but also by subject. A greater emphasis on focusing federal statistics activities and fostering increased collaboration among the statistical agencies is evident in the development of the President's FY98 budget. The budgeting process for the executive branch agencies is generally carried out in a hierarchical fashion—the National Center for Education Statistics, for example, submits its budget to the Department of Education, and the Department of Education submits a version of that to the Office of Management and Budget. Alternatively, it can be developed through a cross-cut, where OMB looks at programs not only within the context of their respective departments but also across the government to see how specific activities fit together regardless of their home locations. For the first time in two decades, the OMB director called for a statistical agency cross-cut as an integral part of the budget formulation process for FY98.12 In addition to the OMB cross-cut, the OMB director called for highlighting statistical activities in the Administration's budget documents and, thus, in the presentation of the budgets to the Congress. Underlying the presentations and discussions at the workshop was a desire to tap IT innovations in order to realize a vision for the federal statistical agencies. A prominent theme in the discussions was how to address the decentralized nature of the U.S. national statistical system through virtual mechanisms. The look-up facilities provided by the FedStats Web site are a first step toward addressing this challenge. Other related challenges cited by workshop participants include finding ways for users to conduct queries across data sets from multiple surveys, including queries across data developed by more than one agency—a hard problem given that each survey has its own set of objectives and definitions associated with the information it provides. The notion of a virtual statistical agency also applies to the day-to-day work of the agencies. Although some legislative and policy barriers, discussed above in relation 11 Available online from <http://www.fedstats.gov>. 12 Note, however, that it was customary to have a statistical-agency cross-cut in each budget year prior to 1980.
OCR for page 14
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics to OMB's legislative proposal for data sharing, limit the extent to which federal agencies can share statistical data, there is interest in having more collaboration between statistical agencies on their surveys. INFORMATION TECHNOLOGY INNOVATION IN FEDERAL STATISTICS Federal statistical agencies have long recognized the pivotal role of IT in all phases of their activity. In fact, the Census Bureau was a significant driver of innovation in information technology for many years: Punch-card-based tabulation devices, invented by Herman Hollerith at the Census Bureau, were used to tabulate the results of the 1890 decennial census; The first Univac (Remington-Rand) computer, Univac I, was delivered in 1951 to the Census Bureau to help tabulate the results of the 1950 decennial census;13 The Film Optical Scanning Device for Input to Computers (FOSDIC) enabled 1960 census questionnaires to be transferred to microfilm and scanned into computers for processing; The Census Bureau led in the development of computer-aided interviewing tools; and It developed the Topologically Integrated Geographic Encoding and Referencing (TIGER) digital database of geographic features, which covers the entire United States. Reflecting a long history of IT use, the statistical agencies have a substantial base of legacy computer systems for carrying out surveys. The workshop case study on the IT infrastructure supporting the National Crime Victimization Survey illustrates the multiple cycles of modernization that have been undertaken by statistical agencies (Box 1.4). Today, while they are no longer a primary driver of IT innovation, the statistical agencies continue to leverage IT in fulfilling their missions. Challenges include finding more effective and efficient means of collecting information, enhancing the data analysis process, increasing the availability of data while protecting confidentiality, and creating more usable, more accessible statistical products. The workshop explored, and this report describes, some of the mission activities where partnerships be- 13 See, e.g., J.A.N. Lee. 1996. "looking.back: March in Computing History," IEEE Computer 29 (3). Available online from <http:// computer.org/50/looking/r30006.htm>.
OCR for page 15
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics tween the IT research community and the statistics community might be fostered. BOX 1.4 Modernization of the Information Technology Used for the National Crime Victimization Survey Steven Phillips of the Census Bureau described some key elements in the development of the system used to conduct the National Crime Victimization Survey (NCVS) for the Bureau of Justice Statistics. He noted that the general trend over the years has been toward more direct communication with the sponsor agency, more direct communication with the subject matter analysts, quicker turnaround, and opportunities to modify the analysis system more rapidly. In the early days, the focus was on minimizing the use of central processing unit (CPU) cycles and storage space, both of which were costly and thus in short supply. Because the costs of both have continued to drop dramatically, the effort has shifted from optimizing the speed at which applications run to improving the end product. At the data collection end, paper-and-pencil interviewing was originally used. In 1986, Mini-CATI, a system that ran on Digital Equipment Corporation minicomputers, was developed, and the benefits of online computer-assisted interviewing began to be explored. In 1989, the NCVS switched to a package called Micro-CATI, a quicker, more efficient, PC-based CATI system, and in 1999 it moved to a more capable CATI system that provides more powerful authoring tools and better capabilities for exporting the survey data and tabulations online to the sponsor. As of 1999, roughly 30 percent of the NCVS sample was using CATI interviewing. Until 1985 a large Univac mainframe was used to process the survey data. It employed variable-length files; each household was structured into one record that could expand or contract. All the data in the tables were created by custom code, and the tables themselves were generated by a variety of custom packages. In 1986, processing shifted to a Fortran environment. In 1989, SAS (a software product of the SAS Institute, Inc.) began to be used for the NCVS survey. At that time a new and more flexible nested and hierarchical data file format was adopted. Another big advantage of moving to this software system has been the ease with which tables can be created. Originally, all of the statistical tables were processed on a custom-written table generator. It produced a large numbers of tables, and the Bureau of Justice Statistics literally cut and pasted—with scissors and mucilage—to create the final tables for publications. A migration from mainframe-based Fortran software to a full SAS/Unix processing environment was undertaken in the 1990s; today, all processing is performed on a Unix workstation, and a set of SAS procedures is used to create the appropriate tables. All that remains to produce the final product is to process these tables, currently done using Lotus 1-2-3, into a format with appropriate fonts and other features for publication.
OCR for page 16
SUMMARY OF A WORKSHOP ON INFORMATION TECHNOLOGY RESEARCH for Federal Statistics IT innovation has been taking place throughout government, motivated by a belief that effective deployment of new technology could vastly enhance citizens' access to government information and significantly streamline current government operations. The leveraging of information technology has been a particular focus of efforts to reinvent government. For example, Vice President Gore launched the National Performance Review, later renamed the National Partnership for Reinventing Government, with the intent of making government work better and cost less. The rapid growth of the Internet and the ease of use of the World Wide Web have offered an opportunity for extending electronic access to government resources, an opportunity that has been identified and exploited by the federal statistical agencies and others. Individual agency efforts have been complemented by cross-agency initiatives such as FedStats and Access America for Seniors.14 While government agency Web pages have helped considerably in making information available, much more remains to be done to make it easy for citizens to locate and retrieve relevant, appropriate information. Chapter 2 of this report looks at a number of research topics that emerged from the discussions at the workshop—topics that not only address the requirements of federal statistics but also are interesting research opportunities in their own right. The discussions resulted in another outcome as well: an increased recognition of the potential of interactions between government and the IT research community. Chapter 3 discusses some issues related to the nature and conduct of such interactions. The development of a comprehensive set of specific requirements or of a full, prioritized research agenda is, of course, beyond the scope of a single workshop, and this report does not presume to develop either. Nor does it aim to identify immediate solutions or ways of funding and deploying them. Rather, it examines opportunities for engaging the information technology research and federal statistics communities in research activities of mutual interest. 14 Access America for Seniors, a government-operated Web portal that delivers electronic information and services for senior citizens, is available online at <http:// www.seniors.gov>.
Representative terms from entire chapter: