Read "Communicating Science and Engineering Data in the Information Age" at NAP.edu

Page 19 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

2

The Current Dissemination Program

The current dissemination program of the National Center for Science and Engineering Statistics (NCSES) is wide-ranging and multifaceted. In order to fulfill its mandate to serve as collector and distributor of information about the science and engineering enterprise for the National Science Foundation (NSF), this relatively small, resource-constrained statistical agency¹ disseminates its publishable data in several formats (hard-copy, mixed, and electronic-only publications); maintains an extensive website; makes its data available for retrieval from the consolidated FedStats database and through the Data.gov portal; provides access to confidential microdata in a protected environment for research purposes; and supports provision of three online communal tools that are used to retrieve data from the NCSES database: the Integrated Science and Engineering Resources Data System (WebCASPAR), the Scientists and Engineers Statistical Data System (SESTAT), and the less known Industrial Research and Development Information System (IRIS) (see Table 2-1).

These diverse outputs and self-maintained tools serve a broad community of information users with widely different data needs, ranging from one-time casual to recurring, highly sophisticated and widely divergent levels of statistical knowledge that extend from rudimentary to very knowledgeable. The user community also has quite different access preferences, as attested by the users who discussed their uses of the data with the panel

____________

¹The 2011 budget for NCSES was $41.5 million, down from $45.7 million in fiscal year 2009 and $41.9 million in fiscal year 2010. The agency has only 45 full-time permanent staff members, of whom 21 are statisticians.

Page 20 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

TABLE 2-1 Summary of Selected Characteristics of NSF Science and Engineering Surveys

Survey	Current Contractor	Database Retrieval Tool/ Publication
Survey of Earned Doctorates	National Opinion Research Center (NORC)	WebCASPAR; InfoBriefs; Science and Engineering Degrees; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Doctorate Recipients from U.S. Universities: Summary Report; Academic Institutional Profiles

Survey of Graduate Students and Postdoctorates in Science and Engineering	RTI International	WebCASPAR; InfoBriefs; Graduate Students and Postdoctorates in Science and Engineering; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Academic Institutional Profiles

Page 21 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Variables Available	Availability of Microdata	Series Initiated/ Archiving
Academic institution of doctorate; baccalaureate-origin institution (United States and foreign); birth year; citizenship status at graduation; country of birth and citizenship; disability status; educational attainment of parents; educational history in college; field of degrees (N = 292); graduate and undergraduate educational debt; marital status, number/age of dependents; postgraduation plans (work, postdoctorate, other study/training); primary and secondary work activities; source and type of financial support for postdoctoral study/research; type and location of employer; race/ethnicity; sex; sources of financial support during graduate school; type of academic institution (e.g., historically black institutions, Carnegie codes, control) awarding the doctorate	Access to restricted microdata can be arranged through a licensing agreement. A secure data access facility/data enclave providing restricted microdata access is under development with NORC.	1957 (conducted annually, limited data available 1920-1956)

The number and characteristics of graduate students; postdoctoral appointees; and doctorate-holding nonfaculty researchers in science, engineering, and health (SEH) fields	Data for the years 1972–2008 are available in a public-use file format.	1975 (conducted annually)

Page 22 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Survey	Current Contractor	Database Retrieval Tool/ Publication
Survey of Doctorate Recipients	NORC	SESTAT; InfoBriefs; Characteristics of Doctoral Scientists and Engineers in the United States; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Science and Engineering State Profiles

National Survey of Recent College Graduates	Mathematica Policy Research, Inc. and Census Bureau	SESTAT; InfoBriefs; Characteristics of Recent Science and Engineering Graduates; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering

Page 23 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Variables Available	Availability of Microdata	Series Initiated/ Archiving
Citizenship status; country of birth; country of citizenship; date of birth; disability status; educational history (for each degree held: field, level, institution, when received); employment status (unemployed, employed part time, or employed full time); geographic place of employment; marital status; number of children; occupation (current or past job); primary work activity (e.g., teaching, basic research, etc.); postdoctorate status (current and/ or three most recent postdoctoral appointments); race/ethnicity; salary; satisfaction and importance of various aspects of job; school enrollment status; sector of employment (e.g., academia, industry, government, etc.); sex; work-related training	Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement. The date available online though the enclave arrangement discussed above.	1973 (conducted biennially)

For individuals who recently received bachelor’s or master’s degrees in an SEH field from a U.S. institution: age; citizenship status; country of birth; country of citizenship; disability status; educational history (for each degree held: field, level, when received); employment status (unemployed, employed part time, or employed full time); educational attainment of parents; financial support and debt amount for undergraduate and graduate degree; geographic place of employment; marital status; number of children; occupation (current or previous job); place of birth; work activity (e.g., teaching, basic research, etc.); race/ethnicity; salary; overall satisfaction with principal job; school enrollment status; sector of employment (e.g., academia, industry, government, etc.); sex; work-related training	Access to restricted data for researchers interested in analyzing microdata can be arranged through a licensing agreement.	1976 (conducted biennially)

Page 24 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Survey	Current Contractor	Database Retrieval Tool/ Publication
National Survey of College Graduates	Census Bureau	SESTAT; InfoBriefs; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering

Business Research and Development and Innovation Survey (BRDIS)	Census Bureau	IRIS; InfoBrief; Business and Industrial R&D; Science and Engineering Indicators; National Patterns of Research and Development Resources; Science and Engineering State Profiles

Survey of Federal Funds for Research and Development	Synectics for Management Decisions, Inc.	WebCASPAR; InfoBrief; Federal Funds for Research and Development; Science and Engineering State Profiles; Science and Engineering Indicators; National Patterns of Research and Development Resources

Page 25 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Variables Available	Availability of Microdata	Series Initiated/ Archiving
For individuals holding a bachelor’s or higher degree in any field: academic employment (position, rank, and tenure); age; citizenship status; country of birth; country of citizenship; disability status; educational history (for each degree held: field, level, when received); employment status (unemployed, employed full time, or employed part time); geographic place of employment; immigrant module (year of entry, type of entry visa, reason(s) for coming to the United States, etc.); labor force status; marital status; number of children; occupation (current or past job); primary work activity (e.g., teaching, basic research, etc.); publication and patent activities; race/ethnicity; salary; satisfaction and importance of various aspects of job; school enrollment status; sector of employment (academia, industry, government); sex; work-related training	Public-use data files are available upon request.	1962 (conducted biennially)

Financial measures of research and development (R&D) activity; company R&D activity funded by others; R&D employment; R&D management and strategy; and intellectual property, technology transfer, and innovation	Census Research Data Centers	1953 (conducted annually); a new series began in 2008 when the survey was changed

Federal obligations by the following key variables: character of work; basic research; applied research; development; federal agency; federally funded research and development centers (FFRDCs); field of science and engineering; geographic location (within the United States and foreign country); performer (type of organization doing the work); R&D plant Federal outlays by: character of work, basic research, applied research, development, R&D plant	Data tables	1952 (conducted annually)

Page 26 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Survey	Current Contractor	Database Retrieval Tool/ Publication
Survey of Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions	Synectics for Management Decisions, Inc.	WebCASPAR; InfoBrief; Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions; Science and Engineering State Profiles; Science and Engineering Indicators; National Patterns of Research and Development Resources

Survey of R&D Expenditures at Federally Funded R&D Centers (FFRDCs)	ICF Macro	WebCASPAR; InfoBrief; R&D Expenditures at Federally Funded R&D Centers; Academic Research and Development Expenditures Science and Engineering Indicators; National Patterns of Research and Development Resources

Survey of Research and Development Expenditures at Universities and Colleges	ICF Macro	WebCASPAR; InfoBrief; Academic Research and Development Expenditures; Science and Engineering Indicators; National Patterns of Research and Development Resources; Science and Engineering State Profiles; Academic Institutional Profiles

Survey of State Research and Development Expenditures	Census Bureau	InfoBrief; State Government R&D Expenditures; Science and Engineering Indicators

Page 27 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Variables Available	Availability of Microdata	Series Initiated/ Archiving
Data by federal agency, academic institutions and location: R&D; fellowships, traineeships, and training grants; R&D plant; facilities and equipment for instruction in science and engineering; general support for science and engineering; type of academic institution (i.e., historically black colleges and universities [HBCUs[, tribal institutions, high-Hispanic-enrollment institutions, minority institutions); type of institutional control (public versus private)	Data tables only	1965 (conducted annually)

FFRDC R&D expenditures by source of funds (federal, state and local, industry, institutional, or other); and character of work (basic research, applied research, or development)	Data tables only	1965 (conducted annually)

Institution R&D expenditures by source of funds (federal, state and local, industry, institutional, or other); character of work (basic research versus applied research and development); pass throughs to subrecipients; receipts as a subrecipient; S&E field; non-S&E field; R&D equipment expenditures by S&E field; federal agency; type of degree granted, HBCU, public or private control; geographic location (within the United States)	Data tables (selected items)	1972 (conducted annually, limited data available for various years for 1954-1970)

State agency or department; state R&D expenditures; internal performers; external performers; basic research; source of funds (federal, state, other); R&D facilities	Data tables	1964 (conducted occasionally)

Page 28 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Survey	Current Contractor	Database Retrieval Tool/ Publication
Survey of Science and Engineering Research Facilities	RTI International	WebCASPAR; Scientific and Engineering Research Facilities; Science and Engineering Indicators

Survey of Public Attitudes Toward and Understanding of Science and Technology	NORC, via a science and technology module on the General Social Survey	Science and Engineering Indicators

(see Chapter 5). With limited resources, NCSES attempts to be all things to all users, and because it is spread so thinly, the panel has serious concerns about whether these outputs and tools are optimized for all the tasks to which they are addressed, as well as about whether NCSES is using the most up-to-date technologies and processes to best advantage for the user community.

In this chapter, we assess the status of the NCSES dissemination program. First, we describe the remaining hard-copy publications. We then review the NCSES user interface tools, including WebCASPAR, SESTAT, and IRIS, through which individuals are able to directly access and retrieve tailored outputs from the database. Then we discuss the structure of the databases and their current presentation on the web for downloading and use by third parties. We assess the current status of the program in light of the emerging practices for electronic dissemination, primarily the development of the Semantic Web as a way to facilitate access to information on the Internet. We provide examples of semantic web systems in federal agencies

Page 29 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Variables Available	Availability of Microdata	Series Initiated/ Archiving
Status of research facilities at academic institutions and nonprofit biomedical research organizations and hospitals by: amount and type of science and engineering research space; current expenditures for projects to construct and repair/renovate research facilities; condition of research facilities; planned construction and repair/renovation of research facilities; source of funds (federal, state and local, institutional) for construction and repair/renovation of research facilities; research animal facilities; bandwidth speeds and high performance network connections; fiber; high performance computing; wireless connections	Microdata from this survey for the years 1988-2001 are not available.	1986 (conducted biennially)

Demographic, behavioral, and attitudinal by how information about S&T is obtained; interest in science-related issues; visits to informal science institutions; S&T knowledge; attitudes toward science-related issues	Data tables	ICPSR, 1979-2001; CD, 1979-2004; (conducted biennially)

and the possibilities for development of a semantic web structure for science and engineering (S&E) information on the Internet. Finally, we consider the important issue of timeliness—a subject of great concern for users of NCSES data—and the possibility of moving the release and distribution of S&E data to a real-time basis.

TRADITIONAL FORMAT PUBLICATIONS

NCSES continues a few publications using a print-based approach and still has a customer base for them, although that customer base seems to be declining over time. Moreover, although most retrieval of NCSES information is by electronic means, a large part of the offerings are simply electronic depictions of previous hard-copy publications. It is fair to say that NCSES continues to manage its publications program in much the same way as it traditionally has, although the finished products, for the most part, are now sent to the website for posting rather than to a print-

Page 30 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

ing facility for production and distribution. The shift to provision of data in electronic format over the years can be characterized as a thin digitization of previously manual products. The format for the database that is made available on the website and that is queried by the NCSES tools is largely a replication of the old tables that found their way into the printed publications.

The major publication is Science and Engineering Indicators, a massive (in terms of bulk and effort) biannual product of the National Science Board, to which the NCSES staff makes a substantial commitment. This publication and the S&E indicators program that underscores it are the subject of a companion National Research Council study that is ongoing as our report was being prepared and are therefore not reviewed here. Nonetheless, in interviews with users (see Chapter 4), the volume was well regarded and companion online publications, such as the Digest, have also proven popular.²

Several annual publications continue to appear in hard copy. Among these publications are some that pertain to specialized audiences: Women, Minorities, and Persons with Disabilities in Science and Engineering; Doctorate Recipients from United States Universities: Summary Report; and Academic Institutional Profiles. These series have proven to be popular, but their small circulations indicate their limited reach in hard-copy format.

Another series that still has some traction is the InfoBriefs series, which is published in both hard copy and on the website. In this series, NCSES highlights key findings of its major statistical programs in summary form, largely to improve the timeliness of data release. Typically, the InfoBriefs are followed by publication of a comprehensive set of detailed tables in electronic format (xls, pdf). Again, according to user comments received by the panel, this series is found to be useful and should be retained. The series appears to achieve its purpose of bringing highlights to the attention of the user community. A rudimentary search of the web shows that InfoBriefs are often referenced, summarized, or retransmitted in specialty newsletters and blogs.

At the same time, the NCSES approach to dissemination of standing data tables is largely a static electronic analog to its long-standing series of print publications. The approach the agency takes to the release of data tables is relatively unsophisticated when compared with approaches to table access used by other data organizations, such as the Census Bureau’s American FactFinder (discussed below).

____________

²Despite the reported popularity of the print version of some of the publications, even those publications that continue in print have been severely curtailed. The print run for the Science and Engineering Indicators volume has been cut from 19,000 to 5,000 in recent years, and NCSES reports plans to cut the number of hard copies further in the future.

Page 31 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

EMERGING OPTIONS FOR PRINT DISSEMINATION

Although NCSES has taken a number of steps to deemphasize or eliminate the release of its data in print form, to the extent that only a handful of publications are still available in print format, it has not done much to change the way it approaches printing and hard-copy distribution. To meet the needs of the remaining users who require hard-copy publications, there are alternative means of printing and dissemination that may be more efficient for NCSES.

According to Jeff Turner, director of sales and marketing of the U.S. Government Printing Office (GPO), the growing availability and ease of print-on-demand (POD) and electronic books may be an answer to meeting the residual need for print products. In his presentation to the panel, he discussed the flexibility of arrangements for POD services.

Turner stated that such agencies as NCSES can directly purchase POD services from vendors by using a simplified purchase agreement through GPO that gives the agency complete control and convenience when looking for ways to quickly procure quality printing and related services. GPO provides training and technical assistance to agencies so they can use vendors certified by GPO.

GPO has also made arrangements for the purchase of printing services from a local Federal Express (FedEx) Office establishment through the GPOExpress contact. Moreover, agencies can choose to provide their publications to the public in POD format through the GPO sales program, wherein GPO manages the contracts and reprints books in response to customer demand, thus saving both the agency and GPO warehousing space and expense.

The GPO eBook Program is another innovation. GPO uses the Google Books Partner Program to display titles that have been accepted into the GPO Sales Program, thereby increasing public awareness of federal titles. The eBook program constitutes a step toward focusing additional public attention on federal agency publications and products, but it is less pertinent to the dissemination issues faced by NCSES than the POD program.

TOOLS FOR ACCESSING THE DATABASE

The principal database access tools made available by NCSES to its data users have been in place for some time, and, like many older systems, they are in need of updating. They are best characterized as bespoke tools for individual access, having been developed from scratch to solve specific access problems associated with specific databases and dated user community requests. The resources and effort to maintain the database access tools are high relative to their utility. The capabilities are somewhat limited and

Page 32 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

technologically dated, in contrast to some of the tools emerging elsewhere among the federal statistical agencies. A brief description of the three main NCSES tools based on information provided on its website follows.

Scientists and Engineers Statistical Data System

SESTAT is a integrated data collection effort capturing information about employment, educational, and demographic characteristics of scientists and engineers in the United States. The data are collected from three national surveys of this population: the National Survey of College Graduates (NSCG), the National Survey of Recent College Graduates (NSRCG), and the Survey of Doctorate Recipients (SDR). Data are available for download or through the SESTAT Data Tool, which allows users to generate custom data tables.

Integrated Science and Engineering Resources Data System (WebCASPAR)

WebCASPAR is a database system containing information about academic S&E resources that is available on the web. Included in the database is information from four of NCSES’s research and development (R&D) expenditure surveys and two of its academic surveys plus information from the Integrated Postsecondary Education Data System (IPEDS) data from National Center for Education Statistics. According to the description, the system provides the user with opportunities to select variables of interest and to specify whether and how information should be aggregated.³ Information is presented in HTML format and output can be in hard-copy form or in Lotus, Excel, or SAS formats for additional manipulation by the researcher.

Survey of Earned Doctorates Tabulation Engine

As this report was being prepared, NCSES released, on a pilot basis, a new data tool to provide access to selected variables from the Survey of Earned Doctorates (SED). The SED Tabulation Engine complements the WebCASPAR tool by performing tabulations on the 2006 and beyond data. This tool was a consequence of decisions to change the way confidentiality protections are applied to SED data. Beginning with the 2007 SED, data on the race/ethnicity, gender, and citizenship status of doctorate recipients were no longer reported in WebCASPAR. These changes were made with a goal of strengthening the confidentiality protections applied to SED data.

The WebCASPAR system was incapable of employing the new confidentiality protection procedures, so the range of SED variables available

____________

³See http://www.nsf.gov/statistics/database.cfm [November 2011].

Page 33 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

in WebCASPAR was reduced. The SED Tabulation Engine was developed so that NCSES could continue to provide data users with access to gender, race/ethnicity, and citizenship data from 2007 onward. This new tool displays estimates that were developed in a way that intends to prevent disclosure of personally identifiable information in tables using gender, race/ethnicity, or citizenship variables. It provides users with the ability to generate statistics using all of the SED variables previously available in WebCASPAR except baccalaureate institution and the highest degree awarded by those institutions. NCSES will explore the possibility of adding the baccalaureate institution variable to the tabulation engine in a future release.⁴

The tabulation engine includes a disclosure control mechanism that is intended to protect the identity of respondents when using the gender, citizenship, and race/ethnicity variables. It displays estimates that are intended not to disclose personally identifiable information and enables users to generate statistics using all of the SED variables previously available in WebCASPAR, except some institutional information. The SED Tabulation Engine was developed by NSF through a contract to the National Opinion Research Center at the University of Chicago.⁵

Industrial Research and Development Information System

IRIS links an online interface to a historical database with more than 2,500 statistical tables containing all industrial R&D data published by NSF from 1953 through 1998 when, concurrent with implementation of the new industrial classification system, the series was discontinued. IRIS has recently been updated as an IRIS II version that contains statistics for 1953-2007. The tables that reside in IRIS and IRIS II were drawn from the results of NSF’s annual Survey of Industrial Research and Development, the primary source for national-level data on U.S. industrial R&D. This survey was replaced with the Business Research and Development and Innovation Survey, for which there is currently no comparable dedicated access tool. NCSES is contemplating creation of a repository similar to IRIS and IRIS II for the new survey results.⁶

IRIS are in Excel spreadsheet format and are accessible either by defining variables, such as total R&D expenditures, or dimensions, such as size of company, for specific research topics. The data can also be obtained by querying the report in which the tables were first published.

NCSES’s three major dissemination tools (SESTAT, WebCASPAR, and

____________

⁴See https://webcaspar.nsf.gov/Help/dataMapHelpDisplay.jsp?subHeader=DataSourceBySubject&type=DS&abbr=DRF&noHeader=1&JS=No [August 2011].

⁵See https://ncses.norc.org/NSFTabEngine/#WELCOME [May 2011].

⁶Communication with Raymond Wolfe, NCSES, July 22, 2011.

Page 34 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

IRIS) have been in place without major modification for some time. Some of the data users at the panel’s workshop commented that the tools are in need of a retooling. Although the tools retrieve individual and cross-tabulated data elements with some efficiency and produce tabulations as specified by users, they have no ability to enhance data analysis by use of either standardized or user-specified visualizations. Nor can they reach across the data sets to permit integrated retrieval and analysis. Furthermore, they do not offer systematic or complete access to microdata, and they fail to offer any standard means for machine access to the data and metadata, creating substantial barriers to third-party web tools and services.

If NCSES were to consider the best approach to modernizing its tools and access to the available information and data, one step would be to consult and research what other government agencies have done or are doing to improve their dissemination tools; another would be to consider what the private sector has to offer. The first approach and resulting research would enable NCSES to gain knowledge and best practices already available or in process, leveraging and incorporating the learnings into current and future tactical and strategic planning. In addition, in light of the limited resources currently available to it, NCSES should seek to identify other government agencies or private-sector partners that would provide opportunities to join, leverage, or use available toolsets and approaches (see Recommendation 3-5).

Alternative Federal Statistical Agency Tools

Although their databases are constructed in a different manner and the uses are often quite dissimilar, two quite sophisticated retrieval tools now in use and being subject to further development by the Census Bureau should be considered in assessing the adequacy and available functionality of the NCSES tools. The panel invited Census Bureau officials in charge of maintaining and upgrading these major tools—American FactFinder and DataWeb—to discuss them at the panel workshop.

American FactFinder

The American FactFinder is the Census Bureau’s primary web-based data dissemination vehicle. This tool enables the retrieval of data from the decennial census, the economic census, the American Community Survey, annual economic surveys, and the Population Estimates Program—all very large databases—in tabular, map, or chart-form data products, as well as an online access to archived data (through download).

Jeffrey Sisson, the American FactFinder program manager, reported that the system is being redesigned with several ambitious goals: to increase

Page 35 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

the effectiveness of user data access; to guide users to their data without forcing them to become experts; to improve turnaround time; to increase the efficiency and flexibility of dissemination operations; to address growing usage and data volume needs; and to provide a platform that evolves over time, avoiding technology obsolescence. The overall goal of the redesign is to make information easier to find, update the look and feel of the site, increase its functionality, implement topic- and geography-based search and navigation, standardize functionality and look across all data products and surveys, implement new and improved table manipulations, and implement charting functionality.

Sisson said that the plan for the redesign was based on stakeholder and user feedback, usability studies, and a usability audit. Based on the usability studies, the Census Bureau selected the following areas for improvement: usability and customer satisfaction, visual elements, conventional layout, consistent structure, and layering of information.

Information received by the panel after the introduction of the redesigned American FactFinder (FactFinder 2) suggests that, even with extensive usability studies, the introduction of a new tool can be a challenging activity. As our report was being prepared, the Census Bureau was continuing to work with users to refine the FactFinder 2 tool to better meet user needs.

Despite the difficulties encountered in the implementation phase, it seems appropriate for NCSES to consider the American FactFinder model, based on formal usability studies in determining how it might better provide improved user access to the large number of standing tables published subsequent to its InfoBriefs. The difficulties encountered in implementing the upgrades in the American Factfinder tool are also pertinent to consider when introducing a new tool to the user community.

DataWeb

In his introduction to the discussion of the Census Bureau’s DataWeb network, Cavan Capps, chief of DataWeb applications, described the major tasks facing statistical agencies: how to present the right data with the right context to meet users’ needs through effective data integration, how to ensure that the most recent and most correct data are displayed, and how to facilitate the efficient reuse of data for different purposes. In his presentation, he stated that Census Bureau met these challenges through the DataWeb network, which consists of three parts. The DataWeb network and its component servers create a web of machine-accessible data, whereas HotReports and DataFerrett provide tools for users to present and manipulate that data.

The DataWeb project was started in 1995 to develop an open-source

Page 36 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

framework that networks distributed statistical databases together into a seamless unified virtual data warehouse. It was originally funded by the U.S. Census Bureau, with participation at various times by the Bureau of Labor Statistics, the Centers for Disease Control and Prevention, Harvard University, and a number of nonprofit institutions.

DataWeb is not just an archive or publisher of data; rather, it is a technology infrastructure that reads, normalizes, manipulates, and presents remote data sources from several different agencies in a way that facilitates reuse of the data for policy purposes (American Association for the Advancement of Science, 2003; Bosley and Capps, 2000; Capps, Green, and Wallace, 1999). The DataWeb framework is accessed by hundreds of thousands of users to support statistically complex asymmetrical tabulations and visualizations for hundreds of millions of records in seconds, stored in different formats transparently and instantly. The data can be maintained by the sponsoring government agency using its own internal format and processes; thus, the available data are “official” and are updated in real time. This infrastructure is being explored as a way of reviewing data throughout the life cycle of the data creation process, making possible the capture and provision of statistically appropriate metadata that define the appropriate statistical usage and integration.

The software provides a service-oriented architecture that pulls data from different database structures and vendors and normalizes them into a standard stream of data. The normalized stream is intelligent and supports standard transformations, can geographically map itself correctly using the correct vintage of political geography, understands standard code sets so that data can be combined in statistical appropriate ways, understands how to weight survey data appropriately, and understands variance and other statistical behaviors.

Capps described DataWeb as having the capacity for handling different kinds of data in the same environment or framework. It is empowered by statistical intelligence: documentation, statistical usage rules, and data integration rules. Its features include storing the data once, but using it many times. DataFerrett and HotReports both use the DataWeb framework.

DataFerrett is a data web browser that is targeted at sophisticated data users and can present multiple data sets in an integrated way. It speeds analytical tasks by allowing data manipulation, incorporating advanced tabulation and descriptive statistics, and its mapping and business graphics use statistical rules. It has the capability of adding regressions and other advanced statistics.

HotReports are presented much like the NCSES InfoBriefs. They are targeted to local decision makers with limited time and statistical background. Designed to bring together relevant variables for local areas, they are topically oriented and updated when needed. They have been developed

Page 37 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

to be quick to build using a drag-and-drop layout. The main difference is that while InfoBriefs consist of static tables that are generated manually and “pasted” into documents, HotReports are generated from the data itself, its metadata, and publishing rules describing each table. This means first that it is always possible to trace the provenance (including data-editing “footnotes”) of any reported summary result to the existing data, and that the table is “dynamic”—offering live updates (if desired), drill-down, and integration with other data sources.

The DataWeb system demonstrates the feasibility of integrating data from multiple federal agencies for rich reporting and analysis. It also demonstrates how metadata can be used to make data products, such as reports, both more reproducible and more dynamic.

It seems appropriate, then, for NCSES to look at DataWeb as a resource as it considers a new approach to data retrieval. It can consider redesigning its retrieval tools through incorporating aspects of DataWeb design and functionality as well as making its data available through DataWeb.

PRIVATE-SECTOR TOOLS AND INFRASTRUCTURE FOR DISSEMINATION

Information presented to the panel at its workshop emphasized that this is an exciting, fast-changing time for electronic data dissemination in the public sector. Indeed, many of the tools and applications that were discussed in the workshop in late 2010 have been substantially revised since then. Nonetheless, the panel thinks that the following summary discussion of the trends in data visualization, data publication, and data sharing is foundational, in that it points to developments that need to be taken into account by NCSES as the agency considers updating its data dissemination program.

The major inputs to the following discussion of what was then the state of practice were (a) a presentation by panel member Micah Altman, who summarized the state of current practice in terms of publicly available systems for online numeric presentation and for web-based data visualization, data publication, and data sharing; (b) a presentation describing a tool called Google Public Data Explorer by Jürgen Schwärzler, statistician, and Benjamin Yolken, project leader for this program; (c) a presentation by panel member Christiaan Laevaert on the practical aspects of using the Google Public Data Explorer tool and the significant improvements in the overall visibility of the data offerings of the Statistical Office of the European Union (EUROSTAT); and (d) a presentation by Steve McDougall, product manager, and Stephan Jou, technical architect for IBM, who described the lessons that have been learned concerning the Many Eyes website, wherein users can experiment with, download, and create visualizations of data sets.

Page 38 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

The current set of tools for online data access include special-purpose tools for visualization, tools for one-way data publication, and tools for public data sharing and exchange. These can be further classified as open source and closed source. Some leading examples were discussed in each category.

State of the Practice in Online Data Visualization

The panel heard presentations on three toolkits that are examples of the advanced visualization that can be made possible when data are available in machine-understandable formats using open standards and metadata.

Protovis and its associated tool, Data Driven Documents (D3), are toolkits for dynamic visualization of complex data. These open-source tools handle small-sized databases. They support a partial grammar of graphics in high-level abstractions (D3 adds capacity for animation, interaction, and dynamic visualizations) (Bostock and Heer, 2009).

Similarly, Processing and Prefuse Flare are open-source toolkits built to support advanced web-based visualizations. Processing is both a framework and an open-source language that was originally based on Java. The Processing tool uses a function-based visualization model, whereas Flare is built on Flash and uses an object-based model (Heer, Card, and Landay, 2005; Reas and Fry, 2007).

State of the Practice in Online Data Publication

Google also has a number of offerings, including Google Docs (formerly Google Sheets), which is an Excel-type tool that has application programming interfaces (APIs) for integration and handles small data. Fusion Tables focuses on data sharing, linking, and merging. The Google Public Data Explorer is used for data publication, and the (now defunct) Google Palimpsest had aimed to provide scientific data sharing and preservation. Despite being under the Google umbrella, each of these tools is essentially a standalone system, using its own user interfaces, with its own business model and term of services.

The most pertinent tool for public data use is the Google Public Data Explorer, which, as described to the panel by the Google development team, searches across data elements and has some visualization capability. It was launched as a Google product in March 2010. It is designed to make large, public-interest data sets easy to explore, visualize, and communicate. In the standard Google visualizations, charts and maps animate over time, and changes become easier to perceive and understand. It is designed for users who are not data experts. In a short time, users can navigate between different views, make their own comparisons, and share findings.

Page 39 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

The Google Public Data Explorer includes a number of data sets, all of which are provided by third-party data providers, such as international organizations, national statistical offices, nongovernmental organizations, and research institutions. These providers are responsible for creating and maintaining all of the content that appears in the product.

The potential of the Google Public Data Explorer tool was discussed by panel member Christiaan Laevaert. Eurostat has been rethinking its approach to visualization tools, adapting procedures that are able, with minimal effort, to supply data in formats required by emerging tools or standards on the Internet. Free access to and reuse of data are a cornerstone of Eurostat’s dissemination policy, and it is precisely the reuse of its data—in all kinds of commercial and noncommercial projects—that gives Eurostat much higher visibility than it could achieve solely through its own dissemination products. As an example, working with Google resulted not only in data being featured on the Google Public Data Explorer but also in the integration of data into Google search with Onebox. The Google search integration makes data sets searchable in 34 languages and ensures the highest ranking in search results. Currently, four Eurostat data sets have been integrated, which has significantly improved the overall visibility of its data.

The Organisation for Economic Co-operation and Development (OECD) also recently upgraded its statistics retrieval and display capabilities with the introduction of the Statistics from A-Z—Beta Version tool. Users can identify series with the use of keywords and obtain an instant retrieval of Excel files or real-time data in a variety of formats with capacity of production of tailored charts.⁷

Although the Eurostat applications on Google Public Data Explorer and the OECD-developed Statistics from A-Z represent new and interesting efforts in the international arena, other tools have been developed by private-sector businesses that have extensive track records of developing platforms and services for data publication. These include the Nesstar Publisher, Ivation Beyond 20/20, Socrata, and the Tableau system.

Tableau is a particularly interesting example of the state of the practice in data extraction. Like the Google Public Data Explorer product, it can be used to publish data for web-scale online use. In contrast, Tableau handles data with tens of millions of rows (which is smaller than high-end SQL databases but far exceeds the capability of Google Public Data Explorer), supports a wide variety of linked visualizations, and provides an easy-to-use graphical user interface for nonexpert users to publish data. Google Public Data Explorer provides an XML API, but no configuration tools.

____________

⁷See http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html [October 2011].

Page 40 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Moreover, Tableau supports not only visualizations but also direct downloads of data extracts and of derivative “print” works, such as reports and HTML tables. Nevertheless, Google’s ability to leverage its search engine dominance and redirect key search terms to Google Public Data Explorer data visualizations can provide publishers using this tool with unparalleled visibility among users.

State of the Practice in Data Sharing

Data sharing platforms go beyond data publication to allow the wider user community to comment and correct data provided through the system, add value through integrated visualizations or tags, and even provide additional data for comparison and integration. At the time our report was being prepared, there was one open-source data sharing platform, the Dataverse Network. Several competing closed commercial platforms have been developed over the last several years, including the now-defunct Google Palimpsest, Graphwise, Swivel, Dabble, and Verifiable data sharing services, as well as the operational Data360, Factual, Many Eyes, and BuzzData services. The existing services that are listed are all of note for different reasons. More new services, such as FigShare and Numbrary, have emerged recently or are on the horizon but have yet to achieve significant uptake.

The Dataverse Network (DVN) software is the only open-source system currently available specifically designed for data sharing (King, 2007). It is designed to provide access to research data and to facilitate data sharing through standard/open tools, such as DDI, Dublin Core, and USMARC metadata; Z39.50, LOCKSS, and OAI-PMH search and harvesting; and Creative Commons licensing. It replaces the Virtual Data Center software, which was developed under the NSF DLI-2 program (Altman et al., 2001). It facilitates the public preservation and distribution of persistent, citeable, authorized, and verifiable research data, with powerful but easy-to-use technology. The project increases scholarly recognition and distributed control for authors, journals, and others who make data available; improves data access and analysis; and still enables professional archives to provide integrated preservation and other services. It is a leading example of standards-based open systems.

The Dataverse Network also serves as a federated catalog, allowing users to find and access data across dozens of remote sources, including the Interuniversity Consortium for Political Social Science Research, DataWeb, and the National Archives and Records Administration. Already accessible through the DVN is the largest collection of social science data in the world, through a partnership with the Data Preservation Alliance for the Social Sciences (Data-PASS) (Altman et al., 2009; Gutman et al., 2009). This includes integrated access to hundreds of large government data sets.

Page 41 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Of these systems, the Dataverse Network is unique in being designed to explicitly support long-term access and permanent preservation. To this end, the system supports best practices, such as format migration, human-understandable formats and metadata, persistent identifier assignment, and semantic fixity checking. In addition, many threats to long-term access can be fully addressed only by collaborative stewardship of content, and the system supports distributed, policy-based replication of its content across multiple collaborating institutions, to ensure the long-term stewardship of the data against budgetary and other institutional threats (see Altman and Crabtree, 2011).

Making data available in machine-understandable formats using open standards and metadata also enables the media or other data redistributors to easily pick up the data and integrate it into their own specific visualization tools for further dissemination. This enhances the visibility of the data and allows a statistical agency to reach a much broader audience with tools specifically targeted for such audiences. As an example, The Guardian, a British newspaper, has published a visualization tool based on data from Eurostat that explains to European citizens “Who we are, how we live and what it costs.”⁸

Data360, created in 2004, is the oldest closed-source data sharing service still operational. Its stated aim was to make data available for better public policy. It now contains thousands of data sets and offers static and dynamic visualizations, direct access to data, and generated reports (Macdonald, 2009, p. 4).

Factual is a data manipulation developed in the commercial sector. It is closed source, runs as a proprietary service, and handles only moderate-sized databases. It extensively supports collaborative data manipulation in such functions as data linking, aggregation, and filtering, and it has extensive mashup support, with Google RESTful and Java JSON APIs for extraction and interrogation of data sets. It also integrates with Google charts and maps. It is a leading example of collaborative data editing. Factual contains a relatively small collection but has the aim of eventually loading all the Data.gov files.⁹ If this aim is achieved, several of the NCSES data files that reside in Data.gov will be available in this tool.

Many Eyes is a website that permits users to enter their own data sets and produce tailored visualizations from a stock of sample visualizations on demand (Viegas, 2007). Many Eyes is largely uncurated, and as a result it hosts over 200,000 data sets, the vast majority of which are tiny, undocumented, and with unknown provenance. In part, this is because the goal of

____________

⁸See http://www.guardian.co.uk/world/interactive/2011/mar/14/new-europe-statistics-interactive [November 2011].

⁹See http://www.factual.com/topic/government [November 2011].

Page 42 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

the site is not to create a data collection or archive but to make visualization a catalyst for discussion and collective insight about data. Many Eyes is particularly notable for its prototype work involving accessibility for people with disabilities. (In contrast, none of the other visualization tools described provides accessible components or analogs.) By employing a processing design that carefully separates data manipulation and data analysis from presentation (see, for example, Wilkinson et al., 2005) and deferring visualization to the final stage of the chain of computation, the Many Eyes prototype was able to offer powerful data manipulation and analysis functions that were potentially accessible to a visually impaired audience. Although this is not yet in production, it shows that data analytics for the visually impaired can go far beyond those typically offered.

BuzzData is a relatively new entry to the data sharing offerings in which a community of interest for a data set is formed and each data set has tabs for tracking versions, visualizations, related articles, attachments, and comments. The idea is that users using the data will build value to the data set, thereby creating a social network around it (Howard, 2011).

Trends in Data Access Tools and Infrastructure

Data dissemination is a rapidly developing area, in which players, technologies, and vocations are changing rapidly. The above review of emerging public and private-sector tools reveals a number of general trends and patterns, which are summarized below:

In the private sector, no dominant business model, company, or commercial product has emerged. To the contrary, many commercial services in this area have failed, and business models for data sharing remain unclear.
The availability, usability, and features of third-party systems have raised user expectations for access to data. Increasingly, users are expecting access to data in real time and at a fine level of detail. They want access to data that are machine-understandable and that can be imported or mashed up using third-party services. Data.gov is a prime example of this trend applied to the public sector.
Mega-scale online analysis, social integration, metadata exchange of catalog information, collaboration features, and ad hoc support for data manipulation are “solved problems” and well within the state of the practice. However, many services fail to adhere to good practices.
Extremely powerful (peta-scale) online analysis, interactive statistical disclosure limitation, semantic harmonization, dynamic linking of data across different data sources with different data collection

Page 43 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

designs, and data analysis and browsing support for the visually impaired remain research problems.
None of the commercial services is designed with preservation or long-term access.
Both private-sector and public production services currently available fall short of providing rich access to visually impaired users.

Overall, these patterns strongly suggest that NCSES should not adopt a single service or technology for data visualization and sharing, nor should it develop another bespoke system, but instead should make data available in open formats and protocols, and with sufficient documentation and metadata, to enable the easy inclusion of these data in third-party catalogs and services. It would benefit from exploring mashups (a mashup occurs when a web page or application uses and combines data, presentation, or functionality from two or more sources to create new services) with ongoing public-sector dissemination tool sets, such as DataWeb, in order to quickly transform its electronic dissemination platforms and refine its participation in government-wide portals (see Recommendation 3-4).

DISSEMINATION BY MEANS OF GOVERNMENT-WIDE PORTALS

In addition to data dissemination through its own website and possible utilization of such tools as DataWeb, NCSES has options for disseminating data through two major government-wide initiatives. It has a presence through both portals, but they both fall short of serving as comprehensive platforms for featuring and disseminating S&E information in electronic form.

FedStats

An early, once-ambitious government-side data access service, FedStats has been available online since 1997. FedStats is a portal that was designed to be a one-stop gateway through which users can retrieve a full range of official statistical information produced by the federal government without having to know in advance which federal agency produces which particular statistic. It has searching and linking capabilities to data from agencies that provide data and trend information on such topics as economic and population trends, crime, education, health care, S&E workforce and expenditures, farm production, and more. Data can be retrieved by searching by subject matter, program area, or agency.

NCSES has been a part of FedStats from the beginning. Currently, the tool drives a user who is searching by subject matter (topic) or press releases to the NCSES website, from whence the search continues using

Page 44 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

the existing NCSES search and retrieval tools. Searching by agency is a bit problematic—the site had not been updated to incorporate the new name of NCSES as of September 2011.

Data.gov

A promising new portal for disseminating federal government information in the form of raw data and applications (apps) has more recently been developed. Data.gov is a major component of a spate of recent open-government initiatives that have been designed to serve as a catalyst for increasing transparency. NCSES has been a member of this federal open-government initiative from its beginning in May 2009. The SESTAT tool is one of the apps that can be accessed through Data.gov, although the WebCASPAR, IRIS, and SED Tabulation Engine tools were not being made available through this portal at the time this report was being prepared.

Workshop presenter Alan Vander Mallie, program manager in the General Services Administration, stated that Data.gov aims to promote accountability and provide information for citizens on what their government is doing with tools to enable collaboration across all levels of government. It is a one-stop website for free access to data produced or held by the federal government, designed to make it easy to find, download, and use, including databases, data feeds, graphics, and other data visualizations.

Vander Mallie reported that, at its inception in 2009, Data.gov consisted of 47 raw data sets and 27 tools to assist in accessing the data in some of the complex data stores. At the time of the workshop, the program supported 2,895 raw data sets and 638 tools, which are accessed through raw data and tool catalogues. (The number of raw data sets and geographic data sets claimed on the Data.gov website home page had grown to nearly 390,000 by fall 2011.) This increase is primarily the result of linking and rebranding the Geospatial One Stop (Geodata.gov) service as part of the Data.gov site. The catalog of raw data sets (see http://explore.data.gov/catalog/raw/ [November 2011]) available has increased to roughly 3,602, based on a catalog search. Raw data are defined as machine-readable data at the lowest level of aggregation in structured data sets with multiple purposes. The raw data sets are designed to be mashed up—that is, linked and otherwise put in specific contexts using web programming techniques and technologies. Following the workshop, Socrata, which provides an open government software solution, has introduced a new Data.gov website designed to help government agencies publish and distribute data in new ways, including interactive charts, maps, and lists. At the time this report was being prepared, this software was available only to participating government agencies and was not accessible to the panel.

In the future, Vander Mallie said, Data.gov is slated to continue to

Page 45 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

expand its coverage of data sets and tools and to continue to support communities of interest by building community pages that collect related data sets and other information to help users find data on a single topic in one location. One continuing objective is to make data available through the application programming interface, permitting the public and developers to directly source their data from Data.gov.

Expansion into the Semantic Web, an emerging standardized way of expressing the relationships between web pages so the meaning of hyperlinked information can be understood, is also part of the future plan for Data.gov. The objective is to enable the public and developers to create a new generation of “linked data” mashups. Working toward this goal, Data.gov has an indexed set of resource framework documents that are available and is working with the W3C to promote international standards for persistent government data (and metadata) on the web. Plans are also in place for expanding mobile applications, improved “meta-tagging” (short descriptions of an HTML web page that describe the content and facilitate implementation of standards to describe the data), and enhancing data visualization across agencies. In short, the idea is to give agencies a powerful new tool for disseminating their data and a one-stop locale for the public to access them. Efforts also exist to create government-wide or agency-specific data catalogs and dictionaries, which would be published along with the available data sets.

Suzanne Acar, senior information architect for the U.S. Department of the Interior and cochair of the Federal Data Architecture Subcommittee of the Chief Information Officer Council (see http://www.cio.gov [November 2011]), put the current and future Data.gov into context. She discussed the evolution of Enterprise Data/Information Management (EIM)—a framework of functions that can be tailored to fit the strategic information goals of any organization. For agencies like NSF to benefit from the capabilities of Web 2.0 and Web 3.0, it is important to ensure consistent quality of information and official designations of authoritative data sources.

While this report was being prepared, the future of Data.gov remained somewhat uncertain because of the threat of budget cuts (Lipowicz, 2011). Nonetheless, the development of Data.gov was heading in an additional direction—a direction that could be promising for improved dissemination of S&E data. The Office of Management and Budget is setting up a number of community-based, topic-specific Data.gov sites. The initial sites cover information on energy, law, and health.¹⁰ In conjunction with the Office of Science and Technology Policy, NCSES might consider setting up such a topic-specific site for the science and technology community, particularly

____________

¹⁰See http://www.data.gov/energy; http://www.whitehouse.gov/blog/2011/06/30/invitation-our-latest-open-innovation-ecosystem-energydatagov [August 2011].

Page 46 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

as it is now a clearinghouse for data dissemination. Overall, the sense of the panel was that Data.gov was a useful channel for disseminating NCSES data, but that NCSES should not rely on it as the only solution for disseminating data in open formats and through open APIs.

EXPANDING ACCESS TO THE NCSES DATABASE

In addition to making its database available to the public through use of the SESTAT, WebCASPAR, and IRIS tools as well as through FedStats and Data.gov, NCSES makes the microdata available under carefully controlled circumstances for download and use by outside organizations and developers. NCSES, like all federal agencies, is bound by the Privacy Act of 1974 to protect the confidentiality of the records it maintains about individuals and other statutory requirements for the protection of confidential statistical information under Title V of the 2002 E-Government Act, the Confidential Information Protection and Statistical Efficiency Act (CIPSEA), and the NSF’s own statutory provisions. These statutes require NCSES to establish protocols and procedures to protect the information the agency collects. In addition, CIPSEA requires that data collected under a pledge of confidentiality be used solely for statistical purposes and thus not be disclosed in identifiable form.

This confidentiality protection is afforded to the data in several ways. Some are fairly straightforward, such as deleting identifying information (such as name and address) from the records. In other cases, however, such straightforward methods may not be adequate. This is true for most of NCSES’s microdata files that contain information about individuals. In those cases, NCSES attempts to develop a public-use file that provides researchers with as much microdata as feasible, given the need to protect respondent confidentiality. It achieves this goal by suppressing selected fields and/or recoding variables. These suppressions, however, may render the resulting data of little use to analysts and researchers.

When NCSES believes that protection of respondent confidentiality would require such extensive recoding that the resulting file would have little, if any, research utility, the agency has developed a variety of methods to assist individuals in using the data in such a situation. In some cases, researchers are able to state their needs for tabulations or other statistics with sufficient specificity that necessary summary information can be provided without the need for access to microdata. In other cases, NSF and the researcher can execute a license agreement that permits the researcher to use the data files at the NSF offices in Arlington, Virginia, or, under rigorously restricted conditions, at the researcher’s academic institution.

Microdata files for three surveys may be obtained under a license agreement with NSF: the Survey of Earned Doctorates, the Survey of Doctorate

Page 47 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

Recipients, and the National Survey of Recent College Graduates. The SESTAT Integrated Data File can also be obtained in this manner.

For two of these surveys—the Survey of Earned Doctorates and the Survey of Doctorate Recipients—plans are under way to provide authorized researchers with remote access to microdata using the most secure methods to protect confidentiality. This online environment is called the NORC Data Enclave. The enclave seeks to implement technological security, statistical protections, legal requirements, and researcher training in one package. The NORC Data Enclave intends to aid in preserving data for the long term by documenting the data using Data Documentation Initiative–compliant metadata standards. When implemented, the enclave intends to set up a research “collaboratory”—an arrangement that would develop a knowledge infrastructure around each data set, enabling geographically dispersed researchers to share information through wikis and blogs. This is an expanding and innovative program for the agency, one intended to both protect confidential data and enhance the usability of the data for research and analytical purposes.

Otherwise confidential data from the 2008 Business Research and Development and Innovation Survey (BRDIS), sponsored by NCSES and conducted by the U.S. Census Bureau, has been made available to qualified researchers on approved projects through the Census Bureau’s Research Data Centers (RDCs). This survey is a successor to the Survey of Industrial Research and Development. Data available in the RDC network are business domestic and global R&D expenditures and workforce that are collected information from a nationally representative sample of about 40,000 manufacturing and nonmanufacturing industries. There are plans to create an onsite RDC at NCSES so program staff can have access to the confidential data under controlled circumstances.

Although respondent privacy must be protected, the current NCSES approach is neither transparent, nor does it appear systematic. As the recent introduction of the SED Tabulation Engine illustrates, data from the same series survey may be split across different, nonintegrated systems. The private NCSES collection is not made available under a consistent set of terms of use (which vary by database), nor a consistent mechanism (i.e., some data sets are not available at all, some are available through the NORC enclave, and some only through the Census Bureau), nor are the methods of disclosure risk analysis used publicly documented.

Statistical and technical methods for protecting confidentiality are rapidly changing. Maximizing research utility requires a regular review of methods, consistent license agreements, and providing data in many forms, including public-use data and restricted data enclaves (National Research Council, 2005).

In addition, the need to provide confidentiality in the present does

Page 48 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

not eliminate the responsibility to provide for long-term access. The risk of reidentification changes as time elapses. As discussed in Chapter 3, all NCSES data, even confidential data, should be stewarded for long-term access and permanent preservation.

REAL-TIME DISSEMINATION AS A GOAL

One of the most common user criticisms that the panel heard about the dissemination program was the length of time between the survey reference periods and when NCSES released data from those surveys. In an era when users are increasingly being treated to real-time or near-real-time economic and social information, the lengthy delays in publication of NCSES survey results are not very well understood. The lack of timeliness is discussed here as a dissemination issue, though, in reality, timeliness problems have to do more with data gathering, statistical methodology, and processing practices, some of which have been addressed in previous National Research Council reports (National Research Council, 2004, pp. 105, 114, 131, 147, 159-160; National Research Council, 2010, p. 21).

It was reported to the panel by the NCSES leadership that there have been initiatives by NCSES over the years to shorten the publication time by reducing reliance on printed reports and to make more use of relatively quick-turnaround formats, such as InfoBriefs. These have successfully put the major data series in the hands of users more quickly than in the past. However, users still have to wait too long after the reference period to get access to the detailed publication tabulations that are necessary for sophisticated analysis from a major NCSES survey; for example, detailed data from the new Survey of Industrial Research and Development for the years 2006 and 2007 were released in June 2011, a year after less detailed summaries of data from the BRDIS for 2008 were released in May 2010.

The delays in other reports, as indicated by new releases announced on the NCSES website, are similarly problematic:

Science and Engineering Research Facilities: Fiscal Year 2007 (released September 23, 2011)
Characteristics of Scientists and Engineers in the United States: 2006 (released September 14, 2011)
U.S. Exports of Advanced Technology Products Declined Less Than Other U.S. Exports in 2009 (released September 1, 2011)
Science and Engineering Doctorate Awards: 2007-2008 (released August 22, 2011)
Industrial Research and Development Information System (IRIS) 1953-2007 data (released July 26, 2011)

Page 49 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

As mentioned earlier in this chapter, the shift to provision of data in electronic format has been simply a digitization of previously manual products. The format for the website database is a replication of the old tables that found their way into the printed publications, so the laborious and time-consuming processes that were required for production of the manual products are still necessary. Another source of the timeliness problem stems from the fact that NCSES has largely shifted to electronic dissemination but without systematic machine-understandable metadata and change control. This means that a great deal of NCSES time still must be spent in painstakingly checking data and formatting the data for print and electronic publication in order to check the accuracy and reliability of the published products. For example, each page of the hard copy must be checked by someone looking at the source data. This effort comes at the expense of ensuring data integrity at the source, and it takes an inordinate amount of scarce staff time.

Page 50 Cite

Suggested Citation:"2 The Current Dissemination Program." National Research Council. 2012. Communicating Science and Engineering Data in the Information Age. Washington, DC: The National Academies Press. doi: 10.17226/13282.

×

This page intentionally left blank.

Communicating Science and Engineering Data in the Information Age (2012)

Chapter: 2 The Current Dissemination Program

Welcome to OpenBook!

Get Email Updates