2

The Current Dissemination Program

The current dissemination program of the National Center for Science and Engineering Statistics (NCSES) is wide-ranging and multifaceted. In order to fulfill its mandate to serve as collector and distributor of information about the science and engineering enterprise for the National Science Foundation (NSF), this relatively small, resource-constrained statistical agency1 disseminates its publishable data in several formats (hard-copy, mixed, and electronic-only publications); maintains an extensive website; makes its data available for retrieval from the consolidated FedStats database and through the Data.gov portal; provides access to confidential microdata in a protected environment for research purposes; and supports provision of three online communal tools that are used to retrieve data from the NCSES database: the Integrated Science and Engineering Resources Data System (WebCASPAR), the Scientists and Engineers Statistical Data System (SESTAT), and the less known Industrial Research and Development Information System (IRIS) (see Table 2-1).

These diverse outputs and self-maintained tools serve a broad community of information users with widely different data needs, ranging from one-time casual to recurring, highly sophisticated and widely divergent levels of statistical knowledge that extend from rudimentary to very knowledgeable. The user community also has quite different access preferences, as attested by the users who discussed their uses of the data with the panel

____________

1The 2011 budget for NCSES was $41.5 million, down from $45.7 million in fiscal year 2009 and $41.9 million in fiscal year 2010. The agency has only 45 full-time permanent staff members, of whom 21 are statisticians.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 19
2 The Current Dissemination Program T he current dissemination program of the National Center for Sci- ence and Engineering Statistics (NCSES) is wide-ranging and mul- tifaceted. In order to fulfill its mandate to serve as collector and distributor of information about the science and engineering enterprise for the National Science Foundation (NSF), this relatively small, resource- constrained statistical agency1 disseminates its publishable data in several formats (hard-copy, mixed, and electronic-only publications); maintains an extensive website; makes its data available for retrieval from the consoli- dated FedStats database and through the Data.gov portal; provides access to confidential microdata in a protected environment for research purposes; and supports provision of three online communal tools that are used to retrieve data from the NCSES database: the Integrated Science and Engi- neering Resources Data System (WebCASPAR), the Scientists and Engineers Statistical Data System (SESTAT), and the less known Industrial Research and Development Information System (IRIS) (see Table 2-1). These diverse outputs and self-maintained tools serve a broad commu- nity of information users with widely different data needs, ranging from one-time casual to recurring, highly sophisticated and widely divergent levels of statistical knowledge that extend from rudimentary to very knowl- edgeable. The user community also has quite different access preferences, as attested by the users who discussed their uses of the data with the panel 1 The 2011 budget for NCSES was $41.5 million, down from $45.7 million in fiscal year 2009 and $41.9 million in fiscal year 2010. The agency has only 45 full-time permanent staff members, of whom 21 are statisticians. 19

OCR for page 19
20 COMMUNICATING SCIENCE AND ENGINEERING DATA TABLE 2-1 Summary of Selected Characteristics of NSF Science and Engineering Surveys Current Database Retrieval Tool/ Survey Contractor Publication Survey of Earned National Opinion Research WebCASPAR; InfoBriefs; Science Doctorates Center (NORC) and Engineering Degrees; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Doctorate Recipients from U.S. Universities: Summary Report; Academic Institutional Profiles Survey of RTI International WebCASPAR; InfoBriefs; Graduate Graduate Students and Postdoctorates in Students and Science and Engineering; Science Postdoctorates and Engineering Indicators; Women, in Science and Minorities, and Persons with Disabilities Engineering in Science and Engineering; Academic Institutional Profiles

OCR for page 19
21 THE CURRENT DISSEMINATION PROGRAM Availability of Series Initiated/ Variables Available Microdata Archiving Academic institution of doctorate; Access to restricted 1957 (conducted baccalaureate-origin institution microdata can be annually, limited data (United States and foreign); birth year; arranged through a available 1920-1956) citizenship status at graduation; country licensing agreement. of birth and citizenship; disability A secure data access status; educational attainment of facility/data enclave parents; educational history in college; providing restricted field of degrees (N = 292); graduate microdata access is and undergraduate educational under development debt; marital status, number/age of with NORC. dependents; postgraduation plans (work, postdoctorate, other study/training); primary and secondary work activities; source and type of financial support for postdoctoral study/research; type and location of employer; race/ethnicity; sex; sources of financial support during graduate school; type of academic institution (e.g., historically black institutions, Carnegie codes, control) awarding the doctorate The number and characteristics of Data for the 1975 (conducted graduate students; postdoctoral years 1972–2008 annually) appointees; and doctorate-holding are available in nonfaculty researchers in science, a public-use file engineering, and health (SEH) fields format. continued

OCR for page 19
22 COMMUNICATING SCIENCE AND ENGINEERING DATA TABLE 2-1 Continued Current Database Retrieval Tool/ Survey Contractor Publication Survey of NORC SESTAT; InfoBriefs; Characteristics of Doctorate Doctoral Scientists and Engineers in the Recipients United States; Science and Engineering Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering; Science and Engineering State Profiles National Mathematica Policy SESTAT; InfoBriefs; Characteristics Survey of Research, Inc. and of Recent Science and Engineering Recent College Census Bureau Graduates; Science and Engineering Graduates Indicators; Women, Minorities, and Persons with Disabilities in Science and Engineering

OCR for page 19
23 THE CURRENT DISSEMINATION PROGRAM Availability of Series Initiated/ Variables Available Microdata Archiving Citizenship status; country of birth; Access to restricted 1973 (conducted country of citizenship; date of birth; data for researchers biennially) disability status; educational history (for interested in each degree held: field, level, institution, analyzing microdata when received); employment status can be arranged (unemployed, employed part time, through a licensing or employed full time); geographic agreement. The place of employment; marital status; date available number of children; occupation (current online though the or past job); primary work activity enclave arrangement (e.g., teaching, basic research, etc.); discussed above. postdoctorate status (current and/ or three most recent postdoctoral appointments); race/ethnicity; salary; satisfaction and importance of various aspects of job; school enrollment status; sector of employment (e.g., academia, industry, government, etc.); sex; work- related training For individuals who recently received Access to restricted 1976 (conducted bachelor’s or master’s degrees in an data for researchers biennially) SEH field from a U.S. institution: age; interested in citizenship status; country of birth; analyzing microdata country of citizenship; disability status; can be arranged educational history (for each degree through a licensing held: field, level, when received); agreement. employment status (unemployed, employed part time, or employed full time); educational attainment of parents; financial support and debt amount for undergraduate and graduate degree; geographic place of employment; marital status; number of children; occupation (current or previous job); place of birth; work activity (e.g., teaching, basic research, etc.); race/ethnicity; salary; overall satisfaction with principal job; school enrollment status; sector of employment (e.g., academia, industry, government, etc.); sex; work-related training continued

OCR for page 19
24 COMMUNICATING SCIENCE AND ENGINEERING DATA TABLE 2-1 Continued Current Database Retrieval Tool/ Survey Contractor Publication National Survey Census Bureau SESTAT; InfoBriefs; Science and of College Engineering Indicators; Women, Graduates Minorities, and Persons with Disabilities in Science and Engineering Business Census Bureau IRIS; InfoBrief; Business and Industrial Research and R&D; Science and Engineering Development Indicators; National Patterns of and Innovation Research and Development Resources; Survey (BRDIS) Science and Engineering State Profiles Survey of Synectics for Management WebCASPAR; InfoBrief; Federal Funds Federal Funds Decisions, Inc. for Research and Development; Science for Research and and Engineering State Profiles; Science Development and Engineering Indicators; National Patterns of Research and Development Resources

OCR for page 19
25 THE CURRENT DISSEMINATION PROGRAM Availability of Series Initiated/ Variables Available Microdata Archiving For individuals holding a bachelor’s Public-use data files 1962 (conducted or higher degree in any field: academic are available upon biennially) employment (position, rank, and tenure); request. age; citizenship status; country of birth; country of citizenship; disability status; educational history (for each degree held: field, level, when received); employment status (unemployed, employed full time, or employed part time); geographic place of employment; immigrant module (year of entry, type of entry visa, reason(s) for coming to the United States, etc.); labor force status; marital status; number of children; occupation (current or past job); primary work activity (e.g., teaching, basic research, etc.); publication and patent activities; race/ethnicity; salary; satisfaction and importance of various aspects of job; school enrollment status; sector of employment (academia, industry, government); sex; work-related training Financial measures of research and Census Research 1953 (conducted development (R&D) activity; company Data Centers annually); a new series R&D activity funded by others; R&D began in 2008 when employment; R&D management and the survey was strategy; and intellectual property, changed technology transfer, and innovation Federal obligations by the following Data tables 1952 (conducted key variables: character of work; basic annually) research; applied research; development; federal agency; federally funded research and development centers (FFRDCs); field of science and engineering; geographic location (within the United States and foreign country); performer (type of organization doing the work); R&D plant Federal outlays by: character of work, basic research, applied research, development, R&D plant continued

OCR for page 19
26 COMMUNICATING SCIENCE AND ENGINEERING DATA TABLE 2-1 Continued Current Database Retrieval Tool/ Survey Contractor Publication Survey of Synectics for Management WebCASPAR; InfoBrief; Federal Federal Science Decisions, Inc. Science and Engineering Support to and Engineering Universities, Colleges, and Nonprofit Support to Institutions; Science and Engineering Universities, State Profiles; Science and Engineering Colleges, and Indicators; National Patterns of Nonprofit Research and Development Resources Institutions Survey of R&D ICF Macro WebCASPAR; InfoBrief; R&D Expenditures at Expenditures at Federally Funded Federally Funded R&D Centers; Academic Research and R&D Centers Development Expenditures Science (FFRDCs) and Engineering Indicators; National Patterns of Research and Development Resources Survey of ICF Macro WebCASPAR; InfoBrief; Academic Research and Research and Development Development Expenditures; Science and Engineering Expenditures at Indicators; National Patterns of Universities and Research and Development Resources; Colleges Science and Engineering State Profiles; Academic Institutional Profiles Survey of State Census Bureau InfoBrief; State Government R&D Research and Expenditures; Science and Engineering Development Indicators Expenditures

OCR for page 19
27 THE CURRENT DISSEMINATION PROGRAM Availability of Series Initiated/ Variables Available Microdata Archiving Data by federal agency, academic Data tables only 1965 (conducted institutions and location: R&D; annually) fellowships, traineeships, and training grants; R&D plant; facilities and equipment for instruction in science and engineering; general support for science and engineering; type of academic institution (i.e., historically black colleges and universities [HBCUs[, tribal institutions, high-Hispanic-enrollment institutions, minority institutions); type of institutional control (public versus private) FFRDC R&D expenditures by source of Data tables only 1965 (conducted funds (federal, state and local, industry, annually) institutional, or other); and character of work (basic research, applied research, or development) Institution R&D expenditures by Data tables (selected 1972 (conducted source of funds (federal, state and items) annually, limited data local, industry, institutional, or other); available for various character of work (basic research versus years for 1954-1970) applied research and development); pass throughs to subrecipients; receipts as a subrecipient; S&E field; non-S&E field; R&D equipment expenditures by S&E field; federal agency; type of degree granted, HBCU, public or private control; geographic location (within the United States) State agency or department; state R&D Data tables 1964 (conducted expenditures; internal performers; occasionally) external performers; basic research; source of funds (federal, state, other); R&D facilities continued

OCR for page 19
28 COMMUNICATING SCIENCE AND ENGINEERING DATA TABLE 2-1 Continued Current Database Retrieval Tool/ Survey Contractor Publication Survey of Science RTI International WebCASPAR; Scientific and Engineering and Engineering Research Facilities; Science and Research Engineering Indicators Facilities Survey of NORC, via a science and Science and Engineering Indicators Public Attitudes technology module on the Toward and General Social Survey Understanding of Science and Technology (see Chapter 5). With limited resources, NCSES attempts to be all things to all users, and because it is spread so thinly, the panel has serious concerns about whether these outputs and tools are optimized for all the tasks to which they are addressed, as well as about whether NCSES is using the most up-to-date technologies and processes to best advantage for the user community. In this chapter, we assess the status of the NCSES dissemination pro- gram. First, we describe the remaining hard-copy publications. We then review the NCSES user interface tools, including WebCASPAR, SESTAT, and IRIS, through which individuals are able to directly access and retrieve tailored outputs from the database. Then we discuss the structure of the databases and their current presentation on the web for downloading and use by third parties. We assess the current status of the program in light of the emerging practices for electronic dissemination, primarily the develop- ment of the Semantic Web as a way to facilitate access to information on the Internet. We provide examples of semantic web systems in federal agencies

OCR for page 19
29 THE CURRENT DISSEMINATION PROGRAM Availability of Series Initiated/ Variables Available Microdata Archiving Status of research facilities at academic Microdata from this 1986 (conducted institutions and nonprofit biomedical survey for the years biennially) research organizations and hospitals 1988-2001 are not by: amount and type of science and available. engineering research space; current expenditures for projects to construct and repair/renovate research facilities; condition of research facilities; planned construction and repair/renovation of research facilities; source of funds (federal, state and local, institutional) for construction and repair/renovation of research facilities; research animal facilities; bandwidth speeds and high performance network connections; fiber; high performance computing; wireless connections Demographic, behavioral, and Data tables ICPSR, 1979-2001; attitudinal by how information about CD, 1979-2004; S&T is obtained; interest in science- (conducted biennially) related issues; visits to informal science institutions; S&T knowledge; attitudes toward science-related issues and the possibilities for development of a semantic web structure for science and engineering (S&E) information on the Internet. Finally, we consider the important issue of timeliness—a subject of great concern for users of NCSES data—and the possibility of moving the release and distribution of S&E data to a real-time basis. TRADITIONAL FORMAT PUBLICATIONS NCSES continues a few publications using a print-based approach and still has a customer base for them, although that customer base seems to be declining over time. Moreover, although most retrieval of NCSES information is by electronic means, a large part of the offerings are simply electronic depictions of previous hard-copy publications. It is fair to say that NCSES continues to manage its publications program in much the same way as it traditionally has, although the finished products, for the most part, are now sent to the website for posting rather than to a print-

OCR for page 19
40 COMMUNICATING SCIENCE AND ENGINEERING DATA Moreover, Tableau supports not only visualizations but also direct down- loads of data extracts and of derivative “print” works, such as reports and HTML tables. Nevertheless, Google’s ability to leverage its search engine dominance and redirect key search terms to Google Public Data Explorer data visualizations can provide publishers using this tool with unparalleled visibility among users. State of the Practice in Data Sharing Data sharing platforms go beyond data publication to allow the wider user community to comment and correct data provided through the sys- tem, add value through integrated visualizations or tags, and even provide additional data for comparison and integration. At the time our report was being prepared, there was one open-source data sharing platform, the Data- verse Network. Several competing closed commercial platforms have been developed over the last several years, including the now-defunct Google Palimpsest, Graphwise, Swivel, Dabble, and Verifiable data sharing services, as well as the operational Data360, Factual, Many Eyes, and BuzzData services. The existing services that are listed are all of note for different reasons. More new services, such as FigShare and Numbrary, have emerged recently or are on the horizon but have yet to achieve significant uptake. The Dataverse Network (DVN) software is the only open-source sys- tem currently available specifically designed for data sharing (King, 2007). It is designed to provide access to research data and to facilitate data shar- ing through standard/open tools, such as DDI, Dublin Core, and USMARC metadata; Z39.50, LOCKSS, and OAI-PMH search and harvesting; and Creative Commons licensing. It replaces the Virtual Data Center software, which was developed under the NSF DLI-2 program (Altman et al., 2001). It facilitates the public preservation and distribution of persistent, citeable, authorized, and verifiable research data, with powerful but easy-to-use tech- nology. The project increases scholarly recognition and distributed control for authors, journals, and others who make data available; improves data access and analysis; and still enables professional archives to provide inte- grated preservation and other services. It is a leading example of standards- based open systems. The Dataverse Network also serves as a federated catalog, allowing users to find and access data across dozens of remote sources, including the Interuniversity Consortium for Political Social Science Research, DataWeb, and the National Archives and Records Administration. Already acces- sible through the DVN is the largest collection of social science data in the world, through a partnership with the Data Preservation Alliance for the Social Sciences (Data-PASS) (Altman et al., 2009; Gutman et al., 2009). This includes integrated access to hundreds of large government data sets.

OCR for page 19
41 THE CURRENT DISSEMINATION PROGRAM Of these systems, the Dataverse Network is unique in being designed to explicitly support long-term access and permanent preservation. To this end, the system supports best practices, such as format migration, human- understandable formats and metadata, persistent identifier assignment, and semantic fixity checking. In addition, many threats to long-term access can be fully addressed only by collaborative stewardship of content, and the system supports distributed, policy-based replication of its content across multiple collaborating institutions, to ensure the long-term stewardship of the data against budgetary and other institutional threats (see Altman and Crabtree, 2011). Making data available in machine-understandable formats using open standards and metadata also enables the media or other data redistributors to easily pick up the data and integrate it into their own specific visualiza- tion tools for further dissemination. This enhances the visibility of the data and allows a statistical agency to reach a much broader audience with tools specifically targeted for such audiences. As an example, The Guardian, a British newspaper, has published a visualization tool based on data from Eurostat that explains to European citizens “Who we are, how we live and what it costs.”8 Data360, created in 2004, is the oldest closed-source data sharing service still operational. Its stated aim was to make data available for bet- ter public policy. It now contains thousands of data sets and offers static and dynamic visualizations, direct access to data, and generated reports (Macdonald, 2009, p. 4). Factual is a data manipulation developed in the commercial sector. It is closed source, runs as a proprietary service, and handles only moderate- sized databases. It extensively supports collaborative data manipulation in such functions as data linking, aggregation, and filtering, and it has exten- sive mashup support, with Google RESTful and Java JSON APIs for extrac- tion and interrogation of data sets. It also integrates with Google charts and maps. It is a leading example of collaborative data editing. Factual contains a relatively small collection but has the aim of eventually loading all the Data.gov files.9 If this aim is achieved, several of the NCSES data files that reside in Data.gov will be available in this tool. Many Eyes is a website that permits users to enter their own data sets and produce tailored visualizations from a stock of sample visualizations on demand (Viegas, 2007). Many Eyes is largely uncurated, and as a result it hosts over 200,000 data sets, the vast majority of which are tiny, undocu- mented, and with unknown provenance. In part, this is because the goal of 8 S ee http://www.guardian.co.uk/world/interactive/2011/mar/14/new-europe-statistics- interactive [November 2011]. 9 See http://www.factual.com/topic/government [November 2011].

OCR for page 19
42 COMMUNICATING SCIENCE AND ENGINEERING DATA the site is not to create a data collection or archive but to make visualiza- tion a catalyst for discussion and collective insight about data. Many Eyes is particularly notable for its prototype work involving accessibility for people with disabilities. (In contrast, none of the other visualization tools described provides accessible components or analogs.) By employing a pro- cessing design that carefully separates data manipulation and data analysis from presentation (see, for example, Wilkinson et al., 2005) and deferring visualization to the final stage of the chain of computation, the Many Eyes prototype was able to offer powerful data manipulation and analysis functions that were potentially accessible to a visually impaired audience. Although this is not yet in production, it shows that data analytics for the visually impaired can go far beyond those typically offered. BuzzData is a relatively new entry to the data sharing offerings in which a community of interest for a data set is formed and each data set has tabs for tracking versions, visualizations, related articles, attachments, and comments. The idea is that users using the data will build value to the data set, thereby creating a social network around it (Howard, 2011). Trends in Data Access Tools and Infrastructure Data dissemination is a rapidly developing area, in which players, tech- nologies, and vocations are changing rapidly. The above review of emerg- ing public and private-sector tools reveals a number of general trends and patterns, which are summarized below: • In the private sector, no dominant business model, company, or commercial product has emerged. To the contrary, many commer- cial services in this area have failed, and business models for data sharing remain unclear. • The availability, usability, and features of third-party systems have raised user expectations for access to data. Increasingly, users are expecting access to data in real time and at a fine level of detail. They want access to data that are machine-understandable and that can be imported or mashed up using third-party services. Data.gov is a prime example of this trend applied to the public sector. • Mega-scale online analysis, social integration, metadata exchange of catalog information, collaboration features, and ad hoc support for data manipulation are “solved problems” and well within the state of the practice. However, many services fail to adhere to good practices. • Extremely powerful (peta-scale) online analysis, interactive statisti- cal disclosure limitation, semantic harmonization, dynamic linking of data across different data sources with different data collection

OCR for page 19
43 THE CURRENT DISSEMINATION PROGRAM designs, and data analysis and browsing support for the visually impaired remain research problems. • None of the commercial services is designed with preservation or long-term access. • Both private-sector and public production services currently avail- able fall short of providing rich access to visually impaired users. Overall, these patterns strongly suggest that NCSES should not adopt a single service or technology for data visualization and sharing, nor should it develop another bespoke system, but instead should make data available in open formats and protocols, and with sufficient documentation and meta- data, to enable the easy inclusion of these data in third-party catalogs and services. It would benefit from exploring mashups (a mashup occurs when a web page or application uses and combines data, presentation, or func- tionality from two or more sources to create new services) with ongoing public-sector dissemination tool sets, such as DataWeb, in order to quickly transform its electronic dissemination platforms and refine its participation in government-wide portals (see Recommendation 3-4). DISSEMINATION BY MEANS OF GOVERNMENT-WIDE PORTALS In addition to data dissemination through its own website and possible utilization of such tools as DataWeb, NCSES has options for disseminat- ing data through two major government-wide initiatives. It has a presence through both portals, but they both fall short of serving as comprehensive platforms for featuring and disseminating S&E information in electronic form. FedStats An early, once-ambitious government-side data access service, FedStats has been available online since 1997. FedStats is a portal that was designed to be a one-stop gateway through which users can retrieve a full range of official statistical information produced by the federal government without having to know in advance which federal agency produces which particular statistic. It has searching and linking capabilities to data from agencies that provide data and trend information on such topics as economic and popu- lation trends, crime, education, health care, S&E workforce and expendi- tures, farm production, and more. Data can be retrieved by searching by subject matter, program area, or agency. NCSES has been a part of FedStats from the beginning. Currently, the tool drives a user who is searching by subject matter (topic) or press releases to the NCSES website, from whence the search continues using

OCR for page 19
44 COMMUNICATING SCIENCE AND ENGINEERING DATA the existing NCSES search and retrieval tools. Searching by agency is a bit problematic—the site had not been updated to incorporate the new name of NCSES as of September 2011. Data.gov A promising new portal for disseminating federal government infor- mation in the form of raw data and applications (apps) has more recently been developed. Data.gov is a major component of a spate of recent open- government initiatives that have been designed to serve as a catalyst for increasing transparency. NCSES has been a member of this federal open- government initiative from its beginning in May 2009. The SESTAT tool is one of the apps that can be accessed through Data.gov, although the WebCASPAR, IRIS, and SED Tabulation Engine tools were not being made available through this portal at the time this report was being prepared. Workshop presenter Alan Vander Mallie, program manager in the Gen- eral Services Administration, stated that Data.gov aims to promote account- ability and provide information for citizens on what their government is doing with tools to enable collaboration across all levels of government. It is a one-stop website for free access to data produced or held by the federal government, designed to make it easy to find, download, and use, including databases, data feeds, graphics, and other data visualizations. Vander Mallie reported that, at its inception in 2009, Data.gov con- sisted of 47 raw data sets and 27 tools to assist in accessing the data in some of the complex data stores. At the time of the workshop, the program supported 2,895 raw data sets and 638 tools, which are accessed through raw data and tool catalogues. (The number of raw data sets and geographic data sets claimed on the Data.gov website home page had grown to nearly 390,000 by fall 2011.) This increase is primarily the result of linking and rebranding the Geospatial One Stop (Geodata.gov) service as part of the Data.gov site. The catalog of raw data sets (see http://explore.data.gov/ catalog/raw/ [November 2011]) available has increased to roughly 3,602, based on a catalog search. Raw data are defined as machine-readable data at the lowest level of aggregation in structured data sets with multiple purposes. The raw data sets are designed to be mashed up— that is, linked and otherwise put in specific contexts using web programming techniques and technologies. Following the workshop, Socrata, which provides an open government software solution, has introduced a new Data.gov website designed to help government agencies publish and distribute data in new ways, including interactive charts, maps, and lists. At the time this report was being prepared, this software was available only to participating gov- ernment agencies and was not accessible to the panel. In the future, Vander Mallie said, Data.gov is slated to continue to

OCR for page 19
45 THE CURRENT DISSEMINATION PROGRAM expand its coverage of data sets and tools and to continue to support com- munities of interest by building community pages that collect related data sets and other information to help users find data on a single topic in one location. One continuing objective is to make data available through the application programming interface, permitting the public and developers to directly source their data from Data.gov. Expansion into the Semantic Web, an emerging standardized way of expressing the relationships between web pages so the meaning of hyper- linked information can be understood, is also part of the future plan for Data.gov. The objective is to enable the public and developers to create a new generation of “linked data” mashups. Working toward this goal, Data. gov has an indexed set of resource framework documents that are avail- able and is working with the W3C to promote international standards for persistent government data (and metadata) on the web. Plans are also in place for expanding mobile applications, improved “meta-tagging” (short descriptions of an HTML web page that describe the content and facilitate implementation of standards to describe the data), and enhancing data visu- alization across agencies. In short, the idea is to give agencies a powerful new tool for disseminating their data and a one-stop locale for the public to access them. Efforts also exist to create government-wide or agency-specific data catalogs and dictionaries, which would be published along with the available data sets. Suzanne Acar, senior information architect for the U.S. Department of the Interior and cochair of the Federal Data Architecture Subcommittee of the Chief Information Officer Council (see http://www.cio.gov [November 2011] ), put the current and future Data.gov into context. She discussed the evolution of Enterprise Data/Information Management (EIM)—a frame- work of functions that can be tailored to fit the strategic information goals of any organization. For agencies like NSF to benefit from the capabilities of Web 2.0 and Web 3.0, it is important to ensure consistent quality of information and official designations of authoritative data sources. While this report was being prepared, the future of Data.gov remained somewhat uncertain because of the threat of budget cuts (Lipowicz, 2011). Nonetheless, the development of Data.gov was heading in an additional direction—a direction that could be promising for improved dissemination of S&E data. The Office of Management and Budget is setting up a number of community-based, topic-specific Data.gov sites. The initial sites cover information on energy, law, and health.10 In conjunction with the Office of Science and Technology Policy, NCSES might consider setting up such a topic-specific site for the science and technology community, particularly 10 See http://www.data.gov/energy; http://www.whitehouse.gov/blog/2011/06/30/invitation- our-latest-open-innovation-ecosystem-energydatagov [August 2011].

OCR for page 19
46 COMMUNICATING SCIENCE AND ENGINEERING DATA as it is now a clearinghouse for data dissemination. Overall, the sense of the panel was that Data.gov was a useful channel for disseminating NCSES data, but that NCSES should not rely on it as the only solution for dissemi- nating data in open formats and through open APIs. EXPANDING ACCESS TO THE NCSES DATABASE In addition to making its database available to the public through use of the SESTAT, WebCASPAR, and IRIS tools as well as through FedStats and Data.gov, NCSES makes the microdata available under carefully con- trolled circumstances for download and use by outside organizations and developers. NCSES, like all federal agencies, is bound by the Privacy Act of 1974 to protect the confidentiality of the records it maintains about indi- viduals and other statutory requirements for the protection of confidential statistical information under Title V of the 2002 E-Government Act, the Confidential Information Protection and Statistical Efficiency Act (CIPSEA), and the NSF’s own statutory provisions. These statutes require NCSES to establish protocols and procedures to protect the information the agency collects. In addition, CIPSEA requires that data collected under a pledge of confidentiality be used solely for statistical purposes and thus not be disclosed in identifiable form. This confidentiality protection is afforded to the data in several ways. Some are fairly straightforward, such as deleting identifying information (such as name and address) from the records. In other cases, however, such straightforward methods may not be adequate. This is true for most of NCSES’s microdata files that contain information about individuals. In those cases, NCSES attempts to develop a public-use file that provides researchers with as much microdata as feasible, given the need to protect respondent confidentiality. It achieves this goal by suppressing selected fields and/or recoding variables. These suppressions, however, may render the resulting data of little use to analysts and researchers. When NCSES believes that protection of respondent confidentiality would require such extensive recoding that the resulting file would have little, if any, research utility, the agency has developed a variety of methods to assist individuals in using the data in such a situation. In some cases, researchers are able to state their needs for tabulations or other statistics with sufficient specificity that necessary summary information can be pro- vided without the need for access to microdata. In other cases, NSF and the researcher can execute a license agreement that permits the researcher to use the data files at the NSF offices in Arlington, Virginia, or, under rigorously restricted conditions, at the researcher’s academic institution. Microdata files for three surveys may be obtained under a license agree- ment with NSF: the Survey of Earned Doctorates, the Survey of Doctorate

OCR for page 19
47 THE CURRENT DISSEMINATION PROGRAM Recipients, and the National Survey of Recent College Graduates. The SESTAT Integrated Data File can also be obtained in this manner. For two of these surveys—the Survey of Earned Doctorates and the Survey of Doctorate Recipients—plans are under way to provide authorized researchers with remote access to microdata using the most secure methods to protect confidentiality. This online environment is called the NORC Data Enclave. The enclave seeks to implement technological security, statisti- cal protections, legal requirements, and researcher training in one pack- age. The NORC Data Enclave intends to aid in preserving data for the long term by documenting the data using Data Documentation Initiative– compliant metadata standards. When implemented, the enclave intends to set up a research “collaboratory”—an arrangement that would develop a knowledge infrastructure around each data set, enabling geographically dispersed researchers to share information through wikis and blogs. This is an expanding and innovative program for the agency, one intended to both protect confidential data and enhance the usability of the data for research and analytical purposes. Otherwise confidential data from the 2008 Business Research and Development and Innovation Survey (BRDIS), sponsored by NCSES and conducted by the U.S. Census Bureau, has been made available to qualified researchers on approved projects through the Census Bureau’s Research Data Centers (RDCs). This survey is a successor to the Survey of Indus- trial Research and Development. Data available in the RDC network are business domestic and global R&D expenditures and workforce that are collected information from a nationally representative sample of about 40,000 manufacturing and nonmanufacturing industries. There are plans to create an onsite RDC at NCSES so program staff can have access to the confidential data under controlled circumstances. Although respondent privacy must be protected, the current NCSES approach is neither transparent, nor does it appear systematic. As the recent introduction of the SED Tabulation Engine illustrates, data from the same series survey may be split across different, nonintegrated systems. The pri- vate NCSES collection is not made available under a consistent set of terms of use (which vary by database), nor a consistent mechanism (i.e., some data sets are not available at all, some are available through the NORC enclave, and some only through the Census Bureau), nor are the methods of disclosure risk analysis used publicly documented. Statistical and technical methods for protecting confidentiality are rap- idly changing. Maximizing research utility requires a regular review of methods, consistent license agreements, and providing data in many forms, including public-use data and restricted data enclaves (National Research Council, 2005). In addition, the need to provide confidentiality in the present does

OCR for page 19
48 COMMUNICATING SCIENCE AND ENGINEERING DATA not eliminate the responsibility to provide for long-term access. The risk of reidentification changes as time elapses. As discussed in Chapter 3, all NCSES data, even confidential data, should be stewarded for long-term access and permanent preservation. REAL-TIME DISSEMINATION AS A GOAL One of the most common user criticisms that the panel heard about the dissemination program was the length of time between the survey reference periods and when NCSES released data from those surveys. In an era when users are increasingly being treated to real-time or near-real-time economic and social information, the lengthy delays in publication of NCSES survey results are not very well understood. The lack of timeliness is discussed here as a dissemination issue, though, in reality, timeliness problems have to do more with data gathering, statistical methodology, and processing practices, some of which have been addressed in previous National Research Council reports (National Research Council, 2004, pp. 105, 114, 131, 147, 159- 160; National Research Council, 2010, p. 21). It was reported to the panel by the NCSES leadership that there have been initiatives by NCSES over the years to shorten the publication time by reducing reliance on printed reports and to make more use of relatively quick-turnaround formats, such as InfoBriefs. These have successfully put the major data series in the hands of users more quickly than in the past. However, users still have to wait too long after the reference period to get access to the detailed publication tabulations that are necessary for sophisti- cated analysis from a major NCSES survey; for example, detailed data from the new Survey of Industrial Research and Development for the years 2006 and 2007 were released in June 2011, a year after less detailed summaries of data from the BRDIS for 2008 were released in May 2010. The delays in other reports, as indicated by new releases announced on the NCSES website, are similarly problematic: • Science and Engineering Research Facilities: Fiscal Year 2007 (released September 23, 2011) • Characteristics of Scientists and Engineers in the United States: 2006 (released September 14, 2011) • U.S. Exports of Advanced Technology Products Declined Less Than Other U.S. Exports in 2009 (released September 1, 2011) • Science and Engineering Doctorate Awards: 2007-2008 (released August 22, 2011) • Industrial Research and Development Information System (IRIS) 1953-2007 data (released July 26, 2011)

OCR for page 19
49 THE CURRENT DISSEMINATION PROGRAM As mentioned earlier in this chapter, the shift to provision of data in electronic format has been simply a digitization of previously manual products. The format for the website database is a replication of the old tables that found their way into the printed publications, so the labori- ous and time-consuming processes that were required for production of the manual products are still necessary. Another source of the timeliness problem stems from the fact that NCSES has largely shifted to electronic dissemination but without systematic machine-understandable metadata and change control. This means that a great deal of NCSES time still must be spent in painstakingly checking data and formatting the data for print and electronic publication in order to check the accuracy and reliability of the published products. For example, each page of the hard copy must be checked by someone looking at the source data. This effort comes at the expense of ensuring data integrity at the source, and it takes an inordinate amount of scarce staff time.

OCR for page 19