Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 19
2
The Current Dissemination Program
T
he current dissemination program of the National Center for Sci-
ence and Engineering Statistics (NCSES) is wide-ranging and mul-
tifaceted. In order to fulfill its mandate to serve as collector and
distributor of information about the science and engineering enterprise
for the National Science Foundation (NSF), this relatively small, resource-
constrained statistical agency1 disseminates its publishable data in several
formats (hard-copy, mixed, and electronic-only publications); maintains an
extensive website; makes its data available for retrieval from the consoli-
dated FedStats database and through the Data.gov portal; provides access
to confidential microdata in a protected environment for research purposes;
and supports provision of three online communal tools that are used to
retrieve data from the NCSES database: the Integrated Science and Engi-
neering Resources Data System (WebCASPAR), the Scientists and Engineers
Statistical Data System (SESTAT), and the less known Industrial Research
and Development Information System (IRIS) (see Table 2-1).
These diverse outputs and self-maintained tools serve a broad commu-
nity of information users with widely different data needs, ranging from
one-time casual to recurring, highly sophisticated and widely divergent
levels of statistical knowledge that extend from rudimentary to very knowl-
edgeable. The user community also has quite different access preferences,
as attested by the users who discussed their uses of the data with the panel
1 The 2011 budget for NCSES was $41.5 million, down from $45.7 million in fiscal year
2009 and $41.9 million in fiscal year 2010. The agency has only 45 full-time permanent staff
members, of whom 21 are statisticians.
19
OCR for page 20
20 COMMUNICATING SCIENCE AND ENGINEERING DATA
TABLE 2-1 Summary of Selected Characteristics of NSF Science and
Engineering Surveys
Current Database Retrieval Tool/
Survey Contractor Publication
Survey of Earned National Opinion Research WebCASPAR; InfoBriefs; Science
Doctorates Center (NORC) and Engineering Degrees; Science
and Engineering Indicators; Women,
Minorities, and Persons with Disabilities
in Science and Engineering; Doctorate
Recipients from U.S. Universities:
Summary Report; Academic Institutional
Profiles
Survey of RTI International WebCASPAR; InfoBriefs; Graduate
Graduate Students and Postdoctorates in
Students and Science and Engineering; Science
Postdoctorates and Engineering Indicators; Women,
in Science and Minorities, and Persons with Disabilities
Engineering in Science and Engineering; Academic
Institutional Profiles
OCR for page 21
21
THE CURRENT DISSEMINATION PROGRAM
Availability of Series Initiated/
Variables Available Microdata Archiving
Academic institution of doctorate; Access to restricted 1957 (conducted
baccalaureate-origin institution microdata can be annually, limited data
(United States and foreign); birth year; arranged through a available 1920-1956)
citizenship status at graduation; country licensing agreement.
of birth and citizenship; disability A secure data access
status; educational attainment of facility/data enclave
parents; educational history in college; providing restricted
field of degrees (N = 292); graduate microdata access is
and undergraduate educational under development
debt; marital status, number/age of with NORC.
dependents; postgraduation plans (work,
postdoctorate, other study/training);
primary and secondary work activities;
source and type of financial support for
postdoctoral study/research; type and
location of employer; race/ethnicity;
sex; sources of financial support during
graduate school; type of academic
institution (e.g., historically black
institutions, Carnegie codes, control)
awarding the doctorate
The number and characteristics of Data for the 1975 (conducted
graduate students; postdoctoral years 1972–2008 annually)
appointees; and doctorate-holding are available in
nonfaculty researchers in science, a public-use file
engineering, and health (SEH) fields format.
continued
OCR for page 22
22 COMMUNICATING SCIENCE AND ENGINEERING DATA
TABLE 2-1 Continued
Current Database Retrieval Tool/
Survey Contractor Publication
Survey of NORC SESTAT; InfoBriefs; Characteristics of
Doctorate Doctoral Scientists and Engineers in the
Recipients United States; Science and Engineering
Indicators; Women, Minorities, and
Persons with Disabilities in Science and
Engineering; Science and Engineering
State Profiles
National Mathematica Policy SESTAT; InfoBriefs; Characteristics
Survey of Research, Inc. and of Recent Science and Engineering
Recent College Census Bureau Graduates; Science and Engineering
Graduates Indicators; Women, Minorities, and
Persons with Disabilities in Science and
Engineering
OCR for page 23
23
THE CURRENT DISSEMINATION PROGRAM
Availability of Series Initiated/
Variables Available Microdata Archiving
Citizenship status; country of birth; Access to restricted 1973 (conducted
country of citizenship; date of birth; data for researchers biennially)
disability status; educational history (for interested in
each degree held: field, level, institution, analyzing microdata
when received); employment status can be arranged
(unemployed, employed part time, through a licensing
or employed full time); geographic agreement. The
place of employment; marital status; date available
number of children; occupation (current online though the
or past job); primary work activity enclave arrangement
(e.g., teaching, basic research, etc.); discussed above.
postdoctorate status (current and/
or three most recent postdoctoral
appointments); race/ethnicity; salary;
satisfaction and importance of various
aspects of job; school enrollment status;
sector of employment (e.g., academia,
industry, government, etc.); sex; work-
related training
For individuals who recently received Access to restricted 1976 (conducted
bachelor’s or master’s degrees in an data for researchers biennially)
SEH field from a U.S. institution: age; interested in
citizenship status; country of birth; analyzing microdata
country of citizenship; disability status; can be arranged
educational history (for each degree through a licensing
held: field, level, when received); agreement.
employment status (unemployed,
employed part time, or employed full
time); educational attainment of parents;
financial support and debt amount for
undergraduate and graduate degree;
geographic place of employment; marital
status; number of children; occupation
(current or previous job); place of birth;
work activity (e.g., teaching, basic
research, etc.); race/ethnicity; salary;
overall satisfaction with principal job;
school enrollment status; sector of
employment (e.g., academia, industry,
government, etc.); sex; work-related
training
continued
OCR for page 24
24 COMMUNICATING SCIENCE AND ENGINEERING DATA
TABLE 2-1 Continued
Current Database Retrieval Tool/
Survey Contractor Publication
National Survey Census Bureau SESTAT; InfoBriefs; Science and
of College Engineering Indicators; Women,
Graduates Minorities, and Persons with Disabilities
in Science and Engineering
Business Census Bureau IRIS; InfoBrief; Business and Industrial
Research and R&D; Science and Engineering
Development Indicators; National Patterns of
and Innovation Research and Development Resources;
Survey (BRDIS) Science and Engineering State Profiles
Survey of Synectics for Management WebCASPAR; InfoBrief; Federal Funds
Federal Funds Decisions, Inc. for Research and Development; Science
for Research and and Engineering State Profiles; Science
Development and Engineering Indicators; National
Patterns of Research and Development
Resources
OCR for page 25
25
THE CURRENT DISSEMINATION PROGRAM
Availability of Series Initiated/
Variables Available Microdata Archiving
For individuals holding a bachelor’s Public-use data files 1962 (conducted
or higher degree in any field: academic are available upon biennially)
employment (position, rank, and tenure); request.
age; citizenship status; country of
birth; country of citizenship; disability
status; educational history (for each
degree held: field, level, when received);
employment status (unemployed,
employed full time, or employed part
time); geographic place of employment;
immigrant module (year of entry, type of
entry visa, reason(s) for coming to the
United States, etc.); labor force status;
marital status; number of children;
occupation (current or past job);
primary work activity (e.g., teaching,
basic research, etc.); publication and
patent activities; race/ethnicity; salary;
satisfaction and importance of various
aspects of job; school enrollment status;
sector of employment (academia,
industry, government); sex; work-related
training
Financial measures of research and Census Research 1953 (conducted
development (R&D) activity; company Data Centers annually); a new series
R&D activity funded by others; R&D began in 2008 when
employment; R&D management and the survey was
strategy; and intellectual property, changed
technology transfer, and innovation
Federal obligations by the following Data tables 1952 (conducted
key variables: character of work; basic annually)
research; applied research; development;
federal agency; federally funded research
and development centers (FFRDCs); field
of science and engineering; geographic
location (within the United States and
foreign country); performer (type of
organization doing the work); R&D
plant
Federal outlays by: character of work,
basic research, applied research,
development, R&D plant
continued
OCR for page 26
26 COMMUNICATING SCIENCE AND ENGINEERING DATA
TABLE 2-1 Continued
Current Database Retrieval Tool/
Survey Contractor Publication
Survey of Synectics for Management WebCASPAR; InfoBrief; Federal
Federal Science Decisions, Inc. Science and Engineering Support to
and Engineering Universities, Colleges, and Nonprofit
Support to Institutions; Science and Engineering
Universities, State Profiles; Science and Engineering
Colleges, and Indicators; National Patterns of
Nonprofit Research and Development Resources
Institutions
Survey of R&D ICF Macro WebCASPAR; InfoBrief; R&D
Expenditures at Expenditures at Federally Funded
Federally Funded R&D Centers; Academic Research and
R&D Centers Development Expenditures Science
(FFRDCs) and Engineering Indicators; National
Patterns of Research and Development
Resources
Survey of ICF Macro WebCASPAR; InfoBrief; Academic
Research and Research and Development
Development Expenditures; Science and Engineering
Expenditures at Indicators; National Patterns of
Universities and Research and Development Resources;
Colleges Science and Engineering State Profiles;
Academic Institutional Profiles
Survey of State Census Bureau InfoBrief; State Government R&D
Research and Expenditures; Science and Engineering
Development Indicators
Expenditures
OCR for page 27
27
THE CURRENT DISSEMINATION PROGRAM
Availability of Series Initiated/
Variables Available Microdata Archiving
Data by federal agency, academic Data tables only 1965 (conducted
institutions and location: R&D; annually)
fellowships, traineeships, and training
grants; R&D plant; facilities and
equipment for instruction in science and
engineering; general support for science
and engineering; type of academic
institution (i.e., historically black
colleges and universities [HBCUs[, tribal
institutions, high-Hispanic-enrollment
institutions, minority institutions); type
of institutional control (public versus
private)
FFRDC R&D expenditures by source of Data tables only 1965 (conducted
funds (federal, state and local, industry, annually)
institutional, or other); and character of
work (basic research, applied research,
or development)
Institution R&D expenditures by Data tables (selected 1972 (conducted
source of funds (federal, state and items) annually, limited data
local, industry, institutional, or other); available for various
character of work (basic research versus years for 1954-1970)
applied research and development);
pass throughs to subrecipients; receipts
as a subrecipient; S&E field; non-S&E
field; R&D equipment expenditures
by S&E field; federal agency; type of
degree granted, HBCU, public or private
control; geographic location (within the
United States)
State agency or department; state R&D Data tables 1964 (conducted
expenditures; internal performers; occasionally)
external performers; basic research;
source of funds (federal, state, other);
R&D facilities
continued
OCR for page 28
28 COMMUNICATING SCIENCE AND ENGINEERING DATA
TABLE 2-1 Continued
Current Database Retrieval Tool/
Survey Contractor Publication
Survey of Science RTI International WebCASPAR; Scientific and Engineering
and Engineering Research Facilities; Science and
Research Engineering Indicators
Facilities
Survey of NORC, via a science and Science and Engineering Indicators
Public Attitudes technology module on the
Toward and General Social Survey
Understanding
of Science and
Technology
(see Chapter 5). With limited resources, NCSES attempts to be all things to
all users, and because it is spread so thinly, the panel has serious concerns
about whether these outputs and tools are optimized for all the tasks to
which they are addressed, as well as about whether NCSES is using the
most up-to-date technologies and processes to best advantage for the user
community.
In this chapter, we assess the status of the NCSES dissemination pro-
gram. First, we describe the remaining hard-copy publications. We then
review the NCSES user interface tools, including WebCASPAR, SESTAT,
and IRIS, through which individuals are able to directly access and retrieve
tailored outputs from the database. Then we discuss the structure of the
databases and their current presentation on the web for downloading and
use by third parties. We assess the current status of the program in light of
the emerging practices for electronic dissemination, primarily the develop-
ment of the Semantic Web as a way to facilitate access to information on the
Internet. We provide examples of semantic web systems in federal agencies
OCR for page 29
29
THE CURRENT DISSEMINATION PROGRAM
Availability of Series Initiated/
Variables Available Microdata Archiving
Status of research facilities at academic Microdata from this 1986 (conducted
institutions and nonprofit biomedical survey for the years biennially)
research organizations and hospitals 1988-2001 are not
by: amount and type of science and available.
engineering research space; current
expenditures for projects to construct
and repair/renovate research facilities;
condition of research facilities; planned
construction and repair/renovation
of research facilities; source of funds
(federal, state and local, institutional)
for construction and repair/renovation
of research facilities; research animal
facilities; bandwidth speeds and high
performance network connections; fiber;
high performance computing; wireless
connections
Demographic, behavioral, and Data tables ICPSR, 1979-2001;
attitudinal by how information about CD, 1979-2004;
S&T is obtained; interest in science- (conducted biennially)
related issues; visits to informal science
institutions; S&T knowledge; attitudes
toward science-related issues
and the possibilities for development of a semantic web structure for science
and engineering (S&E) information on the Internet. Finally, we consider
the important issue of timeliness—a subject of great concern for users of
NCSES data—and the possibility of moving the release and distribution of
S&E data to a real-time basis.
TRADITIONAL FORMAT PUBLICATIONS
NCSES continues a few publications using a print-based approach
and still has a customer base for them, although that customer base seems
to be declining over time. Moreover, although most retrieval of NCSES
information is by electronic means, a large part of the offerings are simply
electronic depictions of previous hard-copy publications. It is fair to say
that NCSES continues to manage its publications program in much the
same way as it traditionally has, although the finished products, for the
most part, are now sent to the website for posting rather than to a print-
OCR for page 40
40 COMMUNICATING SCIENCE AND ENGINEERING DATA
Moreover, Tableau supports not only visualizations but also direct down-
loads of data extracts and of derivative “print” works, such as reports and
HTML tables. Nevertheless, Google’s ability to leverage its search engine
dominance and redirect key search terms to Google Public Data Explorer
data visualizations can provide publishers using this tool with unparalleled
visibility among users.
State of the Practice in Data Sharing
Data sharing platforms go beyond data publication to allow the wider
user community to comment and correct data provided through the sys-
tem, add value through integrated visualizations or tags, and even provide
additional data for comparison and integration. At the time our report was
being prepared, there was one open-source data sharing platform, the Data-
verse Network. Several competing closed commercial platforms have been
developed over the last several years, including the now-defunct Google
Palimpsest, Graphwise, Swivel, Dabble, and Verifiable data sharing services,
as well as the operational Data360, Factual, Many Eyes, and BuzzData
services. The existing services that are listed are all of note for different
reasons. More new services, such as FigShare and Numbrary, have emerged
recently or are on the horizon but have yet to achieve significant uptake.
The Dataverse Network (DVN) software is the only open-source sys-
tem currently available specifically designed for data sharing (King, 2007).
It is designed to provide access to research data and to facilitate data shar-
ing through standard/open tools, such as DDI, Dublin Core, and USMARC
metadata; Z39.50, LOCKSS, and OAI-PMH search and harvesting; and
Creative Commons licensing. It replaces the Virtual Data Center software,
which was developed under the NSF DLI-2 program (Altman et al., 2001).
It facilitates the public preservation and distribution of persistent, citeable,
authorized, and verifiable research data, with powerful but easy-to-use tech-
nology. The project increases scholarly recognition and distributed control
for authors, journals, and others who make data available; improves data
access and analysis; and still enables professional archives to provide inte-
grated preservation and other services. It is a leading example of standards-
based open systems.
The Dataverse Network also serves as a federated catalog, allowing
users to find and access data across dozens of remote sources, including the
Interuniversity Consortium for Political Social Science Research, DataWeb,
and the National Archives and Records Administration. Already acces-
sible through the DVN is the largest collection of social science data in the
world, through a partnership with the Data Preservation Alliance for the
Social Sciences (Data-PASS) (Altman et al., 2009; Gutman et al., 2009).
This includes integrated access to hundreds of large government data sets.
OCR for page 41
41
THE CURRENT DISSEMINATION PROGRAM
Of these systems, the Dataverse Network is unique in being designed
to explicitly support long-term access and permanent preservation. To this
end, the system supports best practices, such as format migration, human-
understandable formats and metadata, persistent identifier assignment, and
semantic fixity checking. In addition, many threats to long-term access can
be fully addressed only by collaborative stewardship of content, and the
system supports distributed, policy-based replication of its content across
multiple collaborating institutions, to ensure the long-term stewardship of
the data against budgetary and other institutional threats (see Altman and
Crabtree, 2011).
Making data available in machine-understandable formats using open
standards and metadata also enables the media or other data redistributors
to easily pick up the data and integrate it into their own specific visualiza-
tion tools for further dissemination. This enhances the visibility of the data
and allows a statistical agency to reach a much broader audience with tools
specifically targeted for such audiences. As an example, The Guardian, a
British newspaper, has published a visualization tool based on data from
Eurostat that explains to European citizens “Who we are, how we live and
what it costs.”8
Data360, created in 2004, is the oldest closed-source data sharing
service still operational. Its stated aim was to make data available for bet-
ter public policy. It now contains thousands of data sets and offers static
and dynamic visualizations, direct access to data, and generated reports
(Macdonald, 2009, p. 4).
Factual is a data manipulation developed in the commercial sector. It
is closed source, runs as a proprietary service, and handles only moderate-
sized databases. It extensively supports collaborative data manipulation in
such functions as data linking, aggregation, and filtering, and it has exten-
sive mashup support, with Google RESTful and Java JSON APIs for extrac-
tion and interrogation of data sets. It also integrates with Google charts and
maps. It is a leading example of collaborative data editing. Factual contains
a relatively small collection but has the aim of eventually loading all the
Data.gov files.9 If this aim is achieved, several of the NCSES data files that
reside in Data.gov will be available in this tool.
Many Eyes is a website that permits users to enter their own data sets
and produce tailored visualizations from a stock of sample visualizations on
demand (Viegas, 2007). Many Eyes is largely uncurated, and as a result it
hosts over 200,000 data sets, the vast majority of which are tiny, undocu-
mented, and with unknown provenance. In part, this is because the goal of
8 S ee http://www.guardian.co.uk/world/interactive/2011/mar/14/new-europe-statistics-
interactive [November 2011].
9 See http://www.factual.com/topic/government [November 2011].
OCR for page 42
42 COMMUNICATING SCIENCE AND ENGINEERING DATA
the site is not to create a data collection or archive but to make visualiza-
tion a catalyst for discussion and collective insight about data. Many Eyes
is particularly notable for its prototype work involving accessibility for
people with disabilities. (In contrast, none of the other visualization tools
described provides accessible components or analogs.) By employing a pro-
cessing design that carefully separates data manipulation and data analysis
from presentation (see, for example, Wilkinson et al., 2005) and deferring
visualization to the final stage of the chain of computation, the Many
Eyes prototype was able to offer powerful data manipulation and analysis
functions that were potentially accessible to a visually impaired audience.
Although this is not yet in production, it shows that data analytics for the
visually impaired can go far beyond those typically offered.
BuzzData is a relatively new entry to the data sharing offerings in
which a community of interest for a data set is formed and each data set
has tabs for tracking versions, visualizations, related articles, attachments,
and comments. The idea is that users using the data will build value to the
data set, thereby creating a social network around it (Howard, 2011).
Trends in Data Access Tools and Infrastructure
Data dissemination is a rapidly developing area, in which players, tech-
nologies, and vocations are changing rapidly. The above review of emerg-
ing public and private-sector tools reveals a number of general trends and
patterns, which are summarized below:
• In the private sector, no dominant business model, company, or
commercial product has emerged. To the contrary, many commer-
cial services in this area have failed, and business models for data
sharing remain unclear.
• The availability, usability, and features of third-party systems have
raised user expectations for access to data. Increasingly, users are
expecting access to data in real time and at a fine level of detail.
They want access to data that are machine-understandable and that
can be imported or mashed up using third-party services. Data.gov
is a prime example of this trend applied to the public sector.
• Mega-scale online analysis, social integration, metadata exchange
of catalog information, collaboration features, and ad hoc support
for data manipulation are “solved problems” and well within the
state of the practice. However, many services fail to adhere to good
practices.
• Extremely powerful (peta-scale) online analysis, interactive statisti-
cal disclosure limitation, semantic harmonization, dynamic linking
of data across different data sources with different data collection
OCR for page 43
43
THE CURRENT DISSEMINATION PROGRAM
designs, and data analysis and browsing support for the visually
impaired remain research problems.
• None of the commercial services is designed with preservation or
long-term access.
• Both private-sector and public production services currently avail-
able fall short of providing rich access to visually impaired users.
Overall, these patterns strongly suggest that NCSES should not adopt a
single service or technology for data visualization and sharing, nor should it
develop another bespoke system, but instead should make data available in
open formats and protocols, and with sufficient documentation and meta-
data, to enable the easy inclusion of these data in third-party catalogs and
services. It would benefit from exploring mashups (a mashup occurs when
a web page or application uses and combines data, presentation, or func-
tionality from two or more sources to create new services) with ongoing
public-sector dissemination tool sets, such as DataWeb, in order to quickly
transform its electronic dissemination platforms and refine its participation
in government-wide portals (see Recommendation 3-4).
DISSEMINATION BY MEANS OF GOVERNMENT-WIDE PORTALS
In addition to data dissemination through its own website and possible
utilization of such tools as DataWeb, NCSES has options for disseminat-
ing data through two major government-wide initiatives. It has a presence
through both portals, but they both fall short of serving as comprehensive
platforms for featuring and disseminating S&E information in electronic
form.
FedStats
An early, once-ambitious government-side data access service, FedStats
has been available online since 1997. FedStats is a portal that was designed
to be a one-stop gateway through which users can retrieve a full range of
official statistical information produced by the federal government without
having to know in advance which federal agency produces which particular
statistic. It has searching and linking capabilities to data from agencies that
provide data and trend information on such topics as economic and popu-
lation trends, crime, education, health care, S&E workforce and expendi-
tures, farm production, and more. Data can be retrieved by searching by
subject matter, program area, or agency.
NCSES has been a part of FedStats from the beginning. Currently,
the tool drives a user who is searching by subject matter (topic) or press
releases to the NCSES website, from whence the search continues using
OCR for page 44
44 COMMUNICATING SCIENCE AND ENGINEERING DATA
the existing NCSES search and retrieval tools. Searching by agency is a bit
problematic—the site had not been updated to incorporate the new name
of NCSES as of September 2011.
Data.gov
A promising new portal for disseminating federal government infor-
mation in the form of raw data and applications (apps) has more recently
been developed. Data.gov is a major component of a spate of recent open-
government initiatives that have been designed to serve as a catalyst for
increasing transparency. NCSES has been a member of this federal open-
government initiative from its beginning in May 2009. The SESTAT tool
is one of the apps that can be accessed through Data.gov, although the
WebCASPAR, IRIS, and SED Tabulation Engine tools were not being made
available through this portal at the time this report was being prepared.
Workshop presenter Alan Vander Mallie, program manager in the Gen-
eral Services Administration, stated that Data.gov aims to promote account-
ability and provide information for citizens on what their government is
doing with tools to enable collaboration across all levels of government. It
is a one-stop website for free access to data produced or held by the federal
government, designed to make it easy to find, download, and use, including
databases, data feeds, graphics, and other data visualizations.
Vander Mallie reported that, at its inception in 2009, Data.gov con-
sisted of 47 raw data sets and 27 tools to assist in accessing the data in
some of the complex data stores. At the time of the workshop, the program
supported 2,895 raw data sets and 638 tools, which are accessed through
raw data and tool catalogues. (The number of raw data sets and geographic
data sets claimed on the Data.gov website home page had grown to nearly
390,000 by fall 2011.) This increase is primarily the result of linking and
rebranding the Geospatial One Stop (Geodata.gov) service as part of the
Data.gov site. The catalog of raw data sets (see http://explore.data.gov/
catalog/raw/ [November 2011]) available has increased to roughly 3,602,
based on a catalog search. Raw data are defined as machine-readable data
at the lowest level of aggregation in structured data sets with multiple
purposes. The raw data sets are designed to be mashed up— that is, linked
and otherwise put in specific contexts using web programming techniques
and technologies. Following the workshop, Socrata, which provides an
open government software solution, has introduced a new Data.gov website
designed to help government agencies publish and distribute data in new
ways, including interactive charts, maps, and lists. At the time this report
was being prepared, this software was available only to participating gov-
ernment agencies and was not accessible to the panel.
In the future, Vander Mallie said, Data.gov is slated to continue to
OCR for page 45
45
THE CURRENT DISSEMINATION PROGRAM
expand its coverage of data sets and tools and to continue to support com-
munities of interest by building community pages that collect related data
sets and other information to help users find data on a single topic in one
location. One continuing objective is to make data available through the
application programming interface, permitting the public and developers to
directly source their data from Data.gov.
Expansion into the Semantic Web, an emerging standardized way of
expressing the relationships between web pages so the meaning of hyper-
linked information can be understood, is also part of the future plan for
Data.gov. The objective is to enable the public and developers to create a
new generation of “linked data” mashups. Working toward this goal, Data.
gov has an indexed set of resource framework documents that are avail-
able and is working with the W3C to promote international standards for
persistent government data (and metadata) on the web. Plans are also in
place for expanding mobile applications, improved “meta-tagging” (short
descriptions of an HTML web page that describe the content and facilitate
implementation of standards to describe the data), and enhancing data visu-
alization across agencies. In short, the idea is to give agencies a powerful
new tool for disseminating their data and a one-stop locale for the public to
access them. Efforts also exist to create government-wide or agency-specific
data catalogs and dictionaries, which would be published along with the
available data sets.
Suzanne Acar, senior information architect for the U.S. Department of
the Interior and cochair of the Federal Data Architecture Subcommittee of
the Chief Information Officer Council (see http://www.cio.gov [November
2011] ), put the current and future Data.gov into context. She discussed the
evolution of Enterprise Data/Information Management (EIM)—a frame-
work of functions that can be tailored to fit the strategic information goals
of any organization. For agencies like NSF to benefit from the capabilities
of Web 2.0 and Web 3.0, it is important to ensure consistent quality of
information and official designations of authoritative data sources.
While this report was being prepared, the future of Data.gov remained
somewhat uncertain because of the threat of budget cuts (Lipowicz, 2011).
Nonetheless, the development of Data.gov was heading in an additional
direction—a direction that could be promising for improved dissemination
of S&E data. The Office of Management and Budget is setting up a number
of community-based, topic-specific Data.gov sites. The initial sites cover
information on energy, law, and health.10 In conjunction with the Office
of Science and Technology Policy, NCSES might consider setting up such
a topic-specific site for the science and technology community, particularly
10 See http://www.data.gov/energy; http://www.whitehouse.gov/blog/2011/06/30/invitation-
our-latest-open-innovation-ecosystem-energydatagov [August 2011].
OCR for page 46
46 COMMUNICATING SCIENCE AND ENGINEERING DATA
as it is now a clearinghouse for data dissemination. Overall, the sense of
the panel was that Data.gov was a useful channel for disseminating NCSES
data, but that NCSES should not rely on it as the only solution for dissemi-
nating data in open formats and through open APIs.
EXPANDING ACCESS TO THE NCSES DATABASE
In addition to making its database available to the public through use
of the SESTAT, WebCASPAR, and IRIS tools as well as through FedStats
and Data.gov, NCSES makes the microdata available under carefully con-
trolled circumstances for download and use by outside organizations and
developers. NCSES, like all federal agencies, is bound by the Privacy Act of
1974 to protect the confidentiality of the records it maintains about indi-
viduals and other statutory requirements for the protection of confidential
statistical information under Title V of the 2002 E-Government Act, the
Confidential Information Protection and Statistical Efficiency Act (CIPSEA),
and the NSF’s own statutory provisions. These statutes require NCSES to
establish protocols and procedures to protect the information the agency
collects. In addition, CIPSEA requires that data collected under a pledge
of confidentiality be used solely for statistical purposes and thus not be
disclosed in identifiable form.
This confidentiality protection is afforded to the data in several ways.
Some are fairly straightforward, such as deleting identifying information
(such as name and address) from the records. In other cases, however,
such straightforward methods may not be adequate. This is true for most
of NCSES’s microdata files that contain information about individuals.
In those cases, NCSES attempts to develop a public-use file that provides
researchers with as much microdata as feasible, given the need to protect
respondent confidentiality. It achieves this goal by suppressing selected
fields and/or recoding variables. These suppressions, however, may render
the resulting data of little use to analysts and researchers.
When NCSES believes that protection of respondent confidentiality
would require such extensive recoding that the resulting file would have
little, if any, research utility, the agency has developed a variety of methods
to assist individuals in using the data in such a situation. In some cases,
researchers are able to state their needs for tabulations or other statistics
with sufficient specificity that necessary summary information can be pro-
vided without the need for access to microdata. In other cases, NSF and the
researcher can execute a license agreement that permits the researcher to use
the data files at the NSF offices in Arlington, Virginia, or, under rigorously
restricted conditions, at the researcher’s academic institution.
Microdata files for three surveys may be obtained under a license agree-
ment with NSF: the Survey of Earned Doctorates, the Survey of Doctorate
OCR for page 47
47
THE CURRENT DISSEMINATION PROGRAM
Recipients, and the National Survey of Recent College Graduates. The
SESTAT Integrated Data File can also be obtained in this manner.
For two of these surveys—the Survey of Earned Doctorates and the
Survey of Doctorate Recipients—plans are under way to provide authorized
researchers with remote access to microdata using the most secure methods
to protect confidentiality. This online environment is called the NORC Data
Enclave. The enclave seeks to implement technological security, statisti-
cal protections, legal requirements, and researcher training in one pack-
age. The NORC Data Enclave intends to aid in preserving data for the
long term by documenting the data using Data Documentation Initiative–
compliant metadata standards. When implemented, the enclave intends to
set up a research “collaboratory”—an arrangement that would develop a
knowledge infrastructure around each data set, enabling geographically
dispersed researchers to share information through wikis and blogs. This is
an expanding and innovative program for the agency, one intended to both
protect confidential data and enhance the usability of the data for research
and analytical purposes.
Otherwise confidential data from the 2008 Business Research and
Development and Innovation Survey (BRDIS), sponsored by NCSES and
conducted by the U.S. Census Bureau, has been made available to qualified
researchers on approved projects through the Census Bureau’s Research
Data Centers (RDCs). This survey is a successor to the Survey of Indus-
trial Research and Development. Data available in the RDC network are
business domestic and global R&D expenditures and workforce that are
collected information from a nationally representative sample of about
40,000 manufacturing and nonmanufacturing industries. There are plans
to create an onsite RDC at NCSES so program staff can have access to the
confidential data under controlled circumstances.
Although respondent privacy must be protected, the current NCSES
approach is neither transparent, nor does it appear systematic. As the recent
introduction of the SED Tabulation Engine illustrates, data from the same
series survey may be split across different, nonintegrated systems. The pri-
vate NCSES collection is not made available under a consistent set of terms
of use (which vary by database), nor a consistent mechanism (i.e., some
data sets are not available at all, some are available through the NORC
enclave, and some only through the Census Bureau), nor are the methods
of disclosure risk analysis used publicly documented.
Statistical and technical methods for protecting confidentiality are rap-
idly changing. Maximizing research utility requires a regular review of
methods, consistent license agreements, and providing data in many forms,
including public-use data and restricted data enclaves (National Research
Council, 2005).
In addition, the need to provide confidentiality in the present does
OCR for page 48
48 COMMUNICATING SCIENCE AND ENGINEERING DATA
not eliminate the responsibility to provide for long-term access. The risk
of reidentification changes as time elapses. As discussed in Chapter 3, all
NCSES data, even confidential data, should be stewarded for long-term
access and permanent preservation.
REAL-TIME DISSEMINATION AS A GOAL
One of the most common user criticisms that the panel heard about the
dissemination program was the length of time between the survey reference
periods and when NCSES released data from those surveys. In an era when
users are increasingly being treated to real-time or near-real-time economic
and social information, the lengthy delays in publication of NCSES survey
results are not very well understood. The lack of timeliness is discussed here
as a dissemination issue, though, in reality, timeliness problems have to do
more with data gathering, statistical methodology, and processing practices,
some of which have been addressed in previous National Research Council
reports (National Research Council, 2004, pp. 105, 114, 131, 147, 159-
160; National Research Council, 2010, p. 21).
It was reported to the panel by the NCSES leadership that there have
been initiatives by NCSES over the years to shorten the publication time
by reducing reliance on printed reports and to make more use of relatively
quick-turnaround formats, such as InfoBriefs. These have successfully put
the major data series in the hands of users more quickly than in the past.
However, users still have to wait too long after the reference period to get
access to the detailed publication tabulations that are necessary for sophisti-
cated analysis from a major NCSES survey; for example, detailed data from
the new Survey of Industrial Research and Development for the years 2006
and 2007 were released in June 2011, a year after less detailed summaries
of data from the BRDIS for 2008 were released in May 2010.
The delays in other reports, as indicated by new releases announced on
the NCSES website, are similarly problematic:
• Science and Engineering Research Facilities: Fiscal Year 2007
(released September 23, 2011)
• Characteristics of Scientists and Engineers in the United States:
2006 (released September 14, 2011)
• U.S. Exports of Advanced Technology Products Declined Less Than
Other U.S. Exports in 2009 (released September 1, 2011)
• Science and Engineering Doctorate Awards: 2007-2008 (released
August 22, 2011)
• Industrial Research and Development Information System (IRIS)
1953-2007 data (released July 26, 2011)
OCR for page 49
49
THE CURRENT DISSEMINATION PROGRAM
As mentioned earlier in this chapter, the shift to provision of data
in electronic format has been simply a digitization of previously manual
products. The format for the website database is a replication of the old
tables that found their way into the printed publications, so the labori-
ous and time-consuming processes that were required for production of
the manual products are still necessary. Another source of the timeliness
problem stems from the fact that NCSES has largely shifted to electronic
dissemination but without systematic machine-understandable metadata
and change control. This means that a great deal of NCSES time still must
be spent in painstakingly checking data and formatting the data for print
and electronic publication in order to check the accuracy and reliability of
the published products. For example, each page of the hard copy must be
checked by someone looking at the source data. This effort comes at the
expense of ensuring data integrity at the source, and it takes an inordinate
amount of scarce staff time.
OCR for page 50