Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 51
3
Strategy for Modernizing Data Storage,
Retrieval, and Dissemination
I
n this chapter, we propose a strategy for modernizing the infrastructure
and processes that support the dissemination function of the National
Center for Science and Engineering Statistics (NCSES). Several rather
significant actions need to be taken in order to capitalize on the new
technologies and processes that would facilitate this modernization. We
make six recommendations for action, ranging from revising the format
in which science and engineering (S&E) data are received from the survey
contractors to more attention on archiving the data for long-term access
and preservation.
CAPACITY OF NCSES TO TAKE ADVANTAGE
OF NEW TECHNOLOGIES
Emerging technologies for data capture, storage, retrieval, and exchange
will dramatically change the context in which NCSES will provide data to
users in the future. These technologies will further increase efficiency, per-
mitting users to access the data interactively and to dynamically integrate
it with other information. For NCSES, the key to being able to take advan-
tage of these technologies is to begin with a sharp focus on modernizing
procedures for collection and ingestion of raw data and information about
the data (metadata) into the data system. This is no simple task because of
the likelihood that modernization will call for accommodating infrastruc-
ture changes. Whether the existing systems will have the capacity to ingest
the metadata and individual record data in formats that support the new
technologies is not certain.
51
OCR for page 52
52 COMMUNICATING SCIENCE AND ENGINEERING DATA
In order to take full advantage of many of the emerging data sharing
and visualization tools described in Chapter 2, it is important that the
incoming data be collected and ingested into the NCSES data processing
system in as disaggregated a form as possible. The data should be accom-
panied by sufficient information about the data items (metadata) to sup-
port future analyses and comparability with previous analyses, and there
should be an appropriate versioning/change management system to ensure
that the ability to trace the origin and history of the data (provenance) is
incorporated. This is challenging to NCSES because, for the most part, the
agency data are collected, updated, and accessed by contractors to NCSES.
Since the collection, tabulation, and front-end activities are controlled by
contractors, NCSES must specify the requirements for data inputs that are
compatible with retrieval in open data formats and suitable for retrieval in
formats that support common tools that software developers use to process
data.
The data also need to be in formats that enable taking advantage of
the web development capabilities embedded in Data.gov and other emerg-
ing dissemination means. The data must be capable of mashup with other
data sources. These capabilities require that access to the data be available
through an open application programming interface (API) that exposes the
disaggregated data, along with its metadata, in machine-understandable
form. The result is to enrich results and enhance the value of the data to
users.
It is critically important that the data be accompanied by the machine-
actionable documentation (metadata) needed to establish the data’s history
of origin and ownership (provenance) and include a record of any modifica-
tions made during data editing and clean-up. The documentation also needs
to include the measurement properties of the data with sufficient detail and
accuracy to enable publication-ready tables to be automatically generated
in a statistically consistent manner.
Furthermore, it is critically important that a formal automated capabil-
ity for tracking and controlling changes to a project’s files—in particular to
source code, documentation, and web pages (version control)—and formal
change management procedures be applied to data collected by contractors.
This establishes a reliable data provenance and ensures that all previous
publications can be automatically verified and replicated.
In the panel’s judgment, NCSES is not very well positioned to meet the
above preconditions for taking advantage of emerging technologies. The
survey data that are entered into the center’s database are received from
the survey contractors in tabular format mainly though machine-readable
tabulations, rather than in a more easily accessible microdata format.
This situation is not unique to the S&E data that are received from
contactors by NCSES. Suzanne Acar (representing the U.S. Federal Bureau
OCR for page 53
53
MODERNIZING DATA STORAGE, RETRIEVAL, AND DISSEMINATION
of Investigation and the Federal Data Architecture Subcommittee of the
Chief Information Officers Council) stated that difficulty in fully utilizing
emerging technologies is a government-wide issue, one that will be taken up
by a group of the World Wide Web Consortium (W3C) and other standards
organizations.1 W3C has plans to develop contract templates to enable
governmental organizations to properly specify the format for receipt of
the data from their contractors.
According to Ron Bianchi (representing the Economic Research Service
of the U.S. Department of Agriculture), barriers to taking advantage of
emerging technologies is a widespread issue across the federal statistical
system and has been identified as a major concern for the newly formed
Statistical Community of Practice and Engagement (SCOPE). This coordi-
nating activity involves most of the large federal statistical agencies. The
initial plans for the SCOPE initiative have included developing a template
for contract deliverables specifications for data formats and accompanying
metadata.
Recommendation 3-1. The National Center for Science and Engineer-
ing Statistics should incorporate provisions in contracts with data
providers for the receipt of versioned microdata, at the level of detail
originally collected, in open machine-actionable formats.
Implementing this recommendation will be no simple task for NCSES.
Currently, NCSES manages 13 major surveys that involve contracts with
five private-sector organizations and the U.S. Census Bureau (see Table
2-1). Furthermore, adding this requirement may initially incur additional
costs to support a shift from the current practice of formatting the data
after they are received to requiring contactors to input the data in a new
format. Some consideration will have to be made for reformatting the exist-
ing historical databases to be compatible with the new open formats and
structures, when possible, so data can be manipulated across current and
prior survey results.
To enable the receipt of metadata from contractors in a universally
accessible format, NCSES should consider adopting an electronic data inter-
change (EDI) metadata transfer standard. The selection and adoption of a
metadata transfer standard would be more effective if NCSES accomplished
it through participation in a government-wide initiative, such as the W3C
contract template development or the SCOPE effort, which is more focused
on the federal statistical agencies.
1 W3C is an international community of member organizations that develops web standards,
see http://www.w3c.org [November 2011].
OCR for page 54
54 COMMUNICATING SCIENCE AND ENGINEERING DATA
Improving Data Delivery, Presentation, and Quality
In their presentations to the panel, the NCSES staff produced a large
hard-copy stack of tabulations, noting that the stack represented just one
of the center’s periodic reports. The staff also noted that, even though the
center has largely shifted to electronic dissemination, the dictates of data
accuracy and reliability require that a great deal of NCSES time is spent in
checking data and formatting them for print and electronic publication.2
For example, each page of the hard copy must be checked by someone look-
ing at the source data. This effort comes at the expense of ensuring data
integrity at the source. We think this emphasis is misplaced.
Although it will never be possible to fully avoid edit and quality checks,
because errors are prone to creep into data at any stage in processing, there
is much to be gained by focusing primarily on the quality of the incoming
raw data from the source. This approach is best ensured by adopting a
comprehensive database management framework for the process, rather
than the current primary focus on review of the tabular presentation. A
framework that ensures integrity at the source of the data, buttressed by
the availability of metadata, is the necessary foundation of real improve-
ment in data dissemination. Adoption of such an approach should have
further benefits. By changing to a dissemination framework from a review
framework, NCSES could free up some existing resources or be able to
reduce contractor involvement, which would allow for the realignment of
resources and funding to focus on making further process improvements.
Recommendation 3-2. The National Center for Science and Engineer-
ing Statistics should transition to a dissemination framework that
emphasizes database management rather than data presentation and
strive to use auditable machine-actionable means, such as version con-
trol, to ensure integrity of the data and make the provenance of the
data used in publications verifiable and transparent.
All of the tables published by NCSES are selections, aggregations,
and projections of the underlying micro-level observations. Recommen-
dation 3-2 envisions that, whenever possible, published tables should be
defined explicitly in these terms and produced by an automated process
that includes metadata.
The panel acknowledges that in some cases—such as the NCSES’s Sci-
ence and Engineering Indicators—this approach may not be immediately
feasible, since an extensive data appendix is necessary to support the analy-
2 Thisinformation is based on the National Science Foundation presentation to the panel,
October 27, 2010 (slide numbers 14-16).
OCR for page 55
55
MODERNIZING DATA STORAGE, RETRIEVAL, AND DISSEMINATION
sis in the report. However, in general (following the practice that NCSES
currently employs for the most detailed statistical tables), a web release of
the raw data will reduce the burden on the NCSES staff related to manu-
ally check publications and will form the basis of a transition from tables
to information and provide the users with more timely information. This
structured approach to release of data will also provide transparency in
the process, increase replicability, and assuage any user concerns about the
delay between data collection and their availability.
It is important that the data provided by contractors to NCSES include
machine-readable metadata that capture the statistical properties of the
data and of the collection and research design. The appropriate form and
content of these metadata are being considered in the SCOPE initiative. It is
likely that such metadata are produced in the data collection process, since
computer-assisted telephone interviewing (CATI) and other related survey
tools use much of this information in their operations. However, metadata
are currently not included in the required deliverables to the National Sci-
ence Foundation (NSF) from contractors.
The shift to increased user capacity to produce customized output from
the raw data is potentially a major and significant enhancement, which has
the potential to offer great direct benefit, but such a change will also require
consideration of second-order effects. Care will need to be taken to ensure
that data confidentiality is ensured when providing users with cross-source
microdata: consequently, rules about publishable cell size, for example,
will have to be carefully considered.3 The greater transparency inherent in
making more of the raw data available also increases the risk that users
could juxtapose data in ways that lead to invalid interpretations, although
this danger can certainly be reduced by the accessibility of robust metadata
that explain the meaning (and limitations) of the data.
Recommendation 3-3. The National Center for Science and Engineer-
ing Statistics (NCSES) should require that data received from contrac-
tors be accompanied by machine-actionable metadata so as to allow
for automated production of NCSES publications, comparability with
previous analysis, and efficient access for third-party visualization,
integration, and analysis tools.
3 Several reports of the Committee on National Statistics address the need to maintain the
confidentiality of data provided to government agencies in confidence: Privacy and Confiden-
tiality as Factors in Survey Response (National Research Council, 1979); Private Lives and
Public Policies: Confidentiality and Accessibility of Government Statistics (National Research
Council, 1993); Protecting Student Records and Facilitating Education Research (National
Research Council, 2009); and Protecting and Accessing Data from the Survey of Earned
Doctorates (National Research Council, 2010).
OCR for page 56
56 COMMUNICATING SCIENCE AND ENGINEERING DATA
Another positive benefit of providing transparency and tools for explor-
atory access to data is that users will be in a position to identify errors in
the data. NCSES should be prepared to solicit and accept error reports and
make corrections as necessary. In contemporary terms, this would be an
application of “crowd sourcing”—a focused attempt to tap into the collec-
tive intelligence of the users of the data. Clearly, when the general public
has access and tools to combine data across data sources, there may be
additional questions about data accuracy and usefulness, and NCSES will
need to do its best to educate users and respond to their discoveries.
In its presentations, NCSES staff stressed that they are a comparatively
small organization with limited resources. One way that these limited
resources could be stretched is for NCSES to consider digital distribution
channels, including enhanced use of pdf files and, after investigation of
cost and benefits, perhaps facilitating print-on-demand (POD) publica-
tion. NCSES may wish to consider turning to POD technology of the U.S.
Government Printing Office (GPO) as a potential means of controlling the
costs associated with printing and distributing the few remaining hard-copy
reports that it produces (see Chapter 2 for details).
VISUALIZATION OF S&E DATA
Just as a picture may be worth a thousand words, so can the best data
visualizations replace a ream of tabular output and written analysis. Appli-
cations of data visualization—or as Edward Tufte (2004) characterizes it,
the visual display of quantitative information—are growing profusely. (See
Ware, 2004, for a contemporary treatment of this area.) Data visualizations
are increasingly being used by federal data-producing agencies and others
to analytically depict large data sets, such as those produced by NCSES.
Two of the larger statistical agencies—the Census Bureau and the Bureau
of Economic Analysis—and other federal agencies maintain visualization
sites that are suggestive of approaches that NCSES might profitably take.4
Indeed, assisted by NCSES, the National Science Board has provided
visualized data in the form of charts and graphs, and it maps its printed
and online digest published in support of the 2010 Science and Engineering
Indicators volume (National Science Board, 2010b). These static displays
of information have been chosen by NSF staff for their ability to clarify
relationships and trends in visually pleasing and interesting ways. They
are appropriately considered first-generation visualizations, since they are
4 Seehttp://blogs.census.gov/censusblog/2011/07/visualizing-foreign-trade-data.html; http://
lehd.did.census.gov/led/datatools/visualization.html; http://www.bea.gov/newsreleases/glance.
htm; http://www.bea.gov/itable/index.cfm; http://www.uspto.gov/dashboards/patents/main.
dashxml [August 2011].
OCR for page 57
57
MODERNIZING DATA STORAGE, RETRIEVAL, AND DISSEMINATION
not associated with an electronic database and thus are not susceptible to
manipulation by data users who want to interactively illustrate aspects of
the data for their own analysis.
The field of data visualization is quite dynamic, with new approaches
and technologies being offered in the form of online sites and applications
by both private and public sectors, as well as nuanced approaches to build-
ing a community of analysts around visualized subject matter. Because of
the shortage of staff resources and the fast-changing data visualization land-
scape, the panel suggests that NCSES choose several deliberate approaches
that can be taken in order to make progress toward improving visualiza-
tion of the NCSES data. NCSES could (a) confederate with other federal
statistical agencies that are already moving forward with visualization
programs under an umbrella such as SCOPE; (b) work with private-sector
vendors, such as the Google Public Data Explorer, to expand the potential
for visualization of the NCSES data sets (taking much the same approach
as Eurostat); or (c) continue to develop a select set of straightforward
visualizations, such as those offered in the 2010 Digest but continuously
update those visualizations and post them to the Internet when new data
become available.
As discussed in Chapter 2, a complementary approach would be to
provide the data in machine-understandable formats using open standards
and with appropriate metadata so that users can develop their own visu-
alizations using the increasingly sophisticated private vendor visualization
tools that are on the market. NCSES could take advantage of the rapidly
emerging services that make data easier to find, aggregate, interpret, inte-
grate, and link.
Recommendation 3-4. The National Center for Science and Engineer-
ing Statistics should proceed to make its data available through open
interfaces and in open formats compatible with efficient access for
third-party visualization, integration, and analysis tools.
RETRIEVAL AND DISSEMINATION TOOLS
Adopting a new approach to data management and distribution will
open up many exciting opportunities for low-cost solutions to data retrieval
and dissemination. These opportunities would expand utilization of emerging
government and private-sector resources to go beyond the capabilities offered
by the current Scientists and Engineers Statistical Data System (SESTAT), the
Integrated Science and Engineering Resources Data System (WebCASPAR),
the Industrial Research and Development Information System (IRIS), and the
Survey of Earned Doctorates (SED) Tabulation Engine tools.
As discussed in Chapter 2, once the conditions are established for
OCR for page 58
58 COMMUNICATING SCIENCE AND ENGINEERING DATA
dissemination of data, the public services, such as Data.gov, and private
services, such as the Google Public Data Explorer, can bear much of the
burden of dissemination. A caveat is in order here, however. Although using
private-sector tools for dissemination is a promising solution for NCSES,
dissemination tool development is extremely dynamic in the private sector,
as panel member Micah Altman observed at the panel workshop. Many
of the start-up dissemination and data sharing services have gone out of
business. In view of this uncertainty, his advice is that users should mitigate
the risk of using any of these systems by opting for open-source software
whenever possible, retaining preservation copies of files in other institu-
tions, limiting use to dissemination only (not data management), and lever-
aging metadata and APIs to create one data source that is then disseminated
through multiple sources.
Another caution was voiced at the panel workshop by Myron P.
Gutmann, director of NSF’s Directorate of Social, Behavioral, and Eco-
nomic Sciences, with regard to such private-sector services as the Google
Public Data Exchange. He warned that it could be dangerous to overrely on
these private-sector dissemination tools, since the conditions of service or
even the continued provision of service are corporate decisions that could
significantly change or even end the dissemination mode. He also expressed
a concern that distribution in a nongovernment-owned system could open
the possibility of unauthorized changes in the data set unless there were
strict controls in place within the dissemination tool and a policy that the
data be anchored back to the originating federal agency source.
Altman identified research challenges and gaps between the state of the
art and the state of the practice. Research challenges in this area include
peta-scale online analysis, interactive statistical disclosure limitation, busi-
ness models for long-term preservation, and data analysis tools for the
visually impaired. Closable gaps include managing nontabular complex
data and metadata-driven harmonization and linkage across data resources.
Recommendation 3-5. The National Center for Science and Engineer-
ing Statistics should develop a plan for redesign of its retrieval tools
utilizing the emerging, sustainable capabilities of other government and
private-sector resources.
PRESERVING ACCESS TO S&E DATA
When considering data release and management, it is important to
have a long-term data management plan. Yet according to staff, the current
NCSES approach to archival issues is ad hoc. In view of the importance of
these data for historical reference, long-term access and permanent archival
OCR for page 59
59
MODERNIZING DATA STORAGE, RETRIEVAL, AND DISSEMINATION
preservation are needed, and these could be ensured through proper policies
and practices.
At a minimum, all of the collected data and the electronic and hard-
copy publications that are produced should be scheduled for retention by
the National Archives and Records Administration (NARA). In this regard,
the NSF Sustainable Digital Data Preservation and Access Network Part-
ners (DataNet) initiative is a ready in-house source of information on best
practices and tools for implementing an active archival program.
NARA ELECTRONIC RECORDS PROGRAM
NARA has responsibility for the custody and retrieval of federal gov-
ernment records for which they have received a transfer of legal custody
of records for the originating agency. A growing part of the NARA collec-
tions are in the form of electronic records. Because of the panel’s interest in
ensuring the long-term retention and retrieval of NCSES data, we invited
Margaret O. Adams, manager of the Archival Services Program, and Theo-
dore J. Hull, senior archivist of accessions, to discuss the NARA reference
services for electronic records.
The process for identifying records for archiving is a collaborative one.
NCSES is required by law to manage records created or received in the
course of business, and it does so by completing a form (Standard Form
115) that outlines the holdings and requests records disposition authority.
Through a records scheduling and appraisal process, the archivist of the
United States determines which federal records have temporary value and
may be destroyed and which federal records have permanent value and must
be preserved and transferred to the National Archives of the United States.
The archivist’s determination constitutes mandatory authority for the final
disposition of all federal records (36 CFR 1220.12). Only a very small per-
centage of records identified for permanent retention are actually accessed
by NARA, but the kind of electronic records that are produced by NCSES
have a very high chance of being appraised for permanent retention—that is,
social and economic microdata collected for input into periodic and onetime
studies and statistical reports, including information filed to comply with
government regulations, as well as summary statistical data from national
or special censuses and surveys.
According to Hull, a good part of the accessioning work is done by
NARA. When records, documentation, and accession documents (SF-258)
are received, NARA conducts a preliminary assessment, which can involve
converting files to ASCII, contacting agency for replacements or additional
documentation, verifying file formats, and selecting only permanent files for
retention. Only then are records archived using NARA’s Archival Preserva-
tion System (APS).
OCR for page 60
60 COMMUNICATING SCIENCE AND ENGINEERING DATA
After they are accessed, they may be researched and retrieved using
descriptions of the electronic records series in NARA’s online Archival
Research Catalog (ARC).5 (This source will be replaced by NARA’s Online
Public Access [OPA] system in coming months.) ARC includes descriptions
for approximately 68 percent of NARA’s holdings nationwide and about
99 percent of accessioned electronic records.
The NARA records system is a very large system. As of January 2011,
there were 717 series and 6.6 billion logical data records contributed by
over 150 source agencies described in the ARC. The ARC search supports
filtering by type of records (data files), and copies of fully releasable data
files are provided on removable media for cost recovery. The Online Public
Access system currently under development aims to support direct down-
load of electronic records files.
In her presentation, Adams referred to the Committee on National Sta-
tistics publication, Principles and Practices for a Federal Statistical Agency
(National Research Council, 2009, p. 27), which states that “a good dis-
semination program also uses a variety of channels to inform the broadest
possible audience of potential users about available data products and how
to obtain them. . . . Agencies should also arrange for archiving of data with
the National Archives and Records Administration and other data archives,
as appropriate, so that data are available for historical research in future
years.”
As mentioned above, the archiving process begins with the identifica-
tion of holdings and the request for records disposition authority by the
agency. This is sometimes a challenging task, particularly with the growth
of electronic versus hard-copy holdings. In the case of NCSES, the process
of identifying and completing a records disposition authority request was
last completed in 1995. Several types of records were then identified for
permanent retention, including final published surveys and studies; elec-
tronic micro-level survey data, final edited versions of all electronic survey
microdata, databases, spreadsheets, detailed tables, charts, statistical data,
and other micro-level respondent information created prior to compiling,
condensing, or summarizing the survey responses into the final summa-
rized or published product; electronic text and detailed statistical tables,
data analyses, and related records; electronic copies of survey reports,
including the text of the final report and all other electronic records related
to the report, such as detailed tables, charts, statistical data analyses,
and spreadsheets; and technical information regarding data format and
structure and other related computer program and system documentation,
including codebooks, file layouts, data fields, data dictionaries, and other
records that are necessary to understand the microdata. For most of these
5 See http://www.archives.gov/research/arc [November 2011].
OCR for page 61
61
MODERNIZING DATA STORAGE, RETRIEVAL, AND DISSEMINATION
items, NCSES is instructed to retain them at the agency level for 10 years
and then forward them to NARA.
Much has happened in terms of data collection, processing, and dis-
semination in the years since 1995. It is appropriate that NCSES review and
refile, if necessary, a request for records disposition authority.
Recommendation 3-6. The National Center for Science and Engineer-
ing Statistics (NCSES) should work with the National Archives and
Records Administration (NARA) to ensure long-term access and preser-
vation of all of its publications and all data necessary to replicate these
publications. As a necessary step, NCSES should review and update
the request for disposition authority that is filed with NARA to ensure
prompt and complete disposition of records and should regularly
review the status of compliance with the records retention directive.
OCR for page 62