Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 225
Appendix D
Scientific Computing, Information
Technology, and Informatics
INFORMATICS AND INFORMATION TECHNOLOGY
IT and informatics are in rapid transition. Technologic change in the
global capacity of computing and telecommunication has been growing expo-
nentially (Hilbert and Lopez 2011). The end of Moore's law1 of exponential
growth in computer hardware power (Robert 2000) will require, for example,
mastery of parallel programming to sustain the growth of computing perform-
ance and to meet the need for analyzing massive amounts of data until a postsili-
con era is realized. The next 10 years will see a massive rebuilding of IT infra-
structure everywhere.
The economics of IT are also changing profoundly, largely under the fa-
vorable pressure of consumer applications. Enormous increases in data band-
width (especially wireless) have made possible a wide array of mobile endpoints
for applications, and this trend will continue. The inability of traditional rela-
tional databases to scale to handle the rapid growth in unstructured, semiconsis-
tent, real-time data on which decisions often need to be made based in the com-
mercial world has led to the emergence of such tools as Map Reduce, Hadoop,
and other next-generation data environments (NoSQL 2012), which are dis-
cussed above. Virtualization is steadily eliminating the concept of a dedicated
server in a fixed location, and cloud computing is transforming the economics of
IT. Social networking, already a major consumer phenomenon, has now entered
the scientific workplace and can be used for heightened collaboration, as dis-
cussed above.
All of the emerging changes will require a more responsive and flexible
approach to the opportunities afforded by global informatics and lead to a sys-
1
Moore's law is a rule of thumb in the history of computing hardware whereby the
number of transistors that can be placed inexpensively on an integrated circuit doubles
about every 2 years (Moore 1965).
225
OCR for page 226
226 Science For Environmental Protection: The Road Ahead
tems perspective of data instead of a focus on one locale, one experiment, or one
medium at a time. Those are the directions that IT and informatics are taking.
The challenge will lie in understanding how to harness information for EPA's
science needs for the future and understanding the role of advanced computer
science and informatics in EPA.
High-Performance Computing
EPA's National Computer Center in Research Triangle Park, North Caro-
lina, houses many of the agency's computing resources, including the super-
computing resources used by the Environmental Modeling and Visualization
Laboratory and resources for such major applications as computational toxicol-
ogy, exposure research, and risk assessment. Those resources are traditional
high-performance computing machines, the products of a shrinking and strug-
gling industry segment. The future of high-performance computing machines
will look entirely different, and it is important that EPA adjust to the change to
remain at the leading edge of the field.
Parallel Programming
Central processing units (CPUs) can no longer be made to run faster, so
progress requires putting multiple CPUs, or "cores", on each chip to operate
concurrently. That, in turn, requires a decomposition of applications into inde-
pendent components that can run in parallel. An important opportunity afforded
by the effort to create highly parallel programs is that they can also be exported
to external networks of underused processing for the few jobs that require mas-
sive resources. The existing tools for that style of programming are poor, and the
skill is seldom taught. Fortunately, EPA has had experience in this regard in its
supercomputing projects, but it will need to expand its overall skills inventory
greatly to continue to take advantage of parallel and emerging techniques in
computing as Moore's law is repealed.
Cloud Computing
Cloud computing will redefine the economics of computation for the next
20 years. A cloud-computing server typically provides services to its clients in
three ways: complete applications (software as a service, or SaaS); a platform
for clients to build on (PaaS); or a raw infrastructure of processors, storage, and
networks (IaaS). Clouds generally are classified as public (provided commer-
cially), private (to one or more organizations), or hybrid (public with a secure
connection to private). Services can be scaled up or down in capacity and per-
formance instantly; the client is charged for the amount of time, storage, CPUs,
and bandwidth, moment by moment. Even organizations with extreme needs for
computation, storage, and bandwidth and high volatility of demand over the
OCR for page 227
Appendix D 227
short term have been able to transition from their own data centers to the cloud
with excellent results (Cockcroft 2011). EPA has recognized the opportunity
presented by cloud computing and has begun to embark on a process of transi-
tion for many services to a private EPA cloud (Lee and Eason 2010).
Throughout EPA, and especially in the regions and the technical offices,
applications and databases are the responsibility of regions and offices, but the
Office of Technology Operations and Planning (in the Office of Environmental
Information) provides the infrastructure, platform, and support from datacenters
in Research Triangle Park, North Carolina; Arlington, Virginia; Chicago, Illi-
nois; and Denver, Colorado. Thus, it is natural for EPA scientific computing to
move to PaaS and IaaS cloud operation, and it has begun to do so. Done care-
fully, this will also permit some applications to be moved to the public cloud as
economics requires. Given the trajectory of costs and budgets, that is inevitable,
and it is important that EPA continue on this path, ensuring that new science
applications are designed for private cloud implementation and for later portabil-
ity to the public cloud.
Wireless Networks
Dramatic improvement in the performance of data transmission in both
wide-area and local wireless networks is driving enormous growth in mobile
devices and applications. With many government agencies upgrading infrastruc-
ture under pressure to use more effectively the underused radiofrequency spec-
trum over which they have control, that growth will continue for the foreseeable
future. Combined with new-generation real-time sensors, the wireless network
has a profound effect on collection of and access to environmental information
but it also changes expectations about the user experience. Furthermore, design-
ing for mobile devices has different constraints and freedoms from building
Web applications for a desktop environment. The techniques will be important
as EPA works to engage and gain support from the public. It will be important
for EPA to master the skills of spectrum-sharing and efficient use of bandwidth.
DATA MANAGEMENT
With centralized data centers, strong data-quality standards, and highly
organized exchanges, EPA is executing well in IT and has adapted to changing
technology while continuing to support its original charter to protect the envi-
ronment and human health. However, a persistent challenge in such fields such
as computational toxicology is the integration of available data from many
sources. In particular, many investigators who generate large datasets may not
have the knowledge and experience in informatics to integrate and interpret the
data successfully. In the future, adopting a systems-thinking approach will result
in a mixture of data from a variety of sources, including the atmosphere, soil,
water, and foods; data will be related to genetics and health outcomes; and they
OCR for page 228
228 Science For Environmental Protection: The Road Ahead
will range from highly unstructured to highly structured data. These factors will
require even more multidisciplinary collaboration among agency scientists.
Warehousing and Mining
As increasingly large amounts of data continue to be generated through
designated systems--such as environmental monitoring, biomarker and other
exposure surveillance data, disease surveillance, and designed epidemiologic
and experimental studies--or streamed from community crowdsourcing, EPA is
faced with both an opportunity and a challenge of channeling and integrating
data into a massive "data warehouse". Data warehousing is a well-developed
concept and a common practice in business (Miller et al. 2009). In EPA, the
adaptation of and transition to data warehousing will continue to evolve with
good protocols, such as EPA's Envirofacts Warehouse (Pang 2009; Egeghy et
al. 2012) and the Aggregated Computational Toxicology Resource (Egeghy et
al. 2012; Judson et al. 2012). In the future, data in EPA's warehouse will come
from diverse sources, from multiple media, and across geographic, physical, and
institutional boundaries. Recent efforts to integrate the US Geological Survey's
National Water Information System with EPA's Storage and Retrieval System
are an example (Beran and Piasecki 2009). To harvest relevant information from
massive datasets to support EPA's science and regulatory activities, integration
of heterogeneous databases and mining of these massive datasets present some
new opportunities. A recent application involving the European Union's Water
Resource Management Information System is a case in point (Dzemydien et al.
2008).
Data-mining has become a standard for analyzing massive, multisource,
heterogeneous data on consumer behavior used in business (Ngai et al. 2009).
EPA should and can adopt this data analytic paradigm to support its knowledge-
discovery process. The paradigm is increasingly important at a time when the
discovery of new evidence or a new data model can be bolstered by dynamic
mining of large amounts of data, including environmental indicators of air and
water, satellite imagery of climate change from representative population data-
bases, health indicators from disease surveillance systems and medical data-
bases, social behavioral patterns, individual lifestyle data, and -omics data and
disease pathways. That will require EPA to invest its resources to continue the
development of new analytic and computational methods to deal with static
datasets (for example, modeling of complex biologic systems and air and water
models) and to adapt and develop new data-mining techniques to process, visu-
alize, link, and model the massive amounts of data that are streaming from mul-
tiple sources. EPA is making progress in that direction in its Aggregated Com-
putational Toxicology Resource System (Judson et al. 2012). Successful cases
have also been reported for ecologic modeling (Stockwell 2006), air-pollution
management (Li and Shue 2004), and toxicity screening (Helma et al. 2000;
Martin et al. 2009), to name a few.
OCR for page 229
Appendix D 229
Large Datasets
Informatics, data warehousing, and data-mining afford EPA powerful
tools for maximal use of wealth of information that will continue to be gathered
by it, other agencies, and the public on an unprecedented scale. Data analysis
and modeling in many cases will be accomplished through informatics tech-
niques, as is already the case in the analysis of -omics data (Ng et al. 2006;
Baumgartner et al. 2011; Roy et al. 2011). As EPA moves forward with analyz-
ing and modeling large sets of data, it should keep three points in mind:
Information generation and information gathering are accelerating ex-
ponentially, and EPA will not be able to generate all the data needed to address
complex environmental and health problems. It would benefit the agency to con-
tinue to develop its capacity to access, harvest, manage, and integrate data from
diverse sources and different media and across geographic and disciplinary
boundaries rapidly and systematically.
Links between environmental change, exposure, human behavior, and
human health are complex, and seamless integration and dynamic mining of
diverse datasets will boost the chance of discovering such links. For example, to
derive personal exposure estimates for particulate matter smaller than 2.5 µm in
diameter (PM2.5), it is necessary to integrate environmental data, human behav-
ioral data, and insight about how PM2.5 penetrates various indoor microenviron-
ments. The exposure estimates are then linked to disease-mechanism data and
health data. Such an approach is not difficult to appreciate in principle, but its
practice hinges on how successfully an informatics approach can be adapted to
mine the massive data from diverse systems. EPA has been a leader in air-
quality research and associated health effects of exposure to air pollutants, as
showcased through its contributions to the Six Cities Study (Dockery et al.
1993) and the National Morbidity, Mortality, and Air Pollution Study (Samet et
al. 2000; Dominici et al. 2006), and it is in a strong position to retain its cutting-
edge position by adapting informatics approaches to the analysis and modeling
of diverse and massive datasets.
As environmental challenges continue to emerge and evolve, EPA's
approach to problem-solving will need to be dynamic and adaptive. Having a
cutting-edge capacity of data warehousing, data-mining, bioinformatics, envi-
ronmental informatics, and health informatics will boost EPA's ability to inte-
grate massive external data in a timely fashion, to adopt new techniques, to bor-
row scientific and technical expertise from outside the agency, and to be more
responsive and anticipatory.
As EPA continues to strengthen its informatics infrastructure, it will be
important to pay attention to new analytic and statistical methods to address
emerging modeling issues and to bridge methodologic gaps. Several outstanding
issues warrant high priority. One challenge is to analyze large amounts of data
OCR for page 230
230 Science For Environmental Protection: The Road Ahead
from diverse sources without having a shared standard for the data collection
(Hall et al. 2005). For example, screening and identifying complex chemical
mixtures in the natural environment are difficult because there so many possible
mixtures and the mixtures change temporally and spatially (Casey et al. 2004).
A second example involves conducting gene-screening analysis to differentiate
among tens of thousands of genes or single-nucleotide polymorphisms along a
hypothesized disease pathway with only a small number of subjects. Overzeal-
ous findings of a positive association are a consequence of this high-dimension
problem (Rajaraman and Ullman 2011). Mining that type of data could pose
serious challenges in validity and utility when the data are from across geo-
graphic and disciplinary boundaries and have heterogeneous quality standards.
A special danger with huge datasets is a problem of multiple comparisons,
which can lead to massive false positive results. Also with such data, there is
sometimes a dominance of bias over randomness--increasing the amount of
data generally reduces variances, sometimes close to zero, it but does not reduce
bias. In fact, it may even increase bias by diverting attention from the basic qual-
ity of the data. Another challenge involves the modeling of complex biologic
systems (such as pathway models, physiologically based pharmacokinetic and
pharmacodynamic models, and hospital admission data). Information from a
small number of static datasets is insufficient to support a large number of un-
known model parameters. Two approaches are widely used: fixing some pa-
rameters at values that have only weak support from external systems (Wang et
al. 1997) and tightening the range of variation of the values of the parameters by
imposing probabilistic distributions in a Bayesian approach, such as a Monte
Carlo Markov chain (Bois 2000). Those methods may give the user an unwar-
ranted sense of truth when there are substantial uncertainties in the true model.
As informatics and data-mining become standard, techniques for data analysis
will be increasingly hybrid, combining mathematical, computational, graphical,
and statistical tools and qualitative methods to conduct data exploration, ma-
chine learning, modeling, and decision-making. Developing its inhouse capabil-
ity will help EPA to adopt and apply the new techniques.
Data Sharing and Distribution
EPA devotes substantial resources to the public sharing of data resources.
It also provides support and encouragement to software and application (app)
developers for the creation of both institutional and consumer applications for
accessing, presenting, and analyzing available environmental data. One example
is the Toxics Release Inventory. Others being developed are the EPA Saves
Your Skin mobile telephone app, which provides ZIP codebased ultraviolet
index information to help the public take action to protect their skin and an air-
quality index mobile app, which feeds air-quality information based on ZIP
code. The agency has made strides in analytic and simulation activities, as
shown in the leadership role that it has played in computational toxicology (see
OCR for page 231
Appendix D 231
the section "Example of Using Emerging Science to Address Regulatory Issues
and Support Decision-Making: ToxCast Program" in Chapter 3).
As information trends move from long-term data to data that are gathered
in nearly real time from dispersed geographic sites, there will not be time for a
traditional cycle in which the desired information needs to be extracted from the
original compilation, reformatted to a specific standard, and finally loaded into
an analytic application. It will instead be necessary to literally "send the algo-
rithm to the data" and receive and collect the results centrally. In other words,
the complex formulas developed to analyze the data may be used at the site and
time of data collection rather than being sent to a central data-processing site for
analyses. That approach, first developed by Google in 2004, is named Map Re-
duce and uses a functional programming model (Dean and Ghemawat 2004).
Hadoop, a widely available implementation of Map Reduce, is available in
open-source form and from several major vendors. Not only can Hadoop pro-
gramming parallelize the problem of accessing widely distributed data; it is es-
pecially useful for processing unstructured data or combining them with tradi-
tional structured data.
REFERENCES
Baumgartner, C., M. Osl, M. Netzer, and D. Baumgartner. 2011. Bioinformatic-driven
search for metabolic biomarkers in disease. J. Clin. Bioinform. 1:2, doi:10.1186/
2043-9113-1-2.
Beran, B., and M. Piasecki. 2009. Engineering new paths to water data. Comput. Geosci.
35(4):753-760.
Bois, F.Y. 2000. Statistical analysis of Fisher et al. PBPK model of trichloroethylene
kinetics. Environ. Health Perspect. 108(suppl. 2):275-282.
Casey, M., C. Gennings, W.H. Carter, V.C. Moser, and J.E. Simmons. 2004. Detecting
interaction(s) and assessing the impact of component subsets in a chemical mixture
using fixed-ratio mixture ray designs. J. Agr. Biol. Environ. Stat. 9(3):339-361.
Cockcroft, A. 2011. Net Cloud Architecture. Velocity Conference, June 14, 2011
[online]. Available: http://www.slideshare.net/adrianco/netflix-velocity-conference-
2011 [accessed Apr. 10, 2012].
Dean, J., and S. Ghemawat. 2004. MapReduce: Simplified data processing on large
clusters. Pp. 137-149 in Proceedings of the 6th Symposium on Operating Systems
Design and Implementation (OSDI '04), December 5, 2004, San Francisco, CA
[online]. Available: http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.
pdf [accessed Mar. 30, 2012].
Dockery, D.W., C.A. Pope, III, X. Xu, J.D. Spengler, J.H. Ware, M.E. Fay, B.G. Ferris,
and F.A. Speizer. 1993. An association between air pollution and mortality in six
US cities. N. Engl. J. Med. 329(24):1753-1759.
Dominici, F., R.D. Peng, M.L. Bell, L. Pham, A. McDermott, S.L. Zeger, and J.M.
Samet. 2006. Fine particulate air pollution and hospital admission for cardiovascu-
lar and respiratory diseases. JAMA 295(10):1127-1134.
Dzemydien, D., S. Maskelinas, and K. Jacobsen. 2008. Sustainable management of
water resources based on web services and distributed data warehouses. Technol.
Econ. Dev. Econ. 14(1):38-50.
OCR for page 232
232 Science For Environmental Protection: The Road Ahead
Egeghy, P.P., R. Judson, S. Gangwal, S. Mosher, D. Smith, J. Vail, and E.A. Cohen
Hubal. 2012. The exposure data landscape for manufactured chemicals. Sci. Total
Environ. 414(1):159-166.
Hall, P., J.S. Marron, and A. Neeman. 2005. Geometric representation of high dimension,
low sample size data. J. R. Statist. Soc. B 67(3):427-444.
Helma, C., E. Gottmann, and S. Kramer. 2000. Knowledge discovery and data mining in
toxicology. Stat. Method. Med. Res. 9(4):329-358.
Hilbert, M., and P. López. The world's technological capacity to store, communicate, and
compute information. Science 332(6025):60-65.
Judson, R.S., M.T. Martin, P. Egeghy, S. Gangwal, D.M. Reif, P. Kothiya, M. Wolf, T.
Cathey, T. Transue, D. Smith, J. Vail, A. Frame, S. Mosher, E.A. Cohen-Hubal, and
A.M. Richard. 2012. Aggregating data for computational toxicology applications:
The US Environmental Protection Agency (EPA) Aggregated Computational
Toxicology Resource (ACToR) System. Int. J. Mol. Sci. 13(2):1805-1831.
Lee, M., and W. Eason. 2010. The Silver Lining of Cloud Computing. Presentation at Envi-
ronmental Information Symposium 2010-Enabling Environmental Protection
through Transparency and Open Government, May 13, 2010, Philadelphia, PA
[online]. Available: http://www.epa.gov/oei/symposium/2010/lee.pdf [accessed Apr.
2, 2012].
Li, S.-T., and L.-Y. Shue. 2004. Data mining to aid policy making in air pollution
management. Expert Sys. Appl. 27(3):331-340.
Martin, M.T., R.S. Judson, D.M. Reif, R.J. Kavlock, and D.J. Dix. 2009. Profiling
chemicals based on chronic toxicity results from the US EPA ToxRef Database.
Environ. Health Perspect. 117(3):392-399.
Miller, F.P., A.F. Vandome, and J. McBrewster, Jr., eds. 2009. Data Warehouse: Extract,
Transform, Load, Metadata, Data Integration, Data Mining, Data Warehouse
Appliance, Database Management System, Decision Support System. Orlando, FL:
Alpha Press.
Moore, G. 1965. Cramming more components onto integrated circuits. Electronics 38(8)
[online]. Available: http://www.cs.utexas.edu/~fussell/courses/cs352h/papers/moore.
pdf [accessed Apr. 6, 2012].
Ng, A., B. Bursteinas, Q. Gao, E. Mollison, and M. Zvelebil. 2006. Resources for
integrative systems biology: From data through databases to networks and
dynamic system models. Brief. Bioinform. 7(4):318-330.
Ngai, E.W.T., L. Xiu, and D.C.K. Chau. 2009. Application of data mining techniques in
customer relationship management: A literature review and classification. Expert
Syst. Appl. 36(2):2592-2602.
NoSQL. 2012. NoSQL Website [online]. Available: http://nosql-database.org/ [accessed
Apr. 30, 2012].
Pang, L. 2009. Best practices in data warehousing. Pp. 146-152 in Encyclopedia of Data
Warehousing and Mining, 2nd Ed., J. Wang, ed. Hershey, PA: Information Science
Reference.
Rajaraman, A., and J.D. Ullman. 2011. Mining of Massive Datasets. New York:
Cambridge University Press.
Robert, L.G. 2000. Beyond Moore's law: Internet growth trends. Computer 33(1):117-
119.
Roy, P., C. Truntzer, D. Maucort-Boulch, T. Jouve, and N. Molinari. 2011. Protein mass
spectra data analysis for clinical biomarker discovery: A global review. Brief.
Bioinform. 12(2):176-186.
OCR for page 233
Appendix D 233
Samet, J.M., F. Dominici, F.C. Curriero, I. Coursac, and S.L. Zeger. 2000. Fine particu-
late air pollution and mortality in 20 US cities, 19871994. N. Engl. J. Med.
343(24):1742-1749.
Stockwell, D.R.B. 2006. Improving ecological niche models by data mining large
environmental datasets for surrogate models. Ecol. Model. 192(1-2):188-196.
Wang, X., M.J. Santostefano, M.V. Evans, V.M. Richardson, J.J. Diliberto, and L.S.
Birnbaum. 1997. Determination of parameters responsible for pharmacokinetic
behavior of TCDD in female SpragueDawley rats. Toxicol. Appl. Pharmacol.
147(1):151-168.
OCR for page 234