| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
SUMMARY
Research in proteomics is the next logical step after genomics in understanding life
processes at the molecular level. In the largest sense proteomics encompasses knowledge of the
structure, function and expression of all proteins in the biochemical or biological contexts of all
organisms. Since that is an impossible goal to achieve, at least in our lifetimes, it is appropriate
to set more realistic, achievable goals for the field. Up to now, primarily for reasons of
feasibility, scientists have tended to concentrate on accumulating information about the nature of
proteins and their absolute and relative levels of expression in cells (the primary tools for this
have been 2D gel electrophoresis and mass spectrometry). Although these data have been useful
and will continue to be so, the information inherent in the broader definition of proteomics must
also be obtained if the true promise of the growing field is to be realized. Acquiring this
knowledge is the challenge for researchers in proteomics and the means to support these
endeavors need to be provided. An attempt has been made to present the major issues
confronting the field of proteomics and two clear messages come through in this report. The first
is that the mandate of proteomics is and should be much broader than is frequently recognized.
The second is that proteomics is much more complicated than sequencing genomes. This will
require new technologies but it is highs likely that many of these will he develonerl ~ coking
~ ~ cat -of--- ------A -------- ~ - ---r~~ ~~~~ =
. . ~ A . ~ ~ ~ .. . ~ ~ ~ . ~ . ~ . ~ ~ ~
back ~u to ;zu years trom now, the question Is: Will we have done the~ob wisely or wastefully!
]:ntroduction
Due to the rising interest in proteomics research worldwide, a symposium entitled "Defining
the Mandate of Proteomics in the Post-Genomics Era" was held at the National Academy of
Sciences on February 25, 2002, in Washington, D.C. Most of the attendees were invited because
of their strong interest in proteomics, proteins, or drug discovery. They came from industry,
both large and small, academia, and government. Most were from the United States, but an
effort was made to invite people from outside the United States. Four of the 10 speakers came
from outside of the United States. Six young scientists from around the world received travel
fellowships to attend the meeting. The attendees heard about recent advances in the field that
will greatly accelerate the process of accumulating and interpreting much of this additional
needed data and information.
The planning committee selected speakers (see Box I) and designed the symposium in the
hope that one of the outcomes of the meeting would be helping to set the field on as wise a path
as possible for the future. After the presentations attendees were involved in individual breakout
sessions on a variety of topics, including
protein separation and identification
protein structure and function
metabolic pathways and post-translational modifications
implementation: necessary policy and infrastructure conditions for collaboration
platforms: emerging technologies
computational methods and bioinfo~matics
clinical proteomics
OCR for page 2
The thoughts and ideas of the speakers and those expressed in the breakout sessions were
captured by recorders to assist in the preparation of this report. While other organizations and
meetings have addressed many of the issues facing proteomics, we hope that participants and
readers of this report will look back on this meeting as the field progresses and find that it was of
some help in defining the current efforts and applications, as well as providing direction to the
advancing state of the art.
............ .. ... .......
. ~ ........... .......... ....... I .... .. ........ I ~
': 2""''" 2"~"'"'';;'C~ ' '
e~ller-Um ~ N Y ~---~- -::Y
3...C,'.~'s~.-.-~,.,..~.~
... ................. ....
~n2cu~""'"''~s~-o-~--G ~ . ~ ~.,,~H,,, R~,,,,,,,S, it. d
.,..~....
... ~ ~ ~.~.~. ^~.,-.~- - ~ ---~-_~ ~ —~~? ~ ^~ ~~v -~v--
~ Cel~-~es Colons- Sac—Inlet ~ d-
Proteomics
Now that the DNA sequences of the human genome and genomes of dozens of other
organisms are essentially known, the biomedical and biological communities are placing
increased emphasis on proteomics, the study of the proteins that are the gene products.
Proteomics, a word derived from "protein" and "genomics," needs further definition, as do
proteomics initiatives, especially since many in the scientific community are asking for a human
proteome project.
Historically one can point back to meetings and articles over 20 years ago, when scientists
began to think about mapping the entire set of human proteins (see, for example, B.F.C. CIark,
"Towards a Total Human Protein Map". Indeed, Congress was considering a project called the
"Human Protein Index" long before the Human Genome Project had been conceived. The
Human Protein Index project was developed in the late 1970's by Norman G. Anderson and N.
Leigh Anderson at the Depardnent of Energy's Argonne National Laboratory2. Its objective was
to enumerate the human proteins (what would now be called the human proteome) by separation
on 2D gels and thus define their genes from the protein end, the only approach possible in those
days before large scale DNA sequencing was possible. But this effort was perhaps ahead of its
~ Clark, B.F.C (1981) Towards a Total Human Protein Map. Nature 292 (5823~: 491-492
2Anderson,N.G. and Anderson,N.L. (1979) BehringInst Mitt. 63: 169-210
OCR for page 3
3
time given the lack of suitable technologies and shifting political sands. Instead, the rise of
genomics took center stage. An Australian postdoctoral student, Marc Wilkins, is often credited
with coining the term "proteomics" in 1994 at a time when only one proteomics company
existed (Large Scale Biology Corporation).
Today many proteomics initiatives are underway in industry and otherwise, such as the
Human Proteomics Initiative (HPl), an effort which began in 2000 by the Swiss Institute of
Bioinformatics and the European Bioinformatics Institute. The goal of the HPT is to annotate
each known protein, providing information that includes the description of protein function,
domain structure, subcellular location, post-translational modifications, splice variants, and
similarities to other mammalian proteins4. Another major proteomics effort is led by the Human
Proteome Organization (HUPO), a group which has created a worldwide organization that
engages in scientific and educational activities to encourage the spread of proteomics
technologies and to disseminate knowledge pertaining to the human proteome and that of model
organisms
On which goals should these national and international efforts focus? Should they be
limited to human proteomics or like the Human Genome Project, include key model organisms?
Perhaps the proteomes of the human pathogens should be included as well (e.g., the malaria
parasite and other infectious microorganisms), and if so, in what order of priority? Should
development of more efficient instrumentation (e.g., mass spectrometers, X-ray diffractometers,
nuclear magnetic resonance spectrometers) and improved computational methodologies (e.g.,
high-speed computers and software useful in bioinfonnatics) be emphasized? What should be
the role of major federal funding agencies (e.g., the National Institutes of Health, the National
Science Foundation, the U.S. Environmental Protection Agency, and the U.S. Department of
Agriculture)? What should be the role of academic laboratories? Should projects be supported
mostly by individual research grants or program project (group effort) grants? What should be
the role of the private sector, particularly those companies large and small that have a major
stake in exploiting the results of the venous genome projects and proteomics initiatives? How
can all of these stakeholders cooperate most effectively while still maintaining proprietary
information where appropriate? Should the overall goal be to understand the structure and
function of all known proteins or should only those known to be involved in diseases be
emphasized?
After all, one must first understand fimction if one is to fully understand
dysfimction. Is enough emphasis being given to the fimctional aspects of proteomics? Are
studies on post-translational modifications of proteins and subsequent functional aspects
included in "proteomics"? Hence the interest in organizing the one-day symposium reported
herein.
3
4
5
OCR for page 4
4
Discussion of General Topics Covered at Symposium
Beginning with a definition of the term "proteomics," Marvin Cassman, former director of the
National Institute of General Medical Sciences, and now at University of California, San Francisco
and the Institute for Quantitative Biomedical Research, was one of many speakers expressing an
opinion on this subject and it was clear that proteomics means many (or at least different) things to
different people. ~ ~ fi : ~ ~ ~ ~ :~_ ~` : ~ ~ : ~- I,
proteomics is not merely protein chemistry. Symposium chair and Dean of the University of
Michigan College of Pharmacy, George Kenyon, commented, "Proteomics is not just a mass
spectrum of a spot on a gel." Perhaps the most useful definition of proteomics for our purposes is
the broadest: Proteomics represents the effort to establish the identities, quantities, structures, and
biochemical and cellular functions of all proteins in an organism, organ, or organelle, and how these
properties vary in space, time, or physiological state.
Somewhat limited operational definitions of proteomics were offered by some of the speakers.
For instance, "In one sense it makes no difference at all why should you call something proteomics
or call it something else?" Dr. Cassman continued, "What we call things often conditions how we
organize our thinking and our efforts." He explained that genome-driven target selection coupled to
high-throughput technologies is what he believes structural genomics means. "It means you are
using the genomes as the primary source for target selection." However, structural proteomics uses
these features "plus the additive feature of full coverage of protein space, that is, completeness"
stated Dr. Cassman. The goal of completeness does not intend to suggest, however, that any smaller
scale experiments, even including high- throughput analysis of specific tissues or subsets of proteins,
would not be considered to be part of proteomics.
Of course there are many "-omics" along with proteomics including genomics, metabolomics,
transcriptomics, interactomics and so on, which are collectively involved in the mandate of defining
proteomics. However, we will restrain ourselves from commenting on other "-omics". Functional
genomics and functional proteomics (which can encompass other 'omics' as mentioned) are closely
juxtaposed on a continuum along the path of discovering the detailed secrets of life and life
processes.
The general topics covered at the symposium included:
come detonations Include ..h~h-throu~h~ut and some do not. Ubv~ouslv
Perspectives (including genomics perspective, relationship of proteome to genome)
Source of proteins (including organism, sample storage)
Protein separation (including purification if subcellular)
Protein identification (largely mass spectometry)
· Protein function (including localization, protein:protein interactions, structure determination,
structure-function, post-translational modifications)
Applications (including drug discovery, diagnostics)
Tnformatics (including homology modeling, databases, analysis software, standardization)
Other topics (including international collaboration, ethical considerations, collaboratories6)
6 Collaboratories are distributed research centers in which scientists in two or more locations are able to work together
with the assistance of various forms of communication and collaborative technologies.
OCR for page 5
Dr. Cassman defined proteomics as a set of related options: "the analysis of complete
complements of proteins present in defined cell or tissue environments (i.e., context-dependent) and
their variation in space and time" (with credit given to Stan Fields for his contributions to this
definition). One example of a proteomic effort is the Protein Structure Initiative of the NIGMS,
which has as a goal the generation of a complete complement of protein structures in nature through
the combination of direct structure determination and homology modeling. Although it requires
high-throughput technology and genomic data to use for target determination, the goal of
"completeness" is what distinguishes the effort as proteomics, according to Dr. Cassman.
The second part of his definition is exemplified by the use of microarrays to identify
characteristic markers for cancer progression in specific tissue samples. These studies involve image
and pattern recognition tools, which yield large-scale visualization of specific cell-dependent,
context-dependent proteomic outputs.
The third part of the definition involves examining proteomic outputs in time and space. This
requires not only the application of bioinformatics tools but also computational biology, that is, the
use of modeling and simulation. Complex systems analysis could be considered an important
element in the larger picture of defining a proteome, and such analysis will require theoretical
modeling of systems. Several examples of NIGMS initiatives that focus on mathematical modeling
of complex biological systems were provided.
While we may be far off in terms of defining a complete human proteome, approaching
proteomics on an organelIar basis provides goals that are perhaps achievable in our lifetimes.
Remember that the first DNA genomes sequenced were those of the bacteriophage, in the 1970s,
followed in 1981 with the DNA sequencing of a human mitochondrial genome.
Consider also that the mitochon~ion, which is estimated to be composed of about 2,000
proteins, presents a considerably more manageable problem and a microcosm of whole cell
proteomics. With this in mind Nobel laureate Sir John Walker, head of the Dunn Medical Research
Council Unit in Cambridge, UK, discussed his proteomic studies of mitochondria directed to
resolving specific biological issues. Dr. Walker's work includes the definition of the protein
complement assembled in the respiratory enzyme known as complex I, the identification of the
biochemical functions of a family of transport proteins found only in mitochondria, and the
discovery of phosphorylation-dephosphorylation pathways in mitochondria. These studies rely not
only on mass spectrometric and bioinformatics tools but also on biochemistry and genetics. Such an
integrated approach is proving to be quite rewarding in Dr. Walker's view, in terms of both
understanding the biology of mitochondria and the technical development of new methods versus
attempts to analyze the global complement of proteins in the organelle. It is also possible to focus
on subcompartrnents of mitochondria, such as the inner mitochonc3rial membrane of so much interest
to bioenergeticists.
In this report we have tried to avoid being constrained by a narrow definition of proteomics
(e.~.. merely cuantitatina protein levels) and have used the broad definition given earlier to allow a
wide-ranging discussion of goals, techniques, opportunities, and challenges.
~ A, ~ ~ a,
Lessons Learned from the Human Genome Project
Francis Collins, director of the National Human Genome Research Institute, spoke about
lessons learned from the Human Genome Project that might be applicable to the discussion of a
OCR for page 6
6
public large-scale proteomics initiative (see Box 2~. He began his presentation by taking issue
with the term "post-genomics era." He queried whether this means that from the beginning of
the universe until 2001 we were in the "pre-genome era," and then suddenly, "bang," we moved
into the post-genome era (leading one to wonder what happened to the genome era). He
suggested that it was presumptuous to say that the Human Genome Project is already behind us.
He pointed out that proteomics is a subset of genomics, and genomics is more than sequencing
genomes, which will be ongoing for decades to come. His comments are especially relevant
given that the human genome was still only about 69 percent complete at the time of the meeting.
~ BOXY
Lessons Learned ffom~the Hu~nan ~G~enome Project ~~ ~ -
~-~ ~~ ~~ ~-~C~omm~ents iYom~Francis~ ~CoBins ~ ~~ ~~ ~ ~ ~ ~~ ~ ~~ ~~ : ~~ ~~ ~ ~~ ~~ ~~ ~~ ~
~ ~ ~ ~~.~ ~~ ~ ~ ~ ~ ~~ ~ ~~. ~~ ~~ ~ ~~ . ~ ~ ~ ~ ~~ ~~ ~~ ~~ ~ ~ - ~ ~~ ~~ ~~ ~ ~~ ~ ~~ ~ ~ :~ ~ ~~ ~ ~ ~ . ~~ ~
~: ~~ ~ it. ~~ :~ ~ .~ ~~ ~~ ~ ~ ~ ~ ~~ ~~: ~~ ~~.~ ~~ ~ aft. ~ ~ ~~ ~~ ~~ ~
~o~gam~sms~am ~~generatec -A recess
genome, ~~ ~d nai; start until six years into Me project and was initiated first m~ pi]
pecks. ~
o ~~ Pubic ~ava~labili~ of data and resources id absolutely critical if He ~bene:ffls
scenic community am egging ~ be i. The ~~d release~of pre-public~on ;~da1:a~was a
He success of He Human Genome Project.
0 ~ ~ Interdisciplinary research neekls to- be ~s~, including Ethel pa~cipatton of experts
In ~om~t~;on, cherry - , and~b~om~˘s.
Blot ~~In~:ernational participation and~coordination~-s~ar~ essential component to bung He
beds m~e~problem~to~avoid~pli~cation,~and~r cost sharing ~~ ~~ ~ ~~ ~~ ~ ~~ ~ ~~
::: : : :::: :: ::::: ::: :~: :::: : ::::::::::::: ::::: : :~ :::: :::::: ::: :~:: ::::,: ::~: : :::::: :: :::: :: :::: ::~ :::: : : :::::: :::~:::::::~::::::::. :: :::::::::: ::::~:::: :::.::::::: :::::::::::::: :: :::::::::::::::::: hi: i:: :~
To ~ ~ Centralized databases thy allow for ~~e~:~on~and visualizations of the ~ ares bark
of those Ho want to
+~ ^~,~ ~ ~ . ~
tu ~~ ,
.—
:: ::: :
P~ ~ ~ . ~ . . , ~ ~ i, ~ ~ ~ ~~ . ~~ ~
~ no ~ :: fin 1~_nr~v~t~: ~n~r~n~r.~nin~:~ ~~.~ Act ~ ~
~ ~~ ~~ ~ ~~ ~ ~~ i: ~~ ~~ ~ ~~ :~
: ::: :: i: ::: : i::: : ::: : :: ::: : ::: ::
__ ~ ~~ _ a_ _ ~~ _ ~ ~~ ~ ~ especially Or the
nucleotide
successful
i. ~ ~.~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~ ~ ~~.~ ~~ ~ A ~~ ~ ~ ~ ~~ -A ~ ~ ~ ~~ ~ ~ ~~ ~ ~ ~ ~ ~~ ~ ~
bll~c-p~ate ~partnershlp~s~:~mclude~:a~:compelling~s:clex]:tIfic~opport~lty3~pre-competl:tWe~da~: sets,::
If, - t ~
Dr. Collins concurred with other participants in delivenng the sobering message that a large-
scale proteomics effort is orders of magnitude more complicated and difficult than the
sequencing ofthe human genome. (As if 100 trillion cells making up an organism and billions of
base pairs in genomes are not enough complexity already!) The concept of a complete dataset of
OCR for page 7
7
all human proteins is therefore very difficult to imagine. There are many challenges as stated
below:
· Wide dynamic range of expression
Protein modifications
· Physical handling of proteins is more difficult than working with nucleic acids
Need for multiple technologies, many of which are not optimized or even invented
· Unlike DNA data, protein data are more analog than digital, making data integration
and analysis very challenging
· Intellectual property rights and claims
Dr. Collins said that the most important area for investment in proteomics right now is
technology development so that we can move these methods in the direction of being able to
tackle a mammalian proteome without facing enormous costs and problems with quality of the
data.
A number of resources for genomics research continue to be generated that may help inform
a proteomics effort, including multiple coverage of certain genomes and more specifically:
Multiple genomic sequences from mouse (6x coverage), rat (3x coverage), puffer
fish, zebrafish, a sea squirt, and close relatives of C. elegans (IOx coverage) and D.
melanogaster will be forthcoming. Comparative genomics will be helpful in
understanding gene models and gene function.
~ - . ~ a. ~ ~ ~ . . ~~ .
.
tull-length human cUNA sequencing etiorts are ongoing in Germany and Japan.
· Full-length cDNAs for human and mouse are being generated through the National
Institutes of Health (NIH) Mammalian Gene Collection . Multiple N]H institutes plan to
support a central database of protein sequence and fimction through a new initiatives.
Dr. Collins referred to one publication: "Global Analysis of Protein Activities Using
Proteome Chips."9 He finished his presentation with a particular recommendation, not Tom a
scientist but Tom a famous athlete (hockey star Wayne Gretzky). When asked how it occulted
that he was so good at playing hockey, and why it was that he always seemed to score the key
goals, Gretzky said, "It is very simple. You have got to skate where the puck is going to be." In
the field of proteomics Dr. Collins said he was not sure where exactly the puck was going to be,
but there were a lot of"Wayne Gretzky's" at the meeting, and Dr. Collins was glad to get a
chance to listen to them.
7
8
9 Zhu, H., et al. 2001. Science 293 (5537~: 2101-2105
OCR for page 8
8
Sources of Proteins
By definition any proteomics effort aims at 'completeness' of information. This part of the
symposium addressed primarily the comprehensiveness or completeness of any assembled
library of proteins and the quality of the materials. It was noted that protein expression in a
given cell varies from none to abundant. Historically, for practical reasons, the abundant
proteins have been investigated most extensively; however, some of the rarely expressed proteins
and proteins that appear only in disease states may be among the more interesting. Joshua
LaBaer, Harvard Medical School, noted that the function of all proteins can be studied regardless
of in viva levels once a copy of the gene and adequate expression vectors are available. ideally
it would be desirable to have an available repository or library containing one clone for every
spliced variant in the proteome. The size of that library will not be known for some time, but an
intermediate realizable objective would be a repository consisting of one clone for every gene.
These clones should be "expression ready''; that is, they should contain only the cDNA from the
initiation site to the stop codons. It seems likely that we should have "some idea of all the
different cDNAs" in the genome in the near future, stated Dr. LaBaer. The expressed proteins
could be studied functionally and often identified by mass spectrometry. In general it is fairly
easy to produce large quantities of proteins in insect cells or bacteria, but in certain cases it may
be necessary to express them in their native cells in order to address such problems as
localization or post-translational modifications. Dr. LaBaer compared the complexities of
studying mammalian systems with those in yeast. There are approximately 6,000 genes in yeast
compared to a much larger number in humans. Moreover, the genome in yeast is relatively
simple; for example, there are only about 220 intron-containing genes in yeast, whereas a much
larger fraction of mammalian genes contain introns and alternative splicing substantially
increases the number of expressed proteins.
To this end Dr. LaBaer described the FLEX Gene repository, which is currently being
assembled by a consortium of about 20 different public and private research laboratories.
"FLEX" stands for Full Length Expression ready. This repository will enable scientists to move
several genes simultaneously from the master vector to any expression vector, which will allow
researchers to screen for function by high-throughput experimentation. It is the intention of this
consortium to make this collection of all human genes broadly available without restrictions on
their use. The four self-defined objectives of the consortium are (~) identification of the genes,
(2) assembly of clones, (3) sequence validation, and (4) distribution to the scientific community.
One example of the success of this effort resulted in the identification of two new genes that are
likely involved in the migration of breast cancer cells through a membrane. The collaboration of
public and private research groups raises certain legal issues, which include consideration of
antitrust law.
Recombination-based cloning was presented as a high-throughput technology to enable the
ready transfer of cDNAs from the supplied vector to one's own preferred expression vector. Dr.
LaBaer described a protein purification scheme that was developed by a graduate student in his
laboratory, Pascal Braun. "In the case of human proteins," Dr. LaBaer explained, "where it is
not easy to produce these proteins in human cells, The availability of large numbers of purified
proteins] will require the use of heterologous "expression] systems such as bacteria." "To
develop these methods," continued Dr. LaBaer, "Braun transferred a collection of 30 cancer
genes into four different expression vectors, each one adding a different epitope tag. tBraun]
OCR for page 9
9
then developed a two-hour automated protocol for purifying 96 proteins in parallel [and] has
now purified over 330 different proteins using this approach." Braun and Yanhui Hu of the lab
created a database that correlates the success of purification with various features of the proteins
such as pi, GO annotation, subcellular localization, and domain structure. Dr. LaBaer said they
found that the presence of certain domains such as SH2 domains or SH3 domains can predict
success in purification.
Dr. LaBaer concluded with a description of a database derived from a computer program
that searches the primary literature for abstracts that mention both a gene and a disease. The
assumption is that a significant number of such occurrences may identify groups of genes
associated with a given disease. This effort was presented as a task in progress, and interested
scientists were invited to experiment with the database.~°
Brian T. Chait Tom Rockefeller University described a proteomics approach to
understanding cellular function. His group is interested in mechanisms by which materials enter
and exit the nucleus, the isolation of multiprotein complexes and the determination of their
cellular localization. The basic concept is to introduce a particular affinity tag to one of the
proteins at its natural location in the chromosome, which is done by replacing the endogenous
gene by a gene that will code for a protein with a tag on it or as he termed it, "a piece of
molecular Velcro." So long as the multiprotein complex is stable, the tag allows isolation of the
associated interacting proteins. An application to the nuclear pore complex, a group of proteins
involved in nuclear trafficking, was described extensively. The complex as isolated has a
molecular mass of 50 million daltons. Interestingly, in the initial purification experiments it
contained about I80 interacting proteins, but upon further fractionation only around 50 were
found to comprise the complex. The individual proteins are identified by mass spectrometry,
which has the power to provide additional information about phosphorylation sites.
Preliminary experiments describing the use of this approach to follow proteins at different
points in the cell cycle and in the regulation of chromatin were mentioned briefly. The genomic
tagging and mapping approach can be used to gain analogous information about a number of
other systems. Most importantly this approach can show where the protein is localized within
the cell, how much is present, when the protein is present and for how long, with what it is
interacting, and even something about the topology of the protein complexes.
Protein Separation
After more than a decade of effort in gene sequencing, reliable estimates of the number of
human genes is still a matter of disagreement, speculation, and debate. From the point of view of
proteomics, just the detection or enumeration of the numbers of expressed proteins defies
prediction based on our current understanding of human cell-type protein composition and its
modulation by myriad undefined post-translational modifications. Their actual identification or
annotation of function remains a challenge. This entire situation is not significantly better for
~ °
OCR for page 10
10
yeast. It is thus not surprising that a key problem in proteomics at a practical level is the
simplification of protein mixtures to a state in which their characterization by physicochemical
methods is experimentally tractable. There are no documented, reliable, or reproducible
strategies for separation of classes of proteins or even individual proteins from very complex
mixtures typically obtained in biological samples such as cell lysates. Clearly, not only does one
wish to know which specific proteins are in a given sample but, ideally, one would wish to know
whether specific proteins are part of a particular biologically significant compartment, complex,
or subcomplex.
Denis Hochstrasser from the University of Geneva, a founder of GeneProt Inc., GeneBio
SA, the Swiss Institute of Bioinformatics, and one of the pioneers in the identification of proteins
in 2D gels, took the lead in dealing with the topic of protein separation. He stated at the outset
that he wanted to play the role of "devil's advocate": to describe some of the excitement in
proteomics but also to describe some of the difficulties. He outlined the scale of potential
proteins one can look for in the millimolar (10-3), micromolar (10-6), nanomolar (10-9), picomolar
(10-~2), femtomolar (10-~s), attomolar (10-~g), zeptomolar (10-2~) and yoctomolar (10- 4) (which is
less than one molecule per liter) ranges. When one considers human blood, for example,
Hochstrasser noted, "typically you only see albumin, immunoglobulin, and transferrin," whereas
cardiac markers such as troponin are present at nanomolar concentrations, and insulin-like
growth factor or insulin are in the picomolar range. Parathvroid hormone is in the low nicomolar
range and Tumor Necrosis Factor is found in the femtomolar range (see Figure 1).
Concentration
Albumin
mM-3
,uM-6 —
nM-9
pM-12—
fM -15 —
_
*1012 I~
_ .
yM-24 1 1
10
~.~(ft.' ~
~ _
Immunoglobulin
Transferrin
Bp3
Leptin
\
Alkalin Phosphatase
Troponin
-
aM-18—
zM-21 —
100 1'000
Plasma proteins
Parathormon
Tnf
\
-
-
-
-
\
10'000 100'000 (~n~rot-
Number of proteins
FIGURE 1. Potential plasma proteins observable at various concentration ranges: millimolar (10~3),
micromolar (Io-6), nanomolar (lO~9), picomolar (Io-~2), fewtomolar (lO-~s), attomolar (10~~), zeptomolar
(10-2~) and yoctomolar (Io-24~. SOURCE: Courtesy of Denis Hochstrasser, GeneProt Inc.
OCR for page 11
11
Hochstrasser speculated that there is "a linear logarithmic relationship between the
concentration in blood and the number of proteins." He suggested that if there are about 300,000
proteins in the human body or five to six times the number of genes "you probably could find
any protein you have in the body, maybe one in the total blood volume, which would be just
below Avogadro's number (1 protein/L of plasma), because we have 6 or 7 liters of blood which
makes about 4 liters of plasma, and if you have one in 4 liters, it is about at the yoctomolar (10-
24M) level."
For experimental studies the amount of starting material, such as blood, is considerable in
order to have high enough levels of various protein material that can be detected by today's
methods. Since a 2D gel has a dynamic range of only 104, Hochstrasser stated, "if anyone used
tad 2D gel from crude plasma, you never go below the micromolar range." Hochstrasser noted,
for example, that starting with 1 mL of sample leads to roughly a nanomolar limit of detection.
He further explained that starting with a much larger volume (e.g., 5-10 liters of plasma) is
necessary to achieve detectability in the lower picomolar range. Clearly, prefractionation of
proteins, individually, or as a subgroup is essential to reach the dynamic range of detectability
required for both cell and tissue lYsates. and plasma.
. ~ , ~
In subsequent discussion it became clear that even the best large-format 2D gels are
inadequate for studies of the global range of expression, perhaps still inadequate by a factor of
10; therefore at least a 10-fold fractionation prior to large-format 2D gel separation would be
required. Unfortunately, many membrane proteins do not enter 2D gels effectively. This
presents a formidable challenge for the field.
In his presentation, Julio Cells from the Institute of Cancer Biology and the Danish Center
for Human Genome Research in Aarhus, Denmark, also spoke about methods and challenges in
the area of protein separation. He stated that "for the study of tissue biopsies the use of high-
resolution 2D electrophoresis is the method of choice Nor separations] as non-ge} high-
throughput technologies based on chromatography-mass spectrometer are not yet ready for the
study of tissue samples." He stated that 2D gel technology in combination with mass
spectrometer can be used to establish comprehensive databases of protein information that can
be useful in the clinical setting. He also made the important point that data in a given cell type
can be valuable to the study of other cell types since 80-90 percent of the proteins are believed to
be shared by all cell troves. While many structural and metabolic gene products may be the same
between all cells, as one reviewer pointed out, cell-specific proteins will be important for
understanding function and disease.
An afternoon breakout session, devoted to the topic of "protein separation and
identification," was led by Julio Cells; Alain Van Dorsselear, Louis Pasteur University, CARS;
and A. L. Burlingame of the University of California, San Francisco. Most of the 1 6 discussants
were experts in mass spectrometr-Y. The discussants concluded that the issue of sample
preparation and purification has been sadly neglected at most meetings dealing with proteomics.
There was the impression among some of the discussants that protein biochemists were
developing and using methods to purify proteins that were not being adequately defined
compositionally by mass spectrometrists interested in proteins. They envisioned setting up "core
centers of excellence" in proteomics where innovation, mobility of people and ideas, and training
can all occur. These core centers might also lead to spin-offs for the development of new
instrumentation. Resources required to support a broad proteomic effort could be in the form of
sample collections, standardization of data across platforms, and ligands that allow assaying of
individual proteins, to name just a few. These centers would complement the work of scientists
OCR for page 20
20
comparisons of datasets of profiles of protein expression, usually determined by mass
spectrometry.
Sequence comparison can be powerful especially if families of related sequences are
identified. However, it is becoming apparent that not only can function diverge markedly when
two sequences differ by 50 percent or more, in some instances sequences that are more than 90
percent identical code for proteins that operate on completely different substrates and have no
cross-reactivity. Assignment of biochemical function from sequence data alone should always
be regarded as tentative without confirmatory experimental evidence. Most functional
annotation errors in genomics databases probably arise this way.
Structural Proteomics
Among the possible experimental ways of approaching the problem of function
determination on a large scale, the one that has received the most emphasis thus far is the use of
structural information. Predicated on the assumption that the three-dimensional structure of a
protein will often provide information about its biochemical and cellular functions, the structural
approach is being applied on a genome-wide scale in a number of independent initiatives.
Although in many instances at least the chemical function of an enzyme can be guessed from its
overall fold, even that deduction is often problematic, and assignment of higher levels of
function is practically impossible without additional information. This problem is exacerbated
when membrane-associated proteins are considered. Between 25-40 percent of the proteins in
the cell are estimated to be membrane associated (depending on the organism). The database of
membrane protein structures is very small and the methods for determining those structures are
very difficult and uncertain.
Cheryl Arrowsmith, a structural biologist from the Ontario Center for Structural Proteomics
at the University of Toronto, discussed her group's research on structural proteomics. She
emphasized the difference of structural proteomics from structural genomics because they work
on proteins, not genes. The focus of her proteomics research is to use X-ray crystallography and
NMR spectroscopy to determine the three-dimensional structures of proteins on a genome-wide
scale. She is particularly interested in examining the extent to which protein structure can reveal
protein function. The model system used is Methanobacterium thermoautotrophicum, whose
sequence was completed at the time the project was initiated in 1998. Since that time, her
laboratory has evaluated thousands of proteins by subcloninp into bacterial expression systems'
· ~ ~ T. ITS ~ ~ ~ ~ ~ ~ "^ , · ~ ~ ~ ~ ~ , · ~ ~ ~ ~
perishing eltner NMK stucles or ^-ray action on soluble and relatively clean purpled
protein. They have also evaluated hundreds of proteins from a number of different bacterial,
viral, and yeast genomes. However, the number of proteins that give structural samples was low.
"There is a huge attrition rate in going from cloned genes to those that can be readily expressed
in bacteria, are soluble in bactena, can be purified, give good crystals or promising NMR
spectra, and these would be very good in terms of getting a structure." The attrition rate overall
is about 85-95 percent of genes that are tried, in other words, approximately 5-15 percent of
bacterial or archaebactenal genes can be processed straight through to three-dimensional
structures using a single protocol (e.g., single expression conditions, single purification
procedure), according to Dr. Arrowsmith.~3 The numbers are worse for eukaryotic systems.
i3 Christendat, D., et al. (2000. Nat. Struct. Biol., 7~10~: 903-sos.
OCR for page 21
21
"Clearly one needs to try multiple procedures for protein expression, purification, and
crystallization in order to improve the success rate for structures, " said Dr. Arrowsmith.
She has confirmed these difficulties in a number of other species and systems, and she
reported that many of the other National Institutes of Health centers participating in the project
are seeing these sorts of statistics as well. Only in a few cases have they had the opportunity and
actually gone on to do functional studies of these proteins. Even with proteins of known
function, such as spermidine synthase, the determination of structure can be useful in proposing
an atomic model and thus a better understanding of the mechanism of enzymatic function. Dr.
Arrowsmith's group was among the first to solve the structure of this protein. There are
thousands of clones and proteins that have been prepared in the Ontario Center for Structural
Proteomics and in many of the other centers; and these clones are available for further functional
analysis. "l think this is a huge resource that is being generated, and it should be exploited
through projects that emphasize [biochemical] functional analysis of proteins," said Dr.
Arrow smith .
Cellular Function
Protein location can be determined by such genome-wide techniques as green fluorescent
protein (GFP) tagging, and protein:protein interactions can be determined by affinity
chromatography, immunoprecipitation, and yeast two-hybrid experiments. Databases resulting
Dom these methods are beginning to emerge, but they are of uncertain accuracy. Recent
comparisons of independently obtained databases for yeast proteins suggest that location
determination is fairly robust but protein:protein interactions are at best determined with less
than 50 percent overall accuracy. Clearly more reliable methods are needed, and efforts to create
protein chips for profiling of interactions with proteins and small molecules appear promising.
One useful addition to the available arsenal of function-finding tools would be a database of
three-dimensional motifs of biochemical function. Such a database would contain those
structural elements that participate in ligand binding and catalysis for proteins of known
function. This database could be searched in a manner similar to sequence database searches
whenever a new protein structure is determined. Another useful tool would be, for each protein
family, a database of mutations with functional characterization. Essentially this database would
provide a link between a mutation at a particular site, a genetic lesion, a metabolic lesion and
even a phenotype such as a disease.
Once again it was stressed that proteomics should be considered as a much broader field
than would be apparent Dom early efforts, which have focused on cataloging levels of protein
expression. Ideally it should encompass efforts to obtain complete functional descriptions for the
gene products in a cell or organism. Because of the complexity of functional description, clearly
more than one technique is required and no one existing technique should be emphasized in
preference to any others. This goal may be beyond the reach of existing technologies, even for
small numbers of proteins, but it is the direction in which the field must go.
OCR for page 22
22
Applications
The application of proteomic techmn]~ies to clinical research ~nr1 n,~hlin health in ~enern1 is
an immediate goal of proteomics. , ~ ,,
_~_ ____--- ~-~ r~~~-~ ~~~~ a,
A distantly related goal is the eventual application of
proteomlcs to environmental, agncultural, and veterinary research, research areas that are far less
developed than clinical applications. Thus, essentially all the applications discussed in the
formal lectures and breakout sessions centered on clinical applications.
Clinical proteomics aims to discover proteins with medical relevance, said Alan Sachs, a
director of R&D at Merck. Such discoveries can be defined broadly as those that identify a
potential target for pharmaceutical development' a markerts) for disease diagnosis or staging,
and risk assessment, both for medical and environmental studies. Alan Sachs and Denis
Hochstrasser co-chaired the "Clinical Aspects" breakout session and covered a wide range of
issues: consent, samples, platforms, phases of diagnostic development, data analysis, and
definition. (Note that there is a difference between developing biological insight and identifying
clinically important diagnostic and prognostic protein-based assays, as one reviewer of this
report has suggested: "BY studying protein interactions. or splice forms. or abundance. one
~ _~ ~ ~ ~ . , .~. , ,
might be able to effectively distinguish between healthy and diseased tissues. One of the great
promises of genomics, and one that has captured the imagination of the public, is the idea that we
might move toward personalized medicine through broad genomic or proteomic surveys, what is
often called 'pharmacogenomics')."
Samples
Julio Cells illustrated the potential of proteomics to the study of diseases during his talk of
his research on bladder cancer. He stated, "one must take into consideration the set of samples
you are going to use." "I ^~^ em, ^ l ~ ~ ~ 1 ~` me -if- ~ ~ ~ I Hal ~ ~ ~ _~1 .
.
DlUpb1~b, b~1Q =1. W~11b, =1U OU1~- LypUb O1 b=11~1~b U=1 U~ Stilly
heterogeneous 1n terms of cell type, stage of pathology, etc., and this presents a challenge for
proteomic analysis that must be faced." Experimental research also must consider the use of
various types of cell lines, primary tissues, body fluids, and various animal models. Each of
these may impose considerations on the types of techniques used for proteomic analysis.
As became apparent from several discussion participants, it is currently quite difficult to
identify the best procedures for obtaining and storing samples for proteomic analysis, because
the techniques used to analyze the samples are constantly changing, making it difficult to arrive
at a consensus protocol for sample preparation that would be best for a particular analysis
method. Thus, Dr. Sachs and others agreed that various strategies for handling samples or
standard operating procedures, and long-term storage will need to be co-developed along with
evolving protein detection methods. Dr. Hochstrasser raised the point that "eve don't know how
to store the samples if we don't know how we plan to use them later." This is important,
especially considering that most proteins stored in the freezer at -20°C are useless for specific
types of clinical research after a few months, according to Hochstrasser. The question of storage
remains a problem because the technology for measurement in the clinics has not evolved, said
Hochstrasser, "yet we need to start worrying about sample storage now."
Related to this is the nature of the samples. "Defining 'normal' is a major problem," stated
Dr. Cells. As many researchers know, the pathology of samples can be open to interpretation,
OCR for page 23
23
and robust parameters must be delineated and adhered to when defining normal versus various
stages of pathology.
Consideration of the various proteomic methods under development suggests that the size of
samples required will be dictated largely by the constantly changing technology. As with all
research the nature of the study will dictate the size of the sample available. Dr. Celis noted,
"tissue biopsies will impose the most severe restraints, both in terms of size as well as the
available clinical data to support the experimental work." Tissue epidemiolog~cal studies may
provide blood or some other easily obtainable tissue that is not the target tissue of interest,
whereas cancer epidemiology likely will provide tumors of different grades of differentiation.
Each of these types of studies imposes complexities and limitations on sample size, number, and
method of analysis.
The proteome itself has a large, dynamic range, depending on the cells being analyzed, and
the location of cells within a tissue could influence its size and nature. Dr. Celis estimated that
the dynamic range (i.e., the concentrations of proteins) spanned 12-13 orders of magnitude.
Given the limits of sensitivity of detection and the availability of a suitable amount of
starting material, Hochstrasser stated, "l strongly believe that a combination of bioinformatics
(dry lab) and chemistry (wet lab) is crucial to finding new diagnostic markers and therapeutic
agents." Several participants expressed their belief that no single technology would be sufficient
for proteomic analysis and that multiple approaches will be required, at least in the near future.
Ethical Considerations
In addition to the issues surrounding samples being obtained and stored properly certain
consent requirements and sample limitations permit clinical samples to be used only once after
patient consent has been obtained. In this case consent means both a clear description to the
patient regarding how the samples will be used and a disclosure of who will have access to the
samples. "Some samples will be anonymous, others will be 'de-identified', and yet others will
have restrictions placed on their use," noted Dr. Sachs. For example, samples may have a
limitation placed on the type of disease studied or the facility or institution at which the analysis
may be performed. Thus, it is important that sample-tracking procedures are in place to ensure
that only samples with appropriate consent from subjects are distributed to a specific site for a
particular type of investigation.
Development of Diagnostics
Participants of the "Clinical Aspects" breakout session on diagnostics discussed the fact that
although the experimental platform used in clinical settings to detect protein markers will change
rapidly in the coming years, the underlying principles regarding the stages of going from the
discovery of protein markers to their use as diagnostic tools in a community setting will remain
reasonably constant. Consequently the criteria used to judge the quality of a marker or markers
as diagnostics in a clinical setting are different from those used to evaluate the quality of a
marker in the basic science setting. Discussion centered on the fact that the basic researchers
developing protein markers, as well as reviewers evaluating such work, must consider the
technical aspects of the application and development of such markers so that statistically
underpowered or misinterpreted studies using such markers are not initiated or reported.
Another reviewer pointed out an important variable to consider in clinical applications, which is
OCR for page 24
24
the impact of population or sample variability due to the heterologous nature of individuals.
This point corresponds again to the idea of pharmacogenomics or personalized medicine.
Although data analysis (informatics) is addressed elsewhere in this report, several speakers
noted that special consideration should be given to adequate data analysis when reporting
something as significant as the association of protein markers with a disease. Participant Thea
Kalebic from the National Cancer Institute (NCI) recommended publication criteria for reporting
the use of marker and clinical samples. Criteria should be specified for the use and analysis of a
particular method to avoid incorrect application of a technique or inadequate or wrong
interpretation of the results, stated Dr. Sachs. Participant {zet Kapetanovic, NCT, further
suggested that a paper be written for the lay audience to describe how algorithmic clustering
methodologies are being used to do disease association studies. "I think a lot of physicians, as
well as clinical researchers, are not bioinformatics or statistics people, and they would benefit
from such a review," stated Dr. Sachs.
Clinical researchers will also need to consider the types of proteins that might be most
relevant, noted Dr. Cells. "Because every modification has a functional meaning, [one must also
consider] a protein-protein or protein-macromolecule interaction [as well as] cellular
distribution, movement, or migration," added Dr. Cells. Regarding techniques, Dr. Cells
believes the only available technique that provides a global picture of the cell proteome is high-
resolution 2D gel electrophoresis, despite its obvious limitations in terms of the numbers of
proteins resolved and the sensitivity of detection.
The non-ye! approaches based on
chromatography and mass spectrometry allow for high-throughput, Dr. Celis noted, but he stated
they are not yet ready for the study of complex tissue samples.
Scott Patterson, vice president of proteomics at Celera prefers the high-throughput
approaches to clinical applications of proteomics research. "In our search for markers of disease
or drug efficacy, and targets for small molecules, therapeutic antibodies, and cellular
immunotherapeutics (vaccines), we employ a broad-based discovery approach," stated Dr.
Patterson. His team uses chromatography and mass spectrometry as the basic tools in searching
for protein diagnostic markers and therapeutic targets in specific diseases.
"Most of you will know Celera for sequencing genomes," commented Dr. Patterson. But as
the company decided to embark upon drug discovery based upon its valuable genomics business,
the first platform to be built was a proteomics platform. The proteomics component of that
strategy is to discover diagnostic markers of disease and targets for therapeutic intervention, said
Dr. Patterson. They are specifically focused on proteins that are differentially expressed in
disease tissue compared with normal tissue. Contrary to Dr. Celis's approach of performing a
high-resolution protein separation at the beginning of the analysis (as is the case for 2D gel
electrophoresis), a very high-resolution peptide analysis is performed at the end of the process
using chromatography and mass spectrometry. "In its simplest description," said Dr. Patterson,
"protein-level analysis is accomplished through targeted capture of classes of proteins (or the
depletion of abundant proteins) prior to proteolytic digestion, yielding peptides that are
quantitated and identified by MS/MS using one of a variety of platforms fe.g., a MALDI-TOF-
TOF-MS or the Voyager 4700 Proteomics Analyzer™']." The MS/MS spectra are identified
using search algorithms for spectrum-to-sequence matching (using characterized protein
sequence databases or a translation of the Celera human genome sequence). Automated
identification can be achieved through spectral matching or spectrum-to-spectrum matching.
This overall approach of peptide-level analysis can be employed with isotope dilution strategies
OCR for page 25
25
(such as ICAT™ for quantitation of the relative abundance of peptides and proteins from pairs
of samples) and without if the fractionation of a series of samples is sufficiently reproducible.
Identification of early markers of disease is important for development of a reliable assay
for tissue samples so as to help diagnose disease, provide insight into prognosis and identify risk
for disease. These markers are especially important in identifying tumor stages such as with Dr.
Celis's work. A therapeutic that derives from this information is also desirable. Proteomic data,
in combination with microarray (gene expression) data, pathology, immunohistochemistry, etc.,
have the potential to identify novel markers for early detection, diagnostics, prognostics, and
response to treatment, concluded Dr. Cells. Drug discovery and improvement in public health
and environmental research will require a combination of all these and other technologies.
Salvatore Sechi, National Institute of Diabetes, Digestive and Kidney Diseases, emphasized that
although it is clear that the emphasis in the clinical community is on marker discovery, the
technology needed for clinical assays and high-throughput proteomics has not evolved yet. It is
important to recognize however, that developing clinically relevant diagnostic and prognostic
tests is something separate from developing biologically relevant insight into disease. While
these two goals are not mutually exclusive, they are not necessarily overlapping, notes one
reviewer. It maY be a relatively simple matter to identifier patterns of gene expression that can be
~ ~ ~ ~ 1 ~ 1
1 ~ ~ ·~1 1 ~ · 1 ~ 1 ~ ~1 ~ 1 ~ · ~ · 1 ~ , · · 1 , · , ,1
correlated With a clinical outcome but that does not provide an immediate Insight into the
underlying mechanism of the associated disease state. This does not mean that such a prognostic
protein expression fingerprint is not useful. Any too! that can help improve and influence
treatment has a great potential to affect patients' lives. This, it seems, defines a mandate for
proteomics in the twenty-first century.
Computational Methods and Bioinformatics
Computation has become an essential component of biological research. The great quantity
and diversity of the data being generated by different technologies is daunting and impossible to
organize or oversee without computational assistance. In functional genomics, a great deal of
effort has been devoted to developing community-based standards for reporting gene expression
data to allow others to replicate experiments. The same will need to be done for proteomics to
validate across the different technologies. Perhaps never before has a bioinformatics problem of
this magnitude been approached. No one person can integrate and organize all the relevant
information for even a single protein being studied without access to computational tools.
Sequence, structure, expression profiles, functional assays, protein-protein interaction from yeast
two-hybrid experiments or protein chip experiments, and other data all provide information on
different aspects of proteins whose functions and roles we are only beginning to understand.
Without effective and integrated databases to store and retrieve these data, and advanced
computational methods such as pattern recognition and other machine learning approaches to
analyze and interpret them, the full implications of these data will not be realized. A few years
ago the typical biologist may have had little reason to turn to a computer for insights or
information. Today the story is very different.
c'
To paraphrase an old adage, "No protein is an island," and researchers who are unable (or
unwilling) to use all available data do so at their own peril. Computation can provide powerful
tools to enable the detection of subtle relationships between data and suggest hypotheses for
OCR for page 26
26
expenmental validation. In addition to the traditional hypothesis-driven research to which we
are all accustomed, computational methods provide a new paradigm: 'computationally assisted
hypothesis generation'. Far from supplanting the biologist's intuition, understanding, and
experimentation, computation can provide an added dimension enabling additional insight and
understanding. Our ability to take advantage of the technological advances in genomics and
proteomics will hinge greatly on our ability to integrate computation into the research and
discovery process.
For instance, experiments performed on one protein often have relevance for other proteins,
not only within the context of the organism from which that protein was denved (i.e.,
paralogues) but also within the context of other organisms containing orthologous proteins.
Researchers investigating an individual protein, say, one involved in disease resistance in
potatoes, would miss a wealth of information and experimental data if they were not aware of
work being done on related plants, such as Arabidopsis, tomatoes, or rice. In fact, many disease-
resistance proteins in plants have orthologues in insects and animals, and experiments on one
will shed light on the function of another. Entire cathwaYs in one organism have analocues in
~ . ., _ ~
· . · · · . . . · . ~ . - ~ . . . ~
others. Genetic experiments In one organism will have Implications tor related organisms.
Three-dimensional structures solved for one protein can be used to predict the structure of other
proteins whose sequences are similar. Residues shown to be catalytic in one protein are likely to
play a similar role for related proteins. Taken alone, proteomics data being generated
(microarray, protein chip, structural, yeast two-hybrid, mass spectrometr-Y) can provide important
insights. Taken in concert and integrated, these data provide a context for understanding the
complex interactions and roles of these biological molecules.
To take full advantage of the information contained in these data, computational
development of two basic types is necessary: (~) database infrastructure enabling efficient and
biologically intuitive storage and retrieval, and interface design to enable different databases to
communicate with each other, and to allow investigators with disparate backgrounds to access
the information in these databases; and (2) intelligent systems, agents, and software tools to
discern relationships between data, and to generate hypotheses that can be tested experimentally.
Such bioinformatics development may also be used to help answer fundamental questions in
biology, which have never been posed. In addition, training the next generation of scientists to
make the kinds of contributions that will be critical to discovery in this new century of
proteomics must not be ignored. We discuss each of these issues separately.
The breakout group "Computational Methods and Bioinformatics," led by Kimmen
Sjolander, University of California, Berkeley, and Dagmar Ringe, Brandeis University, discussed
database issues.
Database infrastructure and interface design
For histoncal reasons most biological databases have been produced pnmarily by the
biological community, while most computational tools have been produced by the mathematical
and computational communities. This has resulted in databases that are often not easily amenable
to automated data-mining methods, unintelligible to some computers, and computational tools
that are often non-intuitive to biologists. Biological databases have inherent complications
stemming from the nature of the information they contain and the dependence of computational
methods on these data. Most biological data are not digital, making machine-readability of the
data (for automated data-mining) impossible. In addition, the lack of standardized nomenclature
OCR for page 27
27
and ontology, the use of protein aliases (leading to ambiguity), the lack of interoperability
across databases, and the presence of errors in database annotations have hindered and
complicated the use of computational methods. While the computational biology community has
begun dialogue in this area, a great deal of work needs to be done before access to information
becomes routine and accessible to the computational non-expert.
Development of new methods
Computational methods are based on models, whether mathematical or biological. As
biologists achieve new insights based on new data being generated by experiment existing
models will be reevaluated and changed accordingly. New methods will need to be developed
and existing methods refined. In all cases development of benchmark test sets is critical for the
assessment of method accuracy and reliability. As more information becomes available in
databases more robust tools and more intuitive methods for finding relationships within these
data will be needed.
Trairlirlg
Biology is being changed by computation. And computation, in turn, is being changed by
biology. Researchers working at the interface of computation and biology are increasingly in
demand. New degree-granting programs and departments are springing up around the world to
train the next generation of scientists in this interdisciplinary field. To be effective in this new
age of computationally assisted biological discovery biologists must receive training in statistics,
mathematics, and computation, and become expert in the use and interpretation of the results of
computational methods. It was suggested that life scientists learn at least some simple scripting
language such as PERL, and a database language such as Shy. For computer scientists working
in this area, training in life sciences is necessary. Both groups must learn to speak a common
language.
The information explosion has presented an opportunity and a challenge to the biological
and computational communities. The wealth of data being generated needs to be integrated in
order to define a system from multiple viewpoints and to understand a system from different sets
of empirical data. Such integration is possible only with computational tools that can find
relationships within the data and use these relationships to create testable models. Such tools
must also be user fiiendly.
Proteomics: A Coordinated International Effort
"It was a report by a National Academy of Sciences pane} tin 1988] chaired by Bruce
Alberts President of The National Academies] that basically laid out the blueprint that became
the Human Genome Project, and a wise blueprint, indeed," said Francis Collins. Dr. Collins
stressed the success and importance of having a large international consortium of laboratories
involved in the Human Genome Project. "Forming large teams and international teams twas
critical] and this is the group that earned out the large-scale sequencing effort or at least the
leaders of many of the labs that were involved in that six-country enterprise," he stated. Dr.
OCR for page 28
28
Collins hopes the same will be true for proteomics research. "l think it was very helpful that
all of the groups had the capacity for large-scale effort, had an open door to come and join in and
that this was an international enterprise, also something that ~ had hoped would happen for
proteomics in the public sector because after all science is an international enterprise. That is one
of the joys of the whole thing."
Protein Structural Initiative
~ September 2000 the NIGMS initiated the support of seven centers to begin work on
developing an approach to structural genomics in order to reap the benefits of the multiple
genome projects being undertaken worldwide. Two more centers were subsequently added in
September 2001, forming what is now known as the Protein Structural Initiative. The idea for
the consortium resulted from a planning meeting in April 2000 jointly sponsored by the NIGMS
and the WelIcome Trust, and there was wide representation from several countries. It was
essentially a policy meeting to come up with ideas about how to consider structural genomics in
a worldwide setting, said Marvin Cassman. He defined structural genomics as "the discovery,
analysis, and dissemination of three-dimensional structures of proteins, RNA, and other
biological macromolecules." However, the focus is primarily proteins.
Currently the NIGMS is helping to fund nine pilot projects to determine the best strategies
for a large-scale production process. Each project is required to include all the components for
the effort based on genome-driven target selection. These components include protein
production, crystallization, structure determination, theoretical analyses for homology modeling,
target selection testing approaches for full coverage of protein families, development of high-
throughput methods, and best management practices. The consortium consists of industrial and
international collaborators with 66 investigators and 24 institutions, according to Dr. Cassman.
They all involve the development of technology.
Proteomics is generating novel requirements for scientific collaboration. Several Divers
and barriers to collaboration were discussed by members of the breakout called "implementation:
Necessary Policy and Infrastructure Conditions for Collaboration," led by James Myers, of the
Pacific Northwest National Laboratory, and Richard Morns, National Institutes of Health,
Division of Allergy, Immunology and Transplantation. "As has been the case with genomics
research, the promise of medical benefits is a major driver of proteomics research," stated Dr.
Myers. Participants of the breakout session discussed how collaboration may create needed
economies of scale in research, for example, by making it possible for groups of scientists to
strive for completeness in critical description and annotation of proteins, an effort that would
surpass the capacity of any individual scientist. It was suggested that success in this field would
be based on the extent to which it selected and strived to characterize specific organs, pathways,
and systems with completeness.
Research Collaboration
Proteomics research also poses unique challenges to collaboration. Due to its close tie with
application, the field imposes barriers in intellectual property and authorship and other issues of
attribution. As with genomics, tensions between collaboration and competition (as well as
between government and industry) are also heightened in proteomics research. A global
distribution of resources, expertise, and potential targets are necessary for collaboration and
OCR for page 29
29
success in the field. In terms of international focus the same problems arise here that arose in
genomics. The proteomics techniques and data collection and expertise occur in one locale, but
in developing countries there are important diseases and other health problems that are affecting
millions of lives and should be studied. However, one has to address some of the differences in
policies such as informed consent between countries if one is actual going to ret some work to
happen there.
~ , .
, ~ O O
Policy, organizational, and technology solutions were discussed as interdependent variables,
and as essential fixture enablers of proteomics research. For example, the need to construct and
use diverse data sets was examined. Proteomics poses unprecedented demands for integrating
biological samples, electronic data, outputs from instrumentation, and expertise Mom multiple
disciplines. Interactions across disciplinary boundaries that were crucial for the genome project
may be even more so for proteomics due to the variety of expertise needed. The group
examined how this might be coordinated and made accessible through shared user facilities.
Such collaboration may also help overcome shortages of trained experts who can contribute to
proteomics research through particular fields including mathematics, statistics, physics,
chemistry, computer science, and engineering. The potential drawbacks of such shared facilities
were identified, including the requirement to travel away from one's home institution and the
overhead costs of dealing with multiple facilities. Such costs could be mitigated by the creation
of virtual facilities providing aggregated capabilities over the Internet. Dr. Myers suggested that
Internet-based collaboratories, or laboratories without walls, could permit researchers to easily
share data, instruments, and expertise.
In addition, the group discussed the need to provide intellectual credit to developers of
shared and reference resources (e.g., samples, instruments, software). The recognition by (and
pressure for) scientific publication in academia leaves little room for scientists to work on
reference resources and database construction. Since the best makers of such tools and databases
are those who actually use them, scientists should be supported not only for the development of
the too} (which often is hard to get through conventional funding means), but also for
applications of the too] toward a biological problem. A scientist should be able to get funding
for making a too! that will help with an important problem; clever grant writing should reflect
this contribution toward their own work and toward the benefit of the scientific community. This
problem could also be solved by adjustments to institutional tenure policies, whereby credit is
given to those who take time away from the bench to develop critical databases, websites, and
software for general use by the scientific community. The participants recommended that tenure
committees and granting agencies should be able to recognize those kinds of contributions.
Information technology could also play a role by assisting managers to develop broad
metrics of authorship or enabling them to track the pedigree or provenance of intellectual
contributions at a finer grain than publication credit.
There are very few people trained in the multiple areas necessary for proteomics research,
from computation to experimentation across disciplines of biology and chemistry. The need to
train new researchers and to encourage practicing researchers to broaden their expertise could be
met through support for additional fellowships and sabbaticals, which could be made more
effective and less disruptive through remote collaboration capabilities. Difficulties in sharing
and comparing data from different techniques and disciplines could be reduced through the
promotion of standardization efforts and open, extensible data format standards.
Overall it was agreed that the pursuit of proteomic research needs to progress on an
international scale with broad support from governments and industries alike. In turn, this
OCR for page 30
30
creates the need for international professional organizations, as well as software applications
and standards to enable international collaboration. "There is the sense that just like the promise
of the genomics revolution, there are many, many things that can be done that are of practical
importance," stated Dr. Myers. "Collaboration to speed up that process is really a big driver
here and driving it more than just the basic research kind of ideas."
Conclusion
The most useful definition of proteomics is likely to be the broadest: proteomics represents
the effort to establish the identities, quantities, structures, and biochemical and cellular functions
of all proteins in an organism, organ, or organelle, and how these properties vary in space, time,
and physiological state. Proteomics is thus a huge, long-term task, much more involved than
sequencing the genome.
At the time the Human Genome Project was begun the basic methodology for sequencing
DNA, Sanger's dideoxy chain termination, had already been in place for five years and the task,
while challenging, was essentially one of efficient scale-up. One of the main lessons from this
symposium is that proteomics has not yet reached that stage. There is much work to be done in
the technology sector. Perhaps the most important area for investment right now is in platform
technology development for high-throughput systems. Other areas where emphasis might be
placed in the short term include protein markers and clinical assays of disease, as well as the use
of less complex model systems. Quality controls and annotations are needed at all levels. There
are also several barriers that remain to translate proteomics results into clinical applications, but
progress is being made as described in this report. There is room for both big and small science,
stated George Kenyon. No one group, company, or government entity is going to solve these
problems; there is a great need for interdisciplinary collaboration, locally, nationally, and
globally.
Representative terms from entire chapter:
genome project