| ||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 39
-
Molecular Structure and Function
Biological Macromolecules Are Machines
All biological functions depend on events that occur at the molecular level.
These events are directed, modulated, or detected by complex biological ma-
chines, which are themselves large molecules or clusters of molecules. Included
are proteins, nucleic acids, carbohydrates, lipids, and complexes of them. Many
areas of biological science focus on the signals detected by these machines or the
output from these machines. The field of structural biology is concerned with the
properties and behavior of the machines themselves. The ultimate goals of this
field are to be able to predict the structure, function, and behavior of the machines
from their chemical formulas, through the use of basic principles of chemistry and
physics and knowledge derived from studies of other machines. Although we are
still a long way from these goals, enormous progress has been made during the
past two decades. Because of recent advances, primarily in recombinant DNA
technology, computer science, and biological instrumentation, we should begin to
realize the goals of structural biology during the next two decades.
Much of biological research still begins as descriptive science. A curious
phenomenon in some living organism sparks our interest, perhaps because it is
reminiscent of some previously known phenomenon, perhaps because it is inex-
plicable in any terms currently available to us. The richness and diversity of
biological phenomena have led to the danger of a biology overwhelmed with
descriptions of phenomena and devoid of any unifying principles. Unlike the rest
of biology, structural biology is in the unique position of having its unifying
principles largely known. They derive from basic molecular physics and chemis-
try. Rigorous physical theory and powerful experimental techniques already
39
OCR for page 40
40
OPPORTUNITIES IN BIOLOGY
provide a deep understanding of the properties of small molecules. The same
principles, largely intact, must suffice to explain and predict the properties of the
larger molecules. For example, proteins are composed of linear chains of amino
acids, only 20 different types of which regularly occur in proteins. The properties
of proteins must be determined by the amino acids they contain and the order in
which they are linked. While these properties may become complex and far
removed from any property inherent in single amino acids, the existence of a
limited set of fundamental building blocks restricts the ultimate functional proper-
ties of proteins.
Nucleic acids are potentially simpler than proteins since they are composed
of only four fundamental types of building blocks, called bases, linked to each
other through a chain of sugars and phosphates. The sequence of these bases in
the DNA of an organism constitutes its genetic information. This sequence
determines all of the proteins an organism can produce, all of the chemical
reactions it can carry out, and, ultimately, all of the behavior the organism can
reveal in response to its environment.
Carbohydrates and lipids are intermediate in complexity between nucleic
acids and proteins. We currently know less about them, but this deficit is rapidly
being eliminated.
The central focus in structural biology at present is the three-dimensional
arrangement of the atoms that constitute a large biological molecule. Two
decades ago this information was available for only several proteins and one
nucleic acid, and each three-dimensional structure determined was a landmark in
biology. Today such structures are determined routinely, and we have begun to
see structures of not just individual large molecules, but whole arrays of such
molecules. The first three-dimensional structures were each consistent with our
expectations based on fundamental physics and chemistry. Most of the structures
determined subsequently, however, were completely unrelated, and a large body
of descriptive structural data began to emerge as more and more structures were
revealed by x-ray crystallography. From newer data, patterns of three-dimen-
sional structures have begun to emerge; it is now clear that most if not all
structures will eventually fit into rational categories.
The Main Theme of Structural Biology Is the Relation of
Molecular Structure to Function
Since biologists are ultimately interested in function, structural biology is
often a means toward an end. The role played by structural biology differs
somewhat depending on our prior knowledge of the function of particular mole-
cules under investigation. Where considerable knowledge about function already
exists, the determination of three-dimensional structure has almost inevitably led
to major additional insights into function. For example, the three-dimensional
OCR for page 41
MOl~CULAR SlRUClVRE AND FUNCTION
41
structure of hemoglobin, the protein that carries oxygen in our blood stream, has
helped us understand how we adapt to changes in altitude, how fish control their
depth, and how a large number of human mutant hemoglobins relate to particular
disease symptoms.
Often knowledge about structure can provide dramatic advances in our
understanding about function even when prior knowledge is sketchy. For ex-
ample, early biological experiments had shown that DNA contained genetic
information, but these experiments offered no real clues to how a molecule could
store information or how that information could be passed from cell to cell or
from generation to generation. The structure of DNA, with bases paired between
two different chains, led immediately to the correct conclusions about the mecha-
nism of information storage and transfer. The information resided in the sequence
of the bases; the apparent redundancy of two strands with equivalent (comple-
mentary) information meant that each could serve to pass the information onto a
daughter strand. Furthermore, the redundancy offered a natural defense against
loss of information. Even if one strand is damaged (as by chemicals or radiation),
in the vast majority of cases the information on the other strand can be used to
recover the missing information. Indeed, cells have evolved truly elegant mecha-
nisms to determine which strand contains the original undamaged information;
such models could provide useful paradigms for the current human preoccupation
with electronic information handling.
The ultimate challenge for structural biology occurs when we have a struc-
ture but no clues at all about its function. Because of dramatic advances in our
ability to determine structures, this challenge is likely to occur with increasing
frequency. There have been a few remarkable cases in which limited structural
information, such as a knowledge of the sequence of amino acid residues in a
protein, without any three-dimensional structural information, has led to signifi-
cant insights into function. In general, however, our current ability to predict
function from structure in the absence of prior biological clues is limited, and one
of our major needs is to improve our predictive abilities.
Biological Structure Is Organized Hierarchically
The structures of large biological molecules such as proteins and nucleic
acids are complex. It is not practical or useful to describe these structures in
words. In fact, highly specialized computer-driven graphics systems have been
especially created to display molecular structures visually. An example of the
output from one of these display systems is shown in Plates 1 and 2. Such devices
are an invaluable aid to today's structural biologist, and future advances should
make such devices cheaper, easier to use, and thus more readily available to all bi-
ologists.
Because of the complexity of biological structures, it is frequently convenient
OCR for page 42
42
OPPORTUNITIES IN BIOLOGY
to deal only with certain aspects of these structures. It is common practice to
describe structure at a series of hierarchical levels, called primary, secondary,
tertiary, and quaternary structure. This hierarchy reflects some of the types of
information provided by particular experimental techniques used to determine the
structures of biological molecules.
The primary structure is the covalent chemical structure, that is, a specif~ca-
tion of the identity of all the atoms and the bonds that connect them. The major
molecules with which we work proteins, nucleic acids, and carbohydrates-
usually consist of linear arrays of units, each of which has a similar overall
structure; they differ only in certain details. The Apes of units are limited in
numbers: 4 common ones in typical nucleic acids, roughly a dozen in typical
carbohydrates, and 20 in proteins. Thus, the primary structure can be specified
almost completely by naming the linear order, or sequence, of each type of unit of
the chain. The primary structure is given by the sequence plus a description of
any additional covalent modifications or crosslinks.
The sequence of proteins, nucleic acids, and carbohydrates is determined
principally by chemical methods. This is understandable since it is, in fact, the
chemical structure. These methods have advanced tremendously in the past
decade, and the implications of these advances constitute the second section of
this chapter.
The secondary structure refers to regular patterns of folding of adjacent
residues. Most secondary structures are helices. Some of the most frequent and
best-known helices are the alpha helices found in many proteins and double
helices found in virtually all nucleic acids. Carbohydrates also form helices.
Helices are convenient structural motifs: They are easy to recognize by inspec-
tion of a known three-dimensional structure, they are relatively easy to detect
experimentally by physical techniques, and their appearance within many struc-
tures is relatively easy to predict just from a knowledge of the primary structure.
The tertiary structure is the complete three-dimen$ional structure of a single
biological unit. Until recently the only available method for determining this
structure was x-ray diffraction studies of a single crystal sample. Now electron
and neutron diffraction have become available as tools for solid samples, and
nuclear magnetic resonance spectroscopy has been developed to the point where
it can be used to determine the tertiary structure of small proteins and nucleic
acids in liquid solution, that is, close to the state in which they are usually found
inside living cells. The tertiary structure usually provides the starting point for
studies that attempt to correlate structure and function.
Quaternary structure describes the assembly of individual molecular units
into more complex arrays. The simplest example of quaternary structure is a
protein that consists of multiple subunits. The units may be identical or different.
The arrangement of the subunits frequently has important functional implications.
Some quaternary structures have been detennined by experimental methods that
OCR for page 43
MOLECULAR STRUCTURE AND FUNCTION
43
reveal not only the arrangement of the subunits but also their individual tertiary
structures. However, many quaternary structures are too complex to be addressed
by existing techniques. Here a variety of methods ranging from electron micros-
copy to neutron scattering to chemical crosslinking can still provide information
about the overall shape of the assembly and detailed arrangement of the compo
nents.
In the sections that follow, we will first explore the levels of biological
structures; our concerns will be improved methods for revealing these structures
and the application of the resulting information to solving biological problems.
We will then consider the current and future prospects for predicting the higher
order structure of biological macromolecules from more readily available infor-
mation on lower order structure. Finally we will consider the power of our
newfound abilities to alter macromolecular structure more or less at will.
PRIMARY STRUCTURE
Nucleic Acid and Protein Sequence Data Are Accumulating Rapidly
The amount of available information on the primary structure of biological
polymers is increasing at an astounding rate. Two decades ago we knew the
nucleotide sequence of only a single small nucleic acid, the yeast alanine transfer
RNA. We knew the amino acid sequence of fewer than 100 different types of
proteins.
Today more than 18 million base pairs of DNA have been sequenced, and the
data are accumulating at more than several million bases a year. The first
completed sequences were research landmarks. Now sequences are appearing so
rapidly that many research journals refuse to publish such information unless it
has some particular novel or utilitarian aspects. Indeed, sequence data are cur-
rently accumulating faster than we can analyze them, and even faster than we can
enter them into the data bases by existing methods.
The longest block of continuous DNA sequence known is the entire primary
structure of Epstein-Barr virus. This 172,282-base-pair genome is responsible for
a number of human diseases including infectious mononucleosis, Burkitt's lym-
phoma, and nasopharyngeal carcinoma. Knowledge of the DNA sequence poten-
tially unlocks for us all of the secrets of the virus. The challenge now is to use this
sequence information to learn how to prevent or control the diseases caused by the
virus. Other landmarks of recent DNA sequencing include the complete DNA
sequence of the maize (corn) chloroplast DNA (about 130,000 base pairs) and the
complete sequence of the gene for human factor VIII, one of the proteins involved
in blood clotting, which is defective in certain hemophilias. We know the
complete sequence of many other important proteins, RNAs, and viruses. Per
OCR for page 44
44
OPPORTUNITIES INBIO~GY
haps what is most important is that we have the technical ability to determine the
sequence of virtually any piece of DNA, RNA, or protein.
Sequence Comparisons Lead to Structural, Functional, and
Evolutionary Insights
Much valuable comparative sequence information awaits us as the data
accumulate and as analytic methods become more reliable and informative.
Already, one can do much using the data bases to help interpret any DNA
sequence plucked more or less at random from a genome. The patterns of
sequence in the regions that code for the amino acid chains of proteins differ
enough from the noncoding regions that the former can usually be identified. For
example, we know about types of sequences that are required for efficient synthe-
sis of proteins in many different types of organisms. We know about some
general types of control elements for certain genes important in developmental
pattern formation or in an organism's response to environmental stress.
When the protein sequence predicted from a gene is compared with all
known protein sequences, there is about one chance in three that it will be similar
enough to one or more of them to be recognized in a match. This provides an
immediate clue to the function of the previously unknown protein. Perhaps the
most spectacular example of such a match was the discovery that the product of
the sis oncogene, a protein of unknown function that is associated with some
cancers, was extremely similar to a blood protein that promotes normal growth,
the platelet-derived growth factor. As the data base grows in size, and as our
general knowledge about the function of its constituents does likewise, the proba-
bilities of informative matches should rise steadily. One can anticipate the growth
of a new speciality, molecular archeology, that resembles the field of archeology
itself. Protein and gene sequences are old. They have been rearranged and altered
much as the residual artifacts from a town or fortification have partially deterio-
rated and become dispersed. The components that remain, however, when prom
erly viewed, provide clues to the function of the whole.
An archeologist might conclude that a room full of amphoras was likely to
have been a storage room and not sleeping quarters. In the same way, we can
already look at some protein sequences and gain clues about functions, even about
functions that we have never observed in detail in the laboratory. Proteins that
have transmembrane domains will sit in membranes, proteins with nucleic acid
binding sequences will bind DNA or RNA; a protein with both would probably
bring a nucleic acid into the vicinity of a membrane and keep it there. The
analysis can be carried further because the details of the protein sequence can
provide even greater clues, just as the details of the decoration on a piece of
pottery or the shape of an arrowhead can identify the geographic origin of the
people who produced iL
OCR for page 45
MOLECULAR STRUCTURE AND FUNCTION
45
Many proteins with related functions have probably evolved from common
ancestors. Thus receptors-proteins designed to sit at the cell surface and detect
the environment may represent one or more fundamental families of structures
and sequences. For example, the sequence of the beta-adrenergic receptor, which
binds the hormone adrenalin, and the sequence for rhodopsin, which detects light,
are sufficiently similar that we can tell both were once related through a common
progenitor. In the same way, proteases often resemble other proteases and
structural proteins resemble other structural proteins.
Many proteins are modified chemically after they are synthesized. Proteases
may remove one or both ends of the initial chain as well as make cleavages in the
middle: Carbohydrates may be added to form glycoproteins. Some of these
modifications occur as the protein travels from its initial site of synthesis to its
final location in the cell. Others, such as phosphates, are added and removed
repeatedly as part of the functioning or regulation of the protein. The enzymes
that perform these modifications frequently do so by recognizing particular signal
sequences. Because we now know some of these signals, a search of protein
sequences can frequently reveal potential modification sites and in turn provide
additional clues to the function of the protein.
Three-dimensional structure is better conserved in evolution than sequence
is. Apparently there are severe constraints on folding a protein to make a compact
three-dimensional array that is stable in the aqueous medium of a cell and
resistant to proteases. Once we have mastered techniques for estimating possible
folded structures from amino acid sequences, we will enhance our ability to
explore molecular archeology. Mastery of these techniques itself will probably
require an examination of many more three-dimensional structures by x-ray
diffraction. What we still cannot do with much success is predict the function of
an arbitrary protein without some molecular archeological clues.
When inspected by eye, a three-dimensional protein structure is complex and
confusing; about the best most trained observers can do is identify potential
binding sites as clefts or pockets and find potential sites of flexibility, such as
connectors, between domains. Clues to functional regions can emerge from
amino acids that are found in places other than their usual locations. For example,
in typical soluble proteins, hydrophilic residues (which have an affinity for
water), such as charged residues, reside on the surface, whereas hydrophobic
(which avoid contact with water) residues, such as those with hydrocarbon side
chains, are found buried in the interior. A buried charged group, particularly if it
is not paired with an opposite charge, can be a clue to a functional site. Similarly
an exposed hydrophobic group may reveal a binding site for a hydrophobic small
molecule. A whole set of such groups may indicate a surface of the protein that
interacts with another protein or a membrane.
One can go only so far with visual inspection. Methods for systematic
analysis of three dimensional protein structures are needed that can extract, from
OCR for page 46
46
OPPORTUNITIES IN BIOLOGY
the structures, as many clues as possible about protein function. Such procedures
are still in their infancy; the next decade should see rapid growth in such tech-
niques now that a sufficient library of known structures and functions exists on
which to develop, test, and refine these methods.
The DNA Sequences of Entire Genomes of Some Simple Organisms
Will Soon Be Known
The explosion in sequence data has just begun. DNA sequencing is far easier
than protein sequencing, and the tools already available for cloning and efficient
sequencing of 500-base-pair blocks of DNA will ensure that the current stream of
new sequence data will become a torrent.
The ultimate target would be to determine the sequence of all the DNA in an
organism, that is, to sequence an entire genome. Genomes range in size from
750,000 base pairs (a mycoplasma) to more than 3 billion base pairs.
Such large-scale sequencing programs are feasible by today's technology,
but they are expensive in both manpower and actual dollar cost. Automated DNA
sequencing techniques have begun to be developed, which should markedly
diminish manpower requirements and decrease costs. It now seems likely that in
the next few decades we will determine the complete DNA sequence of the
bacterium Escherichia colt, the yeast Saccharomyces cerevisiae, the human
genome, the fruitfly Drosophila, the mouse genome, the nematode Caenorhabdi-
tis elegans, and possibly even a number of plant and other bacterial and yeast
genomes. The resulting information will stimulate future generations of biolo-
gists as they explore the functions of the tens of thousands of genes that will be
revealed for the first time by such sequencing programs.
The major issue facing us today is how to stage the process of large-scale
sequence determination. One set of concerns related to this issue deals with the
optimal scientific strategy and the selection of targets for sequencing. Another set
of concerns deals with attempts to organize and accelerate this work by mecha-
nisms other than the types of investigator-initiated individual research projects
typical in current biological science.
Most investigators favor making a physical map of a genome before com-
mencing really large-scale sequencing. This physical map will consist of an
ordered set of large DNA fragments that covers the entire genome. From each
large DNA fragment, smaller pieces can be isolated (or cloned) and used as source
material to perform the actual DNA sequencing. Some workers favor construct-
ing the ordered set of fragments by isolating individual ones at random and then
determining which fragments are neighbors in the genome. Others favor dividing
the genomes into successively smaller piece& first chromosomes, then chromo-
some fragments, then very large DNA pieces-until an ordered set of DNA
fragments is created. At present there are good arguments for attempting both
approaches simultaneously.
OCR for page 47
MOl£CULAR SIR UCTURE AND FUNCTION
47
The fast physical maps of genomes or segments of genomes are almost
completed. In principle, one could use these and simply start large-scale sequenc-
ing now, with existing approaches. However, the likelihood of major improve-
ments in automated technology over the next 5 to 10 years leads many people to
favor concentrating current efforts on spading the development of that technol-
ogy and delaying most massive sequencing until the technology is available. As
automated DNA sequencing machines become common, the rate-limiting step in
obtaining data will shift to the production of the DNA needed for sequencing.
Thus, we need to enhance our ability to prepare large numbers of discrete DNA
fragments (preferably kept in linear order from some larger starting fragment).
Robotics seems an attractive and useful new technology. At present, most
sequencing methods are limited to a maximum of about 500 base pairs per DNA
fragment. Every significant increase in the size of the fragment that can be
sequenced will improve the overall efficiency of the process. Multiplex methods,
in which numerous different fragments are handled in parallel or in series, offer
another way to accelerate and expedite the entire process.
A third set of scientific concerns deals with the choice of species, individual
organisms, and genes to sequence. At one extreme are those who believe that
current efforts should be restricted to sequences tied to existing biological prob-
lems. For example, in the pursuit of human disease genes, it might be far more
important to determine the DNA sequence of the same gene in many individuals
than to extend a given sequence into neighboring regions to see what is there.
Similarly, comparisons of different species frequently provide biological insights
that would not have been possible if studies had been restricted to a single
organism.
The advantage of this traditional problem-oriented approach is obvious: The
sequences obtained are more or less guaranteed to be interesting and useful.
However, the disadvantage is also obvious: As interesting regions are sequenced,
it will become more difficult to motivate people to risk explorations of regions of
genomes for which little or no information is available. While explorations of
these regions have the potential to make major advances through finding com-
pletely unexpected genes and functions, the work is also risky since some regions
may yield no rewards at all. The realities of tight funding and frequently
competitive review for renewed funding militate against such work; if it is to be
encouraged, new support mechanisms may need to be created, with longer term
commitments and rewards for more risk-taking.
The final set of concerns deals with whether genomic sequencing should be
organized in ways similar to those in which "big science" has been dealt with in
other disciplines. The actual process of sequence determination is boring. It
seems to require more dedication and large-scale organization than most typical
biology projects. The intellectual rewards of obtaining sequences of entire
genomes are likely to be missed by most of those involved in the massive effort to
accumulate the data. Much of the data may not result in publications in the
OCR for page 48
48
OPPORTUNITIES INBIO~GY
primary scientific literature, and some publications that do result may have very
large numbers of authors. Thus special efforts may be needed to maintain
investigators' morale.
Structural and Computational Methods Need to Advance to Keep Pace with the
Explosion in Sequence Data
As the acquisition of sequence data continues to accelerate over the next few
years, the problem of managing these data will become increasingly severe.
Considerable thought and resources will be needed to optimize the collection of
data in consistent formats, the entry of data into computerized data bases acces-
sible to all investigators, and the refinement of computer algorithms for all sorts of
sequence and structure analysis. The anticipated size of the data base-100
million base pairs in the next few years, 10 to 100 billion base pairs eventually-
is not staggering even by today's standards. However, the way the data are being
accumulated, by efforts in hundreds of different laboratories, each with its own
computer systems and idiosyncrasies, poses a serious problem. What would help
is a relatively uniform system of data annotation and transmittal. If this can be
done by translation programs that accept a wide variety of inputs and return them
to the data base and to the investigator in the standard format, it would probably
win broad acceptance by the community because each laboratory could then
maintain its own style.
A second complexity of the existing data bases is that there are three inde-
pendent repositories for nucleic acid sequence data: GenBank, operated by the
Los Alamos National Laboratory; the EMBL data base, operated by the European
Molecular Biology Laboratory in Heidelberg; and Protein Identification Re-
source, operated by the National Biomedical Research Foundation in Washing-
ton, D.C. The multiplicity of data bases poses severe problems for current and
potential users. Ideally, the three should be combined into one.
Nomenclature is a major problem for all three data bases. The names of
molecules, species, and genes are constantly changing, and the data bases also
change the cryptic names that they use to identify entries. The various data bases
are also not cross-indexed. A major research problem is to determine what data
are common to two or all three data bases. Moreover, any such cross-companson
is outdated as soon as one of the data bases is updated. International responsibil-
ity for entering data into major data bases and greatly expanding the use of
electronic communication for this purpose is badly needed. Someone will have to
construct and maintain a cross-index of related biological data bases. Possibly a
direct-access system will be set up. In any case, the availability of a periodically
updated cross-index will allow other installations to provide an integrated re-
trieval system to their users more easily.
The third major complexity of the sequence data base is the sophistication of
many of the interrogations that will be made. Today each new sequence is almost
OCR for page 49
MOIB:CULAR SIR UNSURE ARID FUNCTION
/
49
automatically run through a comparison program to see what matches can be
found with preexisting sequences. At the most trivial level, this procedure may
reveal that the sequence has already been reported by someone else, and perhaps
the same gene will have been reported under another name. At a more profound
level, the comparison may reveal functional and structural insights. These com-
parisons can consume large amounts of computer time if they are carried out with
algorithms that try to detect even very slight degrees of sequence similarity.
In the near future, we should begin to see many more attempts to use the data
bases to refine methods for structure prediction, studies that will consume enormous
amounts of computer time. It seems prudent to plan ahead and support research
and development of better computer algorithms and better computer hardware to
optimize the biologist's use of the DNA data base. Among the needs are data-
base management systems designed to keep track of inquiries and results, so that
insights gained by different inquiries can be synergistic and so that unwarranted
duplicate inquiries can be short-circuited. Other needs are for improved analyti-
cal tools for predicting structure and for comparing structure and sequence. These
tools may take the form of new chips, parallel processors, or more powerful
algorithms.
Carbohydrate Research Is Gaining Momentum
In the past decade, structural studies on carbohydrates have begun to ap-
proach the capabilities of more developed areas of protein and nucleic acid
structure. Techniques have been developed to deduce the complete structure of
complex oligosaccharides, including oligosaccharides found in scarce glycopro-
teins, such as cell-surface molecules.
Glycoproteins are proteins containing covalently attached sugars, usually
short carbohydrate polymers attached to the side chains of the amino acids
asparagine, serine, or threonine. Glycoproteins are found throughout nature, from
simple single-celled organisms to humans, and they play critical roles in these
organisms. Glycoproteins are usually, but not exclusively, found on the surfaces
of cells and in cellular secretions. For example, almost all of the human blood
proteins and all of the well-characterized eukaryotic cell-surface macromolecules
are glycoproteins. In addition, Glycoproteins are key components in the outer
coatings of a number of pathological agents, including viruses and parasites.
Many of the molecules used by the immune system to combat these pathogens are
also glycoproteins. Recently, important roles have been identified for some
Glycoproteins that remain in the cell's interior, such as the proteins that form the
pores in the nuclear membrane.
The new techniques in carbohydrate research include nuclear magnetic reso-
nance (NMR), fast-atom-bombardment mass spectrometry, and metabolic label-
ing with radioactive sugars combined with stepwise degradation with a battery of
purified glycosidases. The consequence has been the elucidation of hundreds of
OCR for page 66
66
OPPORTUNITIES IN BIOLOGY
resolution x-ray analysis, will depend on the type of assembly in question. With
membrane complexes, the crystals must be grown from precise mixtures of
detergents, amphiphiles (polar molecules that have an affinity to both aqueous
and nonaqueous areas), protein, and lipid; the process of crystallization has an
additional dimension compared with that of soluble proteins. A major difficulty
at present, therefore, is in obtaining sufficient commitments of financing and time
to support such crystallization efforts. The first such crystallizations were carried
out only after many years of trials in Europe, where the support of science can
maintain a constant effort in a high-risk, long-term endeavor. Because risks have
been demonstrably reduced, considerable weight must be given to early successes
in growing crystals of sufficient quality for high-resolution analyses.
The crystallization of viruses is often made difficult by the small supplies
available for systematic experimentation. Thus, large cell-culture laboratories
with suitable biohazard containment are required. New methods of crystallization
may also be needed, particularly for lipid-enveloped viruses such as rubella or
measles. The forming of homogeneous complexes of viruses with antibodies,
drugs, and receptors will call for considerable effort. An alternative approach,
which has proven successful in the past, consists of crystallization of components
of the assembled structure, determining the three-dimensional structures of each
of the components, and then using electron microscopy studies to provide the
architectural details of how the components are arranged in the assembly. Com-
plementary analyses of this nature will also, in many instances, provide the most
appropriate pathway toward understanding the details and action of the large
intracellular organelles, such as the nuclear pore and the various types of cy-
toskeletal fHaments. An exciting result to be expected along these lines over the
next 10 years is the merging of the available low-resolution picture of myosin-
actin filament interaction with that of the high-resolution structures currently
being determined for the actin monomer and the myosin head.
Complex Biological Structures Can Assemble Themselves
Researchers have begun to unravel how molecular assemblies are formed. In
living cells, the production of the components destined to be assembled is often
coordinated tightly both spatially and temporally. This coordination is revealed
by studies with mutants, in which individual components are defective or are not
synthesized in the proper amounts. Sometimes assembly occurs just by spontane-
ous association of individual proteins and nucleic acids, but steps in assembly are
frequently accompanied by the covalent modification of key proteins and nucleic
acids. Such modification can make the assembly irreversible in essence, to lock
the pieces into place. Other assembly mechanisms have been found to make use
of scaffolding molecules. These molecules are present at intermediate stages in
OCR for page 67
MOI~CUL"AR SIRUG1VRE AND FUNCTION
67
the assembly to help align critical components, but they disappear before the final
structure is formed, just as a scaffold is taken down as a building is finished.
Studies that attempt to assemble biological structures in vitro have been
particularly fruitful. These allow the timing of particular steps to be controlled at
will, and Reticular components can be added sequentially or simultaneously,
which in turn allows detailed study of assembly pathways and direct tests of the
function of specific components of the assembly by single-component-omission
experiments. Such studies are not always possible in viva. For example, if a
protein has functions critical to the cell, it will be difficult to see its effect on the
structure or function of an assembly by simply preventing its being synthesized.
In vitro assembly is also useful for many structural studies. For example,
neutron-scattering measurements usually require the creation of an assembly in
which some components contain the normal isotope of hydrogen whereas in
others the hydrogen is substituted with deuterium. Such manipulations can be
carried out only by starting with isolated components in vitro. Complex biologi-
cal structures successfully assembled in vitro include ribosomes, microtubules,
nucleosomes, and even many viruses.
DIRECTED MODIFICATION OF PROTEINS
We Can Now Design and Construct New Molecular Machines
Until recently, the experimental strategies available to structural biology
were largely limited to examining naturally occurring biological structures. Test-
ing specific hypotheses by altering structures was limited to observing naturally
occurring biological variants when they could be identified, as in the numerous
mutant hemoglobins. This approach is limited in having no systematic way to
search for a particular desired variant. Furthermore, one was restricted to those
variants that had no lethal consequences for the organism and variants that had a
significant chance of arising by natural biological mutation or evolution.
The development of recombinant DNA technology has dramatically altered
our study of the structure and function of proteins. The major breakthrough lies in
our new ability to modify or synthesize de novo genes (DNA) that, when intro-
duced into cells, direct the synthesis of modified or new protein molecules. What
was only a fantasy a few years ago is today a routine procedure: We can produce
protein molecules of any desired sequence. We can produce altered proteins in
bacteria, yeast, or plant or animal tissue-culture cells, which makes it possible to
isolate large enough quantities for structural and functional studies. In addition
we can produce the altered proteins in viva in transgenic animals to gauge the
effect of the altered protein on complex biological processes.
OCR for page 68
68
OPPORTUNITIES IN BIOLOGY
The Future Will See a Heightened Interdisciplinary Cooperation Between
Structural and Molecular Biology
The techniques of traditional structural analysis and of recombinant DNA
when combined increase the value of both. Such integrated approaches will allow
more rapid and informative studies of the structures of proteins and how these
structures determine function. The future will see ever-closer working relations
among scientists expert in these different disciplines.
The potential to alter proteins at will is remarkable since it transforms
structural biology from a science limited to strictly descriptive observations to an
experimental science in which specific hypotheses can be tested with appropriate
controls in specifically modified molecules. Our ability to do this is still in its
infancy; much experience will be needed before the strategies in routine use
approach optimal design. However, it is already clear that the ability to alter the
sequence of proteins and nucleic acids systematically has revolutionary applica-
tions for structural biology. The importance of these new technologies is twofold,
the first of which is widely appreciated, the second of which is perhaps less often
noted.
First, by altering protein structure, or by creating new proteins, we are able to
produce improved or even new proteins of value to human welfare, such as new
pharmaceuticals. We are already using recombinant DNA technologies to pro-
duce human growth hormone, the blood-clotting factor VIII, the anticoagulant
tissue plasminogen activator, interferons, and several lymphokines (which regu-
late development of various cells in the body). Variants of these proteins are
being made and tested for improved properties such as increased heat stabilities,
or increased lifetime in the blood. We are limited in these efforts by the fact that
in most cases we do not yet sufficiently understand the structures of these proteins
or the relations of structure to function to know what changes to make. As we
learn more about the structures of these molecules from x-ray crystallography and
other techniques, the successful production of useful variants will increase.
Second, determining protein structure will be enhanced by our ability to
modify proteins. These modifications will result in proteins that crystallize bet-
ter, that are designed for easy insertion of heavy metals needed in x-ray crystal-
lography, and that have specific perturbations introduced to test hypotheses. In
parallel, as the body of defined structures grows, we will be in a better position to
design rationally modified proteins. We will know more about possible protein-
folding motifs, domain attachments, and the effects of certain kinds of single-
residue modifications. Today, in the absence of a known three-dimensional
protein structure, the behavior of site-directed mutants can be unpredictable. The
cellular location of the new product, its stability, and its properties can frequently
differ from our naive predictions. This situation should change markedly as we
OCR for page 69
MOl£CULAR STRUClVRE AND FUNCTION
69
gain more experience with site-directed protein modification and its structural
consequences.
Methods for Designing New Proteins. The first approach to protein design is
site-directed mutagenesis. Here one usually alters a single amino acid by chang-
ing one or two nucleotides in the gene at the point coding for that amino acid. The
result is a site mutant, which may resemble natural mutants, except that the
experimenter can choose the site and the replacement.
The second approach is to make larger alterations. Usually this involves
interchanging segments of two or more different proteins. Domains are segments
of sequence often associated with particular functions of a protein. Therefore, by
appropriate switches in domains, one can rationally create proteins likely to have
desired hybrid functions.
The creation of such chimeric proteins in the laboratory mirrors the events
that seem to occur in protein evolution. The genes for proteins in most organisms
occur in blocks of coding regions (axons) and blocks of noncoding regions
(introns). The introns are removed from the message by RNA splicing before a
final transcript is used to direct the synthesis of protein (See Chapter 4~. The
exons frequently appear to correspond to functional or structural motifs in the
protein, and often they correspond to actual three-dimensional structural domains.
Many new protein functions may have evolved by exon shuffling rearrange-
ments among pre-existing exons of proven functional capability. Such exon
shuffling provides a rapid way to create proteins with new, hybrid functions. It
also provides a rationale for the presence of interrupted genes. An organism that
has such a pattern of genetic information is likely to be more able to cut and paste
its genes in a meaningful way and thus should have a selective advantage.
Domains are also regions that appear to fold independently into three-dimen-
sional structures. Switching pre-existing domains maximizes the likelihood that
the new chimeric protein will still be able to achieve a stable, well-ordered, three-
dimensional structure.
Genetically Engineered Proteins Reveal Much About How Proteins Function
The use of site-directed protein modification offers great promise for answer-
ing some of the fundamental questions in contemporary biology. For example
cell-surface receptors must migrate throughout the cell from one organelle to
another, moving from the endoplasmic reticulum (site of synthesis) to the Golgi
complex (site of carbohydrate modification) to the plasma membrane (site of
clustering in specialized regions of the cell surface called coated pits). Once
inside a coated pit, these proteins are taken inside the cell in a coated vesicle and
then recycled back to the cell surface in a recycling vesicle. All of these
OCR for page 70
70
OPPORTUNlTI~ INBIO~GY
movements seem to be dictated by signals contained within the structure of the
protein itself. What are these targeting signals? Are they simply short, continu-
ous stretches of amino acids or are they determined by the three-dimensional
structure of the protein? Are protein modifications, such as phosphorylation or
fatty acylation of the protein, required for any of these targeting signals?
The use of chimeric proteins has made it possible to define the functions of
linear sequences responsible for protein translocation into the endoplasmic reticu-
lum, mitochondria, and nucleus. However, signals that are defined by noncon-
tinuous amino acid sequences are more difficult (if not impossible) to define func-
tionally with chimeric proteins. Incorrect protein folding becomes a major
obstacle when the function of an internal sequence or domain is examined by this
approach.
Once a targeting signal for a given movement is identified, the scientist is in a
superb position to study biochemically how the proteins of the cell interact with
the cell-surface receptor to affect the desired targeting event. All the potential
questions can now be answered in model systems, in which cloned genes for cell-
surface receptors are transfected into cultured cells and then studied functionally
and biochemically.
The targeting problem is related in a major way to another crucial problem in
biology: protein folding. Many of the rules for protein folding can be derived
from site-directed mutagenesis studies of proteins such as cell-surface receptors.
For example, the folding of a cell-surface receptor in the lumen of the
endoplasmic reticulum depends in large part on the arrangement of cysteine
residues and other amino acids in the primary structure of the polypeptide. By use
of site-directed mutagenesis, one can begin to vary the position and number of
cysteine residues to determine their effects on the interaction of the protein with
the cellular machinery of the folding process.
Cell-surface receptors are key molecules that mediate a variety of physiologi-
cally important processes, ranging from the regulation of blood glucose and
cholesterol levels to the control of body iron and vitamin BE stores. Fundamental
research on these molecules should shed light not only on basic science but on
medicine as well.
For example, it has recently been possible to create functional, chimeric cell-
surface receptors. In a chimera, the extracellular domain of epidermal growth
factor receptor can stimulate the tyrosine kinase activity of an attached, insulin
receptor-intracellular domain. This result shows that the insulin and epidermal
growth factor receptors use a common mechanism for signal transduction across
the plasma membrane (See Chapter 7~. Future applications of this type of
approach include studying the function of new receptors for unknown ligands by
activating the new receptor's cytoplasmic domain with a heterologous ligand-
binding domain derived from an already characterized receptor.
OCR for page 71
MOLECULAR 57RUCrURE AND FUNCTION
71
Some oncogenes represent naturally occurring receptor mutants. These will
help us to understand normal mechanisms of receptor function and should lead to
an understanding of how normal signaling Processes are subverted to result in
tumorigenesis.
c,
Functional Protein Molecules Can Also Be Synthesized Chemically
For peptide chains of fewer than about 100 amino acids, chemical (as well as
biological) synthesis is now possible and will frequently be the method of choice
for shorter chains. By synthesizing chains in blocks, much longer chains will
become practical synthetic goals. Chemical synthesis permits the insertion of
isotopes, either stable or radioactive, at specific single sites in the chain. The
peptide bond itself, can be replaced in selected locations to render the product
totally resistant to proteolytic degradation at that position. In general, such
products are extremely difficult or impossible to prepare biologically. Such a
chemical approach is particularly valuable in testing hypotheses related to small
structural and functional domains and the possible refolding of these isolated
units. Single- and multiple-site mutagenesis experiments are equally easy chemi-
cally since any amino acid can be selected for any position with the automatic
equipment for synthesis that is now available. The simultaneous use of recombi-
nant DNA techniques and sophisticated polypeptide chemical synthesis will cre-
ate new approaches to both the understanding of protein structure and the devel-
opment of specific reagents and functions.
FOLDING
Initial reaction to the appearance of the first high-resolution crystal structure
of a protein was one of shock at the complexity and the total absence of obvious
symmetry. This reaction was conditioned, in part, by the earlier appearance of the
model for DNA, in which the base-paired double helix, once revealed, seemed
elegant and simple. During the past two decades, the successful determination of
many protein structures has led to the realization that there are, indeed, underlying
substructural motifs in these molecules. These motifs are themselves complex
and asymmetrical, but they are repeated in many structures.
The properties of the chemically bonded atoms in the peptide chain severely
constrain the possible conformations He chain can assume, and only a small
number of secondary structures are possible regardless of the sequence of amino
acids. The structural motifs are made up of various combinations of these
secondary units. These supersecondary units, in turn, are packed together into
structural domains. A domain may be the whole molecule, but, in larger proteins,
the native molecules are frequently composed of several domains.
OCR for page 72
72
We Do Not Yet Understand How Proteins Assume Their Intricate
Thiee-Dimensiona1 Forms
OPPORTUNITIES INBIOL~GY
The biosynthetic machinery that synthesizes peptide chains is the same for all
proteins. As far as is known, the peptide emerges as an essentially straight chain
having no intrinsic biological activity. During, or shortly after, the completion of
this synthesis, this chain folds up spontaneously to give the final unique, compact,
biologically active structure, which is characteristic of the native protein. In the
early 1960s this process was shown to occur in vitro. By now a large number of
proteins have been shown to undergo this reaction without any apparent help
other than the correct solvent environment. The mystery of how a biologically
active protein is formed from an inert disordered chain is known as the folding
problem. Research in this area, a major interface between the fields of biology
and chemistry, has rapidly expanded during the past five years.
The basic question has always been, From chemical theory, can the three-
dimensional structure of a protein be derived solely from its known amino acid
sequence? Since folding appears to be a purely spontaneous process, prediction
of the structure is a severe test of the level of our understanding of the chemistry
of polypeptide chains. The attack on this problem is both experimental and
theoretical. Its solution is not only essential as a basic underpinning for all of
molecular biology, but would also be of great practical importance in the indus-
trial application of genetic engineering.
Experimental Studies Search for Folding Intermediates
On the experimental side, much work has centered on the search for folding
intermediates. Do the secondary structural elements in native proteins exist, as
such, in small purified peptides apart from the rest of the structure? In the past it
was thought that such structures would be so unstable that they would not be
found, but recent evidence, based largely on optical spectroscopy, suggests a
positive answer to the question, at least for certain sequences. Is there a definable
folding pathway along which such structure intermediates can be found (Figure 3-
4~? This question is controversial. Thermodynamic measurements can fre-
quently be fitted to a two-state model reflecting only the native and the unfolded
forms. Kinetic data, on the other hand, are often difficult or impossible to
interpret without the assumption of one or more intermediate states. The methods
are invariably indirect and the interpretation non-unique. Since crystals cannot be
obtained for the unfolded state, x-ray diffraction data are not even potentially
available. Detailed NMR studies on long peptides are still on the horizon, but
may in the future provide direct structural information on these intermediate
states.
OCR for page 73
MOI~CUL`AR STRUCTURE AND FUNCTION
73
,!
~3
_
FIGURE 34 Schematic diagram of the folding process for an all helical protein. lye inaction starts
with an extended chain containing no permanent intrachain interactions. Ibis proceeds to a hypothetical
intermediate with fluctuating helical segments that occasionally associate. The end point is the
folded, compact, biologically active stn~cture.
A Diverse Range of Theoretical Studies Is in Progress
Theoretical approaches to folding have been proceeding along three lines.
The first two are fundamental procedures for which, in principle, we do not need
to know the final structure. The third is a collection of ad hoc procedures with
which we try to produce useful generalizations by examining the known struc-
tures.
Energy Minimization. Minimization of the conformational energy of pep-
tides is perhaps the oldest of the three procedures. The goal is to predict the native
folded structure of the protein by assuming the' it is actually the most stable
structure. This requires computing a potential energy function with many terms
representing possible conformational changes: bond stretching, angle bending,
torsional rotations, van der Waals interactions, and various electrostatic terms.
The minimization of this potential energy with respect to the locations of each of
the atoms in the protein should lead to the observed native structure. The
difficulties are formidable. The largest hurdle one must overcome is the problem
of multiple local energy minima. The potential energy function is like the surface
of the earth. Energy minimization corresponds to finding the lowest point on the
surface. Wherever one starts on the surface, one can find the lowest point nearby.
But how does one know if this is-the lowest point on the whole planet? How can
one tell if a locally stable protein structure is actually the most stable possible
OCR for page 74
74
OPPORTUNITIES IN BIOLOGY
structure? Although intense efforts are under way, it is not yet possible to
consistently derive the correct native structure by starting with an unfolded chain
and attempting to minimize the potential energy.
Molecular Dynamics. In the second theoretical approach, one actually simu-
lates the motions of the atoms of a protein. The potential energy function (in
principle, the same one used for energy minimization) can be used to obtain the
forces on each atom. The movement of the atoms in response to these forces is
then calculated. The applications of this powerful procedure are under intensive
development. As with energy minimization, the fidelity of this approach depends
on the quality of the potential energy function and on the proper modeling of the
solvent. Molecular dynamic simulations have not yet successfully folded a
protein in the absence of additional structural information. The latter can, in
principle, be provided experimentally by Now or optical spectroscopic proce-
dures. When such data can be supplied, some recent tests have shown notable
success in carrying out the folding simulations.
The Protein Data Bank Is a Rich Resource for Predicting Structure
In the ad hoc approaches, the protein data bank is searched for patterns and
statistical correlations. For example, probabilities based on the occurrence of
each amino acid in various types of secondary structure differ and can, in turn, be
used predictively to estimate probable regions of alpha helix, beta strand, and beta
turn structures in any sequence. In parallel efforts, combinatorial algorithms
aimed at packing assigned secondary structures into superseconda~y and larger
tertiary units have been developed. Most recently, combinations of such secon-
daIy and tertiary prediction schemes that show great promise in providing prob-
able domain structures have been worked out. Whether the resulting models are
close enough to converge to the native structure through molecular dynamic or
energy minimization procedures is not yet known. Although all ad hoc ap-
proaches are implicity based on the underlying chemistry through the use of
known structures, only a few explicity refer to these properties in the algorithm
itself.
New Experimental Tools Will Aid Studies of Protein Folding
Genetic Approaches. The power of modern genetics is being brought to bear
on the problems of folding. Some mutants seem to be clearly deficient in the
folding process, and yet the final folded protein does not seem abnormal in any
way. The discovery of other systems of this sort and their detailed analysis may
provide a great deal of information on folding pathways that would not be found
by other procedures, or even suspected.
OCR for page 75
MACULAR ~RUClURE AND FUNCTION
75
Polypeptide Synthesis. The chemical and recombinant DNA approaches
have been discussed in an earlier section. These complementary approaches to
providing peptides of known sequence will play major roles in the future study of
protein folding. At this time, the behavior of peptides at membrane interfaces has
been studied in detail by chemical synthesis; general specifications of such
interactions are starting to appear, and marked improvement in our understanding
of electrostatic interactions in alpha helices seems imminent.
Through recombinant DNA approaches, many different molecules appropri-
ate for structure-function investigations are already being created, largely through
single-site mutagenesis. Estimates of hydrogen bonding energies are being de-
rived from comparisons between carefully planned and constructed mutants of
proteins of known structure, and factors affecting protein stability are being
outlined.
The Folding Problem Now Seems Ripe for Major Advances
The immediate future for the folding problem looks remarkably (and unex-
pectedly) bright. The development of both fundamental and ad hoc theoretical
approaches is advancing rapidly. The correlation and interactions between theory
and experiment will be much closer than has generally been true in the past.
Combined approaches, with various levels of theory or theory and experiment,
seem likely to be the most fruitful. The ability to easily synthesize specific
polymers, themselves specifically designed to test theoretical predictions or to
provide missing values for parameters, seems particularly promising.
Instrumentation. The solution of the structures of new proteins, and of
mutant versions of older proteins, will continue to be of major importance. Thus
the development and implementation of new and improved x-ray and neutron
diffraction procedures is as important to the folding problem it is as to other areas
in structural biology. Improvements in both solid-state and high-resolution Now
will be central to the specification of the unfolded state and the search for
definable folding intermediates. Proteins that are isotonically labeled at specific
sites will be essential in this process, and they will also permit the study by Now
of substantially larger proteins than can currently be tackled.
NEW TECHNIQUES AND INSTRUMENTATION
Improvements in Analytical Techniques and Instrumentation Are Necessary
Better methods for automated X-ray diffraction are critical to our increased
understanding of molecular structure and function. In addition, more general and
OCR for page 76
76
OPPORTUNITIES IN BIOLOGY
effective methods are needed for direct analysis of x-ray data without the need for
preparing many heavy metal derivatives. Many of the most interesting biological
molecules have not been crystallized. Systematic studies are needed, aimed at
producing crystals and other ordered arrays suitable for high-resolution structural
determinations. Funding mechanisms must be adjusted to allow the long-term
support of this speculative but extremely critical area
Methods are also needed to extend two-dimensional Now to larger structures
and to automate its analysis. The development of instruments operating at higher
magnetic fields will certainly play an important role in this work.
Advances in Computation Will Revolutionize the Study of Molecular
Structure and Function
Improved methods are needed for collecting and transmitting DNA sequences
including a single, international data base. Improved methods are also needed for
extracting more biological information directly from sequence data. We must
ensure that continuing advances in computer science are made available, rapidly
and broadly, to the field of structural biology.
More accurate protein folding calculations must be developed, including
better methods for refining x-ray structures and improved semiempirical methods
based on the ever-increasing data base of structures. In addition, uniform,
inexpensive devices to display three-dimensional structures are needed so that,
ultimately, every biologist can view any known structure directly and accurately.
Representative terms from entire chapter:
amino acid