Click for next page ( 40


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 39
- Molecular Structure and Function Biological Macromolecules Are Machines All biological functions depend on events that occur at the molecular level. These events are directed, modulated, or detected by complex biological ma- chines, which are themselves large molecules or clusters of molecules. Included are proteins, nucleic acids, carbohydrates, lipids, and complexes of them. Many areas of biological science focus on the signals detected by these machines or the output from these machines. The field of structural biology is concerned with the properties and behavior of the machines themselves. The ultimate goals of this field are to be able to predict the structure, function, and behavior of the machines from their chemical formulas, through the use of basic principles of chemistry and physics and knowledge derived from studies of other machines. Although we are still a long way from these goals, enormous progress has been made during the past two decades. Because of recent advances, primarily in recombinant DNA technology, computer science, and biological instrumentation, we should begin to realize the goals of structural biology during the next two decades. Much of biological research still begins as descriptive science. A curious phenomenon in some living organism sparks our interest, perhaps because it is reminiscent of some previously known phenomenon, perhaps because it is inex- plicable in any terms currently available to us. The richness and diversity of biological phenomena have led to the danger of a biology overwhelmed with descriptions of phenomena and devoid of any unifying principles. Unlike the rest of biology, structural biology is in the unique position of having its unifying principles largely known. They derive from basic molecular physics and chemis- try. Rigorous physical theory and powerful experimental techniques already 39

OCR for page 39
40 OPPORTUNITIES IN BIOLOGY provide a deep understanding of the properties of small molecules. The same principles, largely intact, must suffice to explain and predict the properties of the larger molecules. For example, proteins are composed of linear chains of amino acids, only 20 different types of which regularly occur in proteins. The properties of proteins must be determined by the amino acids they contain and the order in which they are linked. While these properties may become complex and far removed from any property inherent in single amino acids, the existence of a limited set of fundamental building blocks restricts the ultimate functional proper- ties of proteins. Nucleic acids are potentially simpler than proteins since they are composed of only four fundamental types of building blocks, called bases, linked to each other through a chain of sugars and phosphates. The sequence of these bases in the DNA of an organism constitutes its genetic information. This sequence determines all of the proteins an organism can produce, all of the chemical reactions it can carry out, and, ultimately, all of the behavior the organism can reveal in response to its environment. Carbohydrates and lipids are intermediate in complexity between nucleic acids and proteins. We currently know less about them, but this deficit is rapidly being eliminated. The central focus in structural biology at present is the three-dimensional arrangement of the atoms that constitute a large biological molecule. Two decades ago this information was available for only several proteins and one nucleic acid, and each three-dimensional structure determined was a landmark in biology. Today such structures are determined routinely, and we have begun to see structures of not just individual large molecules, but whole arrays of such molecules. The first three-dimensional structures were each consistent with our expectations based on fundamental physics and chemistry. Most of the structures determined subsequently, however, were completely unrelated, and a large body of descriptive structural data began to emerge as more and more structures were revealed by x-ray crystallography. From newer data, patterns of three-dimen- sional structures have begun to emerge; it is now clear that most if not all structures will eventually fit into rational categories. The Main Theme of Structural Biology Is the Relation of Molecular Structure to Function Since biologists are ultimately interested in function, structural biology is often a means toward an end. The role played by structural biology differs somewhat depending on our prior knowledge of the function of particular mole- cules under investigation. Where considerable knowledge about function already exists, the determination of three-dimensional structure has almost inevitably led to major additional insights into function. For example, the three-dimensional

OCR for page 39
MOl~CULAR SlRUClVRE AND FUNCTION 41 structure of hemoglobin, the protein that carries oxygen in our blood stream, has helped us understand how we adapt to changes in altitude, how fish control their depth, and how a large number of human mutant hemoglobins relate to particular disease symptoms. Often knowledge about structure can provide dramatic advances in our understanding about function even when prior knowledge is sketchy. For ex- ample, early biological experiments had shown that DNA contained genetic information, but these experiments offered no real clues to how a molecule could store information or how that information could be passed from cell to cell or from generation to generation. The structure of DNA, with bases paired between two different chains, led immediately to the correct conclusions about the mecha- nism of information storage and transfer. The information resided in the sequence of the bases; the apparent redundancy of two strands with equivalent (comple- mentary) information meant that each could serve to pass the information onto a daughter strand. Furthermore, the redundancy offered a natural defense against loss of information. Even if one strand is damaged (as by chemicals or radiation), in the vast majority of cases the information on the other strand can be used to recover the missing information. Indeed, cells have evolved truly elegant mecha- nisms to determine which strand contains the original undamaged information; such models could provide useful paradigms for the current human preoccupation with electronic information handling. The ultimate challenge for structural biology occurs when we have a struc- ture but no clues at all about its function. Because of dramatic advances in our ability to determine structures, this challenge is likely to occur with increasing frequency. There have been a few remarkable cases in which limited structural information, such as a knowledge of the sequence of amino acid residues in a protein, without any three-dimensional structural information, has led to signifi- cant insights into function. In general, however, our current ability to predict function from structure in the absence of prior biological clues is limited, and one of our major needs is to improve our predictive abilities. Biological Structure Is Organized Hierarchically The structures of large biological molecules such as proteins and nucleic acids are complex. It is not practical or useful to describe these structures in words. In fact, highly specialized computer-driven graphics systems have been especially created to display molecular structures visually. An example of the output from one of these display systems is shown in Plates 1 and 2. Such devices are an invaluable aid to today's structural biologist, and future advances should make such devices cheaper, easier to use, and thus more readily available to all bi- ologists. Because of the complexity of biological structures, it is frequently convenient

OCR for page 39
42 OPPORTUNITIES IN BIOLOGY to deal only with certain aspects of these structures. It is common practice to describe structure at a series of hierarchical levels, called primary, secondary, tertiary, and quaternary structure. This hierarchy reflects some of the types of information provided by particular experimental techniques used to determine the structures of biological molecules. The primary structure is the covalent chemical structure, that is, a specif~ca- tion of the identity of all the atoms and the bonds that connect them. The major molecules with which we work proteins, nucleic acids, and carbohydrates- usually consist of linear arrays of units, each of which has a similar overall structure; they differ only in certain details. The Apes of units are limited in numbers: 4 common ones in typical nucleic acids, roughly a dozen in typical carbohydrates, and 20 in proteins. Thus, the primary structure can be specified almost completely by naming the linear order, or sequence, of each type of unit of the chain. The primary structure is given by the sequence plus a description of any additional covalent modifications or crosslinks. The sequence of proteins, nucleic acids, and carbohydrates is determined principally by chemical methods. This is understandable since it is, in fact, the chemical structure. These methods have advanced tremendously in the past decade, and the implications of these advances constitute the second section of this chapter. The secondary structure refers to regular patterns of folding of adjacent residues. Most secondary structures are helices. Some of the most frequent and best-known helices are the alpha helices found in many proteins and double helices found in virtually all nucleic acids. Carbohydrates also form helices. Helices are convenient structural motifs: They are easy to recognize by inspec- tion of a known three-dimensional structure, they are relatively easy to detect experimentally by physical techniques, and their appearance within many struc- tures is relatively easy to predict just from a knowledge of the primary structure. The tertiary structure is the complete three-dimen$ional structure of a single biological unit. Until recently the only available method for determining this structure was x-ray diffraction studies of a single crystal sample. Now electron and neutron diffraction have become available as tools for solid samples, and nuclear magnetic resonance spectroscopy has been developed to the point where it can be used to determine the tertiary structure of small proteins and nucleic acids in liquid solution, that is, close to the state in which they are usually found inside living cells. The tertiary structure usually provides the starting point for studies that attempt to correlate structure and function. Quaternary structure describes the assembly of individual molecular units into more complex arrays. The simplest example of quaternary structure is a protein that consists of multiple subunits. The units may be identical or different. The arrangement of the subunits frequently has important functional implications. Some quaternary structures have been detennined by experimental methods that

OCR for page 39
MOLECULAR STRUCTURE AND FUNCTION 43 reveal not only the arrangement of the subunits but also their individual tertiary structures. However, many quaternary structures are too complex to be addressed by existing techniques. Here a variety of methods ranging from electron micros- copy to neutron scattering to chemical crosslinking can still provide information about the overall shape of the assembly and detailed arrangement of the compo nents. In the sections that follow, we will first explore the levels of biological structures; our concerns will be improved methods for revealing these structures and the application of the resulting information to solving biological problems. We will then consider the current and future prospects for predicting the higher order structure of biological macromolecules from more readily available infor- mation on lower order structure. Finally we will consider the power of our newfound abilities to alter macromolecular structure more or less at will. PRIMARY STRUCTURE Nucleic Acid and Protein Sequence Data Are Accumulating Rapidly The amount of available information on the primary structure of biological polymers is increasing at an astounding rate. Two decades ago we knew the nucleotide sequence of only a single small nucleic acid, the yeast alanine transfer RNA. We knew the amino acid sequence of fewer than 100 different types of proteins. Today more than 18 million base pairs of DNA have been sequenced, and the data are accumulating at more than several million bases a year. The first completed sequences were research landmarks. Now sequences are appearing so rapidly that many research journals refuse to publish such information unless it has some particular novel or utilitarian aspects. Indeed, sequence data are cur- rently accumulating faster than we can analyze them, and even faster than we can enter them into the data bases by existing methods. The longest block of continuous DNA sequence known is the entire primary structure of Epstein-Barr virus. This 172,282-base-pair genome is responsible for a number of human diseases including infectious mononucleosis, Burkitt's lym- phoma, and nasopharyngeal carcinoma. Knowledge of the DNA sequence poten- tially unlocks for us all of the secrets of the virus. The challenge now is to use this sequence information to learn how to prevent or control the diseases caused by the virus. Other landmarks of recent DNA sequencing include the complete DNA sequence of the maize (corn) chloroplast DNA (about 130,000 base pairs) and the complete sequence of the gene for human factor VIII, one of the proteins involved in blood clotting, which is defective in certain hemophilias. We know the complete sequence of many other important proteins, RNAs, and viruses. Per

OCR for page 39
44 OPPORTUNITIES INBIO~GY haps what is most important is that we have the technical ability to determine the sequence of virtually any piece of DNA, RNA, or protein. Sequence Comparisons Lead to Structural, Functional, and Evolutionary Insights Much valuable comparative sequence information awaits us as the data accumulate and as analytic methods become more reliable and informative. Already, one can do much using the data bases to help interpret any DNA sequence plucked more or less at random from a genome. The patterns of sequence in the regions that code for the amino acid chains of proteins differ enough from the noncoding regions that the former can usually be identified. For example, we know about types of sequences that are required for efficient synthe- sis of proteins in many different types of organisms. We know about some general types of control elements for certain genes important in developmental pattern formation or in an organism's response to environmental stress. When the protein sequence predicted from a gene is compared with all known protein sequences, there is about one chance in three that it will be similar enough to one or more of them to be recognized in a match. This provides an immediate clue to the function of the previously unknown protein. Perhaps the most spectacular example of such a match was the discovery that the product of the sis oncogene, a protein of unknown function that is associated with some cancers, was extremely similar to a blood protein that promotes normal growth, the platelet-derived growth factor. As the data base grows in size, and as our general knowledge about the function of its constituents does likewise, the proba- bilities of informative matches should rise steadily. One can anticipate the growth of a new speciality, molecular archeology, that resembles the field of archeology itself. Protein and gene sequences are old. They have been rearranged and altered much as the residual artifacts from a town or fortification have partially deterio- rated and become dispersed. The components that remain, however, when prom erly viewed, provide clues to the function of the whole. An archeologist might conclude that a room full of amphoras was likely to have been a storage room and not sleeping quarters. In the same way, we can already look at some protein sequences and gain clues about functions, even about functions that we have never observed in detail in the laboratory. Proteins that have transmembrane domains will sit in membranes, proteins with nucleic acid binding sequences will bind DNA or RNA; a protein with both would probably bring a nucleic acid into the vicinity of a membrane and keep it there. The analysis can be carried further because the details of the protein sequence can provide even greater clues, just as the details of the decoration on a piece of pottery or the shape of an arrowhead can identify the geographic origin of the people who produced iL

OCR for page 39
MOLECULAR STRUCTURE AND FUNCTION 45 Many proteins with related functions have probably evolved from common ancestors. Thus receptors-proteins designed to sit at the cell surface and detect the environment may represent one or more fundamental families of structures and sequences. For example, the sequence of the beta-adrenergic receptor, which binds the hormone adrenalin, and the sequence for rhodopsin, which detects light, are sufficiently similar that we can tell both were once related through a common progenitor. In the same way, proteases often resemble other proteases and structural proteins resemble other structural proteins. Many proteins are modified chemically after they are synthesized. Proteases may remove one or both ends of the initial chain as well as make cleavages in the middle: Carbohydrates may be added to form glycoproteins. Some of these modifications occur as the protein travels from its initial site of synthesis to its final location in the cell. Others, such as phosphates, are added and removed repeatedly as part of the functioning or regulation of the protein. The enzymes that perform these modifications frequently do so by recognizing particular signal sequences. Because we now know some of these signals, a search of protein sequences can frequently reveal potential modification sites and in turn provide additional clues to the function of the protein. Three-dimensional structure is better conserved in evolution than sequence is. Apparently there are severe constraints on folding a protein to make a compact three-dimensional array that is stable in the aqueous medium of a cell and resistant to proteases. Once we have mastered techniques for estimating possible folded structures from amino acid sequences, we will enhance our ability to explore molecular archeology. Mastery of these techniques itself will probably require an examination of many more three-dimensional structures by x-ray diffraction. What we still cannot do with much success is predict the function of an arbitrary protein without some molecular archeological clues. When inspected by eye, a three-dimensional protein structure is complex and confusing; about the best most trained observers can do is identify potential binding sites as clefts or pockets and find potential sites of flexibility, such as connectors, between domains. Clues to functional regions can emerge from amino acids that are found in places other than their usual locations. For example, in typical soluble proteins, hydrophilic residues (which have an affinity for water), such as charged residues, reside on the surface, whereas hydrophobic (which avoid contact with water) residues, such as those with hydrocarbon side chains, are found buried in the interior. A buried charged group, particularly if it is not paired with an opposite charge, can be a clue to a functional site. Similarly an exposed hydrophobic group may reveal a binding site for a hydrophobic small molecule. A whole set of such groups may indicate a surface of the protein that interacts with another protein or a membrane. One can go only so far with visual inspection. Methods for systematic analysis of three dimensional protein structures are needed that can extract, from

OCR for page 39
46 OPPORTUNITIES IN BIOLOGY the structures, as many clues as possible about protein function. Such procedures are still in their infancy; the next decade should see rapid growth in such tech- niques now that a sufficient library of known structures and functions exists on which to develop, test, and refine these methods. The DNA Sequences of Entire Genomes of Some Simple Organisms Will Soon Be Known The explosion in sequence data has just begun. DNA sequencing is far easier than protein sequencing, and the tools already available for cloning and efficient sequencing of 500-base-pair blocks of DNA will ensure that the current stream of new sequence data will become a torrent. The ultimate target would be to determine the sequence of all the DNA in an organism, that is, to sequence an entire genome. Genomes range in size from 750,000 base pairs (a mycoplasma) to more than 3 billion base pairs. Such large-scale sequencing programs are feasible by today's technology, but they are expensive in both manpower and actual dollar cost. Automated DNA sequencing techniques have begun to be developed, which should markedly diminish manpower requirements and decrease costs. It now seems likely that in the next few decades we will determine the complete DNA sequence of the bacterium Escherichia colt, the yeast Saccharomyces cerevisiae, the human genome, the fruitfly Drosophila, the mouse genome, the nematode Caenorhabdi- tis elegans, and possibly even a number of plant and other bacterial and yeast genomes. The resulting information will stimulate future generations of biolo- gists as they explore the functions of the tens of thousands of genes that will be revealed for the first time by such sequencing programs. The major issue facing us today is how to stage the process of large-scale sequence determination. One set of concerns related to this issue deals with the optimal scientific strategy and the selection of targets for sequencing. Another set of concerns deals with attempts to organize and accelerate this work by mecha- nisms other than the types of investigator-initiated individual research projects typical in current biological science. Most investigators favor making a physical map of a genome before com- mencing really large-scale sequencing. This physical map will consist of an ordered set of large DNA fragments that covers the entire genome. From each large DNA fragment, smaller pieces can be isolated (or cloned) and used as source material to perform the actual DNA sequencing. Some workers favor construct- ing the ordered set of fragments by isolating individual ones at random and then determining which fragments are neighbors in the genome. Others favor dividing the genomes into successively smaller piece& first chromosomes, then chromo- some fragments, then very large DNA pieces-until an ordered set of DNA fragments is created. At present there are good arguments for attempting both approaches simultaneously.

OCR for page 39
MOlCULAR SIR UCTURE AND FUNCTION 47 The fast physical maps of genomes or segments of genomes are almost completed. In principle, one could use these and simply start large-scale sequenc- ing now, with existing approaches. However, the likelihood of major improve- ments in automated technology over the next 5 to 10 years leads many people to favor concentrating current efforts on spading the development of that technol- ogy and delaying most massive sequencing until the technology is available. As automated DNA sequencing machines become common, the rate-limiting step in obtaining data will shift to the production of the DNA needed for sequencing. Thus, we need to enhance our ability to prepare large numbers of discrete DNA fragments (preferably kept in linear order from some larger starting fragment). Robotics seems an attractive and useful new technology. At present, most sequencing methods are limited to a maximum of about 500 base pairs per DNA fragment. Every significant increase in the size of the fragment that can be sequenced will improve the overall efficiency of the process. Multiplex methods, in which numerous different fragments are handled in parallel or in series, offer another way to accelerate and expedite the entire process. A third set of scientific concerns deals with the choice of species, individual organisms, and genes to sequence. At one extreme are those who believe that current efforts should be restricted to sequences tied to existing biological prob- lems. For example, in the pursuit of human disease genes, it might be far more important to determine the DNA sequence of the same gene in many individuals than to extend a given sequence into neighboring regions to see what is there. Similarly, comparisons of different species frequently provide biological insights that would not have been possible if studies had been restricted to a single organism. The advantage of this traditional problem-oriented approach is obvious: The sequences obtained are more or less guaranteed to be interesting and useful. However, the disadvantage is also obvious: As interesting regions are sequenced, it will become more difficult to motivate people to risk explorations of regions of genomes for which little or no information is available. While explorations of these regions have the potential to make major advances through finding com- pletely unexpected genes and functions, the work is also risky since some regions may yield no rewards at all. The realities of tight funding and frequently competitive review for renewed funding militate against such work; if it is to be encouraged, new support mechanisms may need to be created, with longer term commitments and rewards for more risk-taking. The final set of concerns deals with whether genomic sequencing should be organized in ways similar to those in which "big science" has been dealt with in other disciplines. The actual process of sequence determination is boring. It seems to require more dedication and large-scale organization than most typical biology projects. The intellectual rewards of obtaining sequences of entire genomes are likely to be missed by most of those involved in the massive effort to accumulate the data. Much of the data may not result in publications in the

OCR for page 39
48 OPPORTUNITIES INBIO~GY primary scientific literature, and some publications that do result may have very large numbers of authors. Thus special efforts may be needed to maintain investigators' morale. Structural and Computational Methods Need to Advance to Keep Pace with the Explosion in Sequence Data As the acquisition of sequence data continues to accelerate over the next few years, the problem of managing these data will become increasingly severe. Considerable thought and resources will be needed to optimize the collection of data in consistent formats, the entry of data into computerized data bases acces- sible to all investigators, and the refinement of computer algorithms for all sorts of sequence and structure analysis. The anticipated size of the data base-100 million base pairs in the next few years, 10 to 100 billion base pairs eventually- is not staggering even by today's standards. However, the way the data are being accumulated, by efforts in hundreds of different laboratories, each with its own computer systems and idiosyncrasies, poses a serious problem. What would help is a relatively uniform system of data annotation and transmittal. If this can be done by translation programs that accept a wide variety of inputs and return them to the data base and to the investigator in the standard format, it would probably win broad acceptance by the community because each laboratory could then maintain its own style. A second complexity of the existing data bases is that there are three inde- pendent repositories for nucleic acid sequence data: GenBank, operated by the Los Alamos National Laboratory; the EMBL data base, operated by the European Molecular Biology Laboratory in Heidelberg; and Protein Identification Re- source, operated by the National Biomedical Research Foundation in Washing- ton, D.C. The multiplicity of data bases poses severe problems for current and potential users. Ideally, the three should be combined into one. Nomenclature is a major problem for all three data bases. The names of molecules, species, and genes are constantly changing, and the data bases also change the cryptic names that they use to identify entries. The various data bases are also not cross-indexed. A major research problem is to determine what data are common to two or all three data bases. Moreover, any such cross-companson is outdated as soon as one of the data bases is updated. International responsibil- ity for entering data into major data bases and greatly expanding the use of electronic communication for this purpose is badly needed. Someone will have to construct and maintain a cross-index of related biological data bases. Possibly a direct-access system will be set up. In any case, the availability of a periodically updated cross-index will allow other installations to provide an integrated re- trieval system to their users more easily. The third major complexity of the sequence data base is the sophistication of many of the interrogations that will be made. Today each new sequence is almost

OCR for page 39
MOIB:CULAR SIR UNSURE ARID FUNCTION / 49 automatically run through a comparison program to see what matches can be found with preexisting sequences. At the most trivial level, this procedure may reveal that the sequence has already been reported by someone else, and perhaps the same gene will have been reported under another name. At a more profound level, the comparison may reveal functional and structural insights. These com- parisons can consume large amounts of computer time if they are carried out with algorithms that try to detect even very slight degrees of sequence similarity. In the near future, we should begin to see many more attempts to use the data bases to refine methods for structure prediction, studies that will consume enormous amounts of computer time. It seems prudent to plan ahead and support research and development of better computer algorithms and better computer hardware to optimize the biologist's use of the DNA data base. Among the needs are data- base management systems designed to keep track of inquiries and results, so that insights gained by different inquiries can be synergistic and so that unwarranted duplicate inquiries can be short-circuited. Other needs are for improved analyti- cal tools for predicting structure and for comparing structure and sequence. These tools may take the form of new chips, parallel processors, or more powerful algorithms. Carbohydrate Research Is Gaining Momentum In the past decade, structural studies on carbohydrates have begun to ap- proach the capabilities of more developed areas of protein and nucleic acid structure. Techniques have been developed to deduce the complete structure of complex oligosaccharides, including oligosaccharides found in scarce glycopro- teins, such as cell-surface molecules. Glycoproteins are proteins containing covalently attached sugars, usually short carbohydrate polymers attached to the side chains of the amino acids asparagine, serine, or threonine. Glycoproteins are found throughout nature, from simple single-celled organisms to humans, and they play critical roles in these organisms. Glycoproteins are usually, but not exclusively, found on the surfaces of cells and in cellular secretions. For example, almost all of the human blood proteins and all of the well-characterized eukaryotic cell-surface macromolecules are glycoproteins. In addition, Glycoproteins are key components in the outer coatings of a number of pathological agents, including viruses and parasites. Many of the molecules used by the immune system to combat these pathogens are also glycoproteins. Recently, important roles have been identified for some Glycoproteins that remain in the cell's interior, such as the proteins that form the pores in the nuclear membrane. The new techniques in carbohydrate research include nuclear magnetic reso- nance (NMR), fast-atom-bombardment mass spectrometry, and metabolic label- ing with radioactive sugars combined with stepwise degradation with a battery of purified glycosidases. The consequence has been the elucidation of hundreds of

OCR for page 39
66 OPPORTUNITIES IN BIOLOGY resolution x-ray analysis, will depend on the type of assembly in question. With membrane complexes, the crystals must be grown from precise mixtures of detergents, amphiphiles (polar molecules that have an affinity to both aqueous and nonaqueous areas), protein, and lipid; the process of crystallization has an additional dimension compared with that of soluble proteins. A major difficulty at present, therefore, is in obtaining sufficient commitments of financing and time to support such crystallization efforts. The first such crystallizations were carried out only after many years of trials in Europe, where the support of science can maintain a constant effort in a high-risk, long-term endeavor. Because risks have been demonstrably reduced, considerable weight must be given to early successes in growing crystals of sufficient quality for high-resolution analyses. The crystallization of viruses is often made difficult by the small supplies available for systematic experimentation. Thus, large cell-culture laboratories with suitable biohazard containment are required. New methods of crystallization may also be needed, particularly for lipid-enveloped viruses such as rubella or measles. The forming of homogeneous complexes of viruses with antibodies, drugs, and receptors will call for considerable effort. An alternative approach, which has proven successful in the past, consists of crystallization of components of the assembled structure, determining the three-dimensional structures of each of the components, and then using electron microscopy studies to provide the architectural details of how the components are arranged in the assembly. Com- plementary analyses of this nature will also, in many instances, provide the most appropriate pathway toward understanding the details and action of the large intracellular organelles, such as the nuclear pore and the various types of cy- toskeletal fHaments. An exciting result to be expected along these lines over the next 10 years is the merging of the available low-resolution picture of myosin- actin filament interaction with that of the high-resolution structures currently being determined for the actin monomer and the myosin head. Complex Biological Structures Can Assemble Themselves Researchers have begun to unravel how molecular assemblies are formed. In living cells, the production of the components destined to be assembled is often coordinated tightly both spatially and temporally. This coordination is revealed by studies with mutants, in which individual components are defective or are not synthesized in the proper amounts. Sometimes assembly occurs just by spontane- ous association of individual proteins and nucleic acids, but steps in assembly are frequently accompanied by the covalent modification of key proteins and nucleic acids. Such modification can make the assembly irreversible in essence, to lock the pieces into place. Other assembly mechanisms have been found to make use of scaffolding molecules. These molecules are present at intermediate stages in

OCR for page 39
MOI~CUL"AR SIRUG1VRE AND FUNCTION 67 the assembly to help align critical components, but they disappear before the final structure is formed, just as a scaffold is taken down as a building is finished. Studies that attempt to assemble biological structures in vitro have been particularly fruitful. These allow the timing of particular steps to be controlled at will, and Reticular components can be added sequentially or simultaneously, which in turn allows detailed study of assembly pathways and direct tests of the function of specific components of the assembly by single-component-omission experiments. Such studies are not always possible in viva. For example, if a protein has functions critical to the cell, it will be difficult to see its effect on the structure or function of an assembly by simply preventing its being synthesized. In vitro assembly is also useful for many structural studies. For example, neutron-scattering measurements usually require the creation of an assembly in which some components contain the normal isotope of hydrogen whereas in others the hydrogen is substituted with deuterium. Such manipulations can be carried out only by starting with isolated components in vitro. Complex biologi- cal structures successfully assembled in vitro include ribosomes, microtubules, nucleosomes, and even many viruses. DIRECTED MODIFICATION OF PROTEINS We Can Now Design and Construct New Molecular Machines Until recently, the experimental strategies available to structural biology were largely limited to examining naturally occurring biological structures. Test- ing specific hypotheses by altering structures was limited to observing naturally occurring biological variants when they could be identified, as in the numerous mutant hemoglobins. This approach is limited in having no systematic way to search for a particular desired variant. Furthermore, one was restricted to those variants that had no lethal consequences for the organism and variants that had a significant chance of arising by natural biological mutation or evolution. The development of recombinant DNA technology has dramatically altered our study of the structure and function of proteins. The major breakthrough lies in our new ability to modify or synthesize de novo genes (DNA) that, when intro- duced into cells, direct the synthesis of modified or new protein molecules. What was only a fantasy a few years ago is today a routine procedure: We can produce protein molecules of any desired sequence. We can produce altered proteins in bacteria, yeast, or plant or animal tissue-culture cells, which makes it possible to isolate large enough quantities for structural and functional studies. In addition we can produce the altered proteins in viva in transgenic animals to gauge the effect of the altered protein on complex biological processes.

OCR for page 39
68 OPPORTUNITIES IN BIOLOGY The Future Will See a Heightened Interdisciplinary Cooperation Between Structural and Molecular Biology The techniques of traditional structural analysis and of recombinant DNA when combined increase the value of both. Such integrated approaches will allow more rapid and informative studies of the structures of proteins and how these structures determine function. The future will see ever-closer working relations among scientists expert in these different disciplines. The potential to alter proteins at will is remarkable since it transforms structural biology from a science limited to strictly descriptive observations to an experimental science in which specific hypotheses can be tested with appropriate controls in specifically modified molecules. Our ability to do this is still in its infancy; much experience will be needed before the strategies in routine use approach optimal design. However, it is already clear that the ability to alter the sequence of proteins and nucleic acids systematically has revolutionary applica- tions for structural biology. The importance of these new technologies is twofold, the first of which is widely appreciated, the second of which is perhaps less often noted. First, by altering protein structure, or by creating new proteins, we are able to produce improved or even new proteins of value to human welfare, such as new pharmaceuticals. We are already using recombinant DNA technologies to pro- duce human growth hormone, the blood-clotting factor VIII, the anticoagulant tissue plasminogen activator, interferons, and several lymphokines (which regu- late development of various cells in the body). Variants of these proteins are being made and tested for improved properties such as increased heat stabilities, or increased lifetime in the blood. We are limited in these efforts by the fact that in most cases we do not yet sufficiently understand the structures of these proteins or the relations of structure to function to know what changes to make. As we learn more about the structures of these molecules from x-ray crystallography and other techniques, the successful production of useful variants will increase. Second, determining protein structure will be enhanced by our ability to modify proteins. These modifications will result in proteins that crystallize bet- ter, that are designed for easy insertion of heavy metals needed in x-ray crystal- lography, and that have specific perturbations introduced to test hypotheses. In parallel, as the body of defined structures grows, we will be in a better position to design rationally modified proteins. We will know more about possible protein- folding motifs, domain attachments, and the effects of certain kinds of single- residue modifications. Today, in the absence of a known three-dimensional protein structure, the behavior of site-directed mutants can be unpredictable. The cellular location of the new product, its stability, and its properties can frequently differ from our naive predictions. This situation should change markedly as we

OCR for page 39
MOlCULAR STRUClVRE AND FUNCTION 69 gain more experience with site-directed protein modification and its structural consequences. Methods for Designing New Proteins. The first approach to protein design is site-directed mutagenesis. Here one usually alters a single amino acid by chang- ing one or two nucleotides in the gene at the point coding for that amino acid. The result is a site mutant, which may resemble natural mutants, except that the experimenter can choose the site and the replacement. The second approach is to make larger alterations. Usually this involves interchanging segments of two or more different proteins. Domains are segments of sequence often associated with particular functions of a protein. Therefore, by appropriate switches in domains, one can rationally create proteins likely to have desired hybrid functions. The creation of such chimeric proteins in the laboratory mirrors the events that seem to occur in protein evolution. The genes for proteins in most organisms occur in blocks of coding regions (axons) and blocks of noncoding regions (introns). The introns are removed from the message by RNA splicing before a final transcript is used to direct the synthesis of protein (See Chapter 4~. The exons frequently appear to correspond to functional or structural motifs in the protein, and often they correspond to actual three-dimensional structural domains. Many new protein functions may have evolved by exon shuffling rearrange- ments among pre-existing exons of proven functional capability. Such exon shuffling provides a rapid way to create proteins with new, hybrid functions. It also provides a rationale for the presence of interrupted genes. An organism that has such a pattern of genetic information is likely to be more able to cut and paste its genes in a meaningful way and thus should have a selective advantage. Domains are also regions that appear to fold independently into three-dimen- sional structures. Switching pre-existing domains maximizes the likelihood that the new chimeric protein will still be able to achieve a stable, well-ordered, three- dimensional structure. Genetically Engineered Proteins Reveal Much About How Proteins Function The use of site-directed protein modification offers great promise for answer- ing some of the fundamental questions in contemporary biology. For example cell-surface receptors must migrate throughout the cell from one organelle to another, moving from the endoplasmic reticulum (site of synthesis) to the Golgi complex (site of carbohydrate modification) to the plasma membrane (site of clustering in specialized regions of the cell surface called coated pits). Once inside a coated pit, these proteins are taken inside the cell in a coated vesicle and then recycled back to the cell surface in a recycling vesicle. All of these

OCR for page 39
70 OPPORTUNlTI~ INBIO~GY movements seem to be dictated by signals contained within the structure of the protein itself. What are these targeting signals? Are they simply short, continu- ous stretches of amino acids or are they determined by the three-dimensional structure of the protein? Are protein modifications, such as phosphorylation or fatty acylation of the protein, required for any of these targeting signals? The use of chimeric proteins has made it possible to define the functions of linear sequences responsible for protein translocation into the endoplasmic reticu- lum, mitochondria, and nucleus. However, signals that are defined by noncon- tinuous amino acid sequences are more difficult (if not impossible) to define func- tionally with chimeric proteins. Incorrect protein folding becomes a major obstacle when the function of an internal sequence or domain is examined by this approach. Once a targeting signal for a given movement is identified, the scientist is in a superb position to study biochemically how the proteins of the cell interact with the cell-surface receptor to affect the desired targeting event. All the potential questions can now be answered in model systems, in which cloned genes for cell- surface receptors are transfected into cultured cells and then studied functionally and biochemically. The targeting problem is related in a major way to another crucial problem in biology: protein folding. Many of the rules for protein folding can be derived from site-directed mutagenesis studies of proteins such as cell-surface receptors. For example, the folding of a cell-surface receptor in the lumen of the endoplasmic reticulum depends in large part on the arrangement of cysteine residues and other amino acids in the primary structure of the polypeptide. By use of site-directed mutagenesis, one can begin to vary the position and number of cysteine residues to determine their effects on the interaction of the protein with the cellular machinery of the folding process. Cell-surface receptors are key molecules that mediate a variety of physiologi- cally important processes, ranging from the regulation of blood glucose and cholesterol levels to the control of body iron and vitamin BE stores. Fundamental research on these molecules should shed light not only on basic science but on medicine as well. For example, it has recently been possible to create functional, chimeric cell- surface receptors. In a chimera, the extracellular domain of epidermal growth factor receptor can stimulate the tyrosine kinase activity of an attached, insulin receptor-intracellular domain. This result shows that the insulin and epidermal growth factor receptors use a common mechanism for signal transduction across the plasma membrane (See Chapter 7~. Future applications of this type of approach include studying the function of new receptors for unknown ligands by activating the new receptor's cytoplasmic domain with a heterologous ligand- binding domain derived from an already characterized receptor.

OCR for page 39
MOLECULAR 57RUCrURE AND FUNCTION 71 Some oncogenes represent naturally occurring receptor mutants. These will help us to understand normal mechanisms of receptor function and should lead to an understanding of how normal signaling Processes are subverted to result in tumorigenesis. c, Functional Protein Molecules Can Also Be Synthesized Chemically For peptide chains of fewer than about 100 amino acids, chemical (as well as biological) synthesis is now possible and will frequently be the method of choice for shorter chains. By synthesizing chains in blocks, much longer chains will become practical synthetic goals. Chemical synthesis permits the insertion of isotopes, either stable or radioactive, at specific single sites in the chain. The peptide bond itself, can be replaced in selected locations to render the product totally resistant to proteolytic degradation at that position. In general, such products are extremely difficult or impossible to prepare biologically. Such a chemical approach is particularly valuable in testing hypotheses related to small structural and functional domains and the possible refolding of these isolated units. Single- and multiple-site mutagenesis experiments are equally easy chemi- cally since any amino acid can be selected for any position with the automatic equipment for synthesis that is now available. The simultaneous use of recombi- nant DNA techniques and sophisticated polypeptide chemical synthesis will cre- ate new approaches to both the understanding of protein structure and the devel- opment of specific reagents and functions. FOLDING Initial reaction to the appearance of the first high-resolution crystal structure of a protein was one of shock at the complexity and the total absence of obvious symmetry. This reaction was conditioned, in part, by the earlier appearance of the model for DNA, in which the base-paired double helix, once revealed, seemed elegant and simple. During the past two decades, the successful determination of many protein structures has led to the realization that there are, indeed, underlying substructural motifs in these molecules. These motifs are themselves complex and asymmetrical, but they are repeated in many structures. The properties of the chemically bonded atoms in the peptide chain severely constrain the possible conformations He chain can assume, and only a small number of secondary structures are possible regardless of the sequence of amino acids. The structural motifs are made up of various combinations of these secondary units. These supersecondary units, in turn, are packed together into structural domains. A domain may be the whole molecule, but, in larger proteins, the native molecules are frequently composed of several domains.

OCR for page 39
72 We Do Not Yet Understand How Proteins Assume Their Intricate Thiee-Dimensiona1 Forms OPPORTUNITIES INBIOL~GY The biosynthetic machinery that synthesizes peptide chains is the same for all proteins. As far as is known, the peptide emerges as an essentially straight chain having no intrinsic biological activity. During, or shortly after, the completion of this synthesis, this chain folds up spontaneously to give the final unique, compact, biologically active structure, which is characteristic of the native protein. In the early 1960s this process was shown to occur in vitro. By now a large number of proteins have been shown to undergo this reaction without any apparent help other than the correct solvent environment. The mystery of how a biologically active protein is formed from an inert disordered chain is known as the folding problem. Research in this area, a major interface between the fields of biology and chemistry, has rapidly expanded during the past five years. The basic question has always been, From chemical theory, can the three- dimensional structure of a protein be derived solely from its known amino acid sequence? Since folding appears to be a purely spontaneous process, prediction of the structure is a severe test of the level of our understanding of the chemistry of polypeptide chains. The attack on this problem is both experimental and theoretical. Its solution is not only essential as a basic underpinning for all of molecular biology, but would also be of great practical importance in the indus- trial application of genetic engineering. Experimental Studies Search for Folding Intermediates On the experimental side, much work has centered on the search for folding intermediates. Do the secondary structural elements in native proteins exist, as such, in small purified peptides apart from the rest of the structure? In the past it was thought that such structures would be so unstable that they would not be found, but recent evidence, based largely on optical spectroscopy, suggests a positive answer to the question, at least for certain sequences. Is there a definable folding pathway along which such structure intermediates can be found (Figure 3- 4~? This question is controversial. Thermodynamic measurements can fre- quently be fitted to a two-state model reflecting only the native and the unfolded forms. Kinetic data, on the other hand, are often difficult or impossible to interpret without the assumption of one or more intermediate states. The methods are invariably indirect and the interpretation non-unique. Since crystals cannot be obtained for the unfolded state, x-ray diffraction data are not even potentially available. Detailed NMR studies on long peptides are still on the horizon, but may in the future provide direct structural information on these intermediate states.

OCR for page 39
MOI~CUL`AR STRUCTURE AND FUNCTION 73 ,! ~3 _ FIGURE 34 Schematic diagram of the folding process for an all helical protein. lye inaction starts with an extended chain containing no permanent intrachain interactions. Ibis proceeds to a hypothetical intermediate with fluctuating helical segments that occasionally associate. The end point is the folded, compact, biologically active stn~cture. A Diverse Range of Theoretical Studies Is in Progress Theoretical approaches to folding have been proceeding along three lines. The first two are fundamental procedures for which, in principle, we do not need to know the final structure. The third is a collection of ad hoc procedures with which we try to produce useful generalizations by examining the known struc- tures. Energy Minimization. Minimization of the conformational energy of pep- tides is perhaps the oldest of the three procedures. The goal is to predict the native folded structure of the protein by assuming the' it is actually the most stable structure. This requires computing a potential energy function with many terms representing possible conformational changes: bond stretching, angle bending, torsional rotations, van der Waals interactions, and various electrostatic terms. The minimization of this potential energy with respect to the locations of each of the atoms in the protein should lead to the observed native structure. The difficulties are formidable. The largest hurdle one must overcome is the problem of multiple local energy minima. The potential energy function is like the surface of the earth. Energy minimization corresponds to finding the lowest point on the surface. Wherever one starts on the surface, one can find the lowest point nearby. But how does one know if this is-the lowest point on the whole planet? How can one tell if a locally stable protein structure is actually the most stable possible

OCR for page 39
74 OPPORTUNITIES IN BIOLOGY structure? Although intense efforts are under way, it is not yet possible to consistently derive the correct native structure by starting with an unfolded chain and attempting to minimize the potential energy. Molecular Dynamics. In the second theoretical approach, one actually simu- lates the motions of the atoms of a protein. The potential energy function (in principle, the same one used for energy minimization) can be used to obtain the forces on each atom. The movement of the atoms in response to these forces is then calculated. The applications of this powerful procedure are under intensive development. As with energy minimization, the fidelity of this approach depends on the quality of the potential energy function and on the proper modeling of the solvent. Molecular dynamic simulations have not yet successfully folded a protein in the absence of additional structural information. The latter can, in principle, be provided experimentally by Now or optical spectroscopic proce- dures. When such data can be supplied, some recent tests have shown notable success in carrying out the folding simulations. The Protein Data Bank Is a Rich Resource for Predicting Structure In the ad hoc approaches, the protein data bank is searched for patterns and statistical correlations. For example, probabilities based on the occurrence of each amino acid in various types of secondary structure differ and can, in turn, be used predictively to estimate probable regions of alpha helix, beta strand, and beta turn structures in any sequence. In parallel efforts, combinatorial algorithms aimed at packing assigned secondary structures into superseconda~y and larger tertiary units have been developed. Most recently, combinations of such secon- daIy and tertiary prediction schemes that show great promise in providing prob- able domain structures have been worked out. Whether the resulting models are close enough to converge to the native structure through molecular dynamic or energy minimization procedures is not yet known. Although all ad hoc ap- proaches are implicity based on the underlying chemistry through the use of known structures, only a few explicity refer to these properties in the algorithm itself. New Experimental Tools Will Aid Studies of Protein Folding Genetic Approaches. The power of modern genetics is being brought to bear on the problems of folding. Some mutants seem to be clearly deficient in the folding process, and yet the final folded protein does not seem abnormal in any way. The discovery of other systems of this sort and their detailed analysis may provide a great deal of information on folding pathways that would not be found by other procedures, or even suspected.

OCR for page 39
MACULAR ~RUClURE AND FUNCTION 75 Polypeptide Synthesis. The chemical and recombinant DNA approaches have been discussed in an earlier section. These complementary approaches to providing peptides of known sequence will play major roles in the future study of protein folding. At this time, the behavior of peptides at membrane interfaces has been studied in detail by chemical synthesis; general specifications of such interactions are starting to appear, and marked improvement in our understanding of electrostatic interactions in alpha helices seems imminent. Through recombinant DNA approaches, many different molecules appropri- ate for structure-function investigations are already being created, largely through single-site mutagenesis. Estimates of hydrogen bonding energies are being de- rived from comparisons between carefully planned and constructed mutants of proteins of known structure, and factors affecting protein stability are being outlined. The Folding Problem Now Seems Ripe for Major Advances The immediate future for the folding problem looks remarkably (and unex- pectedly) bright. The development of both fundamental and ad hoc theoretical approaches is advancing rapidly. The correlation and interactions between theory and experiment will be much closer than has generally been true in the past. Combined approaches, with various levels of theory or theory and experiment, seem likely to be the most fruitful. The ability to easily synthesize specific polymers, themselves specifically designed to test theoretical predictions or to provide missing values for parameters, seems particularly promising. Instrumentation. The solution of the structures of new proteins, and of mutant versions of older proteins, will continue to be of major importance. Thus the development and implementation of new and improved x-ray and neutron diffraction procedures is as important to the folding problem it is as to other areas in structural biology. Improvements in both solid-state and high-resolution Now will be central to the specification of the unfolded state and the search for definable folding intermediates. Proteins that are isotonically labeled at specific sites will be essential in this process, and they will also permit the study by Now of substantially larger proteins than can currently be tackled. NEW TECHNIQUES AND INSTRUMENTATION Improvements in Analytical Techniques and Instrumentation Are Necessary Better methods for automated X-ray diffraction are critical to our increased understanding of molecular structure and function. In addition, more general and

OCR for page 39
76 OPPORTUNITIES IN BIOLOGY effective methods are needed for direct analysis of x-ray data without the need for preparing many heavy metal derivatives. Many of the most interesting biological molecules have not been crystallized. Systematic studies are needed, aimed at producing crystals and other ordered arrays suitable for high-resolution structural determinations. Funding mechanisms must be adjusted to allow the long-term support of this speculative but extremely critical area Methods are also needed to extend two-dimensional Now to larger structures and to automate its analysis. The development of instruments operating at higher magnetic fields will certainly play an important role in this work. Advances in Computation Will Revolutionize the Study of Molecular Structure and Function Improved methods are needed for collecting and transmitting DNA sequences including a single, international data base. Improved methods are also needed for extracting more biological information directly from sequence data. We must ensure that continuing advances in computer science are made available, rapidly and broadly, to the field of structural biology. More accurate protein folding calculations must be developed, including better methods for refining x-ray structures and improved semiempirical methods based on the ever-increasing data base of structures. In addition, uniform, inexpensive devices to display three-dimensional structures are needed so that, ultimately, every biologist can view any known structure directly and accurately.