| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 56
Sequencing
The nucleotide sequence of a genome is its physical map at the
highest level of resolution. It provides all the information that goes
into making up an individual's genetic complement, and no two
individuals (except identical twins) share the same genome sequence.
Rather, every human contains a duplicate copy of every chromosome,
with about a 1.0 percent difference between the sequence of each of
his or her two homologous chromosomes (that is, the average person
in a population is heterozygous for about 1.0 percent of the nucleotide
pairs, or approximately 30 million pairs). These differences have arisen
by mutations accumulated over the course of evolutionary time, and
most of them do not affect the normal functions of the individual.
Any sequence derived from the human genome will be a prototype-
a blueprint that will lay out the basic organization and sequence of
the genes on the chromosomes. This prototype may be derived by
forming a composite of regional sequences from many individuals; it
need not represent the complete sequence of any one person. The
nature of individual variation will become apparent when regions of
interest are compared among individuals.
In 1971, the first nucleotide sequence was obtained directly from
DNA with the determination of the 12-nucleotide-Iong cohesive ends
of the bacteriophage A (Wu and Taylor, 19711. Since that time, with
the advent of rapid techniques, about 15 million nucleotides of DNA
sequence have accumulated in the GenBank database (see Chapter
6), of which over 2 million are from human DNA (Howard Bilofsky
of Bolt Beranek and Newman, Inc., personal communication, 19871.
This figure represents approximately 0.07 percent of the human
56
OCR for page 57
SEQUENCING
57
genome. Thus, although human genomic sequencing has already
begun, unless a special effort is initiated, the entire sequence will not
be available for many decades, if ever.
WHY SEQUENCE THE ENTIRE HUMAN GENOME?
There is general agreement in the biological sciences community
that a physical map of the human genome, as represented in a set of
cloned overlapping DNA fragments, is a worthwhile goal. There is
much less consensus on the advisability of embarking on the deter-
mination of major amounts of its nucleotide sequence. Three kinds
of reservations are often voiced:
· Since the amount of useful protein-coding information
in the
genome is estimated to be 5 percent or less, a great deal of effort
would be expended in determining the order of nucleotides of no
apparent significance. If massive amounts of sequencing are to be
done, why not sequence only large libraries of cDNAs instead?
· Even if only cDNAs were sequenced, we would lack the ability
to utilize the vast amount of sequence information generated. The
problem will be even worse with a complete genome sequence.
Therefore, the limited amount of usable knowledge gained is not
worth the anticipated cost.
· Even if the project is worthwhile, the intensive effort required
will divert funds from other research aimed at understanding the
structure and function of genes in all organisms, and, therefore, there
will be a net loss rather than a net gain of important biological
information.
To address the first point, one must consider whether it would be
less difficult to identify, before sequencing, the 5 percent of the
genome that actually encodes proteins than to sequence the entire
genome. Given the present state of sequencing technology, this is
certainly the case; for this reason, we would anticipate that most
human genome sequencing in the immediate future will be carried out
on cDNA clones, which represent the expressed DNA sequence.
However, it seems fair to assume that by the time sequencing begins
on a massive scale, the technology will have matured so far that
inserting a preliminary step that discriminates among genes, intergenic
regions, and introns which will presumably involve sorting out all
the repeated isolates of the same DNA clones will be less efficient
than sequencing large regions from ordered genomic DNA clone
libraries without reference to their contents. This, of course, assumes
OCR for page 58
58
MAPPING AND SEQUENCING TTIE HUMAN GENOME
major technological advances in sequencing, as will be described
subsequently.
Another reason to sequence more than just cDNAs is that sequencing
the entire genome is certain to reveal unsuspected sequences having
important functions. For example, one of the great challenges of a
genome sequencing project is to identify potentially important func-
tional domains involved in gene regulation and chromosome organi-
zation. The identities of such sequences will be elicited by multiple
analytical approaches and will require sequence comparisons between
the analogous intergenic regions in multiple species (including human
versus mouse) and the recognition of unusual patterns of sequence
within a single organism. As one example, the comparative sequencing
of 3,500 nucleotides of a regulatory region from the engrailed gene of
two different Drosophila species has revealed the presence of more
than 50 short blocks of evolutionarily conserved sequences, most of
which are suspected to represent the binding sites for different gene
regulatory proteins is. Kassis and P. O' Farrell, University of Califor-
nia, San Francisco, personal communication, 19871. Determining the
function of each of the sequences will require experimental testing
based on the sequence analysis, which can pinpoint even short
sequences that deserve serious investigation by virtue of their con-
servation during evolution.
Sequence comparisons among different species will pick up genes
readily as evolutionarily conserved sequences in the genomes. How-
ever, such comparisons are rarely necessary for picking out coding
sequences since existing analytical tools are adequate to identify them
within a single DNA sequence. The standard procedure is to use
computer programs to identify open reading frames, which are regions
of nucleotide sequence lacking the stop codons that terminate a protein
sequence. Practical experience shows that this information, when
combined with codon usage patterns and other characteristics, allows
one to identify virtually all genes in a nucleotide sequence, even
though short exons will occasionally be missed (Staden and Mc-
LachIan, 19821. Accurate specification of the coding sequences can
then be obtained by a standard experimental analysis of the corre-
sponding clones in a collection of cDNAs. Some of the genes
discovered will have immediate significance to the biomedical com-
munity because they are associated with a disease. Many others will
be analyzed and found to contain homologies to existing gene families,
an immediate clue to their possible function. As more gene sequences
are determined, such relations among genes will be found with
increasing frequency (as has recently happened, for example, among
genes that encode cell surface proteins that bind specific protein
OCR for page 59
SEQUENCING
59
molecules that are involved in cell signaling), and entirely new gene
families will be identified as well (Doolittle et al., 19861.
Critics will rightly point out that a complete human genome sequence
will make such a huge number of genes perhaps as many as 100,000-
directly accessible that the function of the vast majority of them will
remain unknown for many decades after the genome has been
completely sequenced. Why then should one devote extra resources
to speeding up the completion of the sequencing effort? The committee
feels that much is to be gained from having a complete catalog of
human gene sequences that does not require knowing the function of
most of the individual genes. For example, scientists interested in the
signaling actions of cyclic nucleotides will immediately be able to
recognize a large group of genes that are likely to produce proteins
that bind cyclic nucleotides. Specific antibodies can be prepared to
each of these proteins and used to test for the role of each of them
in any signaling pathway of interest. Whole families of proteins that
are likely to mediate the signaling effect of calcium ions can be
identifiecl in a similar way. Likewise, a large group of candidate human
genes will be immediately available as potential analogs of any newly
discovered yeast, nematode, or Drosophila protein, for example.
Other novel uses of the genome sequence data, unforeseen at present,
will be developed by individual scientists, just as many of the most
important current uses of recombinant-DNA technology were not
foreseen by its early developers. in short, we anticipate that the
genome sequence will serve as a basic "dictionary" that catalyzes
striking advances in our understanding of cells and organisms.
In response to the third criticism, the committee specifically rec-
ommends that the sequence of the human genome be determined in
parallel with analogous sequencing of the genomes of the other
organisms needed to interpret the human data. Thus, the basic research
on these mode! organisms should be closely integrated with data on
humans. In addition, the project must have independent funding so
that it does not divert funds from ongoing basic studies, particularly
those trying to understand the function of genes in in all organisms,
because it is ultimately such research that will make the information
on the human genome interpretable.
Finally, a concerted sequencing effort will benefit a wide range of
biological investigations. By pushing the development of sequencing
technology and establishing sequencing centers, inexpensive sequenc-
ing will become available to anyone who has a legitimate need for it.
In this way the envisioned project will free individual laboratories
from the routine and currently labor-intensive effort of sequencing
their few genes of interest a necessary prelude to studies of gene
OCR for page 60
60
MAPPING AND SEQUENCING THE HUMAN GENOME
function and regulated expression. It is extremely inefficient for each
laboratory to set up the facilities needed to sequence the 100,000 to
1 million nucleotides that it finds of interest. Rather, the recommended
project grows out of the recognition that elucidating nucleotide
sequences (as distinct from sequence analysis) is ideally an exercise
of production, not of research.
Accumulating large amounts of DNA sequence data will have an
impact on the biological community in other ways as well. The
information contained within the genome sequence will allow full
investigation into the nature and extent of polymorphism, or diversity
(see Chapter 4), in the genes in the human population. Once genes
with widespread diversity (such as the major histocompatibility anti-
gens and T-cell receptor genes) are identified, comparative sequencing
of a single gene or gene family in many individuals will naturally
follow. Finally, the availability of structural information on a variety
of genes will stimulate efforts to correlate protein coding domains, or
exons, with protein fooling domains. It has been proposed that the
segments of proteins encoder! by individual exons arose during
evolution as small protein units capable of independent folding and
that they have assembled into multifunctional proteins as independent
domains (Gilbert, 197S, 1985~. By studying these correlations, one
may learn much about the rules that govern the secondary and tertiary
structure of proteins. Such spin-offs will be of great value to the
biological community and are meant to augment its activities- not to
detract from them.
CURRENT TECHNOLOGY IN DNA SEQUENCING:
CHEMICAL AND ENZYMATIC METHODS
Any project to sequence a large genome with many repeated
sequences would not start with short, randomly selected genome
fragments, even though this is the easiest way to obtain a large amount
of sequence information quickly. Most of the sequences obtained in
this way would be short (perhaps 200 to 600 nucleotides), and millions
of gaps would be left between them. Because most genes in humans
extend for many thousands of nucleotides (Table 2-1), little information
of biological value can be obtained from a collection of such short
sequences. For this reason, sequencing would normally begin with a
large cloned segment of DNA that would be sequenced completely.
Such a DNA segment must first be subcloned into smaller, more
manageable fragments. This can be done by one of three methods:
· Generate a detailed restriction map, and determine from the map
the identity of each subclone and its relation to the whole.
OCR for page 61
SEQUENCING
61
· Beginning at one end of the large segment, generate a series of
successively smaller DNA fragments by a limited removal of nucleo-
tides from the end with exonucleases (enzymes that hydrolyze the
phosphodiester bonds that join nucleotides together starting at a chain
end); clone the remaining DNA to produce a series of clones of known
. .
Orlgln.
· Generate a totally random series of overlapping subclones, whose
relationship to one another is revealed only after their sequencing.
Large sequencing projects often mix all three strategies. One
sometimes begins by randomly sequencing fragments and follows with
directed sequencing of specific subclones as the gaps are located. All
sequencing strategies require some redundancy in the form of over-
lapping sequences in order to merge the results of several determi-
nations from different subclones and to provide a check on the
accuracy of the sequence, which requires the sequencing of both
DNA strands as a cross-check on systematic errors. The subcloning
method will determine to a large degree the amount of redundancy in
the data. Although time-consuming during the subcloning process,
the first and second subcloning methods ultimately require that any
single segment be sequenced only about three times. The third method,
because one is sequencing subclones at random, generally requires
that each segment be sequenced approximately 10 times; however,
methods are available to specifically select missing clones, after a
three-fold coverage, which reduce the amount of redundant sequencing
(Sanger et at., 19821.
The ability to sequence large stretches of DNA became a reality in
the middle to late 1970s with the independent development of two
techniques. One of these, developed by Sanger and his colleagues at
the Medical Research Council in Cambridge, England, is a method
called enzymatic sequencing (Sanger et at., 19771. The unknown
sequence is subcloned into a single-stranded DNA virus, and DNA
synthesis is initiated from a primer sequence adjacent to the unknown
sequence. This method utilizes the principle that when appropriately
designed chain-terminating analogs of the four DNA nucleotides (A,
G. C, and T) are incorporated into DNA by DNA polymerase,
synthesis of the growing DNA chain is terminated. For example, if
the synthesis of DNA molecules begins at a fixed point on a template
in the presence of a low concentration of the A analog, the analog
will infrequently be incorporated instead of the normal A nucleotide
at any one position. However, when incorporation occurs, the syn-
thesis of the chain stops. Thus a nested set of DNA fragments that
terminate at every A nucleotide in the unknown sequence is generated.
OCR for page 62
62
MAPPING AND SEQUENCING THE HUMAN GENOME
By correlating the length of the terminated chains with the identity
of the base analog that was present in the reaction, one can determine
the order of the nested DNA fragments and, hence, the corresponding
nucleotide sequences (Figure S-11. At present, this method dominates
DNA sequencing applications primarily because once the subclones
are generated the procedure involves only a few simple steps.
The second technique, which is referred to as chemical sequencing,
was developed by Maxam and Gilbert at Harvard University (Maxam
and Gilbert, 1977~. It uses chemicals that break the DNA chain at
specific nucleotides. The DNA molecule is labeled at one end with a
radioactive tag. It is then cleaved with each chemical separately in
such a way as to generate breaks infrequently at any given nucleotide.
As in the enzymatic sequencing technique, the DNA fragments are
separated according to size, and the sizes are correlated with the
nucleotide that is cleaved (Figure 5-2~. This method is generally more
time-consuming than the enzymatic sequencing method, but it often
produces fewer ambiguities in the interpretation of the data.
Both methods generate mixtures of specific DNA fragments that
are separated by polyacrylamide gel electrophoresis a technique that
can resolve fragments that differ in size by a single nucleotide. When
radioactively labeled DNA fragments are used, they are detected by
exposing the gel to an x-ray film. That film, which has imprinted upon
it a ladder of bands distributed over four parallel lanes representing
the four nucleotides of DNA, must be interpreted or read by an
experienced person and the data must be entered into a computer.
Machines have been developed to expedite this process through the
use of a stylus attached to a computer that points to each band on
the x-ray film. The computer then registers the position and translates
it into one of the four nucleotides of DNA. Attempts are now under
way to develop x-ray film scanners capable of reading such films
directly. In addition, automatic methods that use fluorescent labels
have been introduced. It is critical that other strategies for reducing
the human labor and error involved in this process be developed if
the human genome is to be sequenced in a timely manner.
THE DIFFICULTY OF DETERMINING THE SEQUENCE OF THE
HUMAN GENOME WITH CURRENT TECHNOLOGY
What constrains efforts to embark immediately on a large-scale
human genome sequencing project? The cost and inefficiency of
current DNA sequencing technologies are too great to make it feasible
to contemplate determining the 3 billion nucleotides of the DNA
sequence in the human genome within a reasonable time. The largest
OCR for page 63
SEQUENCING
tar
0
3
-
o
63
~5=D~ DNA Fragment
1
~-
All
G T C
V ~ ~11. ~
/ /
A
A G T C
C _= _
G __ _
(S')GCAGATACGC(3')
Sequence of end-labeled strand
Denature to separate strands
Anneal short end-labeled
oligonucleotido to one strand
Carry out DNA synthesis primed by the
oligonucleotide In the presence of a small
amount of the Indicated chain-terminating
di deoxyrlbo n u cleos Ida trip h asp h ate
All of the labeled strands In
each tube will end with the
corresponding nucleoUde
Parallel gel electrophoresls
and autoradlography will
separate the labeled
fragments of dlfferlng
Icagth
FIGURE 5-1 DNA sequencing by the enzymatic method. The key to this method is the
use of a dideoxyribonucleoside triphosphate that blocks the addition of the next nucleotide
after its incorporation into the growing chain. The primed in vitro synthesis of DNA
molecules in the presence of a minor proportion of a single-type of such a chain-terminating
nucleotide generates a family of DNA fragments each of which ends in the particular chain-
terminating nucleotide (see also Figure 5-3). Here a radioactive DNA primer is used to
initiate the synthesis of such DNA fragments and four different synthesis reactions-each
with a different chain-terminating nucleotide-are analyzed by electrophoresis in four
parallel lanes of a gel. The DNA sequence is then determined from the electrophoresis
pattern.
OCR for page 64
64
MAPPING AND SEQUENCING THE HUMAN GENOME
1
Label ends
DNA Fragment
| Cut with restriction enzyme;
J. separate pieces
Isolate end- |
labeled strand ~t
Mel
Hi\ if/
I C
~ G
a C
In A
c
D T
A
c'
c G
~ C
o G
_:
3~
Discard
C & T
i 1 ,
/ Reaction proceeds
/ long enough to produce
an average of one
break per strand; the
random breaks generate
fragments representing
all positions of the
indicated nucleotide
A G C&T C
- . At=
l
(5') G C A G A T A C G C (3')
Sequence of end-labeled strand
Expose each sample to
different chemical
reaction that breaks
C DNA after the
indicated nucleotide
Parallel gel electrophoresis
and autoradiography
FIGURE 5-2 DNA sequencing by the chemical method. A DNA fragment that is radioactive
only at its 5' end is the material to be sequenced. A different chemical reaction in each of
four samples breaks the DNA fragment only (or mainly) at A, G. both C and T. and C
residues, respectively. The labeled DNA subfragments created by these reactions all have
the label at one end and the cleavage point at the other. Electrophoresis of each sample
through a polyacrylamide gel then allows each DNA subfragment to be separated according
to its size. After autoradiography of the gel, the four sets of labeled subfragments (one set
per gel lane) together yield a radioactive band for each nucleotide in the original DNA
fragment. Adapted, with permission, from Darnell et al. (1986).
OCR for page 65
SEQUENCING
65
contiguous segment of human DNA determined to date is the 150,000
nucleotides encoding the human growth-hormone gene. This is 0.005
percent of the total genome.
Some other numbers are informative in this context. Currently, a
skilled laboratory worker in a well-equipped facility can produce from
about 50,000 nucleotides of finished DNA sequence per year (B.
Barrell, Medical Research Council, Cambridge, personal communi-
cation, 1987) to about 100,000 nucleotides of finished sequence per
year (E. Chen, Genentech, personal communication, 19871. The cost
of this sequence ranges from $1 to $2 per nucleotide, an estimate
based on the assumption that one worker costs a laboratory approx-
imately $100,000 per year, including salary, supplies, and overhead.
Even at the upper estimate of 100,000 nucleotides sequenced per
person per year (which has not yet been achieved in a sustained
effort), determining the human genome sequence would require 30,000
person-years of work at a cost of $3 billion. Since the sequencing of
the genomes of other species is essential for an understanding of the
human genome, the actual amount of sequencing would approach 6
billion nucleotides, at a current cost of $6 billion. This high cost of
sequencing reflects the fact that the endeavor is still highly labor
intensive and does not include unforeseeable technical problems or
technical improvements.
Most of the time spent in a sequencing project is occupied with
obtaining the original DNA clones containing the gene of interest and
subcloning and handling the DNA prior to performing the actual
sequencing reactions steps that have not yet been streamlined or
automated. In addition, the entire process from subcloning to inter-
preting gels requires careful supervision of personnel; a ratio of no
more than three technicians for each doctoral-level scientist is generally
accepted as optimal.
The rate of DNA sequence determination is also limited by the fact
that all techniques currently use polyacrylamide gels that resolve no
more than 250 to 500 nucleotides at a time. At this level of resolution,
hundreds of millions of individual sequence determinations would be
required to complete the human genome, given the estimate that each
sequence will need to be determined three times over. By increasing
the length of the average contiguous sequence that can be determined
on a single gel, considerable time and effort would be saved.
THE ACCURACY OF DNA SEQUENCING
Unless the human genome sequence is determined accurately, it
will be of little use. Errors in DNA sequence determination occur at
OCR for page 66
66
MAPPING AND SEQUENCING THE HUMAN GENOME
several levels. The most common is caused by insufficient resolution
of adjacent DNA fragments in gel electrophoresis because of compres-
sion in their migration (neighboring bands merge into one). These
effects are especially prevalent in regions containing large numbers
of G and C nucleotides. Aberrations in the sequencing reactions can
also occur in stretches of unusual sequence. These problems are
compounded by human error, such as when researchers attempt to
guess the sequence in ambiguous regions and when sequence gels are
read past the point of accurate resolution. Another common source
of human error occurs in transcribing the data into the computer. One
potential source of error that will become more common as large-
scale sequencing is attempted resides in the presence of short, highly
repetitive sequences in human DNA, which can be confused when
they occur in multiple clones. Furthermore, the cloning process itself
may introduce a few errors.
The accuracy of DNA sequencing has not yet been firmly estab-
lished. A careful and experienced laboratory probably achieves an
accuracy of about one error in every 5,000 nucleotides (0.02 percent
error rate) in the finished DNA sequence, but this degree of precision
requires careful attention to virtually every nucleotide in the sequence
(E. Chen, Genentech, personal communication, 19871. Such attention
inevitably slows the sequencing rate. It will be difficult to hold the
error rate to this level in a large-scale nucleotide sequencing project.
Although a few investigators have achieved a 0.02 percent error
rate, most careful workers can only achieve an error rate of 0.1
percent. It is important to consider the impact of this error rate in the
sequence of the human genome. Although it might seem large, the
committee believes it is tolerable. The estimated level of DNA
sequence heterozygosity among individuals is about 1.0 percent. The
errors in the DNA- sequence will be randomly placed, and hence most
will occur outside coding sequences. Those errors in coding regions
of genes that are either insertions or deletions of nucleotides (as most
sequencing errors are) will have profound effects in that they will
cause the reading frame to shift. This could lead to a failure to identify
an exon as coding for part of a protein. If we consider that the average
coding region (exon) is approximately 200 nucleotides long, one can
anticipate that an error will occur on average in one of every five
exons. The detection of some of these errors in exons may be
facilitated by computer programs that predict coding regions on the
basis of the use of particular sets of three nucleotides (codons) that
code for each amino acid in humans. However, the errors will
eventually be identified with certainty only by those interested in that
region of the genome. This analysis puts into perspective the need to
OCR for page 67
SEQUENCING
67
aim for approximately 0.1 percent as the maximum acceptable error
rate in the initial sequence produced.
EMERGING AND FUTURE TECHNOLOGY
The obvious mismatch between the efficiency of current DNA
sequencing technology and the genetic complexity of genomes in even
the simplest cells has given rise to several research projects aimed at
developing more efficient sequencing methods. We seem to be on the
threshold of a new generation of sequencing methods that should
make large-scale sequencing projects more practical. Given the emer-
gent state of these technologies, however, it is not surprising that
expert opinion is widely divided on several key questions.
· Which of several next-generation strategies will prove most
effective?
· Will the best next-generation strategy represent a quantum jump
in sequencing capability or an incremental improvement that largely
decreases the tedium of sequencing and shifts costs from skilled labor
to instruments?
· In looking ahead to the need for a series of cumulative 5- to 10-
fold increases in sequencing capability, is the future likely to lie in
scaling up automated techniques that are already at the prototype
stage, or does it lie in revolutionary new methods?
These questions will remain unanswered until future large-scale
projects have been completed. Particularly crucial will be a determi-
nation of the steps in a sequencing project that become rate-limiting
as the goals of sequencing are increased. No foreseeable technology
will be able to automate DNA sequencing comprehensively. DNA
sequencing involves a complex series of experimental steps with very
different prospects for automation. For this reason, a given incremental
increase in efficiency at any one step will rarely result in a comparable
increase in overall efficiency.
Several current research projects aimed at automating various steps
of sequencing are at different stages of development, and they illustrate
the range of approaches being tested. They not only call for different
technical strategies, but to various degrees they also reflect different
perceptions of the steps in DNA sequencing most in need of greater
efficiency. Several groups [California Institute of Technology, DuPont,
and the European Molecular Biology Laboratory (EMBL)] are adapt-
ing the basic enzymatic sequencing methodology to a more automated
operation at the level of reading the sequencing gels. Others are
developing automated film readers, which are less expensive and not
OCR for page 68
68
MAPPING AND SEQUENCING THE HUMAN GENOME
limited by the slow rate of electrophoresis (Elder et al., 19851.
Radioactive labeling of the fragments has been replaced by the use
of fluorescent tags, which can be detected in the gel by characteristic
light emissions evoked by laser illumination. The Cal Tech and DuPont
methods allow more efficient use of the polyacrylamide gel, since the
four reaction mixtures representing the four DNA nucleotide precur-
sors can be labeled with different tags and then mixed together before
being fractionated on a single gel lane (Figure 5-3) (Smith et al., 1986~.
Alternatively, the EMBE method uses a single fluorescent tag for all
four nucleotides and runs four gel lanes, which are monitored simul-
taneously with radioactive tags. In both cases, fragments are detected
as they migrate past the point of laser illumination at the bottom of
the gel, which eliminates the need to expose, develop, and interpret
x-ray film. In each case, multiple sequences can be simultaneously
analyzed on a single gel. The immediate goal of these projects is to
develop a commercial instrument capable of sequencing 15,000 nu-
cleotides per day, starting from the appropriate reaction mixtures.
A second approach is being developed in Japan, with assistance
from the government and an industrial consortium that includes the
Hitachi, Fuji, and Seiko corporations. This attempt to improve DNA
sequencing emphasizes robotics and automated processing of samples.
The automated steps begin with subcloned DNA fragments and carry
them through the sequencing reactions. One such prototype performs
more than 30 steps in the complex set of reactions that are required
in the chemical sequencing strategy. Each step is controlled by a
~microcomputer. The maximum daily output of a single instrument is
a sequence of 5,000 nucleotides. Current work emphasizes the orga-
nization of the overall sequencing experiment into a production line.
The goal of this approach is to establish an automated facility able to
sequence 1 million nucleotides per day at a cost of approximately
$0.20 per nucleotide (Wade, 19871. This cost- does not include the
substantial cost of preparing the DNA fragments to be sequenced.
The production-line approach would feature both automated and
manual steps, with those operations most amenable to mechanization
automated.
A third approach, called multiplex sequencing, depends less on
automation and more on increasing the amount of sequence data that
can be obtained from one set of chemical sequencing reactions
fractionated on a single sequencing gel. Each sample analyzed contains
a mixture of 40 or more DNA samples, each of which has been marked
with a unique short nucleotide sequence (an oligonucleotide sequence).
After the normal chemical sequencing reactions have been completed,
the unlabeled samples are fractionated on a standard sequencing
OCR for page 69
SEQ UENCING
69
gel, and the separated DNA fragments are transferred to a membrane
on which the spatial patterns of the fragments formed during electro-
phoresis are preserved. The sequencing ladder for each individual
sample can then be successively visualized by a series of DNA-DNA
hybridization assays, each using a different radioactive oligonucleotide
as a DNA probe that is specific for the reference end of one particular
subclone in the mixture. In principle, if a dozen sets of 40 mixed
samples each are subjected to this analysis on a single gel and each
can generate 250 nucleotides of DNA sequence, then a sequence of
120,000 nucleotides can be derived from one set of chemical sequencing
reactions by using sequential hybridization with the membrane pro-
duced by this method (G. Church, Harvard University, personal
communication, 1987~.
All these methods utilize the chemistry or enzymology of current
sequencing procedures. An intriguing question is whether fundamen-
tally more powerful technologies are likely to arise in the foreseeable
future. little research is being directed toward this problem. The
most obvious possibilities for future sequencing techniques would be
the use of sensitive physical methods such as mass or fluorescence
spectroscopy, magnetic resonance detection, and electron microscopy.
These might be used in combination with each other or with more
conventional biochemical fractionation methods. The disparity be-
tween the capabilities of the current technology and the magnitude of
the work required to sequence the human genome suggests that
fundamentally different technologies deserve serious exploration.
OPTIONS AND RECOMMENDATIONS
The committee considered three options regarding the initiation of
human genome DNA sequencing. The first is to begin a large-scale
initiative immediately in one or a few large centers devoted to DNA
sequencing with current technology. This option might be expected
to include the establishment of an independent institute whose goal
would be the mapping and sequencing of the genome as quickly as
possible. The second option is to make a strong commitment to
develop better DNA cloning, sequencing, and data analysis technol-
ogies by supporting smaller scale pilot projects (e.g., sequencing 1
million nucleotides in 1 year), while allowing investigators to gain
practical experience with larger scale sequencing. These improvements
in current technology should be designed to reduce the cost and
increase the efficiency of sequencing techniques. The third option is
to make no special effort to sequence the human genome, but to
.
OCR for page 70
70
.=
cn
~c
c
v,
~c
~ -
~ -
3 -
o -
,,, -
~ - o
"OC 5
_ ~
~ ~n
E ~
._
C)
._
o c,
~ _
_ _
Ct
,~
~L)
C,)
n
~o
G
-
C)
~C
t:
._
C
5
C
C
_! ,C
Ct
o
._
e~
C
CC
~:
-
~C
_ Ct
_ ~
Ct ~
. _ .
~ $_ _
~ 00 ~
._ ~
04
C
~1 ~ ~
P!: C) Z
V Ct
~ _ _
_
0N
- _
C
Ct
Ct
::
-
04-
Ct ~
C) O
- s_
C) C~
C
>, O
._
_
_ V,
_ ._
O ·-
~ 3
-
~ .
C3.
~ .
.. ~
G) ~:
o-
~ o~
D
Ct
·CO
o'
oa
._C ~
._ .
:~. Q
X ~
C
~Q
O
_ I,
Z ·-
-
C~
o
C)
-
Ct
La
o
~5
G
D
.=
CO
o
O-
_ ~
C C ~
~ x cn
= 0 0
ce a~ c
E~ Q
._ . _
V
CO
Q
~Z
-
~ ~ ~C
O ~ ~_
=-- O U'
LL CL Q
_ U,
: a,
::
,.., ~C
(, ~0,<
·-Z
3.~C
~_
~O ~
.0 _
O '
_ ~
~ _
.= CO
a, ~
~a O
c': n
<:
V ~ ~ ~ ~ ~
~:
~'
Z
=!
~ te ° ~
_
~ O
c~
CO
c
-
a,
_ _
a~ := (
_
O ~
0 ~ C
_ ._
a), ~
_ ~ ~
._ p_
C Ct
c,
c: ~
~:
,
OCR for page 71
71
a, -
~ o
cO- ~
~ s ~ ~ o
a) ==Pc ~
- 0 tD
- ~z -
o a) cc)
~ =. c-~ >
.
e
x Q
._ ~
~ .c
._ ~
~ E
a) q,
~ I _
=~s ~_I~]J
~ s~
a) (~ ~
O ~ v,
° ~ O_ ~ U=)
2 x·. E 0
c ~ : 4)
x ~ -
~ 0 ~
l
. 8~<
. pO
. -o'~
. .~=
~oa~
,tn
ce
_
0 0
_ 0
c' -
. _ _
cn
-
~ a,
_ c~
.' c
(13 '-
_ >
1 X
~ 0
ct ~
(: ·_
6
c' uo
1 oM =O
=_
~ J c
- c
.
o.
n <`, 2
~ =_.°
- ~ ~a (D
~c ~ c'
~ c
G) Q ~ C
,, C) - -
-"S O
m.' ~
C O.~=
E s cO a,
E CCo
°' a' Ct
C ~Q
~Q ~=
Q~ ~ ~
o~ C) ~) 20 C) D
E Q
~n c
OCR for page 72
72
MAPPING AND SEQUENCING THE HUMAN GENOME
depend on the normal process of science to generate the sequence,
knowing that the complete sequence would not be available within
the next 20 years, if ever.
As explained in Chapter 3, knowledge of the sequence of the human
genome and the genomes of the necessary reference organisms will
provide a crucial medical and basic research too! that will be used by
the biological and biomedical research community long into the future.
The committee concluded that without a special effort to achieve this
goal, the desired DNA sequences are not likely to be obtained in the
time optimal for future medical and scientific advances, if ever. On
the basis of this argument, the committee rejected option 3. In deciding
between options ~ and 2, the committee concluded that the high cost
and slow rate of sequencing with current technology precluded the
initiation of a large-scale sequencing effort at the present time.
Therefore, the committee made the following recommendations.
The Project Should Begin with Two Kinds of Studies
Initially, improvements in existing technology and the development
of new technology directed toward the long-range goal of a complete
human genome sequence should be vigorously encouraged. This effort
would include applications of automation and robotics at all steps in
cloning and sequencing. It is particularly important to automate the
steps of DNA cloning. In this context it is useful to think in terms of
trying to achieve 5- to 10-fold incremental improvements in the cost,
efficiency, or human labor required for these tasks. Several such
incremental improvements are needed to make the sequencing of
many important genomes practicable. A reasonable baseline sequence
from which to measure initial progress is 150,000 nucleotides, the size
of the largest human sequencing project to date.
These technological projects will assist in identifying the rate-
limiting step in large-scale sequencing, which at present is believed
to be the subcloning ste~the one step that has not been automated.
However, further improvements in all stages of the procedure from
subcloning to the interpretation of sequence data will be required.
The awarding of competitive grants to individuals and to larger
groups organized into cooperative, multidisciplinary centers is viewed
by the committee as the most effective way to achieve these goals.
A second type of pilot study that should be initiated immediately
would define as its goal sequencing approximately 1 million nucleotides
of continuous sequence (approximately 5 to 10 times what has been
achieved to date). Such projects will be important in providing an
opportunity for the direct implementation and testing of improvements
in existing technology as they arise and the provision of a practical
OCR for page 73
SEQUENCING
73
impetus for the development of new technology. They will also reveal
where problems in analysis are likely to arise. For example, will
repetitive sequences complicate the assembly of a unique, contiguous
sequence? Are some sequences unclonable? Will new genes be
identified correctly?
As in the past, human gene sequencing by individual research
groups interested in specific genes shouicl be strongly supported by
standard research grants. This directed sequencing will provide val-
uable information about genes that have been identifier! as important
in biology and medicine and should also lead to advances in sequencing
technology. However, as the physical map develops and as the cost
and efficiency of DNA sequencing improve, ever-larger sequencing
efforts taken on by groups interested primarily in the sequence of the
genome as a goal in and of itself will evolve.
To Derive the Full Benefit of the Human Genome Sequence Wig
Require Many New Tools, Including a Comprehensive Database of
DNA Sequences from Other Organisms
Comparative sequence analysis has proven a powerful technique
for distinguishing those elements of a gene sequence that are highly
constrained functionally from those that are not. As explained pre-
viously, such analysis can provide insights into conserved regulatory
and structural sequences. The availability of extensive sequence data
from other organisms will also maximize the likelihood that the
counterparts of important human genes will be identified in other
organisms where their functions will generally be easier to study. The
corollary will also hold: Genes that have been identified as important
to other organisms will be found rapidly in the human DNA sequence.
Therefore, a project of this type must not be restricted to determining
the human genome sequence, but should include genome sequence
determination of selected other species as well.
DNA Sequence Determinations Require Quality Control
A mechanism of quality control must be developed to monitor the
groups that are contributing extensive sequence DNA information.
One might consider an external group that functions as the Bureau of
Standards does to provide independent quality control. Quality control
is critically important to the initiative, and it poses unique technical
challenges. The optimum methods of checking DNA sequences are
likely to differ from the optimum methods of collecting data; indeed,
the sequence-checking method should ideally be experimentally in-
dependent of the sequencing method. For example, the presence of
OCR for page 74
74
MAPPING AND SEQUENCING THE HUMAN GElVOME
the many restriction enzyme cleavage sites predicted from the DNA
sequence could be tested by cleavage of the DNA followed by gel
electrophoresis.
To succeed, this project will require careful oversight and coordi-
nation among the groups involved in mapping, sequencing, collecting
and analyzing data, and a system for distributing samples.
REFERENCES
Alberts, B., D. Bray, J. Lewis, M. Raft, K. Roberts, and J. D. Watson. 1989. Molecular
Biology of the Cell, 2nd ed. Garland, New York. In press.
Botstein, D., R. L. White, M. Skolnick, and R. W. Davis. 1980 Construction of a genetic
linkage map in man using restriction fragment length polymorphisms. Am. J. Hum.
Genet. 32:314-331.
Darnell. J. H. Lodish, and D. Baltimore. 1986. Molecular Cell Biology. Scientific American
Books, New York. 1160 pp.
Doolittle, R. F., D. F. Feng, M. S. Johnson, and M. A. McClure. 1986. Relationships of
human protein sequences to those of other organisms. Cold Spring Harbor Symp.
Quant. Biol. 51:447-455.
Elder. J. K., D. K. Green, E. M. Southern. 1986. Automatic reading of DNA sequencing
gel autoradiographs using a large format digital scanner. Nucleic Acids Res. 14:417-
424.
Gilbert, W. 1978. Why genes in pieces? Nature 271:501.
Gilbert, W. 1985. Genes-in-pieces revisited. Science 228:823-824.
Gusella, J. F., R. E. Tanzi, M. A. Anderson, W. Hobbs, K. Gibbons, R. Raschtchian, T.
C. Gilliam. M. R. Wallace, N. S. Wexler, P. M. Conneally. 1984. DNA markers for
nervous system diseases. Science 225: 1320-1326.
Maxam, A. M., and W. Gilbert. 1977. A new method for sequencing DNA. Proc. Natl.
Acad. Sci. U.S A. 74:560-564.
Sanger, F., S. Nicklen, and A. R. Coulson. 1977. DNA sequencing with chain-terminating
inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74:5463-5467.
Sanger, F., A. R. Coulson, G. F. Hong, D. F. Hill, G. B. Petersen. 1982. Nucleotide
sequence of bacteriophage A DNA. J. Mol. Biol. 162:729-773.
Smith, L. M., J. Z. Sanders, R. J. Kaiser, P. Hughes C. Dodd, C. R. Connell, C. Heiner,
S. B. H. Kent, and L. E. Hood. 1986. Fluorescence detection in automated DNA
sequence analysis. Nature 321 :674-679.
Staden R., and A. D. McLachlan. 1982. Codon preference and its use in identifying protein
coding regions in long DNA sequences. Nucleic Acids Res. 10:141-156.
Wada, A. 1987. Automated high-speed DNA sequencing. Nature 325:771-772.
Wu, R., and E. Taylor. 1971. Nucleotide sequence analysis of DNA. II. Complete nucleotide
sequence of the cohesive ends of bacteriophage A DNA. J. Mol. Biol. 57:491-511.
Representative terms from entire chapter:
dna sequence