Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 75
6
The Collection, Analysis, and
Distribution of Information and
Materials
The mapping and sequencing effort will generate more data than
any other single project in the history of biology. For example, just
to record the 3 billion nucleotides that make up the haploid human
genome will require nearly 1 million pages of printed text. Variation
between the two chromosomes of each individual (heterozygosity)
and among the many human beings (polymorphism) further increase
the body of information to be stored, collated, and analyzed. In the
conception and planning of any human genome project, close attention
must be paid to how the data and experimental material are collected
and distributed.
The full set of information to be gained from mapping and sequencing
the human genome is of potentially greater usefulness than its
component parts. For example, although in principle one can use
computer searches to pick out coding sequences that are parts of
genes, finding the true beginning and end of a gene and all of its
coding and noncoding components may require reference to other
data, such as the similar nucleotide sequence from a related, but
evolutionarily separated, species such as the mouse. To extract the
maximum information from the human sequence, it will also be
necessary to search for amino acid homologies with the entire set of
all known proteins, regardless of their origin. In addition, extensive
searches for regions of similarity of short nucleotide sequences
between human genes and their mouse counterparts will be necessary
to detect regulatory DNA sequences and other conserved sequences
for which functions can then be sought. Finally, the correlation of
sequence data with the large amounts of information derived from
75
OCR for page 76
76
MAPPING AND SEQUENCING THE HUMAN GENOME
human genetic linkage and disease studies is needed to derive the
molecular basis for human phenotypes. As more DNA sequence
information is obtained, our sophistication in interpretation should
increase to the point at which a computer search will reveal a
fascinating wealth of correlative data concerning almost any new
DNA sequence obtained.
The human genome project will differ from traditional biological
research in its greater requirement for sharing materials among
laboratories. For example, many laboratories will contribute DNA
clones to an ordered DNA clone collection. These clones must be
centrally indexed. Free access to the collected clones will be necessary
for other laboratories engaged in mapping efforts and will help to
prevent a needless duplication of effort. Such clones will also provide
a source of DNA to be sequenced as well~as many DNA probes for
researchers seeking human disease genes. Two different types of
centralized facilities will be needed: centers to collect and distribute
materials such as DNA clones and human cell lines and centers to
collect and distribute mapping and sequencing data.
The magnitude of the required data storage and distribution effort
can be understood by comparing the existing content of facilities that
collect and store mapping and sequence data with the anticipated
capacity required if the human and other complex genomes are
sequenced. For example, the DNA data bank maintained by GenBank
and the European Molecular Biology Laboratory (EMBL) contains
15 million nucleotides of sequence data from the entire biological
world viruses, procaryotes, plants, and animals-and includes about
2 million nucleotides of human sequence. The human genome, con-
taining 3 billion nucleotide pairs, is 200 times as large as the sum of
these DNA sequences collected from all organisms. Moreover, only
a few hundred restriction fragment length polymorphisms (RF~Ps)
have been mapped on the human genome, whereas the target of the
genome project is several thousand mapped RF~Ps an increase of
more than an order of magnitude. The efficient cataloging, manage-
ment, and distribution of mapping and sequencing data at levels from
one to two orders of magnitude greater than today's must be achieved
in pace with data acquisition and are essential for the success of the
project. Fortunately, several prototypic operations are already in
place. These include GenBank/EMBE, Mendelian Inheritance in Man,
Human Gene Mapping Library, and Centre d'Etude du Polymorphisme
Humain, each of which is briefly reviewed below to provide a
background for discussion of the much larger efforts that will be
needed in the future. There are also repositories of cell lines and
cultures, such as the American Type Culture Collection and the Cell
OCR for page 77
INFORMATION AND MATERIALS
77
Bank in Camden, New Jersey, that have had extensive experience in
handling and distributing biological materials. Although these opera-
tions are not reviewed here, they should be considered in the
development of any materials-handling center.
PRESENT INFORMATION-HANDLING ORGANIZATIONS
GenBank/EMBL
The GenBank/EMBE data bank stores and distributes DNA se-
quence information. GenBank in the United States and the EMBE
data bank in the Federal Republic of Germany share the task of
recording, annotating, and distributing all the DNA sequence data
published, regardless of the species of origin. Each bank is responsible
for monitoring approximately half of the relevant published literature,
and once they have entered and annotated the files, they exchange
information so that each has a complete holding. The current holdings
are about 15 million nucleotides, and the rate of acquisition is currently
7 million nucleotides per year and increasing rapidly (H. Bilofsky,
personal communication, 19871. Both banks are finding it more and
more difficult to keep up to date. The backlogs in the entry of published
sequence data, a source of frustration within the user community,
have two main causes.
First, nonelectronic data entry (entry from printed DNA sequences)
still accounts for more than half of all data entered as a result of
policy and organization, not technology. Authors have yet to be
educated about the need to send data either electronically or in
magnetic form to the data bases, in part because coordination between
scientific journals and the data bases has, until very recently, been
nonexistent. The reentry of data from a printed copy of a sequence
into a data base is a slow, error-prone process, but in the absence of
pressure from journals to authors to provide all sequence data in
magnetic form, it has been absolutely necessary.
Second, GenBank/EMBE have not had sufficient support to keep
abreast of the gene sequence data being generated by present biomed-
ical research. However, the lessons that have been learned from their
experience should be invaluable in setting up and managing a new
facility dedicated to the collection of DNA sequence data, which will
be an essential component of a human genome project.
Mendelian Inheritance in Man
The Mendelian Inheritance in Man (MIM) project stores and
classifies information about human disease phenotypes. MIM is an
OCR for page 78
78
MAPPING AND SEQUENCING THE HUMAN GENOME
encyclopeclia of gene loci based on 'human phenotypes, most of them
disease phenotypes. It has been maintained at The Johns Hopkins
University by Victor McKusick since the early 1960s and has been
computerized since 1964. Seven hard-copy editions, all generated
from the computer, have appeared between 1966 and 1986, and the
number of entries has grown during this time from 1,500 to more than
4,100.
An attempt is made to assign only one entry per genetic locus; i.e.,
various phenotypes produced by alleles at one and the same locus
(e.g., the beta-gIobin locus) are allowed only one entry. Inevitably,
however, more than one entry has been assigned to many allelic
disorders because of the incomplete status of our knowledge; in other
cases a disorder assigned one entry subsequently proved to be
produced by any one of two or more loci. Entries have also been
created in the catalog for loci for which no Mendelian variation has
yet been identified. Most of these are structural genes that have been
characterized and mapped by a combination of somatic cell and
molecular genetic methods.
Collaborative research in the management of this knowledge base
at the Lister Hill National Center for Biomedical Communications of
the National Library of Medicine has produced OMIM- an on-line
version of MTM that is being tested in the clinic and laboratory. OMIM
is designed to permit free movement between the text of MIM, gene
map information, and a molecular defects list.
Human Gene Mapping Library
The Human Gene Mapping Library (HGML) at Yale University
positions genes and DNA landmarks on chromosomes (publication of
Howard Hughes Medical Institute, 19861. The HGML consists of a
number of separate but interrelated data bases. One of them, the
"Map" data base, keeps track of the chromosomal positions of all
mapped genes (currently more than 1,200~. This is a dynamic data
base: New genes are being entered at an accelerating rate, refinements
of previous assignments are continually made, and the relations
between gene map positions frequently change. The management of
this data base requires constant attention to data input, editorial
checking on the validity of the data, and data distribution. The data
are maintained with an advanced data-base management system that
is operated by a high-speecl, large-volume computer. User-frienclly
menus have been prepared to facilitate access to the data by the
1nexperlencec A.
OCR for page 79
INFORMA TION AND MA TERIALS
79
Other data bases within HGML include ".Lit," which contains a
list of all pertinent literature citations; "RF:UP," which contains data
on Ramps; "Probe," which contains data on DNA probes used for
mapping; and 'iSource," which contains information regarding the
laboratories from which certain DNA probes or cell lines may be
obtained.
The HGML data base, together with the scientific community it
serves, also strives to maintain a uniform and orderly nomenclature
for all mapped genes. It will be important to extend this nomenclature
(or another that is agreeable to the scientific community) to other
species, such as the laboratory mouse, so that direct comparisons
between homologous genes in different species can be made readily.
The HGML also assigns accession numbers to all DNA probes that
might-be useful as genetic markers. Upon request, researchers active
in this field are given unique DNA probe identification numbers (D
numbers), so that these probes can be described unambiguously.
More than 2,000 probes have been numbered, and rapid growth to
more than 100,000 is expected in the years ahead. An extension of
this type of system could serve as a logical means of keeping track
of the overlapping DNA clones produced by a human genome project.
Centre d'Etud~e du Polymorphisme Humair'
The Centre d' etude du Polymorphisme Humain (CEPH) coordinates
an international RAP mapping effort using data from standard families
(Marx, 1985; Dausset,- 19861. CEPH, created by Jean Dausset in Paris,
differs from MIM and HGML in being a collaborative research effort
that both generates and stores human mapping data. As discussed in
Chapter 4, CEPH maintains lymphoblastic cell lines and sends samples
of DNA from these cells to collaborators in Europe and North America.
In return, the recipients agree to test their REAP probes on all- the
so-called "informative', families for each probe (the families in" which
two alleles of the particular REAP are present). Collaborating members
of CEPH are required to submit to Paris all of their RAP probe
mapping data in a prescribed, uniform format. CEPH then maintains
a common data base to which members of the project have rapid
access, which thereby allows members to place their own RAP
probes on a common linkage map. Through this collaborative project,
the work of several laboratories on different continents is coordinated
toward a common goal, which- can be achieved much more rapidly
than it could be in any one laboratory alone.
OCR for page 80
80
MAPPING AND SEQUENCING THE HUMAN GENOME
MAPPING DATA BASES REQUIRED FOR A
HUMAN GENOME PROJECT
The Collaborative Facilities Needed To Generate an
RFLP Map Must Be Expanded
One of the early goals foreseen for the human genome project is
an RAP map in which the average separation of markers is 1 cM.
CEPH provides an example of how international collaboration, in-
volving both the exchange of materials (DNA samples) and data (each
group's probe-mapping results), can be organized for the production
of an RAP map held in common. However, to achieve a 1-cM RAP
map in a timely fashion (5 to 10 years), either CEPH must be expanded
substantially or a new organization must be created and modeled
along its lines, with the following objectives:
· A significant increase in the number and diversity of origin of
families.
· Identification of several thousand new REAP probes and their
use to screen the set of DNAs obtained from these families (requiring
either more laboratories or the enlargement of existing ones).
· A major increase in DNA production facilities because of the
increased number of families and REAP probes to be used with these
DNAs.
At present, CEPH maintains stable lymphoblastoid cell lines derived
from each of its 600 participants. It grows batches of the cells, extracts
the DNA, and distributes it. More than one center may have to be
established to grow the cells and to produce and distribute DNA.
At present, the laboratories collaborating with CEPH do not have
to make available to the project their REAP probe DNAs; they need
only provide the data obtained with them. This helps to make the
CEPH collaboration successful by avoiding constraints that might
otherwise restrain participation. In the future, however, rules con-
cerning the general availability of RAP probes will have to be decided
within the context of a human genome project. If REAP mapping is
done uncler contract by commercial enterprises, some of which already
have considerable experience in the field, the contract should stipulate
that there be open access to all the probes that are developed.
All Human Map Data Should Be Accessible from a
Single Data Base
In the major mapping data base associated with the human genome
project, it will be necessary to keep track of the map positions,
OCR for page 81
INFORMATION AND MATERIALS
81
literature references, and material distribution sources for all identified
landmarks in the human genome, including the DNA clones in the
ordered clone collection. This can best be accomplished by having a
single centralized data base that is easily accessible to the entire
scientific community. A large data facility will be needed to manage
this information.- Initially, this facility would be responsible for
integrating all RAP mapping and DNA clone data, which would
include all the information now in the MIM and HGML data bases.
Once a human genome project begins generating large amounts of
data, the annual management costs of mapping data bases are likely
to rise from the total of $800,000 currently spent by M]:M and HGML
to perhaps $5 million. Whether the mapping data bases that are unified
by a single management organization should also be housed under
one roof is an open question. During the first stages of the project,
and as long as MIM and HGML are electronically linked, it may be
more practical to leave them in different institutions.
A Materzal Collection and Distribution Facility for Ordered Sets of
Cloned DNA Fragments Will Be an important First Stage in Any
Sequencing Project
The representation of the physical map in a DNA clone collection
is immediately useful in that DNA segments of unknown origin can
be located on them either by hybridization or by fingerprinting
methods. Ultimately, the best physical map is the complete set of all
such materials along with the information data bases described above.
A separate dedicated facility will be required if these materials are to
be made readily accessible to the entire user community.
Maintaining a facility that collects, organizes, and distributes all
the available DNA clones generated by mapping efforts will be a
major task. Further study will be needed to determine exactly how
this facility should operate. At one extreme, one could imagine that
such a facility would merely store DNA clones received from all
participating laboratories (as DNA, as bacterial viruses, or as yeast
cells carrying artificial chromosomes), index them according to some
reasonable plan, and then redistribute them for a standard fee in
response to specific requests from scientists. Because of the very
large number of clones expected in the collection (at least several
hundred thousand versus the 42,000 items now at the American Type
Culture Collection), this aspect of the task will require major orga-
nizational efforts like those of a large mail-order company. In addition,
stocks will also have to be replenished at intervals to keep the
collection adequately supplied with materials. Because of possible
OCR for page 82
82
MAPPING AND SEQUENCING THE HUMAN GENOME
clone instabilities, both these regrown stocks and each new stock
received will require checking (by restriction enzyme fingerprinting
or some other high-resolution method) as a standard quality-control
procedure.
An additional possible routine role for the central facility includes
converting large human DNA fragments cloned as artificial chromo-
somes into more readily accessible bacterial virus or cosmic DNA
clone collections. The facility could also take all the DNA clones that
have been mapped elsewhere by a variety of different procedures and
fingerprint them by a single method to provide a standard indexing
procedure. One can also envision a central facility that would actually
help with the mapping effort; this type of facility could establish a
single standard protocol for characterizing each DNA clone (for
example, a standard restriction enzyme fingerprinting method) and
collect and analyze the data provided by each of the participating
laboratories to search for new overlaps. At present, mapping methods
are in a state of flux, and many competing approaches are being tried
in different laboratories. Any mapping role for a central facility should
therefore be delayed until a reasonable consensus can be reached on
the best way to proceed.
The cost of constructing and operating a storage and distribution
facility will be high. Estimates of as much as $250 million spent for
30 years of operation have been made once the full range of clones
has been generated (Stevenson, 19874.
A DNA SEQUENCE DATA BANK DEDICATED TO A
HUMAN GENOME PROJECT
A Concerted Initiative Aimed at Determining the Sequence of
the Human Genome Will Generate Large Amounts of
DNA Sequence Data
Not only will there eventually be many billion nucleotides of human
DNA sequences, but also there will be large tracts of sequence from
the mouse genome that can be used for comparisons between the two
species. In early stages of the sequencing portion of the project, it is
likely that the genomes of experimental mode} organisms such as E.
colt, yeast, the nematode, and Drosophila will be completely se-
quenced. If the project is to succeed, all data on large amounts of
contiguous DNA sequence should be collected and distributed by a
dedicated DNA sequence bank.
Fortunately, the amount of data associated with a human genome
project is well within present disk storage and computer hardware
OCR for page 83
INFORMA TION AND MA TERIALS
83
capacity. Many government agencies- as well as the business worId-
are storing and handling significantly larger volumes of data. The
difficulties will be encountered in the entry arid classification of the
data and even more so in their analysis and distribution to the
international scientific community. An important goal of the entire
endeavor should be to make available the information in a form that
will benefit a very large portion of the biomedical and basic research
community as quickly as possible.
All Data Must Be Entered Electronically or Magnetically
..
From the outset, all sequence data must be entered into the DNA
sequence bank by electronic or magnetic means. Moreover, the human
genome project can circumvent many of the problems experienced
by GenBank/EMBE by establishing a standard features format imple-
mented at the point of data collection with the intention of expediting
data entry. For example, all submitted data blocks could be packaged
with references by the sender to data source, DNA clone number,
chromosome region, and other factors. Since most data will probably
be sent from fewer than two dozen research laboratories, the chances
of entering spurious data from inexperienced investigators will be
low. Nonetheless, there must be standards that set a minimum length
of contiguous sequence suitable for submission and ensure quality
control with regard to the frequency of errors in the accepted sequence.
An Initial Analysis Should Be Performed by a Central Facility
Not unexpectedly, many different points of view exist about how,
where, and when the large amounts of data in the genome sequence
ought to be processed. New computers are constantly appearing, and
new strategies for using them are always evolving. The most important
analyses will no doubt be done by people interested in specific types
of proteins, regulatory sequences, evolutionary processes, and so
forth. However, some analysis should also be performed at the central
facility to help in classifying the data for future research. Exactly how
the data are to be analyzed might be tied to the number of centers or
laboratories collecting the data, the kinds of staffing provided at a
central facility, and the scope of the immediate data dispersal, i.e.,
whether it is national or international.
An Example of an lnifial Sequence Analysis
The strategy of data analysis will have to evolve as data accumulate,
but the primary question will always be whether a particular sequence
OCR for page 84
84
MAPPING AND SEQUENCING TlIE HUMAN GENOME
is an important island of information or just part of a surrounding
ocean of chaos. Accordingly, the incoming data might be screened
for repeated sequences. Even the most interesting parts of the human
genome- the 50,000 to 100~000 genes are going to be redundant,
inasmuch as there will be many large families of closely related genes.
The central nervous system, for example, which may account for 40
percent of all~genes, is almost certainly going to include many such
families. Some type of screening can help catalog the incoming data
from the start and determine where and how the data should be stored.
To encourage the timely submission of data, all data submitted by
the sequencing centers should be rapidly returned to them in a
processed form for inspection and verification; each center should
also be kept aware of progress at the others.
Establishing an Efficient Computer Network
Many possible computer arrangements would suffice for the jobs
described. It seems reasonable, however, to begin the operation on a
modest level, with the intention of scaling up over several years. For
example, the operation could easily be initiated with local computers
that are connected with the National Supercomputer Network. In this
model, the data collected at a sequencing center would be fed into a
local computer, checked and entered into a features table, and then
transmitted over the Supercomputer Network (which is especially
good for high-speed transmission of large amounts of data) to the
central DNA sequence bank. There, an analytical facility, which
would probably use parallel computers at some future date, could
handle the early stages of data analysis. The screened data would
then be transmitted back to the various collection centers for verifi-
cation. Once verified, the data would become available to the scientific
community, moving through the Supercomputer Network to various
local distribution points.
The Need for Research 011 Data Analysis
We are only at the beginning of learning how to use computers to
interpret DNA sequence information. New ways of searching DNA
sequences will need to be designed as we learn more about such
subjects as the binding sites for gene regulatory proteins, the rules
that regulate RNA splicing, RNA secondary structure, and the effects
of specific amino acid replacements on protein folding. In the future,
we can expect to learn a great deal more about genes from their
sequences than is possible today. A human genome project should
therefore encourage the activities of those who combine skills in
OCR for page 85
INFORMA TION AND MA TERIALS
85
computer programming and biology; these individuals will be needed
to generate the DNA sequence search routines of greatest utility to
the biological community.
The Estimated Cost
Although it is difficult to predict the cost exactly, approximately $5
million per year might be set aside for the sequence information
facility. The largest part of the costs related to information handling
will undoubtedly be devoted to professional staffing. Beyond that,
funds will also have to be made available to develop software and to
provide education and training to ensure further innovations in
computer use in biology.
It is essential to keep the conventional data bases, including
GenBank/EMBE, fully operational for the next several years, in
particular to ensure comprehensive collection of sequence data from
nonhuman sources. However, the time will come when sequence data
from all sources will have to be melded into a single large, efficient
facility.
CONCLUSIONS
More than any other part of the human genome initiative, the
handling of information and material will require organization and
standardization. A single unified policy must prevail if the information
is to be accurately acquired, stored, analyzed, and distributed. There
must be a central facility for tracking and distributing the experimental
materials, and there must be a dedicated computer center for storing,
checking, screening, and searching the sequence and mapping data.
The establishment of these facilities will be critical and will require
careful advance planning. The committee recommends a competition
in which all interested groups submit detailed applications or pilot
program trials.
REFERENCES
Dausset, J. 1986. Le centre d'etude du polymorphisme humain. Presse Med. 15:1801-1802.
Marx, J. L 1985. Putting the human genome on the map. Science 239:150-151.
Regional Localizations of Genes and Genetic Markers to Chromosomes and Subregions of
Chromosomes. 1986, Number 1, HGM8. Howard Hughes Medical Institute Human
Gene Mapping Library, New Haven, Conn.
Stevenson, R. E. Cited by L. Roberts, 1987. Human genome: Question of cost. Science
237:141 1-1412.
Representative terms from entire chapter:
sequence data