In shotgun sequencing the genome is broken randomly into short fragments (1 to 2 kbp long) suitable for sequencing. The fragments are ligated into a suitable vector and then partially sequenced. Around 400–500 bp of sequence can be generated from each fragment in a single sequencing run. In some cases, both ends of a fragment are sequenced. Computerized searching for overlaps between individual sequences then assembles the complete sequence.
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Shotgun and clone contig method
1. Genome Sequencing: Shotgun and Clone Contig Method
Dr. Naveen Gaurav
Associate Professor and Head
Department of Biotechnology
Shri Guru Ram Rai University
Dehradun
2. Genome Sequencing: Shotgun Method
In shotgun sequencing the genome is broken randomly into short fragments (1 to 2 kbp long)
suitable for sequencing. The fragments are ligated into a suitable vector and then partially
sequenced. Around 400–500 bp of sequence can be generated from each fragment in a
single sequencing run. In some cases, both ends of a fragment are sequenced. Computerized
searching for overlaps between individual sequences then assembles the complete
sequence. Overlapping sequences are assembled to generate contigs (Fig. 1). The term
contig refers to a known DNA sequence that is contiguous and lacks gaps.
Since fragments are cloned at random, duplicates will quite often be sequenced. To get full
coverage the total amount of sequence obtained must therefore be several times that of the
genome to allow for duplications. For example, 99.8% coverage requires a total amount of
sequence that is 6- to 8-fold the genome size. In principle, all that is required to assemble a
genome, however large, from small sequences is a sufficiently powerful computer. No
genetic map or prior information is needed about the organism whose genome is to be
sequenced. The original limitation to shotgun sequencing was the massive data handling that
is required. The development of faster computers overcame this problem. The first bacterial
genome to be sequenced was Haemophilus influenza. The sequence was deduced from just
under 25,000 sequences averaging 480 bp each. This gave a total of almost 12 million bp of
sequence—six times the genome size. Computerized assembly using overlaps resulted in 140
regions of contiguous sequence—that is, 140 contigs. The bacterium Haemophilus had the
honor of being the first organism to be totally sequenced.
3. Figure 1. Shotgun Sequencing
The first step in shotgun
sequencing an entire genome
is to digest the genome into a
large number of small
fragments suitable for
sequencing. All the small
fragments are then cloned and
sequenced. Computers
analyze the sequence data for
overlapping regions and
assemble the sequences into
several large contigs. Since
some regions of the genome
are unstable when cloned,
some gaps may remain even
after this procedure is
repeated several times.
4. The gaps between the contigs may be closed by more individualistic procedures. The
easiest method is to re-screen the original set of clones with pairs of probes corresponding
to sequences on the two sides of each gap. Clones that hybridize to both members of such
a pair of probes presumably carry DNA that bridges the gap between two contigs. Such
clones are then sequenced in full to close the gaps between contigs. However, many of the
gaps between contigs are due to regions of DNA that are unstable when cloned, especially
in a multicopy vector. Therefore, a second library in a different vector, often a single copy
vector such as a lambda phage, is often used during the later stages of shotgun cloning.
Pairs of end-of-contig probes are used to screen the new library for clones that hybridize
to both probes and carry DNA that bridges the gap between the two contigs (Fig. 2A). A
third approach, which avoids cloning altogether, is to run PCR reactions on whole genomic
DNA using random pairs of PCR primers corresponding to contig ends. A PCR product will
result only if the two contig ends are within a few kb of each other (Fig. 2B).
5. Figure 2. Closing Gaps between
Contigs
To identify gaps between
contigs, probes or primers are
made that correspond to the
ends of the contigs (pink). In (A)
a new library of clones (green) is
screened with end-of-contig
probes. Clones that hybridize to
probes from two sides of a gap
are isolated. In this example, a
probe for the end of contig #3
(3b) and the beginning of contig
#4 (4a) hybridize to the
fragment shown. Therefore, the
sequence of this clone should
close the gap between contig #3
and #4. (B) The second approach
uses PCR primers that
correspond to the ends of
contigs to amplify genomic DNA.
If the primer pair is within a few
kilobases of each other, a PCR
product is made and can be
sequenced.
6. Hierarchical shotgun sequencing
Although shotgun sequencing can in theory be applied to a genome of any size, its direct
application to the sequencing of large genomes (for instance, the human genome) was
limited until the late 1990s, when technological advances made practical the handling of the
vast quantities of complex data involved in the process. Historically, full-genome shotgun
sequencing was believed to be limited by both the sheer size of large genomes and by the
complexity added by the high percentage of repetitive DNA (greater than 50% for the
human genome) present in large genomes. It was not widely accepted that a full-genome
shotgun sequence of a large genome would provide reliable data. For these reasons, other
strategies that lowered the computational load of sequence assembly had to be utilized
before shotgun sequencing was performed. In hierarchical sequencing, also known as top-
down sequencing, a low-resolution physical map of the genome is made prior to actual
sequencing. From this map, a minimal number of fragments that cover the entire
chromosome are selected for sequencing. In this way, the minimum amount of high-
throughput sequencing and assembly is required.
The amplified genome is first sheared into larger pieces (50-200kb) and cloned into a
bacterial host using BACs or P1-derived artificial chromosomes (PAC). Because multiple
genome copies have been sheared at random, the fragments contained in these clones have
different ends, and with enough coverage (see section above) finding a scaffold of BAC
contigs that covers the entire genome is theoretically possible. This scaffold is called a tiling
path. Once a tiling path has been found, the BACs that form this path are sheared at random
into smaller fragments and can be sequenced using the shotgun method on a smaller scale.
7. Although the full sequences of the BAC contigs is not known, their orientations relative to one
another are known. There are several methods for deducing this order and selecting the BACs
that make up a tiling path. The general strategy involves identifying the positions of the clones
relative to one another and then selecting the fewest clones required to form a contiguous
scaffold that covers the entire area of interest. The order of the clones is deduced by
determining the way in which they overlap. Overlapping clones can be identified in several ways.
A small radioactively or chemically labeled probe containing a sequence-tagged site (STS) can be
hybridized onto a microarray upon which the clones are printed. In this way, all the clones that
contain a particular sequence in the genome are identified. The end of one of these clones can
then be sequenced to yield a new probe and the process repeated in a method called
chromosome walking. Alternatively, the BAC library can be restriction-digested. Two clones that
have several fragment sizes in common are inferred to overlap because they contain multiple
similarly spaced restriction sites in common. This method of genomic mapping is called
restriction fingerprinting because it identifies a set of restriction sites contained in each clone.
Once the overlap between the clones has been found and their order relative to the genome
known, a scaffold of a minimal subset of these contigs that covers the entire genome is shotgun-
sequenced. Because it involves first creating a low-resolution map of the genome, hierarchical
shotgun sequencing is slower than whole-genome shotgun sequencing, but relies less heavily on
computer algorithms than whole-genome shotgun sequencing. The process of extensive BAC
library creation and tiling path selection, however, make hierarchical shotgun sequencing slow
and labor-intensive. Now that the technology is available and the reliability of the data
demonstrated, the speed and cost efficiency of whole-genome shotgun sequencing has made it
the primary method for genome sequencing.