210 14 2MB
English Pages 275 Year 2007
Genome Sequencing Technologand Algorithms
For a listing of related Artech House titles, turn to the back of this book.
Genome Sequencing Technology and Algorithms Sun Kim Haixu Tang Elaine R. Mardis Editors
artechhouse.com
Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the U.S. Library of Congress. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library.
Cover design by Igor Valdman
ISBN 13: 978-1-59693-094-0
© 2008 ARTECH HOUSE, INC. 685 Canton Street Norwood, MA 02062 All rights reserved. Printed and bound in the United States of America. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher. All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Artech House cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
10 9 8 7 6 5 4 3 2 1
List of Contributors Chapter 1
Elaine R. Mardis, Washington University
Chapter 2
Baback Gharizadeh, Roxana Jalili, and Mostafa Ronaghi, Stanford University
Chapter 3
David Okou and Michael E. Zwick, Emory University School of Medicine
Chapter 4
Jay Shendure, University of Washington Gregory J. Porreca and George M. Church, Harvard Medical School
Chapter 5
Lewis J. Frey and Joyce A. Mitchell, University of Utah Victor Maojo, Universidad Politecnica de Madrid
Chapter 6
Sun Kim and Haixu Tang, Indiana University
Chapter 7
Sun Kim and Haixu Tang, Indiana University
Chapter 8
Jiacheng Chen and Steven Skiena, Stony Brook University
Chapter 9
Haixu Tang and Sun Kim, Indiana University
Chapter 10
Paola Bonizzoni and Gianluca Della Vedova, University di Milano-Bicocca Riccardo Dondi, University of Bergamo Jing Li, Case Western Reserve University
Chapter 11
Benjamin J. Raphael, Brown University Stas Volik, Colin C. Collins, University of California at San Francisco
Chapter 12
Curt Balch and Kenneth P. Nephew, Indiana University Tim H.-M. Huang, The Ohio State University
Chapter 13
Aleksandar Milosavljevic and Cristian Coarfa, Baylor College of Medicine
Contents Part I The New DNA Sequencing Technology
1
1
An Overview of New DNA Sequencing Technology
3
1.1
An Overview
3
1.1.1
Background
3
1.1.2
Rationale for Technology Development Toward Massively Parallel Scale DNA Sequencing 4
1.1.3
Goals of Massively Parallel Sequencing Approaches
1.2
Massively Parallel Sequencing by Synthesis Pyrosequencing 6
1.2.1
Principle of the Method
6
1.2.2
Pyrosequencing in a Microtiter Plate Format
7
1.2.3
The 454 GS-20 Sequencer
7
1.2.4
Novel Applications Enabled by Massively Parallel Pyrosequencing
8
6
1.3
Massively Parallel Sequencing by Other Approaches
8
1.3.1
Sequencing by Synthesis with Reversible Terminators
8
1.3.2
Ligation-Based Sequencing
8
1.3.3
Sequencing by Hybridization
9
vii
viii
Genome Sequencing Technology and Algorithms
1.4
Survey of Future Massively Parallel Sequencing Methods
9
1.4.1
Sequencing Within a Zero-Mode Waveguide
9
1.4.2
Nanopore Sequencing Approaches
10
References
11
2
Array-Based Pyrosequencing Technology
15
2.1
Introduction
15
2.2
Pyrosequencing Chemistry
16
2.3
Array-Based Pyrosequencing
17
2.4
454 Sequencing Chemistry
18
2.5
Applications of 454 Sequencing Technology
19
2.5.1
Whole-Genome Sequencing
19
2.5.2
Ultrabroad Sequencing
20
2.5.3
Ultradeep Amplicon Sequencing
20
2.6
Advantages and Challenges
20
2.7
Future of Pyrosequencing
21
References
21
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing
25
3.1
Introduction
25
3.2
DNA Sequencing by Hybridization with Resequencing Arrays
26
3.3
Resequencing Array Experimental Protocols
28
3.4
Analyzing Resequencing Array Data with ABACUS
29
3.5
Review of RA Applications
33
3.5.1
Human Resequencing
33
3.5.2
Mitochondrial DNA Resequencing
33
3.5.3
Microbial Pathogen Resequencing
35
3.6
Further Challenges
38
References
40
3
Contents
ix
4
Polony Sequencing
43
4.1
Introduction
43
4.2
Overview
44
4.3
Construction of Sequencing Libraries
45
4.4
Template Amplification with Emulsion PCR
46
4.5
Sequencing
48
4.6
Future Directions
49
References
50
Genome Sequencing: A Complex Path to Personalized Medicine
53
5.1
Introduction
53
5.2
Personalized Medicine
55
5.3
Heterogeneous Data Sources
56
5.4
Information Modeling
57
5.5
Ontologies and Terminologies
58
5.6
Applications
59
5.7
Conclusion
71
References
72
Part II Genome Sequencing and Fragment Assembly
77
6
Overview of Genome Assembly Techniques
79
6.1
Genome Sequencing by Shotgun-Sequencing Strategy
79
6.1.1
A Procedure for Whole-Genome Shotgun (WGS) Sequencing
80
6.2
Trimming Vector and Low-Quality Sequences
82
6.2.1
The Trimming Vector and Low-Quality Sequences Problem
82
6.3
Fragment Assembly
84
6.3.1
The Fragment Assembly Problem
84
5
x
Genome Sequencing Technology and Algorithms
6.4
Assembly Validation
85
6.4.1
The Assembly Validation Problem
85
6.5
Scaffold Generation
87
6.5.1
The Scaffold Generation Problem
89
6.5.2
Bambus
89
6.5.3
GigAssembler
93
6.6
Finishing
94
6.7
Three Strategies for Whole-Genome Sequencing
94
6.8
Discussion
95
6.8.1
A Thought on an Exploratory Genome Sequencing Framework
96
Acknowledgments
97
References
97
7
Fragment Assembly Algorithms
101
7.1
TIGR Assembler
102
7.1.1
Merging Fragments with Assemblies
102
7.1.2
Building a Consensus Sequence
102
7.1.3
Handling Repetitive Sequences
103
7.2
Phrap
103
7.3
CAP3
104
7.3.1
Automatics Clipping of 5’ and 3’ Poor Quality Regions
104
7.3.2
Computation and Evaluation of Overlaps
104
7.3.3
Use of Mate-Pair Constraints in Construction of Contigs 105
7.4
Celera Assembler
105
7.4.1
Kececioglu and Myers Approach
105
7.4.2
The Design Principle of the Celera Whole-Genome Assembler
107
7.4.3
Overlapper
108
7.4.4
Unitigger
108
7.4.5
Scaffolder
109
7.5
Arachne
110
7.5.1
Contig Assembly
111
Contents
xi
7.5.2
Detecting Repeat Contigs and Repeat Supercontigs
111
7.6
EULER
112
7.6.1
Idury-Waterman Algorithm
112
7.6.2
An Overview of EULER
113
7.6.3
Error Correction and Data Corruption
113
7.6.4
Eulerian Superpath
114
7.6.5
Use of Mate-Pair Information
115
7.7
Other Approaches to Fragment Assembly
116
7.7.1
A Genetic Algorithm Approach
116
7.7.2
A Structured Pattern-Matching Approach
117
7.8
Incompleteness of the Survey
119
Acknowledgments
120
References
120
Assembly for Double-Ended Short-Read Sequencing Technologies
123
8.1
Introduction
123
8.2
Short-Read Sequencing Technologies
125
8.3
Assembly for Short-Read Sequencing
128
8.3.1
Algorithmic Methods
129
8.3.2
Simulation Results
129
8.4
Developing a Short-Read-Pair Assembler
132
8.4.1
Analysis
135
References
140
Part III Beyond Conventional Genome Sequencing
143
Genome Characterization in the Post–Human Genome Project Era
145
9.1
Genome Resequencing and Comparative Assembly
146
9.2
Genotyping Versus Haplotyping
147
9.3
Large-Scale Genome Variations
147
8
9
xii
Genome Sequencing Technology and Algorithms
9.4
Epigenomics: Genetic Variations Beyond Genome Sequences
148
Conclusion
149
References
149
The Haplotyping Problem: An Overview of Computational Models and Solutions
151
10.1
Introduction
151
10.2
Preliminary Definitions
153
10.3
Inferring Haplotypes in a Population
154
10.3.1
The Inference Problem: A General Rule
156
10.3.2
The Pure Parsimony Haplotyping Problem
158
10.3.3
The Inference Problem by the Coalescent Model
158
10.3.4
Xor-Genotyping
161
10.3.5
Incomplete Data
162
10.4
Inferring Haplotypes in Pedigrees
163
10.5
Inferring Haplotypes from Fragments
169
10.6
A Glimpse over Statistical Methods
175
10.7
Discussion
177
Acknowledgments
178
References
178
11
Analysis of Genomic Alterations in Cancer
183
11.1
Introduction
183
11.1.1
Measurement of Copy Number Changes by Array Hybridization
185
11.1.2
Measurement of Genome Rearrangements by End Sequence Profiling
187
11.2
Analysis of ESP Data
188
11.3
Combination of Techniques
191
9.5
10
11.4
Future Directions
191
References
192
Contents 12
xiii
High-Throughput Assessments of Epigenomics in Human Disease
197
12.1
Introduction
197
12.2
Epigenetic Phenomena That Regulate Gene Expression
198
12.2.1
Methylation of Deoxycytosine
198
12.2.2
Histone Modifications and Nucleosome Remodeling
198
12.2.3
Small Inhibitory RNA Molecules
199
12.3
Epigenetics and Disease
200
12.3.1
Epigenetics and Developmental and Neurological Diseases
200
12.3.2
Epigenetics and Cancer
200
12.4
High-Throughput Analyses of Epigenetic Phenomena
201
12.4.1
Gel-Based Approaches
201
12.4.2
Microarrays
212
12.4.3
Cloning/Sequencing
213
12.4.4
Mass Spectrometry
215
12.5
Conclusions
215
Acknowledgments
215
References
216
13
Comparative Sequencing, Assembly, and Anchoring
225
13.1
Comparing an Assembled Genome with Another Assembled Genome
226
13.2
Mutual Comparison of Genome Fragments
229
13.3
Comparing an Assembled Genome with Genome Fragments
230
13.3.1
Applications Using Read Anchoring
230
13.3.2
Applications Employing Anchoring of Paired Ends
232
13.3.3
Applications Utilizing Mapping of Clone Reads
233
13.4
Anchoring by Seed-and-Extend Versus Positional Hashing Methods
234
The UD-CSD Benchmark for Anchoring
237
13.5
xiv
Genome Sequencing Technology and Algorithms
13.6
Conclusions
239
References
241
About the Authors
245
Index
251
Part I The New DNA Sequencing Technology
1 An Overview of New DNA Sequencing Technology Elaine R. Mardis
1.1 An Overview 1.1.1
Background
The dideoxynucleotide termination DNA sequencing technology invented by Fred Sanger and colleagues, published in 1977, formed the basis for DNA sequencing from its inception through 2004 [1]. Originally based on radioactive labeling, the method was automated by the use of fluorescent labeling coupled with excitation and detection on dedicated instruments, with fragment separation by slab gel [2] and ultimately by capillary gel electrophoresis. A variety of molecular biology, chemistry, and enzymology-based improvements have brought Sanger’s approach to its current state of the art. By virtue of economies of scale, high-throughput automation and reaction optimization, large sequencing centers have decreased the cost of a fluorescent Sanger sequencing reaction to around $0.30. However, it is likely that only incremental cost decreases will continue to be achieved for Sanger sequencing in its current manifestation. This fact, coupled with the ever-increasing need for DNA sequencing toward a variety of biomedical (and other) studies, has resulted in a rapid phase of technology development of so-called next generation or massively parallel sequencing technologies, that will revolutionize DNA sequencing as we now know it. Along with this revolution will come a significant and potentially unanticipated impact
3
4
Genome Sequencing Technology and Algorithms
on sequencing-supportive infrastructures, namely, the computational hardware and software required to process and interpret these data. 1.1.2
Rationale for Technology Development Toward Massively Parallel Scale DNA Sequencing
It is perhaps interesting to evaluate the scientific underpinnings that have led to the recent revolution in DNA sequencing technology. With the completion of the reference human genome [3, 4], human geneticists and others began to question the nature and extent of genome-wide interindividual genomic variation. This concept of “strain-to-reference” comparison was not a novel one—certainly microbiologists had been studying the genomic differences between reference and pathogenic (clinical) strains of viruses and bacteria for many years, largely enabled by the ever-increasing availability of genome sequences for these organisms. Transitioning this concept to larger and more complex genomes simply is a matter of increasing the scale of comparison, since the human genome is approximately 1,000-fold larger than that of the average bacterium. It is also appropriate to note that much more focused strain-to-reference comparisons of human sequences have been pursued for many years in many studies, using PCR-based resequencing approaches. Here, PCR with genome-unique primers is utilized to selectively amplify the same region from many individual genomes and each resulting product is sequenced. A comparison of all sequences to the human reference can subsequently highlight common and rare mutations that may predispose to the disease state, predict outcome, or aid in the identification of specific treatments [5–13]. A first stab at understanding human diversity at the single-nucleotide level was embodied initially in 1999–2000 by the SNP Consortium efforts [14] and then scaled up considerably in 2002–2004 by the human “HapMap” project [15]. The latter project was significant not only in its accomplishments (over 1 million verified common human SNPs across four major human populations) but also in that it formed the basis for the development of high-throughput technologies for single nucleotide, genome-wide genotyping that could interrogate the many common human SNPs identified. For example, these approaches now enable the typing of more than 500,000 SNPs across the human genome for around $400 per sample. At the RNA level, DNA sequencing (long used for sequencing the ends of cloned mRNAs to produce expressed sequence tags or ESTs) was also being implemented to quantitate genome-wide gene expression levels for a given tissue or experimental condition by sequencing small RNA “tags” using an approach termed SAGE (serial analysis of gene expression) [16]. Mapping SAGE tag sequences back to a reference genome or transcriptome enabled identification of expressed genes and corresponding quantitative expression levels to be
An Overview of New DNA Sequencing Technology
5
ascertained [17–22] (for examples). One drawback of this powerful technique, especially compared to microarray technology, was the significant expense incurred to sequence the number of SAGE tags required to yield a meaningful result; a number that scaled with genome complexity. With genome-wide genotyping technology in-hand and providing ever-increasing numbers of SNPs per genotype, the interest in going beyond common single nucleotide variations (those SNPs found at more than 5% or greater frequency in the population, by definition), to characterize the range of variation in multiple base insertions and deletions, as well as inversions, rearrangements, and translocations began to increase. One approach to this scale of characterization was reported by Eichler and colleagues [23], and was a variation on an earlier approach that had utilized BAC (bacterial artificial chromosome) clone end sequencing to identify genome rearrangements [24]. In the Eichler approach, genomic DNA from a single CEPH individual (who also had been genotyped in the HapMap project) was used to generate a fosmid library. Next, the sequences obtained from the ends of about 1 million fosmid clones in this library were mapped back to the human reference sequence and the end mapping locations were interpreted as described next. Since fosmids package their recombinant genomes within a relatively tight size range of 35,000–50,000 bases, the expected separation between end sequences of a given insert lies in that range by definition (much more so than BACs, by contrast). Using this approach, one can identify fosmid ends that do not map both to the same chromosomal pseudomolecule in the human reference, indicating a translocation or other rearrangement has occurred. Another possibility is a fosmid end placement distance that is smaller than the expected range (the sampled genome is deleted relative to the reference genome) or is at a distance larger than expected (the sampled genome has an inserted sequence relative to the reference genome). Typically, one would then select these clones and sequence them in their entirety to better characterize the nature of the rearrangement, deletion or insertion, as was described in the Eichler manuscript. Again, although this is a very powerful method for elucidating genomic variation beyond SNP variation, significant sequencing costs are involved (around $1 million per genome). Beyond this sequencing-intensive approach, many groups [25–28] are now reporting analyses of copy number variation in individual human genomes, an offshoot of genome-wide SNP profiling on high-density microarray substrates that enable comparative genome hybridization and signal intensity analysis to ascertain regions of the genome that are exhibiting greater than two copies, in the case of amplification, and one or zero copies in the case of loss of heterozygosity or deletion.
6
1.1.3
Genome Sequencing Technology and Algorithms
Goals of Massively Parallel Sequencing Approaches
Ultimately, one major aim of massively parallel DNA sequencing instrumentation is to enable the large-scale sequencing of many human genomes in a accelerated timeframe (about 1 week per genome) at a cost approximating a high-end diagnostic assay (about $1,000 per genome). At the first pass, this will largely happen in research laboratories such as at genome centers or large pharmaceutical companies, and will involve disease-focused patient collections. Obviously, generating the data is only one part of the equation, but it must be done first. The secondary, more challenging goal is the analysis of this data on an individual genome basis and then, across all genomes in a given disease cohort, in correlation with other genomic and clinical data types. In addition to known genes, the interpretation of sequence data in nongenic sequence will become increasingly more informative as the overall knowledge of the functions of these regions of the human genome is increasingly elucidated. It is likely that with the technology at hand, however, basic research will not wait. What this means for human health is yet to be seen, but likely we will begin to distill out of these basic research activities the genomic hallmarks of disease, which can then be further coalesced into sequence-based diagnostic assays. These diagnostics will almost certainly not involve whole genome resequencing but rather will focus on the genomic hallmarks or biomarkers, perhaps using modified-scale instrumentation derived from today’s massively parallel instruments or perhaps using single-molecule detection methods that are presently under development.
1.2 Massively Parallel Sequencing by Synthesis Pyrosequencing 1.2.1
Principle of the Method
Conceptually and in practice, pyrosequencing is a completely novel approach from the standard Sanger dideoxynucleotide approach, initially reported by Nyren and colleagues [29], and later modified [30–34]. Figure 1 exemplifies the method in its current form. Basically, upon nucleotide incorporation by the polymerase, the released pyrophosphate is converted to ATP by action of the enzyme sulfurylase, providing firefly luciferase with the necessary energy source to convert luciferin to oxy-luciferin plus light. Since each template is queried with a single-nucleotide species in each step of pyrosequencing, detection of emitted light from this reaction can be directly correlated to the number of nucleotides incorporated, up to a point of nonlinear response which is typically greater than six nucleotides of the same identity (depending upon the sensitivity of the detector).
An Overview of New DNA Sequencing Technology 1.2.2
7
Pyrosequencing in a Microtiter Plate Format
The pyrosequencing concept was initially limited due to the buildup of unincorporated nucleotides and residual ATP between base additions. The inclusion of the apyrase enzyme to the pyrosequencing reaction cocktail overcame this limitation by its enzymatic action [33], and the pyrosequencing method was then translated into a commercially available microtiter plate-format instrument. Still, the microtiter plate-based assay has a limited read-length of around 10–16 nucleotides but has found widespread use for low-to-medium throughput genotyping. 1.2.3
The 454 GS-20 Sequencer
In 2004, pyrosequencing transitioned onto a massively parallel DNA sequencing instrument that achieved commercial release by 454 Life Sciences, Inc. [35]. The instrument embodies several novel approaches to DNA sequencing that tremendously streamline and parallelize the process from library construction through sequence detection, as described below. In this system, a genomic library is made by fragmenting the genomic DNA to about 500 bp, repairing the ends and ligating two 454-specific linkers to each genomic fragment. These fragments are coupled to Sepharose beads carrying covalently linked oligonucleotides complementary to the fragment library’s ligated linkers. This bead/DNA mixture is emulsified in an oil suspension containing aqueous PCR reactants, enabling the amplification of millions of unique fragment-bead combinations in a large-batch PCR format. The Sepharose beads that contain amplified DNA are prepared for the sequencing reaction by denaturation of the unattached strand and annealing of a sequencing primer, and then are pipetted into a PicoTiterPlate (PTP) device. The PTP is composed of hundreds of thousands of fused fiber optic strands, the ends of which are hollowed out to a diameter sufficient to contain a single Sepharose bead. Smaller magnetic beads, to which pyrosequencing (sulfurylase and luciferase) enzymes are covalently attached, are added into the PTP. The PTP fits into a flow-cell device that positions it against a high-sensitivity CCD camera in the 454 GS-20 sequencing instrument. Pyrosequencing follows, whereby sequential flows of each dNTP, separated by an imaging step and a wash step, take place. At each well address in the PTP, the incorporation of one or more nucleotides into the synthesized strand on each bead is captured by the CCD camera, which records positional information about each well address throughout the sequencing process. A post-run bioinformatic pipeline processes the raw pyrosequencing data into approximately 200,000 sequencing reads of about 100 bp each. Recent improvements to the 454 system have enabled increased read-lengths, averaging around 400,000 reads of 250-bp read-length per 7-hour run.
8
1.2.4
Genome Sequencing Technology and Algorithms
Novel Applications Enabled by Massively Parallel Pyrosequencing
The advent of massively parallel sequencing, ushered in by the availability of 454 pyrosequencing, has enabled genome scientists to pursue applications for DNA sequencing at levels that heretofore were often not possible, due to cost and timeline. These include SAGE profiling [36], cDNA sequencing [V. Magrini, personal communication, 2006], metagenomics [37], nucleosome positioning [38], and others. The increasing efficiency of the instrument, coupled with the availability of paired end read sequencing, predicts that the instrument will continue to inspire scientists to more and varied ingenuity in 454-based approaches.
1.3 Massively Parallel Sequencing by Other Approaches 1.3.1
Sequencing by Synthesis with Reversible Terminators
In addition to the 454 pyrosequencing chemistry, several companies and academic groups are working to develop massively parallel sequencing instrumentation that uses reversible dye terminator chemistry for DNA sequencing. To date only one has achieved commercial availability, which is the platform offered by Solexa, LTD (now a wholly owned subsidiary of Illumina, Inc.). Although this platform has a similar initial library construction procedure as that outlined for 454, once linker ligated genomic fragments are produced, they are amplified in situ following hybridization to a complementary oligo that is covalently linked to a glass slide (“flow cell”) surface. The amplified fragments, or clusters, are denatured, annealed with a sequencing primer, and placed onto the sequencing instrument for sequencing by synthesis (SBS) using 3’-blocked fluorescent-labeled deoxynucleotides. Each synthesis step includes single nucleotide incorporation, washing to remove unincorporated nucleotides, imaging of the entire flow cell, deblocking of the 3’ – OH ends of each synthesized strand, and washing. Using this approach, at present a 4-day run on the Solexa instrument results in around 50 to 60 million sequences of 40 to 50 base pairs each, for a total of around 1 Gb of sequence (following data quality filtering). 1.3.2
Ligation-Based Sequencing
An important variant to polymerase-based methods for massively parallel sequencing is an approach that depends upon the high specificity of DNA ligase to mediate the sequencing of genomic fragments. This approach builds on previous genotyping methods such as the ligation chain reaction (LCR) [39–41] and oligonucleotide ligation assay (OLA) [42] that rely on the specificity of DNA ligase to join the DNA backbone. Ligation-based sequencing is covered in great detail later in this book, and will form the core technology for a next
An Overview of New DNA Sequencing Technology
9
generation sequencing instrument scheduled to be introduced in 2007 by Applied Biosystems [H. Fiske, personal communication, 2007]. 1.3.3
Sequencing by Hybridization
The same technology that has enabled genome-wide surveys of gene expression by the hybridization of single-stranded cDNA copies of messenger RNA species [43] more recently has been utilized for a variant of genome resequencing typically referred to as “sequencing by hybridization.” Originally described early in the brief history of genome sequencing [44–47], methods to generate arrays of oligonucleotides of sufficient depth to address even single human chromosomes were not technologically feasible at that time, and so the method was not applicable to most genomes of interest. More recently, sequencing by hybridization has taken the form of whole genome genotyping of SNPs, using a variety of commercially available oligonucleotide-based approaches, such as Illumina, Affymetrix, and Nimblegen [48–50]. Since the technology to create ever-increasing densities of oligonucleotides on solid supports is rapidly improving and costs are falling, sequencing by hybridization offers an attractive first pass at characterizing genomes in advance of sequencing. Methods to extract additional information from genome-wide SNP typing microarrays are now enhancing the data value from these experiments by offering information about large-scale amplifications and deletions, as well as loss of heterozygosity (LOH). Additional historical and practical aspects of sequencing by hybridization are covered later in this book.
1.4 Survey of Future Massively Parallel Sequencing Methods Several DNA sequencing approaches of interest that will likely not be available in the near term bear mentioning in this introduction because of the incredible potential they may offer for genome sequencing at throughputs one to two orders of magnitude higher than the methods discussed in this book, as well as at high accuracy and high sequence contiguity. Concomitantly, they are the highest risk approaches presently being pursued in genome sequencing technology development. 1.4.1
Sequencing Within a Zero-Mode Waveguide
The basis for DNA sequencing in a zero-mode waveguide was first spelled out in a seminal paper by Levene et al. [51]. This paper described a sequencing reaction and detection environment, the zero-mode waveguide, that more closely correlates with the observation volumes necessary for many single-molecule detection technologies while enabling substrate concentrations to be held
10
Genome Sequencing Technology and Algorithms
optimal for the enzymological assay being performed. As described, zero-mode waveguides would be formed by electron beam lithography, followed by reactive ion etching in a metal film deposited on a microscope coverslip (in other words, the 1 × 1-inch-thin glass typically placed over samples on a microscope slide). Since each coverslip could potentially contain millions of waveguides, the resulting assay would have the potential of massive parallelism. For direct observations of single molecules, enzymes (such as DNA polymerase) could be adsorbed onto the bottom of the waveguides and provided with necessary substrates and reactants, while being monitored from below by a microscope objective that both provides the necessary illumination and collects emitted light through the same objective. Since the waveguide provides a limited illumination volume, it is much more likely that the signal of a nucleotide in the active site of the enzyme would be detected as opposed to freely diffusing labeled nucleotides. Although the latter will provide a background fluorescence due to weakly illuminated molecules, it will be essentially constant and therefore readily subtracted. Using fluorescence-coupled spectroscopy, the authors were able to observe the incorporation of coumarin-dCTP into an M13 singlestranded DNA template by immobilized mutant T7 polymerase [51]. This work has set the stage for an attempt to commercialize the zero-mode waveguide technology for DNA sequencing (and perhaps other applications) by Pacific Biosciences (Sunnyvale, California). The potential advantages of this approach for DNA sequencing include very long read-lengths of unlabeled single-template molecules with very high accuracy, in a highly multiplexed fashion. Although technically challenging, this is one approach that could potentially revolutionize DNA sequencing and its routine application in biology as well as in diagnostic and prognostic medicine. 1.4.2
Nanopore Sequencing Approaches
The Coulter counter [52] was no doubt the inspiration for the use of nanopores to sequence DNA, since it works to separate electrolyte solution-suspended particles by drawing them through a channel between two reservoirs. As a particle enters the channel, it increases the electrical impedance of the channel and therefore affects a current drop when a voltage is being passed across the channel. Current research on nanopore-based sequencing utilizes one of two types of nanopores: protein-based or synthetic. Much of the initial nanopore DNA sequencing work was performed using the bacterial protein α-hemolysin [53], a transmembrane protein that inserts into the lipid bilayer. However, these studies also have largely demonstrated that α-hemolysin pores are limited by size, variation, and stability. Hence, synthetic nanopores are currently being widely investigated for DNA-based analysis, using a wide variety of detection approaches [54–56]. The promise of nanopores includes single-molecule sequencing with
An Overview of New DNA Sequencing Technology
11
very long read-lengths, potentially without the requirement for labeling the DNA. Thousands to millions of nanopores could be contained in a single device, allowing a very high sequencing capacity in a very low cost device. The drawbacks of nanopores for sequencing so far include low sensitivity to detect signals that can be used to distinguish the identity of individual nucleotides, and difficulty forcing DNA through the pores in a uniformly single-stranded fashion, due to its tendency to form hairpins and other structures. If these limitations can be overcome, however, the possible application of nanopores to very inexpensive, very long read sequencing would undoubtedly usher in a revolution in DNA sequencing.
References [1] Sanger, F., S. Nicklen, and A. R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Natl. Acad. Sci. USA, Vol. 74, No. 12, 1977, pp. 5463–5467. [2] Smith, L. M., et al., “Fluorescence Detection in Automated DNA Sequence Analysis,” Nature, Vol. 321, No. 6071, 1986, pp. 674–679. [3] Lander, E. S., et al., “Initial Sequencing and Analysis of the Human Genome,” Nature, Vol. 409, No. 6822, 2001, pp. 860–921. [4] International Human Genome Sequencing Consortium, “Finishing the Euchromatic Sequence of the Human Genome,” Nature, Vol. 431, No. 7011, 2004, pp. 931–945. [5] Akey, J. M., et al., “Population History and Natural Selection Shape Patterns of Genetic Variation in 132 Genes,” PLoS Biol., Vol. 2, No. 10, 2004, p. e286. [6] Livingston, R. J., et al., “Pattern of Sequence Variation Across 213 Environmental Response Genes,” Genome Res., Vol. 14, No. 10A, 2004, pp. 1821–1831. [7] Wilson, R. K., et al., “Mutational Profiling in the Human Genome,” Cold Spring Harb. Symp. Quant. Biol., Vol. 68, 2003, pp. 23–29. [8] Fullerton, S. M., et al., “Apolipoprotein E Variation at the Sequence Haplotype Level: Implications for the Origin and Maintenance of a Major Human Polymorphism,” Am. J. Hum. Genet., Vol. 67, No. 4, 2000, pp. 881–900. [9] Nickerson, D. A., et al., “Sequence Diversity and Large-Scale Typing of SNPs in the Human Apolipoprotein E Gene,” Genome Res., Vol. 10, No. 10, 2000, pp. 1532–1545. [10] Rieder, M. J., and D. A. Nickerson, “Hypertension and Single Nucleotide Polymorphisms,” Curr. Hypertens. Rep., Vol. 2, No. 1, 2000, pp. 44–49. [11] Zhu, X., et al., “Localization of a Small Genomic Region Associated with Elevated ACE,” Am. J. Hum. Genet., Vol. 67, No. 5, 2000, pp. 1144–1153. [12] Levine, R. L., et al., “Activating Mutation in the Tyrosine Kinase JAK2 in Polycythemia Vera, Essential Thrombocythemia, and Myeloid Metaplasia with Myelofibrosis,” Cancer Cell, Vol. 7, No. 4, 2005, pp. 387–397.
12
Genome Sequencing Technology and Algorithms
[13] Thomas, R. K., et al., “Detection of Oncogenic Mutations in the EGFR Gene in Lung Adenocarcinoma with Differential Sensitivity to EGFR Tyrosine Kinase Inhibitors,” Cold Spring Harb. Symp. Quant. Biol., Vol. 70, 2005, pp. 73–81. [14] Sachidanandam, R., et al., “A Map of Human Genome Sequence Variation Containing 1.42 Million Single Nucleotide Polymorphisms,” Nature, Vol. 409, No. 6822, 2001, pp. 928–933. [15] The International Human Genome Sequencing Consortium, “A Haplotype Map of the Human Genome,” Nature, Vol. 437, No. 7063, 2005, pp. 1299–1320. [16] Velculescu, V. E., B. Vogelstein, and K. W. Kinzler, “Analysing Uncharted Transcriptomes with SAGE,” Trends Genet., Vol. 16, No. 10, 2000, pp. 423–425. [17] Funaguma, S., et al., “SAGE Analysis of Early Oogenesis in the Silkworm, Bombyx Mori,” Insect Biochem. Mol. Biol., Vol. 37, No. 2, 2007, pp. 147–154. [18] McIntosh, S., et al., “SAGE of the Developing Wheat Caryopsis,” Plant Biotechnol. J., Vol. 5, No. 1, 2007, pp. 69–83. [19] Gibbings, J. G., et al., “Global Transcript Analysis of Rice Leaf and Seed Using SAGE Technology,” Plant Biotechnol. J., Vol. 1, No. 4, 2003, pp. 271–285. [20] Gowda, M., et al., “Deep and Comparative Analysis of the Mycelium and Appressorium Transcriptomes of Magnaporthe Grisea Using MPSS, RL-SAGE, and Oligoarray Methods,” BMC Genomics, Vol. 7, 2006, p. 310. [21] Berthier, D., et al., “Bovine Transcriptome Analysis by SAGE Technology During an Experimental Trypanosoma Congolense Infection,” Ann. NY Acad. Sci., Vol. 1081, 2006, pp. 286–299. [22] Rosinski-Chupin, I., et al., “SAGE Analysis of Mosquito Salivary Gland Transcriptomes During Plasmodium Invasion,” Cell Microbiol., Vol. 9, No. 3, 2007, pp. 708–724. [23] Tuzun, E., et al., “Fine-Scale Structural Variation of the Human Genome,” Nat. Genet., Vol. 37, No. 7, 2005, pp. 727–732. [24] Volik, S., et al., “End-Sequence Profiling: Sequence-Based Analysis of Aberrant Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 100, No. 13, 2003, pp. 7696–7701. [25] Pfeifer, D., et al., “Genome-Wide Analysis of DNA Copy Number Changes and LOH in CLL Using High-Density SNP Arrays,” Blood, Vol. 109, No. 3, 2007, pp. 1202–1210. [26] Komura, D., et al., “Genome-Wide Detection of Human Copy Number Variations Using High-Density DNA Oligonucleotide Arrays,” Genome Res., Vol. 16, No. 12, 2006, pp. 1575–1584. [27] Redon, R., et al., “Global Variation in Copy Number in the Human Genome,” Nature, Vol. 444, No. 7118, 2006, pp. 444–454. [28] Kotliarov, Y., et al., “High-Resolution Global Genomic Survey of 178 Gliomas Reveals Novel Regions of Copy Number Alteration and Allelic Imbalances,” Cancer Res., Vol. 66, No. 19, 2006, pp. 9428–9436.
An Overview of New DNA Sequencing Technology
13
[29] Nyren, P., B. Pettersson, and M. Uhlen, “Solid Phase DNA Minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay,” Anal. Biochem., Vol. 208, No. 1, 1993, pp. 171–175. [30] Ahmadian, A., et al., “Single-Nucleotide Polymorphism Analysis by Pyrosequencing,” Anal. Biochem., Vol. 280, No. 1, 2000, pp. 103–110. [31] Ahmadian, A., et al., “Analysis of the p53 Tumor Suppressor Gene by Pyrosequencing,” Biotechniques, Vol. 28, No. 1, 2000, pp. 140–4, 146–7. [32] Ronaghi, M., et al., “PCR-Introduced Loop Structure as Primer in DNA Sequencing,” Biotechniques, Vol. 25, No. 5, 1998, pp. 876–878, 880–882, 884. [33] Ronaghi, M., M. Uhlen, and P. Nyren, “A Sequencing Method Based on Real-Time Pyrophosphate,” Science, Vol. 281, No. 5375, 1998, pp. 363, 365. [34] Ronaghi, M., et al., “Real-Time DNA Sequencing Using Detection of Pyrophosphate Release,” Anal. Biochem., Vol. 242, No. 1, 1996, pp. 84–89. [35] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, No. 7057, 2005, pp. 376–380. [36] Bainbridge, M. N., et al., “Analysis of the Prostate Cancer Cell Line LNCaP Transcriptome Using a Sequencing-by-Synthesis Approach,” BMC Genomics, Vol. 7, 2006, p. 246. [37] Turnbaugh, P. J., et al., “An Obesity-Associated Gut Microbiome with Increased Capacity for Energy Harvest,” Nature, Vol. 444, No. 7122, 2006, pp. 1027–1031. [38] Johnson, S. M., et al., “Flexibility and Constraint in the Nucleosome Core Landscape of Caenorhabditis Elegans Chromatin,” Genome Res., Vol. 16, No. 12, 2006, pp. 1505–1516. [39] Wiedmann, M., et al., “Ligase Chain Reaction (LCR)—Overview and Applications,” PCR Methods Appl., Vol. 3, No. 4, 1994, pp. S51–S64. [40] Wu, D. Y., and R. B. Wallace, “The Ligation Amplification Reaction (LAR)—Amplification of Specific DNA Sequences Using Sequential Rounds of Template-Dependent Ligation,” Genomics, Vol. 4, No. 4, 1989, pp. 560–569. [41] Barany, F., “The Ligase Chain Reaction in a PCR World,” PCR Methods Appl., Vol. 1, No. 1, 1991, pp. 5–16. [42] Nickerson, D. A., et al., “Automated DNA Diagnostics Using an ELISA-Based Oligonucleotide Ligation Assay,” Proc. Natl. Acad. Sci. USA, Vol. 87, No. 22, 1990, pp. 8923–8927. [43] Brown, P. O., and D. Botstein, “Exploring the New World of the Genome with DNA Microarrays,” Nat. Genet., Vol. 21, No. 1, Suppl., 1999, pp. 33–37. [44] Drmanac, R., et al., “DNA Sequence Determination by Hybridization: A Strategy for Efficient Large-Scale Sequencing,” Science, Vol. 260, No. 5114, 1993, pp. 1649–1652. [45] Lipshutz, R. J., “Likelihood DNA Sequencing by Hybridization,” J. Biomol. Struct. Dyn., Vol. 11, No. 3, 1993, pp. 637–653.
14
Genome Sequencing Technology and Algorithms
[46] Gunderson, K. L., et al., “Mutation Detection by Ligation to Complete N-Mer DNA Arrays,” Genome Res., Vol. 8, No. 11, 1988, pp. 1142–1153. [47] Broude, N. E., et al., “Enhanced DNA Sequencing by Hybridization,” Proc. Natl. Acad. Sci. USA, Vol. 91, No. 8, 1994, pp. 3072–3076. [48] Lipshutz, R. J., et al., “Using Oligonucleotide Probe Arrays to Access Genetic Diversity,” Biotechniques, Vol. 19, No. 3, 1995, pp. 442–447. [49] Wang, D. G., et al., “Large-Scale Identification, Mapping, and Genotyping of SingleNucleotide Polymorphisms in the Human Genome,” Science, Vol. 280, No. 5366, 1988, pp. 1077–1082. [50] Albert, T. J., et al., “Light-Directed 5’—>3’ Synthesis of Complex Oligonucleotide Microarrays,” Nucleic Acids Res., Vol. 31, No. 7, 2003, p. e35. [51] Levene, M. J., et al., “Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations,” Science, Vol. 299, No. 5607, 2003, pp. 682–686. [52] DeBlois, R. W., “Counting and Sizing of Submicron Particles by Resistive Pulse Technique,” Rev. Sci. Instrum., Vol. 41, 1970, p. 909. [53] Song, L., et al., “Structure of Staphylococcal Alpha-Hemolysin: A Heptameric Transmembrane Pore,” Science, Vol. 274, No. 5294, 1996, pp. 1859–1866. [54] Heng, J. B., et al., “The Electromechanics of DNA in a Synthetic Nanopore,” Biophys. J., Vol. 90, No. 3, 2006, pp. 1098–1106. [55] Meller, A., et al., “Rapid Nanopore Discrimination Between Single Polynucleotide Molecules,” Proc. Natl. Acad. Sci. USA, Vol. 97, No. 3, 2000, pp. 1079–1084. [56] Fologea, D., et al., “Detecting Single Stranded DNA with a Solid State Nanopore,” Nano. Lett., Vol. 5, No. 10, 2005, pp. 1905–1909.
2 Array-Based Pyrosequencing Technology Baback Gharizadeh, Roxana Jalili, and Mostafa Ronaghi
With the completion draft of the human genome, we are entering a new era in the biological sciences with the DNA sequencing as one of the main catalysts. In the DNA sequencing field, pyrosequencing has emerged as a technology for de novo high-throughput whole genome sequencing. The method is based on the principle of sequencing-by-synthesis and pyrophosphate detection through a series of enzymatic reactions to generate luminescence sequence peak signals. Pyrosequencing is being used from single nucleotide to whole genome sequencing and the method is commercially available for low throughput sequencing by Biotage and high-throughput sequencing by 454 Life Sciences (currently owned by Roche). In this chapter we describe the methodologies, applications and discuss current developments, which would decrease the cost of sequencing by another two orders of magnitude within the next 5 years.
2.1 Introduction DNA sequencing is a key tool to determine nucleic acids sequence and is applied in biosciences from single-nucleotide polymorphism (SNP) genotyping to whole-genome sequencing. Due to its widespread applications, an extensive number of different sciences have been benefiting from DNA sequencing from molecular biology, medicine, diagnostics, genetics, biotechnology, pharmacology, and forensics to archeology and anthropology. Moreover, DNA sequencing is promoting new discoveries that are revolutionizing the conceptual foundations of many fields. 15
16
Genome Sequencing Technology and Algorithms
High throughput and affordable DNA-sequencing techniques would undoubtedly continue the stream of the revolution initiated by the Human Genome Project. Recent impressive advances in DNA-sequencing technologies have accelerated the detailed analysis of genomes from many organisms. We have been observing numerous reports of complete or draft versions of the genome sequence of several well-studied, multicellular organisms. The Human Genome Project was made achievable by a significant reduction in DNA sequencing cost by three orders of magnitude; a further cost reduction of two to three would launch a new era of DNA sequencing applications from short DNA reads to whole-genome sequencing. The chain termination sequencing method, also known as Sanger sequencing, developed by Frederick Sanger and colleagues [1], has been the most widely used sequencing method since its advent in 1977, and still is extensively in use. The remarkable advances in chemistry, automation, and data acquisition to the Sanger sequencing method has made it into a simple and elegant technique, central to almost all past and current genome-sequencing projects of any significant scale. Despite all these grand advantages, there are limitations in this method, which could be complemented with other techniques. Among the current state-of-the-art DNA-sequencing techniques, pyrosequencing [2, 3] has emerged, which is being used for a wide variety of applications. In the beginning, the method was only restricted to SNP genotyping [4] and short reads [5, 6] when it was introduced in 1997 but it is now being used for broader applications. The original pyrosequencing method is based on conventional PCR and a four-enzyme system sequencing 96 samples at a time (http://www.biotage.com). Later on, this method was developed further to a high-throughput micorfluidics format by 454 Life Sciences (http://www. 454.com), which was acquired by Roche recently. Several groups are working on further development of this technology for different applications, which will be discussed in this chapter.
2.2 Pyrosequencing Chemistry Pyrosequencing technology is based on the sequencing-by-synthesis principle and employs a cascade of four enzymes to accurately determine nucleic acid sequences during DNA synthesis. In pyrosequencing, the sequencing primer is hybridized to a single-stranded biotin-labeled DNA template (post-PCR alkali treated) and mixed with the enzymes: DNA polymerase, ATP sulfurylase, luciferase and apyrase, and the substrates adenosine 5’ phosphosulfate (APS), luciferin and deoxynucleotide triphosphates (dNTPs). Cycles of four dNTPs are separately dispensed to the reaction mixture iteratively. After each nucleotide
Array-Based Pyrosequencing Technology
17
dispensation, DNA polymerase catalyzes the incorporation of complimentary dNTP into the template strand. Each nucleotide incorporation event is followed by release of inorganic pyrophosphate (ppi) in a quantity equimolar to the amount of incorporated nucleotide. ATP sulfurylase quantitatively converts ppi to ATP in the presence of APS. The generated ATP drives the luciferase-mediated conversion of luciferin to oxyluciferin, producing visible light in amounts that are proportional to the amount of ATP. The light in the luciferase-catalyzed reaction is then detected by a photon-detection device such a charge-coupled device (CCD) camera, CMOS image sensor, or photomultiplier tube. The generated light is observed as a peak signal in the pyrogram or flowgram (corresponding to electropherogram in dideoxy sequencing). Each signal peak is proportional to the number of nucleotides incorporated (e.g., a triple dCTP nucleotide incorporation generates a triple higher peak). Apyrase is a nucleotidedegrading enzyme, which continuously degrades ATP and nonincorporated dNTPs in the reaction mixture. There is a certain time interval between each nucleotide dispensation to allow complete degradation. For this reason, dNTP addition is performed one at a time. During this synthesis process, the DNA strand is extended by complementary nucleotides, and the DNA sequence is demonstrated by the signal peaks in a pyrogram on a computer monitor. Base-callings are performed with integrated software with features for related SNP and sequencing analysis. Pyrosequencing was earlier limited to sequencing of short stretches of DNA, due to the inhibition of apyrase. The natural dATP was a substrate for luciferase, resulting in false sequence signals. dATP was substituted by dATP-α-S [2]. Higher concentrations of this nucleotide had an inhibitory effect on apyrase catalytic activity, causing nonsynchronized extension. The dATP-α-S consisted of two isomers: Sp and Rp. The Rp isomer was not incorporated in the DNA template as it was not a substrate for DNA polymerase, and its presence in the sequencing reaction simply inhibited apyrase activity. By introducing the dATP-α-S SP isomer, substantial longer reads were achieved [7]. This improvement had a major impact on pyrosequencing read-length and allowed sequencing of over one-hundred bases and opened many avenues for numerous applications.
2.3 Array-Based Pyrosequencing Pyrosequencing has been developed into a massively parallel microfluidic sequencing platform [8] by 454 Life Sciences Corporation (Branford, Connecticut). The first generation of 454 sequencing platform (GS20) is capable of sequencing 100 bases in average and generates between 20 to 50 megabases (depending on the protocol) of raw DNA sequence in less than 5 hours. The
18
Genome Sequencing Technology and Algorithms
company has recently released a GS FLX platform, which enables longer reads of 200–300 bases generating over 100-megabases per run. Based on the same platform, read-length distribution around 500 nucleotides has also been achieved. This read-length will open the doors for direct shotgun sequencing of complex genomes circumventing laborious library construction for de novo genome sequencing.
2.4 454 Sequencing Chemistry The 454 sequencing method uses a three-enzyme pyrosequencing chemistry in a microfluidic format (apyrase is not used except in the washing buffer). The detection enzymes are immobilized on beads and the substrates luciferin, APS, and nucleotides are flowed into the system. For sequencing of large fragments and whole genome, 454 sequencing consist of four steps. 1. Library preparation where whole-genomic DNA or large fragments (usually over 2 kb) are nebulized into small fragments preferably with majority in the 300–800 bp range by high nitrogen gas pressure. The fragmented DNA is then blunt-ended and polished by DNA end repair for adapter ligation. There are two 44-mer double-stranded adapters that are blunt-ended on one side and consist of 20-bases PCR primer, 20-bases sequencing primer, and 4-base key sequences (for initial sequence quality monitoring) with one of the adapters biotin-labeled. The adapters provide sequences for both amplification and sequencing of the immobilized DNA fragments. After ligation and nick repair, the nonbiotin labeled fragments are separated, single stranded by alkali treatment and used as DNA library. 2. The second step is emulsion PCR, where the ssDNA library is hybridized to complementary strand immobilized on sepharose capture beads and distributed to water-in-oil emulsion and PCR reaction mixture, which are then mixed by a high-speed shaker to form emulsions. The theoretical distribution ratio of beads and ssDNA is 1:1 for the clonal amplification. After the PCR reaction, the microreactors are broken and the beads are captured by filtration. The biotin-labeled amplicon positive beads are then enriched by using streptavidin magnetic beads and are single stranded. 3. Sequencing step where the ssDNA positive beads (23 µm in diameter) are incubated with DNA polymerase and single-strand binding protein (SSB), after that distributed into the wells on an optical faceplate called PicoTiterPlate, which contains 1.6 million wells (each well is 44 µm in diameter and has the capacity for one bead). After adding
Array-Based Pyrosequencing Technology
19
the DNA beads and enzyme beads (ATP sulfurylase and luciferase), the packing beads are layered onto the wells and the plate is centrifuged for bead deposition. The packing beads ensure that the DNA beads are kept in place during DNA sequencing and minimize DNA bead loss during the liquid flow. The PicoTiterPlate is placed into the instrument for sequencing. 4. Generated sequence signals are recorded as images for data analysis. The signal processing, base-callings, assembly, mapping, and variance detections are performed by integrated software. For amplicon sequencing there is no need for library preparation and adapter ligation. Instead 19-mer sequences are added to the PCR primers, which function as amplification and sequencing primers. Amplicons can be sequenced one-directional or bidirectional.
2.5 Applications of 454 Sequencing Technology The 454 sequencing technology is the only commercially available technology that can be used for high-throughput de novo sequencing generating readlength as long as 200–300 bases with a 500-base-sequencing platform in the pipeline. The method can generally be categorized for whole-genome sequencing, broad sequencing, and deep sequencing. 2.5.1
Whole-Genome Sequencing
454 sequencing has been used widely for whole-genome sequencing generally for short genomes such as de novo bacterial [9–11] and BAC/PAC/cosmid/ forsmid [12, 13] sequencing basically because of their ease of assembly. Due to high number of sequence repeat regions in mammalian genomes, the 100-base GS 20 has not been as suitable as Sanger sequencing for complex genomes. The 454 longer read platform FLX (and the coming generation producing 500 bases) along with paired-end technology [14] and its upcoming improvements are making this technique suitable for sequencing complex genomes. The 454 sequencing is now being used for evolutionary anthropology by The Max Planck Institute to sequence Neanderthal genome [15, 16], which is one of the closest to human genome. The sequencing of the ancient Neanderthal genome is specially challenging as it is highly fragmented, which has just been enabled by pyrosequencing. This sequencing project is currently being taken place at the 454 Sequencing Center. Moreover, the whole human genome of a Nobel Prize–winning scientist is also being sequenced at the same center.
20
2.5.2
Genome Sequencing Technology and Algorithms
Ultrabroad Sequencing
The 454 sequencing is now currently being utilized for many different applications such as large pools of cDNA [17–19], small RNA, and micro-RNA [20–23], ditag/SAGE libraries [14, 24] as well as other amplicon pools [25], paleogenomics and metagenomics [26–29]. Moreover, using DNA tags or barcodes [30] could be a great advantage in reducing the cost of sequencing, which allows en masse sequencing of pooled amplicons from different samples. By using bioinformatic tools one can easily sort out the amplicon sequences by the identification tags or barcodes.
2.5.3
Ultradeep Amplicon Sequencing
The ultradeep sequencing allows detection of low-frequency mutations [30–32]. This is generally not possible by conventional Sanger sequencing as it can only detect mutations down to 20%. Screening the low-frequency mutations is important for many studies such as drug resistance or antibiotic resistance as well as finding novel mutations in heterogeneous samples.
2.6 Advantages and Challenges The 454 sequencing technology is the only commercially available high-throughput sequencing platform for de novo sequencing with lower cost than Sanger sequencing. It relies on clonal amplification, which allows sequencing of unclonable regions. Moreover, GC content is generally not an issue in pyrosequencing [33]. It has the ability to detect low-frequency mutations, which cannot be detected by Sanger sequencing [32]. Detection of low-frequency mutations is a valuable tool for clinical cancer and drug-resistance research. The 454 sequencing has been limited to average sequences of 100-bases read-length, however, by the new developments in chemistry and software algorithms, the read-length is being expanded to 500 bases. This will facilitate sequencing of highly repetitive regions and complex genomes with the ability to assemble the “left-out” contigs. Homopolymers are the main challenge affecting the accuracy of the 454 sequencing technology. The main challenge is caused by dATP-α-S when incorporated in homopolymer Ts (more than 4 to 5). Homopolymer strings (mainly homopolymeric T) regions can reduce synchronized extension causing nonuniform sequence peak heights, affecting the read-length and possibly causing sequence errors [7]. This problem could be resolved by engineering more efficient DNA polymerases or a luciferase that does not recognize natural dATP as a substrate.
Array-Based Pyrosequencing Technology
21
2.7 Future of Pyrosequencing While pyrosequencing chemistry will be improved to extend the read-length, progress in detection, microfluidics, and base-calling will be critical for further advancement of this technology for different applications at low cost. In addition, upstream processes should be integrated to reduce the cost, shorten the time for sample preparation and minimize human intervention for accurate pyrosequencing. Currently the cost of the machine is about $500,000 and sequencing of mammalian-size genome would cost more than $2 million. Both the cost of instrumentation and reagents should decrease by tenfold and a hundredfold, respectively. Our group has been working on developing methodologies, which could dramatically reduce instrumentation cost. More specifically, we have developed a new detection scheme based on CMOS image sensor [34], which uses standard semiconductor manufacturing techniques, with higher sensitivity than CCD-camera operating at room temperature with a 2-V battery. Furthermore, analog-to-digital conversion and digital signal processing units have been integrated with image sensor to maximize the efficiency of detection. We have integrated this CMOS system with the fluidics to successfully pyrosequence mixtures of PCR products. As CMOS can be customized and offers high sensitivity, there is room for further miniaturization. A 9-megapixel CMOS has now been designed which would enable sequencing of 9 million reaction wells simultaneously. Sequencing 500 nucleotides with an overall 60% efficiency of the system would potentially produce ∼1× of mammalian genome in a single run. As this integrated system is very compact, a superscalar version of such device containing 20 chips can be envisioned for even more high-throughput settings. For lower throughput version of pyrosequencing, Advanced Liquid Logic, Inc., (http://www.liquid-logic.com), has implemented digital microfluidics based on electrowetting to perform pyrosequencing on a flat chip. Sample preparation including PCR and DNA immobilization can potentially be performed on the sequencing chip. The chip is fully programmable enabling various sequencing applications on the same platform. This device could serve for diagnostics, point-of-care, and biodefence applications. In conclusion, we believe that pyrosequencing technology is scalable and can be further miniaturized. A cost of $10,000 per de novo mammalian genome could be available within the next 5 to 7 years.
References [1]
Sanger, F., S. Nicklen, and A. R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Natl. Acad. Sci. USA, Vol. 74, 1977, pp. 5463–5467.
22
Genome Sequencing Technology and Algorithms
[2] Ronaghi, M., et al., “Real-Time DNA Sequencing Using Detection of Pyrophosphate Release,” Analytical Biochemistry, Vol. 242, 1996, pp. 84–89. [3] Ronaghi, M., M. Uhlen, and P. Nyren, “A Sequencing Method Based on Real-Time Pyrophosphate,” Science, Vol. 281, 1998, pp. 363, 365. [4] Ahmadian, A., et al., “Single-Nucleotide Polymorphism Analysis by Pyrosequencing,” Analytical Biochemistry, Vol. 280, 2000, pp. 103–110. [5] Gharizadeh, B., et al., “Typing of Human Papillomavirus by Pyrosequencing,” Lab Invest., Vol. 81, 2001, pp. 673–679. [6] Nordstrom, T., et al., “Method Enabling Fast Partial Sequencing of cDNA Clones,” Analytical Biochemistry, Vol. 292, 2001, pp. 266–271. [7] Gharizadeh, B., et al., “Long-Read Pyrosequencing Using Pure 2’-Deoxyadenosine-5’-O’(1-Thiotriphosphate) Sp-Isomer,” Analytical Biochemistry, Vol. 301, 2002, pp. 82–90. [8] Margulies, M., et al., “Genome Sequencing in Microfabricated High-Density Picolitre Reactors,” Nature, Vol. 437, 2005, pp. 376–380. [9] Goldberg, S. M., et al., “A Sanger/Pyrosequencing Hybrid Approach for the Generation of High-Quality Draft Assemblies of Marine Microbial Genomes,” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 11240–11245. [10] Hofreuter, D., et al., “Unique Features of a Highly Pathogenic Campylobacter Jejuni Strain,” Infect Immun., Vol. 74, 2006, pp. 4694–4707. [11] Oh, J. D., et al., “The Complete Genome Sequence of a Chronic Atrophic Gastritis Helicobacter Pylori Strain: Evolution During Disease Progression,” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 9999–10004. [12] Moore, M. J., et al., “Rapid and Accurate Pyrosequencing of Angiosperm Plastid Genomes,” BMC Plant Biol., Vol. 6, 2006, p. 17. [13] Wicker, T., et al., “454 Sequencing Put to the Test Using the Complex Genome of Barley,” BMC Genomics, Vol. 7, 2006, p. 275. [14] Ng, P., et al., “Multiplex Sequencing of Paired-End Ditags (MS-PET): A Strategy for the Ultra-High-Throughput Analysis of Transcriptomes and Genomes,” Nucleic Acids Res., Vol. 34, 2006, p. e84. [15] Green, R. E., et al., “Analysis of One Million Base Pairs of Neanderthal DNA,” Nature, Vol. 444, 2006, pp. 330–336. [16] Noonan, J. P., et al., “Sequencing and Analysis of Neanderthal Genomic DNA,” Science, Vol. 314, 2006, pp. 1113–1118. [17] Cheung, F., et al., Sequencing Medicago Truncatula Expressed Sequenced Tags Using 454 Life Sciences Technology,” BMC Genomics, Vol. 7, 2006, p. 272. [18] Emrich, S. J., et al., “Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing,” Genome Res., Vol. 17, 2007, pp. 69–73. [19] Weber, A. P., et al., “Sampling the Arabidopsis Transcriptome with Massively-Parallel Pyrosequencing,” Plant Physiol., 2007.
Array-Based Pyrosequencing Technology
23
[20] Axtell, M. J., et al., “A Two-Hit Trigger for siRNA Biogenesis in Plants,” Cell, Vol. 127, 2006, pp. 565–577. [21] Girard, A., et al., “A Germline-Specific Class of Small RNAs Binds Mammalian Piwi Proteins,” Nature, Vol. 442, 2006, pp. 199–202. [22] Lu, C., et al., “MicroRNAs and Other Small RNAs Enriched in the Arabidopsis RNA-Dependent RNA Polymerase-2 Mutant,” Genome Res., Vol. 16, 2006, pp. 1276–1288. [23] Pak, J., and A. Fire, “Distinct Populations of Primary and Secondary Effectors During RNAi in C. Elegans,” Science, Vol. 315, 2007, pp. 241–244. [24] Nielsen, K. L., A. L., Hogh, and J. Emmersen, “DeepSAGE—Digital Transcriptomics with High Sensitivity, Simple Experimental Protocol and Multiplexing of Samples,” Nucleic Acids Res., Vol. 34, 2006, p. e133. [25] Albert, I., et al., “Translational and Rotational Settings of H2A.Z Nucleosomes Across the Saccharomyces Cerevisiae Genome,” Nature, Vol. 446, 2007, pp. 572–576. [26] Edwards, R. A., et al., “Using Pyrosequencing to Shed Light on Deep Mine Microbial Ecology,” BMC Genomics, Vol. 7, 2006, p. 57. [27] Krause, L., et al., “Finding Novel Genes in Bacterial Communities Isolated from the Environment,” Bioinformatics, Vol. 22, 2006, pp. e281–e289. [28] Sogin, M. L., et al., “Microbial Diversity in the Deep Sea and the Underexplored ‘Rare Biosphere,’” Proc. Natl. Acad. Sci. USA, Vol. 103, 2006, pp. 12115–12120. [29] Turnbaugh, P. J., et al., “An Obesity-Associated Gut Microbiome with Increased Capacity for Energy Harvest,” Nature, Vol. 444, 2006, pp. 1027–1031. [30] Binladen, J., et al., “The Use of Coded PCR Primers Enables High-Throughput Sequencing of Multiple Homolog Amplification Products by 454 Parallel Sequencing,” PLoS ONE, Vol. 2, 2007, p. e197. [31] Thomas, R. K., et al., “Sensitive Mutation Detection in Heterogeneous Cancer Specimens by Massively Parallel Picoliter Reactor Sequencing,” Nat. Med., Vol. 12, 2006, pp. 852–855. [32] Wang, C., et al., “Characterization of Mutation Spectra with Ultra-Deep Pyrosequencing: Application to HIV-1 Drug Resistance,” Genome Res., Vol. 17, No. 8, August 2007, pp. 1195–1201. [33] Ronaghi, M., et al., “Analyses of Secondary Structures in DNA by Pyrosequencing,” Analytical Biochemistry, Vol. 267, 1999, pp. 65–71. [34] Eltoukhy, H., K. Salama, and A. El Gamal, “A 0.18µm CMOS Luminescence Detection Lab-on-Chip,” IEEE Journal of Solid State Circuits, Vol. 41, 2004, pp. 651–662.
3 The Role of Resequencing Arrays in Revolutionizing DNA Sequencing David Okou and Michael E. Zwick
3.1 Introduction Genomics technologies that enable rapid, accurate, and cost-effective DNA sequencing promise to revolutionize genetic research. At the same time, these technologies will usher in a new era of patient care, with an increasing focus on predictive health and individualized genomic medicine. The Human Genome Project’s (HGP) resounding success [1] and its numerous innovations [2], combined with the pressure of direct competition from a commercial company [3], have fostered an environment in which ever-greater quantities of DNA are being sequenced at an ever-decreasing cost [4]. Yet all scientific revolutions tend to change the research landscape in unanticipated ways, and the HGP is no exception. While remarkably efficient, the costs, infrastructure, and personnel required to maintain a traditional, large-scale industrial DNA-sequencing center that uses gel electrophoresis and Sanger-sequencing chemistry may not be scalable to the degree necessary to meet future research and medical application needs [4]. Consequently, the focus of genomics is shifting to a number of exciting new technologies that offer drastic cost reductions, while at the same time dramatically increasing data production [4]. A characteristic these technologies share, and one not often factored into cost-reduction estimates, is that their space, personnel, and infrastructure requirements are a fraction of those
25
26
Genome Sequencing Technology and Algorithms
necessary for the traditional industrial model. Therefore, the prospect of a genome-sequencing center in every lab is increasingly within reach. Resequencing arrays (RAs) are one example of a next generation DNA sequencing technology that can enable individual laboratories to generate vast quantities of genome sequence rapidly and in a cost-effective fashion. This chapter first provides an overview of how RAs determine a DNA sequence. A discussion of basic experimental protocols and methods of RA data analysis follows. Finally, the chapter concludes with a review of several diverse studies that demonstrate how RAs can be employed for DNA sequencing.
3.2 DNA Sequencing by Hybridization with Resequencing Arrays The term “resequencing” refers to the act of sequencing a gene, genomic region, or genome in multiple different individuals in order to identify genetic variation. Resequencing arrays accomplish this task by hybridizing target DNA from a single individual to a set of custom-designed oligonucleotide probes located on a solid surface. When target DNA is hybridized to a RA, interactions with complementary probe oligonucleotides act either to query the identity of a given base (haploid) or to determine a genotype (diploid), depending on the target DNA source. With a high-quality genome reference sequence in hand, RAs can be custom-designed to resequence the unique genomic sequences of any organism. The great power of this approach arises from the fact that high-density oligonucleotide (oligo) RAs possess a very large number of oligos and can be processed by a single individual within a single laboratory [5, 6]. RAs query a given base by using overlapping oligonucleotide probes, tiled at a 1-base-pair (bp) resolution. The oligonucleotide probes, referred to as features, are typically 25–base-pairs long. Both the forward and reverse strands are interrogated, so sequencing a single base requires a total of 8 features. A set of four features contains oligonucleotides identical to the forward reference strand, except at position 13 (the base to be queried), where there is either A, C, G, or T. The remaining four features are similarly designed for the complementary strand (Figure 3.1). When a labeled DNA sample, called a target, is hybridized to these eight features on the array, the two features complementary to the reference sequence (forward and reverse complement) will yield the highest signal. If, however, the target DNA contains a variant base at position 13, the two features complementary to that variant base will yield the highest signal (Figure 3.2). Given eight features for each base, interrogation of an L-length duplex strand would require 8L oligonucleotide probes. Two companies currently manufacture high-density oligonucleotide RAs: Affymetrix, Inc. and NimbleGen Systems, Inc. Affymetrix’s GeneChip technology uses a series of masks and photolithography to manufacture microarrays
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing
27
Reference genomic DNA
Probes tiled at 1-bp resolution
Probe features Synthesized on RA
Reference sequence Base queried
Figure 3.1 A description of the concept and design of resequencing arrays.
Genome
Haploid
Diploid
Target DNA: production and fragmentation
RA hybridization, wash, scan Probes
Probes
Fwd Rev
Data analysis (ABACUS)
Fwd Rev
A C G
A C G
T
T
Haploid Base call: A
Diploid Genotype: A/G
Figure 3.2 Overview of the basic resequencing array protocol.
containing oligos with a length of 25 bp [7]. The current commercially available RAs with an 8-micron feature size contain about 2.4 million distinct oligos. These RAs can resequence approximately 300 kb per chip (2.4 million 8-micron features/eight features per queried base). Because Affymetrix’s technology synthesizes oligos on entire wafers that are subsequently broken into individual RAs, a typical order of 45 chips can resequence up to 13.5 megabases (Mb). The small feature size and concomitant high density of Affymetrix RAs is an appealing strength of their platform. NimbleGen Systems, Inc., on the other hand, uses maskless photolithography to manufacture microarrays that contain about
28
Genome Sequencing Technology and Algorithms
385,000 16-micron features [8]. These RAs can resequence approximately 48 kb per chip. There are two advantages to the Nimblecen Systems approach: they can manufacture single arrays, and they can put oligos of different lengths on a single chip. Oligos of different lengths on a single RA enable more precise matching of melting temperatures between target and probe DNA. Regardless of the RA manufacturer, decreasing probe feature size and increasing microarray density is a very fertile area for research and development, and in all likelihood, further progress in these areas will continue to enhance RA capabilities.
3.3 Resequencing Array Experimental Protocols RAs generate the highest quality data when target DNA sequences that correspond to the complementary probe oligos are unique [5, 6]. Common bioinformatics procedures that identify repetitive sequences underlie RA design, regardless of the targeted organism. Unique DNA sequences can be readily identified and obtained from publicly available DNA sequence repositories using tools like the UCSC Genome Bioinformatics site [9]. Also, genome repeat masking (i.e., RepeatMasker, available at http://www.repeatmasker.org/) is another good alternative. RepeatMasker identifies and masks common repetitive sequences and simple sequence repeats [10]. The unique sequences that remain are then candidates for RA probe selection. In order to use RA space most efficiently, a typical design requires that unique sequences be longer than 50 bp to be tiled on the microarray, while repetitive sequences should be longer than 50 bp to be excluded. However, these guidelines can be relaxed or tightened depending on the specific RA application. This approach is employed because, during RA synthesis, the first and last 12 bases of each unique sequence fragment are not full-length (25-bp) oligos on the chip; hence they do not provide useful base-calling information. These oligos do, however, take up space on the RA, thereby reducing the total number of bases that can be resequenced. Once the desired unique sequences have been identified, they are placed in a flat text multi-FASTA file (each unique fragment is in a separate FASTA text block) that is provided to the chip manufacturer. The RAs are then manufactured as discussed previously. The greatest technical challenge posed by resequencing with RAs lies in isolating target DNA. RAs can resequence either haploid or diploid target DNA. In both cases, the methods of target DNA isolation are identical. Target DNA from relatively simple microbial genomes can be obtained either by using long PCR (LPCR) of selected genomic regions [6] or, even more simply, via whole-genome amplification (WGA) of the entire genome. Larger, more complex eukaryotic genomes present a more significant challenge. LPCR is currently the preferred method of target DNA production and has been used in a project
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing
29
that generated 160 MB of human genomic sequence. LPCR target DNA products are quantified and pooled by individual for a given chip design. A modern robotic infrastructure can further automate this process, but scaling this process while simultaneously amplifying the majority of the target genome sequences (>90%) is a significant obstacle. How to improve methods of target DNA production from complex eukaryotic genomes is an active area of research. Clone-based methodologies are a promising approach that may be more scalable [11, 12], although these methods still require substantial effort and quality control in the course of isolating target DNA products. Isolated target DNA requires processing prior to RA hybridization. The first step is to fragment the target DNA to products that are 20 to 200 bases long. This is typically performed enzymatically using DNAse I and should be followed by gel confirmation of digestion. This step in the experimental protocol is critical, since target DNA overdigestion and underdigestion both lead to poor RA performance. After digestion, the target DNA fragments are labeled and then hybridized to the RA overnight. The chip is then washed and scanned, which generates a high-definition image of the RA. At this point, the chip image is ready to be analyzed.
3.4 Analyzing Resequencing Array Data with ABACUS Determining the DNA sequence from a RA requires analysis of the chip image for each queried base. Initial studies using high-density oligonucleotide microarrays used for single nucleotide polymorphism (SNP) discovery and genotyping relied on manual examination and visual inspection of the chip images by the human eye [13–20]. While these initial studies held out great promise, they also suffered from major shortcomings. The reliance on human inspection of each chip was inherently not scalable, was subject to variation among individual viewers, and tended to bias the resulting base (haploid target) and genotype (diploid target) calls. For example, human inspectors tended to identify common SNPs more often than rare SNPs. Equally problematic, the initial large-scale microarray experiments reported SNP false-positive detection rates between 12% and 45% [19, 21]. While these numbers certainly seemed high, because each chip scans a large number of sites, these initial experiments suggested that microarray-based variation detection and genotyping in fact achieves an accuracy between 99.93% and 99.99% [19–21]. The paradox of high false-positive SNP detection rates paired with seemingly high RA accuracy exposes a significant problem that faces any resequencing technology [5]. A simple thought experiment can clarify this situation. If we propose a hypothetical technology that can determine the sequence of bases with an accuracy of 99.9%, then we would expect to observe 10 errors
30
Genome Sequencing Technology and Algorithms
for every 10,000 bases resequenced. Naturally occurring rates of genetic variation in populations of virtually any organism one can examine are relatively low; thus, most resequenced bases are monomorphic. For example, in humans, the average level of variation is about eight differences per 10,000 bases [22]. Hence, if we resequence 10,000 bases with our 99.9% accurate resequencing technology, we would expect to identify eight true variants and make ten errors. This corresponds to an SNP false-positive detection rate of 55%, a result that demonstrates the need for all resequencing technologies to have very high accuracy in order to detect the generally rare but quite real genetic variants found in genomes. The conundrum of high RA error rates combined with the reliance on human visual inspection hinted that RAs might prove inadequate as an effective resequencing technology. To address these challenges, researchers developed ABACUS (Adaptive Background genotype calling scheme) [5]. ABACUS is an objective statistical framework designed to distinguish base (haploid target)/genotype (diploid target) calls, which can be made with extraordinary accuracy, from those that are less reliable. Because RAs are extraordinarily accurate on the vast majority of queried bases, significantly increasing RA data accuracy becomes possible. Furthermore, the analysis is automated and does not require any human visual inspection of the chip images. ABACUS assigns a quality score to every base/genotype call that corresponds to the level of support at each resequenced base. This approach is similar to the one undertaken for traditional DNA sequencing [23, 24]. To assess data quality, both replicate experiments and accuracy estimates are performed. RA data has been shown to be highly reproducible in replicate experiments consisting of independent amplification of identical samples followed by hybridization to distinct microarrays of the same design [5]. In an autosomal (diploid target) replicate experiment, 813,295 out of 813,295 genotypes were called identically (including 351 heterozygotes); at X-linked loci in males (haploid target), 841,236 out of 841,236 sites were called identically. If repeatability could be equated to accuracy, then this level of repeatability in haploid and diploid genotype calls would correspond to a phred score of at least 54 (assuming a binomial error probability of P; in relating P-values to phred scores, note that phred = −10 log10P) [23, 24]. Although repeatability can certainly suggest accuracy, it is also possible that systematic but repeatable errors might remain undetected. Therefore, using independent resequencing/genotyping technologies, accuracy was also assessed via two separate means. For haploid target data, a 6x-shotgun resequencing on a single individual sample yielded 17,423 base calls that were identical to the ABACUS calls. To assess the accuracy at segregating sites in diploids (nonsegregating sites identical to the reference are exceedingly likely to be correct, since overall polymorphism rates are so low), 1,938 genotypes were
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing
31
obtained at 108 segregating sites. The accuracy of diploid genotypes at segregating autosomal sites was confirmed for 1,515 of 1,515 homozygous calls, and for 420 out of 423 (99.29%) heterozygotes [5]. These results indicated that genotyping accuracy at segregating sites was greater than 99.8% (and this of course does not take into account the nonsegregating sites also likely to be called correctly). A subset of these SNPs was investigated to assess the accuracy of SNP detection: 108 out of 108 SNPs were experimentally confirmed, and an additional 371 SNPs were confirmed electronically. This indicated that RAs can be used for both detection and genotyping of variation simultaneously, and that the base-calling accuracy reaches or exceeds most other widely available stand-alone technologies. Furthermore, because ABACUS provides a quality score for individual bases (haploid targets)/genotypes (diploid targets), investigators can focus attention on those sites providing accurate information. Haploid quality scores, which are those relevant to resequencing X-chromosome loci in human males and microbial genomes, for example, are higher on average than those for diploid data (Figure 3.3). RAs can also detect insertion/deletion (indel) variation in haploid targets. Currently, ABACUS is not designed to detect indel variation in diploid targets. This initial application of ABACUS demonstrated greater than 99.9999% accuracy on more than 80% of the genotype calls in a 7%
6%
Percent of total bases resequenced
Diploid Haploid 5%
4%
3%
2%
1%
0% 5
25
45
65
85
105 125
145
165
185
205
225 245 265
285
305
Quality score (log10)
Figure 3.3 Quality score (log10) distribution for haploid and diploid resequenced bases [5].
32
Genome Sequencing Technology and Algorithms
single experiment using Affymetrix RAs [5]. This work established that RAs could be an effective resequencing platform. Improving the overall call rate above 80% for single experiments then became a priority. After scanning a RA, the first step in data analysis is determining the position of features on the chip. This process is called grid alignment. Grid alignment is followed by the extraction of the raw data from each RA image that is subsequently used for base calling by ABACUS. When ABACUS was developed, grid alignment required manual intervention by a laboratory technician, who would determine the positions of the four corners of a grid by eye. The single grid placed by the technician encompassed the entire RA image, effectively assuming a “globally perfect” grid alignment. Once the grid was placed over the RA image, the intensity data could then be extracted for analysis by ABACUS. This process, in addition to being laborious, led to varying degrees of misalignment on portions of (and in some cases, entire) RAs. Data extracted from RAs with misaligned grids raised error rates. Increasing the quality score threshold could eliminate errors, but at the cost of reduced rates of base/genotype calling. The problems associated with manual grid alignment were eventually solved by software that performs automated grid alignment on RA image files [D. J. Cutler and M. Zwick, personal communication, 2006]. The software makes no “globally perfect” assumption, but instead assumes a locally perfect alignment. Micromisalignments, when undetected, can correspond to very weak heterozygotes when resequencing diploid targets. These errors are minimized with a locally perfect method of grid alignment. Figure 3.4 provides an example of the way local grid alignment can improve data quality. At the same time, automated grid alignment allowed for a reduction of the quality score threshold, increasing call rates to more than 90% of sites. Currently, automated grid alignment followed by ABACUS base calling requires less than 1 minute per Global grid alignment
Local grid alignment
Figure 3.4 Local grid alignment (right) extracts RA data more accurately than global grid alignment (left).
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing
33
microarray on a standard desktop computer. All of these tools have been implemented in a free, open-source software package called RATools, which is publicly available at http://www.dpgp.org/RA/ra.htm. RATools works equally well with RAs manufactured by Affymetrix or Nimblecen Systems, Inc.
3.5 Review of RA Applications Resequencing arrays have been used successfully in a variety of systems and projects. As this new technology advances, an increasingly diverse collection of RA projects has ensued. Here we review a few select applications of RA technology. 3.5.1
Human Resequencing
High-throughput RAs were applied to an experiment encompassing 32-autosomal and eight X-linked genomic regions, each consisting of approximately 50 kb of unique sequence spanning a 100-kb region, in 40 humans [M. Zwick, personal communication, 2007]. The regions resequenced included unique coding and noncoding sequences. Long PCR was used to generate target DNA for RA hybridization. Using thresholds identical to those described previously, approximately 1.6 MB of DNA sequence was determined for each of the 40 individuals. In total, 6,040 SNPs were identified, corresponding to human genetic variation levels determined by other sequencing technologies. The SNPs that were discovered consisted of both common and rare variants. A pronounced excess of rare variants, relative to the number predicted by the neutral theory, was observed in the genomic regions resequenced. These data demonstrated that the RAs can be a viable resequencing platform that is capable of identifying both common and rare SNPs in large-scale resequencing projects focused on complex eukaryotic genomes. 3.5.2
Mitochondrial DNA Resequencing
Mitochondrial DNA mutations are commonly found in cancers, but until recently, strategies for using the mitochondrial genome as a cancer-screening tool were limited by the lack of a high-throughput platform for mutation detection. Maitra et al. [25] used a RA to rapidly sequence the human mitochondrial genome in order to detect mutations linked to cancers. The first generation of the MitoChip (v1.0, released in 2004) could sequence more than 29 kb of double-stranded DNA in a single assay. Both strands of the entire human mitochondrial coding sequence (15,451 bp) were arrayed on the MitoChip; both strands of an additional 12,935 bp (84% of coding DNA) were also arrayed in duplicate. Using ABACUS and a total threshold quality score of 30 [5], the Affymetrix GeneChip DNA Analysis software successfully assigned base calls at
34
Genome Sequencing Technology and Algorithms
96.0% of nucleotide positions [25]. Replicate experiments demonstrated more than 99.99% reproducibility of base calls both within and between chips using array-based sequencing (Table 3.1). There were no significant differences in the percentage of genotype calls among mitochondrial DNA extracted from lymphocytes, cell lines, primary tumors, or body fluids. The second generation MitoChip (v2.0 released in 2006) has a smaller feature size (50 kb of sequence, compared to 32 kb) and includes redundant tiling features, enabling researchers to detect potential insertion-deletion mutations and nearly three-dozen variations in the noncoding D-loop region. These regions were not directly detectable using the original MitoChip [26]. Using RAs to resequence mitochondrial genomes accomplished three key objectives. First, the number of polymerase chain reactions (PCR) required to sequence the entire mitochondrial DNA was reduced to three (from 20 to 40 individual reactions). Second, the amount of starting template needed was reduced to just 10 nanograms (100 nanograms cited [25]). And finally, sequencing the entire mitochondria on a single chip eliminates the need to visually inspect traditional chromatograms. The ability of array-based mitochondrial genome resequencing to detect mitochondrial mutations in samples of body fluids obtained from cancer patients attests to its promise as a tool for the early detection of cancer and other disorders in human clinical samples.
Table 3.1 Summary of Replicate Experiments Total samples analyzed in replicate
13
Total samples analyzed for replicate experiments
26
Within chip reproducibility Mitochondrial base pairs tiled in duplicate per chip
12,935 bp
Total within-chip duplicate base pairs analyzed
26*12,935 = 336,310 bp
Total within-chip duplicate base pairs called
311,814 bp (92.71%)
Discordant called within chips
8 bp (0.0025%)
Between-chip reproducibility Total mitochondrial base pairs tiled per chip
28,386 bp
Total base pairs analyzed in one set of 13 chips
13*28,386 = 369,018 bp
Total base pairs assigned in first set of chips
350,010 bp (94.84%)
Total base pairs assigned in second set of chips
355,086 bp (96.22%)
Total base pairs assigned in both sets
345,094 bp (93.51%)
Discordant bases calls between chips
10 bp (0.0028%)
The Role of Resequencing Arrays in Revolutionizing DNA Sequencing 3.5.3
35
Microbial Pathogen Resequencing
RAs can be a rapid, cost-effective, and efficient platform for rapidly resequencing microbial and viral genomes or genomic regions. Three recent studies have made use of RAs to resequence genomic regions from multiple strain isolates (Bacillus anthracis), viral genomes (SARS virus), and portions of microbial and viral genomes from human clinical samples, as discussed later. 3.5.3.1 Anthrax Resequencing (Bacterial Pathogens)
After the pathogen Bacillus anthracis was employed in the 2001 bioterror attacks in the United States, RAs were used to characterize genetic variations from multiple B. anthracis strains [6]. B. anthracis has two characteristics that make it an ideal system for RA-based resequencing. First, the B. anthracis genome consists of a main circular chromosome and two plasmids (pXO1, pXO2) that are approximately 5.2 MB in size. The relatively small size of microbial genomes allows for rapid target DNA isolation. While target DNA was generated using LPCR [6], subsequent work demonstrated that WGA on higher-density RAs works equally well [M. Zwick, personal communication, 2007]. Second, the ABACUS base-calling model for anthrax is haploid (the same as for the X chromosome in human males), and haploid base calls on RAs have higher-quality scores. A geographically diverse panel of 56 B. anthracis strains was resequenced using Affymetrix RAs, and base calls were determined with the RA Tools software package [6]. Each RA was capable of resequencing 29,212 bp, or about 0.5%, of the B. anthracis genome. Long PCR sample preparation and chip processing was conducted for 118 RAs. Analysis of these 118 RAs with the ABACUS software package showed that 115 were successful (97.5%). Experimental failure was declared when less than 60% of the total possible bases failed to achieve quality scores exceeding the ABACUS user-defined threshold. For that study, the total threshold was set at 31 with a strand minimum of −2 [5], as determined by analysis of a replication experiment. The successful RAs (115) called 92.6% of the possible bases (3,109,539 bp out of a total possible 3,359,380 bp). Bases with quality scores that failed to meet the minimum threshold stemmed from two primary causes. Amplicon failure, which typically arises from LPCR failure, accounted for 1.1% of the uncalled bases. The remaining uncalled bases were distributed nonrandomly across the RAs and were composed of oligonucleotide probes with a significantly higher purine composition (P