Tag-based Next Generation Sequencing

Wiley, 2012. - 673 p. - Tag-based approaches were originally designed to increase the throughput of capillary sequencing

458 41 8MB

English Pages [673]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Tag-based Next Generation Sequencing

  • Commentary
  • 1292592
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Edited by Matthias Harbers and G€ unter Kahl Tag-based Next Generation Sequencing

Related Titles Liu, Z.

Next Generation Sequencing and Whole Genome Selection in Aquaculture 2011 ISBN: 978-0-8138-0637-2

Meksem, K., Kahl, G. (eds.)

The Handbook of Plant Mutation Screening Mining of Natural and Induced Alleles 2010 ISBN: 978-3-527-32604-4

Janitz, M. (ed.)

Next Generation Genome Sequencing Towards Personalized Medicine 2008 ISBN: 978-3-527-32090-5

Edited by Matthias Harbers and G€ unter Kahl

Tag-based Next Generation Sequencing

The Editors Dr. Matthias Harbers 4-2-6 Nishihara Kashiwa-Shi Chiba 277-0885 Japan Prof. Dr. G€ unter Kahl Mohrm€ uhlgasse 3 63500 Seligenstadt Germany

Cover The scheme in the foreground symbolizes the events at a silenced and an activated promoter (adapted with kind permission from Macmillan Publishers Ltd./Shelley L. Berger: The complex language of chromatin regulation during transcription, Nature 447, 2007, and with kind permission from Shelley L. Berger, University of Pennsylvania, Philadelphia). The background (modified with kind permission from Steven Henikoff, Fred Hutchinson Cancer Research Center and University of Washington, Seattle) symbolizes the C-methylation patterns of some Arabidopsis thaliana chromosomes.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty can be created or extended by sales representatives or written sales materials. The Advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Card No.: applied for British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de. Ó 2012 Wiley-VCH Verlag & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical, and Medical business with Blackwell Publishing. All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law. Cover Design Formgeber, Eppelheim Composition Thomson Digital, Noida Printing and Binding Fabulous Printers Pte Ltd, Singapore Printed in Singapore Printed on acid-free paper Print ISBN: ePDF ISBN: oBook ISBN: ePub ISBN: Mobi ISBN:

978-3-527-32819-2 978-3-527-64477-3 978-3-527-64458-2 978-3-527-64457-5 3-527-64456-3

We dedicate this book to the memory of late Eberhard Harbers who aroused our interest in nucleic acids with one of the first books ever published on this topic.

j

Contents

Preface XIX List of Contributors

XXI

Part One Tag-Based Nucleic Acid Analysis 1

1.1 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.3.5 1.3.6 1.3.7 1.3.8 1.4 1.4.1 1.4.2 1.5

2

2.1 2.2 2.3 2.4 2.4.1 2.4.2

1

DeepSuperSAGE: High-Throughput Transcriptome Sequencing with Now- and Next-Generation Sequencing Technologies 3 Hideo Matsumura, Carlos Molina, Detlev H. Kr€ uger, Ryohei Terauchi, and G€ unter Kahl Introduction 3 Overview of the Protocols 5 Principle of the SuperSAGE Method 5 Power of the SuperSAGE Tag 5 Development of DeepSuperSAGE 6 Ditag-Based DeepSuperSAGE (for 454 Pyrosequencing) 7 Single-Tag-Based DeepSuperSAGE (HT-SuperSAGE) 8 Methods and Protocols 9 Linker or Adapter Preparation 9 RNA Samples 10 cDNA Synthesis and NlaIII Digestion 10 Tag Extraction from cDNA 10 Tag Extraction from cDNA 11 Purification of Linker–Tag Fragments 12 Ditag or Adapter–Tag Formation and Amplification 12 Preparation of Templates for Sequencing 14 Applications 14 Applications of DeepSuperSAGE in Combination with 454 Pyrosequencing 14 Practical Analysis of HT-SuperSAGE 18 Perspectives 19 References 20 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites 23 Matthias Harbers, Mitchell S. Dushay, and Piero Carninci Introduction 23 What is CAGE? 24 Why CAGE? 26 Methods and Protocols 28 Key Reagents and Consumables 28 Precautions 30

VII

VIII

j

Contents

30

2.4.3 2.4.4 2.5 2.6

RNA Samples Used for DeepCAGE Library Preparation DeepCAGE Library Preparation 32 Applications 43 Perspectives 44 References 45

3

Definition of Promotome–Transcriptome Architecture Using CAGEscan Nicolas Bertin, Charles Plessy, Piero Carninci, and Matthias Harbers Introduction 47 What is CAGEscan? 48 Why CAGEscan? 50 Methods and Protocols 51 Key Reagents and Consumables 51 Precautions 53 RNA Samples Used for CAGEscan Library Preparation 53 Considerations on Pooling CAGEscan Libraries 54 CAGEscan Library Preparation 54 Applications and Perspectives 59 References 61

3.1 3.2 3.3 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 3.5

4 4.1 4.2 4.2.1 4.2.2 4.2.3 4.2.4 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.4

5

5.1 5.2 5.2.1 5.2.2 5.3 5.3.1 5.3.2 5.3.3 5.4

6 6.1 6.1.1 6.2 6.2.1 6.2.2

RACE: New Applications of an Old Method to Connect Exons Charles Plessy Introduction 63 Deep-RACE 65 Choice of the Sequencer 65 Validation of Promoter Studies 65 Other Applications of Deep-RACE 66 Limitations of Deep-RACE 66 Methods Outline 67 Primer Design 67 Molecular Biology of Deep-RACE Library Preparation 67 Sequencing of Deep-RACE Libraries 68 Analysis 68 Perspectives 70 References 71

47

63

RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing 73 Xiaoan Ruan and Yijun Ruan Introduction 73 Methods and Protocols 75 Key Reagents and Consumables 75 Protocol 78 Applications 88 PET Sequencing with SOLiD 88 Mapping of the PETs 88 PET Clustering, Annotation, and Genome Browser Visualization 89 Perspectives 90 References 90 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA 91 Alistair R.R. Forrest Introduction 91 Before Starting 93 Methods and Protocols 93 Preface 93 Materials and Consumables 94

Contents

6.2.3 6.3 6.4 6.5

Protocol 95 Bioinformatic Considerations Applications 104 Perspectives 105 References 107

7

Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes 109 Anne Borries, J€org Vogel, and Cynthia M. Sharma Introduction 109 What is dRNA-Seq? 111 Why dRNA-Seq? 112 Methods and Protocols 115 Materials and Consumables 115 Precautions 116 RNA Samples Used for dRNA-Seq Library Preparation 116 dRNA-Seq Library Preparation 116 Applications 119 Perspectives 120 References 121

7.1 7.2 7.3 7.4 7.4.1 7.4.2 7.4.3 7.4.4 7.5 7.6

8

8.1 8.1.1 8.1.2 8.1.3 8.1.4 8.2 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.4 8.5 8.6

9 9.1 9.2 9.2.1 9.2.2 9.2.3 9.2.4 9.3 9.3.1 9.3.2 9.3.3 9.3.4 9.3.5 9.4

103

Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing 123 Javier Armisen, W. Robert Shaw, and Eric A. Miska Introduction 123 miRNAs 123 piRNAs 125 siRNAs 126 Other Small RNAs 126 HTS/NGS 127 Methods and Protocols 128 Key Reagents and Solutions 128 Total RNA Isolation 129 Small RNA Isolation 129 Ligation of Adapters 131 Troubleshooting 134 Applications 134 Perspectives 136 References 138 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq 139 Joshua W.K. Ho, Artyom A. Alekseyenko, Mitzi I. Kuroda, and Peter J. Park Introduction 139 Methods and Protocols 141 Antibody Validation 141 ChIP 141 Sequencing Library Preparation 144 Data Analysis 146 Applications 147 Deciphering the Transcriptional Regulatory Program 148 Unraveling Epigenetic Regulation 148 Comparative Interindividual or Interspecies Analysis 149 Study of Human Diseases and Clinical Applications 149 Advantages and Challenges of ChIP-Seq 149 Perspectives 150 References 151

j

IX

X

j

Contents

10

10.1 10.2 10.3 10.4 10.5 10.6 10.7

11

11.1 11.2 11.3 11.4 11.5

12

12.1 12.1.1 12.1.2 12.1.3 12.1.3.1 12.1.3.2 12.1.3.3 12.1.3.4 12.2 12.2.1 12.2.2 12.3 12.4 12.4.1 12.4.2 12.4.3 12.4.4 12.5

13

13.1 13.2 13.3 13.4 13.5 13.5.1

Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing 153 Julian K€onig, Nicholas J. McGlincy, and Jernej Ule Introduction 153 Procedure Overview 154 Antibody and Library Preparation Quality Controls 155 Oligonucleotide Design 156 Recent Modifications of the iCLIP Protocol 158 Troubleshooting 158 Methods and Protocols 159 References 169 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats 171 Virginia Edgcomb and Thorsten Stoeck Introduction 171 Cariaco Basin 173 Framvaren Fjord 176 Comparison of Cariaco Basin to Framvaren Fjord 177 Perspectives on Interpretation of Microbial Eukaryote 454 Data 179 References 182 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET) 185 Xiaoan Ruan and Yijun Ruan Introduction 185 Development of the ChIA-PET Method 186 Applications of the ChIA-PET Method 187 Experimental Design of ChIA-PET Analysis 187 ChIP Sample Preparation 187 ChIA-PET Library Construction 189 ChIA-PET Library Sequencing and Mapping 190 Control Libraries 191 Methods and Protocols 192 Key Reagents and Consumables 192 Protocol 195 Timeline 206 Anticipated Results 207 Verification of Sonicated Chromatin DNA Size Range 207 ChIP Quality Control: Yield and Enrichment 207 ChIA-PET Library Quality Control 207 ChIA-PET Sequencing and Mapping Analysis 207 Perspectives 209 References 209 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling 211 Sorana Morrissy, Yongjun Zhao, Allen Delaney, Jennifer Asano, Noreen Dhalla, Irene Li, Helen McDonald, Pawan Pandoh, Anna-Liisa Prabhu, Angela Tam, Martin Hirst, and Marco Marra Introduction 211 Protocol Details 212 Protocol Overview and Timeline 213 Critical Parameters and Troubleshooting 214 Methods and Protocols 215 Basic Protocol 1: First- and Second-Strand cDNA Synthesis for Tag-Seq Library Construction 215

Contents

13.5.2 13.5.3 13.5.4 13.5.5 13.5.6 13.6 13.7

14

14.1 14.2 14.2.1 14.2.2 14.2.3 14.2.4 14.2.5 14.3 14.4

15

15.1 15.1.1 15.1.2 15.2 15.2.1 15.2.1.1 15.2.1.2 15.2.1.3 15.2.1.4 15.2.1.5 15.2.1.6 15.2.1.7 15.2.1.8 15.2.2 15.2.2.1 15.2.2.2 15.2.3 15.2.3.1 15.2.3.2 15.2.3.3 15.3

16

16.1 16.2 16.2.1 16.2.2

Basic Protocol 2: Tag Generation 219 Basic Protocol 3: PCR and Fragment Isolation 223 Basic Protocol 4: Preparing the Library for Illumina Sequencing Alternate Protocol: Amplified Tag-Seq library construction (Tag-SeqLite) 227 Basic Protocol 5: Data Analysis 232 Applications 239 Perspectives 240 References 241

226

Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) 243 Paul G. Giresi and Jason D. Lieb Introduction 243 Methods and Protocols 245 FAIRE Procedure 245 Optimization of the FAIRE Procedure 245 Equipment and Reagents 246 Detection of FAIRE DNA 250 High-Throughput Sequencing 252 Applications 254 Perspectives 254 References 255 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing 257 Hendrik-Jan Megens and Martien A.M. Groenen Introduction 257 SNP Discovery and Nucleotide Variation Assessment 259 Sequence and Library Preparation Strategies 259 Methods 261 Preprocessing of Reads 262 FASTQ Format 262 FASTQ Format – Illumina Version 262 Illumina FASTQ to Sanger FASTQ 263 ABI SOLiD- and Roche 454-Specific Formats 264 Illumina SCARF or QSEQ to FASTQ 264 Quality Evaluation 265 Handling Adapter Sequences – Linkers and Barcodes 265 Quality Trimming 266 Mapping Reads to a Reference Genome 266 Making Alignments Using MOSAIK 267 Making Alignments Using BWA 268 Variant Calling 269 SAM Format 270 Variant Calling with SAMtools 273 Variant Calling with GATK 274 Notes 275 References 275 DGS (Ditag Genome Scanning) – A Restriction-Based Paired-End Sequencing Approach for Genome Structural Analysis 277 Jun Chen, Yeong C. Kim, and San Ming Wang Introduction 277 Methods and Protocols 278 Cloning-Based DGS Protocol 278 Non-Cloning-Based DGS Protocol 281

j

XI

XII

j

Contents

16.2.3 16.3 16.3.1 16.3.2 16.3.3 16.4

Computational Mapping Analysis of Experimental Ditags 282 Applications 283 Analyzing Normal Genome Structure 283 Identifying Somatic Rearrangements in Cancer Genomes 283 A Useful Tool to Study Family Germline Genetic Disorders 284 Perspectives 284 References 285

17

Next-Generation Sequencing of Bacterial Artificial Chromosome Clones for Next-Generation Physical Mapping 287 Robert Bogden, Keith Stormo, Jason Dobry, Amy Mraz, Quanzhou Tao, Michiel van Eijk, Jan van Oeveren, Marcel Prins, Jon Wittendorp, and Mark van Haaren History of the Bacterial Artificial Chromosome Vector Systems 287 History of Physical Mapping 288 What is WGP? 289 Flow of a WGP Project 289 BAC Pooling Strategies 290 Methods and Protocols 291 BAC Library and Pooling Strategy 291 Sample Preparation for Illumina Sequencing 292 Illumina Sequencing 293 Deconvolution to Assign the BAC Address to Each Read 293 Contig Building 293 Applications 294 Results from Real WGP Projects Performed by the Authors 294 Reorganizing Project Funding and Sequencing Budgets 295 Unleash the Power of BAC Clones 296 Perspectives 296 References 297

17.1 17.2 17.3 17.4 17.5 17.6 17.6.1 17.6.2 17.6.3 17.6.4 17.6.5 17.7 17.7.1 17.7.2 17.7.3 17.8

18

18.1 18.2 18.3 18.3.1 18.4 18.4.1 18.4.2 18.4.3 18.4.4 18.4.5 18.4.6 18.5 18.6

19

19.1 19.2 19.2.1 19.2.2 19.3

HELP-Tagging: Tag-Based Genome-Wide Cytosine Methylation Profiling 299 Masako Suzuki and John M. Greally Introduction 299 Genome-Wide DNA Methylation Analysis 299 What is HELP-Tagging? 300 When is HELP-Tagging the Preferred Cytosine Methylation Assay? Methods and Protocols 301 Reagents, Materials, and Equipment 301 Buffers and Adapters for HELP-Tagging Library Preparation 302 Precautions 303 DNA Samples for HELP-Tagging Library Preparation 303 HELP-Tagging Library Preparation 304 Illumina Sequencing 307 Applications 308 Perspectives 308 References 309

301

Second-Generation Sequencing Library Preparation: In Vitro Tagmentation via Transposome Insertion 311 Fraz Syed Introduction 311 Methods and Protocols 313 Materials 313 Methods 314 Perspectives 321 References 321

Contents

Part Two Next-Generation Tag-Based Sequencing 20 20.1 20.2 20.3 20.3.1 20.4 20.5 20.6 20.7 20.8 20.9 20.10

21 21.1 21.2 21.2.1 21.2.1.1 21.2.1.2 21.2.1.3 21.2.1.4 21.2.2 21.3 21.3.1 21.3.2 21.4 21.4.1 21.4.2 21.4.3 21.4.4 21.5

22

22.1 22.2 22.2.1 22.2.2 22.2.3 22.3 22.4

23

23.1 23.2 23.3 23.3.1

323

Moving Towards Third-Generation Sequencing Technologies Karolina Janitz and Michal Janitz Introduction 325 Differences Between NGS and Sanger Sequencing 326 Preparation of Templates for Sequencing 326 Next Generation: Single-Molecule Templates 327 Real-Time Sequencing 327 Nanopore Sequencing 328 Ion Torrent Electronic Sequencing 329 Genome Enrichment 331 Advantages of NGS 331 Problem of Short Reads 333 Perspectives 335 References 335

325

Beyond Tags to Full-Length Transcripts 337 Mohammed Mohiuddin, Stephen Hutchison, and Thomas Jarvie Introduction 337 Generation of Full-Length Transcriptomes 338 mRNA Fragmentation and Preparation for Sequencing 339 Step 1: mRNA Preparation 339 Step 2: Fragmentation 339 Step 3: First-Strand Synthesis 340 Step 4: Second-Strand Synthesis and Library Preparation 340 Sequencing of Full-Length Transcriptomes 341 Methods 342 Assembly 342 Mapping 343 Applications 344 Model Organisms 344 Fusion Transcript Detection 347 Digital Gene Expression 349 Allele-Specific Expression 349 Perspectives 350 References 351 Helicos Single-Molecule Sequencing for Accurate Tag-Based RNA Quantitation 353 John F. Thompson, Tal Raz, and Patrice M. Milos Introduction 353 Methods and Protocols 355 Reagents, Required Primers, and Thermocycler Programs 355 Standard Precautions 357 DGE Sample Preparation 357 Applications 362 Perspectives 364 References 365 Total RNA-seq: Complete Analysis of the Transcriptome Using Illumina Sequencing-by-Synthesis Sequencing 367 Shujun Luo, Geoffrey P. Smith, Irina Khrebtukova, and Gary P. Schroth Introduction 367 Total RNA-Seq 368 Methods and Protocols 369 Key Materials 369

j

XIII

XIV

j

Contents

23.3.2 23.4 23.5

Method 370 Total RNA-Seq Data Collection and Interpretation Applications 380 References 381

Part Three 24

24.1 24.2 24.3 24.4 24.4.1 24.4.2 24.4.3 24.4.4 24.5 24.6 25

25.1 25.2 25.2.1 25.2.2 25.2.3 25.2.4 25.2.5 25.3 25.3.1 25.3.1.1 25.3.1.2 25.3.1.3 25.3.2 25.3.2.1 25.3.2.2 25.3.2.3 25.3.2.4 25.3.3 25.4

26

26.1 26.2 26.2.1 26.2.2 26.2.3 26.2.4 26.3

Bioinformatics for Tag-Based Technologies

378

383

Computational Infrastructure and Basic Data Analysis for Next-Generation Sequencing 385 David Sexton Introduction 385 Background 386 Getting Started with the Next-Generation Manufacturers 387 Infrastructure and Data Analysis 388 Computational Considerations 388 Data Dynamics 389 Software and Postanalysis 390 Staffing Requirements 391 Applications 392 Perspectives 392 CLC Bio Integrated Platform for Handling and Analysis of Tag Sequencing Data 393 Roald Forsberg, Søren Mønsted, and Anne-Mette Hein Introduction 393 Main Components and Features 394 Data Flow and Data Back-End 394 CLC Genomics Workbench 394 CLC Genomics Server 395 APIs 396 Acceleration of Analysis 396 Applications 396 First-Level Analysis 396 Import 396 Demultiplexing 397 Trim and Quality Control 397 Second and Third Levels – Application-Specific Steps 398 RNA-Seq 398 SmallRNA-Seq 399 Tag-Seq 400 ChIP-Seq 401 Fourth Level – Expression Analysis 403 Perspectives 404 References 405 Multidimensional Context of Sequence Tags: Biological Data Integration 407 Korbinian Grote and Thomas Werner Introduction 407 Methods and Strategies 408 Annotation Links Sequence Tags (Reads) to Biology 408 Application of the Methods and Strategies 410 Only Positive Results are Conclusive 411 Automatic Workflow: ChIP-seq of Peroxisome Proliferator-Activated Receptor-c 412 Perspectives 414 References 415

Contents

27

27.1 27.2 27.2.1 27.2.2 27.2.3 27.3 27.3.1 27.3.2 27.3.3 27.4 27.4.1 27.4.2 27.4.3 27.4.4 27.4.5 27.5 27.5.1 27.5.2 27.6 27.6.1 27.6.2 27.7 27.7.1 27.7.2 27.8 27.8.1 27.8.2 27.8.3 27.8.4 27.8.5 27.8.6 27.9 27.9.1 27.9.2 27.10

28

28.1 28.2 28.2.1 28.3 28.3.1 28.3.1.1 28.3.1.2 28.3.1.3 28.3.1.4 28.3.2 28.3.3 28.3.4

Experimental Design and Quality Control of Next-Generation Sequencing Experiments 417 Peter A.C. 't Hoen, Matthew S. Hestand, Judith M. Boer, Yuching Lai, Maarten van Iterson, Michiel van Galen, Henk P. Buermans, and Johan T. den Dunnen Introduction 417 Choice of Platform 417 Read Length and Number of Reads 418 Single-End versus Paired-End Sequencing 419 Platform-Specific Advantages and Disadvantages 419 Sequencing Depth 420 Expression Profiling 420 ChIP-Seq: Relation Enrichment Factor and Sequencing Depth 421 Barcoding 422 Replicates, Randomization, and Statistical Testing 422 Technical and Biological Replicates 422 Technical Variability 423 Biological Replicates Increase Accuracy 424 Sample Size 424 Importance of Randomizing Samples 424 Experimental Controls 425 Spike-Ins 425 Negative Controls in ChIP-Seq Experiments 426 General Quality Assessment 427 Nucleotide Frequency Characteristics 428 Percentage Duplicate Reads 428 Platform-Specific Quality Scores 428 Sanger, Roche, Illumina, and SOLiD Quality Scores 429 Conversion and Visualization of Quality Scores 429 Quality Checks After Alignment 430 Percentage of Reads Aligned and Percentage in Repeat Regions 430 DeepSAGE: Percentage 21–22Mers 430 RNA-Seq: Percentage Tags in Annotated Transcripts 430 miRNA Profiling: Percentage in Annotated miRNAs 430 ChIP-Seq: Enrichment 430 Correlation Measures 431 What Can Go Wrong 431 Sample Swaps 431 Contamination 431 Perspectives 432 References 432 UTGB Toolkit for Personalized Genome Browsers 435 Taro L. Saito, Jun Yoshimura, Budrul Ahsan, Atsushi Sasaki, Reginaldo Kurosh, and Shinichi Morishita Introduction 435 Overview of the UTGB Toolkit 436 Availability of the UTGB Toolkit 438 Methods 438 Installation of the UTGB Toolkit 438 Prerequisites 438 Easy Installer 438 Mac OS X and Linux 438 Windows 439 Running the UTGB Toolkit 439 Viewing Help Messages 439 Creating a new UTGB Project 440

j

XV

XVI

j

Contents

28.3.5 28.3.6 28.3.7 28.3.8 28.3.8.1 28.3.8.2 28.3.8.3 28.3.8.4 28.3.9 28.3.10 28.3.11 28.3.11.1 28.3.11.2 28.3.12 28.4 28.4.1 28.4.2 28.4.3 28.4.3.1 28.4.3.2 28.4.3.3 28.4.3.4 28.5

Building a Genome Browser 441 Launching a Portable Web Server 441 Configuring Track Views 441 Adding a New Track 442 FastaTrack 442 ReadTrack 442 WigTrack 443 Adding Keyword Search 443 Switching Views 443 Publishing Your Genome Browser 443 Manual Installation of UTGB Toolkit (Optional) 444 Windows 444 Mac OS X and Linux 444 Developing Your Own Tracks 444 Applications 444 Portable Web Server for Quickly Browsing Local Resources Portable Database Engine 445 Web Application Development Framework 446 Server-Side Programming Support 446 Web Action 446 Database Connection 446 Object–Database Mapping 446 Perspectives 447 References 447

29

Beyond the Pipelines: Cloud Computing Facilitates Management, Distribution, Security, and Analysis of High-Speed Sequencer Data Boris Umylny and Richard S.J. Weisburd Introduction 449 Data Management 450 Data Quantity 450 HSSs 451 Data Analysis 452 Data Size 453 Distribution 454 Collaboration 454 Distribution of Data, Annotations, and Analysis Tools 455 Analysis 456 Integrating Data Repositories and Analytics 456 Integrating HSS Discovery Pipelines with Annotation Data 458 Integrating HSS and Traditional Analysis Algorithms 459 Cloud-Based Infrastructure 461 Security 462 Healthcare Data and Privacy Issues 464 Sample Evaluation of a Vendor Solution 465 Perspectives 465 References 467

29.1 29.2 29.2.1 29.2.2 29.2.3 29.2.4 29.3 29.3.1 29.3.2 29.4 29.4.1 29.4.2 29.4.3 29.4.4 29.5 29.6 29.7 29.8

30

30.1 30.2 30.2.1 30.2.2 30.2.3 30.2.4

444

449

Computational Methods for the Identification of MicroRNAs from Small RNA Sequencing Data 469 Eugene Berezikov Introduction 469 Implementing the miR-Intess Pipeline 470 Preprocessing of Small RNA Sequencing Data 470 Mapping of Small RNA Reads to the Genome 471 Annotation 471 Identification of Hairpin Structures 472

Contents

30.2.5 30.2.6 30.3

miRNA Signatures 472 miRNA Expression Profiles Applications 474 References 474 Glossary

474

477

Link Collection for Next-Generation Sequencing Index

575

565

j

XVII

j

Preface

Unprecedented progress in sequencing technologies along with the development of software to interpret the resulting massive DNA sequence data have brought so-called next-generation sequencing technologies into the focus of today’s Life Sciences and medical research. Beyond doubt, next-generation sequencing will have a dramatic impact on our understanding of disease and healthcare in the next years to come, and will provide us with entirely new insights into life on Earth. We know that it is impossible to provide an up-to-date overview on such a rapidly developing field and its future directions within the scope of just one single book. Others have already provided comprehensive overviews on next-generation sequencing technologies and their use in genome sequencing, such as, for example, Michal Janitz with a book entitled Next-Generation Genome Sequencing: Towards Personalized Medicine, also published by Wiley-VCH (2008). Sequencing of entire genomes and resequencing of specific genomic regions such as exons are leading the field at this point, and the results have already started to make rapid changes in biological and medical research. However, in parallel, many research tools for what is now known as “analytical sequencing” have been designed, and most of them will make next-generation sequencing applications routine for studying biological and medical aspects. At the starting point of “analytical sequencing” the dominant idea was that short sequencing reads – so-called “tags” – could be used for transcript identification. Tag-based approaches were originally designed to increase the throughput of capillary sequencing, where concatemers of such short tag sequences were first used in expression profiling. The new next-generation sequencing platforms largely expanded the use of tag-based approaches, since tag lengths perfectly matched, and still match, the short read lengths of highly parallel sequencing reactions, and therefore avoid concatemerization. Moreover, many of the new applications no longer use restriction endonucleases to limit tag length, which is now determined by the read length into the ends of DNA fragments (also denoted as “sequence census methods”). Today, tagand sequence census-based approaches cover many applications in genome and transcriptome analysis starting from proteins, DNA, or RNA. Although further progress in next-generation sequencing will yield longer read lengths, tag- and sequence census-based approaches will maintain their important role in Life Sciences, because longer reads are not always required to obtain meaningful data for “analytical sequencing.” Whereas de novo genome sequencing and resequencing will benefit from ever-more powerful sequencing methods, analytical sequencing will shift away from “sequencing power” to better software packages for data analysis and visualization of the resulting immense datasets. It will be essential for common users to make the data more easily accessible and to provide the tools that allow small laboratories without any bioinformatics infrastructure to also work with this kind of data. Moreover, we see a clear need to establish more reference data and better genome annotations, fundamental to data interpretation. In particular, for analytical or diagnostic applications, the success of next-generation sequencing will depend on

XIX

XX

j

Preface

reliable and reproducible interpretation of the datasets. It is necessary to move away from the descriptive studies at the start of any new technology development towards experiments using replicates and statistical analysis along with trusted references. Today, next-generation sequence data still require powerful bioinformatics that has to be converted into easy-to-use data analysis tools along with a decrease in the cost for running next-generation sequencing experiments. Use of shorter sequencing reads and their reduced information content is one way to reduce experimental cost. The present book presents an overview of recently developed tag/sequence censusbased approaches and current next-generation sequencing technologies, along with an introduction to data analysis. These three topics are reflected in the organization of the book into three major parts. We intentionally excluded chapters on the upcoming third (next-next)-generation sequencer from Pacific Biosciences and Life Technologies’ new single-molecule sequencing technology. Although the first instruments of either vendor may already be on the market when this book is published, both methods produce much longer sequence reads (over 1000 bp) not really needed for the methods covered by the present book. We express our gratitude for the dedicated support and the efforts of all authors working together with us to make this book possible. September 2011 Kashiwa (Japan) Frankfurt am Main (Germany)

Matthias Harbers G€ unter Kahl

j

XXI

List of Contributors Budrul Ahsan University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan Artyom A. Alekseyenko Brigham and Women’s Hospital and Harvard Medical School Division of Genetics Department of Medicine 77 Avenue Louis Pasteur Boston, MA 02115 USA Javier Armisen Wellcome Trust/Cancer Research UK Gurdon Institute University of Cambridge The Henry Wellcome Building of Cancer and Developmental Biology Tennis Court Road Cambridge CB2 1QN UK Jennifer Asano University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Eugene Berezikov Hubrecht Institute Small RNA Biology Research Group Uppsalalaan 8 3584 CT Utrecht The Netherlands

Nicolas Bertin RIKEN Yokohama Institute Omics Science Center 1-7-22 Suehiro-cho Tsurumi-ku, Yokohama Kanagawa 230-0045 Japan

Henk P. Buermans Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands

Judith M. Boer Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands

Piero Carninci RIKEN Yokohama Institute Omics Science Center 1-7-22 Suehiro-cho Tsurumi-ku, Yokohama Kanagawa 230-0045 Japan

and

Jun Chen Xiamen University School of Life Sciences Department of Ocean Biology Xiamen, Fujian 361012 China

Erasmus Medical Center Laboratory of Pediatric Oncology Erasmus MC-Sophia Children’s Hospital room Ee1575, Dr. Molewaterplein 50 3015 GE Rotterdam The Netherlands Robert Bogden Amplicon Express Inc. 2345 NE Hopkins Court Pullman, WA 99163 USA Anne Borries University of Würzburg Institute for Molecular Infection Biology Research Center for Infectious Diseases (ZINF) Josef-Schneider-Straße 2/Bau D15 97080 Würzburg Germany

Allen Delaney University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Johan T. den Dunnen Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands

XXII

j

List of Contributors

Noreen Dhalla University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada

John M. Greally Albert Einstein College of Medicine Center for Epigenomics Department of Genetics 1301 Morris Park Avenue Bronx, NY 10461 USA

Martin Hirst University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada

Jason Dobry Amplicon Express Inc. 2345 NE Hopkins Court Pullman, WA 99163 USA

Martien A.M. Groenen Wageningen University Animal Breeding and Genomics Center Marijkeweg 40 6709 PG Wageningen The Netherlands

Joshua W.K. Ho Brigham and Women’s Hospital and Harvard Medical School Division of Genetics Department of Medicine 77 Avenue Louis Pasteur Boston, MA 02115 USA

Mitchell S. Dushay Illinois Institute of Technology Division of Biology Life Sciences Building 3101 South Dearborn Street Chicago, IL 60616 USA Virginia Edgcomb Woods Hole Oceanographic Institution Department of Geology and Geophysics 266 Woods Hole Road Woods Hole, MA 02543 USA Alistair R.R. Forrest RIKEN Yokohama Institute Omics Science Center 1-7-22 Suehiro-cho Tsurumi-ku, Yokohama Kanagawa 230-0045 Japan Roald Forsberg CLC bio Finlandsgade 10–12 Katrinebjerg 8200 Aarhus N Denmark Paul G. Giresi University of North Carolina at Chapel Hill Department of Biology and Carolina Center for Genome Sciences 408 Fordham Hall Chapel Hill, NC 27599-3280 USA

Korbinian Grote Genomatix Software GmbH Bayerstrasse 85a 80335 Munich Germany Matthias Harbers DNAFORM Inc. Leading Venture Plaza 2 75-1 Ono-cho Tsurumi-ku, Yokohama Kanagawa 230-0046 Japan Anne-Mette Hein CLC bio Finlandsgade 10–12 Katrinebjerg 8200 Aarhus N Denmark Matthew S. Hestand Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands

Stephen Hutchison 454 Life Sciences 15 Commercial Street Branford, CT 06405 USA Karolina Janitz Hawkesbury Institute for the Environment University of Western Sydney Hawkesbury Campus, Locked Bag 1797 Penrith, NSW 2751 Australia Michal Janitz University of New South Wales School of Biotechnology and Biomolecular Sciences Biological Sciences Building Kensington, NSW 2052 Australia Thomas Jarvie 454 Life Sciences 15 Commercial Street Branford, CT 06405 USA

and University of Kentucky Department of Veterinary Science Gluck Equine Research Center 1400 Nicholasville Road Lexington, KY 40546-0099 USA

Günter Kahl University of Frankfurt am Main Biocenter Max-von-Lauestraße 9 60439 Frankfurt am Main Germany

List of Contributors

and Frankfurt Biotechnology Innovation Center (FIZ) GenXPro Ltd Altenhöferallee 3 60438 Frankfurt am Main Germany Irina Khrebtukova Illumina Inc. Gene Expression Applications 25861 Industrial Boulevard Hayward, CA 94545 USA Yeong C. Kim University of Nebraska Medical Center Department of Genetics, Cell Biology & Anatomy 42nd and Emile Omaha, NE 68198 USA Julian König MRC Laboratory of Molecular Biology Division of Structural Studies Hills Road Cambridge CB2 0QH UK Detlev H. Krüger Charité – Universitätsmedizin Berlin Institut für Virologie Schumannstraße 20/21 10117 Berlin Germany Mitzi I. Kuroda Brigham and Women’s Hospital and Harvard Medical School Division of Genetics Department of Medicine 77 Avenue Louis Pasteur Boston, MA 02115 USA Reginaldo Kurosh University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan

Yuching Lai Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands Irene Li University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Jason D. Lieb University of North Carolina at Chapel Hill Department of Biology and Carolina Center for Genome Sciences 408 Fordham Hall Chapel Hill, NC 27599-3280 USA Shujun Luo Illumina Inc. Gene Expression Applications 25861 Industrial Boulevard Hayward, CA 94545 USA Marco Marra University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Hideo Matsumura Shinshu University Gene Research Center Tokita 3-15-1 Ueda, Nagano 386-8567 Japan Helen McDonald University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada

j

XXIII

Nicholas J. McGlincy MRC Laboratory of Molecular Biology Division of Neurobiology Hills Road Cambridge CB2 0QH UK Hendrik-Jan Megens Wageningen University Animal Breeding and Genomics Center Marijkeweg 40 6709 PG Wageningen The Netherlands Patrice M. Milos Helicos BioSciences Corporation One Kendall Square, Building 200 Cambridge, MA 02139 USA Eric A. Miska Wellcome Trust/Cancer Research UK Gurdon Institute University of Cambridge The Henry Wellcome Building of Cancer and Developmental Biology Tennis Court Road Cambridge CB2 1QN UK Mohammed Mohiuddin 454 Life Sciences 15 Commercial Street Branford, CT 06405 USA Carlos Molina INRA-URLEG Unité de Recherche en Légumineuses 17 Rue Sully 21000 Dijon France Søren Mønsted CLC bio Finlandsgade 10–12 Katrinebjerg 8200 Aarhus N Denmark Shinichi Morishita University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan

XXIV

j

List of Contributors

Sorana Morrissy University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada

Xiaoan Ruan Genome Institute of Singapore Genome Technology and Biology 60 Biopolis Street #02-01 Genome Singapore 138672 Singapore

Amy Mraz Amplicon Express Inc. 2345 NE Hopkins Court Pullman, WA 99163 USA

Yijun Ruan Genome Institute of Singapore Genome Technology and Biology 60 Biopolis Street #02-01 Genome Singapore 138672 Singapore

Pawan Pandoh University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Peter J. Park Harvard Medical School Center for Biomedical Informatics 10 Shattuck Street Boston, MA 02115 USA Charles Plessy RIKEN Yokohama Institute Omics Science Center 1-7-22 Suehiro-cho Tsurumi-ku, Yokohama Kanagawa 230-0045 Japan Anna-Liisa Prabhu University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Marcel Prins KeyGene NV 6700 AE Wageningen The Netherlands Tal Raz Helicos BioSciences Corporation One Kendall Square, Building 700 Cambridge, MA 02139 USA

Taro L. Saito University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan Atsushi Sasaki University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan Gary P. Schroth Illumina Inc. Gene Expression Applications 25861 Industrial Boulevard Hayward, CA 94545 USA David Sexton Baylor Medical College Human Genome Sequencing Center 2005 South Mason Rd #906 Katy, TX 77450 USA Cynthia M. Sharma University of Würzburg Institute for Molecular Infection Biology Research Center for Infectious Diseases (ZINF) Josef-Schneider-Straße 2/Bau D15 97080 Würzburg Germany

W. Robert Shaw Imperial College London Department of Life Sciences London SW7 2AZ UK Geoffrey P. Smith Illumina Cambridge Ltd. Sequencing Research Little Chesterford Essex CB10 1XL UK Thorsten Stoeck University of Kaiserslautern Faculty of Biology Ecology Department Erwin-Schrödinger Straße 14 67663 Kaiserslautern Germany Keith Stormo Amplicon Express Inc. 2345 NE Hopkins Court Pullman, WA 99163 USA Masako Suzuki Albert Einstein College of Medicine Center for Epigenomics Department of Genetics 1301 Morris Park Avenue Bronx, NY 10461 USA Fraz Syed Epicentre Biotechnologies 726 Post Road Madison, WI 53713 USA Angela Tam University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada Quanzhou Tao Amplicon Express Inc. 2345 NE Hopkins Court Pullman, WA 99163 USA

List of Contributors

Ryohei Terauchi Iwate Biotechnology Research Center Research Group of Genetics and Genomics Narita 22-174-4 Kitakami, Iwate 024-0003 Japan Peter A.C. ’t Hoen Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands and Leiden University Medical Center Leiden Genome Technology Center Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands John F. Thompson Helicos BioSciences Corporation One Kendall Square, Building 700 Cambridge, MA 02139 USA Jernej Ule MRC Laboratory of Molecular Biology Division of Structural Studies Hills Road Cambridge CB2 0QH UK Boris Umylny Japan Bioinformatics KK Yoyogiekimae Building 401 1-36-6 Yoyogi, Shibuya-ku Tokyo 151-0053 Japan Michiel van Eijk KeyGene NV 6700 AE Wageningen The Netherlands Michiel van Galen Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands

Mark van Haaren KeyGene NV 6700 AE Wageningen The Netherlands Maarten van Iterson Leiden University Medical Center Center for Human and Clinical Genetics Postal Zone S4-P P.O. Box 9600 2300 RC Leiden The Netherlands Jan van Oeveren KeyGene NV 6700 AE Wageningen The Netherlands Jörg Vogel University of Würzburg Institute for Molecular Infection Biology Research Center for Infectious Diseases (ZINF) Josef-Schneider-Straße 2/Bau D15 97080 Würzburg Germany San Ming Wang University of Nebraska Medical Center Department of Genetics, Cell Biology & Anatomy 42nd and Emile Omaha, NE 68198 USA Richard S.J. Weisburd ELSS Inc. 2504-3 Saiki Tsukuba Ibaraki 305-0028 Japan Thomas Werner Genomatix Software GmbH Bayerstrasse 85a 80335 Munich Germany Jon Wittendorp KeyGene NV 6700 AE Wageningen The Netherlands

j

XXV

Jun Yoshimura University of Tokyo Graduate School of Frontier Sciences Department of Computational Biology Kashiwa Research Complex 370 5-1-5 Kashiwanoha Kashiwa City, Chiba 277-8562 Japan Yongjun Zhao University of British Columbia BC Cancer Agency Genome Sciences Centre 570 West 7th Avenue Vancouver, BC V5Z 4S6 Canada

j

Part One Tag-Based Nucleic Acid Analysis

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

1

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing with Now- and Next-Generation Sequencing Technologies Hideo Matsumura, Carlos Molina, Detlev H. Kr€uger, Ryohei Terauchi, and G€ unter Kahl Abstract

SuperSAGE is a variant of the serial analysis of gene expression (SAGE) expression profiling technology, in which 26-bp tags are extracted from cDNA using the type III restriction endonuclease EcoP15I. The use of a longer tag size in SuperSAGE allows a secure tag-to-gene annotation by homology searches against genome, transcript, or expressed sequence tag sequences. For organisms without genomic information, the 26-bp tags can be used as polymerase chain reaction primers to recover the full-length cDNA by 50 - and 30 -rapid amplification of cDNA ends. Here, we present the combination of SuperSAGE and high-throughput sequencing technologies (nowor next-generation sequencing (NGS)). We coin this merger deepSuperSAGE. The direct sequencing of millions of tag fragments shortens time and reduces costs for the analysis enormously. Furthermore, the incorporation of an indexing system expands the potential of deepSuperSAGE to analyze multiple samples in a single NGS run. The most recent version of deepSuperSAGE (high-throughput SuperSAGE) at least equals or even outcompetes microarrays in throughput. These improvements allow the application of deepSuperSAGE in transcriptome analysis in any eukaryotic system.

1.1 Introduction

Technologies for gene expression analysis have dramatically been improved over the past years. Northern blot analysis and polymerase chain reaction in combination with a reverse transcription reaction (reverse transcription-polymerase chain reaction RT-PCR) still are, to some extent, standard tools for expression analysis of individual genes. However, these techniques by all their virtue cannot be expanded to measure gene expression genome-wide and therefore will instead be used to analyze expression on a gene-by-gene basis, although it is possible to expand the analysis to 384 genes or more by multiplexing in the case of quantitative PCR (quantitative polymerase chain reactionqPCR, which is then called high-throughput real-time RTPCR). This variant of qPCR – if controlled properly – allows an ultra-sensitive measurement of transcription by using gene-specific primers and probes in a PCR-based assay [1]. Although guidelines for the proper design of qPCR experiments have been established [2–4], a further increase in the number of addressed genes still meets with difficulties. The recent explosion of information from genome and transcriptome sequencing projects now encourages analysis of the expression of a large number, preferably all, genes at a given time. Traditionally, microarrays of various architectures already

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

3

4

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

represented tools for this kind of high-throughput gene expression profiling [5]. Microarrays are microscale solid supports (e.g., nylon membranes, nitrocellulose, glass, quartz, silicon, or other synthetic material) onto which either DNA fragments, cDNAs, oligonucleotides, genes, open reading frames, peptides, or proteins (e.g., antibodies) are spotted in an ordered pattern (“array”) at extremely high density. Microarray-based expression profiling (“transcript profiling”) for the simultaneous detection of the expression of thousands or tens of thousands of genes (the so-called “expressome”), whose complementary DNA sequences are immobilized on the array, requires the hybridization of fluorophore-labeled cDNAs from target tissue(s). After hybridization and high-stringency washing, the hybridization patterns can be visualized by fluorescence detection. It is appreciated that microarrays played a pioneering role in transcriptomics. However, their role in transcriptomics is fading [6]. The reasons for this decline are manifold. By all their virtue, microarrays of whatever format suffer from a series of devaluating insufficiencies. In fact, the poor correlation between different microarray platforms stands out (relatively large differences in data obtained in different labs using the same platform), but – equally important – its closed architecture format allows us to detect only the transcription of genes that are spotted on the array. Therefore, microarrays cannot detect novel genes. They require large amounts of input RNA for robust answers, which are at the most semiquantitative and at their best with the more abundant mRNAs. Microarrays are also prone to cross-hybridization of a single probe to different target RNAs and the experimenter has no reliable predictor for on-chip hybridization efficiency. Ambiguity exists in data analysis and interpretation, and in some cases defective oligonucleotides prior to printing have been reported. The widely different fluorescence intensity signals generated by different probes targeting the same gene confuses the experimenter. All these many inadequacies and inconsistencies of the microarray platform, and more so the irreproducibility of microarray-based results [7,8], which persisted in spite of many improvements, catalyzed the development of substitute technologies. For example, expressed sequence tag (EST) analysis – the large-scale sequencing of partial cDNA fragments – generating sequence information of thousands of expressed genes was and is extensively used. The number of sequence reads from a particular gene represents the expression level of the gene in the sample. However, ESTs are also only a sample of the whole transcriptome, contain a high sequence error rate (up to 3%), are relatively short (average 400 bp), contain artifacts such as vector and bacterial sequence contaminations, only represent 50 and/or 30 ends of transcripts, suffer from a bias in the dbEST database (http://www.ncbi.nlm.nih.gov/dbEST; splice variants involving exons located in the center of long transcripts are under-represented), and due to high sequencing costs EST analysis has not always been a suitable method in terms of throughput. In a seminal publication, Velculescu et al. [9] reported a method to count the transcripts in a high-throughput manner, which they named serial analysis of gene expression (SAGE). In SAGE, originally a short fragment of 13–15 bp in size is isolated as a tag from a defined position of each cDNA. Those tags are then concatenated and cloned into a plasmid vector for sequencing. The key to the SAGE technique is the use of the type IIS restriction endonuclease BsmFI as the tagging enzyme that extracts tag fragments from transcripts. BsmFI cuts 13–15 bases away from its recognition site, allowing the isolation of 13- to 15-bp tag sequences from cDNAs. Each transcript is uniquely represented by a tag fragment and the tag frequency in the sample (tag count) reflects the abundance of the corresponding transcript. The obtained 13- to 15-bp tag sequence can be used as query by BLAST search against EST databases of the species from which the tag sequence is derived (tag annotation). By listing the count and annotation of thousands of tags, one can obtain a comprehensive and quantitative profile of gene expression. In contrast to analog datasets generated by hybridization-based methods like microarrays, SAGE data are digital and easy to handle bioinformatically. SAGE is an open-architecture method whereby researchers can theoretically address all the expressed transcripts

1.2 Overview of the Protocols

simply by increasing the number of tags to be analyzed. All these advantages make SAGE superior to microarrays as a closed-architecture method. However, the original SAGE method had a major problem of accuracy in tag-togene annotation, owing to the short size of the tag. To overcome this inadequacy, improved versions of SAGE were established, that obtained longer tag sequences from cDNAs. For example, LongSAGE [10] employed MmeI to isolate 21-bp tags and, more recently, SuperSAGE [11] has generated much longer tags (26 bp). Recent rapid advancements of DNA sequencing technologies dramatically improved the SuperSAGE protocol by increasing throughput and reducing analytical cost. The merger of SuperSAGE with one of the “now- or next-generation sequencing” (NGS) platforms [11] is known as deepSuperSAGE or also high-throughput SuperSAGE (HT-SuperSAGE) [12,23]. The potential of this technology for genome-wide and quantitative gene expression profiling has now been amply demonstrated and will be addressed in the present chapter.

1.2 Overview of the Protocols 1.2.1 Principle of the SuperSAGE Method

SuperSAGE is an improved version of the SAGE technology, whereby 26-bp tags are extracted from cDNA using the type III enzyme EcoP15I [13,14]. The distance between recognition and cleavage sites of EcoP15I is the longest for all the known restriction enzymes, which can cleave 25/27 bp away from its recognition site [15]. Basically, the experimental procedure of SuperSAGE is similar to that of the original SAGE, except for the tagging enzyme, oligo(dT) primers, and linkers. All the details from cDNA synthesis to tag extraction are described in the following protocol. For an efficient DNA digestion with EcoP15I, two copies of its recognition sequence 50 -CAGCAG-30 should be located in head-to-head orientation within the target DNA molecule [13]. Therefore, one 50 -CAGCAG-30 site is inserted into the adapter-oligo (dT) primer sequence and the other site is incorporated in the linkers, which are ligated to the digested cDNA. These linker-ligated cDNA fragments are cleaved by EcoP15I at a position 25/27 bp away from either of the recognition sites in the linkers and adapter-(dT) primers. Two “linker–tag” fragments are ligated in head-to-head orientation to generate “linker–ditag–linker” fragments and the resulting fragments are amplified by polymerase chain reactionPCR. After removal of linkers, ditags are concatenated and the concatemers are cloned into a plasmid vector for sequencing. From sequencing reads of plasmid inserts (concatemers), tag sequences are extracted. Although most of the tags are expected to be 27 bp in size, a considerable number of 26-bp tags was actually obtained, as also described in Section 1.2.5. Therefore, we defined a 26-bp sequence as the tag from concatemer sequences. 1.2.2 Power of the SuperSAGE Tag

With the increased tag length (26 or 27 bp), the efficiency of tag-to-gene annotation is considerably improved. In model organisms, 26-bp tags allow almost perfect gene annotation by a BLAST search against the genome or cDNA sequence databases [12]. BLAST analysis with tags of different sizes (15, 20, or 26 bp) convincingly demonstrated that 15- and 20-bp tags usually match DNA sequences of multiple species, whereas the 26-bp SuperSAGE tags matches DNA sequences of a single species in most of the cases [12]. Therefore, the sequence information of 26-bp tags can uniquely identify the gene and species from which the tag was derived. Using this highspecificity of a SuperSAGE tag, it allows simultaneous monitoring of gene expression

j

5

6

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

of two or more species synchronously that are in a tight interaction (e.g., a pathogen and its host cells) [16,17]. An additional advantage of the 26-bp SuperSAGE tags is that the full or partial sequence of the corresponding genes could easily be recovered by PCR. This allows the analysis of transcriptomes even in nonmodel organisms. For recovery of corresponding genes from tag sequences, rapid amplification of cDNA ends (RACE) is the most conventional method [18]. By a combination of 30 - and 50 -RACE, sequences of several full-length cDNAs were obtained easily starting from 26-bp tag sequences in Nicotiana benthamiana [19]. Alternatively, Coemans et al. [20] succeeded in amplifying genes corresponding to the tags by thermal asymmetric interlaced (TAIL)-PCR) using genomic Musa accuminata (banana) DNA as template. This method recovered corresponding genes including their promoter regions from the tag without preparing a high-quality cDNA template. In summary, the high specificity of tag-to-gene annotation and its applicability to nonmodel organisms are the two major advantages of SuperSAGE. 1.2.3 Development of DeepSuperSAGE

Recent advances in DNA sequencing technologies – the NGS platforms – are dramatically changing the whole research strategy in biological studies. These technologies aim at reading huge amounts of DNA sequences in a short time at low cost. Currently available NGS technologies are based on massively parallel sequencing, which produces sequences of more than millions of DNA fragments at a time. The output of the NGS DNA sequencers is a huge number of short sequences, so-called “reads.” This feature of NGS is extremely suitable for sequencing the 26-bp SuperSAGE tags. Thus, we have tried to combine SuperSAGE and an NGS technology to establish deepSuperSAGE, which greatly reinforces the traditional SuperSAGE technology. The first NGS instrument released was the Genome Sequencer (“GS” series) from 454 Life Sciences in 2005 [21]. This sequencer employs pyrosequencing and the average read length spans from 100 (GS20) to 400 bp (GS FLX Titanium). We developed a protocol for direct sequencing of SuperSAGE ditags with linkers for the GS20 DNA sequencer (Figure 1.1) [22]. Afterwards, more powerful massively parallel sequencers continuously emerged. Since the read length of these machines is short (less than 35–50 bp), fragments containing single tags (not ditags) were applied to these sequencers (Figure 1.1) [23]. This deepSuperSAGE technology allows a high-throughput analysis of any transcriptome. The advantages of this method include: i) Huge numbers of 26-bp tags (more than 1 million) are obtained in a single sequencing run. ii) DNA fragments containing tags or ditags are directly sequenced without plasmid cloning. iii) Tags from several independent samples can be pooled and analyzed together in a single run by employing index (barcode) sequences in the linker or adapter. Since increasing the numbers of analyzed tags apparently contributes to improve accuracy of profiling data, it is promising that high-quality data can be obtained in deepSuperSAGE analysis. Additionally, analytical costs are reduced, owing to the lower sequencing costs per base in NGS. In the original SuperSAGE protocol, concatenation of ditags and plasmid cloning were necessary for sequencing [12]. Using this approach, it was not easy to optimize cloning efficiency and obtain clones with large inserts. Even after a high-quality library was constructed, several hundreds of clones or inserts had to be prepared. DeepSuperSAGE now avoids all these steps and, as a consequence, greatly contributes to reduction in effort, time, and costs. Previously, SuperSAGE was regarded as a gene expression profiling method for a limited number of samples, because the time, cost, and effort required proportionally

1.2 Overview of the Protocols

j

7

Fig. 1.1 Scheme of deepSuperSAGE. After EcoP15I digestion of linker (adapter)-ligated cDNA fragments immobilized on paramagnetic beads, ditags were formed for 454 pyrosequencing analysis (left) or another adapter was immediately ligated to the EcoP15I digestion end (single-tag) for Illumina GA analysis (right). Sizes of sequenced fragments were 96–98 bp in the ditag protocol and 36 bp in the single-tag protocol.

increased with sample numbers. By employing the multiplexing protocol in deepSuperSAGE, the SuperSAGE technology is now applicable to many samples without increasing the time, cost, and effort. In combination with NGS, digital gene expression (DGE) and RNA-seq are used commonly for high-throughput transcript profiling [24,25]. The DGE protocol for the Illumina Genome Analyzer (GA) platform was based on LongSAGE. However, deepSuperSAGE turned out to be superior, due to the longer size of the obtained tags. RNA-seq, on the other hand, is suitable to understand the structure of transcripts rather than quantifying amounts of transcripts. Consequently, we suggest that deepSuperSAGE is still the best method of tagbased quantitative transcriptome analysis employing NGS. The method and some of its applications will be described, separately for (i) ditagand (ii) single-tag-based deepSuperSAGE. 1.2.4 Ditag-Based DeepSuperSAGE (for 454 Pyrosequencing)

The first version of the released 454 pyrosequencer (GS20) produced reliable sequence reads of 100 bp from each fragment. Coincidentally, the size of a SuperSAGE “linker–ditag–linker” fragment generated after PCR amplification is 96–98 bp, which perfectly fits the size of a single sequence path of GS20 sequencing. Therefore, amplified fragments directly served as sequencing templates without concatenation and plasmid cloning. A single sequencing run produces sequences of 200 000–1 000 000 ditags on average, indicating that a total of 400 000–2 000 000 tags could well be obtained. Since in this procedure linker regions are also sequenced together with ditags, we considered using different linker fragments with unique sequences to

8

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

generate individual SuperSAGE libraries and separating sequencing data from independent samples based on the linker sequences. Introduction of this improvement allowed a multiplexed SuperSAGE analysis of different samples. Thereby, the scale of multiplexing and tag count for each sample can be flexibly changed and adapted depending on research objectives. Experimental steps from RNA to ditag amplification and purification were almost identical to the original SuperSAGE protocol as described later. Generally, starting from several hundred microgram total RNA (1–3 mg poly(A) þ RNA), 1 mg of amplified ditags is obtained from 40 PCR reactions. Successful sequencing will provide more than 200 000 sequence reads. For tag extraction from sequence data, several processes are required, including elimination of incomplete (short) sequence reads, sorting libraries by linker sequences (if multiplexed), and exclusion of duplicated ditag sequences. For this purpose, we developed our own programs, such as SuperSAGE_tag_extract_pipe [22] or GXP-Tag sorter (GenXPro) [26]. 1.2.5 Single-Tag-Based DeepSuperSAGE (HT-SuperSAGE)

After the release of the 454 pyrosequencer, other NGS technologies became available, as, for example, the Illumina GA, based on “sequencing-by-synthesis” (SBS), and the Applied Biosystems SOLiD system, based on “sequencing-by-ligation” methods. These DNA sequencers enabled 100 000 000 reads in a single run. It was expected that a complete transcript profile would be obtained when these sequencers are used for deepSuperSAGE. In the early version of these sequencers (GA or SOLiD), the size of a sequence read was typically 35 or 36 bp, shorter than the ditag length (52 bp). Therefore, for the adaptation of deepSuperSAGE to GA or SOLiD sequencers, a single-tag sequencing protocol was designed. It basically follows the original SuperSAGE or deepSuperSAGE workflow for 454 pyrosequencing up to the step of 26-bp tag extraction. However, after this step, no “ditags” are formed. Instead, two adapter fragments are ligated to each end of the single tag. At this step, we skip the purification from a polyacrylamide gel electrophoresis (PAGE) gel and the fill-in reaction of EcoP15I-digested fragments, which are necessary in the original and 454 pyrosequencing SuperSAGE protocols. This measure reduces the time for experiments and avoids loss of DNA fragments. Single tags flanked by the adapters are amplified by PCR. Finally, PCR products of the expected size (accurately containing adapters and tag) are purified and applied to direct sequencing. In this protocol, two additional improvements were included: i) The number of PCR amplification cycles of adapter–tag fragments was reduced. ii) For the analysis of multiple samples in a single sequencing run, a systematic indexing (barcoding) was employed. By incorporating these improvements, we developed the protocol of HTSuperSAGE [23]. We were concerned that removal of duplicated ditags could not be integrated in the single-tag protocol and therefore expected distortion of transcript profiles due to PCR amplification biases [9]. To avoid this potential problem, PCR cycles were reduced to 5–10 cycles. By comparing tag profiling among different PCR cycles (3, 5, and 10 cycles), we could assure that an increase in PCR cycle numbers up to 10 did not cause any significant distortion in the expression profiles [23]. Since the required amount of template DNA for sequencing on Illumina GA platforms is about 10 ng, sufficient template DNA can be prepared by 10 PCR cycles. In our sequence data, tags with various sizes were observed. If sorted by length, we found that 27-bp tags made up 66% and 26-bp tags made up 25% of all tags. Tags with other sizes were under-represented [23]. Therefore, tags can be recovered from more than 90% of all sequence reads by extracting 26-bp tag sequences. The strategy for sample multiplexing was already employed in deepSuperSAGE using 454 pyrosequencing. The Illumina GA has larger sequencing capacities and

1.3 Methods and Protocols

j

9

Fig. 1.2 Position of index for multiplexing. Index sequences were located in the linker or adapter sequences. For ditag sequence analysis (left), a 5- or 6-bp index sequence was incorporated within 10 bp upstream of the EcoP15I site in the linkers. For singletag (HT-SuperSAGE) analysis (right), 4-bp index sequences were located adjacent to the sequencing primer.

therefore allows pooling much larger numbers of samples. For this purpose, a systematic indexing protocol should be developed. For the design of this index it is recommended that it should be located close to the tag sequence, due to limitations in read length. Yamaguchi et al. [27] combined SuperSAGE with the Illumina GA and employed a 2-base index upstream of the EcoP15I site in the adapter. In our established protocol, we have designed a 4-base index just downstream of the sequencing primer site (Figure 1.2) [23]. Adapter fragments with different index sequences are separately ligated to 26-bp tag fragments derived from different samples. Adapter–tag fragments from different libraries are pooled and sequenced together. The sequence reads are separated in silico according to their index sequences. By positioning the index in the first 4 bases of the sequence read, the frequency of sequencing errors is minimized.

1.3 Methods and Protocols 1.3.1 Linker or Adapter Preparation 454 Pyrosequencing Linker DNAs for SuperSAGE are prepared by annealing the two complementary oligonucleotides, as shown in Table 1.1 (Linker-1A, -1B, -2A, -2B). Linker DNAs have cohesive ends, which are compatible with the end generated by NlaIII digestion (50 CATG-30 ). An EcoP15I recognition site (50 -CAGCAG-30 ) is present adjacent to the 50 CATG-30 site. The 30 ends of the Linker-XBs should be amino-modified to prevent ligation to the cDNA or another linker molecule at this site. We can synthesize several different pairs of linker DNAs (Linker-1, -2, -3, -4, etc.) for the preparation of multiple SuperSAGE libraries. In these linkers, sequence variation of 5–6 bp is incorporated within the 10-bp region upstream of the EcoP15I recognition site as an index

Name

Sequences

Linker-1A Linker-1B Linker-2A Linker-2B Adapter-1A Adapter-1B Adapter-2A Adapter-2B

50 -TTTGGATTTGCTGGTGCAGTACAACTAGGCTTAATACAGCAGCATG 50 -CTGCTGTATTAAGCCTAGTTGTACTGCACCAGCAAATCCAAA-amino 50 -TTTCTGCTCGAATTCAAGCTTCTAACGATGTACGCAGCAGCATG 50 -CTGCTGCGTACATCGTTAGAAGCTTGAATTCGAGCAGAAA-amino 50 -ACAGGTTCAGAGTTCTACAGTCCGACGATCXXXXa) 50 -NNYYYYGATCGTCGGACTGTAGAACTCTGAACCTGT-aminoa) 50 -CAAGCAGAAGACGGCATACGATCTAACGATGTACGCAGCAGCATG 50 -CTGCTGCGTACATCGTTAGATCGTATGCCGTCTTCTGCTTG

a)

XXXX and YYYY indicate arbitrary index sequences. Each of them should be complementary.

Table 1.1 Oligonucleotide sequences for linkers or adapters in deepSuperSAGE.

10

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

(Figure 1.2). In this protocol, we show the sequences of only Linker-1 and Linker-2 (Table 1.1). 1.1 Dissolve the synthetic linker oligonucleotides (Linker-1A, -1B, -2A, -2B) in LoTE buffer (3 mM Tris–HCl, pH7.5; 0.2 mM EDTA), so that their concentration is 1 mg/ml. 1.2 Mix 1 ml Linker-1B (or Linker-2B), 1 ml 10 polynucleotide kinase buffer, 1 ml 10 mM ATP, 7 ml H2O, and 1 ml T4 polynucleotide kinase, and incubate at 37  C for 30 min to phosphorylate the 50 ends. 1.3 Add 1 ml Linker-1A or -2A to the 50 -phosphorylated Linker-1B or -2B solution from the previous step, respectively. 1.4 After mixing, denature by incubating at 95  C for 2 min and cool down to 20  C for annealing. 1.5 The annealed double-stranded DNAs (200 ng/ml) are designated as Linker-1 and Linker-2, respectively. HT-SuperSAGE The procedure for HT-SuperSAGE adapter preparation basically follows the linker preparation for 454 sequencing libraries described above. Sequences of adapter oligonucleotides were changed for Illumina GA sequencing (Table 1.1). Adapter-1 has a 4-bp index sequence (“XXXX” in Table 1.1) and a 2-base cohesive end of “NN” for ligating to EcoP15I-digested tag ends. The annealed double-stranded DNAs (200 ng/ ml) are designated as Adapter-1 and Adapter-2, respectively. Adapter-2 carries a cohesive end for the NlaIII site (CATG) and an EcoP15I recognition site (50 -CAGCAG-30 ) adjacent to the NlaIII site. 1.3.2 RNA Samples

About 20–30 mg of total RNA as starting material for 454 sequencing allows a successful deepSuperSAGE experiment. For HT-SuperSAGE, 1–10 mg total RNA is sufficient. 1.3.3 cDNA Synthesis and NlaIII Digestion

The protocols for cDNA synthesis and NlaIII digestion do not depend on any sequencing technology. Any cDNA synthesis protocol is applicable to SuperSAGE, but biotinylated adapter-oligo(dT) primer (50 -biotin- CTGATGTAGAGGTACCGGA TGCCAGCAGTTTTTTTTTTTTTTTTTTT-30 ) should be used for reverse transcription. We employ the SuperScript II double-strand cDNA synthesis kit (Invitrogen), following the experimental procedures given in its instruction manual. 3.1 After second-strand cDNA synthesis, double-stranded cDNA is purified by passing it through a column (QIAquick PCR purification kit; Qiagen) instead of phenol/chloroform extraction and ethanol precipitation. 3.2 Purified cDNA (50 ml eluted DNA from a column) is completely digested with NlaIII, by adding 20 ml NlaIII digestion buffer (NEBuffer 4), 2 ml bovine serum albumin (BSA), 123 ml LoTE, 5 ml NlaIII (10 U/ml; NEB). 3.3 Incubate at 37  C for 1.5 h. 1.3.4 Tag Extraction from cDNA 454 Pyrosequencing

4.1 Digested cDNA solution (without purification) is divided into the two tubes, tube A and tube B (each 100 ml).

1.3 Methods and Protocols

4.2 Tube A and tube B both contain cDNA to be ligated with Linker-1 and Linker-2, respectively, as described above. 4.3 An equal volume of 2 B&W buffer (10 mM Tris–HCl, pH 7.5; 1 mM EDTA; 2 M NaCl) is added to each of the tubes A and B. 4.4 Contents of tubes A and B are separately added to the washed streptavidincoated paramagnetic beads (Dynabeads M-270). 4.5 Biotinylated cDNA fragments are associated with streptavidin-coated magnetic beads by incubation at room temperature for 30 min. 4.6 After washing the beads 3 times with 1 B&W buffer and once with LoTE buffer, Linker-1 and Linker-2, respectively, are ligated to the ends of the cDNAs bound to the magnetic beads in the two tubes. 4.7 For ligation, 200 ng linker DNA is usually added to a tube. 4.8 To ligate linkers to digested cDNAs bound to the magnetic beads, add 21 ml LoTE, 6 ml 5 T4 DNA ligase buffer, and either 1 ml Linker-1 or -2 solution (100–200 ng), respectively. 4.9 The bead suspension is incubated at 50  C for 2 min for the dissociation of linker dimers and kept at room temperature for 15 min. 4.10 T4 DNA ligase (10 U) is added and the tubes are incubated at 16  C for 2 h. 4.11 After ligating the linkers, the bead suspension from the two tubes is mixed. 4.12 The beads are washed 4 times with 1 B&W buffer, followed by washing with LoTE buffer for 3 times. 4.13 The resulting linker-cDNA fragments on the beads are digested with EcoP15I to release “linker–tag” fragments. 4.14 For EcoP15I digestion, 10 ml 10 EcoP15I digestion buffer (100 mM Tris–HCl, pH 8.0; 100 mM KCl; 100 mM MgCl2; 1 mM EDTA; 1 mM dithiothreitol; 50 mg/ ml BSA), 2 ml 100 mM ATP, 83 ml sterile water, and 5 ml EcoP15I (2 U/ml; NEB) are added to the washed paramagnetic beads. 4.15 Tubes are incubated at 37  C for 2 h. 1.3.5 Tag Extraction from cDNA HT-SuperSAGE

5.1 Prepare a 100-ml suspension of streptavidin-coated magnetic beads (Dynabeads M-270) in a siliconized 1.5-ml microtube. Beads are washed once with 1 B&W buffer. 5.2 To the washed magnetic beads, 200 ml 2 B&W solution and 200 ml digested cDNA solution are added and suspended well. 5.3 After the digested cDNAs are associated with the beads for 30 min incubation, the tube is placed on the magnetic stand, and the supernatant is discarded. 5.4 Magnetic beads are washed 3 times with 200 ml 1 B&W and once with 200 ml LoTE. 5.5 For Adapter-2 ligation to the digested cDNAs, 21 ml LoTE, 6 ml 5 T4 DNA ligase buffer, and 1 ml Adapter-2 solution are added to the magnetic beads. 5.6 After mixing with pipettes, the bead suspension is incubated at 50  C for 2 min for the dissociation of adapter dimers. 5.7 Tubes are kept at room temperature for 15 min. 5.8 After the tubes cooled down, 2 ml T4 DNA ligase (10 U) is added and incubated at 16  C for 2 h with occasional mixing. 5.9 After ligation reaction, beads are washed 4 times with 1 B&W and 3 times with LoTE. 5.10 The beads are suspended in 75 ml LoTE. 5.11 For EcoP15I digestion, 10 ml 10 NEBuffer 3, 10 ml 10 ATP solution (1 mM), 1 ml 100 BSA (100 mg/ml), and 4 ml EcoP15I are added to the suspended magnetic beads. 5.12 Incubate the tube at 37  C for 2 h.

j

11

12

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing 1.3.6 Purification of Linker–Tag Fragments

6.1 In both sequencing methods, the DNA released from the beads after EcoP15I digestion is extracted by phenol/chloroform. 6.2 Precipitate DNA by adding 100 ml 10 M ammonium acetate, 3 ml glycogen, and 950 ml cold ethanol. 6.3 The tube is kept at 80  C for 1 h. 6.4 The DNA is precipitated by centrifugation at 15 000  g for 40 min at 4  C and the resulting pellet is washed once with 70% ethanol. 6.5 After drying, the pellet is dissolved in 10 ml LoTE buffer. 454 Pyrosequencing

6.6 Dissolved DNA solution is loaded onto an 8% PAGE gel, which is prepared by mixing 3.5 ml 40% acrylamide/bis solution, 13.5 ml dH2O, 350 ml 50 TAE (Tris– acetate–EDTA) buffer, 175 ml 10% ammonium persulfate, and 15 ml TEMED. 6.7 The polyacrylamide gel is run at 75 V for 10 min and then at 150 V for around 30 min. 6.8 The gel is stained with SYBR Green (Molecular Probes) and the DNA visualized on a UV trans-illuminator. 6.9 The “linker–tag” fragments of expected size (around 70 bp) are cut out and put into a 0.5-ml tube. 6.10 Holes are made at the top and the bottom of the tube with a needle and it is placed in a 2-ml tube. 6.11 The tube is centrifuged at the maximum speed for 2–3 min (table centrifuge). 6.12 Polyacrylamide gel pieces are collected at the bottom of the 2-ml tube and 300 ml LoTE is added to the gel pieces for resuspension. 6.13 After incubation at 37  C for 2 h, the gel suspension is transferred to a Spin-X column (Corning) and centrifuged at maximum speed for 2 min. 6.14 Collected solution at the bottom of the tube is extracted by phenol/chloroform and precipitated as described above. 6.15 After once washing with 70% ethanol, the dried linker–tag DNA is dissolved in 8 ml LoTE buffer. HT-SuperSAGE Further purification of EcoP15I-digested fragments was not necessary. 1.3.7 Ditag or Adapter–Tag Formation and Amplification 454 Pyrosequencing

7.1 Purified “linker–tag” fragments (a mixture of Linker-1–tag and Linker-2–tag fragments) are blunt-ended by fill-in reaction using Blunting High Kit (Toyobo). 7.2 To the linker–tag solution (8 ml), 1 ml 10 blunting buffer and 1 ml KOD DNA polymerase (Toyobo) are added. 7.3 The tube is incubated at 72  C for 2 min and immediately transferred onto ice. 7.4 For ditag formation, 30 ml LoTE and 40 ml Ligation High (Toyobo) are added to the 10 ml blunt-ended reaction. 7.5 After incubation of the ligation reaction mixture at 16  C for 4 h to overnight, a small aliquot of the ligation product is removed and diluted (1/5 and 1/10) with LoTE buffer. 7.6 These diluents are used as templates for the PCR amplification of the “linker– ditag–linker” fragments. 7.7 For Linker-1 and Linker-2, we use PCR primers with the sequence 50 - CAACTAGGCTTAATACAGCAGCA-30 and 50 - CTAACGATGTACGCAGCAGCA-30 , respectively.

1.3 Methods and Protocols

7.8 If other linkers with different indexes are employed, PCR primers should be changed, according to used linker sequences. 7.9 Hot-start PCR is not always necessary for amplifying “linker–ditag–linker” fragments. 7.10 We amplify “linker–ditag–linker” in a reaction mixture containing 5 ml 10 PCR buffer, 5 ml 2 mM dNTP, each 0.2 ml primer (350 ng/ml), 38.34 ml dH2O, 1ml diluted template solution, and 0.26 ml Taq DNA polymerase (5 U/ml). 7.11 We amplify “linker–ditag–linker” with the following reaction cycle: 94  C for 2 min, then 25 cycles each at 94  C for 40 s and 60  C for 40 s. 7.12 With the pilot PCR experiment, we determine which of the 1/5 and 1/10 template dilutions gives the better amplification of the “linker–ditag–linker.” 7.13 PCR products (96–98 bp) are observed in a SYBR Green-stained acrylamide gel. 7.14 A bulk PCR is carried out under the same conditions for 40–48 tubes, each containing 50 ml, using diluted template (either of 1/5 or 1/10 dilutions) that gave the better amplification in the pilot PCR (see above). 7.15 All PCR products are collected in a tube and purified with QIAquick PCR purification kit (Qiagen). 7.16 For purification, six to eight columns are used and eluted DNAs from all the columns are collected in a single tube. 7.17 This DNA solution (180–240 ml) is loaded onto an 8% polyacrylamide gel. 7.18 After running the gel and staining with SYBR Green as described above, the separated DNA fragments of the expected size (96–98 bp) are cut out from the gel. 7.19 DNA is eluted from the polyacrylamide gel and purified by ethanol precipitation after phenol/chloroform extraction, as described above. Around 1 mg of purified “linker–ditag–linker” fragments can be obtained from 40–48 PCR reaction tubes. HT-SuperSAGE

7.20 Prepare Adapter-1 with defined index sequences assigned to individual samples (Adapter-1a, -1b, -1c, etc.). 7.21 For ligation of Adapter-1, 3 ml 5 T4 DNA ligase buffer and 0.5 ml Adapter-1 solution are added to the solution of the Adapter-2-ligated tags. 7.22 Incubate the tube at 50  C for 2 min and keep it at room temperature for 15 min. 7.23 After the tubes cooled down, 1.5 ml T4 DNA ligase (7.5 U) is added and incubated at 16  C for 2 h. 7.24 For PCR amplification of adapter-ligated tag fragments, PCR reaction mixture, containing 3 ml 5 Phusion HF buffer, 0.3 ml 2.5 mM dNTP, 0.1 ml 50 mM MgCl2, 0.15 ml Adapter-1 primer, 0.15 ml Adapter-2 primer, 10.1 ml dH2O, 1 ml ligation solution, and 0.2 ml Phusion Hot Start DNA polymerase, is prepared in a tube. 7.25 PCR reaction proceeds under the following conditions: 98  C for 2 min, then 5– 10 cycles each at 98  C for 30 s and 60  C for 30 s. 7.26 Prepare an 8% PAGE gel by mixing 3.5 ml 40% acrylamide/bis solution, 13.5 ml dH2O, 350 ml 50 TAE buffer, 175 ml 10% ammonium persulfate, and 15 ml TEMED. 7.27 Running buffer (1 TAE) is prepared and added to the upper and lower electrophoresis chambers. 7.28 Then 3 ml 6 loading dye is added to 15 ml of the PCR solution and loaded into the well. 7.29 An aliquot of 2 ml of a 20-bp marker ladder is also loaded as molecular size marker. Run the gel at 75 V for 10 min and then at 150 V for around 30 min. 7.30 After staining the gel with SYBR Green, it was visualized on a UV illuminator. The size of the expected amplified fragment (tags sandwiched with two adapters) is 123–125 bp. 7.31 Repeat PCR reactions under the same condition in 8–14 tubes.

j

13

14

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

7.32 After the PCR reaction, solutions from all the tubes are collected in a 1.5-ml tube and purified by MinElute Reaction Cleanup kit or by ethanol precipitation. 7.33 Prepare 8% polyacrylamide gel as described in Step 5.6. Add 3 ml 6 loading buffer to purified PCR product and load it in the well. 7.34 After running the gel as described above, the gel is stained with SYBR Green and bands are visualized under UV light. 7.35 Only the 123- to 125-bp band (Adapter-1 and Adapter-2 ligated 26- to 27-bp tag) is cut out from the gel and transferred to a 0.5-ml microtube. 7.36 Elution and purification of DNA in the gel was done as described above. 7.37 Finally, the resulting pellet after ethanol precipitation is dissolved in 10–15 ml LoTE. 1.3.8 Preparation of Templates for Sequencing 454 Pyrosequencing Purified DNA is ready for sequencing analysis after adapter ligation for 454 pyrosequencing analysis instructed by manufacturer’s protocol.

HT-SuperSAGE The purified PCR product from each sample is quantified by an Agilent Bioanalyzer system.

8.1 A DNA chip from Agilent DNA 1000 kit is prepared and filled with Gel-Dye Mix supplied with the kit. 8.2 Load 1 ml purified PCR product in the well of the chip and run the chip in the Agilent 2100 Bioanalyzer. 8.3 The DNA concentration of the 123- to 125-bp fragment is measured using 2100 Expert software (Agilent Technologies). 8.4 Based on this quantification, an equal amount of DNA (PCR product) from each sample is mixed and the mixture sequenced on an Illumina GA. 8.5 For the sequencing reaction, GEX sequencing primer (50 -CGACAGGTTCAGAGTTCTACAGTCCGACGATC) should be employed. 1.4 Applications

DeepSuperSAGE recommends itself for whole-genome transcriptome studies of any eukaryotic organism. It has already been employed as a transcriptome analysis tool in various studies, particularly of nonmodel organisms without sequenced genomes (banana, chickpea, pea, lentil, Boechera, etc.). The high quality of data produced, the relatively simple procedure in combination with one of the NGS platforms, and the lower costs for a transcriptome analysis as compared to, for example, a complete microarray experiment will promote its applications in future. 1.4.1 Applications of DeepSuperSAGE in Combination with 454 Pyrosequencing

DeepSuperSAGE reveals many facets of the transcriptome reacting upon abiotic or biotic stresses or deciphers the changing involvement of transcription and transcripts during development of any organism (Table 1.2). Particularly in higher plants, deepSuperSAGE has shown its resolving power as a transcriptome analysis tool. However, genome sequences of most plants are either incomplete or untouched, regardless of their economic (mostly agricultural) importance. As described above, genes can be recovered from deepSuperSAGE tag sequences by RACE without searching databases. However, the most recent advances of NGS technologies now allow us to construct a substantial EST database by just

j

15

applications

of

1.4 Applications

Author

Species

Sequencing technology

Molina et al. [26] Sharbel et al. [28] Sharbel et al. [29] Gilardoni et al. [30] Yamaguchi et al. [27] Pinto et al. [31] Matsumura et al. [23]

Cicer arietinum Boechera spp. Boechera spp. Nicotiana attenuata Solanum tovum Tetradon nigroviridis Oryza sativa, Danio renio, Arabidopsis thaliana, Magnaporthe oryzae

454 454 454 454 Illumina 454 Illumina

sequencing cDNA fragments from the experimenter’s own materials. In chickpea (Cicer arietinum L.) or Boechera species, for example, deepSuperSAGE tag sequences were BLASTed against public or newly sequenced cDNA databases for the identification of the corresponding genes [26,28,29]. Without preparing one’s own cDNA databases, EST sequences from related species are also applicable as reference sequences to BLAST searches of the tags. To give only one example, tags from chickpea were BLASTed against Medicago truncatula ESTs [26]. Similarly, for annotation of Nicotiana attenuata and Solanum torvum tags, DNA sequences of Nicotiana species, Solanum species, or egg plant Unigenes were employed as databases for retrieval [27,30]. It is still an open question whether and to what extent sequences from genetically distant species are acceptable for tag-to-gene annotation via sequence similarity. Practically, however, the few examples described above demonstrate that corresponding cDNAs (genes) could be successfully identified this way. DeepSuperSAGE additionally identifies unique classes of transcripts, which cannot be detected by microarrays, for example. In differentially expressed tags of drought-exposed chickpea roots, 170 tags matched EST sequences in the antisense polarity [26]. Therefore, the detection of antisense transcripts is a rewarding advantage of deepSuperSAGE. Although a further (functional and/or structural) analysis is still required for each tag (or transcript), deepSuperSAGE nevertheless discovers novel transcripts. Sharbel et al. [29] could identify allelic variation of transcripts from the same locus by analyzing deepSuperSAGE tags from apomictic and sexual ovules of Boechera species. The window of a SuperSAGE tag expands over only 26 bases and therefore identified transcript variants might be limited in numbers. However, the tag likely localizes to the 30 -untranslated region of cDNAs, which increases the chances to identify sequence variations. Combining information of alleles and their expression patterns has helped to better understand complex events in living organisms like apomixis [28,29]. One of the best examples of the power of deepSuperSAGE as a transcriptome profiling technology is the identification of rapidly up- and downregulated genes, the quantification of their transcripts, the discovery of many sense and antisense transcripts, the multitude of alternatively spliced transcript isoforms, and their contribution to the various salt stress-induced metabolic pathways, to name a few benefits of the technique. Within the focus of the corresponding experiments, two deepSuperSAGE libraries were developed from roots and nodules of the salt-tolerant chickpea variety INRAT-93. A moderate salt stress of 25 mM NaCl was chosen and the deepSuperSAGE transcript profiles established after only 2 h of salt stress. Sequencing of the tags was done by the 454 platform. Among the various results and insights into the first wave of salt stress-compensatory measures of chickpea roots, a compilation of the 40 top upregulated transcripts and their annotations is shown in Table 1.3. In parallel, the 40 top upregulated transcripts from nodules of the same plants are shown in Table 1.4. These 40 transcripts were chosen among thousands of upregulated transcripts in both organs that were significantly, but less activated after onset of the salt stress (86 919 transcripts representing 17 918 unique 26-bp deepSuperSAGE tags, so-called UniTags,

Table 1.2 Various deepSuperSAGE.

published

16

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

Table 1.3 Top 40 annotatable and upregulated UniTags of roots from the salt-tolerant chickpea variety INRAT-93 under salt stress.

Tag ID

Associated gene annotation

Rln

Associated process

STCa-18884 STCa-7896 STCa-318 STCa-19021 STCa-17087 STCa-7166 STCa-1381 STCa-2982 STCa-15648 STCa-20215 STCa-20066 STCa-15159 STCa-17434 STCa-22427 STCa-4531 STCa-14437 STCa-1385 STCa-12309 STCa-23197 STCa-8459 STCa-12035 STCa-11051 STCa-7975 STCa-14984 STCa-21666 STCa-1958 STCa-17272 STCa-24178 STCa-13313 STCa-23978 STCa-10123 STCa-11172 STCa-181 STCa-15340 STCa-24453 STCa-4528 STCa-5543 STCa-11309 STCa-16808 STCa-22470

early nodulin 40 superoxide dismutase trypsin protein inhibitor 3 extensin dormancy-associated protein NADP-dependent isocitrate dehydrogenase I acetyl-CoA synthetase cysteine synthase mitochondrial 24S mt-RNL ribosomal gene putative extracellular dermal glycoprotein 14-3-3-like protein A disease resistance protein DRRG49-C AAD20160.1 protein fiber protein Fb19 isoflavone 30 -hydroxylase 60S acidic ribosomal protein P1 1-aminocylopropane-1-carboxylate oxidase ankyrin-like protein hypothetical protein UDP-glucose pyrophosphorylase cytochrome P450 monooxygenase retinoblastoma-related protein T5A14.10 protein 40S ribosomal protein S4 low-temperature salt-responsive protein LTI6B gibberellin-stimulated protein 10-kDa photosystem II polypeptide phosphoglycerate mutase Chalcone isomerase inorganic pyrophosphatase-like protein synaptobrevin-like protein caffeic acid 3-O-methyltransferase myoinositol-1-phosphate synthase alfin-1 tonoplast intrinsic protein cytochrome P450 monooxygenase e-subunit of mitochondrial F1-ATPase 60S ribosomal protein L18a histone H2B glutathione S-transferase

5.69 3.70 3.59 3.40 3.38 3.25 3.19 3.15 3.10 3.08 3.03 2.98 2.92 2.88 2.88 2.83 2.83 2.83 2.78 2.78 2.73 2.68 2.68 2.68 2.68 2.68 2.68 2.62 2.62 2.62 2.62 2.56 2.56 2.56 2.56 2.56 2.56 2.49 2.49 2.49

nodulation ROS scavenging endopeptidase inhibitor cell wall organization no associated process metabolism metabolism protein metabolism no associated process proteolysis protein domain-specific binding response to stress no associated process response to stress no associated process protein biosynthesis metabolism no associated process response to stress metabolism electron transport/metal ion binding no associated process no associated process protein biosynthesis Integral to membrane Hormone response Oxygen evolving complex Metabolism/metal ion binding flavonoid biosynthesis phosphate metabolism transport/integral to membrane lignin biosynthesis inositol 3P biosynthesis/Ca2 þ release regulation of transcription transport electron transport/metal ion binding ATP-coupled proton transport protein biosynthesis response to DNA damage stimulus ROS scavenging

Two deepSuperSAGE libraries derived from salt stressed- and nontreated chickpea roots, respectively, of the salt-tolerant variety INRAT-93 were developed. All 26-bp tags per library were grouped in classes sharing the same sequence (UniTags) and their counts were normalized to counts per million. After normalization, counts were compared between libraries and expression ratios were calculated for each UniTag (Rln). Here, the 40 UniTags showing the largest expression ratios after salt stress induction (2 h 25 mM NaCl) are listed.

from roots, and 57 281 transcripts representing 13 115 UniTags from nodules of the same plants). The thousands of downregulated genes, the antisense transcripts, and their corresponding sense counterparts as well as the Gene Ontology (GO) terms for all of these various messages and their response to salt stress are completely ignored here. However, from a more detailed GO analysis we can infer that (i) transcripts associated with the generation and scavenging of reactive oxygen species (ROS), and (ii) transcripts involved in Na þ homeostasis were over-represented in GO categories, to give only two examples. Both pathways undergo strong global transcriptome changes in chickpea roots and nodules already 2 h after onset of moderate salt stress. Additionally, a set of more than 15 candidate transcripts react as potential components of the salt-overly-sensitive (SOS) pathway in chickpea (Figure 1.3). Some of the major insights into the first steps of salt stress response in chickpea are that (i) normal nodules already have elevated levels of transcripts encoding ROS

1.4 Applications

j

17

Table 1.4 Top 40 annotatable and up-regulated UniTags of nodules from the salt-tolerant chickpea variety INRAT-93 under salt stress.

Tag ID

Associated gene annotation

Rln

Associated process

STCa-18884 STCa-15648 STCa-11090 STCa-17434 STCa-1958 STCa-3760 STCa-89 STCa-16482 STCa-10316 STCa-3321 STCa-1263 STCa-13055 STCa-22149 STCa-10862 STCa-21007 STCa-4833 STCa-8434 STCa-23572 STCa-7572 STCa-1895 STCa-16007 STCa-2175 STCa-12406 STCa-12523 STCa-269 STCa-1589 STCa-19649 STCa-22041 STCa-199 STCa-542 STCa-13688 STCa-15530 STCa-16514 STCa-22816 STCa-4167 STCa-2241 STCa-319 STCa-9781 STCa-1461 STCa-13993

early nodulin 40 24S mitochondrial ribosomal mt-RNL gene 40S ribosomal protein SA AAD20160.1 protein gibberellin-stimulated protein cysteine proteinase inhibitor drought-induced protein 40S ribosomal protein S9-2 NtEIG-E80 protein leghemoglobin benzoyltransferase-like protein nonspecific lipid-transfer protein precursor acyl carrier protein F6N18.8 protein two-component response regulator PRR37 T13M11_21 protein fiber protein Fb2 F7K24_140 protein protein phosphatase 2A GDP-mannose 3,5-epimerase aquaporin PIP-type 7a glutathione S-transferase coatomer subunit b0 -2 T23K23_9 protein phytochrome B b-galactosidase vacuolar ATPase subunit A root nodule extensin nodulin-like protein prolyl 4-hydroxylase O-methyltransferase NADH ubiquinone oxidoreductase NADH dehydrogenase F17F16.27 protein syringolide-induced protein putative extensin trypsin protein inhibitor 3 eukaryotic translation initiation factor 3 HMG1 protein F8K7.2 protein

4.11 3.17 2.73 2.61 2.61 2.48 2.48 2.48 2.33 2.33 2.33 2.33 2.33 2.33 2.33 2.14 2.14 2.14 2.14 2.14 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92 1.92

nodulation translation protein biosynthesis no associated term no associated term inhibition of proteolysis response to stress protein biosynthesis no associated term oxygen transport no associated term transport (lipids) lipid biosynthesis no associated term regulation of transcription regulation of transcription no associated term signal transduction signal transduction ascorbic acid biosynthesis transport (trans-membrane) ROS scavenging protein transport no associated term signal transduction metabolism (carbohydrates) ion transport cell wall organization transport (transmembrane) ROS scavenging lignin biosynthesis electron transport electron transport no associated term metabolism (carbohydrates) cell wall organization inhibition of proteolysis protein biosynthesis regulation of transcription no associated term

Two SuperSAGE libraries derived from salt-stressed and nontreated chickpea nodules, respectively, of the salt-tolerant variety INRAT-93 were developed. All 26-bp tags per library were grouped in classes sharing the same sequence (UniTags) and their counts were normalized to counts per million. After normalization, counts were compared between libraries and expression ratios were calculated for each UniTag (Rln). Here, the 40 UniTags showing the largest expression ratios after salt stress induction (2 h 25 mM NaCl) are listed.

scavengers prior to any salt treatment (i.e., are in a state of increased stress by ROS), and (ii) both nodules and roots rapidly (already 2 h after addition of 25 mM NaCl) respond to salt stress by transcription of genes encoding ROS scavengers. This rapid activation of genes in response to salt stress was unknown in nodulating legumes. We conclude that deepSuperSAGE expression profiling enriched our previously very limited knowledge of first reactions of a chickpea plant upon salt stress. We would like to point out that most of the data of the salt stress SuperSAGE experiments have not been evaluated yet. However, the two examples (although only superficially) presented here already show the potential of this next-generation transcriptome sequencing technology. DeepSuperSAGE is the technique of choice for the identification of differentially expressed genes in any eukaryotic organism. Gilardoni et al. [30] systematically employed deepSuperSAGE from gene discovery to functional analysis of identified

18

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

Fig. 1.3 Over-representation of more than 390 GO biological processes after salt stress induction, as calculated for chickpea roots and nodules using the software package ErmineJ. (Left panel) Heatmap of over-representation of GO biological processes in salt-stressed roots (SR) depicted in parallel with their over-representation levels in stressed nodules (SN) and nonstressed nodules in relation to roots of the same plants (NC). Numbers of represented genes per GO category for each case are shown by the curves right to the heatmap. (Right panel) Amplification of the heatmap region containing 52 high significance (P < 1e-10) overrepresented GO terms in salt-stressed chickpea roots (SR). In parallel, the dynamics of the same processes in stressed nodules (SN) and nontreated nodules (NC) is shown.

genes in N. attenuata. Tools or resources for functional genomics are well developed for model species, like human, Caenorhabditis elegans, Drosophila melanogaster, or Arabidopsis thaliana. We exploited such tools by combining deepSuperSAGE and virus-induced gene silencing (VIGS), the later being a highly efficient tool for knocking-down target genes and measuring the resulting phenotype. Although VIGS is not applicable to any plant species, it nevertheless aided in the linking of a phenotype to a gene identified by deepSuperSAGE. It can be expected that the recent progress in RNA interference technology will support its application to a wide spectrum of species. Combining deepSuperSAGE and gene silencing technologies will, in our view, enrich our knowledge of the relationship between sequence and function. 1.4.2 Practical Analysis of HT-SuperSAGE

For the development of the described HT-SuperSAGE protocol, we designed 27 independently indexed adapters [23]. Additionally, cDNAs from two tissue samples

1.5 Perspectives

were digested with three different 4-bp cutter restriction endonucleases (anchoring enzymes) and tags were prepared. Amplified adapter–tag fragments from all 31 samples (27 indexed samples and additional four samples employing different anchoring enzymes) in total were pooled and sequenced in three lanes of a flow cell in an Illumina GAIIx sequencer (16 057 777 sequence reads of 35 bases) [23]. For tag extraction from sequence reads of pooled samples, our own programs were written in Perl script. Tag profiling data from all the applied samples was successfully separated and retrieved. As expected, contamination of tags from different samples was only less than 0.2% of the analyzed independent tags, even among index sequences with single-base differences. Three benefits can be expected by pooling many samples in HT-SuperSAGE: (i) expansion of deepSuperSAGE applications, (ii) reduction of analytical cost per sample, and (iii) savings of starting material (RNA) from each sample. The analyses of biological replicates and expression kinetics were easy in HT-SuperSAGE and, additionally, a sufficient amount of tags can be prepared from 1 mg total RNA. Currently, with all the advances made, the performance and potential of HT-SuperSAGE is positively superior to microarray techniques, since it is based on an unprecedented ultra-high (deep) sequencing of tags, the digital printout of quantitative tag counts, and a high-throughput capacity. DeepSuperSAGE can also employ different anchoring enzymes, of which NlaIII is the standard enzyme in all the many versions of SAGE. However, as described by Sharbel et al. [29], cDNA is frequently not efficiently digested by NlaIII, but instead by DpnII, at least in certain species. Theoretically, any 4-bp cutter restriction endonuclease can be part of the deepSuperSAGE protocol and the change in the sequence of adapter ends is often welcomed. Actually, the frequency of sites for 4-bp cutter enzymes in cDNA is generally not consistent. Experimental results in A. thaliana show that NlaIII or DpnII digestion could recover tags from 92 to 93% of expressed genes, while BfaI produced tags from about 80% of the cDNAs. Similar biases of restriction sites in the predicted genes were also reported by in silico scans of D. melanogaster and C. elegans genomes [32]. Since the restriction endonuclease BfaI recognizes the sequence 50 -CTAG-30 , which includes a stop codon (TAG), this site may be less represented in cDNA sequences. However, the results demonstrate that NlaIII or DpnII are appropriate endonucleases for deepSuperSAGE, and most (above 99%) of the expressed genes could be monitored by these two enzymes.

1.5 Perspectives

NGS technologies are great innovations, and have revolutionized genomics and transcriptomics. The NGS platforms are continuously being improved and expanded, and new sequencing technologies are already being released or will be released in the near future, such as single-molecule sequencing from Pacific Biosciences [33]. This, and other next-next-generation sequencing technologies will read long fragments (above 1000 bp) in one path without amplification of the template DNA. However, the number of sequencing reads per run will be reduced compared to current massively parallel sequencing. Single-molecule sequencing will assist whole-genome analysis, even in de novo sequencing of genomes owing to efficient sequence assembling and less errors by PCR amplification. In transcriptomics, the new sequencing methods will be an effective tool to sequence cDNA directly and may allow us to read millions of full-length cDNA sequences accurately at a time. We expect that deepSuperSAGE in combination with massively parallel sequencing will remain advantageous even after the emergence of the next-next generation of sequencers. One of its merits is quantitative expression analysis, for which the number of sequence reads (tag counts) determines its accuracy and potential as a gene discovery tool. Moreover, multiplexing will assist in the measurement of gene expression of many different samples synchronously. Also, sequencing costs are still an issue and the costs for an RNA-

j

19

20

j

1 DeepSuperSAGE: High-Throughput Transcriptome Sequencing

seq experiment still exceed the costs of a deepSuperSAGE experiment by a factor of 10. Therefore, the current deepSuperSAGE is still superior to single-molecule sequencing of cDNA or tags/tag concatemers. Instead, the immense accumulation of whole-genome and long cDNA sequences in the databases will greatly support the application of deepSuperSAGE in many aspects of eukaryotic biology.

Acknowledgments

H.M. is supported by the Program for the Promotion of Basic and Applied Researches for Innovations in Bio-oriented Industry (BRAIN). This work is also supported by JSPS grant 22380009. G.K. acknowledges research support by DFG (grant DFG 332/ 22-1) and GTZ (grant 08.7860.3-001.00). All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Bustin, S.A., Benes, V., Garson, J.A.,

2

3

4

5

6

7

8

Hellemans, J., Huggett, J., Kubista, M., Mueller, R., Nolan, T., Pfaffl, M.W., Shipley, G.L., Vandesompele, J., and Wittwer, C.T. (2009) The MIQE Guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin. Chem., 55, 611–622. Bustin, S.A. (2010) Why the need for qPCR publication guidelines? The case for MIQUE. Methods, 50, 217–226. Bustin, S.A., Beaulieu, F.A., Huggett, J., Jaggi, R., Kibenge, F.S.B., Olsvik, P.A., Penning, L.C., and Toegel, S. (2010) MIQE precis: practical implementation of minimum standard guidelines for fluorescence-based quantitative real-time PCR experiments. BMC Mol. Biol., 11, 74. Derveaux, S., Vandesompele, J., and Hellemans, J. (2010) How to do successful gene expression analysis using real-time PCR. Methods, 50, 227–230. Schena, M., Shalon, D., Davis, R.W., and Brown, P.Q. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270, 467–470. Shendure, J. (2008) The beginning of the end for microarrays? Nat. Methods, 5, 585–587. Irizarry, R.A., Warren, D., Spencer, F., Kim, I.F., Biswal, S., Frank, B.C., Gabrielson, E., Garcia, J.G.N., Geoghegan, J., Germino, G., Griffin, C., Hilmer, S.C., Hoffman, E., Jedlicka, A.E., Kawasaki, E., Martınez-Murillo, F., Morsberger, L., Lee, H., Petersen, D., Quackenbush, J., Scott, A., Wilson, M., Yang, Y., Qing Ye, S., and Yu, W. (2005) Multiple-laboratory comparison of microarray platforms. Nat. Methods, 2 345–350. Larkin, J.E., Frank, B.C., Gavras, H., Sultana, R., and Quackenbush, J. (2005)

9

10

11

12

13

14

15

16

Independence and reproducibility across microarray platforms. Nat. Methods, 2, 337–344. Velculescu, V.E., Zhang, L., and Vogelstein, B., and Kinzler, K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487. Saha, S., Sparks, A.B., Rago, C., Akmaev, V., Wang, C.J., Vogelstein, B., Kinzler, K.W., and Velculescu, V.E. (2002) Using the transcriptome to annotate the genome. Nat. Biotechnol., 20, 508–512. Metzker, M.L. (2010) Sequencing technologies – the next generation. Nat. Rev. Genet., 11, 31–46. Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S., Winter, P., Kahl, G., Reuter, M., Kr€ uger, D.H., and Terauchi, R. (2003) Gene expression analysis of host–pathogen interactions by SuperSAGE. Proc. Natl. Acad. Sci. USA, 100, 15718–15723. Meisel, A., Bickle, T.A., Kruger, D.H., and Schroeder, C. (1992) Type III restriction enzymes need two inversely oriented recognition sites for DNA cleavage. Nature, 355, 467–469. Moncke-Buchner, E., Rothenberg, M., Reich, S., Wagenf€ uhr, K., Matsumura, H., Terauchi, R., Kruger, D.H., and Reuter, M. (2009) Functional characterization and modulation of the DNA cleavage efficiency of Type III restriction endonuclease EcoP15I in its interaction with two sites in the DNA target. J. Mol. Biol., 387, 1309–1319. Wagenfuhr, K., Pieper, S., Mackeldanz, P., Linscheid, M., Kruger, D.H., and Reuter, M. (2007) Structural domains in the Type III restriction endonuclease EcoP15I: characterization by limited proteolysis, mass spectrometry and insertional mutagenesis. J. Mol. Biol., 366, 93–102. Matsumura, H., Ito, A., Saitoh, H., Winter, P., Kahl, G., Reuter, M., Kruger, D.H., and Terauchi, R. (2004) SuperSAGE. Cell. Microbiol., 7, 11–18.

17 Raftery, M.J., Moncke-Buchner, E.,

18

19

20

21

Matsumura, H., Giese, T., Winkelmann, A., Reuter, M., Terauchi, H., Schonrich, G., and Kruger, D.H. (2009) Unravelling the interaction of human cytomegalovirus with dendritic cells by using SuperSAGE. J. Gen. Virol., 90, 2221–2233. Nasir, K.B.H., Takahashi, Y., Ito, A., Saitoh, H., Matsumura, H., Kanzaki, H., Shimizu, T., Ito, M., Sharma, P.C., OhmeTakagi, M., Kamoun, S., and Terauchi, R. (2005) High-throughput in plant expression screening identifies a class II ethyleneresponsive element binding factor-like protein that regulates plant cell death and non-host resistance. Plant J., 43, 491–505. Matsumura, H., Bin Nasir, K.H., Yoshida, K., Ito, A., Kahl, G., Kr€ uger, D.H., and Terauchi, R. (2006) SuperSAGE array: the direct use of 26-base-pair transcript tags in oligonucleotide arrays. Nat. Methods, 3, 469–474. Coemans, B., Matsumura, H., Terauchi, R., Remy, S., Swennen, R., and Sagi, L. (2005) SuperSAGE combined with PCR walking allows global gene expression profiling of banana (Musa acuminata), a non-model organism. Theor. Appl. Genet., 111, 1118–1126. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L., Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Irzyk, G.P., Jandom, S.C., Alenquer, M.L., Jarvie, T.P., Jirage, K.B., Kim, J.B., Knight, R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J., Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., McKenna, M.P., Myers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., Tomasz, A.,

References

22

23

24

25

Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P., Yu, P., Begley, R.F., and Rothberg, J.M. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437 376–380. Terauchi, R., Matsumura, H., Kr€ uger, D.H., and Kahl, G. (2008) SuperSAGE: the most advanced transcriptome technology for functional genomics, in Handbook of Plant Functional Genomics (eds G. Kahl and K. Meksem), Wiley-VCH, Weinheim, pp. 37–54. Matsumura F H., Yoshida, K., Luo, S., Kimura, E., Fujibe, T., Albertyn, Z., Barrero, RA., Kr€ uger, D.H., Kahl, G., Schroth, G.P., and Terauchi, R. (2010) Highthroughput SuperSAGE for digital gene expression analysis of multiple samples using next generation sequencing. PLoS ONE, 5, e1201. Nielsen, K.L., Høgh, A.L., and Emmersen, J. (2006) DeepSAGE – digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples. Nucleic Acids Res., 34, e133. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008) RNA-seq: an assessment of technical reproducibility

26

27

28

29

30

and comparison with gene expression arrays. Genome Res., 18, 1509–1517. Molina, C., Rotter, B., Horres, R., Udupa, S.M., Besser, B., Bellarmino, L., Baum, M., Matsumura, H., Terauchi, R., Kahl, G., and Winter, P. (2008) SuperSAGE: the drought stress-responsive transcriptome of chickpea roots. BMC Genomics, 9, 553. Yamaguchi, H., Fukuoka, H., Arao, T., Ohyama, A., Nunome, T., Miyatake, K., and Negoro, S. (2010) Gene expression analysis in cadmium-stressed roots of a low cadmium-accumulating solanaceous plant, Solanum torvum. J. Exp. Bot., 61, 423–437. Sharbel, T.F., Voigt, M.L., Corral, J.M., Thiel, T., Varshney, A., Kumlehn, J., Vogel, H., and Rotter, B. (2009) Molecular signatures of apomictic and sexual ovules in the Boechera holboellii complex. Plant J., 58, 870–882. Sharbel, T.F., Voigt, M.L., Corral, J.M., Galla, G., Kumlehn, J., Klukas, C., Schreiber, F., Vogel, H., and Rotter, B. (2010) Apomictic and sexual ovules of Boechera display heterochronic global gene expression patterns. Plant Cell, 22, 655–671. Gilardoni, P.A., Schuck, S., J€ ungling, R., Rotter, B., Baldwin, I.T., and Bonaventure, G. (2010) SuperSAGE analysis of the Nicotiana attenuata transcriptome after fatty

j

21

acid-amino acid elicitation (FAC): identification of early mediators of insect responses. BMC Plant Biol., 10, 66. 31 Pinto, P.I., Matsumura, H., Thorne, M.A., Power, D.M., Terauchi, R., Reinhardt, R., and Canario, A.V. (2010) Gill transcriptome response to changes in environmental calcium in the green spotted puffer fish. BMC Genomics, 11, 476. 32 Pleasance, E.D., Marra, M.A., and Jones, S.J. (2003) Assessment of SAGE in transcript identification. Genome Res., 6, 1203–1215. 33 Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P., Rank, D., Baybayan, P., Bettman, B., Bibillo, A., Bjornson, K., Chaudhuri, B., Christians, F., Cicero, R., Clark, S., Dalal, R., Dewinter, A., Dixon, J., Foquet, M., Gaertner, A., Hardenbol, P., Heiner, C., Hester, K., Holden, D., Kearns, G., Kong, X., Kuse, R., Lacroix, Y., Lin, S., Lundquist, P., Ma, C., Marks, P., Maxham, M., Murphy, D., Park, I., Pham, T., Phillips, M., Roy, J., Sebra, R., Shen, G., Sorenson, J., Tomaney, A., Travers, K., Trulson, M., Vieceli, J., Wegener, J., Wu, D., Yang, A., Zaccarin, D., Zhao, P., Zhong, F., Korlach, J., and Turner, S. (2009) Real-time DNA sequencing from single polymerase molecules. Science, 323, 133–138.

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites Matthias Harbers, Mitchell S. Dushay, and Piero Carninci Abstract

Gene expression is tightly controlled by regulatory elements within promoter regions. Genome-wide promoter identification and measurement of promoter activities is of essential importance for understanding gene expression and its regulation in a biological context. Cap analysis gene expression (CAGE) is a method for the isolation of short sequencing tags from the 50 end of mRNA transcripts that are sequenced at high throughput by next-generation sequencing methods. Mapping back the short sequencing tags to a reference genome allows for reliable identification of transcription start sites (TSS) on a genome-wide scale. Hence, CAGE can be used for promoter and transcript identification, where the number of CAGE tags found per TSS is a quantitative measure of transcription from each site. CAGE makes use of the 50 -endspecific cap structure in eukaryotic mRNA. During cap selection by cap trapping, the cap structure is selectively biotinylated and the biotinylated RNA/cDNA hybrids are enriched on streptavidin-coated beats. Due to the high selectivity of the cap trapper step, CAGE libraries can be directly prepared from total RNA and do not require any mRNA purification. Moreover, cDNA synthesis is driven by random primers for monitoring even nonpolyadenylated mRNAs commonly not detected by other methods relying on oligo(dT) priming or mRNA purification. Here, we provide the latest version of our DeepCAGE protocol preparing CAGE tags for direct sequencing on an Illumina sequencing platform. Compared to the original CAGE protocol, we now use EcoP15I to obtain longer tags of 27 bp, we omitted the concatenation step needed for capillary sequencing, we introduced barcoding for multiplex sequencing, and we simplified purification steps during library preparation to allow for highthroughput library production. The process further reduced RNA requirements, where the new protocol now starts from only 5 mg of total RNA. In recent years, CAGE has been the underlying method for many promoter and gene network projects, including work for the NIH ENCODE and the RIKEN FANTOM projects. We believe that our new protocol will make CAGE very attractive to a large number of researchers.

2.1 Introduction

Digital expression profiling started with the classical serial analysis of gene expression (SAGE) method that for the first time allowed the preparation of short DNA fragments – so-called tags – for large-scale sequencing [1]. The important leap forward made by the SAGE method [2] was the underlying idea that even short DNA sequences are sufficient for transcript identification as compared to the longer expressed sequence tag (EST) reads used before in transcript discovery [3]. Since all DNA fragments were prepared in such a way that only one DNA fragment per

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

23

24

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites

transcript was obtained, direct counting of identical sequencing tags offered unsupervised and quantitative expression profiling. Although conceptually superior to other expression profiling methods, the greatest limitation of SAGE was originally the low throughput of capillary sequencing. Later, SAGE benefited from the use of new restriction endonucleases that enabled the isolation of longer SAGE tags of some 20 bp (LongSAGE [4]) or even 26–27 bp (SuperSAGE [5,6]). With those longer sequences, SAGE tags could be directly mapped to genomic sequences, further expanding on the use of SAGE libraries in gene discovery. The introduction of new high-speed sequencing methods pushed SAGE to new dimensions greatly increasing the number of tags per sample. Modified SAGE protocols like DeepSAGE [7] have offered better measures of transcriptional activities and transcriptome complexity, including suggesting the existence of many previously unannotated genes. While SAGE guided the way to analytical sequencing using short sequencing reads, the position of SAGE tags within mRNA transcripts restricted its application mainly to transcript identification. However, the location of the tag within the parental mRNA transcript is important for data interpretation and the information content that can be obtained from tag sequences. Therefore, new protocols were developed to locate the tags at the exact 50 and/or 30 ends of mRNA transcripts to identify the borders of transcribed regions in genomes [4,8]. Cap analysis gene expression (CAGE) emerged as a method for isolating tags from the exact 50 end of mRNAs [9,10]. Sequence information from the 50 end of mRNAs can be used to identify transcription start sites (TSSs) in genomes. In combination with new high-speed sequencing methods, DeepCAGE is a unique approach for genome-wide TSS identification and to determine the exact promoter activity at each TSS [11]. Genome-wide CAGE studies have for the first time provided thorough pictures of the complexity of gene regulation, characteristics of TSS, and the use of alternative promoters in gene expression [12,13]. By now CAGE projects have made fundamental contributions to our understanding of the Drosophila (M. Dushay, unpublished), mouse [12], rat [14], and human genomes [13]. Moreover, CAGE has been the basis to large-scale gene network studies and deep analysis of transcriptomes from many organisms, including fungi, insects, and mammals.

2.2 What is CAGE?

In eukaryotes, the 50 ends of mRNAs are marked by the so-called cap structure [15,16]. This cap structure consists of a guanine nucleotide connected to the mRNA via a 50 –50 triphosphate bond. The guanine is further modified to form the final 7-methylguanosine cap (m7G) structure. A trimethylguanosine cap structure has also been observed, as, for example, in some RNA polymerase III (Pol III)-derived small nucleolar RNAs. Capping of mRNA happens early during transcription where the capping enzyme complex (CEC) is bound to RNA polymerase II (Pol II) even before transcription starts. Therefore, it is assumed that all Pol II transcripts in eukaryotes are capped. However, additional capping activities have been found in RNA viruses such as flavivirus that use RNA capping for their propagation [17–20]. It is possible that additional mechanisms exist to add a cap structure to RNA molecules that are independent from the transcriptional event [15]. RNA capping is an essential step during mRNA maturation and is required for protein synthesis. Moreover, capping protects mRNAs against 50 -exonucleases, and is involved in nuclear export and splicing. Accordingly, decapping of translated mRNAs reduces their half-life and is an important regulatory mechanism in the cell. The unique position of the cap structure at the 50 end of mRNA has long been used in cDNA cloning [21]. Along with oligo(dT) priming of polyadenylated mRNAs, the cap structure allows for the selection of full-length cDNAs. Different methods have been developed over time for targeting capped mRNAs, out of which the so-called cap trapper[22,23], oligocapping [24],and cap switching [25]have been the mostimportant.

2.2 What is CAGE?

j

25

Fig. 2.1 Flow of DeepCAGE experiment: key steps of a CAGE experiment. Step numbers refer to numbering used in the CAGE protocol.

CAGE makes use of the cap trapper method to select 50 ends of mRNA. During cap trapping, the diol group within the cap structure is chemically biotinylated to capture mRNA–cDNA hybrids on streptavidin-coated beads. In RNA–DNA hybrids, the RNA strand is protected against RNase I digestion by the complementary DNA strand. Therefore, the combination of capturing biotinylated mRNA–DNA hybrids with RNase I digestion can be used to destroy all RNAs that are not protected at their 50 end by a complementary cDNA strand. Moreover, RNase I digestion effectively removes rRNAs that can make up more than 90% of all cellular RNAs. This is an important advantage of the cap trapper method over other approaches, and it has been used on a large scale to prepare cDNA and CAGE libraries directly from total RNA. Moreover, we found that the chemical modification of the cap structure has a much smaller bias as compared to oligo capping using a number of enzymatic reactions to replace the cap structure by an oligonucleotide [26] or the cap switching method used in nanoCAGE [27] depending on the template-free activity of some reverse transcriptases. Centered on the cap trapper method, the preparation of a DeepCAGE library comprises the following steps also illustrated in Figure 2.1: .

.

We recommend preparing DeepCAGE libraries directly from total RNA. As outlined above, rRNA will be effectively removed during the cap trapper step. Moreover, working with total RNA removes the need for mRNA purification on oligo(dT) columns. Since a large fraction of the mRNA pool within cells is not polyadenylated, mRNA purification by standard protocols can be biased and selects against nonpolyadenylated Pol II transcripts. The protocol has been optimized to work from only 5 mg of total RNA, which is a large improvement over earlier CAGE protocols. Purified total RNA is forwarded to first-strand cDNA synthesis using random priming. We prefer random priming over oligo(dT) priming as it enables recovery of all capped transcripts including nonpolyadenylated mRNAs. Moreover, random priming helps to assure that the cDNA can be extended to the 50 end of even very long mRNAs. The random primers introduce an EcoP15I site at the 30 end of the cDNAs as needed for a more efficient digestion by EcoP15I. The protocol uses trehalose in the reverse transcription reaction since trehalose was shown to improve the heat stability and fidelity of reverse transcriptases [28,29].

26

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites .

.

.

.

.

.

cDNA transcripts that comprise the 50 end of mRNAs are selected by the cap trapper method. As pointed out before, the cap trapper step is also essential to remove rRNA along with other truncated RNAs and cDNAs that did not extend to the very 50 end of an mRNA. After cap trapper selection a double-stranded linker is ligated to the 30 end of the single-stranded cDNAs (complementary to the 50 end of mRNAs). This linker introduces a recognition site for EcoP15I adjacent to the 30 end of the cDNA. EcoP15I cuts 25/27 bp apart from its binding site and therefore allows isolating short tags from the 50 end of mRNAs. Moreover, the linker contains sequences as needed for direct sequencing on an Illumina platform. Optionally, the 50 linker can further be used to introduce a barcode for multiplex sequencing of DeepCAGE libraries. Commonly, we use short 3-bp barcodes in front of the EcoP15I recognition site so that we do not reduce the length of the CAGE tags. Having a 3-bp barcode, plus 6 bp for the EcoP15I site, plus 27 bp for the actual CAGE tag, a 36-bp read is sufficient for sequencing the CAGE libraries. The second cDNA strand is prepared by means of a primer extension reaction using a High-Fidelity DNA polymerase. After second-strand cDNA synthesis, the double-stranded DNA is digested by EcoP15I. Efficient cleavage by EcoP15I requires two inversely oriented recognition sites, preferably in a head-to-head orientation. Therefore, the protocol introduces a second EcoP15I site during first-strand synthesis to improve the efficiency of EcoP15I, although it is know that the distance between the two EcoP15I sites may affect the digestion as well. After EcoP15I digestion a second linker is ligated to the open 30 end of the cDNA. Since EcoP15I digestion had created a 2-bp overhang on the lower strand, the linker uses a 2-bp random overhang to guide ligation. Moreover, the overhang makes sure that the entire 27-bp information can be retrieved as compared to other protocols using bunt-end ligation steps. The linker at the 30 end contains sequences needed for direct sequencing on an Illumina platform. Optionally, the linker at the 30 end can be used to introduce a barcode for multiplex sequencing or indexing samples. CAGE tags having linkers at both ends can be amplified by mild polymerase chain reaction (PCR) amplification prior to sequencing using linker sequences also needed for Illumina sequencing.

2.3 Why CAGE?

CAGE offers a comprehensive view of transcriptomes with a focus on 50 ends. Other methods that have been employed for genome-wide transcriptome analysis include microarrays, SAGE, and tiling arrays. Microarrays have been valuable for assessment of gene expression [30], and have been applied to study development (e.g., [31]), circadian rhythms (e.g., [32]), immunity (e.g., [33,34]), and many other biological and medical problems. However, microarrays are limited by their dependence on established genome annotations needed for probe design. Whether custom-made or commercial, microarrays can only assess RNA transcripts that hybridize to probes included on the chips. Thus, microarray analysis may not provide complete and unbiased transcriptome descriptions. The ease with which microarrays can be used, particularly with the wide availability of commercial processing services, has contributed to their wide popularity and routine use. By comparison, SAGE has been much less commonly used, despite its advantage of not being limited by any selected probe set. SAGE sequences are usually at internal sites towards the transcripts’ 30 ends and are determined by the cleavage site of a restriction enzyme closest to that end. For this reason, SAGE does not provide information on TSSs or alternative 50 ends of transcripts, let alone underlying promoter sequences and regulatory information. Tiling arrays are an alternative method of transcript analysis that yield information over the entire transcript length including their 50 ends [35]. However, tiling arrays do not

2.3 Why CAGE?

allow accurate identification of actual TSSs due to their limited resolution. Also, linking the 50 ends of transcripts with landmarks considerable distances downstream of transcript 50 ends can be problematic – a problem shared by the DeepCAGE method described in this chapter. We expect that tiling arrays will be largely replaced by RNAseq methods [36–38] discussed in Chapter 6 as RNA-seq has a much higher resolution than tiling arrays. Although RNA-seq provides more overall transcript information, it requires far more sequencing (maybe estimated 10- to 50-fold) to identify TSS by RNAseq than by CAGE. Moreover, RNA-seq protocols are often biased towards lower coverage of sequences at 50 and 30 ends, which limits quantitative identification of TSSs and profiling of promoter activities. Studies in mouse and man have shown a wealth of still not understood TSSs discovered by CAGE with their great potential for regulation of gene expression [1,12] that could not have been identified by other methods. Drosophila melanogaster is an optimal model system with a rich history in genetic studies to explore new promoter activities and to bring them to functional analysis. Its advantages include a very well-annotated genome, a wealth of functional genomic tools, and a world-wide community to verify CAGE results by linking TSS to gene transcripts and using aforementioned tools to explore transcript functions. Such experiments could include studying the relations between different TSSs and alternative transcripts of individual genes, and the roles of antisense transcripts and noncoding RNAs found in CAGE data. In addition, comparisons of CAGE-identified D. melanogaster promoters with sequences from 11 other Drosophila species will enable assessment of conservation through evolution with implications for function. A method similar to CAGE has been applied to analyze transcription in Drosophila embryos [39], whereas we have generated CAGE libraries from whole larvae (manuscript in preparation). These experiments have shown the practicality of CAGE to whole-animal studies, even though Drosophila has a denser genome than mammals [40,41]. Several considerations were involved in handling our CAGE data. One was how to meaningfully cluster CAGE tags. Mammalian studies revealed narrow and broad CAGE tag cluster peaks associated with genes of particular expression patterns (ubiquitous or limited) and different underlying promoter sequences (TATA boxes versus CpG islands). Another was the complexity of linking CAGE tags to genome landmarks on the basis of location alone, when the Drosophila genome contains overlapping genes and genes within introns. This was potentially complicated still further by our choice of studying whole animals with all their tissues (i.e., nervous system, gut, muscle, hemolymph, and imaginal disks). Our initial analyses were necessarily approximate and based on a conservative estimate that only narrow peaks within 50 bp of a genome landmark could be confidently linked. This appeared to distinguish genes within introns from the larger surrounding genes and genes located on opposing strands in the cases we examined. As a representative example, Figure 2.2 shows the CAGE clusters in the coding Drosophila gene CG13321 on chromosome 2R (location: 49E4–49E4). For this gene, as for many others, we could confirm the known TSS, but additional CAGE clusters were found at the beginning of two exons, within exons, and at the 30 end. It was intriguing to see CAGE clusters in intron sequences right in front of exons, which could be new TSS for transcripts starting at the adjacent exons. It is a common pattern for CAGE to find signals within exons, which could reflect on processed and recapped transcripts. Similarly, CAGE clusters at transcripts’ 30 ends have often been seen in human and mouse CAGE datasets that could be related to a new class of gene-termini-associated RNAs [42]. Our main point here is that CAGE on whole animals is practicable and will produce valuable data that may not be available from other methods like RNA-seq that provide far more complex datasets. After our initial evaluation and publication of those CAGE data, the large Drosophila research community can be counted on to assess CAGEidentified TSSs at a plethora of individual genetic loci. This will further weigh the value of CAGE and most assuredly advance understanding of transcriptional control in Drosophila and eukaryotes in general. This is just one example on how CAGE can be used in research on a classical model system. With the ongoing reduction in sequencing cost, we envision that CAGE will be used not only in correlative studies

j

27

28

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites

Fig. 2.2 Location of CAGE clusters in gene CG13321. Drosophila gene CG13321 on chromosome 2R (location: 49E4–49E4) is a coding gene of unknown function. The locations of CAGE clusters are shown in the track on the top. In the tracks below, the gene annotation from FlyBase and the RefSeq genes, as well as EST reads are indicated. For this gene, CAGE clusters were found at the known TSS, the beginning of 2 exons, within exons, and at the 30 end. The figure does not show expression levels for each CAGE cluster, but only the regions to which CAGE tags had been mapped.

to genome annotations, but will further take a great part in expression profiling and transcriptome analysis for a better understanding of biological systems such as Drosophila and many others.

2.4 Methods and Protocols 2.4.1 Key Reagents and Consumables Key Reagents . . . . . . . . . . . . . .

PrimeScript reverse transcriptase (TaKaRa; cat. no. 2680A, 10 000 U) Agencourt RNAClean XP kit (Beckman Coulter; cat. no. A63987, 40 ml) Agencourt AMPure XP kit (Beckman Coulter; cat. no. A63881, 60 ml) Biotin (long arm) hydrazide (Vector; cat. no. SP-1100, 50 mg) RNase ONE ribonuclease (Promega; cat. no. M4261, 1000 U) MPG Streptavidin-coated beads (TaKaRa; cat. no. 6124A, 2 ml) T4 DNA ligase (NEB; cat. no. M0202S, 20 000 U) SYBR Premix Ex Taq (Perfect Real Time) (TaKaRa; cat. no. RR041A, 200 reactions) TaKaRa LA Taq (TaKaRa; cat. no. RR002A, 125 U) EcoP15I (NEB; cat. no. R0646S, 500 U) Sinefungin (Calbiochem-Novabiochem; cat. no. 567051, 2 mg) Phusion High-Fidelity DNA polymerase (Finnzymes; cat. no. F-530S, 100 U) Exonuclease I (Escherichia coli) (NEB; cat. no. M0293S, 3000 U) MinElute PCR purification kit (Qiagen; cat. no. 28004, 50 columns)

Selected Buffers and Solutions Needed for DeepCAGE Library Preparation .

Saturated trehalose (2.26 M) D-Trehalose

7.27 g

H2O

10 ml

Autoclave at 121  C for 30 min.

final volume

2.4 Methods and Protocols .

.

.

.

.

.

.

4.9 M Sorbitol Sorbitol

17.8 g

H2O

20 ml

final volume

Autoclave at 121  C for 30 min. 3.3 M Sorbitol/0.66 M trehalose stock solution Saturated trehalose

10 ml

4.9 M Sorbitol

20 ml

Prepare solution in 50-ml screw-cap tube by mixing both solutions. Then add Chelex 100 ion exchanger (about 1 cm high from the bottom of the tube). Do not use a metal spoon for adding the Chelex 100. Mix well and incubate solution for 3 h at room temperature. Spin down briefly the Chelex 100 and transfer supernatant to a different tube. Protect final solution from light. 5 Buffer for 30 linker ligation 1 M Tris–HCl (pH 7.0)

50.0 ml

250 mM

100 mM ATP

10.0 ml

5 mM

Bovine serum albumin (BSA; 10 mg/ml)

0.5 ml

25 mg/ml

H2O

up to 200 ml

final volume

5 M NaCl

45 ml

4.5 M

0.5 M EDTA (pH 8.0)

5 ml

50 mM

5 M NaCl

3 ml

0.3 M

0.5 M EDTA (pH 8.0)

100 ml

1 mM

H2O

46.9 ml

Wash Buffer 1

Wash Buffer 2

Wash Buffer 3 1 M Tris–HCl (pH 8.5)

1.0 ml

20.0 mM

0.5 M EDTA (pH 8.0)

100 ml

1.0 mM

1 M NaOAc (pH 6.1)

25.0 ml

0.5 M

10% Sodium dodecylsulfate

2.0 ml

0.4%

H2O

21.9 ml

Wash Buffer 4 1 M Tris–HCl (pH 8.5)

500 ml

10.0 mM

0.5 M EDTA (pH 8.0)

100 ml

1.0 mM

1 M NaOAc (pH 6.1)

25.0 ml

0.5 M

H2O

24.4 ml

j

29

30

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites

Other buffers and solutions needed: . . . . . . . . . . . . . . . .

Ultra-pure water 10 mM dNTPs 1 M NaOAc (pH 4.5) 40% Glycerol 1 M Tris–HCl (pH 8.0) 20 mg/ml tRNA 70% Ethanol 1 M Na citrate (pH 6.0) 0.5 M EDTA (pH 8.0) 50 mM NaOH 1 M Tris–HCl (pH 7.0) 50% PEG6000 25 mM MgCl2 2.5 mM dNTPs 0.4 M MgCl2 3 M NaOAc (pH 5.0)

Refer to standard laboratory handbooks [43,44] regarding the preparation of standard buffers and solutions not described in detail here. Enzyme reaction buffers are provided by the enzyme maker if not otherwise stated in the text. Oligonucleotides Refer to Table 2.1 for the oligonucleotides required for DeepCAGE library preparation. 2.4.2 Precautions

Make sure that you have all necessary reagents and equipment before starting library preparation. DeepCAGE uses RNA samples and therefore all necessary precautions have to be made for working under RNase-free conditions. Refer to a laboratory manual for more advice on working with RNA [43,44]. Moreover, during DeepCAGE library preparation very small amounts of nucleic acids are handled, and therefore all tips and tubes should be siliconized if not otherwise noted in the protocol. Make sure that you do not lose your sample. DeepCAGE libraries can be amplified by PCR, where the PCR products have to be purified after the amplification step. Strictly separate the area where the PCR reactions are set up and where the PCR products are purified. Refer to a laboratory manual for more advice on how to avoid PCR contaminations. Avoid under all circumstances that PCR products from previous DeepCAGE library preparations contaminate other samples. If you are preparing many DeepCAGE libraries from the same organisms, such cross-contaminations may not be recognized during the later data analysis. Consider the use of barcodes and/or indexing in quality control on DeepCAGE libraries within the same study for tracking samples and contaminations within DeepCAGE libraries. If you are uncertain about your reagents and laboratory set-up, make a test library first to check how the protocol works in your hands. DeepCAGE libraries can be analyzed on an Agilent Bioanalyzer or PCR products can be cloned for test sequencing. Reading a few clones by capillary sequencing can help you to confirm the success of your first DeepCAGE library preparations before starting more expensive deep sequencing on an Illumina sequencer. 2.4.3 RNA Samples Used for DeepCAGE Library Preparation

Many different RNA preparation methods have been successfully used for DeepCAGE library preparation, including standard commercial kits for the preparation

2.4 Methods and Protocols

j

31

Table 2.1 Reverse transcript primers.

RT-N15-EcoP 50 Linker 50 SOL_#1 upper lower 50 SOL_#2 upper lower 50 SOL_#3 upper lower 50 SOL_#4 upper lower 50 SOL_#5 upper lower 50 SOL_#6 upper lower 50 SOL_#8 upper lower 50 SOL_#9 upper lower 50 SOL_#10 upper lower 50 SOL_#11 upper lower 50 SOL_#12 upper lower 50 SOL_#13 upper lower 50 SOL_#14 upper lower 50 SOL_#15 upper lower 50 SOL_#16 upper lower Second-strand synthesis primer Second SOL 30 Linker 30 SOL upper lower PCR primer SOLX41F_34 SOLX_R qRT-PCR primer for [50 L-RTp] SOL-f1 SOL-f2 SOL-r1 SOL-r2

50 -AAGGTCTATCAGCAGNNNNNNNNNNNNNNNC-30

31 nt

50 -bio-CCACCGACAGGTTCAGAGTTCTACAGAGACAGCAGNNNNNN-30 50 -phos-CTGCTG TCTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30

41 nt 35 nt

50 -bio-CCACCGACAGGTTCAGAGTTCTACAGCTTCAGCAGNNNNNN-30 50 -phos-CTGCTG AAGCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30

50 -bio-CCACCGACAGGTTCAGAGTTCTACAGGCCCAGCAGNNNNNN-30 50 -phos-CTGCTG GGCCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30

41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35 41 41 35

50 -bio-CCACCGACAGGTTCAGAGTTCTACAG-30

26 nt

50 -phos-NNTCGTATGCCGTCTTCTGCTTG-30 50 -CAAGCAGAAGACGGCATACGA-30

23 nt 23 nt 21 nt

50 -AATGATACGGCGACCACCGACAGGTTCAGAGTTC-30 50 -CAAGCAGAAGACGGCATACGA-30

34 nt 21 nt

50 -bio-CCACCGACAGGTTCAGAGTTCTACAGGATCAGCAGNNNNNN-30 50 -phos-CTGCTG ATCCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGACACAGCAGNNNNNN-30 50 -phos-CTGCTG TGTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGACTCAGCAGNNNNNN-30 50 -phos-CTGCTG AGTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGACGCAGCAGNNNNNN-30 50 -phos-CTGCTG CGTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGATCCAGCAGNNNNNN-30 50 -phos-CTGCTG GATCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGATGCAGCAGNNNNNN-30 50 -phos-CTGCTGCATCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGAGCCAGCAGNNNNNN-30 50 -phos-CTGCTG GCTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGAGTCAGCAGNNNNNN-30 50 -phos-CTGCTG ACTCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGTAGCAGCAGNNNNNN-30 50 -phos-CTGCTG CTACTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGTGGCAGCAGNNNNNN-30 50 -phos-CTGCTG CCACTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGGTACAGCAGNNNNNN-30 50 -phos-CTGCTG TACCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30 50 -bio-CCACCGACAGGTTCAGAGTTCTACAGGACCAGCAGNNNNNN-30 50 -phos-CTGCTG GTCCTGTAGAACTCTGAACCTGTCGGTGG-NH2-30

50 -CGACAGGTTCAGAGTTCTACAG-30 50 -CGACAGGTTCAGAGTTCTAC-30 50 -CCTTCGGTTAAGGTCTATCAG-30 50 -CCTTCGGTTAAGGTCTATCAGC-30

nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt nt

22 nt 20 nt 21 nt 22 nt (Continued )

32

j

2 DeepCAGE: Genome-Wide Mapping of Transcription Start Sites

Table 2.1 (Continued)

qRT-PCR primer for [ACTB for human] EcoP F3_forward ACTB-R3 qRT-PCR primer for [ACTB for mouse] EF1f M_R2

50 -TATAGCCAGCAGGACCGCCG-30 50 -CATGCCGGAGCCGTTGTC-30

20 nt 18 nt

50 -TACCCCCAGCAGGACTGTC-30 50 -GTCATCCATGGCGAACTGG-30

19 nt 19 nt

Bold ¼ Sequence of 3bp barcode; italic ¼ EcoP15I recognition site.

of total RNA. (Note: It may be required to make extra efforts to remove polysaccharides from some RNA samples.) All RNA samples should be analyzed before starting any DeepCAGE library preparation. Seeing the entire cost of library preparation, high-speed sequencing plus data analysis, there is little meaning in starting the experiment with low-quality RNA. A standard RNA quality control should include: .

.

.

Measuring A260/280 (>2.0) and A30/260 (2.0) and A230/260 ( SRR006565.sai bwa samse hg19_male.fa SRR006565.sai SRR006565.trimmed.fastq > SRR006565.sam The alignment file produced by BWA can be sorted and indexed in the binary sequence alignment/map (BAM) format [41]. Such alignment files can be uploaded to most genome browsers for visualization. samtools view -uS SRR006565.sam | samtools sort - SRR006565 The BAM files contain the result of the sequencing as aligned reads. However, the interest in this RACE analysis of TSSs is the 50 end of the transcript (i.e., the first base of the aligned RACE reads). The following operations, using BEDTools and the standard Unix awk command, transform the alignment of RACE reads into a count of how many reads support the use of a nucleotide position for transcription initiation, and then isolate the regions where most reads accumulate. bamToBed -i SRR006565.bam | awk 'BEGIN{OFS="\t"}{if($6=="+"){print $1,$2,$2+1,$4,$5,$6};if($6=="-"){print $1,$31,$3,$4,$5,$6} }' > SRR006565.bed Genomic windows are then created with the mergeBed command, grouping reads separated by less than 100 bases in a single genomic interval. These intervals are then filtered with awk to discard regions containing less than 500 reads, since the number of reads expected for the loci of interest is far greater. mergeBed -n -d 100 -i SRR006565.bed | awk 'BEGIN{OFS="\t"}{if($4>=500){print}}' > SRR006565.mask This yields 29 regions. The experiment targeted 17 genes. One of them had two alternative promoters. Others belong to conserved protein families and cross-hybridization of the gene-specific PCR primers generated some secondary signal from orthologous or paralogous loci. In the last step of the bioinformatics analysis, the precise profile is generated by counting the number of reads at 50 ends per genomic nucleotide with mergeBed and discarding the values for the nucleotides outside the regions of interest with windowBed.

j

69

70

j

4 RACE: New Applications of an Old Method to Connect Exons

Table 4.2 Example result for the SND1 gene on chromosome 7.

Start

End

Count

Strand

127292201 127292205 127292218 127292245 127292246 127292247 127292248 127292249 127292250

127292202 127292206 127292219 127292246 127292247 127292248 127292249 127292250 127292251

1 4 2 2146 4844 93039 7110 4938 1730

þ þ þ þ þ þ þ þ þ

mergeBed -s -n -d -1 -i SRR006565.bed | windowBed -a stdin -b SRR006565.mask -w 1 -u > SRR006565.csv The resulting file is very compact and can be loaded in the form of a spreadsheet. As an example, the contents of such a spreadsheet for the gene SND1 on chromosome 7 are listed in Table 4.2. Promoter profiles are typically displayed as XYscatter plots with the genomic coordinates on the horizontal axis and the expression score on the vertical axis (Figure 4.2).

Fig. 4.2 Deep-RACE profile of the sharp peak promoter of SND1. Promoter profile of SND1 plotted with the data of Table 4.2. This is a typical example of single-peak promoter – one of the four classes of promoter shapes determined by the FANTOM3 project using CAGE [42].

100000 80000 60000 40000

Count

20000 0 127,292,180

127,292,200

127,292,220

127,292,240

127,292,260

4.4 Perspectives

New sequencing methods that in the future will allow us to obtain high-quality fulllength cDNA sequences may reduce the need for analytical Deep-RACE analysis. However, this progress will not only depend on the length of sequencing reads that will be delivered by future sequencing methods at a high quality, but also on the protocols used for sample preparation. Where we are facing limitations in our sample preparation methods, alternative methods will continue to be needed for confirmation of data originally obtained by high-throughput methods. Here, Deep-RACE and modifications of the protocols described in this chapter along with primer walking strategies will keep their value as important tools in transcript analysis. Nevertheless, another trend in sequencing technology emerges, towards more small-scale instruments, like the Roche 454 GS junior or the Illumina MiSeq, and such instruments may become widespread in molecular biology laboratories. Sequencing of RACE products may then become routine. Acknowledgments

All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter. C.P. acknowledges the specific research support from a Research Grant for RIKEN Omics Science Center from MEXT and would like to thank all the colleagues at the Omics Science Center for precious feedback during the development of the methodology.

References

j

71

References 1 Frohman, M.A. (1994) PCR Methods Appl., 2

3

4

5

6

7

8

9

10

11

12

13 14 15

4, S40–S58. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C. et al. (2005) Science, 309, 1559–1563. Gerhard, D.S., Wagner, L., Feingold, E.A., Shenmen, C.M., Grouse, L.H., Schuler, G., Klein, S.L., Old, S., Rasooly, R., Good, P. et al. (2004) Genome Res., 14, 2121–2127. Frohman, M.A., Dush, M.K., and Martin, G.R. (1988) Proc. Natl. Acad. Sci. USA, 85, 8998–9002. Scotto-Lavino, E., Du, G., and Frohman, M.A. (2006) Nat. Protoc., 1, 2742–2745. Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G. et al. (2005) Science, 308, 1149–1154. Scotto-Lavino, E., Du, G., and Frohman, M.A. (2006) Nat. Protoc., 1, 2555–2562. Carninci, P., Kvam, C., Kitamura, A., Ohsumi, T., Okazaki, Y., Itoh, M., Kamiya, M., Shibata, K., Sasaki, N., Izawa, M. et al. (1996) Genomics, 37, 327–336. Carninci, P., Westover, A., Nishiyama, Y., Ohsumi, T., Itoh, M., Nagaoka, S., Sasaki, N., Okazaki, Y., Muramatsu, M., Schneider, C. et al. (1997) DNA Res., 4, 61–66. Shinshi, H., Miwa, M., Kato, K., Noguchi, M., Matsushima, T., and Sugimura, T. (1976) Biochemistry, 15, 2185–2190. Mandl, C.W., Heinz, F.X., PuchhammerSt€ ockl, E., and Kunz, C. (1991) Biotechniques, 10, 484–486. Fromont-Racine, M., Bertrand, E., Pictet, R., and Grange, T. (1993) Nucleic Acids Res., 21, 1683–1684. Maruyama, K. and Sugano, S. (1994) Gene, 138, 171–174. Scotto-Lavino, E., Du, G., and Frohman, M.A. (2006) Nat. Protoc., 1, 3056–3061. Zhu, Y.Y., Machleder, E.M., Chenchik, A., Li, R., and Siebert, P.D. (2001) Biotechniques, 30, 892–897.

16 Hirzmann, J., Luo, D., Hahnen, J., and

17 18

19

20

21

22

23

24

25

26 27 28

Hobom, G. (1993) Nucleic Acids Res., 21, 3597–3598. Ohtake, H., Ohtoko, K., Ishimaru, Y., and Kato, S. (2004) DNA Res., 11, 305–309. Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius, R., Watahiki, A., Nakamura, M., Arakawa, T. et al. (2003) Proc. Natl. Acad. Sci. USA, 100, 15776–15781. Plessy, C., Bertin, N., Takahashi, H., Simone, R., Salimullah, M., Lassmann, T., Vitezic, M., Severin, J., Olivarius, S., Lazarevic, D. et al. (2010) Nat. Methods, 7, 528–534. Faulkner, G.J., Kimura, Y., Daub, C.O., Wani, S., Plessy, C., Irvine, K.M., Schroder, K., Cloonan, N., Steptoe, A.L., Lassmann, T. et al. (2009) Nat. Genet., 41, 563–571. Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M., Fukuda, S., Tagami, M., Sasaki, D., Imamura, K., Kai, C., Harbers, M. et al. (2006) Nat. Methods, 3, 211–222. Wei, C., Ng, P., Chiu, K.P., Wong, C.H., Ang, C.C., Lipovich, L., Liu, E.T., and Ruan, Y. (2004) Proc. Natl. Acad. Sci. USA, 101, 11701–11706. Hashimoto, S., Suzuki, Y., Kasai, Y., Morohoshi, K., Yamada, T., Sese, J., Morishita, S., Sugano, S., and Matsushima, K. (2004) Nat. Biotechnol., 22, 1146–1149. Maeda, N., Nishiyori, H., Nakamura, M., Kawazu, C., Murata, M., Sano, H., Hayashida, K., Fukuda, S., Tagami, M., Hasegawa, A. et al. (2008) Biotechniques, 45, 95–97. Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J. et al. (2005) Genome Res., 15, 1451–1455. Quinlan, A.R. and Hall, I.M. (2010) Bioinformatics, 26, 841–842. Olivarius, S., Plessy, C., and Carninci, P. (2009) Biotechniques, 46, 130–132. Freeman, J.D., Warren, R.L., Webb, J.R., Nelson, B.H., and Holt, R.A. (2009) Genome Res., 19, 1817–1824.

29 Margulies, M., Egholm, M., Altman, W.E.,

30

31

32 33 34

35

36

37

38

39

40 41

42

Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y., Chen, Z. et al. (2005) Nature, 437, 376–380. Valouev, A., Ichikawa, J., Tonthat, T., Stuart, J., Ranade, S., Peckham, H., Zeng, K., Malek, J.A., Costa, G., McKernan, K. et al. (2008) Genome Res., 18, 1051–1063. Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J., Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. et al. (2008) Nature, 456, 53–59. Rozen, S. and Skaletsky, H. (2000) Methods Mol. Biol., 132, 365–386. Rice, P., Longden, I., and Bleasby, A. (2000) Trends Genet., 16, 276–277. Pruitt, K.D., Tatusova, T., and Maglott, D.R. (2007) Nucleic Acids Res., 35, D61–D65. Kapranov, P., Cheng, J., Dike, S., Nix, D.A., Duttagupta, R., Willingham, A.T., Stadler, P.F., Hertel, J., Hackerm€ uller, J., Hofacker, I.L. et al. (2007) Science, 316, 1484–1488. Pfeffer, S., Lagos-Quintana, M., and Tuschl, T. (2005) Curr. Protoc. Mol. Biol., 26, 26.4. Salimullah, M., Kato, S., Murata, M., Kawazu, C., Plessy, C., and Carninci, P. (2009) Biotechniques, 47, 305–307. M€oller, S., Krabbenh€oft, H.N., Tille, A., Paleino, D., Williams, A., Wolstencroft, K., Goble, C., Holland, R., Belhachemi, D., and Plessy, C. (2010) BMC Bioinformatics, 11 (Suppl. 12), S5. Cock, P.J.A., Fields, C.J., Goto, N., Heuer, M.L., and Rice, P.M. (2010) Nucleic Acids Res., 38, 1767–1771. Li, H. and Durbin, R. (2010) Bioinformatics, 26, 589–595. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. (2009) Bioinformatics, 25, 2078–2079. Carninci, P., Sandelin, A., Lenhard, B., Katayama, S., Shimokawa, K., Ponjavic, J., Semple, C.A., Taylor, M.S., Engstr€om, P.G., Frith, M.C. et al. (2006) Nat. Genet., 38, 626–635.

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing Xiaoan Ruan and Yijun Ruan Abstract

RNA-PET is a paired-end tag (PET) sequencing method for full-length mRNA analysis using next-generation sequencer platforms such as the Illumina Genome Analyzer and Applied Biosystems SOLiD. Unlike the RNA-seq method that sequences short randomly sheared shotgun RNA fragments, RNA-PETcaptures and sequences the 50 and 30 -end tags of full-length cDNA fragments of all expressed genes in a biological sample. When mapped to a reference genome, RNA-PET sequences can demarcate the boundaries of transcription units genome-wide, in addition to its ability to quantify the transcription level of all expressed genes. Furthermore, the unique feature of RNA-PET is to identify fusion transcripts. Therefore, RNA-PET has been regarded as the best PETmethod for genome annotation. In this chapter, we describe the details of the RNA-PET protocol and discuss the critical issues.

5.1 Introduction

Genomics holds much promise for huge improvements in human healthcare and next-generation sequencing technologies are becoming a driving force that penetrates the entire field of genomic science. As the current sequencing technologies are limited by short sequencing reads, an important part of the sequencing strategy is to use paired-end tag (PET) sequencing approaches to analyze nucleic acid templates such as RNA for transcriptomes and DNA for genomes [1,2]. To fully understand the regulation of gene transcription in the whole-genome context, it is important to define where precisely gene transcription starts and terminates. To obtain such information, we developed an efficient strategy to demarcate the boundaries of transcription units for the whole genome. The core concept of this strategy is to obtain the linked 50 and 30 short tag sequences for each transcript, map these terminal “signatures” to the genome, and thereby infer the complete transcription units by the genome sequence encompassed between these 50 and 30 signatures. As an intermediate step, we first developed the 50 - and 30 -LongSAGE (long serial analysis of gene expression) protocols to capture the 50 and 30 tag sequences from transcripts separately [3]. With this ability, we then combined these two separate protocols into one for extracting the paired-end 50 and 30 tags for sequencing and mapping analysis [4]. In the early version of PET sequencing for full-length transcripts, short tag fragments of 20 (50 ) and 20 bp (30 ) were extracted through a plasmid-based cloning process and the paired tag fragments were concatenated into longer DNA fragments for Sanger capillary sequencing (AB3730xl). Later, we adapted the Roche 454 pyrosequencer (GS FLX) for such an analysis [5]. However, the plasmid-based cloning

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

73

74

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

Fig. 5.1 Diagram of RNA-PET analysis for full-length transcripts by SOLiD. Outline of the four critical functional steps for RNA-PET library construction. (a) GsuI poly(dT) oligo is used for capturing the 30 end of full-length RNA. (b) Double-stranded cDNA is synthesized and isolated. (c) The double-stranded cDNA is modified with specific DNA linkers and PETs are excised by EcoP15I. (d) SOLiD paired-end sequencing and mapping analysis.

method for PET extraction was a long and laborious process, and the 20-bp tag information limited high mapping specificity. We now further improved the PET analysis protocol by developing a complete in vitro cloning-free protocol and adapting the enzyme EcoP15I for extracting longer PET fragments of 27bp at each end. This new version of PETanalysis for full-length mRNA (termed RNA-PET) consists of six major steps (Figure 5.1). 1. Capture and synthesize full-length cDNAs from mRNA transcripts using the cap trapper approach and a specifically designed GsuI-dT16 oligo for reverse transcription. 2. Ligate the captured full-length cDNAs to specific DNA linkers. 3. Circularize the linker-ligated full-length cDNAs, and excise 27 bp of terminal tags from each of the 30 and 50 ends. 4. Isolate and purify the PET constructs and ligate to sequencing adapters. 5. Paired-end sequencing to analyze the captured ends by either Illumina Genome Analyzer or Applied Biosystems SOLiD 4 and map both sequence tags to the reference genome. 6. Annotate the tags to existing transcript datasets and visualize the annotated sequence data on a genome browser. In RNA-PET analysis, total RNA is used as starting material, and poly(T) oligos are used to enrich and purify mRNAs. Approximately 1–5 mg poly(A) mRNA is employed for the RNA-PET library construction. The cap trapper approach [6–8] is combined with GsuI-poly(T) oligos to capture full-length cDNAs. After full-length cDNAs are obtained, they are methylated to block the EcoP15I recognition site at the fifth residue A. The cDNA is then ligated to specifically designed linker sequences and circularized

5.2 Methods and Protocols

in a larger ligation volume (around 0.1 ng DNA/ml). Uncircularized molecules are removed by exonuclease digestion using plasmid-save (Epicenter) treatment and the remaining circular cDNAs are digested with EcoP15I to release the PETs 27 bp from each end. The resulting PET constructs are ligated to specific paired-end sequencing adapters compatible with either SOLiD or Illumina GAII sequencing platforms. After ligated to the sequencing adapters, the PET templates are further polymerase chain reaction (PCR) amplified and sequenced from both ends. Approximately 20–30 million PETs are generated in each sequencing run. After filtering out redundant and low-quality tags, unique PET sequence reads are processed for mapping onto a reference genome. Approximately 90% of PETs map to known transcripts or splicing variants, and are named “concordant PETs”. However, a small portion of the misaligned PETs (named as “discordant PETs”) map either in wrong orientations on the same chromosome, to different strands, or on different chromosomes. For concordant PETs, the digital expression level can be obtained from mapped sequence counts. Even though a majority of the discordant PETs are derived from nonspecific ligations, a collection of these PETs serves as a valuable pool for identification of novel transcripts and possible splice variants. It should be noted that RNA-PETonly characterizes the 50 and 30 ends of transcripts, whereas RNA-seq is robust for tagging internal exons, but poor on transcript terminal regions. Therefore, the combination of RNA-PET and RNA-seq should be viewed as the ultimate solution for comprehensive transcriptome characterization.

5.2 Methods and Protocols 5.2.1 Key Reagents and Consumables Capture Full-Length Poly(A) mRNA and Synthesize cDNAs . .

.

. .

. . .

. . .

1–5 mg Poly(A) mRNA isolated from total RNA GsuI-dT16 oligo: 50 -GAGCTAGTTCTGGAGTTTTTTTTTTTTTTTTVN-30 , in 1 mg/ml and stored at 20  C DNA LoBind tubes (Eppendorf) (Note: For all steps involving single-stranded DNA or RNA, it is preferable to use “LoBind” microfuge tubes to avoid loss of nucleic acid) Isopropanol-precipitated DNA/RNA: isopropanol and 3 M NaOAc, pH 5.5 75% Ethanol (EtOH) (Note: Commercially available nuclease-free water (instead of diethylpyrocarbonate (DEPC)-treated water) was used for all RNA-containing enzymatic reactions to avoid possible inhibition of enzymatic reactions by residual DEPC or ethanol) RNasin-Plus RNase inhibitor (Promega) 2  GC-I buffer (TaKaRa) dNTP mix (with 5-Me-dCTP instead of dCTP): 10 mM dATP, 10 mM dTTP, 10 mM dGTP, 5 mM 5-Me-dCTP in 10 mM Tris–HCl, pH 8.0. Stored at 20  C 4.9 M D-Sorbitol (Sigma) SuperScript II and III reverse transcriptase (Invitrogen) Prepare saturated trehalose (RNase-free) (Sigma). Heat water in a 1.7-ml tube to 42  C in a heat block. Slowly add trehalose powder to the tube and dissolve it by vortexing. Maintain temperature at around 42  C and continue to add trehalose until saturation is reached. Upon cooling down the solution to room temperature, trehalose will form crystals and a saturated solution is obtained. (Note: Since the solubility of trehalose increases with temperature, it is important to maintain the temperature of the trehalose solution at or around 42  C (at which temperature trehalose is used in the protocol).) Aliquot the solution and store at 20  C.

j

75

76

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing . . . . . . . . . . . .

. . . . .

. . .

. . .

.

Proteinase K, 20 mg/ml (Ambion) Phenol/chloroform/isoamyl alcohol (IAA) solution, 25: 24: 1, pH 6.6 (Ambion) Sodium periodate (NaIO4) (Sigma) 1.1 M NaOAc pH 4.5: diluted from 3 M NaOAc and adjusted to pH 4.5 1 M NaOAc at pH 6.1: diluted from 3 M NaOAc and adjusted to pH 6.1 10% Sodium dodecylsulfate (SDS) solution (Gibco) 10 mM Biotin hydrazide (long arm) (Vector; cat. no. SP-1100, 50mg) 5 M Sodium chloride (NaCl) (Ambion) 10  RNase ONE buffer (Promega) RNase ONE ribonuclease, 10 U/ml (Promega) Yeast tRNA, 10 mg/ml and 50 mg/ml (Ambion) Dyna MPC-S magnet (magnetic particle collector; now replaced by DynaMag-2) (Invitrogen) Dynabeads M-280 Streptavidin (Invitrogen) 1  Binding Buffer (BB): 2 M NaCl, 50 mM EDTA at pH 8.0 1  BB þ yeast tRNA: 2 M NaCl, 50 mM EDTA pH 8.0, yeast tRNA 0.25 mg/ml 1  Blocking buffer: 0.4% SDS, 50 mg/ml yeast tRNA 1  Washing buffer: 10 mM Tris–HCl, pH 7.5, 0.2 mM EDTA, 10 mM NaCl, 20% glycerol, 40 mg/ml yeast tRNA Eppendorf thermomixer Intelli-Mixer (Elmi) Alkaline hydrolysis buffer: 50 mM sodium hydroxide (NaOH), 5 mM EDTA, pH 8.0; prepare fresh each time 2-ml MaXtract High Density tube (Qiagen) Buffer EB (Qiagen) 1  Tris–NaCl–EDTA (TNE) buffer: 10 mM Tris–HCl, pH 8.0, 50 mM NaCl, 0.1 mM EDTA; buffer sterilized by syringe filter or autoclave DNA linkers for cap trapper of 50 ends of the full-length cDNAs and doublestranded cDNA synthesis; linker E-E2-GsuI-N5 and linker E-E2-GsuI-N6 (see Table 5.1 for sequence details and Step 9.1 for oligo annealing)

Table 5.1 Oligonucleotide sequences used in RNA-PET analysis.

Oligonucleotide

Sequence

GsuI poly(T) linker for 30 end of the transcripts and reverse transcription GsuI-dT16 oligo 50 -GAGCTAGTTCTGGAGTTTTTTTTTTTTTTTTVN-30 Linkers for 50 end cap trapper of full-length cDNA and second-strand cDNA synthesis Linker E-E2-GsuI-N5 top 50 -CTACCTGGAGAACATGAGGCAGCCAGGNNNNN-30 bottom 50 -phos-CTGGCTGCCTCATGTTCTCCAGGTAG-30 Linker E-E2-GsuI-N6 top 50 -CTACCTGGAGAACATGAGGCAGCCAGNNNNNN-30 bottom 50 -phos-CTGGCTGCCTCATGTTCTCCAGGTAG-30 SOLiD linkers for modification of cDNA S SOLiD linker E5v4 top 50 -CCGCCTTGGCCGTACAGCAG-30 (internal biotin at the sixth dT) bottom 50 -phos-GCTGTACGGCCAAG-30 S SOLiD linker E3v3 top 50 -GCGGATGTACGGTACAGCAGTT-30 (internal biotin at the sixth dT) bottom 50 -phos-CTGCTGTACCGTACAT-30 SOLiD sequencing adapters and PCR primers SOLiD_RNA P1_m_top 50 -TTCCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT-30 SOLiD_RNA P1_m_bottom 50 -NNATCACCGACTGCCCATAGAGAGGAAAGCGGAGGCGTAGTGG-30 SOLiD_RNA P2_m_top 50 -NNAGAGAATGAGGAACCCGGGGCAG-30 SOLiD_RNA P2_m_bottom 50 -TTCTGCCCCGGGTTCCTCATTCTCT-30 SOLiD library PCR primer-1 (ABI) 50 -CCACTACGCCTCCGCTTTCCTCTCTATG-30 SOLiD library PCR primer-2 (ABI) 50 -CTGCCCCGGGTTCCTCATTCT-30

Length (nt)

33

32 26 32 26

20 14 22 16 43 43 25 25 28 21

5.2 Methods and Protocols .

. . . . . . . . . . . . . . . .

Oligos are synthesized and high-performance liquid chromatography-purified by chromatography by Integrated DNA Technologies; oligos are annealed to doublestranded DNA and stored in 20  C at 0.4 mg/ml for use 4–20% TBE gels (Invitrogen) TaKaRa solution I and II (TaKaRa) TaKaRa Ex Taq (TaKaRa) dNTP mix, 2.5 mM each (TaKaRa) GlycoBlue, 15 mg/ml (Ambion) GsuI, 5 U/ml (Fermentas) cDNA size fractionation columns (Invitrogen) 6  Loading dye (Fermentas) TEN Buffer: 10 mM Tris–HCl, pH 8.0, 0.1 mM EDTA, pH 8.0, 25 mM NaCl Molecular probes SYBR Green, 10 000  in dimethylsulfoxide (Invitrogen) Gel staining buffer: 1  TBE, Molecular probes SYBR Green 1 Gel handler support (Sigma) Molecular Probes Quant-iT PicoGreen double-stranded DNA reagent (Invitrogen) Calf thymus DNA (Sigma) EcoP15I (NEB) S-adenosylmethionine (SAM), 32 mM (NEB)

Modify Full-Length cDNAs and Capture 50 - and 30 -End Tags . . .

. . . . . . . . .

. . . . . . . . . . . . . . . .

. . .

5  T4 DNA ligase buffer with poly(ethylene glycol) (PEG) (Invitrogen) T4 DNA ligase, 30 U/ml (Fermentas) SOLiD linker E5v4 and SOLiD linker E3v3 at 200 ng/ml (see Table 5.1 for sequence detail) GlycoBlue, 15 mg/ml (Ambion) 10  T4 DNA ligase (NEB) T4 DNA polynucleotide kinase, 10 U/ml (NEB) 15 ml MaXtract High Density tube (Qiagen) 10  Escherichia coli DNA ligation buffer (Qiagen) 10 mM dNTP mix (Eppendorf) E. coli ligase (NEB) E. coli DNA polymerase I (NEB) Plasmid-safe reaction: 25 mM ATP, 10  reaction buffer, plasmid-safe DNase, 10 U/ml (Epicenter) 10  bBuffer 3 (NEB) 100  bovine serum albumin (BSA) (NEB) 10 mM Sinefungin (Calbiochem) ATP (NEB) EcoP15I 10 U/ml (NEB) SOLiD P1 and SOLiD P2 adapters, each at 200 ng/ml (see Table 5.1) 10  bBuffer 2 (Qiagen) 10 mM dNTP mix (Eppendorf) E. coli DNA polymerase I (NEB) 1  B&W buffer: 5 mM Tris–HCl, pH 7.5, 0.5 mM EDTA, 1 M NaCl 2  Phusion High-Fidelity PCR Master Mix with HF Buffer (Finnzymes) Molecular Probes SYBR Green 1, 10 000  in DMSO (Invitrogen) Gel staining buffer: 1  TBE, Molecular probes SYBR Green 1 Gel handler support (Sigma) Solexa-454 PCR primer-1 and Solexa-454 PCR primer-2, each at 25 mM SOLiD library PCR primer-1 and SOLiD library PCR primer-2, each at 25mM (see Table 5.1 for sequence details) 4–20% TBE gels (Invitrogen) 25-bp DNA ladder at 1 mg/ml (Invitrogen) 1  TBE buffer (1st BASE) diluted from 1  stock

j

77

78

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing . . . . .

. .

SOLiD library PCR primer-1 and -2 6% TBE gel with five wells (Invitrogen) 1  TBE buffer (1st BASE) diluted from 10  stock QIAquick PCR purification kit (Qiagen) Spin-X centrifuge tube filters, cellulose acetate membrane, 0.22-mm pore size (Costar) 21G needle (Becton Dickinson) 0.6-ml Micro tube (Axygen)

5.2.2 Protocol

1. Mix poly(A) mRNA with GsuI-dT16 oligo and concentrate the mix 1.1 Mix following reagents in order in a 0.2-ml PCR tube on ice (Notes: (i) For all steps involving RNA manipulations, ensure RNase-free conditions are maintained, including all reaction buffers. (ii) This step is necessary when the combined volume of poly(A) mRNA and GsuI-dT oligo exceeds 9 ml. (iii) Do not use glycogen at any stage where it is not specifically mentioned, as glycogen will interfere with the cap trapper selection process.): Poly(A) mRNA (1–5 mg)

xx ml

GsuI-dT oligo (1 mg/ml)

0.7  poly(A) mRNA amount (mg)

3 M NaOAc pH 5.5

1/10 volume

Isopropanol

equal volume

Keep tube cold at 80  C for 30 min. Centrifuge at 13 200 rpm for 30 min at 4  C. Wash 2 times with 500 ml freshly chilled 75% EtOH. Air-dry pellet, resuspend pellet in 19-ml nuclease-free water, add 1 ml RNasePlus inhibitor to the mixture, and transfer solution into a 0.2-ml PCR tube. 2. Reverse transcription 2.1 The reaction mixture is heated at 65  C for 10 min and cooled to 37  C for 1 min, then held at 42  C on a thermal cycler while waiting for the other components to be prepared. 2.2 Set up the RT Mix on ice in a 0.2-ml thin-walled PCR tube: 1.2 1.3 1.4 1.5

2  GC-I buffer

75 ml

RNasin-Plus RNase inhibitor

1 ml

10 mM dNTP (with 5-Me-dCTP in place of dCTP)

4 ml

4.9 M Sorbitol

26 ml

SuperScript II reverse transcriptase

8 ml

SuperScript III reverse transcriptase

4 ml

2.3 Put 10 ml saturated trehalose solution into another 0.2-ml PCR tube and leave warming at 42  C in a thermal cycler. 2.4 When the oligo(dT)/mRNA annealing step is complete, place the RT Mix into the thermal cycler that was preset at 42  C for at least 2 min.

5.2 Methods and Protocols

2.5 Mix the warm trehalose together with the RT Mix (volume ¼ 128 ml), quickly transfer the entire reaction mix into the tube containing the annealed oligo/ mRNA (volume now ¼ 148 ml), and immediately start the incubation: 42  C

50 min

50  C

25 min



55 C

25 min

4 C

hold

2.6 Add 2 ml of Proteinase K to digest all enzymes by incubation at 45  C for 15 min. 2.7 Transfer the solution into a LoBind tube and purify the RNA by adding an equal volume of phenol/chloroform/IAA, 25: 24: 1, pH 6.6, directly into the tube. 2.8 Mix the organic and aqueous phases thoroughly for about 1 min. 2.9 Centrifuge at 13 200 rpm for 3 min to separate the phases. 2.10 Remove the upper, nucleic acid-containing phase by carefully pipetting into a new tube. 2.11 Repeat the extraction with 150 ml water and combine both aqueous phases. 2.12 Use isopropanol to precipitate the RNA/DNA heteroduplex: Aqueous phase with RNA/DNA heteroduplex

xx ml

3 M NaOAc, pH 5.2

1/10 volume

Isopropanol

equal volume

2.13 Keep cold at 80  C for 30 min, centrifuge at 14 000 rpm for 30 min at 4  C; wash 2 times with chilled 75% EtOH. 2.14 Air-dry pellet and resuspend the pellet in 44.5 ml of nuclease-free water. 3. Oxidation of diol structures 3.1 Prepare the following stock solutions fresh each time, using 1.7-ml tubes: 10 mM Biotin hydrazide (long arm) 200 mM NaIO4 3.2 In a 1.7-ml siliconized tube, add the following and mix well: RNA/DNA heteroduplex

44.5 ml

1.1 M NaOAc, pH4.5

3.0 ml

Fresh 200 mM NaIO4

2.5 ml

3.3 Incubate reaction mix on ice for 45 min in the dark. 3.4 Isopropanol precipitates the DNA/RNA heteroduplex as detailed in Steps 2.12–2.14 and resuspend the pellet in 50 ml nuclease-free water. 4. Biotinylation of 50 end of mRNA 4.1 To the 50 ml oxidized ( ) cDNA/RNA heteroduplex, add the following: 1 M NaOAc, pH 6.1

5 ml

10% SDS

5 ml

Fresh-made 10 mM biotin hydrazide

150 ml

4.2 Incubate at room temperature overnight and keep in the dark.

j

79

80

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

4.3 Isopropanol precipitates the 210 ml biotinylated ( ) cDNA/RNA heteroduplex as detailed in Steps 2.12–2.14. Resuspend the pellet in 170 ml water. 4.4 Preset the Eppendorf shaking incubator to cool down to 4  C. 5. RNase ONE selection 5.1 Set up the following in a 1.7-ml LoBind tube (Note: Use 2.5 U of RNase ONE (Promega) per microgram of starting poly(A) mRNA.): Biotinylated ( ) DNA/RNA sample:

170 ml

10  RNase ONE buffer

20 ml

RNase ONE ribonuclease (10 U/ml)

xx ml

Nuclease-free water

add up to 200 ml

(Note: Use 2.5 U of RNase ONE (Promega) per microgram of starting poly (A) mRNA.) 5.2 Incubate at 37  C for 30 min. (Note: During RNase ONE digestion, proceed to Step 6 to prepare the Dynabeads.) 5.3 Quench the reaction by adding: 10 mg/ml Yeast tRNA

4 ml

5 M NaCl

50 ml

5.4 Leave the RNase ONE-treated sample on ice. 6. Preparation of Dynabeads M-280 Streptavidin 6.1 Use the magnetic stand and LoBind tubes for all steps involving M-280 beads and 200 ml of the Dynabeads M-280 Streptavidin suspension per RNA sample. (Note: Before using the Dynabeads, resuspend the beads by strong shaking.) 6.2 Wash Dynabeads to remove preservatives. (Note: Precautions for Dynabeads washing procedures: . After transferring the required amount of beads (e.g., 200 ml) to a fresh LoBind tube, the tube should be placed on an MPC stand (magnet) for at least 1–2 min. . Remove wash supernatant only when the tube is positioned on an MPC. Never remove supernatant while the tube is taken off from the MPC. . Add washing buffer along the inside wall of a tube. . Resuspend and mix the beads only when the tube is taken off the MPC.) 6.3 Wash Dynabeads 3 times with 200 ml of 1  BB at room temperature. 6.4 Add 200 ml 1  BB þ yeast tRNA to the beads and incubate at 4  C for 30 min using an Eppendorf shaker at 800 rpm (precooled to 4  C, see Step 4.4). 6.5 Wash beads 3 times with 1  BB at room temperature. 7. Binding of full-length biotinylated ( ) DNA/RNA heteroduplex 7.1 Remove supernatant from beads and add the chilled RNase ONE-treated sample to beads. 7.2 Rotate 30 min at room temperature on an Intelli-Mixer for binding to occur (Program F8: U ¼ 50, u ¼ 60, 30 rpm). (Notes: (i) As Dynabeads are heavy and can easily settle down to the bottom of the tube, the immobilization of ( ) cDNA/RNA heteroduplexes onto the beads should be done on an IntelliMixer with constant rotations. (ii) The incubation time of immobilization can be increased if the samples are diluted.) 7.3 Wash the heteroduplex-bound beads at room temperature as follows: 2

200 ml 1  BB

1

200 ml 1  Block

1

200 ml 1  Wash

1

200 ml of 50 mg/ml yeast tRNA

5.2 Methods and Protocols

8. Hydrolytic degradation of bound RNA to release full-length ( ) cDNA strands 8.1 Prepare alkaline hydrolysis buffer. (Note: Always use freshly prepared alkaline hydrolysis buffer.) 8.2 Prepare a tube containing 150 ml of 1 M Tris–HCl, pH 7.5 (for neutralization). 8.3 Remove supernatant (50 mg/ml yeast tRNA) from the beads and add 50 ml alkaline hydrolysis buffer. 8.4 Shake the mixture at 65  C for 10 min using the Eppendorf shaker at 1400 rpm. 8.5 Collect the supernatant containing full-length ( ) cDNA into the tube containing 150 ml of 1 M Tris–HCl, pH 7.5 for neutralization. 8.6 Repeat the hydrolysis and collection steps twice, collecting all fractions into the same tube to a final volume of 300 ml. 8.7 Immediately before use, pellet the DNA in the MaXtract High Density tube by centrifugation at 13 200 rpm for 30 s. (Note: RNase-free conditions are not necessary from this step onwards.) 8.8 Transfer the DNA into the MaXtract tube and directly add equal volume of phenol/chloroform/IAA, 25: 24: 1, pH 7.9 to the MaXtract tube. 8.9 Mix the organic and aqueous phases thoroughly for about 1 min. 8.10 Centrifuge at 13 200 rpm for 3 min to separate the phases. 8.11 Remove the upper, nucleic acid-containing phase by carefully pipetting into a new tube. 8.12 Isopropanol precipitates the DNA as detailed in Steps 2.12–2.14. Resuspend the pellet in 10 ml buffer EB. 9. Synthesis of double-stranded cDNA 9.1 Annealing oligos to prepare double-strand DNA linkers. (Note: DNA linkers should be annealed beforehand and stored in aliquots at –20  C. Always keep the annealed linkers on ice and avoid warming them when thawing. This precaution could prevent possible denaturation and subsequent complications. Centrifugation of the annealed adapters is also recommended in the cold (4  C).) 9.1.1 Thaw single-strand oligos at room temperature for 15 min. 9.1.2 Spin at maximum speed at 4  C for 1 min to collect any dislodged oligos. 9.1.3 Add 1  TNE buffer to thawed oligos to make it 100 mM. 9.1.4 Vortex for approximately 1 min to resuspend oligos and briefly spin down to collect the resuspended oligos at the bottom of tubes. 9.1.5 Quality control: perform NanoDrop photometric measurement and use optical density constant value to calculate concentration. Check if the concentration measured by NanoDrop is within the expected range. 9.1.6 In a 0.2-ml PCR tube, mix together the following oligos (see Table 5.1 for sequences): Oligonucleotide A (top strand) (100 mM)

40 ml

Oligonucleotide B (bottom strand) (100 mM)

40 ml

9.1.7 Heat at 95  C for 10 min, then turn off the program with lid closed and slowly let the tube cool down to room temperature to allow oligos to anneal to each other. It may take 90 min to complete the cool-down process. For long-term storage, keep the annealed oligos (also called double-stranded DNA linkers or adapters) at 80  C. 9.1.8 Keep all annealed double-stranded DNA linkers or adapters on ice before use. 9.1.9 Measure the DNA concentration by NanoDrop spectrophotometry and dilute it to a concentration of 200 ng/ml with 1  TNE. 9.1.10 Run 200 ng each of the single-strand oligos together with 200 ng of annealed double-stranded DNA linkers on a 4–20% polyacrylamide gel electrophoresis gel to ensure that the annealing is satisfactory.

j

81

82

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

9.2 50 End cap trapper ligation and second-strand cDNA synthesis. 9.2.1 Set following reagents on ice in a 1.7-ml LoBind tube: Full-length single-strand ( ) cDNA

10 ml

0.4 mg/ml Linker E-E2-GsuI-N5

4 ml

0.4 mg/ml Linker E-E2-GsuI-N6

1 ml

TaKaRa Solution II (A3101-1) (Note: Ensure that the cDNA and linkers are well mixed before adding Solution II, as the latter contains PEG, which could lead to precipitation of GlycoBlue)

10 ml

TaKaRa Solution I (ligase, A201-1)

20 ml

9.2.2 Mix by flicking and short spin at 4  C. 9.2.3 Incubate at 16  C overnight to allow degenerate oligo to anneal and ligate. 9.3 Primer extension for second-strand cDNA synthesis. 9.3.1 Set the following on ice in a 0.2-ml thin-walled PCR tube: Overnight ligation mix

45 ml

Nuclease-free water

35 ml

10  Ex Taq buffer with Mg2 þ

8 ml

2.5 mM dNTP

8 ml

Ex Taq polymerase

4 ml

9.3.2 Incubate the PCR reaction in the warmed-up thermal cycler at 65  C for 5 min, and cycle at 68  C for 30 min, 72  C for 10 min and hold at 4  C. 9.3.3 Add 2 ml of Proteinase K, mix by pipetting up and down, and incubate at 45  C for 15 min to digest the polymerase enzyme. 9.3.4 Immediately before use, obtain pellet by spin MaXtract High Density tube at 14 000 rpm for 30 s. 9.3.5 Transfer the DNA into the MaXtract tube and adjust reaction volume to 200 ml. 9.3.6 Purify the reaction mix with phenol/chloroform/IAA, precipitate DNA with isopropanol as detailed in Steps 2.12–2.14, and resuspend the pellet in 66.8 ml water. 9.4 GsuI digestion to remove poly(A) tail and produce 30 -terminal ends. 9.4.1 Freshly dilute 32 mM SAM to 0.5 mM. 9.4.2 Set the following on ice in a 1.7-ml LoBind tube: Full-length double-stranded cDNA

66.8 ml

10  Buffer TANGO with BSA (Fermentas)

8.6 ml

0.5 mM SAM (1 ml SAM þ 63 ml dH2O)

8.6 ml

GsuI (5 U/ml, Fermentas; cat. no. ER0462)

2.0 ml

9.4.3 Mix by flicking and briefly spin at 4  C. 9.4.4 Incubate at 30  C overnight. 9.4.5 Inactivate GsuI at 65  C for 20 min and transfer sample onto ice.

5.2 Methods and Protocols

9.5 Isolate full-length cDNAs by size fractionation and separation from added linkers. 9.5.1 Prepare size fractionation column according to manufacturer’s instructions while sample is held at 65  C for 20 min. Equilibrate column to room temperature before use. (Note: It is important to use only columns that do not show any visible bubbles trapped within the matrix as these tubes are likely faulty in our experiences. Also, the chances of bubble formation can be reduced by allowing the columns to equilibrate to room temperature.) Remove top cap first, then bottom cap, and then allow liquid to drain off completely. Add 0.8 ml TEN buffer and allow it to drain off. Repeat the washes 3 times. (Note: It is preferred that each draining of the cDNA columns does not take longer than 25 min.) 9.5.2 Label 20 1.7-ml tubes for fractionations. 9.5.3 Add 2 ml of 6  loading dye to the cDNA sample and keep it on ice. 9.5.4 Transfer the overnight digested mixture onto the prepared column and collect the entire flow-through into the first collection tube. 9.5.5 Add 100 ml TEN buffer and collect the entire flow through into the second collection tube. 9.5.6 Add another 100 ml TEN buffer and start to collect cDNA by single drop per tube into the third tube and so on until the 20th tube. 9.5.7 During the collection process allow complete drain-off of each drop, before adding the next 100 ml buffer. It may be needed to add 100 ml TEN buffer for 7 times total at each step. 9.5.8 After finishing the size fractionation, the collected tubes from 9 through 18 are selected to run on a 4–20% TBE gel at 200 V for 15 min to assess cDNA collection results. Usually the tubes from 3–13 will be pooled and purified as full-length cDNA for the next step. (Note: Avoid collecting any fractionation tubes showing the presence of small-molecular-weight linker DNA bands as the presence of linkers in subsequent reactions will cause problems by quenching enzymatic reactions.) The tubes beyond 14–20 are discarded as they usually contain linker DNA. 9.5.9 The quantity of the purified full-length cDNA is measured by the Quant-iT PicoGreen method following the manufacturer’s instructions. 10. Methylation of full-length cDNA using EcoP15I 10.1 Set the following reaction mix on ice in a 1.7-ml LoBind tube: Full-length double-strand cDNA

top up to 100 ml with dH2O

10  Buffer 3

10 ml

100  BSA

1 ml

32 mM SAM (1 ml SAM þ 63 ml dH2O)

25 ml

EcoP15I (Note: Use about 10 U of EcoP15I (NEB) per mg of starting double-stranded cDNA)

10 U/mg cDNA

10.2 Incubate reaction mix at 37  C overnight. 10.3 Immediately before use, obtain pellet by spin MaXtract High Density tube at 14 000 rpm for 30 s. 10.4 Transfer DNA into MaXtract tube, adjust the reaction volume to 200 ml by dH2O. 10.5 Purify the reaction mix with phenol/chloroform/IAA. Isopropanol precipitates full-length cDNA as detailed in Steps 2.12–2.14. Then resuspend the pellet in 50 ml nuclease-free water.

j

83

84

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

11. Ligation of sequencing linkers to full-length cDNAs 11.1 Estimate the amount of linkers to be used with the formula: [(ng cDNA  200  20 bp)/2500 bp]/[200 ng/ml linkers] ¼ ml of linkers to be used. (Note: It is assumed that the average length of cDNA is 2500 bp.) 11.2 Set the following on ice in a 1.7-ml tube and incubate at 16  C overnight: Sequencing linker-1 (200 ng/ml) (see Note in Step 9.1)

x ml (see formula above)

Sequencing linker-2 (200 ng/ml)

x ml (see formula above)

Full-length cDNA

50 ml

5  T4 DNA ligase buffer þ PEG

40 ml

Nuclease-free water

top up to 200 ml

T4 DNA ligase (Note: Add T4 DNA 1 ml ligase to the reaction last and keep the reaction cold at all times) 11.3 Prepare MaXtract High Density tube as in previous procedures. 11.4 Transfer the DNA into the MaXtract tube. 11.5 Purify the reaction mix with phenol/chloroform/IAA. Isopropanol precipitates the DNA as detailed in Steps 2.12–2.14. Then resuspend the pellet in 44 ml water. 12. Addition of phosphate group to the 50 -ends of linker-ligated full-length cDNAs 12.1 Set the following on ice in a 1.7-ml tube: Linker-ligated full-length cDNAs

44 ml

10  T4 DNA ligase buffer

5 ml

T4 DNA polynucleotide kinase

1 ml (final conc. 0.2 U/ml)

12.2 Incubate at 37  C for 30 min. 13. Circularization of linker-ligated cDNA by ligation in a 5-ml volume 13.1 Prepare the following enzyme reaction mix (5 ml) on ice into a 15-ml tube: Nuclease-free water

4425 ml

10  T4 DNA ligase buffer

495 ml

T4 DNA ligase

30 ml (final conc. 0.18 U/ml)

Linker-ligated cDNA mix (from Step 11.2)

50 ml

Transfer the enzyme reaction (50 ml) from Step 11.2 into the above ligase mix. 13.2 Incubate the ligation at 16  C overnight. (Note: Circularization can be performed for as long as 24 h at 16  C.) 13.3 Immediately before use, pellet in a 15-ml MaXtract High Density tube by centrifugation at 3000 rpm for 1 min. 13.4 Transfer the cDNA reaction mix into the 15-ml MaXtract tube. 13.5 Purify reaction mix with phenol/chloroform/IAA. Isopropanol precipitates the DNA as detailed in Steps 2.12–2.14. Then resuspend the pellet in 78 ml of Buffer EB.

5.2 Methods and Protocols

14. DNA nick repair 14.1 Set the following reagents on ice in a 1.7-ml tube: DNA in Buffer EB

78 ml

10  E. coli DNA ligation buffer

10 ml

10 mM dNTP

2 ml

E. coli DNA ligase

2 ml

E. coli DNA polymerase I

8 ml

14.2 Incubate at 16  C for 2 h. 14.3 Immediately before use, get ready a 2-ml MaXtract High Density tube by centrifugation at 14 000 rpm for 1 min. 14.4 Transfer DNA nick repair reaction mix into the MaXtract tube. 14.5 Purify reaction mix with phenol/chloroform/IAA. Isopropanol precipitates the DNA as detailed in Steps 2.12–2.14. Then resuspend the pellet in 84 ml of Buffer EB. 15. Plasmid-safe treatment to cleave remaining linear DNA molecules 15.1 Set the following on ice in a 1.7-ml LoBind tube: Circularized DNA

84 ml

25 mM ATP (Epicenter)

4 ml

10  Reaction buffer (Epicenter)

10 ml

Plasmid-safe DNase (10 U/ml)

2 ml

15.2 Incubate at 37  C, 40 min. (Note: The maximum incubation time for plasmid-safe treatment can be 2 h.) 15.3 Prepare the 2-ml MaXtract High Density tube as above. 15.4 Transfer the DNA into the MaXtract tube; adjust reaction volume to 200 ml. 15.5 Purify reaction mix with phenol/chloroform/IAA. Isopropanol precipitates the DNA as detailed in Steps 2.12–2.14. Then resuspend the pellet in 50 ml Buffer EB. 16. EcoP15I digestion to release 50 and 30 tags from circularized full-length cDNA 16.1 Set the following on ice in a 1.7-ml LoBind tube: Circularized full-length cDNA

50 ml

10  Buffer 3

10 ml

100  BSA

1 ml

10 mM Sinefungin

1 ml

10  ATP (NEB) (Note: The ATP supplied with EcoP15I (NEB; cat. no. B6101S) is at 10  ; thus, 20 ml is needed to obtain a 2  final concentration while the ATP (NEB; cat. no. P0756) is usually at 100  and hence 2 ml is used instead)

2 ml

EcoP15I (NEB)

10 U/ mg DNA

Nuclease-free water

top up to 100 ml

16.2 Incubate at 37  C, 2 h. 17. Binding of EcoP15I digested DNA tags to Dynabeads M-280 Streptavidin 17.1 Swirl bottle of Dynabeads M-280 Streptavidin suspension thoroughly. 17.2 Transfer 50 ml of Dynabeads M-280 Streptavidin suspension to a 1.7-ml tube.

j

85

86

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

17.3 Using the MPC, wash beads with 150 ml of 2  B&W buffer by pipetting up and down. 17.4 Resuspend beads in 100 ml of 2  B&W buffer. 17.5 Add 100 ml EcoP15I-digested DNA to the resuspended Dynabeads, mix well. 17.6 Incubate at room temperature with rotation on the Intelli-Mixer (Program F8, 30 rpm) for 30 min. During the incubation, biotinylated linkers associated with the captured DNA tags are bound and remain on the beads. 17.7 With the help of MPC, the reaction beads are washed twice with 150 ml of 1  B&W buffer by pipetting up and down, which will remove DNA fragments with no linker attached. 18. Ligation of (Solexa or SOLiD) sequencing adapters to DNA template for highthroughput sequencing 18.1 For Solexa adapter ligation, set the following on ice in a 1.7-ml tube: Nuclease-free water

36 ml

Solexa 454 adapter E/A (200 ng/ml)

4 ml

Solexa 454 adapter E/B (200 ng/ml)

4 ml

10  T4 DNA ligase buffer

5 ml

18.2 For SOLiD adapter ligation, set the following on ice in a 1.7-ml tube: Nuclease-free water

36 ml

SOLiD P1 adapter (200 ng/ml)

4 ml

SOLiD P2 adapter (200 ng/ml)

4 ml

10  T4 DNA ligase buffer

5 ml

18.3 Resuspend tag-bound beads with the above ligation mix that is chosen for the specific sequencing platform. 18.4 Add 1 ml T4 DNA ligase to the bead suspension to a final concentration of 0.6 U/ml to ligate adapters to the captured PETs. 18.5 Incubate at room temperature overnight with rotation on an Intelli-Mixer (Program F8, 30 rpm, U ¼ 50, u ¼ 60). 18.6 Wash the beads twice with 150 ml of 1 B&W Buffer. 19. Nick translation repair 19.1 Set the following reagents on ice in a 1.7-ml tube: Nuclease-free water

38.5 ml

10  Buffer 2

5.0 ml

10 mM dNTP

2.5 ml (final conc. 500 mM)

E. coli DNA polymerase I

4.0 ml

19.2 Resuspend Dynabeads in the above reaction mix. 19.3 Incubate at room temperature with rotation for 2 h on an Intelli-Mixer (Program F8, 30 rpm). 19.4 Wash the beads twice with 150 ml of 1 B&W Buffer using the MPC. 19.5 Resuspend the Dynabeads in 50 ml Buffer EB. 20. PCR amplification to assess captured 50 and 30 cDNA tags 20.1 Set up the following reaction mix on ice in a 0.2-ml thin-walled PCR tube: Nuclease-free water

21 ml

Dynabeads suspension

2 ml

Solexa or SOLiD PCR primer-1

1 ml

5.2 Methods and Protocols

Solexa or SOLiD PCR primer-2

1 ml

2  Phusion Master Mix with enzyme

25 ml

j

87

PCR cycling conditions: 98  C

30 s

98  C

10 s

65  C

30 s



72 C

30 s

72  C

5 min

4 C

hold

9 > > > 20 cycles > = 20 cycles > > > > ; 20 cycles

20.2 After completion of the PCR reaction, take 25 ml PCR products and run on a 10-well 4-20% TBE gel at 200 V for 45 min and stain the gel for 10 min in SYBR-TBE buffer before taking a picture. 20.3 Load 500 ng of a 25-bp DNA ladder side by side for size determination. 20.4 As shown in Figure 5.2, an expected 154-bp DNA band is observed from SOLiD constructs, consistent with the structures of the captured cDNA tags associated with platform-specific linkers and adapters. 21. PCR scale-up for preparation of sequencing templates 21.1 Scale-up PCR reactions using all or at least half the amount of available Dynabeads as amplification template and collect expected PCR fragments separated on a 6% TBE gel. 21.2 Depending on the number of PCR reactions, the PCR products need to be concentrated before loading onto the gel. 21.3 A 25-bp DNA ladder is critical as size reference when harvesting the desired band, and should be loaded side by side with PCR samples in the gel. 22. Purify PCR fragment from 6% TBE gel using gel-crush method 22.1 The PCR fragment of interest is carefully excised and collected into several 0.6-ml microtubes that have been pierced at the bottom with a 21G needle. Two or more gel slices can be put into each 0.6-ml microtube as the size of the gel slices permits. The pierced tube is placed inside a 1.5-ml screw-cap microtube and centrifuged at 14 000 rpm for 5 min. The gel slices are thus conveniently shredded and collected at the bottom of each 1.5-ml tube. 22.2 Add 400 ml of TE buffer to each 1.5-ml screw-cap tube; stir the gel pieces with the pipette tip to ensure that they are immersed in the TE buffer.

Fig. 5.2 Quality control assessments of the sequence template. To assess the quality control of the RNA-PET libraries, the constructed 50 and 30 paired-end cDNA libraries are analyzed by both polyacrylamide gel electrophoresis and Agilent Bioanalyzer. The expected construct should be 154-bp long, which is comprised of 50 - and 30 -end tags, a linker sequence, and two SOLiD sequencing adapters ligated to each end. For DNA gel analysis, the DNA band of the expected size is observed for the RNA-PET library templates. (a) Lane 1: 25-bp DNA ladder; lanes 2, 3, and 4: the 154-bp PCR-amplified RNA-PET bands for SOLiD sequencing. After its excision from the gel and purification, the template quality control is further processed on an Agilent Bioanalyzer and shown in (b) as a 154bp peak.

88

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing

22.3 Transfer the 1.5-ml screw-cap tubes to 80  C freezer for 1–2 h, then incubate tubes at 37  C overnight. The DNA from the shredded gel will elute into the TE buffer during the incubation. 22.4 After overnight incubation, transfer gel pieces together with the buffer to the filter cup of a Spin-X column, and centrifuge at 14 000 rpm and 4  C for 10 min. 22.5 After centrifugation, add 200 ml TE to each filter cup and stir to loosen the gel pie with a pipette tip. Centrifuge again at 14 000 rpm for 10 min to recover remaining DNA. 22.6 Transfer the eluate to a new tube and precipitate with isopropanol: DNA eluate

xx ml

3 M NaOAc, pH 5.2

1/10 volume

GlycoBlue

2 ml

Isopropanol

equal volume

22.7 Incubate at 80  C for 30 min, spin at 14 000 rpm and 4  C for 30 min, wash 2 times with 70% EtOH. 22.8 Remove supernatant, spin for 15 s and remove remaining supernatant. 22.9 Air-dry DNA pellet and resuspend DNA with 20 ml TE buffer. 22.10 Take 1 ml DNA template to perform a quality control check with an Agilent 2100 Bioanalyzer using the DNA-1000 kit according to manufacturer’s instructions. The Agilent profile should show a clean and strong DNA fragment peak (154 bp) for the correct SOLiD sequencing template and no background from the sample.

5.3 Applications 5.3.1 PET Sequencing with SOLiD

1. Paired-end sequencing is performed with SOLiD 4 paired-end sequencing format following SOLiD guidelines and instructions. The PE sequencing generates two 35-bp PETs. 2. Each of the PET sequences consists of a 27-bp tag plus an 8-bp linker sequence, which is part of the linker adjacent to the tag. The PET structure is illustrated as 27bp tag þ 8bp–linker–8bp þ 27 bp tag. 3. In our experience, one spot sequencing (equivalent to 1/8 of a slide) after filtering the noise reads usually produces over 35 million of the pass filter PETs. 4. Since the sequencing adapters can be ligated to either end of a given transcript PET, a specific signature sequence (AACTGCTG) characteristic of the 30 -AA serves as identifier for the 30 -end tag. 5. To start the PET analysis, the signature sequence is first identified from both paired-end sequence reads. As long as the AACTGCTG (5 ! 3) sequence is identified from one end of a PET, this tag is considered as the 30 -end tag, and the other end is defined as the 50 -end tag. A small portion of the PETs do not have any signature sequence or, in rare cases, the signature sequence appears on both paired ends. In these cases, the PETs are discarded from future analysis. 5.3.2 Mapping of the PETs

1. After PET orientation (50 ! 30 ) is identified, regular mapping of 50 and 30 tags to the reference genome is performed through the SOLiD (BioScope) analysis

5.3 Applications

2. 3. 4.

5.

6.

j

89

pipeline, followed by further analysis specifically developed for the RNA-PET data. A seed tag of 24 bp is used and maximally 2-bp mismatches are allowed within the seed sequence for each tag mapping. Those PETs that uniquely mapped both to the 50 and 30 ends of the reference genome are classified as uniquely mapped PETs. Approximate 90% of PETs that are mapped on the same chromosome, on the same strand, and in the same correct orientation to the known transcripts or known transcript variants are defined as concordant PETs. A small portion (around 10%) of PETs that are mapped incorrectly to the reference genome are referred to as discordant PETs. These PETs represent a class of the PETs that mapped either in the wrong orientations on the same strand (e.g., 30 -end tap mapped before the 50 -end tap), or mapped on two different strands or two different chromosomes (e.g., one end mapped on chromosome 3, another end mapped on chromosome 8). A majority of discordant PETs are derived from ligation noise. However, this class of PETs serves as a valuable pool to identify novel, fusion, or transcriptional variants that might be caused by a variety of genome rearrangements (e.g., deletions, inversions, insertions, tandem repeats, and translocations) or from transcriptional variations due to trans-splicing or mutational events.

5.3.3 PET Clustering, Annotation, and Genome Browser Visualization

1. After concordant PETs are identified, they are clustered with nearby PETs within a 100-bp window to extend each PETat its 50 and 30 end, respectively. Specifically, the mapping location of the 50 and 30 tag of a given PET is extended in a 100-bp window from both directions. 2. If the 50 and 30 tags of a second PET mapped within the 50 and 30 search window of the first PET, then the two PETs are clustered and the search windows are readjusted and the clustering continued with new PETs. This process is dynamic and iterative, and repeated until no new PETs can be found within the allowed window. 3. At the end of the clustering process, most related concordant PETs are clustered to each other, and PETs falling outside the search window are classified as singletons, filtered out from the dataset, and for further analyzed. Furthermore, the 50 end of any clustered PETs should be within 100 bp from each other. The same criterion applies to 30 -end tags. 4. RNA-PETsequences mapped onto the reference genome are further uploaded to a genome browser for data visualization and further analysis. Analyzed RNA-PET data can be visualized in a genome browser as shown in Figure 5.3. The extracted example is from our genome browser and the tags are originated from a cancer RNA sample. Only the concordant PETs are shown in Figure 5.3. Gene expression levels of transcripts are represented by PETcounts. Splice variant can also be seen from the mapped concordant PETs. Fig. 5.3 RNA-PET visualization in a genome browser. Analyzed RNA-PET data can be visualized in a genome browser. In the example extracted from our genome browser, the tags originate from a cancer sample. Only the concordant PETs are shown here. Gene expression levels are represented by the PET counts. Splice variant can also be seen from the mapped concordant PETs.

90

j

5 RNA-PET: Full-Length Transcript Analysis Using 50 - and 30 -Paired-End Tag Next-Generation Sequencing 5.4 Perspectives

The most unique feature of RNA-PET is its ability to capture the 50 and 30 signatures of the same transcripts simultaneously, hence allow the mapping of the 50 and 30 ends at the same time, and demarcate the boundaries of transcription units genome-wide. This unique feature is extremely valuable for study of alternative usage of 50 transcription start sites and 30 transcription termination sites, identification of new transcription units, and annotation of new genomes. However, due to the constraint of capturing full-length transcripts, the RNA-PET approach would be conceptually less quantitative for long transcripts (above 5 kb) than RNA-seq that sequences the randomly shared cDNA fragments. Therefore, RNA-PET and RNA-seq are two complementary approaches for comprehensive transcriptome analysis, collectively to demarcate the boundaries and define the contents of each and every transcription unit, both structurally and quantitatively.

Acknowledgments

The authors would like to thank Yufen Goh, Andrea Ho, Kelly Quek, Atif Shahab, Wai Loon Ong, Wan Ting Poh, Lavanya Veeravalli, Herve Thoreau, Chin Thing Ong, Adeline Chew, Poh Tong Shing Thompson, Lim Kian Chew, Dawn Sum, See Ting Leong, and all other members of the Genome Technology and Biology group for technical and sequencing support. This work was funded by Genome Institute of Singapore, A STAR, Singapore and in also in part by AB Life Technology Inc. Y.R. is also supported by NIH ENCODE grants (R01 HG004456-01, R01HG003521-01, and part of 1U54HG004557-01). All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Peters, B.A. and Victor, E.V. (2005)

4 Ng, P., Wei, C.L., Sung, W.K., Chiu, K.P.,

6 Carninci, P., Kvam, C., Kitamura, A.,

Transcriptome PETs: a genome’s best friends. Nat. Methods, 2, 93–94. 2 Fullwood, M.J., Wei, C.L., Liu, E.T., and Ruan, Y. (2009) Next-generation DNA sequencing of paired end ditags for transcriptome and genome analysis (review). Genome Res., 19, 521–532. 3 Wei, C.L., Ng, P., Chiu, K.P., Wong, C.H., Ang, C.C., Lipovich, L., Liu, E.T., and Ruan, Y. (2004) 50 Long serial analysis of gene expression (LongSAGE) and 30 LongSAGE for transcriptome characterization and genome annotation. Proc. Natl. Acad. Sci. USA, 101, 11701–11706.

Lipovich, L., Ang, C.C., Gupta, S., Shahab, A., Ridwan, A., Wong, C.H. et al. (2005) Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods, 2, 105–111. 5 Ng, P., Tan, J.J., Ooi, H.S., Lee, Y.L., Chiu, K.P., Fullwood, M.J., Srinivasan, K.G., Perbost, C., Du, L., Sung, W.K. et al. (2006) Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultrahigh-throughput analysis of transcriptomes and genomes. Nucleic Acids Res., 34 e84.

Ohsumi, T., Okazaki, Y., Itoh, M., Kamiya, M., Shibata, K., Sasaki, N., Izawa, M. et al. (1996) High-efficiency full-length cDNA cloning by biotinylated CAP trapper. Genomics, 37, 327–336. 7 Carninci, P., Westover, A., Nishiyama, Y., Ohsumi, T., Itoh, M., Nagaoka, S., Sasaki, N., Okazaki, Y., Muramatsu, M., Schneider, C. et al. (1997) High efficiency selection of fulllength cDNA by improved biotinylated cap trapper. DNA Res., 4, 61–66. 8 Carninci, P. and Hayashizaki, Y. (1999) Highefficiency full-length cDNA cloning. Methods Enzymol., 303, 19–44.

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA Alistair R.R. Forrest Abstract

Next-generation sequencers have revolutionized the way we do genomics. RNAseq – a set of shotgun sequencing protocols developed for sequencing the transcriptome – is incredibly useful for gene finding, measuring expression (at the level of genes, transcripts, and alleles), alternative splicing studies, and noncoding RNA discovery. Here, we describe a simple strand-specific RNA-seq protocol for identifying and quantifying RNA species within a sample compatible with both SOLiD and Illumina Genome Analyzer second-generation DNA sequencers. Although several approaches exist for generating shotgun libraries of the transcriptome, the most used version to date involves cDNA fragmentation and linker ligation, which loses strand information. Maintaining strand information is critical to capture overlapping antisense transcription and discern the strand of novel transcribed regions (particularly important for unspliced noncoding RNAs). The strand-specific protocol presented here uses RNA fragmentation to generate short RNA fragments that are then converted to cDNA using an anchored random primer and a template switch primer. We discuss the application of the technology, and see great scope for its use in both model organisms and novel species. To date, RNA-seq has mostly been applied to human and mouse systems with applications in gene expression, transcript discovery, expressed single nucleotide polymorphisms, detection of gene fusions in cancer, and allelic usage studies. For other species with less genomic or transcriptomic information, RNA-seq provides a great tool for rapid annotation of a genome, providing massive-scale expressed sequence tag coverage in a few experiments. Even without a genome sequence, RNA-seq can in principal be used for building transcript contigs akin to UniGene assembled transcripts from the 1990s. With each of these applications in mind we discuss aspects of experimental design, such as read length, depth of sequencing, replicates, and analysis strategies needed to achieve each goal.

6.1 Introduction

Expressed sequence tags (ESTs) are short cDNA sequences that were introduced in 1991 as a method to annotate the transcribed regions of the human genome and thereby identify genes [1]. Due to the high cost of Sanger sequencing, ESTs and later full-length cDNA projects typically used subtraction [2–4] and normalization [5] techniques to reduce the fraction of redundant highly expressed transcripts and thereby increase complexity (and the number of transcript species identified per dollar spent). Without subtraction and normalization, ESTs can be used for quantitation of gene expression [6,7]. However, this is an expensive approach, which was rapidly replaced by gene expression microarrays. The focus of EST projects moved

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

91

92

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

largely to gene discovery rather than quantitation. During this period ESTs were used as physical clones for cDNA microarrays [8] and later for designing synthetic oligonucleotide arrays. With large collections of ESTs being made for multiple tissues, EST-based evidence became an important tool in gene discovery. Efforts such as UniGene [9] and the TIGR consensus sequences [10] used clusters of overlapping EST sequences to build consensus transcript models. Importantly, ESTs also provided the largest body of evidence for widespread alternative splicing of mammalian genes [11]. Although these analyzes could estimate the fraction of genes with alternative isoforms, they could not answer what level these alternative isoforms were expressed at and whether they were biologically relevant. This then led to the development of exon arrays [12] and splice junction arrays [13], which are designed to measure expression levels of each exon or known splice junction. These arrays were able to confirm tissue-specific isoform expression and relative expression levels of various exon combinations. A major drawback, however, is that they cannot detect novel isoforms or novel genes as their designs are limited to what is already known. With the arrival of next-generation sequencing (NGS), several groups simultaneously revisited EST-like libraries for gene expression and transcript discovery in a set of related protocols now referred to as RNA-seq [14–19]. RNA-seq is massively parallel sequencing of short random ESTs. Library protocols are designed to give even coverage along the full length of an RNA using either random priming or fragmentation (either at the RNA or cDNA stages). Figure 6.1 demonstrates the schematic for the protocol described here. When mapped to the genome the RNA-seq reads show expression from exons and across splice junctions. This can be used to measure expression at the level of a gene or specific exon combinations corresponding to different isoforms. Importantly, RNA-seq is unbiased by gene models; therefore, it can also be used to discover novel transcribed regions, splice junctions, and posttranscriptionally modified transcripts (e.g., RNA editing). The protocol described in this chapter was the first published strand-specific RNAseq version and used the Life Technologies (Applied Biosystems) SOLiD system. We present the original protocol and also provide information on how to adapt it for the Illumina Genome Analyzer. Although the original publication used 25- to 32-base single-end reads, the read lengths for both platforms have increased (50 bases for SOLiD and 150 bases for Illumina), and with Illumina it is now possible to do 150 base paired-end reads. This improves both coverage and in silico assembly of transcript isoforms. Fig. 6.1 RNA-seq schematic: SQRL (short quantitative RNA library) protocol. (a) Start with mRNA or rRNA-depleted RNA. (b) Randomly fragment the RNA using chemical or heat fragmentation. (c) Use an anchored random hexamer to prime cDNA synthesis off the random fragment and use a template-switching primer to incorporate the second primer. (d) PCR amplify the library (20 rounds or less) and size-select on a gel. (e) Quantify the library and send for NGS. Note: SOLiD version is shown here.

6.2 Methods and Protocols

j

93

Table 6.1 Depth and length recommended for RNA-seq applications.

Application

Number of reads

Length

Notes

Expression profiling

5–30 million

1  36

number of genes measured is a function of depth; 36 bases is long enough for expression profiling providing there is either a genome or full-length transcriptome to map against (e.g., human, soybean [20,21])

Transcript discovery

150

2  75

as above, this is a function of depth (e.g., mouse [22])

Allelic usage

20–200

2  45

requires heterozygous SNP information from whole-genome sequencing or SNP arrays, deeper is better (e.g., [23])

Genome annotation

30–180

1  >30/2  >50

function of genome size and sequencing depth (e.g., Trypanosoma, Candida [24,25])

Transcriptome discovery and expression (without a genome)

35

2  75

longer reads help with transcript assembly (e.g., Lateolabrax japonicas [26])

6.1.1 Before Starting

Before starting it is worth considering what you are hoping to achieve using RNA-seq data. Length of read, depth of sequencing, RNA fraction targeted, and supporting genomic and transcriptomic data will all affect the design of the study. Although RNA-seq produces superior gene expression measurements compared to microarrays, it is still a relatively expensive alternative. Most users of RNA-seq are therefore interested in transcript discovery (both novel genes and splice variants) and measurement of their expression. For an unannotated genome this is a particularly useful approach as commercial microarrays are often not available. This type of study can be achieved using 30 million reads. For users interested in assembling transcript structures, deeper sequencing with longer or paired-end reads is typically required. If the organism has not been sequenced previously, longer and deeper sequencing is likely to assist in transcript assembly (akin to UniGene and TIGR tentative consensus sequences). For allelic usage experiments, the ability to call differential allele usage is affected by both the ability to call a heterozygous single nucleotide polymorphism (SNP) from the RNA-seq data and a significant number of independent reads across the regions to give trust in the expression of each allele. This approach, however, will only give information in cases where both alleles are detected to some level in the RNA-seq data. However, to identify cases where only one allele is expressed, matching genome sequence or SNP genotyping is required (in the case of the genome sequence, highconfidence SNPs should be used only). Based on published analysis, estimated depths and lengths required for each application are presented in Table 6.1.

6.2 Methods and Protocols 6.2.1 Preface

The following protocol is reproduced from the supplementary methods section of an original article in Nature Methods with permission from the publisher [14]. At the time of preparing this chapter a comprehensive side-by-side comparison of stranded RNAseq protocols was released [27]. In that review, the authors conclude that an alternative protocol developed by Parkhomchuk et al. [28] that uses dUTP for second-strand synthesis and UNG to selectively destroy the second strand after fragmentation and linker ligation provided more even coverage and strand specificity than what we achieved with the protocol described below. I have not tested this protocol myself, but it looks promising and I direct the reader to the review for further information.

94

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

The protocol here describes generation of a stranded RNA-seq library. It is targeted at both users running SOLiD or Illumina platforms, and smaller labs who would submit the libraries to a core facility with access to these machines. It is assumed that the user has some basic skills in molecular biology, such as agarose gel electrophoresis and imaging, and use of a centrifuge and pipettes. Gloves should be used at all times to prevent contamination of the RNA sample with RNases and to protect the user from exposure to toxic chemicals (e.g., guanidinium isothiocyanate and phenol used in RNA extraction). General safety gear such as a lab coat and safety glasses should be worn at all times. RNA samples should be stored at 80  C when not in use and unless otherwise mentioned, RNA samples should be kept on wet ice or a precooled freezer block when preparing libraries. Avoid multiple freeze–thaw cycles; if necessary, consider aliquoting the RNA into “RNase-free” single-use tubes. Polymerase chain reaction (PCR) preparation and library purification areas should be kept clean and separate and filter tips should be used at all time to avoid crosscontamination. RNaseZap (Ambion; cat. no. AM9780-AM9784) and RNase AWAY (Invitrogen; cat. no. 10328011) products can be used to maintain an RNase-free working area. 6.2.2 Materials and Consumables Key Reagents and Kits . . . . . . . . . . . .

Oligotex mRNA mini kit (Qiagen; cat. no. 70042) RiboMinus ribosomal depletion kit (Invitrogen; cat. no. K1550-02) RNA 6000 Pico kit (Agilent; cat. no. 5067-1513) Ambion RNA fragmentation buffer (Ambion; cat. no. 8740) YM-30 columns (Millipore; cat. no. 4307) SuperScript II (Invitrogen; cat. no. 18064-014) QIAquick gel extraction kit (Qiagen; cat. no. 28704) TA cloning vector (TOPO-TA PCR-TOPO) Advantage polymerase (Clontech) Agarose (various suppliers) TAE (various suppliers) Low-molecular-weight DNA ladder (various suppliers)

RNase-Free Reagents Multiple suppliers (purchase rather than making in-house): . . . .

RNase-free 300-ml and 1.5-ml tubes RNase-free filter barrier tips RNase-free water Disposable gloves latex or acetonitrile

Key Equipment Needed . . . . . . . .

.

Microcentrifuge Pipettes (10 ml, 20 ml, 200 ml, 1 ml) Agilent 2100 Bioanalyzer (or equivalent) NanoDrop spectrophotometer (or equivalent) Thermocycler SpeedVac centrifugal dessicator system (or equivalent) Gel electrophoresis set-up UV trans-illuminator (preferably with low setting) and gel documentation/camera set-up Access to SOLiD or Illumina GA3 NGS service provider

6.2 Methods and Protocols 6.2.3 Protocol RNA Preparation RNA for RNA-seq can be prepared by any number of commercial kits, guanidinium reagents, and in-house protocols. For most mammalian tissue culture and tissue applications we use kits from Qiagen (RNeasy; cat. no. 74104; miRNeasy; cat. no. 217004) and Ambion (RNAqueous; cat. no. AM1912; RiboPure; cat. no. AM1924), or TRIzol reagent (Invitrogen; cat. no. 15596-026) or equivalent. For some RNA sources specialized kits and protocols may be required to lyse cells or remove contaminants. Examples include isolation from blood, bone, soil, plant tissues high in cellulose and lignin, and animal tissues high in lipid, glycogen, or heparin. Use RNase-free reagents and plasticware (typically the kits provides everything except tips), and check the quality of your RNA by both Bioanalyzer and NanoDrop spectrophotometer. Good quality RNA will typically have an RNA integrity number (RIN) above 9, and OD260/280  2 and OD260/230  2. The above kits are used for purifying total RNA. The first step in any RNA-seq experiment is determining what fraction of RNA you are interested in profiling. Typically, rRNA, which typically makes of 90–99% of total RNA, is depleted or poly (A) þ mRNA is selected for. A significant fraction of RNAs are thought to be poly(A)–, therefore a number of alternative strategies could be used. Two rounds of ribosomal depletion using RiboMinus beads (Invitrogen; cat. no. K1550-02) can reduce the rRNA levels to less than 5% after which the RiboMinus RNA could be profiled directly or further fractionated into poly(A) and poly(A) þ using oligo(dT) selection. Other RNA subfractions that could be generated are size-fractionated RNAs (short/long) or subcellularly fractionated RNA populations (nuclear/cytoplasmic/polysomal).

1.1 Poly(A) þ RNA Several kits exist for mRNA purification directly from cells (e.g., PolyApurist, Ambion; cat. no. AM1922) and from total RNA (Oligotex mRNA mini kit, Qiagen; cat. no. ID70042). Below is an abridged version of the Oligotex mRNA protocol from Qiagen. Alternative protocols are likely to yield similar results. 1.2 Oligotex mRNA spin-column protocol for isolation of poly(A) þ mRNA from total RNA Abridged text adapted directly from Qiagen’s Oligotex mini protocol (http://www1. Qiagen.com/literature/handbooks/literature.aspx?id¼1000156). Important notes before starting: . .

. .

.

.

.

.

Ensure buffers are appropriately prepared. Heat Oligotex suspension to 37  C. Vortexing prior to use, then place at room temperature. Heat a water bath or heating block to 70  C and heat Buffer OEB. Review the introductory material on pp. 12–19 of Qiagen’s protocol before starting. If working with RNA for the first time, read Appendix A (p. 76) of Qiagen’s protocol. Buffer OBB may form a precipitate upon storage. If necessary, redissolve at 37  C, then room temperature. Unless otherwise indicated, all steps, including centrifugation, should be performed at room temperature. All centrifugation steps should be performed in a microcentrifuge at maximum speed. 1. Determine the amount of starting RNA. Do not use more than 1 mg. The initial volume of the RNA solution is not important so long as the volume can be brought up to the indicated amount with RNase-free water. Make up the total RNA to 250 ml with H2O and add 250 ml of Buffer OBB. 2. Add 15 ml of Oligotex suspension. Mix the contents thoroughly by pipetting or flicking the tube. Incubate for 3 min at 70  C in a heating block. This step disrupts secondary structure of the RNA.

j

95

96

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

3. Remove sample from the water bath/heating block and place at 20–30  C for 10 min. This step allows hybridization between the oligo(dT)30 of the Oligotex particle and the poly(A) tail of the mRNA. 4. Pellet the Oligotex: mRNA complex by centrifugation for 2 min at maximum speed (14 000–18 000  g) and carefully remove the supernatant by pipetting. Loss of the Oligotex resin can be avoided if approximately 50 ml of the supernatant is left in the microcentrifuge tube. The remaining solution will not affect the procedure. (Note: Save the supernatant until certain that satisfactory binding and elution of poly(A) þ mRNA has occurred.) 5. Resuspend the Oligotex: mRNA pellet in 400 ml Buffer OW2 by vortexing or pipetting, and pipette onto a small spin column placed in a 1.5-ml microcentrifuge tube. Centrifuge for 1 min at maximum speed. 6. Transfer the spin column to a new RNase-free 1.5-ml microcentrifuge tube and apply 400 ml Buffer OW2 to the column. Centrifuge for 1 min at maximum speed and discard the flow-through. 7. Transfer spin column to a new RNase-free 1.5-ml microcentrifuge tube. Pipette 20 ml hot (70  C) Buffer OEB onto the column, pipette up and down 3 or 4 times to resuspend the resin, and centrifuge for 1 min at maximum speed. (Note: The volume of Buffer OEB used depends on the expected or desired concentration of poly(A) þ mRNA. Ensure that Buffer OEB does not cool significantly during handling. Remember that small volumes cool down quickly. With multiple samples, it may be necessary to place the entire microcentrifuge tube (with spin column, Oligotex, and sample) into a 70  C heating block to maintain the temperature while preparing the next samples.) 8. To ensure maximal yield, pipette another 20 ml hot (70  C) Buffer OEB onto the column. Pipette up and down 3 or 4 times to resuspend the resin and centrifuge for 1 min at maximum speed. To keep the elution volume low, the first eluate may be used for a second elution. Reheat the eluate to 70  C and elute in the same microcentrifuge tube. However, for maximal yield, the additional volume of Buffer OEB is recommended. 9. Check selection on Bioanalyzer Pico chip (load 0.5–1 ng per well). Quantify by NanoDrop rather than relying on Bioanalyzer estimates. Figure 6.2 shows the expected profile from Pico chip bioanalysis after each stage of RNA preparation. Expected yield of mRNA from 100 mg of total RNA is in the order 300 ng. 1.3 rRNA-depleted RNA rRNA can be selectively depleted from a total RNA sample using bead-affinity protein nucleic acid probes complementary to rRNA. In the original publication we used one round of poly(A) purification with Oligotex beads and one round of ribosomal depletion using RiboMinus. If starting with total RNA, I recommend using two rounds of RiboMinus. An alternative approach (not detailed here) is selective degradation of uncapped, 50 phosphorylated molecules (such as 18S and 28S rRNAs) using 50 phosphate-dependent exonuclease. Fig. 6.2 mRNA purification and rRNA depletion. (a) RNA after two rounds of Oligotex poly(A) purification. (b) RNA after one round of Oligotex and one round of RiboMinus depletion.

6.2 Methods and Protocols

Abridged RiboMinus protocol adapted directly from the manufacturer’s manual (http://www.invitrogen.com/content/sfs/manuals/RiboMinus_human_mouse_man.pdf). Selective hybridization: 1. To a sterile, RNase-free 1.5-ml microcentrifuge tube, add the following: RNA (2–10 mg)

95  C for 5 min.

6.2 Methods and Protocols

j

101

Table 6.2 Primers.

SOLiD-specific primers Forward FDV primers sFDVhex CTTTCCTCTCTATGGGCAGTCGGTGATNNNNNN LAmpFDV CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT Reverse RDV primers AmpRDV AACTGCCCCGGGTTCCTCATTCTCT RDV-GGG AACTGCCCCGGGTTCCTCATTCTCTrGrGrG Illumina-specific primers (new – suitable for paired-end sequencing) Forward Illu_F_hex CATTGAGCTGAACCGAGTCCAGCAGNNNNNN Illu_F_Amp CAAGCAGAAGACGGCATACGACGATCTCGACATTGAGCTGAACCGAGTCCAGCAG Illu_R_seq (aka SBS11 [27]) CGATCTCGACATTGAGCTGAACCGAGTCCAGCAG Reverse Illu_R_GGG TTTCCCTACACGACGCTCTTCCGATCTrGrGrG Illu_R_Amp AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT Illu_R_seq ACACTCTTTCCCTACACGACGCTCTTCCGATCTGGG

All primers should be purified by high-performance liquid chromatography.

2.3 PCR amplification of cDNA library (see Table 6.2 for primers) 1. Set up the following PCR reaction: H 2O

13.5 ml

2 mM dNTP each

2.5 ml

10 Buffer

2.5 ml

SOLiD: AMPRDV (10 mM)

2.5 ml

Illumina: Illu_F_Amp (10 mM) SOLiD: LAMPFDV (2 mM)

2.5 ml (note lower concentration used to minimize primer dimer)

Illumina: Illu_R_Amp (2 mM) First-strand cDNA

1.0 ml

Advantage DNA polymerase

0.5 ml

TOTAL 2. Thermocycle as follows:  . 1  {94 C ! 5 min} (activate the polymerase)   . 20  {94 C ! 15 s, 68 C ! 15 s} 2.4 Library purification and size selection via gel electrophoresis 1. Load the entire library onto a 1 TAE 3% agarose gel and run at 100 V, 55 min with low-molecular-weight ladder. 2. Cut out band with a scalpel. Use a unique scalpel every time (Figure 6.4). 3. Purify DNA via QIAquick agarose extraction: 4. Optional: Reload the purified material onto another 3% agarose gel and repeat purification (we have found this necessary to remove smaller amplicons). 5. Quantify final amplified pool concentration by NanoDrop. Figure 6.4 shows a good library with a library size of 100–300 bp for short reads of 35–50 bases (current with SOLiD); however, the Illumina GAIIx and HiSeq 2000 platforms it is now possible to do paired 150-bp reads. For this a good library should have a smear ranging from 300 to 700 bp with the majority of the product in the 400- to 600-bp range.

102

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

Fig. 6.4 Library purification. Optimal library size for 50-base signal end sequencing. For longer reads and paired-ends on Illumina sequencers less-fragmented RNA should be used and a higher-molecular-weight region cut from the gel.

A failed or poor synthesis is generally characterized by a strong band just below 100 bp and a very faint smear if anything. Library Quality Control by Sequencing In the original protocol we used Sanger sequencing of 96 independent clones to check library complexity, rRNA rate, insert length, and mapping position on RefSeq transcripts. It is included here for reference; however, with the cost of NGS rapidly decreasing it may be more cost- and time-effective to go directly to NGS, and in the case of SOLiD, sectoring of a slide means multiple libraries could be checked simultaneously prior to committing to full-scale sequencing.

Note: We quality control the libraries prior to SOLiD sequencing by both capillary sequencing of cloned amplimers and accurate quantification. To sequence, we take 1 and 0.1 ml of eluted library and clone into a TA cloning vector (TOPO-TA PCR-TOPO). Pick 96 colonies and prepared plasmid DNA via for 96-well miniprep. Samples are sequenced using standard AB sequencing via core. Sequences are reviewed on the following criteria: . . .

Insert size. Length of mappable sequence (BLAT UCSC and BLAST NCBI). Number of nontemplated G residues at the 50 end. As template switching is used we expect to see three or more nontemplated G residues. Critical statistics: 1. Median amplicon size. For SOLiD our preferred median insert size is 70 bp, to ensure most 50-bp reads are insert rather than linker. To capitalize on Illumina’s paired-end 150-bp reads the preferred insert size is 300–600 bases. 2. Range of amplicon size. In some cases we have had a good median size but too broad a size range. Ideally you want to ensure that all sequences are larger than the minimum read length.

6.3 Bioinformatic Considerations

3. Percent rRNA. rRNAs are a common contaminant if depletion was poor. For a library made from poly(A) þ RiboMinus depleted total RNA you hope to see less than 5% of sequences mapping to rRNA. Position relative to full-length sequences. Once tags are mapped to their respective genes, we check to see if the tags are spread across the entire length of the transcript, or biased towards the 30 or 50 end. In principal if you take all RefSeq transcripts and break their lengths into deciles, approximately 10% of RNA-seq mappings should map to each decile. If you observe greater than 15% or less than 5% on any of the deciles, this may indicate poor fragmentation or degraded starting material. NGS Typical yield at this stage is 50–100 ng of library for a 150-base median amplicon length. At this point the library is ready to send for NGS. For SOLiD, typically 500 pg of library is used in the emulsion PCR. While for Illumina, 10 ml of library at 1–6 pM is loaded per channel (approximately 20 pg).

Note: Given the typical yield of library and the actual amount loaded onto the flow cell (1/1000–1/2000) or beads (1/100–1/200) it is likely that considerably fewer PCR cycles could be used. Twenty cycles were used in the original protocol, but fewer cycles will reduce PCR-introduced biases in coverage.

6.3 Bioinformatic Considerations

As mentioned at the beginning of this chapter, the analyses possible with RNA-seq data depend on aspects such as read length, single- or paired-end reads, depth of sequencing, number of replicates, and the level of supporting information, such as a genome sequence and transcript annotations available for the species of interest. Generating a set of successful RNA-seq libraries is only a fraction of the work. This chapter is mainly directed at generating the dataset; however, without bioinformatics support such a dataset can sit for months with little being achieved. Below, I detail some of the broad areas of analysis; however, this is far from complete and is meant to introduce the reader to additional reading. Ideally a bioinformatician familiar with Linux/UNIX, R, and handling of large datasets should be recruited into any RNA-seq project. RNA-seq bioinformatics can broadly be broken into three classes of analysis: mapping, assembly and differential expression. The majority of work to date has been carried out on species for which a reference genome sequence and some level of cDNA annotation are available. Unspliced reads can easily be mapped directly to the genome; for spliced reads it largely depends on the length of sequence on either side of the splice junction. Short reads (25–35 bases) were used in the original versions of RNA-seq, due to limitations of the sequencing platforms. With such a short length they are not suitable for efficient mapping of splice junctions. In addition, the median length of internal exons for mouse and human is approximately 124 bases [33], meaning that only a small fraction of reads would actually fall across a splice junction, and even for the best case scenario, reads with 12–17 bases on either side of the junction, gapped alignment was impossible. In these cases reads that failed to map to the genome were mapped to either a set of known splice junctions or an expanded set that included all possible splice donor–acceptor pairs known for each gene. An improvement came with the combination of Bowtie [34] (an aligner) and TopHat [35] (a program that discovers splice junctions without using known transcripts). Bowtie first aligns the unspliced reads to the genome, these are used to identify exonic signal and potential exons which could be joined, and the spliced reads that failed to map

j

103

104

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

directly to the genome are then aligned against potential splice junctions between these exons using TopHat. TopHat provides efficient gapped alignment across splice junctions; however, ultimately, a user of RNA-seq would like to know what are the full set of transcripts within a sample, their exonic structure, and the expression level of each isoform from any given locus. The assembly of alternative splice forms using RNA-seq data, has recently been made possible using several free software solutions (Cufflinks [36], Scripture [22], Trans-ABySS [37]). Without using known transcript information these programs can efficiently recover known gene and transcript structures, and provide expression measurements on each. The above methods require a reference genome to which sequences are aligned. The ABySS program originally introduced for genome assemblies can also be used for transcript assemblies and is useful when the genome sequence is not available [38]. Transcript assemblies can then be annotated using a combination of open reading frame and domain predictions, and alignments against known genes in other species [39–41]. Even without gene annotation, these assemblies can be used as surrogate “gene units” for gene expression analyses, with those showing differential expression then chosen for further characterization. Finally differential gene expression analysis can be carried out on RNA-seq data. The first step is to assign the RNA-seq signal onto some representation of a gene or transcript (generally in the order of 30 000–200 000 objects). This could be, for instance, (i) all RNA-seq signal mapping within the boundaries of a known gene, (ii) only exonic signal, (iii) transcript-level expression predicted by Cufflinks, Scripture, or Trans-ABySS, (iv) transcript assemblies from ABySS, or (v) specific exons or splice junctions identified from annotations and TopHat. This then generates a table of expression values for a set of transcript units that can be fed into statistical packages such as DEseq [42], EdgeR [43], and Myrna [44] designed for detecting differential expression from sequence tag counts. Note that it is particularly important to consider experimental design here; triplicates should be used. The author would particularly recommend the use of biological triplicates, either from three individuals or three independent cultures. Technical reproducibility between libraries is typically excellent for RNA-seq, so the main point of replicates is to see whether the observations generalize to another individual or culture plate (i.e., are the differences found specific to that RNA preparation or will you find them if you look at additional samples?). It is worth mentioning the issue of transcript length and mappability in RNA-seq. In principal, longer transcripts will generate more reads than shorter ones. A 10-kb transcript can be split into 20 500 base fragments, whereas a 1-kb transcript can only be split into two. If you are interested in comparing absolute counts of molecules from different loci with different lengths then you need to introduce length normalization. More so this needs to take into account the amount of unique sequence generated by each loci. If the transcript is composed of duplicated sequence then RNA-seq signal could be generated from either locus. To address this, RPKM (reads per kilobase of exon model per million mapped reads) was introduced by Mortazavi et al. [16] – this considers the length of mappable sequence in generating expression counts for each gene model. Even so, for some loci, RNA-seq and RPKM normalization will not help. As an example the SMN1 and SMN2 loci are completely duplicated with the exception of a SNP in the 30 -untranslated region (UTR) and RNA-seq signal multimaps. To measure expression of these two loci, isoform-specific quantitative reverse transcription-PCR is a better option.

6.4 Applications

At the time of writing this chapter, more than 200 articles with the keyword “RNA-seq” in their abstract had been published since 2008. RNA-seq has been applied to more

6.5 Perspectives

j

105

Table 6.3 Species to which RNA-seq has been applied.

Mammals

Homo sapiens [60], Mus musculus [14,16], Pan troglodytes, Macaca mulatta [61], Cricetulus griseus [62]

Vertebrates

Xenopus tropicalis [63], Salvelinus namaycush [47], Lateolabrax japonicus [26], Danio rerio [64], Python molorus [65]

Insects

Drosophila melanogaster [66], Anopheles albimanus, Anopheles arabiensis, Anopheles dirus, Anopheles farauti, Anopheles freeborni, Anopheles gambiae, Anopheles quadriannulatus, Anopheles quadrimaculatus, Anopheles stephensi, Aedes (Stegomyia) aegypti [67]

Plants

Vitis vinifera [46], Oryza sativa [68], Arabidopsis thaliana [69], Solanum lycopersicum [70], Glycine max [20]

Human pathogens

Chlamydia trachomatis [71], Listeria monocytogenes [72], Plasmodium falciparum [73], Helicobacter pylori [74], Trypanosoma brucei [75], Candida albicans [24] Burkholderia cenocepacia [76], Salmonella typhi [77], Bacillus anthracis [78]

Other

Caenorhabditis sp. 3 PS1010 [79], Saccharomyces cerevisiae [17,19], Aspergillus oryzae [80], Laccaria bicolor [81], Blautia hydrogenotrophica, Marvinbryantia formatexigens [82], Sulfolobus solfataricus P2 [83]

than 40 species, including animals, plants, fungi, protists, archaea, and bacteria (Table 6.3). The majority of studies in the first year have focused on proof of concept, and to address transcriptional complexity in human, mouse and yeast [14,16,17,19]. RNA-seq has proven to robustly identify alternative splicing, 50 -UTRs, and 30 -UTRs, and has provided an increased estimate on the number of loci with alternative transcript isoforms. The upper estimate on isoforms will only increase as more cell types are profiled over the coming years in large scale projects such as ENCODE [45]. For organisms with sparse or no transcriptome annotation, RNA-seq is the method of choice [46]. Even without a reference genome, transcript assembly, annotation, and differential expression analysis is possible [47]. Libraries are quick, quantitative, and can be used to rapidly annotate gene structures on a recently completed genome. This has been particularly useful for gene annotation in human pathogens such as malaria, anthrax, and others (see Table 6.3). This has helped with their genomic and transcriptomic annotation, and identified transcripts implicated in pathogenesis. In addition host–pathogen dynamics can be studied by using RNA-seq on infected cells, and mapping to both the host and pathogen genomes. For example, Epstein–Barr virus [48]- and hepatitis C [49]-infected human cells, or Mimivirus-infected Acanthamoeba [50]. Even in a well-characterized transcriptome such as human and mouse, RNA-seq is proving useful to discover new genes and new transcripts, in particular noncoding RNAs [22]. Additional applications of the technology are the detection of gene fusions expressed in cancer [51,52] and identification of imprinted loci by examining allelic usage from reads mapping across heterozygous SNPs [23,53–55]. Finally, populations of microorganisms (e.g., gut flora, seawater, soil, etc.) can be studied by RNA-seq. Metatranscriptomic analysis can be used to both monitor transcript prevalence, and hence population dynamics, and for finding transcripts encoding novel enzymatic activities [56–59]. It is worth noting that for this application longer read lengths are preferred and the 454 Life Science platform (Roche) is generally used (although paired-end 150-base reads from Illumina would be useful here).

6.5 Perspectives

Over the past few years since the first RNA-seq publications, progress has been made in read length, sequencing depth, coverage, and bioinformatic support. For both Illumina and SOLiD, read lengths and density have increased; however, Illumina definitely outperforms in terms of read length and paired-end reads (currently 150 bases from each end). Cost per base is decreasing, and there are new third-generation platforms such as Ion Torrent [84], Pacific Biosciences [85], Oxford Nanopore [86] and Helicos BioSciences [87], that are appearing on the market. With these new platforms the main areas of development to watch are single-molecule sequencing, direct RNA sequencing, full-length transcript sequencing, and single-cell analyses.

106

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

The original protocols on the second-generation sequencers involved PCR amplification of cDNA products prior to loading onto a flow cell or beads. This potentially introduces biases in libraries, where some molecules are selectively amplified over others. One approach recently reported is to sequence the cDNA without any PCR preamplification. Clonal amplification still occurs on SOLiD beads or Illumina clusters, but this avoids any preamplification biases when a pool of cDNAs are competitively preamplified together. This to some extent means working blind, as unamplified libraries cannot be visualized on a gel or Bioanalyzer profile. True single-molecule sequencers can completely avoid this as each molecule is read separately. Molecules can be sequenced by either monitoring extension of a DNA strand by a DNA polymerase (as occurs with the Helicos HeliScope and Pacific Biosciences systems) or by monitoring sequential release of nucleotides from a single-stranded oligonucleotide as it is degraded by a nuclease (as occurs with Oxford Nanopore). All three of these systems have the potential for direct sequencing of firststrand cDNAs. However, Helicos is the most mature of the third-generation systems to date with several machines sold worldwide and multiple publications [87–91]. Proof-of-principle publications have appeared for Pacific Biosciences [85,92,93] and Oxford Nanopore [86]; however, neither platform has as yet made it to market. Finally, direct RNA sequencing without the need for cDNA synthesis is now possible using the Helicos system [90,91]. Using a proprietary RNA-dependent DNA polymerase, Helicos has demonstrated direct RNA sequencing. In principal, exonuclease-based Nanopore systems are also capable of direct RNA sequencing and should be able to detect modified bases directly (e.g., methyl-cytosine) [86,94]. The analyses to date have generally involved total RNA prepared from a population of cells or tissue. Reducing sample requirements can make experiments on rare cell populations possible (such efforts have been made for microarrays over recent years) [91]. Reducing the requirements further will make it possible to examine the expression profile (and the set of transcripts) present in individual cells. To date, three publications on single-cell RNA-seq have appeared – two are on oocytes and blastocysts (large cells with correspondingly large amounts of RNA), and one on a more challenging target, individual stimulated neurons [95–97]. These groundbreaking experiments have examined relatively small sets of individual cells, but in the near future experiments expression profiling 1000 or more individual cells is likely. This will be very useful to both understand population dynamics and for building gene-regulatory models. With such profiles it will be possible to discriminate analog and digital expression regulation (i.e., does the entire population express 50 copies of a transcript or does 1% of the population express 5000 copies and the rest none?). Finally, the last area of development is generation of full-length RNA-seq. Despite the achievements of bioinformatics approaches to recover transcript structures and relative abundance, there is still a fundamental limit on deconvolution of transcript structures from short-read RNA-seq data due to the length of reads and library molecules. If two alternative splicing events (A, A0 and B, B0 ) occur at different ends of a locus, and specifically at a distance greater than the length of the library molecules, the observed signal could be generated from any potential mixture of AB, A0 B, AB0 and A0 B0 . Without reads that simultaneously sample both the A region and B region it is not possible to deconvolute what the relative abundance of each component is. Using barcoding and smart pooling of fragmented full-length cDNAs it is possible to generate full-length cDNA assemblies using short reads [98]. However, this is still expensive and labor-intensive compared to RNA-seq. In the near future it will be possible to do direct full-length sequencing using third-generation sequencers. Both Oxford Nanopore and Pacific Biosciences offer the promise of high-throughput very long sequences (at least several kilobases). Coupled with the possibility of direct RNA sequencing using exonuclease or RNA-dependent polymerase, full-length cDNA sequencing may become a thing of the past replaced by direct full-length RNA sequencing. The next few years will be a very exciting time for transcriptome discovery and systems biology.

References

j

107

Acknowledgments

I would like to acknowledge Brooke Gardiner, Gabriel Kolle, Sean Grimmond, and Melissa Brown for their help in developing the original protocol. I am employed on a Research Grant for RIKEN Omics Science Center from MEXT to Yoshihide Hayashizaki and a grant of the Innovative Cell Biology by Innovative Technology (Cell Innovation Program) from MEXT to Yoshihide Hayashizaki. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Adams, M.D., Kelley, J.M., Gocayne, J.D.,

2

3 4

5

6

7

8

9 10

11

12

13

14

15

16

17

18

Dubnick, M. et al. (1991) Science, 252, 1651–1656. Kavathas, P., Sukhatme, V.P., Herzenberg, L.A., and Parnes, J.R. (1984) Proc. Natl. Acad. Sci. USA, 81, 7688–7692. Kurnit, D.M. (1979) Proc. Natl. Acad. Sci. USA, 76, 2372–2375. Vitek, M.P., Kreissman, S.G., and Gross, R.H. (1981) Nucleic Acids Res., 9, 1191–1202. Soares, M.B., Bonaldo, M.F., Jelene, P., Su, L. et al. (1994) Proc. Natl. Acad. Sci. USA, 91, 9228–9232. Hawkins, V., Doll, D., Bumgarner, R., Smith, T. et al. (1999) Nucleic Acids Res., 27, 204–208. Hishiki, T., Kawamoto, S., Morishita, S., and Okubo, K. (2000) Nucleic Acids Res., 28, 136–138. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995) Science, 270, 467–470. Miller, G.S. and Fuchs, R. (1997) Comput. Appl. Biosci., 13, 81–87. Quackenbush, J., Liang, F., Holt, I., Pertea, G., and Upton, J. (2000) Nucleic Acids Res., 28, 141–145. Modrek, B., Resch, A., Grasso, C., and Lee, C. (2001) Nucleic Acids Res., 29, 2850–2859. Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D. et al. (2001) Nature, 409, 922–927. Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z. et al. (2003) Science, 302, 2141–2144. Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B. et al. (2008) Nat. Methods, 5, 613–619. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008) Genome Res., 18, 1509–1517. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008) Nat. Methods, 5, 621–628. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C. et al. (2008) Science, 320, 1344–1349. Wang, Z., Gerstein, M., and Snyder, M. (2009) Nat. Rev. Genet., 10, 57–63.

19 Wilhelm, B.T., Marguerat, S., Watt, S.,

20

21 22

23

24

25 26

27

28

29

30 31

32

33 34

35 36

37

Schubert, F. et al. (2008) Nature, 453, 1239–1243. Severin, A.J., Woody, J.L., Bolon, Y.T., Joseph, B. et al. (2010) BMC Plant. Biol., 10, 160. Bradford, J.R., Hey, Y., Yates, T., Li, Y. et al. (2010) BMC Genomics, 11, 282. Guttman, M., Garber, M., Levin, J.Z., Donaghey, J. et al. (2010) Nat. Biotechnol., 28, 503–510. Heap, G.A., Yang, J.H., Downes, K., Healy, B.C. et al. (2010) Hum. Mol. Genet., 19, 122–134. Bruno, V.M., Wang, Z., Marjani, S.L., Euskirchen, G.M. et al. (2010) Genome Res., 20, 1451–1458. Kolev, N.G., Franklin, J.B., Carmi, S., Shi, H. et al. (2010) PLoS Pathog., 6, e1001090. Xiang, L.X., He, D., Dong, W.R., Zhang, Y.W., and Shao, J.Z. (2010) BMC Genomics, 11, 472. Levin, J.Z., Yassour, M., Adiconis, X., Nusbaum, C. et al. (2010) Nat. Methods, 7, 709–715. Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M. et al. (2009) Nucleic Acids Res., 37, e123. Huber, H.E., McCoy, J.M., Seehra, J.S., and Richardson, C.C. (1989) J. Biol. Chem., 264, 4669–4678. Luo, G.X. and Taylor, J. (1990) J. Virol., 64, 4321–4328. Matz, M., Shagin, D., Bogdanova, E., Britanova, O. et al. (1999) Nucleic Acids Res., 27, 1558–1560. Zhu, Y.Y., Machleder, E.M., Chenchik, A., Li, R., and Siebert, P.D. (2001) Biotechniques, 30, 892–897. Forrest, A.R. and Carninci, P. (2009) RNA Biol., 6, 107–112. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. (2009) Genome Biol., 10, R25. Trapnell, C., Pachter, L., and Salzberg, S.L. (2009) Bioinformatics, 25, 1105–1111. Trapnell, C., Williams, B.A., Pertea, G., Mortazavi, A. et al. (2010) Nat. Biotechnol., 28, 511–515. Robertson, G., Schein, J., Chiu, R., Corbett, R. et al. (2010) Nat. Methods, 7, 909–912.

38 Birol, I., Jackman, S.D., Nielsen, C.B.,

39 40

41

42 43

44 45

46

47

48 49

50

51

52

53

54

55 56

Qian, J.Q. et al. (2009) Bioinformatics, 25, 2872–2877. Babik, W., Stuglik, M., Qi, W., Kuenzli, M. et al. (2010) BMC Genomics, 11, 390. Deakin, J.E., Waters, P.D., Marshall Graves, J.A., Papenfuss, A.T. et al. (2010) Marsupial Genetics and Genomics, Springer, Dordrecht, pp. 121–132. Dodd, D., Moon, Y.H., Swaminathan, K., Mackie, R.I., and Cann, I.K. (2010) J. Biol. Chem., 285, 30261–30273. Anders, S. and Huber, W. (2010) Genome Biol., 11, R106. Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010) Bioinformatics, 26, 139–140. Langmead, B., Hansen, K.D., and Leek, J.T. (2010) Genome Biol., 11, R83. Raney, B.J., Cline, M.S., Rosenbloom, K.R., Dreszer, T.R. et al. (2010) Nucleic Acids Res., 9, D871–D875. Denoeud, F., Aury, J.M., Da Silva, C., Noel, B. et al. (2008) Genome Biol., 9, R175. Goetz, F., Rosauer, D., Sitar, S., Goetz, G. et al. (2010) Mol. Ecol., 19 (Suppl. 1), 176–196. Lin, Z., Xu, G., Deng, N., Taylor, C. et al. (2010) J. Virol., 84, 13053–13058. Woodhouse, S.D., Narayan, R., Latham, S., Lee, S. et al. (2010) Hepatology, 52, 443–453. Legendre, M., Audic, S., Poirot, O., Hingamp, P. et al. (2010) Genome Res., 20, 664–674. Maher, C.A., Palanisamy, N., Brenner, J.C., Cao, X. et al. (2009) Proc. Natl. Acad. Sci. USA, 106, 12353–12358. Palanisamy, N., Ateeq, B., KalyanaSundaram, S., Pflueger, D. et al. (2010) Nat. Med., 16, 793–798. Babak, T., Deveale, B., Armour, C., Raymond, C. et al. (2008) Curr. Biol., 18, 1735–1741. Wang, X., Sun, Q., McGrath, S.D., Mardis, E.R. et al. (2008) PLoS ONE, 3, e3839. Zhang, K., Li, J.B., Gao, Y., Egli, D. et al. (2009) Nat. Methods, 6, 613–618. Gilbert, J.A., Field, D., Huang, Y., Edwards, R. et al. (2008) PLoS ONE, 3, e3042.

108

j

6 Stranded RNA-Seq: Strand-Specific Shotgun Sequencing of RNA

57 He, S., Wurtzel, O., Singh, K., Froula, J.L. 58

59 60 61

62

63

64

65

66

67

68 69

70

et al. (2010) Nat. Methods, 7, 807–812. Poretsky, R.S., Gifford, S., Rinta-Kanto, J., Vila-Costa, M., and Moran, M.A. (2009) J. Vis. Exp., 18, 1086. Shi, Y., Tyson, G.W., and DeLong, E.F. (2009) Nature, 459, 266–269. Sultan, M., Schulz, M.H., Richard, H., Magen, A. et al. (2008) Science, 321, 956–960. Blekhman, R., Marioni, J.C., Zumbo, P., Stephens, M., and Gilad, Y. (2010) Genome Res., 20, 180–189. Birzele, F., Schaub, J., Rust, W., Clemens, C. et al. (2010) Nucleic Acids Res., 38, 3999–4010. Akkers, R.C., van Heeringen, S.J., Jacobi, U.G., Janssen-Megens, E.M. et al. (2009) Dev. Cell, 17, 425–434. Ordas, A., Hegedus, Z., Henkel, C.V., Stockhammer, O.W. et al. (2010) Fish Shellfish Immunol., Epub ahead of print. doi:10.1016/j.fsi.2010.08.022. Wall, C.E., Cozza, S., Riquelme, C.A., McCombie, W.R. et al. (2011) Physiol. Genomics, 43, 69–76. Zhang, Y., Malone, J.H., Powell, S.K., Periwal, V. et al. (2010) PLoS Biol., 8, e1000320. Hittinger, C.T., Johnston, M., Tossberg, J.T., and Rokas, A. (2010) Proc. Natl. Acad. Sci. USA, 107, 1476–1481. Zhang, G., Guo, G., Hu, X., Zhang, Y. et al. (2010) Genome Res., 20, 646–654. Filichkin, S.A., Priest, H.D., Givan, S.A., Shen, R. et al. (2010) Genome Res., 20, 45–58. Bombarely, A., Menda, N., Tecle, I.Y., Buels, R.M. et al. (2011) Nucleic Acids Res., 39, D1149–D1155.

71 Albrecht, M., Sharma, C.M., Reinhardt, R.,

85 Korlach, J., Bjornson, K.P., Chaudhuri, B.P.,

Vogel, J., and Rudel, T. (2010) Nucleic Acids Res., 38, 868–877. Oliver, H.F., Orsi, R.H., Ponnala, L., Keich, U. et al. (2009) BMC Genomics, 10, 641. Otto, T.D., Wilinski, D., Assefa, S., Keane, T.M. et al. (2010) Mol. Microbiol., 76, 12–24. Sharma, C.M., Hoffmann, S., Darfeuille, F., Reignier, J. et al. (2010) Nature, 464, 250–255. Siegel, T.N., Hekstra, D.R., Wang, X., Dewell, S., and Cross, G.A. (2010) Nucleic Acids Res., 38, 4946–4957. Yoder-Himes, D.R., Chain, P.S., Zhu, Y., Wurtzel, O. et al. (2009) Proc. Natl. Acad. Sci. USA, 106, 3976–3981. Perkins, T.T., Kingsley, R.A., Fookes, M.C., Gardner, P.P. et al. (2009) PLoS Genet., 5, e1000569. Martin, J., Zhu, W., Passalacqua, K.D., Bergman, N., and Borodovsky, M. (2010) BMC Bioinformatics, 11 (Suppl. 3), S10. Mortazavi, A., Schwarz, E.M., Williams, B., Schaeffer, L. et al. (2010) Genome Res., 20, 1740–1747. Wang, B., Guo, G., Wang, C., Lin, Y. et al. (2010) Nucleic Acids Res., 38, 5075–5087. Larsen, P.E., Trivedi, G., Sreedasyam, A., Lu, V. et al. (2010) PLoS ONE, 5, e9780. Rey, F.E., Faith, J.J., Bain, J., Muehlbauer, M.J. et al. (2010) J. Biol. Chem., 285, 22082–22090. Wurtzel, O., Sapra, R., Chen, F., Zhu, Y. et al. (2010) Genome Res., 20, 133–141. Pennisi, E. (2010) Science, 327, 1190.

Cicero, R.L. et al. (2010) Methods Enzymol., 472, 431–455. Clarke, J., Wu, H.C., Jayasinghe, L., Patel, A. et al. (2009) Nat. Nanotechnol., 4, 265–270. Lipson, D., Raz, T., Kieu, A., Jones, D.R. et al. (2009) Nat. Biotechnol., 27, 652–658. Kapranov, P., Ozsolak, F., Kim, S.W., Foissac, S. et al. (2010) Nature, 466, 642–646. Ozsolak, F., Goren, A., Gymrek, M., Guttman, M. et al. (2010) Genome Res., 20, 519–525. Ozsolak, F., Platt, A.R., Jones, D.R., Reifenberger, J.G. et al. (2009) Nature, 461, 814–818. Ozsolak, F., Ting, D.T., Wittner, B.S., Brannigan, B.W. et al. (2010) Nat. Methods, 7, 619–621. Flusberg, B.A., Webster, D.R., Lee, J.H., Travers, K.J. et al. (2010) Nat. Methods, 7, 461–465. Travers, K.J., Chin, C.S., Rank, D.R., Eid, J.S., and Turner, S.W. (2010) Nucleic Acids Res., 38, e159. Wallace, E.V., Stoddart, D., Heron, A.J., Mikhailova, E. et al. (2010) Chem. Commun., 46, 8195–8197. Eberwine, J. and Bartfai, T. (2011) Pharmacol. Ther., 129, 241–259. Lao, K.Q., Tang, F., Barbacioru, C., Wang, Y. et al. (2009) J. Biomol. Tech., 20, 266–271. Tang, F., Barbacioru, C., Nordman, E., Li, B. et al. (2010) Nat. Protoc., 5, 516–535. Kuroshu, R.M., Watanabe, J., Sugano, S., Morishita, S. et al. (2010) PLoS ONE, 5, e10517.

72

73

74

75

76

77

78

79

80

81 82

83 84

86

87 88

89

90

91

92

93

94

95 96

97 98

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-SequencingBased Analysis of Primary Transcriptomes Anne Borries, J€org Vogel, and Cynthia M. Sharma Abstract

The application of deep-sequencing technologies to transcriptome analysis has recently been attracting a great deal of attention. The determination of the exact boundaries and relative levels of the whole set of RNA molecules transcribed from a genome improves our understanding of functional elements, facilitates better genome annotations, and allows for monitoring of gene expression changes under diverse growth conditions. A variety of so-called RNA sequencing (RNA-seq) methods have been applied to investigate eukaryotic and prokaryotic transcriptomes. Here, we describe a differential RNA-seq (dRNA-seq) approach for the selective analysis of primary transcripts in the cell. The method is based on a differential exonuclease treatment of total RNA samples, which leads to depletion of processed RNAs, whereas primary transcripts get enriched in relative terms. Comparison of cDNA libraries generated from untreated RNA versus an exonuclease-treated sample enables a global mapping of transcriptional start sites as well as processing sites. Furthermore, the single-nucleotide resolution of this approach facilitates the determination of new promoter regions, and thus can be used to define operon and suboperon structures at the genome-wide level. In addition, dRNA-seq can be used for a global identification of small regulatory RNAs and antisense transcripts. Overall, dRNA-seq has been used to reveal an unexpected complexity of transcriptomes in a growing number of bacterial and archaeal species.

7.1 Introduction

During the last decade an increased understanding of the relationship between genotypes and phenotypes combined with the observation that organisms can respond to environmental changes by modifying their transcription has generated renewed interest in gene regulation. Diverse methods for analyzing gene expression have been developed that quantify or compare transcript levels. Initially, hybridization-based approaches had been the method of choice to monitor gene expression, including tiling [1–3], all-exon [4], and exon-junction [5] microarray platforms. Tiling arrays are a well-established tool for detailed transcript analysis at a global level and consist of overlapping oligonucleotide probes covering the genome at high density. Labeled transcripts hybridize to their complementary probes and the resulting fluorescence is quantified. Tiling arrays do not only allow for quantifying gene expression, but have also been successfully used to identify noncoding RNAs, antisense RNAs, and promoter regions [2,3,6–9]. Generally, however, the abovementioned hybridization-based methods have several disadvantages, including

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

109

110

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes

limited probe density, requirement of an available genome sequence for probe design, unspecific or cross-hybridization, and limited dynamic range of detection. Digital transcript-counting approaches such as serial analysis of gene expression (SAGE) [10] and massively parallel signature sequencing [11] overcome many of these hybridization-based limitations, but are costly and cannot be used to investigate exon– intron junctions. For recent improvements of these methods by combination with high-throughput sequencing (HTPS) methods, see Chapters 1-3 of this book. The development of next-generation sequencing (NGS) approaches based on sequencingby-synthesis chemistry now allows for the direct sequencing of millions of cDNAs in parallel. At present, three platforms dominate the NGS market: the 454 FLX system (Roche), the Solexa Genome Analyzer (Illumina) and the SOLiD system (ABI); all of them have been used to generate, at single-nucleotide resolution, bacterial [7,12] and eukaryotic [13] transcriptomes. Here, we will focus on RNA-seq studies of bacterial transcriptomes, and how these discovered unexpected transcriptional complexity, novel noncoding RNAs, and operon structures in diverse bacterial and archaeal genomes (reviewed in [12,14]). For more detailed information on general RNA-seq approaches, see also Chapters 6, 8, 22, and 23 in this book. To analyze transcriptomes using RNA-seq, total RNA has first to be isolated and then converted into cDNA. One major challenge of RNA-seq studies is the abundance of rRNAs and tRNAs, which constitute more than 95% of the cellular RNA pool. In eukaryotic RNA-seq studies, poly(A)-tail were often used to enrich the mRNA fraction and to generate first-strand cDNA synthesis using oligo(dT) priming [13]. As bacterial transcripts lack long poly(A) tails and, thus, cannot be enriched by poly(dT) capture, several other methods have been applied to deplete rRNAs prior to cDNA synthesis [7,14]. One very common method is the capture of rRNAs using antisense oligos that are coupled to magnetic beads (MICROBExpress kit, Ambion); however, this approach requires species-specific probes that are only available for a handful of bacteria, at best. In addition, other enrichment procedures, such as size selection using gel electrophoresis or coimmunoprecipitation of RNAs bound to a specific protein, have been used to deplete abundant rRNAs [15,16]. A disadvantage of these methods is the risk of possibly removing potentially interesting long transcripts or that prior knowledge of an interacting protein is required, respectively. The construction of the cDNA library to be sequenced is another important step of RNA sequencing: in some studies sequencing adapters were ligated to double-stranded cDNA, which was generated by random hexamer priming and thereby lost strandspecific information. However, several strand-specific protocols including, for example, direct sequencing of first-strand cDNA [17], template switching polymerase chain reaction (PCR) [18], bisulfite-induced C ! U conversions prior to cDNA synthesis [19], incorporation of deoxyuridine into second-strand cDNA rendering it sensitive to uracil-N-glycosylase degradation [20], and 50 end linker ligation combined with poly (A)-tailing using Escherichia coli poly(A) polymerase [6,16,21], have been developed. Furthermore, it has been reported that E. coli poly(A) polymerase preferentially polyadenylates mRNAs, but not rRNAs, and thus can be also used to deplete rRNAs [22]. After library construction, sequencing of cDNAs using HTPS generates huge numbers of short sequence reads (35–200 bp) representing transcript fragments that are then computationally aligned and mapped to reference genomes. The data is then used to identify transcribed genomic regions and to quantify RNA levels based on cDNA coverage at a given locus. One major goal of the next-generation transcriptome studies is the determination of exact transcript boundaries, such as the 50 located transcription start sites (TSS). Classical approaches for the definition of 50 ends include primer extension [23] and 50 rapid amplification of cDNA ends (RACE), but have commonly only been applied to single genes [24]. These methods are time-consuming, differ in sensitivity, and do not generally permit a global analysis. Note, however, that a modified RACE protocol combined with HTPS has recently been used to determine 50 ends of E. coli transcripts in a genome-wide manner [25].

7.2 What is dRNA-Seq?

Unlike tiling array approaches, which are limited by the density of the oligonucleotide probes, RNA-seq approaches can be used to determine 50 ends at singlenucleotide resolution. A drawback of most of the RNA-seq approaches still is that they do not distinguish between TSS and processing sites within transcripts. For example, a combination of whole-transcript sequencing combined with strand-sensitive 50 end determination based on 50 RNA linker ligation was recently used to define the TSS and operons of the archaeon Sulfolobus solfataricus [26]. In this chapter, we describe a novel RNA-seq approach, differential RNA sequencing dRNA-seq, based on a differential exonuclease treatment, which we originally developed to determine the primary transcriptome of the major human pathogen Helicobacter pylori [21]. dRNA-seq permits us to distinguish between primary start sites and processing sites, and this information can then be used to define global maps of TSS, operon structure, and noncoding RNA output. 7.2 What is dRNA-Seq?

In bacteria, the cellular RNA pool is comprised of primary transcripts marked by a 50 -triphosphate (50 -PPP) and processed RNAs with either a 50 monophosphate (50 -P) or 50 -hydroxyl (50 -OH) group. The main goal of dRNA-seq is to distinguish between primary and processed 50 ends. For this purpose, dRNA-seq uses 50 -P-dependent terminator exonuclease (TEX) that specifically degrades processed RNAs carrying a 50 P, whereas primary transcripts with a 50 -PPP are protected (Figure 7.1a). This causes a relative enrichment of primary transcripts and depletes processed RNAs. Specifically, dRNA-seq is based on a differential sequencing and comparison of two cDNA libraries: one library (TEX ) is generated from untreated, total RNA, and the other (TEX þ ) from RNA that was enriched for primary transcripts by a treatment with TEX. Furthermore, the exonuclease treatment can be used to remove abundant processed RNAs including 16S and 23S rRNAs prior to deep sequencing (Figure 7.1b and c). In contrast, primary transcripts such as most small regulatory RNAs (sRNAs) are not degraded and thus get enriched in the TEX þ library in relative terms. A characteristic enrichment pattern reflected by a redistribution of a gene’s cDNAs towards a sawtoothlike profile with an elevated sharp 50 flank then enables to distinguish TSS from processing sites. Figure 7.1c exemplifies the typical patterns of primary and processed 50 ends for the cagA mRNA and tRNA-Phe of H. pylori, respectively: cDNA reads from the TEX-treated ( þ ) library (red curve) cluster towards the primary 50 end of cagA mRNA and exactly match the TSS determined independently by primer extension, whereas cDNA reads from the untreated ( ) library (black curve) are equally distributed along the whole mRNA locus. A similar pattern is observed at the TSS of the tRNAPhe gene. Note that maturation of tRNAs involves cleavage by RNase P that leaves a 50 -P. In contrast to the TSS, cDNA distributions at this processing site are the same in the two libraries or even show enrichment in the untreated library. Based on these differential patterns, TSS and processing sites can be annotated in a genome wide manner. A typical dRNA-seq experiment comprises the following steps: .

.

Construction of dRNA-seq libraries starts with the isolation of high-quality total RNA. The removal of rRNAs is not required as most of the rRNA will be eliminated during TEX treatment. The RNA is treated with DNase I to ensure that any residual genomic DNA is removed prior to cDNA synthesis. Following the DNase treatment, we recommend to check RNA integrity on an agarose gel or Agilent Bioanalyzer. In addition, a control PCR is recommended to test whether the removal of contaminating genomic DNA was successful. The workflow for preparing dRNA-seq libraries is illustrated in Figure 7.2.

.

One half of the DNA-free RNA sample is treated with TEX that degrades the processed RNAs and, thus, enriches for primary transcripts.

j

111

112

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes (a)

(b) -

ST TEX

5’ P

- + 6000 4000 3000 2000 1500 1000

5’ PPP TEX

TEX

- +

TEX

+ TEX 23S rRNA fragments

23S rRNA 16S rRNA

500

5’ CAP 5’ OH

200

HPnc6670 ~ 130 nt 5S rRNA ~120 nt

(c) (-) cDNA, no treatment (+) cDNA, 5’PPP enriched

AUG

cagA

tRNA-Phe RNase P cleavage

Fig. 7.1 Enrichment of primary transcripts using TEX. (a) Schematic of 50 -P-dependent TEX activity. TEX specifically degrades RNAs with a 50 -P, while primary transcripts (red) with a 50 -PPP or RNA with other termini are protected. (b) (Left) Total RNA of H. pylori grown to an OD600 of 0.6 was separated on an agarose gel and stained with ethidium bromide; “ / þ ” refers to prior treatment with TEX. Treatment of total RNA with TEX eliminates most of the processed RNAs carrying a 50 -P as obvious for 23S and 16S ribosomal RNA. (Right) Northern blot analysis of several H. pylori rRNAs and sRNAs identified in a recent dRNA-seq study [21] confirms that processed fragments of 23S rRNA are completely eliminated upon TEX treatment, whereas HPnc6670 sRNA accumulating as the primary transcript is not degraded. Although 5S rRNA is a processed RNA, it is not degraded likely due to a stable secondary structure which sequesters its 50 -P end. (c) Examples for primary start sites and processing sites. dRNA-seq-specific cDNA enrichment patterns (here cDNA libraries generated from RNA / þ TEX treatment from H. pylori grown under acid stress) can be observed at the primary 50 ends of cagA mRNA (left) or tRNA-Phe precursor (right). Exonuclease treatment (red curve; (þ) library) redistributes the cagA cDNAs towards the nuclease-protected 50 end, yielding a sawtooth-like profile with an elevated sharp 50 flank that matches the previously reported TSS based on primer extension [33]. In contrast, the mature (RNase P-cleaved) 50 end of tRNA-Phe is predominant in the ( ) library (black curve). (Adapted from [21].) .

.

.

Strand-specific sequencing is achieved by ligation of a 50 RNA linker and by poly (A)-tailing prior to cDNA synthesis. To this end, both the TEX / þ samples are treated with tobacco acid pyrophosphatase (TAP), which cleaves the 50 -PPP group of the contained primary transcripts leaving a 50 -P that is required for 50 linker ligation. (After TEX and TAP treatment, visual quality control of the RNA samples on a polyacrylamide gel followed by direct staining of the RNA is recommended.) Next, cDNA libraries are constructed from both RNA samples by the following steps: (i) 50 linker ligation, (ii) poly(A)-tailing at the 30 end with E. coli poly(A) polymerase, (iii) first-strand cDNA synthesis using an oligo(dT)-adapter primer and an RNaseH– reverse transcriptase, and (iv) PCR amplification with primers containing barcodes for the designated sequencing platform. After sequencing of cDNAs on the designated HTPS platform, cDNA reads are mapped to a reference genome and transcript levels are quantified.

7.3 Why dRNA-Seq?

To understand the full complexity of bacterial and eukaryotic transcriptomes, nonannotated transcripts such as sRNAs, mRNAs encoding small proteins, or antisense

7.3 Why dRNA-Seq? Cellular RNA pool Primary transcripts 5’ PPP Processed RNA

5’ P 5’ OH

TEX-

TEX+

5’ PPP 5’ P

5’ PPP 5’ P

TEX

5’ PPP

TAP treatment 5’ P 5’ P

5’ P

poly(A)-tailing 5’ P 5’ P

AAAAAA

5’ P

AAAAAA

AAAAAA

5’ end linker ligation AAAAAA AAAAAA AAAAAA

1st strand cDNA synthesis using oligo(dT) adapter primer AAAAAA TTTTTT

AAAAAA TTTTTT

AAAAAA TTTTTT

PCR amplification incl. barcode NNNN NNNN

AAAAAA TTTTTT

NNNN NNNN

AAAAAA TTTTTT

NNNN NNNN

AAAAAA TTTTTT

Deep sequencing

Mapping to genome, transcript quantification

transcripts must be identified. Compared to the commonly used tiling arrays and conventional RNA-seq approaches, dRNA-seq has remarkable advantages: .

.

.

.

.

A major advantage of dRNA-seq is its ability to differentiate between the primary transcriptome and processed RNAs. Tiling arrays can determine 50 ends within a window of 5–30 nucleotides (depending on the probe density). In contrast, dRNA-seq maps TSS with single-nucleotide resolution. This is helpful for the identification of consensus motifs at promoters or of transcription factor binding sites in coregulated genes. Unlike custom arrays, the same dRNA-seq protocol can be applied to different strains of the same species. Abundant rRNA and tRNAs are removed during dRNA-seq library generation, obviating other depletion methods such as bead-based removal of rRNA or size fractionation by gel electrophoresis. (Note: Depletion methods based on size selection run the risk of removing interesting long noncoding transcripts.) Direct RNomics has been previously used for sRNA identification in bacteria (e.g., E. coli or the thermophilic bacterium Aquifex aeolicus) and the two archaeons S. solfataricus and Archaeoglobus fulgidus, but was at that time limited by the

j

113

Fig. 7.2 Typical workflow for a dRNA-seq experiment. The cellular RNA pool consists of primary transcripts with a 50 -PPP and processed RNAs with a 50 -P or 50 -OH. RNAs with a 50 -OH group are not accessible for 50 -linker ligation during cDNA library constructions and, thus, will not be represented in the cDNA library. For the construction of dRNA-seq libraries, each RNA sample is split into two parts. One half remains untreated (TEX ), whereas the other half is treated with TEX (TEX þ ). The TEX enzyme digests processed transcripts carrying a 50 -P group while primary transcripts with a 50 -PPP group are not digested. After TEX treatment, both halves are treated with TAP to transform 50 -PPP groups into 50 -P groups for 50 RNA linker ligation. Next, poly(A) tails are added by E. coli poly(A) polymerase and 50 end RNA linkers (black bars) are ligated to transcripts of both samples. For library preparation, the modified RNA samples are reverse transcribed into first-strand cDNA using an oligo (dT)-adapter primer. After RNA digestion, the second strand of the cDNA is generated and cDNAs are amplified by PCR. (Barcodes that are specific for each library can be introduced via the sense primer in this step.) Finally, both cDNAs are sequenced on a highthroughput sequencing platform and the resulting reads are mapped to the reference genome for transcript quantification.

114

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes

expensive Sanger sequencing method [27–29]. In contrast, dRNA-seq provides higher sequence coverage at much lower cost and also detects low abundance RNAs without a size-selection step. A global TSS map facilitates the annotations of the 50 -untranslated regions (UTRs) of all mRNAs in the cell, including the identification of cis-encoded regulatory elements such as riboswitches or leader peptides. In addition, many regulatory RNAs bind to the 50 -UTR of trans-encoded mRNAs in order to regulate gene expression. Definition of TSS generates the 50 -UTR data necessary for the prediction of sRNA binding sites. Analysis of the sequence coverage for each gene helps to estimate its expression level across a population of cells.

.

.

Recently published RNA-seq and tiling array studies [7,12,14] revealed that the transcriptional output of bacterial genomes is much more complex than previously anticipated. One of these studies is the analysis of the primary transcriptome of the major human pathogen H. pylori using dRNA-seq. H. pylori is a Gram-negative, spiralshaped bacterium that thrives in the acidic environment of the human stomach and can cause gastritis, peptic ulcer disease, or gastric cancer [30]. Sequencing of its 1.67Mb genome had revealed only a small number of transcriptional regulators, but nothing was known about its overall transcriptional organization and noncoding RNA output. The original dRNA-seq analysis was based on 454 sequencing of around 3 000 000 cDNAs from five different growth conditions (mid-log growth, acid stress, and growth in the absence or presence of host cells). This discovered more than 1900 unique TSS in the small H. pylori genome and indicated a very compact transcriptional organization. Figure 7.3 shows the dRNA-seq results (mid-log and acid stress cag pathogenicity island

relative score

ML + AS 0.1

AS +

0.05 0

2 1 g1 g 1 ca ca

ca

g

16

ca

g

17

leading strand

ML -

cagA

ca g ca 24 g2 5

ca g1 8 ca g1 9 ca g2 0 ca g2 ca 1 g2 2

P0 ca 533 g1 ca 3 g1 4 ca g1 5

cag23

H 0

ML -

-0.1 -0.2

ML + AS -

lagging strand

relative score

cag10

AS + cag22-cag18 suboperon cag25-cag18 operon 564,000

566,000

568,000

570,000

572,000

574,000

576,000

578,000

580,000

582,000

584,000

Fig. 7.3 dRNA-seq analyses of the cag pathogenicity island from H. pylori. dRNA-seq was used to annotate TSS and define operon structures in H. pylori, which is shown for part of the cag pathogenicity island [21]. Sequenced cDNAs of mid-log growth (ML / þ ) and acid stress (AS / þ ) libraries were mapped onto cag annotations (gray arrows) in forward (leading strand) and reverse direction (lagging strand). Black and gray arrows indicate published (cagA/B [33]) and newly identified TSS, respectively. Annotation of cagB (white arrow) according to H. pylori strain G27. Dotted lines indicate transcription of the primary cag25–cag18 operon and its associated cag22–cag18 suboperon, which is uncoupled and induced under acid stress by an internal TSS in cag23. Relative scales are defined by “percent mapped reads per genome position” and are consistent for all four libraries of the same strand. (Adapted from [21].)

7.4 Methods and Protocols

libraries) for an example locus from the Helicobacter study – the cag pathogenicity island. dRNA-seq confirmed the previously reported acid induction of genes within this major virulence locus [31,32]. The typical enrichment pattern at TSS (see Figure 7.1c) confirmed the previously described TSS of cagA/B (black arrows [33]) and revealed several new TSS in this pathogenicity island. One of them is an acidinduced internal TSS in cag23 which leads to uncoupling of a cag22–cag18 suboperon from the cag25–cag18 main operon explaining the previously reported upregulation of the cag22–cag18 genes under acid stress [31]. In addition to mRNA start sites, hundreds of TSS were found within operons and opposite to annotated genes, and revealed that the complexity of gene expression from this small genome is increased by the uncoupling of suboperons and massive antisense transcription. Moreover, several novel small proteins of less than 50 amino acids and an unexpected number of more than 60 sRNAs were discovered based on dRNA-seq in H. pylori – a bacterium previously thought to lack riboregulation. Therefore, global TSS mapping using dRNA-seq analysis turned out to be a powerful tool for mapping and annotating the primary transcriptome. It will be helpful to facilitate and improve genome annotation of many other organisms.

7.4 Methods and Protocols 7.4.1 Materials and Consumables Key Reagents . . .

. . . . . . . .

DNase I (1 U/ml, Fermentas; cat. no. EN0521, 1000 U) Superase-In RNase inhibitor (20 U/ml, Ambion; cat. no. AM2694, 2500 U) Terminator 50 -phosphate-dependent exonuclease (1 U/ml, Epicentre; cat. no. TER51020, 40 U) TAP (10 U/ml, Epicentre; cat. no. T19100, 100 U) Roti-Aqua-P/C/I (phenol/chloroform/isoamylalcohol) (Carl Roth; cat. no. X985.1) GlycoBlue (15 mg/ml; Ambion; cat. no. AM9515) Stains-all (Sigma-Aldrich; cat. no. E9379-1G) Gel Loading Buffer II (Ambion; cat. no. AM8546G) RNA RiboRuler high-range RNA ladder (Fermentas; cat. no. SM1821) Phase Lock Gel tubes (VWR; cat. no. 713-2536, 2 ml) Eppendorf tubes (Eppendorf Vertrieb Deutschland; cat. no. 30 120 086)

Selected Buffers and Solutions Needed for dRNA-Seq Library Preparation .

.

.

30: 1 EtOH: 3 M NaOAc, pH 6.5 EtOH 100%

29 ml

3 M NaOAc, pH 6.5

1 ml

Store at room temperature. Stains-all stock solution Stains-all

0.1 g

Formamide

100 ml

Store in dark at 4  C. Stains-all working solution

final volume

j

115

116

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes

Stains-all stock solution

30 ml

Formamide

90 ml

H2O

80 ml

Store in dark at 4  C. Other buffers and solutions needed: . . . .

Ultra-pure water (dH2O) 3 M NaOAc (pH 6.5) 75% EtOH 0.5 M EDTA (pH 8.0)

7.4.2 Precautions

All necessary solutions and reagents should be prepared prior to the experiment. Make sure that you work under RNase-free conditions and always keep the RNA samples on ice. 7.4.3 RNA Samples Used for dRNA-Seq Library Preparation

To construct dRNA-seq libraries, at least 15 mg total RNA of high quality is required. However, we advise to prepare larger RNA amounts from the same sample to have enough material for downstream confirmation experiments, such as 50 -RACE, Northern blot analysis, or quantitative reverse transcription-PCR. It is important to use an RNA preparation method that gives high-quality RNA to avoid extensive sequencing of rRNA degradation fragments. For example, for bacterial total RNA preparation, we recommend RNA isolation methods based on hot-phenol extraction, which gives high yield and RNA of high integrity [34,35]. Note, that for Gram-positive bacteria a more extensive lysis method (e.g., using lysozyme or glass beads) prior to RNA extraction may be required. For methods on RNA quality control, refer also to Chapter 2 (DeepCAGE) in this book. 7.4.4 dRNA-Seq Library Preparation

1. DNase I digestion (100 ml total reaction volume) 1.1 Dissolve RNA sample in 79 ml dH2O (total amount 40–50 mg). 1.2 Denature RNA in dH2O for 5 min at 65  C. 1.3 Cool on ice for 5 min. 1.4 Add on ice:

1.5 1.6 1.7 1.8 1.9

10 DNase I buffer including MgCl2

10 ml

Superase-In RNase inhibitor (20 U/ml)

1 ml

DNase I (1 U/ml)

10 ml

Incubate for 30–45 min at 37  C. Add 100 ml Roti-Aqua-P/C/I to 2-ml Phase Lock Gel tube. Add DNase I-digested samples. Mix for 15 s by shaking the tubes (do not vortex!). Centrifuge 12 min, 15  C, 13 000 rpm.

7.4 Methods and Protocols

1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19

Transfer upper phase to a fresh 1.5-ml Eppendorf tube. Add 2.5 volumes (300 ml) 30 : 1 mix (EtOH : 3 M NaOAc, pH 6.5). Precipitate at least 1 h or overnight at 20  C. Centrifuge 30 min, 4  C, 13 000 rpm. Discard supernatant. Wash pellet with 350 ml 75% EtOH. Centrifuge 10 min, 4  C, 13 000 rpm. Discard supernatant and air-dry pellet. Add 40 ml dH2O (do not pipette up and down!). Dissolve pellet by 5 min incubation at 65  C and 800–900 rpm on a thermo-shaker, vortex 2–3 times in between. 1.20 Check RNA concentration on NanoDrop (final RNA concentration should be 1 mg/ml). 1.21 Check removal of genomic DNA by a control PCR (use any primers which yield a product of 500–1000 bp) with 100 ng RNA before and after DNase I digestion as input and 100 ng genomic DNA as positive control, 40 cycles PCR amplification. 1.22 Check integrity of the DNase-free RNA by visual inspection of the 23S and 16S bands on an agarose gel or on a Bioanalyzer. 2. Terminator exonuclease treatment for dRNA-seq (50 ml total reaction volume) 2.1 Prepare two 1.5-ml reaction tubes with each 7 mg of DNase I-treated RNA in 37.5 ml RNase-free water (TEX and TEX þ ). 2.2 Denature RNAs for 2 min at 90  C. 2.3 Cool on ice for 5 min. 2.4 Add to each tube: Superase-In RNase inhibitor (20 U/ml)

0.5 ml

10 TEX buffer

5.0 ml

Add 7 ml dH2O to the TEX– sample. Add 7 ml TEX (1 U/ml) to the TEX þ sample. Incubate 60 min at 30  C. Place tubes on ice and stop reaction by addition of 0.5 ml 0.5 M EDTA, pH 8.0. 2.9 Add 50 ml dH2O to each tube. 2.10 Add 100 ml Roti-Aqua-P/C/I in 2-ml Phase Lock Gel tube to the reaction samples. 2.11 Mix for 15 s by vigorously inverting the tubes (do not vortex!). 2.12 Centrifuge 12 min, 15  C, 13 000 rpm. 2.13 Transfer upper phase to fresh 1.5-ml Eppendorf tube. 2.14 Add 2 ml GlycoBlue. 2.15 Add 300 ml 30 : 1 mix (EtOH : 3 M NaOAc, pH 6.5). 2.16 Precipitate overnight at 20  C. 2.17 Centrifuge 30 min, 4  C, 13 000 rpm. 2.18 Discard supernatant. 2.19 Wash pellet with 90 ml 75% EtOH. 2.20 Centrifuge 10 min, 4  C, 13 000 rpm. 2.21 Discard supernatant and air-dry pellet. 2.22 Add 11 ml dH2O (do not pipette up and down!). 2.23 Dissolve pellet by incubation for 5 min at 65  C and 800–900 rpm on thermo-shaker, vortex 2–3 times in between. 2.24 Measure RNA concentration of 1 ml on NanoDrop (the TEX-treated sample should contain less RNA due to removal of processed RNAs). 3. TAP treatment (20 ml total reaction volume) 3.1 Denature remaining 10 ml of / þ TEX-treated RNAs for 1 min at 90  C. 2.5 2.6 2.7 2.8

j

117

118

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes

3.2 Cool on ice for 5 min. 3.3 Prepare TAP Mix according: 10 TAP buffer

2.0 ml

TAP (10 U/ml)

0.5 ml

Superase-In RNase inhibitor (20 u/ml)

0.5 ml

dH2O

7.0 ml

Final volume

10 ml

3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20

Add 10 ml TAP Mix to each sample, and mix well by pipetting up and down. Incubate for 60 min at 37  C. Add 80 ml dH2O. Add 100 ml Roti-Aqua-P/C/I in 2-ml Phase Lock Gel tube. Mix for 15 s by vigorously inverting the tubes (do not vortex!). Centrifuge 12 min, 15  C, 13 000 rpm. Transfer upper phase to fresh 1.5-ml Eppendorf tube. Add 0.8 ml GlycoBlue. Add 300 ml 30 : 1 mix (EtOH : 3 M NaOAc, pH 6.5). Precipitate overnight at 20  C. Centrifuge 30 min, 4  C, 13 000 rpm. Discard supernatant. Wash pellets with 90 ml 75% EtOH. Centrifuge 10 min, 4  C, 13 000 rpm. Discard supernatant and air-dry pellets. Add 20 ml dH2O (do not pipette up and down!). Dissolve pellet by incubation for 5 min at 65  C and 800–900 rpm on thermo-shaker, vortex 2–3 times in between. 4. Quality control of TEX and TAP treatment Note: Check TEX treatment by visual inspection on a polyacrylamide gel. 4.1 Prepare a 4% polyacrylamide/8.3 M urea gel (10 cm  10 cm). 4.2 Add 5 ml loading buffer II to 5 ml of the TEX and TEX þ samples, denature samples 1–2 min at 95  C and load on gel. 4.3 Denature also RNA ladder and load 7.5 ml RNA RiboRuler high-range RNA ladder. 4.4 Run gel at 150 V for approximately 1–1.5 h. 4.5 Place gel in 100 ml Stains-all working solution in the dark on shaker. 4.6 Stain gel 20 min (Stains-all working solution can be reused 20–30 times). 4.7 Destain gel with dH2O in light. 5. cDNA library preparation After a successful TEX and TAP treatment and quality control, cDNA libraries are constructed from the two RNA samples for the designated sequencing platform. The libraries must be generated in a strand-specific manner. For example, in our dRNA-seq analysis of the H. pylori primary transcriptome, cDNA cloning and 454 pyrosequencing were performed as described for the identification of eukaryotic microRNA [36], but omitting size fractionation of RNA prior to cDNA synthesis. In brief, equal amounts of / þ TEX-treated RNA were poly(A)-tailed using poly(A) polymerase, followed by ligation of an RNA adapter (50 -GACCUUGGCUGUCACUCA-30 , 18 nucleotides) to the 50 -phosphate of the RNAs. First-strand cDNA synthesis was then performed using an oligo(dT)-adapter primer and Moloney murine leukemia virus RNaseH reverse transcriptase. Incubation temperatures were 42  C for 20 min, ramp to 55  C followed by 55  C for 5 min. The resulting cDNAs were then PCR-amplified to 20–30 ng/ml using a high-fidelity DNA

7.5 Applications

polymerase. The primers used for PCR amplification were designed for amplicon sequencing according to the instructions of 454 Life Sciences. PCR products were purified using the NucleoSpin Extract II kit (Macherey & Nagel) and cDNA sizes were examined by capillary electrophoresis on a MultiNA microchip electrophoresis system (Shimadzu). (Note: If the cDNA library contains many cDNAs in the size range of the 50 end plus 30 end adapters, a size fractionation of the cDNA library on a gel is required to avoid extensive sequencing of self-ligation products of the two adapters.) Finally, the cDNA libraries were sequenced following the 454 standard protocols. 7.5 Applications

Following the initial study in H. pylori [21], dRNA-seq has in the meanwhile been used for global TSS mapping and transcript discovery in several other species, including Methanosarcina mazei G€o1 [37], Chlamydia trachomatis [38], and Bacillus subtilis [39]. A TEX-based approach was also used by others for global TSS mapping and sRNA identification in the plant symbiotic a-proteobacterium Shinorhizobium meliloti [40]; however, this study included a size-selection step for RNAs in a size range of 50–350 nucleotides, and thus TSS mapping was largely restricted to sRNA candidates and stable fragments of mRNA leaders. In the dRNA-seq study of M. mazei G€o1 [37 – an archaeal organism with a high proportion of noncoding regions (around 25% of its approximately 4-Mb genome) – 454 sequencing of around 40 000–185 000 cDNAs per dRNA-seq library from two different growth conditions detected 876 TSS, 40 new small open reading frames (ORFs) of 30 amino acids or less, and 208 sRNA candidates, mostly in intergenic regions, but also in cisantisense orientation to ORFs. Comparative analysis of the organism grown under different nitrogen availabilities revealed 135 nitrogen-regulated sRNAs, indicating direct transcriptional response to nitrogen, and allowed for the definition of a unique sequence motif for nitrogen-responsive promoters. Archaea are known to have mainly leaderless mRNAs; in contrast, TSS mapping in M. mazei defined more than 500 mRNAs with surprisingly long 50 -UTRs that could be targeted by post-transcriptional regulators or themselves harbor cis-encoded regulatory RNA elements. Albrecht et al. analyzed the primary transcriptome of two developmental stages of, C. trachomatis – an obligate intracellular pathogen. Using dRNA-seq, they mapped 363 TSS and identified more than 40 sRNA candidates [38]. Semiquantitative gene expression analysis revealed differences for several genes and also one of the most abundantly transcribed sRNAs in the two developmental stages. In addition, dRNAseq also helped to identify trans-encoded regulatory RNAs in B. subtilis in contrast to the previous assumption that Firmicutes mainly rely on cis-encoded riboswitch elements in the 50 -UTRs of mRNAs [39]. All of the recently published dRNA-seq studies used the 454 platform whose current titanium chemistry can achieve long read length (an average of 400 bp) and around 1 million reads per run. In contrast, other platforms like the SOLiD system (ABI by Life Technologies) and Solexa (Illumina) can nowadays generate more than 10 million reads with a length up to 50 or 120 bp per lane, respectively. Considering that very long read lengths are not absolutely necessary for the detection of TSS and small RNAs, and that the increased read length of around 100 bp compared to the initial 36bp Solexa reads gives much higher specificity during mapping to reference genomes, it is more effective to analyze primary transcriptomes with systems yielding higher amount of reads. This raises the coverage of each nucleotide due to increased sequencing depth. Furthermore, we have introduced specific barcodes for each sample during library generation, which allows for multiplex sequencing. Sequencing of multiple libraries in one lane reduces sequencing costs and time. Since dRNA-seq was developed for the annotation of TSS, it is biased to the 50 end due to the enrichment of reads that start at the TSS. This 50 bias is also due to the read-

j

119

120

j

7 Differential RNA Sequencing (dRNA-Seq): Deep-Sequencing-Based Analysis of Primary Transcriptomes

length limitation of the current sequencing methods. With the ongoing development of increasing sequencing length it will be possible to sequence full-length primary transcripts. At the moment, we recommend combining the method with conventional RNA-seq techniques including RNA fragmentation to cover full-lengths transcript by fragment assembly, and to define 30 ends and operons. Moreover, the current dRNA-seq method is limited to the detection of certain classes of processed transcripts. Degraded RNAs carrying a 50 -OH group are not digested by TEX; however, they do not permit 50 linker ligation either and will therefore not be represented in the cDNA library. To capture these RNAs as well, an additional 50 phosphorylation step using T4 polynucleotide kinase is required. Furthermore, some of the abundant processed RNAs such as 5S rRNA or tRNA are sometimes not fully degraded by TEX. The 50 -P of these RNAs is likely to be sequestered in a stable secondary structure and therefore not accessible for the exonuclease. However, with the increasing sequencing depths, the removal of abundant rRNA and tRNAs is becoming less of an issue.

7.6 Perspectives

dRNA-seq has been developed for global TSS mapping. So far, it has been applied only to transcriptome analysis of bacterial model organisms at a limited set of specific growth conditions. However, as certain promoters might be expressed at very low levels under standard growth conditions, we strongly recommend that dRNA-seq libraries be prepared from a variety of growth conditions. The staggering increase in sequence numbers per run combined with lower sequencing costs should permit this. The identification of TSS of coregulated genes that are active under diverse conditions will permit the finding of consensus promoter motifs or transcription factor binding sites. Thus far, dRNA-seq has only been used in a semiquantitative manner since studies were limited in sequencing depth and biological replicates due to the high costs of 454 sequencing. Again, the steady decrease of sequencing costs per library will help to overcome these limitations and render dRNA-seq useful for quantitative analysis as well. dRNA-seq studies have so far been applied only to prokaryotes. However, the CAP structure at the 50 ends of eukaryotic mRNA also protects from terminator-exonuclease digestion and we thus expect that dRNA-seq will work for eukaryotes as well. Moreover, sequencing-based approaches could enable the parallel analysis of host and bacterial RNAs because cDNA reads can be mapped to the respective genome. By contrast, parallel transcriptome analysis of two different organisms was previously limited by problems of cross-hybridization when using microarrays. In addition, the amount of input material and ratio of host/pathogen RNA are critical factors for microarray analysis. If the ratio is too high, pathogen transcripts cannot be detected and thus a separation or enrichment of the bacterial transcripts is required prior to microarray hybridization. In contrast, there should soon be enough sequencing depth for RNAseq-based analyses that do not require such physical separation steps. Another advantage of RNA-seq compared to microarray platforms is that it requires fewer cells as starting material. However, the RNA amount isolated from tissue samples or clinical samples can still be limiting for the current library preparation protocol. Therefore, emerging sequencing technologies such as direct RNA sequencing using Helicos sequencing [41] as well as single-cell sequencing [42] will probably be a solution. New amplification-free library protocols [43] will help to overcome limitations in RNA amount isolated from infected tissues or biases introduced during library preparation. It will give insights into infection processes to an unprecedented resolution. Sequencing data are still influenced by biases generated by random primer annealing during cDNA transcription and by PCR amplification steps during library preparation. Direct RNA sequencing will provide a very potential solution to minimize biases of sequencing data.

References

j

121

Furthermore, as many regulatory processes are stochastic events, and it is unlikely that bacteria isolated from a given niche are synchronized or represent a homogenous population, single-cell analysis will allow for distinguishing differential transcriptional responses within the bacterial community. It will also allow us to answer questions of whether cis-antisense RNAs occur in the same cell as sense transcript. Furthermore, this approach will be the starting point to analyze transcriptomes from nonculturable species. Therefore, single-cell sequencing will be a milestone for transcriptome research. While dRNA-seq has been providing an advantage over previous hybridization techniques in mapping TSS, the future sequencing technologies may help uncover even more complex transcriptome architecture in prokaryotes and eukaryotes as well as mixed populations.

Acknowledgments

We thank Maureen Kiley Thomason for critical comments on this chapter. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Kapranov, P., Cheng, J., Dike, S., Nix, D.A.

17 Croucher, N.J., Fookes, M.C., Perkins, T.T.,

et al. (2007) Science, 316, 1484–1488. Selinger, D.W., Cheung, K.J., Mei, R., Johansson, E.M. et al. (2000) Nat. Biotechnol., 18, 1262–1268. Toledo-Arana, A., Dussurget, O., Nikitas, G., Sesto, N. et al. (2009) Nature, 459, 950–956. Gardina, P.J., Clark, T.A., Shimada, B., Staples, M.K. et al. (2006) BMC Genomics, 7, 325. Clark, T.A., Sugnet, C.W., and Ares, M. Jr. (2002) Science, 296, 907–910. McGrath, P.T., Lee, H., Zhang, L., Iniesta, A.A. et al. (2007) Nat. Biotechnol., 25, 584–592. Sorek, R. and Cossart, P. (2010) Nat. Rev. Genet., 11, 9–16. Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigo, R. et al. (2007) Nature, 447, 799–816. Georg, J., Voss, B., Scholz, I., Mitschke, J. et al. (2009) Mol. Syst. Biol., 5, 305. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. (1995) Science, 270, 484–487. Jongeneel, C.V., Iseli, C., Stevenson, B.J., Riggins, G.J. et al. (2003) Proc. Natl. Acad. Sci. USA, 100, 4702–4705. van Vliet, A.H. (2010) FEMS Microbiol. Lett., 302, 1–7. Wang, Z., Gerstein, M., and Snyder, M. (2009) Nat. Rev. Genet., 10, 57–63. Croucher, N.J. and Thomson, N.R. (2010) Curr. Opin. Microbiol., 13, 619–624. Liu, J.M., Livny, J., Lawrence, M.S., Kimball, M.D. et al. (2009) Nucleic Acids Res., 37, e46. Sittka, A., Lucchini, S., Papenfort, K., Sharma, C.M. et al. (2008) PLoS Genet., 4, e1000163.

Turner, D.J. et al. (2009) Nucleic Acids Res., 37, e148. Cloonan, N., Forrest, A.R., Kolle, G., Gardiner, B.B. et al. (2008) Nat. Methods, 5, 613–619. He, Y., Vogelstein, B., Velculescu, V.E., Papadopoulos, N., and Kinzler, K.W. (2008) Science, 322, 1855–1857. Parkhomchuk, D., Borodina, T., Amstislavskiy, V., Banaru, M. et al. (2009) Nucleic Acids Res., 37, e123. Sharma, C.M., Hoffmann, S., Darfeuille, F., Reignier, J. et al. (2010) Nature, 464, 250–255. Frias-Lopez, J., Shi, Y., Tyson, G.W., Coleman, M.L. et al. (2008) Proc. Natl. Acad. Sci. USA, 105, 3805–3810. Thompson, J.A., Radonovich, M.F., and Salzman, N.P. (1979) J. Virol., 31, 437–446. Argaman, L., Hershberg, R., Vogel, J., Bejerano, G. et al. (2001) Curr. Biol., 11, 941–950. Mendoza-Vargas, A., Olvera, L., Olvera, M., Grande, R. et al. (2009) PLoS ONE, 4, e7526. Wurtzel, O., Sapra, R., Chen, F., Zhu, Y. et al. (2010) Genome Res., 20, 133–141. Vogel, J., Bartels, V., Tang, T.H., Churakov, G. et al. (2003) Nucleic Acids Res., 31, 6435–6443. Tang, T.H., Rozhdestvensky, T.S., d’Orval, B.C., Bortolin, M.L. et al. (2002) Nucleic Acids Res., 30, 921–930. Willkomm, D.K., Minnerup, J., Huttenhofer, A., and Hartmann, R.K. (2005) Nucleic Acids Res., 33, 1949–1960. Cover, T.L. and Blaser, M.J. (2009) Gastroenterology, 136, 1863–1873.

2

3

4

5 6

7 8

9 10

11

12 13 14 15 16

18

19

20

21

22

23

24

25

26 27

28

29

30

31 Wen, Y., Marcus, E.A., Matrubutham, U.,

32

33

34

35 36

37

38

39

40

41

42

43

Gleeson, M.A. et al. (2003) Infect Immun., 71, 5921–5939. Merrell, D.S., Goodrich, M.L., Otto, G., Tompkins, L.S., and Falkow, S. (2003) Infect Immun., 71, 3529–3539. Spohn, G., Beier, D., Rappuoli, R., and Scarlato, V. (1997) Mol. Microbiol., 26, 361–372. Blomberg, P., Wagner, E.G., and Nordstrom, K. (1990) EMBO J., 9, 2331–2340. Mattatall, N.R. and Sanderson, K.E. (1996) J. Bacteriol., 178, 2272–2278. Berezikov, E., Thuemmler, F., van Laake, L.W., Kondova, I. et al. (2006) Nat. Genet., 38, 1375–1377. Jager, D., Sharma, C.M., Thomsen, J., Ehlers, C. et al. (2009) Proc. Natl. Acad. Sci. USA, 106, 21878–21882. Albrecht, M., Sharma, C.M., Reinhardt, R., Vogel, J., and Rudel, T. (2010) Nucleic Acids Res., 38, 868–877. Irnov, I., Sharma, C.M., Vogel, J., and Winkler, W.C. (2010) Nucleic Acids Res., 38, 6637–6651. Schluter, J.P., Reinkensmeier, J., Daschkey, S., Evguenieva-Hackenberg, E. et al. (2010) BMC Genomics, 11, 245. Ozsolak, F., Platt, A.R., Jones, D.R., Reifenberger, J.G. et al. (2009) Nature, 461, 814–818. Tang, F., Barbacioru, C., Wang, Y., Nordman, E. et al. (2009) Nat. Methods, 6, 377–382. Mamanova, L., Andrews, R.M., James, K.D., Sheridan, E.M. et al. (2010) Nat. Methods, 7, 130–132.

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing Javier Armisen, W. Robert Shaw, and Eric A. Miska Abstract

Small RNAs are noncoding regulatory RNAs of short length (18–31 nucleotides long) with important roles in gene expression. Since the first identified endogenous small regulatory RNAs in the early 1990s, many functional small RNAs have been identified in diverse organisms using conventional techniques such as genetics, molecular cloning, and predictions from bioinformatics. However, in recent years, and with the implementation of new high-throughput sequencing technologies, also called nextgeneration sequencing (NGS) technologies, the number of novel small RNAs identified has increased drastically, from a few hundred to hundred of thousands, opening up a new dimension in our attempts to understand gene regulation and altering the landscape of functional RNA molecules indefinitely. In this chapter, we provide a brief summary of three major classes of small RNAs and the NGS technologies used to investigate these classes. We present a generalized method to prepare small RNA libraries suitable for all NGS technologies, and we explain the advantages and disadvantages of the use of NGS to identify and monitor small RNAs populations.

8.1 Introduction

Small regulatory RNAs [1–7] can be classified into three major classes based on their limited size ranges and their abilities to interact specifically with members of a particular family of RNA-binding proteins called Argonaute proteins (Table 8.1). These three classes are microRNAs (miRNAs), Piwi-interacting RNAs (piRNAs), and small interfering RNAs (siRNAs). 8.1.1 miRNAs)

Since the first identified miRNAs lin-4 and let-7 [8,9], thousands of new miRNAs have been discovered. While their functions remain unknown in the majority of cases, many miRNAs share a common mechanism of action whereby the miRNA recognizes and binds to sites within the 30 -untranslated region (UTR) of target mRNAs, triggering translation repression, mRNA degradation, and a consequent decrease in gene expression. In animals and plants, miRNAs are mainly transcribed as long hairpin precursor structures (called pri-miRNAs) by the RNA polymerase II. miRNAs genes can be transcribed individually or, if they are clustered in the genome, they can be transcribed as a single long transcript followed by individual processing of each stem–loop. Pri-miRNAs are subsequently processed in the nucleus by an RNase III

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

123

124

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

Table 8.1 Classification of small RNAs.

Class

Size (nt) Structure precursor Processing

miRNA 21–23 siRNA 20–25 piRNA 26–31

imperfect hairpin long doublestranded RNA unknown

Dicer-dependent Dicer-dependent Dicer-independent

Modification

Expression

Mechanism of action

50 -monophosphate 50 -mono- or triphosphate; 20 -O-methylated at 30 end 50 -monophosphate; 20 -O-methylated at 30 end

ubiquitous ubiquitous

translation repression; mRNA degradation transcriptional gene silencing; mRNA degradation transcriptional gene silencing; transposon silencing; heterochromatin regulation

germline

Functional small RNAs can be classified into three main categories based on their size, function, and Argonaute family member-associated. Certain small RNAs, such as secondary siRNAs, are modified with a triphosphate and cannot be ligated directly during HTS library preparation.

enzyme, Drosha in animal and Dicer-like 1 enzyme (DCL1) in plants [10,11]. Drosha processing requires the cofactor DGCR8/Pasha (DGCR8 in humans, Pasha in Drosophila melanogaster, and PASH-1 in Caenorhabditis elegans), which contains a double-stranded RNA-binding domain and facilitates the binding of Drosha to its target. This complex formed by Drosha/DGCR8 is known as the microprocessor complex, and it also includes RNA helicases and several nuclear ribonucleoproteins [12]. The microprocessor complex recognizes the base of the stem–loop structures of the pri-miRNAs binds to it and cleaves 11 bp away [13]. The final products are stem–loop structures of 70–80 nucleotides (pre-miRNAs) with a 2nucleotide overhang at the 30 end and a 50 -monophosphate. In a few cases, pri-miRNAs can be derived directly from introns of protein-coding genes (called mirtrons) and their processing by Drosha can occur in a cotranscriptional manner before splicing of the host RNA occurs or after splicing, bypassing the requirement for Drosha altogether [14]. In animals, pre-miRNAs containing the 2-nucleotide 30 overhang are recognized and exported from the nucleus to the cytoplasm by the transport protein Exportin-5 in a Ran-GTP-dependent manner [15]. In the cytoplasm, pre-miRNAs are further processed by another RNase III enzyme, Dicer, which cleaves near their terminal loops, resulting in 22-nucleotide double-stranded RNA products containing 50 monophosphates and 30 -hydroxyl groups. In plants, the cleavage of the pre-mRNA into the 22-nucleotide double-stranded RNA product occurs in the nucleus by DCL1 before they are exported to the cytoplasm by the plant ortholog of exportin-5, HASTY [16]. Once in the cytoplasm, plant miRNAs are modified at their 30 ends by the methyltransferase protein HEN1, converting a 20 -hydroxyl group to a 20 methoxy group. The incorporation of the methyl group at the 30 end of plant miRNAs increases their stability and prevents their degradation [17]. From the double-stranded RNA product only one strand – the “guide” strand – is incorporated into the Argonaute-containing effector complex called the RNA-induced silencing complex (RISC), while the other “passenger strand” is degraded. The RISC complex then mediates the translation repression or mRNA destablization of the target mRNAs. In animals, Dicer function requires an additional cofactor containing a doublestranded RNA-binding domain known as TRBP or PACT in mammals and LOQS in D. melanogaster [18,19]. Interestingly, there are no clear criteria for determining which strand is loaded into the RISC complex. As a general rule, the strand with lesser thermodynamic stability at its 50 end will be the one loaded into the effector complex. However, in some cases, either strand can be loaded into the RISC complex, suggesting the necessity of additional factors that might be required during the strand selection. Moreover, the large number of Argonaute proteins present in each organism (five in D. melanogaster, 10 in A. thaliana, and 27 in C. elegans) complicates the criteria by which a strand is selected for subsequent steps or which one is degraded. On the other hand, there appears to be a clear relationship between the number of mismatches Dicer products contain and the type of Argonaute protein to which they bind, and their subsequent effects. For example, in D. melanogaster and C. elegans, Dicer products containing

8.1 Introduction

j

125

mismatches are more likely to bind to Ago1/ALG-1 while those with no mismatches are loaded into Ago2/RDE-1 [20,21]. Ago1-containing miRNAs pair with their targets through only a limited region of sequence at the 50 end of the miRNA called the “seed region,” and cause translation repression and/or degradation of their mRNA targets; Ago2-containing miRNAs pair perfectly with their target mRNAs, causing target cleavage and degradation. However, this is not the case for human Argonaute proteins where there is no clear link between mismatches and the specificity for different Argonaute proteins. 8.1.2 piRNAs

piRNAs are a short class of RNAs with a size distinct to that of miRNAs (24–31 nucleotides). They were first identified in D. melanogaster [22] as a group of small RNAs associated with a specific subfamily of germline-expressed Argonaute family proteins – the Piwi proteins (Piwi, Aubergine/Aub, and Ago3 in flies; MILI, MIWI, and MIWI2 in mice; and HILI, HIWI1, HIWI2, and HIWI3 in humans). piRNAs have 50 -monophosphates and, like plants miRNAs, they are 20 -O-methylated at their 30 end; however, unlike miRNAs, piRNAs are Dicer-independent, and therefore they lack any 2-nucleotide 30 overhang [23]. In C. elegans, piRNA are shorter than in other organisms, and are named 21 U because they are 21 nucleotides long and start with a uracil base [24]. piRNA biogenesis remains rather unclear, with indications that suggest they could come from a single long transcript, making them DNA strandspecific [25]. There is an initial piRNA population produced at early stages that is required for later production of more piRNAs by a mechanism known as “ping-pong” (Figure 8.1). In D. melanogaster, in which the “ping-pong” model was originally proposed, Piwi/Aub binds to a first set of antisense piRNA derived from a transposon transcript (primary piRNAs), and targets complementary active transposons, generating secondary piRNAs in the sense orientation. Then, Ago3 binds to the sense piRNAs and cleaves the target antisense transcript, producing the 50 end of the antisense strand that is recognized by Piwi/Aub repeating the cycle [25]. Support for Fig. 8.1 Schematic representation of the “pingpong” model. Piwi/Aub containing antisense piRNAs target sense transcript derived from transposons, resulting in the production of sense piRNAs. Sense piRNAs are recognized by Ago3 and target antisense transcript, producing antisense piRNAs, repeating the cycle.

Transposon mRNA or sense transcript 5’

A

3’

Aub/Piwi

U

3’ 5’

Cleavage and modification 3’

A Ago3

5’

5’

3’

U

5’

3’

A

Aub/Piwi

U 5’

Cleavage and modification

3’ 5’

A Ago3

5’

U

piRNA cluster transcript or antisense transcript

3’

3’

126

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

this model comes from findings that the first 10 nucleotides of antisense Aubassociated piRNAs are usually complementary to the sense Ago3-associated piRNAs. In addition, Piwi/Aub-associated piRNAs show strong preferences for uracil at their 50 ends, whereas Ago3-associated piRNAs have a preference for adenosine at nucleotide 10. Interestingly, a “ping-pong” mechanism in mice seems to be restricted to early stages of development, where a specific set of piRNAs and Piwi proteins are coexpressed, reflecting the population complexity of piRNAs. This complexity is also reflected in their lack of conservation among species. Unlike miRNAs, piRNAs are not conserved among species, being highly diverse and mapping to only to a few specific clusters in the genome. piRNAs function remains largely unknown. They are mainly expressed in the germline, and are required for genome stability and maintenance of germline stem cells, likely by silencing transposons and other repetitive elements [26]. 8.1.3 siRNAs

siRNA were first discovered in plants as a mechanism of defense during viral infection. They are a species of small RNAs involved in post-transcriptional gene silencing or RNA interference [27]. Since then many siRNAs have been identified and classified with different names according to different criteria: their origins (exosiRNAs are exogenously introduced by, for example, injection of double-stranded RNA or double-stranded RNA produced during viral infection and replication, whereas endo-siRNAs are endogenously produced from double-stranded RNA species found within the cell); the event that triggered their production (natural antisense transcript-derived siRNAs (natsiRNAs) are produced as a response to stress rather than viral infection [28]); or their mechanism of action (trans-acting siRNAs (tasiRNAs) regulate protein-coding genes located in a different genomic location). siRNAs have a similar size to miRNAs (20–24 nucleotides) and are involved in a wide range of biological processes. siRNAs are produced by Dicer-dependent cleavage of double-stranded RNA and therefore have a 50 -monophosphate and a 30 -hydroxyl group. Interestingly, some organisms such as D. melanogaster contain a siRNAspecific Dicer (DCR-2) [29]. Plants and worms also contain a specific class of endosiRNAs (secondary siRNAs) that differ slightly in their biochemistry and require the action of RNA-dependent RNA polymerases (RdRPs) for their production; in some cases a miRNA is required for their production. After an initial cleavage step mediated by a miRNA or siRNA, RdRPs use the mRNA fragments to make further doublestranded RNAs that are then processed by Dicer. The effect is to amplify the silencing response to the original siRNA or miRNA trigger. However, as an unprimed product of an RdRP, the first nucleotide of a secondary siRNA has a 50 -triphosphate, instead of a 50 -monophosphate, and this alters their method of cloning for high-throughput sequencing (HTS; also called next-generation sequencing (NGS)) (see below). Like piRNAs, little is known about origin and biogenesis of siRNAs. They have been identified in somatic tissues as well as in the germline; they can derive from several genomic locations, including protein-coding genes and sense and antisense transposons; and they can derive from long RNA transcripts that can fold themselves into a double-stranded RNA structure, or from overlapping products of two transcripts, one sense and the other one antisense, that together form double-stranded RNA [30]. Functions of siRNAs also vary from causing target RNA cleavage to stimulating heterochromatic DNA formation by recruiting repressive complexes to direct DNA methylation and repressive histone modifications [30]. 8.1.4 Other Small RNAs

In addition to the three main classes of small RNAs, there are other classes of highly abundant noncoding small RNAs involved in many biological processes. This group

8.2 HTS/NGS

of small RNAs includes transfer RNA (tRNA), ribosomal RNA (rRNA), small nuclear (snRNA), and small nucleolar RNA (snoRNA). We have excluded these short RNAs from our discussion due to the lack of information contributed to them by the use of HTS; in many cases these classes have been overlooked or discarded in the analysis of small RNAs.

8.2 HTS/NGS

More than 30 years ago two publications changed DNA sequencing technology completely. Maxam and Gilbert, using DNA chemically modified bases followed by cleavage and gel resolution [31], and Sanger and Coulson, using the chain termination reaction [32], allowed a significant increase in the number of base pairs able to be sequenced (from 25 to 80 bp) and set the foundations for large-scale DNA sequencing. Since then, Sanger technology has been improved, aiming to increase both the number and length of reads, while simultaneously decreasing the error rate. However, limitations in the depth of coverage of sequencing (number of times that a genome is represented in the total number of sequences) resulted in a barrier when large genome sequences were attempted or when complex populations were studied. In 2005, the coverage issue was partly resolved with the landmark publication of the sequencing-by-synthesis (SBS) technology developed by 454 Life Sciences, which allowed up to 500 000 reads of 100 bp DNA fragments, opening a new window for sequencing of complex genomes [33]. The SBS technology used a pyrosequencing method as a readout, which detected pyrophosphate groups produced during nucleotide incorporation, and allowed the use of less initial material [33]. In recent years, the read length has increased significantly, up to 400 bp using 454 sequencing and 750 bp using Sanger sequencing, as well as the depth of coverage (in some cases reaching up to 30 times the human genome). The first NGS technology used to identify small RNAs was the 454 Life Sciences sequencing platform [34], allowing the sequencing of hundreds of thousands of templates simultaneously. First in plants [34] and then in animals [35], 454 sequencing proved to be a useful tool to identify and monitor miRNAs and piRNA populations. A few years later, in 2007, the first publication to characterize miRNAs using the Illumina (then called Solexa) sequencing technology was published [36]. The Illumina (Solexa) platform had a better coverage than the 454 platform, producing millions of short reads (of up to 36 bp). The decreased read length when compared to 454 was offset by the large increase in the depth of coverage, and proved a great advantage in the discovery and profiling of complex short RNA populations. Since then, more publications have used Solexa for small RNA identification and profiling than any other platform. This includes similar platforms such as SOLiD, which is capable of producing several gigabases of short reads in one run and yet only one publication [37] has used the SOLiD platform to profile short RNAs. Cost per run, depth of sequencing, and ease of data analysis are the three main factors involved in the selection of any HTS technology, and therefore it is tempting to speculate that, at least for small RNA discovery, there is one platform that has combined these three elements well and this is reflected by the number of publications using it. All the methods claim high fidelity (low error rate), particularly important in the case of short RNAs, since the change of a single nucleotide could make a significant difference in a 22-nucleotide long sequence. However, since each technology manufacturer presents their data in different ways, it is almost impossible to make any reliable comparison between them. There are four steps to take into consideration while using any HTS approach: library preparation, template preparation, sequencing technology, and data analysis. Template preparation and sequencing technology are specific aspects of each platform (for a review, see Metzker [38]) for which the user has little, if any, control over, leaving library preparation and data analysis to be optimized by the user.

j

127

128

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

In this chapter, we will focus mainly on library preparation for small RNAs with some advice in data analysis. Independent of which platform is used, library preparation for small RNAs always follows the same steps: RNA purification, small RNA isolation, adapter ligation, cDNA conversion, polymerase chain reaction (PCR) amplification, and library purification. Here, we describe a generic protocol for library preparation that can be used for any current available platform that sequences small RNAs. Alternatively, it is possible to buy commercialized kits for small RNA library preparation suitable for the different platforms.

8.3 Methods and Protocols 8.3.1 Key Reagents and Solutions Reagents . . . . . . . . . . . . . . . . . . . . . .

TRIzol (Invitrogen; cat. no. 15596-026) Chloroform (Sigma; cat. no. 472476) Isopropanol (Sigma; cat. no. I9516) Glycogen (Roche; cat. no. 10901393001) 70% Ethanol (Sigma; cat. no. 02877) SYBR Green (Invitrogen; cat. no. S7585) T4 RNA Ligase 2 (NEB; cat. no. M0242L) T4 RNA Ligase 1 (NEB; cat. no. M0204L) RNAguard (GE Healthcare; cat. no. 27-0815-01) SuperScript II reverse transcription kit (Invitrogen; cat. no. 18064-014) GeneRuler 100-bp ladder (Fermentas; cat. no. SM0241) 20-bp ladder Low Ladder (Sigma; cat. no. P1598-40UG) Phusion High-Fidelity DNA polymerase (NEB; cat. no. F-530S) dNTP solution set (4  25 mM) (Invitrogen; cat. no. 10297-018) Nuclease-free water (Ambion; cat. no. AM9938) RNase-free 2-ml microfuge tube (Ambion; cat. no. AM12425) 2 Gel loading buffer II (Ambion; cat. no. AM8546G) Phenol/chloroform/isoamyl alcohol (Ambion; cat. no. AM9730) TEMED (Sigma; cat. no. T9281) Ammonium persulfate (Sigma; cat. no. A3678) SequaGel/UreaGel system (National Diagnostics; cat. no. EC-833) AccuGel 19 : 1, 40% (National Diagnostics)

Solutions . . . .

Elution solution (0.3 M NaOAc, pH 5.2, 0.1% sodium dodecylsulfate) 10% Ammonium persulfate 100 mM Dithiothreitol (DTT) 10 Tris/borate/EDTA (TBE)

Adapter Sequences

30 RNA adapter

/50 rApp/ATCTCGTATGCCGTCTTCTGCTTG/3ddC/

50 RNA adapter

50 -GUUCAGAGUUCUACAGUCCGACGAUC

RT-primer

50 -CAAGCAGAAGACGGCATACGA

PCR Primer 1

50 -CAAGCAGAAGACGGCATACGA

PCR Primer 2

50 -AATGATACGGCGACCACCGACAGGT TCAGAGTTCTACAGTCCGA

8.3 Methods and Protocols 8.3.2 Total RNA Isolation

This is probably the most crucial step, since the initial RNA quality will determine the final outcome. We recommend that before starting make sure that all materials and reagents are RNase-free. Poor-quality RNA will affect the ligation efficiency, resulting in a loss of accuracy and sensitivity. Total RNA can be extracted using commercially available kits, including column-based RNA isolation kits, to enrich for small RNAs. They produce a fast and clean short RNA preparation separated from long RNAs. However, most of the kits do not remove completely tRNA or rRNA, and they have not been assayed for small RNA populations with special modifications. Here, we describe a non-column-based isolation protocol that can be applied to any organism. .

.

.

.

.

. .

. . .

Add 10 the sample volume of TRIzol reagent to a freshly harvested pellet. (Do not use more that 1 ml per 1.5-ml tube). For cultured cells, collect cells by centrifugation 5 min at 800  g at 4  C and wash with ice-cold phosphate-buffered saline (PBS), pH 8.0. Centrifuge for 5 min at 800  g and remove the supernatant before the addition of TRIzol (use 1 ml per 10-cm plate, which is usually between 5 and 10  106 cells). Incubate in TRIzol for 10 min at room temperature (22  C). For tissue samples, use approximately 100 mg. Slice it into small pieces and homogenize the sample using a tight-fitting dounce homogenizer. Homogenize with 10 strokes (on ice) before the addition of TRIzol (the expected yield is about 100 mg of total RNA). Allow the homogenate to sit at room temperature (22  C) for 10 min. (Note: If samples are not be processed immediately, flash-freeze and store at 80  C.) Add chloroform at 1/5 of the initial volume of TRIzol used. Incubate at room temperature (22  C) for 3–5 min. Centrifuge at 12 000  g for 15 min at 4  C. After centrifugation, there will be two clear visible phases and an intermediate one (not visible) between the upper aqueous phase and the bottom red phenol phase. Short RNAs will be in the upper aqueous phase. (Note: Avoid collecting the intermediate phase since it contains DNA.) Transfer the upper aqueous phase to a new tube (should be 60% of the initial volume of TRIzol used). Discard the TRIzol (phenol) phase appropriately. Add isopropanol at 1/2 of the initial volume of TRIzol used and 10 mg of glycogen to help to precipitate the small RNAs. Mix well and leave overnight at 20  C. Shorter incubation times are possible with the risk of losing lowly abundant small RNA populations. Centrifuge at 12 000  g for 15 min at 4  C. Wash pellet with 70% ethanol (use less than 1 volume of the initial volume of TRIzol used). Carefully invert the tube and spin at 10 000  g for 5 min at 4  C. Remove the supernatant. Quick spin and remove any residual ethanol. Air-dry pellet for 5 min at room temperature (22  C). Resuspend pellet in a small volume of RNase-free water. Estimate RNA concentration using OD260. High-quality RNA gives a value of OD260/OD280 1.8–2.0. Alternatively, run 1 mg of your total RNA in a 15% denaturing polyacrylamide gel and stain with SYBR Green to visualize your RNA quality. The presence of a continuous smear along the gel indicates RNA degradation and poor RNA quality. Good RNA quality should contain defined bands at the upper part of the gel representing intact rRNA and tRNA. It is possible to confirm the quality of your RNA by Northern blot against the rRNA or tRNA.

8.3.3 Small RNA Isolation

Small RNAs are size-selected on a denaturing polyacrylamide gel, cut out from the gel, and eluted before their ligation to the adapters (Figure 8.2). Only a small

j

129

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing Total RNA Sa m pl eB

Fig. 8.2 Isolation of small RNAs from total RNA. After total RNA extraction, small RNAs are size-selected on a 15% denaturing polyacrylamide gel and gel eluted. During this step either specific small RNA populations can be individually selected or all small RNA populations can be isolated as a whole group.

M ar ke r Sa m pl eA

j

RN A

130

45

45

Cut and stain RNA marker lane

32 26 18

32 26 18

Cut small RNAs lanes

15% PAGE

Gel Elution

P P

OH OH P

P

OH OH P

OH

Small RNA population

percentage of the total RNA represents the short RNA fraction of 18–31 nucleotides long (tRNA, rRNA, and other long RNAs are highly enriched). Therefore, we recommend loading starting material of 20–30 mg of total RNA/well for gel purification. However, libraries can be prepared with as little starting material as 1 mg. In addition, RNA markers are required to identify small RNA species of the correct size. Single-stranded RNA molecules of 18, 26, and 32 nucleotides can be used as markers. These markers run alongside your samples in the gel and, when stained, delimit the gel region to be cut out. .

. .

.

.

.

.

.

.

Prepare a 15% 8 M urea gel mix. Pour the 15% urea mix into a 1-mm thick 13.3 cm  8.7 cm size gel (15 ml of mix is required for each gel). Allow to polymerize. Prerun the gel for 30 min at 200 V. Wash the wells using 0.5 TBE. Mix equal volumes of total RNA with 2  gel loading buffer (50 mM EDTA pH 8.0, 0.05% (w/v) bromophenol blue in formamide). We recommend not to exceed a total of 25 ml/well. Heat the sample at 80  C for 5 min, then cool quickly on ice. Centrifuge to collect the volume to the bottom of the tube and load the sample into one well. Leave at least two wells empty between samples to avoid crosscontamination. Load 10 ml of the 18/23/32-nucleotide synthetic RNA oligonucleotide mix (at 10 mM) as a marker in one side well of the gel. Run the gel at 150 V until the bromophenol blue dye reaches the bottom of the gel (about 2 h). Cut off the lane of the gel that contains the 18/23/32-nucleotide marker. Stain this gel fragment with SYBR Green for 5 min. Visualize the marker with an UV trans-illuminator and mark the bands. Finally, put the marker gel back together with the rest of the gel. From each sample lane cut out the gel region corresponding to 18–32 nucleotides with a clean razor blade and transfer it to a 1.5-ml RNase-free microcentrifuge tube. Add 400 ml of sterile elution solution (0.3 M NaOAc, pH 5.2, 0.1% sodium dodecylsulfate) to the microcentrifuge tube and elute the RNA by rotating overnight at 4  C. Transfer the eluate to new 1.5-ml microcentrifuge tube. Be careful not to transfer any of the gel pieces. Precipitate the RNA with an equal volume (500 ml) of 100% isopropanol and add 10 mg of glycogen to the sample. Incubate overnight at 20  C. Centrifuge at 10 000  g for 25 min at 4  C.

8.3 Methods and Protocols .

j

131

Carefully remove the supernatant and wash the pellet with 750 ml of room temperature (22  C) 80% EtOH. Allow the RNA pellet to air-dry for 5 min and dissolve the RNA in 5 ml of RNase-free water.

8.3.4 Ligation of Adapters

Most platforms produce their own adapters and therefore you should check the sequence of the adapters with each company. Here, we describe a modified protocol to ligate sequence-specific adapters, originally described in Hafner et al. [39] (Figure 8.3). For the ligation reaction a 50 -mono/biphosphate group and a 30 -hydroxyl group are required – features that not all short RNA species possess. For example, secondary siRNAs in C. elegans and plants have a 50 -triphosphate as products of RdRPs, and therefore this must be converted to a 50 -monophosphate or 50 -biphosphate prior to the ligation reaction. This can be achieved by using a phosphohydrolase to remove c- and b-phosphates from 50 -triphosphorylated short RNAs and leave an a-phosphate

Small RNAs population P P

OH OH P

P

OH OH P

OH

3’ Adapter Ligation T4 RNA Ligase 2

3’ Adapter OH

P

5’ Adapter Ligation

pA

P

OH

T4 RNA Ligase

5’ Adapter OH P

Reverse Transcription

RT-Primer

PCR Amplification PCR Primer 1

PCR Primer 2

Gel Purification

DNA Ladder (bp)

3.5% Agarose Gel

A B

C D

Samples

200 100 75

Solexa library (cut and elute) Primer-dimer

Fig. 8.3 Library preparation for Illumina platform. Isolated small RNAs are ligated to the 30 preadenylated adapter and the 50 adapter. The ligated products are converted to cDNA and PCR amplified. Finally, libraries are gel purified in a 3.5% agarose gel. Samples A and B are libraries generated directly from total RNA extraction resulting in several libraries of non-small-RNA sizes. C and D are libraries generated after small RNA isolation, resulting in an enrichment of small RNAs and little primer-dimer or other nonsmall-RNA size.

132

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

required for the T4 RNA Ligase. Short RNA fragments arising from RNA degradation usually contain a 50 -hydroxyl group and 30 -phosphate, and they will not be incorporated during library preparation unless end-repair is required. Finally, it is important to consider that the presence of a 50 -phosphate and a 30 -hydroxyl group in small RNAs can cause their circularization and/or concatenation among small RNA population during ligation. To prevent undesired ligation products, a chemically modified 30 adapter together with a modified RNA ligase is used during library preparation. Only modified preadenylated 30 adapters (Integrated DNA Technologies) (intermediate products of the ligation reaction) are the substrate for a modified RNA ligase in the absence of ATP. 1. 30 Adapter ligation 1.1 Set up the preadenylated 30 adapter ligation reaction in a 1.5-ml RNase-free siliconized microcentrifuge tube: Purified small RNAs

6.5 ml

30 Preadenylated adapter (10 mM)

0.5 ml

1.2 Incubate at 80  C for 2 min. Transfer immediately to ice and incubate for at least 1 min. Prepare the rest of reaction mixture containing: 10 T4 RNL2 truncated reaction buffer (NEB)

1.0 ml

T4 RNA ligase 2 (10 U/ml) (NEB)

1.0 ml

RNase inhibitor (40 U/ml)

1.0 ml

Final volume

10.0 ml

1.3 Add the reagents to the RNA and mix gently. Incubate at 22  C for 1–2 h or overnight at 16  C. (Note: Some less abundant short RNAs might require longer incubation times to ensure their ligation than other more abundant short RNAs.) 2. 50 Adapter ligation 2.1 Denature the RNA by incubation for 2 min at 80  C. Place the tube immediately on ice for 1 min. 2.2 Set up the 50 adapter ligation reaction by adding to the 30 adapter ligation: 10 mM ATP

1.0 ml

50 RNA adapter (5 mM)

1.0 ml

10 RNA ligation buffer (NEB)

0.5 ml

T4 RNA ligase (10 U/ml) (NEB)

1.0 ml

RNase inhibitor (40 U/ml)

1.0 ml

Final volume

5.0 ml

2.3 Incubate at 20  C for 2 h or overnight at 16  C. 3. Reverse transcription of small RNAs ligated with adapters 3.1 Set up a reverse transcription reaction in a 1.5-ml RNase-free microfuge tube: Purified ligated small RNA

4.5 ml

RT-primer (100 mM)

0.5 ml

3.2 Heat to 65  C for 10 min, centrifuge briefly to cool and place the tube immediately on ice for 1 min. 3.3 Add the following in this order:

8.3 Methods and Protocols

5 First-strand buffer (Invitrogen)

2.0 ml

12.5 mM dNTP mix

0.5 ml

100 mM DTT

1.0 ml

RNase inhibitor (40 U/ml)

0.5 ml

Final volume

9.0 ml

3.4 Heat to 48  C for 3 min and then add 1.0 ml of SuperScript II (200 U/ml) (Invitrogen). 3.5 Incubate at 44  C for at least 1 h. 4. PCR amplification 4.1 Set up pilot 20-ml PCR reactions from the reverse transcription samples to verify that the cloning steps worked. If everything looks good, then scale-up for the final PCR. Also, the number of PCR cycles can be optimized here. Start with 12 cycles and then go up or down depending on the strength of the product. Reverse transcription reaction mix

1.0 ml

5 Phusion High-Fidelity buffer (Finnzymes)

4.0 ml

PCR Primer 1 (25 mM)

0.4 ml

PCR Primer 2 (25 mM)

0.4 ml

10 mM dNTPs mix

0.4 ml

Phusion DNA polymerase (Finnzymes)

0.2 ml

dH2O

13.6 ml

Final volume

20.0 ml

4.2 PCR conditions: 98  C

30 s

98  C

10 s

58  C

30 s

72  C

20 s

72  C

5 min

}12 cycles

4.3 Prepare a 6% TBE native gel (in a 1-mm thick 13.3 cm  8.7 cm gel) to visualize the PCR products. 10 TBE (final 0.5 TBE)

1.5 ml

40% Acrylamide (19 : 1 acrylamide : bis-acrylamide)

4.5 ml

dH2O

24 ml

Final volume

30 ml

4.4 Add 300 ml 10% ammonium persulfate and 10 ml TEMED, and pour immediately. Wait until completely polymerized. 4.5 Add 5 ml of 5 DNA loading buffer to 20 ml of your PCR sample and load into 6% nondenaturing gel. Also, load your 100-bp ladder along with the 20-bp Low Ladder. It is not necessary to prerun this gel. Run the gel until the xylene cyanol is two-thirds of the way down the gel. If loading dye only contains

j

133

134

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

bromophenol blue, do not run the until the end or otherwise the 20-bp ladder will run out of the gel. It is very important to run the 100-bp ladder in order to trace the PCR product. 4.6 Stain the gel with SYBR Green for 5 min. 4.7 Visualize the gel using an UV trans-illuminator. Small RNA runs between 20 and 30 nucleotides above the adapter–dimer sequence For example, if the adapter-dimer band runs at 70 nucleotides (depending on the platformspecific adapter lengths), then expect a PCR product between 92 and 100 nucleotides. Importantly, you should see two prominent bands on the gel – one is the correct PCR product and the other is the adapter–dimer band. The PCR product should be stronger and more prominent than the adapter– adapter band. If no bands can be seen on the gel increase the number of PCR cycles or if the bands are too strong reduce the number of cycles to avoid bias. 5. Scaled-up PCR 5.1 After determining that your small RNA cloning gave the expected product size, set up a 200-ml PCR reaction using your reverse transcription samples. The number of PCR cycles should have been empirically determined in the previous step. 5.2 Check 20 ml of PCR product on a 6% acrylamide gel as described previously. 6. Gel purification Concentrate the PCR sample in a small volume (10–15 ml) and run it in two wells of a 3.5% low-melting-point agarose gel containing 0.5 mg/ml of ethidium bromide in 1 TAE running buffer along with a 20-bp DNA ladder for approximately 2 h at 80 V until the marker bands are sufficiently resolved. Visualize the DNA in the gel using a UV trans-illuminator and excise the band of approximately 92–100 bp in size (Figure 8.3). Transfer the gel slice to a 1.5-ml reaction tube and weigh it. At this point you could use any commercially available gel purification kit to extract the PCR products or elute your samples using 0.3 M NaCl, followed by a phenol/ chloroform/isoamylalcohol (25 : 24 : 1) extraction and ethanol precipitation. Libraries are resuspended in 10–20 ml of 10 mM Tris–HCl, pH 8.5 and are ready for sequencing. In most cases 10 ml of a 10-nM library is sufficient to generate over 10 millions reads.

8.4 Troubleshooting

If no library is detected on the gel or any other DNA/RNA quantification apparatus (e.g., a Bioanalyzer) the main causes are: .

.

.

Pellets were not air-dry properly and residual EtOH was left on the samples, inhibiting the ligation reaction. RNase contamination or problems with total RNA extraction (check total RNA extraction by gel). Low starting material. Need to increase the amount of sample for library preparation.

8.5 Applications

The introduction of NGS technologies has revolutionized the discovery and expression profiling of functional noncoding small RNAs. HTS has been used as an efficient tool in many organisms, with no exceptions, with impressive results, and has allowed the identification and monitoring of even very low abundance small RNAs that may not have been detected by the use of more traditional sequencing methods [24,40,41].

8.5 Applications (a) n clones let-7*

5’ CUAUGCAAUUUUCACCUUACC 3’

23

5’ CUAUGCAAUUUUCACCUUACCU 3’

11 2

5’ CUAUGCAAUUUUCACCUUACCUU 3’ let-7 genomic

5’ ACCGGUGAACUAUGCAAUUUUCACCUUACCGG 3’

(b)

LIN-28

LIN-28

G PUP-2

LIN-28

UUUU

A U

C U

?

pre-let-7 Blockade

Uridylation

Degradation

The number of miRNAs detected in every organism tested to date has increased with the application of HTS and some of these new sequences have been further verified by Northern blotting, adding to the complexity of miRNA-mediated regulation of gene expression. HTS has also revealed many alternative sequence isoforms of miRNAs (termed isomiRs) that are often, but not always, of lower abundance than the reference sequence in miRBase (www.mirbase.org). Very abundant isomiRs indicate that there can be errors in miRBase reference sequences due to the insufficient sequencing depth of previous small RNA cloning and sequencing technologies. Alternatively, isomiRs may change in abundance under different environmental conditions or developmental stages. Less-abundant isomiRs provide evidence of modification to the miRNA reference sequence, such as RNA editing of the double-stranded RNA pre-miRNA hairpin, variability in the cleavage sites of both the Drosha and Dicer ribonucleases, and the addition of 30 -terminal nucleotides to pre-miRNA or mature miRNA sequences. RNA editing and alternative cleavage products have the potential to change the seed sequence of miRNAs and are thus potentially important mechanisms in altering the regulation of target genes. The addition of 30 -terminal nucleotides has been associated with miRNA stability and turnover; the poly(U) polymerase PUP-2 has been shown to promote the degradation of pre-let-7 by the addition of uridine residues at the 30 end (Figure 8.4) [42,43]. A homolog of PUP-2 in mammalian cells, Zcchc11, fine-tunes miR-26a expression similarly by adding 30 -terminal uridine residues to miR-26a to abrogate repression of interleukin-6 [44]. Another impact of HTS on low-abundance small RNAs has been in piRNA discovery. Given their diverse nonconserved nature and their extremely low abundance, HTS has provided a comprehensive view of the complexity of piRNA populations that could not have been appreciated by limited cloning of a subset. Mapping the thousands of piRNAs to genomes has allowed hypotheses on piRNA biogenesis from genomic clusters to develop and pointed to functions in regulating transposons in the germline. Secondly, HTS has provided a more detailed profile of previously known RNAs and, allowed the detection and comparison of significant as well as subtle changes in small RNA populations during different development stages or during environmental changes, highlighting the importance of small RNAs as key regulators of gene expression. Such HTS profiling can point to functions for miRNAs; Kato et al. identified several miRNAs, including miR-54, that are expressed more highly in male C. elegans worms than in hermaphrodite worms [45]; since male worms show an altered developmental program and behavior to hermaphrodite worms, the potential for some miRNAs to contribute to fundamental differences between males and hermaphrodites is one example of how detailed profiling can indicate new functions for miRNAs.

j

135

Fig. 8.4 Identification of uridine residues at the 30 end pre-let-7 and their biological function. Frequency of unmodified and modified let-7 molecules identified by HTS. A revised model of posttranscriptional regulation of let-7 in C. elegans.

136

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

HTS has played an important role in the identification of functional small RNAs in many different cell types and their differentiation, including stem cells and carcinoma cells [46]. Human embryonic stem cells can be differentiated in vitro into erythroid bodies – accumulations of cells belonging to all three germ layers. HTS of RNA prepared before or after differentiation into erythroid bodies identified 334 known and 104 novel miRNAs expressed, of which 171 known and 23 novel miRNAs changed their expression significantly on differentiation [46]. The predicted targets of the miRNAs enriched in either the human embryonic stem cells or erythroid bodies shared common functions in cell differentiation, cell cycle control, programmed cell death, and transcriptional regulation. HTS has been used powerfully in combination with other molecular techniques such as immunoprecipitation to characterize and classify small RNAs based on their binding partners, helping to understand their biology and function. Immunoprecipitation of Piwi Argonaute proteins followed by HTS of associated RNAs has helped to identify piRNAs [24,47]. Immunoprecipitation of ALG-1/ALG-2 Argonaute proteins followed by HTS of associated RNAs has identified which miRNAs in C. elegans act to repress gene expression in vivo at different developmental stages. The application of HTS is linked to computational analysis. Massive HTS datasets represent a problem if data processing and mapping to a reference genome are not done correctly. Today, there are many computational methods that have been developed to rapidly and accurately quantify and map small RNA data (review in [48]). Interestingly, bioinformatics analyses have also been done to predict some of the small RNA populations [49,50], which were then later confirmed by HTS. This crosstalk between the focus of bioinformatics on the discovery of short RNAs and HTS underlines the necessity to integrate both as one method, where the progression of one is meaningless without the progression of the other. How can we make sense of million of reads if we are not able to manage the data? Genome mapping, clustering, and isolation of individual sequences require complex data processing, which need to be analyzed in a fast and reliable manner, and yet be presented in an interpretable and intuitive way.

8.6 Perspectives

The increasing number of publications using HTS to identify small RNAs in different organisms indicates that we are far from a complete catalog of active small RNAs. Small RNAs could be classified in two major groups: conserved and nonconserved. Conserved short RNAs, such as rRNA, tRNAs, and some miRNAs, can be easily identified across species and cell types by sequence identity without the use of HTS. Nonconserved small RNAs are more difficult to identify, since they are species- and cell-type-specific, and they require a more laborious identification process. In addition to conservation, short RNAs can be grouped based on their specific modification. For example, secondary siRNAs contain 50 -triphosphate, which makes them impossible to clone by conventional library preparation methods and they require end-repair. Note that any end-repair step can potentially introduce more noise to the data by repairing also the degradation products, so it requires careful mapping and data interpretation. Interestingly, even canonical ends do not guarantee to be identified by HTS; the lys-6 miRNA was identified through genetic screens and is involved in the control of left–right asymmetry during neuronal differentiation in C. elegans [41], but only one of the multiple HTS libraries prepared in worms has detected it [45]. When using Illumina adapters for library preparation, the resulting ligation product of lsy-6 miRNA forms a strong predicted stem–loop structure (Figure 8.5). The predicted stem has a high GC content that could prevent the template denaturing during the reverse transcription reaction or PCR amplification. Another possibility is that the very restricted expression and low abundance of lys-6 miRNA falls below the detection limit with the minimum amount of material

8.6 Perspectives

C-G base pair

30

A-U base pair

40

G-U base pair G or A U or C

20

50 10

60 5’

3’

required for HTS. On the contrary, other small RNAs also have restricted and low expression, and yet are still identified in screening using HTS. Problems with restricted expression, conformation, or abundances raise the question of what other small RNAs fail to be identified by HTS and whether these issues can be resolved by increasing sensitivity, the use of different sets of adapters, or by finding new ways in which libraries and templates are generated. One critical step during library preparation is the number of cycles required during the PCR amplification step as well as reverse transcription efficiency. Insufficient cycles can result in a poor library that fails in the sequencing reaction, while too many could introduce bias towards some sequences as well as increase the likelihood of amplification errors. Most HTS platforms also have an additional amplification step during their template generation (such as Illumina or 454) and while new emerging companies like Helicos remove the necessity of PCR amplification in their platform, this technology is still not yet applicable for small RNA discovery. In the near future, we will see new affordable approaches in the way that small RNAs are selected, with no adapters and no PCR required, and new scales of sequencing where small RNA populations can be identified using input from a single cell.

Acknowledgments

All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

j

137

Fig. 8.5 Predicted secondary structure of lys-6 miRNA after adapter ligation. Sequences of lys-6 miRNA and adapters were submitted to “mFold” for predicted secondary structure. The resulting folded RNA contains a predicted strong secondary structure with long hairpin structure that could block the reverse transcriptase step or PCR amplification step during library preparation.

138

j

8 Identification and Expression Profiling of Small RNA Populations Using High-Throughput Sequencing

References 1 Lagos-Quintana, M., Rauhut, R.,

2 3

4 5

6

7

8 9 10

11

12

13

14 15

16

17

Lendeckel, W., and Tuschl, T. (2001) Science, 294, 853–858. Lee, R.C. and Ambros, V. (2001) Science, 294, 862–864. Reinhart, B.J., Weinstein, E.G., Rhoades, M.W., Bartel, B., and Bartel, D.P. (2002) Genes Dev., 16, 1616–1626. Johnston, R.J. and Hobert, O. (2003) Nature, 426, 845–849. Pfeffer, S., Sewer, A., Lagos-Quintana, M., Sheridan, R., Sander, C., Grasser, F.A., van Dyk, L.F., Ho, C.K., Shuman, S., Chien, M. et al. (2005) Nat. Methods, 2, 269–276. Arazi, T., Talmor-Neiman, M., Stav, R., Riese, M., Huijser, P., and Baulcombe, D.C. (2005) Plant J., 43, 837–848. Watanabe, T., Takeda, A., Mise, K., Okuno, T., Suzuki, T., Minami, N., and Imai, H. (2005) FEBS Lett., 579, 318–324. Wightman, B., Ha, I., and Ruvkun, G. (1993) Cell, 75, 855–862. Lee, R.C., Feinbaum, R.L., and Ambros, V. (1993) Cell, 75, 843–854. Kurihara, Y. and Watanabe, Y. (2004) Proc. Natl. Acad. Sci. USA, 101, 12753–12758. Lee, Y., Kim, M., Han, J., Yeom, K.H., Lee, S., Baek, S.H., and Kim, V.N. (2004) EMBO J., 23, 4051–4060. Gregory, R.I., Yan, K.P., Amuthan, G., Chendrimada, T., Doratotaj, B., Cooch, N., and Shiekhattar, R. (2004) Nature, 432, 235–240. Han, J., Lee, Y., Yeom, K.H., Nam, J.W., Heo, I., Rhee, J.K., Sohn, S.Y., Cho, Y., Zhang, B.T., and Kim, V.N. (2006) Cell, 125, 887–901. Kim, Y.K. and Kim, V.N. (2007) EMBO J., 26, 775–783. Lund, E., Guttinger, S., Calado, A., Dahlberg, J.E., and Kutay, U. (2004) Science, 303, 95–98. Park, M.Y., Wu, G., Gonzalez-Sulser, A., Vaucheret, H., and Poethig, R.S. (2005) Proc. Natl. Acad. Sci. USA, 102, 3691–3696. Yu, B., Yang, Z., Li, J., Minakhina, S., Yang, M., Padgett, R.W., Steward, R., and Chen, X. (2005) Science, 307, 932–935.

18 Chendrimada, T.P., Gregory, R.I.,

19

20

21

22

23

24

25

26

27

28

29

30 31 32 33

Kumaraswamy, E., Norman, J., Cooch, N., Nishikura, K., and Shiekhattar, R. (2005) Nature, 436, 740–744. Lee, Y., Hur, I., Park, S.Y., Kim, Y.K., Suh, M.R., and Kim, V.N. (2006) EMBO J., 25, 522–532. Forstemann, K., Horwich, M.D., Wee, L., Tomari, Y., and Zamore, P.D. (2007) Cell, 130, 287–297. Steiner, F.A., Hoogstrate, S.W., Okihara, K.L., Thijssen, K.L., Ketting, R.F., Plasterk, R.H., and Sijen, T. (2007) Nat. Struct. Mol. Biol., 14, 927–933. Aravin, A.A., Naumova, N.M., Tulin, A.V., Vagin, V.V., Rozovsky, Y.M., and Gvozdev, V.A. (2001) Curr. Biol., 11, 1017–1027. Vagin, V.V., Sigova, A., Li, C., Seitz, H., Gvozdev, V., and Zamore, P.D. (2006) Science, 313, 320–324. Das, P.P., Bagijn, M.P., Goldstein, L.D., Woolford, J.R., Lehrbach, N.J., Sapetschnig, A., Buhecha, H.R., Gilchrist, M.J., Howe, K.L., Stark, R. et al. (2008) Mol. Cell, 31, 79–90. Brennecke, J., Aravin, A.A., Stark, A., Dus, M., Kellis, M., Sachidanandam, R., and Hannon, G.J. (2007) Cell, 128, 1089–1103. Aravin, A.A., Sachidanandam, R., Bourc’his, D., Schaefer, C., Pezic, D., Toth, K.F., Bestor, T., and Hannon, G.J. (2008) Mol. Cell, 31, 785–799. Hamilton, A., Voinnet, O., Chappell, L., and Baulcombe, D. (2002) EMBO J., 21, 4671–4679. Borsani, O., Zhu, J., Verslues, P.E., Sunkar, R., and Zhu, J.K. (2005) Cell, 123, 1279–1291. Lee, Y.S., Nakahara, K., Pham, J.W., Kim, K., He, Z., Sontheimer, E.J., and Carthew, R.W. (2004) Cell, 117, 69–81. Chung, W.J., Okamura, K., Martin, R., and Lai, E.C. (2008) Curr. Biol., 18, 795–802. Maxam, A.M. and Gilbert, W. (1977) Proc. Natl. Acad. Sci. USA, 74, 560–564. Sanger, F. and Coulson, A.R. (1975) J. Mol. Biol., 94, 441–448. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z. et al. (2005) Nature, 437, 376–380.

34 Henderson, I.R., Zhang, X., Lu, C.,

35

36

37

38 39

40

41

42

43

44

45 46

47

48 49 50

Johnson, L., Meyers, B.C., Green, P.J., and Jacobsen, S.E. (2006) Nat. Genet., 38, 721–725. Girard, A., Sachidanandam, R., Hannon, G.J., and Carmell, M.A. (2006) Nature, 442, 199–202. Tyler, D.M., Okamura, K., Chung, W.J., Hagen, J.W., Berezikov, E., Hannon, G.J., and Lai, E.C. (2008) Genes Dev., 22, 26–36. Goff, L.A., Davila, J., Swerdel, M.R., Moore, J.C., Cohen, R.I., Wu, H., Sun, Y.E., and Hart, R.P. (2009) PloS ONE, 4, e7192. Metzker, M.L. (2010) Nat. Rev., 11, 31–46. Hafner, M., Landgraf, P., Ludwig, J., Rice, A., Ojo, T., Lin, C., Holoch, D., Lim, C., and Tuschl, T. (2008) Methods, 44, 3–12. Armisen, J., Gilchrist, M.J., Wilczynska, A., Standart, N., and Miska, E.A. (2009) Genome Res., 19, 1766–1775. Ruby, J.G., Jan, C., Player, C., Axtell, M.J., Lee, W., Nusbaum, C., Ge, H., and Bartel, D.P. (2006) Cell, 127, 1193–1207. Lehrbach, N.J., Armisen, J., Lightfoot, H.L., Murfitt, K.J., Bugaut, A., Balasubramanian, S., and Miska, E.A. (2009) Nat. Struct. Mol. Biol., 16, 1016–1020. Heo, I., Joo, C., Kim, Y.K., Ha, M., Yoon, M.J., Cho, J., Yeom, K.H., Han, J., and Kim, V.N. (2009) Cell, 138, 696–708. Jones, M.R., Quinton, L.J., Blahna, M.T., Neilson, J.R., Fu, S., Ivanov, A.R., Wolf, D.A., and Mizgerd, J.P. (2009) Nat. Cell Biol., 11, 1157–1163. Kato, M., de Lencastre, A., Pincus, Z., and Slack, F.J. (2009) Genome Biol., 10, R54. Morin, R.D., O’Connor, M.D., Griffith, M., Kuchenbauer, F., Delaney, A., Prabhu, A.L., Zhao, Y., McDonald, H., Zeng, T., Hirst, M. et al. (2008) Genome Res., 18, 610–621. Batista, P.J., Ruby, J.G., Claycomb, J.M., Chiang, R., Fahlgren, N., Kasschau, K.D., Chaves, D.A., Gu, W., Vasale, J.J., Duan, S. et al. (2008) Mol. Cell, 31, 67–78. Hawkins, R.D., Hon, G.C., and Ren, B. (2010) Nat. Rev., 11, 476–486. Lai, E.C., Tomancak, P., Williams, R.W., and Rubin, G.M. (2003) Genome Biol., 4, R42. Grad, Y., Aach, J., Hayes, G.D., Reinhart, B.J., Church, G.M., Ruvkun, G., and Kim, J. (2003) Mol. Cell, 11, 1253–1263.

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq Joshua W.K. Ho, Artyom A. Alekseyenko, Mitzi I. Kuroda, and Peter J. Park Abstract

Chromatin immunoprecipitation (ChIP) followed by massively parallel sequencing (ChIP-seq) has enabled generation of genome-wide maps of various types of in vivo protein–DNA interaction at an unprecedented resolution. These maps have led to many important insights into the regulatory mechanisms of gene expression through transcription factor binding and epigenetic modifications. In this chapter, we present a common ChIP-seq experimental and analysis pipeline, and give examples of how ChIP-seq can be applied to tackle various biological questions.

9.1 Introduction

Protein–DNA interaction is a hallmark of gene regulation at the genomic and epigenomic levels. At the genomic level, gene expression is controlled by the coordinated binding of the transcriptional machinery, transcription factors, enhancers, silencers, and insulators [1]. At the epigenomic level, gene expression in eukaryotes is affected by the location and modification status of the DNA packaging protein complex – the histone core particle in the nucleosome. The occupancy of nucleosomes along a genome reflects the accessibility of the local chromatin region for regulatory protein binding. Incorporation of different histone variants or covalent modification of histone proteins (usually methylation and acetylation of one or more N-terminal lysine and arginine amino acids) near the promoter or the gene body is correlated with gene activation or repression [2,3]. It is increasingly recognized that understanding the dynamics of various protein–DNA interactions can lead to novel insight into many complex biological processes. As will be described in this chapter, ChIP-seq provides a powerful approach to map these protein–DNA interaction events at a genome-wide scale (Figure 9.1). Chromatin immunoprecipitation (ChIP) is one of the most widely used biochemical methods to interrogate in vivo protein–DNA interaction. In a ChIP experiment, DNA binding proteins are typically first cross-linked to the DNA by formaldehyde. The chromatin–protein complexes are then physically sheared by sonication or enzymatically digested by micrococcal nuclease into smaller fragments of 100–600 bp. The target DNA fragments (which are cross-linked to the protein of interest) are immunoprecipitated (precipitation of a protein out of a solution) by an antibody specific to that target protein. Subsequently, the cross-links are reversed by incubating the solution at high temperature to release the deproteinized DNA fragments. This resulting DNA sample should be enriched for fragments originally bound by the target proteins in vivo. In mapping histone modifications, it is also common to perform ChIP without cross-linking, which is sometimes referred to as

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

139

140

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

Fig. 9.1 Principle of protein–DNA interaction mapping using ChIP-seq. ChIP-seq is used to map transcription factor (TF) binding (green profile), nucleosomes with a covalently modified histone (purple profile), and bulk nucleosomes (orange profile).

Histone with a modified tail

nucleosome TF DNA

ChIP-seq profiles

Transcription factor

Histone modification

Bulk nucleosome

native ChIP. Prior to the widespread availability of high-throughput profiling technologies, the ChIP-enriched DNA sample was analyzed by polymerase chain reaction (PCR) (ChIP-PCR) to verify enrichment compared to non-ChIP-enriched control DNA samples. However, this low-throughput approach requires prior knowledge of the target DNA binding sites, which limits its application as a systematic tool for genome-wide identification of candidate DNA–protein interaction sites. In the early 2000s, a new approach that harnesses the high-throughput nature of microarray technology for analyzing ChIP-enriched DNA emerged. Using this ChIPchip approach, a number of early studies demonstrated the power of this technology to perform genome-wide mapping of binding sites of key transcription factors [4,5]. Soon, ChIP-chip gained an important role in a variety of studies to identify functional genomic elements. Currently, a number of commercial companies manufacture microarrays to be used in ChIP-chip experiments for a number of species. Nonetheless, ChIP-chip has its own limitations, including the requirement that the sequences of potential binding regions must be known. This limits its application to organisms for which microarrays are available and exclude nonmodel organisms, such as most microorganisms and plants. Furthermore, the probe coverage of a tiling microarray is often limited to nonrepetitive regions for which unique probes can be designed. Another hindrance to application of ChIP-chip is its high price tag for performing genome-wide analysis of relatively large genomes (e.g., mammalian genomes). Often 10–20 tiling microarrays are required to cover one mammalian genome at high resolution. A number of alternative sequencing-based methods were suggested to analyze ChIP-enriched DNA. They include serial analysis of chromatin occupancy (SACO) [6], serial analysis of binding elements (SABE) [7], sequence tag analysis of genomic element (STAGE) [8], and ChIP coupled with paired-end ditag sequencing (ChIPpaired-end tagPET) [9]. All of these methods initially relied on Sanger sequencing technology. A common feature of these methods is that they all require additional experimental manipulations of the ChIP-enriched DNA, such as cloning or sequence concatenation prior to sequencing. Therefore, it was generally expensive, laborious, and time-consuming to conduct large-scale genome-wide analysis on mammalian genomes using these techniques. The advent of massively parallel sequencing technology, also known as nextgeneration sequencing (NGS), gives rise to a more straightforward approach to analyze ChIP-enriched DNA: simply sequence one end (or both ends) of the ChIPDNA fragments, then map all the resulting short sequence reads to a reference genome to construct a protein–DNA interaction map. This is the birth of ChIP-seq. A number of initial applications of ChIP-seq were reported in mid-2007. They included the genome-wide mapping of histone modification in human T cells [10] and transcription factor binding in human cell lines [11,12]. These studies demonstrated the power of ChIP-seq in genome-wide profiling of protein–DNA interaction in a cost-effective manner. In the rest of this chapter, we present an experimental and analysis pipeline, and review some recent applications of ChIP-seq in the context of addressing different biological questions.

9.2 Methods and Protocols

Antibody validation

• ChIP and input DNA preparation: ChIP DNA • Immunoprecipitation • DNA isolation

Library preparation (Illumina) • ChIP and input DNA repair • Addition of ‘A’ base to 3’ ends

Input DNA • DNA isolation

141

Fig. 9.2 Summary of the key steps in a ChIP-seq experiment.

Chromatin immunoprecipitation (ChIP) • Formaldehyde cross-linking of cultured cells • Chromatin preparation and sonication

j

• Illumina adaptor ligation • PCR and gel size-selection of DNA

9.2 Methods and Protocols

There are three main steps in a ChIP-seq experiment: antibody validation, ChIP, and preparation of sequencing library (Figure 9.2). Experimental protocols for these steps, along with some notes on data analysis, are presented in this section. 9.2.1 Antibody Validation

Testing antibody specificity can be challenging, but it is essential for any successful ChIP experiment. Antibody affinity for target proteins can vary widely, as does the abundance of their targets and the relative amounts of relevant or irrelevant crossreactive proteins. In addition, perfect negative control experiments are rarely feasible (e.g., immunoprecipitation from a mutant cell line). Antibodies specific to posttranslationally modified proteins such as histones are particularly difficult to generate and validate. Finally, antibodies that work well for one procedure such as Western blotting or immunofluorescence may not perform well in ChIP assays and vice versa. In general, we favor the following analysis: 1. Immunoblotting of the target cell nuclear extract comparing control cells with cells treated for RNA interference (RNAi) for the gene of interest. There should be a clear depletion of the appropriately sized protein band in the knockdown lane, with minimal background bands remaining. 2. If antibodies do not generate a robust signal on Western blots, immunofluorescence of target cells  RNAi knockdown may also be used to assess antibody specificity. 3. To further increase confidence in the specificity of a ChIP result, it would be most informative to perform ChIP with a second antibody directed to a different part of the protein of interest, followed by assessment of the consistency of the results. Another way to accomplish the same goal is to express a functional epitope-tagged version of the protein of interest, and use the tag to perform ChIP and assess consistency. 9.2.2 ChIP

In the last decade, a variety of protocols have been generated to perform ChIP in cells from different organisms. Most of the protocols utilize formaldehyde fixation of tissue culture cells because these represent more or less homogeneous populations and are relatively easy to work with. Here, we describe a ChIP protocol that was successfully used to map a variety of modified histones and other chromatinassociated proteins as a part of the Drosophila modENCODE project (www.modencode.org). The procedure has been successfully adapted for a number of human cultured cells. For optimal results, the amount of formaldehyde and fixation time as well as the amount of each antibody should be defined experimentally. (As an alternative, a dual cross-linking protocol has been proposed to improve the quality of ChIP [13].) All volumes below are given for 1  109 cells, enough for 20 immunoprecipitations. To obtain optimal results, we recommend the use of a large

142

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

Table 9.1 Composition of several buffers required for ChIP.

Buffer

Composition

ChIP Wash A ChIP Wash B RIPA buffer RIPA blocking buffer

10 mM HEPES, pH 7.6, 10 mM EDTA, pH 8.0, 0.5 mM EGTA, pH 8.0, 0.25% Triton X-100 10 mM HEPES, pH 7.6, 100 mM NaCl, 1 mM EDTA, pH 8.0, 0.5 mM EGTA, pH 8.0, 0.01% Triton X-100 140 mM NaCl, 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0, 1% Triton X-100, 0.1% SDS, 0.1% DOC for 1 ml of the buffer add 10 ml of 10 mg/ml IgG-free BSA (Sigma) and 10 ml of 100 mg/ml yeast tRNA (Invitrogen) into 980 ml of RIPA buffer 250 mM LiCl, 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0, 0.5% NP-40, 0.5% DOC 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0, 140 mM NaCl 137 mM NaCl, 2.7 mM KCl, 4.3 mM Na2HPO4, 1.47 mM KH2PO4 10 mM Tris–HCl, pH 8.0, 1 mM EDTA, pH 8.0

LiCl ChIP buffer TEN 140 buffer 1 PBS, pH 7.4. TE

SDS, sodium dodecylsulfate; DOC, sodium deoxycholate; BSA, bovine serum albumin.

amount of highly concentrated chromatin per antibody. Any modification of the initial cell amount requires subsequent volume adjustment. The composition of the buffers required in this protocol can be found in Table 9.1. 1. Formaldehyde cross-linking of cultured cells For Drosophila (S2) cells: 1.1 Grow four 50-ml cell cultures in large T-flasks (225 cm2) to a density of 5  106 cells/ml (1  109 cells). 1.2 Add 2.5 ml 37% formaldehyde (Sigma) directly to the culture, to a final concentration of 1.8%. Incubate for 10 min at 25  C on a rocking platform. Stop the reaction by adding 6 ml 1.25 M glycine (pH 7.0) to each culture. Incubate for 5 min on ice. 1.3 Transfer cell suspension into 50-ml Falcon tubes. For human (NTERA) cells [14]: 1.4 Grow cells in 30 ml medium in large T-flasks (175 cm2) to confluence (2  106 cells/ml). 1.5 Add 37% formaldehyde (Sigma) to a final concentration of 1% and incubate 10 min at 37  C on a rocking platform. Stop the reaction by adding 1.25 M glycine (pH 7.0) to each culture to a final concentration of 125 mM. 1.6 Dislodge the cells from the flask with a scraper and transfer cell suspension into 50-ml Falcon tubes. 2. Washing steps 2.1 Pellet fixed cells by spinning for 2 min at 2000  g, 4  C. Use a swinging bucket rotor. 2.2 Resuspend cells in 40 ml of 1 PBS. 2.3 Pellet the cells in a 50 ml Falcon tube by spinning for 2 min at 2000  g, 4  C. 2.4 Resuspend the cells in 40 ml ChIP Wash A solution. Wash fixed cells 10 min at 4  C using rotating wheel. Pellet as above. 2.5 Thoroughly resuspend fixed cells in 40 ml ChIP Wash B solution. Wash cells for 10 min at 4  C. Use rotating wheel. 2.6 Pellet the cells as above; remove the supernatant and quick freeze the pellet in liquid nitrogen. Store the frozen pellet at 80  C. 3. Chromatin preparation and sonication 3.1 Resuspend cross-linked cell pellet in ChIP Wash B solution. Use 15 ml of solution for 1  109 cells (all volumes below are given for 1  109 cells, adjust accordingly). 3.2 Homogenize cells using a Dounce homogenizer and 10 strokes with a tight fitting pestle. Pellet cells by spinning 3 min at 2000  g, 4  C. 3.3 Add 5 ml ice-cold TE to the cell pellet. Resuspend the pellet by pipetting to a homogeneous suspension. Bring the volume of suspension to 13.5 ml with ice-cold TE. Add 1.5 ml 10% SDS to the suspension (final 1% SDS in TE).

9.2 Methods and Protocols

3.4 Mix by inverting the tube 5 times. DO NOT PIPETTE! Immediately pellet the cells by spinning 3 min at 2000  g, 4  C. (Note: For different cell types the time or speed of centrifugation could be increased up to 5 min and 3000  g.) 3.5 Carefully remove the supernatant (loose pellet!) and add ice-cold TE to bring the volume to 15 ml. Mix by inverting the tube no more than 5 times. Immediately pellet cells by spinning 3 min at 2000  g, 4  C. 3.6 Again, carefully remove the supernatant and add ice-cold TE to bring the volume to 15 ml. Mix by inverting the tube no more than 5 times. Immediately pellet cells by spinning 3 min at 2000  g, 4  C. 3.7 Carefully remove the supernatant and add ice-cold TE-PMSF (10 mM Tris– HCl, pH 8.0, 1 mM EDTA, pH 8.0, 1 mM phenylmethylsulfonyl fluoride (PMSF)) to bring the volume of suspension to 10 ml. Add 100 ml 10% SDS to obtain a concentration of 1  108 cells/ml in 0.1% SDS in TE-PMSF. 3.8 Using a Bioruptor (Diagenode), sonicate five 2-ml aliquots in 15-ml polystyrene conical tubes (Falcon). Fill the unoccupied space with a sixth tube containing 2 ml of water. 3.9 Perform 10 min sonication sessions. Each session should consist of 0.5-min sonication pulses alternated with 0.5-min pauses. Use the “high” power setting to generate DNA fragments of 300–500 bp. The amount of sonication should be optimized for each new cell type, but typically, we perform 2.5 sessions for Drosophila cells and 4 sessions for human cells. In between sessions, replenish ice and allow cooling. 3.10 Combine all lysates in one tube and add sequentially: 1.0 ml 10% Triton X-100 (1% final), 100 ml 10% DOC (0.1% final), 280 ml 5 N NaCl (140 mM final). Mix 2 min between each addition at 4  C. 3.11 Mix lysate an additional 10 min on a rotating wheel at 4  C. 3.12 Transfer lysate to Eppendorf tubes. Spin 5 min at 4  C, maximum speed. 3.13 Combine supernatants, mix, and remove 100 ml for input DNA and estimation of the extent of sonication. 3.14 Aliquot sonicated chromatin in 20 0.5-ml aliquots. Freeze aliquots in liquid N2 and keep at 80  C. 4. Preparation of Protein A Sepharose 4 Fast Flow (PAS) (GE Healthcare) beads 4.1 Use 40 ml PAS beads (50% suspension in RIPA (–PMSF)) for each ChIP reaction. (Note: To reduce nonspecific background incubate 100 ml of PAS beads with 1 ml RIPA blocking buffer containing 10 ml of 10 mg/ml IgG-free BSA (Sigma) and 10 ml of 100 mg/ml yeast tRNA (Invitrogen).) 4.2 Incubate beads 4–6 h at 4  C with rocking. 4.3 Pellet the beads for 3 min at 1000  g at 4  C. 5. Input DNA isolation 5.1 To proceed with isolation of input DNA add 260 ml RIPA to the 100-ml chromatin aliquot. Add 4 ml 10 mg/ml DNase-free RNase A (Roche), incubate 30 min at 37  C. 5.2 Add 20 ml 10% SDS and 20 ml 20 mg/ml Proteinase K (PCR grade; Roche), incubate overnight at 37  C, then at 65  C for 6 h. 5.3 Extract samples with 400 ml of phenol/chloroform by vortexing for 30 s, centrifuge 10 min 16 000  g at 25  C. 5.4 Extract samples with 400 ml of chloroform by vortexing 30 s, then spin 10 min at 16 000  g, 25  C. Collect aqueous phase and add l/10 volume 3 M NaAc, pH 5.0 and 10 ml 5 mg/ml glycogen (Applied Biosystems). Precipitate DNA with 2.5 volumes 100% EtOH at 20  C overnight. 5.5 Spin 30 min, 16 000  g at 4  C. Wash the pellet in 500 ml 70% EtOH. 5.6 Spin 30–45 min, 16 000  g at 4  C. Air-dry pellet 10 min at 25  C and dissolve in 100 ml pure H2O. 5.7 Add 2 ml 10 mg/ml RNase, incubate 30 min at 37  C. 5.8 Perform phenol/chloroform extraction and DNA precipitation as described above.

j

143

144

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

5.9 Dissolve input DNA in 50 ml pure H2O. 5.10 Check 1–2 mg of input DNA on 1.2% agarose gel to determine the extent of sonication. You expect a distribution of DNA fragment sizes from 100 to 5000 bp with a peak around 300–500 bp. 6. Immunoprecipitation procedure 6.1 We use 500 ml chromatin per antibody for ChIP. (Note: For abundant antigens and/or high-quality antibodies, this amount may be more than sufficient, but for less abundant proteins we find that use of this high amount of material is important.) 6.2 To preclear the chromatin, we add 500 ml of cross-linked chromatin to 40 ml of PAS beads. Incubate 1 h at 4  C. 6.3 Spin suspensions for 3 min at 1000  g, 4  C. Transfer supernatants to new tubes. Add 5 ml of 100 mM PMSF solution in isopropanol to each 500 ml aliquot of precleared chromatin. 6.4 Add appropriate amount of antibody to each reaction (1 mg–10 mg depending on the antibody). Incubate overnight at 4  C on rotating wheel. 6.5 Add 40 ml of PAS (50% suspension in RIPA (–PMSF)), incubate 2–3 h at 4  C on rotating wheel. 6.6 Wash the beads 5 times 10 min each with 1 ml RIPA buffer, then once with 1 ml LiCl ChIP buffer and finally twice with 1 ml TE. To pellet the beads between washes spin samples 3 min at 1000  g, 4  C. Perform all washes at 4  C. 6.7 Resuspend PAS beads in 200 ml TE, add 0.5 ml of 10 mg/ml DNase-free RNase A (Roche), incubate 30 min at 37  C. 6.8 Add 15 ml of 10% SDS and 15 ml of 20 mg/ml Proteinase K (PCR grade; Roche). Incubate overnight at 37  C. 6.9 Transfer samples to 65  C, incubate 6 h. 6.10 Add 9 ml 5 M NaCl. Extract samples with 270 ml phenol/chloroform by vortexing for 30 s, centrifuge 10 min 16 000  g at 25  C, take aqueous phase, back-extract organic phase with 250 ml of TEN 140 buffer. Combine aqueous phases. 6.11 Extract samples with 500 ml chloroform by vortexing for 30 s, centrifuge 10 min 16 000  g at 25  C. Transfer the upper aqueous phase into a new tube. Add l/10 volume of 3 M NaAc, pH 5.0 and 10 ml of 5 mg/ml glycogen (Applied Biosystems). Precipitate DNA with 2.5 volumes of 100% EtOH at 20  C overnight. 6.12 Spin for 30 min, 16 000  g at 4  C. Wash the pellet in 500 ml of 70% EtOH. 6.13 Spin for 30–45 min, 16 000  g at 4  C. Air-dry the pellet 10 min at 25  C. Dissolve the pellet in 20 ml pure H2O. 9.2.3 Sequencing Library Preparation

Currently, the most popular platform for ChIP-seq is the Illumina sequencer (Genome Analyzers and HiSeq 2000). Here, we describe a protocol for this platform. Protocols for SOLiD and Helicos platforms are available elsewhere [15,16]. Illumina sells a ChIP-seq kit containing all the reagents as well as a detailed protocol for ChIP library preparation. However, this kit makes library preparation relatively expensive. Several significant optimization steps as well as the use of improved reagents for the Illumina-supplied protocol have recently been reported [17]. Here, we describe a ChIP-seq paired-end protocol (with our minor modification) from Michael Snyder’s group [18]. As noted in [17], the library prepared using paired-end adapters are compatible with both single- and paired-end flow cells on the Illumina platform. This protocol uses Illumina adapters and PCR primers along with the non-Illumina commercially available reagents, which allows the users to significantly reduce the cost of library preparation.

9.2 Methods and Protocols

1. ChIP and input DNA repair 1.1 Repair the DNA by using the End-It DNA End-Repair kit (Epicentre). 1.2 Prepare the following reaction mix: ChIP enriched DNA or 10–50 ng of input DNA

1–34 ml

10 End-Repair Buffer

5 ml

2.5 mM dNTP Mix

5 ml

10 mM ATP

5 ml

1.3 1.4 1.5 1.6

Bring the total reaction volume to 49 ml with sterile water End-Repair Enzyme Mix (1 ml). (Note: The total volume should now be 50 ml.) Incubate at 25  C for 45 min. Follow the instructions in the QIAquick PCR Purification Kit (Qiagen) to purify on one QIAquick column, eluting in 34 ml of EB. 2. Addition of “A” base to 30 ends 2.1 Prepare the following reaction mix (total volume should be 50 ml): DNA samples

34 ml

Klenow buffer (NEB)

5 ml

1 mM dATP

10 ml

Klenow fragment (NEB) (30 ! 50 exo ; 5 U/ml)

1 ml

2.2 Incubate at 37  C for 30 min. 2.3 Follow the instructions in the MinElute PCR Purification Kit (Qiagen) to purify on one MinElute column, eluting in 10 ml of EB. 3. Illumina adapter ligation 3.1 To ligate adapters for Solexa paired-end sequencing to DNA fragments LigaFast Rapid DNA Ligation System (Promega) is used. Also dilute the adapter oligo mix 1:10 with water to adjust for the smaller quantity of DNA. 3.2 Prepare the following reaction mix (total volume should be 30 ml): DNA sample

10 ml

DNA ligase buffer

15 ml

RNase-, DNase-free water

2 ml

Paired-end adapter oligo mix (Illumina) (1:10)

1 ml

DNA ligase

2 ml

3.3 Incubate at 25  C for 15 min. 3.4 Follow the instructions in the MinElute PCR Purification Kit (Qiagen) to purify on one MinElute column, eluting in 10 ml of EB. 3.5 Load DNA samples on a 2% agarose TAE gel (SeaPlaque GTG Agarose, Lonza). (Note: This step is optional. This step may result in lower complexity of your sample. In many cases the samples do not require gel size selection and can go directly to PCR amplification. In some cases, such as with a low amount of DNA, you may need to include this step to avoid excessive primer dimers in the next PCR step.) 3.6 Run gel at 120 V for 60 min. 3.7 View the gel on a Dark Reader trans-illuminator (Clare Chemical Research) to avoid exposure to UV light. 3.8 Cut a gel slice to isolate DNA in the 200- to 450-bp range (product may not be visible at this stage).

j

145

146

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

3.9 Use a Qiagen Gel Extraction Kit (Qiagen) to purify the DNA from the agarose slices. (Note: Instead of 50  C, as recommended by Qiagen, melt gel slice at 25  C for 10 min with occasional vortexing.) Elute DNA in 30 ml. 4. PCR and gel size selection of the DNA 4.1 Prepare the following PCR mix (total volume should be 50 ml): DNA samples

23 ml

Paired-end primer 2.0

1 ml

Paired-end primer 1.0

1 ml

Phusion High-Fidelity PCR master mix with HF buffer (NEB)

25 ml

4.2 Amplify using the following PCR protocol: 98  C

30 s

98  C

10 s



65 C

30 s

72  C

30 s

72  C

5 min

4 C

hold

g

17 cycles

4.3 Follow the instructions in the MinElute PCR Purification Kit (Qiagen) to purify on one MinElute column, eluting in 15 ml of EB. (Note: Alternatively, SPRI beads (Beckman Coulter Genomics) can be used for PCR clean-up.) 4.4 Load DNA samples on the 2% agarose TAE gel (SeaPlaque GTG Agarose, Lonza). Gel size purification as described above. (Note: At this step, gel size selection is vital because DNA library must be free from potential primer or adapter dimers.) 4.5 Measure the DNA concentration (ng/ml) at A260/A280 by NanoDrop spectrophotometer. (Note: For more accurate DNA quality and quantity analysis use Agilent Technologies 2100 Bioanalyzer. DNA is now ready for sequencing.) Note: To completely get rid of agarose traces, an additional round of MinElute column purification can be performed (as above).

9.2.4 Data Analysis

Analysis of ChIP-seq data is a not a trivial task. An overview of a data analysis pipeline is shown in Figure 9.3(a). Currently between 20 and 50 million sequence reads are generated in a single lane of Illumina’s Genome Analyzer before filtering. A number of specialized bioinformatics tools have been developed to efficiently align short reads to a reference genome [19]. Commonly, about 50–80% of the reads can be uniquely mapped (up to a few mismatches) to the reference genome. A low proportion of mappable reads (e.g., below 20%) indicates potential problems with the experiment. The aligned sequence reads from the ChIP DNA and input DNA libraries can be directly visualized using a genome browser (e.g., the Integrated Genome Browser [20]) and this can aid visual identification of unusual spatial distribution of reads, such as an unexpectedly large number of copies of the same sequence due to amplification error or low ChIP enrichment. In a single-end ChIP-seq experiment, sequence reads from the positive and negative strands are usually shifted or extended towards its 30 end by a certain number of bases (usually half the estimated average fragment length)

9.3 Applications

Discover enrichment region

(b)

(c)

CTCF

RNA PolII H3K9Me3

H3K4Me3 Peaks of H3K4Me3

Average profile at genomic features

Visualization

Integration with other data

Gene set enrichment analysis

De novo motif discovery

(d)

1

Quality assessment, filtering, read shifting, background correction, normalization

Epigenetic signature discovery

Scaled density 0 –2 –1

Preprocessing Alignment

Construct tag density profile

H3K4Me1

–2000

Scaled density –2 –1 0 1 2 3 4 5

(a)

–1000

0

1000

2000

RNA PoIII

–2000

0 1000 –1000 Relative distance to TSS

such that the location with the highest read density corresponds to the center of the ChIP protected region (Figure 9.3b). One way to estimate the optimal “shift distance” is by analyzing the strand cross-correlation profile [21]. Signal normalization and smoothing is also commonly applied. A number of “peak-calling” algorithms have been developed to identify sharp enrichment regions (“peaks”) in a ChIP-seq profile [22], although it has been noted that different algorithms can generate significantly different peaks [23]. Therefore, it may be instructive to check the stability of the identified peaks using a number of different algorithms under a range of parameter values. Visual comparison of the identified peaks along with the density profiles of the ChIP DNA and input DNA samples can also aid in detection of erroneous calls in addition to giving a sense of the data quality (Figure 9.3c). Depending on the aim of the study, many types of downstream analysis can be carried out. For transcription factor or regulatory protein ChIP, it is common to perform de novo motif discovery using the set of binding sequences under those peaks identified by ChIP-seq. For analysis of histone modification or global regulatory protein ChIP (e.g., RNA polymerase II and CREB-binding protein), it is also useful to analyze ChIP-seq profiles with respect to known key genomic features, such as transcription start and end sites (TSSs and TESs, respectively) exon–intron boundaries, and evolutionarily conserved sites (Figure 9.3d).

9.3 Applications

In the past few years, ChIP-seq technology has matured substantially and has been applied to tackle many interesting biological problems. The main application of ChIPseq (and other ChIP-based approaches) is to map the dynamic DNA binding sites of key regulatory proteins (transcription factors, transcriptional coactivators, RNA polymerase, etc.) or histone modification status at various biological conditions or developmental stages in different cell lines or tissues. These diverse genomic and epigenetic data have opened up the door for systematic elucidation of the combinatorial transcriptional controls through various nuclear factors and epigenetic mechanisms (Figure 9.4). As a result, ChIP-seq has found applications across a wide range of biological and biomedical disciplines, including developmental and stem cell biology, cancer biology, human pathophysiology, and evolutionary genomics. In this section,

2000

j

147

Fig. 9.3 Bioinformatics analysis of ChIPseq data. An analysis pipeline is shown in (a). For single-end ChIP-seq, there should be an enrichment of sequence reads on the positive and negative strand flanking the location of a true protein interaction site. The sequence reads from the positive and negative strands can be shifted or extended to construct a combined read density profile (b). A typical profile of CCCTC-binding factor (CTCF), RNA polymerase II (RNA PolII), histone 3 with trimethylation at lysine 9 (H3K9Me3), and histone 3 with trimethylation at lysine 4 (H3K4Me3) is shown in (c). (d) Two typical average ChIPseq signal profiles of monomethylation of histone 3 at lysine 4 (H3K4Me1) and RNA polymerase II around the TSSs. (Data for (b) and (c) were taken from a D. melanogaster development time course dataset from the modENCODE project; www.modencode.org.)

148

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

Fig. 9.4 Applications of ChIP-seq. ChIPseq has enabled high resolution genomewide mapping of transcription factor (TF) binding, cofactor binding, polymerase binding, and histone modification status. These data, in conjunction with other highthroughput “omic” data, can be used to construct the transcriptional regulatory networks of the system of interest. This systems biology approach to understanding combinatorial transcriptional regulation has found many application areas, including developmental and stem cell biology, cancer biology, human pathophysiology, and evolutionary genomics.

ChIP-seq • TF binding • Cofactor binding • Polymerase binding • Histone modification

Mapping transcriptional regulatory networks

+

Transcription factor

Other “omic” data

Transcription cofactors

Application areas

Closed chromatin Open chromatin

Developmental and stem cell biology

Cancer genomics and epigenomics

Human disease studies and clinical applications

Genome evolution and population genetics

we survey the breadth of biological problems that has benefited from genome-wide protein–DNA interaction mapping using ChIP-seq. 9.3.1 Deciphering the Transcriptional Regulatory Program

A ChIP-seq study often involves the discovery of active or differential DNA binding or cobinding regions (i.e., peaks in a ChIP-seq profile) of various proteins factors in one or multiple conditions. The resulting binding sites are often compared against consensus motif occurrence, various key genomic features (e.g., promoters, enhancers, and conserved regions), and gene expression profiles (through microarray or RNA-seq analysis). For example, one of the earliest ChIP-seq studies mapped the binding sites of 13 sequence-specific transcription factors and two transcriptional regulators in mouse embryonic stem (ES) cells [24]. The result was a transcriptional regulatory network of ES cells that showed how different regulatory pathways were integrated to control ES cell gene expression. In another early study, ChIP-seq was used to map the cobinding of peroxisome proliferator-activated receptor-c (PPARc) and retinoid X receptor (RXR), along with RNA polymerase II occupancy at several time points during murine adipogenesis [25]. Through a Gene Ontology analysis, their temporal protein binding map revealed that the PPARc–RXR binding events are associated with most of the genes in the glucose and lipid metabolism pathways. In a more recent study, ChIP-seq was used to map the binding sites of an enhancerassociated factor, p300, in three different tissues in mouse embryo [26]. These ChIPseq profiles and follow-up validation experiments indicated that enhancer binding proteins interact with the genome in a highly tissue specific manner, allowing the genome-wide study of dynamical enhancer activity during development. 9.3.2 Unraveling Epigenetic Regulation

Another important application of ChIP-seq is the discovery of nucleosome organization and epigenetic regulation [3,27]. Dynamic histone modification is associated with activation and repression of gene expression [10,28]. For example methylation of histone 3 lysine 4 (H3K4) is strongly associated with gene activation, while methylation of histone 3 lysine 27 (H3K27) is associated with regions with repressed gene expression (see Figure 9.3c for example profiles of H3K4Me3 and H3K27Me3). Each histone modification mark usually has an enrichment profile around the TSS of a gene with a characteristic pattern (see Figure 9.3d for an example of H3K4Me1) [10]. Such results have raised intense interest in understanding how dynamic histone modification might regulate gene expression in various key biological processes, such as maintenance of stem cell pluripotency and differentiation [29], embryonic development [30,31], and disease processes [32,33]. In addition, histone modification

9.3 Applications

profiles can lead to the discovery of key genomic regulatory sites. For example, it was demonstrated that histone modification pattern can be integrated with sequence motif information via an integrated probabilistic model to accurately predict transcription factor binding sites [34]. Although most current studies could only map steady state nucleosome/histone distribution, a recent metabolic labeling technique has shown promise in studying the dynamic turnover of histone molecules [35], which raises the prospect of large-scale mapping of dynamic epigenetic regulation. 9.3.3 Comparative Interindividual or Interspecies Analysis

Recently, the International HapMap project [36] and many genome-wide association studies (e.g., [37]) have enabled the community to identify interindividual differences at the level of genome sequences. However, it was largely unclear how these sequence-level polymorphisms translate to variations in protein–DNA interaction and chromatin structure. In this context, ChIP-seq has opened up a cost-effective approach for genome-wide studies of interindividual and interspecies protein–DNA interaction. In a study of the interindividual variation of DNase I hypersensitive sites and CCCTC-binding factor binding sites in two unrelated and geographically distant families, it was found that about 10% of the chromatin sites are individual-specific and that many of the specific binding sites are heritable [38]. Consistent with another study, they found that many of these binding site variations were associated with sequence variation [39]. In another ChIP-seq study of transcription factor binding of five vertebrates, it was shown that interspecies transcription factor binding variation is prominent despite the highly conserved DNA binding preferences of the target transcription factor across these species [40], which implies that lineage-specific loss of transcription factor–DNA interactions could be evolving neutrally. These initial studies demonstrated the promise of using ChIP-seq to study variation and evolution of gene regulation in a genome-wide fashion. ChIP-seq is a particularly attractive technology for this type of application since it can produce high-resolution profiles for any species with a sequenced reference genome. 9.3.4 Study of Human Diseases and Clinical Applications

Dynamic histone modifications are crucial for regulating normal cellular functions. It is therefore not surprising to find that changes in the cellular epigenomic status are an important hallmark of some diseases, such as cancer [32,33]. A recent study showed that the global levels of some histone modifications in human tissues can be used as robust prognosis markers of several cancers [41]. Such findings fueled the prospect of “epigenetic therapy” whereby major epigenetic modifying enzymes become candidate drug targets [42]. Further large-scale studies involving histone modification profiles of many patients will be required to further confirm the clinical significance. Although ChIP-chip is currently the main profiling platform for these initial studies, the cost-effectiveness and increasing availability of NGS will likely lead to a more wide-spread adoption of ChIP-seq in this type of disease studies. 9.3.5 Advantages and Challenges of ChIP-Seq

For most of the aforementioned applications, ChIP-seq has many important advantages over other technologies, such as ChIP-chip. By directly counting the number of DNA fragments that bind to a genomic region, ChIP-seq can quantify protein–DNA interaction with higher spatial resolution and dynamic range, which in turn gives higher identification sensitivity and specificity. In addition, many microarray-specific issues, such as cross-hybridization, GC-bias in hybridization, and inability to design probe for repetitive regions or for organisms that have not been sequenced, are largely alleviated using this sequencing-based technique. Capitalizing on the dropping cost

j

149

150

j

9 Genome-Wide Mapping of Protein–DNA Interactions by ChIP-Seq

of the current sequencing technology, ChIP-seq offers a cost-effective method to analyze large genomes. Nonetheless, there are a number of challenges in applying ChIP-seq [43]. The foremost problem with ChIP-seq, as well as in other ChIP-based methods in general, is the availability of a high-quality ChIP-grade antibody. Many commercially available antibodies, or even different batches of the same antibody, are of variable quality. Validation of antibody quality is therefore crucial. We also found that it is important to use a reasonably large amount of chromatin per antibody. Some protocols that use a smaller number of cells have been proposed recently [44], but further development is necessary for samples that cannot be obtained in large quantities. Determining the sufficient sequencing depth is another important challenge. A computational strategy of subsampling reads can be used to estimate whether a saturation point has been achieved at a given level of sequencing depth at a given significance threshold [21], but the effect of insufficient depth of sequencing has not be adequately explored. A number of novel bioinformatics challenges related to ChIP-seq analysis have also emerged. Many ChIP-seq analysis packages currently focus on peak calling, while methods for analyzing broad enrichment regions are limited. While it is clear that the sequence reads produced from input DNA are far from uniformly distributed along the genome [45,46], the best method for background correction is not completely clear. Since the input DNA covers most of the genome, it needs to be sequenced to high depth for accurate background estimation, but that is generally not the case in practice. In addition, problems related to data quality assessment, signal normalization and smoothing, and estimation of statistical significance still require further investigation using a wide range of ChIP-seq datasets.

9.4 Perspectives

ChIP-seq is a flexible and maturing technology. New advances in ChIP-seq technology will likely arise from improvement in both the experimental technology and bioinformatics analysis. Compared to the Illumina sequencing-based ChIP-seq analysis described in this chapter, the forthcoming “single-molecule real-time” sequencing technology will likely bring new opportunities and challenges. The ability to perform sequencing without prior DNA amplification can simplify the experimental work flow as well as eliminating a source of experimental bias. Also, issues related to the insufficient sequencing depth for larger mammalian genomes will eventually become less relevant. However, data generated by these new platforms exhibit new data characteristics and therefore likely require new analysis approaches. With production of more reads per sequencing run, the issue of sample multiplexing will likely become more relevant. By barcoding various immunoprecipitation reactions, it is possible to perform ChIP-seq on multiple factors simultaneously using the 454 sequencing platform [47] as well as the Illumina Genome Analyzers [48]. From the analytical perspective, the increasing volume of ChIP-seq data will give rise to new opportunities for novel statistical methods in the context of two-group, multigroup, factorial, and time-series experimental designs. This type of interprofile comparison is not trivial since each profile can be confounded by variable sequencing depth, antibody quality, amplification efficiency, and fragment size distribution. Through close collaboration between experimentalists, technology developers, and bioinformaticians, we believe clever use of ChIP-seq in a well-designed study will greatly advance our understanding of gene regulatory mechanisms in many important biological processes.

Acknowledgments

We thank Tatyana Kahn and Yuri Schwartz for sharing the human cultured cell ChIP protocol. This work is supported by grants GM45744 (M.I.K.), U01HG004258

References

j

151

(P.J.P.), and a SysCODE Interdisciplinary Postdoctoral Training Fellowship (J.W.K. H.) from the National Institute of Health. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Farnham, P.J. (2009) Nat. Rev. Genet., 10, 2 3 4

5

6

7 8

9

10

11

12

13

14 15

16

605–616. Ernst, J. and Kellis, M. (2010) Nat. Biotechnol., 28, 817–825. Tolstorukov, M.Y., Kharchenko, P.V., and Park, P.J. (2010) Epigenomics, 2, 187–197. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E. et al. (2000) Science, 290, 2306–2309. Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. (2001) Nature, 409, 533–538. Impey, S., McCorkle, S.R., Cha-Molstad, H., Dwyer, J.M., Yochum, G.S., Boss, J.M., McWeeney, S., Dunn, J.J., Mandel, G., and Goodman, R.H. (2004) Cell, 119, 1041–1054. Chen, J. and Sadowski, I. (2005) Proc. Natl. Acad. Sci. USA, 102, 4813–4818. Bhinge, A.A., Kim, J., Euskirchen, G.M., Snyder, M., and Iyer, V.R. (2007) Genome Res., 17, 910–916. Wei, C., Wu, Q., Vega, V., Chiu, K., Ng, P., Zhang, T., Shahab, A., Yong, H., Fu, Y., and Weng, Z. (2006) Cell, 124, 207–219. Barski, A., Cuddapah, S., Cui, K., Roh, T., Schones, D., Wang, Z., Wei, G., Chepelev, I., and Zhao, K. (2007) Cell, 129, 823–837. Johnson, D.S., Mortazavi, A., Myers, R.M., and Wold, B. (2007) Science, 316, 1497–1502. Robertson, G., Hirst, M., Bainbridge, M., Bilenky, M., Zhao, Y., Zeng, T., Euskirchen, G., Bernier, B., Varhol, R., Delaney, A. et al. (2007) Nat. Methods, 4, 651–657. Zeng, P.-Y., Vakoc, C.R., Chen, Z.-C., Blobel, G.A., and Berger, S.L. (2006) Biotechniques, 41, 694–698. Chambers, I. and Smith, A. (2004) Oncogene, 23, 7150–7160. Kim, T.-K., Hemberg, M., Gray, J.M., Costa, A.M., Bear, D.M., Wu, J., Harmin, D.A., Laptewicz, M., BarbaraHaley, K., Kuersten, S. et al. (2010) Nature, 465, 182–187. Goren, A., Ozsolak, F., Shoresh, N., Ku, M., Adli, M., Hart, C., Gymrek, M., Zuk, O., Regev, A., Milos, P.M. et al. (2010) Nat. Methods, 7, 47–49.

17 Quail, M.A., Kozarewa, I., Smith, F.,

18

19 20

21

22 23

24

25

26

27 28

29

30

31

Scally, A., Stephens, P.J., Durbin, R., Swerdlow, H., and Turner, D.J. (2008) Nat. Methods, 5, 1005–1010. Zhong, M., Niu, W., Lu, Z.J., Sarov, M., Murray, J.I., Janette, J., Raha, D., Sheaffer, K.L., Lam, H.Y.K., Preston, E. et al. (2010) PLoS Genet., 6, e1000848–e1000848. Trapnell, C. and Salzberg, S.L. (2009) Nat. Biotechnol., 27, 455–457. Nicol, J.W., Helt, G.A., Blanchard, S.G., Raja, A., and Loraine, A.E. (2009) Bioinformatics, 25, 2730–2731. Kharchenko, P.V., Tolstorukov, M.Y., and Park, P.J. (2008) Nat. Biotechnol., 26, 1351–1359. Pepke, S., Wold, B., and Mortazavi, A. (2009) Nat. Methods, 6, S22–S32. Laajala, T., Raghav, S., Tuomela, S., Lahesmaa, R., Aittokallio, T., and Elo, L. (2009) BMC Genomics, 10, 618. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V.B., Wong, E., Orlov, Y.L., Zhang, W., Jiang, J. et al. (2008) Cell, 133, 1106–1117.  Nielsen, R., Pedersen, T.A., Hagenbeek, D., Moulos, P., Siersbæk, R., Megens, E., Denissov, S., Børgesen, M., Francoijs, K.-J., Mandrup, S. et al. (2008) Genes Dev., 22, 2953–2967. Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I., Shoukry, M., Wright, C., Chen, F. et al. (2009) Nature, 457, 854–858. Park, P.J. (2008) Epigenetics, 3, 318–321. Karlic, R., Chung, H.-R., Lasserre, J., Vlahovicek, K., and Vingron, M. (2010) Proc. Natl. Acad. Sci. USA, 107, 2926–2931. Mikkelsen, T.S., Ku, M., Jaffe, D.B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.-K., Koche, R.P. et al. (2007) Nature, 448, 553–560. Akkers, R.C., van Heeringen, S.J., Jacobi, U.G., Janssen-Megens, E.M., Françoijs, K.-J., Stunnenberg, H.G., and Veenstra, G.J.C. (2009) Dev. Cell, 17, 425–434. Hammoud, S.S., Nix, D.A., Zhang, H., Purwar, J., Carrell, D.T., and Cairns, B.R. (2009) Nature, 460, 473–478.

32 Neff, T. and Armstrong, S.A. (2009)

Leukemia, 23, 1243–1251. 33 Chi, P., Allis, C.D., and Wang, G.G. (2010)

Nat. Rev. Cancer, 10, 457–469. 34 Won, K.-J., Ren, B., and Wang, W. (2010)

Genome Biol., 11, R7–R7. 35 Deal, R.B., Henikoff, J.G., and Henikoff, S. 36 37

38

39

40

41

42 43 44 45

46

47 48

(2010) Science, 328, 1161–1164. The International HapMap Consortium (2007) Nature, 449, 851–861. The Welcome Trust Case Control Consortium (2007) Nature, 447, 661–678. McDaniell, R., Lee, B.-K., Song, L., Liu, Z., Boyle, A.P., Erdos, M.R., Scott, L.J., Morken, M.A., Kucera, K.S., Battenhouse, A. et al. (2010) Science, 328, 235–239. Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S.M., Habegger, L., Rozowsky, J., Shi, M., Urban, A.E. et al. (2010) Science, 328, 232–235. Schmidt, D., Wilson, M.D., Ballester, B., Schwalie, P.C., Brown, G.D., Marshall, A., Kutter, C., Watt, S., Martinez-Jimenez, C.P., Mackay, S. et al. (2010) Science, 328, 1036–1040. Seligson, D.B., Horvath, S., McBrian, M.A., Mah, V., Yu, H., Tze, S., Wang, Q., Chia, D., Goodglick, L., and Kurdistani, S.K. (2009) Am. J. Pathol., 174, 1619–1628. Bhalla, K.N. (2005) J. Clin. Oncol., 23, 3971–3993. Park, P.J. (2009) Nat. Rev. Genet., 10, 669–680. Adli, M., Zhu, J., and Bernstein, B.E. (2010) Nat. Methods, 7, 615–618. € Teytelman, L., Ozaydın, B., Zill, O., Lefrançois, P., Snyder, M., Rine, J., and Eisen, M.B. (2009) PLoS ONE, 4, e6700–e6700. Vega, V.B., Cheung, E., Palanisamy, N., and Sung, W.-K. (2009) PLoS ONE, 4, e5241–e5241. Meyer, M., Stenzel, U., and Hofreiter, M. (2008) Nat. Protocols, 3, 267–278. Lefrancois, P., Euskirchen, G., Auerbach, R., Rozowsky, J., Gibson, T., Yellman, C., Gerstein, M., and Snyder, M. (2009) BMC Genomics, 10, 37.

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing Julian K€onig, Nicholas J. McGlincy, and Jernej Ule Abstract

Post-transcriptional regulation of gene expression is controlled by the unique composition and spatial arrangement of RNA-binding proteins (RBPs) on individual transcripts. Therefore, understanding post-transcriptional regulation requires precise and comprehensive binding site maps for RBPs. UV cross-linking and immunoprecipitation (CLIP) is a state-of-the-art technique for generating such maps on a genome-wide scale. However, data complexity is often limited and the resolution of the resulting maps is confined to approximately 30 nucleotides. This, in turn, complicates the identification of individual binding sites. We recently described individual-nucleotide resolution CLIP (iCLIP) – an approach that both increases data complexity and allows binding site detection at single-nucleotide resolution. Here, we present the latest version of our iCLIP protocol, discussing critical aspects and recent modifications.

10.1 Introduction

Throughout their lifetime, transcripts are associated with a plethora of RNA-binding proteins (RBPs). The combinatorial binding and spatial arrangement of these RBPs give rise to a diverse range of ribonucleoprotein (RNP) particles that determine the cellular fate and function of each RNA [1,2]. Recent advances towards more precise positional information on the binding sites of RBPs within RNAs have improved our understanding of the molecular mechanisms of post-transcriptional regulation [3]. Originally, protein–RNA interactions were studied using biochemical methods such as systematic evolution of ligands by exponential enrichment (SELEX), electrophoretic mobility shift, and RNA protection assays, or genetic methods such as the yeast three-hybrid system [4–6]. These approaches, however, did not address RNA binding in its native cellular context. In a first step towards preserving the cellular context, RNA immunoprecipitation (RIP) was combined with differential display or microarray analysis (RIP-Chip) [7–9]. These methods were of low resolution and prone to identifying indirect interactions. Furthermore, they were limited to studying stable RNPs since protein–RNA complexes can reassociate after cell lysis [10]. In order to increase the resolution and specificity, a strategy referred to as CLIP (UV cross-linking and immunoprecipitation) was developed [11,12]. CLIP combines UV cross-linking of RBPs to their cognate RNA molecules with rigorous purification schemes. In combination with high-throughput sequencing (HTS), CLIP has proven as a powerful tool to study protein–RNA interactions on a genome-wide scale (referred to as HITS-CLIP or CLIP-seq) [13,14]. Prominent examples range from regulation of alternative splicing in mammals [14–16] to protein–microRNA

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

153

154

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

interactions and subcellular RNA localization in organisms as diverse as Caenorhabditis elegans and the fungus Ustilago maydis [17–19]. Recent modifications of the protocol include the use of photoreactive ribonucleoside analogs (PAR-CLIP) [20,21] and affinity purification under denaturing conditions (CRAC) [22]. Despite the high specificity of the obtained data, CLIP experiments commonly generate cDNA libraries of limited complexity. This is partly due to the restricted amount of copurified RNA, resulting from inefficiency of cross-linking and the two RNA ligation reactions that are required for library amplification. In addition, many cDNAs prematurely truncate at the cross-linked nucleotide [23] and are thus lost during the standard CLIP library preparation protocol. We recently developed iCLIP (individual-nucleotide resolution CLIP) to overcome this limitation [24]. In order to capture the truncated cDNAs, we replaced one of the inefficient intermolecular RNA ligation steps with a more efficient intramolecular cDNA circularization. Importantly, sequencing the truncated cDNAs provides direct insights into the position of the cross-link site, allowing us to map protein–RNA interactions with single-nucleotide resolution. We have successfully applied iCLIP to study the impact of binding position on the regulation of alternative splicing by heterogeneous nuclear RNP (heterogeneous nuclear ribonucleoproteinhnRNP) C and T-cell intracellular antigen 1 (TIA1)/ TIA1-like 1 (T-cell intracellular antigen 1-like 1TIAL1) [24,25]. In this chapter, we describe our most recent version of the iCLIP protocol and discuss new additions and key features of this technology.

10.2 Procedure Overview

In order to preserve in vivo protein–RNA interactions, living cells or tissue are irradiated with UV-C light. This covalently cross-links proteins to RNA molecules at positions of close proximity (Figure 10.1). Cross-linked sites will therefore represent the positions of direct protein–RNA interactions. Following cross-linking, the cells are lysed and the RNA is fragmented using low concentrations of RNase I. Next, the protein–RNA complex is immunoprecipitated with an antibody specific to the protein of interest. The antibody is immobilized on magnetic beads to facilitate washing and buffer changes. After stringent washing, a DNA adapter is ligated to the 30 end of the RNA, while the 50 end is radioactively labeled. Upon denaturing gel electrophoresis, the complex is transferred to a nitrocellulose membrane. This removes free RNAs that are not covalently attached to the protein. The radioactive label allows visualization of the purified protein–RNA complex by autoradiography (Figure 10.2). This is used to guide extraction of the membrane region containing the protein–RNA complex. The covalent bond formed by UV cross-linking is irreversible, so the protein is removed from the RNA by Proteinase K digestion. This leaves a short, covalently attached peptide at the cross-link site that interferes with reverse transcription. Consequently, most cDNAs will truncate at this position, thereby inheriting the information about the cross-linked nucleotide. In order to recover these truncated cDNAs, we developed a cloning procedure based on cDNA circularization. To this end, reverse transcription is performed with oligonucleotide primers that contain two inversely oriented adapter regions separated by a BamHI restriction site. In addition, the primers carry at their 50 end a 4-nucleotide barcode to mark individual experiments, as well as a 5-nucleotide random barcode to control for PCR artifacts by individualizing single cDNA molecules. The resulting cDNAs are size-purified using denaturing gel electrophoresis (Figure 10.3) and circularized by single-stranded DNA ligase. The circularized cDNAs are relinearized by BamHI digestion. This is accomplished by annealing a complementary oligonucleotide to the single-stranded restriction site between the two adapter regions. The relinearized cDNAs are PCR-amplified and analyzed using polyacrylamide gel electrophoresis (PAGE; Figure 10.4). Finally, samples are subjected to HTS on the Illumina Genome Analyzer II.

j

10.3 Antibody and Library Preparation Quality Controls 1 UV crosslinking in vivo UV

4 Immunoprecipitation

6 RNA adapter ligation

5 Dephosphorylation

7 Radioactive labeling

of RNA Protein/RNA complex

2 Cell lysis

UV

5’

RBP

5’

3’

RBP

3 Partial RNA

digestion

AAA

9 Extraction of RNA from the

Reverse transcription (RT)

cDNA

3’

RNA adapter

8 SDS-PAGE and membrane

membrane. Proteinaise K leaves polypeptide ( ) at the crosslink nucleotide

RT primer: two cleavable adapter regions and barcode

transfer to remove free RNA Crosslinked protein/RNA complex

Complex size

10

155

Protein

Membrane

Size selection using gel electrophoresis

cDNA size

11

12

Circularization

13

Annealing of oligonucleotide to the cleavage site

RT products

16

14

Linearization

15

PCR amplification

High-throughput sequencing

BamHI RT primer Urea-PAGE

Fig. 10.1 Schematic representation of the iCLIP protocol. Protein–RNA complexes are covalently cross-linked in vivo using UV irradiation (step 1). The protein of interest is purified together with the bound RNA (steps 2–5). To allow for sequence-specific priming of reverse transcription, an RNA adapter is ligated to the 30 end of the RNA, whereas the 50 end is radioactively labeled (steps 6 and 7). Cross-linked protein–RNA complexes are purified from free RNA using SDS–PAGE and membrane transfer (step 8). The RNA is recovered from the membrane by digesting the protein with Proteinase K leaving a polypeptide remaining at the cross-link nucleotide (step 9). Reverse transcription truncates at the remaining polypeptide, and introduces two cleavable adapter regions and barcode sequences (step 10). Size selection removes free RT primer before circularization. The following relinearization generates suitable templates for PCR amplification (steps 11–15). Finally, HTS generates reads in which the barcode sequences are immediately followed by the last nucleotide of the cDNA (step 16). Since this nucleotide locates one position upstream of the cross-linked nucleotide, the binding site can be deduced with high resolution.

sample

α hnRNP C 10.3 Antibody and Library Preparation Quality Controls

RNase

1 ++

2 +

3 + ++

4 + +

kDa 250

During the course of an iCLIP experiment, the quality of immunoprecipitation and library preparation can be monitored at two steps: (i) the autoradiograph, which allows study of the size distribution of the purified protein–RNA complexes, and (ii) gel electrophoresis of the final PCR products, which allows monitoring of the size distribution of amplified cDNA fragments and the identification of any nonspecific products. It is of vital importance to use these two steps to control for the correct size of the protein–RNA complex, the antibody specificity, and the quality of cDNA amplification. In the first step, the size distribution of the protein–RNA complexes after partial and complete RNase digestion (low RNase, used for library preparation, and high RNase; Figure 10.2) is compared by autoradiography. In the low-RNase sample, the radioactive signal should be broad, extending from the size of the protein into highermolecular-weight areas. This represents the protein cross-linked to radioactively labeled RNA fragments of various sizes. In the high-RNase sample, where the crosslinked RNA is reduced to its minimum size, there should be only one sharp radioactive band migrating slightly above the expected size of the protein. The absence of a clear change in signal migration upon RNase treatment could indicate that conditions for partial digestion are inappropriate, that cell extracts contain high endogenous RNase activity, or that the RBP itself has been labeled by the polynucleotide kinase reaction. In the latter case, the radioactive signal would persist in the non UV irradiated control sample. This control also allows detection of contaminating non-cross-linked RNA.

130 100 70

proteinRNA complex

55

hnRNP C protein

35 27

area cut from the membrane Fig. 10.2 Autoradiograph of cross-linked hnRNP C– RNA complexes using denaturing gel electrophoresis and membrane transfer. hnRNP C–RNA complexes were immunopurified from cell extracts using an antibody against hnRNP C (a hnRNP C, samples 3 and 4). RNA was partially digested using low ( þ ) or high ( þ þ ) concentration of RNase. Complexes shifting upwards from the size of the protein (40 kDa) can be observed (sample 4). The shift is less pronounced when high concentrations of RNase were used (sample 3). The radioactive signal disappears when no antibody was used in the immunoprecipitation (samples 1 and 2).

156 4

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing 3

2

1

m nt 766 250 200

the light blue die

H

150

M

100

L

75

mark on the plastic gel cassette

the dark blue die

RT primer (41nt)

50

Fig. 10.3 Schematic 6% TBE/urea gel (Invitrogen) to guide the excision of iCLIP cDNA products. The gel is run for 40 min at 180 V leading to a reproducible migration pattern of cDNAs and dyes (light and dark blue) in the gel. Use a razor blade to cut (red line) the high (H), medium (M), and low (L) cDNA fractions. Start by cutting in the middle of the light blue dye and immediately above the mark on the plastic gel cassette. Divide the medium and low fractions and trim the high fraction about 1 cm above the light blue dye. Use vertical cuts guided by the pockets and the dye to separate the different lanes (in this example 1– 4). The marker lane (m) can be stained and imaged to control sizes after the cutting. Fragment sizes are indicated on the right.

The specificity of the immunoprecipitation is typically assessed with a no-antibody control, which should give no signal in the autoradiograph (Figure 10.2). In addition, the high-RNase samples should be carefully examined to detect signals from nonspecific RBPs that coimmunoprecipitate with the protein of interest. Ideally, these contaminating signals should be avoided or minimized by optimizing the immunoprecipitation conditions or changing the antibody. An alternative source of additional signals in the autoradiograph can be remnants of free DNA adapter, which is applied in huge excess. The free adapter migrates at approximately 55 kDa. Increasing the duration and number of washing steps after the ligation can reduce this background. To completely avoid radioactive labeling of the DNA adapter, it is possible to remove the aliquot for radioactive labeling prior to the ligation step. It is, however, important to note that carryover of DNA adapter into the reverse transcription reaction can lead to undesired primer-dimer products, therefore labeling of the complex after ligation is recommended to monitor such contamination. Finally, specificity of the antibody itself should be controlled with samples from knockout or knockdown cells (or tissue). In the latter samples, a decrease in radioactive signal should correlate with the knockdown efficiency. In order to control the different steps of library preparation, it is important to maintain one or more negative controls throughout the complete experiment. Optimal input material for these controls are the no-antibody samples or an immunoprecipitation from knockout cells. Control samples should show no product after PCR amplification (Figure 10.4) and should return very few unique sequences from HTS. Knockdown cells are not recommended as a sequencing control, since these still contain the protein, albeit in smaller quantities. After PCR amplification, the length distribution of the products should reflect the size range of cDNAs that were cut from the polyacrylamide gel after reverse transcription (Figures 10.3 and 10.4). Note that the PCR primers introduce an additional 76 nucleotides, therefore amplification of cDNAs of 75–120 nucleotides should produce products of 151–196 nucleotides on the PCR gel. In some cases, especially if the RNA input was low, a primer-dimer product of 128 nucleotides can appear. This should be removed by further gel purification. A broader size distribution or additional bands are indicative of secondary products formed during the PCR reaction (most often due to overamplification), degradation of cDNAs prior to circularization, or, in the worst case, amplification of contaminating DNA. If any of these products are seen, sequencing of the library is not recommended.

10.4 Oligonucleotide Design

The iCLIP protocol requires a specific set of RNA and DNA oligonucleotides that guide the enzymatic reactions and minimize the generation of undesired byproducts (Figure 10.5). In the latest version of our protocol, the DNA oligonucleotide L3-App is ligated to the 30 end of the purified RNA to allow for sequence-specific priming of the reverse transcription. The sequence of L3-App is complementary to the 30 end of the PCR primer P3Solexa, which is used to amplify the library for HTS. The 30 end of L3-App is blocked with a dideoxycytidine to prevent oligomerization during the ligation reaction. Importantly, the 50 end of L3-App is preadenylated (App) to improve the ligation efficiency [26,27]. The preadenylation constitutes a ligation intermediate upon ATP activation and therefore no further ATP has to be added to the reaction. This prevents the formation of L3-App-independent inter- or intramolecular ligation products. Furthermore, since the 50 ends of the RNAs no longer need to be dephosphorylated, we replaced the calf intestine phosphatase with the polynucleotide kinase (PNK), which contains efficient 30 end-specific phosphatase activity when used at low pH.

j

10.4 Oligonucleotide Design

Reverse transcription is primed with one of the oligonucleotides RT1Clip to RT16Clip (abbreviated as RT#Clip), which are complementary to the last 9 nucleotides (excluding the dideoxycytidine) at the 30 end of L3-App. Since annealing of primer P3Solexa during the subsequent PCR reaction requires the full sequence introduced by L3-App, this incomplete overlap ensures that the RT#Clip products cannot be amplified unless L3-App served as a template during reverse transcription. This is vital since excess of circularized RT#Clip oligonucleotides would otherwise facilitate generation of primer-dimers during the PCR amplification. The RT#Clips also contain an adapter region complementary to the 30 end of the second PCR primer P5Solexa. The regions of RT#Clip corresponding to P3Solexa and P5Solexa are separated by a BamHI restriction site that allows linearization of the circularized cDNAs later on in the procedure. At the 50 end, RT#Clips contain two distinct barcode regions. The 4-nucleotide experiment-specific barcode is unique to each RT#Clip oligonucleotide and marks individual experiments or replicates. The 5nucleotide random barcode is unique to each RT#Clip molecule and allows products of different reverse transcription events to be distinguished from mere PCR duplicates. In our latest versions of RT#Clip, we split the random barcode and moved three random nucleotides to the beginning of the sequence reads. This is different from our previous primer design, where the sequence read started with the experiment-specific barcode. The reason for this rearrangement was that cluster definition by the Genome Analyzer software can be impaired if all reads share the same starting nucleotides. The remaining two random nucleotides were placed at the 50 end of the RT#Clips to reduce a sequence-dependent bias during circularization. To allow for ligation during the circularization reaction, all RT#Clip oligonucleotides are 50 -phosphorylated. Prior to PCR amplification, the circularized cDNA molecules require linearization by cutting between the two adapter regions. To guide this enzymatic reaction, a DNA oligonucleotide (cut-oligo) covering the BamHI site and its flanking regions is annealed to the cDNAs. To prevent cut-oligo from serving as a primer during the subsequent PCR amplification, it contains four noncomplementary adenosines at its 30 end. PCR amplification is performed with the oligonucleotides P3Solexa (61 nucleotides) and P5Solexa (58 nucleotides). Their 30 ends are complementary to the two

cDNA gel

No antibody

α hnRNP C

H

H

M

L

M

157

L

bp 766 300

Amplified cDNA

200 150

Primer dimer

100 75

Primers 50

1

2

3

4

5

6

Fig. 10.4 Analysis of PCR-amplified iCLIP cDNA libraries using gel electrophoresis. RNA recovered from the membrane (Figure 10.2) was reverse transcribed and size-purified using denaturing gel electrophoresis (Figure 10.3). Three size fractions of cDNA (high (H): 120–200 nucleotides, medium (M): 85–120 nucleotides, and low (L): 70–85 nucleotides) were recovered, circularized, relinearized, and PCRamplified. PCR products of different size distribution can be observed as a result of the different sizes of the input fractions. Since the PCR primer introduces 76 nucleotides to the cDNA, sizes should range in the range 196–276 nucleotides for high, 161–196 nucleotides for medium, and 146–161 nucleotides for low size fractions. PCR products are absent when no antibody was used for the immunoprecipitation (lanes 1–3).

5’

-A

AT

..

.C

CACGACGCTCTTCCGATCT

BamHI cleavage site

cut-oligo

aa

5’-GTTCAGGATCCACGACGCTCTTCaa

4-nt experimentspecific barcode

TA

P5Solexa

Only RT products can serve as PCR templates after circularization, since this sequence is missing in the RT primer.

5’P-NNXXXXNNNAGATCGGAAGAGCGTCGTGGATCCTGAACCGC RT#Clip split 5-nt random L3-App barcode

-5’App 3’- RNA

AGATCGGAAGAGCGGTTCAG-ddC GCTGAACCGCTCTTCCGATCT P3Solexa

CT

. ..

AA

-C

5’

Fig. 10.5 RNA and DNA oligonucleotide design. The DNA adapter (L3-App) is preadenylated at the 50 end (50 App) to allow ligation to the cross-linked RNA. The 30 end is protected with dideoxycytidine to prevent concatenation. The reverse transcription primer (RT#Clip) is complementary to the 30 half of L3-App to allow sequence-specific priming. The 50 end is phosphorylated to enable circularization and contains the barcode sequences. NNN. . .NN indicates the split 5nucleotide random barcode, while XXXX is the experiment-specific 4-nucleotide barcode, which is unique for each RT#Clip oligonucleotide. Cut-oligo is complementary to the BamHI cleavage site in RT#Clip and contains four adenosines at its 30 end (aaaa) to prevent the oligonucleotide from acting as primer during the subsequent PCR amplification. The oligonucleotides P3Solexa and P5Solexa are used for PCR amplification. Complementary sections are delimited by gray arrows. It is important to note that RT#Clip can only serve as a template for PCR amplification when acquiring sequence from L3-App during reverse transcription. This minimizes primer-dimer formation during PCR.

158

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

adapter regions introduced by L3-App and the RT#Clip oligonucleotides. P3Solexa and P5Solexa contain additional 41 and 39 nucleotides, respectively, required for HTS with the Illumina Genome Analyzer II.

10.5 Recent Modifications of the iCLIP Protocol

In order to generate RBP binding maps of superior coverage and specificity, we are continuously working to improve the iCLIP protocol. Recent key modifications are the new barcode design for the reverse transcription primers and the usage of a preadenylated adapter that improves efficiency of the intermolecular ligation step (as discussed above). To further enhance ligation efficiency we now add poly (ethylene glycol) 400 (PEG400) to the ligation reaction. We found that PEG of higher molecular weights had a negative effect on protein recovery from the immunoprecipitation. Size purification of the cDNA products is necessary to remove excess reverse transcription primer that can give rise to primer-dimer formation during the PCR reaction. We excise three bands of different molecular weights from the cDNA gel. We found that the lower-molecular-weight fraction is prone to isolation of contaminant primer-dimers. However, this fraction can also contain interesting short RNA species, such as micro RNAs. Splitting the cDNA into three size fractions therefore avoids contamination of higher-molecular-weight fractions with primer-dimer products and at the same time allows recovery of smaller cDNAs (Figure 10.3). The three fractions are treated separately during PCR and gel analysis, but can be mixed before submitting the library for HTS. The choice of fractions to be sequenced can be made after analysis of the final PCR products and evaluation of primer-dimer occurrence. During PCR amplification of the cDNA library it is critical that no secondary products are generated. Therefore, we have tested different PCR enzyme mixes for their efficiency and propensity to generate undesired products. In our hands, AccuPrime SuperMix I (Invitrogen) and Immomix (Bioline) performed best in terms of specificity. We prefer AccuPrime SuperMix, since the reaction buffer is compatible with TBE gel electrophoresis. This eliminates the need for purification prior to the gel run. In terms of efficiency, we found Phusion mix (Finnzymes) to be the best enzyme mix. The reaction is, however, very sensitive to overamplification and secondary product generation, so we do not recommend it for iCLIP experiments.

10.6 Troubleshooting

Since the iCLIP protocol contains a diverse range of enzymatic reactions and purification steps, it is not always easy to identify a problem when an experiment fails. Therefore, this section contains a few general suggestions, while more specific comments are given throughout the protocol. Each step has to be performed with high accuracy to obtain proper results. Precautions should also be taken to avoid contamination with PCR products from previous experiments. The best way to minimize this problem is to spatially separate pre- and post-PCR steps. Ideally, the analysis of the PCR products and all subsequent steps should be performed in a separate room. Moreover, buffers and other reagents should be aliquoted so that each member of the laboratory has their own set. In this way, sources of contamination can be easier identified. Finally, let us get started with your iCLIP experiment.

10.7 Methods and Protocols 10.7 Methods and Protocols

1. UV cross-linking 1.1 Tissues Note: Tissue amounts of 50 mg produced good data using the anti-Nova or anti-hnRNP C antibodies. For neuronal tissue use Hank’s buffered salt solution instead of phosphate-buffered saline (PBS). 1.1A Harvest 500 mg of tissue (enough for 10 immunoprecipitations). Add 5 ml ice-cold PBS. 1.1B Sequentially pass the tissue several times through the following: a. 10-ml pipette. b. 10-ml pipette with a cut p1000 tip (cut off a bit from the tip with a blade). c. 10-ml pipette with an uncut p1000 tip. d. 10-ml pipette with a p10 tip. Note: UV light can penetrate a few cell layers, so triturating to a single cell suspension is unnecessary. We use this procedure to partially triturate brain tissue. Other tissue may require different dissociation protocols. 1.1C Transfer to a 10-cm tissue culture plate and place on ice. Irradiate suspension 4 times with 100 mJ/cm2 in a Stratalinker 2400 at 254 nm. Mix between each irradiation. Note: The length of cross-linking should be optimized for each protein, as each RNA-binding domain cross-links with different efficiency depending on its content of aromatic amino acids and the nucleotide composition of the binding site. Try 100, 200, and 400 mJ/cm2, then use the shortest condition that gives greater than 70% of the maximum signal. 1.1D Add 0.5 ml suspension to each of 10 microtubes, spin at top speed for 10 s at 4  C to pellet cells, and remove the supernatant. 1.1E Snap freeze pellets on dry ice and store at 80  C until use. 1.2 Tissue culture cells 1.2A Add 6 ml ice-cold PBS to cells growing in a 10-cm plate (enough for 10 immunoprecipitations). Remove lid and place on ice. 1.2B Irradiate once with 150 mJ/cm2 in a Stratalinker 2400 at 254 nm. Note: Cells grown in a monolayer are equally exposed to the UV light and hence only require a single irradiation to cross-link equally. 1.2C Harvest cells by scraping, using cell lifters. 1.2D Add 2 ml suspension to each microtube, spin at top speed for 10 s at 4  C to pellet cells, and then remove supernatant. 1.2E Snap freeze pellets on dry ice and store at 80  C until use. 2. Immunoprecipitation 2.1 Solutions Store all buffers in the fridge and perform the procedure on ice. . Lysis buffer 50 mM Tris–HCl, pH 7.4 100 mM NaCl 1% NP-40

j

159

160

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

0.1% Sodium dodecylsulfate (SDS) 0.5% Sodium deoxycholate On the day of experiment, add 1/100 volume of protease inhibitor cocktail (Calbiochem) to the amount of buffer required for lysis (but not washing). Note: If you are working with a tissue with high RNase A activity, adding 1/ 1000 volume of ANTI-RNase (Ambion; cat. no. AM2692) will control the RNase conditions, without affecting the activity of RNase I. .

High-salt wash 50 mM Tris–HCl, pH 7.4 1 M NaCl 1 mM EDTA 1% NP-40 0.1% SDS 0.5% Sodium deoxycholate

.

PNK buffer 20 mM Tris–HCl, pH 7.4 10 mM MgCl2 0.2% Tween-20

.

5  PNK, pH 6.5 buffer 350 mM Tris–HCl, pH 6.5 50 mM MgCl2 25 mM Dithiothreitol

Freeze aliquots of the buffer. . 4  Ligation buffer 200 mM Tris–HCl, pH 7.8 40 mM MgCl2 40 mM Dithiothreitol .

PK buffer 100 mM Tris–HCl, pH 7.4 50 mM NaCl 10 mM EDTA

.

PK buffer þ 7 M urea 100 mM Tris–HCl, pH 7.4 50 mM NaCl 10 mM EDTA 7 M Urea

10.7 Methods and Protocols

2.2 Bead preparation Note: The amounts given below are meant for the cloning experiment. Less can be used for preliminary experiments or the high-RNase control. 2.2A Add 100 ml of Protein A Dynabeads (for rabbit antibodies; Dynal; cat. no. 100.02) per experiment to a microtube. Note: Use Protein G Dynabeads for a mouse or goat antibody. These can sometimes work better for rabbit antibodies, too. 2.2B Wash beads 2 times with lysis buffer. 2.2C Resuspend beads in 100 ml lysis buffer with 2–10 mg antibody per experiment. Note: The amount of antibody required depends on its quality and purity. This should be optimized in preliminary experiments. 2.2D Rotate tubes at room temperature for 30–60 min (until lysate is ready). 2.2E Wash 3 times with lysis buffer and leave in the last wash until ready to proceed to Step 2.4A. 2.3 Partial RNA digestion and centrifugation 2.3A Resuspend cell pellet (from Step 1) in 1 ml lysis buffer (with protease inhibitors). Note: We are aiming for a concentration of around 10 mg/ml. Mouse brain pellets have around 50 mg and cell culture pellets around 20 mg. Weighing pellets before freezing can help you to be more precise about the required volume of lysis buffer. 2.3B Sonicate sample on ice (optional step). The probe should be approximately 0.5 cm from the bottom of the tube and not touching the tube sides in order to avoid foaming. Sonicate 2 times with 10-s bursts at 5 dB. Clean the probe by sonicating water before and after sample treatment. Note: Sonication helps when using cell culture as undigested viscous DNA can sometimes cause problems with the immunoprecipitation. It can also alleviate problems caused by mild lysis buffers or hard-to-lyse tissues. Note: Optionally, the lysate can be precleared with Protein A Sepharose (this does not hurt, but usually makes little difference; it may reduce background when using Protein A Dynabeads with a dirty antibody). Prepare a 30% Protein A Sepharose slurry in water. Add 100 ml Protein A Sepharose slurry to 1.5 ml lysate and rotate for 10 min in the cold room before spinning. 2.3C Make 1/1000 RNase I (Ambion; cat. no. AM2295) dilution in lysis buffer and add 10 ml to the lysate together with 2 ml Turbo DNase (Ambion; cat. no. AM2238). 2.3D Incubate for 3 min at 37  C shaking at 1100 rpm. After incubation transfer to ice for >3 min. Note: It is important to digest for exactly 3 min. Use 1.5-ml tubes for 1.5-ml Eppendorf Thermomixer to make the warming to 37  C efficient and reproducible. Note: The optimal dilution factor for the low-RNase condition depends on the batch of RNase, so in the first experiment several dilutions should be tested. Concentrations between 1 : 500 to 1 : 2000 have worked well for us in the past. Unlike other DNases, Turbo DNase is active in conditions of up to 200 mM NaCl.

j

161

162

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

2.3E (Optional step, recommended for initial optimizations) Treat one sample with high RNase: prepare a 1/50 RNase I dilution in lysis buffer and add 10 ml to the lysate together with 2 ml Turbo DNase. Incubate for 3 min at 37  C shaking at 1100 rpm and then transfer to ice for >3 min. This control can go straight from Step 2.4D to Step 2.7B (the RNA will be too short for ligation of the RNA linker). To minimize the use of reagents, it is possible to use only 1/5 of the cell lysate and all other reagents for this experiment. Note: The high-RNase control can be omitted after successful optimization. Other recommended controls include a control where the RBP is absent from the original material (such as a knockout animal or knockdown cells), a control where no cross-linking is done and a control where no antibody is used during immunoprecipitation. Note: Unlike other RNases, RNase I has no base preference and therefore cleaves after all four nucleotides. Under high-RNase conditions, the size of the radioactive band viewed by SDS–PAGE has to change in comparison to low-RNase conditions, confirming the band corresponds to a protein–RNA complex. Furthermore, this experiment helps to determine the size of the immunoprecipitated RBP, as the protein will be bound to short RNAs and thus will migrate as a less diffuse band around 5 kDa above the expected molecular weight. 2.3F (Optional step) Add cold lysis buffer to the lysate to bring it to the total of 2 ml. A more diluted lysate can decrease the background in the immunoprecipitation. Note: Optional: To test a new antibody, collect 15 ml at this step for Western blot comparison of lysate before and after immunoprecipitation (to visualize depletion of the protein from the lysate). 2.3G Spin at 4  C at top speed for 20 min (15 000 rpm or 21 800  g with our centrifuge) and carefully collect the supernatant. 2.4 Immunoprecipitation 2.4A Remove wash buffer from the beads, then add the cell extract to the beads. 2.4B Rotate beads/lysate mix for 1 h or overnight at 4  C. Note: (Optional) Save 15 ml supernatant for Western blot analysis in order to assess the amount of the antigen before and after immunoprecipitation. 2.4C Discard the supernatant and wash 2 times with high-salt wash (rotate the second wash for at least 1 min in the cold room). 2.4D Wash 2 times with PNK buffer and then resuspend in 1 ml PNK buffer (samples can be left at this at 4  C (or even overnight) until you are ready to proceed to Step 2.5). 2.5 30 End RNA dephosphorylation 2.5A Discard supernatant. Resuspend the beads in 20 ml of the following mixture:

5  PNK, pH 6.5 buffer

4.0 ml

10.7 Methods and Protocols

PNK (NEB; with 30 phosphatase activity)

0.5 ml

RNasin

0.5 ml

Water

15.0 ml

2.5B Incubate for 20 min at 37  C. 2.5C Wash once with PNK buffer. 2.5D Wash once with high-salt wash (rotate wash for at least 1 min in cold room). 2.5E Wash 2 times with PNK buffer. 2.6 L3 Linker ligation 2.6A Carefully remove the supernatant and resuspend the beads in 20 ml of the following mix: Water

9.0 ml

4  Ligation buffer

4.0 ml

RNA ligase (NEB)

1.0 ml

RNasin (NEB)

0.5 ml

Preadenylated linker L3-App (20 mM)

1.5 ml

PEG400 (81 170, Sigma)

4.0 ml

2.6B Incubate overnight (16 h) at 16  C. 2.6C Add 500 ml PNK buffer. 2.6D Wash 2 times with 1 ml high-salt buffer, rotating in the wash for 5 min in the cold room. 2.6E Wash 2 times with 1 ml PNK buffer and leave in 1 ml of the second wash. 2.7 50 End labeling 2.7A Collect 200 ml (20%) of beads from Step 2.6E and remove the supernatant. 2.7B Add 4 ml of hot PNK mix: PNK (NEB)

0.2 ml

32

0.4 ml

10  PNK buffer (NEB)

0.4 ml

Water

3.0 ml

P-c-ATP

2.7C Incubate for 5 min at 37  C. 2.7D Remove the supernatant and add 20 ml of 1  NuPAGE loading buffer (prepared by mixing 4  stock with water; Invitrogen) to the beads. Remove the supernatant from remaining cold beads from Step 2.6E. Then add the radioactively labeled beads to the cold beads. Incubate at 70  C for 5 min. 2.7E Place on magnet to precipitate the beads and load the eluate on the gel. 3. SDS–PAGE and nitrocellulose transfer 3A Load the samples on a 4–12% NuPAGE Bis-Tris gel (Invitrogen) according to the manufacturer’s instructions. Use 0.5 l 1 MOPS running buffer (Invitrogen). Also load 5 ml of a prestained protein size marker (e.g., PageRuler Plus, Fermentas; cat. no. SM1811). Note: The Novex NuPAGE gels are critical. A pour-your-own SDS–PAGE gel (Laemmli) changes its pH during the run which can get to 9.5 leading to alkaline hydrolysis of the RNA. The Novex NuPAGE buffer system is close to pH 7. We use MOPS NuPAGE running buffer.

j

163

164

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

3B Run the gel for 50 min at 180 V. 3C Remove the dye front and discard it as solid waste (contains free radioactive ATP). 3D Transfer the protein–RNA complexes from the gel to a Protan BA85 nitrocellulose membrane (Whatman) using the Novex wet transfer apparatus according to the manufacturer’s instructions (Invitrogen; transfer for 1 h at 30 V; do not forget to add 10% methanol to the transfer buffer). Note: The pure nitrocellulose membrane is a little fragile, but it works better for the RNA–protein extraction step. 3E After the transfer, rinse the membrane in PBS buffer, then wrap it in saran wrap and expose it to a Fuji film at 80  C (place a fluorescent sticker next to the membrane to later align the film and the membrane). Perform exposures for 30 min, 1 h, and overnight. Note: Most free RNA will have left out of the gel or through the membrane, so the membrane will be 10–100 times less radioactive than the samples loaded on the gel. 4. RNA isolation 4A Use the high-RNase condition to examine the specificity of the protein–RNA complex. Note: When performing iCLIP for the first time, use the following criteria to check that a specific RNA–protein cross-link and pulldown has been performed: 1. Is there a radioactive band 5 kDa above the molecular weight of the protein in the high-RNase experiment? 2. Does the band disappear in the control experiments? These might include: no UV cross-link, pulldown with no antibody (beads only or preimmune serum), samples from a knockout organism or knockdown cells, or an appropriate control for overexpressed tagged proteins. 3. Does the band move up and become more diffuse in the low-RNase condition? Since the RNA digestion is random, the RNA sizes vary more in the low-RNase condition and thus the RNA–protein complexes are more heterogeneous in size. On this basis, if you are convinced of the veracity of your results, proceed to RNA isolation and amplification. Note the following guidelines: 1. The average molecular weight of 70 nucleotides of RNA is 20 kDa. As the tags contain a linker of 21 nucleotides (L3-App), the ideal position of RNA–protein complexes that will generate iCLIP tags of sufficient length is 20–60 kDa above the expected molecular weight of the protein. 2. The width of the excised band depends on potential other RNA–protein complexes present in the vicinity as seen in the high-RNase experiment. If none are apparent, cut a wide band of 20–60 kDa above the molecular weight of the protein. If, however, other contaminant bands are present above the size of the protein, cut only up to the size of those bands. If the contaminating bands run below your RNA–protein complex, you might consider cutting an additional band between the contaminating band and your protein–RNA complex. The RNA sequences cloned from this band can later be used to compare with those purified with your protein–RNA complex to control the specificity of your experiment. 4B Isolate the protein–RNA complexes from the low-RNase experiment using your autoradiograph as a mask by cutting the respective region out of the nitrocellulose membrane. The region can be taken either in a single piece or further divided into two portions designated H (high, upwards from the band) and L (low, downwards from the band). Place the membrane fragments into

10.7 Methods and Protocols

1.5-ml tubes. If a piece of membrane is too large to fit down to the bottom of the tube, cut it into several pieces before placing it into the tube. 4C Add 10 ml Proteinase K (Roche; cat. no. 03 115 828 001) in 200 ml PK buffer to the nitrocellulose pieces (all should be submerged). Incubate shaking at 1100 rpm for 20 min at 37  C. 4D Add 200 ml of PK buffer þ 7 M urea, and incubate for further 20 min at 37  C and 1100 rpm. 4E Collect the solution and add it together with 400 ml RNA phenol/CHCl3 (Ambion; cat. no. 9722) to a 2-ml Phase Lock Gel Heavy tube (VWR; cat. no. 713-2536). Note: Over 90% of the radioactive signal should be removed after Proteinase K treatment. This can be monitored by a Geiger counter measurement of the membrane pieces before adding Proteinase K and after removing it. 4F Incubate for 5 min at 30  C shaking at 1100 rpm (DO NOT VORTEX). Separate the phases by spinning for 5 min at 13 000 rpm at room temperature. 4G Transfer the aqueous layer into a new tube (be careful not to touch the gel matrix with the pipette). Precipitate by addition of 0.5 ml GlycoBlue (Ambion; cat. no. 9510), 40 ml 3 M sodium acetate, pH 5.5. Then mix and add 1 ml 100% ethanol, mix again, and place overnight at 20  C. Note: GlycoBlue is necessary to efficiently precipitate the small quantity of RNA. 4H Spin for 15 min at 15 000 rpm at 4  C. Remove the supernatant and wash the pellet with 0.5 ml 80% ethanol. Resuspend the pellet in 6.25 ml water. Note: Remove the wash first with a p1000 and then with a p20 or p10. Try not to disturb the pellet, but if you do, spin it down again. Leave on the bench for 3 min, but no longer, with the cap open to dry. When resuspending, make sure to pipette along the back area of the tube. 5. Reverse transcription 5A Add the following reagents to the resuspended pellet from Step 4 H: Primer RT#Clip (0.5 pmol/ml)

0.5 ml

dNTP mix (10 mM)

0.5 ml

Note: Do not forget a negative control. This can either be a reaction where no RNA was added to the mix, but preferably a control sample that was isolated from a piece of nitrocellulose that did not contain the protein– RNA complex (e.g., the no-antibody control). Use distinct primers (RT1Clip–RT16Clip) for the control and the different replicates or experiments. The different primers contain individual 4-nucleotide barcode sequences that allow multiplexing of samples and control for cross-contamination between samples. 5B RT thermal program: 70  C

5 min

25  C

hold until the RT Mix (see below) is added, mix by pipetting

RT Mix 5  RT buffer (Invitrogen)

2.0 ml

0.1 M Dithiothreitol

0.5 ml

SuperScript III reverse transcriptase (200 U/ml; Invitrogen)

0.25 ml

j

165

166

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

25  C

5 min

42  C

20 min



50 C

40 min

80  C

5 min

4 C

hold

5C Mix the samples that shall be multiplexed at this point. 5D Add TE buffer to 100 ml, then add 0.5 ml GlycoBlue and mix. Add 15 ml sodium acetate, pH 5.5, and mix, then add 300 ml 100% ethanol. Mix again and precipitate overnight at 20  C. 6. Gel purification 6A Spin down for 15 min at 15 000 rpm at 4  C. Remove the supernatant and wash the pellet with 0.5 ml 80% ethanol. Spin down again, remove supernatant and resuspend the pellet in 6 ml water. 6B Add 6 ml 2  TBE/urea loading buffer (Invitrogen) to the cDNA. It is recommended, at least in initial experiments, to add loading buffer also to 6 ml size marker (NEB low-molecular-weight marker; cat. no. N3233S; diluted 1/30). Heat samples to 80  C for 3 min directly before loading. Leave one lane between each sample to facilitate cutting. 6C Prepare 0.8 l 1  TBE running buffer and fill the upper chamber with 0.2 l and the lower chamber with 0.6 l. Use p1000 to flush precipitated urea out of the wells before loading 12 ml of each sample. Load the marker into the last lane. 6D Run 6% TBE/urea gel for 40 min at 180 V until the lower (dark blue) dye is close to the bottom. 6E (Optional – only if the size marker was added) Cut off the last lane containing the size marker and stain it by incubation for 10 min shaking in 10 ml TBE buffer with 2 ml SYBR Green II stock. Wash once with TBE and visualize by UV transillumination. Record the sizes of the marker bands. 6F Together with the full L3-App sequence, the primer sequence accounts for 52 nucleotides of the cDNA. The upper (lighter blue) dye runs at 110– 150 nucleotides and the first rim of the plastic gel cassette is at 75 nucleotides – these marks can be used to guide excision together with the size marker. Cut three bands at 70–85, 85–120, and 120–200 nucleotides. Use Figure 10.3 to guide the cutting of the bands. Place each gel piece into a 1.5-ml microtube. 6G Add 400 ml TE and crush the gel piece into small pieces with a 1-ml syringe plunger. Incubate shaking at 1100 rpm for 1 h at 37  C, then place on dry ice for 2 min and place back at 1100 rpm for 1 h at 37  C. 6H Transfer the liquid portion of the supernatant into a Costar SpinX column (Corning; cat. no. 8161) into which you have placed two 1-cm glass prefilters (Whatman; cat. no. 1 823 010). 6I Spin at 13 000 rpm for 1 min into a 1.5-ml tube. Add 0.5 ml GlycoBlue and 40 ml sodium acetate, pH 5.5. Mix, then add 1 ml 100% ethanol. Mix again and precipitate overnight at 20  C. 7. Ligation of the primer to the 50 end of the cDNA 7A Spin down and wash as described above, resuspend it in 8 ml ligation mix and incubate for 1 h at 60  C: Water

6.5 ml

10  CircLigase Buffer II (Epicenter)

0.8 ml

50 mM MnCl2

0.4 ml

CircLigase II (Epicenter)

0.3 ml

7B Add 30 ml oligo annealing mix:

10.7 Methods and Protocols

Water

26 ml

FastDigest Buffer (Fermentas)

3 ml

10 mM Cut_oligo

1 ml

7C Anneal the oligos with following program: 95  C

2 min

successive cycles of 20 s, starting from 95  C and decreasing the temperature by 1  C each cycle down to 25  C 25  C

hold

7D Add 2 ml BamHI (Fast Fermentas) and incubate for 30 min at 37  C. 7E Add 50 ml TE, 0.5 ml GlycoBlue and mix. Add 10 ml sodium acetate, pH 5.5, and mix. Then add 250 ml 100% ethanol. Mix again and precipitate overnight at 20  C. 8. PCR amplification 8A Spin down and wash the cDNA as described above, then resuspend it in 11 ml water. 8.1. Optimize PCR amplification Note: Step 8.1 is optional. If you previously prepared libraries with the same protein and you had good radioactive RNA signal, you can estimate the number of required cycles and move directly to Step 8.2. 8.1A Prepare the following PCR mix: cDNA (from Step 8A)

0.5 ml

Primer mix P5Solexa/P3Solexa, 10 mM each

0.25 ml

AccuPrime SuperMix 1 enzyme (Invitrogen)

5.0 ml

Water

4.25 ml

8.1B Run the following PCR: 94  C

2 min

94  C

15 s

65  C

30 s

68  C 

30 s

68 C

3 min

25  C

hold

9 > =

25–35 cycles

> ;

Note: All work done post-PCR must be carried out on a specially designated bench. This cDNA must never be taken to an area where work with iCLIP RNA is done. 8.1C Mix 8 ml PCR product with 2 ml of 5  TBE loading buffer, load on a 6% TBE gel, and stain with SYBR Green I. 8.2. Preparative PCR 8.2A From your results in Step 8.1, estimate the minimum number of PCR cycles to use to amplify the whole library, such that it will give a band on a gel. Consider that you will now be amplifying 10 times

j

167

168

j

10 Analysis of Protein–RNA Interactions with Single-Nucleotide Resolution Using iCLIP and Next-Generation Sequencing

more concentrated cDNA, therefore 3 cycles less are needed than in the preliminary PCR. 8.2B Prepare the following PCR mix: cDNA (from Step 8A)

10 ml

Water

9 ml

Primer mix P5Solexa/P3Solexa, 10 mM each

1 ml

AccuPrime SuperMix 1 enzyme (Invitrogen)

20 ml

8.2C Run the same PCR program as in Step 8.1B. 8.2D Mix 8 ml PCR product with 2 ml 5  TBE loading buffer, load on a 6% TBE gel and stain with SYBR Green I. The size of the cDNA insert to be mapped to the genome will be the size of product minus the combined length of the P3/P5Solexa primers and the barcode (128 nucleotides). Ideally, the PCR products should be >145 nucleotides. If sharp DNA bands are visible with a size 145 nucleotides. 8.2E Submit 10 ml of the PCR library for sequencing and store the rest. 9. Linker and primer sequences Preadenylated 30 linker DNA (we order the DNA adapter from IDT and then make aliquots of 20 mM): L3-App

rAppAGATCGGAAGAGCGGTTCAG/ddC/

DNA primers (we order desalted oligonucleotides from Sigma and do not gel purify them): RT1Clip

X33NNAACCNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT2Clip

X33NNACAANNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT3Clip

X33NNATTGNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT4Clip

X33NNAGGTNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT5Clip

X33NNCGCCNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT6Clip

X33NNCCGGNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT7Clip

X33NNCTAANNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT8Clip

X33NNCATTNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT9Clip

X33NNGCCANNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT10Clip

X33NNGACCNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT11Clip

X33NNGGTTNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

References

RT12Clip

X33NNGTGGNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT13Clip

X33NNTCCGNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT14Clip

X33NNTGCCNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT15Clip

X33NNTATTNNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

RT16Clip

X33NNTTAANNNAGATCGGAAGAGCGTCGTGgatcCTGAACCGC

Cut_oligo

GTTCAGGATCCACGACGCTCTTCaaaa

P5Solexa

AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

P3Solexa

CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT

j

169

X33 ¼ 50 phosphate. Acknowledgments

The authors thank all members of the Ule laboratory, and Dr. Kathi Zarnack and Dr. Ignacio Schor for advice and discussion as well as experimental support. This work was supported by the European Research Council grant 206726-CLIP to J. U. and a Long-term Human Frontiers Science Program fellowship to J.K. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Moore, M.J. (2005) Science, 309, 1514–1518. 2 Keene, J.D. (2007) Nat. Rev. Genet., 8,

13 Licatalosi, D.D., Mele, A., Fak, J.J., Ule, J.,

533–543. 3 Wang, Z. and Burge, C.B. (2008) RNA, 14,

802–813. 4 Denman, R.B. (2006) Bioessays, 28, 5 6

7 8 9

10 11

12

1132–1143. Tuerk, C. and Gold, L. (1990) Science, 249, 505–510. SenGupta, D.J., Zhang, B., Kraemer, B., Pochart, P., Fields, S., and Wickens, M. (1996) Proc. Natl. Acad. Sci. USA, 93, 8496–8501. Trifillis, P., Day, N., and Kiledjian, M. (1999) RNA, 5, 1071–1082. Brooks, S.A. and Rigby, W.F. (2000) Nucleic Acids Res., 28, E49. Tenenbaum, S.A., Carson, C.C., Lager, P.J., and Keene, J.D. (2000) Proc. Natl. Acad. Sci. USA, 97, 14085–14090. Mili, S. and Steitz, J.A. (2004) RNA, 10, 1692–1694. Ule, J., Jensen, K.B., Ruggiu, M., Mele, A., Ule, A., and Darnell, R.B. (2003) Science, 302, 1212–1215. Ule, J., Jensen, K., Mele, A., and Darnell, R.B. (2005) Methods, 37, 376–386.

14

15 16

17 18

19

20

Kayikci, M., Chi, S.W., Clark, T.A., Schweitzer, A.C., Blume, J.E., Wang, X., Darnell, J.C., and Darnell, R.B. (2008) Nature, 456, 464–469. Yeo, G.W., Coufal, N.G., Liang, T.Y., Peng, G.E., Fu, X.D., and Gage, F.H. (2009) Nat. Struct. Mol. Biol., 16, 130–137. Ule, J. and Darnell, R.B. (2006) Curr. Opin. Neurobiol., 16, 102–110. Xue, Y., Zhou, Y., Wu, T., Zhu, T., Ji, X., Kwon, Y.S., Zhang, C., Yeo, G., Black, D.L., Sun, H., Fu, X.D., and Zhang, Y. (2009) Mol. Cell, 36, 996–1006. Chi, S.W., Zang, J.B., Mele, A., and Darnell, R.B. (2009) Nature, 460, 479–486. Zisoulis, D.G., Lovci, M.T., Wilbert, M.L., Hutt, K.R., Liang, T.Y., Pasquinelli, A.E., and Yeo, G.W. (2010) Nat. Struct. Mol. Biol., 17, 173–179. K€onig, J., Baumann, S., Koepke, J., Pohlmann, T., Zarnack, K., and Feldbr€ ugge, M. (2009) EMBO J., 28, 1855–1866. Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M. Jr., Jungkamp, A.C., Munschauer, M., Ulrich, A., Wardle,

21

22

23 24

25

26

27

G.S., Dewell, S., Zavolan, M., and Tuschl, T. (2010) Cell, 141, 129–141. Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., Rothballer, A., Ascano, M., Jungkamp, A., Munschauer, M., Ulrich, A., Wardle, G.S., Dewell, S., Zavolan, M., and Tuschl, T. (2010) J. Vis. Exp., 41, http://www.jove.com/ details.php?id=2034 doi: 10.3791/2034. Granneman, S., Kudla, G., Petfalski, E., and Tollervey, D. (2009) Proc. Natl. Acad. Sci. USA, 106, 9613–9618. Urlaub, H., Hartmuth, K., and L€ uhrmann, R. (2002) Methods, 26, 170–181. K€onig, J., Zarnack, K., Rot, G., Curk, T., Kayikci, M., Zupan, B., Turner, D.J., Luscombe, N.M., and Ule, J. (2010) Nat. Struct. Mol. Biol., 17, 909–915. Wang, Z., Kayikci, M., Briese, M., Zarnack, K., Luscombe, N.M., Rot, G., Zupan, B., Curk, T., and Ule, J. (2010) PLoS Biol., 8, e1000530. Lau, N.C., Lim, L.P., Weinstein, E.G., and Bartel, D.P. (2001) Science, 294, 858–862. Vigneault, F., Sismour, A.M., and Church, G.M. (2008) Nat. Methods, 5, 777–779.

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats Virginia Edgcomb and Thorsten Stoeck Abstract

Polymerase chain reaction amplification of ribosomal DNA genes from environmental samples followed by sequencing of clone libraries revolutionized our understanding of the diversity and ecology of microbial eukaryotes by revealing that protist communities are far more genetically diverse than observations based on morphology indicated. Even the most extensive clone libraries have only recovered small fractions of much larger assemblages in environmental samples and those studies have revealed novel sequences at almost all levels of taxonomic hierarchy. Massively parallel 454 tag sequencing offers an approach for exploration of this extensive biodiversity that is much less labor-intensive, less methodologically biased, and less expensive than clone libraries, and provides unprecedented sequencing depth. Pyrosequencing studies of anoxic and micro-oxic marine environments have revealed that protistan communities are much more diverse than previously thought, and contain many sequences with unknown taxonomic affiliation. Sequences affiliating with known taxonomic groups are most commonly dinoflagellates, ciliates, and other members of the alveolates, members of which are known to be important in marine environments as primary producers, heterotrophic grazers, symbionts, and parasites.

11.1 Introduction

Microbial eukaryotes are pivotal members of aquatic microbial communities. Through grazing on prokaryotic prey, they regenerate nutrients and modify or remineralize organic matter [1–3], and are presumed to be important in organic matter cycling of all phases (particulate and dissolved). In addition, they are known to affect the population dynamics, activity, and physiological state of their prey [4–6]. Phagotrophic protists and viruses are the main sources of mortality for marine microbes [7,8]. Bacterial grazing is carried out principally by small flagellated protists and ciliates [1–4]. Protist grazing rates on free-living microbes have on occasion been measured in situ (e.g., [9,10]), but little is known about the identity of the dominant grazers. Through these direct and indirect effects, microbial eukaryotes influence the metabolic potential of bacterial and archaeal communities, and in turn, the aquatic carbon and other nutrient cycles. In recognition of their importance in aquatic ecosystems, they are now considered in numerical models of carbon cycling and in paradigms of surface and deep-ocean microbial ecology [7]. Over the past decade the widespread application of culture-independent molecular approaches has revolutionized our understanding of the structure and complexity of eukaryotic microbial communities, including in many “extreme” environments such

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

171

172

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

as anoxic and deep sea habitats (e.g., [11,12]. The “gold standard” for many years has been the cloning and Sanger sequencing of polymerase chain reaction (PCR)amplified marker genes, usually small subunit ribosomal DNA (SSU rDNA) from genomic DNA extracted from environmental samples [13]. The magnitude of the undersampled “protistan gap” in the eukaryotic tree of life has been highlighted by recent molecular studies of microbial diversity. These studies, based primarily on SSU rDNA, show that our current understanding of the ecological complexity of protist communities, and of protistan global species, at all levels of taxonomic hierarchy, is severely limited (e.g., [14,15]). Such surveys have shown genetic diversity within known protist taxa and also representing new taxa is much greater than previously suspected based on results from traditional culture-based approaches, which are highly selective and only capable of detecting a fraction of the organisms in environmental samples using described media. Even recent studies of microbial eukaryotes in the relatively better-studied oxygenated water column reveal distinct populations in the euphotic versus bathypelagic zones, as well as representatives of novel alveolate and stramenopile lineages about which little is known [16]. Few (if any) SSU rRNA gene surveys of any community approach complete coverage [12,17,18]. In marine environments, even the largest eukaryotic SSU rDNA libraries have not approached sampling saturation, and therefore fail to completely characterize the diversity of microbial eukaryotes in those samples (e.g., [12,19–21]). Sanger-based sequencing of clone libraries, while powerful for many applications, is labor-intensive, relatively expensive, and is methodologically biased [20]. The failure of Sanger-based clone libraries to capture a full picture of eukaryotic microbial diversity is due in part to the complexity of the eukaryotic microbial communities, but is also due to methodological issues such as bias in the plasmid ligation step during which preferential ligation occurs between the linearized plasmid DNA and shorter PCR amplicon homologs (e.g., [22]). This is not only detrimental to exploration of true richness and complexity of protistan communities, but also hampers comparative analyses of protistan communities in different samples in an ecological and biogeographical context [23–25]. Another factor that makes it difficult to capture a relatively complete picture of marine eukaryotic microbial diversity is that only a few eukaryotic taxa at any given time in a sample are likely to be abundant and actively growing, with the rest of the taxa either dormant (e.g., cysts) or less active. This has been well established for bacterial biodiversity, where studies have shown that these dormant or slowly growing and less abundant taxa are very difficult to detect by clone library approaches, especially when sequencing depth does not even approach saturation [26,27]. Massively parallel tag sequencing (454 sequencing) offers a means of more accurately and economically determining the depth of diversity of microbial communities. This method starts with relatively short (typically 200–800 bp) DNA fragments generated from genomic DNA, PCR products, bacterial artificial chromosomes, or cDNA, to which short adapters are added at both the 30 and 50 ends (incorporated into primers for PCR products). These adapters are used for purification, amplification, and sequencing steps, during which a single-stranded DNA library is immobilized onto specifically designed capture beads such that one DNA fragment is attached to one bead, and this library is emulsified with amplification reagents containing a mixture of water and oil that results in micro-fabricated highdensity picoliter reactors [28]. Elimination of the ligation and transformation steps of the Sanger-based clone library approach avoids some of the major biases inherent in that approach and 454 sequencing produces hundreds of thousands of sequences in a single run. This allows researchers to sequence much more deeply at a fraction of the cost and effort of Sanger-based methods, providing much more comprehensive picture of the microbial community for downstream analyses. Sogin et al. [27] used first-generation 454 sequencing to demonstrate that bacterial diversity was one to two orders of magnitude greater than reported previously in samples from deep water masses of the Northern Atlantic and diffuse flow hydrothermal vents. Most of the

11.2 Cariaco Basin

diversity reported in that study was represented by sequence types (tags) present at very low frequency and those authors referred to this long tail of amplicon diversity observed in rank abundance curves as the “rare biosphere” [27]. Here, we focus on 454 sequencing studies of microbial eukaryotes in micro-oxic and anoxic marine environments. These first studies are revealing a similar long tail of rare taxa (see below), suggesting that marine microbial eukaryotic diversity has also been significantly underestimated.

11.2 Cariaco Basin

The Cariaco Basin is the world’s largest truly marine anoxic system, and is an excellent system in which to compare microbial populations in the anoxic and oxic water column because the Basin is a mosaic of dramatically different biogeochemical niches, and has existed as such for millions of years [29], although it probably experienced periods of oxidation [30]. This basin progresses from fully oxic to sulfidic across a temporally varying boundary located at 250–350 m water depth. Protist diversity in this environment was first studied by Stoeck et al. [31] and these researchers found a substantial number of novel protist lineages in the anoxic portion of the water column, including new clades of the highest taxonomic level, as well as deep branches within established protistan groups. Edgcomb et al. [12] expanded on this previous work by sampling the Basin extensively at three stations, in two contrasting seasons. Sampling occurred at four depths and across strong biogeochemical gradients spanning fully oxygenated layers (40 m above the oxic/ anoxic interface) to the deep, highly sulfidic habitats at 900 m depth. The oxic/anoxic interface was defined as the depth at which oxygen became undetectable. This depth usually corresponded to a particle density maximum and peaks in prokaryote and protist cell numbers. A critical issue in sampling is the volume of samples. In the literature, sample volumes from as little as 1 L of water and as much as up to 20 l of sample water are used to construct amplicon libraries. This is certainly a source of bias (even though this potential source of bias has not been evaluated and quantified to date). However, there is no rule as to what volume is best, because the nature of the sample or sampling site plays a decisive role in the decision of how much water is drawn onto one filter: in oligotrophic waters or in deep-sea samples protistan cell numbers can be as low as 103/L, thus larger volumes of sample water are necessary to enrich the low-abundant genes (from rare taxa) for subsequent sequence analyses. At other sampling sites, such as eutrophic waters or during algal blooms, cell numbers can be more than twice as high in a liter of water and less sample water is required. A study that assesses the effect of the sample size (volume) on amplicon distribution and coverage would be highly interesting and valuable for the design of future environmental pyrosequencing studies. Sampling of oxygen-depleted habitats is not trivial. Most strict anaerobes are usually highly sensitive to oxygen and react with immediate cell death upon exposure to even trace amounts of oxygen. Therefore, measures need to be taken in order to avoid contact of sample material with atmospheric conditions. This can be accomplished as follows: water columns are sampled using Niskin bottles attached to a rosette with a conductivity, temperature, and density profiler recording physicochemical parameters of the sampling site. Immediately after retrieval on board water is withdrawn from Niskin bottles under N2 atmosphere to sterile, evacuated 2-L hospital intravenous bags (non-DEHP (di(2-ethylhexyl) phthalate) vinyl; e.g., Secure Medical). Samples should be processed as soon as possible in order to minimize handling time and thus potential changes in protistan community structures in the bags. When immediate processing is not possible, bags can be stored immersed in seawater maintained at in situ temperature and light conditions, but no longer than 24 h. Cells from bags can be collected on 47-mm Durapore membranes (Millipore,

j

173

174

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

0.65 mm pore size) by using an in-line filtration system connected directly to the intravenous bag under gentle vacuum (around 25 cm Hg) provided by a peristaltic pump. This sampling and processing protocol prevents exposure to the atmosphere and minimizes alterations in the water’s redox potential. Immediately after filtration, membranes can be placed in 2.0-ml cryovials containing 1 ml of extraction buffer (100 mM Tris–HCl, pH 8, 100 mM sodium phosphate buffer, pH 8, 1.5 M NaCl, 100 mM EDTA, pH 8) with 1% CTAB (hexadecyltrimethylammonium bromide). Then 5 ml of Proteinase K should be added before freezing of the sample at 20  C or in liquid nitrogen until further processing of the samples in the lab as described in detail elsewhere [32]. In brief, DNA is extracted from filters using either a phenol/ isoamyl/chloroform extraction with a final isopropanol precipitation of the nucleic acids or a commercially available kit like Qiagen’s AllPrep DNA/RNA kit. When higher loads of enzymatic downstream reaction inhibitors are expected in the sample (like humic acids, interfering with the Taq polymerase in the subsequent PCR reaction) the phenol/chloroform extraction proved to be better suited than commercially available kits. The target genes are then amplified from extracted environmental genomic DNA using polymerase chain reaction. The 454 Life Science’s A or B sequencing adapters were fused to the 50 end of the specific PCR primers. For each individual environmental DNA extract we usually run three to five independent 30-ml PCR reactions with reaction mix consisting of 5 U of PfuTurbo polymerase (Stratagene), 1 Pfu reaction buffer, 200 mm dNTPs (Pierce Nucleic Acid Technologies), a 0.2 mM concentration of each primer in a volume of 100 ml, and 3–10 ng genomic DNA as template. The PCR protocol depends on the choice of gene-specific primers. As an example, for the hypervariable V9 region of the SSU rRNA fragment, it employs an initial denaturation at 94  C for 3 min; 30 cycles of 94  C 30 s, 57  C for 45 s, and 72  C for 1 min; and a final 2 min extension at 72  C. PCR products from the same DNA sample are pooled and cleaned by using the MinElute PCR purification kit (Qiagen). The quality of the products can be assessed on a Bioanalyzer 2100 (Agilent) using a DNA1000 LabChip. We only use sharp, distinct amplification products with a total yield of more than 200 ng for 454 sequencing. The fragments in the amplicon libraries are then bound to beads under conditions that favor one fragment per bead. The emulsion PCR is performed by emulsifying the beads in a PCR mixture in oil, with PCR amplification occurring in each droplet, generating more than 10 million copies of a unique DNA template. After breaking the emulsion, the DNA strands are denatured and beads carrying single-stranded DNA clones are deposited into wells on a PicoTiter Plate (454 Life Sciences) for pyrosequencing. Alternatively, RNA can be extracted from environmental samples instead of DNA, because DNA is very stable in the environment and does not necessarily reflect the true (indigenous) composition of local protistan diversity. However, this is more crucial in sedimentary and accumulative systems like benthic environments where ancient (“nonliving”) DNA can accumulate over longer periods of time, rather than samples from a dynamic water column. In case RNA is extracted (for example with Qiagen’s AllPrep DNA/RNA extraction kit) filters should be stored in RNAlater (e.g., available from Qiagen or Ambion) before extraction with a kit according to the manufacturer’s protocol. Then, single-stranded RNA should be transcribed into cDNA (e.g., using Qiagen’s Omniscript two-step reverse transcription kit), which then can be further processed like the genomic environmental DNA described above. Figure 11.1 depicts the basic steps involved in sample preparation. Edgcomb et al. [12] compared the diversity of protistan gene sequences based on relatively large clone libraries (16 000 sequences total, 6489 high-quality protistan) generated using multiple PCR primer combinations to the picture of protistan diversity obtained using Roche GS FLX 454 sequencing of amplicons from the V9 region using the V9 primer pair 1391F (50 -GTACACACCGCCCGTC-30 ) and EukB (reverse) (50 -TGATCCTTCTGCAGGTTCACCTAC-30 ) (Edgcomb, unpublished data). Pyrosequencing was performed by MWG Biotech. This effort produced 251 648 sequences; after quality control 82 484 (around 205 bp) sequences remained for further analysis. “Poor quality” sequences were removed if (i) they were shorter than

11.2 Cariaco Basin

j

175

Fig. 11.1 Flowgram of methodology for 454 pyrosequencing analysis of 18S rRNA.

100 nucleotides, (ii) they contained any N (ambiguous) nucleotides, or (iii) incorrect primer sequence. Tag sequences were compared to a reference database generated from all SSU rRNA gene sequences in public databases longer than 500 nucleotides. The BLAST output was parsed to extract taxonomic assignments at thresholds ranging from 70 to 100% sequence similarity. For each query, only the most similar target sequence was retained for which a good taxonomic assignment was provided. The picture of protist diversity obtained using a single V9 library generated by pooling V9 PCR products generated individually from all DNA samples that were previously analyzed by clone library/Sanger sequencing, shows the Cariaco water column to be dominated by Alveolata (predominantly the ciliate subphylum Intramacronucleata, and four dinoflagellate orders, Gymnodiniales, Prorocetrales, Syndiniales, and Gonyaulacales) and Rhizaria (Table 11.1). The two sequencing technologies showed a congruent picture of the dominant protist types recovered and their relative proportions; however, as expected, the 454 approach revealed a much more extensive picture of the “rare biosphere” [12].

Taxonomic group

Alveolata Rhizaria Stramenopiles Environmental samples Euglenozoa Fungi Viridiplantae Cryptophyta Telonemida Rhodophyta Apusozoa

Percentage of 454 Library 43 29 9 8 5 1 1 0.6 0.1 0.1 0.1

Taxonomic group

Haptophycea Cryomonadida Amoebozoa Phaeomonas Parabasalidae Jakobida Glaucocystophyceae Ichthyosporea Heterolobosea Proleptomonas

Percentage of 454 Library 0.5 0.4 0.07 0.06 0.05 0.05 0.02 0.01 0.01 0.005

Sequences were assigned to taxonomic groups based on 70% or greater sequence similarity threshold for BLASTN analysis of sequences against a reference database containing all SSU rRNA gene sequences in public databases longer than 500 nucleotides (Edgcomb et al., unpublished data).

Table 11.1 Phylum and kingdom-level assignments of the 18S rRNA gene sequence collection obtained by 454 pyrosequencing of Cariaco Basin samples.

176

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

Unfortunately, since the 454 data in this study were generated from a single, pooled sample, information about the differences in populations at different depths is lost; however, parametric statistical analyses of the clone library data indicated that the protistan communities in anoxic habitats are dissimilar to those in the oxic water column and may share as little as 5% of operational taxonomic units (OTUs; defined as clusters of sequences sharing 95% or more sequence similarity) [32].

11.3 Framvaren Fjord

The Framvaren Fjord located in southwest Norway shares the feature of a stable and defined oxic/anoxic interface and permanently anoxic water column with steep physicochemical gradients below the redoxcline with the Cariaco Basin, yet this fjord varies in several physicochemical parameters (for discussion of comparison, see [15]). For example, while the Cariaco Basin has a redoxcline below the photic zone and relatively low sulfide concentrations below the redoxcline, the oxic/anoxic boundary layer of the Framvaren Fjord is located around 18 m depth, with sulfide concentrations below the redoxcline that are 25 times greater than found in the Black Sea [33] and steep biogeochemical gradients down to the bottom waters (180 m). Stoeck et al. [34] conducted a pyrosequencing study of 10 L of anoxic Fjord water that compared the picture of protist diversity in this sample obtained using amplification of the V9 versus V4 hypervariable regions. Since the variations in V9 and V4 hypervariable regions do not exactly track each other, by targeting both regions, the coverage of protistan diversity was almost certainly more thorough than it would have been by using either region alone. This study obtained 602 070 tag sequences (around 250 bp). The PCR amplification of the V9 region used 1391F and EukB, and for V4 used TAReuk454FWD1 and TAReukREV3 [34]. Sequences were considered to be “low quality” and hence were removed that (i) were shorter than 100 nucleotides, (ii) had inaccurate calibration key sequence, (ii) had inaccurate or incomplete primer sequence, or (iv) contained any nucleotides designated as N (unidentified). All unique tags (nonoverlapping and dereplicated) were clustered at 97–99% identity to access diversity. Taxonomy was assigned by BLASTN for unique tags with a best BLAST hit of at least 80% sequence similarity, which allowed assignment to approximately class level [33,34]. Both markers identified a wide range of taxonomic groups in the anoxic waters of Framvaren Fjord; however, the sequences for which an assignment could be made to a known taxonomic group in both libraries were dominated by dinoflagellates and their close relatives. The majority of tags identified in that study (71 and 40% for V4 and V9, respectively) matched unclassified environmental sequences in GenBank, indicating that the vast proportion of eukaryotic microbes in this anoxic sample is currently undescribed. Of interest is the observation that the V4 and V9 markers detected very different diversity profiles within the dinoflagellates (Figure 11.2). The implications of this observation for researchers intending to apply 454 pyrosequencing to studies of eukaryotic microbial diversity is that, in order to capture a more complete picture of diversity, one cannot rely on so-called “universal” PCR primers. This should be no surprise, as it is well known that PCR primers are generally biased in their recovery of different taxonomic groups. Accumulation curves of unique tags plotted against total number of tags recovered demonstrated that even after around 250 000 sequence reads, accumulation of unique tags does not reach saturation, and both V4 and V9 libraries were dominated by rare tags (singletons). Seventy-five percent and 68% of high-quality unique V4 and V9 tags were detected only once [34]. These authors note that these singletons, while possibly representing rare genotypes in the environment, may alternatively be simply due to some combination of nucleotide misincorporation and read errors during PCR and sequencing, PCR chimera formation, or intragenomic polymorphism among multiple rRNA copies within a single nucleus. This is important to consider when interpreting eukaryotic pyrosequencing data and we suggest here that after initial

11.4 Comparison of Cariaco Basin to Framvaren Fjord

j

177

Fig. 11.2 Comparison of V4 versus V9 recovery of dinoflagellates. Differential recovery of taxonomic groups within the dinoflagellates observed when using (a) V4 and (b) V9 18S rRNA gene markers for 454 pyrosequencing of Framvaren Fjord samples [34].

culling of tag sequences using a quality control pipeline, it is still prudent to interpret the remaining singletons with caution, as they almost certainly do not all represent rare genotypes in the environmental sample.

11.4 Comparison of Cariaco Basin to Framvaren Fjord

A separate 454 pyrosequencing study was conducted that compared protistan community complexity in two contrasting anoxic marine ecosystems – the Cariaco Basin and Framvaren Fjord. Since eukaryotic clone library data existed for both of these sites, it was possible to evaluate the efficiency of the pyrosequencing strategy to compare these two ecosystems [15,33]. The comparative study of Framvaren Fjord and Cariaco Basin by Stoeck et al. [15] examined temporal effects, local patchiness, and environmental factors associated with distinct local characteristics found at different depths. In each location, samples were collected at the anoxic/oxic interface, and from below the interface in two different seasons and at locations separated by kilometers. These pyrosequencing efforts utilized the V9 hypervariable region of the 18S rRNA gene to examine diversity in the eight samples (four from Framvaren and four from Cariaco) using the Roche GS FLX system and two different forward primers for the V9 amplification (1380F and 1389F) coupled with 1510R [15]. After quality control procedures, 222 593 reads (around 250 bp) were available for further analysis. This level of sequencing effort was still not adequate for retrieving all SSU rRNA gene sequences present in a single sample [15]. For example, as many as 5600 unique tag sequences were present in a 7-L water sample from the Cariaco Basin, yet this sampling did not reach saturation. Not all of this sequence diversity, however, is likely to represent true species richness because of several caveats that must be considered

178

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

when interpreting tag sequence data: (i) not all SSU rRNA gene copies within a species are identical [35,36], (ii) specific taxon groups may exhibit relatively extreme amounts of variability in the V9 (or another) hypervariable region, and (iii) the accuracy of the 454 pyrosequencing strategy (GS-20 technology) is 99.75–99.5% for small subunit rRNA genes [37]. The Stoeck et al. [15] study applied a systematic trimming procedure to attempt to minimize the effect of PCR and sequencing errors, but nonetheless there remained a significant number of singleton sequences present only once in a particular sample. As an example, when one of the Cariaco samples was clustered based on one nucleotide difference (accounting for around 0.8% sequence similarity), the number of OTUs in that sample dropped from 5600 to around 2600 [15]. It is difficult to determine what percentage of those singletons represent sequencing/PCR errors versus real diversity. A cautious interpretation of 454 data is therefore recommended when predicting taxon richness in a sample. Of note in the Stoeck et al. [15] study comparing Framvaren and Cariaco anoxic water column protist communities is that the number of detected OTUs, even when tags were clustered at 10-nucleotide differences, exceeded previous parametric and nonparametric richness estimates from the same sites based on clone library data. This may be due to several factors. Previous clone libraries may have been so severely undersampled that any diversity estimates would have been problematic due to poor confidence intervals. Furthermore, relative amplicon copy number after PCR may not reflect the relative abundance of a taxon in a sample. This may make abundance-based richness estimates of eukaryotic diversity inappropriate since the copy number of SSU rRNA genes varies significantly among protists (e.g., [38,39]. While protist diversity in oxygen-depleted and anoxic marine samples far exceeds previous estimates, we still must make more headway in understanding how hypervariable tag sequence data translates into taxonomic entities. Similar to any DNA-based study (including clone libraries), some portion of genotypes could result from the amplification of detrital DNA. Since the proportion of rare OTUs (represented by fewer than 10 tags) ranged from 71 to 83% in the eight samples analyzed in this study, it is at least safe to assume that the high number of rare taxa is not an artifact based simply on high intraspecific heterogeneity in the V9 region. The rare genotypes may represent a genomic pool that helps the protist community react to local biotic or abiotic changes. Pyrosequencing and clone library/Sanger sequencing studies in Framvaren Fjord and Cariaco Basin detected the same most abundant taxonomic groups (e.g., alveolates and stramenopiles), and several authors have already given tentative explanations for the dominance of these groups in anoxic marine systems (e.g., [20,40,41]). However, 454 approaches indisputably provide greater insight into phylum richness. It should be noted, however, that all tags that could be assigned to taxonomic groups not previously detected via clone libraries (members of the Apusozoa, Chrysomerophytes, Eustigmatophytes, Rypochytriomycetes, Ichthyosporea, Phaeothamniophytes, Rhodophytes, Oikomonads, and Centroheliozoa) all account for less than 1% of the unique protistan tags, which explains why they were missed with the clone library approach. It should be noted that the eight sites sampled in the comparative study of Framvaren Fjord and Cariaco Basin by Stoeck et al. [15] differed markedly in community composition, in spite of the fact that they share in common the fact that they are both anoxic marine ecosystems (Figure 11.3). This is of course not surprising given the fact that the two locations are subjected to different degrees of seasonal upwelling and surface inputs, and multiple physicochemical differences between them. Observed protist community composition differences can be mostly ascribed to the fact that the anoxic/oxic interface of the super-sulfidic Framvaren Fjord lies within the photic zone and therefore differs significantly from the less sulfidic deep-sea sites in Cariaco Basin. Relative proximity to the photic zone and hydrogen sulfide concentration are likely the two largest drivers of community composition differences observed between the two locations. Hydrogen sulfide is widely known to be toxic to eukaryotes, and hence is a strong selective force. Hydrogen sulfide detoxification requires specific adapta-

11.5 Perspectives on Interpretation of Microbial Eukaryote 454 Data

j

179

Fig. 11.3 Comparison of protist communities in Framvaren Fjord and Cariaco Basin samples. Relative taxonomic distribution of unique protistan and fungal V9 tags generated from four anoxic water samples of the Caribbean Cariaco deep-sea basin (CAR1–4) and from four anoxic water samples of the Norwegian Framvaren Fjord (FV1–4). (a) Phylum-based assignment. Phyla that were represented by a proportion of 1% or more of all unique tags in at least one of the eight libraries used for 454 sequencing are shown. The category “others” denotes tags that could not be assigned to a taxonomic entity based on an 80% BLASTN similarity threshold and tags which fell into other phyla or taxon groups but were represented by less than 1% of the unique tags in all of the eight PCR amplicon libraries used for 454 sequencing.  This category contains all sequences not previously detected in clone library/Sanger sequencing from the same environment. (Adapted from Stoeck et al. [15].)

tions that are not necessarily present in all facultative or strictly anaerobic protists [42,43]. In spite of the difficulties in making direct comparisons between the protist communities in the two locations given the limited number of samples in the study and physicochemical differences between them, the Stoeck et al. study does demonstrate the power of 454 pyrosequencing to more deeply survey protistan community complexity, and it does demonstrate that both ecosystems are highly variable regarding the dynamics of protistan communities on a spatial and temporal scale.

11.5 Perspectives on Interpretation of Microbial Eukaryote 454 Data

High-throughput pyrosequencing technology promised to address methodological shortcomings by recovering uncommon, perhaps even exceedingly rare species [27,44], but the short read lengths of 454 sequences made it necessary to rely on existing long rRNA gene sequences in public databases to establish taxonomic identities. As pyrosequencing reads get longer, this problem will lessen. Also, concerns remain about the role that sequencing errors may play in producing an artifactual picture of the sample’s richness [45]. As Stoeck et al. [34] showed, homopolymers within microbial eukaryote rRNA gene sequences may present a significant problem by contributing to the overinflation of rare tags. The longer hypervariable V4 region may encompass 6.8 times more homopolymers relative to the shorter V9 region [34]. When clustering tag sequences at 100% sequence similarity, it is possible that homopolymer error rates contribute to overinflated total diversity. Stoeck et al. [34] suggest that clustering tag sequences at 97% sequence similarity, which is common for prokaryotes [26], may be too conservative for estimating protist diversity, as the SSU may not evolve fast enough in many protists to resolve species (or OTU equivalents), and that the faster-evolving ITS regions may be more appropriate markers for speciation. Sequence variation among copies of rRNA genes within single nuclei may also contribute to overinflation of rare taxon numbers. The largest number of unique tags from Framvaren Fjord came from dinoflagellates and chlorophytes [34]. Both groups contain many large-celled taxa that have been shown to contain many rRNA cistron copies [46]. Since different taxonomic groups are known to contain very different levels of intragenomic SSU polymorphism, it is very difficult to estimate the level of contribution this type of variation is making to an environmental sample. Interpretation of protistan tag sequence data is further complicated by unknown contributions from other potential sources of error, some of which have been

180

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

discussed above (i.e., PCR/sequencing error and PCR chimeras). Different platforms for next-generation sequencing (Roche 454, Illumina Genome Analyzer, and ABI SOLiD) all are subject to systematic biases and base calling errors that need to be considered when utilizing these platforms for population targeted sequencing studies [47]. Altogether, in spite of quality control efforts to remove poor quality sequences, it is important to consider clustering protist tag sequences at levels below 100% for generation of diversity estimates and taxonomic profiles. For the V9 and V4 regions of the eukaryotic SSU rRNA gene, clustering at 98% would allow for an average of four and five errors, respectively. There are many approaches to clustering tag sequence data and different approaches can influence resulting diversity estimates (for discussion of this, see [48]), and Huse et al. [49] discuss how a common method of generating OTUs by multiple sequence alignment and complete-linkage clustering can significantly increase the number of predicted OTUs and related richness estimates. They propose a two-step process involving an initial 2% singlelinkage preclustering followed by an average-linkage clustering based on pairwise alignments, which they demonstrate reduces the OTU richness in environmental samples by as much as 60%. The computational processing of large pyrosequencing datasets is currently the bottleneck of this approach in microbial ecology. Typical steps of analyzing such sequence data include, for example, quality controls, computing alignments, BLASTing the sequences, identifying OTUs, assessing sampling saturation, and calculating basic statistical information on the sequence data. Tools like PANGEA [50] and QIIME [51] consist of a chain of different tools, which manually have to be combined in order to obtain the desired results. For that purpose they use well-known software packages like MegaBLAST or CD-HIT and new Perl scripts to bridge the gaps between the inputs and outputs of these packages. They do not offer graphical user interfaces, which would be very convenient, specifically for biologists who want to apply such techniques as a tool to address their biological questions without requiring deeper knowledge and skills in computational sciences. Only recently, a software tool, JAguc (http://wwwagak.informatik.uni-kl.de/research/ JAguc/), has been published that efficiently and conveniently combines all these tasks in a single package [34,52]. In addition to providing a number of functionalities not matched by any other available program for the processing of environmental 454 tag datasets, JAguc has further clear advantages over other available tools. For example, its algorithm allows for a very fast alignment and data clustering into operational taxonomic units, which is usually the bottleneck in 454 tag data processing. On a desktop PC (Intel Core i7-920 with 2.67 GHz, four cores, eight logic processors with 6 GB of RAM) allowing the Java virtual 8 GB of memory, we observed a runtime of only 1 h (using a maximum of eight parallel threads) for computing the alignments for all possible pairs in a 340 000 tag dataset (400 bp), clustering at 95%, and writing the result to a database [52]. Furthermore, JAguc can be run on local installations in contrast to web-based tool like VAMPS (vamps.mbl.edu). For webbased tools, the computation time strongly depends on the number of jobs submitted to the server, and thus, can hardly be predicted. Additionally, the user can provide custom-made databases against which the environmental data can be checked for a taxonomic assignment using JAguc, while in VAMPS only available databases (updates schedules are not posted) are at the users disposal (e.g., eukaryote databases are not yet available, and currently only SSU rDNA-based databases exist for bacteria and archaea in the current release of VAMPS). Furthermore, in JAguc the user has full control over each individual step and can adjust parameters (e.g., in alignment construction) for each individual dataset. JAguc’s strategy of sequence data quality control proved to be very efficient and its advantages over other quality check tools like AmpliconNoise (PyroNoise [53]) are summarized elsewhere [52]. To complicate matters further, the design of PCR primers used for both Sanger and pyrosequencing approaches may be significantly biased in their recovery of protists, possibly creating a distorted view of the extant richness and diversity. These concerns are not unique to studies of eukaryotic microbial diversity. For example, primer pair and amplicon length have also been observed to significantly influence the estimates

11.5 Perspectives on Interpretation of Microbial Eukaryote 454 Data

of bacterial species richness and evenness in a study of the termite hindgut [54]. Under these circumstances comparing two species lists is difficult and compounded by uncertainties about nonparametric statistical methods commonly used to estimate the size of the species pool. For example, while Stoeck et al. [34] recovered essentially the same broad picture of taxonomic composition using both V4 and V9 primers, there were distinct differences in the recovery of particular taxonomic groups within the dinoflagellates. This suggests that a better picture of eukaryotic microbial diversity in anoxic marine environments (and most likely other environments as well) can be obtained using a combination of primer sets (for further discussion of this, see [55]). As public databases also expand in terms of representation of sequences containing both hypervariable regions, any biases in taxonomic assignment that may exist due to lower representation of sequences containing V9 versus V4 regions will be minimized. Amplicons from massively parallel tag sequencing do not allow for relative or absolute quantification of the taxa hiding behind the environmental sequences. This is because protists differ tremendously in their gene and genome copy numbers. There is substantial variation in some operons within a single genome, and also a correlation between cell size and genome copy number [56]. Thus, a small flagellate with a low number of genome copies and SSU RNA operons needs to occur in cell numbers ranging in the thousands in order to account for the number of amplicons that can be obtained by an individual larger ciliate. Amplicon number is therefore a function of individual genome/gene copy number as well as PCR primer preferences during amplification, and fragment size (shorter fragments are preferably amplified and there is substantial variation in the SSU rRNA size in eukaryotes), rather than true taxon abundance in the sample. As a consequence, abundance-based statistical analyses of amplicon libraries such as richness estimates, rarefaction or diversity indices, are heavily biased and should be interpreted more as comparative, and relative measures rather than absolute and true measures. One means to assess the true abundance of specific target taxa is fluorescence in situ hybridization employing specific fluorescently labeled oligonucleotide probes that are hybridized to organisms fixed with formaldehyde and filtered on to membranes, and visualized under an epifluorescence microscope. Massively parallel sequencing has increased our understanding of marine anoxic protistan communities. This technology shows that protistan community complexity in these environments is much greater than previously thought and includes eukaryotic microbes from all major protistan groups. Furthermore, based on these initial studies, these habitats appear to be dominated by novel, taxonomically unassignable sequences, and dinoflagellates, ciliates, and other members of the Alveolates. Ciliates were observed to dominate the eukaryotes in the anoxic water column of Cariaco Basin based on scanning electron microscope observations of filtered whole water samples [57]. Dinoflagellates are known to be important as primary producers, symbionts, parasites, and heterotrophic grazers in marine systems (e.g., [58–63]). The presence of a large number of sequences of unknown taxonomic affiliation is not surprising when one considers that anoxic and micro-oxic marine environments are perhaps the least studied because of the presumption that eukaryotes require oxygen and are limited by sulfide [55]. The presence of a large number of amplicons detected in very low numbers is intriguing and the importance of this remains to be determined. A pyrosequencing study of protistan communities in an Austrian lake that examined seasonal abundance patterns found a predominance of few abundant taxa and a highly dynamic turnover of rare species [64]. The eukaryotes that comprise the “rare biosphere” in anoxic and micro-oxic marine environments may be ecologically significant if they are simply dormant types that act as ecosystem buffers by responding to subtle environmental changes. Future pyrosequencing studies may provide the necessary sequencing depth and ease of processing of multiple samples required for integrated studies of habitat biogeochemistry and spatial and temporal variation in protistan communities capable of shedding light on this protistan “rare biosphere.” Pyrosequencing of natural protistan

j

181

182

j

11 Massively Parallel Tag Sequencing Unveils the Complexity of Marine Protistan Communities in Oxygen-Depleted Habitats

communities is still in its infancy and only very few studies have been conducted on this subject. Specifically, methodological studies are still very scarce. This makes it difficult to solidly interpret the data obtained by environmental protistan pyrosequencing and to exploit the full power of this strategy. As a result, concerted efforts should not only be made to pyrosequence more habitats for protistan communities, but also to conduct more methodological studies in order to identify the true power and shortcomings of the pyrosequencing strategy for natural protistan communities.

Acknowledgments

The research presented here was supported by National Science Foundation grant MCB-034807 to V.E. (collaborative project with S. Epstein at Northeastern University, Boston, MA), the International Census of Marine Microbes and the W.M. Keck Foundation award to V.E. and T.S., and a grant from the Deutsche Forschungsgemeinschaft STO414/3-1 awarded to T.S. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Sherr, E.B. and Sherr, B.F. (2002) Antonie 2 3 4

5

6 7

8 9 10 11 12

13

14 15 16

van Leeuwenhoek, 81, 293–308. Taylor, G.T. (1982) Ann. Inst. Oceanogr. (Suppl.), 58, 227–241. Taylor, G.T., Iturriaga, R., and Sullivan, C.W. (1985) Mar. Ecol. Prog. Ser., 23, 129–141. Frias-Lopez, J., Thompson, A., Waldbauer, J., and Chisholm, S.W. (2009) Environ. Microbiol., 11, 512–525. Lin, X.J., Scranton, M.I., Varela, R., Chistoserdov, A., and Taylor, G.T. (2007) Aquat. Microb. Ecol., 47, 57–72. Madsen, E.L., Sinclair, J.L., and Ghiorse, W.C. (1991) Science, 252, 830–833. Aristegui, J., Gasol, J.M., Duarte, C.M., and Herndl, G.J. (2009) Limnol. Oceanogr., 54, 1501–1529. Suttle, C.A. (2005) Nature, 437, 356–361. Ishii, N., Takeda, H., Doi, M., Fuma, S. et al. (2002) Limnology, 3, 47–50. Sherr, E.B. and Sherr, B.F. (1987) Nature, 325, 710–711. Bass, D. and Cavalier-Smith, T. (2004) Int. J. Syst. Evol. Microbiol., 54, 2393–2404. Edgcomb, V.P., Orsi, W., Bunge, J., Jeon, S.-O., Christen, R., Leslin, C., Holder, M., Taylor, G.T., Suarez, P., Varela, R., and Epstein, S. (2011) ISME J., 5, 1344–1356. Sanger, F., Donelson, J.E., Coulson, A.R., Kossel, H., and Fischer, D. (1974) J. Mol. Biol., 90, 315–333. Epstein, S. and Lopez-Garcia, P. (2008) Biodivers. Conserv., 17, 261–276. Stoeck, T., Behnke, A., Christen, R., AmaralZettler, L. et al. (2009) BMC Biol., 7, 72. Countway, P.D., Gast, R.J., Dennett, M.R., Savai, P. et al. (2007) Environ. Microbiol., 9, 1219–1232.

17 Baker, B.J., Tyson, G.W., Goosherst, L., and

18

19

20

21 22

23

24

25

26 27

28 29 30 31

Banfield, J.F. (2009) Appl. Environ. Microb., 75, 2192–2199. Palacios, C., Zettler, E., Amils, R., and Amaral-Zetter, L. (2008) PLos ONE, 3, e3853. Edgcomb, V.P., Kysela, D.T., Teske, A., Gomez, A.D., and Sogin, M.L. (2002) Proc. Natl. Acad. Sci. USA, 99, 7658–7662. Stoeck, T., Hayward, B., Taylor, G.T., Varela, R., and Epstein, S.S. (2006) Protist, 157, 31–43. Stoeck, T., Kasper, J., Bunge, J., Leslin, C. et al. (2007) PLoS ONE, 2, e728. Huber, J.A., Morrison, H.G., Huse, S.M., Neal, P.R. et al. (2009) Environ. Microbiol., 11, 1292–1302. Behnke, A., Bunge, J., Barger, K., Breiner, H.W. et al. (2006) Appl. Environ. Microbiol., 72, 3626–3636. Groisillier, A., Massana, R., Valentin, K., Vaulot, D., and Guillou, L. (2006) Aquat. Microb. Ecol., 42, 277–291. Richards, T.A., Vepritskiy, A.A., Guliamova, D.E., and Nierzwicki-Bauer, S.A. (2005) Environ. Microbiol., 7, 1413–1425. Pedros-Alio, C. (2006) Trends in Microbiol., 14, 257–263. Sogin, M.L., Morrison, H.G., Huber, J.A., Welch, D.M. et al. (2006) Proc. Natl. Acad. Sci. USA, 103, 12115–12120. Ronaghi, M., Uhlen, M., and Nyren, P. (1998) Science, 281, 363–365. Robertson, P. and Burke, K. (1989) AAPG Bull., 73, 490–509. Peterson, L.C., Haug, G.H., Hughen, K.A., and Rohl, U. (2000) Science, 290, 1947–1951. Stoeck, T. and Epstein, S. (2003) Appl. Environ. Microbiol., 69, 2657–2663.

32 Orsi, W., Edgcomb, V., Jeon, S.O., Leslin, C.

et al. (2011) ISME J., 5, 1357–1373. 33 Skei, J.M. (1988) Mar. Chem., 23,

209–218. 34 Stoeck, T., Bass, D., Nebel, M., Christen, R.

et al. (2010) Mol. Ecol., 19, 21–31. 35 Dessen, P., Zagulski, M., Gromadka, R.,

36 37

38

39

40

41

42

43

44

45

Plattner, H. et al. (2001) Trends Genet., 17, 306–308. Rooney, A.P. (2004) Mol. Biol. Evol., 21, 1704–1711. Huse, S.M., Huber, J.A., Morrison, H.G., Sogin, M.L., and Welch, D.M. (2007) Genome Biol., 8, R143. Auinger, B.M., Pfandl, K., and Boenigk, J. (2008) Appl. Environ. Microb., 74, 2505–2510. Countway, P.D., Gast, R.J., Savai, P., and Caron, D.A. (2005) J. Eukaryot. Microbiol., 52, 95–106. Dawson, S.C. and Pace, N.R. (2002) Proc. Natl. Acad. Sci. USA, 99, 8324–8329. Kolodziej, K. and Stoeck, T. (2007) Appl. Environ. Microbiol., 73, 2718–2726. Fenchel, T. and Finlay, B.J. (1995) Ecology and Evolution in Anoxic Worlds, Oxford University Press, Oxford. Theissen, U. and Martin, W. (eds) (2007) Biochemical and Evolutionary Aspects of Eukaryotes that Inhabit Sulfidic Environments, Springer, Berlin. Huber, J.A., Mark Welch, D.B., Morrison, H.G., Huse, S.M. et al. (2007) Science, 318, 97–100. Kunin, V. and Hugenholtz, P. (2010) The Open J., http://www.theopenjournal.org/ toj_articles/1.

References 46 Zhu, F., Massana, R., Not, F., Marie, D., and

52 Behnke, A., Engel, M., Christen, R., Nebel,

Vaulot, D. (2005) FEMS Microbiol. Ecol., 52, 79–92. Harismendy, O., Ng, P.C., Strausberg, R.L., Wang, X. et al. (2009) Genome Biol., 10, R32. Christen, R. (2008) Microbes Environ., 23, 253–268. Huse, S., Welch, D.M., Morrison, H.G., and Sogin, M.L. (2010) Environ. Microbiol., 12, 1889–1898. Giongo, A., Crabb, D.B., Davis-Richardson, A.G., Chauliac, D. et al. (2010) ISME J., 4, 852–861. Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K. et al. (2010) Nat. Methods, 7, 335–336.

M. et al. (20110) Environ. Microbiol., 13, 340–349. Quince, C., Lanzen, A., Curtis, T.P., Davenport, R.J. et al. (2009) Nat. Methods, 6, 639–641. Engelbrektson, A., Kunin, V., Wrighton, K.C., Zvenigorodsky, N. et al. (2010) ISME J., 4, 642–647. Dawson, S.C. and Hagen, K.D. (2009) J. Biol., 8, 105. Prokopowich, C.D., Gregory, T.R., and Crease, T.J. (2003) Genome, 46, 48–50. Edgcomb, V.P., Orsi, W., Taylor, G.T., Vdacny, P., Taylor, C., Suarez, P., and Epstein, S. (2011) ISME J., 5, 1237–1241.

47 48 49

50

51

53

54

55 56 57

j

183

58 Baillie, B.K., Belda-Baillie, C.A., and

59 60 61 62 63 64

Maruyama, T. (2000) J. Phycol., 36, 1153–1161. Chambouvet, A., Morin, P., Marie, D., and Guillou, L. (2008) Science, 322, 1254–1257. Gast, R.J. and Caron, D.A. (1996) Mol. Biol. Evol., 13, 1192–1197. Guillou, L., Viprey, M., Chambouvet, A. et al. (2008) Environ. Microbiol., 10, 3349–3365. Knowlton, N. and Rohwer, F. (2003) Am. Naturalist, 162, S51–S62. Rowan, R. and Powers, D.A. (1992) Proc. Natl. Acad. Sci. USA, 89, 3639–3643. Nolte, V., Pandey, R.V., Jost, S., Medinger, R. et al. (2010) Mol. Ecol., 19, 2908–2915.

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET) Xiaoan Ruan and Yijun Ruan Abstract

Chromatin interaction analysis using paired-end tag sequencing (ChIA-PET) is a genome-wide, high-throughput, and unbiased approach for de novo detection of higher-order chromatin interactions. In a ChIA-PETanalysis, cross-linked chromatin fibers are fragmented by sonication and enriched by chromatin immunoprecipitation (ChIP) with a specific antibody of interest. In the enriched chromatin complexes, tethered DNA fragments are joined together by specifically designed DNA linkers through proximity ligation and PETs are extracted for high-throughput sequencing analysis. Mapping PET sequences to a reference genome can reveal genome-wide binding sites and chromatin interactions associated with the protein factor(s) under study. The entire ChIA-PET protocol comprises three major parts: ChIP sample preparation, ChIA-PET library construction, and PET sequencing and mapping. The detailed procedures described here mainly focus on the first two parts, which can be completed in approximately 4 weeks.

12.1 Introduction

Genomes are thought to be functionally organized into three-dimensional structures in vivo [1]. Genome-wide studies of transcription factor binding sites using chromatin immunoprecipitation (ChIP) enrichment followed by microarray detection (ChIPchip) [2,3], paired-end tag (PET) sequencing (ChIP-PET) [4], or single-end sequencing (ChIP-seq) [5], have shown that many transcription factor binding sites are located distal to genes, suggesting extensive remote control of transcription regulation. Various methods had been developed to investigate long-range chromatin interactions, such as chromosome conformation capture (3C) [6,7], and variants including ChIP-3C [8,9], 4C [10–14], and 5C [15], as well as RNA-Trap [16] and fluorescence in situ hybridization [17], which have provided many insights into higher-level organization of chromatin structures. However, these methods are limited to one-pointoriented or partial genome detection of interactions and are incapable of de novo detection of genome-wide interactions. A global strategy for investigating higherorder chromatin structures is needed to understand the mechanisms for the remote control of transcription regulation in three-dimensional nuclear space. We therefore developed a genome-wide, high-throughput, and unbiased short tag sequencing approach called chromatin interaction analysis (ChIA)-PET with the incorporation of the original concept of “nuclear proximity ligation” [18] that has been applied in the 3C approach [7] to capture interacting DNA segments bound by protein factors, the exploitation of the PET strategy [19–21], and next-generation sequencing technolo-

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

185

186

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

gies [5,22,23] for de novo detection of chromatin interactions. We have demonstrated the potential of this method using the system of ERa-mediated transcription regulation in human cancer cells [24]. 12.1.1 Development of the ChIA-PET Method

The basic principle of detecting chromatin interactions is the use of “proximity ligation” to capture DNA elements that are located far away from each other in the linear genome, but are in close spatial proximity as a result of juxtaposition by protein factors [7,18]. One of the major challenges in developing an unbiased whole-genome approach for de novo detection of chromatin interaction is finding an unbiased method for manipulation of all proximity-ligated DNA fragments. The high-level chromatin interaction is of great complexity in a tiny and compact nuclear space crowded with masses of DNA and related proteins. Consequently, any region of the genome could potentially interact with multiple other segments of the genome specifically or nonspecifically. Moreover, such interactions may act transiently and proximately [25]. Further challenges arise when a population of nonsynchronized cells is studied, in which specific interactions may occur only in a small subset of the cell population [25,26]. Hence, analyses of chromatin interactions are expected to be very noisy and the measure to reduce the complexity to detect specific interactions is an important issue. The 3C and variant methods utilize sequence-specific approaches to reduce the complexity by detecting interactions only in the targeted genome locations, but exclude possible interactions in all other regions. To overcome these issues, we devised a strategy to introduce specific oligonucleotide sequences into the junction of all proximity ligation products. We coupled this strategy with ChIP to enrich specific chromatin interactions, as well as ultra-highthroughput sequencing technology for deep coverage, and thus formulated the ChIA-PET analysis procedure. An overview of ChIA-PET analysis is illustrated in Figure 12.1. Briefly, cross-linked chromatin materials are fragmented by sonication and the sonicated DNA–protein complexes are enriched by ChIP against specific protein factors; tethered DNA fragments in each of the enriched chromatin complexes are connected with DNA linkers via proximity ligation and the PETs are extracted for sequencing analysis. The resulting ChIA-PET sequences are mapped to reference genomes to reveal relationships between remote chromosomal regions

Fig. 12.1 Overview of ChIA-PET and comparison with other methods. ChIAPET uses sonication (lightning symbol) to fragment chromatin fibers and then ChIP to enrich specific interactions for analysis. To capture the interacting DNA fragments in spatial proximity, two ligation steps are employed: (1) linker ligation and (2) proximity ligation. Subsequently, PET constructs are extracted by MmeI digestion, sequenced by Illumina GAII and mapped to the reference genome to identify long-range interacting partners. The scope of this approach for identifying interactions is all-to-all, means unbiased, and genome-wide. The other approaches use restriction digestion (scissors symbol) to fragment chromatin, followed by direct ligation. Specific PCR primers were used to detect interactions at specific loci. Therefore, the detection scopes are limited from one-to-one, one-to-many, or many-to-many, but not genome-wide.

12.1 Introduction

brought together in close spatial proximity by protein factors. Compared with other methods designed for studying chromatin interactions, we believe that ChIA-PET has the following advantages: ChIA-PET uses sonication for chromatin fragmentation so as to “shake off” nonspecific interactions and hence has less background noise [27]; ChIA-PET uses ChIP to enrich specific chromatin interactions mediated by given protein factors so as to provide protein-specific chromatin interaction information; also ChIA-PET is independent of site-specific polymerase chain reaction (PCR) for detection, and therefore unbiased and genome-wide for de novo detection of all chromatin interactions (Figure 12.1). 12.1.2 Applications of the ChIA-PET Method

The ChIA-PET method may be used to interrogate all chromatin interactions as well as binding sites mediated by protein factors of interest. Depending on the protein factors used for ChIP enrichment, ChIA-PET analysis can be applied to the detection of all chromatin interactions involved in a particular nuclear process. For instance, the use of general transcription factors or RNA polymerase II components for ChIP enrichment and ChIA-PET analysis would identify all chromatin interactions involved in transcription regulation, and the use of protein factors involved in DNA replication or chromatin structure would identify all chromatin interactions due to DNA replication and chromatin structural modification. More specifically, the use of specific transcription factors for ChIA-PET analysis would further reduce ChIA-PET library complexity and add specificity to chromatin interactions, and enable examination of specific chromatin interactions mediated by particular transcription factors. 12.1.3 Experimental Design of ChIA-PET Analysis

A ChIA-PET experiment involves three main parts: ChIP sample preparation, ChIAPET library construction, and PET sequence mapping to reference genome and chromatin interaction analysis (Figure 12.2). 12.1.3.1 ChIP Sample Preparation High-quality ChIP samples are prerequisites for successful ChIA-PET experiments. The timing and concentration of formaldehyde treatment as well as the force of sonication applied to the chromatin should be considered. ChIP preparation for ChIA-PET analysis (similar to ChIP-seq and ChIP-Chip) is highly dependent on the cell samples and the protein factors under investigation. Too much formaldehyde treatment would increase chromatin interaction noises as well as making the chromatin difficult to manipulate, while too little will lead to inadequate capture of chromatin interactions [6,26]. Importantly, we use sonication to fragment formaldehyde cross-linked chromatin samples. This is the first step in our ChIA-PET protocol that is notably different from the 3C-like methods that use restriction enzyme digestion for chromatin fragmentation. The use of sonication for chromatin fragmentation has well been tested and widely applied in various ChIP-based methods, including ChIP-Chip and ChIP-seq. Based on ChIP-seq experimental data that shows clean, sharp peaks for many binding sites [5,22], we suggest that vigorous shearing force by sonication could break up weak interactions of nonspecific chromatin fragments attached to specific chromatin interaction complexes [27]. It is important to note that inappropriate sonication conditions could generate poor chromatin samples for a ChIP experiment. Therefore, the sonication conditions should be optimized and the effectiveness of the sonication of each sample submitted for ChIA-PET analysis must be checked by reverse crosslinking a small aliquot and running on a DNA gel to ensure that the DNA is in a

j

187

188

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

Fig. 12.2 Flowchart of the ChIA-PET protocol procedures.

Grow and harvest cells Formaldehyde crosslinking DNA/protein Cell lysis and nuclear lysis Chromatin DNA fragmentation by sonication Pre-cleared chromatin extract Antibody-coated beads Chromatin immunoprecipitation ChIP QC: yield and enrichment

End polishing of ChIP DNA ½

½

Ligate Linker-A

Ligate Linker-B Combine

Phosphorylation of linker 5’ end Elute ChIP complex from beads Proximity ligation under diluted conditions Reverse cross-link, release DNA DNA purification MmeI digestion release PET constructs PET immobolisation on Dynabeads Ligate sequencing adaptor QC PCR for construct size PCR scale up and gel purification of construct

PET DNA QC Agilent Bioanalyzer

ChIA-PET library template

Illumina GA IIx or SOLiD v4 paired-end sequencing PET sequence mapping and analysis

desirable size range. Typically, the appropriate size of chromatin fragments is around 500 bp with a range from 200 to 2000 bp. After fragmentation, the chromatin samples are processed for ChIP, which enriches specific chromatin complexes bound by given protein factors of interest. In our experience, higher ChIP enrichment with specific ChIP antibodies results in better ChIA-PET data. Therefore, we strongly recommend the ChIP to be made with well-optimized ChIP protocols and validated by ChIP-qPCR (quantitative polymerase chain reactionquantitative PCR) before proceeding to ChIA-PET library construction. A well-designed primer set is important for analysis of the ChIP enrichment via qPCR. Primer design software is recommended to be used for optimal customization of the parameters such as melting temperature, oligonucleotide length, and GC content. It is critical to perform a BLAST analysis on the designed primers to ensure that the sequence is unique to the desired genomic regions. The control primers should also be designed such that they amplify a region of DNA that is away from the expected binding site. It is advisably to design a few sets of primers and that this primer set should be tested with input DNA to ensure that the threshold cycle value does not exceed 30–32.

12.1 Introduction

j

189

Fig. 12.3 Schematic view of linker nucleotide barcoding for ChIA-PET analysis using linker barcode for ChIA-PET analysis. Linker-A and linker-B are added to different aliquots of same ChIP material. After linker ligation, the two aliquots are mixed for diluted proximity ligation and subsequent PET sequencing analysis. The PET sequences with linker compositions made up of both linker-A and linker-B (A/B) are considered as derived from undesired chimeric ligation products between two different chromatin complexes. This linker-A/B-associated PETs are 100% nonspecific and can be used as a noise indicator for library data assessment. As chimeric ligations between A and A, or B and B cannot technically be detected, the extracted chimeric PET subset (hybrid linker-A/Bassociated PETs) thus serves here as an estimator for the total number of chimeric PETs (twice that of linker-A/B associated PETs identified) that exist in a particular library dataset.

12.1.3.2 ChIA-PET Library Construction After ChIP enrichment, the ends of tethered DNA fragments in chromatin complexes are blunted to allow for subsequent linker ligation. Linker oligonucleotides are designed with a 50 overhang consisting of 4 nucleotides (GGCC) and a recognition site for the type II restriction enzyme MmeI (TCCAAC) at the blunt end to be ligated to DNA fragments (Figure 12.3). The flanking restriction enzyme sites in the linker sequence allow for extraction of the PET with the structure (20-bp tag)–[linker seq]– (20-bp tag). To easily purify PET constructs by streptavidin-coated magnetic beads, each linker is modified with biotin at the internal C6 of the ninth base (T) from the 50 end, so as to avoid steric hindrance in enzymatic reactions. The purified ChIA-PET constructs can then be analyzed by high-throughput PET sequencing. In ChIA-PET analysis, the most critical step is proximity ligation. Ideally, all proximity ligation products are derived from DNA fragments bound within a given ChIP complex. However, the level of nonspecific chimeric DNA ligations between different ChIP complexes can be high and thus may confound data analysis. To address this, we introduced linker “barcode” sequences for ChIA-PET experiment to specifically identify such chimeric ligation PETs. Two linker sequences (linker-A and

190

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

linker-B) are designed with the same sequence, except a 4-bp difference from each other as a “barcode,” located next to the MmeI restriction site. The two linkers are added separately to two equal amounts of the same ChIP materials. After linker ligation and removal of free linkers, the two aliquots are mixed together to proceed to the following circularization step. In the circularization mixture, there are two types of chromatin complexes, which are attached to either linker-A or linker-B, respectively. After circularization by linker ligation, there are three possible circularized products with specific linker composition: linker-A/linker-A (CTTA/CTTA), linker-B/linker-B (ACAT/ACAT), and linkerA/linker-B (CTTA/ACAT). Among these three types of linker compositions, the hybrid linker-A/B-associated PET is the chimeric ligation product that occurs between two different chromatin complexes. The percentage of this hybrid linker-associated PET could be used as an indicator for an estimate of nonspecific ligation. To reduce the level of nonspecific ligation, the linker-ligated ChIP complexes are eluted off the beads and diluted in a large volume for proximity ligation. While in theory this approach should work well, and we have been able to use an “off-beads” method to consistently prepare ChIA-PET libraries, the Brownian motion of molecules in solution could still cause significant nonspecific chimeric ligations. After PET extraction by MmeI digestion, the biotinylated PET constructs are purified by streptavidin-bound magnetic beads. The PETs are then ligated with an appropriate sequence adapter, followed by PCR amplification to prepare sequence templates. 12.1.3.3 ChIA-PET Library Sequencing and Mapping Depending on the sequence adapters chosen, the ChIA-PET templates can be analyzed by most DNA sequencing platforms available on the market. However, the efficient way for sequencing these short tags is to use high throughput, nextgeneration sequencers [19]. In our ChIA-PET analysis, we have successfully used the Roche 454 GS pyrosequencer [23], Illumina Genome Analyzer (Solexa) [5,22], and ABI SOLiD [28] systems. In principle, the analysis can be expanded to other highthroughput sequencing platforms, such as Helicos single-molecule sequencer [29] or Pacific Biosciences. The protocol illustrated here uses adapters compatible with Roche 454 GS, Illumina Genome Analyzer (Solexa) and ABI SOLiD 4. The Roche 454 GS sequencer can produce longer reads (200–400 bp), which is sufficient to cover the entire template sequence (223 bp) including genomic tags, adapters, and full linker sequences. For sequencing one library, a full 454 run normally produces approximately 500 000 PET reads, which is a small dataset and obviously not enough for whole-genome analysis, especially for mammalian genomes. Thanks to the rapid development in high-throughput sequencing technology, today’s Illumina GAIIx and ABI SOLiD 4 have been applied in our hands as major platforms to provide a costeffective analysis for ChIA-PET. From one lane (Genome Analyzer) or 1/8 a SOLiD 4 slide data approximately 20–30 million PET reads are routinely generated. In both Solexa and SOLiD data analysis, the paired-end sequencing format is performed to generate 2  36-bp (Solexa) or 2  35-bp (SOLiD) reads from each end, which consists of a 20-bp genomic tag and a 16-bp linker sequence with the specific built-in “barcode” to distinguish linker-A or linker-B. After raw sequence data are generated from high-throughput platforms, they are processed and aligned to the relevant reference genome by the sequence analysis pipelines [30]. The identical sequence reads, most likely derived from PCR amplifications, are collapsed into unique reads and proceed to generate a uniquely mapped PETset. PETsequences that are not mapped, partially mapped, or mapped to multiple locations will not proceed for future analysis. The uniquely mapped PETs are then clustered and classified for further analyses of the protein binding and interacting sites. To differentiate binding from interacting PET, each of the uniquely mapped PETs can be further classified as where the two tags are derived from. If two tags of a PETare

12.1 Introduction

mapped on the same chromosome, same strand, and at a distance less than 3 kb, these two tags are considered derived from a self-ligated ChIP DNA fragment and defined as a “self-ligation” PETs. If a PETmapping result does not fit into any of these criteria, it most likely resulted from an “interligation” event between two different chromatin fragments, which may come from two DNA fragments, bound together by protein factor(s), within the same chromatin complex, or two DNA fragments coming from two different chromatin complexes. Collectively, they are all defined as “interligation” PETs, but the latter case is clearly derived from a nonspecific ligation event or chimeric ligation causing noise. The extent of this noise background can be estimated from calculating the percentage of hybrid linker-A/B-associated PETs to quantitatively estimate the noise PETs existing in a particular library. The two tags from the “interligation” PETs could come from different chromosomes, different strands, and different orientations or from the same chromosome but at long distance [30]. The “interligation” PETs are further classified as “intra” or “inter” subgroups. For those mapped on the same chromosome but with a distance of greater than 3 kb are defined as “intrachromosomal interligation” PETs. In contrast, if two tags of a PETare mapped to two different chromosomes, this type of the PETs are therefore called “interchromosomal interligation” PETs. The collection of all “self-ligation” and “interligation” PETs at a given locus reflects the ChIP enrichment level of DNAbound protein factor(s) captured by an antibody. Like ChIP-seq and many other types of whole-genome-wide studies, it is not surprising that there is a vast majority of the PET data coming from undesired nonspecific activities. To distinguish these nonspecific noises (randomly captured PETs) from the real, enriched PETs, the hybrid linker-A/B-associated PETs (noise PETs) are classified and assembled into a “noise dataset.” Assessment of the “noise dataset” shows that all hybrid linker-A/B-associated PETs are randomly distributed throughout whole genome. Moreover, most of the hybrid PETs are shown only as singletons and little or no clusters are observed. The “noise dataset” assessment strongly supports the notion that those PETs forming above-normal count clusters (either from “self-ligation” or “interligation”) are mostly derived from captured enrichment of protein–DNA binding or chromatin–chromatin interaction activities [30]. A preliminary one-lane or 1/8-slide run of the ChIA-PET construct is performed on Solexa or SOLiD as quality control. If the construct shows relatively significant amount of binding sites and chromosome interactions, additional runs will be performed to saturate the library data for further analysis. 12.1.3.4 Control Libraries The idea of the proximity ligation is the core concept and a critical step in ChIA-PET analysis. The two tethered DNA fragments bound together by protein factors are captured by affinity antibody binding, followed by proximity ligation of the two tethered DNA fragments. The subsequent formation of the “interligation” PET captures the original spatial relationship between the two remote elements that have been brought to close proximity by chromatin interactions. As a negative control, a modified version of ChIP-PET library [4] is designed to illustrate the importance of this proximity ligation for formation of “interligation” PET in ChIA-PET analysis. As outlined in Figure 12.4, the negative control ChIP-PET protocol is almost identical to ChIA-PET, except that after chromatin DNA fragments are ligated to the ChIA-PET linkers, they are released from the chromatin complex by reverse cross-linking and depletion of proteins. The pure, isolated DNAs are then circularized by ligation in a large volume that favors a strong self-ligation activity. As a result, the negative control ChIP-PET library data exhibits only “self-ligation” PETs and has little or no “interligation” PETs, in contrast to the ChIA-PET library. Additional control libraries such as no-treatment, or mock ChIA-PET using a nonspecific antibody or IgG can also be useful to assess general noise from captured sequence mapping.

j

191

192

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

Fig. 12.4 ChIP-PET, a subset of ChIA-PET data, is a “self-ligation” control. (a) Schematic view of the ChIA-PET method which can generate both “selfligation” and “inter-ligation” PETs. (b) The ChIP-PET method, considered as a control to ChIA-PET, produces only “self-ligation” PETs for protein binding information. The difference between ChIA-PET and ChIPPET is that in ChIP-PET, chromatin complexes are reverse cross-linked and the bound DNA fragments are released before the proximity ligation, therefore, only “selfligation” PETs are produced. (c) Two replicate ChIA-PET data showing “selfligation” PET and “inter-ligation” PET (purple for intra-, and blue for interchromosomal). The ChIP-PET library data, however, shows only “self-ligation” PETs. The IgG ChIA-PET mock library shows only singletons.

12.2 Methods and Protocols 12.2.1 Key Reagents and Consumables ChIP Preparations . . . .

. . . . . . . . . . . . . . . . .

Dulbecco’s modified Eagle medium, high glucose (Gibco; cat. no. 10313-021) Fetal bovine serum, certified, heat-inactivated (Gibco; cat. no. 10082-147) Penicillin/streptomycin, liquid (Gibco; cat. no. 15140-163) 37% Formaldehyde (Merck-Calbiochem; cat. no. 344198) (Caution: Very toxic if inhaled, ingested, or absorbed through skin) Glycine (Bio-Rad; cat. no. 161-0718) 1 Phosphate-buffered saline (Invitrogen; cat. no. 10010031) Complete, EDTA-free (Roche; cat. no. 11873580001) HEPES, pH 7.5 (Invitrogen; cat. no. 15630080) 5 M Sodium chloride (Ambion; cat. no. AM9760G) 0.5 M EDTA (Ambion; cat. no. AM9260G) Triton X-100 (Promega; cat. no. H5142) Sodium deoxycholate (Sigma-Aldrich; cat. no. D5670) 10% Sodium dodecylsulfate (SDS) (Promega; cat. no. V6553) Nuclease-free water (Ambion; cat. no. AM993) Proteinase K solution (20 mg/ml), 1 ml (Fermentas; cat. no. E00491) Molecular Biology Agarose (Bio-Rad; cat. no. 161-3102) 10 Tris–acetate–EDTA (TAE) buffer, pH 8.0 (Ambion; cat. no. AM9864) TE buffer (Promega; cat. no. V6232) 6 Loading dye (Fermentas; cat. no. R0611) GeneRuler Plus 100-bp DNA ladder, ready-to-use (Fermentas; cat. no. SM0323) Dynabeads Protein G (Invitrogen; cat. no. 100-04D)

12.2 Methods and Protocols .

. . . . . . .

. .

. . . . . . . . . .

. . . . . .

Anti-trimethyl-histone H3 (Lys4), clone MC315, H3K4me3 antibody (Millipore; cat. no. 04-745) 1 M Tris buffer, pH 8.0 (Invitrogen; cat. no. 15568-025) Lithium chloride (Sigma-Aldrich; cat. no. L0505) Nonidet P-40 (Roche; cat. no. 11754599001) 1 M Tris buffer, pH 7.4 (Invitrogen; cat. no. 15567-027) 3 M Sodium acetate, pH 5.5 (100 ml) (Ambion; cat. no. AM9740) GlycoBlue (15 mg/ml) (Ambion; cat. no. AM9516) Phenol/chloroform/isoamyl alcohol (IAA) 25: 24: 1, pH 7.9, 100 ml (Ambion; cat. no. AM9730) (Caution: Very toxic if inhaled, ingested, or absorbed through skin) Isopropanol (Sigma; cat. no. I-9516-500 ml) Quant-iT PicoGreen double-stranded DNA reagent (10  100 ml) (Invitrogen; cat. no. P11495) LightCycler 480 SYBR Green I Master (Roche; cat. no. 04707516001) Positive primer forward 50 -TTCAGAGCTGCATTCCTTCC-30 Positive primer reverse 50 -CGGAATACTGACGAGGAGAAA-30 Negative primer forward 50 -AGTCTGAGCTTTGTGGACAGC-30 Negative primer reverse 50 -CCCTCCCAGTATACAGTCTTGC-30 Tissue culture dish, 150  20 mm (Nunc; cat. no. 168381) Cell scrapers (Corning; cat. no. 3010) BD Falcon polypropylene conical tubes (50 ml) (Becton Dickinson; cat. no. 352070) 50-ml PPCO centrifuge tube (Nalgene; cat. no. 3119-0050) BD Falcon polystyrene round-bottom tubes (14-ml) (Becton Dickinson; cat. no. 352057) Glass beads 0.5-mm diameter (Biospecs; cat. no. 11079105) DNA LoBind tubes, 1.5-ml PCR clean (Eppendorf; cat. no. 0030 108.051) MaXtract High Density tubes, 200  2 ml (Qiagen; cat. no. 129056) IWAKI flat-bottom polystyrene 96-well microtiter plate LightCycler 480 multiwell plate 384 (Roche; cat. no. 04729749001) LightCycler 480 sealing foil (Roche; cat. no. 04729757001)

ChIA-PET Library Construction . . . . . . . . .

. .

. . . . . . . . . .

Nuclease-free water (Ambion; cat. no. AM9937) T4 DNA polymerase, 10 reaction buffer (Promega; cat. no. M831A) 10 mM dNTP mix (Eppendorf; cat. no. 0032 003.109) T4 DNA polymerase (Promega; cat. no. M421F) 1 M Tris buffer, pH 7.4 (Invitrogen; cat. no. 15567-027) 1 M Tris buffer, pH 8.0 (Invitrogen; cat. no. 15568-025) 0.5 M EDTA pH 8.0 (100 ml) (Ambion; cat. no. AM9260G) 5 M Sodium chloride (Ambion; cat. no. AM9760G) 5 T4 DNA ligase buffer with poly(ethylene glycol) (PEG) (Invitrogen; cat. no. 46300-018) T4 DNA ligase (30 U/ml) (Fermentas; cat. no. EL0013) 10 T4 DNA ligase buffer (New England Biolabs; cat. no. B0202S) (Critical: Dithiothreitol may be oxidized over time; use new ligase buffer if reagent is too old) T4 DNA polynucleotide kinase (10 U/ml) (New England Biolabs; cat. no. M0201L) 10% SDS (500 ml) (Promega; cat. no. V6553) TE buffer (pH 8.0) (Ambion; cat. no. AM9849) Buffer EB (250 ml) (Qiagen; cat. no. 19086) Triton X-100 (Promega; cat. no. H5142) Proteinase K solution (20 mg/ml), 1 ml (Fermentas; cat. no. E00491) Phenol/chloroform/IAA, 25: 24: 1, pH 7.9, 100 ml (Ambion; cat. no. AM9730) 3 M Sodium acetate pH 5.5 (100 ml) (Ambion; cat. no. AM9740) Isopropanol (Sigma-Aldrich; cat. no. I9516) GlycoBlue (15 mg/ml) (Ambion; cat. no. AM9516)

j

193

194

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET) . . . . . . . .

.

. . . . . . . . . . . . . .

. . . . .

. .

Ethanol (Sigma-Aldrich; cat. no. 459836) S-adenosylmethionine (SAM) (New England Biolabs; cat. no. B9003S) 10 NEBuffer 4 (New England Biolabs; cat. no. B7004S) MmeI (2 U/ml) (New England Biolabs; cat. no. R0637L) Dynabeads M-280 Streptavidin (10 mg/ml, 10 ml) (Invitrogen; cat. no. 11206D) 10 T4 DNA ligase buffer (Fermentas; cat. no. B69) 10 NEBuffer 2 (New England Biolabs; cat. no. B7002S) Escherichia coli DNA polymerase I (10 U/ml) (New England Biolabs; cat. no. M0209L) Phusion High-Fidelity PCR master mix with HF buffer (Finnzymes; cat. no. F-531) 6 Loading dye (Fermentas; cat. no. R0611) 25-bp DNA ladder (Invitrogen; cat. no. 10597-011) Novex 4-20% TBE gel, 1.0 mm, 10 wells (Invitrogen; cat. no. EC6225BOX) 10 TBE buffer (Ambion; cat. no.AM9863) SYBR Green I (Molecular Probes) (Invitrogen; cat. no. S-7585) Novex 6% TBE gel, 1.0 mm, 5 wells (Invitrogen; cat. no. EC6264BOX) Agilent DNA 1000 reagents (Agilent Technologies; cat. no. 5607-1505) Agilent DNA 1000 kit (Agilent Technologies; cat. no. 5607-1504) BD Falcon polypropylene conical tubes (50-ml) (Becton Dickinson, 352070) 2.0-ml Screw-cap tubes, presterilized (Axygen) DNA LoBind tubes, 1.5-ml PCR clean (Eppendorf; cat. no. 0030 108.051) 0.6-ml tubes (500) (Axygen) 0.2-ml PCR tubes (Axygen) Spin-X centrifuge tube filters, 0.22-mm pore CA membrane, sterile (Costar; cat. no. 8160) 50-ml FEP centrifuge tubes (Nalgene; cat. no. 3114-0050) Gel handler, 10 sheets/pack (Sigma; cat. no. Z376957-1PAK) MaXtract High Density, 25  50 ml (Qiagen; cat. no. 129073) MaXtract High Density, 100  15 ml (Qiagen; cat. no. 129065) Polaroid 667 ISO 3000 black & white instant pack film (Polaroid) 21G needle (100) (Becton Dickinson) Stainless steel sterile surgical blades (Myco Medical Supplies)

Equipment . . . . . . . . . . . . .

. . . . . .

Orbital shaker (Labnet; model Orbit 1000) Refrigerated microcentrifuge (Eppendorf; model 5415 R) Mircocentrifuge (Eppendorf; model 5415 D) Bench-top refrigerated centrifuge (Eppendorf; model 5810 R) High-speed centrifuge (Sorvall; model RC 5C plus) Tube rotator (Palico Biotech; model Intelli-Mixer RM-2L) Sonicator (Branson; model Digital Sonifier) Gel casting set (Scie-Plas) Gel imager (Syngenem; model G:Box iChemi system) Magnetic particle collector (MPC) (Invitrogen; model DynaMag-2) Heatblock (Eppendorf; model thermomixer comfort) Micro Vac (Tomy; model MV-100) Microplate Reader (TECAN model GENios Automated Microplate Visible with Magellan software) qPCR machine (Roche; model LightCycler 480; cat. no. 04643631001) PAGE gel electrophoresis system (Invitrogen; model Novex Mini-Cell) Incubator (Memmert; cat. no. INB500) DarkReader trans-illuminator (Clare Chemical Research; cat. no.DR-45M) Genome Analyzer IIx (Illumina) ABI SOLiD 4 (Applied Biosystems)

12.2 Methods and Protocols Reagent Set-up .

.

.

.

.

.

.

.

.

.

.

.

2.5 M Glycine Dissolve 93.8 g glycine in 500 ml nuclease-free water. (Critical: Store the buffer at room temperature.) 0.1% SDS FA lysis buffer 50 mM HEPES, pH 7.4, 150 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% sodium deoxycholate, and 0.1% SDS. (Critical: Store the buffer at 4  C. The addition of protease inhibitor (complete) should be done before using the buffer.) 1% SDS FA lysis buffer 50 mM HEPES, pH 7.4, 150 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% sodium deoxycholate, and 1% SDS. (Critical: Store the buffer at room temperature to avoid SDS precipitation. The addition of protease inhibitor (complete) should be done before using the buffer.) PBS/0.1% Triton X-100 0.1% Triton X-100 and 1 PBS. (Critical: Store the buffer at 4  C.) 0.1% SDS FA lysis buffer/0.35 M NaCl 50 mM HEPES, pH 7.4, 350 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% sodium deoxycholate, and 0.1% SDS. (Critical: Store the buffer at room temperature.) ChIP wash buffer 10 mM Tris, pH 8.0, 250 mM lithium chloride, 1 mM EDTA, 0.5% Nonidet P-40, and 0.5% sodium deoxycholate. (Critical: Store the buffer at 4  C.) ChIP elution buffer 50 mM Tris, pH 7.4, 10 mM EDTA, and 1% SDS. (Critical: Store the buffer at room temperature.) ChIA-PET wash buffer 10 mM Tris, pH 7.4, 1 mM EDTA, and 500 mM NaCl. (Critical: Store the buffer at room temperature.) ChIA-PET elution buffer TE buffer and 1% SDS. (Critical: Store the buffer at room temperature.) 2 B & W buffer 10 mM Tris, pH 7.5, 1 mM EDTA, and 2 M NaCl. (Critical: Store the buffer at room temperature.) 1 B & W buffer 5 mM Tris, pH 7.5, 0.5 mM EDTA, and 1 M NaCl. (Critical: Store the buffer at room temperature.) 1 TNE buffer 10 mM Tris, pH 8.0, 0.1 mM EDTA, and 50 mM NaCl. (Critical: Store the buffer at room temperature.)

Oligonucleotide Sequences Oligonucleotide sequences used in ChIA-PET analysis are listed in Table 12.1. 12.2.2 Protocol

The following procedures have been successfully applied in ChIA-PET analyses in a number of human and mouse cellular systems (e.g., MCF7, HCT116, K562, HeLa, NB4, and mouse ES) using antibodies raised against ERa, RNA polymerase II, CTCF, H3K4me3, and p53 in our laboratory. The procedures described below are mainly based on ChIA-PET analysis for H3K4me3 chromatin interaction in HCT116 cells. 1. ChIP sample preparation procedures Generally, ChIP can be prepared from cultured cells (at least 107 cells per ChIP) grown under desired conditions or, in principle, from any tissue samples of interest. The cells or tissues are first subject to cross-linking, followed by sonication for DNA fragmentation. The tethered chromatins are then enriched with a specific antibody for protein–DNA complexes bound together by the protein factor of interest. To succeed in ChIA-PET analysis, a good ChIP should have strong enrichment and enough chromatin DNA (50–100 ng) on the beads. If a ChIP is not able to provide sufficient chromatin DNA for library construction, it is recommended to plan and perform multiple ChIP reactions in parallel to collect enough starting materials. In this procedure, colon cancer HCT116 cells were used as obtained from ATCC (CCL-247). The cells are grown to passage 3 before harvesting for the next step. Usually, the cells are harvested at

j

195

196

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

Table 12.1 Oligonucleotide sequences used in ChIA-PET analysis.

Oligos and DNA adapters Positive primer forward reverse Negative primer forward reverse Linker-A, biotinylated top bottom Linker-B, biotinylated top bottom Linker-A, nonbiotinylated top Linker-B, nonbiotinylated top Adapter-A top bottom Adapter-B top bottom Solexa 1-454 primer Solexa 2-454 primer SOLiD P1_m_F SOLiD P1_m_R SOLiD P2_m_F SOLiD P2_m_R SOLiD PCR primer 1 SOLiD PCR primer 2

Sequence

Length (nt)

50 -ttcagagctgcattccttcc-30 50 -cggaatactgacgaggagaaa-30

20 21

50 -agtctgagctttgtggacagc-30 50 -ccctcccagtatacagtcttgc-30

21 22

50 - GGCCGCGATATCTTATCCAAC-30 50 - GTTGGATAAGATATCGC-30

21 17

50 - GGCCGCGATATACATTCCAAC-30 50 - GTTGGAATGTATATCGC-30

21 17

50 - GGCCGCGATATCTTATCCAAC-30

21

50 - GGCCGCGATATACATTCCAAC-30

21

50 -CCATCTCATCCCTGCGTGTCCCATCTGTTCCCTCCCTGTCTCAGNN-30 50 - CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG-30

46 44

50 - CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG-30 50 -CCTATCCCCTGTGTGCCTTGCCTATCCCCTGTTGCGTGTCTCAGNN-30 50 -aatgatacggcgaccaccgagatctacacCctatcccctgtgtgccttg-30 50 -CaagcagaagacggcatacgagatCGGTccatctcatccctgcgtgtc-30 50 -CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGATNN-30 50 -ATCACCGACTGCCCATAGAGAGGAAAGCGGAGGCGTAGTGGTT-30 50 -AGAGAATGAGGAACCCGGGGCAGTT-30 50 -CTGCCCCGGGTTCCTCATTCTCTNN-30 50 -CCACTACGCCTCCGCTTTCCTCTCTATG-30 50 -CTGCCCCGGGTTCCTCATTCT-30

44 46 49 48 43 43 25 25 28 21

70% confluence for ChIP preparation. A total of 6.5  106 cells are plated on a 140  20 mm culture dish, and allowed to grow for 1.5 days prior to crosslinking and harvesting. 2. Formaldehyde cross-linking of cells for 1 h 2.1 Grow HCT116 in Dulbecco’s modified Eagle medium supplemented with 1 penicillin/streptomycin and 10% fetal bovine serum to 2  107 cells in a 140 mm  20 mm tissue culture dish for one ChIP experiment. 2.2 Add 37% formaldehyde to the medium to a final concentration of 1% and incubate the mixture on an orbital shaker at room temperature for 10 min. (Critical step: The extent of cross-linking is very critical and dependent on the cell line and protein of interest. The conditions for cross-linking should be optimized based on the concentration of formaldehyde, incubation time, and temperature.) 2.3 Quench formaldehyde by adding 2.5 M glycine to the medium to a final concentration of 0.2 M and incubate the mixture on an orbital shaker at room temperature for 5 min. 2.4 Discard the media and wash the cells twice with 8 ml cold PBS. 2.5 Add 5 ml cold PBS (containing solubilized protease inhibitor) to the cells, scrape, and collect the cells in a 50-ml tube. (Note: A maximum of three ChIPs can be processed at a time within a tube.) 2.6 Rinse the plate with another 5 ml cold PBS (containing solubilized protease inhibitor) and collect the remaining cells into the same 50-ml tube.

12.2 Methods and Protocols

3.

4.

5.

6.

2.7 Centrifuge the tube at 1800  g for 10 min, 4  C and discard the supernatant. (Pause point: The cell pellet can be stored at 80  C for at least 1 year. Thaw the pellet gently on ice before proceeding.) Cell lysis for 1.5 h 3.1 Resuspend two parallel cell pellets each with 10 ml of cold 0.1% SDS FA cell lysis buffer and incubate on the Intelli-Mixer (Program F1, 12 rpm) for 15 min, 4  C. Centrifuge the tube at 800  g and 4  C for 10 min, and discard the supernatant. Repeat the cell lysis step with another 10 ml of cold 0.1% SDS FA cell lysis buffer. (Note: A maximum of three ChIPs can be processed at a time within a tube.) (Pause point: The pellets can be stored at 80  C for 3 months. Thaw the pellet gently on ice before proceeding.) Nuclear lysis for 3.5 h 4.1 Resuspend the cell pellet (from two ChIPs) with 10 ml of 1% SDS FA cell lysis buffer and transfer the suspension into a 50-ml PPCO centrifuge tube. Incubate on the Intelli-Mixer (Program F1, 12 rpm) for 15 min, 4  C. Centrifuge the tube at 47 000  g and 4  C for 30 min and discard the supernatant. (Note: A maximum of three ChIPs can be processed at a time within one tube.) 4.2 Wash the cell pellet (from two ChIPs) twice with 20 ml of cold 0.1% SDS FA cell lysis buffer (per ChIP) and incubate on the Intelli-Mixer (Program F1, 12 rpm) for 15 min at 4  C. Centrifuge the tube at 47 000  g and 4  C for 30 min, and discard the supernatant. (Pause point: The pellet can be stored at 80  C for 3 months. Thaw the pellet gently on ice before proceeding.) Sonication of lysate for 2.5 h 5.1 Transfer nuclei pellets from two ChIP samples into a clean 14-ml roundbottom tube. Add 2 ml of cold 0.1% SDS FA cell lysis buffer. 5.2 Using another clean 14-ml round-bottom tube, measure 1 ml of 0.5 mm glass beads and pour the beads into the pellet mixture. Remove any bubbles formed. (Note: Presence of bubbles will result in foaming during sonication which will reduce the efficiency of chromatin shearing.) 5.3 Shear the chromatin to the desired DNA size range by sonicating eight cycles at 35% amplitude for 30 s, followed by incubation on ice for 30 s. Keep the tube always on ice. (Critical step: Different conditions such as volume of the sample, depth of the sonicator probe, sonication strength, number of cycles, type of cell, and extent of cross-linking can influence the efficiency of sonication. Optimization of sonication conditions is recommended by using different times and settings to obtain the desired DNA size range. Keep the sample in a beaker with ice throughout the sonication to cool down the samples during 30 s incubation to prevent overheating of sample.) 5.4 To check the efficacy of sonication: aliquot 5 ml of the sonicated sample into a 1.5-ml LoBind tube, and centrifuge at 16 000  g and 4  C for 5 min. Transfer the supernatant to a new 1.5-ml LoBind tube. Add 5 ml of TE buffer and 1 ml Proteinase K to the sample, and mix well. Incubate the mixture at 42  C for 30 min. Run the sample on a 1–2% prestained agarose gel to check that the DNA size range of the chromatin is between 200 and 700 bp. 5.5 Pipette the lysate, avoiding as much glass beads as possible, to a new 1.5-ml LoBind tube, and centrifuge at 16 000  g and 4  C for 30 min. 5.6 Transfer the solublized chromatin to a new 1.5-ml LoBind tube. (Pause point: Store the chromatin at 80  C. Thaw the chromatin gently on ice before proceeding.) Immunoprecipitation of chromatin for 2 days 6.1 Preclearing of lysate. For every ChIP performed aliquot 50 ml of Protein G magnetic beads into a 1.5-ml LoBind tube. Wash the beads thrice using 1 ml of cold PBS/0.1% Triton X-100 and incubate on the Intelli-Mixer (Program F1, 12 rpm) at room temperature for 5 min. Centrifuge the tube at 100  g and 4  C for 1 min. Place the tube on the MPC for 1 min and discard

j

197

198

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

supernatant carefully without disturbing the beads. Add 1 ChIP volume of lysate (1 ml) to the washed beads and incubate on the Intelli-Mixer (Program F1, 12 rpm) overnight at 4  C. (Critical step: Aliquot 20 ml of the lysate and keep at 80  C as the input DNA, which is later used as a control for ChIP enrichment analysis. Use similar type of protein magnetic beads (A or G) used for coating the antibody below.) 6.2 Coating of magnetic beads with antibody. Aliquot 50 ml of Protein G magnetic beads for every ChIP into a 1.5-ml LoBind tube. Wash the beads thrice with 1 ml of cold PBS/0.1% Triton X-100 and incubate on the Intelli-Mixer (Program F1, 12rpm) at room temperature for 5 min. Centrifuge the tube at 100  g and 4  C for 1 min. Place the tube on the MPC for 1 min and discard supernatant without disturbing the beads. Resuspend the beads with 450 ml of cold PBS/0.1% Triton X-100 and add 5 mg of H3K4me3 antibody. Incubate the tube on the Intelli-Mixer (Program F1, 12 p.m.) at 4  C overnight. (Critical step: Depending on the source of antibody, appropriate protein (A or G) magnetic beads should be used to ensure sufficient binding affinity to the antibody used.) 6.3 Centrifuge the tube containing the antibody-coated beads at 100  g and 4  C for 1 min. Place the tube on the MPC for 1 min and discard supernatant carefully without disturbing the beads. Wash off unbound antibody twice with 1 ml of cold PBS/0.1% Triton X-100. Discard buffer from last wash prior to adding the precleared chromatin. 6.4 Centrifuge the tube containing the precleared chromatin at 100  g and 4  C for1 min. Placethetube ontheMPC for1 minandtransferthesupernatantinto the tube containing the antibody-coated beads. Mix well by gentle pipetting. 6.5 Incubate the tube on the Intelli-Mixer (Program F1, 12 rpm) overnight at 4  C. 7. Washes and elution of immunoprecipitated DNA–protein complexes for 1.5 h 7.1 Centrifuge the tube containing the magnetic beads coated with DNA–protein complexes at 100  g and 4  C for 1 min. Place the tube on the MPC for 1 min and discard supernatant. 7.2 Wash the beads for 5 min on the Intelli-Mixer (Program F1, 12 rpm) at room temperature with 1 ml of each of the following buffers (in order), centrifuge at 100  g and 4  C for 1 min, place on the MPC for 1 min, and discard the supernatant carefully without disturbing the beads. Wash 3 times with 0.1% SDS FA lysis buffer, once with 0.1% SDS FA lysis buffer/0.35 M NaCl, and once with ChIP wash buffer. 7.3 Add 1 ml of TE buffer and incubate on the Intelli-Mixer (Program F1, 12 rpm) for 5 min at room temperature. 7.4 Aliquot 20% (200 ml) of the beads into a new 1.5-ml LoBind tube. Centrifuge the tube at 100  g and 4  C for 1 min. Place the tube on the MPC for 1 min and discard the TE buffer carefully without disturbing the beads. Elute the DNA–protein complexes off the beads by adding 200 ml of ChIP elution buffer. Mix well and incubate on the Intelli-Mixer (Program F1, 12 rpm) at 37  C for 30 min. (Pause point: Store the remaining 80% of the beads at 4  C for up to 2 weeks for ChIA–PET library construction upon ChIP enrichment verification. When stored for long periods of time, a significant fraction of ChIP DNA may come off the beads or chromatin proteins may be degraded, making the sample less useful for library construction.) 7.5 Centrifuge the tube at 100  g and 4  C for 1 min, and place it on the MPC for 1 min. Transfer the supernatant (eluted DNA) into a new 1.5-ml LoBind tube. (Note: Elution step can be repeated by adding another 200 ml of ChIP elution buffer to increase the DNA yield.) 8. Reverse cross-link and purification of immunoprecipitated DNA for 2 h 8.1 Digest the protein in the input DNA and eluted DNA–protein complexes by adding Proteinase K (20 mg/ml) to a final concentration of 0.2 mg/ml. Mix well and incubate at 45  C for 2 h.

12.2 Methods and Protocols

8.2 Centrifuge two 2-ml MaXtract High Density tube at 16 000  g and room temperature for 1 min, to pool down the MaXtract gel. 8.3 Transfer the reverse cross-linked sample into the MaXtract High Density tube. Add equivalent amount of phenol/chloroform/IAA (pH 7.9) into the tube and mix well by shaking the tube vigorously until a white solution is observed. 8.4 Centrifuge the tube at 16 000  g and room temperature for 5 min. Carefully transfer the upper aqueous phase into a 1.5-ml LoBind tube without touching the MaXtract gel. 8.5 Precipitate DNA by adding 1/10 volume of 3 M sodium acetate (pH 5.5), 1 volume of isopropanol and 1 ml of GlycoBlue (50 mg/ml) and mix well. Incubate at 80  C for 30 min. 8.6 Centrifuge the tube at 16 000  g and 4  C for 30 min, and discard the supernatant carefully without dislodging the pellet. 8.7 Wash the DNA pellet twice with 1 ml of 75% ethanol. During each wash, slowly pour away the ethanol, taking care that the pellet is intact. To reconsolidate the DNA pellet, centrifuge the tubes at 16 000  g and 4  C for 3 min. 8.8 Allow the pellet to dry using a Micro Vac for 2 min and resuspend the pellet with 20 ml of TE buffer. 8.9 Quantitate the DNA concentration with PicoGreen according to the product manual. Using the eluted DNA (20%) concentration, calculate the amount of ChIP DNA (80%) that is bound on the magnetic beads, which is later used for library construction. (Note: NanoDrop spectrophotometry cannot be used to determine DNA concentration as GlycoBlue will interfere with the readings.) 9. ChIP enrichment verification for 2 h 9.1 Perform quantitative real-time PCR on the eluted ChIP DNA and input DNA (as background) using positive and negative primers, specifically designed for H3K4me3 as enrichment bench markers, to verify the quality of the ChIP through enrichment calculation. Enrichment calculation can be done using the D-DCt method. (Critical step: qPCR should be performed on the same batches of chromatin quantified with PicoGreen. The ChIP enrichment must be more than 100-fold after normalizing the background level from input DNA. Poor ChIP enrichment usually results in low-quality ChIA-PET libraries that yield few binding sites and almost no interactions upon sequencing. As ChIP enrichment can sometimes be variable, ChIP-qPCR should be carried out on the actual batches used.) 10. ChIA-PET library construction 10.1 The starting material for ChIA-PET library construction is the enriched chromatin DNA and protein complexes bound on either Sepharose beads or magnetic beads (A or G). The ChIP beads are usually topped up with at least 1  volume of TE buffer and stored at 4  C, ready for library construction. The ChIP materials should contain at least 50 ml of 100% Sepharose beads, or 50 ml of magnetic beads suspension (50% slurry), dependent on the type of beads used for ChIP preparation. Moreover, a minimum of 50–100 ng chromatin DNA on the beads is critically required for library construction. Some points to be considered: . ChIP material is sensitive to both proteases and nucleases. Care should be taken to avoid introducing contaminants. Precautionary procedures such as swabbing down lab benches with 70% ethanol are advised. . Sample should always be kept on ice unless specified. . The use of 1.5-ml DNA LoBind tubes (Eppendorf) is recommended to minimize the loss of sample as the starting amount of DNA is very small. . All centrifugation of beads should be done at 100  g (800 rpm) with bench-top centrifuge, at 4  C.

j

199

200

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

A general note for all enzymatic reactions: because excessive glycerol from the enzyme stock may interfere with the reaction, ensure the volume of enzyme is 16 h 16.1 For the 10-ml ligation, add 100 ml Proteinase K (Fermentas) and mix well by inverting. Incubate at 37  C overnight (16 h) without rotation. (Note: Final enzyme concentration ¼ 0.2 mg/ml; effective concentration ¼ 0.05–1 mg/ ml.) (Critical step: It is critical that chromatin and other DNA-associated proteins are completely digested or else DNA would be lost together with protein during the phenol/chloroform purification step.) 17. DNA purification for 4 h 17.1 Centrifuge a 15-ml MaXtract tube at 1800  g and room temperature for 5 min to pellet the gel. 17.2 Transfer the 10 ml ligation mix into the MaXtract tube and add 10 ml phenol/chloroform/IAA (pH 7.9) and mix by inverting continuously for 2 min. 17.3 Centrifuge at 1800  g and room temperature for 5 min to separate the phases. Transfer the upper aqueous phase to a 50-ml PPCO centrifuge tube and proceed to isopropanol precipitation. 17.4 Precipitate DNA by adding 1/10 volume of 3 M sodium acetate (pH 5.5), 1 volume of isopropanol and 2 ml of GlycoBlue (50 mg/ml), and mix well. Incubate at 80  C for at least 1 h. 17.5 Allow the frozen solution to thaw before centrifugation at 47 000  g and 4  C for 45 min, using the fixed SS-34 rotor on the high-speed centrifuge. 17.6 Decant supernatant and transfer the blue pellet into a 1.5-ml LoBind tube. Wash the pellet twice with 1 ml of 75% ethanol. During each wash, slowly

12.2 Methods and Protocols

pour away the ethanol, taking care that the pellet is intact. To reconsolidate the DNA pellet, centrifuge the tubes at 16 000  g and 4  C for 3 min. 17.7 Allow the pellet to dry using a Micro Vac for 2 min and resuspend the pellet with 35 ml of buffer EB. 18. MmeI digestion to release PET constructs for 2.5 h 18.1 Dilute SAM stock 64 by mixing 1 ml of concentrated SAM stock with 63 ml of water. Prepare reaction mix on ice: DNA

34 ml

10 NEBuffer 4 (NEB)

5 ml

SAM (64  diluted as above)

5 ml

Nonbiotinylated linker (200 ng/ml)

5 ml

MmeI (NEB)

1 ml 50 ml

Mix by gentle pipetting and incubate at 37  C, 2 h without rotation. (Critical step: Nonbiotinylated linker is used to quench MmeI activity, as MmeI in excess can be self-inhibitory.) 19. PET immobilization on Dynabeads for 1 h 19.1 Prepare the Dynabeads M-280 Streptavidin by transferring 50 ml of resuspended Dynabeads to a 1.5-ml LoBind tube. 19.2 Centrifuge the tube at 100  g and 4  C for 1 min, and place the tube on the MPC for 1 min. Discard the buffer carefully without disturbing the beads. Wash the beads twice with 150 ml of 2 B&W Buffer by centrifuging at 100  g and 4  C for 1 min, place the tube on the MPC for 1 min, discard the buffer, and resuspend beads in 50 ml of 2 B&W Buffer. (Critical step: Do not let the beads dry out and always centrifuge the beads very gently at 100  g.) 19.3 Add 50 ml digestion mix (from Step 18.1) to the resuspended beads and mix well. Incubate at room temperature with rotation on the Intelli-Mixer (Program F8, 30 rpm) for 45 min. 19.4 Centrifuge the tube at 100  g and 4  C for 1 min, and place the tube on the MPC for 1 min. Discard the supernatant carefully without disturbing the beads. Wash the beads thrice with 150 ml of 1 B&W Buffer by centrifuging at 100  g and 4  C for 1 min, place the tube on the MPC for 1 min and discard the buffer. Leave the beads in 1 B&W Buffer after the last wash. 20. Ligation of Solexa or ABI SOLiD Adapters to the immobilized PET for 1 day 20.1 Prepare ligation mix on ice (Critical step: Adapters should be thawed gently on ice.): Nuclease-free water

36 ml

Solexa/or SOLiD adapter A (200 ng/ml)

4 ml

Solexa/or SOLiD adapter B (200 ng/ml)

4 ml

10 T4 DNA ligase buffer (Fermentas)

5 ml 49 ml

20.2 Remove 1 B&W buffer using the MPC by adding the ligation mix to the beads and resuspend. Add 1 ml T4 DNA ligase (Fermentas) and mix by pipetting. Incubate at 16  C overnight (16 h) with rotation on the IntelliMixer (Program F8, 30 rpm). 20.3 Centrifuge the tube at 100  g and 4  C for 1 min and place the tube on the MPC for 1 min. Discard the supernatant carefully without disturbing the beads. Wash the beads thrice with 150 ml of 1 B&W buffer by centrifuging

j

203

204

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

at 100  g and 4  C for 1 min, place the tube on the MPC for 1 min, and discard the buffer carefully without disturbing the beads. Leave the beads in 1 B&W Buffer after the last wash on the ice. 20.4 Prepare enzyme mix on ice (50 ml/sample): Nuclease-free water

38.5 ml

10 NEBuffer 2 (NEB)

5.0 ml

10 mM dNTP mix (Eppendorf)

2.5 ml

E. coli DNA polymerase I (NEB)

4.0 ml 50.0 ml

20.5 Remove 1 B&W Buffer using the MPC, add the enzyme mix to the beads and resuspend. Mix by pipetting, and incubate at room temperature for 2 h with rotation on the Intelli-Mixer (Program F8, 30 rpm). 21. Quality control PCR amplification of ChIA-PETs for 3 h 21.1 Centrifuge the tube at 100  g and 4  C for 1 min and place the tube on the MPC for 1 min. Discard the enzyme mix carefully without disturbing the beads. Wash the beads twice with 150 ml of 1 B&W Buffer using the MPC. Resuspend the beads in 50 ml of EB buffer and transfer to a fresh 1.5-ml LoBind tube. (Pause point: Beads in EB buffer may be stored at 20  C.) 21.2 In a 0.2-ml PCR tube, for each PCR reaction add and mix: Beads suspension as template

2 ml

Solexa 1-454 or SOLiD PCR primer 1 (25 mM)

1 ml

Solexa 2-454 or SOLiD PCR primer 2 (25 mM)

1 ml

Phusion High-Fidelity PCR master mix (Finnzymes)

25 ml

Nuclease-free water

21 ml 50 ml

PCR cycle conditions: 98  C

30 s

98  C

10 s

65  C

30 s

72  C

30 s

72  C

5 min



4 C

9 > =

18 cycles

> ;

hold

21.3 Mix the quality control PCR products with 10 ml of 6 loading dye and run on 4–20% TBE gel, 25 ml per well, with 250 ng of 25-bp DNA ladder. Handle the gel using the gel handler. Stain the gel with SYBR Green I (1 ml SYBR Green per 10 ml TBE buffer) and visualize on DarkReader. Select the best cycle conditions and estimate number of reactions required for PCR scaleup. (Critical step: PCR optimization may be required, so it is advisable to test different cycle conditions, volumes of beads, and so on, during the first quality control run. Use as few PCR cycles as possible as overamplification can increase the chances of PCR errors and lower the overall complexity of

12.2 Methods and Protocols

the library. Do not reuse Dynabeads for further PCR reactions. This has been found to give poor results.) 22. Scale-up PCR amplification of ChIA-PET for 2 days 22.1 Scale-up the number of PCR reactions according to quality control data obtained in Step 21. Suggested number of PCR reactions: 12–16 22.2 Combine the PCR product into one tube and place the tube on an MPC for 1 min. Carefully transfer the PCR product, without touching the beads, into a 2-ml MaXtract tube. Add equivalent amount of phenol/chloroform/ IAA (pH 7.9) and mix the tube well. Centrifuge the tube at 16 000  g and room temperature for 5 min. Carefully transfer the top aqueous phase into a new tube. Add 1 volume of isopropanol, 1/10 volume of sodium acetate and 1 ml of GlycoBlue, mix well and incubate at 80  C for 30 min. Centrifuge at 16 000  g and 4  C for 30 min. Discard supernatant and wash the DNA pellet twice with 75% ethanol. Allow the pellet to dry and resuspend the DNA with 200 ml TE buffer. 22.3 Add and mix 40 ml of 6 loading dye to the DNA and run 50 ml sample per well in a 6% TBE PAGE gel at 200 V for 35 min together with 1 mg of 25 bp DNA ladder. (Note: Do not overload the PAGE gel, as the resulting smears may interfere with visualization of DNA bands. Some DNA ladders have been found to be unsuited for use on PAGE gel, so try not to substitute DNA ladders unless you have tested them already and found that they run well on a PAGE gel.) 22.4 Handle the gel using the gel handler. Stain the gel with SYBR Green I and visualize on DarkReader. Excise DNA of desired size. (Critical step: To prevent the inclusion of nonspecific DNA, excise only the central portion of the band, excluding the highest and lowest portions. Do not visualize gel by UV as this could damage the DNA.) 22.5 Collect gel slices into 0.6-ml tubes with the bottom pierced with a 21G needle. 22.6 Place the pierced 0.6-ml tubes inside a 1.5-ml screw-cap tube and centrifuge at 16 000  g for 5 min, 4  C. 22.7 Add 200 ml TE buffer to each 1.5-ml screw-cap tube containing shredded gel and stir the gel pieces with the pipette tip, making sure that the gel pieces are immersed in the buffer. 22.8 Freeze the gel suspension at 80  C for 1 h and incubate the tube at 37  C overnight. 22.9 Transfer the gel pieces together with the buffer in each screw-cap tube to the filter cup of a Spin-X tube. Centrifuge at 16 000  g and 4  C for 10 min. Transfer the eluate into a new 1.5-ml LoBind tube. 22.10 Rinse the screw-cap tube with 200 ml TE buffer and transfer rinsing buffer to the filter cup of the same Spin-X unit upon completion of Step 22.9 and centrifuge again at 16 000  g for 10 min, 4  C. 22.11 Pool the filter-through and purify the DNA by mixing an equivalent amount of phenol/chloroform/IAA (pH 7.9) in a 2-ml MaXtract tube and centrifuge the tube at 16 000  g and room temperature for 5 min. Transfer the upper aqueous phase to a new LoBind tube and add 1/10 volume of 3 M sodium acetate (pH 5.5), 1 volume of isopropanol, and 1 ml of GlycoBlue (50 mg/ml). 22.12 Incubate the tube for 30 min at 80  C, and centrifuge at 16 000  g and 4  C for 30 min. Discard the supernatant and wash the DNA pellet twice with 1 ml of 75% ethanol. Allow the pellet to dry and resuspend the DNA with 15 ml of TE buffer. 22.13 Perform a quality control check with a double-stranded DNA 1000 chip on an Agilent Bioanalyzer, using 1 ml of sample to determine the quality and quantity of the ChIA-PET template DNA. (Note: There should be a welldefined, intense electropherogram peak corresponding to the fragment

j

205

206

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

size of interest. The lower and upper markers should be clearly indicated and the baseline should be flat, with no stray peaks. Agilent Bioanalyzer produces up to 10% error and usually reports a higher-than-expected product size. At this point, the ChIA-PET template DNA is ready for highthroughput sequencing analysis. Solexa GAIIx paired-end sequencing has been the platform we use to generate ChIA-PET data. However, the ChIAPET template can also be sequenced by other high-throughput sequencing platforms such as ABI (SOLiD).)

12.3 Timeline .

.

ChIP sample preparation procedure (total time span: 4 days) Formaldehyde cross-linking of cells

1.0 h

Cell lysis

1.5 h

Nuclear lysis

3.5 h

Sonication of lysate

2.5 h

Immunoprecipitation of chromatin

2 days

Washes and elution of immunoprecipitated DNA–protein complexes

1.5 h

Reverse cross-link and purification of immunoprecipitated DNA

2.0 h

ChIP enrichment verification

2.0 h

ChIA-PET library construction (total time span: 7 days) End-polishing of ChIP DNA

2.0 h

Biotinylated linker ligation to ChIP DNA

16.0 h

0

.

.

.

Phosphorylation of linker 5 ends

1.5 h

Elution of chromatin complex

2.5 h

Proximity ligation of linker-added DNA fragments

2.5 h

Reverse cross-linking of chromatin

>16.0 h

MmeI digestion to release PET constructs

2.5 h

PET immobilization onto Dynabeads

1.0 h

Sequencing adapter ligation

1 day

Quality control PCR amplification of ChIA-PETs

3.0 h

Scale-up PCR amplification of ChIA-PET

2 days

Paired-end sequencing and processing (total time span: 7–8 days). The time span includes ChIA-PET template preparation, paired-end sequencing (2  36 bp), and processing and mapping of sequence reads. ChIA-PET pipeline analysis and data upload for browser view (total time span: 1 day). With the GIS ChIA-PET analysis pipeline [30], the ChIA-PET sequence data are further classified into mappable PETs, unique PETs and uniquely mapped PETs. TOTAL TIME: 3 Weeks

12.4 Anticipated Results

j

207

12.4 Anticipated Results 12.4.1 Verification of Sonicated Chromatin DNA Size Range

The anticipated DNA size range of the fragmented chromatin DNA should be around 200–2000 bp. Figure 12.5 shows a typical DNA size range after chromatin sonication. We expect a majority of sonicated DNA fragments are within the 200- to 600-bp range. 12.4.2 ChIP Quality Control: Yield and Enrichment

For the ChIP sample quality control, we first expect at least 50–100 ng DNA on beads in order to proceed to ChIA-PET library construction. For the ChIP enrichment, we expect excellent qPCR enrichment results from bench marker genes after normalization using input DNA and negative control sites. 12.4.3 ChIA-PET Library Quality Control

At Step 21, the quality control PCR gel of constructed ChIA-PET library should show a specific band at the expected size of 223 bp (Figure 12.6a). At Step 22, the scale-up PCR gel of ChIA-PET library should generate abundant and clean ChIA-PET DNA bands (Figure 12.6b), which are then gel sliced for purification and quantified by an Agilent Bioanalyzer (Figure 12.6c).

Fig. 12.5 Gel image of the size range of sonicated chromatin. A small aliquot of chromatin DNA fragments is de-cross-linked and run on a 1% agarose gel. A smear should be observed between 200 and 600 bp.

12.4.4 ChIA-PET Sequencing and Mapping Analysis

A successful library should contain a high number of self-ligations and interligation PETs with an exponentially decreasing distribution of genomic spans, such as in the example shown in Figure 12.7(a). A successful ChIA-PET library data generation should exhibit a good mapping pattern, strong binding peaks and interactions, and little or no interactions from the chimeric PET within the same region on a chromosome (Figure 12.7d). Fig. 12.6 ChIA-PET library DNA template quality control results. (a) Example of a quality control PCR gel for a ChIA-PET template. The correct size of the PET construct (Adapter A–Tag–Linker–Tag– Adapter B) is 223 bp. (b) Scale-up PCR gel for a preparative ChIA-PET template. The bands for ChIAPET constructs at the 223-bp position are gel-sliced and purified as the final sequencing template. (c) Purified ChIA-PET template quality control is assessed again by an Agilent Bioanalyzer and a unique band at the 223 bp should be seen.

208

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

Fig. 12.7 ChIA-PET sequence mapping results. (a) Example of a good ChIA-PET (H3K4me3 in HCT116) mapping distribution. The PETs are mapped to the human reference genome and the genomic span of PET sequences is shown. The majority of the PET sequences are referred to as self-ligation PETs as they are mapped between a few hundred up to 3000-bp span. (b) Example of a bad ChIA-PET (H3K4me3 in HCT116) mapping distribution showing lack of the self-ligation PETs within a dataset. (c1) In the ChIA-PET genome browser, a specific region from RNAPII ChIA-PET in HCT116 cells shows multiple interligation PET clusters between two binding sites. (c2) In the same region above, H3K4me3 ChIA-PET shows a similar interligation PET pattern. (c3) In the same region, there are no bindings and interactions detected from the linker-A/B-associated chimeric-ligation PETs which is a subset of H3K4me3 ChIA-PET. (d) An example of good binding, but weak interaction, detected in a region of the H3K4me3 ChIA-PET dataset. (e) Example of the regions with no ChIP enrichment and no interactions detected. The tracks in the ChIA-PET human genome browser are displayed as: (1) density of self-ligation PETs, (2) interligation PETs, and (3) self-ligation PETs.

References

j

209

An unsuccessful library will show weak or no interaction or binding sites on the browser (Figure 12.7e and f). Adjustments should be made to optimize ChIP conditions and construct a new library.

12.5 Perspectives

Overall, the ChIA-PET analysis protocol is technically challenging, which requires good practice in molecular manipulation skills in ChIP preparation and library construction. The DNA sequencing of the ChIA-PET library requires reads from both ends of each template and the mapping of the ChIA-PET sequence data requires paired matching of each paired tag from each PET sequence reads. In addition, the ChIA-PET mapping data reveals new dimensions of information in a genome-wide manner that have not charted before. Therefore, ChIA-PET analysis includes many new challenges technically, but also represents exciting opportunities to identify and characterize new features of chromatin interaction and chromosomal topological architectures that are the new framework for gene transcription regulation and other nuclear functions, such as chromosomal rearrangement (translocation), DNA replication, and DNA repair. It is obvious that a ChIA-PET experiment provides two genome-wide datasets: the binding sites of the target protein factor under study and the long-range interacts among the binding sites. Therefore, the CHIA-PET method is superior to the current ChIP-seq method that can only provide protein binding sites information. Hence, if you have a question that you think a ChIP-seq experiment can give you the answer to, try ChIA-PET instead, because it will give you better answers.

Acknowledgments

The authors acknowledge the following Huay Mei PoH, Su Qin Peh, Chin Thing Ong, Adeline Chew, Poh Tong Shing Thompson, Lim Kian Chew, Lee Yen Ling, Sia Yee Yen, Eunice Tai, Dawn Sum, See Ting Leong, Low Hwee Meng, and Herve Thoreau from Genome Technology and Biology Group at the Genome Institute of Singapore for technical support; and Atif Shahab, Chan Chee Seng, Fabianus H. Mulawadi, Guoliang Li, and Ken Kin Sung Wing from IT groups for computing support. Y.R. is supported by A STAR of Singapore and NIH ENCODE grants (R01 HG004456-01, R01HG003521-01, and part of 1U54HG004557-01). All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter. References 1 Woodcock, C.L. (2006) Chromatin

architecture. Curr. Opin. Struct. Biol., 16, 213–220. 2 Buck, M.J. and Lieb, J.D. (2004) ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics, 83, 349–360. 3 Wu, J. et al. (2006) ChIP-chip comes of age for genome-wide functional analysis. Cancer Res., 66, 6899–6902. 4 Wei, C.L. et al. (2006) A global map of p53 transcription-factor binding sites in the human genome. Cell, 124, 207–219.

5 Johnson, D.S. et al. (2007) Genome-wide

6

7

8

9

mapping of in vivo protein–DNA interactions. Science, 316, 1497–1502. Dekker, J. (2006) The three “C”s of chromosome conformation capture: controls, controls, controls. Nat. Methods, 3, 17–21. Dekker, J. et al. (2002) Capturing chromosome conformation. Science, 295, 1306–1311. Cai, S. et al. (2006) SATB1 packages densely looped, transcriptionally active chromatin for coordinated expression of cytokine genes. Nat. Genet., 38, 1278–1288. Horike, S. et al. (2005) Loss of silentchromatin looping and impaired imprinting

10

11

12

13

of DLX5 in Rett syndrome. Nat. Genet., 37, 31–40. Gondor, A. et al. (2008) High-resolution circular chromosome conformation capture assay. Nat. Protoc., 3, 303–313. Ohlsson, R. and Gondor, A. (2007) The 4C technique: the “Rosetta stone” for genome biology in three-dimensional? Curr. Opin. Cell Biol., 19, 321–325. Simonis, M. et al. (2006) Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat. Genet., 38, 1348–1354. Zhao, Z. et al. (2006) Circular chromosome conformation capture (4C) uncovers

210

14

15

16

17

18

j

12 Chromatin Interaction Analysis Using Paired-End Tag Sequencing (ChIA-PET)

extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat. Genet., 38, 1341–1347. Wurtele, H. and Chartrand, P. (2006) Genome-wide scanning of HoxB1associated loci in mouse ES cells using an open-ended chromosome conformation capture methodology. Chromosome Res., 14, 477–495. Dostie, J. et al. (2006) Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome. Res., 16, 1299–1309. Carter, D. et al. (2002) Long-range chromatin regulatory interactions in vivo. Nat. Genet., 32, 623–626. Osborne, C.S. et al. (2004) Active genes dynamically colocalize to shared sites of ongoing transcription. Nat. Genet., 36, 1065–1071. Cullen, K.E. et al. (1993) Interaction between transcription regulatory regions of prolactin chromatin. Science, 261, 203–206.

19 Fullwood, M.J. et al. (2009) Next-generation

20

21

22

23

24

25

DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res., 19, 521–532. Ng, P. et al. (2007) Paired-end ditagging for transcriptome and genome analysis. Curr. Protoc. Mol. Biol., 21 21.12. Ng, P. et al. (2005) Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods, 2, 105–111. Barski, A. et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell, 129, 823–837. Margulies, M. et al. (2005) Genome sequencing in microfabricated highdensity picolitre reactors. Nature, 437, 376–380. Fullwood, M.J. et al. (2009) An oestrogenreceptor-alpha-bound human chromatin interactome. Nature, 462, 58–64. Misteli, T. (2007) Beyond the sequence: cellular organization of genome function. Cell, 128, 787–800.

26 Simonis, M. et al. (2007) An evaluation

27

28

29

30

of 3C-based methods to capture DNA interactions. Nat. Methods, 4, 895–901. Fullwood, M.J. and Ruan, Y. (2009) ChIPbased methods for the identification of longrange chromatin interactions. J. Cell Biochem., 107, 30–39. Shendure, J. et al. (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science, 309, 1728–1732. Harris, T.D. et al. (2008) Single-molecule DNA sequencing of a viral genome. Science, 320, 106–109. Li, G., Fullwood, M.J., Xu, H., Mulawadi, F.H., Velkov, S., Vega, V.B., Ariyaratne, P.N., Mohamed, Y.B., Ooi, H.-S., Tennakoon, C., Wei, C.-L., Ruan, Y., and Sung, W.-K. (2010) ChIA-PET Tool: a comprehensive software for chromatin interaction analysis with paired-end-tag sequencing. Genome Biol., 11, R22.

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling Sorana Morrissy, Yongjun Zhao, Allen Delaney, Jennifer Asano, Noreen Dhalla, Irene Li, Helen McDonald, Pawan Pandoh, Anna-Liisa Prabhu, Angela Tam, Martin Hirst, and Marco Marra Abstract

This chapter describes a protocol for performing digital gene expression profiling on the Illumina Genome Analyzer sequencing platform. Tag sequencing (Tag-seq) is an implementation of the LongSAGE protocol on the Illumina sequencing platform that increases utility while reducing both the cost and time required to generate gene expression profiles. The ultra-high-throughput sequencing capability of the Illumina GA platform allows the cost-effective generation of libraries containing an average of 20 million tags – a 200-fold improvement over classical LongSAGE. Tag-seq has less sequence composition bias, leading to a better representation of AT-rich tag sequences, and allows accurate profiling of a subset of the transcriptome characterized by AT-rich genes expressed at levels below the threshold of detection of LongSAGE.

13.1 Introduction

Comprehensive genome-wide datasets have been extensively mined in order to better annotate transcribed regions, identify regulatory regions, and profile differentially expressed genes between normal and disease states. To date, analysis of expressed sequence tag (EST) libraries and cDNA libraries, mRNA sequences, transcript tagging technologies, and microarrays have provided evidence that a large proportion (60–90%) of the mouse and human genomes are transcribed into RNA [1,2]. Comparing large collections of EST and tag sequence libraries has shown that tag sequencing performs better at identifying low-frequency expression given the same sampling depth [3]. Tag sequencing methods also have certain advantages over microarrays, since they are not reliant on prior knowledge of either genomic or transcribed regions, and they provide digital counts of transcript expression over a large dynamic range. The tag sequencing technique of serial analysis of gene expression (SAGE) [4] has been extensively used for gene expression profiling and de novo transcript discovery. A major aim of developing this technique was to increase the efficiency of sequencedriven transcript profiling by increasing the number of transcripts detected per sequence read. The SAGE technique thus results in the generation of sequence reads containing 30–45 short sequence tags (depending on read length) from the 30 ends of polyadenylated (poly(A) þ ) transcripts, which can be mapped to transcript and genome resources to provide digital counts of transcript expression of known and novel genes, respectively [4]. Assessing the expression of rare transcripts can be achieved by increasing the number of sequences analyzed; however, the cost of

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

211

212

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Fig. 13.1 Tag-seq library generation. Polyadenylated mRNAs (blue rectangle) are captured using oligo(dT) beads and double-stranded cDNA (orange rectangle) is subsequently synthesized. The cDNA is digested with the NlaIII anchoring restriction enzyme (vertical dark red arrows), leaving a 4-bp overhang (GTAC). Only cDNA fragments anchored to oligo(dT) beads are retained. Adapter A (teal rectangle) is ligated to the overhang, and contains a recognition site for the Type IIS tagging enzyme MmeI (TCCGAC, which overlaps by one base with the NlaIII recognition site, CATG). Following MmeI digestion (dark red arrow), Adapter B is ligated (brown rectangle) to the resulting 2-bp overhang. PCR primers (blue arrows) annealing to adapters A and B are used to enrich tags (orange rectangles). Cluster generation and sequencing (purple arrow) is performed on the Illumina Cluster Station and Genome Analyzer. The resulting image files are processed to extract the read sequences, and 21-bp SAGE tags are further extracted from the reads. Tags consist of the 4-bp NlaIII recognition sites and 17 bp of unique sequence, and constitute a total of 21 bases that can be mapped back to the original mRNA.

generating large amounts of reads using Sanger sequencing is prohibitive. This limitation can be overcome with the application of next-generation sequencing platforms, enabling cost-effective ultra-high-throughput sampling [5]. One such application (Tag-seq) was developed to utilize the Illumina platform to sequence tag sequence libraries generated through a modified LongSAGE protocol [6]. Similar to conventional SAGE [4,7], Tag-seq library construction involves the production of short tags from the 30 ends of mRNA molecules. The basic protocols in this chapter describe procedures for capturing mRNA molecules via their poly(A) þ tails, generating the tags, sequencing them on the Illumina platform, and analyzing the resulting data. An overview of the Tag-seq protocol is shown in Figure 13.1.

13.2 Protocol Details

Similar to conventional SAGE, Tag-seq library construction involves the capture of mRNA molecules via their poly(A) þ tails using magnetic oligo(dT) beads from DNase I-treated total RNA (Figure 13.1). Double-stranded cDNA is enzymatically generated from the captured mRNA directly on the magnetic beads. The cDNAs are digested with the restriction enzyme NlaIII to remove 50 sequence from the 30 -most NlaIII recognition sequence in the mRNA molecule Note that a small minority of transcripts without NlaIII recognition sites cannot be profiled [7]. After removal of the digested products, adapters containing sequences compatible with Illumina cluster generation as well as a recognition site for the type IIS enzyme MmeI are ligated onto the 4-nucleotide overhangs left after NlaIII digestion. The 50 adapter and 21-bp sequence tag is released from the oligo(dT) beads by digestion with MmeI. Following purification of tags and dephosphorylation to prevent self-ligation, adapters containing a 2-nucleotide degenerate 30 overhang and sequences compatible with direct

13.3 Protocol Overview and Timeline

sequencing on the Illumina Genome Analyzer are ligated to the random overhang left after MmeI digestion. The resulting product is polymerase chain reaction (PCR)amplified using primers containing sequences that will hybridize to the surface of an Illumina flow cell. After size separation by polyacrylamide gel electrophoresis (PAGE), an 85-bp DNA band is excised and purified using a filter column followed by ethanol precipitation. Cluster generation and sequencing are performed on the Illumina Cluster Station and Genome Analyzer, respectively, using a sequencing primer that is specific for the 50 adapter. The first base sequenced is the first base of the sequence tag following the NlaIII recognition sequence. Sequence reads are output to files along with the number of times that each sequence read was observed (refer to Basic Protocol 5). Sequences are trimmed at the first ambiguous base call and truncated at 17 bp to create “raw” tags. Raw tags shorter than 17 bp are discarded. Remaining tags are filtered to remove adapter sequences and create an “SA” (sans adapters) version of the raw library. When there are genome or transcriptome annotations available for the species corresponding to the sequencing experiment, data filtering can be done to retain only those tags mapping to the genome (“MG”), mapping to the transcriptome (“MT”; in the case of human or mouse, including Unigene [8], the Mammalian Gene Collection (MGC [9]), and miRDB (miRDB.org)), mapping to either the genome or the transcriptome (“MA”), or mapping to RefSeq (“MR” [9]). Given an available transcriptome, an additional type of filtering termed “SSOOHE” (sans singletons and one-offs of highly expressed tags) can be implemented to remove (i) tags observed at a count of one (singletons), and (ii) low-frequency tags that differ by one base from a highly expressed tag (one-offs) and which do not themselves map to a known transcript. By definition, one-off tags have a maximum frequency of 5% of the high-frequency tags, which in turn have counts above 500. For purposes of novel transcript discovery, the use of SA-filtered or MG-filtered libraries is recommended. For analyses of the expression of annotated transcripts, any filtered dataset can be used.

13.3 Protocol Overview and Timeline

The Tag-seq lab protocols described in this chapter are organized into five basic protocols. Basic Protocol 1 describes first- and second-strand cDNA synthesis; Basic Protocol 2 provides methods for tag generation; Basic Protocol 3 provides PCR and fragment isolation procedures; and Basic Protocol 4 describes preparing the library for Illumina sequencing. The Alternate Protocol provides a method for amplified Tagseq library construction (Tag-seqLite), used when the RNA material is limiting (e.g., below 500 ng) and cDNA amplification is consequently necessary (this protocol requires an additional working day). Basic Protocol 5 describes analysis of the data generated by Illumina sequencing. (Note: Illumina-specific adapters, PCR, and sequencing primer oligonucleotide sequences are protected by an Illumina-owned copyright.) Whenever possible, Tag-seq library construction should be segregated into three areas: a pre-PCR area including RNA and cDNA synthesis, a library construction area, and a post-PCR area. Multiple samples can be processed in parallel; however, attention must be paid to prevent sample swap and potential cross contamination. The basic laboratory protocols can be completed in 4 days, as follows: Day

Stage

Total working time

1

Basic Protocol 1 and Steps 1–16 of Basic Protocol 2

8h

2

Steps 17–49 of Basic Protocol 2

5h

j

213

214

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

3

Basic Protocol 3

7.5 h

4

Basic Protocol 4

1.5 h

In these protocols are five points at which work may be stopped until the next day: . . . . .

Overnight ligation Ethanol precipitation ( 20  C) PCR product before gel purification (store at 4  C) Gel slurries (store at 4  C) Purified PCR product (store at 20  C)

13.4 Critical Parameters and Troubleshooting

The following precautions and techniques are critical to the success of the Tag-seq protocols. To prevent RNA degradation, add RNase inhibitor into the RNA sample and all reactions involving RNA. To help avoid cross-contamination and reagent degradation, reagents should be stored in aliquots. Do not reuse aliquots. Pipette tips should be stuffed and never reused. The enzyme NlaIII is extremely temperature-sensitive. Retrieve an aliquot of NlaIII on ice after all other reagents have been added to the reaction. To ensure adequate mixing of beads during incubation use a Thermomixer. Vortex beads before aliquotting, but do not vortex beads during library construction. Mix by flicking followed by a gentle microcentrifugation to bring the suspension to the bottom of the tube. To avoid losing beads, aspirate from top to bottom slowly, ensuring that bubbles are removed before aspirating the wash solution. For the final wash, use a 200-ml pipette tip to aspirate the remainder of the solution before proceeding to the next step. For better visualization, tilt the magnetic bar. Note that after MmeI digestion, the tags are present in the supernatant, not on the beads. Make sure to proceed using the supernatant. To dissolve sodium dodecylsulfate (SDS) precipitates, warm the lysis buffer and Wash Buffer C to 37  C. If any precipitate is observed during the wash steps, rewarm. It is critical that Wash Buffer C is clear to inactivate and wash away the Escherichia coli DNA polymerase, otherwise downstream reactions may be affected. To ensure adequate resolution of the final amplified library, run the PAGE gel for 2.5 h at 250 V with cold-water circulation. Cut out a tight 85-bp band from the best PCR cycle and take care to avoid the linker band. Always include a no-template control in the PCR to monitor for PCR contamination. PCR cycles for Tag-seqLite can be 11 and 13 cycles, due to the additional amplification step in the SMART cDNA protocol. For all PCRs in Tagseq and Tag-seqLite, use only 2.5 ml of template per reaction; save the remainder of the template in case of PCR failure. Purified 85-bp PCR products are run on the Agilent DNA1000 chip (Figure 13.2). Agilent usually calls the product size around 11 bp larger, so expect to see a 96-bp peak on Agilent. See Table 13.1 for a summary of some potential problems in library construction and their possible causes and solutions. Fig. 13.2 Purified 85-bp PCR product on the Agilent DNA1000 chip is seen as a 96-bp peak.

13.5 Methods and Protocols

j

215

Table 13.1 Troubleshooting for Tag-seq library construction.

Problem

Possible cause

Solution

No 85-bp PCR product and no linker band, but visible primer dimers

inactive ligase inactive PCR reagents

No 85-bp PCR product, but visible linker band and primer dimers

inactive reverse transcriptase inactive MmeI PCR reagent contamination no cold water circulation; gel is too warm

replace ligase replace PCR reagents and set-up PCR again, along with a positive control replace reverse transcriptase replace MmeI replace PCR reagents check cold water circulation

Visible 85-bp PCR product in no-template control PCR product Smeary PCR product

13.5 Methods and Protocols 13.5.1 Basic Protocol 1: First- and Second-Strand cDNA Synthesis for Tag-Seq Library Construction

This protocol provides details for mRNA capture on oligo(dT) beads, and first- and second-strand cDNA synthesis using an oligo(dT) primer in a pre-PCR area. The RNA sample is typically isolated using TRIzol or column-based techniques (e.g., Qiagen AllPrep Mini Kit or Ambion RiboPure Kit), followed by DNase I treatment to obtain DNA-free RNA starting material. RNA integrity is then assessed using an Agilent Bioanalyzer RNA 6000 Nano Chip, according to the manufacturer’s directions. The resulting RNA integrity number (RIN) is used to help establish an RNA quality standard. This protocol is designed for 500–2000 ng of total RNA of RIN 7. Note: When working with RNA, precautions must be taken to avoid RNA degradation resulting from RNase contamination of reagents and materials. The use of nucleasefree (e.g., diethylpyrocarbonate (DEPC)-treated) water, Neptune Barrier Tips (CLP; www.clpdirect.com) is recommended for pipette ting, as well as RNaseZap (Ambion) for RNase control.

Materials . . .

. . . . . . . . . . . . . . .

RNaseZap (Ambion) DEPC-treated H2O Total RNA sample: typically isolated using TriZOL (Invitrogen), AllPrep Mini Kit (Qiagen), or RiboPure Kit (Ambion), and DNase I-treated Oligo(dT) magnetic beads (Invitrogen) Lysis/Binding Buffer (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) Wash Buffer B (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) 1 First-Strand Buffer (Invitrogen) 5 First-Strand Buffer (Invitrogen) 4 U/ml RNaseOUT (Invitrogen) 0.1 M Dithiothreitol (DTT) (Invitrogen) 5 M Betaine (Sigma), prepared using nuclease-free H2O 10 mM dNTP mix (10 mM each dNTP; Invitrogen) 200 U/ml SuperScript II reverse transcriptase (Invitrogen) 5 Second-Strand Buffer (Invitrogen) 10 U/ml E. coli DNA ligase (Invitrogen) 10 U/ml E. coli DNA polymerase (Invitrogen) 2 U/ml E. coli RNase H (Invitrogen) Wash Buffer C (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03)

216

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling . . . . . . . . . .

0.5 M EDTA (Invitrogen) Wash Buffer D (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) 10 Buffer 4 (NEB) Textured nitrile gloves (Fisher) Bench Coat (bench protection paper; Fisher) RNase-free 1.5-ml nonstick (siliconized) microcentrifuge tubes (Ambion) Magnetic stand (Invitrogen; cat. no. R670-01) 50-ml Conical polypropylene tubes (BD Falcon) Clay Adams Nutator Shaker (VWR Scientific) Thermomixers, 1.5 ml (Eppendorf)

Reagents and Solutions

Use nuclease-free (e.g., DEPC-treated water) in all recipes and protocol steps. . B&W buffer, 2 and 1 2 B&W buffer 10 mM Tris–Cl, pH 7.5 1.0 mM EDTA 2.0 M NaCl Store up to 12 months at room temperature. 1 B&W buffer Dilute 2 B&W buffer 1 : 1 with water Store up to 12 months at room temperature. . Bromphenol blue/xylene cyanol loading dye, 10 To a sterile 50-ml conical polypropylene tube (e.g., BD Falcon) add the following: 12.5 g Ficoll 0.21 g Bromphenol blue 0.21 g Xylene cyanol Adjust the volume to 50 ml with nuclease-free dH2O. Dissolve by incubating at 37  C for a few hours, with occasional vortexing. Once completely dissolved, bring volume back to 50 ml with nuclease-free dH2O. Mix well and store at 4  C. Protocol

Retrieve reagents and prepare equipment 1.1 Put on a clean pair of nitrile gloves and lab coat. Wipe down workbench, small equipment, and ice bucket with RNaseZap and DEPC-treated water. Lay down a new bench pad (Bench Coat bench protection paper). 1.2 Change gloves. Retrieve fresh ice and all required reagents. Retrieve RNA sample. 1.3 Thaw all reagents listed in Steps 1.4–1.15, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. Check that the Lysis/Binding Buffer is clear before use and warm the buffer to 37  C to dissolve any precipitate. Purify mRNA using oligo(dT) beads 1.4 Dilute 500 ng to 2 mg of total RNA (DNase I-treated) with DEPC-treated water to reach a final volume of 50 ml. 1.5 Heat the RNA sample at 65  C for 5 min to disrupt the secondary structures, then place on ice.

13.5 Methods and Protocols

1.6 Thoroughly resuspend the oligo(dT) beads by slow vortexing and transfer 100 ml of the suspension to an RNase-free 1.5-ml siliconized (nonstick) tube. 1.7 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.8 Immediately add 100 ml of Lysis/Binding Buffer to the beads on the magnetic stand. Remove the tube and finger flick the tube to mix, then microcentrifuge briefly and place the tube back on the magnetic stand. Remove the supernatant and repeat once more. 1.9 Resuspend the beads in 50 ml of Lysis/Binding Buffer. 1.10 Immediately add the entire 50-ml total RNA sample (from Step 1.5) to the beads that have been equilibrated with Lysis/Binding Buffer. 1.11 Mix the sample well by flicking the tube and microcentrifuge briefly to bring the solution to the bottom of the tube. Firmly close and place the 1.5-ml tube in a 50-ml conical tube stuffed with a Kimwipe. 1.12 Place the 50-ml conical tube on a rocking platform (Nutator) for 5 min at room temperature. 1.13 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.14 Immediately wash the beads on the magnetic stand twice, each time with 200 ml of Wash Buffer B. 1.15 Wash the beads four times on the magnetic stand, each time with 100-ml of 1 First-Strand Buffer. On the fourth wash, do not remove the supernatant. Synthesize first-strand cDNA 1.16 Put on a clean pair of gloves and lab coat. Retrieve all reagents listed in Step 1.18. Thaw all reagents except RNaseOUT, vortex, and microcentrifuge briefly to bring the reagents to the bottoms of the tubes. 1.17 Set a Thermomixer to 42  C. Set mixing for 30 s, 1400 rpm, at 10-min intervals. 1.18 Prepare the first-strand reaction in a new siliconized (nonstick) RNase-free 1.5-ml tube on ice (total volume, 38 ml). 5 First-Strand Buffer

8.0 ml

4 U/ml RNaseOUT

0.5 ml

DEPC-treated H2O

20 ml

0.1 M DTT

4.5 ml

10 mM dNTP mix

2.0 ml

5 M Betaine

3.0 ml

(Note: Volumes listed are for one library.) 1.19 Carefully remove the fourth wash (Step 1.15) and immediately add the 38 ml of first strand cDNA synthesis mix to the beads. 1.20 Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid, and microcentrifuge briefly to bring the solution to the bottom of the tube. 1.21 Place the tube at 42  C for 2 min to equilibrate the reagents. 1.22 Add 2 ml of SuperScript II (200 U/ml) reverse transcriptase. 1.23 Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid. Microcentrifuge briefly to bring the solution to the bottom of the tube. 1.24 Incubate on Thermomixer for 1 h with settings as in Step 1.17. Synthesize second-strand cDNA 1.25 Dispose of all waste and partially used reagent aliquots from previous steps. 1.26 Put on a clean pair of gloves and lab coat. Retrieve all required reagents listed in Step 1.29. Thaw all reagents except the three enzymes, vortex, and microcentrifuge briefly to bring solutions to the bottoms of the tubes.

j

217

218

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

1.27 Set one Thermomixer to 72  C and another one to 16  C. Set mixing for 30 s, 1400 rpm, at 10-min intervals. 1.28 When the first-strand synthesis is complete (Step 1.24), heat the sample at 72  C for 7 min in the Thermomixer set up in Step 1.27, then place on ice while setting up the second-strand synthesis mix. 1.29 Begin preparing the second-strand reaction in a new siliconized (nonstick) RNase-free 1.5-ml tube on ice (total volume, 115 ml): DEPC-treated H2O

73.0 ml

5 Second-Strand Buffer

31.0 ml

10 mM dNTP mix

3.75 ml

10 U/ml E. coli DNA ligase

1.0 ml

10 U/ml E. coli DNA polymerase

5.0 ml

2 U/ml RNase H

1.25 ml

(Note: Add the E. coli DNA ligase, E. coli DNA polymerase, and E. coli RNase H after the other reagents. Volumes listed are for one library.) 1.30 Add the 115 ml second-strand reaction mix directly into the tube containing the 40-ml first-strand reaction. 1.31 Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid. Microcentrifuge briefly to bring the solution to the bottom of the tube. 1.32 Incubate at 16  C for 2.5 h in the Thermomixer set up in Step 1.27. Prepare materials for DNA purification 1.33 Dispose of all waste and partially used reagents aliquots from previous steps. Put on a clean pair of gloves and lab coat. 1.34 Set a Thermomixer to 75  C, set mixing for 30 s, 1400 rpm, at 2-min intervals. 1.35 Retrieve all reagents required in Steps 1.36–1.49. Warm all reagents to room temperature, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. 1.36 Preheat 800 mlof Wash Buffer Cto 75  C in a sterile microcentrifugetube. Mix Wash Buffer C before adding to sample to dissolve any remaining precipitate. Purify cDNA 1.37 After the second-strand incubation (Step 1.32), place the reaction tube on ice for 2 min, then add 22.5 ml of 0.5 M EDTA to stop the reaction. 1.38 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.39 Immediately add 375 ml of warm Wash Buffer C (from Step 1.36) to the beads. Mix well. 1.40 Incubate the beads at 75  C for 15 min in the Thermomixer that was set up in Step 1.34. 1.41 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.42 Immediately wash again on the magnetic stand with 375 ml of warm Wash Buffer C. Perform wash steps quickly to prevent precipitation of SDS. 1.43 Wash 4 times on the magnetic stand with 375 ml of room temperature Wash Buffer D. (Note: Make sure Wash Buffer D is at room temperature to avoid clumpingofbeads.Ifclumpingofthebeadsoccurs,performadditionalwashes.) 1.44 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.45 Immediately add 100 ml of 1 Buffer 4 to the beads and gently resuspend. 1.46 Transfer the contents to a new 1.5-ml siliconized (nonstick) RNase-free tube. If the beads stick to the sides of the tube, gently scrape them off using a pipette tip.

13.5 Methods and Protocols

1.47 Wash the old tube once with 100 ml of 1 Buffer 4 and transfer contents to the new tube. 1.48 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 1.49 Immediately add 88 ml of nuclease-free water to the beads and gently resuspend. Proceed to Basic Protocol 2. Note: The preceding steps of Basic Protocol 1, plus those of Basic Protocol 2 up to Step 2.16, comprise the Working Day 1 procedures. 13.5.2 Basic Protocol 2: Tag Generation

This section describes the generation of 30 sequence tags from the cDNA generated in Basic Protocol 1. Sequence tags are generated by two rounds of enzymatic digestion followed by adapter ligation. Materials . . . . . . . . .

. . . . . . . . . . . .

. . . . .

100 Bovine serum albumin (BSA; NEB) 10 Buffer 4 (NEB) 10 U/ml NlaIII restriction endonuclease (NEB) cDNA-containing magnetic beads (Basic Protocol 1, Step 2.49) Wash Buffer C (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) Wash Buffer D (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) 10 Ligase buffer (from I-SAGE Long Kit; Invitrogen; cat. no. T5000-03) DEPC-treated H2O 10 mM GEX Adapter 1 (from Illumina Tag Profiling Sample Prep Kit; cat. no. FC102-1005) 5 U/ml T4 DNA ligase (Invitrogen) DNA Away (Molecular BioProducts) 32 mM (800  ) S-adenosylmethionine (SAM) 10 (and 1  ) Buffer 4 (NEB) 2 U/ml MmeI restriction endonuclease (NEB) 1 U/ml Shrimp alkaline phosphatase (Invitrogen) 2-ml Phase Lock Gel tube (heavy; Fisher) Phenol/chloroform/isoamyl alcohol (IAA) (Fisher) 3 M Sodium acetate, pH 5.5 (Ambion) 20 mg/ml Mussel glycogen 100% and ice-cold 70% Ethanol 1.5 mM GEX Adapter 2 (from Illumina Tag Profiling Sample Prep Kit; cat. no. FC102-1005) Thermomixers, 1.5 ml (Eppendorf) Textured nitrile gloves (Fisher) RNase-free 1.5-ml nonstick (siliconized) microcentrifuge tubes (Ambion) 16  C Water bath (Fisher Isotemp 3016) Magnetic stand (Invitrogen; cat. no. R670-01)

Protocol

Carry out NlaIII digestion 2.1 Dispose of all waste and partially used reagent aliquots from Basic Protocol 1. 2.2 Retrieve all required reagents listed in Step 2.4. Thaw all reagents except NlaIII, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. 2.3 Set a Thermomixer to 37  C. Set mixing for 30 s, 1400 rpm, at 10-min intervals.

j

219

220

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

2.4 Keep the tube on ice and add the following reagents directly to the tube containing the 88 ml of cDNA magnetic bead suspension from Step 1.49 of Basic Protocol 1, in the order listed (100 ml total volume): 100 BSA

1.0 ml

10 Buffer 4

10.0 ml

10 U/ml NlaIII

1.0 ml

(Note: NlaIII is extremely temperature-sensitive. Bring out an aliquot of the enzyme on ice after all other reagents have been added.) 2.5 Mix gently by flicking the tube, without causing the beads to splash on the inner walls or lid, and then microcentrifuge briefly to bring the solution to the bottom of the tube. 2.6 Incubate at 37  C for 1 h in the Thermomixer set up in Step 2.3. 2.7 Put on a clean pair of gloves and lab coat. Retrieve all required reagents. Thaw all reagents listed in Steps 2.9 and 2.10, vortex, then microcentrifuge briefly to bring the solutions to the bottoms of the tubes. Warm Wash Buffer C to 37  C to dissolve any SDS precipitate. Mix Wash Buffer C to ensure all precipitates are dissolved. Turn on the 16  C water bath. 2.8 After the 1-h incubation, place the tube with the beads (from Step 2.6) on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 2.9 Wash twice on the magnetic stand with 375 ml of warm Wash Buffer C. 2.10 Warm up Wash Buffer D to room temperature to prevent clumping of beads. Wash 4 times on the magnetic stand, each time with 375 ml Wash Buffer D. If clumping of the beads occurs, perform additional washes. Leave the beads in the final Wash Buffer D. Ligate GEX Adapter 1 2.11 Place the tube on the magnetic stand for 2 min, then carefully remove and discard the supernatant. 2.12 Wash once on the magnetic stand with 100 ml of 1 ligase buffer (prepared from the 10 ligase buffer provided in the Invitrogen I-SAGE Long Kit). Add 100 ml of 1 ligase buffer to the beads, mix by flicking, microcentrifuge to bring the solution to the bottom of the tube, then transfer the beads to a new 1.5-ml nonstick microcentrifuge tube. 2.13 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 2.14 Immediately add the following to the beads (50 ml total volume): DEPC-treated H2O

36 ml

10 mM GEX Adapter 1

3 ml

5 Ligase buffer

10 ml

5 U/ml T4 DNA ligase

1 ml

(Note: GEX Adapter 1 is ligated to the 50 end of NlaIII-digested bead-bound cDNA fragments. The adapter contains an MmeI recognition site facilitating tag generation by MmeI digestion in Step 2.17.) 2.15 Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid. Microcentrifuge briefly to bring the solution to the bottom of the tube. 2.16 Seal the lid with Parafilm and incubate the ligation overnight in the 16  C water bath. Note: This is the end of Working Day 1.

13.5 Methods and Protocols

Prepare reagents and equipment for MmeI digestion 2.17 Put on a clean pair of gloves and lab coat. Wipe down the workbench, small equipment, and ice bucket with DNA Away. Lay down new bench coat. 2.18 Change gloves. Retrieve fresh ice and all required reagents in Steps 2.20–2.27. 2.19 Set a Thermomixer to 37  C, set mixing for 30 s, 1400 rpm, at 10-min intervals. Prepare 10 SAM 2.20 Freshly prepare 10 SAM by diluting the 800 (32 mM) stock (10 ml of 10 SAM is required per library) as described in the following steps. (Note: Discard any excess solution after use.) 2.21 First, dilute the 800 stock to 200: 32 mM SAM

2 ml

DEPC-treated H2O

6 ml

2.22 Second, aliquot 1.5 ml of the 200 SAM and add 28.5 ml DEPC-treated water to make 10 SAM. Prepare 1 Buffer 4/1 SAM 2.23 Prepare a fresh 1 Buffer 4/1 SAM solution using the remaining 200 SAM (total volume, 1000 ml; 500 ml of this solution is required per library). 32 mM (200) SAM

5 ml

1 Buffer 4

995 ml

Carry out MmeI digestion 2.24 Place the sample tube containing the beads (from Step 2.16) on a magnetic stand for 2 min, then carefully remove and discard the supernatant. 2.25 Washthebeadsfourtimesonthemagneticstandwith250 mlof WashBufferD. 2.26 Wash the beads twice on the magnetic stand with 250 ml of 1 Buffer 4/1 SAM (from Step 2.23). Remove the supernatant. 2.27 Depending on the number of libraries to be constructed, prepare the following digestion mix in a new microcentrifuge tube on ice (total volume, 100 ml): DEPC-treated H2O

76 ml

10 Buffer 4

10 ml

10 SAM (from Step 2.22)

10 ml

2 U/ml MmeI (2 U/ml)

4 ml

(Note: Volumes listed are for one library.) 2.28 Add 100 ml of this digestion mix to the beads. Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid, then microcentrifuge briefly to bring the solution to the bottom of the tube. 2.29 Incubate the tubes at 37  C for 1.5 h on a Thermomixer set to mix for 30 s, at 1400 rpm, at 10-min intervals. Purify and dephosphorylate MmeI tags 2.30 After 1.5 h, place the sample tube on a magnetic stand for 2 min. (Important: Do not discard the supernatant. The tags are present in the supernatant after the MmeI digestion.) 2.31 Carefully transfer the supernatant to a new 1.5-ml tube. Retain the tube containing the beads. 2.32 Wash the tube containing the beads on the magnetic stand with 50 ml of 1 Buffer 4. Transfer the supernatant to the new tube to yield a total volume of 150 ml. Discard the beads.

j

221

222

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

2.33 Add 2 ml of 1 U/ml shrimp alkaline phosphatase to the 150-ml sample from Step 2.32 and incubate at 37  C for 1 h. 2.34 Microcentrifuge a 2-ml Phase Lock Gel tube 1 min at maximum speed (14 000 rpm), room temperature. 2.35 Add 50 ml of DEPC-treated water to the sample from Step 2.33 and transfer the 200 ml to the prespun 2-ml Phase Lock Gel tube from Step 2.34. 2.36 Add an equal volume (200 ml) of phenol/chloroform/IAA to the supernatant. Mix well by hand. 2.37 Microcentrifuge 5 min at maximum speed (14 000 rpm), room temperature. Note: Nonsiliconized tubes can be used from this point on. 2.38 Transfer the aqueous (top) phase to a new 1.5-ml tube and add the following precipitation reagents to the tube (total volume 872 ml including the aqueous phase itself): 3 M Sodium acetate, pH 5.5

20 ml

20 mg/ml Mussel glycogen

2 ml

100% Ethanol

650 ml

2.39 Vortex vigorously, then microcentrifuge briefly to bring the solution to the bottom of the tube. 2.40 Chill the tube at 20  C for 5 min (do not use a 20  C location where reagents are stored). 2.41 Microcentrifuge 30 min at maximum speed, 4  C. Turnon the 16  Cwaterbath. 2.42 Carefully decant the supernatant into a fresh microcentrifuge tube. Keep track of the pellet so that it does not slide out of the tube. 2.43 Wash the pellet 3 times, each time with 1 ml cold 70% ethanol, microcentrifuging 5 min at maximum speed, 4  C, between washes. Carefully decant the supernatant into new microcentrifuge tube. Keep track of the pellet so it does not slide out. 2.44 After removing the final wash, dab the tube rims on a Kimwipe to remove ethanol. Microcentrifuge briefly to collect the residual ethanol at the bottom of the tube, and carefully remove using a pipette tor and 10-ml tip. 2.45 Mark the outside bottom of the tube to better locate the pellet when resuspending in Step 2.47. 2.46 Lay the open tube on its side and air-dry the pellet for 5 to 15 min, or until the pellet is translucent. Do not over-dry the pellet. 2.47 Resuspend the pellet in 6 ml of DEPC-treated water. Let the tube sit closed and upright for 10 min at room temperature, and then aid dissolving by pipette ting up and down using a pipette tor and 10-ml tip. Ligate GEX Adapter 2 2.48 Set up the following adapter ligation on ice directly into the tube from Step 2.47 containing the 6 ml of DNA solution (10 ml total volume): 1.5 mM GEX Adapter 2

1 ml

5 Ligase buffer

2 ml

5 U/ml T4 DNA ligase

1 ml

2.49 Seal the lid of the tube with Parafilm and incubate the ligation overnight in a 16  C water bath, then proceed with Basic Protocol 3. Note: This is the end of Working Day 2.

13.5 Methods and Protocols 13.5.3 Basic Protocol 3: PCR and Fragment Isolation

This section describes the amplification and purification of 30 cDNA tags generated in Basic Protocol 2, utilizing PCR primers that provide sequences to enable cluster generation and sequencing on the Illumina Genome Analyzer. Materials . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

RNaseZap (Ambion) Ultra-pure water (Invitrogen) 5 HF buffer (Finnzymes) Dimethylsulfoxide (DMSO) (Invitrogen) 10 mM dNTP mix (10 mM each dNTP; Invitrogen) GEX PCR Primers 1 and 2 (from Illumina Tag Profiling Sample Prep Kit; cat. no. FC-102-1005) 2 U/ml Phusion Hot Start DNA polymerase (Finnzymes) 40% Acrylamide (19:1 acrylamide/bis; Bio-Rad) 50 TAE buffer 10% (w/v) Ammonium persulfate (Bio-Rad; prepare immediately before use) TEMED (Bio-Rad) Micro-90 cleaning solution (Cole-Parmer) 10 Bromphenol blue/xylene cyanol loading dye (see recipe above) 25-bp DNA ladder (20 ng/ml; Invitrogen) in loading dye (5: 1) SYBR Green (Cambrex Bio Science) Elution buffer: 5 parts low-TE buffer (10 mM Tris–Cl, pH 7.4 and 1 mM EDTA) plus 1 part 7.5 M ammonium acetate 3 M Sodium acetate, pH 5.5 (Ambion) 20 mg/ml Mussel glycogen (Roche Scientific) 70and100%Ethanol(anhydrousethylalcohol;CommercialAlcohol,www.comalc.com) EB buffer (from Qiagen PCR Purification Kit) Textured nitrile gloves (Fisher) RNase-free 1.5-, 0.5-, and 2-ml microcentrifuge tubes (Ambion) 0.2-ml Thin-walled, RNase-free PCR tubes (Ambion) Peltier Thermal Cycler (MJ Research) Glass plates for PAGE gels (Owl Scientific) Bags for gel pouring (Fisher Scientific) Casting tray (Owl Scientific) Colored tape (assorted; VWR Scientific) Combs (15-well, 1.5-mm; Owl Scientific; cat. no. P1-15D) Spacers (1.5-mm; Owl Scientific; cat. no. P1-SD) Gel pouches (Owl Scientific; cat. no. GP2-25) 50-ml Conical polypropylene tubes (e.g., BD Falcon) Penguin Owl Electrophoresis System (Owl Scientific) Power supply (LVC2kW, 48VDCV; Tyco Electronics) 18G needle Typhoon gel scanner (GE Healthcare) Dark Reader (UV trans-illuminator) (InterScience; www.interscience.com) Bench Coat (bench protection paper; Fisher) Heating block Spin-X columns (0.22-mm; Costar)

Protocol

Retrieve reagents and prepare equipment 3.1 Put on a clean pair of gloves and lab coat. Wipe down the workbench, small equipment, and ice bucket with RNaseZap.

j

223

224

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

3.2 Change gloves. Retrieve fresh ice and all required reagents listed in Step 3.3. Thaw all reagents except the enzyme, vortex, then microcentrifuge briefly to bring the solutions to the bottoms of the tubes. Set up PCR 3.3 In a PCR set-up hood, for each reaction to be performed, prepare the following PCR brew mix in a 1.5-ml tube, on ice (total volume, 22.5 ml per reaction). Make enough brew for three reactions including two different PCR cycles (specified in Step 3.9) and a brew-only negative control (for 15 cycles): Ultra-pure H2O

15.37 ml

5 Phusion HF buffer

5.0 ml

10 mM dNTP mix

0.63 ml

DMSO

0.75 ml

25 mM GEX PCR Primer 1

0.25 ml

25 mM GEX PCR Primer 2

0.25 ml

2 U/ml Phusion Hot Start DNA polymerase (NEB)

0.25 ml

3.4 Label individual 0.2-ml thin-walled PCR tubes. Using a new pipette tip, aliquot 22.5 ml of PCR brew into each tube. 3.5 Take PCR tubes, along with the DNA sample that has been incubating at 16  C overnight (overnight ligation mixture, from Step 2.49 in Basic Protocol 2), to the library construction area PCR hood, on ice. 3.6 Put on a clean pair of gloves and disposable lab coat. Retrieve all reagents. Turn on the PCR machine. 3.7 Transfer 2.5 ml of the overnight ligation mixture into the 0.2-ml PCR tube reserved for the sample and 2.5 ml of water into the brew-only control tube, for a total volume of 25 ml/tube. Vortex, then microcentrifuge briefly to bring the solutions to the bottoms of the tubes. 3.8 Store the remainder of the sample in the 20  C freezer. Perform PCR 3.9 Load PCR tubes in the PCR machine (thermal cycler), ensuring that the lid is properly in place. For each sample, run one replicate at 13 cycles and the other at 15 cycles. For the negative PCR control, run 15 cycles. Start the two different PCR programs.

13 or 15 cycles

30 s

98  C (initial denaturation)

10 s

98  C (denaturation)

30 s

60  C (annealing)

15 s

72  C (extension)

10 min

72  C (final extension)

indefinite

4  C (hold)

Analyze PCR product by PAGE 3.10 During the PCR reaction, prepare one 12% polyacrylamide gel in TAE buffer for each library: Autoclaved dH2O

23.5 ml

40% Acrylamide (19 : 1 acrylamide : bis)

10 ml

50 TAE buffer

700 ml

10% (w/v) Ammonium persulfate

350 ml

TEMED (add immediately before pouring gel)

30 ml

13.5 Methods and Protocols

Retrieve all equipment for gel pouring (for each gel: casting tray, notched glass plate, regular plate, comb, gel pouch, two spacers). Scrub plates with Micro-90 and wash with water. Then, wipe down the plates, comb, and spacers with 70% ethanol using Kimwipes. Make sure apparatus is completely dry and lint-free, or bubbles will form in the gel. Once the plates are dry, assemble with the spacers placed between the plates in the plastic gel pouch, which is then screwed into the casting tray (the gel pouch prevents leaking of the gel). Add all the reagents in the order listed above to a 50-ml conical polypropylene tube. Screw down the lid and invert several times quickly to mix. Tilt the casting tray at a slight angle and pour the mix in between the two glass plates. Lower the casting tray while pouring the gel to prevent bubbles from occurring (if bubbles form, they can be removed by tapping). Fill the tray to the top level of the notched plate and place the comb into the gel. Let it polymerize for at least 30 min. 3.11 Set up the gel in the PAGE apparatus, with cold water circulation. Mark buffer level. 3.12 Add 1/10 volume of 10 bromphenol blue/xylene cyanol loading dye to the sample tubes containing the PCR products and to the negative control tube, making the total volume 28 ml. 3.13 Label the tube containing gel slurry clearly. 3.14 Load 10 ml of the 25-bp DNA ladder (20 ng/ml) in the left terminal well of the gel. 3.15 Load all of the brew-only control into one lane of the gel. Load the entire PCR cycle product into one well for each sample. Leave at least eight empty well spaces between the sample well and the DNA ladders. 3.16 Run the gel at 250 V for 2.5 h with cold water circulation. 3.17 Check the gel every hour for buffer level in the upper chamber and ensure that the gel is running appropriately. 3.18 Prepare a tube for shearing the gel slices by making a hole through the bottom of a 0.5-ml microcentrifuge tube with an 18G needle and placing it on top of a 2-ml tube. 3.19 Prepare fresh SYBR Green DNA stain: 10 ml stock in 100 ml of 1 TAE. Minimize exposure to light. 3.20 Using a clean tray, stain the gel for 1 min in the SYBR Green stain. 3.21 Scan the gel and save image. Excise amplified sequence tags 3.22 Lay the gel down on Dark Reader. Using a brand new razor blade carefully cut out the 85-bp band from the best PCR cycle image on the gel. Avoid the ladder and all other bands. Transfer the gel slice into the perforated 0.5-ml tube prepared in Step 3.18. 3.23 With the lids tailing (left of tube position), microcentrifuge 3 min at 12 000 rpm, 4  C. (Note: The gel slices should shear through the holes and collect at the bottom of the 2-ml tubes.) 3.24 To the gel slice that was sheared into the 2-ml tubes, add 200 ml of elution buffer. Ensure that all sheared gel pieces are covered with elution buffer. Add more buffer if needed. 3.25 Mix well by vortexing. Microcentrifuge briefly to bring the solution to the bottom of the tube. 3.26 Elute gel slurries by incubating at 65  C for 1 h in heating block. 3.27 Put on a clean pair of gloves and disposable lab coat. 3.28 Retrieve all reagents listed in Step 3.35 and thaw in an ice bucket. Vortex and microcentrifuge briefly to bring the solutions to the bottom of the tube. 3.29 Lay down new bench pad (Bench Coat bench protection paper). Precipitate amplified sequence tags 3.30 Retrieve the gel slurry from the 65  C heating block (see Step 3.26). Vortex and microcentrifuge briefly to bring the solution to the bottom of the tube.

j

225

226

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

3.31 Transfer the contents of each tube into a Spin-X filter spin column tube. Tap the slurry into the column. Microcentrifuge 5 min at 12 000 rpm, room temperature. 3.32 Check each spin column tube and ensure that the entire buffer has spun through the filter. Respin the tubes if there is still liquid trapped in the gel material. 3.33 Remove and discard the filter column containing the gel material. 3.34 Transfer the expected 200 ml of eluate to a sterile 1.5-ml microcentrifuge tube and add the following (total 872 ml including the eluate): 3 M Sodium acetate

20 ml

20 mg/ml Mussel glycogen

2 ml

100% Ethanol

650 ml

3.35 Vortex, then microcentrifuge briefly to bring the solution to the bottom of the tube. Chill the tube at 20  C for at least 5 min. 3.36 Microcentrifuge at least 30 min at 14 000 rpm, 4  C. 3.37 Carefully decant the supernatant into a fresh microcentrifuge tube. Keep an eye on the pellet so that it does not slide out. 3.38 Wash the pellet two times with 1 ml cold 70% ethanol. Microcentrifuge 3 min at 14 000 rpm, 4  C between washes. Carefully decant the supernatant into a new microcentrifuge tube. Keep an eye on the pellet so that it does not slide out. 3.39 After removing the final wash, dab the tube rim on a Kimwipe to remove excess ethanol. Microcentrifuge the tube briefly to bring any liquid to the bottom of the tube and carefully remove any residual ethanol using a pipette tor with a 10-ml tip. 3.40 Mark the outside bottom of the tube to better locate the pellet when resuspending in EB buffer at Step 3.42. 3.41 Lay the open tube on its side and air-dry the pellet. Do not over-dry the pellet. 3.42 Resuspend the pellet in 10 ml EB buffer (from Qiagen PCR Purification Kit). Let the tube sit closed and upright for 10 min at room temperature, and then aid dissolving by pipette ting up and down using a pipette tor with a 10-ml tip. 3.43 Store the PCR product (library) in 20  C freezer. Proceed with Basic Protocol 4 to prepare the library for Illumina Sequencing. Note: This is the end of working day 3. 13.5.4 Basic Protocol 4: Preparing the Library for Illumina Sequencing

This section provides guidelines for confirming the size of the PCR amplified 30 sequence tags and accurately determining their concentration using the Agilent DNA1000 Kit and Agilent Bioanalyzer. Materials . . .

. . . . .

Sample: PCR product (library; see Basic Protocol 3, Step 3.43) Agilent DNA 1000 Kit (Agilent) EB buffer (from Qiagen PCR Purification Kit) supplemented with 0.1% (v/v) Tween-20 Agilent 2100 Bioanalyzer (Agilent) Chip Priming Station (Agilent) IKA Vortex Mixer (Agilent) Illumina Genome Analyzer (Illumina) Cluster Station (Illumina)

13.5 Methods and Protocols Protocol

4.1 Retrieve sample from 20  C freezer (see Step 3.43 of Basic Protocol 3). Run a 1-ml aliquot on an Agilent DNA 1000 chip following the Agilent 2100 protocol and keep the remainder on ice. 4.2 Dilute a minimum of 2 ml of the sample to 8 nM in EB supplemented with 0.1% Tween-20. (Note: If the sample is less than 8 nM, no dilution is needed.) 4.3 Double-check the diluted DNA by running another Agilent DNA1000 chip, and determine the final concentration of the diluted sample. (Note: No double-check is necessary if a sample is not diluted in Step 4.2.) 4.4 Generate clusters and sequence on the Illumina Genome Analyzer following the manufacturer’s recommended protocols (Illumina). 13.5.5 Alternate Protocol: Amplified Tag-Seq library construction (Tag-SeqLite)

If the RNA material is limiting (e.g., below 500 ng), cDNA amplification is necessary. Tag-seqLite biochemistry is based upon the SMART (Switching Mechanism At the 50 end of RNATranscripts) cDNA synthesis strategy (Clontech) for the generation of fulllength cDNA libraries. In SMART cDNA synthesis, only polyadenylated RNA molecules that have been full-length reverse transcribed are extended with a poly (C) tail by a terminal transferase activity inherent in the reverse transcriptase. A synthetic oligonucleotide with a 30 poly(G) stretch is hybridized to the first-strand cDNA and serves as a primer for synthesis of the second cDNA strand. Thus, each full-length first-strand cDNA molecule will have incorporated a synthetic 50 priming site and a 30 site – a pool of oligo(dT) primers with degenerate 30 ends–50 -T(30)VN-30 – allowing the cDNA to be amplified using a subsequent PCR step. Following PCR amplification, the cDNA is processed according to the standard tag sequencing library protocol. RNA is typically isolated using TRIzol or column-based techniques (using, e.g., the Qiagen AllPrep Mini Kit or Ambion RiboPure Kit), followed by DNase I treatment to obtain DNA-free RNA starting material. RNA integrity is then assessed using an Agilent Bioanalyzer RNA 6000 Nano Chip, according to the manufacturer’s directions; the resulting RIN is used to help establish an RNA quality standard. This protocol is designed for 5–500 ng of total RNA of RIN 7 or better. Note: When working with RNA, precautions must be taken to avoid RNA degradation resulting from RNase contamination of reagents and materials. The use of Neptune Barrier Tips (CLP; www.clpdirect.com) is recommended for pipette ting, as well as RNaseZap (Ambion) for RNase control.

Materials (in addition to those specified in Basic Protocols 1, 2, 3, and 4) .

. . . . .

.

Total RNA sample: typically isolated using TriZOL (Invitrogen), AllPrep Mini Kit (Qiagen), or RiboPure Kit (Ambion) and DNase I-treated -LITE1/LITE TS primer mix (20 mM each; Integrated DNA Technologies, www. idtdna.com) including: -Biotin-AAG CAG TGG TAA CAA CGC AGA GTA CTT TTT TTT TTT TTT TTT TTT TTT TTT TTT-TVN -Lite TS primer, 10 mM: AAG CAG TGG TAA CAA CGC AGA GTA CGC GGG Nuclease-free H2O (Ambion) 20 mM DTT SMARTScribe reverse transcriptase (Clontech) TE buffer, pH 8.0 (Invitrogen) Advantage 2 PCR Kit (Clontech) including: -10 Advantage 2 Buffer -50 Advantage 2 Polymerase Mix Buffer PB (Qiagen)

j

227

228

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling . . . . . . . . . .

.

Buffer PE (Qiagen) Buffer EB (Qiagen) M280 Streptavidin beads (Invitrogen) 2 and 1 B&W buffer (see recipe) Thin-walled, frosted-lid, RNase-free PCR tubes (Ambion) Microcentrifuge adapters for the PCR tubes Peltier Thermal Cycler (MJ Research) QIAquick spin columns and collection tubes (Qiagen) NanoDrop spectrophotometer Agilent DNA 7500 Kit (Agilent) including: -DNA 7500 Gel Matrix -DNA 7500 Maker -DNA 7500 Ladder -Agilent DNA 7500 Chips -Agilent Chip Priming Station -IKA Works Vortexer -Agilent Electrode Cleaner Additional reagents and equipment for quantitating DNA using NanoDrop spectrophotometer

Protocol

Retrieve reagents and prepare equipment A.1 Put on a clean pair of nitrile gloves and lab coat. Wipe down the workbench, small equipment, and ice bucket with RNaseZap. Lay down a new bench pad (Bench Coat bench protection paper). A.2 Change gloves. Retrieve fresh ice and all required reagents listed in Steps A.4 and A.77. Thaw all reagents, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. A.3 Set one Thermomixer to 42  C and set another Thermomixer to 70  C. (Note: The Thermomixers may be replaced with heating blocks for this procedure, since mixing is not necessary.) Synthesize first-strand cDNA A.4 Combine the DNase I-treated total RNA and LITE1/LITE TS primer mix (20 mM each) in a new siliconized (nonstick) RNase-free 1.5-ml tube on ice (total volume, 4 ml): LITE1/LITE TS primer mix (20 mM each)

1 ml

RNA (total RNA, DNase-treated, 10–100 ng) in DEPC-treated H2O

3 ml

A.5 Mix gently by flicking the tube or slow vortexing and then microcentrifuge briefly to bring the solution to the bottom of the tube. A.6 Place the tube in the 70  C Thermomixer for 2 min to heat-denature the RNA, then remove the tube and microcentrifuge briefly to bring the solution to the bottom of the tube. A.7 Place the tube in a clean rack at room temperature. Add the remaining reagents for first-strand cDNA synthesis to the tube: 5 First-Strand Buffer

2.0 ml

5 M Betaine

1.0 ml

10 mM dNTP mix

1.0 ml

20 mM DTT

1.0 ml

RNaseOUT

0.5 ml

13.5 Methods and Protocols

A.8 Mix gently by pipette ting up and down, then microcentrifuge briefly to bring the solution to the bottom of the tube. Add 1 ml of 100 U/ml SMARTScribe Reverse Transcriptase and mix again. (Note: Total reaction volume is 10.5 ml if there is no loss via evaporation or pipette ting.) A.9 Incubate at 42  C for 90 min in the other Thermomixer that was set up in Step A.3. A.10 Set a timer for 30 min. A.11 After 30 min, microcentrifuge the tube briefly to bring down condensed droplets on the sides of the tube. A.12 Mix the tube gently, and microcentrifuge again briefly to collect the solution at the bottom of the tube. Incubate at 42  C for the remaining 60 min. A.13 Put on a clean pair of gloves and lab coat. Retrieve all reagents listed in Step A.20. A.14 During incubation, set another Thermomixer (or heating block) to 72  C. A.15 After the 90 min incubation at 42  C (Steps A.9–A.12), add 40 ml TE buffer, pH 8.0, to the sample. Mix gently and microcentrifuge briefly to bring the solution to the bottom of the tube. (Note: Total volume is now 50 ml.) A.16 Incubate at 72  C for 7 min in the Thermomixer prepared at Step A.14, to stop the reaction. A.17 Mix tube gently, then microcentrifuge briefly to bring the solution to the bottom of the tube. A.18 Transfer two-thirds of the first-strand cDNA products (30 ml) to a new siliconized (nonstick) 1.5-ml tube, and keep on ice. Save the remainder of the first-strand cDNA in the 20  C freezer. Perform Tag-seqLite PCR A.19 Retrieve all reagents. Thaw all reagents, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. A.20 In the RNA-area flow hood, make a 5 premix for five 50-ml PCR reactions in a new 1.5-ml tube on ice (you will need three reactions per library and two reactions for negative control; total volume per reaction should be 40 ml): Nuclease-free dH2O

32 ml

10 Advantage 2 Buffer

5 ml

10 mM dNTP

1 ml

10 mM LITE TS primer

1 ml

50 Advantage 2 Polymerase Mix

1 ml

A.21 Aliquot the premix into 0.2-ml PCR tubes (40 ml/tube). A.22 Add 10 ml of the first-strand cDNA to each of the three sample tubes and add 10 ml of DEPC-treated water to the negative controls. Be sure to label the tubes accordingly. Microcentrifuge the PCR tubes using microcentrifuge tube adapters. A.23 Perform PCR using the following thermal cycling program:

20 cycles

60 s

95  C (initial denaturation)

30 s

95  C (denaturation)

30 s

65  C (annealing/extension)

6 min

68  C (final extension)

indefinite

4  C (hold)

j

229

230

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Purify PCR product A.24 Put on a clean pair of gloves and lab coat. A.25 Remove the five PCR tubes from the PCR machine and pool the PCR products into two tubes – one for the sample and one for the negative control. A.26 Add 5 volumes of Qiagen Buffer PB to 1 volume each of the pooled PCR products and negative controls. Mix the contents in each tube by pipette ting up and down. A.27 Place a QIAquick spin column in a 2-ml collection tube provided by Qiagen. A.28 To bind DNA, apply the sample to the QIAquick column and microcentrifuge 60 s at 10 000  g, room temperature. (Note: The sample must be added in two aliquots of 0.5 ml in order to not exceed the maximum sample volume of the spin column (0.75 ml). Each application requires a separate spin.) A.29 Discard the flow-through following each spin and place the QIAquick column back into the same collection tube. (Note: All subsequent centrifugations are carried out using the same settings.) A.30 To wash, add 0.75 ml Qiagen Buffer PE (containing ethanol) to the QIAquick column and microcentrifuge again as described in Step A.28. A.31 Discard flow-through and place the QIAquick column back in the same tube. Microcentrifuge the column for an additional 2 min (at 10 000  g, room temperature). Use pipette tor and tip to aspirate off any ethanol residue that may be resting on the ridge of the column. A.32 Place the QIAquick column in a clean and labeled 1.5-ml microcentrifuge tube. A.33 Add 30 ml of warmed Buffer EB and wait for 1 min prior to centrifugation. Centrifuge the column for 1 min as described in Step A.28. (Note: It is important to dispense the elution buffer directly onto the membrane, and not the sides of the column, to ensure complete elution of bound DNA.) A.34 Run 1.5 ml of the PCR product on the NanoDrop to assess concentration. Assess cDNA using Agilent analysis A.35 Put on a clean pair of gloves and lab coat. Wipe down workbench and small equipment. A.36 Allow reagents in the DNA 7500 Assay kit to equilibrate to room temperature while protected from light (20 min). A.37 Prepare an Agilent DNA 7500 chip according to the instructions in the Agilent DNA Assay manual. A.38 Add 1 ml of the amplified cDNA (from Step A.33) at room temperature (no heating is required). Run the chip using the “DNA-7500-Assay” setting under the “dsDNA assay” option. Note: Before proceeding with the NlaIII digestion, review the Agilent DNA smear profile. A typical cDNA smear sizing from 200 bp to 10 kb range is expected; 500 ng of cDNA is required to proceed with the next step on the following day. Generate tags by NlaIII digestion A.39 Put on a clean pair of gloves and lab coat. Wipe down workbench, small equipment, and ice bucket. Lay down a new bench pad (Bench Coat bench protection paper). A.40 Change gloves. Retrieve fresh ice and the required reagents listed in Step A.42. Thaw all reagents except NlaIII, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. Take an aliquot of NlaIII when you are ready to add it to your sample. A.41 Based on the NanoDrop quantification, determine the amount of cDNA you want to use in the next reaction. (Note: 500–1000 ng can be used.)

13.5 Methods and Protocols

A.42 Set up the reaction as follows: Amplified DNA 500ng cDNA from step 33 and determined in step 41: 100 BSA

1 ml

10 Buffer 4

10 ml

10 U/ml NlaIII

1 ml

Adjust final volume to 100 ml with nuclease-free water A.43 Microcentrifuge the tube briefly to bring the reaction down to the bottom, then mix by slow vortexing and microcentrifuge again to collect the solution at the bottom of the tube. A.44 Seal the tube with Parafilm and incubate at 37  C for 1.5 h in a Thermomixer (or heat block). A.45 Put on a clean pair of gloves and lab coat. Lay down a new bench pad (Bench Coat bench protection paper). A.46 Change gloves. Retrieve fresh ice and all required reagents listed in Step A.51. Precipitate cDNA A.47 Microcentrifuge a 2-ml Phase Lock Gel tube 1 min at maximum speed, room temperature. A.48 Add 200 ml of phenol/chloroform/IAA to the Phase Lock Gel tube. A.49 Add 100 ml of nuclease-free water to the cDNA digest (from Step A.44) to bring volume up to 200 ml, then transfer the 200 ml of cDNA digest into the prespun 2-ml Phase Lock Gel tube (from Step A.47). A.50 Mix well by hand. Microcentrifuge 5 min at maximum speed, room temperature. A.51 Remove the 200 ml of aqueous (top) phase, transfer to a sterile 1.5-ml tube, and add the following (for a total volume of 873 ml including the aqueous phase itself): 3 M Sodium acetate, pH 5.5

20 ml

20 mg/ml Mussel glycogen

3 ml

100 % Ethanol

650 ml

A.52 Vortex to mix well, then chill for 20 min in a 20  C freezer. A.53 Microcentrifuge 30 min at maximum speed, 4  C. A.54 Carefully decant the supernatant into a fresh microcentrifuge tube. Keep an eye on the pellet so that it does not slide out of the tube. Purify DNA A.55 Wash the pellet twice, each time with 1 ml cold 70% ethanol, microcentrifuging 5 min at maximum speed, 4  C, between washes. A.56 Carefully decant the supernatant into a new microcentrifuge tube. Keep an eye on the pellet so it does not slide out. A.57 After removing the final wash, dab the tube rim on a Kimwipe to remove ethanol. Microcentrifuge briefly to collect the residual ethanol at the bottom of the tube, and carefully remove using a pipette tor and 10-ml tip. A.58 Mark the outside bottom of the tube to better locate the pellet when resuspending in Step A.60. A.59 Lay the open tube on its side and air-dry the pellet for 5 to 15 min, or until the pellet is translucent. Do not over-dry the pellet. A.60 Resuspend the pellet in 200 ml nuclease-free water. Mix by flicking or gentle vortexing and store on ice.

j

231

232

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Bind cDNA to magnetic beads A.61 Thoroughly resuspend M280 Streptavidin beads in the 2-ml stock supply tube by gentle vortexing. A.62 Transfer 100 ml (1 mg) of resuspended beads to a new 1.5-ml nonstick microcentrifuge tube. Place the tube on a magnetic stand for 2 min. A.63 Carefully remove the supernatant while the tube remains on magnet and discard the supernatant. A.64 Wash the beads twice on the magnetic stand with 100 ml of 2 B&W buffer. A.65 Resuspend beads in 200 ml of 2 B&W buffer. A.66 Add the 200 ml of amplified (biotinylated) cDNA (from Step A.60) to the beads, and mix well. Firmly close and place the 1.5-ml tube in a 50-ml conical tube stuffed with a Kimwipe. A.67 Place the 50-ml tube on a rocking platform (Nutator) for 1 h at room temperature. Ligate GEX Adapter 1 to NlaIII-digested cDNA A.68 Retrieve all required reagents listed in Step A.74. Thaw all reagents except T4 DNA ligase, vortex, and microcentrifuge briefly to bring the solutions to the bottoms of the tubes. A.69 Turn on the 16  C water bath. A.70 Place the tube (from Step A.67) on a magnetic stand for 2 min, then carefully remove and discard the supernatant. A.71 Wash the beads twice on the magnetic stand with 200 ml of 1 B&W buffer. A.72 Wash the beads twice on the magnetic stand with 150 ml of 1 ligase buffer. Leave the second wash of 1 ligase buffer in the tube. A.73 Place the tube on a magnetic stand for 2 min, then carefully remove and discard the supernatant. A.74 Immediately add the following to the beads (50 ml total volume): Nuclease-free H2O

35 ml

10 mM GEX Adapter 1

3 ml

5 Ligase buffer

10 ml

5 U/ml T4 DNA ligase

2 ml

A.75 Mix gently by flicking the tube without causing the beads to splash on the inner walls or lid, then microcentrifuge briefly to bring the solution to the bottom of the tube. Incubate at 16  C overnight in a water bath. A.76 Proceed on the next day with Basic Protocol 2, Steps 2.17–2.49. Complete Basic Protocols 3 and 4. 13.5.6 Basic Protocol 5: Data Analysis

A suite of freely available scripts allows the user to filter raw Tag-seq libraries, perform comparisons between libraries, and analyze gene expression. Since strand information is retained during Tag-seq library construction, antisense expression can also be profiled. The following protocol describes the set-up and typical use of these scripts (see Figure 13.3a). Materials .

Hardware and software. Computer with a Linux or Macintosh operating system. An Internet browser or ftp client is required to download the tar archives, and a command-line prompt with a c-shell interface is needed to run the scripts.

13.5 Methods and Protocols

j

233

Fig. 13.3 Overview of the computational analysis flow (a) and output of three analysis scripts. (b) SdCompare: the number of tag sequences (y-axis) with expression counts above 20 in each of two compared libraries are binned by the log-ratio of their expression (x-axis). This provides a measure of the similarity between two libraries (ce0068 and ce0069). (c) CorrelatePlot: a scatterplot of the (log) expression levels of tags sequenced in two libraries (ce0068 and ce0069; x-axis and y-axis, respectively) is shown along with the Pearson correlation coefficient of the two libraries, the linear regression equation (top right), and the linear regression line. (d) SdSageTree: a hierarchical tree representation of the distance matrix calculated for five libraries. The distance matrix is constructed from the standard deviations of the log ratios of the tag expression values in the five libraries. See Table 13.2 for an overview of the scripts and Table 13.3 for a summary of the data files.

.

Files. User-generated Tag-seq files should be in a tab-delimited format, and contain sequence reads and observed counts in two columns. The script and analysis files are available as tar archives that can be downloaded from ftp://ftp.bcgsc.ca/ supplementary/CPHG_2009_TagSeq.

234

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Fig. 13.3 (Continued)

Protocol

Set up the system 5.1 To set up the environment required to run the scripts, modify the .cshrc file to include the lines below. The user decides on the location of the working directory and specifies that path in the first line of text added to the .cshrc file (replace/Users/user/tabdir in the example below). The working directory is henceforth referred to as $tabdir. if ( ! $?tabdir ) setenv tabdir /Users/user/tabdir if ( -f $tabdir/scripts/Cshrc ) source $tabdir/scripts/Cshrc 5.2 Log into a c-shell for the changes to take effect. Unless the default shell is a cshell, run the csh command to switch to one (run exit to return to the default shell). The prompt should change to the current directory followed by a % character (e.g., tabdir%). Set up analysis files 5.3 Download the following files from ftp://ftp.bcgsc.ca/supplementary/ CPHG_2009_TagSeq: . tabdir.tar.gz . data.tar.gz . sagedir.tar.gz . TreeViewX.app The file TreeViewX.app contains an application required to build phylogenetic trees, and should be placed in the Applications folder. The archive tabdir.tar.gz contains scripts as well as other data and files required by the scripts; data.tar.gz contains sample Tag-seq data; and sagedir.tar.gz contains

13.5 Methods and Protocols

transcriptome and genome reference files for human, mouse, and Caenorhabditis elegans. The content and purpose of the provided scripts and files are outlined in Tables 13.2 and 13.3. 5.4 Move the three archives into the $tabdir directory and expand them, using the following prompt commands: mv *tar.gz $tabdir cd $tabdir gunzip ./*gz tar xvf tabdir.tar tar xvf data.tar tar xvf sagedir.tar The sagedir.tar archive expands to the folder sagedir, which contains genome and transcriptome reference files. These files contain all possible tag sequences that map to the genome or transcriptome, and are referred to as virtual tag databases. The tabdir.tar archive contains five directories (darwin, QCreports, scripts, lib, and seqdata); see Tables 13.2 and 13.3 for a description of the files in these folders. The data.tar archive contains the folders taglibs and sagelibs (see Table 13.3). Briefly, the taglibs folder contains the sample Tag-seq raw libraries, and is the location to which the user should add additional Tag-seq libraries. The sagelibs folder contains filtered Tag-seq library files of 17-bp tag sequences and read counts. Subfolders contain the MA, MT, MR, SSOOHE, and SA filtered libraries. Data files provided include virtual tag libraries generated for human, mouse, and C. elegans genomes and transcriptomes. Raw Tag-seq libraries are provided and can be processed using the makeLibraryWrapper script into taglib files. The main use of taglib files is in the SSOOHE filtering step. The makeLibraryWrapper script also generates sagelib files that contain filtered tags. Filtering options include SSOOHE (the removal of singleton tags and one-offs of highly expressed tags) and SA (removal of tags containing adapter sequences). If genome and transcriptome files are available for the species profiled using Tag-seq, additional filtering options include MG (retaining tags that map to the genome), MT (retaining tags mapping to the transcriptome), MA (tags mapping to the genome or transcriptome), and MR (tags mapping to RefSeq). Analysis scripts can be used as described below on the sagelib files to compare libraries (CompareSage and SdCompare), to calculate correlations between libraries (Correlate, CorrelatePlot), to merge two or more libraries files (mergeTagFiles), or to sample a random subset of tags from a library (randomSageSamples). Tags can be annotated as mapping to the sense or antisense strand of transcripts, or in proximity to known genes (TagBasedAnnotate, TagBasedTranscriptAnnotate). The distance between multiple libraries can be calculated using the sageTree and SdSageTree scripts, and hierarchical trees can be displayed using the provided Tree-View X software or similar applications. Simple statistics can be generated using the sumTags script, which reports the number of tag sequences in a library and their abundance. The majority of scripts were adapted from the LongSAGE pipeline in our laboratory and therefore still contain the term “SAGE” in their names. Process raw data files and generate library statistics 5.5 Use the script makeLibraryWrapper to process raw Tag-seq library files into filtered 17-bp tag library files. The script expects that each file represents a complete library; therefore, in the rare cases when libraries have multiple lanes, these should be manually merged by the user before analysis. SampleDataWrapper is an example of how makeLibraryWrapper can be

j

235

236

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Table 13.2 Overview of scripts provided for tag sequencing library analysis.

Script location and name

Purpose

Command examples (from the $tabdir directory)

$tabdir/scripts/CompareSage

compare two libraries, computing a P-value for each tag/gene in the library; using the normal approximation to the binomial join a series of libraries together and compute a correlation matrix

CompareSage ./data/sagelibs/hs1113.sagelib ./data/sagelibs/mm0315.sagelib Correlate ./data/sagelibs/hs1113.sagelib ./data/sagelibs/mm0315.sagelib CorrelatePlot ./data/sagelibs/hs1113.sagelib ./data/sagelibs/mm0315.sagelib SdCompare ./data/sagelibs/hs1113.sagelib ./data/sagelibs/mm0315.sagelib sageTree ./data/sagelibs/hs1113.sagelib ./data/sagelibs/ce0045.sagelib ./data/sagelibs/mm0315.sagelib ./data/sagelibs/mm0323.sagelib ./data/sagelibs/mm0321.sagelib ./data/sagelibs/mm0324.sagelib SdSageTree ./data/sagelibs/hs1113.sagelib ./data/sagelibs/mm0315.sagelib ./data/sagelibs/mm0323.sagelib ./data/sagelibs/mm0321.sagelib ./data/sagelibs/mm0324.sagelib TagBasedAnnotate ./data/sagelibs/hs1113.sagelib

$tabdir/scripts/Correlate

$tabdir/scripts/CorrelatePlot

produce a scatterplot of two libraries, including correlation

$tabdir/scripts/SdCompare

compare two libraries, using the log ratio standard deviation statistic

$tabdir/scripts/sageTree

make a tree using the Fitch algorithm from the correlation matrix for the set of libraries in the argument list

$tabdir/scripts/SdSageTree

make a tree using the Fitch algorithm from a distance matrix constructed from the standard deviations of the log ratios of the libraries

$tabdir/scripts/TagBasedAnnotate

create an annotation for a library, with fields: Transcript accession, Tag, Count, Chromosome, Nucleotide position, Tag Sense, Location Type, PreDistance, PostDistance, Transcript Sense, and Information; Tag Sense is relative to the annotated genome; Location type may be intergenic, intron, exon, or complex; complex means that the location is in a region of overlapping exons or introns or both; PreDistance is the distance (in bp) to the nearest previous exon boundary; PostDistance is the distance (in bp) to the nearest following exon boundary; used by other scripts create transcript based annotation for a library, with fields: Transcript accession, Tag, Count, Tag Position in transcript, Nucleotide position, Tag Sense relative to the transcript, and Information merge a set of libraries or tag files, yielding one merged tag count file

$tabdir/scripts/TagBasedTranscriptAnnotate

$tabdir/scripts/mergeTagFiles

$tabdir/scripts/randomSageSample $tabdir/scripts/sumtags $tabdir/scripts/SampleDataWrapper

$tabdir/scripts/makeLibraryWrapper

a random tag sample of specified size is selected from a library sum of all tags and tag sequences calls makeLibraryWrapper with library name and flow cell file parameters specified for the sample data process raw data into taglib, sagelib, and filtered sagelib files; as input, it requires library name, flow cell, lane, and (if the species is not human) the taxon; output is to $tabdir/QCreports/SxLibraries.tbl and to the $tabdir/data/sagelibs directory

TagBasedTranscriptAnnotate ./data/sagelibs/hs1113.sagelib

mergeTagFiles ./data/sagelibs/hs1113.sagelib ./data/sagelibs/ce0045.sagelib ./data/sagelibs/mm0315.sagelib ./data/sagelibs/mm0323.sagelib ./data/sagelibs/mm0321.sagelib ./data/sagelibs/mm0324.sagelib randomSageSample ./data/sagelibs/mm0321.sagelib 1000 sumtags ./data/sagelibs/mm0321.sagelib SampleDataWrapper



(Continued )

13.5 Methods and Protocols

j

237

Table 13.2 (Continued)

Script location and name

Purpose

Command examples (from the $tabdir directory)

$tabdir/scripts/Cshrc $tabdir/scripts/AddGenomicAcnData

set local shell parameters annotate a tag with the nearest RefSeq in the genome; classify the tag as 30 , exon, intron, or 50 and give distances from the 30 and 50 ends; used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts used by other scripts

— —

$tabdir/scripts/getSpecies $tabdir/scripts/getTaxon $tabdir/scripts/histogram $tabdir/scripts/leftjoin $tabdir/scripts/mergeTagExon $tabdir/scripts/outerjoin $tabdir/scripts/scatterGraph $tabdir/scripts/tabify $tabdir/scripts/tagLocationSummary

called on multiple library files. The textual output of this script includes the number of tag sequences and the tag abundance for every analyzed library: Making fcvsagelib from fctaglib : 305eraaxx_4 05eraaxx_4 SxSageFilters: SxSageFilters: Making subsets for library hs1113 hs1113 SxSage Raw 13202772 3775625 hs1113 SxSage SA 13199694 3775618 hs1113 SxSage SSOOHE 9063537 401995 hs1113 SxSage MR 1067160 12272 hs1113 SxSage MT 1814808 23385 hs1113 SxSage MG 8259852 371699 hs1113 SxSage MA 8281508 372016 Adapter content : 0.000 SSOOHE Proportion (of SA) : 0.687 MR Proportion (of SSOOHE) : 0.118 MA Proportion (of SSOOHE) : 0.914 Perform library comparisons 5.6 Detect differentially expressed tags between two libraries using the CompareSage script, which provides P-values associated with each tag expression change. To generate statistics for a comparison of the log-ratio of the tag expression values in two libraries, use SdCompare to also provide a plot of the log-ratios (x-axis) of all tags (y-axis), shown in Figure 13.2b. 5.7 Assess the Pearson correlation between two or more libraries by using the Correlate script, which provides statistics on the expression values of tags in each input library (min, max, mean expression, etc.) as well as a pairwisw correlation matrix. To view the correlation values as a plot, use CorrelatePlot (Figure 13.2c). 5.8 Assess the similarity of multiple libraries with a phylogeny generated by the script SdSageTree, using a distance metric constructed from the standard deviations of the log ratios between pairs of libraries, or by the script sageTree, using Pearson correlations between pairs of libraries as the distance metric. Visualize the output file containing the relationships between analyzed libraries using the Tree-View X application provided on the ftp site (SdSageTree output; Figure 13.2d).

— — — — — — — — —

238

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

Table 13.3 Location, content, and intended usage of the data files.

File Location and Name

Contents

Purpose

$tabdir/Qcreports/SxLibraries.tbl

library flow cell lane upper_protocol lower_protocol taxon filename date_downloaded Library_type species_abbrev longer_species_abbrev taxon Capitalized_species_abbrev Species Long_Species Common_name Capitalized_species_abbrev taxon Species contains data from the Illumina machine; fulllength tags from a single lane (a flow cell has eight lanes) contains the tags from fctaglibs, with short tags (less than 12) removed and tags truncated at the first ambiguous base contains files of virtual tags and counts created by truncating the sequences in corresponding fctaglib flow cell files at 17-bp (minimum) contains files of virtual tags and counts created by truncating the sequences in corresponding taglib files at 17-bp (minimum) SSOOHE filtered tag and count

used by scripts

$tabdir/lib/Taxonomy/Species.lst

$tabdir/seqlib/Unigene_Species.lst $tabdir/data/fctaglibs

$tabdir/data/taglibs

$tabdir/data/fcVirtualSage

$tabdir/data/VirtualSage

$tabdir/data/sagelibs/ .sagelib

$tabdir/data/sagelibs/MA $tabdir/data/sagelibs/MG $tabdir/data/sagelibs/Raw $tabdir/data/sagelibs/MT $tabdir/data/sagelibsMR $tabdir/data/sagelibs/SA $tabdir/data/sagelibs/SSOOHE $tabdir/seqdata $tabdir/darwin

$tabdir/sagedir/NlaIII_17/ Ref.tagdb

$tabdir/sagedir/NlaIII_17/ Ref.astagdb

$tabdir/sagedir/NlaIII_17/ Chrom.tagdb $tabdir/sagedir/NlaIII_17/ Chrom_u.tagdb

$tabdir/sagedir/NlaIII_17/ Chrom_u5.tagdb

$tabdir/sagedir/NlaIII_17/ Transcripts.tagdb

$tabdir/sagedir/NlaIII_17/ Transcripts_as.tagdb

directory of tag libraries that map to either the genome or transcriptome of the appropriate species directory of tag libraries that map to the genome of the appropriate species directory of unfiltered tag libraries directory of tag libraries that map to transcriptome resources in the appropriate species directory of tag libraries that map to RefSeq in the appropriate species directory of tag libraries from which adapter sequences were removed directory of SSOOHE filtered tag libraries files containing RefSeq annotations contains the directory i386 and a directory corresponding to the users host type (e.g., x86_64; to determine this parameter, run echo $HOSTTYPE on the command line) tag sequences mapping on the sense strand of either C. elegans (cel), human (hu), or mouse (mm) RefSeq annotations tag sequences mapping on the antisense strand of either C. elegans (cel), human (hu), or mouse (mm) RefSeq annotations tag sequences mapping to both strands of either C. elegans (cel), human (hu), or mouse (mm) genome tag sequences mapping uniquely to both strands of either C. elegans (cel), human (hu), or mouse (mm) genome tag sequences mapping at most 5 times to both strands of either C. elegans (cel), human (hu), or mouse (mm) genome tag sequences mapping the sense strand of either C. elegans (cel), human (hu), or mouse (mm) transcripts (Unigene, MGC, microRNAs) tag sequences mapping the antisense strand of either C. elegans (cel), human (hu), or mouse (mm) transcripts (Unigene, MGC, microRNAs)

used by scripts

used by scripts contains files used to create taglib and sagelib files

files used by scripts in tag-based analysis; these files can be replaced with an alternately filtered file (e.g., from the SA folder) in order to provide a different input for the analysis scripts filtered tag-based analysis filtered tag-based analysis unfiltered tag-based analysis filtered tag-based analysis filtered tag-based analysis filtered tag-based analysis filtered tag-based analysis used by scripts used by scripts

13.6 Applications

Annotate tags 5.9 Annotate library tags with genomic or transcriptome information using the TagBasedAnnotate and the TagBasedTranscriptAnnotate scripts, respectively. Mapping tags to the genome produces a tab-delimited file containing the tag sequence, its abundance, the chromosome and nucleotide position of the tag location, any accessions overlapping the same site, the strand (TagSense), whether the tag location is intronic, exonic, intergenic, exonic or intronic, or in complex regions of overlapping introns or exons, or both. “Pre-distance” and “post-distance” are the number of base pairs at which known genes are encoded upstream and downstream, respectively, of the tag location. The genomic mappings provide a way to interpret novel tags that do not map to known transcripts. In these cases, it is informative to localize the novel tag as either in proximity to another gene either in the 50 or 30 directions. 5.10 Mapping tags to the transcriptome produces an Excel file containing the tag sequence, its abundance, the transcript to which the tag maps, the NlaIII position relative to the 30 end of the transcript, transcript annotation, and the strand relative to the transcript. If the strand is positive, the tag originates from the sense strand of the transcript, and represents an mRNA variant from that locus. If the strand is negative, the tag represents an mRNA expressed from the opposite strand of the transcript. Tags that map to both to the sense strand of one gene and the antisense strand of another gene represent a sense transcript that overlaps another known transcript on the opposite strand. Tags that map only to the antisense strand of a known transcript are evidence for novel antisense expression. Other tools 5.11 Use the script randomSageSample to take a random subsample of a library, and quickly calculate the abundance and number of tag sequences using the sumtags script. 5.12 Use the script mergeTagFiles to merge a set of two or more tag libraries, yielding a tab-delimited file of tag sequences and abundance in each of the input libraries.

13.6 Applications

The digital and quantitative nature of SAGE data, along with its efficient sampling of short sequence tags from known and novel mRNA transcripts, and its theoretically unlimited dynamic range, have made SAGE an attractive technology for profiling eukaryotic transcriptomes [4,10,11]. Numerous improvements to the original technology [7,12–16] have been described. These include the production of longer tags, which have improved the specificity of tag-to-gene mapping [7,17], and modifications designed to facilitate library construction from nanogram quantities of total RNA [13]. Classical SAGE library construction involves the concatenation of individual SAGE tags followed by cloning and direct Sanger sequencing of individual clones. Concatenation allows for the efficient use of the read lengths afforded by the Sanger sequencing platform with tens of SAGE tags being sequenced in a single read. SAGE libraries constructed in this manner are typically sequenced to depths of tens of thousands of tags [18]. The application of next-generation sequencing to SAGE [6] and other tag-based approaches [19] has allowed for the cost-effective sequencing of millions of tags. This increased sequencing depth relative to typical LongSAGE libraries has increased the dynamic range of detectable tags allowing for a more comprehensive profiling of the polyadenylated fraction of the transcriptome, and the detection of rare transcripts at levels that allow for statistically meaningful comparisons between biological states. While it is technically possible to generate LongSAGE libraries with 10 million tags by

j

239

240

j

13 Tag-Seq: Next-Generation Tag Sequencing for Gene Expression Profiling

simply increasing the amount of Sanger sequencing conducted, the cost (US$400 000) of creating such a library is highly prohibitive. In contrast, Tag-seq libraries of this size can be routinely generated for approximately US$2500 each, facilitating large-scale and cost-effective creation of gene expression data sets. The added depth provided by the Tag-seq approach results in an additional 48.3% of expressed genes detectable at depths greater than those of a typical (100 000 tags) LongSAGE library [6]. Compared to RNA-seq [20,21], another protocol implemented on the Illumina platform, Tag-seq has comparable dynamic range and de novo transcript discovery and quantification capabilities [6]. In contrast, RNA-seq libraries generate reads spanning whole transcripts, and are thus considerably more informative of transcriptional start and end sites, and alternative splicing events. Two recently developed computational tools have attempted to deduce read strand of origin from RNA-seq data [22,23]. These tools rely on splice site sequences and read-pair information to assign a subset of reads to the positive or negative strands, thus generating a semiquantitative measure of SAS transcription. However, significant development of these algorithms will be required in order to close the gap between their inferred estimates of SAS transcription and the precise digital counts generated by Tag-seq. As a result, RNA-seq and Tagseq libraries provide complementary information regarding expressed transcripts.

13.7 Perspectives

The increase in sampling depth provided by Tag-seq results in significantly improved detection and quantification of low-abundance transcripts relative to LongSAGE and to microarrays. Specifically, the large profiling depths achieved by Tag-seq lead to increased statistical confidence in measuring differential gene expression across conditions of interest, such as: pathogen-challenged versus unchallenged immune systems in fishes [24,25] and plants [26]; nematodes at various stages of the lifecycle [27]; tissues harboring specific gene isoforms or mutations, in mammals [28] and plants [29]; human, chimp, and macaque neural tissues [30]; and cancerous versus normal tissues [6,31,32]. Respectively, these studies have identified genes involved in pathogen resistance and longevity, transcriptional targets of specific gene variants, evolutionarily conserved as well as species-specific genes related to neuronal signaling and energy metabolism, and differentially expressed genes relevant to cancer biology. Tag-seq is well suited to profiling transcription in organisms that lack either a wellannotated transcriptome or genome. Since the method can be used to profile any polyadenylated mRNA, it lends itself to the detection of previously uncharacterized noncoding genes [24,25,30] and antisense transcripts [6,30,31]. In organisms where there is no annotated transcriptome however, it is desirable to know not only the expression level of novel transcripts, but also their sequence. One strategy addressing this challenge has been to create Tag-seq libraries in conjunction with RNA-seq libraries from the same mRNA samples [24,25]. The initial step of assembling the RNA-seq reads into contigs generates consensus sequences that essentially correspond to a novel transcriptome. Tag-seq tags can then be mapped against this transcriptome, providing digital expression counts of those sequences and enabling subsequent analyses of differential gene expression [24,25]. Tag-seq libraries have also been analyzed in conjunction with chromatin immunoprecipitation (ChIP)-seq data [33]. The focus of that study was on (i) discriminating functionally active transcription factor binding events in mouse pancreatic islets and liver tissues, and (ii) on determining how that activity could be altered by specific epigenetic marks (including mono- and trimethylation of histone 3 lysine 4, i.e., H3K4me1 and H3K4me3). ChIP-seq was used to measure transcription factor occupancy, as well as genome-wide localization of the histone marks. These data were analyzed in conjunction with Tag-seq libraries, which provided information on transcriptionally active loci. Together, the results identified classes of loci that were active, poised for activation, or showed pioneer-like transcription factor activity, as

References

j

241

well as the patterns of transcription factor occupancy and histone modifications distinguishing each class of genes [33]. Whether used alone or in concert with other methods, Tag-seq is a highly versatile component of the current genomic toolset, and provides a cost-effective and efficient means to generate comprehensive digital expression counts. One important aspect of this method is that Tag-seq libraries created in different labs and at different timepoints can be directly compared [6,27,32]. Thus, databases of Tag-seq libraries representing a variety of tissues and cell line samples have already been created–for instance, from normal and cancer tissues in the Cancer Genome Anatomy Project (cgap.nci.nih.gov), and represent a valuable gene-expression resource to the research community.

Acknowledgments

This study received funding support from the British Columbia Cancer Foundation, Genome British Columbia, and Genome Canada. M.A.M. is a Scholar of the Michael Smith Foundation for Health Research. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Cheng, J., Kapranov, P., Drenkow, J., Dike, S.

12 Heidenblut, A.M., Luttges, J., Buchholz, M.,

et al. (2005) Science, 308, 1149–1154. Riken Genome Exploration Research Group, Genome Science Group and the FANTOM Consortium (2005) Science, 309, 1564. Chen, J., Sun, M., Lee, S., Zhou, G. et al. (2002) Proc. Natl. Acad. Sci. USA, 99, 12257–12262. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. (1995) Science, 270, 484–487. Bentley, D.R. (2006) Curr. Opin. Genet. Dev., 16, 545–552. Morrissy, A.S., Morin, R.D., Delaney, A., Zeng, T. et al. (2009) Genome Res., 19, 1825–1835. Saha, S., Sparks, A.B., Rago, C., Akmaev, V. et al. (2002) Nat. Biotechnol., 20, 508–512. Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E. et al. (2003) Nucleic Acids Res., 31, 28–33. Gerhard, D.S., Wagner, L., Feingold, E.A., Shenmen, C.M. et al. (2004) Genome Res., 14, 2121–2127. Siddiqui, A.S., Khattra, J., Delaney, A.D., Zhao, Y. et al. (2005) Proc. Natl. Acad. Sci. USA, 102, 18485–18490. Khattra, J., Delaney, A.D., Zhao, Y., Siddiqui, A. et al. (2007) Genome Res., 17, 108–116.

Heinitz, C. et al. (2004) Nucleic Acids Res., 32, e131. Peters, D.G., Kassam, A.B., Yonas, H., O’Hare, E.H. et al. (1999) Nucleic Acids Res., 27, e39. Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M. et al. (2006) Nat. Methods, 3, 211–222. Gowda, M., Jantasuriyarat, C., Dean, R.A., and Wang, G.L. (2004) Plant Physiol., 134, 890–897. Wei, C.L., Ng, P., Chiu, K.P., Wong, C.H. et al. (2004) Proc. Natl. Acad. Sci. USA, 101, 11701–11706. Matsumura, H., Reich, S., Ito, A., Saitoh, H. et al. (2003) Proc. Natl. Acad. Sci. USA, 100, 15718–15723. Boon, K., Osorio, E.C., Greenhut, S.F., Schaefer, C.F. et al. (2002) Proc. Natl. Acad. Sci. USA, 99, 11287–11292. Valen, E., Pascarella, G., Chalk, A., Maeda, N. et al. (2009) Genome Res., 19, 255–265. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008) Genome Res., 18, 1509–1517. Rosenkranz, R., Borodina, T., Lehrach, H., and Himmelbauer, H. (2008) Genomics, 92, 187–194. Guttman, M., Garber, M., Levin, J.Z., Donaghey, J. et al. (2010) Nat. Biotechnol., 28, 503–510.

2

3

4

5 6

7

8

9

10

11

13

14

15

16

17

18

19 20

21

22

23 Trapnell, C., Williams, B.A., Pertea, G.,

24 25

26

27 28

29

30

31

32

33

Mortazavi, A. et al. (2010) Nat. Biotechnol., 28, 511–515. Mu, Y., Ding, F., Cui, P., Ao, J. et al. (2010) BMC Genomics, 11, 506. Xiang, L.X., He, D., Dong, W.R., Zhang, Y.W., and Shao, J.Z. (2010) BMC Genomics, 11, 472. Wu, J., Zhang, Y., Zhang, H., Huang, H. et al. (2010) BMC Plant Biol., 10, 234. Ruzanov, P., and Riddle, D.L. (2010) Nucleic Acids Res., 38, 3252–3262. Hoen, P.A., Ariyurek, Y., Thygesen, H.H., Vreugdenhil, E. et al. (2008) Nucleic Acids Res., 36, e141. Eveland, A.L., Satoh-Nagasawa, N., Goldshmidt, A., Meyer, S. et al. (2010) Plant Physiol., 154, 1024–1039. Babbitt, C.C., Fedrigo, O., Pfefferle, A.D., Boyle, A.P. et al. (2010) Genome Biol. Evol., 2, 67–79. Wu, Z.J., Meyer, C.A., Choudhury, S., Shipitsin, M. et al. (2010) Genome Res., 20, 1730–1739. Kavak, E., Unlu, M., Nister, M., and Koman, A. (2010) Nucleic Acids Res., 38, 7008–7021. Hoffman, B.G., Robertson, G., Zavaglia, B., Beach, M. et al. (2010) Genome Res., 20, 1037–1051.

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) Paul G. Giresi and Jason D. Lieb Abstract

The binding of sequence-specific regulatory factors and the recruitment of chromatin remodeling activities cause nucleosomes to be evicted from chromatin in eukaryotic cells. Traditionally, these active sites have been identified experimentally through their sensitivity to nucleases. Here, we describe the details of a simple procedure for the genome-wide isolation of nucleosome-depleted DNA from human chromatin, termed FAIRE (formaldehyde-assisted isolation of regulatory elements). We also provide protocols for the preparation of FAIRE-enriched DNA for various types of detection, including use of the polymerase chain reaction, DNA microarrays, and next-generation sequencing. FAIRE works on all eukaryotic chromatin tested to date. To perform FAIRE, chromatin is cross-linked with formaldehyde, sheared by sonication, and phenol/chloroform-extracted. Most genomic DNA is cross-linked to nucleosomes and is sequestered to the interphase, whereas DNA recovered in the aqueous phase corresponds to nucleosome-depleted regions of the genome. The isolated regions are largely coincident with the location of DNase I-hypersensitive sites, transcriptional start sites, enhancers, insulators, and active promoters. Given its speed and simplicity, FAIRE has utility in establishing chromatin profiles of diverse cell types in health and disease, isolating DNA regulatory elements en masse for further characterization, and as a screening assay for the effects of small molecules on chromatin organization.

14.1 Introduction

In eukaryotes, packaging of DNA into chromatin reduces the accessibility of genetic information to the set of proteins involved in regulating DNA-templated processes such as transcription. Successful orchestration of DNA-dependent processes is achieved in part by regulating the stability of nucleosomes at these sites [1–3]. Here, “stability” refers to the probability of an intact nucleosome at a given nucleotide position versus a nucleosome in an absent or disrupted state at that position. Several mechanisms exist to modulate nucleosome stability, including competition with sequence-specific factors [4–7], ATP-dependent nucleosome remodeling complexes [8–10], and post-translational modifications of the histone tails [11–14]. Nucleosome stability at any given locus is governed by a combination of factors acting in concert, which results in a context-specific set of DNA elements bound by regulatory factors for each cell type. Traditionally, active regulatory elements have been identified by their increased sensitivity to nuclease digestion, such as DNase I [15–20]. Typically, this involves subjecting isolated nuclei to a mild nuclease treatment, followed by detection

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

243

244

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

Fig. 14.1 FAIRE Procedure. (a) The FAIRE procedure described in the text is shown on the left, while preparation of the reference or input sample is shown on the right. The DNA recovered from the aqueous phase of each extraction can then be used to identify sites of open chromatin using qPCR, tiling microarrays, or high-throughput sequencing applications. (b) For qPCR, a series of primers, depicted as convergent arrows, are designed to span a genomic region of interest. Sites of open chromatin are highlighted in blue, with qPCR results depicted above. Amplicons that span or are near the boundaries of open chromatin often result in lower relative enrichment due to shearing of DNA fragments, as shown by asterisks. (c) Microarrays. Typically we use high-resolution microarrays that tile either regions of interest or the entire genome of an organism with 50- to 70-bp oligonucleotides. (d) High-throughput sequencing technologies can be used to map the DNA fragments back to the reference genome.

(a)

FAIRE

Cells crosslinked with formaldehyde

Shear by sonication

Phenol/chloroform extraction

(b)

(c)

*

Reference chromatin Not crosslinked

Shear by sonication

Phenol/chloroform extraction

(d)

* * *

qPCR

DNA microarrays High-throughput sequencing

using Southern blots to identify nuclease hypersensitive sites. Several groups have recently adapted the procedure for genome-wide detection with DNA microarrays or next-generation sequencing [21–24]. However, requirements for a clean nuclei preparation from a single-cell suspension and the need for laborious enzyme titrations means that it is difficult to perform DNase hypersensitivity assays on solid tissues, on a limited number of cells, or in parallel on many different samples. Here, we describe an alternative strategy for genome-wide isolation of active regulatory elements termed FAIRE (formaldehyde-assisted isolation of regulatory elements). It is a simple, high-throughput procedure to isolate and map genomic regions depleted of nucleosomes. The procedure involves cross-linking proteins to DNA using formaldehyde, shearing the chromatin, and performing a phenol/ chloroform extraction. The genomic regions preferentially segregated into the aqueous phase are then mapped back to the genome by hybridization to tiling microarrays or are read directly using next-generation DNA sequencing (Figure 14.1). Quantitative polymerase chain reaction (qPCR) can be used to assay individual loci, which is useful when screening many cell or tissue types. The relatively straightforward nature and tractability of FAIRE has broad utility for the genome-wide detection of active regulatory elements across all eukaryotic species, in clinical samples, and for high-throughout screens. FAIRE was first demonstrated in Saccharomyces cerevisiae [25]. In yeast, the genomic regions immediately upstream of genes were preferentially segregated into the aqueous phase, in a manner that was strongly negatively correlated with nucleosome occupancy [26,27]. Subsequent studies demonstrated that FAIRE efficiently isolated nucleosome-depleted regions of the Homo sapiens genome, which included both transcription start sites and distal regulatory elements such as enhancers and silencers [28] (Figure 14.2). Results from both yeast and human found that enrichment of the upstream regions of genes was positively correlated with transcription of the downstream gene. However, in human cells the vast majority of sites identified were far from any annotated gene. For the majority of these distal sites, it is not yet possible to ascribe a function, identify what factors

14.2 Methods and Protocols chr19:

j

245

59335000 59340000 59345000 59350000 59355000 59360000 59365000 59370000 59375000 59380000 59385000

20

FAIRE-seq Lymphoblastoid

0

FAIRE-chip Lymphoblastoid

9 -7 0.3

DNase-seq CD4

0 0.18

H3K4me1 ChIP-seq CD4

0 0.15 0 0.76

H3K4me2 ChIP-seq CD4 H3K4me3 ChIP-seq CD4

0

CNOT3

LENG1

TMC4

MBOAT7 TSEN34

Fig. 14.2 FAIRE data. DNA isolated by FAIRE in human lymphoblastoid cells was mapped to the genome using both the Illumina GAII (black) and NimbleGen tiling microarrays (red). A 60-kb region of chromosome 19 is displayed in the UCSC genome browser. For sequencing data, the number of extended reads overlapping each base is plotted (see text). The FAIRE microarray data (red) is plotted as z-scores (see text). Also shown is DNase I hypersensitivity (blue) [37], and H3K4 mono-, di-, and trimethylation from human CD4 þ cells [38]. Each of these datasets are represented as the density estimates from fseq. Black arrows represent the UCSC Known Genes [39], with arrowheads indicating the direction of transcription. The FAIRE data colocalizes with transcriptional start sites, DNase I hypersensitive sites, and is adjacent to histone modifications indicative of active 50 ends of genes.

might be bound, or determine the genes being regulated by each regulatory element. The enrichment of regulatory regions in the aqueous phase is thought to result from the very high cross-linking efficiency of histone proteins to DNA versus the lower efficiency of cross-linking sequence-specific proteins to DNA. This difference in cross-linking efficiency is likely due in part to formaldehyde’s very short crosslinking distance. Formaldehyde is a small molecule (HCHO), and cross-links are only formed between proteins and DNA in direct contact. There are approximately 10–15 histone–DNA interactions within a nucleosome that serve as potential crosslinking sites [29]. However, for most DNA-binding proteins there are far fewer potential cross-linking sites. The average binding sites are 5–15 bp [30], with only a few of the bases close enough to the protein contacts be cross-linked [31]. In addition, formaldehyde requires a e-amino group, such as occurs on lysine, to form a cross-link [32,33]. Approximately 10% of the amino acids of histones are lysine, a much higher proportion than a typical protein. Due to both of these factors nucleosomes are much more readily cross-linkable to DNA and are likely to dominate the cross-linking profile.

14.2 Methods and Protocols 14.2.1 FAIRE Procedure

The following provides a general framework for performing FAIRE, which specifically emphasizes performing FAIRE on cells grown in culture. Modifications to the FAIRE protocol for tissue samples are noted throughout this section. 14.2.2 Optimization of the FAIRE Procedure

The two critical steps in FAIRE that should be optimized when first starting out or working with a new organism are cross-linking time and sonication. The optimal incubation time for formaldehyde cross-linking can vary widely between organisms (5 min human versus 30 min yeast). If chromatin immunoprecipitation

246

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

(ChIP) has been performed for the organism or sample type it is best to consider this the maximum incubation time and typically shorter incubation times are optimal. For sonication, the exact parameters will vary based on the actual machine used. However, factors that will effect sonication efficiency include density of cells in solution, total volume of solution to be sonicated and the power setting. Sonication conditions should be optimized to deliver the proper size range of DNA fragments (100–1000 bp) with the fewest number of cycles while avoid using power settings that are so high that they excessive heat the sample or cause foaming. Optimized conditions based on the recommended equipment are provided in this section.

14.2.3 Equipment and Reagents Reagents . . . .

. . . . . . .

37% Formaldehyde 2.5 M Glycine 1 PBS (phosphate-buffered saline) Phenol/chloroform/isoamyl alcohol (IAA) (25: 24: 1 saturated with 10 mM Tris, pH 8.0, 1 mM EDTA; Sigma; cat. no. P3803) Chloroform/IAA (24: 1; Fluka BioChemika; cat. no. 25 666) Ethanol 95 and 70% 3 M Sodium acetate (pH 5.2) 20 mg/ml Glycogen 10 mM Tris–HCl (pH 7.4) 10 mg/ml of Proteinase K 10 mg/ml of RNase A

Equipment for all FAIRE Applications .

. .

Mini-BeadBeater-8 (Note: required for applications in yeast and highly recommended for tissue samples, but can be substituted with the “Alternative lysis procedure” for cell culture (see below).) Tubes with metal beads (cat. no. MoBio 13117-50) Sonicator - Branson Sonifier 450 (or equivalent) - Bioruptor UCD-200 (Diagenode)

Equipment for FAIRE in Tissues . .

Tissue pulverizer (cat. no. Biospec 59012N) 15-ml Conical tissue grinder (VWR; cat. no. 47732-446)

Lysis Buffers .

FAIRE lysis buffer

2% Triton X-100 1% SDS 100 mM NaCl 10 mM Tris-Cl, pH 8.0 1 mM EDTA

14.2 Methods and Protocols .

Alternative lysis procedure buffers Buffer 1 1 M HEPES KOH, pH 7.5

5.0 ml

5 M NaCl

2.8 ml

0.5 M EDTA (pH 8.0)

0.2 ml

100% Glycerol

10.0 ml

100% NP-40

5.0 ml

100% Triton X-100

0.25 ml

dH2O

76.7 ml

Buffer 2 5 M NaCl

4.0 ml

0.5 M EDTA (pH 8.0)

0.2 ml

0.5 M EGTA (pH 8.0)

0.1 ml

1 M Tris (pH 8.0)

1.0 ml

dH2O

94.7 ml

Buffer 3 0.5 M EDTA (pH 8.0)

0.2 ml

0.5 M EGTA (pH 8.0)

0.1 ml

1 M Tris (pH 8.0)

1.0 ml

5 M NaCl

2.0 ml

10% Sodium deoxycholate

1.0 ml

N-Lauroyl sarcosine

500 mg

50 Protease Inhibitor

1.0 ml

dH2O

94.7 ml

Protocol

1. Cross-linking samples Note: It is highly recommended that tissue samples be broken up prior to crosslinking. Increasing the surface area results in much better cross-linking efficiency. If the recommended tissue pulverizer is available, a few strikes works well on most tissue types; the user is cautioned not to overload the pulverizer. If the tissue pulverizer is not available mincing with a scalpel or razor blade can be used, although this is not ideal for all tissue types. 1.1 Add 37% formaldehyde directly to media to a final concentration of 1% and incubate at 25  C with shaking 80 rpm. If using human tissues, resuspend in 1 PBS prior to the addition of formaldehyde. Incubation times: Human cell culture

5 min

Human tissue samples

7 min

Yeast

30 min

j

247

248

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

1.2 Add 2.5 M glycine to a final concentration of 125 mM, incubate 5 min at 25  C with shaking. (Note: For adherent cells, remove media containing formaldehyde and glycine, and add ice-cold 1 PBS directly to plate, scrape plate, and collect in tube for centrifugation.) 1.3 Spin at 900  g for 5 min at 4  C. 1.4 Wash samples twice with ice cold 1 PBS, spinning down (900  g for 5 min at 4  C) samples after each wash and removing supernatant. Note: At this point samples can be snap frozen and stored at

80  C.

2. Cell lysis (if frozen thaw cells on ice) Note: For tissue samples, prior to the addition of lysis buffer it is highly recommended to grind frozen samples into a powder. If using the recommend 15-ml tissue grinding tubes, place sample at the bottom of tubes in the grooved conical portion, screw the pestle top on the tube and place in a liquid nitrogen bath to freeze sample. Once frozen, remove from bath and grind sample into a powder. The ground sample can be recovered by adding lysis buffer directly to the tube. Proceed with cell lysis as described below. If the 15-ml tissue grinding tubes are not available, either traditional mortar and pestles or glass dounce can be used. In our experience these can either be difficult to work with, especially for small samples, or have variable efficiency across tissue types. 2.1 Resuspend cells in 1 ml of lysis buffer per 107 cells (or 0.4 g). 2.2 Add 1 ml of cells resuspended in lysis buffer to 2-ml tubes containing metal beads 2.3 Lyse cells in the Mini-BeadBeater-8 for five 2-min sessions at 4  C, place on ice cells for 2 min between each session. 2.4 Recover lysate by pipetting out solution into a 15-ml tube, if sample was lysed in multiple 2-ml tubes these can be combined into a single 15-ml tube up a maximum of 2 ml. (Note: Alternatively, the lysate can be recovered by puncturing the bottom of 2-ml tube with 25G syringe and draining into 15-ml tube on ice. A hole can be made in the 15-ml caps to fit 2-ml tubes and lysate can be recovered by slow centrifugation at 4  C.) 2b. Alternative cell lysis Note: This lysis procedure is similar to what is used for human ChIP and can be used for experiments where both applications will be performed. We have found that this procedure is really only effective for cell culture samples and does not work for tissues or yeast samples. 2.b.1 Add 10 ml of Buffer 1 per 108 cells and rock at 4  C for 10 min. 2.b.2 Spin cells at 1300  g for 5 min at 4  C and remove supernatant. 2.b.3 Resuspend pellet in 10 ml of Buffer 2 per 108 cells and rock at room temperature for 10 min. 2.b.4 Spin cells at 1300  g for 5 min at 4  C and remove supernatant. 2.b.5 Resuspend pellet in 3.5 ml of Buffer 3 per 108 cells. 3. Sonication Note: Samples prepared using the alternative lysis procedure, from tissues or containing visible material in solution should be sonicated using a microtip, which is more efficient than the Bioruptor at getting everything into solution. 3.1 Place sonicator tip into 15-ml tube until tip is submerged approximately halfway in 2 ml lysis solution, using 25% amplitude sonicate samples for 10 cycles of 30 s (1 s ON, 0.5 s OFF), place on ice for 2 min between cycles.

14.2 Methods and Protocols

(Note: For the Bioruptor, we have found the 1.5-ml tube adapter works best; transfer 300 ml aliquots to 1.5-ml tubes and sonicate in Bioruptor for 15 min on HIGH using 30-s pulses and 30 s of rest, keep waterbath at a constant 4  C.) 3.2 Remove a 200-ml aliquot to check fragment size on 1% agarose gel, remaining sample should be kept at 4  C. Check DNA fragment size as follows: 3.2.1 Spin at 15 000  g for 5 min at 4  C to clear cellular debris. Transfer the supernatant to a new tube. 3.2.2 Add 1 ml of 10 mg/ml of Proteinase K and incubate at 55  C for 1 h and then incubate at 65  C for 4 h to overnight. 3.2.3 Add an equal volume phenol/chloroform, mix, and spin 12 000  g for 5 min. 3.2.4 Transfer aqueous phase to a new tube, add an equal volume of chloroform/IAA, mix, and spin 12 000  g for 5 min. 3.2.5 Transfer aqueous phase to a new tube, add 1 ml of 10 mg/ml RNase A, and incubate at 37  C for 1 h. 3.2.6 Add 1/10 volume of 3 M sodium acetate (pH 5.2) and 1 ml of 20 mg/ ml glycogen, mix by inverting, and add 2 volume of 95% ethanol, incubate at –20  C for 1 h. 3.2.7 Pellet DNA at 15 000  g for 10 min at 4  C, wash with 500 ml 70% ethanol, and spin at 15 000  g for 5 min at room temperature (25  C). 3.2.8 Dry pellet and resuspend in 10 ml 10 mM Tris–HCl (pH 7.4), and run on a 1% agarose gel. 3.2.9 If the majority of DNA fragment sizes between 100 and 1000 bp, proceed with phenol/chloroform extraction, otherwise perform additional cycles of sonication to achieve the desired size range. 3.2.10 If DNA is of the correct size and there is at least 100 ng remaining, set aside as an input DNA sample. 4. Phenol/chloroform extraction 4.1 Spin the remaining extract at 15 000  g for 5 min at 4  C to clear cellular debris and transfer the supernatant to a new tube. (Note: If there is not sufficient DNA remaining after checking fragment size, remove a 200-ml aliquot and perform the procedure for checking DNA fragment size (above) omitting the 1% agarose gel step.) 4.2 Add an equal volume of phenol/chloroform, vortex, and spin at 12 000  g for 5 min, transfer aqueous phase to a new tube. (Note: If aqueous phase is small add 500 ml of TE to interphase, vortex, spin down, and recover aqueous phase.) 4.3 Add an equal volume phenol/chloroform to aqueous phase in fresh tube, vortex, spin down, and transfer aqueous phase to a fresh tube. 4.4 Add an equal volume of chloroform/IAA, vortex, and spin 12 000  g for 5 min and transfer aqueous phase to a new tube 5. DNA precipitation 5.1 Add 1/10 volume of 3 M sodium acetate (pH 5.2) and 1 ml of 20 mg/ml glycogen, mix by inverting, and add 2 volume of 95% ethanol. Incubate at –20  C 1 h to overnight. 5.2 Pellet precipitated DNA by spinning at 15 000  g for 30 min at 4  C and remove supernatant 5.3 Wash pellet with 500 ml 70% ethanol, spin at 15 000  g for 5 min at room temp (25  C), remove supernatant and dry pellet in a SpeedVac. 5.4 Resuspend pellet in 50 ml 10 mM Tris–HCl (pH 7.4). 5.5 Add 1 ml of 10 mg/ml RNase A and incubate at 37  C for 1 h.

j

249

250

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

5.6 Clean-up sample using a spin column (must be able recover 75–200 bp DNA) or perform an additional phenol/chloroform extraction and ethanol precipitation. (Note: We have found that this step is necessary to achieve accurate spectrophotometric measurements of samples.) 14.2.4 Detection of FAIRE DNA qPCR qPCR is used both as a method for detecting open chromatin sites and as a means to validate sites identified using either DNA microarray or high-throughput sequencing data. There are several considerations when designing qPCR experiments, including selection of an appropriate set of reference regions, exact primer localization, and methods for quantitation of the results. It is important to select an appropriate set of reference regions since these will be used to calculate relative enrichment for all other sites tested. This can be difficult due to the limited knowledge of “gold standard” sites of closed chromatin available for most species. Even for cells in which sites of closed chromatin have been mapped, these may be limited to a specific growth condition. Therefore we often use a tiling approach (Figure 14.1b) for detection of open chromatin sites using qPCR. Here, primer pairs are designed such that the products are either overlapping or closely spaced across the genomic regions being interrogated. The reference regions are those primer sets flanking the regions isolated by FAIRE. This strategy is also useful for validating results from microarray and sequencing data, which requires a set of positive and negative sites to determine both sensitivity and specificity. Primer design is also critical for obtaining accurate results from qPCR, since primer pairs spanning or near the edges of open chromatin sites may be able to only detect a subset of the DNA fragments isolated in the aqueous phase (Figure 14.1b). Optimally, primer pairs should be designed to amplify 60- to 100-bp products within the central portion of the identified regions. We typically calculate the relative enrichment for each amplicon using the comparative CT method [34]. Here, a ratio is calculated using the signal from the FAIRE sample relative to the signal from DNA prepared from an uncross-linked sample. All ratios are then normalized to the amplicon with the lowest ratio, which is typically from the reference regions. Relative quantitation is used in part because FAIRE enriches for mitochondrial DNA and since the mitochondrial content can vary considerably between cells it is difficult to get an accurate measurement of the proportion of genomic DNA enriched in each of the FAIRE samples. Detection by DNA Microarray High-quality FAIRE data has been obtained from several microarray platforms, including Agilent, NimbleGen (Roche), and polymerase chain reaction PCR-based arrays. Any microarray platform will suffice, but there are several factors to consider, such as the type of probe, the genomic regions covered, and the resolution [35]. One of the most important for FAIRE is selecting a microarray design with sufficient resolution (Figure 14.1c). For oligonucleotide (50–75 bp) tiling microarrays, probeto-probe spacing should not exceed 100 bp if possible. Doing so reduces the number of probes per FAIRE site to just one or two. Sample Amplification by Ligation-Mediated PCR (LM-PCR)

1. Preparation of unidirectional linker . Long linker sequence: GCGGTGACCCGGGAGATCTGAATTC . Short linker sequence: GAATTCAGATC Linker should be purified by high-performance liquid chromatography 1.1 Mix 250 ml 1 M Tris–HCl (pH 7.9), 375 ml Long oligo (40 mM stock), and 375 ml Short oligo (40 mM stock). 1.2 Place 50-ml aliquots in Eppendorf tubes at 95  C for 5 min, then 70  C for 5 min, then cool to room temperature.

14.2 Methods and Protocols

1.3 Transfer to 4  C and allowed to stand overnight (store at

20  C).

2. Blunting DNA fragments and ligation of unidirectional linkers 2.1 Add 100 ng of FAIRE DNA and 100 ng of input DNA to two separate tubes, bring the final volume up to 100 ml. 2.2 Add 12.1 ml of the following master mix to each tube: 10 NEB 2

11.0 ml

BSA (10 mg/ml)

0.5 ml

dNTP mix (25 mM each)

0.4 ml

T4 DNA polymerase (3 U/ml)

0.2 ml

2.3 Mix by pipetting and incubate at 12  C for 20 min. 2.4 Purify DNA using a spin column, resuspend in 25 ml elution buffer and place on ice. 2.5 Add 25.2 ml cold ligase mix to each, mix by pipetting, and incubate overnight at 16  C. ddH2O

13.0 ml

10 Ligase buffer

5.0 ml

Unidirectional linkers (15 mM)

6.7 ml

T4 DNA ligase

0.5 ml

2.6 Purify DNA using a spin column, resuspend in 25 ml elution buffer and place on ice. 3. Amplification of DNA fragments Note: To avoid potential jackpot effects introduced during PCR, two amplification reactions are carried out in parallel for each sample, so two amplifications for FAIRE DNA and two amplifications for input DNA sample. 3.1 Add 73 ml of the following master mix to each tube: 10 PCR buffer

10.0 ml

ddH2O

57.5 ml

25 mM dNTP mix

1.0 ml

Oligo Long (40 mM stock)

2.5 ml

Taq (5 U/ml)

2.0 ml

Some 10 PCR buffers require you to add MgCl2 to the buffer, adjust accordingly by reducing ddH2O volume 3.2 Amplify samples using the following parameters: 55  C

2 min

72  C

5 min

95  C

2 min

95  C

1 min

55  C

1 min

72  C

2 min

g

20 cycles

j

251

252

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

72  C 

4 C

5 min hold

3.3 Check 2 ml of the reaction mix on 1% agarose gel (average size of 400 bp). 3.4 Add 100 ml of isopropanol, vortex, and incubate at room temperature for 10 min. 3.5 Spin at 12 000  g for 10 min and discard supernatant. 3.6 Rinse pellet with 500 ml of ice cold 70% ethanol and spin at 12 000  g for 2 min. 3.7 Discard supernatant and dry in a SpeedVac. 3.8 Resuspend in 50 ml 10 mM Tris–HCl (pH 7.4). For sample labeling and hybridization procedures follow the manufacturer’s recommended protocols

14.2.5 High-Throughput Sequencing

Each of the high-throughput sequencing platforms utilizes a different sample preparation procedure. The procedure described here is used for the preparation of FAIRE DNA for the Illumina GAII (Figure 14.1d). Using the Illumina GAII you will typically need at least 25–30 million reads for human samples to have sufficient signal, currently this mean running 1–2 lanes of a flow cell. Although we describe a protocol for the Illumina GAII, any of the high-throughput technologies will work as long as there are a sufficient number of reads. 1. Blunting DNA fragments (Epicenter End-It DNA End-Repair Kit; cat. no. ER0720 $75) 1.1 Add 100 ng of FAIRE DNA to a tube and bring final volume to 34 ml with ddH2O. 1.2 Add 16 ml of the following master mix to each tube: 10 End-It Repair Buffer

5 ml

2.5 mM dNTP mix

5 ml

10 mM ATP

5 ml

End Repair Enzyme mix

1 ml

1.3 Incubate at room temperature for 45 min. 1.4 Clean-up using Qiagen PCR Purification Column. (Note: For the clean-up use 250 ml PBI buffer and spin at 10 000 rpm in a table-top centrifuge.) 1.5 Elute with 35 ml elution buffer (EB), final spin should be full speed. 2. Add A overhang (NEB Klenow exo 50 U/ml; cat. no. M0212M) 2.1 Bring final volume to 43 ml with ddH2O and add 7 ml of the following master mix: 10 NEB 2

5 ml

10 mM dATP

1 ml

Klenow 50 U/ml

1 ml

2.2 Incubate for 30 min at 37  C. 2.3 Clean-up using Qiagen MinElute PCR Purification Column. (Note: For the clean-up use 250 ml PBI buffer and spin at 10 000 rpm in a table-top centrifuge.) 2.4 Elute with 11 ml elution buffer (EB), final spin should be full speed

14.2 Methods and Protocols

3. Ligation of Illumina adapters (Epicenter Fast-Link Kit; cat. no. LK11025) Note: Illumina adapters should be diluted for samples less than 500 ng. Adapters should be diluted 1: 10 with ddH2O. For 500 ng use only 1 ml of undiluted adapters. 3.1 Bring final volume up to 21.5 ml with ddH2O and add 8.5 ml of the following master mix: 10 Fast-Link Buffer

3.0 ml

10 mM ATP

1.5 ml

Illumina adapters (dilute)

2.0 ml

Fast-Link DNA Ligase (2 U/ml)

2.0 ml

3.2 Incubate at room temperature for 2 h. 3.3 Add 10 ml of the following master mix: ddH2O

7.5 ml

10 Fast-Link Buffer

1.0 ml

10 mM ATP

0.5 ml

Fast-Link DNA Ligase (2 U/ml)

1.0 ml

3.4 Incubate 3 h to overnight at 16  C. 3.5 Clean-up using Qiagen PCR Purification Column. (Note: For the clean-up use 250 ml PBI buffer and spin at 10 000 rpm in a table-top centrifuge.) 3.6 Elute with 37 ml elution buffer (EB), final spin should be full speed. 4. Amplification of samples (Stratagene PfuUltra II Fusion HS DNA Polymerase; cat. no. 600670) Note: To avoid potential jackpot effects introduced during PCR, two amplification reactions are carried out in parallel, so two amplifications for FAIRE DNA. 4.1 Use 50 ng of FAIRE DNA, bring final volume up to 77 ml with ddH2O and add 23 ml of the following master mix: Illumina primers

2.0 ml

10 PfuUltra II reaction buffer

10.0 ml

2.5 mM dNTP

10.0 ml

Phusion polymerase

1.0 ml

4.2 Amplify samples using the following parameters:

30 s

98  C

20 s

98  C

30 s

65  C

30 s

72  C

5 min

72  C

Hold

4 C

g

12 cycles

j

253

254

j

14 Isolation of Active Regulatory Elements from Eukaryotic Chromatin Using FAIRE

4.3 Clean-up using Qiagen MinElute Column. (Note: For the clean-up use 500 ml PBI buffer and spin at 10 000 rpm in a table-top centrifuge, if indicator changes color need to add the recommended amount of 3 M sodium acetate.) 4.4 Elute with 11 ml elution buffer (EB), final spin should be full speed. 5. Size selection of library . Sample loading buffer (3 ml loading buffer per 10 ml sample): 50 mM Tris pH 8.0 40 mM EDTA 40% (w/v) sucrose 5.1 Add 1 ml of sample loading buffer to amplified FAIRE DNA eluted from column. 5.2 Run sample on 2% agarose gel at 120 V for 1 h. 5.3 Excise the brightest portion of the smear, can extend 100 bp from brightest portion. (Note: First, if using a UV tray minimize exposure of gel to UV light by marking excision regions with a razor and complete excision with UV light off (optionally a trans-illumination tray can be used). Also, be careful not to excise the primer-dimers, which there should be a gap between the sample smear and a lower-molecular-weight primer-dimer smear (starts at roughly 100 bp).) 5.4 Purify using Qiagen Gel Extraction Column, weigh the gel slice, and use 6 QG Buffer and 2 isopropanol. (Note: Samples should be incubated at room temperature and NOTusing the suggested 55  C, which can induce a positive GC-bias.) 5.5 Elute with 51 ml elution buffer (EB). 14.3 Applications

Several aspects of FAIRE make it a powerful genome-wide approach for detecting functional in vivo regulatory elements in eukaryotes. It requires little treatment of cells prior to the addition of formaldehyde and involves only a few reagents: formaldehyde, phenol, chloroform, and ethanol. The successful application of FAIRE on a limited numbers of cells expands its utility beyond what other DNA accessibility assays can accomplish. This provides an opportunity to perform genome-wide assays of chromatin structure on tissue samples from patients or to grow cells in small-well plates to screen small molecules for chromatin effects. Additionally, since FAIRE recovers the complete DNA fragments at regulatory elements, it is possible to use this material directly in functional assays, such as with reporter vectors. One of the major limitations of FAIRE is the low signal-to-noise compared to other techniques for measuring accessible regions of the genome, such as DNase I. This is largely due to the fact that the DNase I assays employ a strategy to enrich for the DNA fragments cleaved by the enzyme, either by size selection of passing over a column [21,24], thus eliminating the recovery of random breaks and resulting in minimal background, whereas the background signal seen with FAIRE is due to the incomplete cross-linking of all proteins to DNA and ultimately these regions being recovered in the assay too.

14.4 Perspectives

Genome-wide maps of active regulatory elements will allow a better understanding of how the availability of sequence-based regulatory elements are coordinated with the regulation of factors that utilize them in a given cellular environment. The emerging set of consortium-based datasets, such as those derived from the

References

j

255

ENCODE project [36], will provide a foundation for understanding the relationships among these factors and be critical to constructing realistic models of gene regulation in eukaryotic cells. The next major challenge will be to functionally annotate the catalog of regulatory elements discovered across a diverse set of cell types, organisms, and disease states.

Acknowledgments

We thank members of the Lieb lab for discussions. Support for this work has been provided by grants from the National Human Genome Research Institute. All proprietary names and registered tradenames for all materials, equipment, software, and so on, are acknowledged throughout this chapter.

References 1 Abbott, D.W., Ivanova, V.S., Wang, X.,

2

3

4 5 6 7 8 9 10 11

12

13 14

Bonner, W.M., and Ausio, J. (2001) J. Biol. Chem., 276, 41945–41949. Almer, A., Rudolph, H., Hinnen, A., and Horz, W. (1986) EMBO J., 5, 2689–2696. Boeger, H., Griesenbeck, J., Strattan, J.S., and Kornberg, R.D. (2003) Mol. Cell, 11, 1587–1598. Morse, R.H. (2000) Trends Genet., 16, 51–53. Morse, R.H. (2003) Biochem. Cell Biol., 81, 101–112. Polach, K.J. and Widom, J. (1995) J. Mol. Biol., 254, 130–149. Yu, L. and Morse, R.H. (1999) Mol. Cell Biol., 19, 5279–5288. Sudarsanam, P. and Winston, F. (2000) Trends Genet., 16, 345–351. Tsukiyama, T. and Wu, C. (1995) Cell, 83, 1011–1020. Varga-Weisz, P. (2001) Oncogene, 20, 3076–3085. Dion, M.F., Altschuler, S.J., Wu, L.F., and Rando, O.J. (2005) Proc. Natl. Acad. Sci. USA, 102, 5501–5506. Koch, C.M., Andrews, R.M., Flicek, P., Dillon, S.C., Karaoz, U., Clelland, G.K., Wilcox, S., Beare, D.M., Fowler, J.C., Couttet, P., James, K.D., Lefebvre, G.C., Bruce, A.W., Dovey, O.M., Ellis, P.D., Dhami, P., Langford, C.F., Weng, Z., Birney, E., Carter, N.P., Vetrie, D., and Dunham, I. (2007) Genome Res., 17, 691–707. Reinke, H. and Horz, W. (2003) Mol. Cell, 11, 1599–1607. van Leeuwen, F. and van Steensel, B. (2005) Genome Biol., 6, 113.

15 Keene, M.A., Corces, V., Lowenhaupt, K.,

16

17

18 19 20 21

22

23 24

and Elgin, S.C. (1981) Proc. Natl. Acad. Sci. USA, 78, 143–146. Li, Q., Zhang, M., Han, H., Rohde, A., and Stamatoyannopoulos, G. (2002) Nucleic Acids Res., 30, 2484–2491. Sollner-Webb, B., Camerini-Otero, R.D., and Felsenfeld, G. (1976) Cell, 9, 179–193. Weintraub, H. and Groudine, M. (1976) Science, 193, 848–856. Wu, C. (1980) Nature, 286, 854–860. Wu, C., Wong, Y.C., and Elgin, S.C. (1979) Cell, 16, 807–814. Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M.J., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. (2006) Nat. Methods, 3, 503–509. Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J., Kohli, A., Li, Q., McArthur, M., and Stamatoyannopoulos, J.A. (2004) Nat. Methods, 1, 219–225. Giresi, P.G. and Lieb, J.D. (2006) Nat. Methods, 3, 501–502. Sabo, P.J., Kuehn, M.S., Thurman, R., Johnson, B.E., Johnson, E.M., Cao, H., Yu, M., Rosenzweig, E., Goldy, J., Haydock, A., Weaver, M., Shafer, A., Lee, K., Neri, F., Humbert, R., Singer, M.A., Richmond, T.A., Dorschner, M.O., McArthur, M., Hawrylycz, M., Green, R.D., Navas, P.A., Noble, W.S., and Stamatoyannopoulos, J.A. (2006) Nat. Methods, 3, 511–518.

25 Nagy, P.L., Cleary, M.L., Brown, P.O., and

26

27 28

29

30 31 32 33 34 35 36 37

38

39

Lieb, J.D. (2003) Proc. Natl. Acad. Sci. USA, 100, 6364–6369. Bernstein, B.E., Liu, C.L., Humphrey, E.L., Perlstein, E.O., and Schreiber, S.L. (2004) Genome Biol., 5, R62. Hogan, G.J., Lee, C.K., and Lieb, J.D. (2006) PLoS Genet., 2, e158. Giresi, P.G., Kim, J., McDaniell, R.M., Iyer, V.R., and Lieb, J.D. (2007) Genome Res., 17, 877–885. Luger, K., Mader, A.W., Richmond, R.K., Sargent, D.F., and Richmond, T.J. (1997) Nature, 389, 251–260. Bulyk, M.L. (2004) Genome Biol., 5, 331. Garvie, C.W. and Wolberger, C. (2001) Mol. Cell, 8, 937–946. Brutlag, D., Schlehuber, C., and Bonner, J. (1969) Biochemistry, 8, 3214–3218. Solomon, M.J. and Varshavsky, A. (1985) Proc. Natl. Acad. Sci. USA, 82, 6470–6474. Livak, K.J. and Schmittgen, T.D. (2001) Methods, 25, 402–408. Buck, M.J. and Lieb, J.D. (2004) Genomics, 83, 349–360. Birney, E. et al. (2007) Nature, 447, 799–816. Boyle, A.P., Davis, S., Shulha, H.P., Meltzer, P., Margulies, E.H., Weng, Z., Furey, T.S., and Crawford, G.E. (2008) Cell, 132, 311–322. Wang, Z., Zang, C., Rosenfeld, J.A., Schones, D.E., Barski, A., Cuddapah, S., Cui, K., Roh, T.Y., Peng, W., Zhang, M.Q., and Zhao, K. (2008) Nat. Genet., 40, 897–903. Hsu, F., Kent, W.J., Clawson, H., Kuhn, R.M., Diekhans, M., and Haussler, D. (2006) Bioinformatics, 22, 1036–1046.

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing Hendrik-Jan Megens and Martien A.M. Groenen Abstract

Discovery of genome-wide variation has taken a huge leap forward with the introduction of next-generation sequencing (NGS) technology. Variant discovery requires sampling of a number of haplotypes. This can be either the two haplotypes of a diploid organism or multiple haplotypes in a population. Variant discovery can be done by sequencing pooled DNA and NGS makes it cost-effective to sample many haplotypes. In this chapter, we discuss various sequencing strategies for variation discovery, focusing mainly on single nucleotide polymorphisms, and to a lesser extent on short insertion/deletions (Indels). We discuss different options, such as specific library construction and the amount of sequencing required to meet a certain objective. While the benefits of NGS to their own research may be obvious to many researchers, the main obstacle for applying NGS is often practical – how to manipulate and analyze the amount of data from a typical NGS run. The methods therefore focus on practical considerations dealing with large NGS datasets, such as sequence processing and filtering, mapping to a reference genome, and variant calling. In addition, we focus on data standards and tools to manipulate and analyze data from such standardized datasets. By providing examples and links to easy-toimplement scripts and software, we hope to lower the threshold for biologists to further explore the wealth of information that can be obtained by these new molecular resources.

15.1 Introduction

Understanding at the molecular level how genetic variation influences phenotypic differences requires the identification of this variation in the genome of the species being studied. Genetic variation within the genome comes in many flavors ranging from simple single nucleotide polymorphisms (SNPs) to large insertions, deletions, and duplications of large DNA segments tens to hundreds of thousands of base pairs in size (referred to as structural variants (SVs) and copy number variants (CNVs)). This chapter focuses on the large-scale identification of SNPs and short insertion/ deletions (Indels) of one or a few base pairs in size. Discovery of this kind of variation relies on the sequencing and comparison of different haplotypes. For a long time, the limited throughput and high costs of traditional Sanger sequencing have been limiting factors for large genome-wide SNP discovery in many species. However, this has changed dramatically through the recent development of next-generation sequencing (NGS) technologies. The detection of polymorphisms in genomes has

Tag-based Next Generation Sequencing, First Edition. Edited by Matthias Harbers and G€ unter Kahl. Ó 2012 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2012 by Wiley-VCH Verlag GmbH & Co. KGaA.

257

258

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing

been among the first applications of NGS technologies (e.g., [1–9]). To detect polymorphisms, a sufficient number of haplotypes – dependent on the goal for detection – needs to be sampled. Over the past 3–4 years the main NGS technologies have been Roche 454, Illumina GA (Solexa), and ABI SOLiD. Several new contenders have entered or are about to enter the market in the period 2010–2011 (e.g., Helicos [10], Pacific Biosystems [11]). Roche 454 has the advantage of the longer read lengths – currently around 800 bp – over Illumina GA (typical runs yield around 100 bp, although 150 bp is possible) or SOLiD (around 50 bp). However, Illumina GA and ABI SOLiD both have the advantage of a much higher yield in the number of sequences (usually referred to as “reads”) and therefore also a higher yield in total base pairs. A higher yield per monetary unit is essential for SNP discovery, since it allows for sampling of a higher number of haplotypes for the same amount of money. However, the cost of shorter read lengths results in some tradeoffs. If a reference genome is available, it can be used to reliably map reads produced by any of the current technologies [12]. However, in the absence of a reference assembly, or if the reference assembly is not adequately suited for the question at hand, other decisions may have to be made. For instance, a popular strategy for discovering SNPs for species without a complete reference assembly has been to combine 454 and Illumina GA sequencing (e.g., [3]). The longer reads and assembled contigs generated by 454 can then be used as a reference against which the smaller Illumina GA reads can be mapped for discovery of variation. Alternative strategies for discovering variation in the absence of reference genomes have been applied, such as reducing the complexity of genomes first (e.g., [8,13]). This can be done by constructing so called reduced representation libraries (reduced representation library RRLs) [4]. Since such libraries only represent a fraction of the genome, total coverage of the library can be higher for the same amount of effort, which may allow successful assembly of shorter fragments. Applications of assembly-free methods [14], de novo assembly strategies (e.g., [15]), or outgroup reference strategies (e.g., [16]) are beyond the scope of this chapter, and therefore the availability of a suitable reference sequence is assumed here. As already mentioned, genomic variation comes in many flavors and a second consideration in the choice of NGS sequencing strategy is the type of variation to be discovered. Small Indels can be detected relative to a reference genome if the read length of the chosen technology allows for reliable mapping of the sequence on both sides of the Indel [17]. In principle, even long Indels can be detected in this way. Alternative methods of larger Indel or CNV discovery rely on the estimated distance between the two ends of the sequenced fragments (the “mates”) relative to the expected size distribution of the sequenced library [18,19]. This strategy requires that both sides of the DNA fragments are sequenced. This is referred to as “paired-end” sequencing, resulting in two “mates” for each fragment. For example, if the expected size of fragments is between 300 and 500 bp, but two mates are mapped 200 bp apart on the reference genome, this implies an insertion of 100–300 bp relative to the reference. If in the reference the mates were mapped 2000 bp apart, this would imply a deletion of around 1500–1700 bp relative to the reference. The methods available for de novo discovery of larger Indels by NGS are highly specialized and can benefit, for instance, from construction of larger insert libraries. Since for all current sequencing technologies there is an optimum fragment length, libraries of larger insert size can only be obtained by making so called “mate-pair” libraries. In this case the larger fragment (up to 10–40 kb) is circularized, and only the part containing the two fused ends is selected by excision of the intervening DNA and subsequently sequenced. Methods for CNV discovery heavily rely on the accuracy of a reference assembly. As there is a wide variety in the accuracy of currently available genomes, the discovery of larger SVs remains a difficult endeavor. The de novo discovery of larger Indels is beyond the scope of this chapter, although we will touch upon discovery of small Indels in Section 15.3.

15.1 Introduction 15.1.1 SNP Discovery and Nucleotide Variation Assessment

The variations currently most easily and least ambiguously detectable on a genomewide scale are the SNPs, also often referred to as single-base substitutions or point mutations. These variations are observed either as transitions (between purines, A and G, or between pyrimidines, C and T) or transversions (between purines and pyrimidines). Since biochemically transitions are more likely than transversions, A $ G and C $ T SNPs are usually the most abundant, despite the number of possible transversions being double that of transitions [20]. In general, a transition/ transversion ratio above 2 is observed and this ratio is often used as an indicator for the quality of the SNP identification procedure. If the observed transition/transversion ratio drops below 2, this indicates the presence of an increasing number of false positives. SNPs can be very abundant in genomes. For humans, for instance, over 20 million SNPs have so far been discovered [21]. Owing to the abundance of SNPs in most species and that nowadays cost-effective SNP assays can be made that can interrogate many tens of thousands to millions of SNP positions in a genome [22], SNPs have become the main marker for whole-genome genotype–phenotype association analysis in many species [23,24]. Many genome-wide SNP discovery efforts are conducted with the aim of designing species- or population-specific SNP assays (“SNP chips”) for linkage analysis, association studies, or for plant and animal breeding (marker-assisted breeding and genomic selection). Although the price for each of these SNP assays may be small, such assays are often done on many thousands of individuals and can therefore still end up being very costly. By only selecting SNPs for the assay that are useful in a certain population (i.e., two alleles are present, with a relatively high minor allele frequency), the return on the investment can be maximized. For SNP chips, SNPs are usually seen as biallelic markers. However, in practice, all four possible alleles may be present in a population. In addition, SNPs that have been discovered in one population may not be present in another. It may therefore be difficult to compare the results of SNP assays from population to population due to the artifacts introduced from the discovery, or ascertainment, process. This phenomenon is called ascertainment bias [25,26]. Variation detection using NGS can be applied not only to discover SNPs, but also to allow comparison of SNP composition from population to population in a more or less unbiased (“unascertained”) way. One way of comparing diversity between populations is by measuring nucleotide diversity. This can be done, for instance, by inferring the number of SNPs present between two haplotypes (e.g., in a single diploid organism) or by correcting the total number of SNPs discovered by the number of haplotypes that have been sequenced. Such population genetic analyses (often referred to as “population genomics;” e.g., [27]) are currently revolutionizing our understanding of variation in populations. The vast majority of SNPs in any population will be rare (minor allele frequency below 5%) if the population did not undergo a recent population bottleneck. Such rare alleles may be highly relevant for conferring a genetic basis for disease, for animal and plant breeding, and for conservation genetics [28]. 15.1.2 Sequence and Library Preparation Strategies

Depending on the purpose of the ascertainment of variation – SNP discovery for a dedicated SNP chip, population genomics, or association analysis – different designs of the sequencing effort are possible. To a large extent, the choice will depend on the available budget. Complete genomic sequence data from all individuals in a population would reveal all genetic variation. Various studies aim at sequencing many hundreds to thousands complete individuals to generate comprehensive catalogs of variation that is present at 1–10% [21,29]. Individual sequencing has several benefits. Very rare variants can be determined with high accuracy as they will be corroborated

j

259

260

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing

by several reads in that individual alone [30]. In addition, having the individuals’ genotypes allows for inference of the haplotypes. However, for inferring a genotype with a certain accuracy, the genome needs to be covered by sufficient number of reads [9]. For instance, for a 95% likelihood that both alleles have been sequenced at least once, one needs to have at least five reads (i.e., a genome-wide 5 times coverage). However, because each base has been called with a certain margin of error, one actually needs to see the minor allele at least twice to prevent the multiplication of a small base calling error times a vast genome to result in a very high false-discovery rate. For instance, even if only 1 : 10 000 bases resulted in a falsely discovered SNP, that would result in hundreds of thousands of wrongly called SNPs in the entire genome. Compared to the real SNPs, around 1 : 10 would then be wrong. For the human 1000 Genomes Project the golden standard has been set at 30 times coverage to also take into account the uneven distribution of coverage across the genome (which should be a Poisson-like distribution [9]), although most human individuals have been sequenced around 4 times [21]. If one is only interested in discovering SNPs or making general assessments of nucleotide variation, the pooling of DNA from multiple individuals is a cost-effective way of discovering the high-frequency alleles (above 20%) in a population with 95% confidence for the price of sequencing a single individual to around 30 times (Figure 15.1, assuming two reads to support the minor allele [31,32]). For other studies, where it is sufficient to obtain SNP information for only part of the genome, sequencing can be restricted to either a randomly selected fraction or specifically selected targeted regions of the genome. A random RRL, for instance, can be made by cutting the DNA with a restriction enzyme and then selecting a size fraction by excising a band from a polyacrylamide gel [1,3,4]. The coverage of the genome in the RRL can be varied by the choice of the restriction enzyme and the size range of the fragments recovered after size selection of the digested DNA by gel electrophoresis. Targeted RRLs can be made either by pooling a number of different polymerase chain reaction fractions or by enriching for certain regions on the genome by means of capturing arrays (e.g., exon capturing arrays) [33–35]. Reducing the complexity of a genome allows a higher coverage, or rather a higher read depth, for the same amount of effort and money, and therefore can be applied in studies aiming at finding relatively rare variation in a cost-effective way. Current and up-coming sequencing platforms allow a degree of sequence generation per smallest unit of sequencing (usually a “lane” or “flow-cell”) that is overkill for many applications. For instance, the latest-generation Illumina GA (HiSeq 2000) can deliver in excess of 80 million reads per lane, which equals around 16 Gbp per lane for a paired-end run if run to 100 bp/mate. For sequencing complete bacterial genomes (up to 10 Mbp), small eukaryotic genomes, or reduced complexity libraries of larger genomes (e.g., exome sequencing) that level of sequence generation is not cost-effective for SNP discovery. In such cases, adding tags or barcodes to libraries

Fig. 15.1 Chance P of observing a SNP with minor allele proportion p in n haplotypes P ¼ 1 (1 p)n.

15.2 Methods

that, for instance, each represent a different genome, allows for pooling of a number of different libraries (e.g., [36–38]). In practice, several of the strategies mentioned above can be combined (e.g., by creating an individually tagged and pooled RRL for costeffective SNP discovery that would even allow detection of exceedingly rare variants). The past years have witnessed a reversal in the trend of highly distributed sequencing capacity with many labs being able to afford their own automated Sanger-based sequencing machine towards more clustered NGS machines operated by specialized research centers and service laboratories. This, combined with the high degree of standardization of kits and reagents by the suppliers of the sequence technology, results in a shift in focus for many researchers away from the actual generation of the sequence data. That shift in focus has moved towards, on one hand, the design of the experiment and the choice of the sequence libraries, as highlighted here, and, on the other hand, toward data handling, bioinformatics, and analysis – such as SNP discovery and nucleotide variation estimation – of the data. This last aspect is currently among the biggest challenges for implementing NGS in research and applied sciences. Researchers are expressing fears that while the US$1000 genome may now be within reach in terms of generating the sequence data itself, the actual analysis of the data may cost many times that amount [39]. The analysis of NGS data will, therefore, be the focus of the methods described in Section 15.2. Given that the Illumina GA/HighSeq platform currently is the most frequently used for variant discovery in large and complex genomes by resequencing (e.g., [21]), many of our examples refer to this technology. However, we have tried to generalize procedures and data formats as much as possible, such as by choosing the FASTQ format [40] for sequence data and the SAM/BAM [41] alignment file format, which should be applicable to all sequencing platforms. In addition, all our examples are implementing open-source tools rather than tools supplied by sequencing companies and are often not limited to a single sequencing technology.

15.2 Methods

In this section we provide an overview of the procedures involved in preprocessing NGS data, mapping reads to a reference assembly, and variant calling. Some emphasis will be put on file formats, since proper understanding of formats and implementation of standards can greatly facilitate efficient processing of data and upstream analysis. Many of the tools described in the section can be done on consumer computer hardware. For a dataset that is equivalent to 10–15 times coverage of a human-sized genome, at least 100 GB of free hard-drive space is recommended. A computer with 4–8 GB of RAM should be able to run most of the programs described in this section, apart from the MOSAIK alignment tool, which needs around 20 GB of RAM for mapping a human-sized genome. As for computation speed, virtually all current read-mapping tools can make efficient use of parallel computing infrastructure or are multithreaded, which means that having more processors will speed up mapping. For instance, running the short-read aligner BWA (Burrows–Wheeler Alignment tool on a consumer quad-core computer may take more than 10 times longer than on a 48-core computer that could map 300 million reads in a matter of hours. NGS processing, mapping, and variant calling tools are most often designed to compile and function on UNIX or Linux systems. In addition, modern file systems for UNIX/Linux, such as ext3, ext4, and ReiserFS, offer highly reliable and speedy data handling and copying. Last, but certainly not least, UNIX and Linux systems offer a highly versatile data handling toolbox by default in the form of a POSIX compliant shell such as the Bourne Again Shell (Bash – the default on most Linux systems), Korn Shell (ksh), or C Shell (csh). For these reasons we will assume the availability of a UNIX/Linux system in the remainder of the methods. Furthermore, we assume that Perl (Perl 5.8 or higher) and Python (Python 2.5 or 2.6) interpreters are installed.

j

261

262

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing 15.2.1 Preprocessing of Reads

Depending on sequencing technology and sequencing center, the data will be obtained in a specific format. Usually this is a flat-text format (ASCII) with some kind of data compression to reduce the size of the files for storage and data transfer. Popular data compression methods are the “zipped” format for Windows (with a.zip extension), and “gzipped” (.gz extension) or “bzipped” (.bz2 extension) on POSIX compliant systems (UNIX, Mac OS X, and Linux). Should your data be zipped in some way, you need to uncompress the file to make it human-readable. Assuming your file is called file.fastq.gz, you could achieve this by typing the following command from the shell: gunzip file.fastq.gz This will result in a new file, file.fastq. The original file will be removed. Since the unzipped file is usually around 3 times larger than the gzipped one, you may not want to unzip all files you have simply because this will take too much of your hard drive space. You can alternatively unzip on a stream (and just look at the first lines) by doing: gunzip -c file.fastq.gz | more Piping the output to the command “more” will allow you to scroll through the data and get an idea of what is in it. Current data files invariably will be too long to scroll through nor is this meaningful. However, some data exploration may be valuable, such as to determine what format the data is in or to spot gross formatting issues. 15.2.1.1 FASTQ Format Sequence laboratories often will deliver the data in a generic FASTQ format [40]. This format is derived from the FASTA sequence file format. The differences are (i) the first line, or identifier line, does not start with “>”, but with “@”, and (ii) there are two additional lines to include the quality scores for the base calling. The third line needs to start with “ þ ” and can either be empty or be the same as the sequence identifier in the first line. The fourth line provides a string of ASCII characters indicating the accuracy of base calling for each of the positions in the DNA sequence in the second line. The standard FASTQ, or Sanger FASTQ, format uses the ASCII characters corresponding to numeric values 33 (the first printable ASCII character, which is “!”) to 73, or “I”). The quality coding scheme is often referred to as being Phred-like, since it was initially implemented in the program Phred [42]. The values refer to accuracy of base calling, where the chance p of calling the wrong base translates to the quality value Q as follows:

Q = -10 * log10 p Therefore, a p of 0.05, or 5% chance of calling the wrong base, equates to a Q of 13, and an ASCII character “.”. @seq_id TTAGCCTGGGAACTTCCATATGCTATGGGGATAGCCCTAAAAAGACAAAAATTTT +seq_id DBDAD?AD?:EEB-B=DDAC@D::?5;@;'CCC5CDD5D-A:;A?B??D>ACEEE 15.2.1.2 FASTQ Format – Illumina Version Although the FASTQ sequences derived from Illumina processing pipelines result in a file format that is generally similar to the Sanger FASTQ format, there are two noteworthy differences [40]. The most important one is that the quality measures have a different scale. Instead of running from “!” (ASCII value 33, quality 0) to “I” (ASCII value 73, quality 40), they run from “B” (ASCII value 66, quality value 2) to “h” (ASCII value 104, quality value 40). To get the corresponding quality value, one needs to subtract 64 (as opposed to 33 in Sanger FASTQ). In addition, the minimum value is 2,

15.2 Methods

which corresponds to an error probability of greater than 0.75. Since incorporation of a random base would result in an error probability of 0.75, a quality value of less than 2 is therefore not meaningful. SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.............................. .................................I IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII !"#$%&'()*+,-./0123456789:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh | | | | | 33 59 64 73 104

An added consequence of the Illumina FASTQ format is that the “@” is no longer part of the quality score palette, allowing for less ambiguous parsing of the FASTQ files. A very useful article on various FASTQ flavors can be found on Wikipedia (http://en.wikipedia.org/wiki/Fastq). In addition, the Illumina FASTQ sequence IDs are systematic names [43]. For instance, in the name “@HWI-EAS385_0064:3:24:1204:5245#0/1”, “@HWIEAS385_0064” identifies the machine, “3” means lane 3 in the flow cell, “24” is the tile number in the flow cell, and “1204” and “5245” are the x and y coordinates within the tile, respectively. “#0” means that the sample was not multiplexed. “#ATCACG” would have identified the unique sequence tag in case the sample would have been multiplexed. Lastly, “/1” indicates it is the first member of the pair. Its mate will be identified by the same sequence identifier, save “/2” instead of the “/1”.

15.2.1.3 Illumina FASTQ to Sanger FASTQ Note that certain postprocessing pipelines may assume Sanger FASTQ quality values rather than Illumina FASTQ values (e.g., SAMtools [41]), in which case the Illumina FASTQ needs to be recalculated to obtain Sanger quality values. There are a number of tools available to translate Illumina FASTQ to Sanger FASTQ (e.g., MAQ [44]). Alternatively, you may want to make your own code to recalculate the ASCII symbols from Illumina FASTQ to Sanger FASTQ. The following Python script (compatible with Python 2.5, 2.6 and 2.7) provides an example.

import sys import gzip def next_sequence_gzip(filename): try: file = gzip.open(filename) line = file.readline() while line: if line and line[0] == '@': line1 = line line2 = file.readline() line3 = file.readline() line4 = file.readline() yield (line1,line2,line3,line4) line = file.readline() finally: file.close() def convert_illumina_to_sanger(file): seqs = next_sequence_gzip(file) for seq in seqs: qs = seq [3][0:-1] Qs = [chr(ord(q)-31) for q in qs]

j

263

264

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing

Q = ''.join(Qs) print seq [0]+seq [1]+seq [2]+Q convert_illumina_to_sanger(sys.argv[1]) As an input file you can add the name of a gzipped FASTQ file and the output will be streamed to standardout, where you may gzip it and capture it in a file, for instance: illumina-to-sanger.py myIllumina.fastq.gz | gzip -c >mySanger.fastq.gz 15.2.1.4 ABI SOLiD- and Roche 454-Specific Formats Each of the currently available sequencing technologies, ABI SOLiD, Roche 454, and Illumina GA, have their own specific sequence output formats, and even multiple formats exist per technology depending on the stage of processing of the data. Both Roche 454 and SOLiD processing pipeline may return the sequence data and the quality data in separate formats. SOLiD has a particularly interesting file format, as the sequences are displayed in so called “color space”. The first base is always a “T” from the adapter (example taken from Galaxy website: main.g2.bx.psu.edu): Reads: >1831_573_1004_F3 T00030133312212111300011021310132222 >1831_573_1567_F3 T03330322230322112131010221102122113

Quality scores: >1831_573_1004_F3 4 29 34 34 32 32 24 24 20 17 10 34 29 20 34 13 30 34 22 24 11 28 19 17 34 17 24 17 25 34 7 24 14 12 22 >1831_573_1567_F3 8 26 31 31 16 22 30 31 28 29 22 30 30 31 32 23 30 28 28 31 19 32 30 32 19 8 32 10 13 6 32 10 6 16 11 @1831_573_1004 AAATACTTTCGGCGCCCTAAACCAGCTCACTGGGG + %>CCAA9952+C>5C.?C79,=42C292:C(9/-7 @1831_573_1004 ATTTATGGGTATGGCCGCTCACAGGCCAGCGGCCT + );@@17?@=>7??@A8?==@4A?A4)A+.'A+'1,

Tools for converting SOLiD and Roche 454 output formats to generic Sanger FASTQ file format can be found in the Galaxy toolbox (main.g2.bx.psu.edu). Galaxy is a very useful toolbox for various NGS data manipulation, processing, and analysis tasks [45,46]. Although you can operate Galaxy from a cloud-based platform, local installs of Galaxy are required to deal with issues regarding copying and storing large data volumes. 15.2.1.5 Illumina SCARF or QSEQ to FASTQ Depending on the laboratory doing the sequencing, the data may be returned in generic format, such as FASTQ, or in a technology-specific output. Data from an Illumina GA sequence run may, for instance, be returned in SCARF format. That format can be defined as containing three (or more) columns, where the first column contains machine ID, lane number, tile number, x coordinate in the tile, y coordinate

15.2 Methods

in the tile, and further identifiers (tag barcode and mate identifier) separated by “:” (see also Section 15.2.1.2). The second column contains the sequence and the third the string of quality scores for the base calling. Various tools exist for reformatting SCARF to FASTQ, such as in the “Short Read Toolbox” (http://brianknaus.com/ software/srtoolbox/shortread.html). Alternatively, a simple shell script will do the same and allow you to operate from gzipped files: gunzip -c example.scarf.gz | sed 's/:/\t/g' | awk '{print "@"$1":"$2":"$3":"$4":"$5"\n"$6"\n+\n"$7""}' | gzip -c >example.fastq.gz Another Illumina file format that you may encounter is the QSEQ format. It is similar to SCARF, but there are differences. For instance, Ns are displayed as dots (“.”). The following one-line shell script will convert from gzipped QSEQ to gzipped FASTQ: gunzip -c example.qseq.gz | awk -F '\t' '{gsub(/\./,"N", $9); if ($11>0) print "@"$1"_"$2":"$3":"$4":"$5":"$6"#"$7"/"$8"\n"$9"\n""+"$1"_ "$2":"$3":"$4":"$5":"$6"#"$7"/"$8"\n"$10}' | gzip -c > example.fastq.gz

15.2.1.6 Quality Evaluation Quality control and evaluation is important in any effort involving data acquisition. SNP discovery using NGS analysis is no exception and the fact that the people in charge of sequencing are usually not analyzing the data makes it even more pressing that data analysts understand the quality of the data. One strategy for quality evaluation is to make aggregate scores of quality parameters such as the Phred-like base-calling accuracies [42]. Whereas in early versions of the NGS base-calling software, particularly for Illumina GA, quality scores seemed poor estimators of true accuracy [47], recent versions of the Illumina base-calling software appear to provide much more informative quality scores. Aggregate scores (including distribution of values) per lane or per tile in a lane and per position in the read can spot problematic lanes or tiles, and can give an overall assessment of how the quality of the sequences deteriorates with increasing sequence length. Several very useful and easyto-use tools are available, such as FastQC (http://www.bioinformatics.bbsrc.ac.uk/ projects/fastqc/), SolexaQA [43] (solexaqa.sourceforge.net), and FASTX (http:// hannonlab.cshl.edu/fastx_toolkit/). The latter toolkit can be locally installed, but can also be found in Galaxy [45,46] (main.g2.bx.psu.edu, see also Section 15.2.1.3). 15.2.1.7 Handling Adapter Sequences – Linkers and Barcodes Sequences may, depending on what kind of library has been used, contain linkers at the end of a sequence (e.g., when a mate-pair library was constructed). As linker sequences are known, they are easily removed from the ends of the sequence. The FASTX suite of short-read processing tools (http://hannonlab.cshl.edu/fastx_toolkit/) has a utility (fastx_clipper) for removing such linker sequences. However, if less than four bases of the linker sequence are present, the linker sequence may not be recognized. To ensure removal of all linker sequence, it is advisable to add a round of length trimming to remove the last three bases of all remaining full-length sequences as the presence of unaligned terminal bases may cause serious mapping problems with certain mapping utilities. For a number of applications, sequence tagging or adding barcodes is used [36]. Usually this is a six-base sequence. In fact, there are sequencing centers that use barcoding for all applications, even whole-genome sequencing, as it allows them to be flexible in merging projects. If a barcode is applied, the sequence can normally be determined from the sequence name. For instance, if a sequence is called “@HWIEAS385_0064:3:24:1204:5245#ATCACG/1”, then “ATCACG” is the barcode used for

j

265

266

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing

the library the sequence was generated from. If not already sorted in some way, the sequences resulting from a single project of barcoded sequences can be organized by the fastx_barcode_splitter.pl utility in the FASTX-Toolbox. Barcodes can be easily removed by simply trimming the first six bases from the sequence. The FASTXToolbox has also a utility for trimming any number of bases desired from a sequence, either at the end or the beginning, called fastx_trimmer. All tools in the FASTX-Toolbox can be found as an integrated next-gen handling, mapping, and analysis solution in Galaxy (main.g2.bx.psu.edu, see also Section 15.2.1.3) [45,46]. 15.2.1.8 Quality Trimming The general consensus appears to be that it is better to have shorter sequences of high quality, and to remove the bases particularly at the 30 end of reads that have deteriorated below a certain threshold rather than to retain the entire sequence for mapping and variant calling. Most of the recent read mappers can handle variable length reads and therefore reads can be individually trimmed to only contain highquality bases. This is called “quality trimming.” Reads that do not meet the threshold for minimum base quality over a minimum sequence length should be removed. If a read needs to be removed, then its mate may also need to be removed, even though it may meet the quality threshold, as sequence files usually need to be symmetrical (i.e., the order of the sequences needs to be the same in the file containing the forward and the reverse mates). Depending on upstream analysis, but also on sequence quality, different trimming strategies can be applied. The “DynamicTrim” program in the “SolexaQA” suite of scripts for short-read processing (solexaqa.sourceforge.net) [43] can find the largest contiguous stretch of sequence containing bases that all meet the minimum threshold (e.g., all Q  13) and will also trim bases at the 50 end of the sequence if necessary. This is very useful as it may save reads that have a number of bad bases at the start and that may become trimmed to a very short length if the sequence simply were to be trimmed at the first encountered bad bases. The program does not remove sequences (although only single-base reads can be left). Another useful trimming program is the fastq_quality_trimmer utility in the FASTXToolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html). It removes bases at the end of the read if a lower quality than the threshold is found and can remove sequences if the remaining sequence is shorter than a set value. The FASTXToolkit can also be accessed through Galaxy (main.g2.bx.psu.edu, see also Section 15.2.1.3) [45,46]. 15.2.2 Mapping Reads to a Reference Genome

Ideally, the sequence data obtained would be assembled to a genome sequence prior to variant calling, as this may ensure that any regions that are not in the reference assembly are not represented. However, currently, unless one has a very high coverage of sequence data (depending on the organism, much greater than 30 times), this is not very meaningful. For complex genomes, even with high coverage a guided assembly using existing reference genome builds, or additional mapping information, is usually necessary. RRLs, resulting in incomplete representation of the genome [4], are unsuited for de novo assembly of a whole genome, but may be used to attempt assembly of those parts of the genome that are represented in the libraries if a genome reference assembly is absent [8]. Creation of reference assemblies for read mapping is beyond the scope of this chapter, but we direct the reader to a number of recent reviews on this subject [48–51]. Once the data is prepared (see Section 15.2.1), the next step is to map the reads to a reference sequence. Over the past 3 years a wealth of short-read mappers have been created. A number of these are made available by either Roche, Illumina, or

15.2 Methods

ABI in conjunction with their sequence analysis pipelines, or can be purchased commercially. However, for this chapter we will focus on a number of open-source solutions (for an exhaustive list, see http://seqanswers.com/forums/showthread. php?t¼43; http://seqanswers.com/wiki/Software/list). These programs can be obtained without cost, and a number of them have a very wide user base and often very good community support (e.g., seqanswers.com). In addition, several of the freely available read mappers are used in a variety of high-profile re-sequencing projects such as the human 1000 Genomes Project [21]. Most short-read alignment programs use either one of two read mapping algorithms. The first is based on building a hash for either the sequence reads (MAQ [44], ELAND (www.illumina. com)) or the reference genome (Novoalign (www.novocraft.com), MOSAIK [52]). The second group of alignment programs, such as BWA [53], SOAP2 [54], and Bowtie [55], are based on the Burrows–Wheeler transformation of the sequence data. Programs based on Burrows–Wheeler transformation are, in general, faster, but at the cost of a lower sensitivity. Recently a new alignment program, Stampy [56], was developed that combines the hashing and Burrows–Wheeler approach for fast alignment at high sensitivity. For this chapter we will focus on two popular read mappers: MOSAIK (http:// bioinformatics.bc.edu/marthlab/Mosaik) and BWA [53] (bio-bwa.sourceforge.net). Although each of these mappers has different algorithms for finding alignments of reads against the reference and how they may (BWA) or may not (MOSAIK) use base quality scores for assigning mapping qualities, there are also a number of similarities in how they can be implemented. For instance, both mappers will make an alignment file based on single-end data and will use the paired-end information in a follow-up analysis to refine the alignment files. 15.2.2.1 Making Alignments Using MOSAIK MOSAIK (http://bioinformatics.bc.edu/marthlab/Mosaik) uses the Smith–Waterman algorithm and can, unlike several other short-read aligners, make gapped alignments. It can use data generated by Roche 454, Illumina GA, and SOLiD. It is written by Michael Str€omberg at Boston College, and is available for Windows, Linux, and Mac OS X. The alignment process itself, done by MosaikAligner, can be very fast (depending on the alignment options and read length) and can be run very efficiently on multicore computers as it is multithreaded. One of the disadvantages is that MOSAIK requires quite a bit of memory, at least 19 GB (with a so called ’jump database, see Section 15.2.2.1.1) for mapping reads to a human-sized genome as it makes a hash table of the entire genome. 15.2.2.1.1 Preparing Input Files Using MosaikBuild Before starting aligning reads to a reference, both the reference genome and the read archives need to be converted to a MOSAIK-specific compressed format. Assuming the reference genome is a gzipped FASTA file named reference.fa.gz (–fr), and the “mates” of a paired-end Illumina GA run are in the compressed FASTQ files reads_mate1.fq.gz (–q) and reads_mate2.fq. gz (–q2), respectively:

MosaikBuild -fr reference.fa.gz -oa reference.dat MosaikBuild -q reads_mate1.fq.gz -q2 reads_mate2.fq.gz out reads.dat -st illumina This will create a binary reference (–oa), and binary read archives (–out) containing both mates. In order to limit memory requirements (from around 57 to around 19 Gb for a human-sized genome) and to speed up the alignment process (by eliminating the hashing of the reference genome each time you run the aligner), it is advisable to create a so-called jump database, that holds a hash table of the reference genome: MosaikJump -ia reference.dat -out reference_15 -hs 15 MosaikJump requires a binary reference (–ia) and will create a number of files with a prefix (–out).

j

267

268

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing 15.2.2.1.2 Aligning Reads to the Reference Using MosaikAligner The actual alignment is done by the program:

MosaikAligner -in reads.dat -out reads_aligned.dat -ia reference.dat -hs 15 -mmp 0.05 -mhp 100 -act 20 -j reference_15 -p 10 A binary input read file (–in) and an output alignment file (–out) are specified, as well as the binary reference sequence. that stores all of the alignments. Additionally we specify a binary reference sequence file (–ia). A hash size of 15 is specified (–hs) and a maximum mismatch percentage of 0.05 is allowed (–mmp). It is also possible to allow a fixed maximum number of mismatched bases, but this is not useful if after quality trimming the read libraries contain different read lengths. A jump database (–j) will be used instead of the normal hash map. All hash positions are initially stored by the database, but only 100 random hash positions will be kept for each seed (–mhp). This number is very important as in the default option the aligner tries to find all matches. However, if all matches were to be translated to the alignment file, this could result in a huge alignment file due to a large number of reads aligning to highly repetitive regions. However, setting this value too low will result in low probability of resolving the correct position by a uniquely aligned mate. In each seed cluster, a minimum length of 20 bp is required (–act). A total of 10 processors (–p) are used to increase alignment speed. 15.2.2.1.3 Refining Alignment Files Using Paired-End Data with MosaikSort For a large number of reads, in particular those for repetitive sequences in the genome, the alignment results in multiple positions on the reference genome. Part of these ambiguously aligned reads can be resolved if their mate has a unique mapping position. The MosaikSort utility will determine the distribution of insert sizes based on unambiguously mapped read pairs. Subsequently, read pairs will be resolved if they fall within the window of insert sizes. In addition to resolving the ambiguous read pairs, alignment files also need to be sorted by their reference genome position for upstream processing. MosaikSort does both, for instance:

MosaikSort -in reads_aligned.dat -out reads_aligned_sorted.dat 15.2.2.1.4 Creating BAM/SAM Files with MosaikText MOSAIK provides a number of alignment output file formats. The SAM/BAM format [41], currently the most widely used format for short-read alignments, can be generated from a MOSAIK alignment archive using the utility MosaikText. Since alignment archives can be extremely large, the BAM (binary sequence alignment/map) format, which is a binary form of the SAM format, is preferred:

MosaikText -in reads_aligned_sorted.dat -bam yeast_aligned.bam 15.2.2.1.5 Other Useful Functions in MOSAIK The strategy laid out in Sections 15.2.2.1.1 to 15.2.2.1.4 is the fastest way of getting from data to a SAM/BAM alignment archive. However, there are a number of other useful options in MOSAIK. While merging of different alignments can be done via the SAM/BAM format using SAMtools (see Section 15.2.3), it can also be done using the MosaikMerge command. There is also the option of making ACE or GIG alignment format files using the MosaikAssembler command. For making assessment of read depth across the genome, the MosaikCoverage command can be used. Removing identical reads can be done with the MosaikDupSnoop utility. 15.2.2.2 Making Alignments Using BWA The BWA tool ([53], bio-bwa.sourceforge.net) is Heng Li’s progression from his earlier short-read aligner MAQ (Mapping and Assemblies with Qualities [44]).

15.2 Methods

MAQ was among the first short-read mappers, but lacked the option for doing gapped-alignments. BWA utilizes so-called “Burrows–Wheeler Transform” (BWT), resulting in small memory footprint. Several other short-read aligners apply BWT, such as SOAP2 [54] and Bowtie [55]. Like MAQ, BWA generates mapping qualities and checks suboptimal hits in the BWT for their mapping qualities. BWA is multithreaded and can therefore make efficient use of multicore computers. 15.2.2.2.1 Creating an Indexed Reference Genome for BWA with the “index” Option The first step is to make an index file of the reference genome. The option –a bwtsw specifies the algorithm implemented in BWA-SW. The default option (“is”) can only use genome sizes up to 2 Gbp:

bwa index -a bwtsw reference.fa Execution of this command will generate a number of index files, all having the same prefix as the reference FASTA file. 15.2.2.2.2 Aligning Reads to the Reference Using the “aln” Option Even with paired-end read libraries, each of the mate libraries first needs to be aligned separately to the reference genome. The aln command has several options to specify how to deal with mismatches and gaps, and to influence overall performance. All options have default values, only the reference genome and read library name are required. The output is to standardout and the alignments can be captured by streaming to a file:

bwa aln -t 4 reference.fa reads_mate1.fq.gz > reads_mate1.sai bwa aln -t 4 reference.fa reads_mate2.fq.gz > reads_mate2.sai 15.2.2.2.3 Refining Alignment Files Using Paired-End Data with the “sampe” Option This option generates alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly. A distribution of insert sizes is made based on confidently placed read pairs by default, which is used to resolve mate-pairs like in MOSAIK:

bwa sampe reference.fa reads_mate1.sai reads_mate2.sai reads_mate1.fq.gz reads_mate2.fq.gz | gzip > mapped_reads.sam.gz 15.2.2.2.4 Further Considerations Using BWA Unlike MOSAIK, BWA has no further utilities to process alignment files, such as sorting and removing duplicate reads. However, these utilities are provided by SAMtools [41], made by the same author. Together, BWA and SAMtools form a complete suite for mapping short reads, manipulating SAM/BAM files, and variant calling, which is the subject of Section 15.2.3. BWA has also been implemented in Galaxy [45,46] (main.g2.bx.psu.edu, see also Section 15.2.1.3). 15.2.3 Variant Calling

Variant calling is usually done based on alignments to a reference assembly. The input for variant calling, either SNPs or Indels, is therefore usually an alignment. For SNP detection, methods have been developed to estimate the quality of the SNP or genotype call, taking into account data quality, alignment quality, and systematic errors particular to the sequencing technology used. With regard to the latter, sequencing errors are often not random. For the Illumina GA, for instance, A $ C and G $ T mutations are highly over-represented near the end of the reads, and such errors are usually underestimated with regard to their quality scores [47]. Since sequencing companies are constantly improving their base-calling algorithms, such generalizations may become obsolete in future versions of base-calling pipelines.

j

269

270

j

15 Identification of Nucleotide Variation in Genomes Using Next-Generation Sequencing

However, being aware of possible systematic errors remains important, and current standards recommend recalibration of quality scores [21]. Among the frequently used SNP-calling algorithms are the ones implemented in SOAPsnp [30] and SAMtools [41]. For SV calling a number of algorithms and programs are available. For instance, for short Indels, the program Dindel [17] has been used extensively in the human 1000 Genomes Project, and that algorithm is currently also used in the GATK software. For larger SV calling (CNV, rearrangement) based on paired-end information, popular programs include VariationHunter [19] and BreakDancer [18]. 15.2.3.1 SAM Format The availability of a standardized alignment format greatly facilitates upstream analyses that need to be done on mapped read data, such as variant calling. Various sequence alignment and assembly formats have been designed in the past decades, such as ACE and GIG, but these formats lack a number of characteristics such as inclusion of read-pair data. The SAM (Sequence Alignment/Map) format [41] was designed specifically to accommodate all current sequence technologies, to scale efficiently to high sequence volumes (more than 1011 bp or more than 30 times sequence depth of human-sized genomes), and to accommodate paired-end data. In addition, it is easily implemented in various alignment programs, it can be easily converted from other formats, it allows efficient indexing and sorting, and it can be efficiently applied to work on a stream rather than requiring the entire alignment file to be stored in memory. This latter argument is highly relevant, as uncompressed SAM files may be many dozens of gigabytes in size. The SAM format consists of two parts: a header section and an alignment section. The lines in the header section start with “@”, and contain the header (HD), sequence dictionary (SQ) and read group (RG) record types. The alignment section contains one line per read alignment to the reference. Each line contains the query (or read) name, a bitwise flag, the reference name, the position or the query on the reference (1-based left-most position), the mapping quality, the (extended) CIGAR string, the mate reference name (“¼” when the same as query name), mate position, insert size (0 when the mate is not mapped to the same reference), query DNA sequence, and query base qualities. Subsequent columns can be added for optional tags. @HD VN:1.0 @SQ SN:chr20 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833 99 chr20 28833 20 10M1D25M = 289934 195 \ AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG