Clinical Genomics 9780124047488, 0124047483

Clinical Genomicsprovides an overview of the various next-generation sequencing (NGS) technologies that are currently us

403 47 7MB

English Pages 488 [489] Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Cover......Page 1
Clinical Genomics......Page 4
Copyright Page......Page 5
Dedication......Page 6
Contents......Page 8
List of Contributors......Page 12
Foreword......Page 14
Preface......Page 16
Acknowledgments......Page 18
I. Methods......Page 20
Clinical Molecular Testing: Finer and Finer Resolution......Page 22
Chemistry of Sanger Sequencing, Electrophoresis, Detection......Page 23
Applications in Clinical Genomics......Page 24
Read Length and Input Requirements......Page 25
Cyclic Array Sequencing......Page 26
Illumina Sequencing......Page 27
Library Prep and Sequencing Chemistry......Page 28
Phasing......Page 29
SOLiD Sequencing......Page 30
Ion Torrent Sequencing......Page 33
Roche 454 Genome Sequencers......Page 35
References......Page 37
Next-Generation Sequencing......Page 40
Sequencing in the Clinical Laboratory......Page 41
Applications and Test Information......Page 43
Challenges of Defining a Test Offering That Is Specific to Each Case......Page 44
Preanalytical and Quality Laboratory Processes......Page 45
Analytical......Page 46
Bioinformatics......Page 48
Validation......Page 49
Interpretation and Reporting......Page 50
References......Page 53
3 Targeted Hybrid Capture Methods......Page 56
Specimen Requirements and DNA Preparation......Page 57
General Overview of Library Preparation......Page 58
Obstacles of Target Capture......Page 60
Library Complexity......Page 61
Solid-Phase Hybrid Capture......Page 62
Solution-Based Hybrid Capture......Page 63
Molecular Inversion Probes......Page 65
Amplification-Based Enrichment Versus Capture-Based Enrichment......Page 66
Exome Capture......Page 67
Selected Gene Panels......Page 69
Variant Detection......Page 70
Workflow and TAT......Page 71
References......Page 72
Introduction......Page 76
Sequencing Workflow......Page 77
Nucleic Acids Preparation......Page 78
Primer Design for Multiplex PCR......Page 79
Library Preparation and Amplification......Page 80
Comparison of Amplification- and Capture-Based Methods......Page 81
Clinical Applications......Page 83
References......Page 85
5 Emerging DNA Sequencing Technologies......Page 88
Introduction......Page 89
Single-Molecule Real-Time (SMRT) DNA Sequencing......Page 90
Heliscope Genetic Analysis System......Page 91
Nanopore Sequencing......Page 92
Transmission Electron Microscopy......Page 93
References......Page 94
6 RNA-Sequencing and Methylome Analysis......Page 96
Next-Generation Methods of RNA-Seq......Page 97
Initial Processing of Raw Reads: Quality Assessment......Page 98
Read Alignment......Page 100
RNA-Seq Variant Calling and Filtering......Page 101
Fusion Detection......Page 102
Utility of RNA-Seq for Genomic Structural Variant Detection......Page 103
Methylome Sequencing......Page 104
References......Page 105
List of Acronyms and Abbreviations......Page 107
II. Bioinformatics......Page 108
7 Base Calling, Read Mapping, and Coverage Analysis......Page 110
Base Calling......Page 111
Read Mapping......Page 112
Platform-Specific Base Calling Methods......Page 113
Density......Page 114
PhiX-Based Quality Metrics......Page 115
Usable Sequence......Page 116
Base Calling......Page 117
Key Processes......Page 118
Reference Genome......Page 119
Novoalign......Page 120
Sequence Alignment Factors......Page 121
Performance and Diagnostic Metrics......Page 122
Library Fragment Length......Page 123
Percent of Unique Reads......Page 124
Summary......Page 125
References......Page 126
8 Single Nucleotide Variant Detection Using Next Generation Sequencing......Page 128
Introduction......Page 129
Metal Ions......Page 130
Radiation......Page 131
Altered RNA Splicing......Page 132
Target Size......Page 133
Target Enrichment Approach......Page 134
Anticipated Sample Purity......Page 135
Bioinformatic Approaches for SNV Calling......Page 136
Parameters Used for SNV Detection......Page 137
Implications for Clinical NGS......Page 140
Prediction Tools for Possible Splice Effects......Page 141
Summary......Page 142
References......Page 143
9 Insertions and Deletions (Indels)......Page 148
Indel Definition and Relationship to Other Classes of Mutations......Page 149
Testing for Indels in Constitutional and Somatic Disease......Page 150
Slipped Strand Mispairing (Polymerase Slippage)......Page 152
Frequency of Indels in Human Genomes......Page 153
Decreased Transcription......Page 155
Frameshift......Page 156
Predicting Functional Effects of Novel Indels......Page 157
Sequencing Platform Chemistry......Page 158
Sequence Read Type and Alignment......Page 159
Specimen Issues That Impact Indel Detection by NGS......Page 160
Local Realignment......Page 161
Left Alignment......Page 162
Probabilistic Modeling Using Mapped Reads......Page 163
Split-Read Analysis......Page 164
Indel Annotation......Page 165
Reference Standards......Page 166
References......Page 167
10 Translocation Detection Using Next-Generation Sequencing......Page 170
Mechanisms of Translocation Formation......Page 171
Translocations in Leukemias......Page 172
Sarcomas......Page 173
Hereditary Cancer Syndromes......Page 174
Translocation Detection by Targeted DNA Sequencing......Page 175
Detection of Translocations and Inversions......Page 177
RNA-Seq-Based Analysis......Page 178
Translocation Detection in Clinical Practice......Page 179
Laboratory Issues......Page 180
Online Resources......Page 181
References......Page 182
11 Copy Number Variant Detection Using Next-Generation Sequencing......Page 184
CNV Definition and Relationship to Other Classes of Structural Variation......Page 185
Frequency in the Human Genome......Page 186
CNVs and Disease: Functional Consequences......Page 187
Targeted Sequencing of Candidate Genes......Page 189
Whole Genome Sequencing......Page 191
Introduction......Page 192
Discordant Mate Pair Methods......Page 194
Depth of Coverage......Page 195
SNP Allele Frequency......Page 196
Split Reads and Local De Novo Assembly......Page 197
Targeted Gene Screening......Page 199
Cell-Free DNA......Page 200
Reference Standards......Page 201
1000 Genomes Project Structural Variant Map Data Set......Page 202
References......Page 203
Glossary......Page 205
List of Acronyms and Abbreviations......Page 206
III. Interpretation......Page 208
12 Reference Databases for Disease Associations......Page 210
Introduction......Page 211
Sequence Discovery......Page 212
Understanding a Reference Assembly......Page 213
Overview......Page 214
HapMap......Page 215
1000 Genomes Project......Page 216
NHLBI-ESP......Page 217
Defining Diseases and Phenotypes......Page 218
Orphanet......Page 219
dbSNP......Page 220
dbGaP......Page 221
COSMIC......Page 223
By One or More Genomic Locations......Page 224
By Attributes of a Particular Variant......Page 227
Variants in the ACMG Incidental Findings Gene List......Page 228
Expert Panels and Professional Guidelines......Page 229
Standard Setting for Clinical Grade Databases......Page 230
GA4GH and the Beacon Project......Page 231
References......Page 232
List of Acronyms and Abbreviations......Page 234
13 Reporting of Clinical Genomics Test Results......Page 236
Summary Statement of Test Interpretation......Page 238
Gene Name and Transcript Number......Page 239
Online Mutation Databases......Page 240
Computational Prediction Programs......Page 241
Interpretation of the Test Result......Page 242
Incidental or Secondary Findings......Page 243
Types of Mutations Detected by the Assay......Page 244
Clinical Sensitivity and Specificity......Page 245
Providing Raw Data to Clinicians and Patients......Page 246
References......Page 247
List of Acronyms and Abbreviations......Page 248
14 Reporting Software......Page 250
Clinical Genomic Test Order Entry......Page 251
Analytics: From Reads to Variant Calls......Page 252
Pipeline Orchestration and Management......Page 253
Analytics: Variant Annotation and Classification......Page 254
Final Report Transmission to the EMR......Page 255
Support Personnel......Page 256
References......Page 257
List of Acronyms and Abbreviations......Page 258
Disease-Targeted Sequencing......Page 260
Target Enrichment......Page 261
Multigene Panel Validation......Page 262
Run Validation Samples......Page 263
Bioinformatics and Data Interpretation......Page 264
Advantages and Disadvantages of Amplification-Based NGS......Page 266
References......Page 267
List of Acronyms and Abbreviations......Page 268
16 Targeted Hybrid Capture for Inherited Disease Panels......Page 270
Inherited Cardiomyopathies......Page 271
Evolution of Medical Sequencing in Molecular Diagnostics......Page 272
Target Selection Using Hybridization-Based Capture......Page 274
Ensuring Adequate Coverage Across the Entire ROI......Page 276
Sequencing Regions of Increased or Decreased GC Content......Page 277
Workflow......Page 278
Automation......Page 279
Targeted Hybrid Capture: Analytical Sensitivity Across the Variant Spectrum......Page 280
Gene Panel Testing Strategy......Page 281
Anticipating Interpretive Challenges: Impact of Panel Size on Variant Interpretation......Page 282
Whole Genome Sequencing......Page 283
Other Target Selection Methods......Page 284
Inherited Cardiomyopathies......Page 285
Conclusion and Outlook......Page 286
References......Page 287
17 Constitutional Disorders: Whole Exome and Whole Genome Sequencing......Page 290
Introduction......Page 292
Historical Perspective......Page 293
The Microarray......Page 294
Advantages of Genomic Sequencing......Page 295
What Regions Are Targeted/Covered?......Page 296
Resource-Based Considerations......Page 297
Phenotypically Similar Unrelated Probands......Page 298
The Continued Importance of Clinical Analyses in the Era of Genomic Sequencing......Page 299
Recessive Diseases......Page 300
Issues and Concerns with the Use of Population Variation Databases to Filter Genomic Data Sets......Page 301
The Accuracy and Reproducibility of Databases......Page 302
Recognizing and Managing Artifacts......Page 303
Functional Interpretation of Variants......Page 304
Combinatorial Approaches......Page 305
Determining the Optimal Scope of Genetic/Genomic Investigations......Page 306
Integrating the Management of Additional Genomic Information......Page 307
Managing the Data Load in Clinical Scenarios......Page 308
Consequences of Genomic Sequencing......Page 309
Conclusion and Future Directions......Page 310
References......Page 311
Glossary......Page 315
18 Somatic Diseases (Cancer): Amplification-Based Next-Generation Sequencing......Page 316
Pyrosequencing-Based NGS: Roche 454 Genome Sequencer......Page 317
Reversible Dye-Terminator-Based NGS: Illumina HiSeq and MiSeq Systems......Page 318
Ion Semiconductor-Based NGS: Life Technology PGM and Proton Systems......Page 319
Sequencing by Ligation-Based NGS: Life Technology ABI SOLiD Sequencer......Page 320
Amplification-Based NGS Technologies......Page 321
Targeted DNA Analysis Using Multiplex Amplification......Page 322
Targeted DNA Analysis Using Targeted Capture Followed by Multiplex Amplification......Page 323
Targeted RNA Analysis by Multiplex Amplification......Page 324
Targeted RNA Analysis Using Targeted Capture Followed by Multiplex Amplification......Page 325
Advantages and Disadvantages of Amplification-Based NGS......Page 326
Clinical Application of Amplification-Based NGS in Cancer......Page 327
DNA/RNA Extraction and Quality Control......Page 328
Cancer-Specific Targeted Panels......Page 329
Ion AmpliSeq™ Comprehensive Cancer Panel......Page 330
AmpliSeq Custom Cancer Panels......Page 331
Illumina TruSeq Amplicon Cancer Panel......Page 332
Data Analysis......Page 333
Interpretation and Reporting......Page 334
Challenges and Perspectives......Page 335
References......Page 336
19 Targeted Hybrid-Capture for Somatic Mutation Detection in the Clinic......Page 340
Clinical Utility of Somatic Mutation Detection in Cancer......Page 341
Solid-Phase Versus In-Solution Phase Capture......Page 342
Comparison of In-Solution Hybridization Capture-Based and Amplification-Based Targeted Enrichment Methods for Molecular Onc.........Page 343
Amenable to Multiplexing......Page 345
Detection of Structural Rearrangements (Translocations, Inversions, and Indels)......Page 347
Copy Number Variation (CNV) Detection......Page 350
High Depth of Coverage......Page 351
Pathologic Assessment......Page 352
QC Metrics......Page 353
Validation......Page 356
References......Page 357
20 Somatic Diseases (Cancer): Whole Exome and Whole Genome Sequencing......Page 362
Spectrum of Somatic Mutations in Cancer......Page 363
Exon Level Mutations......Page 364
Chromosome Level Mutations......Page 365
Paired Tumor–Normal Testing......Page 366
Determination of Somatic Status Without Paired Normal Tissue......Page 367
Variants of Unknown Significance......Page 368
Statistical Models of Mutation Effect......Page 369
Clonal Architecture Analysis......Page 370
Decreased Depth of Coverage, Sensitivity, and Specificity......Page 371
Validation of a Single Assay......Page 373
Improved Copy Number Variant Detection......Page 374
Summary......Page 375
References......Page 376
IV. Regulation, Reimbursement, and Legal Issues......Page 380
21 Assay Validation......Page 382
NGS Workflow......Page 383
Assay Validation......Page 386
Accuracy......Page 388
Precision......Page 389
Analytical Sensitivity and Analytical Specificity......Page 390
Quality Control......Page 391
Conclusion......Page 392
References......Page 393
List of Acronyms and Abbreviations......Page 395
22 Regulatory Considerations Related to Clinical Next Generation Sequencing......Page 396
Regulatory Standards......Page 397
FDA Oversight of Clinical NGS......Page 398
In Common with Traditional Tests......Page 400
Analytic Variables......Page 401
Analytic Sensitivity and Analytic Specificity......Page 402
Versioning and Revalidation......Page 403
Postanalytic Variables......Page 404
Proficiency Testing......Page 405
Cell Lines......Page 406
Conclusion......Page 407
References......Page 408
Challenges in Developing a Whole Genome Reference Material......Page 412
Reference Material Selection and Design......Page 413
Bioinformatics, Data Integration, and Data Representation......Page 415
Performance Metrics and Figures of Merit......Page 416
Reference Data......Page 418
Gene Expression RMs......Page 419
References......Page 420
24 Ethical Challenges to Next-Generation Sequencing......Page 422
Introduction......Page 423
Beneficence/Nonmaleficience......Page 425
Conclusion......Page 426
Diagnostics Versus Screening......Page 427
Research and Clinical Care......Page 428
“What to Disclose” Is Becoming “What Not to Disclose”......Page 429
Research Results Versus Clinical Results......Page 430
Incidental Findings......Page 431
Analytic Validity......Page 432
Clinical Utility......Page 433
Personal Utility......Page 434
ELSI (Ethical, Legal and Social Implications)......Page 435
Recommendations......Page 436
Recommendations......Page 439
Introduction......Page 440
Data Protection Methods......Page 441
Reidentification......Page 442
Required/Permitted Sharing......Page 443
Introduction......Page 444
Balance the Amount of Information with Patient Initiative......Page 446
The Right to Know and the Right Not to Know......Page 447
Counseling......Page 448
References......Page 449
Glossary......Page 452
List of Acronyms and Abbreviations......Page 453
Introduction......Page 454
History of Gene Patents......Page 455
Important Legal Cases......Page 457
Genetic Information Nondiscrimination Act......Page 462
References......Page 464
26 Billing and Reimbursement......Page 466
Reimbursement Rate......Page 467
Diagnosis and Procedure Codes......Page 469
Test Design Factors That Impact Reimbursement......Page 471
Entities Focused on Healthcare Expenditures......Page 473
Health Outcomes......Page 474
Summary......Page 475
Glossary......Page 476
List of Acronyms and Abbreviations......Page 477
Index......Page 478
Recommend Papers

Clinical Genomics
 9780124047488, 0124047483

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

CLINICAL GENOMICS

This page intentionally left blank

CLINICAL GENOMICS Edited by

SHASHIKANT KULKARNI M.S (MEDICINE)., PH.D, FACMG Washington University School of Medicine, St. Louis, MO, USA

JOHN PFEIFER M.D, PH.D Washington University School of Medicine, St. Louis, MO, USA

AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Academic Press is an imprint of Elsevier

Academic Press is an imprint of Elsevier 32 Jamestown Road, London NW1 7BY, UK 525 B Street, Suite 1800, San Diego, CA 92101-4495, USA 225 Wyman Street, Waltham, MA 02451, USA The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, UK r 2015 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-12-404748-8 British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. For Information on all Academic Press publications visit our website at http://store.elsevier.com/ Typeset by MPS Limited, Chennai, India www.adi-mps.com Printed and bound in the United States of America

Dedication This book is dedicated With honor to our patients and their loved ones; my family (Shamika and Sonya-BGOW); my teachers and my parents To Jennifer, who has made it all worthwhile

—SK —JDP

This page intentionally left blank

Contents 4. Amplification-Based Methods

List of Contributors .............................................. xi Foreword ............................................................ xiii Preface ................................................................ xv Acknowledgments .............................................. xvii

I

MARINA N. NIKIFOROVA, WILLIAM A. LAFRAMBOISE AND YURI E. NIKIFOROV

Introduction......................................................................... 57 Principles of Amplification-Based Targeted NGS ............. 58 Nucleic Acids Preparation .................................................. 59 Primer Design for Multiplex PCR ...................................... 60 Library Preparation and Amplification .............................. 61 Other Amplification-Based Target Enrichment Approaches ..................................................................... 62 Comparison of Amplification- and Capture-Based Methods........................................................................... 62 Clinical Applications .......................................................... 64 Conclusion........................................................................... 66 References............................................................................ 66

METHODS

1. Overview of Technical Aspects and Chemistries of Next-Generation Sequencing IAN S. HAGEMANN

Clinical Molecular Testing: Finer and Finer Resolution ..... 3 Sanger Sequencing ................................................................ 4 Cyclic Array Sequencing ...................................................... 7 Illumina Sequencing.............................................................. 8 SOLiD Sequencing.............................................................. 11 Ion Torrent Sequencing ...................................................... 14 Roche 454 Genome Sequencers ......................................... 16 Third-Generation Sequencing Platforms ........................... 18 References............................................................................ 18

5. Emerging DNA Sequencing Technologies SHASHIKANT KULKARNI AND JOHN PFEIFER

Introduction......................................................................... 70 Third-Generation Sequencing Approaches ....................... 71 Fourth-Generation Sequencing .......................................... 73 Selected Novel Technologies.............................................. 74 Summary .............................................................................. 75 References............................................................................ 75

2. Clinical Genome Sequencing TINA M. HAMBUCH, JOHN MAYFIELD, SHANKAR AJAY, MICHELLE HOGUE, CARRI-LYN MEAD AND ERICA RAMOS

Introduction......................................................................... 21 Applications and Test Information..................................... 24 Laboratory Process, Data Generation, and Quality Control .............................................................. 26 Conclusion........................................................................... 34 References............................................................................ 34

6. RNA-Sequencing and Methylome Analysis SHAMIKA KETKAR AND SHASHIKANT KULKARNI

Introduction......................................................................... 78 Approaches to Analysis of RNA ........................................ 78 Workflow ............................................................................. 79 Utility of RNA-Seq to Characterize Alternative Splicing Events ............................................................... 84 Utility of RNA-Seq for Genomic Structural Variant Detection ........................................................... 84 RNA-Seq: Challenges, Pitfalls, and Opportunities in Clinical Applications................................................. 85 Methylome Sequencing....................................................... 85 Conclusions.......................................................................... 86 References............................................................................ 86 List of Acronyms and Abbreviations.................................. 88

3. Targeted Hybrid Capture Methods ELIZABETH C. CHASTAIN

Introduction......................................................................... 38 Basic Principles of Hybrid Capture-Based NGS ................ 38 Hybrid Capture-Based Target Enrichment Strategies ........ 43 Clinical Applications of Target Capture Enrichment........ 48 Variant Detection................................................................ 51 Practical and Operational Considerations.......................... 52 Conclusions.......................................................................... 53 References............................................................................ 53

vii

viii

II

CONTENTS

BIOINFORMATICS

7. Base Calling, Read Mapping, and Coverage Analysis PAUL CLIFTEN

Introduction......................................................................... 92 Platform-Specific Base Calling Methods ............................ 94 Read Mapping.................................................................... 100 Coverage Analysis: Metrics for Assessing Genotype Quality ......................................................... 103 Summary ............................................................................ 106 References.......................................................................... 107

8. Single Nucleotide Variant Detection Using Next Generation Sequencing DAVID H. SPENCER, BIN ZHANG AND JOHN PFEIFER

Introduction....................................................................... 110 Sources of SNVs ................................................................ 111 Consequences of SNVs ..................................................... 113 Technical Issues ................................................................. 114 Bioinformatic Approaches for SNV Calling .................... 117 Interpretation of SNVs...................................................... 122 Reporting ........................................................................... 123 Summary ............................................................................ 123 References.......................................................................... 124

9. Insertions and Deletions (Indels) JENNIFER K. SEHN

Overview of Insertion/Deletion Events (Indels) .............. 130 Sources, Frequency, and Consequences of Indels............. 133 Technical Issues That Impact Indel Detection by NGS ......................................................................... 139 Specimen Issues That Impact Indel Detection by NGS ......................................................................... 141 Bioinformatics Approaches to NGS Indel Detection...... 142 Summary ............................................................................ 148 References.......................................................................... 148

10. Translocation Detection Using NextGeneration Sequencing

11. Copy Number Variant Detection Using Next-Generation Sequencing ALEX NORD, STEPHEN J. SALIPANTE AND COLIN PRITCHARD

Overview of Copy Number Variation and Detection via Clinical Next-Generation Sequencing.................................................................... 166 Sources, Frequency, and Functional Consequences of Copy Number Variation in Humans ....................... 167 CNV Detection in Clinical NGS Applications .............. 170 Conceptual Approaches to NGS CNV Detection .......... 173 Detection in the Clinic: Linking Application, Technical Approach, and Detection Methods ............ 180 Reference Standards .......................................................... 182 Orthogonal CNV Validation ............................................ 184 Summary and Conclusion ................................................. 184 References.......................................................................... 184 Glossary.............................................................................. 186 List of Acronyms and Abbreviations................................ 187

III

INTERPRETATION

12. Reference Databases for Disease Associations WENDY S. RUBINSTEIN, DEANNA M. CHURCH AND DONNA R. MAGLOTT

Introduction....................................................................... 192 Identification and Validation of Human Variation ....................................................................... 193 Identification of Common Variation................................ 195 Interpretation of Common Variation ............................... 199 Defining Diseases and Phenotypes.................................... 199 Representation of Variation Data in Public Databases....................................................................... 201 Data Access and Interpretation ........................................ 205 Determination of Variant Pathogenicity .......................... 210 Global Data Sharing ......................................................... 212 Conclusion......................................................................... 213 References.......................................................................... 213 List of Acronyms and Abbreviations................................ 215

13. Reporting of Clinical Genomics Test Results

HALEY ABEL, JOHN PFEIFER AND ERIC DUNCAVAGE

KRISTINA A. ROBERTS, RONG MAO, BRENDAN D. O’FALLON AND ELAINE LYON

Introduction to Translocations ......................................... 152 Translocations in Human Disease..................................... 153 Translocation Detection.................................................... 156 Informatic Approaches to Translocation Detection ........ 158 Translocation Detection in Clinical Practice................... 160 Summary and Conclusion ................................................. 163 References.......................................................................... 163

Introduction....................................................................... 219 Components of the Written NGS Report........................ 219 Beyond the Written Report: Other NGS Reporting Issues to Consider ......................................................... 227 Conclusion......................................................................... 228 References.......................................................................... 228 List of Acronyms and Abbreviations................................ 229

CONTENTS

14. Reporting Software RAKESH NAGARAJAN

Introduction....................................................................... 232 Clinical Genomic Test Order Entry ................................. 232 Laboratory Information Management Systems (LIMS) Tracking........................................................... 233 Analytics: From Reads to Variant Calls ........................... 233 Analytics: Variant Annotation and Classification........... 235 Variant Interpretation ....................................................... 236 Final Report Transmission to the EMR............................ 236 Leveraging Standards in Clinical Genomics Software Systems .......................................................... 237 Regulatory Compliance..................................................... 237 Support Personnel ............................................................. 237 Conclusion......................................................................... 238 References.......................................................................... 238 List of Acronyms and Abbreviations................................ 239

15. Constitutional Diseases: Amplification-Based Next-Generation Sequencing VANESSA L. HORNER AND MADHURI R. HEGDE

Introduction....................................................................... 241 Multigene Panel Validation .............................................. 243 Clinical Workflow ............................................................. 245 Conclusion......................................................................... 247 References.......................................................................... 248 List of Acronyms and Abbreviations................................ 249

16. Targeted Hybrid Capture for Inherited Disease Panels SAMI S. AMR AND BIRGIT FUNKE

Introduction....................................................................... 252 Target Selection Using Hybridization-Based Capture.......................................................................... 255 Design and Implementation of Targeted Hybridization-Based Capture Panels............................ 257 Targeted Hybrid Capture: Selecting a Panel for Constitutional Diseases ................................................ 262 Applications in Clinical Practice: Lessons Learned......... 266 References.......................................................................... 268

17. Constitutional Disorders: Whole Exome and Whole Genome Sequencing BENJAMIN D. SOLOMON

Introduction....................................................................... 273 Genomic Sequencing ........................................................ 276

ix

Analyzing Individual and Multiple Data Sets for Causal Mutation Discovery .................................... 279 Conclusion and Future Directions.................................... 291 Acknowledgment............................................................... 292 References.......................................................................... 292 Glossary.............................................................................. 296

18. Somatic Diseases (Cancer): AmplificationBased Next-Generation Sequencing FENGQI CHANG, GEOFFREY L. LIU, CINDY J. LIU AND MARILYN M. LI

Introduction....................................................................... 298 NGS Technologies ............................................................ 298 Amplification-Based NGS Technologies.......................... 302 Advantages and Disadvantages of Amplification-Based NGS ........................................... 307 Clinical Application of Amplification-Based NGS in Cancer............................................................. 308 Data Analysis..................................................................... 314 Interpretation and Reporting ............................................ 315 Challenges and Perspectives ............................................. 316 References.......................................................................... 317

19. Targeted Hybrid Capture for Somatic Mutation Detection in the Clinic CATHERINE E. COTTRELL, ANDREW J. BREDEMEYER AND HUSSAM AL-KATEB

Introduction....................................................................... 322 Clinical Utility of Somatic Mutation Detection in Cancer ...................................................................... 322 Description of Hybridization-Based Methodology ........... 323 Utility of Targeted Hybrid Capture .................................. 326 NGS in a Clinical Laboratory Setting ............................. 333 Conclusion......................................................................... 338 References.......................................................................... 338

20. Somatic Diseases (Cancer): Whole Exome and Whole Genome Sequencing JENNIFER K. SEHN

Introduction to Exome and Genome Sequencing in Cancer ...................................................................... 344 Interpretative Considerations in Exome and Genome Cancer Sequencing ................................ 344 Analytic Considerations for Exome and Genome Sequencing in Cancer .................................................. 352 Summary ............................................................................ 356 References.......................................................................... 357

x

IV

CONTENTS

REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

24. Ethical Challenges to Next-Generation Sequencing STEPHANIE SOLOMON

21. Assay Validation AMY S. GARGIS, LISA KALMAN AND IRA M. LUBIN

Introduction....................................................................... 364 NGS Workflow.................................................................. 364 The Regulatory and Professional Framework for Assuring Quality ..................................................... 367 Assay Validation................................................................ 367 Accuracy ............................................................................ 369 Precision............................................................................. 370 Analytical Sensitivity and Analytical Specificity............ 371 Reportable and Reference Ranges .................................... 372 Quality Control ................................................................. 372 Reference Materials........................................................... 373 Conclusion......................................................................... 373 Acknowledgment............................................................... 374 References.......................................................................... 374 List of Acronyms and Abbreviations................................ 376

22. Regulatory Considerations Related to Clinical Next Generation Sequencing SHASHIKANT KULKARNI AND JOHN PFEIFER

Introduction....................................................................... 378 Regulatory Standards......................................................... 378 FDA Oversight of Clinical NGS ...................................... 379 Total Quality Management: QC....................................... 381 Total Quality Management: QA ...................................... 386 Conclusion......................................................................... 388 References.......................................................................... 389

23. Genomic Reference Materials for Clinical Applications JUSTIN ZOOK AND MARC SALIT

Introduction....................................................................... 393 Genome in a Bottle Consortium ...................................... 394 Reference Data .................................................................. 399 Other Reference Materials for Genome-Scale Measurements ............................................................... 400 Conclusion......................................................................... 401 References.......................................................................... 401

Introduction....................................................................... 404 Challenging Existing Frameworks..................................... 408 Notifying of Results........................................................... 411 Privacy and Confidentiality .............................................. 421 Informed Consent.............................................................. 425 Conclusion......................................................................... 430 References.......................................................................... 430 Glossary.............................................................................. 433 List of Acronyms and Abbreviations................................ 434

25. Legal Issues ROGER D. KLEIN

Introduction....................................................................... 435 Patent Overview................................................................ 436 History of Gene Patents.................................................... 436 Arguments for and Against Gene Patents ....................... 438 Important Legal Cases....................................................... 438 Implication of Recent Court Decisions for Genetic Testing........................................................................... 443 Genetic Information Nondiscrimination Act .................. 443 References.......................................................................... 445

26. Billing and Reimbursement KRIS RICKHOFF, ANDREW DRURY AND JOHN PFEIFER

Introduction....................................................................... 448 Insurance Payers ................................................................ 448 Reimbursement Processes.................................................. 448 Test Design Factors That Impact Reimbursement ........... 452 Patient Protection and Affordable Care Act ................... 454 Cost Structure.................................................................... 456 Summary ............................................................................ 456 References.......................................................................... 457 Glossary.............................................................................. 457 List of Acronyms and Abbreviations................................ 458

Index .................................................................459

List of Contributors

Haley Abel Division of Statistical Genetics, Washington University School of Medicine, St. Louis, MO, USA

Madhuri R. Hegde Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA

Shankar Ajay Illumina Clinical Services Laboratory, San Diego, CA, USA

Michelle Hogue Illumina Clinical Services Laboratory, San Diego, CA, USA

Hussam Al-Kateb Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Vanessa L. Horner Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA Lisa Kalman Division of Laboratory Programs, Services, and Standards, Centers for Disease Control and Prevention, Atlanta, GA, USA

Sami S. Amr Department of Pathology, Brigham and Women’s Hospital/Harvard Medical School, Boston, MA, USA; Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, MA, USA

Shamika Ketkar Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA

Andrew J. Bredemeyer Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Roger D. Klein Department of Molecular Pathology, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic, Cleveland, OH, USA

Fengqi Chang Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA

Shashikant Kulkarni Department of Pathology and Immunology, Department of Pediatrics, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA

Elizabeth C. Chastain Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA Deanna M. Church

Personalis, Inc., Menlo Park, CA, USA

William A. LaFramboise Genomics Division of the Cancer Biomarker Facility, Shadyside Hospital, University of Pittsburgh Medical Center, Pittsburgh, PA, USA

Paul Cliften Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA Catherine E. Cottrell Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Marilyn M. Li Dan Duncan Cancer Center, Baylor College of Medicine, Houston, TX, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA

Andrew Drury Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Cindy J. Liu Research and Computing Services, Harvard Business School, Cambridge, MA, USA

Eric Duncavage Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Geoffrey L. Liu Department of Human University of Chicago, Chicago, IL, USA

Genetics,

Ira M. Lubin Division of Laboratory Programs, Services, and Standards, Centers for Disease Control and Prevention, Atlanta, GA, USA

Birgit Funke Department of Pathology, Massachusetts General Hospital/Harvard Medical School, Boston, MA, USA; Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, MA, USA

Elaine Lyon ARUP Laboratories, Salt Lake City, UT, USA; Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT, USA

Amy S. Gargis Division of Preparedness and Emerging Infections, Laboratory Preparedness and Response Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA

Donna R. Maglott National Center for Biotechnology Information/National Library of Medicine/National Institutes of Health, Bethesda, MD, USA

Ian S. Hagemann Departments of Pathology and Immunology and of Obstetrics and Gynecology, Washington University School of Medicine, St. Louis, MO, USA

Rong Mao ARUP Laboratories, Salt Lake City, UT, USA; Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT, USA John Mayfield Illumina Clinical Services Laboratory, San Diego, CA, USA

Tina M. Hambuch Illumina Clinical Services Laboratory, San Diego, CA, USA

xi

xii

LIST OF CONTRIBUTORS

Carri-Lyn Mead Illumina Clinical Services Laboratory, San Diego, CA, USA Rakesh Nagarajan Department of Pathology and Immunology, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA

Wendy S. Rubinstein National Center for Biotechnology Information/National Library of Medicine/National Institutes of Health, Bethesda, MD, USA Stephen J. Salipante Department of Laboratory Medicine, University of Washington, Seattle, WA, USA

Yuri E. Nikiforov Division of Molecular and Genomic Pathology, Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, PA, USA

Marc Salit Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA

Marina N. Nikiforova Division of Molecular and Genomic Pathology, Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, PA, USA

Jennifer K. Sehn Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

Alex Nord Center for Neuroscience, Departments of Neurobiology, Physiology and Behavior and Psychiatry, University of California at Davis, CA, USA Brendan D. O’Fallon ARUP Laboratories, Salt Lake City, UT, USA

Benjamin D. Solomon Medical Genetics Branch, National Human Genome Research Institute/National Institutes of Health, Bethesda, MD, USA

John Pfeifer Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA Colin Pritchard Department of Laboratory Medicine, University of Washington, Seattle, WA, USA Erica Ramos Illumina Clinical Services Laboratory, San Diego, CA, USA Kris Rickhoff Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA Kristina A. Roberts ARUP Laboratories, Salt Lake City, UT, USA; Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT, USA

Stephanie Solomon Albert Gnaegi Center for Health Care Ethics, Saint Louis University, Salus Center, St. Louis, MO, USA David H. Spencer Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA Bin Zhang Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA Justin Zook Biosystems National Institute of Gaithersburg, MD, USA

and Biomaterials Division, Standards and Technology,

Foreword

Genomics is a young scientific discipline, surprisingly so considering the rapid prominence that it has gained across nearly the entire landscape of biological and biomedical research. In fact, the word “genomics” was first described in the scientific literature in 1987 [1]. That year was personally significant for me—I graduated medical and graduate school in 1987 and started my residency in laboratory medicine (clinical pathology). Later that same year, I also made the decision to shift my area of research from cell biology/biochemistry (which had encompassed my undergraduate and graduate research efforts) to this nascent, heavily discussed area involving the comprehensive study of genomes—a.k.a., genomics. At that time, the raging debate in the biomedical research circles focused on the proposal for a big “genome project.” Some argued that it was simply a bad idea; a subset of those also claimed that it would be a “career dead end” for young trainees who foolishly joined in the effort. Fortunately, I ignored both of those views and went “all in”— dedicating the research component of my pathology training to genomics and becoming a front line participant of the Human Genome Project (both at its launch and throughout its 13 years). While admittedly vague, at that time I could envision productive connections between enhanced knowledge of the human “genomic blueprint” and the diagnostic work of pathologists. My earliest training in clinical pathology (combined with a deep appreciation for other areas of pathology) had quickly revealed the need for more robust tools to refine diagnoses and empower clinicians to acquire insights about the “uniqueness” of each patient. In short, too much of medical care was generic, lacking the fine-tuning needed for truly personalized care. I thought that genomics could possibly help to provide that fine-tuning. Let me emphasize—these were very nascent insights; I had no well-formulated ideas about timetable or implementation, and assumed that any sizable infusion of genomics into diagnostic medicine was many decades away. In hindsight, I should have been more brash! Fast forward to 2014—slightly more than a quarter century later and a blink of an eye in the history of scientific inquiry. Since that time, we have sequenced the human genome—the first time as part of the Human Genome Project and then tens of thousands of times since then, as the cost of sequencing DNA plummeted roughly a million-fold with the development and implementation of powerful new technologies. These new capabilities catalyzed a flurry of key advances in understanding the genomic bases of specific disorders (most notably, rare diseases and cancer) and drug response, yielding convincing and prototypic examples in which genomic information has clinical utility. In aggregate, these early triumphs have helped to clarify the vague notion of “genomic medicine” and to illuminate a path forward for more widespread utilization of genomics in patient care. To make this vision a reality, genome sequencing will need to become a tool fully adapted for use in clinical settings—and that will require bringing together genomic technologies and clinical implementation. However, bringing the genomic and clinical worlds together is not easy. To help make it happen, scholarly guides are needed. Kulkarni and Pfeifer have created such a resource in producing Clinical Genomics. For this book, an assembled group of key experts and opinion leaders wrote 26 chapters that collectively cover a wide swath of territory relevant to the use of genome sequencing for medical diagnostics. Their audience: clinical laboratory professionals and physicians (both established and in-training) seeking an overview of and guidance for adoption of genome sequencing as a clinical tool. The general areas covered include genome sequencing methods, data analysis, interpreting and reporting genomic information, genomic-based diagnostics for specific disorders, and ancillary (but critically important) topics related to regulation, reimbursement, and legal issues. It is particularly impressive that chapters covering ethical and legal challenges associated with clinical genomics were included. In addition to providing key technical details for diagnostic-based genome sequencing, Clinical Genomics effectively converts the knowledge of human genomics gained in basic science research settings into factual, practice-based information to facilitate the use of genome sequencing in clinical settings. In summary, Kulkarni, Pfeifer, and their recruited authors aimed to compile a first-rate book that would benefit interested physicians, pathologists, and other healthcare professionals wishing to learn about the opportunities and challenges of using genome sequencing for the diagnosis, prognosis, and management of inherited and somatic disorders—something key for realizing a future of genomic medicine. As evident from the pages that follow, that goal has been clearly reached. Readers of Clinical Genomics will gain important insights about the future

xiii

xiv

FOREWORD

of medicine, a future in which genomic information increasingly becomes a mainstream component of diagnostics and clinical care. I applaud Kulkarni and Pfeifer for helping to make genomic medicine a reality! Eric Green, MD, PhD Director, National Human Genome Research Institute (NHGRI)

Reference [1] McKusick VA, Ruddle FH. Genomics 1987;1:12.

Eric Green received his graduate and postgraduate training at Washington University in St. Louis, earning an MD-PhD in 1987 and then pursuing clinical training (in clinical pathology) and postdoctoral research training (in genomics) until 1992. Following 2 years as an assistant professor at Washington University, Dr. Green joined the National Human Genome Research Institute of the National Institutes of Health, where he has been for just over 20 years. During that time, he has assumed multiple leadership positions, being appointed the Institute’s Director in 2009.

Preface

In 2011, next generation sequencing (NGS)-based clinical diagnostic testing was implemented for precision medicine at Washington University School of Medicine, St. Louis by Genomics and Pathology Services (GPS). Even before GPS began accepting patient specimens for testing of a set of genes that provided information to direct the clinical care of oncology patients, it was evident that there was a lack of real-world educational resources for healthcare providers who recognized the need to incorporate NGS-based tests into their clinical practice. There was the need for a textbook specifically devoted to the practice-based issues that are unique to NGS on technical, bioinformatic, interpretive, ethical, and regulatory levels. And as the portfolio of tests offered by GPS has grown, as the volume of testing has increased, and as a wider variety of physicians and trainees have incorporated NGS-based tests into their clinical practice, the need for a textbook focused on clinical genomics has become more and more obvious. This book was produced in order to meet that need. The chapters are authored by nationally recognized experts (with practical experience in that they are associated with clinical laboratories that are actively performing clinical NGS-based testing for either constitutional/hereditary diseases or somatic/acquired diseases such as cancer). The topics covered include the technical details of the platforms and chemistries used to perform NGS; the conceptual underpinnings of assay design for testing gene panels, exomes, or whole genomes; the bioinformatics pipelines required to identify and annotate sequence variants; the clinical settings in which NGS can contribute to patient care; the ethical and regulatory issues surrounding NGS testing; and the reimbursement issues that govern payment for testing. The focus throughout is on the advantages and disadvantages; capabilities and limitations; and clinical settings for which genomic methods can be used clinically to enhance patient care, and the key elements that must be considered in the design, validation, and implementation of NGS-based tests for this purpose. The book is not a laboratory manual or a compilation of laboratory protocols, for two reasons. First, generic protocols are of little use since many clinical NGS labs develop customized tests based on gene panels, the exome, or the whole genome as required to meet the clinical needs of their patient population. Second, given the rapid pace of change in NGS platforms, bioinformatics pipelines, and relevant genetic loci, any set of laboratory protocols would be hopelessly out of date before it ever appeared in print! The book should be useful to a broad medical audience, including medical directors, pathologists, and geneticists who are responsible for designing and implementing NGS-based tests; oncologists, pediatricians, geneticists, and other clinicians who order genetic tests as part of the care of their patients whether for diagnosis or prediction of therapeutic response; laboratory personnel who perform the hands-on component of the testing; and trainees (whether medical students, residents, or fellows). The book should also prove useful to basic science and translational researchers who are interested in the clinical application of NGS in order to guide the research and development activities within their laboratories. And, since the book covers the bioethical, legal, and regulatory issues related to NGS, it can also serve as a textbook for undergraduate and graduate level courses focused on the broader topic of clinical genomics.

xv

This page intentionally left blank

Acknowledgments

First and foremost, we thank our colleagues who contributed their time to write the chapters that comprise the work; the book would not have been possible without them. They are all extremely busy people, and their willingness to participate in this project is a sign of their commitment to share their expertise regarding the opportunities provided by next generation sequencing techniques to enhance patient care. We are fortunate to count them as not only colleagues but also friends, and have greatly benefited from their expert advice. In addition, we want to thank the leadership team at Genomics and Pathology Services (GPS) at Washington University School of Medicine in St. Louis. We especially want to acknowledge Dr. Karen Seibert, Director of GPS; Dr. Cathy Cottrell, Medical Director; Dr. Andrew Bredemeyer, Chief Operations Officer, and Dr. John Heusel, Chief Medical Officer, and thank them for their availability and willingness to assist us with this project. They have been tireless in their efforts to ensure that NGS-based methods can become part of the clinical testing performed to improve patient care. And we want to acknowledge our colleagues on the GPS team, including Drs. Eric Duncavage, Haley Abel, David Spencer, Ian Hageman, Hussam Al-Kateb, Ian Hagemann, Tu Nguyen, Robi Mitra, Rakesh Nagarajan, Richard Head, and Paul Clifton, all of whom generously supported the effort. We want to express our sincere gratitude to our hardworking colleagues and staff at the Genome Technology Access Center (GTAC), Center for Biomedical Informatics (CBMI), GPS and Cytogenomics and Molecular Pathology for providing valuable direct and indirect support of the book as well. We also want to acknowledge Dr. Herbert “Skip” Virgin, Chairman of the Department of Pathology and Immunology, and Dr. Jeffrey Milbrandt, Chairman of the Department of Genetics, both at Washington University School of Medicine, for their support of the development and operation of GPS. They have provided a model of commitment to patient care, academic productivity, and focus on resident and fellow education that has made it possible for a laboratory focused on clinical genomics to flourish. We want to thank our assistants, specifically Elease Barnes and Amy Dodson, for their hard work typing and editing the various revisions of the chapters of this book, a task which they performed with endless patience and good humor. We have also been extremely fortunate to interact with a great group of people at Academic Press and Elsevier. Graham Nisbet, Senior Acquisitions Editor, was receptive to our idea for a book focused on clinical genomics using next generation sequencing technologies, and helped launch the project. Catherine Van Der Laan, Associate Acquisitions Editor, provided constant guidance and support; there is no doubt that her encouragement and strict attention to deadlines were absolutely essential for bringing the book to completion. Most importantly, we also owe a debt of gratitude to our families for their love, sustained encouragement, support, and patience during the long hours we spent writing and editing the book. Shashikant Kulkarni and John Pfeifer, St. Louis, 2014

xvii

This page intentionally left blank

S E C T I O N

METHODS

I

This page intentionally left blank

C H A P T E R

1 Overview of Technical Aspects and Chemistries of Next-Generation Sequencing Ian S. Hagemann Departments of Pathology and Immunology and of Obstetrics and Gynecology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Clinical Molecular Testing: Finer and Finer Resolution

Illumina Sequencing Library Prep and Sequencing Chemistry Choice of Platforms Phasing

8 9 10 10

4 5 6

SOLiD Sequencing

11

Ion Torrent Sequencing AmpliSeq Library Preparation

14 16

6

Roche 454 Genome Sequencers

16

Third-Generation Sequencing Platforms

18

References

18

3

Sanger Sequencing Chemistry of Sanger Sequencing, Electrophoresis, Detection Applications in Clinical Genomics Technical Constraints Read Length and Input Requirements Pooled Input DNA Puts a Limit on Sensitivity

4

Cyclic Array Sequencing

7

7

CLINICAL MOLECULAR TESTING: FINER AND FINER RESOLUTION Progress in applying genetic knowledge to clinical medicine has always been tightly linked to the nature of the genetic information that was available for individual patients. Classical cytogenetics provides pan-genomic information at the level of whole chromosomes and subchromosomal structures on the scale of megabases. The availability of clinical cytogenetics made it possible to establish genotypephenotype correlations for major developmental disabilities, including 121 in Down syndrome, the “fragile” X site in Fragile X syndrome, monosomy X in Turner syndrome, and the frequent occurrence of trisomies, particularly 113, 117, and 114, in spontaneous abortions. Over time, new experimental techniques have allowed knowledge to be accumulated at finer and finer levels of resolution, such that genotypephenotype correlations are now routinely established at the single-nucleotide level. Thus it is now well known that germline F5 p.R506Q mutation is responsible for the factor V Leiden phenotype [1] and that loss of imprinting at the SNRPN locus is responsible for PraderWilli syndrome [1], to cite examples of two different types of molecular lesions. Clinical advances have been closely paralleled by progress in research testing, since the underlying technologies tend to be similar.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00001-0

3

© 2015 Elsevier Inc. All rights reserved.

4

1. NGS PLATFORMS

Historically, much clinical molecular testing has taken an indirect approach to determining gene sequences. Although the sequence was fundamentally the analyte of interest, indirect approaches such as restriction fragment length polymorphism (RFLP) analysis, allele-specific polymerase chain reaction (PCR), multiplex ligation-dependent probe amplification (MLPA), and invader chemistry assays have proven easier to implement in the clinical laboratory—easier and more cost-effective to standardize, to perform, and to interpret [2]. Technological advances in the past two decades have begun to change this paradigm by vastly facilitating the acquisition of gene sequence data. Famously, the human genome project required an investment of 10 years and about 10 billion dollars to determine the genomic sequence of a single reference individual. While the technology used for that project was innovative at the time, the effort and cost were clearly monumental and the project could never have been translated directly into a clinical testing modality. Fundamental technical advances, broadly described as next-generation sequencing (NGS), have lowered the cost and difficulty of genomic sequencing by orders of magnitude, so that it is now practical to consider implementing these methods for clinical testing. The first section of this book is a survey of the technologies used for NGS today. The present chapter focuses on the lowest-level building blocks of NGS: the chemical and technological basis of the methods used to convert nucleic acids into sequence. Subsequent chapters deal with methods for selecting the molecules to be sequenced (whole genome, exome, or gene panels) as well as different approaches for enriching the reagent pool for these molecules (capture and amplification) (Chapters 2–4). The section closes with a chapter on emerging “third-generation” methods, which promise to eventually allow single-molecule sequencing (Chapter 5), as well as a chapter on RNA-based methods which allow NGS technology to be used for expression profiling (Chapter 6).

SANGER SEQUENCING Chemistry of Sanger Sequencing, Electrophoresis, Detection In Sanger sequencing [3], DNA polymerase is used to synthesize numerous copies of the sequence of interest in a single primer extension step, using single-stranded DNA as a template. Chain-terminating 20 ,30 -dideoxynucleotide triphosphates (ddNTPs) are spiked into the reaction. At each nucleotide incorporation event, there is chance that a ddNTP will be added in place of a dNTP, in which case, in the absence of a 30 hydroxyl group, the growing DNA chain will be terminated. The endpoint of the reaction is therefore a collection of DNA molecules of varying lengths, each terminated by a dideoxynucleotide [4]. The original Sanger sequencing method consists of two steps. In the “labeling and termination” step, primer extension is performed in four parallel reactions, each reaction containing a different ddNTP in addition to [α-35S]dATP and dNTPs. A “chase” step is then performed with abundant unlabeled dNTPs. Any molecules that have not incorporated a ddNTP will be extended so that they do not interfere with detection. The products are then separated by polyacrylamide gel electrophoresis in four parallel lanes representing ddA, ddT, ddC, and ddG terminators. The DNA sequence is read off of an autoradiograph of the resulting gel by calling peaks in each of the four lanes (Figure 1.1A). Historically, Sanger sequencing employed the Klenow fragment of Escherichia coli DNA polymerase I. The Klenow fragment has 50 -30 polymerase and 30 -50 exonuclease activity, but lacks 50 -30 exonuclease activity [5], thus preventing degradation of desired DNA polymerase products. Klenow fragment is only moderately processive and discriminates against incorporation of ddNTPs, a tendency which can be reduced by including Mn21 in the reaction [6]. Sequenase, which was also commonly used, is a modified T7 DNA polymerase with enhanced processivity over Klenow fragment, a high elongation rate, decreased exonuclease activity, and minimal discrimination between dNTPs and ddNTPs [6,7]. Several variants of Sanger sequencing have been developed. In one of these, thermal cycle sequencing, 2030 denaturationannealingextension cycles are carried out, so that small numbers of template molecules can be repeatedly utilized; since only a single sequencing primer is present, the result is linear amplification of the signal, rather than exponential amplification as would be the case in a PCR [4,8]. The high-temperature steps present in thermal cycle sequencing protocols have the advantage of melting double-stranded templates and disrupting secondary structures that may form in the template. A high-temperature polymerase, such as Taq, is required. Taq polymerase discriminates against ddNTPs, requiring adjustment of the relative concentration of dNTPs and ddNTPs in these reactions. Native Taq polymerase also possesses undesirable 50 -30 exonuclease activity, but this has been engineered out of commercially available recombinant Taq [4].

I. METHODS

SANGER SEQUENCING

5

FIGURE 1.1 Sanger sequencing. (A) Mockup of the results of gel electrophoresis for Sanger sequencing of the DNA molecule 50 -TATGATCAC-30 . The sequence can be read from right to left on the gel. (B) Electropherogram for Sanger sequencing of the same molecule. The results are read from left to right.

(A) A C G T (B)

350 T

A

T

G

A

355 T

C

A

C

Other variant approaches consist of different detection methods: • When radioisotope detection was in use, the original [α-32P]dATP protocol was modified to allow use of [α-33P]dATP and [α-35S]dATP, lower-energy emitters producing sharper bands on the autoradiogram [9]. • Chemiluminescent detection was also reported using biotinylated primers, streptavidin, and biotinylated alkaline phosphatase [10]. • 50 -end labeling of the primer ensures that only authentic primer elongation products will be detected, thus reducing the effect of nicks in template molecules serving as priming sites [4]. Modern Sanger sequencing (automated fluorescent sequencing, dye-terminator sequencing) uses fluorescently labeled ddNTPs that allow the amplification step to be performed in a single reaction. The product of the reaction is a mixture of single-stranded DNA fragments of various lengths, each tagged at one end with a fluorophore indicating the identity of the 30 nucleotide. The reaction is separated by capillary electrophoresis. Continuous recording of four-color fluorescence intensity at the end of the capillary results in an electropherogram (Figure 1.1B) that can be interpreted by base-calling software, such as Mutation Surveyor (SoftGenetics LLC, State College, PA). Clinical Sanger sequencing today uses the fluorescent dye-terminator method and is accomplished with commercially available kits. The BigDye family of products (Applied Biosystems (ABI)/Life Technologies) is commonly used. BigDye v3.1 is recommended by the vendor for most applications, including when long read lengths are desired. The older BigDye v1.1 chemistry remains useful for specialty applications, specifically for cases in which bases close to the sequencing primer are of greatest interest [11]. The ABI PRISM dGTP BigDye Terminator v3.0 kit may be useful for difficult templates, such as those with high GC content [12]. These kits are optimized for readout on ABI capillary electrophoresis platforms, such as the ABI 31xx and 37xx series Genetic Analyzers. These instruments vary in features, particularly in the number of capillaries which ranges from 1 (310 Genetic Analyzer) to 96 (3730xl DNA Analyzer), and in the available modes (Table 1.1).

Applications in Clinical Genomics Sanger sequencing is a “first-generation” DNA sequencing method. Despite the advantages of next-generation sequencing techniques, where throughput is orders of magnitude higher, Sanger sequencing retains an essential place in clinical genomics for at least two specific purposes. First, Sanger sequencing serves as an orthogonal method for confirming sequence variants identified by NGS. When validating clinical NGS tests, reference materials sequenced by Sanger approaches provide ground truth against which the NGS assay can be benchmarked. These materials may include well-characterized publicly available reagents, such as cell lines studied in the HapMap project, or archival clinical samples previously tested by Sanger methods.

I. METHODS

6

1. NGS PLATFORMS

TABLE 1.1 Estimated Sequencing Throughput of ABI Genetic Analyzers, as Reported by the Vendor [13] Instrument 310

3130xl

3730xl



Capillaries 1

16

96

Sample Capacity

Mode

Length of Read

Runs/Day

Output/Day

96 tubes

Standard

600

9

5 kb

Rapid

425

38

15 kb

Long read

950

8

121 kb

Ultra rapid

500

41

328 kb

Extra long read

900

8

691 kb

Standard

700

24

1.6 Mb

Rapid

550

40

2.1 Mb

Resequencing

400

72

2.8 Mb

96- or 384-well plate

96- or 384-well plate

Length of read is reported for 98.5% base-calling accuracy with fewer than 2% N’s.

As an orthogonal method, Sanger sequencing provides a means to confirm variants identified by NGS. It would be impractical to Sanger-confirm every variant, given the large number of primers, reactions, and interpretations that would be required. However, there may be instances where the veracity of a specific variant is in doubt; e.g., called variants that are biologically implausible or otherwise suspected of being spurious. Sanger sequencing is the easiest method to resolve these uncertainties and is therefore an invaluable protocol in any clinical genomics laboratory. Second, Sanger sequencing provides a means to “patch” the coverage of regions that are poorly covered by NGS. In targeted NGS testing, there may be regions that are resistant to sequencing, due to poor capture, amplification, or other idiosyncrasies. These regions are often rich in GC content. One approach to restoring coverage of these areas is to increase the quantity of input DNA, but the quantity available may be limited. It may be possible to redesign the amplification step or capture reagents, or otherwise troubleshoot the NGS technology. However, a very practical approach, when the area to be backfilled is small, is to use Sanger sequencing to span the regions poorly covered by NGS. When Sanger sequencing is used for backfilling NGS data, the NGS and Sanger data must be integrated together for purposes of analysis and reporting, which represents a challenge since these data are obtained by different methods and do not have a one-to-one correspondence to one another. Analyses that are natural for NGS data may be difficult to map onto data obtained by Sanger. For example, measures of sequence quality that are meaningful for NGS are not applicable to Sanger; the concept of depth of coverage can only be indirectly applied to Sanger data; allele frequencies are indirectly and imprecisely ascertained in Sanger sequence from peak heights rather than read counts; and Sanger data do not have paired ends. While NGS may potentially be validated to allow meaningful variant calling from a single nonreference read, the sensitivity of Sanger sequencing has a floor of approximately 20%: variants with a lower allele frequency may be indistinguishable from noise or sequencing errors (discussed below). Thus the performance of an NGS assay may be altered in areas of Sanger patching, and these deviations in performance must be documented and/or disclaimed.

Technical Constraints Read Length and Input Requirements Read lengths achieved with Sanger sequencing are on the order of 7001000 bp per reaction [12]. Thus, a small number of Sanger reactions may be sufficient to cover one or two failed exons in a targeted NGS panel. Required input for Sanger sequencing varies by protocol and by type of template, but as a rule of thumb for doublestranded linear DNA, 10 ng of template per 100 bp of template length gives satisfactory results. Paradoxically, excessive template quantity results in short usable sequence length (i.e., short sequence reads). Input DNA must consist of a relatively pure population of sequences. Each molecule to which the sequencing primer hybridizes will contribute to the electropherogram: the final electropherogram will be a superposition of all of the input molecules. Sequence diversity at a small number of positions (e.g., a heterozygous single-nucleotide variant (SNV) or deletion of a few nucleotides) will be resolvable by human readers or by

I. METHODS

CYCLIC ARRAY SEQUENCING

7

analysis software. More complex diversity within the input DNA will be very difficult to resolve and/or may be indistinguishable from sequencing errors. Pooled Input DNA Puts a Limit on Sensitivity Unlike NGS technologies, in which each sequence read originates from sequencing of a single molecule of DNA, the results of Sanger sequencing represent the pooled characteristics of all of the template molecules. This presents no difficulty if the template is a homogeneous population. However, clinical samples may be heterogeneous in at least two ways. Genomic DNA represents a pool of the patient’s two haplotypes, so positions at which the patient is heterozygous will result in an ambiguous call if some form of heterozygote analysis is not specifically enabled. Mutation Surveyor (SoftGenetics LLC, State College, PA) is one Sanger analysis package with the ability to deconvolute heterozygous SNVs and insertiondeletion variants (indels) [14]. In cancer samples, DNA extracted from bulk tumor tissue is intrinsically a mixture of nontumor stroma and of any subclones present within the tumor, so alleles may be present at less than 50% variant allele frequency (VAF). Mitochondrial heteroplasmy is another scenario where variants of low VAF may be clinically relevant. Variant bases with low allele frequency appear in electropherograms as low peaks which may be indistinguishable from baseline noise. The sensitivity of Sanger sequencing must therefore be validated in each laboratory, but is usually cited as being in the neighborhood of 20% [14]. Variant alleles below this frequency may truly be present within the specimen and are faithfully identified by NGS but cannot reliably be confirmed by Sanger sequencing. A related issue is that Sanger sequencing is not phase-resolved. The two copies of each gene carried by an euploid cell population are averaged together in Sanger sequencing, and variants on one chromosome cannot be differentiated from variants on the other. This limitation is problematic if more than one pathogenic variant is detected in a given gene: variants in cis would imply retention of one functional copy, while variants in trans would mean that both copies are mutated. The lack of phase resolution is also problematic if Sanger data are to be used to determine the patient’s diplotype for complex alleles, as is the case for HLA typing or drug-metabolism genes (e.g., CYP2D6, CYP2D19). At these highly polymorphic loci, multiple positions often need to be assayed with preservation of phase data in order to assign haplotypes unambiguously based on sequence. It is, however, usually possible to use databases of known haplotypes, combined with data describing the probability of each haplotype in the patient’s ethnic group, to ascertain the most probable diplotype. External data cannot be leveraged for phase resolution in the case of somatic variants, which by definition are unique to the patient and are not segregating in the population as discrete haplotypes.

CYCLIC ARRAY SEQUENCING Many of the currently available next-generation (NGS) approaches have been described as cyclic array sequencing platforms, because they involve dispersal of target sequences across the surface of a two-dimensional array, followed by sequencing of those targets [9]. The resulting short sequence reads can be reassembled de novo or, much more commonly in clinical applications, aligned to a reference genome. NGS has been shown to have adequate sensitivity and specificity for clinical testing. Tested against the gold standard of Sanger sequencing, an NGS cardiomyopathy panel consisting of 48 genes showed perfect analytical sensitivity and specificity [15]. A sensorineural hearing loss panel, OtoSeq, was similarly shown to have 100% analytical sensitivity and 99.997% analytical specificity [16]. NGS tests designed to detect somatic variants in cancer have also been validated as having clinical sensitivity and specificity exceeding 99% and adequate for clinical use [17,18]. NGS workflows involve [1] obtaining the nucleic acid of interest; [2] preparing a sequencing library, which may involve enrichment of target sequences; and [3] carrying out the sequencing on the chosen platform. Many platforms have been developed and have been reviewed elsewhere [9,19], but this discussion will be limited to a focused review of those platforms that, thanks to suitable cost and technical parameters, have found a place in clinical genomic testing. Because of constant evolution in the marketplace, technical comparisons between platforms have a limited life span: new instruments, reagent kits, and protocols appear constantly, with major implications for assay performance [20].

I. METHODS

8

1. NGS PLATFORMS

ILLUMINA SEQUENCING Sequencers and reagents developed by Solexa and further commercialized by Illumina, Inc. (San Diego, CA) have come to be one of the most frequently used platforms in clinical genomics, thanks to their versatility and favorable cost/speed/throughput trade-offs. The distinguishing feature of Illumina sequencing is that prepared libraries are hybridized to the two-dimensional surface of a flow cell, then subjected to “bridge amplification” that results in the creation of a localized cluster (a PCR colony, or polony) of about 2000 identical library fragments within a diameter of B1 μm, and across a single lane of a flow cell there can be over 37 million individual amplified clusters [21] (Figure 1.2). These fragments are sequenced in place by successive incorporation of fluorescently labeled, reversibly terminating nucleotides (known as sequencing by synthesis). After each incorporation step, the surface of the flow cell is imaged by a charge-coupled device (CCD) to query each position for the identity of the most recently incorporated nucleotide. Successive cycles of deprotection, incorporation, and imaging result in a series of large image files that are subsequently analyzed to determine the sequence at each polony.

Adapter

(A)

DNA fragment

DNA

Dense lawn of primers Adapter

Adapters Prepare genomic DNA sample

Attach DNA to surface

Randomly fragment genomic DNA and ligate adapters to both ends of the fragments.

Bind single-stranded fragments randomly to the inside surface of the flow cell channels.

Nucleotides

Attached

Bridge amplification Add unlabeled nucleotides and enzyme to initiate solidphase bridge amplification.

Denature the doublestranded molecules

FIGURE 1.2 Illumina sequencing. (A) Initial steps in library hybridization and loading on Illumina flow cell. (B) First and subsequent cycles of sequencing by synthesis. Reprinted with permission from Annual Review of Genomics and Human Genetics by ANNUAL REVIEWS, copyright 2008.

I. METHODS

9

ILLUMINA SEQUENCING

(B)

First chemistry cycle: determine first base To initiate the first sequencing cycle, add all four labeled reversible terminators, primers, and DNA polymerase enzyme to the flow cell.

Image of first chemistry cycle

Before initiating the next chemistry cycle

After laser excitation, capture the image of emitted fluorescence from each cluster on the flow cell. Record the identity of the first base for each cluster.

The blocked 3' terminus and the fluorophore from each incorporated base are removed.

Laser

GCTGA...

Sequence read over multiple chemistry cycles Repeat cycles of sequencing to determine the sequence of bases in a given fragment a single base at a time.

FIGURE 1.2 (Continued).

TABLE 1.2 General Steps for Preparation of Illumina Sequencing Libraries [22] DNA fragmentation End repair 30 Adenylation (A-tailing) Adapter ligation Purification and enrichment Library validation and quantification

Library Prep and Sequencing Chemistry The workflow for Illumina sequencing begins with library preparation, in which the input DNA is processed to make it a suitable substrate for sequencing. The specific steps for library preparation will depend upon the application; protocols are readily available, usually paired with kits containing necessary reagents. An idealized workflow for general-purpose paired-end sequencing of genomic DNA based on a standard Illumina protocol (Table 1.2) [22] is presented here. In clinical genomics applications, the input material will consist of patient samples such as peripheral blood, bone marrow aspirate, fresh tissue, or formalin-fixed paraffin-embedded (FFPE) tissue. Before library preparation can begin, DNA must be extracted by some method standardized for the laboratory and known to yield DNA of suitable quality for sequencing [23]. The DNA must be assayed for quality according to metrics and cutoffs established in the lab. DNA with A260/A280 5 1.82.0 is generally considered suitable for library preparation. Gel electrophoresis may be performed to determine whether the input DNA is high molecular weight or has become degraded. However, given that library preparation begins with fragmentation, high molecular weight

I. METHODS

10

1. NGS PLATFORMS

is not mandatory. A smear seen on an agarose gel may indicate the presence of contaminants including detergents or proteins [22]. The input DNA is fragmented by sonication, nebulization, or endonuclease digestion to yield fragments of ,800 bp. The fragment size can be controlled by adjusting parameters of the sonicator or other instrument. Fragment size is relevant in that it determines the eventual distance between mated reads in paired-end sequencing. Fragmented DNA may have 50 or 30 overhangs of various lengths. To ensure uniform downstream reactivity, an end repair step is performed by simultaneous treatment with T4 DNA polymerase, Klenow fragment, and T4 polynucleotide kinase (PNK). The first two of these serve to remove 30 overhangs (30 -50 exonuclease activity) and fill in 50 overhangs (50 -30 polymerase activity). PNK adds a 50 phosphate group. The A-tailing (30 adenylation) step starts from blunt-ended DNA and adds a 30 A nucleotide using a recombinant “Exo“ Klenow fragment of E. coli DNA polymerase, which lacks 50 -30 and 30 -50 exonuclease activities. The A tail cannot serve as a template, which prevents the addition of multiple A nucleotides. Illumina adapters allow hybridization to the flow cell and may also encompass an index sequence to allow multiplexing multiple samples on a single flow cell. Adapter ligation uses DNA ligase and a molar excess of indexed adapter oligonucleotides to place an adapter at both ends of each DNA fragment. The index is a 6-nucleotide sequence that serves as a barcode for each library. Libraries with different indexes can be multiplexed at the time of sequencing (i.e., run in a single lane) and informatically separated at a later time. At this stage, libraries are size selected by gel electrophoresis to eliminate adapter dimers and other undesired sequence. It is at this stage that the final library insert size is selected, i.e., that the distance between paired ends is fixed (which is important for paired-end sequencing). Each step of library preparation to this point has required a purification procedure, which in the aggregate results in marked reduction in the quantity of DNA present in the library. This is counteracted by a limited PCR amplification step. Although the library fragments had earlier acquired adapters carrying single-stranded DNA at each extremity, the PCR step renders the library fully double-stranded. The PCR step has several additional functions: it adds sequences necessary for hybridization to the flow cell; also, it ensures that the library carries adapters at both ends (otherwise, the primers would fail to anneal). Library preparation is completed by a final round of agarose gel electrophoresis to purify the final product. Illumina recommends storing prepared library DNA at 10 nM concentration in 10 mM TrisHCl buffer, pH 8.5, with 0.1% Tween-20 to prevent sample adsorption to the tube [22].

Choice of Platforms Illumina sequencers in common use are the HiSeq and MiSeq systems. The HiSeq is a larger-scale, higherthroughput instrument, while the MiSeq is conceptualized as a benchtop or personal sequencer. The technical specifications of these platforms are likely to change rapidly over time. With that caveat, these platforms have the following features (Table 1.3). In its current iteration, the HiSeq (model 2500) accepts one or two flow cells, each with eight lanes. In high output run mode, a single flow cell can generate up to 300 Gb of data at 2 3 100 bp paired-end read length. These runs require 11 days to automatically proceed from cluster generation to final sequence. In rapid run mode, one flow cell can generate 60 Gb of 2 3 100 bp reads or 90 Gb of 2 3 150 bp reads, requiring 27 or 40 h, respectively. The MiSeq instrument accepts a single-lane flow cell. With current reagent kits, MiSeq is benchmarked to generate 5.1 Gb of 2 3 150 bp reads in about 24 h [27]. The typical use case for MiSeq is a scenario where the total amount of sequence desired is insufficient to justify use of an entire HiSeq flow cell, either because of a low number of samples to be batched together, small size of a target region to be sequenced, or low desired coverage (more common in constitutional, as opposed to oncology, testing). The FDA granted the MiSeqDx instrument marketing authorization for high-throughput genomic sequencing in December 2013, which has many implications for clinical NGS [28]. Phasing In NGS, each sequence read originates from a single DNA fragment, so variants identified in close proximity to one another can be reliably assigned to discrete haplotypes. For example, tumors have occasionally been found to have point mutations in both KRAS codons 12 and 13. NGS data make it possible to determine whether nearby variants occurred on the same reads, indicating variants in cis, or on different reads, indicating variants in trans.

I. METHODS

11

SOLiD SEQUENCING

TABLE 1.3 Estimated Sequencing Yield for Illumina Instruments, as Reported by the Vendor [2426] Run Mode/Kit

Read Length

Run Time

Number of Reads (Paired Ends)

Output

HiSeq 2000

High output

2 3 100 bp

11 d

6 billion

600 Gb

HiSeq 2500

Rapid run

2 3 150 bp

27 h

1.2 billion

120 Gb

High output

2 3 100 bp

11 d

6 billion

600 Gb

Reagent kit v2

1 3 36 bp

4h

2430 million

540610 Mb

2 3 25 bp

5.5 h

750850 Mb

2 3 150 bp

24 h

4.55.1 Gb

2 3 250 bp

39 h

7.58.5 Gb

2 3 75 bp

24 h

2 3 300 bp

65 h

1 3 35 bp

2d

2 3 50 bp

5d

2530 Gb

2 3 75 bp

7d

37.545 Gb

2 3 100 bp

9.5 d

5460 Gb

2 3 150 bp

14 d

8095 Gb

MiSeq

Reagent kit v3

Genome Analyzer IIx

TruSeq SBS v5

4450 million

3.33.8 Gb 13.215 Gb

640 million

1012 Gb

Sanger sequencing, in contrast, cannot determine phase, as the electropherogram represents a pooled sum of all molecules in the reaction. When the variants to be phase-resolved do not lie on the same read or paired ends of the same read, there is no formal (guaranteed) means to resolve the phase, but it may be possible to do so by “walking” from one variant to the other. A tool for this purpose is included in the Genome Analysis Toolkit (GATK) [29].

SOLiD SEQUENCING The Sequencing by Oligo Ligation Detection (SOLiD) platform was developed by Agencourt Bioscience (at the time, a subsidiary of Beckman Coulter, now a unit of Applied Biosystems, Inc., itself a brand of Life Technologies). The technique is distinctive in that instead of performing sequencing-by-synthesis one nucleotide at a time, it obtains sequence by determining the ability of oligonucleotides to anneal to a template molecule and become ligated to a primer. To generate the substrate for sequencing, DNA fragments are fused to a “P1” adapter, bound to magnetic beads and amplified by emulsion PCR so that each bead is coated with a single clonal population. These beads are then fixed to the surface of a glass slide. More recent iterations of the platform use the “Wildfire” approach, omitting beads and allowing libraries to be prepared directly on a two-dimensional substrate by a templatewalking process resembling colony PCR. Five rounds of primer elongation are carried out, each of which is followed by a primer reset. In each round, a universal sequencing primer is hybridized to the template strands, positioning a 50 end at the beginning of the region to be sequenced. This position is shifted by one nucleotide in each subsequent round of primer elongation. Multiple cycles of ligation, detection, and cleavage are then performed. In each ligation step, a pool of 50 -labeled oligonucleotide 8-mers is introduced. The color of the fluorophore attached to each molecule corresponds to the sequence of the first two nucleotides (Figure 1.3). The remaining six nucleotides are degenerate (i.e., all possible combinations are present in the pool). An annealing and ligation reaction is carried out, so that the fluorophore attached to each bead indicates the dinucleotide sequence just downstream from the sequencing primer. The substrate is washed, and a digital image is acquired to document the color attached to each bead. The fluorophores correspond to the 50 dinucleotide of each 8-mer. Although it might appear that 16 colors would be needed to encode 16 possible dinucleotides, only four colors are used. Each nucleotide will be interrogated twice,

I. METHODS

12

1. NGS PLATFORMS

(A) SOLiD™ substrate

1 µm bead

Di base probes Template 2nd base A C G T

3'TTnnnzz z5' 5'

Template sequence 3'

P1 adapter

A C G

3'TCnnnzz z5'

T

1st base

3'TGnnnzz z5'

3'TAnnnzz z5' Glass slide

Cleavage site 5. Repeat steps 1–4 to extend sequence

1. Prime and ligate POH +

Primer round 1

Universal seq primer (n) 3' 1 µm bead P1 adapter

Ligation cycle 1

Ligase

2 T A

AT

3 A T TA A

4

5

6

T T GA

C T CA

G T AA

7 ... (n cycles) TC A T

GC G G

C

3'

3' TA

Template sequence

6. Primer reset 2. Image Excite

Fluorescence Universal seq primer (n–1) 3' 2. Primer reset

1. Melt off extended sequence

3' 1 µm bead

TA

3'

3. Cap unextended strands 7. Repeat steps 1–5 with new primer

Phosphatase PO4

Primer round 2 3'

1 base shift –1

Universal seq primer (n–1) 3'

4. Cleave off fluor

1 µm bead

Cleavage agent

AA AC

CG

TC

T

GC

AG

AT

CC

TT TA

GG

AA

3' GT

HO AT P 3' TA

8. Repeat Reset with , n–2, n–3, n–4 primers Read position

Primer round

1 2 3 4 5

0

1 2

3

4

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Universal seq primer (n) 3' Universal seq primer (n–1) 3' Universal seq primer (n–2) 3' Universal seq primer (n–3) 3' Universal seq primer (n–4) 3'

Bridge probe

Bridge probe

Bridge probe

Indicates positions of interrogation

Ligation cycle 1 2 3 4 5 6 7

FIGURE 1.3 Principle of SOLiD sequencing. (A) Molecules to be sequenced are deposited on a flow cell slide, primers are annealed, and extension by is carried out by ligation of 8-mers carrying a label indicating their first two nucleotides. Several rounds of primer extension are performed. A “primer reset” is then carried out, applying a new primer of different length. (B) Data are collected in “color space” and are then processed to give the sequence in “base space.” The sequencing is performed by querying each nucleotide twice, once as the first member of a dinucleotide and once as the second member. These queries occur in consecutive rounds of primer extension. To correctly interpret the meaning of the color encoding a dinucleotide, it is essential to know the identity of the preceding nucleotide. If nucleotide n 2 1 is an A and the next dinucleotide is red, the next nucleotide must be T; but if nucleotide n 2 1 is a C and the next dinucleotide is red, the next nucleotide is G. Reprinted with permission from Annual Review of Genomics and Human Genetics by ANNUAL REVIEWS, copyright 2008.

I. METHODS

13

SOLiD SEQUENCING

(B) Data collection and image analysis

Collect color image

Identify beads

Identify bead color G

R G

Collect color image

R

G Y

Identify bead color

R

R G B

Y

Glass slide

B B Y

B

Identify beads

B R

R

G B

Primer round 1, ligation cycle 1 Primer round 2, ligation cycle 1 Primer round 3, ligation cycle 1 Primer round 4, ligation cycle 1

Color space for this sequence

Possible dinucleotides encoded by each color 2nd base Template sequence 1st base

A

C

G

T

A AT CG GC TA

C G

AC CA GT TG

AA CC GG TT

T

GA TC AG CT

Double interrogation With 2 base encoding each base is defined twice

A

T

G

G

A

Decoding Color space sequence AC AA

GA

GC CA CC

TC

TA

CG GT GG AG Base zero

Possible dinucleotides

AT TG T T CT AT TG GG GA Decoded sequence A T

G

G A Base space sequence

FIGURE 1.3 (Continued).

once in each of two successive primer elongation rounds. Because the reading frame shifts by one nucleotide in each round of primer elongation, the result of these two interrogations is a sequence of two colors: one indicating the nucleotide’s identity as the first base of a dinucleotide, and one indicating its identity as the second base of a dinucleotide. There are 16 possible dinucleotides and 16 ordered pairs of colors. Thus the identity of each nucleotide is completely specified. After each interrogation, a cleavage agent is introduced to cleave the 8-mers between the fifth and sixth nucleotides. A new ligation, detection, and cleavage cycle is then performed to query a new dinucleotide located five nucleotides downstream from the first. Up to nine of these cycles are performed. A primer reset is then performed and a new round of primer elongation begins, in a new reading frame allowing a different set of dinucleotides to be queried. After five successive rounds of primer elongation, enough data have been gathered to reconstruct reads of 3570 nucleotides, which are then used for downstream analysis. SOLiD sequencing has several unique advantages. Annealing of the 8-mer oligonucleotides carries greater specificity than base pairing of a single nucleotide and helps to reduce errors. The primer extension methodology is

I. METHODS

14

1. NGS PLATFORMS

TABLE 1.4 Estimated Sequencing Yield for SOLiD Sequencing on the 5500 Series Genetic Analyzer Platform, October 2013, as Reported by the Vendor [30,31]

SOLiD 4 System

5500 W System (1 FlowChip)

5500xl W System (2 FlowChips)

Read Length

Reads/Run

Run Time

Output

2 3 35 bp

1.4 billion

89 days

4056 Gb

2 3 50 bp

1.4 billion

1216 days

6480 Gb

24 h

80 Gb

1 3 50 bp 1 3 75 bp

120 Gb

2 3 50 bp

160 Gb

1 3 50 bp

24 h

160 Gb

1 3 75 bp

240 Gb

2 3 50 bp

320 Gb

also highly tolerant of repetitive sequence. The “color space” encoding of nucleotides allows a certain degree of protection against errors. Since each nucleotide is probed twice (once as the first member of a dinucleotide and once as the second member), true point mutations always manifest themselves as two color changes relative to the reference, in two successive rounds of primer elongation. A single color change will usually reflect a sequencing error. Disadvantages of the SOLiD approach include its conceptual complexity. The short read lengths make sequence assembly difficult, particularly for de novo assembly. Paired-end SOLiD sequencing is accomplished by preparing the library with a second (“P2”) adapter distal to the magnetic bead. Oligonucleotide barcodes can also be incorporated, allowing multiplexing. SOLiD sequencing can be carried out on the Applied Biosystems series 5500 Genetic Analyzers, which use one or two FlowChips. Each FlowChip has six lanes. In 1 3 75 bp fragment sequencing mode, each chip can generate up to 120 Gb of sequence data; in 2 3 50 bp mode, each chip generates up to 160 Gb [30]. The other available platform is the older Applied Biosystems SOLiD 4 System, which has lower throughput (Table 1.4).

ION TORRENT SEQUENCING All of the approaches described in this chapter involve translating DNA sequence into a detectable physical event. Whereas Sanger sequencing, Illumina sequencing, and SOLiD are based on detecting fluorescence, the Ion Torrent approach (Life Technologies) is to convert sequence into very small changes in pH that occur as a result of H1 release when nucleotides are incorporated into an elongating sequence. Library preparation for Ion Torrent sequencing involves fragmentation, end repair, adapter ligation, and size selection. The library molecules are bound to the surface of beads and amplified by emulsion PCR so that each bead is coated with a homogeneous population of molecules. Each bead is then dispersed into a discrete well on the surface of a semiconductor sensor array chip constructed by complementary metal-oxide semiconductor (CMOS) technology [32]. This is the same technology that underlies most microchips, including microprocessors, so the significant advances that have been made in CMOS semiconductor technology are accrued to the Ion Torrent as a matter of course. The substrate in Ion Torrent sequencing has the property that each well, containing an embedded ion-sensitive field effect transistor, functions as an extremely sensitive pH meter (Figure 1.4). Sequencing is accomplished by flooding the plate with one deoxynucleotide species (dA, dC, dG, dT) at a time. In each well, incorporation of a nucleotide causes release of pyrophosphate and a H1 ion, resulting in a pH change. The change in H1 concentration supplies a voltage change at the gate of the transistor, allowing current to flow across the transistor and resulting in a signal [32]. Homopolymers cause incorporation of a greater number of nucleotides, with a correspondingly larger pH change and larger signal. The sequencing reaction is highly analogous to pyrosequencing, with the difference that pH is detected instead of light. The throughput of the method depends on the number of wells per chip, which in turn is related to semiconductor fabrication constraints. The current Proton series of chips carries 154 million wells, each 1.25 μm in diameter and spaced 1.68 μm apart [32]. Additional improvements are expected to reflect Moore’s law (doubling of the number of wells every 18 months), given that the substrate is a semiconductor microchip.

I. METHODS

15

ION TORRENT SEQUENCING

dNTP Well + H+

dNTP +

Δ pH

ΔQ

DNA template Bead Metal-oxide-sensing layer Sensor plate Floating

ΔV

metal gate Bulk

Drain

Source

Silicon substrate

To column receiver

FIGURE 1.4 Schematic diagram of Ion Torrent sensor. The diagram shows a single well, within which is lodged a bead containing DNA template, along with the underlying sensor. When nucleotides are incorporated, H1 ions are released and the resulting pH change is detected as a voltage change. Reprinted by permission from Macmillan Publishers Ltd: Nature 475 (7356) 348, copyright 2011.

Ion Torrent chemistry is unique in using natural deoxynucleotides rather than a derivative thereof, which can reduce sequencing biases related to incorporation of unnatural nucleotides. Compared with other platforms, reads are relatively long, and reaction times are very short (3 h for a 300 base run). Several paired-end modes are available but require off-instrument repriming [32]. The software associated with Ion Torrent sequencing, including an aligner, variant caller, and other plug-ins, is distributed as an open-source product, contributing to its relatively wide adoption. An often cited disadvantage of Ion Torrent methodology is its relatively high error rate compared with Illumina-based platforms, particularly in homopolymeric regions [33]. Some fraction of reads is rejected due to insufficient quality metrics including signal to noise issues, bead clumping, nonclonal beads, or failure to incorporate a bead at a given sensor. Once these 2025% of failed reads are excluded, an accuracy of 99.5% (fewer than 0.5% base error rate) for 250 bp reads has been reported [32]. However, it has recently been demonstrated that a significant number of errors are introduced by false priming events during the multiplex amplification step of library preparation, errors which have been unrecognized in prior studies of accuracy [34]. The most problematic source of error is related to homopolymers, whose length must be deduced from the peak height of a multiple incorporation event. Even a quoted 99.3% base accuracy within homopolymers means that a 5-mer will be identified correctly only 96.5% of the time [32], which is significant given that runs of this length or longer are not uncommon in the genome. Difficulties sequencing across homopolymer runs cause uncertainty in the sequence of the run itself, but also cause “dephasing” of subsequent nucleotides due to incomplete extension, although the order in which nucleotides are flowed across the chip can be changed to mitigate dephasing issues. The simplest flow order, ACGT, returns to each nucleotide only on every fourth cycle, and wells with incomplete extension must wait three cycles to catch up. So-called minimally dephasing flow orders repeat each nucleotide at a shorter interval (e.g., ACA), allow lagging wells to catch up and reduce sequencing errors in polymeric regions. Ion Torrent sequencing is currently available on two instruments: Ion PGM (Personal Genome Machine) and Ion Proton. The Ion PGM is a benchtop instrument that uses Ion 314, 316, or 318 chips yielding 200 or 400 base reads. At the high end of the performance range for Ion PGM, current benchmarks are 45.5 million reads of 400 bp (1.22 Gb of sequence) with a run time of 7.3 h [35] (Table 1.5).

I. METHODS

16

1. NGS PLATFORMS

TABLE 1.5 Estimated Sequencing Yield for Ion PGM and Ion Proton Instruments, as Reported by the Vendor [36,37] Platform

Ion PGM

Ion Proton

Chip

Sequencing Run Time

Output

Expected Reads

200-Base Reads

400-Base Reads

200-Base Reads

400-Base Reads

Ion 314t v2

2.3 h

3.7 h

3050 Mb

60100 Mb

400500 thousand

Ion 316t v2

3.0 h

4.9 h

300600 Mb

600 Mb1 Gb

23 million

Ion 318t v2

4.4 h

7.3 h

600 Mb1 Gb

1.22 Gb

45.5 million

Ion PIt

24 h

10 Gb

Signal image

6080 million

• 4 bases (TACG) cycled 100 times • Chemiluminescent signal generation • Signal processing to determine base sequence and quality score polymerase

GA A T CGGCA TGCT A A AGTCA Anneal primer APS PPi Sulfurylase ATP Luciferase

luciferin

Light + oxyluciferin

FIGURE 1.5 Pyrosequencing principle as implemented in the Roche 454 system. A population of identical template molecules is immobilized on the surface of a bead. Incorporation of a nucleotide phosphosulfate such as adenosine phosphosulfate (APS) is enzymatically converted into light emission. Reprinted from Ref. [37], copyright 2008, with permission from Elsevier.

AmpliSeq Library Preparation Library preparation for targeted Ion Torrent sequencing requires either hybridization-based capture or PCR-based amplification. Life Technologies has developed a platform, AmpliSeq, that allows single-tube PCR amplification of up to 4000 amplicons, as an input for sequencing. The quoted input DNA requirement is 10 ng. Several off-the-shelf panels are available, including a Cancer Mutation Hotspot Panel (46 genes), a Comprehensive Cancer Panel (406 genes), and an Inherited Disease Panel (325 genes). The company provides a tool for designing custom panels.

ROCHE 454 GENOME SEQUENCERS The Roche 454 platforms are based on pyrosequencing. In non-high-throughput (conventional, non-array-based) configuration, pyrosequencing is a highly sensitive method for sequencing by synthesis. A template DNA strand is immobilized and exposed in turn to individual deoxynucleotides in the presence of DNA polymerase, luciferase, and ATP sulfurylase (Figure 1.5). Luciferin is present as a luciferase substrate, and adenosine 50 phosphosulfate as an ATP sulfurylase substrate. Each incorporated nucleotide causes release of a pyrophosphate (PPi) moiety, and this PPi is combined with adenosine 50 phosphosulfate to form ATP in a reaction catalyzed by ATP sulfurylase. In the presence of ATP, luciferase converts luciferin to oxyluciferin and light is emitted, resulting in a signal. To prevent consumption of dATP by the luciferase, with resulting emission of aberrant signals, dATPαS is used as the source of

I. METHODS

17

ROCHE 454 GENOME SEQUENCERS

deoxyadenosine nucleotides. Apyrase is added to degrade unincorporated nucleotide triphosphates to nucleotide monophosphates plus inorganic phosphate, Pi, and another cycle begins. Pyrosequencing was adapted to high-throughput sequencing by the 454 Corporation with the development of a parallel, highly multiplexed platform. Library DNA fragments are captured on the surface of small beads, at a limiting dilution that ensures that each bead receives a single molecule (Figure 1.6). These DNA molecules are (A) DNA library preparation 4.5 h Ligation

B

A

Selection (isolate AB fragments only) A A

B

gDNA

B

• Genome fragmented by nebulization • No cloning; no colony picking • sstDNA library created with adaptors • A/B fragments selected using avidin-biotin purification sstDNA library

(B) Emulsion PCR 8h

Anneal sstDNA to an excess of DNA capture beads

Emulsify beads and PCR reagents in water-in-oil microreactors

Clonal amplification occurs Break microreactors and enrich for DNA-positive inside microreactors beads

sstDNA library

Bead-amplified sstDNA library

(C) Sequencing 7.5 h

• Well diameter: average of 44 μm • 400,000 reads obtained in parallel • A single cloned amplified sstDNA bead is deposited per well

Amplified sstDNA library beads

Quality filtered bases

FIGURE 1.6 Workflow in Roche 454 sequencing. (A) Library preparation, (B) emulsion PCR, and (C) chip loading and sequencing. Reprinted with permission from Annual Review of Genomics and Human Genetics by ANNUAL REVIEWS, copyright 2008.

I. METHODS

18

1. NGS PLATFORMS

TABLE 1.6 Estimated Sequencing Yield for Roche 454 Genome Sequencers, as Reported by the Vendor [39,40] Read Length GS FLX Titanium XL1 GS FLX Titanium XLR70 GS Junior

Reads/Run 6

Sequence/Run

Sequencing Run Time

Consensus Accuracy

700 Mb

23 h

99.997%

Up to 1000 bp

10 shotgun

Up to 600 bp

10 shotgun, 7 3 10 amplicon

450 Mb

10 h

99.995%

400 bp

10 shotgun, 7 3 10 amplicon

35 Mb

10 h

NR

6 5

5 4

Accuracy is reported as concordance between reference and consensus sequence at 153 coverage. NR, not reported.

amplified in an oilwater emulsion (emulsion PCR) on the surface of the beads, then immobilized on the surface of a PicoTiter plate within a 29 μm well. Fiber optics allows light emission in each well to be discretely detected. The 454 GS20 Genome Sequencer, released in 2005, was the first NGS platform to become commercially available. Current iterations of the platform include the GS FLX1 system and a smaller benchtop instrument, the GS Junior. These instruments incorporate the fluidics and optics needed to perform the sequencing reaction and capture the resulting data. GS FLX1 in combination with GS FLX Titanium chemistry is cited as giving read lengths up to 1000 bp with throughput of 700 Mb per 23 h run, with consensus accuracy of 99.997% at 153 coverage. The GS Junior, with GS Junior Titanium chemistry, gives quoted read lengths of 400 bp, throughput of 35 Mb per 10 h run, and consensus accuracy of 99%. Advantages of the 454 platform include long read length and short turnaround time. Cost per base sequenced, however, is high, on the order of $10 per million bases [38]. As with Ion Torrent sequencing, in pyrosequencing, homopolymers are detected as incorporation events of larger magnitude than expected, as multiple nucleotides are incorporated. The relation between run length and signal intensity (or area under the curve) is not strictly stoichiometric, resulting in difficulties sequencing across homopolymers greater than 5 bases long (Table 1.6).

THIRD-GENERATION SEQUENCING PLATFORMS The NGS approaches discussed above have been described as second-generation approaches, in anticipation of the coming availability of third-generation sequencing. Second-generation methods require library preparation and an enrichment or amplification step. These steps are time-consuming, introduce biases related to preferential capture or amplification of certain regions, and also can result in PCR errors which are propagated into the eventual sequence data. Third-generation methods circumvent these problems by sequencing individual molecules, without the need for an enrichment or amplification step. The major disadvantage of single-molecule methods is that they require the ability to detect fantastically small signals, without compromising accuracy. These methods are further discussed in Chapter 5.

References [1] [2] [3] [4] [5] [6] [7] [8]

[9] [10] [11]

Leonard DGB. Diagnostic molecular pathology. 1st ed. Philadelphia, PA: W.B. Saunders; 2003. Pfeifer JD. Molecular genetic testing in surgical pathology. Philadelphia, PA: Lippincott, Williams & Wilkins; 2006. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 1977;74(12):54637. Slatko BE, Albright LM, Tabor S, Ju J. DNA sequencing by the dideoxy method. Curr Protoc Mol Biol 2001; Chapter 7: Unit7 4A. Klenow H, Henningsen I. Selective elimination of the exonuclease activity of the deoxyribonucleic acid polymerase from Escherichia coli B by limited proteolysis. Proc Natl Acad Sci USA 1970;65(1):16875. Tabor S, Richardson CC. Effect of manganese ions on the incorporation of dideoxynucleotides by bacteriophage T7 DNA polymerase and Escherichia coli DNA polymerase I. Proc Natl Acad Sci USA 1989;86(11):407680. Tabor S, Richardson CC. Selective inactivation of the exonuclease activity of bacteriophage T7 DNA polymerase by in vitro mutagenesis. J Biol Chem 1989;264(11):644758. Sears LE, Moran LS, Kissinger C, Creasey T, Perry-O’Keefe H, Roskey M, et al. CircumVent thermal cycle sequencing and alternative manual and automated DNA sequencing protocols using the highly thermostable VentR (exo-) DNA polymerase. BioTechniques 1992;13 (4):62633. Shendure JA, Porreca GJ, Church GM, Gardner AF, Hendrickson CL, Kieleczawa J, et al. Overview of DNA sequencing strategies. Curr Protoc Mol Biol 2011;96:7.1.7.1.23. Martin C, Bresnick L, Juo RR, Voyta JC, Bronstein I. Improved chemiluminescent DNA sequencing. BioTechniques 1991;11(1):1103. Applied Biosystems. Product Bulletin: BigDye Terminator v3.1 and v1.1 Cycle Sequencing Kits. 2002. Available from: ,http://www3. appliedbiosystems.com/cms/groups/mcb_marketing/documents/generaldocuments/cms_040026.pdf..

I. METHODS

REFERENCES

19

[12] Slatko BE, Kieleczawa J, Ju J, Gardner AF, Hendrickson CL, Ausubel FM. “First generation” automated DNA sequencing technology. Curr Protoc Mol Biol 2011; Chapter 7: Unit 7 2. [13] Applied Biosystems. Applied Biosystems Genetic Analyzers at a glance. 2006. Available from: ,http://www3.appliedbiosystems.com/ cms/groups/mcb_marketing/documents/generaldocuments/cms_040402.pdf.. [14] Dong C, Yu B. Mutation surveyor: an in silico tool for sequencing analysis. Methods Mol Biol 2011;760:22337. [15] Sikkema-Raddatz B, Johansson LF, de Boer EN, Almomani R, Boven LG, van den Berg MP, et al. Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics. Hum Mutat 2013;34(7):103542. [16] Sivakumaran TA, Husami A, Kissell D, Zhang W, Keddache M, Black AP, et al. Performance evaluation of the next-generation sequencing approach for molecular diagnosis of hereditary hearing loss. Otolaryngol Head Neck Surg 2013;148(6):100716. [17] International HapMap 3 Consortium, Integrating common and rare genetic variation in diverse human populations. Nature 2010;467 (7311):528. [18] 1000 Genomes Project Consortium, A map of human genome variation from population-scale sequencing. Nature 2010;467(7319): 106173. [19] Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6:287303. [20] Junemann S, Sedlazeck FJ, Prior K, Albersmeier A, John U, Kalinowski J, et al. Updating benchtop sequencing performance comparison. Nat Biotechnol 2013;31(4):2946. [21] Illumina, Inc. Technology Spotlight: Illumina Sequencing. HiSeq System 2010 [6/19/14]. Available from:,http://res.illumina.com/documents/ products/techspotlights/techspotlight_sequencing.pdf.. [22] Illumina, Inc. Paired-End Sample Preparation Guide, Rev. E. 2011 [6/19/14]. Available from: ,http://res.illumina.com/documents/products/ techspotlights/techspotlight_sequencing.pdf.. [23] Liu X, Harada S. DNA isolation from mammalian samples. Curr Protoc Mol Biol 2013; 102:2.14.12.14.13. [PubMed PMID: 23547013]. [24] Illumina, Inc. HiSeq Systems Comparison 2013 [10/25/13]. Available from: ,http://www.illumina.com/systems/hiseq_comparison.ilmn.. [25] Illumina, Inc. MiSeq Performance Specifications 2013 [10/25/13]. Available from: ,http://www.illumina.com/systems/miseq/performance_ specifications.ilmn.. [26] Illumina, Inc. Genome Analyzer IIx Performance and Specifications 2013 [10/25/13]. Available from: ,http://www.illumina.com/systems/ genome_analyzer_iix/performance_specifications.ilmn.. [27] Illumina, Inc. Specification Sheet: Illumina Sequencing. MiSeq System 2013 [6/19/14]. Available from: ,http://res.illumina.com/documents/ products/datasheets/datasheet_miseq.pdf.. [28] Collins FS, Hamburg MA. First FDA authorization for next-generation sequencer. N Engl J Med 2013;369(25):236971. [29] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20(9):1297303. [30] Applied Biosystems. 5500W Series Genetic Analyzers v2.0 Specification Sheet. 2012 [6/19/14]. Available from:,http://tools.lifetechnologies. com/content/sfs/brochures/5500-w-series-spec-sheet.pdf.. [31] Applied Biosystems by Life Technologies. Specification Sheet: Applied Biosystems SOLiD 4 System 2010 [6/19/14]. Available from: ,http://www3.appliedbiosystems.com/cms/groups/global_marketing_group/documents/generaldocuments/cms_078637.pdf.. [32] Merriman B, Ion Torrent R, Team D, Rothberg JM. Progress in Ion Torrent semiconductor chip based sequencing. Electrophoresis 2012;33(23):3397417. [33] Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012;30(5):4349. [34] Eshleman JR, et al. False positives in multiplex PCR-based next-generation sequencing have unique signatures. J Mol Diagn in press. [35] Life Technologies. Ion PGM System Specifications 2013 [10/25/13]. Available from: ,http://www.lifetechnologies.com/us/en/home/lifescience/sequencing/next-generation-sequencing/ion-torrent-next-generation-sequencing-workflow/ion-torrent-next-generation-sequencingrun-sequence/ion-pgm-system-for-next-generation-sequencing/ion-pgm-system-specifications.html.. [36] Life Technologies. Ion Proton System Specifications 2013 [10/25/13]. Available from: ,http://www.lifetechnologies.com/us/en/home/lifescience/sequencing/next-generation-sequencing/ion-torrent-next-generation-sequencing-workflow/ion-torrent-next-generation-sequencingrun-sequence/ion-proton-system-for-next-generation-sequencing/ion-proton-system-specifications.html.. [37] Droege M, Hill B. The genome sequencer FLXt system. J Biotechnol 2008;136:310. [38] Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012;2012:251364. [39] Roche Diagnostics. GS FLX1 System: Performance 2013 [10/25/13]. Available from: ,http://454.com/products/gs-flx-system/.. [40] Roche Diagnostics. GS Junior: Performance 2013 [10/25/13]. Available from: ,http://454.com/products/gs-junior-system/index.asp..

I. METHODS

This page intentionally left blank

C H A P T E R

2 Clinical Genome Sequencing Tina M. Hambuch, John Mayfield, Shankar Ajay, Michelle Hogue, Carri-Lyn Mead and Erica Ramos Illumina Clinical Services Laboratory, San Diego, CA, USA

O U T L I N E Introduction Next-Generation Sequencing Sequencing in the Clinical Laboratory

21 21 22

Applications and Test Information Challenges of Defining a Test Offering That Is Specific to Each Case

24

Laboratory Process, Data Generation, and Quality Control

Preanalytical and Quality Laboratory Processes Analytical Bioinformatics Validation Proficiency Testing Interpretation and Reporting

25 26

26 27 29 30 31 31

Conclusion

34

References

34

KEY CONCEPTS • Clinical genome sequencing (GS) requires additional elements that go well beyond the technology. • GS can be applied appropriately in many clinical situations, but it is important to define how and when this test should be used, and to establish physician/patient communication and support, quality laboratory process, analytical validity, ongoing proficiency testing, specific bioinformatics analyses and filters, and interpretation and reporting processes specific to that clinical application. • There may be multiple approaches that are appropriate depending on the clinical question(s) being assessed. • Implementing clinical GS in a clinical laboratory is labor intensive and requires a robust laboratory and bioinformatics process, but can be implemented successfully to improve patient outcomes. • This field is evolving extremely rapidly and will undoubtedly create opportunities for new professions in the area of bioinformatics and genetic analyses.

INTRODUCTION Next-Generation Sequencing DNA sequencing is one of the most important tools aiding biological researchers in the study of genetics and disease. Until the mid-2000s, almost all DNA sequencing was performed using technology developed by Fred

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00002-2

21

© 2015 Elsevier Inc. All rights reserved.

22

2. CLINICAL GENOME SEQUENCING

Sanger [1]. The methodology involves achieving size-separation of DNA molecules by incorporating modified nucleotides (di-deoxyribonucleotides) in a reaction catalyzed by a DNA polymerase enzyme. Terminated chains are then size fractionated by capillary electrophoresis (the initial version employed chromatograms). Although very accurate, Sanger sequencing provides limited scalability; a highly parallelized setup can only produce roughly 100 kilobase of sequence from a 96-well plate. The protocol is also expensive and labor intensive, since it involves cloning and isolation of DNA and maintenance of libraries. Next-generation sequencing (NGS) involves a radically different approach that enables sequencing millions of DNA molecules at a massively parallel scale at a fraction of the cost. Today, a single human genome can be sequenced for ,$10,000 with the price rapidly approaching the $1000 mark, compared with the Sangersequencing-based Human Genome Project that carried a price tag of nearly $3 billion [24]. Instruments that use NGS technologies are capable of producing hundreds of gigabases of DNA sequence from a single experiment [5]. The increased throughput and economies of scale are achieved by simpler protocols that require a shorter duration to complete. Most protocols start by generating a library of DNA fragments (genomic DNA (gDNA) or reverse-transcribed RNA) from a sample. Each fragment becomes a template that is then amplified to a sufficient degree so that a strong enough signal is emitted to enable detection of the molecule. Sequencing of the templates occurs on a massively parallel scale in a reaction catalyzed either by DNA polymerase or DNA ligase, while the detection of sequenced bases occurs by measuring light or fluorescence emission, or changes in pH, when bases are incorporated [6,7]. Since there is uncertainty involved in detection of an emitted signal, a quality score is assigned to every sequenced base that represents the probability of an error in the base call. Some newer technologies sequence by reading DNA bases as they pass through a synthetic or biological nanopore set in a membrane (e.g., lipid bilayer) across a voltage difference [8]; each nucleotide disrupts the normal flow of current by a distinct amount, enabling detection. Most NGS instruments cycle back and forth rapidly between synthesis and detection of sequenced bases to produce a read (between 100 and 1000 bp) representing a portion of a DNA fragment. In some cases, two ends of a DNA fragment may be sequenced to produce a “paired-end” read. One of the key advantages of NGS is that it can be used for a variety of applications from qualitative and quantitative nucleotide detection through supporting multisample analysis by indexing DNA libraries. Biological researchers can study different systems because of the ability to quickly and efficiently sequence small targeted regions such as a set of exons (exome sequencing) [9,10], expressed genes (RNA-seq) [11,12], regulatory regions such as transcription factor binding sites (ChIP-seq) [13,14] or simply the entire genetic code of a human (whole genome sequencing (WGS)) [5,15,16]. These applications can help uncover both small- and large-scale alterations to the DNA sequence like single nucleotide variants (SNVs), insertions and deletions (indels), copy number variations (CNVs) and gross structural variants (SVs), some of which may be responsible for the genetic basis of disease. In addition, NGS technologies have made it possible to study microbial organisms by sequencing the entire complement of their RNA or DNA, which is termed metagenomics.

Sequencing in the Clinical Laboratory In the 1990s, several technical advances led to the transformation of genetic sequencing from a research tool into a clinical tool. Until that time, sequencing was laborious, slow, and technically challenging, making it incompatible with the high volume, rapid turnaround requirements of a clinical laboratory. The introduction of capillary electrophoresis and dye terminator sequencing enabled genetic testing to be included in clinical evaluations. Genetic testing involving sequencing is now available for over 2000 diseases and is offered by hundreds of clinical laboratories (Genetests.org). The evolution of this progress has recently received another technological advance that continues to enlarge the potential and scope of clinical sequencing. The technology is typically referred to as NGS or Massively Parallel Sequencing (MPS), and reflects a number of platforms that share a shotgun sequencing approach that, as the name suggests, allows for parallel generation of large amounts of data. This chapter will address the challenges and solutions to implementing NGS through the preanalytical, analytical, and postanalytical components of a clinical laboratory service. This chapter will focus on the aspects of NGS that are particularly important for clinical laboratory consideration. While several excellent reviews are available that detail the technical aspects of NGS [17,18], several features are worth specific mention here. NGS involves shearing gDNA into fragments of a particular size, isolating those fragments or subsets of those fragments, and sequencing each fragment simultaneously. Then the digital sequences are scaffolded together to create a large set of sequences that represent either a genome, exome, or

I. METHODS

INTRODUCTION

23

large number of regions of interest. Like other technologies, the applications and potential uses of NGS are broad. For example, NGS can detect SNVs. However, because each molecule is sequenced individually, an individual read is not capable of indicating the diploid or genotypic state of the sample. Multiple reads must be generated in order to ensure that both chromosomes (in a diploid cell) are represented in the sequence that is produced. While this is in some ways a challenge, it also means that there is a quantitative component to NGS which enables it to detect nondiploid level variation. Examples of how this has been used include detecting heteroplasmy, detection of somatic variants in a tumor-normal sample, and detection of circulating fetal DNA from a maternal blood sample. Additionally, this quantitative component can be used to detect relative populations of microorganisms in a sample, such as gut flora. Depending on how the sample library is generated, it is also possible to detect CNVs via sequencing. As the sequences generated can span the entire genome, it is also possible when scaffolding the sequences, to detect regions where there have been translocations, inversions, duplications, and other types of structural genomic events. While all of these possibilities have been demonstrated using NGS, the technical complexity of mapping these variants results in highly variable detection rates that have not yet reached accuracy rates that would be appropriate for most clinical applications. Obviously, a single assay that could potentially address so many different types of questions in a clinical setting suggests many possible applications. There are already some impressive examples of where this technology might impact the practice of medicine. Several publications have demonstrated the potential for patients suffering from rare, undiagnosed conditions, ranging from patients in which NGS has lead to cures, to patients that have simply benefited from finding answers [1921]. These conditions have typically been rare, single-gene diseases whose presentation is vague or nonspecific, and the ability to evaluate multiple genes that could be associated with the symptoms has allowed for more rapid, confident diagnosis and sometimes discovery of new clinical presentations to be associated with disease. In some cases, it revealed that patients actually had two diseases. Recent advances in the chemistry have allowed for very rapid identification of disease in patients who would benefit from rapid results, such as those in the Neonatal Intensive Care Unit (NICU) [22]. In addition, NGS has been used to evaluate changes between cancer and normal tissues from the same individual to significantly impact chemotherapeutic choices [23,24]. The technology has been used effectively to evaluate aneuploidy in fetuses by evaluating the fetal DNA circulating in maternal blood, commonly referred to as noninvasive prenatal testing (NIPT) [2527]. While some of these examples skirt the borders between clinical research and clinical testing, most demonstrate the potential utility for this technology toward a number of difficult and important clinical applications. The transformation from clinical research and exciting anecdotes to standard use as a clinical test is well underway. In 2009, Illumina launched a CLIA-certified, College of American Pathologists (CAP)-accredited sequencing service offering physicians access to whole GS. Since that time, many other clinical laboratories have launched clinical exome sequencing with clinical interpretive support, and still others offer large panels of genes using NGS technology. However, the transition to clinical testing is not trivial and the complexity of the testing, analysis, and reporting represent a significant change in scale from what most clinical laboratories or medical professionals are used to supporting. This includes creating a process in which doctors can confidently understand what they are ordering, what information they will get back, and what other tests might still be needed to supplement the information provided. Patients should receive appropriate informed consent and play a role in determining what information will be evaluated and how it will be used. Clinical laboratories need to establish the analytical validity of the tests, develop bioinformatics pipelines appropriate to addressing the types of questions the tests are intended to evaluate, and develop reporting strategies that are flexible in both context and scale. In order to address this need, several national groups have published guidelines around various aspects of implementation of clinical applications of NGS. These guidelines range from specifics of analytical validity [28], interpretation and reporting of incidental findings [29], to general considerations of implementing NGS in the clinical laboratory [30,31]. While this technology clearly has a huge potential for impact in medicine, it is critical that the development and implementation be done carefully and that each of the multiple aspects of implementation be developed with consideration to the requirements and challenges of the other aspects. With that principle in mind, this chapter has been divided into major sections on (i) applications and information support, (ii) implementation of a quality laboratory process, including bioinformatics challenges and requirements, and (iii) clinical interpretation and reporting. Each section will address the tools, opportunities, challenges, and possible approaches that have been developed or identified to address these critical components of implementing a clinical laboratory test and then integrating the use of that test into medical use in a way that is appropriate, beneficial to patients, and rigorous with regard to standards. Many challenges have not yet been solved, and while many groups are

I. METHODS

24

2. CLINICAL GENOME SEQUENCING

addressing these questions, both the technology and the guidelines will evolve rapidly over the next few years. Most groups use the term “genome sequencing” to reflect generalities that are applicable to both genome and exome offerings. Although the terms “Whole Genome” or “Whole Exome” are often used, it must be remembered that there are regions of the genome which are not resolved or mapped using these technologies.

APPLICATIONS AND TEST INFORMATION GS can be used in a wide variety of situations including the support of diagnostic, predictive, carrier screening, and therapeutic evaluations. While traditional genetic tests can address these situations, GS offers a more comprehensive look at multiple genes that play an important role in health and disease. Ideally test definitions should be very specific and give a medical professional enough information to judge whether that test will aid in addressing the specific clinical question he/she may have. In the case of whole GS, the test can possibly address multiple questions simultaneously, and therefore defining the test, and helping a health practitioner decide how best to use it in a specific clinical situation, can become more challenging. For diagnostic purposes, GS can evaluate multiple predefined disease-causing genes for a specific phenotype with one test (essentially becoming a personalized panel specific to an individual’s clinical presentation). This has significant potential to reduce the time and cost in seeking diagnosis for difficult cases. Due to the nature of rare genetic disease and the difficulty of pinpointing a diagnosis, in the usual course of genetic evaluation, it is not uncommon for an individual to undergo multiple tests in a tiered or reflex approach over months or, more commonly, years, seeking a diagnosis. By covering many or all of the suspect genes in one test, much of the time of that so-called diagnostic odyssey and cost may be eliminated. Additionally, diagnostic testing for rare disease becomes more sensitive when families are investigated together, for example, an affected child and both parents. Familial testing allows for identification of novel variants in an affected child versus those present in the presumably phenotypically normal parents. It also allows for identification of homozygous or compound heterozygous mutations in recessive genes in the child that are present only in a heterozygous state in carrier parents. This significantly narrows the list of candidate genes that must be evaluated for an affected child, and generally is considered most fruitful for identifying de novo or autosomal recessive variants. Finally, diagnostic testing via GS offers the ability to identify novel genes not previously considered in a patient’s differential diagnosis. As rapidly as knowledge of the genome is advancing, it is still unclear in many cases what effect specific genes have on health and disease. As use of GS widens, the discovery of expanded or novel phenotype-gene associations will occur. This falls more in the spectrum of clinical research, however, identification of these suspect genes provides potential for patients who do not benefit from a clear diagnosis by providing possible opportunities for research evaluations and participation in research programs. Predispositional testing for suspected familial cancer syndromes is widely available via single-gene testing or gene panels. Unfortunately, an answer is not forthcoming for a majority of individuals tested via this approach. Testing is most commonly ordered for the handful of prevalent and well-known cancer predisposition syndromes while there are more than 54 cancer predisposing genetic syndromes identified to date [32]. GS has the ability to cover most or all known or suspected cancer predisposing genes with one test. Such testing has significant potential to identify individuals and families who are affected by rare or less common cancer predisposition syndromes that would often be missed with traditional testing. However, there is also potential to detect variants of uncertain clinical significance as well as potential for independent findings when using this broader approach, and that of course is also something that must be planned for clinically. Additionally, the broad scope of GS for cancer predisposing genes will likely help the medical genetics community learn more about the effects of these genes and the breadth of the phenotypes that they cause. As increasing numbers of variants are discovered in patients who do not meet typical diagnostic criteria for specific cancer predisposition syndromes, a broader spectrum of disease for those genes will be discovered. Traditional carrier screening is widely available for many disorders and often focuses on a subset of common mutations, many times specific to one ethnicity. Traditional carrier screening is useful in finding a majority of mutations in high-risk populations, however, potentially misses a significant number of causative mutations that are rare or private mutations. For example, over 1700 mutations have been described to date in CFTR, which is known to cause cystic fibrosis. However, traditional screening in the United States includes only the 23 most common mutations recommended by the American College of Obstetricians and Gynecologists (ACOG) [33]. Because WGS covers the entire gene sequence, rather than a subset of common mutations, the technology will

I. METHODS

APPLICATIONS AND TEST INFORMATION

25

discover rare and private mutations, identifying individuals at risk of passing on recessive disease who might otherwise go undetected, not only in CFTR, but in other recessive diseases as well. Traditional carrier screening generally focuses on the most common recessive disorders within given populations. Although hundreds of rare recessive diseases have been identified, to date, ACOG has only published committee opinion regarding routine carrier screening for four disorders, namely cystic fibrosis in all pregnant women regardless of ethnicity, Tay-Sachs disease, Canavan disease, and familial dysautonomia in women of Ashkenazi Jewish descent [34]. A majority of the parents of individuals with rare recessive disorders were unaware of their carrier status until after their affected child was born. Because these disorders are so rare, it is unlikely that carriers have a family history indicating they are at risk. Genome sequencing has the potential to identify those at risk of passing these rare disorders on to their offspring. Identification of families at risk allows them to make informed reproductive decisions, and makes it possible to pursue testing (if desired) such as preimplantation genetic diagnosis to avoid passing the disease to offspring. GS may greatly impact pharmacotherapeutics in the future. It is well documented that genotype directly influences an individual’s response to some medications [35], since genotype often influences an individual’s rate of absorption, distribution, metabolism, or excretion of a drug. Any of these differences may impact medication effectiveness and or influence toxic responses. Some genetic variants predispose individuals to toxic reactions to certain medications, while other genetic variants cause medications to be more effective. It is currently standard of care to evaluate patients undergoing clinical trials for specific medications for known pharmacogenetic variants prior to undergoing treatment. As more information becomes available about the role of pharmacogenetics, results of WGS will begin to play an important role in the pharmacotherapeutic field, influencing which medications an individual is or is not prescribed. This has significant potential to impact the health and well-being of the general public by reducing toxic effects and maximizing the benefits of prescribed medications.

Challenges of Defining a Test Offering That Is Specific to Each Case The indications for ordering GS and the ways in which results will be utilized are as unique as the individual patients themselves. While this is part of the appeal of GS, it creates challenges that may be unique to the ordering of GS. Consider the example of a standard complete blood count (CBC). A CBC will be interpreted in the same way regardless of the indication for testing. In most institutions, a CBC (and other routine tests) can be ordered with the click of a button and the sample sent directly to the lab. Samples are analyzed and the results of the CBC are returned to the ordering physician in the same manner regardless of patient, indication, or other factors. The same cannot be said for GS. Because GS results can be, and should be, tailored to different indications (i.e., diagnostic versus carrier screening) it is important that the lab receives accurate information regarding the indication for testing, which requires additional time and effort on the part of the physician. It is essential that the GS lab understands the information being sought in order to provide an applicable and suitable interpretation of the genome. Additionally, in diagnostic cases, it is essential that the lab receives phenotype information including accurate details regarding medical history and family history. These pieces of crucial information help direct the search through thousands of variants found in any given genome. Information must also be readily available to the medical practitioner to help him/her understand that, technically, whole GS will not actually cover every known position in the human genome, nor can it detect every type of genetic variation (as discussed more fully below). In addition, the implications for clinical sensitivity may be highly variable depending on the clinical context and the specific set of genes of interest. While this information is fairly easy to convey when dealing with a specific test, such as the common mutations in CFTR, it becomes extremely challenging at the genomic scale. Some solutions have included binning disease and mutation categories in searchable PDF or online formats so that individuals can query specific subsets, or setting filters at the laboratory level to limit testing and reporting to conditions and genes where a certain minimum of clinical sensitivity has been established. Regardless, the challenge of effective communication of very large amounts of information between a clinical laboratory and a clinician is a significant challenge for genomic sequencing. GS presents significant challenges unique to the broad spectrum of the analysis and the significant number of novel variants identified with this relatively new technology. Not the least of these challenges is interpretation of results and application of those results to medical management. Most physicians outside of genetics specialties traditionally receive little training in the field of genetics. A new challenge is therefore for practitioners to feel comfortable utilizing GS results in their daily practice. It is of utmost importance that the medical genetics community form collaborative relationships and provide support to the medical community in general. Likewise, it is

I. METHODS

26

2. CLINICAL GENOME SEQUENCING

equally as important for the medical community to reach out to the genetics community for support. There are multiple subsets of genetics specialists who are capable of assisting in this effort, including clinical geneticists, molecular geneticists, and genetic counselors. These specialists are trained and knowledgeable in the implications of genetic variation and the ways in which variation impacts health and disease. They are also skilled in interpreting genetic test results and should be qualified to interpret GS results. Ideally, patients should be given the opportunity to undergo the informed consent process with a genetics specialist prior to ordering of GS. In reality, many patients and practitioners do not have access to genetics specialists due to geographic location, or insurance limitations. It is therefore of utmost importance that the medical community is provided with important information regarding the utility and implications of this technology, as well as how to utilize the information gained from results. It will in many cases be the physician who must ensure the patient is fully aware of the benefits, risks, and limitations of GS and the various possible results such as positive, negative, and variants of uncertain significance. Equally as important as informed consent, the physician and patient must be able to utilize the results of the test. Physicians must feel comfortable in their ability to apply the results to the patient’s plan of care. It is for this reason that collaboration with genetics specialists is important. In order to help facilitate the informed consent process and application of the information yielded from GS results, the genetics community has the responsibility to establish programs to address these issues. An example of one such program would be short-term rotations between institutions. For example, some commercial labs (e.g., Illumina) as well as many academic labs performing clinical GS in a CLIA environment host rotating clinical students from outside institutions. Students gain hands-on experience in the processing and interpretation aspects of GS and gain a better understanding of the application and implications of such testing. Another platform for education and information sharing is presentations and breakout sessions at crossdisciplinary local and national meetings. This type of platform has the potential to reach a large number of physicians from various regions in a single setting, and allows for the dissemination of information pertaining to GS, although it is no substitute for more detailed, hands-on training. Presentations should address the technical aspects of testing, as well as the benefits, risks, and limitations. Information regarding the accessibility of genetics professionals and other sources of information for further information and/or collaboration is an important part of any such platform.

LABORATORY PROCESS, DATA GENERATION, AND QUALITY CONTROL Preanalytical and Quality Laboratory Processes The critical difference distinguishing a clinical genomic sequencing service from sequencing performed in a research setting is the establishment, implementation, and documentation of standard of operations procedures (SOPs) that ensure integrity and quality of samples, data, patient privacy, interpretation, and final reports returned to the physician. It is also essential to have an available staff consisting of licensed and trained professionals, including genetic counselors, clinical genetic molecular biologists, medical geneticists, and bioinformaticians; in addition to working with each other, many members of this team will need to be able to work directly with ordering physicians and to produce a final report that physicians can use to make patient health care decisions. The application of genomic sequencing services to the medical field requires physicians and patients to be adequately informed and trained to handle the information presented in the final report, while also limiting the data that may not be useful or is distracting from the patient’s health care decisions. While a major focal point of running a clinical genomic sequencing facility centers upon ensuring the accuracy of data and report generation, the majority of errors in a clinical laboratory setting are made during the preanalytical phase of performing clinical tests [25]. The bulk of laboratory errors occur in the preanalytical phase (B60%), with the postanalytical phase being the second most problematic (B25%) [36]. The majority of preanalytical errors can be attributed to sample handling and patient identification. Many of these errors can be avoided by developing a chain of custody program that is scalable for the number of samples entering the lab. Examples include sample collection kits that include matched, barcoded sets of tubes, forms, and instructions for proper patient handling, the implementation of a Laboratory Information Management System, and regular, ongoing training and assessment of staff performance. Many other suggestions exist for how to develop high-quality processing and procedures that minimize the probability of such errors. Given the complexity and personalized nature of each genomic sequencing test, appropriate tools should be developed to optimize communication between clinical genomics laboratory personnel, and the physician to

I. METHODS

LABORATORY PROCESS, DATA GENERATION, AND QUALITY CONTROL

27

ensure the test is being appropriately ordered and analyzed for the patient. Ideally, genetic counselors should be available to communicate with ordering physicians regarding why the test is being ordered, what the deliverable will be, and how the results can be used appropriately. It is also appropriate to offer training and support tools (e.g., podcasts/downloadable instructions) to the ordering physician in order to prepare them for navigating through multiple steps of this process. While onerous for both the laboratory and physician, such steps taken ahead of test ordering are likely to optimize the results and significantly reduce probability of error, confusion, and misunderstanding. Test results from genomic sequencing can offer much more data than the physician requires for patient diagnosis, which can be distracting and potentially stressful to the patient. In order to circumvent issues with excess data, the clinical genomic sequencing laboratory should be forthright in verifying the types of information the patient/physician would prefer to receive or not to receive. The physician or genetic counselor should consider the specific situation of the patient when assessing the appropriateness of potentially delivering sensitive incidental findings, such as susceptibility to late-onset conditions, cancer, or concerns about nonpaternity. The American College of Medical Genetics and Genomics (ACMG) Working Group on Incidental Findings in Clinical Exome and GS recently released a list of genes and categories of variants that should be conveyed in a clinical genomics report, even if found as secondary or incidental results [29]. The ACMG Working Group feels that although these variants may not be immediately necessary for the patient diagnosis decision, they could be critical for the patient’s well-being and should not be withheld. Physician training should therefore include how to assist the patient in making decisions in what should be reported, and how to subsequently complete a required informed consent signed by the patient which outlines the terms of the genomic test request. Since patient genomic sequencing results in a very complex data set with a plethora of information, tools should be developed to enable easy navigation of results as questions arise, and support should be offered to the physician before the test results are delivered in order to both maximize value for the patient and minimize the time and effort set forth by the physician. Patient sample tracking can be one of the most difficult components of clinical genomic sequencing because the process requires many steps that occur over a period of several days, among several people, and the process must protect the patient’s privacy in accordance with Health Insurance Portability and Accountability Act (HIPAA) regulations. However, this challenge is not specific to GS, and has been recognized as an ongoing challenge for all clinical laboratories.

Analytical The processing of one genome can be divided up into three categories: wet-lab processing, bioinformatics analysis, and interpretation and report generation. As the authors of this chapter use an Illumina platform, the details described below are consistent with that platform; though other platforms vary in specifics of certain steps, the general principles nonetheless remain the same. These principles include DNA extraction, DNA shearing and size selection, ligation of oligonucleotide adapters to create a size-selected library, and physical isolation of the library fragments during amplification and sequencing. As discussed more fully in Chapter 1, for library preparation, intact gDNA is first sheared randomly. Prior to adapter labeling the sheared gDNA is blunt ended and an adenosine overhang is created through a “dATP” tailing reaction. The adapters are comprised of the sequencing primer and an additional oligonucleotide that will hybridize to the flowcell. After adapter ligation the sample is size-selected by gel electrophoresis, gel extracted, and purified. After the library has been constructed, it is denatured and loaded on to a glass flowcell where it hybridizes to a lawn of oligonucleotides that are complementary to the adapters on an automated liquid handling instrument that adds appropriate reagents at appropriate times and temperatures. The single-stranded, bound library fragments are then extended and the free ends hybridize to the neighboring lawn of complementary oligonucleotides. This “bridge” is then grown into a cluster through a series of PCR amplifications. In this way, a cluster of approximately 2000 clonally amplified molecules is formed; across a single lane of a flowcell, there can be over a 37 million individual amplified clusters. The flowcell is then transferred to the sequencing instrument. Depending on the target of the sequencing assay, it is also possible to perform a paired-end read, where the opposite end of the fragment is also sequenced. As was mentioned previously, NGS works by individually sequencing fragmented molecules. Each of these molecules represents a haploid segment of DNA. In order to ensure that both chromosomes of a diploid region of DNA are represented, it is therefore necessary to have independent sampling events. This is a classic statistical

I. METHODS

28

2. CLINICAL GENOME SEQUENCING

TABLE 2.1 Example of an Analysis to Asses SNV Calling Accuracy Using Three Coriell Samples Depth of coverage

Sensitivity

Specificity

303

.99.9%

.99.99%

203

99.5%

.99.99%

103

98.0%

99.99%

53

N/A

99.99%

Relationship between depth of coverage, sensitivity, and specificity. In the context of sequencing, sensitivity is used to describe the probability that a variant would be detected if present in the heterozygous state. Specificity describes the probability that the correct nucleotide is called, and that a variant is not called when it is not present.

problem, in which the number of independent sampling events required to have a given probability of detecting both chromosomes can be represented by the formula: Pðx; p; NÞ 5

N2X X

N! pX qðN2XÞ ðX!ÞðN 2 XÞ! K5x

In this case, the number of independent sampling events needed to detect a variant is dependent on the number of total sampling events and the number of times each allele is detected, where N is the number of sampling events, X the number of alternative observations, and p 5 allele 1 and q 5 allele 2, each which should (in constitutional testing) be present in equal proportions. Using this principle, it is possible to estimate the minimum number of sampling events required to have confidence that the call represents both chromosomes, and therefore if a variant were present, it would be detected. However, this formula represents the ideal, in which all calls are perfect; in reality, calls are not always perfect, so additional quality monitoring of the call, and how well the call is mapped to a position in the genome, must also be considered. During assay validation, the number of independent sampling events, typically referred to as the depth of coverage, should be measured on known samples to understand what thresholds of depth result in what confidence of detection. In the Illumina Clinical Services Laboratory (ICSL), evaluation of multiple known samples run at various depths and a subsampling or bootstrapping analysis showed that the results generally track well with the hypothetical expectations (Table 2.1). Based on these results, the average coverage of a call at 30-fold depth results in .99.9% sensitivity and .99.99% specificity of variant detection. However, when the average call is made at 30-fold depth, something less than half of the total calls are made at less than 30-fold depth, and thus have a lower sensitivity and specificity for variant detection. Using a minimal depth of coverage at any position to make a call, specifically 10-fold depth, yields a 97% sensitivity and 99.9% specificity. Mapping the distribution of calls for every call made (Figure 2.1) makes it possible to evaluate the corresponding confidence for any given individual call, and essentially back-calculate the required average depth of coverage required to meet specific sensitivity and sensitivity test metrics. The same approach can be applied to the formula for nondiploid situations, for example, for detection of heteroplasmy, a somatic variant, or sequence variants in different strains of microbes. The genomic sequencing process must be tightly monitored with quality assessments at each step to ensure that the sample is progressing with the highest possible quality. The steps of the sequencing process that for which monitoring is useful include DNA extraction, library preparation, cluster generation, and the sequencing run. Each one of these steps must be assayed for quality. In each case, the quality thresholds established during validation, and the average performance metrics of other samples should be compared to the sample being processed; if a sample is showing metrics outside of the normal range, the sample should be investigated. Robotics and automation are valuable additions that can be made to a protocol to minimize the possibility of human error. DNA extraction, library preparation, and cluster generation can be performed with a robot. Newer generation sequencing machines combine these steps by performing both the cluster generation and sequencing processes, further limiting the possibility of human error in the transfer of the flowcell between steps. Future advances to further combine the sequencing laboratory steps with automation will increasingly assure a reduction in potential errors. In fact, it is easy to imagine that, in the very near future, sequencing will be performed with full automation that does not require human touch after the first sample loading.

I. METHODS

29

LABORATORY PROCESS, DATA GENERATION, AND QUALITY CONTROL

Genome 35% 30% 25% 20% 15% 10% 5% 0% 10

20

30

40

50

60

70

80

90

100

>100

Depth of coverage Depth of coverage

Sensitivity

Specificity

30x

>99.9%

>99.99%

10x

98.0%

99.99%

FIGURE 2.1 For any given average depth of coverage for the genome, the coverage at individual loci will be distributed around that average. This graph displays the distribution of coverage when the average depth of coverage for the genome is about 40-fold. Since most bioinformatics pipelines cannot reliably detect variants at positions with fewer than 10 independent sampling events (less than 10-fold depth of coverage), the graph does not include that region of the distribution.

Quality control of clinical genomic sequencing services is an ongoing responsibility that should be continually evolving based on monitoring and evaluation, as well as externally monitored via proficiency testing, in which laboratories using similar or orthologous techniques compare their results. The use of appropriate metrics and controls for each step of the process will ensure robust sequencing runs and the ability to identify where and how errors may have arisen. External controls, such as lambda DNA fragments, can be spiked into samples and follow the process. Additionally, controls internal to a sample can also be used effectively. In addition to controls, specific run and performance metrics should be established during the validation phase that can be used to monitor individual run performance and ensure that the equipment and chemistries are performing as expected; an orthologous assay such as microarray analysis is one such approach, but other possibilities exist. By comparing the concordance of calls from a genomic level microarray, not only can a measure of quality of the sequence be obtained, but also sample swaps or contaminations can be detected.

Bioinformatics The path that leads from sequenced reads to the identification of genomic alterations is a highly complex process that involves sophisticated bioinformatics analysis. Generally, there are three steps in the analysis: (i) preprocessing of reads, (ii) alignment, and (iii) variant calling. As discussed at length in Chapters 711, while the first step may be accomplished in a short time, the remaining two can take many hours depending on the amount of data that needs to be analyzed. Preprocessing involves filtering out raw sequence data that may not meet certain quality criteria, thus minimizing any effect it may have on downstream analysis. The process of alignment involves placement of reads on a reference human genome sequence, which may be obtained from public sources like the National Center for Biotechnology Information (NCBI), University of California Santa Cruz (UCSC) Genome Browser, or Ensembl. There are many tools that employ different algorithms to align reads; each one offers trade-offs on speed and accuracy [37,38]. A common attribute to all alignment algorithms is that they will fail to align some reads (510%) sequenced from a sample; this is because these reads may not be represented in the reference genome. Since alignment algorithms offer a best estimate for placing a read on the reference genome, there is also a mapping quality that represents the confidence associated with each read placement.

I. METHODS

30

2. CLINICAL GENOME SEQUENCING

One of the community-accepted standards to represent alignments is in BAM file format [39], which captures the above-mentioned data and allows efficient compression and random access of reads (when sorted) that align to a particular segment of the genome. Once the alignment procedure is complete, the BAM file serves as input to the next step in the bioinformatics pipeline—variant calling, where differences between the reference sequence and the aligned reads are examined. There is a plethora of variant calling tools, each one specializing in detecting small (SNVs and Indels) or large genomic alterations (SVs and CNVs). Many factors need to be taken into consideration to ensure accurate genotype calling along every position on the exome or genome. Depth of sequencing, alignment quality scores, and quality of sequenced bases are but a few that affect the outcome of variant detection. With respect to calling genotypes, there are two main paradigms—the first one involves relying on base counting and allelic fraction to distinguish between a heterozygous and homozygous genotype call, the second involves probabilistic methods (Bayes’ theorem) to calculate posterior probability given the observed read data and a genomic prior probability [40]. The latter method accounts for noise in the data and helps provide a measure of statistical uncertainty associated with each genotype call in the form of a score, which is usually a representation of the confidence in the genotype call. Most variant calling algorithms generate variants in VCF format, which is a standard to represent sequence variation. Once a variant has been called, the biological and clinical implications must be assessed. A critical component of the bioinformatics pipeline is an annotation engine to gather information about the observed variant. This usually involves incorporation of data such as overlap with a gene or transcript, codon and/or amino acid change, splice-site change, and variant frequency from publicly available databases such as 1000 Genomes [41] and dbSNP [42]. These are required when evaluating the clinical significance or pathogenicity of a variant since they offer support for accurate interpretation by clinical geneticists and genetic counselors. It is imperative to understand the pitfalls of using data from public databases since many of the entries may not have been reviewed or carefully curated. The same caveat also holds for any commercially available database that describes variant, gene, and disease associations, since not all studies or experiments are designed and executed with comparable numbers and types of controls and cases. Sequencing as a clinical test needs to ultimately be able to provide a result back to the ordering physician based on all the bioinformatics analyses. When dealing with a handful of variants, manual curation by reviewing all available information may be feasible, but this process is not scalable when the sequencing generates thousands or millions of variants. It is essential to have a framework that consists of a repository to facilitate management of variants that have been previously interpreted and that also provides a mechanism to categorize and review variants. This framework should link to the reporting framework, where physician and patient information, phenotypic information and clinical evaluations would be incorporated into a customized patient report. The framework should also be flexible enough to allow reclassification or recategorization of variants (e.g., from benign to pathogenic), tracking which patients may be affected by this change and regenerating reports reflecting the update in variant classification.

Validation Validation of NGS must be performed in every clinical genomics laboratory and updated in the event of any processing changes, regardless of whether the chemistry or platform changes, as discussed in more detail in Chapter 21. Additionally, validations must be performed that are specific to the questions being asked; for example, if the testing is intended to detect substitutions, that test must be validated, but the validation performed on substitutions will not assess the test’s ability to accurately identify CNV or indel events. Additionally, regions that are variable across the genome, such as high and low GC regions, should be evaluated to understand the consistency of base calling quality. One method of validation can be performed by testing the preferred sequencing platform with a “truth set” of DNA. Several DNA samples are available that have been sequenced for specific loci of clinical importance (http://ccr.coriell.org). Additionally, groups such as the National Human Genome Research Institute and the Centers for Disease Control and Prevention’s GetRm (Genetic Testing Reference Materials program) are creating repositories for samples that have been characterized at the genomic level. Some labs have taken a multitiered approach to validation that includes in-house sequencing of a subset of these reference genomes as well-sequencing samples with specific, well-characterized, and clinically valid mutations known to be causative for diseases (as recommended under ACMG guidelines for newborn screening) [43]. By producing pools of data around these well-characterized regions, resampling (bootstrapping) analysis can be used to compare the selected platform’s sequence reads to the known sequence of the “truth set” at different combinations of depth and quality in order to establish the relationship between quality metrics and the sensitivity and specificity of the platform (Table 2.1).

I. METHODS

LABORATORY PROCESS, DATA GENERATION, AND QUALITY CONTROL

31

Another method to establish analytical validity of bioinformatics analysis involves detecting Mendelian inheritance errors or conflicts when evaluating variant calls in samples that form a trio (parents and child) from a pedigree. The analysis must account for background conflicts that can be attributed to de novo mutations (generally assumed to be fewer than 100) in every generation. The number of conflicts observed that exceed this expected “true” rate is dependent on the choice of aligner, variant caller, and the settings that have been used to align reads and make genotype calls. The validation procedure should seek to optimize sensitivity of variant detection while minimizing the number of conflicts. It is also critical to ensure that the annotation pipeline is robust and accurate. Annotation is an important and critical step that aids geneticists and counselors in evaluation of the severity of a variant based on information such as its consequence, frequency in various populations, and association with disease (if such information is present in a database). The validation procedure should choose a set of random SNVs for which the aforementioned is available, and subject them to an accuracy test in which known information is compared with the output from the annotation pipeline and concordance established. Robustness may be evaluated by a repeatability test—running the annotation pipeline multiple times on a data set that represents an actual genomic sequence and ensuring that there are no failures and no deviations in results. Even if the process of annotation uses something like a web-based search engine, multiple tests should be run to determine the best combination of search names to establish comprehensive information gathering. Additionally, for features such as frequency data, in addition to frequency, population/subpopulation trends, and the number of individuals included should be considered, in as much as there are several instances of a variant being reported in a single population at 50% which, upon inspection, turns out to have been only detected in a single individual. Validations must also be performed for reporting procedures that have been established to return test results to the ordering physician. Accurate transfer of patient, sample, and physician information such as names, date of birth, sample type, and so on must be verified. Similarly, it is necessary to ensure that results from the test (measurements and interpretations) are transferred to the report (whether by an electronic or manual/physical method) accurately.

Proficiency Testing The CAP molecular pathology on-site inspections occur every 2 years, but ongoing proficiency testing with both intra- and interlaboratory analysis improves testing procedures and helps to prevent errors [25]. Several well-characterized cell lines harboring known mutations are available from commercial sources, and CAP has also recently made available resources for proficiency testing of both the wet-lab and bioinformatics portions of NGS. In addition, clinical laboratories currently offering genomic level sequencing have developed alternative proficiency testing programs (so-called specimen exchange programs) to enable interlab comparison of exome and GS test metrics.

Interpretation and Reporting One of the most challenging areas for a clinical laboratory is that of providing a clinical report to a physician. The goal is to provide the results, including information about the results, that would be useful to a practitioner in as comprehensive but succinct a way as possible. Obviously, there is some challenge in providing the full set of information and also communicating what is most important as clearly and briefly as possible. For many traditional laboratory tests, where the results tend to be reported in terms of presence/absence of a specific marker, this is a less challenging issue. However, the advent of sequencing as a clinical test requires a modified approach, since any variant within a predefined region can be detected, although its clinical relevance may be completely unknown. The challenge of evaluating variants that do not necessarily have clear clinical implications but may well play an important role in the diagnostic evaluation or pathology of the patient has been well described [44,45]. As the scale increases, the principles remain the same, but the processes required to manage the information and communicate it effectively become much more sophisticated [46]. In this section, we will discuss some of the approaches, tools, and challenges of genomic level interpretation and reporting. There are several elements to consider when evaluating a genetic variant, which may or may not be disease associated (Figure 2.2). Biological characteristics include the type of variant, where it occurs in the gene, the frequency of the variant, and possibly in silico evaluations of the variant. Clinical characteristics include whether the variant has been reported to be associated with a condition or phenotype and can take the form of case study

I. METHODS

32

2. CLINICAL GENOME SEQUENCING

Variants

• • •



Frequency Type of variant Literature • Case studies • Case controls • Functional studies In silico evalutions

Gene

• •

Disease

Relationship of gene to disease Complexity of gene(s) association to disease

• • • •

Mode of inheritance • Incidence/prevalence Penetrance • Age of onset • •

Clinical context

Why is the patient being tested What tests have already been run Pedigree Clinical symptoms

FIGURE 2.2 Steps involved in interpretation and reporting of variants. After a sample has been sequenced and the individual nucleotide calls made, the biological implications of any sequence variants must be evaluated. These steps do not necessarily need to happen in the order illustrated, but all must happen to fully evaluate the implications of a variant in the appropriate clinical context. The types of information to be considered in each step are listed below the steps.

reports, case-control studies, and functional evaluations of the effect of the mutation in vitro or in vivo. The gene in which the variant was detected must also be considered; the degree of understanding of the relationship of the gene relative to the disease, as well as complexity of gene(s) that are associated with disease, are also relevant. This includes a familiarity with phenomena such as whether certain types of mutations (e.g., activating) or regions of genes (specific exons) are known to be more or less likely to be associated with a particular disease. With regard to the disease, specifics of the mode of inheritance, prevalence of disease, and age of onset are important considerations. For example, if a disease has a prevalence of 1/100,000 and is autosomal recessive, then, using HardyWeinberg principles, a variant with a frequency higher than 1% is unlikely to be causative of that disease. Likewise, if mutations known to cause disease are exclusively gain-of-function, then a stop mutation or silent mutation is less likely to be considered pathogenic. Finally, when reporting results, the clinical questions and context of the patient must be considered—is this a diagnostic evaluation or a carrier screen? What other tests have already been performed? Is there any additional phenotypic information that might be relevant for the results and how they should be considered? This is a complex set of considerations, and requires clinical and technical genetic knowledge. The development of databases and tools to enable searches and collation of variant level information has been hugely helpful to the field and enables the gathering of much information in an automated and systematic way. These tools tend to be excellent resources for bringing the information that a molecular pathologist or geneticist needs for interpretation. However, the information within different databases is highly variable in multiple ways, including accuracy, thoroughness, and how frequently updated. In addition, significant bias may be present in the ways the information was gathered and is presented. Several databases report whether a variant is pathogenic, but use quite different sets of standards or weight the various pieces of information quite differently. These databases are undoubtedly the best sources of information available; however, they should be considered only a first step in the evaluation process. Each laboratory must consider what pieces of evidence are needed for an appropriate evaluation in that setting and given the specific questions the laboratory is trying to address. Laboratories should evaluate how databases that they are using have been constructed and are maintained in order to understand what the information they are using really means for them. The implication is that these databases enable interpretation but require further evaluation and consideration of context for appropriate application of the information. Literature has historically been a critical source of information regarding the clinical associations of a genetic variant to a disease. The ability to publish case reports is critical to helping to identify genes and variants that are suspicious for possibly causing a disease. However, these are often just first steps. Subsequent studies in which cases have been evaluated against controls either in pedigree or population fashion, and additional functional evaluations, are critical for providing evidence regarding the confidence that a variant is likely causative for a disease. Very often an initial report will appear in which a variant is found in a gene that is known to be

I. METHODS

LABORATORY PROCESS, DATA GENERATION, AND QUALITY CONTROL

33

associated with disease, and the variant may seem a compelling explanation for the disease, but further studies show that the variant is also found in unaffected individuals or has no effect on the protein’s function. It is therefore important to review all the literature associated with a variant, with specific attention to the primary data; some papers are more robust than others, and should be weighted accordingly. Clinical laboratories are well positioned to bring significant expertise to this process since they typically have a well-trained, clinically oriented staff of MDs, PhDs, and genetic counselors who will review the evidence presented in these papers and bring all of the considerations (Figure 2.2) into the final reporting language. This process is challenging and time-consuming at a single-gene level, and there are several efforts underway to enable better sharing and more standardized processes so that large-scale interpretation becomes feasible for clinical laboratories to support. A genome is composed of over 3 billion base pairs, and any given individual has about 3.3 million variants from the reference genome. However, the reference genome is a representation of actual individuals who may carry minor or disease-causing alleles themselves; it does not represent a “wild type” or nondiseased individual. The ICSL has detected over 60,000 positions in the reference genome where the frequency of that nucleotide is less than 1%. Given these numbers, collection of the biological information for these variants must be done in an automated fashion. It is possible to use established facts to filter variants to exclude variants that simply from a biological perspective can be ruled out for causing a particular disease. Collection of the literature associated with variants should also be automated, which requires a standardized set of terms such as that used by the Human Genome Variation Society (HGVS) (www.hgvs.org) that can be assessed by bioinformatics search tools. The extent to which natural language and other software approaches can automate the evaluation of variants is still highly debated; while it is clear that these tools are invaluable for collating the information, the extent to which they can be reliably used to evaluate the clinical implications of that information is unclear. One primary challenge is that individuals must be able to read through a paper, evaluate the strength of evidence regardless of author’s conclusions, and document this review, an overall process that currently requires professionals spending significant amounts of time sifting through vast numbers of papers. Since individual clinical labs would be challenged to hire a qualified staff large enough to support such efforts, databases where such information is available and could be shared will be extremely valuable. At the same time, each laboratory that builds up an internal database incurs a huge expense in the effort, and thus the mechanisms by which laboratories can create community access that will benefit other laboratories, and ultimately the patients, while still paying for the effort required are interesting and active areas of development. Once the clinical implications of a particular individual’s variants have been decided, the information must be put into the clinical context for which the test was ordered. Incidental findings can potentially be quite numerous and a single answer might not be found; in other cases, there could be multiple variants in multiple genes that plausibly lead to a patient’s symptoms and have equally inconclusive or conclusive evidence supporting them. In fact, it has been shown that in at least a few cases, patients with perplexing clinical presentations are in fact suffering from more than one genetic disease [20]. Reports must be flexible enough to enable the benefit of a personalized survey of the genome but standardized enough to enable clear communication of results. A searchable electronic report might be the best solution; this could provide links to disease descriptions and additional evidence that practitioners could then have access to as needed. The goal is not only to provide a succinct answer to the major question of the moment, but also to ensure that both the physician and patient are aware of attributes of the test that may affect results, such as: 1. The standards that the laboratory uses in order to make variant calls, and the associated confidence or analytical validity of the call 2. Description of the criteria used filter or include genes and variants in the assessment 3. Criteria used to classify variants 4. Indication of the weaknesses of the test, and of any recommendations regarding additional testing that could supplement these weaknesses. An ongoing challenge will still be the large number of variants that are of uncertain clinical significance. The International Standards for Cytogenomic Arrays Consortium has demonstrated approaches to dealing with the large number of novel and uncertain variants that are detected in individuals when genomic evaluations are standardly performed (https://www.iscaconsortium.org). While genetic information is incomplete and the biological implications not always clear, in less than a decade the cytogenetics community has made huge strides in understanding the nature and degree of variation at the cytogenetic level; it can be anticipated that similar advances will be made in NGS to better understand and codify human genomic variation.

I. METHODS

34

2. CLINICAL GENOME SEQUENCING

CONCLUSION The field of human GS is quite young, and many tools and processes must be developed in order to fully realize and apply the potential benefits that this type of testing might provide to patients. The scale of information produced and how to report that information are significant hurdles. Additionally, biological understanding of the impact of the variants now possible to detect is often not clear. The application of GS to clinical cases requires development of appropriate guidelines and policies, and development of bioinformatics and informatics tools, that enable not only more accurate variant detection but also better communication between physicians, genetic counselors, and clinical laboratories. This is a time of great opportunity and challenge for the clinical genetics community, and through community efforts to develop and establish these policies, tools and informed health care providers, patient care can be improved by bringing genomics to the clinic.

References [1] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. PNAS 1977;74(12):54637. [2] Collins FS, Morgan M, Patrinos A. The Human Genome Project: lessons from large-scale biology. Science 2003;300(5617):28690. [3] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [4] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science 2001;291 (5507):130451. [5] Mardis ER. A decade’s perspective on DNA sequencing technology. Nature 2011;470(7333):198203. [6] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456(7218):539. [7] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437(7057):37680. [8] Clarke J, Wu H-C, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol 2009;4(4):26570. [9] Rabbani B, Mahdieh N, Hosomichi K, Nakaoka H, Inoue I. Next-generation sequencing: impact of exome sequencing in characterizing Mendelian disorders. J Hum Genet 2012;57(10):62132. [10] Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Hum Mol Genet 2010;19(R2):R14551. [11] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 2008;5(7):6218. [12] Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10(1):5763. [13] Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize proteinDNA interactions. Nat Rev Genet 2012;13(12):84052. [14] Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009;10(10):66980. [15] Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med 2012;63(1):3561. [16] Pasche B, Absher D. Whole-genome sequencing. JAMA 2011;305(15):15967. [17] Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem 2009;55(4):64158. [18] Metzker ML. Sequencing technologies: the next generation. Nat Rev Genet 2010;11(1):3146. [19] Dinwiddie DL, Kingsmore SF, Caracciolo S, Rossi G, Moratto D, Mazza C, et al. Combined DOCK8 and CLEC7A mutations causing immunodeficiency in 3 brothers with diarrhea, eczema, and infections. J Allergy Clin Immunol 2013;131(2):5947. e1e3. [20] Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 2010;42(1):305. [21] Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 2011;13(3):25562. [22] Saunders CJ, Miller NA, Soden SE, Dinwiddie DL, Noll A, Alnadi NA, et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 2012;4(154):154ra35. [23] Jones SJ, Laskin J, Li YY, Griffith OL, An J, Bilenky M, et al. Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors. Genome Biol 2010;11(8):R82. [24] Mwenifumbo JC, Marra MA. Cancer genome-sequencing study design. Nat Rev Genet 2013;14(5):32132. [25] Chen B, Gagnon MC, Shahangian S, Anderson NL, Howerton DA, Boone DJ. Good laboratory practices for molecular genetic testing for heritable diseases and conditions. Atlanta, GA, USA: Department of Health and Human Services, Public Health Service, Centers for Disease Control and Prevention; 2009. [26] Papasavva T, van Ijcken WF, Kockx CE, van den Hout MC, Kountouris P, Kythreotis L, et al. Next generation sequencing of SNPs for non-invasive prenatal diagnosis: challenges and feasibility as illustrated by an application to beta-thalassaemia. Eur J Hum Genet 2013. [27] Srinivasan A, Bianchi DW, Huang H, Sehnert AJ, Rava RP. Noninvasive detection of fetal subchromosome abnormalities via deep sequencing of maternal plasma. Am J Hum Genet 2013;92(2):16776. [28] Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30(11):10336. [29] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;30(7):56574.

I. METHODS

REFERENCES

35

[30] Group AMPWGAW. The Association for Molecular Pathology’s approach to supporting a global agenda to embrace personalized genomic medicine. J Mol Diagn 2011;13(3):24951. [31] Schrijver I, Aziz N, Farkas DH, Furtado M, Gonzalez AF, Greiner TC, et al. Opportunities and challenges associated with clinical diagnostic genome sequencing: a report of the Association for Molecular Pathology. J Mol Diagn 2012;14(6):52540. [32] Lindor NM, McMaster ML, Lindor CJ, Greene MH, National Cancer Institute DoCPCO, Prevention Trials Research Group. Concise handbook of familial cancer susceptibility syndromes—second edition. J Natl Cancer Inst Monogr 2008;38:193. [33] American College of Obstetricians, Gynecologists Committee on Genetics. ACOG Committee Opinion No. 486: update on carrier screening for cystic fibrosis. Obstet Gynecol 2011;117(4):102831. [34] ACOG Committee on Genetics. ACOG Committee Opinion No. 442: preconception and prenatal carrier screening for genetic diseases in individuals of Eastern European Jewish descent. Obstet Gynecol 2009;114(4):9503. [35] Agency EM. Guideline on the use of pharmacogenetic methodologies in the pharmacokinetic evaluation of medicinal products. Published Guideline. 2011. [36] Carraro P, Plebani M. Errors in a stat laboratory: types and frequencies 10 years later. Clin Chem 2007;53(7):133842. [37] Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009;6(Suppl. 11):S612. [38] Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010;11(5):47383. [39] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25(16):20789. [40] Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011;12 (6):44351. [41] Genomes Project C, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [42] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucl Acids Res 2001;29(1):30811. [43] Watson MS, Lloyd-Puryear MA, Mann MY, Rinaldo P, Howell RR. Main report. Genet Med 2006;8(Suppl. 5):12S252S. [44] Maddalena A, Bale S, Das S, Grody W, Richards S, Committee ALQA. Technical standards and guidelines: molecular genetic testing for ultra-rare disorders. Genet Med 2005;7(8):57183. [45] Group ALPCW. ACMG recommendations for standards for interpretation of sequence variations. Genet Med 2000;2(5):3023. [46] Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, et al. Clinical assessment incorporating a personal genome. Lancet 2010;375(9725):152535.

I. METHODS

This page intentionally left blank

C H A P T E R

3 Targeted Hybrid Capture Methods Elizabeth C. Chastain Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction Basic Principles of Hybrid Capture-Based NGS Specimen Requirements and DNA Preparation Determining the Target ROI Designing Capture Baits General Overview of Library Preparation Coverage and Uniformity Specificity and Sensitivity Obstacles of Target Capture Library Complexity Hybrid Capture-Based Target Enrichment Strategies Solid-Phase Hybrid Capture Solution-Based Hybrid Capture Molecular Inversion Probes

Comparison of Targeted Hybrid Capture Enrichment Strategies Amplification-Based Enrichment Versus Capture-Based Enrichment

38 38 38 39 39 39 41 41 41 42 43 43 44 46

47 47

Clinical Applications of Target Capture Enrichment Exome Capture Selected Gene Panels Disease-Associated Exome Testing

48 48 50 51

Variant Detection

51

Practical and Operational Considerations Workflow and TAT

52 52

Conclusions

53

References

53

KEY CONCEPTS • Successful hybrid capture-based next generation sequencing (NGS) is influenced by a range of parameters including probe design efficiency, on-target coverage, uniformity of coverage, analytical sensitivity and specificity, library complexity, amount of DNA input required, scalability, reproducibility, ease of use, and overall cost. • Several obstacles, such as base composition (i.e., very high or low guanine-cytosine (GC) content) and high sequence homology (e.g., repeat elements and pseudogenes), usually result in inefficient capture and inadequate coverage of the target region of interest (ROI). • Three main hybrid capture-based target enrichment strategies (solid-phase, in-solution, and molecular inversion probes) are available, each with inherent strengths and weaknesses. • Whole exome sequencing (WES) currently provides the most clinical utility in the setting of constitutional disease when used to test for germ line variants to facilitate accurate diagnosis in individuals with disorders that present with atypical manifestations or a variable phenotype, are difficult to confirm using clinical or laboratory criteria alone, or otherwise require extensive or costly evaluation.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00003-4

37

© 2015 Elsevier Inc. All rights reserved.

38

3. TARGETED HYBRID CAPTURE METHODS

• Somatic (cancer) testing is generally performed by target enrichment of selected or limited gene panels rather than WES and focuses on genes that are considered clinically actionable (i.e., have a well-established literature providing evidence for their diagnostic, predictive, and/or prognostic value). • Disease-associated exome testing (capture and sequencing of the entire exome with subsequent reporting of only the genes relevant to the particular disease in question) may provide advantages over WES or limited gene panel testing. • For maximum clinical utility, the bioinformatic pipeline should be designed to detect all four classes of genomic variants at allele frequencies that are physiologically relevant. • Although whole genome sequencing (WGS) provides the most comprehensive and least biased method of NGS, bioinformatics constraints, and cost currently make targeted NGS, whether amplification-based or hybrid capture-based, the more practical choice for clinical use.

INTRODUCTION Next generation sequencing (NGS) advances over the last decade have resulted in a significant decrease in sequencing costs with a concurrent rise in ease of sample throughput. Although whole genome sequencing (WGS) is now feasible, its clinical use is limited by its expense and the tedious analysis required to interpret the tremendous volume of data generated. The utility of whole genome analysis is complicated by a lack of understanding of the functional clinical role of variants identified in most genes, as well as nearly all noncoding regions. Currently, most clinical sequencing efforts therefore focus on regions of the genome with known clinical relevance, most often achieved through the use of targeted enrichment methods. Targeted sequencing approaches have been shown to be useful for identifying the cause of genetic diseases, evaluating disease risk, and informing treatment decisions. Two approaches are widely used to enrich for specific regions of the genome, namely amplification-based and hybrid capture-based. As discussed in more detail in Chapter 1, amplification-based NGS relies on polymerase chain reaction (PCR) strategies to selectively amplify regions of interest. Hybridization-based enrichment methods utilize DNA or RNA probes and rely on complementary base pairing of nucleotides to capture target regions. Both approaches to target enrichment can be used for a diverse range of applications on several different sequencing platforms. This chapter will focus on important aspects of capture-based enrichment methods; amplification-based enrichment methods are detailed in Chapter 4.

BASIC PRINCIPLES OF HYBRID CAPTURE-BASED NGS Design of an appropriate target enrichment method begins with delineation of the target region of interest (ROI) and identification of obstacles that may hinder adequate capture. Successful hybrid capture-based NGS is influenced by a range of parameters including probe design efficiency, on-target coverage, uniformity of coverage, analytical sensitivity and specificity, amount of DNA input required, scalability, reproducibility, ease of use, and overall cost [13]. This section will first discuss these important technical aspects; practical operational points will then be addressed.

Specimen Requirements and DNA Preparation Input DNA used in clinical NGS testing can be derived from a variety of patient sample sources including peripheral blood, bone marrow aspirate specimens, buccal swabs, surgical resections, needle biopsies, and fine needle aspirations (FNAs). In addition, fresh or frozen tissue, formalin-fixed paraffin-embedded (FFPE) tissue, and methanol or ethanol fixed tissue can be used [4,5]. Neoplastic specimens received for NGS testing must undergo additional evaluation by a pathologist to confirm the presence of adequate tumor and select areas for testing. A standardized extraction protocol is then followed in order to minimize inter-sample variation in DNA quality, and total DNA yield is commonly measured by either spectrophotometry or fluorometry [6]. Metrics

I. METHODS

BASIC PRINCIPLES OF HYBRID CAPTURE-BASED NGS

39

including A260/A280 and A260/A230 can be used to estimate nucleic acid purity, and agarose gel electrophoresis can be performed to ensure the presence of high molecular weight genomic DNA (a smear may indicate sample degradation or contaminants). However, as discussed in the library preparation section below, the presence of high molecular weight DNA is not necessarily mandatory; this is one of the reasons that hybrid capture methods work well with FFPE samples that contain damaged nucleic acids due to the fragmentation and crosslinking caused by formalin fixation during routine processing [4]. While sample processing is generally standardized within a laboratory, there is significant inter-laboratory variation in processing and fixation methods; many variables such as type of fixative, fixation time, and storage conditions significantly impact nucleic acid quality and in turn may affect subsequent capture and sequencing. One key variable is the method of decalcification. Acid decalcification destroys nucleic acids and thus acid decalcified specimens are unacceptable for sequence analysis; decalcification with a chelating agent (EDTA) is preferred [7,8].

Determining the Target ROI Determination of the target ROI is based largely on the specific clinical scenario and application of the assay. For constitutional disorders, assessment is generally focused on a well-established clinical diagnosis or a precise phenotype, which may span several disorders sharing similar phenotypic features. In this case, targeted capture panels are designed to encompass all established gene sequences containing variants known to be responsible for the disease or phenotype being evaluated. Alternatively, panels designed for cancer focus on genes with established variants that have diagnostic utility, predict the prognosis of a particular type of malignancy, or can be used to guide therapy. In both constitutional and somatic mutation testing, capture may include coding exons, flanking introns containing conserved splice sites, 50 and 30 untranslated regions (UTRs), and/or other regions reported to harbor pathogenic variants. A more thorough description of assay design for constitutional and somatic diseases is detailed in Chapters 16 and 19, respectively.

Designing Capture Baits Several descriptions of the design of various RNA and DNA capture probes (also known as “baits”) with broad variability in lengths (30800 bp) have been published [911]. The fragment size of the genomic library largely dictates the optimal length of baits and also determines the efficiency to which the baits bind the targets. Shorter fragments are invariably captured with higher specificity than longer ones [9,12]; this is at least partly due to the facts that the average length of a human exon is 120 bp [13] and that exons have some of the highest complexity in the genome. Therefore, capture probes that effectively hybridize fragments about 200 bp in length are desired because they increase capture efficiency and on-target coverage [1]. Another critical aspect of bait design is the degree of tiling or overlap of the capture probes. By allowing for a single base to be covered multiple times, overlapping design provides superior enrichment efficiency when compared with an adjacent or spaced design [14]. Design of capture probes must also be optimized to capture targeted regions with equal efficiency, referred to as balancing. Customized baits can be synthesized by a number of vendors via proprietary protocols, or alternatively, can be purchased.

General Overview of Library Preparation While library preparation methods are somewhat dependent on the enrichment and sequencing platforms being used, all hybrid capture methods have overall similar workflows. The process begins with DNA fragmentation via sonication, nebulization, or endonuclease digestion to yield fragments of the desired length. This is followed by overhang end repair to form blunt ends by the addition of T4 DNA polymerase and Escherichia coli DNA polymerase I Klenow fragment. The 30 to 50 exonuclease activity of the enzymes removes 30 overhangs, while the polymerase activity fills in 50 overhangs and adds an “A” to the 30 tail to prepare for adapter ligation [1]. Platform-specific oligonucleotide adapters are next ligated to the fragments; the adaptors are usually indexed to make sample multiplexing possible. The purification required after each of these steps results in a reduction in the quantity of DNA present; therefore, an optional prehybridization PCR amplification step is often performed at this point to improve robustness, especially for suboptimal clinical samples. Next, the library is hybridized to the capture probes for 2472 h, and then the captured fragments are eluted. A limited number of posthybridization PCR cycles allows for amplification of the captured sequences and addition of indexes prior to sequencing.

I. METHODS

40

3. TARGETED HYBRID CAPTURE METHODS

FIGURE 3.1 Distribution of coverage compared to target design. Panels (A) and (B) show the depth of coverage achieved over exon 9 of KIT using two different hybrid capture bait designs, as viewed in IGV. In each panel, a schematic of chromosome 4 orients the viewer to the genetic locus being illustrated. Numbers indicating the nucleotides included in the viewing range are also shown (approximately chr4:55,591,60055,592,600). The depth of coverage is shown as a gray curve, ranging from 0 on the edges of the illustrated regions to a maximum of over 7503 for both target designs. Coverage greater than 400 3 is indicated by a shaded orange box. The read alignments are colored by strand orientation, and the individual read alignments are shown with the reads packed closely together. The RefSeq track illustrates the position of KIT exon 9 (solid blue bar), flanked by introns (blue line with arrows indicating the orientation of the KIT gene on chromosome 4). In this example, the RefSeq track shows the ROI, which is KIT exon 9. The solid black bar labeled “Targets,” under the RefSeq track, shows the targeted region included in the capture reagent for each panel. The targeted region is longer in panel (A), extending further into the

I. METHODS

BASIC PRINCIPLES OF HYBRID CAPTURE-BASED NGS

41

Coverage and Uniformity Coverage is defined as the number of aligned reads that contain a given nucleotide position; uniformity refers to the variability in sequencing coverage across the target region. Sufficient depth of coverage and sufficient uniformity are critical for bioinformatics data analysis to achieve reliable identification of sequence variants. Factors that influence coverage include the nature of the ROI (i.e., its sequence complexity) and the method used for targeted enrichment. The depth of coverage required to make accurate variant calls is also dependent upon the type of variant being evaluated (e.g., single nucleotide variant (SNV), small insertion or deletion), and whether it is germ line or somatically acquired. In general, a lower depth of coverage is acceptable for constitutional testing where germ line alterations are more easily identified since they are in either a heterozygous or homozygous state; 20303 coverage with balanced reads (forward and reverse reads equally represented) is usually sufficient for this purpose [15,16]. However, much higher read depths are necessary to confidently identify somatic variants in tumor specimens due to cellular heterogeneity within the tumor (i.e., benign inflammatory cells, stromal cells, and parachymal cells are present in addition to malignant cells) and the likely presence of subclonal tumor populations, both of which may lead to low allele fractions. Similarly, high coverage will be needed to properly evaluate mosaicism in some constitutional disorders. For the analysis of tumor specimens, an overall coverage on the order of 40010003 usually provides sufficient coverage for most coding exons, with the exception of hard to capture regions [17]. For NGS of mitrochondrial DNA, an average coverage of greater than 20,000 is required to reliably detect heteroplasmic variants present at 1.5% [18]. A reasonably high level of coverage also helps ensure that the sequence analysis is free from allelic bias (preferential capture of one heterozygous allele over the other) [3].

Specificity and Sensitivity From the perspective of the ROI, the specificity of an NGS assay refers to the percentage of sequences that map to the intended target region and the sensitivity refers to the percentage of the target bases that are represented by one or more sequence reads. From this perspective, sensitivity of variant detection is dependent on the depth of coverage; in cancer samples, a precipitous drop in detection of low-frequency alleles is noted below 400fold coverage in the majority of genomic positions [19]. Hybrid capture generates reads that extend beyond the target region into immediately adjacent sequence, resulting in “near-target” or “off-target” reads (also known as target region shoulders) (Figure 3.1). With this in mind, the precise definition of on/off target delineation must be determined, as inclusion or exclusion of near- or off-target reads will influence the number of sequence reads determined to be on-target [2]. Coverage at the end of targets may drop as fewer fragments are captured in this area; therefore, baits may be designed to capture flanking intronic regions to ensure adequate coverage of these areas (Figure 3.1). An assay that has high specificity and uniformity generally requires fewer reads to generate adequate coverage for downstream analysis and thus will have an overall lower cost [1].

Obstacles of Target Capture

L

Several notable obstacles, such as base composition (i.e., very high or low guanine-cytosine (GC) content) and high sequence homology (e.g., repeat elements and pseudogenes), result in inefficient capture and inadequate coverage of the target ROI. This obstacle has particular relevance in clinical testing since many disease-associated introns flanking the ROI. Correspondingly, the region that is covered at a depth of at least 4003 also extends beyond the ROI. In contrast, the targeted region in panel (B) is more narrowly centered on the ROI. Adequate coverage greater than 4003 is also achieved for the entire ROI using this smaller target region, due to capture of specimen DNA fragments extending beyond the target region. These fragments are responsible for the “shoulders” of coverage that taper off on either side of the targeted region. To illustrate this point, panel (C) shows the same alignments as panel (B), except with the aligned sequence reads shown with their partner (paired-end) reads. Each sequenced DNA fragment is shown as a pink-red bar at one end and a blue bar at the other, connected by a gray line of DNA that was not sequenced. The central gray portion of each fragment is not sequenced because the DNA fragment is longer than the read length for the sequencing platform (e.g., a 300 bp fragment will be sequenced 100 bp from each end using the paired-end approach, leaving 100 bp of DNA in the middle that is not sequenced). It is also not included in the coverage assessment, as its sequence is not known. The shoulders of coverage (green-shaded boxes) are generated by fragments that partially overlap the target region and partially extend into the untargeted flanking intron. The extent of this “near-target” coverage depends on the length of the DNA fragments being sequenced; if the sequencing library is comprised of fragments that are narrowly distributed around the target size, the coverage will also be tightly centered on the targeted regions. In contrast, if many longer fragments are included in the sequencing library, the shoulders of coverage extend further into the untargeted intronic sequences. Hence, the distribution of coverage over and around an ROI depends both on the size of the target region and the length of nucleic acid fragments in the sequencing library. Figure provided by J.K. Sehn.

I. METHODS

42

3. TARGETED HYBRID CAPTURE METHODS

targets are GC-rich such as promoter regions and first exons of many genes [20]. Targets comprised of approximately 5060% GC content yield the highest coverage, while those with low (less than 30%) or high (greater than 70%) content have significantly decreased coverage [9,20,21]. This coverage bias is largely attributed to compression artifacts of the GC stretches specifically occurring during the PCR amplification step of library construction, and in lesser part to suboptimal hybridization of baits to GC-rich regions [20]. Capture efficiency of GC-rich regions has been shown to improve by using probes of longer length (about 120 bp) and adjusting melting temperatures [10]. Another approach (Ultramer, Integrated DNA Technologies, Coralville, IA) [22] employs individually synthesized 50 -biotinylated oligonucleotides that can be “spiked” into existing bait sets to target regions of poor coverage; because these Ultramer probes are individually synthesized, some properties (such as their annealing temperature) can be customized [17]. A major potential source of GC bias is the enrichment PCR performed at the posthybridization stage of library preparation. Protocols designed to improve uniformity of this step incorporate prolonging the denaturation phase of each PCR cycle, modifying the primer-extension temperature, and using alternative Taq polymerases [20]. It is also possible to completely omit the enrichment PCR step to avoid GC bias; however, this approach requires a larger amount of input DNA and is unlikely to be feasible in routine clinical testing where many samples have only limited material [21]. Alternatively, water-in-oil emulsion PCR can be used, which has been shown to generate bias-free library amplification in which the PCR product population is comparable to the original nonamplified preparation [10]. Regions of high sequence homology such as repetitive elements, transposons, short tandem repeats, short interspersed elements (SINES), long interspersed elements (LINES), and pseudogenes also pose significant difficulties in capture efficiency and sequence assembly [23]. The magnitude of the problem created by regions of high sequence homology is suggested by these facts. It is estimated that the human genome contains approximately 20,000 pseudogenes, which are complete or partial copies of genes but do not code for functional polypeptides [24]; greater than 8000 of these pseudogenes are considered “processed” and are formed through retrotransposition of mature RNAs into genomic DNA [25], and a number of these processed pseudogenes have a high enough sequence homology to interfere with testing of medically relevant genes via hybrid capture-based NGS approaches. When the region harboring repetitive sequence is longer than the read, no distinguishing information is present which may result in incorrect read alignment (therefore generating false positive or false negative variant calls), or the read may simply not be mapped (therefore generating reduced sensitivity). Repetitive regions are frequently structural variation hotspots which further complicate assembly of reads from them. The limitations associated with regions of high sequence homology can be circumvented at the bait design stage [1,9]; exclusion of repetitive regions from probe design involves use of programs such as RepeatMasker (Institute for Systems Biology) [26] which screens DNA sequence for interspersed repeats and low complexity DNA sequences to produce a modified sequence query with annotated repeats excluded. Physical blocking can also be used to minimize the impact of repeat elements; this is often accomplished through prehybridization addition of C0t1 DNA which is comprised of short fragments (50300 bp) of human placental DNA that is enriched for repetitive sequences [1]. Longer reads and paired-end sequencing can also be employed to resolve some of the complications associated with highly repetitive sequences [23]. If the modifications above fail to improve coverage, Sanger sequencing is used to “backfill” the regions with high GC content that show low coverage. This backfilling is expensive, cumbersome, and consumes valuable sample DNA, but is nonetheless often required in hybrid capture-based NGS assays.

Library Complexity An important quality metric to consider after sequencing and read mapping is library complexity. Library complexity refers to the number of individual DNA molecules (and thus genomes or cells) that are present in the DNA sequence. It is critical to understand that it is difficult to accurately calculate (or even estimate) complexity from morphologic assessment of patient specimens since all the steps of library preparation involve inefficiencies that interact in complicated ways. Highly cellular specimens composed of viable cells typically produce an adequately complex DNA library. On the contrary, paucicellular specimens have the potential for generating low complexity DNA libraries which are more likely to produce biased sequence results. One common way to measure library complexity is by quantitation of the number of unique, on-target reads. Reads with different 50 and 30 termini are usually unique and thus arise from DNA from more than one genome (and thus cell) undergoing random DNA fragmentation. Fragments with identical 50 and 30 termini (duplicates) almost universally represent PCR amplification bias and thus are commonly removed from downstream analysis.

I. METHODS

HYBRID CAPTURE-BASED TARGET ENRICHMENT STRATEGIES

43

HYBRID CAPTURE-BASED TARGET ENRICHMENT STRATEGIES Solid-Phase Hybrid Capture Hybridization capture methods for NGS arose from established microarray technologies. These platforms utilize high-density clusters of unique oligonucleotides bound to a solid substrate (Figure 3.2C). Originally developed for (A) Uniplex PCR 1 reaction = 1 amplicon

Multiplex PCR 1 reaction = 10 amplicons

Rainstorm 1 reaction = 4000 amplicons

(B) Molecular inversion probes = 10000 exons

Gap-fill and ligate

Exon 1 (C)

Exon 2

Exon 3

Hybrid capture > 100000 exons

Adapter-modified shotgun library Solution hybridization Array capture

Bead capture

FIGURE 3.2 Approaches to target enrichment. (A) In the uniplex PCR-based approach, single amplicons are generated in each reaction. In multiplexed PCR, several primer pairs are used in a single reaction, generating multiple amplicons. Microdroplet PCR allows for the use of thousands of primer pairs simultaneously in a single reaction. (B) In the MIP-based approach, probes consisting of a universal spacer region flanked by target-specific sequences are designed for each amplicon. These probes anneal at either side of the target region with gap closure by a DNA polymerase and ligase. (C) In the hybrid capture-based approach, adaptor-modified genomic DNA libraries are hybridized to target-specific probes either on a microarray surface or in solution. Background DNA is washed away, and the target DNA is eluted and sequenced. Reprinted by permission from Macmillan Publishers Ltd: Nature Methods [1], copyright 2010.

I. METHODS

44

3. TARGETED HYBRID CAPTURE METHODS

gene expression comparisons, microarrays were conveniently adapted for use as a capture platform [2729]. The protocols combining microarrays and massively parallel sequencing are straightforward. Genomic DNA is fragmented to about 250600 bp via sonication or nebulization to generate a library; fragments are then repaired to obtain blunt ends and ligated with adaptors/linkers containing universal primer sequences, and adaptor-ligated DNA fragments are subsequently denatured and hybridized to a microarray of high-density probes. Nonhybridized fragments are washed from the array and the captured fragments are then recovered by heat-based elution. Initial studies describing on-array capture evaluated both long contiguous segments (entire gene loci) and numerous short discontiguous segments (individual gene exons) surrounding BRCA1 and FMR1 genes with total captured sequence ranging in size from 50 kb to about 5 Mb [27,29]. Whole genome amplification was followed by DNA fragmentation to about 300500 bp for subsequent adaptor ligation, amplification, and hybridization to the capture array. Both groups used customized NimbleGen arrays with roughly 385,000 probes covering the target ROI. Depending on the size of the target region, average enrichment generally ranged from 400- to 1000-fold and average coverage was about 8- to 20-fold. These studies demonstrated effective enrichment for a variety of sequences within the targeted region, with roughly 90% of reads mapping back uniquely to the genome (about 70% to targeted regions) and about 95% of targeted sequences containing at least one read. Replication experiments indicated that the reproducibility of array base calls was 99.98%, and when compared to HapMap samples, that the accuracy was 99.81%. Specific modifications to protocols were developed that showed that coverage was more uniform without the precapture amplification step, that yields of downstream DNA sequencing were superior to that of noncaptured DNA sources, and that capture resulted in purification of unique sequences free from repeats and other impurities. Additional modifications to assay protocols, as well as further advances in array development yielding higher probe density, have further improved enrichment and sequence coverage. Advantages of solid-phase hybrid capture approaches over amplification-based methods include the ability to efficiently enrich for large contiguous regions (of Mb size) as well as smaller dispersed targets (of kb size). Drawbacks associated with array-based capture include increased costs associated with the specialized platform (hardware and software). The DNA input required for solid-phase testing is considerably larger than for other targeted methods, which limits use on some clinical samples such as needle core biopsies or FNAs. Additionally, a limited number of arrays can be conceivably processed per day, significantly limiting assay scalability and throughput [1]. Due to these limitations, solid phase hybrid capture has been largely supplanted by solutionbased methods in clinical laboratory testing. Roche NimbleGen (Basel, Switzerland) and Agilent Technologies (Santa Clara, CA) are the two companies largely responsible for providing microarrays for initial solid-phase hybridization studies. Two options were initially available from NimbleGen, arrays with approximately 385,000 and 2.1 million probes, which target up to 5 and 34 Mb of sequence, respectively. In comparison, Agilent arrays contain about 224,000 probes and cover a smaller target size. As of the latter half of 2012, Roche has discontinued production of microarrays although an agreement between Roche and Agilent has ensured that customers can still receive quality microarray products for on-array hybridization through Agilent Technologies [30]. Design of a microarray for a clinical test occurs by choosing probes from a library optimized for given genes, or by uploading custom sequences utilizing the eArray online tool; alternatively, Agilent can assist custom probe design through its Custom Microarray Design Services [31]. Operationally, probe length is optimized from 25 to 60 bp and oligonucleotide probes complementary to the target ROI are printed using inkjet technology, allowing for sequence flexibility. Probe density ranges from 1.9 to 244 Kb per array to allow for array multiplexing on a single slide.

Solution-Based Hybrid Capture In an attempt to overcome the disadvantages associated with solid-phase hybrid capture, in-solution capture protocols were developed. Similar to solid-phase capture, solution-based methods use oligonucleotide probes to capture targeted ROIs from a DNA library. Where microarray enrichment uses an excess of DNA library template molecules over probes, in-solution capture employs an excess of probes over DNA library molecules, which drives the hybridization reaction to completion faster with smaller quantities of the DNA library. Capture by this method typically involves biotinylated cDNA or cRNA baits [9]; during enrichment, genomic DNA is sheared and ligated with common flanking adaptors, and the shotgun library is subsequently mixed in solution with the

I. METHODS

45

HYBRID CAPTURE-BASED TARGET ENRICHMENT STRATEGIES

biotinylated baits which hybridize to the target DNA. After hybridization, streptavidin-coated magnetic beads are added, and a magnet is used to pull down the beads that now have attached DNA fragments from the targeted ROIs, while nontargeted DNA is washed away (Figure 3.2C). Captured fragments are eluted from the beads and recovered for sequencing. The first published description of in-solution hybrid capture in 2009 involved synthesis of a complex pool of ultra-long oligonucleotides on an Agilent microarray, with subsequent cleavage from the array [9]. Each nucleotide consisted of a target-specific 170-mer sequence flanked by 15 bases of a universal primer sequence on each end to allow for PCR amplification. In vitro transcription in the presence of biotin-uridine triphosphate (UTP) was used to generate a single-stranded RNA hybridization bait for “fishing target ROIs out of a pond” of randomly sheared, adaptor-ligated, and PCR-amplified total human DNA. A pilot study of 1900 randomly selected genes was chosen to avoid sampling bias with regard to length, repeat content, or base composition; 22,000 tiled baits were designed to target all 15,565 protein-coding exons of the selected genes without overlap or gaps in coverage. Together, these baits constituted 3.7 Mb and the target exons comprised 2.5 Mb, and in routine use 58% and 42% of the captured 85 Mb uniquely aligning human sequence mapped to baits and exons, respectively. Average coverage of target bases was 86-fold within the baits and 94-fold within coding exons, and specificity was estimated to be 82% when evaluating sequence within 500 bp of a bait and 48% with a strict “on-target” definition of targeted coding sequence only. After normalization, capture uniformity was determined to be as follows: 80% of bases within baits received at least one-half the mean coverage; 86% received at least one-fifth; and 5% were not covered, with excellent reproducibility. The high stringency and sensitivity were in part due to the use of single-stranded RNA baits as the capture agent. While areas containing extreme GC content and repetitive elements were problematic, and coverage gaps left room for improvement, overall the method was shown to be flexible, scalable, and efficient. Solution phase capture offers a number of advantages over solid-phase capture including the ability to be performed in 96 well plates (which make it more readily scalable) and no requirement for specialized equipment [1]. In settings where single-stranded RNA baits are used, which are synthesized in only one orientation, their high excess concentration drives efficient hybridization which not only makes testing possible from lower input DNA quantities but also increases capture efficiency of smaller fragments [28]. DNA or RNA probes can be prepared in mass quantities, simplifying quality control and enabling improved scalability with decreased cost [3,10]. The use of long capture probes reduces allelic bias below that seen by other methods. Finally, the approach can be applied to numerous short discontiguous targets as well as long contiguous regions, and either approach results in high specificity. Several vendors make it easy for laboratories to purchase customized solution-based capture probes for laboratory developed tests (LDTs); several vendors also market predefined panels. Custom orders are made possible through free online software tools or through design service experts. A few of the most widely used capture probe tests (and their related pertinent features) are listed in Table 3.1; in general, these panels are designed to work on a wide variety of sequencing platforms. Various exome and panel-specific kits will be described in greater detail below.

TABLE 3.1 Examples of Commonly Used Custom Solution-Based Target Hybrid Capture Products Product

Free online tool

Service

Maximum # Amount of sequence of probes

Roche NimbleGen SeqCap EZ Choice and EZ Choice XL

NimbleDesign www.nimblegen.com/products/ nimbledesign/index.html

NimbleGen Certified Service Providers

2.1 million

Up to 7 Mb for EZ Choice and up to 200 Mb for XL

Agilent SureSelect Custom DNA or RNA

SureDesign www.earray.chem.agilent.com/suredesign/ index.htm

Agilent Design Service Team

220,000

Up to 24 Mb for DNA and up to 6 Mb for RNA

Illumina Nextera Rapid Capture Custom Enrichment Kit

DesignStudio http://www.illumina.com/informatics/ experimental-design/designstudio.ilmn

Illumina Certified Service Providers

67,000

Up to 15 Mb

I. METHODS

46

3. TARGETED HYBRID CAPTURE METHODS

Molecular Inversion Probes Molecular inversion probes (MIPs), otherwise known as padlock probes or selective circularization probes, combine either array-based or solution-phase target capture enrichment with amplification [32,33]. MIPs are single-stranded oligonucleotides (oligos) consisting of a common linker containing universal PCR primer binding sites flanked by target-specific probes (Figure 3.2B). The target-specific probes simultaneously hybridize to the same DNA fragment, forming a circular structure with the intended target captured between the probes; addition of polymerase and ligase results in gap filling and completion of the circular form by incorporation of the intervening target sequence. Library formation is performed by rolling circle amplification (RCA), or by cleaving the circular form, adding adaptors, and performing standard multiplex PCR. By either approach, the amplicons can then be directly sequenced. Several slight variations of MIP technology exist including Selectors (use of restriction enzymes to determine genomic fragmentation), Gene-Collector (probes specific to desired amplicons, reducing false positives), and CIPer (capture of longer genomic regions by extending the gap-fill reaction) [3436]. MIPs have been used for a wide variety of clinical applications. Early studies focused primarily on genotyping of single nucleotide polymorphisms (SNPs) in association with linkage disequilibrium mapping [33,37]. Utility rapidly expanded to include human papillomavirus (HPV) genotyping and SNP detection for prediction of Neisseria gonorrhoeae antibiotic resistance [36]. MIP technology has also been used for comprehensive genotyping of loci involved in drug metabolism, excretion, and transport [38]. Furthermore, utilization of MIPs has extended to mutation discovery in various forms of cancer including somatic SNVs, focal insertions and deletions (indels), copy number alterations, and loss of heterozygosity (LOH) [35,39,40]. More recent studies indicate a use for MIP related applications in testing for assessment of DNA methylation and RNA modification [41,42]. Advantages of MIPs include lack of shotgun library preparation and relatively low DNA input (less than 100 ng in some settings) compared with other target capture methods [1]. A high overall specificity is achieved due to the use of targeted probe design. Since the MIP reactions can be performed by both array-based and insolution methods, they are scalable for high patient specimen throughput. The use of RCA achieves a reduction in polymerase errors when compared with standard PCR, and background noise is further reduced by the fact that these assays do not require whole genome amplification [40]. One of the main disadvantages of MIPs is poor capture uniformity. Early studies utilizing this method for capture of large proportions of human exomes attempted the use of 55,000 oligos to enrich 10,000 exons, but ultimately less than 20% of the targets were captured and variant detection was impaired due to skewed sampling of heterozygous alleles [43]. Subsequent studies with protocol modifications yielded remarkable improvement in capture efficiency and allelic sampling, improving capture to over 90% [39,44]; while the described modifications have demonstrated marked improvements, capture uniformity is still inferior to other capture methods. Additionally, generation of oligonucleotides can be expensive when sizeable numbers are required to cover extensive target sets, although it is possible to reduce this expense by utilizing oligos synthesized on microarrays [43]. Some studies suggest the use of MIPs may be most relevant for testing involving a relatively small number of targets but with high sample number [1]. ParAllele Bioscience was one of the first companies responsible for development and streamlining MIP assay technology. In 2005, Affymetrix (Santa Clara, CA) acquired ParAllele Bioscience and continues to provide MIP technology for pharmacogenetics and cancer specimen analysis. The OncoScan FFPE Express 2.0 Service [45] launched in the spring of 2011 interrogates greater than 335,000 loci over the entire genome for somatic changes, copy number changes, and LOH. Testing specifically covers over 200 tumor suppressor and oncogenes, with median spacing of 1 probe per 0.5 kb for the top 10 “actionable” tumor suppressor genes, a median spacing of 1 probe per 2 kb for the top 190 actionable oncogenes, and a median backbone spacing of 1 probe per 9 kb. Since the assay requires 200 ng or less of input DNA, it is ideal for limited specimens or archived specimens with significant DNA degradation. Affymetrix also provides the DMET (Drug Metabolizing Enzymes and Transport) Plus Assay [46], which evaluates 1936 SNP, copy number, and indel markers across 231 genes. It provides 100% coverage of “core ADME genes” (32 genes) and 95% coverage of “core markers” (185 variants); additional common and functional variants in other genes associated with hepatic detoxification for processing xenobiotics and environmental toxins are also targeted. Several companies, such as Integrated DNA Technologies (Coralville, IA) [47] and Eurogentec (Fremont, CA) [48], have extensive experience in creating custom MIP oligonucleotides. Finally, Agilent Technologies (Santa Clara, CA) recently released the HaloPlex assay [49], which fragments DNA by eight restriction enzymes and utilizes circular selector technology with tiled probe coverage for target enrichment; HaloPlex is available for custom test panels, for predesigned panels (e.g., cancer, cardiomyopathy, channelopathy), or as whole exome kits.

I. METHODS

47

HYBRID CAPTURE-BASED TARGET ENRICHMENT STRATEGIES

TABLE 3.2 Comparison of Performance Characteristics for Target Enrichment Methods Enrichment method

Pros

Cons

Sensitivity Specificity

DNA input

On-array

Quicker and less laborious compared to amplification

Requires specialized equipment, limited daily throughput

High

Intermediate to high

High

Solution

No specialized equipment, improved scalability, highly efficient

Increased sample preparation time

Very high

High

Low to High

MIP

No shotgun library preparation, wide range of applications

Poor capture uniformity, costly for large target sets

High

Very high

Low to intermediate

Limited multiplexing, amplification bias, polymerase errors

Very high

Very high

Low

Amplification High specificity, resolution, and uniformity MIP: molecular inversion probe.

Comparison of Targeted Hybrid Capture Enrichment Strategies Many of the advantages and disadvantages of individual methods are summarized in Table 3.2. It is worth noting that several studies have been performed comparing the various target capture methods based on custom bait design [28,50,51]. In these studies, vendors were asked to design baits to cover specified portions of the genome, which were then evaluated on several performance parameters including on-target coverage, depth of coverage, uniformity, and analytical sensitivity. One study favored NimbleGen SeqCap array and SeqCap EZ solution enrichment over Agilent SureSelect solution enrichment based on higher proportions of reads mapping to the ROI and better coverage [50]. This result may be largely attributable to NimbleGen’s superior design efficiency (80% ROI targeted by NimbleGen versus approximately 47% targeted by Agilent), since the Agilent SureSelect method performed well for more limited probe sets. Another study assessing a different ROI found the overall performance of Agilent SureSelect solution enrichment method to be superior across the majority of assessed performance measures (e.g., on-target proportion, coverage) compared with NimbleGen SeqCap array and RainDance PCR [51]. Still other studies demonstrate no appreciable difference between Agilent and NimbleGen in-solution probes [1,2]. Not surprisingly, comparison of in-solution probe designs illustrates that a higher density of probes for a given ROI, such as the NimbleGen SeqCap solution design, provides higher coverage of on-target regions. Alternatively, longer baits, such as Agilent SureSelect RNA probes, increase hybridization efficiency and tend to capture problematic regions better. For all hybrid capture enrichment methods in all of these studies, GC-rich regions demonstrated lower coverage, and all probe designs tended to avoid regions of repetitive sequence for capture. Of note, postcapture sequencing was performed on Illumina and SOLiD platforms, emphasizing that the various enrichment strategies can be optimized for use with a variety of sequencing technologies. In summary, the most accurate assessment is that no one method is best as all have inherent strengths and weaknesses [28], and individual laboratories must choose the method based on what is most suitable for the intended test goals and budget.

Amplification-Based Enrichment Versus Capture-Based Enrichment While capture-based methods enrich for sequences of interest followed by limited amplification, amplificationbased methods rely on exponential amplification of the ROI utilizing sequence-specific primers (Figure 3.2A). Amplification-based approaches are covered in detail in Chapter 4, but a few general comments are warranted here. Briefly, when compared with target capture enrichment, amplification methods tend to have superior specificity, resolution, and uniformity [1]. The simpler workflow, with reduced hands-on time and more rapid turnaround time (TAT), have also led to popularity within the clinical context [17]. On the other hand, amplification-based approaches have a limited target ROI because there remains a practical limit to the number of PCR reactions that can be multiplexed; additionally, the potential for amplification bias, polymerase sequencing errors, contamination, and primer binding artifacts all result in challenging quality assurance considerations. Amplicon-based methods are relatively cost-effective. Newer highly multiplexed microfluidic and microdroplet methods have substantial upfront hardware costs but can minimize other disadvantageous aspects of

I. METHODS

48

3. TARGETED HYBRID CAPTURE METHODS

amplification-based methods by limiting PCR errors and increasing accuracy with sample pooling. Similar to targeted capture enrichment, several amplification enrichment systems (e.g., Fluidigm, RainDance, Illumina TruSeq, Ion AmpliSeq) have been optimized for compatability with a variety of benchtop sequencing instruments (e.g., Ion Torrent, Illumina MySeq), which makes amplification-based technology accessible to any size laboratory for clinical use. Final consideration of an enrichment method depends on a number of practical points including how well matched each method is to the entire size of the intended ROI, the projected number of samples to be evaluated in a given time frame, and the possibility of multiplexing for optimal efficiency of sequencing throughput.

CLINICAL APPLICATIONS OF TARGET CAPTURE ENRICHMENT Exome Capture The entire protein encoding DNA sequence, or exome, comprises about 12% of the human genome [3]. This includes about 180,000 exons across more than 20,000 genes and consists of roughly 30 Mb of sequence, which is an amount of sequence that is difficult to evaluate for individual clinically significant variants [52]. However, more than 85% of known disease-causing mutations occur within exons; therefore, the exome represents a highly enriched subset of the genome harboring the majority of significant variants [53]. Indeed, in specific circumstances whole exome sequencing (WES) has proven to be a useful strategy for facilitating diagnosis and guiding therapy [5456]. One challenging aspect of exome sequencing is actually defining an exome. On the one hand, an exome can be limited to exons as defined by databases such as the consensus coding sequence (CCDS) project and USCS Genome Browser [15,57]; however, other databases such as RefSeq and Ensembl include 50 and 30 UTRs as well as noncoding RNA in addition to exons [14]. Other differences in what constitutes the exome are due to uncertainty for some genes as to which sequences are truly protein coding, and which positions most appropriately represent start and end points for an encoded protein. Exome disparity is highlighted by the various commercial exome capture kits which differ in the targeted ROI, even when exome selection is based on use of the same reference databases. Additionally, these kits differ in bait length, bait density, and the type of molecule that is used for capture (DNA versus RNA), and so even an identical ROI produces differences in what is represented in the sequence files. Definition of the “exome” is further complicated by the fact that some potentially medically important genes are not included in the reference databases which may lead to disparate positive or negative results depending on the exome reagent used in the test [16]. Emory Genetics Laboratory offers the Medical EmExome, which currently claims the highest coverage of any exome assay; this testing covers roughly 92% of the exome, where this medical exome has enhanced coverage of roughly 4600 medically relevant and known disease-associated genes [58]. Table 3.3 provides an overview of some commonly used exome capture kits as of 2014. Given the acknowledged differences in “exomes,” several groups have completed comprehensive comparisons of earlier versions of some of the more widely used whole exome solution target capture kits manufactured by Agilent, Illumina, and NimbleGen [1416]. As noted above, exome size varied in these kits depending on genes targeted and amount of flanking sequence, UTR, and noncoding RNA included, but all of these studies noted the importance of probe design with respect to capture efficiency in that a higher density of probes targeting smaller genomic regions results in higher efficiency. Agilent SureSelect Human All Exon Kit targets more genes and thus has better coverage for protein-coding regions; however, when including flanking regions, NimbleGen SeqCap EZ showed better capture efficiency. It was also noted that all platforms show similar bias against extremes of GC content and minor yet consistent bias toward reference allele capture. Similarly, all demonstrate off-target enrichment products that map to repeat elements and regions of segmental duplication. Finally, all methods were shown to be capable of producing high genotype sensitivity and accurate SNP calling at appropriate coverage (about 303), with low false positive rates (less than 0.67%). Compared with high coverage (203 ) WGS, which requires 200 Gb of raw sequence to cover greater than 95% of CCDS annotated exons, WES at 203 coverage with roughly 20 Gb of raw sequence adequately covers 85% and 90% of CCDS exons by Agilent and NimbleGen, respectively. This indicates WES is able to provide high coverage of target regions represented in the CCDS annotations 1020 times as efficiently as WGS, with loss of only about 510% of exons [16]. Overall, all of these exome platforms demonstrate a high level of target efficiency, adequate coverage and uniformity, and adequately detect disease-associated variants and SNVs. The choice of capture reagent therefore often comes down to laboratory-specific criteria.

I. METHODS

49

CLINICAL APPLICATIONS OF TARGET CAPTURE ENRICHMENT

TABLE 3.3 Examples of Off-the-Shelf Exome Kits Kit

Enrichment method

Capture target size

# of protein coding genes (exons)

DNA input

Sequencing instrument optimization

Agilent SureSelect Exon V5

Solution

50 Mb

21,522 (357,999)

3 ug

Illumina or SOLiD

Roche NimbleGen SeqCap EZ Human Exome v3

Solution

64 Mb

.20,000

1 ug

Illumina

Illumina TruSight One

Solution

12 Mb

4813 from HGMD, OMIM and genetest.org

50 ng

Illumina

Illumina Nextera Rapid Capture Exome

Solution

37 Mb

98.3% RefSeq

50 ng

Illumina

50 ng

Illumina

Illumina and Ion Torrent

98.6% CCDS 97.8% ENSEBML 98.1% GENCODE (214,405)

Illumina Nextera Rapid Capture Expanded Exome

Solution

62 Mb

95.3% RefSeq 96.0% CCDS 90.6% ENSEMBL 91.6% GENCODE .88% RefSeq 50 and 30 UTRs .77% predicted microRNA (201,121)

Agilent HaloPlex

MIP/PCR

37 Mb

21,522 (357,999)

200 ng

Ion AmpliSeq Exome

PCR

33 Mb

.97% CCDS

50100 ng Ion OneTouch and Proton

It has been shown that about 20,00025,000 variants are identified in WES of an individual, depending on patient ethnicity [52]. As expected, the number of variants will be higher for those ethnic groups underrepresented in the reference genome. A well-designed bioinformatic pipeline will assist in sifting through these data to determine which variants are likely to be clinically meaningful. Greater than 95% of identified variants will represent known polymorphisms, but even after their elimination, tens to hundreds of nonsynonymous variants still remain, each of which requires individual assessment for proper classification [53]. A large proportion of these variants will be novel with insufficient evidence to support a substantial clinical interpretation and hence will be classified as variants of unknown significance (VUS). A well-known illustration of clinical WES aiding in diagnosis and treatment is provided by the case of a young boy with intractable life-threatening inflammatory bowel disease (IBD) [54]. After extensive clinical evaluation and targeted genetic analysis were unsuccessful in determining the cause of the IBD, WES identified an X-linked inhibitor of apoptosis (XIAP) gene variant. XIAP mutations are associated with hemophagocytic lymphohistiocytosis, and therefore the patient underwent allogeneic hematopoietic progenitor cell transplantation with resolution of his symptoms and remained in clinical remission at the time of the report. Another example is the use of WES for diagnostic purposes in persons with severe intellectual disability [59]. In developed countries, most severe forms of intellectual disability are thought to have a genetic cause, most likely due to de novo point mutations, and so in this study patients with previous extensive clinical and genetic evaluation underwent WES as an end point to current diagnostic strategies. Causal mutations in known intellectual-disability genes were identified in 16% of patients, information which proved to be of prognostic utility in clinical practice for clinicians, patients, and their families, and also supported specific approaches to clinical management including dietary changes and antiepileptic therapy recommendations (see Chapter 17 for a more complete discussion of WES for constitutional disease). On the other hand, while clinical WES for cancer specimens is possible, the approach comes with challenges. First, ideally, paired tumor-normal testing would be performed in order to distinguish between germ line and

I. METHODS

50

3. TARGETED HYBRID CAPTURE METHODS

somatic variants, since comparison of samples allows for downstream removal of shared variants (which are generally considered germ line in most circumstances). However, the use of nontumor samples in this capacity is currently not reimbursed by insurance companies. Second, an important point to keep in mind in tumor-normal testing is that some germ line variants are associated with familial cancer syndromes such as TP53 variants in LiFraumeni syndrome and APC variants in familial adenomatous polyposis, but a thorough personal and family medical history can usually identify whether or not a putative variant is germ line or represents a somatically acquired mutation. Third, the number of genes that harbor clinically actionable mutations (those with established predictive or prognostic value) from an exome study performed on a cancer specimen is currently quite small (on the order of a few dozen), highlighting the excess of data proved by WES from a purely clinical perspective. Fourth, as alluded to above, many mutations identified are VUSs, which are also of little direct clinical value. Given the latter two utility issues, reimbursement for WES of cancer specimens is quite low. Therefore, it is not surprising that most laboratories currently use either hotspot or selected gene panels for sequencing of clinical cancer specimens (see Chapter 20 for a more complete discussion of WES for oncology specimens). The application of WGS and WES comes with the potential for identification of incidental (or secondary) findings, which by definition are results unrelated to the initial indication for sequencing but that may nonetheless be of medical value. An active debate continues on this topic. The American College of Medical Genetics and Genomics (ACMGG) recently published recommendations for reporting of incidental findings in clinical exome and genome sequencing [60], recommendations which include reporting incidental findings on a minimum list of 24 diseases/conditions with likely medical benefit to the patient or their families. Because these recommendations do not take into account several important considerations such as data ownership, ethical and legal ramifications, and reimbursement, they have not been widely adopted. However, these recommendations do serve to highlight the fact that there is currently no regulation of incidental findings, and so each laboratory must develop its own policy regarding reporting this class of variants [61]. In summary, at present, WES provides the most clinical utility in the setting of constitutional disease when used to test for germ line variants to facilitate accurate diagnosis in individuals with disorders that present with atypical manifestations or a variable phenotype, are difficult to confirm using clinical or laboratory criteria alone, or otherwise require extensive or costly evaluation [62,63]. Examples include disorders of nonsyndromic intellectual disability, nonsyndromic hearing loss, and mitochondrial dysfunction. In these (and other) specific clinical situations, WES may be more accurate, faster, and less expensive than conventional diagnostic procedures. In these settings, WES has the potential to improve patient management by revealing treatment options not previously considered, or by ruling out therapies that would have carried little benefit. However, WES is still quite costly and requires tedious analysis and interpretation of a vast amount of data which prohibits wider clinical use, including use in clinical cancer testing (although this may change in the future as the cost of sequencing continues to drop, bioinformatic pipelines continue to improve, and more diseases are clearly genetically defined).

Selected Gene Panels Clinical testing of constitutional diseases with an established repertoire of associated genes is currently performed in many labs by targeted enrichment of gene panels rather than the whole exome. Clinical (phenotypic) and genetic heterogeneity of many inherited disease has helped propel the use of gene panels for diagnostic purposes. If a patient presents with a well-defined clinical phenotype for which identification of a pathogenic variant is highly likely, a smaller disease-targeted panel (or even single gene testing) can be performed. However, if the disease classification is less certain, a more encompassing gene panel can provide for definitive diagnosis. Commercial target capture kits are available for numerous inherited disorders including cardiomyopathies, channelopathies, autism and/or developmental delay, connective tissue disease, epilepsy, neuropathy, mitochondrial diseases, hearing loss, and X chromosome-associated maladies (as discussed in more detail in Chapter 16). Panels for acquired somatic disease (usually cancer) also focus on genes that are considered clinically actionable in that there is well-established literature providing evidence for their diagnostic, predictive, and/or prognostic value. The panels may be quite narrow based on the specific cancer being evaluated (e.g., colon adenocarcinoma, lung adenocarcinoma, gastrointestinal stromal tumor) [64,65] or much broader based on recurrently mutated genes across multiple cancer types [66]. Some laboratories offer “hotspot” assays that target a small number of loci in a limited gene set to detect mutations concentrated at specific positions or confined to a few exons (e.g., KRAS G12 and EGFR exon 19 and 20 mutations in colon cancer). A specific example of this assay type is the so-called ColoSeq Lynch and Polyposis Syndrome Panel [64,67] that evaluates individual exons commonly mutated within 19 genes known to be associated with Lynch syndrome and various forms of polyposis.

I. METHODS

VARIANT DETECTION

51

Alternatively, panels may be quite large and involve sequencing several hundred genes that are directly or indirectly involved in oncogenesis or response to therapy; one such assay for solid tumors captures the entire coding sequence of hundreds of genes and introns from additional genes that are frequently rearranged in cancer [66,68]. Most NGS assays for cancer target an ROI that includes at most a few dozen genes, such as the Comprehensive Cancer Gene Set assay that includes 42 genes, including commonly rearranged genes, across both solid tumor and hematologic malignancies [65,69]. Limited and more comprehensive assays each have distinct advantages and disadvantages. For clinical testing, limiting the number of targeted genes produces a focused interpretation that avoids an excessive number of distracting VUS, decreases incidental findings, and can provide a more rapid TAT. Assays that have a smaller ROI also often have better coverage and are easier to multiplex for improved efficiency. Another important advantage of NGS assays that target only loci with well-established relevance is higher rates of reimbursement (as discussed in Chapter 26), a difference that is critical in the clinical setting where testing is funded by insurance payers rather than research grants or philanthropy. In contrast, large panels are more likely to include genes relevant to clinical trials or drug development, and so have much more utility in investigational settings. In the end, panel design is determined by examining factors such as clinical need, expected sample volume, practicality of running multiple small disease-directed panels versus a single more general cancer-based panel, and sources of revenue.

Disease-Associated Exome Testing Rather than developing an assay with custom reagents that targets only a limited ROI, some laboratories have implemented the so-called disease-associated exome testing. This approach involves capture and sequencing of the entire exome with subsequent reporting of only the genes relevant to the particular disease in question [70]. A nice illustration of this approach is provided by NGS testing for inherited cardiomyopathies and channelopathies [71]. The clinical presentation of these diseases is highly variable with considerable genotypic and phenotypic overlap, and more than 50 disease-associated genes have been identified [72]. Sequence analysis of the exome, with bioinformatics selection and analysis of only the relevant genes, makes it possible for a lab to develop both a comprehensive panel and focused subpanels (e.g., a cardiologist could order a larger “cardiac sudden death” panel or a more limited “hypertrophic cardiomyopathy” subpanel or “channelopoathy” subpanel) [70]. By this approach, if the initial gene set is negative, evaluation of other gene sets can easily be performed (at significant cost savings) since additional bioinformatic analysis can be performed without the need to generate additional sequence. Utilization of clinically designed medical exomes which provide high coverage, as discussed above, is ideal for this type of application. There are numerous advantages to disease-associated exome testing. The approach mitigates the costs associated with the need to maintain an up-to-date target region, as well as the extensive validation associated with constant assay content changes. By validating the exome once upfront, labs are able to eliminate the time, effort, and expense of having to revalidate an assay every time clinical panel content is revised; as new genedisease associations are discovered, and known associations are reclassified, only a minor bioinformatics change is required to introduce additional genes into the assay panel. Compared with WES, disease-associated exome testing results in identification of fewer VUSs, simplifies interpretation, and supports a faster TAT. In addition, reinterpretation as medical knowledge grows and panels expand is made considerably easier, although the questions of how often this will need to be performed and who will pay need to be addressed. One potential disadvantage of disease-associated exome testing is the lack of a laboratory standard for potential secondary findings. As discussed above, while recommendations have been suggested for mandatory reporting of secondary findings for WES, there is no current requirement for genes that must be included in the bioinformatics analysis regardless of the focus of a disease panel. Many genes which are known to have some disease association will not, by definition, be bioinformatically analyzed in a test performed for one particular disease despite the fact that the sequence reads are present in the raw exome sequence file. Thus, laboratories will need to design individual policies to address secondary findings.

VARIANT DETECTION As with all NGS testing, after the sequencing reads are generated, bioinformatics tools are used to align the reads against a reference genome and identify differences between the patient’s sequence and the reference. For maximum clinical utility, the bioinformatic pipeline should be designed to detect all four classes of genomic

I. METHODS

52

3. TARGETED HYBRID CAPTURE METHODS

variants at allele frequencies that are physiologically relevant. It is worth emphasizing that the four main classes of variants (i.e., SNVs, insertions and deletions (indels), copy number variants (CNVs), and structural variants (SVs)) each require different computational approaches for sensitive and specific identification from targeted capture NGS. Various informatics pipelines are known to yield different variant calls for the various classes of variants, and even specific variants, highlighting the importance of validation and optimization of specific pipelines for particular tests in routine clinical use [73]. SNVs are the most straightforward to detect since they differ from the reference genome at a single nucleotide position [19,74]. Smaller indels can be readily detected if contained within the length of an individual sequence read, especially with the use of paired-end reads. DNA fragment size will determine the expected bp range between mated reads in paired-end sequencing, and a significant alteration in the expected distance between mapped pairs can be used to suggest the presence of a longer indel. However, reliable detection of longer indels (greater than about 100 bp) requires the use of more sophisticated analytic tools for reliable identification [19,75]. CNVs are more challenging to detect in a selected gene panel with short-read target capture methods but are more readily identified with exome or genome sequencing. By definition, CNVs are not changes in sequence but changes in the number of reads, and are usually identified as variations in the depth of coverage over limited regions in the overall ROI. Thus, copy number analysis generally requires the use of a bioinformatic tool capable of normalization to account for random coverage variation, and several tools rely on SNV genotypes across the target region to substantiate CNV calls [76]. Smaller gene panels that contain a limited target ROI can hinder normalization analysis compared with tests based on WES or WGS [77]. SVs include translocations, inversions, and other large-scale chromosomal rearrangements. These alterations also pose a challenge for detection in NGS. Most breakpoints for inter- and intrachromosomal rearrangements are not typically included in the targeted ROI of most NGS levels or exome kits because they occur within highly repetitive noncoding (intronic or intergenic) sequence, and thus these regions are both difficult to capture and properly map to the reference genome. However, targeted gene panels for cancer can be successfully designed to include capture of introns commonly associated with translocation breakpoints, and several SV detection tools have been developed that rely on reads that either span the breakpoint (discordant read pairs mapping to two separate chromosomes) or that contain the breakpoint itself (split reads where the reads map partially to both sides of the rearrangement) [77,78]. More information regarding bioinformatic pipelines and variant detection is located in Chapters 711.

PRACTICAL AND OPERATIONAL CONSIDERATIONS Workflow and TAT As mentioned previously, most of the presequencing protocols and workflows are similar for the various hybrid capture strategies, yet several additional key points should be considered. Ease of use and actual handson technician time are relevant factors that may influence the selection of one particular method over another, as are scalability and the potential for automation. Specimen volumes impact the relevance of all these factors. Solution-based and MIP methods are easily scalable to high volume testing such as 96 well plates, while arraybased methods are often limited by the number of samples that can be processed by a technician in a single shift (generally less than 24) [1]. Automation usually requires specialized equipment and is thus associated with an initial capital investment, but automation not only reduces hands-on technician time but also often results in standardization and thus less sample variability. For hybrid capture enrichment, turnaTAT, the interval from sample receipt in the lab to reporting of results is largely dependent on a few key steps in the sequencing pipeline. Since there is a baseline cost to operating the sequencing instrument, to be more cost-effective laboratories often batch samples to reduce individual test cost, but there is a balance between cost-effective batching and providing a reasonably acceptable TAT for results. DNA library preparation for targeted hybrid capture methods generally takes 35 days compared with 1 day for amplicon-based enrichment library preparation, with a large proportion of this time allocated to probe hybridization (2472 h). The time to complete actual sequence runs themselves is highly variable, even within a single instrument. For example, using Reagent kit v3, the Illumina MySeq instrument can generate 3.33.8 Gb of sequence (2 3 75 bp reads) in 24 h or 13.215 Gb of sequence (2 3 300 bp reads) in 65 h [79]; the Illumina HiSeq2500 [80] is capable of sequencing an exome in 27 h when in “rapid-run mode.” The bioinformatics and interpretive component of testing has emerged as a source of protracted TATs given the ease with which massive

I. METHODS

53

REFERENCES

TABLE 3.4 Practical and Operational Considerations of Target Enrichment Methods Enrichment method On-array Solution

Ease of use Intermediate High

Scalability

Multiplexitya

Cost

Moderate

10 10

5

Intermediate

Excellent

10 10

6

,10 samples, intermediate

Excellent

10 10

5

Moderate

10 10

3

4 5

.10 samples, low MIP

High

4

,10 samples, high .100 samples, low

Amplification

Low

2

High

a

Multiplexity refers to the number of discontiguous sequences that can be simultaneously targeted.

sequence can be generated on the current generation of platforms. Thus, overall TAT is often directly proportional to the size of the ROI being sequenced and the number of variants detected. Currently, clinical exome sequencing generally takes approximately 36 months for completion, while targeted panels with tens to hundreds of genes can be returned within 26 weeks [81]. Table 3.4 provides a quick overview of these practical workflow considerations.

CONCLUSIONS Although WGS provides the most comprehensive and least biased method of NGS, bioinformatics constraints and cost currently make targeted NGS, whether amplification-based or hybrid capture-based, the more practical choice for clinical use. Amplification-based methods have excellent specificity and uniformity; hybrid capture methods make it possible to more easily capture larger target regions in a single assay. While WES has proven useful for diagnosis and management in specific clinical situations, selected gene panels have more utility at lower cost. The availability of benchtop sequencers has greatly simplified NGS testing (and markedly lowered to cost) and has made NGS approaches more readily available for integration into routine patient care in a wider range of clinical laboratories. Nonetheless, regardless of the enrichment method, assay development requires meticulous design, recognition of the strengths and limitations of individual methods, rigorous assay validation, and optimization of the bioinformatics pipeline used to identify and interpret variants.

References [1] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):1118. [2] Mertes F, Elsharawy A, Sauer S, van Helvoort JM, van der Zaag PJ, Franke A, et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomics 2011;10(6):37486. [3] Turner EH, Ng SB, Nickerson DA, Shendure J. Methods for genomic partitioning. Annu Rev Genomics Hum Genet 2009;10:26384. [4] Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333. [5] Karnes HE, Duncavage EJ, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122(2):10413. [6] Gallagher SR, Desjardins PR. Quantitation of DNA and RNA with absorption and fluorescence spectroscopy. Curr Protoc Protein Sci 2008; [Appendix 3:Appendix 4K]. [7] Wickham CL, Sarsfield P, Joyner MV, Jones DB, Ellard S, Wilkins B. Formic acid decalcification of bone marrow trephines degrades DNA: alternative use of EDTA allows the amplification and sequencing of relatively long PCR products. Mol Pathol 2000;53(6):336. [8] Reineke T, Jenni B, Abdou MT, Frigerio S, Zubler P, Moch H, et al. Ultrasonic decalcification offers new perspectives for rapid FISH, DNA, and RT-PCR analysis in bone marrow trephines. Am J Surg Pathol 2006;30(7):8926. [9] Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009;27(2):1829. [10] Querfurth R, Fischer A, Schweiger MR, Lehrach H, Mertes F. Creation and application of immortalized bait libraries for targeted enrichment and next-generation sequencing. Biotechniques 2012;52(6):37580. [11] Maricic T, Whitten M, Pa¨a¨bo S. Multiplexed DNA sequence capture of mitochondrial genomes using PCR products. PLoS One 2010; 5(11):e14004. [12] Hodges E, Rooks M, Xuan Z, Bhattacharjee A, Benjamin Gordon D, Brizuela L, et al. Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing. Nat Protoc 2009;4(6):96074.

I. METHODS

54

3. TARGETED HYBRID CAPTURE METHODS

[13] Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 2007;104(49):1942833. [14] Clark MJ, Chen R, Lam HY, Karczewski KJ, Euskirchen G, Butte AJ, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011;29(10):90814. [15] Asan, Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, et al. Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol 2011;12(9):R95. [16] Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. A comparative analysis of exome capture. Genome Biol 2011;12(9): R97. [17] Hagemann IS, Cottrell CE, Lockwood CM. Design of targeted, capture-based, next generation sequencing tests for precision cancer therapy. Cancer Genet 2013;206(12):42031. [18] Cui H, Li F, Chen D, Wang G, Truong CK, Enns GM, et al. Comprehensive next-generation sequence analyses of the entire mitochondrial genome reveal new insights into the molecular diagnosis of mitochondrial DNA disorders. Genet Med 2013;15(5):38894. [19] Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 2014;16(1):7588. [20] Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011;12(2):R18. [21] Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G 1 C)-biased genomes. Nat Methods 2009;6(4):2915. [22] Integrated DNA Technologies [May 2014]. Available from: ,http://www.idtdna.com/pages/products/dna-rna/ultramer-oligos.. [23] Sipos B, Massingham T, Stu¨tz AM, Goldman N. An improved protocol for sequencing of repetitive genomic regions and structural variations using mutagenesis and next generation sequencing. PLoS One 2012;7(8):e43359. [24] Torrents D, Suyama M, Zdobnov E, Bork P. A genome-wide survey of human pseudogenes. Genome Res 2003;13(12):255967. [25] Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 2003;13(12):254158. [26] Institute for Systems Biology [May 2014]. Available from: ,http://www.repeatmasker.org.. [27] Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007;4(11):9035. [28] Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 2010;20(10):142031. [29] Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 2007;4(11):9079. [30] Agilent microarrays [May 2014]. Available from: ,http://www.genomics.agilent.com/en/CGH-CGH-SNP-Microarrays/?pgid5AG-PG9&navAction5pop.. [31] Agilent eArray microarray design tool [May 2014]. Available from: ,https://earray.chem.agilent.com/earray/.. [32] Nilsson M, Malmgren H, Samiotaki M, Kwiatkowski M, Chowdhary BP, Landegren U. Padlock probes: circularizing oligonucleotides for localized DNA detection. Science 1994;265(5181):20858. [33] Hardenbol P, Bane´r J, Jain M, Nilsson M, Namsaraev EA, Karlin-Neumann GA, et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 2003;21(6):6738. [34] Dahl F, Gullberg M, Stenberg J, Landegren U, Nilsson M. Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res 2005;33(8):e71. [35] Fredriksson S, Bane´r J, Dahl F, Chu A, Ji H, Welch K, et al. Multiplex amplification of all coding sequences within 10 cancer genes by Gene-Collector. Nucleic Acids Res 2007;35(7):e47. [36] Akhras MS, Unemo M, Thiyagarajan S, Nyre´n P, Davis RW, Fire AZ, et al. Connector inversion probe technology: a powerful one-primer multiplex DNA amplification system for numerous scientific applications. PLoS One 2007;2(9):e915. [37] Hardenbol P, Yu F, Belmont J, Mackenzie J, Bruckner C, Brundage T, et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 2005;15(2):26975. [38] Daly TM, Dumaual CM, Miao X, Farmen MW, Njau RK, Fu DJ, et al. Multiplex assay for comprehensive genotyping of genes involved in drug metabolism, excretion, and transport. Clin Chem 2007;53(7):122230. [39] Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, et al. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proc Natl Acad Sci USA 2007;104(22):938792. [40] Wang Y, Cottman M, Schiffman JD. Molecular inversion probes: a novel microarray technology and its application in cancer research. Cancer Genet 2012;205(78):34155. [41] Zhang K, Li JB, Gao Y, Egli D, Xie B, Deng J, et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat Methods 2009;6(8):6138. [42] Palanisamy R, Connolly AR, Trau M. Epiallele quantification using molecular inversion probes. Anal Chem 2011;83(7):26317. [43] Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, et al. Multiplex amplification of large sets of human exons. Nat Methods 2007; 4(11):9316. [44] Turner EH, Lee C, Ng SB, Nickerson DA, Shendure J. Massively parallel exon capture and library-free resequencing across 16 genomes. Nat Methods 2009;6(5):3156. [45] Affymetrix OncoScan FFPE Express 2.0 Service [May 2014]. Available from: ,http://www.affymetrix.com/estore/esearch/search.jsp? pd5prod150008&N54294967292.. [46] Affymetrix DMET Plus Assay [May 2014]. Available from: ,http://www.affymetrix.com/estore/esearch/search.jsp?pd5 131412&N54294967292.. [47] Integrated DNA Technologies Custom DNA oligonucleotides [May 2014]. Available from: ,http://www.idtdna.com/pages/products/ dna-rna/custom-dna-oligos..

I. METHODS

REFERENCES

55

[48] Eurogentec Custom Oligonucleotides [May 2014]. Available from: ,http://www.eurogentec.com/oligonucleotides.html.. [49] Agilent Technologies HaloPlex Assay [May 2014]. Available from: ,http://www.genomics.agilent.com/en/HaloPlex-Next-Generation-PCR-/ HaloPlex-Custom-Kits/?cid5cat100006&tabId5AG-PR-1067&Nty51&Ntx5mode1matchall&Ntk5BasicSearch&N542949672921 42949672341429496729414294967244&type5baseSearch&No50&Ntt5haloplex.. [50] Bodi K, Perera AG, Adams PS, Bintzler D, Dewar K, Grove DS, et al. Comparison of commercially available target enrichment methods for next-generation sequencing. J Biomol Tech 2013;24(2):7386. [51] Hedges DJ, Guettouche T, Yang S, Bademci G, Diaz A, Andersen A, et al. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform. PLoS One 2011;6(4):e18595. [52] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461(7261):2726. [53] Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet 2011;12(11):74555. [54] Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 2011;13(3):25562. [55] Bainbridge MN, Wiszniewski W, Murdock DR, Friedman J, Gonzaga-Jauregui C, Newsham I, et al. Whole-genome sequencing for optimized patient management. Sci Transl Med 2011;3(87):87re3. [56] Need AC, Shashi V, Hitomi Y, Schoch K, Shianna KV, McDonald MT, et al. Clinical application of exome sequencing in undiagnosed genetic conditions. J Med Genet 2012;49(6):35361. [57] Sulonen AM, Ellonen P, Almusa H, Lepisto¨ M, Eldfors S, Hannula S, et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol 2011;12(9):R94. [58] Emory Medical EmExome [May 2014]. Available from: ,https://genetics.emory.edu/egl/featuredtests/index.php/2139.. [59] de Ligt J, Willemsen MH, van Bon BW, Kleefstra T, Yntema HG, Kroes T, et al. Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med 2012;367(20):19219. [60] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. [61] College of American Pathologists. Molecular Pathology Checklist. [May 2014]. Available from: ,http://www.cap.org/apps/docs/laboratory_ accreditation/checklists/new/molecular_pathology_checklist.pdf.. [62] Bick D, Dimmock D. Whole exome and whole genome sequencing. Curr Opin Pediatr 2011;23(6):594600. [63] Biesecker LG. Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet Med 2012;14(4):3938. [64] Pritchard CC, Smith C, Salipante SJ, Lee MK, Thornton AM, Nord AS, et al. ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. J Mol Diagn 2012;14(4):35766. [65] Cottrell CE, Al-Kateb H, Bredemeyer AJ, Duncavage EJ, Spencer DH, Abel HJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16(1):89105. [66] Frampton GM, Fichtenholtz A, Otto GA, Wang K, Downing SR, He J, et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol 2013;31(11):102331. [67] University of Washington. ColoSeq--Lynch and Polyposis Syndrome Panel. [May 2014]. Available from: ,http://tests.labmed.washington .edu/COLOSEQ.. [68] Foundation Medicine. Foundation One Test. [May 2014]. Available from: ,http://foundationone.com/learn.php-2.. [69] Genomic and Pathology Services at Washington University in St. Louis. Comprehensive Cancer Gene Set. [May 2014]. Available from: ,http://gps.wustl.edu/services/ngs_cancer.php.. [70] Rehm HL. Disease-targeted sequencing: a cornerstone in the clinic. Nat Rev Genet 2013;14(4):295300. [71] Teekakirikul P, Kelly MA, Rehm HL, Lakdawala NK, Funke BH. Inherited cardiomyopathies: molecular genetics and clinical genetic testing in the postgenomic era. J Mol Diagn 2013;15(2):15870. [72] Ackerman MJ, Priori SG, Willems S, Berul C, Brugada R, Calkins H, et al. HRS/EHRA expert consensus statement on the state of genetic testing for the channelopathies and cardiomyopathies: this document was developed as a partnership between the Heart Rhythm Society (HRS) and the European Heart Rhythm Association (EHRA). Europace 2011;13(8):1077109. [73] O’Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013;5(3):28. [74] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20(9):1297303. [75] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25(21):286571. [76] Xi R, Lee S, Park PJ. A survey of copy-number variation detection tools based on high-throughput sequencing data. Curr Protoc Hum Genet 2012; [Chapter 7:Unit7.19] [77] Abel HJ, Duncavage EJ. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet 2013;206(12):43240. [78] Abel HJ, Al-Kateb H, Cottrell CE, Bredemeyer AJ, Pritchard CC, Grossmann AH, et al. Detection of gene rearrangements in targeted clinical next-generation sequencing. J Mol Diagn 2014;16(4):40517. [79] Illumina MySeq Desktop Sequencer Specifications [May 2014]. Available from: ,http://www.illumina.com/systems/miseq/performance_ specifications.ilmn.. [80] Illumina HiSeq 2500 Sequencer Specifications [May 2014]. Available from: ,http://www.illumina.com/systems/hiseq_2500_1500/ performance_specifications.ilmn.. [81] Johansen Taber KA, Dickinson BD, Wilson M. The promise and challenges of next-generation genome sequencing for clinical care. JAMA Intern Med 2014;174(2):27580.

I. METHODS

This page intentionally left blank

C H A P T E R

4 Amplification-Based Methods Marina N. Nikiforova1, William A. LaFramboise2 and Yuri E. Nikiforov1 1

Division of Molecular and Genomic Pathology, Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, PA, USA 2Genomics Division of the Cancer Biomarker Facility, Shadyside Hospital, University of Pittsburgh Medical Center, Pittsburgh, PA, USA

O U T L I N E Introduction

57

Principles of Amplification-Based Targeted NGS Sequencing Workflow Samples Requirements

58 58 59

Nucleic Acids Preparation

59

Primer Design for Multiplex PCR

60

Library Preparation and Amplification

61

Other Amplification-Based Target Enrichment Approaches

62

Comparison of Amplificationand Capture-Based Methods

62

Clinical Applications

64

Conclusion

66

References

66

INTRODUCTION Despite recent tremendous advances in technology, whole genome sequencing is still associated with tremendous cost and workload. Sequencing of subsets of genes or genetic regions is an alternative approach and usually performed either by hybrid capture approaches or by amplification-based target enrichment methods [1]. Amplification-based next-generation sequencing (NGS) relies on utilization of polymerase chain reaction (PCR) amplification for enrichment of genomic regions of interest. It is a high-resolution, cost-effective method to interrogate patient samples for disease-specific genotype variants. It offers the distinct advantages of rapid laboratory turnaround time along with the generation of a limited data set that can be rapidly processed and analyzed without burdensome storage requirements. This is in contrast to currently available PCR-based SNaPshot assay (Life Technologies, Foster City, CA) and mass spectrometry assays (Sequenom Inc., San Diego, CA) that interrogate small numbers of common mutations for which there is a priori knowledge, and also in contrast to whole genome and exome sequencing assays that generate vast amounts of sequence but require complex, costly, and time-consuming processing and analysis. Precisely designed and scalable targeted amplification-based sequencing allows rapid broad screening of disease-related genomic domains for established mutations, and also additional novel complex and heterogeneous changes, to provide identification of pertinent disease variants. It can be used on a variety of specimens, including formalin-fixed paraffin-embedded (FFPE) and frozen tumor tissues, fine needle aspirates or biopsy specimens, blood and bone marrow, amniotic or placental samples acquired for prenatal screening, and others.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00004-6

57

© 2015 Elsevier Inc. All rights reserved.

58

4. AMPLIFICATION-BASED NGS

Commercial targeted amplification-based sequencing assays for the detection of cancer gene mutations have recently been marketed to clinical laboratories for use in diagnostic screening panels of tumor suppressor genes, oncogenes, and “actionable” drug targets. These assays promise precise gene target enrichment and high sensitivity when coupled with recently developed, benchtop sequencing instruments capable of massively parallel sequencing (MPS) including the Ion Torrent Personal Genome Machine (PGM, Life Technologies Inc., Grand Island, NY), the MiSeq instrument (Illumina, Inc., San Diego, CA), and the 454 GS Junior sequencer (Roche Diagnostics Inc., Indianapolis, IN). Addition of barcodes to the DNA libraries (sample indexing) allows for concomitant sequencing of multiple samples across numerous targets on sequencing platforms, markedly increasing sample throughput while simultaneously reducing the overall cost per sample [2]. Furthermore, DNA sample requirement for targeted amplification-based sequencing is decreasing below that of other assays and is now approaching the nanogram to picogram range. PCR amplicon libraries typically represent the final enriched and amplified product for MPS, constituting multiplied replicates of targets captured in the selection process which then undergo sequencing, readout, and diagnostic evaluation.

PRINCIPLES OF AMPLIFICATION-BASED TARGETED NGS Sequencing Workflow Amplification-based NGS usesmultiplex PCR amplification of DNA or cDNA for enrichment of specific target regions of a genome that are selected for sequencing. It requires implementation of several steps to assure high quality of sequencing including sample adequacy, acceptable quantity and quality of isolated nucleic acids, selection of appropriate genetic targets and careful primer design, optimization of PCR conditions to yield specific products, efficient library preparation, and accurate sequencing. The workflow of amplification-based targeted NGS is illustrated in Figure 4.1. Isolated DNA is first amplified by multiplex PCR to generate amplicons of genomic regions of interest. If RNA is the starting template, it has to be first reverse-transcribed to cDNA and then amplified. Next, the library is constructed by ligating PCR amplicons into platform-specific oligonucleotide adapters and adding sample-specific barcodes. The library is further enriched by clonal amplification using different approaches (e.g., emulsion PCR (emPCR) or cluster generation on a solid surface). Finally, the amplified library is sequenced by MPS.

Sample (FFPE, frozen tissue, blood, etc.)

Nucleic acids: DNA or RNA (cDNA)

Multiplex PCR of genomic regions of interest

Library construction (adapters, barcodes)

Clonal library amplification (emulsion PCR, cluster generation, etc.)

Massively parallel sequencing

FIGURE 4.1 Targeted amplification-based sequencing workflow.

I. METHODS

NUCLEIC ACIDS PREPARATION

59

Samples Requirements Targeted amplification-based NGS can be performed on a variety of samples, such as fresh or snap-frozen tissue, FFPE tissue, fine needle aspiration (FNA) samples, blood, bone marrow, buccal swabs, and cell pellets [3]. Fresh and snap-frozen specimens are considered sources of the highest quality DNA and RNA with respect to molecular size and sequence integrity. However, amplification-based sequencing also allows successful use FFPE specimens. FFPE tissue specimens represent a vast worldwide repository of annotated clinical samples prepared by a standardized method to allow for subsequent histochemical and immunochemical phenotype evaluation of tissue structural features. These tissues represent a uniquely valuable resource for target enrichment and sequencing studies if properly stored and maintained. Since the method of MPS includes the amplification and sequencing of short DNA fragments, this paradigm fits well with the fact that nucleic acids from FFPE samples typically undergo more fragmentation during processing compared with fresh or frozen preparations. However, there is a significant variation in tissue collection and processing protocols throughout laboratories including fixation solutions, fixation time, processing temperatures, and long-term storage conditions—all variables that significantly impact nucleic acid quality and may affect target capture and amplicon sequencing. For example, prolonged (.2448 h) fixation in 10% neutral buffered formalin adversely affects the quality of nucleic acids, decalcifying leads to extensive DNA fragmentation, and exposure to fixatives containing heavy metals (e.g., Zenker’s, B5, acid zinc formalin) may lead to inhibition of DNA polymerases and other enzymes essential for successful sequencing analysis. In addition, RNA is a less stable molecule than DNA and is easily degraded by a variety of ribonuclease enzymes that are replete within the cell and environment; therefore, freshly collected samples or frozen tissue are considered to be better specimens than FFPE for amplificationbased RNA sequencing. Another advantage of FFPE tissue specimens is in the ability to select, quantitate, and enrich the population of tumor cells for molecular testing. A representative H&E slide of the tissue can be reviewed by a pathologist to identify a tumor target and determine the purity of the tumor, including the proportion of neoplastic cells in a background of benign stromal and inflammatory cells in the area selected for testing. Then, manual or laser capture microdissection can be performed under the guidance of the H&E slide to enrich the tumor cell population. Several studies have indicated that erroneous results during MPS can be attributed to the specimen type analyzed. For example, systematic errors in sequencing of DNA from FFPE specimens have been reported. Specifically, cytosine deamination to uracil in FFPE-derived DNA results in C:G . T:A artifacts that may be incorrectly assigned as mutations [4]. Furthermore, A . G, G . T, and G . C transitions have also been identified in DNA derived from FFPE samples at low levels [5]. Other sequencing studies have indicated that data obtained from fresh and frozen tissue samples may also contain errors that are associated with specific preparatory steps commonly used in target enrichment and sequencing assays. It has been determined that the process of mechanical shearing used to induce incremental DNA fragmentation for NGS can be an independent source of base substitution errors which impact the fidelity of mutation detection. High-powered acoustic shearing induces a C . A/G . T transversion identified in sequencing data due to DNA oxidation manifested as a low allelic frequency CCG . CAG mutations [6]; this oxidation-driven error mechanism produced 8-oxoguanine DNA lesions and was exacerbated when the same shearing method was applied to progressively lower DNA concentrations. Therefore, intricacies of sample preparation can produce artifacts that masquerade as mutations in very low allelic fractions emphasizing the need for establishing sequencing cutoffs and validation of low-level mutations by other sensitive techniques.

NUCLEIC ACIDS PREPARATION Numerous methods are available for isolation of DNA and RNA for targeted amplification-based sequence analysis including manual and automated technologies. In general, the performance metrics of sequencing depends on the purity, integrity, and concentration of isolated nucleic acids which can be assessed by several methods, including spectrophotometery (NanoDrop) and flourometery (Qubit). Fluorometery utilizes fluorescent dyes specific to dsDNA, RNA, ssDNA and is more sensitive than spectrophotometery. Most laboratories offer separate isolation protocols for DNA and RNA, however, isolation of total nucleic acids can be considered, particularly for small tissue samples, in order to minimize the nucleic acid loss during separate isolation procedures. If RNA is the starting nucleic acid, cDNA is typically synthesized through reverse transcription for further targeted sequencing analysis.

I. METHODS

60

4. AMPLIFICATION-BASED NGS

PRIMER DESIGN FOR MULTIPLEX PCR Multiplex PCR is a commonly used approach for amplification-based target enrichment. There are several strong advantages of targeted amplification-based sequencing as compared with whole genome and exome sequencing, or targeted sequencing by a hybrid capture approach. It requires a small amount of DNA (10200 ng) as the starting template, can be performed on specimens with a suboptimal DNA quality, it is timeand cost-effective, and provides high depth of sequencing and straightforward data analysis. PCR assays are a mainstay of molecular pathology and represent the most convenient and cost-effective method for target selection and amplification using specimens with limited DNA and low abundance targets. However, critical performance issues arise with pooling (multiplexing) of progressively larger numbers of PCR primers and reactions. Specifically, (i) amplification artifacts are introduced due to polymerase editing mistakes during annealed oligomer extension, and (ii) thermal damage to genomic targets takes place during high temperature cycling resulting in modification of the native nucleic acid sequence [7]. In addition, reaction biases emerge associated with primerdimer formation, substrate competition, and sequence-dependent differences in PCR efficiency [8]. The maximum achievable pooling using conventional PCR is estimated to be 10 targets [9], however, for next-gen sequencing approaches a significantly larger number of primers are necessary in multiplex reaction in order to achieve sequencing of large genomic regions. Therefore, one of the main factors that are crucial for successful amplification-based target enrichment is primer design for multiplex PCR. PCR amplification includes repetitive cycles of DNA denaturation, primer annealing, and sequence extension. The oligonucleotide primers are designed to be complementary to a known genomic sequence of interest. When designing amplification primers for multiplex PCR, several factors must be considered including length of primers (1825 nucleotides), melting temperature (Tm) of the primers that should be either identical or within 12 C, appropriate GC content (5055%), and lack of primer cross-complementarity. In addition, regions with repetitive sequences, known germ line single nucleotide polymorphisms (SNPs), and regions with high homology should be avoided because they may affect efficiency of PCR amplification and create amplification bias. The most common type of amplification bias arises from unequal amplification of alleles due to sequence variation in the primer binding site [10]. Therefore, designed primers should be checked against SNP databases (dbSNP at www.ncbi.nlm.nih.gov/SNP) or the 1000 genomes project (www.1000genomes.org) to assure that primer binding sides do not contain highly variable SNPs. If binding site sequence variation is impossible to avoid, primers should be modified to include several possible nucleotide variations in the primer design. In addition, primers also need to be checked against sequence databases (http://blast.ncbi.nlm.nih.gov/) for evaluation of the primer specificity to the region of interest. This will avoid amplification of pseudogenes and other regions with high sequence homology that may result in erroneous sequence alignment and generation of false positive calls [11,12]. There are a number of software programs available for assisting with primer design (e.g., Primer3: http://frodo.wi.mit.edu/cgibin/primer3/primer3_www.cgi and PrimerBLAST: http://www.ncbi.nlm.nih.gov/ tools/primer-blast). Highly multiplexed PCR permits amplification of thousands of short genomic sequences in a single tube and does not require a large amount of DNA. Depending on a platform, as low as 510 ng of DNA is sufficient for producing a high complexity library. Therefore, this approach has been successfully used in samples when only limited amount of DNA is available (i.e., from small tumor biopsies or FNA samples). However, it is necessary to understand that a very small tissue sample and correspondingly low amount of DNA (picograms) may misrepresent the cell composition in the specimen and affect library complexity by producing biased amplification of one cell population versus another (e.g., nonneoplastic vs. neoplastic cells). In addition, low DNA input can produce bias toward propagation of incorporated errors during early cycles of the PCR, mostly because no excess of DNA is available to compete with the erroneous sequence. Replication errors can be reduced through the use of polymerases with 30 50 exonucleolytic proofreading and mismatch repair capabilities, but at the cost of slower extension rates and lower thermostability. For example, Pfu polymerase (from Pyrococcus furiosus) exhibits ,2% of the errors of Taq polymerase (from Thermus aquaticus) but has a much lower elongation rate (B20 nt/s vs. 80 nt/s, respectively, at 72 C) increasing exposure time for thermal damage [7]. Thermal modifications associated with PCR are characteristically reflected in depurination (A or G), deamination (C . U), and oxidation of G to 8-oxoG. Users should be aware of the potential for overrepresentation of these PCR-specific artifacts which can be miscalled as genetic variants. At a minimum, failure to control for these errors during amplicon sequencing results in overestimation of sample diversity while reducing sensitivity for detection of true genetic variants [13].

I. METHODS

LIBRARY PREPARATION AND AMPLIFICATION

61

Another advantage of multiplex PCR is in amplification of relatively short genomic regions (80150 base pairs) that allows for a successful sequencing of DNA and RNA of suboptimal quality such as from FFPE tissue samples. However, sequencing of large consecutive genomic regions by multiplex PCR can create a crossreaction between primer pairs due to primer overlap and, therefore, may require separation of closely located primers into several multiplex pools (and consideration of whether a capture-based method is more well suited to the analysis). Similarly to other amplification-based methods, targeted amplification-based MPS requires incorporation of strict measures to avoid sample contamination with amplification products. Laboratories should implement physical separation of preamplification area for specimen processing and nucleic acid extraction and postamplification areas, develop a unidirectional workflow process, and assure decontamination of work surfaces.

LIBRARY PREPARATION AND AMPLIFICATION As discussed in Chapter 1 in more detail, preparation of a sequencing “library” for targeted amplificationbased NGS involves ligating of PCR amplicons generated either from DNA or cDNA into platform-specific oligonucleotide adapters. For example, library preparation for use with the Ion Torrentt sequencing platform involves repair of 30 and 50 ends, ligation to an adapter or barcoded adapter for sample multiplexing, and purification of adapter-DNA constructs (Figure 4.2). Then, the library enrichment is performed by the emPCR method [14] that requires separation of the adapter-modified fragment library into single strands with hybridization to beads that carry sequences complementary to adapter sequences. Hybridization is performed under conditions that favor one single library fragment per bead. The beads (with their bound DNA template) are then compartmentalized into wateroil emulsion microdroplets and subjected to PCR amplification. As a result, millions of copies of the clonally amplified target sequence are generated on the bead surface. Similarly, the library preparation for use with Illumina sequencing platforms requires incorporation of adapters and unique sample-specific indices to the library fragments that are later distributed onto flow cell on a glass slide. Next, clonal amplification of individual fragments is performed by an isothermal bridge amplification that generates millions of spatially separated template clusters providing free ends to which a universal sequencing primer can be hybridized to initiate the MPS reaction. Genomic DNA

Genomic targets amplified using Ion AmpliSeq ™ Primer Pool

Primer sequences are partially digested

Adapters A P1 Barcode adapters X P1 A X

Adapters or barcoded adapters are ligated

P1

Nonbarcoded library

P1 Barcoded library

FIGURE 4.2 Target amplification by multiplex PCR and library construction using Ion AmpliSeqt (Life Technologies) approach.

I. METHODS

62

4. AMPLIFICATION-BASED NGS

The most commonly used MPS platforms for amplification-based targeted sequencing are the Ion Torrent PGM (Life Technologies Inc., Grand Island, NY) and the MiSeq instrument (Illumina, Inc., San Diego, CA). They utilize different chemistries, including semiconductor sequencing and sequencing-by-synthesis technology, respectively. Semiconductor sequencing detects the protons released as nucleotides are incorporated during synthesis of clonally amplified DNA, and the signal is proportional to the number of incorporated bases [15]. The sequencing-by-synthesis approach utilizes fluorescently labeled reversible-terminator nucleotides on clonally amplified DNA templates immobilized on the surface of a glass flow cell [16]. PCR errors generated during amplicon selection may further accumulate during library enrichment and subsequent MPS. Therefore, the efficacy of amplicon sequencing for tumor variant detection is directly tied to the fidelity of PCR in amplicon library selection, replication, and production of the sequencing template library for final readout. To date, no direct comparisons of different multiplex PCR amplicon sequencing assays have been published and the threshold for amplicon sequencing to detect mutations remains undefined. A study mixing cancer cell line DNA with tumor DNA reported a limit of mutation detection for amplicon sequencing at approximately 12% mutation abundance due to the background noise of the assay system comprising errors of library preparation, sequencing reactions, and data processing [17]. This threshold is consistent with a recent report using human mitochondrial DNA amplicon libraries with known low-level single-base mutations which could be detected in 2% of DNA templates at 5003 base depth with a false discovery rate of ,1% [18a]. However, recent studies have demonstrated that a significant number of errors are introduced by false pinning events during the multiplex amplification step of library preparation, especially at low-input DNA levels, errors which have been unrecognized in these prior reports of accuracy [18b].

OTHER AMPLIFICATION-BASED TARGET ENRICHMENT APPROACHES Other approaches for amplification-based target enrichment have been developed that do not use multiplex PCR. They employ large numbers of parallel but individual PCR reactions thereby eliminating artifacts associated with multiplexing PCR primers, substrates, and targets. The Fluidigm Access array system (Fluidigm Corporation, South San Francisco, CA) was developed to reduce problems associated with multiplex PCR through use of an automated microfluidic platform that performs optimized PCR reactions in parallel requiring only nanoliter volumes. Barcodes can be added to process multiple specimens simultaneously along with adapter sequences required by nextgeneration platforms for downstream sequencing [19]. Of note, this platform also makes it possible to perform limited multiplexing of carefully designed primer pairs to increase the output of the microfluidic chips. An alternate approach is the RainDance Thunderstorm system (RainDance Technologies, Lexington, MA) which generates independent picoliter volume droplets of fragmented DNA that are merged with aqueous primer pair droplets in an oil emulsion forming massively parallel reactors to perform individual PCR reactions of as many as 4000 target sequences simultaneously [20]. Both platforms eliminate the likelihood of PCR errors introduced by pooling reactions, while at the same time generating a high proportion of target amplicons with a background signal equivalent to uniplex PCR [10]. These platforms require a substantial initial investment but provide an effective approach for limiting the systematic errors introduced by PCR during the enrichment phase of amplicon sequencing.

COMPARISON OF AMPLIFICATION- AND CAPTURE-BASED METHODS When patient specimens and DNA are in ample amounts (310 μg), hybridization capture may be used independently for target enrichment of potentially large genomic domains (150 Mb), simultaneously depleting extraneous sequences and reducing sample complexity for direct evaluation using MPS [21]. Hybridization methods utilize probes complementary to the genomic regions of interest either affixed in a high-density, solid-phase array format or in solution as biotinylated complementary deoxyribonucleic acid (cDNA) or complementary ribonucleic acid (cRNA) probes that bind targets and are subsequently captured using streptavidin-coated magnetic beads [22]. These assays require fragmentation of DNA to allow probe accessibility, and can work with FFPE samples provided prospective targets have not eroded to lengths smaller than the capture oligomers. The most effective probe design incorporates a predetermined overlap or tiling (e.g., 2343 redundancy per target) to capture regions containing disease-related sequence variations that alter binding affinities predicted from normal sequence. The selected regions are translated directly into a DNA template library for sequencing once sufficient

I. METHODS

COMPARISON OF AMPLIFICATION- AND CAPTURE-BASED METHODS

63

target material is acquired. A drawback of hybridization methods, besides the high DNA requirement, is the capture of substantial “off-target” sequence compromising overall specificity. Extraneous sequence can be identified and excluded during downstream bioinformatics analysis but only after additional labor, reagent, and instrumentation costs are incurred. In comparison with amplification-based methods for target enrichment that provide a uniform coverage of amplified regions, capture by hybridization may not allow for capture regions which have either high or low GC content [23,24]. This limitation may lead to omission of some alleles or sequence regions, with an associated decreased sensitivity of sequencing. It is important to recognize that clinical specimens generally are provided to molecular pathology labs in large numbers for immediate processing with limited DNA availability (1100 ng), and usually are comprised of heterogeneous cellular composition with some genomic variants present in low abundance. Therefore, if a hybridization protocol is of preference, it typically requires subsequent amplification to achieve sufficient target levels for high sensitivity, as well as additional rounds of enrichment to achieve the specificity to detect rare genomic alterations. This is most efficiently accomplished through use of postcapture PCR protocols incorporating primer-driven amplification to generate a high-fidelity amplicon library at an adequate concentration for sequencing. Solid-phase and solution hybridization technologies were among the first target enrichment assays commercially available for use with NGS systems and several studies directly comparing their capabilities have been carried out [22,25]. Hybridization methods perform best with large amounts of DNA (.1 μg), a rare case with clinical samples, and secondary PCR amplification is typically required after capture to achieve adequate target sensitivity for variant analysis. A systematic comparison of microarray genomic selection (MGS: Roche: NimbleGen Microarray-based Genomic Selection), solution hybrid selection (SHS: Agilent: SureSelect Target Enrichment System using biotinylated RNA baits), and molecular inversion probes (MIPs) was performed in a study designed to capture an identical target region (2.6 Mb and 528 genes) using Hap Map samples [23]; similarly, the efficacy of arrays (MGS: DNA input 4 μg), solution-based hybridization (SureSelect: DNA input 3 μg) and circularizing probes (MIPS: DNA input 1 μg) was evaluated after capture and limited PCR amplification followed by high-resolution genomic sequencing (Illumina GA-2 platform) [23,25]. These studies showed that the sequence results derived by each capture method compared favorably for homo- and heterozygous genotype concordance (.99.5%) for whole genome sequencing at 303 depth, with MGS displaying highest genotype sensitivity among the three methods. Both Nimblegen arrays and SureSelect solution hybridization demonstrated comparable depth and uniformity of coverage, exceeding that of the MIPs. Each assay comparably detected substitutions and small insertion/deletion variants of up to four bases, indicating that these types of variants are amenable to hybridization-based genomic enrichment. These findings were reinforced in a recent study by the DNA Sequencing Research Group with the rejoinder that a small number of SNPs were exclusively detected by SureSelect SHS and NimbleGen MGS [22]. Selective circularization probes capture genomic target regions by hybridizing to DNA sequences flanking the target domain [21,26]. MIPs contain two flanking probes of 1825 bases which are attached by a sequence (3550 bases) containing universal PCR primer binding sites. Both probes simultaneously bind to noncontiguous DNA fragments forming a circular structure with the cognate target domain captured between the ends of the probes; gap-filling and ligation reactions close the circle while creating a replicate of the target domain. A library of targets for MPS is subsequently generated by performing isothermal rolling circle amplification (RCA) or standard PCR, and this process of capture and linear RCA reduces polymerase errors and thermal modifications that can occur during standard PCR reactions. Several types of ligase-assisted DNA circularization padlock probes and reactions can be utilized in order to interrogate a DNA fragment library (e.g., “selector” probes employ a cocktail of restriction enzymes to control genomic fragmentation) [27]. Circularizing probes provide an ingenious solution-based, capture technology with high specificity but the use of this technology to date has been limited by restricted probe availability, relatively high cost, and overall poor capture uniformity [2]. Agilent Technologies has recently released the HaloPlex assay that employs circular selector technology with tiled probe coverage for target enrichment using small amounts of starting DNA (200 ng) including from FFPE specimens [28]. In this approach, genomic DNA undergoes digestion by eight separate restriction enzymes creating fragment end sequences flanking the target regions of interest and complementary to the ends of the selector probes [27]. The 1001000 base restriction fragments are captured by hybridization with the biotin-labeled selector probes (about 100 bases long) that also contain adapters, bar codes, and universal primer binding sites in their connector domain for PCR amplification and NGS. In contrast to MIP probe circularization, the genomic DNA fragments form circular structures upon hybridization with the ends of the selector probes [28]. These target DNA fragments are then amplified using standard multiplex PCR driven by universal primer binding sequences

I. METHODS

64

4. AMPLIFICATION-BASED NGS

in the connector domain of the selector probes to generate the enriched templates required for MPS. Carrying out sample indexing at the time of capture with this method allows paired samples to be treated identically throughout the entire selection, amplification, and sequencing process. Consequently, templates generated during MPS from tumor and normal specimens can be directly compared for the presence of disease-specific genomic gains and losses using the HaloPlex approach.

CLINICAL APPLICATIONS Targeted NGS is commonly used for testing of human-inherited mutations in genetic diseases and for acquired mutations in oncology applications, including cancer diagnosis, prognosis, and prediction of response to chemotherapy. Sequencing of single genes or portions of genes that carry specific mutations is cost- and time-efficient, and the introduction of benchtop MPS instruments has made targeted sequencing accessible to clinical laboratories in major medical centers. Multigene panel sequencing has been successfully used for the analysis of genes known to be related to a particular genetic disorders including mitochondrial disorders [29], cardiomyopathies [30], and hearing loss [31]. In oncology (Figure 4.3), targeted amplification-based sequencing is applied for detection of individual mutations in cancer-related genes that may assist in cancer diagnosis, have prognostic value, or be used for prediction of response to targeted therapy [3235]. Predesigned, cancer gene panels are commercially available in dedicated kits for amplicon sequencing including Ion AmpliSeqt Ready-to-Use Cancer Panels (Life Technologies), TruSeqsAmplicon Cancer Panel (Illumina),

BRAF p .T599I Point mutation

BRAF p .V600_R603del Deletion

C

C A T C G A G A T W R S

T K

T C A C T V

G T

T

BRAF

FIGURE 4.3 Detection of sequence variants by targeted amplification-based NGS. Custom gene panel (ThyroSeq) for thyroid cancer detection on Ion Torrent PGM (Life Technologies) demonstrated the presence of a rare type of BRAF mutation in a thyroid FNA sample derived from the nodule with papillary thyroid carcinoma, specifically, deletion of BRAF codons 600603 (BRAF p.V600_R603del) and a point mutation in codon 599 (BRAF p.T599I). Of note, both mutations are present on the same allele.

I. METHODS

CLINICAL APPLICATIONS

65

and the NimbleGenSeqCap EZ Comprehensive Cancer Panel (Roche). In addition, custom gene panels can be designed to be used on different sequencing platforms. The AmpliSeq and TruSeq panels rely on multiplex PCR for both target selection and amplification while the NimbleGenSeqCap system combines solution hybridization capture methods with secondary PCR amplification to generate amplicon libraries. Alternatively, targeted enrichment assays independent of these sequencing platforms have been developed incorporating primer-based selection and PCR amplification using microfluidic arrays (Fluidigm Corporation) or microdroplet technology (RainDance Technologies). A successful pipeline has been published pairing the Fluidigm Access Arrayt system with the GS Flx Sequencer, and an automated protocol has been promoted for the GS Junior benchtop system in a commercial agreement between the parent companies [36]. The AmpliSeqt Cancer Panels (Life Technologies) employ multiplex methods to provide rapid, targeted amplification of 46 or 400 oncogenes and tumor suppressor genes in solution using limited amounts of DNA in pooled PCR reactions (AmpliSeqt Cancer Hot Spot Panel: 10 ng; AmpliSeqtComprehensive Cancer Panel: 40 ng) for use with the Ion Torrent PGM. The amplicon libraries average 80150 base pairs and are optimized for sequencing using the template preparation protocol and semiconductor chip system of the PGM. The technology incorporates fusion adapters including target-specific primers and barcodes and, despite the extensive pooling associated with the AmpliSeq methodology, this approach has proven capable of detecting individual mutations in challenging samples with validation by Sanger sequencing [32,33]. The TruSeq amplicon sequencing system (Illumina, Inc.) combines up to 1500 primer pairs for 48 cancerrelated genes in a reaction with a low starting amount of DNA (about 200 ng) for use with the MiSeq benchtop sequencer. The assay design compensates for loss of fidelity during primer extension by utilizing opposing primers that extend toward each other over a 175425 base region of interest in a process repeated over 25 cycles. The elongated strands are then joined by a ligation reaction. Thus, the fidelity of the amplicon remains high at both ends by reducing the probability of replicating extension errors. By amplifying both coding and noncoding strands in this manner, error checking can be performed for difficult sequence domains through evaluation of both templates after sequencing. Illumina offers a paired-end sequencing protocol for this assay, minimizing MPS errors and capturing accurate and overlapping information for regions of interest that can help detect complex structural variants (SVs) as well as genomic gains and losses. The introduction of the TruSite amplicon panel allows interrogation of 26 cancer-related genes in FFPE samples with minimal initial DNA (40 ng) using smaller amplicons (165195 bp) with shorter sequencing length (2 3 150 bp) to compensate for fragmentation. Benchtop MPS technology readily supports detection of SNPs, single base mutations (also known as single nucleotide variants or SNVs) and small insertions and deletions (indels) during amplicon sequencing (Figure 4.3). However, accurate detection of large cancer-related SVs such as balanced and unbalanced chromosomal rearrangements, gene copy number changes (CNVs), and large homozygous and heterozygous deletions is challenging and requires advancements in target capture, library preparation, read length, and data analysis techniques. For example, paired-end or mate-paired sequencing protocols produce both forward and reverse template strands for precise template alignment and long-range positional information; at present, the MiSeq instrument can perform 2 3 250 bp paired-end reads. It is also important to consider that comparison of amplicon sequencing results to the current human genome reference sequence (that has an accuracy of 1 error in 10,000 bases) and cancer mutation databases is not foolproof [37]. The highest accuracy for identification of somatic genomic alterations in individual patient tumor samples is still best obtained by direct comparison to DNA from the matched patient “normal” sample, preferably blood. It is through this process that copy number changes can be directly assessed, and patient germ line versus sporadic changes can be detected. However, the addition of a matched “normal” reference doubles the cost, required reagents, data production, and overall work effort. The use of benchtop sequencing instruments for low-cost mutation scanning is not without technical limitation and can independently contribute to errors in targeted sequencing. Specifically, some NGS platforms produce more systematic errors resulting in a significantly lower accuracy than others; as demonstrated by direct systematic comparison [38], these errors are attributable to the shorter read lengths of MPS systems compared with Sanger sequencing (100400 bases versus 1000 bases, respectively) and the lower fidelity of NGS reactions compared with the dideoxynucleotide chain termination reaction of Sanger sequencing. In addition, the probability of random sequence errors increases across all MPS platforms toward the end of template reads; different methods of template preparation and cycle sequencing unique to each benchtop sequencer are a source of these errors [38]. The Roche and Ion Torrent systems create DNA template libraries comprising monoclonally decorated beads (Roche: 20 μm) or Ion spheres (Ion Torrent: 1 μm) using emPCR followed by sequestration for sequencing in

I. METHODS

66

4. AMPLIFICATION-BASED NGS

individual wells. The templates are then read by pyrosequencing or sequencing-by-synthesis reactions, respectively. Errors include a GC bias when interrogating AT rich regions, along with inaccuracies in detecting sequences of homopolymer regions [39]. In addition, the sequencing paradigm of these instruments incorporates separate serial flows of single bases during sequencing for incorporation in the synthesized strand resulting in rare substitution errors. The Illumina platform uses bridge PCR to amplify templates in discrete monoclonal clusters within a flow cell to annotate individual base incorporation by reverse termination sequencing. This approach results in low cluster diversity due to sequence bias introduced in early cycles [40]. Since all four DNA bases are presented in a single flow for sequence incorporation, multiple base incorporation errors can result in out-of-phase sequence readout increasing the noise of the system. The systematic errors produced by these benchtop instruments can be identified and largely corrected when sequencing is performed at saturating levels of target templates and high base depth (.10003). However, calibration standards should be employed to correct for the idiosyncrasies intrinsic to each platform when these platforms are used for targeted amplicon sequencing. In the absence of corrective methods, Sanger sequencing remains an important benchmark of accuracy and should be used to validate novel discoveries and pertinent clinical findings as needed. The implementation of NGS technology in a clinical laboratory is complex and requires significant expertise in clinical, technical, and bioinformatics aspects of sequencing. Laboratories should strictly apply all quality measures for NGS test development, validation, and quality assurance under the guidance of the Clinical Laboratory Improvement Amendments (CLIA) and College of American Pathologists (CAP) regulations.

CONCLUSION Amplification-based NGS is a high-resolution, cost-effective method to interrogate genomic regions of interest for disease-specific genotype variants. Multiplex PCR is one of the most commonly used approaches for amplification-based target enrichment; it can be performed on a variety of specimens including FFPE tissue, requires small amounts of starting template, is time- and cost-effective, and provides high depth of sequencing and straightforward data analysis without burdensome storage requirements. However, special attention needs to be considered for quality assurance of amplification-based target enrichment including potential for bias, amplicon size limitations, contamination, and sequencing errors. The introduction of benchtop instruments has made targeted sequencing accessible to clinical laboratories for the detection of mutations in oncology and genetic diseases.

References [1] Summerer D. Enabling technologies of genomic-scale sequence enrichment for targeted high-throughput sequencing. Genomics 2009;94:3638. [2] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7:1118. [3] Hadd AG, Houghton J, Choudhary A, Sah S, Chen L, Marko AC, et al. Targeted, high-depth, next-generation sequencing of cancer genes in formalin-fixed, paraffin-embedded and fine-needle aspiration tumor specimens. J Mol Diagn 2013;15:23447. [4] Do H, Dobrovic A. Dramatic reduction of sequence artefacts from DNA isolated from formalin-fixed cancer biopsies by treatment with uracil-DNA glycosylase. Oncotarget 2012;3:54658. [5] Lamy A, Blanchard F, Le Pessot F, Sesboue R, Di Fiore F, Bossut J, et al. Metastatic colorectal cancer KRAS genotyping in routine practice: results and pitfalls. Mod Pathol 2011;24:1090100. [6] Costello M, Pugh TJ, Fennell TJ, Stewart C, Lichtenstein L, Meldrim JC, et al. Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation. Nucleic Acids Res 2013;41: e67. [7] Pienaar E, Theron M, Nelson M, Viljoen HJ. A quantitative model of error accumulation during PCR amplification. Comput Biol Chem 2006;30:10211. [8] Syvanen AC. Toward genome-wide SNP genotyping. Nat Genet 2005;37(Suppl.):S510. [9] Broude NE, Zhang L, Woodward K, Englert D, Cantor CR. Multiplex allele-specific target amplification based on PCR suppression. Proc Natl Acad Sci USA 2001;98:20611. [10] Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, et al. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat Biotechnol 2009;27:102531. [11] Menon RS, Chang YF, St Clair J, Ham RG. RT-PCR artifacts from processed pseudogenes. PCR Methods Appl 1991;1:701.

I. METHODS

REFERENCES

67

[12] Smith RD, Ogden CW, Penny MA. Exclusive amplification of cDNA template (EXACT) RT-PCR to avoid amplifying contaminating genomic pseudogenes. Biotechniques 2001;31:7768. 780, 782. [13] Gilje B, Heikkila R, Oltedal S, Tjensvoll K, Nordgard O. High-fidelity DNA polymerase enhances the sensitivity of a peptide nucleic acid clamp PCR assay for K-ras mutations. J Mol Diagn 2008;10:32531. [14] Schutze T, Rubelt F, Repkow J, Greiner N, Erdmann VA, Lehrach H, et al. A streamlined protocol for emulsion polymerase chain reaction and subsequent purification. Anal Biochem 2011;410:1557. [15] Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 2011;475:34852. [16] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:539. [17] Milbury CA, Correll M, Quackenbush J, Rubio R, Makrigiorgos GM. COLD-PCR enrichment of rare cancer mutations prior to targeted amplicon resequencing. Clin Chem 2012;58:5809. [18] [a] Li M, Stoneking M. A new approach for detecting low-level mutations in next-generation sequence data. Genome Biol 2012;13:R34. [b] McCall CM, Mosier S, Thiess M, Debeljak M, Pallavajjala A, Beierl K, et al. False positives in multiplex PCR-based next generation sequencing have unique signatures. J Mol Diagn Jul 10. [Epub ahead of print]. [19] Halbritter J, Diaz K, Chaki M, Porath JD, Tarrier B, Fu C, et al. High-throughput mutation analysis in patients with a nephronophthisisassociated ciliopathy applying multiplexed barcoded array-based PCR amplification and next-generation sequencing. J Med Genet 2012;49:75667. [20] Tewhey R, Nakano M, Wang X, Pabon-Pena C, Novak B, Giuffre A, et al. Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol 2009;10:R116. [21] Mertes F, Elsharawy A, Sauer S, van Helvoort JM, van der Zaag PJ, Franke A, et al. Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief Funct Genomic 2011;10:37486. [22] Bodi K, Perera A, Adams PS, Bintzler D, Dewar K, Grove DS, et al. Comparison of commercially available target enrichment methods for next generation sequencing. J Biomol Tech 2013;24(2):114. [23] Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, et al. Multiplex amplification of large sets of human exons. Nat Methods 2007;4:9316. [24] Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007;39:15227. [25] Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 2010;20:142031. [26] Wang Y, Cottman M, Schiffman JD. Molecular inversion probes: a novel microarray technology and its application in cancer research. Cancer Genet 2012;205:34155. [27] Johansson H, Isaksson M, Sorqvist EF, Roos F, Stenberg J, Sjoblom T, et al. Targeted resequencing of candidate genes using selector probes. Nucleic Acids Res 2011;39:e8. [28] Dahl F, Gullberg M, Stenberg J, Landegren U, Nilsson M. Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res 2005;33:e71. [29] Calvo SE, Compton AG, Hershman SG, Lim SC, Lieber DS, Tucker EJ, et al. Molecular diagnosis of infantile mitochondrial disease with targeted next-generation sequencing. Sci Transl Med 2012;4:118ra110. [30] Meder B, Haas J, Keller A, Heid C, Just S, Borries A, et al. Targeted next-generation sequencing for the molecular genetic diagnostics of cardiomyopathies. Circ Cardiovasc Genet 2011;4:11022. [31] Shearer AE, DeLuca AP, Hildebrand MS, Taylor KR, Gurrola II J, Scherer S, et al. Comprehensive genetic testing for hereditary hearing loss using massively parallel sequencing. Proc Natl Acad Sci USA 2010;107:211049. [32] Beadling C, Neff TL, Heinrich MC, Rhodes K, Thornton M, Leamon J, et al. Combining highly multiplexed PCR with semiconductorbased sequencing for rapid cancer genotyping. J Mol Diagn 2013;15:1716. [33] Yousem SA, Dacic S, Nikiforov YE, Nikiforova M. Pulmonary langerhans cell histiocytosis: profiling of multifocal tumors using nextgeneration sequencing identifies concordantoccurrence of BRAF V600E mutations. Chest 2013;143(6):167984. [34] Chan M, Ji SM, Yeo ZX, Gan L, Yap E, Yap YS, et al. Development of a next-generation sequencing method for BRCA mutation screening: a comparison between a high-throughput and a benchtop platform. J Mol Diagn 2012;14:60212. [35] Pritchard CC, Smith C, Salipante SJ, Lee MK, Thornton AM, Nord AS, et al. ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. J Mol Diagn 2012;14:35766. [36] Moonsamy PV, Williams T, Bonella P, Holcomb CL, Hoglund BN, Hillman G, et al. High throughput HLA genotyping using 454 sequencing and the fluidigm access array system for simplified amplicon library preparation. Tissue Antigens 2013;81:1419. [37] Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res 2011;21:1498505. [38] Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012;30:4349. [39] Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012;13:341. [40] Krueger F, Andrews SR, Osborne CS. Large scale loss of data in low-diversity illumina sequencing libraries can be recovered by deferred cluster calling. PLoS One 2011;6:e16607.

I. METHODS

This page intentionally left blank

C H A P T E R

5 Emerging DNA Sequencing Technologies Shashikant Kulkarni1 and John Pfeifer2 1

Department of Pathology and Immunology, Department of Pediatrics, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA 2Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction

70

Third-Generation Sequencing Approaches Single-Molecule Real-Time (SMRT) DNA Sequencing Heliscope Genetic Analysis System

71

Fourth-Generation Sequencing Nanopore Sequencing

73 73

71 72

Selected Novel Technologies In Situ DNA Sequencing Transmission Electron Microscopy Electronic Sequencing

74 74 74 75

Summary

75

References

75

KEY CONCEPTS • When discussing the development of DNA sequencing technologies, Sanger sequencing is usually referred to as first-generation DNA sequencing. • The current methods for massively parallel DNA sequence analysis, which are the methods currently referred to as next-generation sequencing (NGS), are second-generation approaches. They all include amplification steps and utilize the sequencing-by-synthesis paradigm to determine the order of the DNA bases in a nucleotide strand. • Third-generation platforms can perform sequencing from a single DNA molecule without the need for prior template amplification, although the sequencing step itself is still involves sequencing-by-synthesis. They provide a number of advantages over second-generation DNA sequencing methods, including avoidance of the artifactual DNA mutations and strand biases introduced by even limited cycles of PCR; higher throughput and faster turnaround times; longer read lengths by some platforms; higher consensus accuracy; and analysis of smaller quantities of nucleic acids. • Fourth-generation approaches utilize nanopore technologies. They permit sequence analysis of single DNA molecules, do not involve prior amplification steps, and the sequencing step is performed without DNA synthesis. • A range of other novel technologies are still in the development stage and are thus years away from widespread clinical use.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00005-8

69

© 2015 Elsevier Inc. All rights reserved.

70

5. EMERGING DNA SEQUENCING TECHNOLOGIES

INTRODUCTION It is customary when discussing the development of DNA sequencing technologies to refer to Sanger sequencing, when performed using nucleotides labeled with different fluorochromes, as first-generation DNA sequencing. The current methods for massively parallel DNA sequence analysis, which are the methods currently referred to as next-generation sequencing (NGS), are thus second-generation approaches, and include the HiSeq and MiSeq platforms manufactured by Illumina; the Roche 454 GS Junior platform manufactured by Roche; the SOLiD platforms manufactured by Life Technologies; and the Ion semiconductor platforms manufactured by Life Technologies. While the details of the technologies employed by these different platforms differ (see Chapter 1), all include amplification steps (usually involving only a limited number of PCR cycles) and utilize the sequencing-by-synthesis paradigm to determine the order of the DNA bases. Although there is no universal nomenclature, for the purposes of this chapter third-generation platforms (Table 5.1) will be defined as those that can perform sequencing from a single DNA molecule without the need for prior template amplification although the sequencing step itself is still involves sequencing-by-synthesis; current third-generation platforms include the PacBioRS manufactured by Pacific Biosciences and the Heliscope Sequencer which was produced by Helicos BioSciences [13]. In this chapter, fourth-generation approaches will be defined as those that involve sequence analysis of single DNA molecules and do not involve prior amplification steps; the sequencing is performed without DNA synthesis, and so is free of nucleotide labeling and detection steps. Nanopore-based technologies are the best examples of fourth-generation platforms [1,4,5]. A range of other novel technologies are still in the development stage and are thus years away from widespread clinical use, but their novel design features makes them worth comment [6]. TABLE 5.1 Emerging Genome-Sequencing Technologiesa Approach

Method

Advantages

Disadvantages

PacBio RS (Pacific BioSciences)

Single molecule real-time sequencing; imaging of dye-labeled nucleotides as they are incorporated by a single DNA polymerase molecule

No template amplification; less starting material required; long read lengths (8001000 bp)

Heliscope sequencer (Helicos BioSciences)

Single molecule real-time sequencing; imaging of dye-labeled nucleotides as they are incorporated

No template amplification; less starting material required; direct RNA-seq

Short read lengths (35 bp)

THIRD GENERATION

FOURTH GENERATION Nanopore sequencing (Oxford Nanopore Technologies)

Single molecule sequencing incorporating nanopore technology

No template amplification; less starting material required; not sequencingby-synthesis; potential to produce very long read lengths (up to 50 kb)

Technical hurdles remain; not commercially available although the MinION device was made available for testing to selected research groups in early 2014

Nanopore sequencing (NABsys)

Nanopore sequencing using converted targets and optical readout

Facilitates massive parallelism

Technical hurdles remain; not commercially available

In situ sequencing

Sequencing of mRNA in situ by ligation chemistries

Sequencing in tissue sections on a glass slide

Short read lengths (,30 bp); only 100400 reads per cell

Transmission electron microscopy (TEM)

DNA bases are labeled by tags with different contrast by TEM

Potential to produce very long read lengths (up to 100 kb)

Technical hurdles remain; not commercially available

Electronic sequencing

Magnetic and electronic fields are used to immobilize nucleic acid coated beads, reagents, and products of sequencing

Chamber free

Technical hurdles remain; not commercially available

NOVEL TECHNIQUES

a

This table is not exhaustive but is instead intended to provide notes summarizing salient features of the various emerging approaches and platforms; since these technologies are evolving rapidly, the reader is encouraged to consult the Internet for up-to-date information.

I. METHODS

THIRD-GENERATION SEQUENCING APPROACHES

71

This chapter will introduce the current third-generation technologies in widespread use and discuss their advantages and disadvantages. The current so-called fourth-generation technologies will also be discussed, with a focus on nanopore-based technologies which have yet to realized their early promise for high-throughput sequencing of long DNA templates in routine use.

THIRD-GENERATION SEQUENCING APPROACHES The unifying feature of all so-called third-generation sequencing methods is that they make it possible to sequence individual DNA molecules without the need for a template amplification step. The lack of an amplification step provides a number of advantages over second-generation DNA sequencing methods (which for the remainder of this chapter will be referred to simply as current NGS approaches), including avoidance of the artifactual DNA mutations and strand biases introduced by even limited cycles of PCR; higher throughput and faster turnaround times; longer read lengths (by some platforms) that enhance de novo contig and genome assembly, which in turn make possible direct detection of haplotypes and phasing; higher consensus accuracy which theoretically enhances rare variant detection; and analysis of smaller quantities of nucleic acids, which has clear advantages in a clinical setting [13].

Single-Molecule Real-Time (SMRT) DNA Sequencing The so-called SMRT technology relies on three key innovations: the SMRT cell which makes it possible to observe incorporation of individual nucleic acids in real time; the use of phospholinked nucleotides, which enables long read lengths; and a novel detection platform that enables single-molecule detection. Briefly, DNA sequencing using SMRT cells is performed on a sequencing chip containing thousands of zero-mode wave guides (ZMWs). ZMWs are produced by semiconductor manufacturing techniques and are essentially holes in an aluminum cladding on a clear silica substrate that are roughly 70 nm in diameter and about 100 nm deep [7]. Sequencing of a single DNA template molecule is performed by a single DNA polymerase molecule attached to the bottom of each of the ZMWs (Figure 5.1); due to the physics of light as it travels through such a small aperture, the volume of buffer illuminated in an individual ZMW is only about 20 zeptoliters (20 3 10221 liters), and consequently it is possible to measure the activity of the single DNA polymerase molecule as it incorporates a single nucleotide by a charge coupled device (CCD) array [710]. The SMRT approach therefore still involves sequencing-by-synthesis utilizing deoxynucleotides (dNTPs) that are fluorescently labeled.

FIGURE 5.1 SMRT sequencing. A single DNA polymerase molecule (in white) is attached to the bottom of a waveguide. The waveguide has a geometry that produces a zone of illumination limited to the very bottom of the well, which creates a detection volume of only about 20 zeptoliters (20 3 10221 liters). As the DNA polymerase processively incorporates nucleotides, individual incorporation events can be discriminated using nucleotides labeled with different colored fluorochromes. Modified from https://ncifrederick.cancer.gov/atp/cms/wp-content/uploads/ 2011/10/pacbio_technology_backgrounder.pdf.

I. METHODS

72

5. EMERGING DNA SEQUENCING TECHNOLOGIES

SMRT technology has several advantages. First, the current PacBioRS instrument produces reads that are about 3000 bp long, although individual reads can achieve lengths up to 20,000 bp or even longer. Such long read lengths make de novo assembly of contigs and genomes much more straightforward. Second, SMRT sequencing makes it possible to directly identify epigenetic modifications. Since the sequencing reaction also tracks the rate of DNA polymerization [11], small changes in polymerization kinetics caused by epigenetic modifications such as methylation result in a change in the timing of when fluorescent signals are observed (Figure 5.2). Third, although individual reads contain a significantly higher percentage of errors when compared with the reads from Illumina and other platforms, with sufficient depth of coverage, the longer read lengths make it possible to provide statistically average consensus sequences that have a high degree of accuracy [12]. An enhanced version of the SMRT sequencing platform, termed the PacBioRS2, has recently been released; it has an increased number of ZMW cells which has doubled the throughput per sequencing experiment [13].

Heliscope Genetic Analysis System The Heliscope instrument is another platform based on single-molecule sequencing technology [1416]. A DNA library is constructed by a random fragmentation of the genome sample, with 30 polyadenylation of the DNA fragments by the enzyme adenosine terminal transferase. The fragments are then denatured and hybridized to a flow cell with surface-tethered poly-T oligomers. Sequencing-by-synthesis cycles proceed utilizing fluorescently labeled nucleotides; incorporation of nucleotides is imaged with a CCD camera. Because the true single-molecule sequencing (tSMS) approach does not require an amplification step prior to the sequencing reaction, the artifactual DNA mutations and strand biases introduced by even limited cycles of PCR are avoided. In addition, the fact that tSMS can be used for direct RNA-seq applications represents a significant advantage over other massively parallel sequencing methods [17]. Nonetheless, the tSMS technique has some significant drawbacks: the typical read length is only about 55 bp long, and the sequencing reaction takes about 8 days to complete.

(B) 50 40 30 20 10 0

T G C A

Fluorescence intensity (a.u.)

Fluorescence intensity (a.u.)

0 50 40 30 20 10 0 700

100

200

300 400 Time (s)

500

600

2,000

700

1,500 Time (s)

Fluorescence intensity (a.u.)

(A)

800

900

R F R F R F

1,000 R F

1,000 1,100 1,200 1,300 1,400 Time (s)

50 40 30 20 10 0 1,400 1,500 1,600 1,700 1,800 1,900 2,000 2,100 Time (s)

FIGURE 5.2

500

R F

0 45826000

R 45826050 45826100 45826150 Chromosome 1 position (bp)

Direct identification of epigenetic modifications by SMRT sequencing. (A) A raw SMRT sequencing read; the DNA polymerase pauses (arrows) when it encounters a modified 5-hydroxymethylcytosine (5hmc) base in the template strand (which is the reverse strand in (B)). (B) Pauses (arrows) appear as discontinuities as the polymerase temporarily stops progressing along the DNA template; the absence of a pause at the same CG position in the corresponding complementary strand implies that each genomic location was hemihydroxymethylated (F, forward strand reads; R, reverse strand reads). Reprinted by permission from Macmillan Publishers Ltd: Nature Methods, 9:7579, copyright 2012.

I. METHODS

73

FOURTH-GENERATION SEQUENCING

FOURTH-GENERATION SEQUENCING Fourth-generation sequencing methods are nanopore-based techniques that rely on entirely different principles of chemistry and physics to produce DNA sequence reads.

Nanopore Sequencing In contrast to all other current DNA sequencing methods, nanopore-based sequencing does not involve sequencing-by-synthesis. Instead, nanopore-base sequencing relies on variations in electrical currents that result from the translocation of individual DNA molecules through artificial nanopores that perforate a membrane. While there are a number of technical variations to the approach, all rely on a small voltage (on the order of 100 mV) applied across a membrane separating two chambers filled with aqueous electrolytes, a configuration that results in a current through the pore that can be measured by standard electrophysiologic techniques. The pores can either be fabricated from exotic carbon materials such as graphene [18] or created by biologic pore-forming proteins such as α-hemolysin inserted into a lipid bilayer [19]. In either case, the nanopore locally unravels the coiled nucleic acid strand which allows the nucleotides to translocate through the pore sequentially in single file. Since the translocating molecule partially blocks the current of ions through the nanopore, changes in current can, in theory, be used to deduce the nucleotide sequence of the nucleic acid strand [20,21]. Interest in nanopore sequencing is enhanced by the fact that it provides for sequencing of individual DNA molecules with extremely long read lengths (in theory, the read lengths can approach up to 50,000 bp). At present, there are two general approaches (Figure 5.3). In the first general approach, the DNA library preparation does not involve any amplification steps (therefore eliminating the potential biases and artifacts introduced by rounds of amplification) and the sequence is determined by current changes as the DNA strand passes through a nanopore (and so is also free of the errors and artifacts intrinsic to the in vitro DNA polymerization that is part of sequencing-by-synthesis). To ensure that

(A)

(B)

Open pore

Current (pA)

100

400 s

50

. . . A T C G G C T . . .

Blocked pore

0

Convert

Hybridize

Detect

A

T

C

Time

G

FIGURE 5.3 Nanopore sequencing. (A) Sequencing using ionic current blockage. A typical trace of the ionic current through an α-hemolysin pore differentiates an open pore (top) from one blocked by a DNA strand (bottom) but cannot distinguish between the 10 and 12 nucleotides that simultaneously block the narrow transmembrane channel domain (red bracket); because the four dNMPs each produce a different ionic current blockage, methods that ensure that only one nucleotide at a time is present in the pore (e.g., via attachment of an exonuclease activity to the biopore protein) make it possible to infer the base sequence of the polynucleotide. (B) Sequencing using a synthetic DNA and optical readout. Each nucleotide in the target DNA if first converted into a longer DNA strand composed of pairs of two different code units (colored orange and blue in this illustration); after hybridizing the converted DNA with molecular beacons that are complimentary to the code units, the beacons are stripped off as the DNA strand passes through the nanopore; the sequence of the original DNA is read by detecting the discrete short-lived photon bursts as each oligo is stripped. Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, 26:11461153, copyright 2008.

I. METHODS

74

5. EMERGING DNA SEQUENCING TECHNOLOGIES

only one nucleotide at a time is present in the pore, a molecule with exonuclease activity is physically attached to the biopore protein [22]; via this configuration, individual unlabeled dideoxynucleotide monophosphates (dNMPs) are released from the end of the DNA or RNA chain which then traverse the nanopore one at a time. Because the four dNMPs each produce a different, easily distinguishable ionic current blockage, it is possible to infer the base sequence of the polynucleotide. However, accurate DNA sequencing relies on assuring that 100% of the exonuclease-released dNMPs traverse the pore, and that they traverse the pore in the order in which they are cleaved from the DNA strand. Though not yet commercially available, the MinION device based on this principle was made available for testing to selected research groups in early 2014 by its manufacturer, Oxford Nanopore. In the second general approach, the template DNA is converted into a mixture of fragments that correspond to segments of the input DNA. For readout, the converted DNA is hybridized with a mixture of two different so-called molecular beacons which, when free in solution or hybridized to DNA, produce only low-background fluorescence. However, the beacons fluoresce briefly as they are stripped off the converted DNA strand as it passes through a nanopore, and so measurement of the fluorescent signal makes it possible to infer the DNA sequence [23,24]. The nanopore approach has been theoretically validated in a number of studies that have shown that modulations of ionic current can be measured as RNA or DNA strands translocate through the pore [25,26]. However, several technical hurdles continue to plague development of the methodology. For example, even though an “infinitely short” channel would prevent the presence of more than one nucleotide within the channel at the same time, it would still not achieve perfect single-nucleotide resolution due to physical constraints on the electric field in the region of the panel [27]. Also, since nucleotide strands are translocated through the nanopores of existing designs at rates far too fast to permit the associated electronics to resolve the small ionic currents, methods for slowing the translocation of nucleic acid strands through pores are required. A number of technical modifications are being explored in order to address the inherent constraints imposed on nanopore sequencing by the laws of physics. Some of the technologies under development exploit quantum mechanical properties to detect and differentiate nucleotides based on the tunneling currents created as they pass through the nanopores [2830]. Other designs employ chemically modified metal electrodes that form base-specific hydrogen bonds to enhance the tunneling currents as the strand of DNA passes through the nanopore [31].

SELECTED NOVEL TECHNOLOGIES There are a number of novel technologies for producing DNA sequence that are in various stages of development. While a detailed description of all of them is beyond the scope of this chapter, several techniques are noteworthy in that they illustrate the use of in situ approaches or involve novel applications of the principles of chemistry and physics to produce DNA sequence reads.

In Situ DNA Sequencing In situ methods for sequencing mRNA within tissue sections on a glass slide rely on hybridization of probes with specific design features to the targeted mRNA molecules within the tissue section [6]. Amplification steps produce enough cDNA for the actual in situ sequencing-by-synthesis by ligation chemistry [32,33]. Currently, these methods are limited by the fact that they generate reads that are less than 30 bp long and only about 100400 reads per cell.

Transmission Electron Microscopy Since the nucleotides in a DNA strand cannot be directly detected by transmission electron microscopy (TEM), electron microscopy-based sequencing requires modification of the individual bases by labels that have a high atomic number. In one approach that is analogous to the use of fluorophores to label DNA for Sanger sequencing, amplification steps are used to tag the different bases of the DNA strand with labels that have different contrast by TEM. Although TEM is potentially capable of producing reads in excess of 100,000 bp long [34], the

I. METHODS

REFERENCES

75

sequencing step itself is not entirely straightforward as uniform base-to-base spacing is required for reliable sequenced reads, a technical feat that has not yet been accomplished. TEM sequencing is still a very theoretical approach, and while it is the focus of several companies, no platforms are commercially available yet.

Electronic Sequencing In this chamber-free method, magnetic and electronic fields are used to immobilize beads carrying nucleic acids to be sequenced, the reagents that will be incorporated as part of the sequencing reaction, and the products of the sequencing reaction. Although the various steps of electronic sequencing have been demonstrated using model systems [35], the technology is in very early stages of development and is several years away from routine use.

SUMMARY The DNA sequencing platforms that are currently in widespread use to perform massively parallel sequencing, which as a group are currently referred to as NGS platforms, have enabled the genomic revolution in science and medicine. However, NGS platforms do not represent the final stage of development of DNA sequencing technologies. A number of so-called third-generation approaches, which are already available commercially, make it possible to sequence individual DNA molecules without the need for library amplification steps. These approaches offer a number of advantages over current NGS methods, including avoidance of the artifactual DNA mutations and strand biases introduced by even limited cycles of PCR; higher throughput and faster turnaround times; longer read lengths (by some platforms) that enhance de novo contig and genome assembly; higher consensus accuracy; and analysis of smaller quantities of nucleic acids, which has clear advantages in clinical settings. However, the third-generation approaches are themselves transitional to fourth-generation techniques (and techniques that are still more experimental) that, while largely still in developmental phases, rely on entirely different principles of chemistry and physics to produce DNA sequence. While fourth-generation technologies are years away from clinical use, they provide a glimpse into the ever more sophisticated utilization of synthetic materials and advanced electronics that will continue to make DNA sequence analysis even faster and less costly.

References [1] Ku CS, Roukos DH. From next-generation sequencing to nanopore sequencing technology: paving the way to personalized genomic medicine. Expert Rev Med Devices 2013;10:16. [2] Su Z, Ning B, Fang H, Hong H, Perkins R, Tong W, Shi L. Next-generation sequencing and its applications in molecular diagnostics. Expert Rev Mol Diagn 2011;11:33343. [3] Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet 2011;52:41335. [4] ,https://www.nanoporetech.com/technology/publications.. [5] ,www.nabsys.com.. [6] Mignardi M, Nilsson M. Fourth-generation sequencing in the cell and the clinic. Genome Med 2014;6:31. [7] Lorlach J, Moarks PJ, Cicero RL, et al. Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-node waveguide nanostructures. Proc Natl Acad Sci USA 2008;105:117681. [8] Levene M, Korlach J, Turner S, et al. Zero-mode waveguides for single-molecule analysis at high concentrations. Science 2003;299:6826. [9] Foquet M, Samiee K, Kong X, et al. Improved fabrication of zero-mode waveguides for single-molecule detection. J Appl Phys 2008;103:034301-1034301-9. [10] ,https://ncifrederick.cancer.gov/atp/cms/wp-content/uploads/2011/10/pacbio_technology_backgrounder.pdf.. [11] Song CX, Clark TA, Lu XY, Kislyuk A, et al. Sensitive and specific single-molecule sequencing of 5-hydroxymethylcytosine. Nat Methods 2011;9:757. [12] Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol 2013;14:405. [13] New Products: PacBio’s RS II; CufflinksjIn SequencejSequencingjGenomeWeb ,http://www.genomeweb.com/sequencing/new-products-pacbios-rs-ii-cufflinks.. [14] Braslavsky I, Hervert B, Kartalov E, et al. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci USA 2003;100:39604. [15] Ozsolak F, Kapranov P, Foissac S, et al. Comprehensive polyadenylation site maps in yeast and human reveal pervasive alternative polyadenylation. Cell 2010;143:101829. [16] Harris TD, Buzby PR, Babcock H, et al. Single-molecule DNA sequencing of a viral genome. Science 2008;320:1069. [17] Rusk N. The true RNA-seq. Nat Methods 2009;6:7901. [18] Garaj S, Hubbard W, Reina A, Kong J, Branton D, Golovchenko JA. Graphene as a subnanometre trans-electrode membrane. Nature 2010;467:1903.

I. METHODS

76

5. EMERGING DNA SEQUENCING TECHNOLOGIES

[19] Braha O, Walker B, Cheley S, et al. Designed protein pores as components for biosensors. Chem Biol 1997;4:497505. [20] Branton D, Deamer DW, Marziali A, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol 2008;26:114653. [21] Kasianowics J, Brandin E, Branton D, et al. Characterization of individual polynucleotide molecules using a membrane channel. Proc Natl Acad Sci USA 1996;93:137703. [22] Astier Y, Braha O, Bayley H. Toward single molecule DNA sequencing: direct identification of ribonucleoside and deoxyribonucleoside 50 -monophosphates by using an engineered protein nanopore equipped with a molecular adapter. J Am Chem Soc 2006;128:170510. [23] Sauer-Budge A, Nyamwanda J, Lubensky D, et al. Unzipping kinetics of double-stranded DNA in a nanopore. Phys Rev Lett 2003;90:23810114. [24] Kim M, Wanunu M, Bell D, et al. Rapid fabrication of uniformly sized nanopores and nanopore arrays for parallel DNA analysis. Adv Mater 2006;18:314953. [25] Akeson M, Branton D, Kasianowicz J, et al. Microsecond time-scale discrimination among polycytidylic acid, polyadenylic acid, and polyuridylic acid as homopolymers or as segments within single RNA molecules. Biophys J 1999;77:322733. [26] Meller A, Nivon L, Brandin E, et al. Rapid nanopore discrimination between single oligonucleotide molecules. Proc Natl Acad Sci USA 2000;97:107984. [27] Muthukumar M, Kong CY. Simulation of polymer translocation through protein channels. Proc Natl Acad Sci USA 2006;103:5273378. [28] Zwlak M, Di Ventra M. Electronic signature of DNA nucleotides via transverse transport. Nano Lett 2005;5:4214. [29] Zikic R, Krstic P, Zhang X, et al. Characterization of the tunneling conductance across DNA bases. Phys Rev E Stat Nonlin Soft Matter Phys 2006;74:011919. [30] Ohshiro T, Matsubara K, Tsutsui M, et al. Single-molecule electrical random resequencing of DNA and RNA. Sci Rep 2012;2:501. [31] He J, Lin L, Zhang P, et al. Identification of DNA base-pairing via tunnel-current decay. Nano Lett 2007;7:38548. [32] Lee JH, Daugharthy ER, Scheiman J, et al. Highly multiplexed subcellular RNA sequencing in situ. Science 2014;343:13603. [33] Ke R, Mignardi M, Pacureanu A, et al. In situ sequencing for RNA analysis in preserved tissue and cells. Nat Methods 2013;10:85760. [34] Perkel J. Making contact with sequencing’s fourth generation. Biotechniques 2011;50:935 [Special News Feature]. [35] ,http://www.faqs.org/patents/assignee/genapsys-inc/..

I. METHODS

C H A P T E R

6 RNA-Sequencing and Methylome Analysis Shamika Ketkar1 and Shashikant Kulkarni2 1

Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO, USA Department of Pathology and Immunology, Department of Pediatrics, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA

2

O U T L I N E Introduction

78

Approaches to Analysis of RNA Microarray Analysis of Differential Gene Expression Next-Generation Methods of RNA-Seq

78 78 78

Fusion Detection Depth of Coverage Issues

Workflow 79 Typical RNA-Seq Protocol 79 Bioinformatic Analyses of Sequence Generated from RNA-Seq Experiment 79 Initial Processing of Raw Reads: Quality Assessment 79 Read Mapping Strategies 81 De novo Read Assembly 81 Read Alignment 81 RNA-Seq Variant Calling and Filtering 82 Expression Estimation: Summarization, Normalization, and Differential Expression 83 Differential Expression 83

83 84

Utility of RNA-Seq to Characterize Alternative Splicing Events

84

Utility of RNA-Seq for Genomic Structural Variant Detection

84

RNA-Seq: Challenges, Pitfalls, and Opportunities in Clinical Applications

85

Methylome Sequencing

85

Conclusions

86

References

86

List of Acronyms and Abbreviations

88

KEY CONCEPTS • Various technologies have been used to deduce and quantify the transcriptome. RNA-sequencing by microarray-based technologies is now being replaced by next-generation sequencing (NGS) methods. • Using deep sequencing by NGS, RNA-sequencing can help elucidate various classes of genomic aberrations, such as translocations, inversions, deletions, and functional effects of single nucleotide variation. • Significant challenges remain for RNA-sequencing to become useful for clinical applications. • Methylome sequencing by NGS approaches holds promise for deciphering aberrations of global methylation patterns that have clinical implications for diagnosis or therapy.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00006-X

77

© 2015 Elsevier Inc. All rights reserved.

78

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

INTRODUCTION Historically, measurement of the global expression of genes has largely relied on microarray-based techniques. These methods have been widely used to study gene expression analysis and to interpret regulatory mechanisms that control cellular processes. However, hybridization-based technology is largely restricted to known genes and has a limited range of quantification. Next-generation sequencing (NGS)-based RNA-sequencing methods (also known as RNA-seq) can not only be used to measure gene expression levels at higher resolution than microarrays but also can reveal unknown transcripts and splicing isoforms, and provide quantitative measurement of alternatively spliced isoforms. The total RNA complement of a cell or a population of cells, also known as transcriptome, can be analyzed by this technology, and thus RNA-seq greatly extends the possibilities of transcriptome studies to the analysis of gene isoforms, translocation events, nucleotide variations, and posttranscriptional base modifications. Due to these advantages and the declining cost of sequencing, NGS-based RNA-seq is becoming an increasingly attractive tool to investigate the full range of transcripts and to reveal the complex landscape and dynamics of the transcriptome. Several flavors of RNA-seq methodology can be used to study a subset of RNA of interest from the total transcriptome. For example, oligo-dT primers to select only the RNAs with poly-A tails can be utilized to analyze messenger RNA (mRNA), also known as mRNA-seq. Similarly, micro-RNA can be exclusively isolated by a selection step after RNA extraction and sequenced (miRNA-seq). This chapter will discuss the various approaches of RNA-seq, as well as the associated NGS-based workflows and bioinformatics pipelines. Key concepts and challenges that remain for clinical implementation will be highlighted. Finally, a brief overview of methylome sequencing will be provided. Detailed protocols of these sequencing methods are beyond the scope of the clinical focus of this chapter but are available in literature [14].

APPROACHES TO ANALYSIS OF RNA Microarray Analysis of Differential Gene Expression Gene expression analysis is widely used to unravel regulatory mechanisms that control cellular processes in plants, animals, and microbes. Initial transcriptomic studies relied heavily on microarrays, which have been very useful in profiling the global expression patterns of genes. Tiling microarrays designed to encompass all expressed exons were an especially attractive approach to comprehensively analyze the transcriptome. However, this hybridization-based approach has several limitations. First, this method suffers from inherent crosshybridization noise leading to fluctuations in probe intensity; this noise is attributed to regions of homology in the genome. Second, microarray-based expression studies are largely restricted to known genes. Third, these array-based methods have a limited range of quantification; several studies comparing NGS methods for RNAseq with data generated by microarrays have unequivocally demonstrated NGS methods to be more reliable, more sensitive, and more reproducible, with significantly less technical noise [5,6].

Next-Generation Methods of RNA-Seq The basic principle of RNA-seq is to use NGS technology to sequence cDNA molecules reversely transcribed from mRNAs or total RNA. Millions of sequenced reads are then mapped to the reference genome or transcriptome, and these mapped reads are used to provide a precise measure of the relative abundance of individual transcripts, splice variants, isoforms, novel transcripts, and chimeric transcripts (if present). Detection of chimeric transcripts denotes fusion events generated by chromosomal aberrations such as translocations, inversions, insertions, or deletions. RNA-seq methods offer a wide dynamic range of quantification that is lacking by conventional microarray approaches. Taking all of these advantages into account (Table 6.1), NGS-based RNA-seq allows the entire transcriptome to be surveyed, with unprecedented throughput and quantitation, and with single base resolution that permits precise annotation, but without restriction to previously mapped transcripts or transcription patterns.

I. METHODS

79

WORKFLOW

TABLE 6.1 Comparison of Microarray and NGS-Based Transcriptomics Methods Method

Microarray

Next-generation sequencing

Technique

Hybridization

Massively parallel sequencing

RNA input needed

High

Low

Background noise

High

Low

Cost

Low

High

Bioinformatics resource intensive

Minimal

Extensive

Comparison of expression levels between genes

Not possible

Yes

De novo assembly

Not possible

Yes

Ability to identify all genomic aberrations

Limited

Yes (fusion genes, allele-specific aberrations, RNA editing)

Novel transcribed regions and alternative isoforms

No

Yes

Ability to detect dynamic range of expression

Minimal (hundred fold)

Very high (thousand fold)

WORKFLOW Typical RNA-Seq Protocol The quality of input RNA has an enormous impact on downstream analysis of RNA-seq data. This issue is especially relevant for clinical analysis of cancer samples where formalin fixation is the method of choice for preserving the tissue at the time of biopsy or excision. RNA extraction and subsequent RNA-seq applications are, in fact, technically challenging from formalin fixed paraffin embedded (FFPE) tissues. However, it has recently been shown that cDNA hybrid capture and subsequent sequencing by NGS technology can be successfully used for transcriptome analysis from FFPE material [7]. Diverse approaches for RNA-seq can be utilized, but most methods rely on cDNA synthesis of total RNA, or fractionation of total RNA to those transcripts with a poly-A tail. Fragments are then sequenced with or without amplification by massively parallel sequencing to obtain sequence reads from one (single-end sequencing) or both ends (pair-end sequencing). After sequencing, the resulting reads are either aligned to a reference genome or reference transcriptome, or assembled de novo to produce a genome-scale transcription map (Figure 6.1).

Bioinformatic Analyses of Sequence Generated from RNA-Seq Experiment Information gained from an RNA-seq experiment can be broadly divided into two categories, qualitative and quantitative. Qualitative data include identification of poly-A sites, transcriptional start sites (TSS), expressed transcripts, and exon/intron boundaries. Quantitative data include measurements of differences in expression, alternative splicing, alternative TSS, and alternative polyadenylation between two or more patients or treatments. The typical RNA-seq pipeline follows a few fundamental analysis steps. First, NGS read sequences aligned to the reference genome or transcriptome. The number of mapped reads is quantitated and the gene expression level is calculated, and differential gene expression is determined. Software tools have been created over the last several years for each of these steps, specifically to process paired-end information, align fragments that span multiple exons, and align fragments that originate from separate regions of the genome for fusion detection. A survey of widely used tools is presented in Table 6.2; specific uses of some of the tools are described in the following sections [818].

Initial Processing of Raw Reads: Quality Assessment As discussed in more detail in Chapter 1, the raw data generated by NGS platforms is converted to base sequences by base-calling algorithms that are platform specific. Each base call is assigned a quality score calculated to indicate its reliability. FastQC [19] is an example of a widely used tool to assess this metric, and the tool can be used to identify low-quality bases that may be due to GC context, regions of homology, adapter

I. METHODS

80

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

mRNA or total RNA

(A) RNA shearing

OR

Reverse-transcribed cDNA

Library generation with ligation of adaptors Forward adaptor

(B)

Reverse adaptor

FIGURE 6.1 Schematic of a typical RNA-seq workflow. Briefly, (A) a library of cDNA molecules is constructed from either mRNA or total RNA or by reverse transcription. (B) Sequencing adaptors are then added to these RNA fragments and subjected to massively parallel sequencing. (C) The sequence (reads) are then mapped to reference genome and processed through a bioinformatic pipeline to analyze transcript abundance, fusion, or splicing events. The red boxes in C denote exonic reads and the blue boxes represent junction reads.

Massively parallel sequencing

Mapping to reference genome Exon1

Exon2

Exon3

(C) Poly(A) reads

TABLE 6.2 Summary of Common Software Tools Available for RNA-seq Analysis Function

Package

URL

Quality assessment

FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

FastX

http://hannonlab.cshl.edu/fastx_toolkit/

PRINSEQ

http://prinseq.sourceforge.net/manual.html

Tagcleaner

http://tagcleaner.sourceforge.net/

SAMstat

http://samstat.sourceforge.net/

SOAP

http://soap.genomics.org.cn/soap1/

SeqMap

http://www-personal.umich.edu/Bjianghui/seqmap/

Maq

http://maq.sourceforge.net/

Bowtie

http://bowtie-bio.sourceforge.net/index.shtml

SHRimP

http://compbio.cs.toronto.edu/shrimp/

BWA

http://soap.genomics.org.cn/soapaligner.html

Tophat

http://tophat.cbcb.umd.edu/

SpliceMap

http://www.stanford.edu/group/wonglab/SpliceMap/

MapSplice

http://www.netlab.uky.edu/p/bioinfo/MapSplice

HMMSplicer

http://derisilab.ucsf.edu/index.php?software5105

SOAPsplice

http://soap.genomics.org.cn/soapsplice.html

Trans-ABySS

http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss

Trinity

http://trinityrnaseq.sourceforge.net/

Oases

http://www.ebi.ac.uk/Bzerbino/oases/

Picard

http://picard.sourceforge.net/

Varscan

http://varscan.sourceforge.net/

Mapping without splicing junction detection

Mapping with splicing junction detection

Transcriptome assembly

Variant detection

(Continued)

I. METHODS

WORKFLOW

81

TABLE 6.2 (Continued) Function

Differential gene expression analysis

Gene fusion detection

Package

URL

Pindel

http://gmt.genome.wustl.edu/packages/pindel/

Breakdancer

http://breakdancer.sourceforge.net/

Chimerascan

http://chimerascan.googlecode.com

edger

http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

DESeq

http://www-huber.embl.de/users/anders/DESeq/

Cuffdiff

http://cufflinks.cbcb.umd.edu/

baySeq

http://www.bioconductor.org/packages/2.8/bioc/html/baySeq.html

HTSeq

http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html

FusionSeq

http://archive.gersteinlab.org/proj/rnaseq/fusionseq/

FusionHunter

http://bioen-compbio.bioen.illinois.edu/FusionHunter/

FusionMap

http://www.omicsoft.com/fusionmap

FusionFinder

http://bioinformatics.childhealthresearch.org.au/software/fusionfinder/

contamination, and nonrandom hybridization of random primers. FastX [20], PRINSEQ [21], and Tagcleaner [22] are some of the tools available to filter and trim low-quality reads that are identified. An average RNA-seq experiment will yield 3003000 million (3 3 108 to 3 3 109) reads during a single sequencing run; the read length ranges from 35 to 120 nt in a typical setting. Multiple samples can be sequenced in the same sequencing reaction provided the RNA-seq libraries are appropriately indexed; the indexed reads are demultiplexed and each read is assigned to the corresponding sample.

Read Mapping Strategies Typically, RNA-seq studies are used for estimating the level of expression of particular genomic regions such as genes, isoforms, exons, splice junctions, or novel transcribed regions. The first step for this quantitation requires identification of these regions in the sequencing library. Numerous software packages are available to map the short reads to their corresponding genomic locations based on the reference genome or transcriptome, but the complexities of genome sequences have direct influences on mapping accuracy. There are three principal approaches for mapping, which in order of decreasing complexity are de novo assembly of reads, read alignment to the genome followed by assembly, and read alignment to the transcriptome. De Novo Read Assembly De novo read assembly is the most challenging and computationally most intensive of the three mapping strategies. It is particularly useful when a reference genome is not available, or when the annotation of the region under study is of poor quality. For a reliable de novo assembly, long paired-end reads and a high level of coverage are required. By exploiting the overlaps between reads, de novo read assembly finds a set of the longest possible contiguously expressed regions (contigs). In recent years, three main algorithmic strategies have been employed for de novo assembly: prefix tree based, overlap-layout-consensus based, and de Bruijn graph based. Of these, the most prevalent has been the de Bruijn graph representation which has been adopted by a number of transcriptome assembly programs such as Trinity [23], Trans-ABySS [24], and Oases [25]. Read Alignment Genome assembly allows for the discovery of unannotated alternative splicing isoforms and novel transcripts, but is challenging since the tool must have the ability to align reads across splice junctions (Figure 6.2). Several alignment programs that are capable of generating spliced alignments include TopHat [26,27], SpliceMap [28],

I. METHODS

82

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

FIGURE 6.2 A typical RNA-seq bioinformatic workflow.

Millions of raw sequence data Library adapter

Yes

Strip library adapter

No Read alignment, splice site discovery Quality assessment Bam quality control, RNA-Seq metrics, splice junction coverage (Picard, FastQ, SAMstat)

Variant detection

Alignment BAM (mapped/unmapped) Splice junctions

Duplicates identification SNVs and indels calling Structural variations identification

Fusion detection

Expression estimation

Differential expression

Fusion transcript discovery

Count reads Gene identification

Differential expression expression detection

Transcript abundance

Alternative splicing

and SOAPSplice [29]. Specialized programs for the alignment of short reads contiguously to a reference have also been developed such as Bowtie [9], BWA [10], and SOAP [12,13]. ALIGNMENT TO REFERENCE GENOME

As mentioned above, there are several workflows that are specifically designed for mapping RNA-seq reads. One widely used workflow involves using Tophat. Tophat first maps reads that continuously map (i.e., map throughout their entire length) to a genomic region. Most reads will be mapped in this step if the sequencing read length is short. Unmapped reads are split into shorter segments that are independently aligned to the genome to identify possible splice junctions. In this way, Tophat can find novel splicing isoforms that have not been annotated, although with long sequencing read lengths, the proportion of junctional reads increases dramatically and this approach becomes inefficient. Limiting the searching space used for mapping the split segments based on annotation of gene structures can increase the efficiency, but this comes at the cost of missed identification of novel junctions. It is worth noting that alignment to the genome results in a set of one or more genomic coordinates for each mapped read, which may or may not span exon junctions. The main advantage to contiguous alignment to the transcriptome (Figure 6.3) is its relative simplicity. One caveat, however, is the assumption that the annotated gene models are accurate and complete, since this method of alignment largely excludes the discovery of novel expressed regions. For contiguous alignment to a reference, the most well-known aligner is Bowtie. Bowtie indexes and compresses a genome sequence using a technique called a BurrowsWheeler transform (BWT). The output of Bowtie is in the SAM format, and (among other information) provides, on a read by read basis, one or more alignment records describing one or more locations to which the read aligns.

RNA-Seq Variant Calling and Filtering Interest in employing RNA-seq data for identifying genomic variants has started to gain momentum, but due to transcriptomic intrinsic complexities such as splicing, it is computationally intensive and therefore remains a challenge. However, especially in settings where a disease sample has paired whole genome and whole exome data along with RNA-seq data, de novo variant calling in RNA-seq data provides an efficient method to validate the findings from the other NGS data sets. Variants can be called using the mapped reads that are subjected to local realignment, base-score recalibration, and candidate-variant calling with available tools such as Picard [30] and SNPiR [31].

I. METHODS

83

WORKFLOW

Spliced alignments Paired-end reads

e

nom

Ge

Tra n

scr

Exon1

Exon2

Exon1

Exon2

FIGURE 6.3 An example of paired-end read alignment. The paired-end read is aligned to the genome and to the transcriptome. Alignment to the genome requires reads to be mapped across introns (dashed line) compared with the contiguous alignment of reads to the transcriptome.

ipto

me

Raw reads

Alignment

Mapped reads

Expression Estimation: Summarization, Normalization, and Differential Expression In order to estimate expression levels for a genomic feature (e.g., genes, transcripts, or exons), a record of the number of reads associated with each feature must be tabulated. A simple and crude approach is to record the number of genome-mapped reads overlapping the exons in a gene based upon existing gene models [3234]. HTSeq [35], Cuffdiff [36], and BEDTools [37] are some of the popularly used quantization tools for this approach. While the number of genome-mapped reads is a popular metric, and summarizes the gene expression levels of the total gene output, it does not capture changes of expression that may be due to complexities such as alternative splice forms. For downstream applications of the summarized genome and/or transcriptome-mapped reads, the count data should be normalized to ensure that expression measurements are directly comparable. For comparing gene expression within a sample, normalization is done based on gene length and gene GC content; for comparing gene expression between samples, normalization is done based on the number of reads obtained (library size) as well as RNA composition. In this context, it is important to take into consideration that read counts arising from a transcript are proportional to: (1) the depth of sampling and (2) the length of the transcript. In RNA-seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it; the total number of fragments is related to total library depth, and the number of fragments is also biased toward larger genes. Reads Per Kilobase per Million mapped reads (RPKM) and Fragments Per Kilobase of exon per Million reads sequenced (FPKM) are extensively used as normalized measurements for library depth and gene size; FPKM is a double-normalization since it normalizes by transcript length and also by the total number of fragments sequenced. The trimmed mean of M-values (TMM) normalization method is used for estimating relative RNA production levels from RNA-seq data, and the method can be used to estimate scale factors between samples. A simpler fraction, Transcripts Per Million (TPM) is used as a technology-independent measure and is sometimes preferred to RPKM/FPKM.

Differential Expression The summarized genome and/or transcriptome mapped count data can be further tested for significant differences in transcript abundance between samples. Generally, for two-group comparisons, a test for a negative binomial distribution is used (as, for example, as implemented by DESeq). For multifactor experiments, a generalized linear model (GLM) likelihood ratio test is used (as, for example, as implemented by edgeR). Other popular tools for performing statistical testing of the significance of expression measurements include Cufflinks Cuffdiff [36], baySeq [38], DESeq [39], and HTSeq [35]. In recent years, these tools have been successfully applied to the identification of differentially expressed transcripts in different disease groups. For example, an RNA-seq approach was used to identify differentially expressed transcripts between oral cancer and normal tissue samples [40]. In this study, allelic imbalance, which is the ratio of transcripts produced by single alleles, was used to evaluate a subgroup of genes involved in cell differentiation, adhesion, cell motility, and muscle contraction to identify a unique transcriptomic and genomic signature in oral cancer patients.

Fusion Detection Fusion genes can be detected using either paired- or single-end reads. For paired-end reads, a discordant read pair is one that is not aligned to the reference genome with the expected distance or orientation between the paired sequences (discussed in more detail in Chapter 11). A fusion gene is suggested when a set of discordant

I. METHODS

84

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

read pairs is mapped to two different genes (for paired-end read approaches), or when junction-spanning reads are identified (for both paired-end and single-end read approaches). Note that junction-spanning reads provide information that is complementary to that from discordant read pairs for paired-end read approaches. In any event, detection of fusion genes from fusion junction-spanning reads using raw data or unmapped reads, and detection of fusion genes from inter-transcript paired-end reads, can be performed using the Chimerscan and FusionFinder algorithms. To date, RNA-seq has been used to comprehensively characterize gene fusions in prostate, brain, and breast cancer using both single-end [15,41,42] and paired-end [43] approaches of RNA-seq.

Depth of Coverage Issues Many factors influence the minimum read depth that is required to adequately address a biological question by NGS techniques in a clinical setting. It is important to realize that experiments that are designed to measure quantitative changes have requirements that differ from those that are designed for qualitative data. Design of these experiments requires careful consideration of issues that relate to biases in genome structure, transcriptome complexity, and read mapability; to the relative abundance of reads that inform about the biological question; and to the trade-off between cost and sequencing depth. Existing bioinformatics tools depend strongly on sequencing depth for differential expression calls. However, in a recent review of the current guidelines and precedents on the issue of coverage for four major study designs, including de novo genome sequencing, genome resequencing, RNA-seq, and ChIP-seq, it was found that the sequencing depth of RNA-seq data sets varied over several orders of magnitude. Recent reports have suggested that in a mammalian genome, about 700 million reads would be required to obtain accurate quantification of .95% of expressed transcripts [44]. However, there has not yet been a systematic analysis of how sequencing coverage affects the accuracy of differential expression calls between samples [45]. The ENCODE consortium has provided data to assess the number of reads required to accurately quantify genes across the dynamic range of FPKM values in human cells [46,47]. From H1 human embryonic stem cells, 214 million paired-end reads of 100 bp each were obtained as part of a saturation analysis to systematically model sequencing depth, and it was shown that for 80% of genes with an FPKM . 10, the abundance estimate based on about 36 million mapped reads was within 6 10% of the abundance estimate based on the full data set [26,48]. However, for genes with a lower level of expression (i.e., FPKM , 10), a minimum of about 80 million mapped reads was required for accurate quantitation of abundance.

UTILITY OF RNA-SEQ TO CHARACTERIZE ALTERNATIVE SPLICING EVENTS Alternative splicing is a very important mechanism in development [49,50] and can also be shown to be a sequela of up to 60% of disease-associated mutations. Additionally, it is known that more than 95% of human genes are alternatively spliced [51]. In view of these important facts, efforts are under way to completely characterize alternative splicing events. Currently, however, a catalog of well-annotated splicing events and its effect in human development and disease does not exist.

UTILITY OF RNA-SEQ FOR GENOMIC STRUCTURAL VARIANT DETECTION In addition to single nucleotide variation, chimeric gene fusions created as products of chromosomal translocations, insertions, deletions, or chromosomal inversions contribute to another level of genomic variation. Additionally, fusions known as “read-through transcripts” or “transcription-induced chimeras” can be formed in neoplastic as well as in normal cells through a common process [15]. In brief, transcription of a gene usually stops at a specific termination point specified by the local nucleotide sequence. However, this mechanism for controlling the activity of RNA polymerase is sometimes bypassed, and transcription continues to the next gene on the same DNA strand, resulting in a chimeric transcript in which the intervening noncoding region between the two genes is removed from the final processed RNA, creating a fused mRNA. Structural variants are pathognomonic to many cancer types, e.g., the BCRABL fusion in chronic myeloid leukemia (CML) and characterization of recurrent pathogenic chimeric fusion events has led to identification of novel targeted therapeutic approaches in cancer [5255]. A number of recent studies have described the use of

I. METHODS

METHYLOME SEQUENCING

85

RNA-seq for detection of established as well as novel chimeric transcripts that can be used as cancer biomarkers [5659]. However, in order to detect fusion genes, the translocation breakpoints must produce a chimeric transcript since most current methods of RNA-seq will not identify genomic aberrations resulting from promotor substitution. Additionally, genes with a low level of expression are difficult to detect by RNA-seq, and so fusions involving this set of genes are concomitantly hard to identify. Appropriate assay design is required to avoid the high false-negative and false-positive rates that are introduced by complexities related to short reads, but RNAseq methods have nonetheless been successfully employed in large-scale genome wide studies to discover multiple nonrecurrent private translocations in many cancer types. However, the clinical utility of identifying these nonrecurrent gene fusion events remains unclear since their impact on therapy and outcome is unknown.

RNA-SEQ: CHALLENGES, PITFALLS, AND OPPORTUNITIES IN CLINICAL APPLICATIONS Despite the declining costs of NGS and continuous improvements in technology, RNA-seq has not yet been adopted in clinical laboratories because several technical and bioinformatics challenges remain for translating RNA-seq methods into routine clinical diagnostic tests. Technical challenges include sample issues pertaining to both quantity and quality of RNA, especially in cancer samples where there may be a paucity of tumor cells, and where the samples are invariably FFPE tissue. The level of bioinformatics sophistication required for RNA-seq data analysis in a clinical setting often poses a significant impediment. Clinically validated software tools that are highly specific and sensitive for analyzing RNA-seq data do not exist at the time of writing this chapter. Additionally, databases that are required to unequivocally establish the functional and clinical significance of many non-protein-coding changes do not exist. Finally, because of the relatively high evolutionary conservation of regulatory sequences, differentiating between many benign variants and disease-causing mutations can be an extremely challenging task. Many groups are currently working to address these sample and technical issues, and together with refinement of bioinformatic tools, clinical implementation of RNA-seq applications should be possible in the near future.

METHYLOME SEQUENCING Perturbation of cellular function can occur by many mechanisms that change the sequence of DNA including point mutations, deletions, insertions, copy number changes, and translocations. In contrast, epigenetic mechanisms do not affect the DNA sequence itself but can still cause heritable or acquired changes in gene activity or function. One of the major epigenetic mechanisms is methylation, which is usually (but not always) associated with the silencing of gene expression. The predominant mechanism for silencing involves recruitment of proteins that preferentially recognize methylated DNA; in turn, these proteins associate with histone deacetylase and chromatin remodeling complexes to cause the stabilization of condensed chromatin. This section will describe how NGS technology can be used to analyze either genome-wide or targeted methylation changes. Methods for DNA methylation analysis are mostly based on bisulfite conversion, affinity purification of methylated DNA, and methylation-sensitive restriction enzyme digestion. Bisulfite conversion of 5-methyl cytosine is the most informative way to analyze DNA methylation patterns (but strict attention to the details of the technique is required to insure complete conversion). Although Sanger sequencing can be used to analyze PCR products after bisulfite conversion [60], and pyrosequencing [61] or mass spectrometry [62] can be used to quantify the extent of methylation at each cytosine, more of these methods has the ability to distinguish cell-specific methylation patterns. High-resolution, genome-wide measurement of DNA methylation is possible using whole genome bisulfite sequencing (WGBS, MethylC-seq) [63] or BS-seq [64,65]; reduced-representation bisulfite sequencing (RRBS) [66]; enrichment-based methods such as MeDIP-seq [67,68], MBD-seq [69], and MRE-seq [68]; and single-CpGresolution DNA methylome analysis such as methylCRF [70]. Collectively, these methods provide wide-ranging genomic CpG coverage, resolution, and quantitative accuracy at low cost. WGBS, for example, allows an unbiased genome-wide analysis of CpG sites with single nucleotide resolution, although it is comprehensive it is nonetheless an expensive technique [71].

I. METHODS

86

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

Input reads and input reference

Results

Preprocessing

Methylation annotation

Quality analysis

Distribution in genes and repeats

Reads conversion

Methylated CGI

Reference conversion

Relationship with gene expression

Mapping

mC identification mC statistics mC distribution

Mapping results

Methylation level distribution

FIGURE 6.4 A typical bioinformatic workflow for methylome sequencing.

Analysis of the data generated from the sequence-based 5-methyl cytosine assays largely consists of sequence alignment and segmentation (Figure 6.4). The primary data are compared with a reference genome assembly to generate a file containing the genomic coordinates of the alignments and orientation with respect to the reference. Specialized alignment tools have been developed to map raw bisulfite-treated reads; these aligners account for the cytosine to thymidine conversion [9,7275] that results from bisulfite treatment, and then utilize segmentation methods to transform the raw sequence alignments into regions of signal and background. While bisulfiteand restriction-based methods provide single-nucleotide resolution, enrichment-based strategies provide for focused analysis but are limited to the average length of the enriched genomic fragments sequenced (usually in the range of 200 bp long). With the plummeting costs of NGS, and the ability to perform genome-wide DNA methylation analysis, it may be possible in near future to clinically test differential methylation patterns in germ line (inherited) diseases as well as acquired (somatic) disorders such as cancer. Although large-scale efforts are in progress to decipher epigenetic changes across the entire genome (http:// www.epigenome.org and http://www.roadmapepigenomics.org), at the time of this writing a detailed catalog of epigenomic alteration and its clinical significance is not available.

CONCLUSIONS RNA-seq and methylome sequencing are techniques that provide the opportunity to analyze the transcriptome and the methylome, their complexities, and their relevance to human pathobiology. The last few years have seen an explosion in the application of these technologies in genomic studies of both somatic and inherited disorders. However, many significant technical and bioinformatic hurdles remain before these tools are in routine clinical use. For example, a complete catalog of transcripts in normal and disease states that can be referred to for interpreting the pathogenic significance of RNA-seq data sets is still in early development, as are catalogs of methylation changes in normal and disease states. Nonetheless, these technologies are poised to play a significant role in personalization of genomic medicine in the near future.

References [1] Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet 2011;12(10):67182. [2] McGettigan PA. Transcriptomics in the RNA-seq era. Curr Opin Chem Biol 2013;17(1):411. [3] Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10(1):5763.

I. METHODS

REFERENCES

87

[4] Wilkerson MD, Cabanski CR, Sun W, Hoadley KA, Walter V, Mose LE, et al. Integrated RNA and DNA sequencing improves mutation detection in low purity tumors. Nucleic Acids Res 2014;42(13):e107. [5] Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 2008;5(7):6139. [6] Liu S, Lin L, Jiang P, Wang D, Xing Y. A comparison of RNA-seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 2011;39(2):57888. [7] Cabanski CR, Magrini V, Griffith M, Griffith OL, McGrath S, Zhang J, et al. cDNA hybrid capture improves transcriptome analysis on low-input and archived samples. J Mol Diagn 2014;16(4):44051. [8] Jiang H, Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 2008;24(20):23956. [9] Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:3. [10] Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 2009;25(14):175460. [11] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18 (11):18518. [12] Li RQ, Li YR, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics 2008;24(5):7134. [13] Li RQ, Yu C, Li YR, Lam TW, Yiu SM, Kristiansen K, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009;25(15):19667. [14] Lin H, Zhang ZF, Zhang MQ, Ma B, Li M. ZOOM! Zillions of oligos mapped. Bioinformatics 2008;24(21):24317. [15] Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 2009;458(7234):97101. [16] Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 2009;5:5. [17] Smith AD, Xuan ZY, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 2008;9. [18] Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-seq. Bioinformatics 2011;27(17):23259. [19] Andrews S. FastQC: a quality control tool for high throughput sequence data, ,http://wwwbioinformaticsbbsrcacuk/projects/fastqc.. [20] Pearson WR, Wood T, Zhang Z, Miller W. Comparison of DNA sequences with protein sequences. Genomics 1997;46(1):2436. [21] Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011;27(6):8634. [22] Schmieder R, Lim YW, Rohwer F, Edwards R. TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets. BMC Bioinformatics 2010;11:341. [23] Borodina T, Adjaye J, Sultan M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol 2011;500:7998. [24] Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, et al. De novo assembly and analysis of RNA-seq data. Nat Methods 2010;7(11):90912. [25] Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012;28(8):108692. [26] Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 2009;25(9):110511. [27] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013;14:4. [28] Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res 2010;38(14):45708. [29] Huang S, Zhang J, Li R, Zhang W, He Z, Lam TW, et al. SOAPsplice: genome-wide ab initio detection of splice junctions from RNA-seq data. Front Genet 2011;2:46. [30] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25(16):20789. [31] Piskol R, Ramaswami G, Li JB. Reliable identification of genomic variants from RNA-seq data. Am J Hum Genet 2013;93(4):64151. [32] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008;18(9):150917. [33] Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNAseq experiments. BMC Bioinformatics 2010;11:94. [34] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 2008;5(7):6218. [35] Anders S. Htseq: analysing high-throughput sequencing data with python, ,http://www-huberemblde/users/anders/HTSeq/.; 2010. [36] Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 2012;7(3):56278. [37] Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010;26(6):8412. [38] Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 2010;11:422. [39] Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol 2010;11(10):R106. [40] Tuch BB, Laborde RR, Xu X, Gu J, Chung CB, Monighetti CK, et al. Tumor transcriptome sequencing reveals allelic expression imbalances associated with copy number alterations. PloS One 2010;5(2):e9317. [41] Levin JZ, Berger MF, Adiconis X, Rogov P, Melnikov A, Fennell T, et al. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol 2009;10(10):R115. [42] Zhao Q, Caballero OL, Levy S, Stevenson BJ, Iseli C, de Souza SJ, et al. Transcriptome-guided characterization of genomic rearrangements in a breast cancer cell line. Proc Natl Acad Sci USA 2009;106(6):188691.

I. METHODS

88

6. RNA-SEQUENCING AND METHYLOME ANALYSIS

[43] Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S, et al. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci USA 2009;106(30):123538. [44] Blencowe BJ, Ahmad S, Lee LJ. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes. Genes Dev 2009;23(12):137986. [45] Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol 2010;11(12):220. [46] ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 2004;306(5696):63640. [47] Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, et al. Landscape of transcription in human cells. Nature 2012;489 (7414):1018. [48] ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 2011;9(4):e1001046. [49] Nilsen TW, Graveley BR. Expansion of the eukaryotic proteome by alternative splicing. Nature 2010;463(7280):45763. [50] Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet 2007;8(10):74961. [51] Carninci P. Is sequencing enlightenment ending the dark age of the transcriptome? Nat Methods 2009;6(10):7113. [52] Demetri GD, von Mehren M, Blanke CD, Van den Abbeele AD, Eisenberg B, Roberts PJ, et al. Efficacy and safety of imatinib mesylate in advanced gastrointestinal stromal tumors. N Engl J Med 2002;347(7):47280. [53] Druker BJ, Guilhot F, O’Brien SG, Gathmann I, Kantarjian H, Gattermann N, et al. Five-year follow-up of patients receiving imatinib for chronic myeloid leukemia. N Engl J Med 2006;355(23):240817. [54] Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004;350(21):212939. [55] Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, et al. Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N Engl J Med 2001;344(11):78392. [56] Bass AJ, Lawrence MS, Brace LE, Ramos AH, Drier Y, Cibulskis K, et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet 2011;43(10):9648. [57] Campbell PJ, Stephens PJ, Pleasance ED, O’Meara S, Li H, Santarius T, et al. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008;40(6):7229. [58] Imielinski M, Berger AH, Hammerman PS, Hernandez B, Pugh TJ, Hodis E, et al. Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell 2012;150(6):110720. [59] Mohajeri A, Tayebwa J, Collin A, Nilsson J, Magnusson L, von Steyern FV, et al. Comprehensive genetic analysis identifies a pathognomonic NAB2/STAT6 fusion gene, nonrandom secondary genomic imbalances, and a characteristic gene expression profile in solitary fibrous tumor. Genes Chromosomes Cancer 2013;52(10):87386. [60] Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, et al. DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet 2006;38(12):137885. [61] Tost J, Gut IG. Analysis of gene-specific DNA methylation patterns by pyrosequencing technology. Methods Mol Biol 2007;373:89102. [62] Ehrich M, Field JK, Liloglou T, Xinarianos G, Oeth P, Nelson MR, et al. Cytosine methylation profiles as a molecular marker in nonsmall cell lung cancer. Cancer Res 2006;66(22):109118. [63] Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009;462(7271):31522. [64] Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 2008;452(7184):2159. [65] Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, et al. Dynamic changes in the human methylome during differentiation. Genome Res 2010;20(3):32031. [66] Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature 2008;454(7205):76670. [67] Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, et al. Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet 2005;37(8):85362. [68] Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 2010;466(7303):2537. [69] Serre D, Lee BH, Ting AH. MBD-isolated genome sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res 2010;38(2):3919. [70] Stevens M, Cheng JB, Li D, Xie M, Hong C, Maire CL, et al. Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods. Genome Res 2013;23(9):154153. [71] Laird PW. Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 2010;11(3):191203. [72] Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics 2009;10:232. [73] Coarfa C, Yu F, Miller CA, Chen Z, Harris RA, Milosavljevic A. Pash 3.0: a versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing. BMC Bioinformatics 2010;11:572. [74] Chen PY, Cokus SJ, Pellegrini M. BS Seeker: precise mapping for bisulfite sequencing. BMC Bioinformatics 2010;11:203. [75] Xi Y, Bock C, Muller F, Sun D, Meissner A, Li W. RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing. Bioinformatics 2012;28(3):4302.

List of Acronyms and Abbreviations RNA-Seq RPKM/FPKM TMM

next-generation RNA-sequencing Reads or paired-end fragments Per Kilobase of exon model per Million mapped reads trimmed mean of M-values.

I. METHODS

S E C T I O N

I I

BIOINFORMATICS

This page intentionally left blank

C H A P T E R

7 Base Calling, Read Mapping, and Coverage Analysis Paul Cliften Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction Library Preparation and Amplification Base Calling Read Mapping Platform-Specific Base Calling Methods Illumina: Platform and Run Metrics Density Intensity by Cycle QScore Distribution QScore Heatmap IVC Plot %Phasing/Prephasing PhiX-Based Quality Metrics Torrent: Platform and Run Metrics Loading or ISP Density Live ISPs Library and Test ISPs Key Signal Clonal Usable Sequence Test Fragment Metrics Illumina: Base Calling Template Generation Base Calling Quality Scoring Torrent: Base Calling Key Processes Postprocessing Intrinsic and Platform-Specific Sources of Error

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00007-1

92 92 92 93

Read Mapping Reference Genome NGS Alignment Tools MAQ Bowtie BWA Novoalign MOSAIK Isaac TMAP Sequence Read and Alignment Formats Sequence Alignment Factors Alignment Processing

94 95 95 96 96 96 96 96 96 97 97 97 97 97 97 97 98 98 98 98 99 99 99 100 100

Coverage Analysis: Metrics for Assessing Genotype Quality Performance and Diagnostic Metrics Total Read Number Percent of Mapped Reads Percent of Read Pairs Aligned Percent of Reads in the Exome or Target Region Library Fragment Length Depth of Coverage Percent of Unique Reads Target Coverage Graph

91

100 100 101 101 101 101 101 102 102 102 102 102 103 103 103 104 104 104 104 104 105 105 106

Summary

106

References

107

© 2015 Elsevier Inc. All rights reserved.

92

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

KEY CONCEPTS • Quality sequence data are fundamental to correctly identifying human disease variants. • Each commercial sequencing platform produces data with specific qualities and intrinsic sources of error that influence accurate identification of sequence variants. • Sequence read alignment is a critical step in accurately determining the genotype of an individual based on short sequence reads. • There are many known sources of error in sequence alignment that limit the accuracy of sequence determination with NGS data. • Quality metrics can be used to demonstrate the completeness and overall quality of genotyping over a target region or genome.

INTRODUCTION The ability to accurately genotype an individual is essential for personalized medicine. Although patient genotyping has been performed for many years, it has been limited to relatively short loci within the genome. Next generation sequencing (NGS) has enabled genotyping at an unprecedented scale. Nearly complete genotype sequences are now routinely determined at the genomic [1] and exomic level [2], and the information is available for thousands of individuals based on NGS technologies. Genotyping of a patient at a genomic scale is currently a complex process that involves four steps: (1) library preparation and template amplification, (2) sequencing or base calling, (3) alignment or mapping of sequence reads, and (4) variant calling. While this chapter focuses on steps 2 and 3, it is important to consider these two steps in the context of the entire process since errors at any step of the process contribute to the accuracy of the resulting genotype.

Library Preparation and Amplification As discussed in Chapter 1, each NGS platform is based on a complex series of processes that impart specific properties to the sequence data including intrinsic sources of error. To illustrate this principle, it is instructive to compare several properties of NGS data to the Sanger-based sequencing data that was used for the Human Genome Project (HGP). The HGP used a clone-based approach to map the human genome. In brief, the HGP created a reference sequence for each of the 22 autosomal chromosomes and the two sex chromosomes. Libraries of bacterial artificial chromosomes (BACs) were created and mapped to identify a “golden path” of overlapping BAC clones that spanned each of the chromosomes. The BAC inserts (typically 100200 kb in length) were further sheared to create plasmid libraries that were also sequenced and assembled at a local level. Several significant properties result from this strategy. First, only one haplotype (from one of several individual donors) was represented in the “golden path” and in the resulting human reference sequence. Second, the sequence reads and the resulting human reference assembly contain errors introduced by mutations during the cloning and propagation of the BAC and plasmid clones as well as any sequencing-related errors. In contrast to the Sanger sequencing used in the HGP, NGS does not require cloning of genomic sequences into vectors. Most NGS methods, however, rely on one or more rounds of Polymerase Chain Reaction (PCR) amplification during library (template) preparation, and most require additional amplification of the template molecules on a bead or a flow cell to provide a sufficient number of “clones” to enable detection of the incorporated bases during the sequencing steps. PCR errors, especially in preparation of the library fragments or in the early rounds of PCR amplification, produce errors in the individual sequence reads. While these errors are not sequencing errors per se, they do contribute to the accuracy of the resulting genotype.

Base Calling Accurate determination of the nucleotides within individual sequence reads is the foundation for correctly determining the sequence or genotype of an individual. Sequencing errors can arise from physical sources such

II. BIOINFORMATICS

INTRODUCTION

93

as dust or oil on the flow cell, or from properties of the template DNA sequence such as homopolymeric or GCrich stretches of DNA. The physical errors tend to be random and are easily mitigated by using a number of sequence reads to identify the correct base in the sequence. The template-dependent errors are often platform specific and are based on the specific properties of the replication enzymes and the chemistries employed in the sequencing reactions. Knowing the tendencies and weaknesses of the specific NGS platforms enables minimization of these sequencing errors in the genotype calls. Although there are a few platform-specific sources of sequencing error, NGS platforms used for clinical genotyping are remarkably accurate at the level of base calls. In fact, most of the NGS genotyping errors originate from the library preparation and template amplification steps or from difficulties in the downstream read alignment steps. Although NGS genotyping relies on the human reference sequence produced by the HGP, it is important to note that NGS genotyping in some respects is more demanding in its scope and its requirement for accuracy than the HGP since the goal of the HGP was to determine the sequence of only one haplotype (from one of several individual donors). For the HGP, the contiguity and the completeness of the reference genome sequence were more important than the base accuracy of the data. NGS genotyping, on the other hand, requires determination of both haplotypes from an individual. The complexity of assessing both haplotypes in an individual demands the higher read depths associated with NGS, much more so than the underlying accuracy of the sequence read base calls. This fact is often overlooked and perhaps engenders the misconception that NGS base calling is less accurate than it really is. For the HGP, researchers shot for a relatively shallow (10153) read depth. The sequence depth contributed to the accuracy of the reference sequence since the consensus of all of the reads could be used to override the random sequence errors in individual reads. The read depth was also required for the contiguity of the assembled reference genome, since extra sequence depth helps to minimize gaps in coverage caused by random sampling of the genome sequence. Also, since sequences were read from the ends of the cloned plasmids (i.e., paired-end sequencing), reads could be used to link contigs within the assembly even if there was no sequence overlap between them. NGS genotyping, on the other hand, requires a much greater depth to ensure that both haplotypes are represented in the sequence data, since a depth of 10 reads would theoretically sample only one of the two haplotypes for any given position in the genome roughly about once every 1000 (210) bases. This doesn’t seem intolerable until it is recognized that only one haplotype would be sampled at approximately 50,000 positions in a human exome, and at 3,000,000 positions in a human genome, which would lead to numerous errors in the genotype sequence. Additional sequence depth is required to distinguish under-sampled alleles from background sequencing errors. For example, with a read depth of 10 it would be difficult, if not impossible, to distinguish a heterozygous variant from a random sequencing error if only one of the 10 reads contained a variant base at a given genomic position. Because of this, a depth of 303 is often targeted for whole genome sequencing using NGS technologies. Higher fold coverage is typically used for exome sequencing, because the exon enrichment processes add additional sampling biases to the sequence coverage (some regions of the genome are easier to enrich) and to the allele balance (some alleles are preferentially enriched over others). Additional levels of sequence depth are also required to ensure identification of underrepresented alleles such as somatic cancer mutations that are found in a fraction of the cells from which the DNA was isolated.

Read Mapping In theory, an optimal way to genotype an individual would be to isolate individual chromosomes and sequence them from one end to the other. This approach is beyond current technology, but it would be nonrandom (leading to uniform sequence coverage) and it would produce a genotype where variants are phased along the length of the chromosome. With the accuracy of Sanger or NGS, a single chromosome would only need to be sequenced to a minimal depth to produce an accurate haplotype. Unfortunately, both Sanger and NGS technologies produce sequence reads that are relatively short. The short read fragments combined with the complexity of the human genome render de novo assembly of the sequence data impractical on an exome or genome wide scale. Although assembling the short NGS reads into a consensus genome sequence or a collection of exons within an exome has been considered unfeasible, a recent report challenging this assumption looks promising, especially for identifying structural changes [3]. NGS reads are instead “mapped” onto the human reference genome and the aligned reads are then used to make variant calls (Figure 7.1). Sequence alignment is computationally the most difficult and expensive step of

II. BIOINFORMATICS

94

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

FIGURE 7.1 A graphical view of sequence reads mapped onto the human genome near the PRNP gene. Sequence reads are represented as thin gray lines with arrows indicating the direction of the reads. Base differences between the sequence reads and the reference genome are indicated. Several sequencing errors are apparent in individual reads. A heterozygous mutation (A-G) is also present and the relative abundance of the “A” (green) and the “G” (brown) allele are indicated as a histogram above the sequence reads. The sequence of the reference genome is shown below the sequence reads with the encoded amino acid sequence of the gene.

variant analysis; in fact, the enormous amounts of data generated by NGS required the development of faster sequence alignment tools to deal with the data [4]. Today, there are dozens of alignment tools that have been developed to map the NGS reads to a reference genome. Because a complete review of all the tools is impossible here, this chapter will instead focus on a few of the most widely used NGS alignment tools and algorithms, as well as a few tools that have been optimized to deal with the unique data properties generated by specific sequencing platforms. As mentioned previously, read mapping is a major source of error in determining the sequence of an individual, and in the context of clinical genotyping, accurate sequence alignment is greatly favored over the speed of the alignment. Variant calls are affected by a number of problems such as mismapped reads, ambiguous read placement, and local misalignment especially around indels. Structural variations such as inversions or large deletions encumber accurate read mapping. Additionally, the reference human genome is not complete but contains gaps, sequence errors, and represents only one of many different human haplotypes. Synthetic human reference sequences that are based on the major alleles of ethnic populations may improve the accuracy of read mapping and genotyping individuals [5].

PLATFORM-SPECIFIC BASE CALLING METHODS NGS became available in 2005, with the introduction of the 454 sequencing platform [6]. 454 sequencing technology utilized bead-based emulsion PCR to amplify copies of the DNA template, thus eliminating the expensive and time-consuming step of cloning plasmids or other vectors to produce enough copies of the template molecule to detect during the sequencing reactions. Additionally, the technology employed a sequencing-by-synthesis approach which detected bases as they were incorporated into the template DNA, rather than the traditional Sanger approach where the sequence reaction was carried out and then the sequencing products were ordered by length using electrophoresis. Combining the bead-based cloning with sequencing-by-synthesis allowed hundreds of thousands of templates to be “read” in a single reaction. During the next several years a number of NGS platforms were commercially developed. These included the SOLiD system [7] developed by Applied Biosystems

II. BIOINFORMATICS

PLATFORM-SPECIFIC BASE CALLING METHODS

95

and the Solexa sequencing instrument that was later rebranded as the Genome Analyzer (GA) by Illumina [8]. In 2007, Roche acquired 454 and started producing the Titanium 454 sequencer. Several other NGS platforms were developed, but these three systems made up the bulk of the early NGS market. The first of the group to market, the 454 sequencing platform, was quickly adopted by the sequencing community. It produced fairly long reads similar to the standard Sanger sequencing technology and was capable of producing hundreds of mega bases per run. The SOLiD and Illumina instruments, on the other hand, were capable of analyzing millions of sequencing templates, but initially they produced short read fragments of 2535 bp. The technologies rapidly matured and the read lengths increased steadily for these instruments reaching about 100125 bp by the year 2010. In 2010, Illumina launched its second version of NGS instrument, the HiSeq. With the GA instrument, Illumina had developed a track record of producing steady improvements in read lengths and data production, and the HiSeq signaled a large leap in capacity with ample opportunity for growth. That same year, the Torrent Sequencing concept was introduced. The Torrent Sequencing technology is based on using semiconductor chips to measure the release of hydrogen ions as nucleotides are incorporated into DNA by a polymerase [9]. The most remarkable features of the proposed sequencing platform were a price tag of less than $50,000 and run times in the hour range. Thus, in 2010, the launch of the HiSeq instrument signaled that NGS had come of age with the promise that it would radically transform the fields of biology and medicine, while the introduction of the Ion Torrent system signified that NGS technologies would soon flow from dedicated sequencing centers to individual labs and clinics. As with any new technology, the Ion Torrent system took time to fully develop. Ion Torrent’s Personal Genome Machine (PGM), launched in the spring of 2011 with modest sequencing capacity. Improvements were made using denser semiconductor chips, but as a desktop sequencer it still has limited capacity (up to only about 23 Gb per run), and so the Ion Proton was developed to provide larger scale sequencing capacity. Both instruments have short run times so samples can be rapidly processed within several hours. As the buzz over the Ion Torrent grew as a fast and inexpensive benchtop sequencer, Illumina focused less on increasing data capacity and more on faster data production, and produced the MiSeq, a smaller version of the HiSeq to compete in the benchtop sequencing market. Four years after its introduction, the HiSeq is still a major player in the NGS field, which has been aided by upgrades such as the HiSeq 2500 capable of performing “rapid” run modes as well as the more traditional “high-throughput” run modes. In Spring 2014, Illumina introduced two new sequencing platforms, the NextSeq 500 and the HiSeq X Ten. The NextSeq instrument has less sequencing capacity than the HiSeq but is built for more rapid run times. This is achieved in part by new reaction chemistries that utilize two rather than four dyes, thus reducing the time it takes to scan the flow cell to detect nucleotide incorporation. The HiSeq X Ten is a high-capacity sequencing instrument dedicated for human genome sequencing and claims to be the first instrument to produce human genome sequences at a cost of under $1000. At this point there are basically two companies that produce platforms employed clinically for NGS base genotyping. Illumina has several instruments that are suitable for clinical sequencing, the HiSeq, MiSeq (including the MiSeqDx, an instrument that has been approved by the FDA for diagnostic sequencing), NexSeq 500 and the HiSeq X Ten. Ion Torrent, owned by Life-Technologies has the Ion PGM and the Ion Proton.

Illumina: Platform and Run Metrics Illumina has a common interface for viewing important quality run metrics on each of its sequencing platforms called the Sequence Analysis Viewer (SAV) (http://support.illumina.com/downloads/sequencing_ analysis_viewer_user_guide_15020619.html). The SAV software is present on each instrument, and can be installed on personal workstations so that quality metrics can be viewed from remote locations. The run data can be viewed in real time as the run is in progress or later after the run is finished. The metrics can be viewed in graphical or tabular form, and by lane or flow cell. Several metrics are most critical to the sequencing run and the downstream analysis. Density The density of the DNA clusters on the flow cell is of critical importance to the data production from a lane and a flow cell. At the prescribed cluster density, maximal amounts of data are produced, but if the cluster density is too high, signal from surrounding clusters increases the base call error rate. If the signal-to-noise ratio is too high, individual DNA clusters fail QC checks and although they are still tracked, they no longer are consider

II. BIOINFORMATICS

96

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

as part of the data output. Cluster density is normally reported in clusters per square millimeter, and ranges between 400 and 1400 K per square millimeter. Each Illumina platform has an optimal cluster density based on the reagents and chemistries that are employed by the platform. Periodically, Illumina will release newer versions of sequencing chemistries (e.g., the HiSeq is now on version 4). Newer sequencing chemistries generally allow for higher cluster density and therefore more sequence reads and data per run, but cluster density and data production may be reduced in sequencing chemistries focused on reducing run times. Intensity by Cycle The intensity by cycle plot shows the signal intensity for clusters that incorporate a specific base. The intensity is different for each of the four nucleotides depending on the specific fluorophore attached to the base. Signal intensity diminishes or fades over time because the DNA molecules within the clusters are released or damaged by the multiple washes and exposure to laser emissions. Illumina platforms make intensity adjustments at specific cycles during a sequencing run. QScore Distribution The quality score distribution is a histogram of individual base call quality scores in the run or lane. Illumina requires that a high percentage of bases must have a quality score above Q30 as a quality cutoff. If a lane or run falls below the Q30 threshold it should be resequenced to provide better quality data. QScore Heatmap The quality score heatmap provides a view the distribution of quality scores by cycle. Additionally, each lane can be viewed separately, including the top and bottom surfaces of the flow cell, to see the range of quality scores over each cycle. IVC Plot An intensity versus cycle (IVC) plot displays the percentage of clusters that incorporate each of the four bases for every cycle of the sequencing reaction. The base composition of the DNA templates is extremely important in the first four sequencing cycles during which the DNA clusters are being located on the surfaces of the flow cell. Heavy biases in base composition prevent proper identification of the DNA clusters and can greatly reduce the number of reads that pass filter thresholds and thus contribute to the output of the lane or flow cell. Although this can be a serious issue for specific sequencing applications or for unconventional libraries, libraries made from fragments of human genomic DNA are sufficiently diverse and typically produce high numbers of sequence reads. %Phasing/Prephasing Illumina tracks the percentage of molecules within the clusters that have fallen behind (phasing) or have moved ahead (prephasing) of the current position within the sequence read. These out of phase molecules reduce accuracy because they produce a signal for the inappropriate base. Illumina therefore tracks the level of phasing and prephasing, and subtracts out the appropriate level of signal before making the base call. PhiX-Based Quality Metrics In addition to the intrinsic quality metrics described above, Illumina also tracks a number of metrics that are based on PhiX-174 sequencing libraries that are spiked into (typically at 1%) each lane of a flow cell. Additionally, before rapid runs were introduced, Illumina required that one of the eight lanes on the flow cell be dedicated to PhiX as a control lane to keep the sequencing run under warrantee. PhiX-174 has a small genome of 5.5 kb with a GC content similar to the human genome (44% vs. 46% GC, respectively). Read sequences generated from PhiX templates are aligned to a reference PhiX genome by the onboard software. Based on these alignments the software calculates and tracks the (i) percent of reads aligned, (ii) the rate of sequencing error, and (iii) the percent of perfect reads.

II. BIOINFORMATICS

PLATFORM-SPECIFIC BASE CALLING METHODS

97

Torrent: Platform and Run Metrics Sequencing and data management on an Ion Torrent instrument are processed through the Torrent Suitet Software (for more details see http://ioncommunity.lifetechnologies.com/docs/DOC-8943). The Torrent Browser allows the user to plan and monitor sequencing runs, and view the sequence data after they are generated. Progress can be monitored with a list or a table view. The list view is more visual and provides a rapid assessment of in-progress runs with thumbnail quality indicators for key metrics. Additionally, the user can set the thresholds for each of the run metrics. The metrics that fall below the threshold are displayed in red to quickly indicate potential problems with the run. Critical metrics for Torrent sequencing are briefly explained below. Loading or ISP Density This metric is the percentage of wells loaded with an Ion Sphere Particle (ISP). Briefly, each sequencing chip has a set number of wells or miniature reaction chambers; the wells are randomly loaded with ISPs, which are proprietary micron-sized acrylamide beads on which library fragments have been clonally amplified to provide enough signal to detect base incorporation during the sequencing reaction. When the ISPs are applied to the chip, and then the chip is briefly centrifuged, the wells are randomly filled with ISPs. Only one ISP can fit in a well, but by chance, some of the wells remain empty. Bead loading is detected by measuring the diffusion rate of a flow with a different pH; since ISPs restrict the flow of solution into the well and thus slightly delay the pH change, the location of wells containing ISPs can be determined electrochemically. Live ISPs This is the percentage of wells with sufficient signal strength during the sequencing reaction. Additionally, to be considered “live,” wells must contain a “key” sequence that designates the read as a library or test fragment. Library fragments are marked with a specific four-nucleotide sequence tag; test fragments are marked with a separate sequence tag. Library and Test ISPs This is the percentage of “live” ISPs that have the library key signal or the test key signal. Key Signal The strength of the signal associated with the key signal. Clonal This is the percentage of ISPs that are clonal. An ISP is clonal if all of its attached DNA fragments are replicated from a single template; it is considered polyclonal if it carries two or more templates. Clonality is determined by the fraction of flows that register base incorporation and by the intensity of the signal. For a clonal ISP, one base incorporation signal should result from every cycle of the four bases. When two or more clones are present on an ISP, signal is detected more frequently and the signal is less intense than normal because only a fraction of the DNA templates will be incorporating the base. Two scores are tracked to help identify polyclonal ISPs. The percent positive flows (PPFs) is simply the percentage of flows that result in a nonzero signal; for this metric any signal above 0.25 is considered a positive flow, and the score is computed for flows 1272. During these flows, the software also tracks a sum of squares (SSQ) score that captures the degree to which the flow signal results in values that are close to integers. For example, an ISP with two template molecules would register scores near 0.5 for a base incorporation into one of the template sequences rather than a normal score near 1. Usable Sequence In addition to the polyclonal ISPs, Torrent also filters low quality and primer dimer reads, and the “usable sequence” is the percentage of Library ISPs that pass the polyclonal, low quality, and primer dimer filters. The low quality filter eliminates reads with uncertain base calls; at least some of these reads are generated by ISPs that contain below optimal levels of template sequence. The software also scans the reads for the 30 adapter sequence. Library fragments with a length of less than 8 bp are considered primer dimers and are filtered out.

II. BIOINFORMATICS

98

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

Test Fragment Metrics Test fragments are short known sequences that are spiked into the experimental sample before emulsion PCR. Test fragments are used during analysis to predict phasing and polymerase loss during the run. Typically, Ion Torrent uses 24 different test fragments labeled (TF_A, TF_B, TF_C, and TF_D).

Illumina: Base Calling In theory, base calling on an Illumina instrument is a fairly simple process. After a single nucleotide is incorporated into the elongating DNA chain, the flow cell is scanned for each of the four bases and the base with the highest intensity is determined to be the incorporated nucleotide. In practice, base calling is a process that requires a considerable amount of computational processing. For instance, a single flow cell often contains over a billion DNA clusters and these are tightly and randomly packed into a small area. Therefore, there is a considerable amount of crosstalk between neighboring DNA clusters on the flow cell due to physical proximity. There is also crosstalk between the nucleotide signals since the fluorophores attached to each base produce light emissions that overlap with the optimal emissions of the fluorophores of other clusters. Additionally, as mentioned before, it is difficult to keep all of the molecules in a cluster at the same position of the template; if a nucleotide is not incorporated into a DNA strand within a cluster, the strand now lags behind the overall cluster position and provides an inappropriate signal at each base position. Thus, base calling is a complicated process, and so it is not surprising that in the early days of Illumina sequencing a number of third-party tools were developed to “recall” bases from the Illumina GA image data. Since the introduction of the HiSeq instrument, the sequencing community has spent much less effort recalling bases from Illumina sequencing runs. There are several reasons for this trend. First, Illumina has become very efficient at processing the run data to produce accurate base calls; in addition to accuracy, the process now demands a great amount of speed because of the tremendous amounts of data that are processed in a sequencing run. Additionally, many of the reagents and instrumentation employed by the sequencing instruments are proprietary. Second, with the HiSeq instrument, image files are no longer stored after the sequencing run. When the HiSeq was first introduced, approximately 3035 Tb of image data were produced per flow cell; this quantity of data is simply too cumbersome and expensive to store for any given amount of time, so base calls are now processed on the instrument and the image data are kept only as long as needed. Base recalling is possible for a HiSeq run, but it requires analysis of the intensity files that Illumina has produced from the processed image data. When the HiSeq was released the intensity data files utilized about 2 Tb of disk storage, so long-term retention of even the intensity files is cumbrous. Template Generation Template generation is the process of identifying the location of each cluster on a flow cell. Currently, Illumina uses image data from the first four cycles to identify the clusters. After each cycle of nucleotide incorporation, two lasers are used to excite the fluorophores, one that excites the A and C fluorophores and one that excites the G and T fluorophores. The signals from each of the four fluorophores are recorded using filters (also known as channels). The images from each of the four channels are used to merge spots into one template. The process relies on the fact that there is crosstalk between the A and C channels and uses this information to help merge images into cluster locations. Several quality checks are also performed at this stage of the analysis. First, a purity filter is applied. This filter, also referred to as the chastity filter, measures the ratio of the most intense base to the second most intense base using the following equation IA/(IA 1 IB) where IA is the intensity of the highest base and IB is the intensity of the second highest base. Chastity scores range from 0.5 to 1. Any cluster with two chastity scores below 0.6 in the first 25 bases is rejected. Next, the software determines if the data are “high” or “low” diversity by determining the probability that two clusters will have the same base calls. In general, low diversity data hinder template generation. On high diversity data, a cluster is rejected if is within 1 pixel of another registered cluster or within 3.5 pixels of a cluster with the same base calls. On low diversity data, a cluster is rejected if it is within 1.75 pixels of another accepted cluster. While applying the filters, the clusters are ordered according to chastity, so that clusters with the purest signals are processed first. Base Calling Base calling for each cycle of a sequence run requires processing of each of the images files through the process of registration and image extraction, followed by intensity correction. Registration is the process used to

II. BIOINFORMATICS

PLATFORM-SPECIFIC BASE CALLING METHODS

99

align each image to the template of cluster position on the flow cell. Image extraction is the process of assigning an intensity value for each DNA cluster from an image. This information is stored in cluster intensity files (cif files). To call bases, intensities must be corrected for channel crosstalk and phasing/prephasing. After those corrections, base calling is a simple determination of the brightest intensity. Once preliminary base calls have been made, Illumina applies a second color filter matrix known as the adaptive color matrix, which can correct for relative intensity shifts of the four color channels over the course of the run or between different portions of the flow cell. Final base calls are made on the fully corrected intensities. If image data are available for the cluster, a base call will always be produced; “N” bases are only inserted when no intensity information was generated. This occurs rather infrequently, but most often near the edge of a flow cell where signal could be lost off the edge of the image. Additionally, bubbles within the flow cell prevent proper image generation, but this annoying problem was seen more frequently with previous versions of the HiSeq chemistries. Quality Scoring This is the process of assigning a score to each base call that indicates the quality or confidence of the call. This process originated with Sanger-based DNA sequencing and is also known as a Phred score, based on the name of a popular software tool used during the HGP [10]. Phred scores for NGS reads are generally in the 140 range and are logarithmically related to base calling error probabilities. Thus, a Phred score of 10 (Q10) refers to a base with a 1 in 10 probability of being incorrect and a score of 20 equates to an error rate of 1 in 100. Illumina does not use Phred scores of 0, 1, or 2 to indicated base quality. In fact, Illumina uses a run of bases with a quality score of 2 as a “read segment quality control indicator” that signals that the base calls have become so unreliable that they should be discarded.

Torrent: Base Calling In many respects, base calling on an Ion Torrent instrument is much more straightforward than on Illumina instruments. For instance, each semiconductor chip has a set number of wells each closely associated with an underlying sensor. With discrete wells there is no need to identify or find the DNA cluster randomly generated on a flow cell; instead, each sensor is monitored for the presence of an ISP and empty wells are masked from subsequent analysis. Additionally, there is no need for storing cumbersome image data; instead, the signal from each flow is recorded and must be decoded in a series of steps that are technically challenging. Key Processes Base calling on an Ion Torrent instrument involves several key processes. First, in “signal processing,” the signal of nucleotide incorporation is detected by the sensor at the bottom of the well, converted to voltage, and then digitized computationally off of the chip. Signal generation and detection take place over a 4 s span as the nucleotide is added to the flow. During that time the “signal” is subjected to noise or variation. The background noise is identified and subtracted from each incorporation event before the raw information is stored in a file. The software also tracks three parameters for “signal correction.” These are referred to as carry-forward (CF), incomplete extension (IE), and droop (DR). Carry-forward and incomplete extension refer to phasing of the sequencing reaction on the ISPs and are essentially the same as the prephasing and phasing designation used by Illumina. Droop is a measure of DNA polymerase loss during the sequencing run and so it related to the “intensity” value tracked by Illumina during base calling. The Torrent basecaller exploits a generative mode to create an artificial model of expected incorporation signals for each base. Each signal is compared with the models, and the most likely sequence is determined. Solver uses the incorporation signal and the three signal correction parameters to determine the most likely base sequence in the well. Solver uses an iterative process to construct predicted signals for partial base sequences and then measures their “fit” to the observed signals generated by the run. Solver keeps a list of partial sequencing candidates and at each step it selects the most promising sequence and extends it by one base. The list size is limited and excess sequences are discarded based on heuristics, and when none of the entries on the list yields a better fit than the best sequence, the search concludes. In theory, in the absence of phasing or droop, the measure of a signal for flows incorporating a single base is expected to be close to 1 and for nonincorporated flows the signal is expected to be near zero. In reality, the signals deviate from these values and can slowly change over the chip or even a well. Normalized signals are passed to Solver, but once Solver has guessed a base sequence, the predicted signal is used to create an adaptive

II. BIOINFORMATICS

100

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

normalization. Adaptive normalization is utilized reiteratively by Solver as the normalized signal and the predicted signal converge at the optimal solution. Once the most likely sequence is found, the remaining differences between the expected and observed signals influence the base quality scores assigned to the reads. Postprocessing Base quality scores are assigned to each base in an Ion Torrent read. These are calibrated on the standard Phred-based scale. The quality scores are assigned by calculating various metrics for the base and read; those metrics are then used as indexes for preestablished lookup tables based on prior system training. As mentioned previously, the software filters out polyclonal and low quality sequence reads and adapter dimers; additionally, Ion Torrent postprocesses reads to remove low quality segments at the 30 end of reads as well as adapter sequences that would hinder downstream alignment processes. Since Ion Torrent read fragments are inherently of varying length, the removal of the 30 sequences does not alter the appearance of the data.

Intrinsic and Platform-Specific Sources of Error Each sequencing platform has its own biases and sources of error. NGS has eliminated cloning biases, but it relies on PCR steps that introduce other biases. Regions of high AT or GC content in a genome may have lower representation in the sequence data depending on the polymerases that are used for amplification of NGS libraries. It is well known that pyrosequencing and Ion Torrent-based sequencing is prone to indels in runs of homopolymeric DNA, mainly because it is difficult to distinguish the signal of, for example say, eight versus nine nucleotide incorporations. Illumina-based sequencing is also prone to sequence-specific errors with inverted repeats and GGC-rich sequences, which contributes to high rates of substitutions that are likely caused by phasing events [11].

READ MAPPING Reference Genome Current methods for accurately determining a genotype sequence rely on mapping reads to the human genome reference. Although the HGP was officially declared completed in 2003, improvements continue to be made to the reference genome. These improvements are assembled into the reference genome in increments called “builds.” GRCh38 is the most current human genome build. It is a product of an international team composed of the Sanger Institute, the Genome Institute, EMBL-EBI, and National Center for Biotechnology Information (NCBI) that is referred to as the Genome Reference Consortium. Although this build has been available since December of 2013, it is still not in heavy use. After the release of a genome build there is generally a lag before it is fully implemented since annotation of the genes and features of the genome takes time, especially if careful manual annotations are made. Additionally, it takes time for the sequencing community to port the new assemblies into their pipelines, and for biotech companies to reannotate their reagents based on the new coordinates. There are several groups that provide access to the human genome builds and to various annotations of the human genome that are based on these builds. These include NCBI, Ensembl, and University of California Santa Cruz (UCSC). Since the early days of the HGP, UCSC has housed much of the human genome data. Beginning with their first genome build in May 2000, they have released periodic and sequential updates to the human genome referred to as builds hg1hg19; with their latest build release, hg38, UCSC altered their numbering to be more consistent with the GRCh38 designation. The UCSC repository has long been a popular source of genomic data and genome annotations for the research community and the hg19 build may be the most popular build in use. Recently, Ensembl has gained an ardent following as a data source because they include data patches (sequence updates to the major builds that don’t disrupt the chromosomal coordinates) and they release scheduled updates to their genome annotations. In any event, the source of the human reference genome has little if any effect on read alignment since the different products are built on the same genome assembly, but there are naming differences between the UCSC and Genome Reference Consortium (GRC) builds for some of the unplaced contigs, so it is necessary to match the genome annotations to the correct genome builds.

II. BIOINFORMATICS

READ MAPPING

101

NGS Alignment Tools Each run of an NGS instrument produces millions of individual sequence reads that have little meaning on their own. For instance, it is humanly impossible to inspect a random sequence read from a human genome and know which chromosome, let alone which region of the chromosome, the sequence was derived from. To produce biological context and meaning to the data, the sequences must first be assembled and mapped to a reference genome. The concept of sequence alignment or sequence comparison has been around since the first biological sequences were obtained. Fundamentally, a sequence alignment makes it possible to annotate an unknown sequence with known sequences, and makes it possible to detail any differences between the newly generated sequence and other known sequences. The well-known BLAST alignment algorithm [12] was designed specifically for the former. BLAST compares one or more unknown sequences to a large database of annotated sequences to rapidly provide some biological meaning or context to the query sequences, and although BLAST could be used for detailing difference between two biological sequences its foremost design was as a search tool. Analysis of NGS reads required significant improvements in alignment speed to deal with the raw amount of data. Additionally, a large portion of NGS is focused on resequencing of known organisms. In cases where the target sequence is known, search complexity is reduced because only exact or near exact matches to the reference genome are of interest rather than all related homologous sequences. Thus, the requirements for homologous sensitivity are reduced. Dozens of algorithms have been developed to map NGS data to a genome, but many of the fundamental approaches are similar. Initially, the NGS alignment tools focused on global or full-length sequence alignments between the reads and the reference genome. The assumption was that there should be few differences between the reads and the reference genome along the length of the entire read. Many of the newer alignment tools also consider local alignment if a suitable global alignment cannot be found. This process often rescues information from reads spanning structural rearrangements where global alignment of the read is not possible. MAQ MAQ was an accurate early NGS alignment tool that was used heavily by the research community. Unlike many early tools, MAQ used the base quality scores to improve accuracy of the alignment. Additionally, MAQ was capable of variant detection. Although MAQ is outdated for both sequence alignment and variant detection, it was the gold standard for accurate NGS alignment. The developer of MAQ also produced SAMtools [13], a software package for processing sequence alignments and producing variant calls and the Burrows Wheeler Aligner (BWA) sequence alignment tool [14] (http://maq.sourceforge.net/). Bowtie Bowtie is an NGS aligner based on the Burrow Wheeler transform. When it was released, it was remarkably fast. The original version of Bowtie did not support gapped or paired-end alignments, so while it could be used for rapidly aligning NGS reads for other applications, it had limited use in sequence alignments for variant detection [15] (http://bowtie-bio.sourceforge.net/index.shtml). BWA BWA is one of the most popular NGS aligners in the research community. As its name implies, BWA is based on the Burrow Wheeler transform algorithm. BWA produces very accurate alignments that are suitable for downstream variant analysis, but is much faster than MAQ [16] (http://bio-bwa.sourceforge.net/). Novoalign Novoalign is a commercial software package distributed by Novocraft Technologies. Like BWA, it is a highly accurate sequence alignment tool and is especially appropriate for alignments that will be used for variant detection. In terms of accuracy, novoalign is on par with BWA, but the software also provides a number of useful features such as adapter and primer trimming, as well as capabilities for alignment of bisulfite treated DNA to map sites of methylation in the genome. As a commercial package, novoalign provides customer support and quick response to potential bugs (which are not common for most academic software). There are free versions of novoalign, but multithreading and other additional features require the licensed version (http://www.novocraft.com).

II. BIOINFORMATICS

102

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

MOSAIK Mosaik is a relatively new open source alignment tool that uses a hash clustering method followed by a banded SmithWaterman algorithm. MOSAIK can align reads from all of the major sequencing platforms and was designed to provide consistent read mapping across the different sequencing platforms [17] (https://code. google.com/p/mosaik-aligner/). Isaac Isaac is a sequence alignment and variant detector tool that was developed by Illumina as an alternative to the BWA-GATK pipeline used for variant detection. Isaac employs a large amount of memory ($48 Gb) to produce ultrafast alignments and is reportedly 45 times faster than the BWA-GATK pipeline [18] (https://github.com/ sequencing/isaac_aligner). TMAP TMAP is the Torrent Mapping Alignment Program and was designed specifically to work with Ion Torrent data. In particular, it helps to reduce problems of insertions and deletion in homopolymeric sequence runs (https://github.com/iontorrent/TMAP).

Sequence Read and Alignment Formats The FASTQ format is the standard output of an NGS run. It is based on the FASTA format, but contains Phred base quality scores encoded by ASCII characters so scores with multiple digits are encoded by a single character. Therefore, for every base, there is a corresponding character that encodes the base quality score. Illumina instruments initially produced data with an alternative file format known as the SCARF format, but eventually converted to the standard FASTQ format. The standard format for a sequence alignment is a binary alignment map (BAM) file. This format can rapidly be process by a computer, but is not human readable. The human readable form of the alignment is called a sequence alignment map (SAM) file. Although information can be gleaned from a SAM file, the most informative way to view the alignment data is with a genome viewer (such as the Integrative Genome Viewer, IGV) that displays the reads mapped onto the human genome (Figure 7.1). Such views of the aligned sequence data are often crucial for interpreting detected variants and for quickly assessing their validity.

Sequence Alignment Factors Accurate variant detection requires that reads be accurately aligned to the reference genome. A number of factors go into this process. First, any exogenous (nonhuman) DNA sequence such as sequencing adapters must be removed from the sequence reads. If the adapter sequences are long enough, they could prevent alignment of the read to the reference genome. Alternatively, if only a few nucleotides of adapter sequence are present on the end of the read, they would likely contribute to mismatched bases that would introduce “noise” into the variant detection. Second, many alignment tools recognize exogenous or unreliable data that are present in the read and remove it. This process is referred to as “hard clipping” since the trimmed sequence information is not viewed in the aligned data. Adapter sequences or bases marked with the Illumina “read segment quality control indicator” (discussed above) are typically hard clipped. Many aligners also perform “soft clipping” where the ends of the sequences have some amount of mismatched sequence; in these cases the sequence is not removed, but it doesn’t contribute to the alignment. Soft clipping is thought to improve detection of indels and reduce false-positive Single nucleotide polymorphism (SNP) calls near indels. For example, if a sequence read spans a deletion at its 30 end, there is often not enough information to split the alignment and place the last several nucleotides in their correct location on the other side of the deletion; a soft clip would remove these tailing nucleotides. Third, with NGS, read length is a consideration. With short reads, it is possible for reads to map to multiple locations within the genome. This ambiguity causes problems in variant calls since misplaced reads from a pseudogene or a close paralog could produce sequence variants when aligned at the wrong location. With 100 bp paired-end sequence reads, the vast majority of reads can be uniquely placed within the human genome, but there are still regions of the genome where read placement is ambiguous. A small percentage of

II. BIOINFORMATICS

COVERAGE ANALYSIS: METRICS FOR ASSESSING GENOTYPE QUALITY

103

false-positive calls in a genome or exome can be removed by only using uniquely mapping reads, but this reduces sensitivity by eliminating these regions from consideration.

Alignment Processing After NGS reads are aligned to the genome, a common practice is to “left align” indel sequences with respect to the reference genome (i.e., since the position of many indels is ambiguous, they are arbitrarily moved to the most 50 possible position). This convention allows comparisons between indel calls from different tools, since a single deletion could be designated several different ways (see Chapter 9). The “left align” convention is also consistent with what would be seen from a Sanger-based sequencing trace from an individual read containing the indel. After NGS reads are aligned to the genome, the reads are sorted based on their chromosomal location. Prior to variant calling, duplicate reads are removed from the alignments to prevent PCR or optical duplicates from “jack potting” or skewing the allele balance within a specific location of the genome which would lead to false-positive variant calls. A software package called Picard-tools is frequently employed to remove duplicate reads.

COVERAGE ANALYSIS: METRICS FOR ASSESSING GENOTYPE QUALITY Genotype quality based on NGS data varies greatly from position to position within a targeted region, so variant calls are normally associated with a quality score that makes it possible to judge the reliability of the variant call within a specific position of the genome. The read depth, base quality scores, read mapping scores, and the number of variant reads all play significant roles in the variant quality score. While the quality scores indicate the reliability of the variant calls, various quality metrics can be used to assess the overall quality of the genotype over the targeted region and to ensure the quality of the data from individual to individual. The read depth in particular is important for ensuring that there are enough high-quality data to make an accurate variant call at any given position. Unfortunately, some regions of the genome are harder to sequence than others, and these regions tend to be underrepresented in the sequence data across a cohort of data. Additionally, the approach of random sequencing a large target using NGS technology frequently results in regions of low sequence coverage. Higher levels of sequencing, of course, help eliminate low coverage regions, but the returns for even higher and higher levels of sequencing depth become increasingly paltry (Figure 7.2). Thus, the random sequencing approach used in NGS genotyping does not produce a completed genotype across a human exome or genome.

Performance and Diagnostic Metrics Beginning with the quality of the tissue sample that is obtained from the individual, a large number of factors contribute to the quality and accuracy of an NGS genotyping assay. Although the factors that contribute to the 1 0.9 Sequence coverage

0.8 0.7 1 2 3 4 5 6 >=7

0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3 4 5 6 7 8 9 Fold of human genome as sequencing input

10

II. BIOINFORMATICS

FIGURE 7.2 Representative coverage of the human genome as a function of sequencing depth (fold coverage). At 13 coverage approximately 50% of the genome has a read depth of one or greater. At 103 coverage only 90% of the genome has been covered by sequence reads; y axis, proportion of the genome covered at indicated read depths; x axis, fold sequencing depth. Adapted from Ref. [19].

104

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

base quality during the sequencing step were discussed above, many of the processing steps during the DNA isolation and library preparation step can have a large impact on the genotype quality as well. Additionally, most of the defects or problems that arise in sample or library preparation are not detectable until the actual sequence reads are assessed by alignment to the reference genome. This section focuses on downstream assessment of genotype quality that relies on mapping the reads to the genome. Total Read Number NGS genotyping of a genome or a targeted region requires a specific number of reads to ensure an adequate level of depth and coverage. The total number of reads is the first indication of whether enough data have been generated, or if more sequence reads are needed. The large capacity of NGS platforms often enables several samples to be sequence on the same flow cell or even in the same lane of the flow cell; typically, each sample is loaded in equal or prescribed amounts based on how many reads are needed for the specific target region, but the resulting read numbers vary due to slight differences between samples, even among libraries prepared at the same time, and to slight variations in loading concentrations or volumes. Large reductions in read numbers in one sample compared with the others in a lane or flow cell could signal a problem with the quality of that library. Percent of Mapped Reads The percentage of reads that map back to reference genome is also an indication of library and sequence quality. With NGS, 100% map back is never achieved, but with human genomic DNA the percentage of reads mapping to the genome is consistently over 95%, depending on the alignment software, the stringency of the parameters, and the source of the DNA sample. Samples obtained from buccal swabs rather than sterile tissues typically have slightly lower map back rates presumably because of small amounts of contaminating bacterial DNA; other factors such as the quality of the DNA and high rates of sequencing error can decrease the percentage of mapped reads. Of particular note, DNA isolated from Formalin-Fixed, Paraffin-Embedded (FFPE) tissue is fragmented and may contain cross-links leading to higher rates of nonreproducible sequencing error [20]; while this usually isn’t a problem, the DNA alterations accumulate with increasing length of fixation and can lead to suboptimal sequencing error rates, especially in older samples and in amplification-based methods of sequencing. Percent of Read Pairs Aligned As discussed in the section on sequence alignment above, NGS often utilizes paired-end reads. Statistically, the percentage of aligned read pairs will always be less than the percent of mapped reads, but a large decrease in the aligned read pairs could indicate problems with the library sample, such as untrimmed adapters or the presence of other exogenous DNA sequence that does not match the reference genome. Structural variation within individual samples could also reduce the alignment of read pairs, but unless the targeted region is fairly small or the sample has an unusual level of structural variation, large differences from sample to sample are not expected. Percent of Reads in the Exome or Target Region Once reads have been mapped back to the genome, the fraction of the reads that map to the targeted area can be determined, which gives an indication of the enrichment of the targeting procedure. Amplicon-based targeting methods have a high percentage of on-target reads but typically only for a limited region of the genome. Exome capture sequencing may yield over 80% on-target reads, but this varies based on the capture method, the targeted region of the genome, and the freshness of the capture reagents. Although a lower percentage of on-target reads does not necessarily reflect the quality of the genotype data, more sequence reads are required to reach the optimal coverage depth and many facilities therefore track these data to help ensure that all steps of the library creation process are optimized from run to run and from sample to sample. Library Fragment Length Determining the library insert size is a good practice, since shearing of the DNA is a critical step in the library preparation and the size of the DNA fragments has substantial effects on downstream processes such as ligation to the adapter sequences and amplification of the library fragments on the flow cell. It is well known that there are physical barriers to amplifying large fragments on Illumina flow cells, and there is some preference for smaller inserts. Downstream analyses that rely on paired-end read information can be influenced by the size of the library inserts; often these tools rely on the identification of paired reads that span the structural rearrangements,

II. BIOINFORMATICS

COVERAGE ANALYSIS: METRICS FOR ASSESSING GENOTYPE QUALITY

105

and if the library inserts are too short, the forward and reverse reads overlap and are physically unable to span sites of rearrangement. Generally, it is preferable that forward and reverse reads don’t overlap, especially for lower levels of sequence coverage, since PCR-induced errors could show up in both the forward and the reverse sequence skewing the variant allele balance. Depth of Coverage Depth of sequencing coverage is a straightforward metric to assess the quality of the genotype data (Figure 7.3) because the SNP error rate is proportional to the depth of coverage since as read depth increases, the accuracy of variant calls increases (a relationship that holds true for all types of sequence variants). Although it may be possible to design relatively small, targeted gene panels where every base can be covered by a prescribed level of sequence coverage, this is beyond current technical capabilities for sequencing across the human genome. Even at the exome level 100% coverage is cost prohibitive. Thus, in genotyping across a large portion of human genome there is a trade-off between sequence coverage and the cost of the genotyping assay. Minimal levels of coverage are somewhat arbitrary but are usually based on the specific use of the sequence data. For example, in a search for a novel causative mutation likely responsible for a disease or syndrome in whole exome data, the search could easily commence if 80% of the exonic sequence were above the prescribed depth; in fact, the chance of identifying the causative mutation would only increase slightly if 95% of the exonic sequence were above the prescribed depth because, among other factors, the causative mutation might reside outside of the targeted exome sequence, and because the success rate for such searches is generally well below 50% [2226]. On the other hand, the sequence data should be as close as possible to 100% coverage at the prescribed depth if the goal of the assay is to report every variant within a 50 gene cancer panel. Thus, coverage metrics are generally set more stringently for genomic hotspots in clinical analyses performed for cancer or disease causing genes, or even important positions within those genes that are known to contribute to disease phenotypes, than for research studies. Percent of Unique Reads In the section, above on sequence alignment, the concept of removing duplicate sequence reads as a method for reducing false-positive variant calls was introduced. The percentage of duplicates reads can also be used to roughly assess the quality of the library prep. In general, a good library contains a vast amount of unique DNA molecules that uniformly cover the genome or target area; however, in practice, many variables go into making a high-quality library, and each library varies in the number of unique molecules and in coverage or representation

FIGURE 7.3 Accurate SNP identification as a function of sequence depth for homozygous and heterozygous SNP calls; y axis, proportion of SNP calls; x axis, sequence depth. Adapted from Ref. [21].

II. BIOINFORMATICS

106

7. BASE CALLING, READ MAPPING, AND COVERAGE ANALYSIS

of the target region. Most often, high percentages of duplicate reads indicate insufficient amounts of starting DNA or over amplification of the original DNA sample. Alternatively, uneven amplification of the target genome during library preparation could lead to high percentages of duplicate reads. The percentage of duplicate reads typically varies over a fairly narrow range from sample to sample based on the specific steps of library preparation and the method of target enrichment, and so outlying samples can be identified by unusually high levels of PCR duplication. However, it is important to recognize that the algorithms that identify duplicates are built for speed rather than accuracy. Rather than interrogating the sequence of the reads, the algorithms assume that aligned reads with the same start and end location for both the forward and reverse read pairs are PCR or optical duplicates. This assumption is most often correct for low to moderate coverage depth of a large target region such as an exome or the whole genome. However, in small target regions where high-sequence depth is needed, the assumption no longer holds since the likelihood that reads from different input DNA molecules will by chance have the same start and end locations increases. Thus, when high levels of coverage are reached, the percentage of duplicate reads is no longer an indication of library quality. Target Coverage Grap A simple analysis of the sequence coverage of a target area is a quick and intuitive way of assessing the initial quality of a sequencing assay influenced by most of the quality metrics, namely read number, mapped reads, ontarget reads, and reads removed as duplicates. Other library issues such as representation biases are often apparent in a coverage graph and may show up as areas of low coverage or regions of extreme depth.

SUMMARY Although patient genotyping has been performed for many years, it has been limited to relatively short loci. NGS has enabled genotyping at an unprecedented scale, but the clinical utility of the approach places a premium on accuracy at all four steps that lead to variant detection including library preparation and template amplification, base calling, alignment/mapping of sequence reads, and variant detection. However, it is important to keep in mind that, in addition to inadequacies in these bioinformatic steps, sequencing errors in NGS data can arise from physical sources, properties of the template DNA sequence such as homopolymeric or GC-rich stretches of DNA, and technical factors that are platform specific. Quality scoring is the process of assigning a score to each base call that indicates the quality or confidence of the call. It is dependent on a number of factors intrinsic to the quality of a sequencing run and to the sequence of the DNA strand. Illumina has a common interface for viewing important quality run metrics on each of its sequencing platforms called the SAV. The SAV software is present on each instrument and can be installed on personal workstations so quality metrics can be viewed from remote locations; several metrics most critical to the sequencing run and the downstream analysis are density, intensity by cycle, QScore distribution, QScore heatmap, IVC plot,% phasing/prephasing, and PhiX-based quality metrics. Sequencing and data management on Torrent instruments are processed through the Torrent Suitet Software. The Torrent Browser allows the user to plan and monitor sequencing runs and to view the sequence data after it is generated; critical metrics for Torrent sequencing are loading or ISP density, live ISPs, library and test ISPs, key signal, clonal, usable sequence, and test fragment metrics. Current NGS technologies produce sequence reads that are relatively short. NGS reads therefore must be aligned (or mapped) onto the human reference genome so the aligned reads can be used to make variant calls. Sequence alignment is computationally the most difficult and expensive step of variant analysis, and is a major source of error. Variant calls are affected by a number of problems such as mismapped reads, ambiguous reads, and local misalignment especially around indels. Structural variations such as inversions or large deletions also encumber accurate read mapping. Additionally, the reference human genome is not complete; it contains gaps and sequence errors and it represents only one of many different human haplotypes. A number of factors go into the mapping process. First any exogenous (nonhuman) DNA sequence such as sequencing adapters must be removed from the sequence reads. After NGS reads are aligned to the genome, the reads are sorted based on their chromosomal location, and prior to variant calling, duplicate reads are removed. For reads assembled into contigs, the assembly is compared by alignment to known sequences to provide biological context and meaning to the data. Dozens of algorithms have been developed to map NGS data to a genome, including MAQ, Bowtie, BWA, Novoalign, MOSAIK, Isaac, and TMAP. However, it is important to note that although the HGP was officially declared completed in 2003, improvements continue to be made to the human

II. BIOINFORMATICS

REFERENCES

107

reference genome itself. There are several groups that provide access to the human genome builds, and to various annotations of the human genome that are based on these builds. Genotype quality based on NGS data varies greatly from position to position within a targeted region of the genome, so variant calls themselves are normally associated with a quality score that makes it possible to judge the reliability of the variant call within a specific position. While the quality scores indicate the reliability of the variant calls individually, various quality metrics can be used to assess the overall quality of the genotype over the targeted region and to ensure the quality of the data from individual to individual. These metrics include total read number, percent of mapped reads, percent of read pairs aligned, percent of reads in the exome or target region, library fragment length, depth of coverage, percent of unique reads, and a target coverage graph.

References [1] Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:106173. [2] Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 2012;337:649. [3] Rimmer A, Phan H, Mathieson I, Iqbal Z, Twigg SRF, Wilkie AOM, et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat Genet 2014;46:9128. [4] Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat. Methods 2009;6:S612; published online 15 October 2009; corrected after print 6 May 2010. Available from: http://dx.doi.org/doi:10.1038/NMETH.1376. [5] Dewey FE, Chen R, Cordero SP, Ormond KE, Caleshu C, Karczewski KJ, et al. Phased whole-genome genetic risk in a family quartet using a major allele reference sequence. PLoS Genet 2011;7:e1002280. [6] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005;437:37680. [7] McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 2009;19:152741. [8] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456:539. [9] Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 2011;475:34852. [10] Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 1998;8:17585. [11] Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 2011;39:e90. [12] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990;215:40310. [13] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:20789. [14] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18:18518. [15] Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009;10:R25. [16] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009;25:175460. [17] Lee W-P, Stromberg MP, Ward A, Stewart C, Garrison EP, Marth GT. MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping. PLoS One 2014;9:e90581. [18] Raczy C, Petrovski R, Saunders CT, Chorny I, Kruglyak S, Margulies EH, et al. Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms. Bioinformatics 2013;29:20413. [19] Wang W, Wei Z, Lam T-W, Wang J. Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Sci Rep 2011;1:55. [20] Do H, Dobrovic A. Dramatic reduction of sequence artefacts from DNA isolated from formalin-fixed cancer biopsies by treatment with uracil. Oncotarget 2012;3:54658. [21] Meynert AM, Ansari M, FitzPatrick DR, Taylor MS. Variant detection sensitivity and biases in whole genome and exome sequencing. BMC Bioinformatics 2014;15:247. [22] Ne´meth AH, Kwasniewska AC, Lise S, Parolin Schnekenberg R, Becker EBE, Bera KD, et al. Next generation sequencing for molecular diagnosis of neurological disorders using ataxias as a model. Brain 2013;136:310618. [23] Need AC, Shashi V, Hitomi Y, Schoch K, Shianna KV, McDonald MT, et al. Clinical application of exome sequencing in undiagnosed genetic conditions. J Med Genet 2012;49:35361 [cited 2014]. [24] Shashi V, McConkie-Rosell A, Rosell B, Schoch K, Vellore K, McDonald M, et al. The utility of the traditional medical genetics diagnostic evaluation in the context of next-generation sequencing for undiagnosed genetic disorders. Genet Med 2014;16:17682. [25] Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N Engl J Med 2013;369:150211. [26] Beaulieu CL, Majewski J, Schwartzentruber J, Samuels ME, Fernandez BA, Bernier FP, et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am J Hum Genet 2014;94:80917.

II. BIOINFORMATICS

This page intentionally left blank

C H A P T E R

8 Single Nucleotide Variant Detection Using Next Generation Sequencing David H. Spencer, Bin Zhang and John Pfeifer Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction

110

Sources of SNVs Endogenous Sources of Damage Leading to SNVs Reactive Oxygen Species Spontaneous Chemical Reactions Metal Ions Errors in DNA Replication Exogenous Sources of Damage Leading to SNVs Chemical Mutagens Radiation

111 111 111 111 111 112 112 112 112

Consequences of SNVs SNVs in Coding Regions Synonymous SNVs Missense SNVs Nonsense SNVs Consequences of SNVs on RNA Processing Altered RNA Splicing SNVs in Regulatory Regions

113 113 113 113 113 113 113 114

Technical Issues Platform Target Size Target Enrichment Approach

114 114 114 115

Library Complexity Depth of Sequencing Anticipated Sample Purity Sample Type

116 116 116 117

Bioinformatic Approaches for SNV Calling Parameters Used for SNV Detection High Sensitivity Tools Tumor/Normal Analyses Implications for Clinical NGS Orthogonal Validation

117 118 121 121 121 122

Interpretation of SNVs Online Resources and Databases Prediction Tools for Missense Variants Prediction Tools for Possible Splice Effects Kindred Testing Paired Tumor-Normal Testing

122 122 122 122 123 123

Reporting

123

Summary

123

References

124

KEY CONCEPTS • Single nucleotide variants (SNVs) occur when a single nucleotide (e.g., A, T, C, or G) is altered in the DNA sequence. SNVs are by far the most common type of sequence change, and there are a number of

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00008-3

109

© 2015 Elsevier Inc. All rights reserved.

110

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

endogenous and exogenous sources of damage that lead to the single base pair substitution mutations that create SNVs. • Selection pressure reduces the overall frequency of single base pair substitutions in coding DNA and in associated regulatory sequences, with the result that the overall SNV rate in coding DNA is much less than that of noncoding DNA. • The biologic impact of SNVs in coding regions depends on their type (synonymous versus missense), and in noncoding regions depends on their impact on RNA processing or gene regulation. • The utility of a clinical next generation sequencing (NGS) assay designed to detect SNVs depends on assay design features including an amplification-based versus hybrid capture-based targeted approach, DNA library complexity, depth of sequencing, tumor cellularity (in sequencing of cancer specimens), specimen fixation, and sequencing platform. • Many popular NGS analysis programs for SNV detection are designed for constitutional genome analysis where variants occur in either 50% (heterozygous) or 100% (homozygous) of the reads. These prior probabilities are often built-in to the algorithms, and consequently SNVs with variant allele frequencies (VAFs) falling too far outside the expected range for homozygous and heterozygous variants are often ignored as false positives. Sensitive and specific bioinformatic approaches for acquired SNVs require either significant revision of the software packages designed for constitutional testing or new algorithms altogether. • Some bioinformatic tools are optimized for very sensitive detection of SNVs in NGS data. These tools require high coverage depth for acceptable performance and rely on spike-in control samples in order to calibrate run-dependent error models, features that must be accounted for in assay design. • There are a number of online tools that can be used to predict the impact of an SNV and evaluate whether an SNV has a documented disease association. However, given a lack of standardized annotation formats, and variability in the level of review that was performed to establish the associations between a specific genotype and a specific phenotype, putative associations should be carefully reviewed in the context of the published medical literature. • Guidelines for reporting SNVs detected in constitutional NGS testing have been developed; consensus guidelines for reporting somatic or acquired SNVs are under development.

INTRODUCTION Single nucleotide variants (SNVs) occur when one nucleotide is substituted for another at a single position in the DNA sequence. Numerically, SNVs are the most common type of sequence change observed in comparisons of one genome to another, and the high density of polymorphic SNVs segregating in the human population (about 1 SNV is present per 800 bases between a single diploid individual and the reference genome) make them ideal markers for genetic mapping [1]. These inherited SNVs are generally classified as single nucleotide polymorphisms (SNPs) if they are present at a moderately high frequency in the population (.1%), although many inherited SNVs exist at lower population allele frequencies yet are nonetheless benign polymorphisms with no known disease association. Two of the most useful databases for determining the population allele frequency of these inherited SNVs (both common and rare) are the GenBank database of SNPs (dbSNP) that contains over 38 million human SNPs mapped to unique locations [25], and the SNP database assembled by the International HapMap consortium [3,4]. Both databases are useful in clinical next generation sequencing (NGS) testing because they can be used to classify benign polymorphisms or candidate disease-causing mutations. While SNPs are polymorphisms that have no direct (or clearly established indirect) association with a specific disease, SNVs that are correlated with disease are often referred to as single base pair mutations or point mutations. These SNVs fall into two categories that have important implications for diagnostic testing. The first category includes SNVs that are associated with constitutional or inherited diseases; these point mutations are present within all cells of a patient (or, in the setting of mosaicism, in the cells of the affected anatomic region). The second general category is somatic SNVs, which are point mutations acquired in somatic tissues and are

II. BIOINFORMATICS

SOURCES OF SNVs

111

associated with a tissue-based phenotype; the most common phenotype associated with somatic mutations is cancer. While both constitutional and somatic mutations may be SNVs, a key difference between the two categories is the proportion of cells affected: mutations in inherited diseases are present in all cells, although they may be heterozygous (50% allele frequency) or homozygous (100% allele frequency), whereas somatic mutations may reside in only a subset of the cells due to tissue heterogeneity or contamination of a tumor with normal, noncancer tissue. These differences can have substantial effects on the design and implementation of NGS assays that aim to detect SNVs. The platforms and bioinformatic pipelines of NGS are ideally suited to the detection of SNVs. In fact, the earliest clinical applications of NGS were designed to detect SNVs in inherited and acquired diseases, and for this reason the bioinformatic pipelines required for sensitive and specific detection of single base pair substitutions are among the most advanced in clinical NGS. The expanding catalog of clinically relevant point mutations (e.g., inherited SNVs in BRCA1/2 that inform prognosis in breast and ovarian cancer, or acquired SNVs in KRAS and EGFR that predict response to targeted therapies such as erlotinib in nonsmall cell lung cancer) has been an especially important driver of development of NGS assays [68]. Amplification-based and hybrid capture-based approaches can provide comprehensive mutational profiling across a large number of genes simultaneously with greatly increased efficiency and decreased cost compared with conventional sequencing methods [9]. Indeed, to date, this approach has been successfully implemented in several clinical laboratories for detecting SNVs [1012].

SOURCES OF SNVs The single base pair substitutions responsible for SNVs are among the most common DNA mutations. Because most mechanisms produce single base pair substitutions distributed randomly throughout DNA, coding and noncoding regions are about equally susceptible [13]. However, selection pressure reduces the overall frequency of surviving mutations in coding DNA and in associated regulatory sequences, with the result that the overall substitution rate in coding DNA is much less than that of noncoding DNA. Structurally, SNVs fall into two classes. Transitions are substitutions of a purine by a purine, or a pyrimidine by a pyrimidine; transversions are substitutions of a purine by a pyrimidine, or a pyrimidine by a purine. In the human genome, the transition rate is higher. Both transversions and transitions are the end result of a number of different processes, including errors in DNA replication, spontaneous chemical reactions, oxidant damage due to reactive oxygen species (ROS), chemical mutagens, and ionizing radiation.

Endogenous Sources of Damage Leading to SNVs Reactive Oxygen Species ROS include  OH, NO, and peroxides, and are produced as a result of normal cellular metabolism, as well as by ionizing radiation, infection, and reperfusion of ischemic tissue. The DNA damage produced by ROS includes sugar and base modifications [1420] that lead to SNVs (Figure 8.1). Among the oxidative lesions of DNA, 8-oxodeoxyguanosine is a major cause of mutagenesis during replication because it can pair with adenine almost as efficiently as with cytosine, causing G:C to T:A transversion mutations; similarly, misincorporation of 8-oxodeoxyguanosine formed in the nucleotide pool of the cell can cause A:T to C:G transversions. Spontaneous Chemical Reactions Most spontaneous chemical changes that alter the structure of nucleotides are deamination and depurination reactions, but spontaneous hydrolysis, alkylation, and adduction reactions also occur [14,20]. Metal Ions The metals iron, copper, nickel, chromium, magnesium, and cadmium are well-established human carcinogens [15,21,22]. Oxidative damage via metal-catalyzed ROS generation is by far the most important mechanism of DNA damage, although metal-catalyzed reactions also produce metabolites from a variety of organic compounds that form DNA adducts. Metals such as arsenic, cadmium, lead, and nickel also directly inhibit DNA repair, which augments the mutagenic potential of the DNA damage they cause.

II. BIOINFORMATICS

112

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

FIGURE 8.1 Examples of DNA damage from endogenous sources leading to SNVs. The oxidative lesion 8-oxodeoxyguanosine is a major cause of mutagenesis during replication because it can pair with adenine almost as efficiently as with cytosine, causing G:C to T:A transversions; similarly, misincorporation of 8-oxodeoxyguanosine formed in the nucleotide pool of the cell can cause A:T to C:G transversions. The alkylation reaction that produces O6-methylguanine (methylation at the O6 position of guanine, closed arrow, results in deprotonation of the adjacent N1 nitrogen, open arrow) results in a base with ambiguous base pairing properties; if it pairs with thymine, it causes a G:C to A:T mutation during the next round of DNA replication.

Errors in DNA Replication The final accuracy of DNA replication depends on the fidelity of the enzymes that polymerize the new DNA strands and on the efficiency of subsequent error-correction mechanisms [23]. In human cells, replicative DNA polymerases have an error rate that is approximately 1026 to 1027 errors per base pair, but mismatch repair lowers the net rate to approximately 10210 errors per residue. However, even this level of DNA replication fidelity results in a high number of mutations over the lifetime of every person.

Exogenous Sources of Damage Leading to SNVs Chemical Mutagens Although a virtually infinite number of chemicals can damage DNA, a few families of environmental and therapeutic compounds illustrate the general mechanisms involved. With the exception of the 50 methylation of cytosine involved in gene regulation, addition of a methyl or ethyl groups to DNA bases (known as alkylation) can be a direct cause of SNVs. For example, the alkylation reaction that produces O6-methylguanine (Figure 8.1) leads to a single base pair mutation during the next round of DNA replication. In contrast, polycyclic aromatic hydrocarbons and related compounds do not cause significant direct DNA damage, but instead are converted by cytochrome P450 into reactive intermediate metabolites that damage DNA through the formation of adducts [15,2426]. DNA adducts often lead to SNVs via the process of translesion DNA synthesis, the process by which a group of low fidelity polymerases replicate damaged DNA at sites where conventional replicative polymerases are blocked [2729]. Cigarette smoke is a particularly rich source of compounds that mediate DNA damage through the formation of DNA adducts. Some antineoplastic agents, such as cisplatin, act primarily by causing intrastrand and interstrand crosslinks. However, cytotoxic alkylating agents used for the treatment of neoplasms, such as cyclophosphamide, busulfan, nitrogen mustard, nitrosourea, and thiotepa, damage DNA through the formation of covalent linkages that produce alkylated nucleotides and SNVs as well as DNADNA crosslinks [30,31]. The significance of the genetic damage that accompanies treatment with cytotoxic alkylating agents is indicated by the increased occurrence of secondary malignancies following their therapeutic use. Radiation Radiation-induced DNA damage can be classified into two general categories, damage caused by ultraviolet radiation (UV light) and damage caused by ionizing radiation. UVB radiation from sunlight (wavelength of 280315 nm) induces covalent bonds between adjacent thymine residues in the same strand of DNA producing bulky intrastrand TT cyclobutane pyrimidine dimers; UVB radiation also generates 6,4 pyrimidinepyrimidone photoproducts from TC or CC dimers in the same strand of DNA. If unrepaired, all these lesions not only

II. BIOINFORMATICS

CONSEQUENCES OF SNVs

113

interfere with transcription but also promote SNVs through error-prone translesion DNA synthesis. Ionizing radiation causes a broad spectrum of DNA damage including individual base lesions, crosslinks, and single- and double-strand breaks [17,32,33].

CONSEQUENCES OF SNVs SNVs in Coding Regions Single base pair substitutions in coding DNA are classified as synonymous (silent) and nonsynonymous (of which the two types are missense mutations and nonsense). Synonymous SNVs This is the most common type of coding SNV and results in a different codon that still specifies the same amino acid. Although synonymous SNVs usually occur at the third base position of a codon due to third base wobble, base substitutions at the first position of a codon can also give rise to synonymous substitutions. Although traditionally thought to confirm no advantage or disadvantage to the cell in which they arise, there are situations in which synonymous substitutions can have a profound effect on the encoded polypeptide, as for example by activating a cryptic splice site within an exon, or by causing an altered pattern of exon skipping or inclusion by activating an exonic splicing enhancer or silencer [3436]. Missense SNVs Missense SNVs are nonsynonymous substitutions in which the altered codon specifies a different amino acid. A conservative substitution results in the replacement of one amino acid by another that is chemically similar, and the effect on protein function is usually minimal. A nonconservative substitution results in replacement of one amino acid by another that is chemically dissimilar (typically in charge or hydrophobicity) which often has a deleterious effect on protein function. Nonsense SNVs Nonsense SNVs are nonsynonymous substitutions in which a codon specifying an amino acid is replaced by a stop codon. Although single base pair substitutions that produce nonsense mutations are a direct mechanism for substituting a normal codon with a stop codon, a variety of splice site mutations can also introduce a premature termination codon if the altered pattern of exon joining results in a change in the codon reading frame. The level of function of a truncated protein is usually difficult to predict, since it depends on both the extent of the truncation and the stability of the truncated polypeptide, but it is usually reduced. Nonsense SNVs can also produce truncated proteins with deleterious gain-of-function or dominant negative activity.

Consequences of SNVs on RNA Processing Altered RNA Splicing SNVs that cause a defect in mRNA splicing appear to represent about 15% of all point mutations that cause human disease [13,34,37] and fall into four main categories. First, SNVs within 50 or 30 consensus splice sites result in the production of mature mRNAs that either lack part of the coding sequence or that contain extra sequences of intronic origin. However, SNVs in a 50 or 30 splice site do not always result in a lack of protein function, as is the case with an SNV in a 50 splice site of the RAS gene which results in production of a protein with an enhanced level of activity [38]. Second, as noted above, SNVs within an intron or exon can activate cryptic splice sites or create novel splice sites [34,35,37,39]. Third, also noted above, SNVs in exonic splicing enhancers or exonic splicing silencers can change the pattern of RNA splicing; e.g., disease-associated mutations in exonic splicing enhancers of the BRCA1 gene have been described, many of which are silent at the amino acid sequence level [35,36,40]. In fact, almost 15% of point mutations known to cause inherited diseases by defective splicing are apparently due to mutations in exonic splicing enhancers or exonic splicing silencers. Fourth, SNVs at the branch site sequence required for lariat formation can interfere with normal RNA splicing. Such mutations have been reported in association with congenital anomalies and inherited diseases [34,4147]. Pathogenic SNVs have been described that affect RNA processing at stages other than RNA splicing, including at the capping site that results in either incomplete or unstable transcripts, in the 50 untranslated region (UTR)

II. BIOINFORMATICS

114

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

that appears to exert their affect on the transcriptional level or translational level through the creation of new AUG initiation codons, and in the 30 UTR at cleavage/polyadenylation signal sequences [48]. In-frame nonsense SNVs at a minimum distance upstream of the last exonexon junction, also referred to as premature termination codons, can trigger nonsense-mediated mRNA decay (NMD). Because of NMD these nonsense SNVs may cause diseases by loss-of-function or haploinsufficiency mechanisms instead of by producing dominant negative proteins [49].

SNVs in Regulatory Regions Most SNVs responsible for genetic disease lie within gene coding regions, but some lie in the 50 flanking sequences that contain constitutive promoter elements, enhancers, repressors, and other gene regulatory elements. For example, mutations in the TATA box of the core promoter and the CACCC motif in the proximal promoter region of the β-globin gene are associated with mild β-thalassemia, while some mutations in the binding site of cis-acting regulatory proteins of genes of γ-globin genes result in increased gene expression [5052]. SNVs in regulatory regions are also thought to contribute significantly to multifactorial, nonmendelian disorders as low-penetrant and modifying factors.

TECHNICAL ISSUES A number of technical issues upstream of actual variant detection process can determine the approach used to identify SNVs from NGS data, or they can have substantial effects on performance. Therefore, accurate detection of SNVs using an NGS assay begins with the specific design of the test. Variables ranging from the sequencing platform to the anticipated amount of input DNA and the size of the region targeted for sequencing all have an impact on which SNVs can be detected and which bioinformatic tools are necessary to do so. Addressing these considerations, and the others listed below, are the initial steps to designing an NGS platform that is tuned for optimal identification of SNVs.

Platform There are three major platforms currently used in clinical NGS assays, those manufactured by Illumina (including the HiSeq and MiSeq instruments), Life Technologies (including the Ion Torrent Ion PGM systems), and Roche (including the 454GS Junior 1 and 454GS FLX systems). Since the platforms vary with respect to their chemistry (see Chapter 1), it is not surprising that they each have different intrinsic error rates (Figure 8.2), and consequently the expected performance of SNV detection. In one comprehensive study evaluating sequencing platform differences [53], the MiSeq instrument had the lowest substitution error rate (about 0.1 substitutions per 100 bases). The Ion Torrent PGM showed a substitution error rate over 10-fold greater, which steadily changes across the read length; however, increased accuracy could be achieved by “clipping” read ends determined to be of low quality [54]. The substitution error rate of the 454GS Junior was intermediate between the MiSeq and Ion Torrent PGM. Single base pair insertions and deletions are technically not SNVs but rather indels (discussed in detail in Chapter 9). However, indel errors are problematic because they can introduce artifacts at the alignment step of sequence analysis that appear as SNV calls, and so their platform specific error rate is worth brief mention. Single base pair indel sequencing errors tend to be relatively rare in MiSeq data (,0.001 indels/100 bases) but have a markedly increased frequency in Ion Torrent PGM reads (1.5 indels/100 bases) and 454GS Junior reads (0.38 indels/100 bases) [53] due to errors at homopolymeric runs. Although the chemistry, detection parameters, and base-calling software for all these platforms are continuously evolving, it is clear that some platforms are more ideally suited for specific clinical applications than others.

Target Size The size of the genomic region targeted for sequence has a large impact on the ability (and approach) to detecting SNVs. Small focused panels targeting specific regions with a high prior probability of harboring

II. BIOINFORMATICS

TECHNICAL ISSUES

115

FIGURE 8.2 Evaluation of read length and quality from benchtop sequencers. Upper panel: Box plots showing the predicted per base quality score for combined sequencing runs for each benchtop instrument at each read position; gray shaded bands indicate the 10% and 90% quantiles, orange shaded bands indicate the lower and upper quartiles, and the blue dot is the median. Lower panel: Comparison of the predicted and measured accuracy for each benchtop sequencer. Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology, 2012;30 (5):4349.

variants of interest can often get by with simple methods, since the potential for false positives is low. In contrast, large panels, exome sequencing, and whole genome sequencing require more rigorous and thorough methods in order to achieve an appropriate balance between sensitivity and positive predictive value.

Target Enrichment Approach The two major approaches for target enrichment that are used in NGS assays are polymerase chain reaction (PCR) amplification and hybridization capture with oligonucleotide baits, and each introduce different types of error and bias that may affect the performance of SNV detection. Amplification-based approaches carry with them all the error biases intrinsic to any PCR-based assay, which fall into several categories. First, there are SNVs that result from single base pair substitution errors caused by thermostable polymerases themselves [14]. An error that by chance occurs in early rounds of amplification (a type of error often referred to as jackpotting) can be particularly problematic since the sequence change is subsequently incorporated into such a high percentage of amplicons that it is difficult to differentiate from a true sequence alteration, leading to a false-positive result. Such errors are often mitigated by using limitedcycle amplification protocols and multiple, independent amplification reactions. Second, low input amplifications can result in allelic drop out due to stochastic amplification bias, which may lead to false-negative results due to preference amplification of a wild type allele at the expense of a mutant allele of interest. Third, the primers for amplification may bind to off-target, paralogous sequences in the genome and result in promiscuous amplification. It has recently been demonstrated that in some settings this type of primer-mediated error can

II. BIOINFORMATICS

116

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

be present in up to 10% of samples in an amplification-based automated analysis [55]. Fourth, amplification bias within a multiplex PCR can lead to differences in the efficiency with which various templates are amplified and can result from differences in the target sequence as short as a single base [5658]. This type of bias can impact the allele fraction observed in the data and lead to false-positive or false-negative interpretations of their significance. Hybrid capture-based approaches carry with them all the biases associated with DNADNA hybridization, including artifacts related to base composition (i.e., percentage of GC composition) and probe length, both of which primarily impact the capture efficiency of the target sequences and thus the depth of sequence that is obtained (which impacts the bioinformatic analysis as discussed below). While most commercial and custom capture probe designs are generated through proprietary vendor software that optimizes probe and assess of offtarget hybridization, in routine practice several iterations of probe design may be necessary and assay validation must involve analysis to detect the false-positive SNVs due to off-target sequences. It is also important to remember that hybrid capture-based approaches involve amplification steps during library preparation (discussed in more detail in Chapter 1), and so even this class of NGS assays is not immune to many of the SNV errors common in amplification-based assays.

Library Complexity The number of independent DNA template molecules (sometimes referred to as genome equivalents) sequenced in an NGS assay has a profound impact on the sensitivity and specificity of SNV detection. While it is possible to perform NGS analysis of even picogram quantities of DNA [5961], this technical feat is accomplished by simply increasing the number of amplification cycles during library preparation. However, the information content in 1000 sequence reads derived from one genome is quite different than the information content present in 1000 sequence reads from 1000 different genomes. Thus, library complexity and sequence depth are independent parameters in NGS assay design. Note that it is difficult to directly measure complexity in a DNA library produced by an amplification-based method since all of the sequence reads in one amplicon have identical size and position; in contrast, the complexity of a DNA library produced by a hybrid capture method is easily assessed since the sequence reads have different sizes and variable positions reflecting the population of DNA fragments captured during the hybridization step.

Depth of Sequencing The greater the number of reads sampled, the more straightforward the bioinformatic interpretation of sequence changes (see below). While it is a very simple assay design issue to ensure sufficient depth of coverage for optimal detection of SNVs, two factors often complicate this feature of assay design. First, as discussed above, simply increasing the depth of coverage through additional amplification cycles can have a negative effect on SNV detection via the impact on library complexity as noted above. Second, the size of the target region, ranging from limited gene panel, to an exome, to a whole genome, directly influences depth of coverage due to inherent cost considerations.

Anticipated Sample Purity Two different features of tumor samples impact the metrics of SNV detection in cancer specimens. The first is tissue heterogeneity in that no tumor specimen is composed of 100% neoplastic cells. Instead, cancer samples contain a varying proportion of nonneoplastic cells including benign parenchymal cells, inflammatory cells, stromal cells, and endothelial cells. Although pathologist review of the tissue sample is required to select the regions of tumor with high cellularity and viability, these estimates are unreliable [62,63]. Laser capture microdissection can be used to achieve greater purity of tumor cells, but the approach is so time consuming that it is poorly suited for routine clinical testing. The second feature is tumor cell heterogeneity, a term that refers to the fact that malignant neoplasms usually demonstrate clonal heterogeneity [6466]. Consequently, even with a relatively pure tumor sample, the detected SNVs may not be an accurate reflection of the range and frequency of various SNVs in the tumor.

II. BIOINFORMATICS

117

BIOINFORMATIC APPROACHES FOR SNV CALLING

A–>T

A–>C

A–>G

C–>A

C–>G

C–>T

P =1.2×10–9

P = 3.9×10–10

2e–04 3e–04 4e–04 5e–04 6e–04 7e–04

Fraction of high-quality bases

High-quality non-variant base changes

G–>A

G–>C

G–>T

T–>A

T–>C

T–>G

FIGURE 8.3 Observed spectrum of high-quality base changes in FFPE and frozen NGS data. The distributions of the mean frequencies for each possible base change for each sample type show that only C to T and G to A transitions are significantly different between fresh and frozen samples. The box plots display the median and interquartile range of the per-sample mean frequency for each base change by sample type, with whiskers extending to the last data point within 1.5 times the IQR, and outliers indicated by circles. Reprinted from Spencer et al. [73], with permission from Elsevier.

Sample Type It is well established that formaldehyde reacts with DNA and proteins to form covalent crosslinks, engenders oxidation and deamination reactions, and leads to the formation of cyclic-based derivatives [14,6770]. These chemical modifications have the potential to bias sequence analysis of formalin fixed paraffin embedded (FFPE) tissue samples via inhibition of enzymatic manipulation of DNA or through direct single base pair changes. While chemical changes that result from formalin fixation can create artifacts in low coverage NGS data sets [71,72], with higher depths of coverage, the false-positive SNV rate from FFPE samples is quite small compared with paired fresh samples from the same tumor (0.041% in FFPE versus 0.035% in frozen samples) (Figure 8.3) [73], a frequency that is several orders of magnitude below the cutoff for reporting SNVs in routine clinical practice (as discussed below). Similarly, several studies have demonstrated that, for both amplification-based and hybrid capture-based approaches, alcohol fixation does not induce sequence artifacts at a clinically significant rate [74,75]. The lack of a significant rate of false-positive SNVs has been shown for both ethanol fixed specimens (of the type used in Papanicolaou stains) and methanol fixed specimens (of the type used in Romanowsky stains such as Diff-Quik), a reassuring result since cytology specimens are an increasingly common sample type used for clinical NGS testing.

BIOINFORMATIC APPROACHES FOR SNV CALLING SNV identification is a fundamental component of many comprehensive NGS analysis software tools and other programs have been designed specifically for identifying SNVs from NGS data. The algorithms and intended use of these tools varies, but in general they fall into three broad categories. The first category includes software designed to detect constitutional SNVs that are either homozygous or heterozygous, and typically rely on Bayesian algorithms that assume variants are present in either 50% or 100% of the genomes in the sample [54,76]. These prior probabilities are often built-in to the algorithms, and SNVs with VAFs falling too far outside the expected range for homozygous and heterozygous variants are thus considered poor-quality and ignored as false positives rather than inherited variants. While parameter modifications may allow these programs to detect somatic SNVs (which may exist at other VAFs) with reasonable accuracy, the second category includes software packages that use other statistical models [77] or heuristic filtering [78] approaches to achieve better performance when used to detect somatic variants. A third category includes approaches for somatic SNV identification in the

II. BIOINFORMATICS

118

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

TABLE 8.1 Open Source NGS Tools for SNV Detection Program

SNV type

Algorithmic approach

Reference

Unified Genotyper (Genome analysis tool kit)

Constitutional

Bayesian

[76]

Samtools

Constitutional

Bayesian

[81]

Free Bayes

Constitutional

Bayesian

[82]

Varscan2

Somatic

Heuristic filtering, Tumor/normal comparison

[78]

SPLINTER

Somatic

Large deviation theory

[77]

Mutect

Somatic

Tumor/normal comparison

[79]

Somatic Sniper

Somatic

Tumor/normal comparison

[80]

Strelka

Somatic

Tumor/normal comparison

[83]

setting of paired tumor/normal comparisons [79,80], which is the gold standard approach for genome-wide (or exome-wide) somatic mutation detection in cancer samples because a matched normal sample provides a means to exclude inherited variants, both common and rare. Table 8.1 lists selected SNV identification software (although these tools may be suitable for other bioinformatic functions as well), along with the underlying approach and intended purpose. It is important to emphasize that software optimized to detect SNVs in routine clinical NGS assays are not necessarily optimized for detection of other classes of variants [73,8487], and that pipelines that have been optimized for the detection of constitutional SNVs are not necessarily optimized for the detection of somatic SNVs (and vice versa) [54,76]. For this reason, it is imperative to empirically test and validate any SNV calling approach used in a clinical assay to establish its performance on the samples and variants that will be the intended substrate of the test.

Parameters Used for SNV Detection Regardless of whether a bioinformatic tool was developed to detect SNVs in the setting of constitutional disease testing or cancer testing, the NGS platform, and whether an amplification-based or hybrid capture-based library preparation is employed, SNV detection generally relies on several data features and quality metrics to distinguish true signal (i.e., real variants) from noise. These features are assessed over the column of base calls from multiply aligned reads at a single genomic position and include the following metrics, among others: • The base quality of the variant position within a read, which reflects the probability that the base call is incorrect (discussed in more detail in Chapter 7). These scores are generated by the sequencing instrument following the sequencing run and take the form of log-transformed probabilities (i.e., PHRED scores, analogous to the scores used by PHRED, the base-calling program used for Sanger sequences) [88], which are convenient for representing the base call in text and binary files that store the sequence data and quality information. • The mapping quality, or confidence that a specific read is correctly mapped to the genome. Like base qualities, these are log-transformed probabilities that the read is incorrectly mapped. This metric is of particular importance in short-read-length NGS data where the mapping of a read is uncertain; the base corresponding to the position of a candidate SNV could represent a true sequence variant, or a mismatch due to improper mapping of a read derived from a paralogous region of the genome. • The strand bias, which is usually expressed as a P-value (commonly resulting from a Fisher exact test) comparing the strand distribution of the variant reads to the total reads aligned to a given position. • Properties of the sequencing reads aligned to a candidate SNV position. These include summary measures of the position of the variant base, and base qualities of all discrepant bases within each read. These variables may indicate the read is improperly mapped, since a large number of high-quality discrepancies and common locations of variant bases within the read are characteristic of improper mapping, rather than true sequence variants.

II. BIOINFORMATICS

119

BIOINFORMATIC APPROACHES FOR SNV CALLING

100

100 97

100 94

97

VarScan2 SPLINTER

99

97

92

89

60 40

49

21 20

Sensitivity (%)

80

100

GATK SAMtools

0

20%

10%

0 0

0

0

50%

5%

Mix proportion FIGURE 8.4 Performance of GATK, SAMtools, VarScan2, and SPLINTER for detecting low-frequency variants in mixed samples at positions with a depth of coverage $1003. Sensitivity for detecting all heterozygous minor “gold standard variants” in samples with mix proportions of 50%, 20%, 10%, and 5% (mean observed gold standard VAFs, 25.5%, 11.2%, 6.8%, and 4.2%, respectively). The indicated sensitivities (true positive/true positive 1 false negative) are point estimates based on detection of all minor gold standard variants at positions with $1003 coverage in each set of mixed samples. Error bars show the 95% binomial Cl for each point estimate. Reprinted from Spencer et al. [85], with permission from Elsevier.

• The local sequence environment, both in the read and at the genomic locus in question. Tools may incorporate a probabilistic assessment that a variant base is properly aligned to the genome, known as base alignment quality [89], the length of nearby homopolymeric sequences if present, and so on. • The depth of coverage and number of reads supporting the candidate variant call. The tools may filter SNVs from regions of low coverage or that are present below some threshold, since all NGS platforms have an intrinsic error rate and thus in regions of low coverage it is difficult to determine whether an SNV represents a random error or true SNV. While there are currently no guidelines or recommendations on where to set thresholds for the above parameters, and which ones are more valuable than others, each software package has different default values that may be tuned to specific applications. Two examples of general-purpose callers with variable performance at detecting low-frequency variants are Samtools and GATK. Although these callers have similar performance at detecting constitutional SNVs, studies using simulated and real data have shown that they have significant weaknesses when directly used to evaluate NGS data from cancer samples. One study specifically assessed the ability of commonly used NGS analysis programs to detect low-frequency variants in high-coverage (.10003) targeted NGS data using mixtures of HapMap samples [85]. Sequencing showed no evidence for bias against nonreference alleles in targeted hybridization-capture NGS data, but analysis of variant calls at known “gold standard” variant positions revealed substantial variability in sensitivity for low-frequency variants across the programs using the default parameters for each program (Figure 8.4). Variant identification function of the popular and widely used Samtools program is particularly insensitive to low-frequency variants and would be poorly suited for a bioinformatic pipeline for de novo detection of somatic mutations in cancer specimens. In contrast, the GATK variant caller had very good sensitivity (97%) for variants with VAFs of about 10% but was unable to detect variants present at lower frequencies. Two other callers that were designed for low-frequency variant detection [77,78] showed the best performance. This study also showed that detection of SNVs is dependent on coverage (Figure 8.5), with most of the missed calls occurring at positions with relatively low coverage and the most robust detection occurring at a minimum coverage of about 4003 (however, since coverage in targeted NGS data can be quite variable, even higher mean target coverage is required to ensure that 4003 is achieved at all critical positions in an NGS assay). Although higher coverage improved sensitivity, it was associated with more false-positive variant calls, especially for Varscan2,

II. BIOINFORMATICS

120

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

FIGURE 8.5 Sensitivity of (A) GATK, (B) SAMtools, (C) VarScan2, and (D) SPLINTER for low-frequency variants as a function of observed coverage at variant positions. Sequencing reads from mixed samples were randomly sampled to obtain data sets with estimate mean coverage depths of 1500, 1250, 1000, 750, 500, 400, 200, and 100 across the entire target region for each of the mixed samples. The observed coverage depths were determined for all minor gold standard variants, and variant detection was performed using each of the four programs. Panels show the overall sensitivity for all variants from each mixed sample in the observed coverage in bins of 100 for GATK, SAMtools, VarScan2, and Splinter. Error bars show the 95% binomial C1 for each point estimate. Reprinted from Spencer et al. [85], with permission from Elsevier.

although the specificity could be improved by filtering sequencing reads to eliminate those with low-quality and questionable alignments. SPLINTER showed the opposite trend, with a large number of false-positive calls in the low coverage data and few at higher coverage levels. Nonetheless, the positive predictive value of all callers for coding region variants was very high. Of note, when the same programs were used to detect variants in a set of targeted sequence data from lung adenocarcinoma samples rather than artificially created mixtures of HapMap samples, similar performance and results were obtained [85].

II. BIOINFORMATICS

BIOINFORMATIC APPROACHES FOR SNV CALLING

121

High Sensitivity Tools Although the study discussed above tested four variant identification programs, other bioinformatic tools have been (and undoubtedly more will follow) developed for detection of SNVs that perform differently. However, any software that uses a Bayesian algorithm designed for constitutional variants, and there are many that fall into this category, is expected to have limited sensitivity since such methods anticipate only homozygous and heterozygous variants [9092]. In contrast, some bioinformatic tools are optimized for very sensitive detection of SNVs in NGS data. SPLINTER is one such specialized tool, and its sensitivity at low VAFs (i.e., in the range of 4%) is higher than GATK and Samtools, and its sensitivity is similar to Varscan2 while maintaining higher specificity. However, the algorithm used by SPLINTER requires high coverage depth for acceptable performance [77] and relies on spike-in control samples in order to calibrate run-dependent error models. These features of the tool must be accounted for in assay design, although control sequences are already part of the standard NGS workflow (e.g., the phiX sequencing control with Illumina platforms) can be double as controls for SPLINTER without additional cost or inputs. Nonetheless, among laboratories that perform NGS of tumor samples, the lower limit of SNV VAFs that have clear clinical utility is currently generally considered to be in the range of 5%. Somatic mutations certainly exist at lower frequencies, especially in the setting of minimal residual disease (MRD) testing in which accurate detection of variants at frequencies substantially less than 1% is required [93]. While it is possible that some SNV identification tools may have sensitivity at VAFs of 1% (or even lower), it is likely that current intrinsic error rates of NGS platforms make them poorly suited for reliable discovery of variants much below about 2% without compromising specificity. In practice, many laboratories therefore include manual review of base counts at specific “hotspot” loci with high a priori likelihood of somatic mutation (e.g., KRAS codon 12, BRAF codon 600) as a means for sensitive detection of critical variants without resulting in a large number of false positives that would occur if a high sensitivity approach were broadly applied across the targeted genomic regions. It is likely that recent advances in NGS methods that employ “molecular tags” to increase sequencing accuracy will ultimately replace these ad hoc methods and permit reliable detection of low frequency variants for cancer testing and MRD assays [9496].

Tumor/Normal Analyses Additional issues arise in testing for somatic SNVs that rely on comparisons of paired samples from the same individual, such as those involving data from tumor and normal tissue. Depending on the target size and other technical issues (described above), a variety of analytic approaches can be employed. The simplest approach involves subtracting variants identified in the normal sample from the tumor to using sophisticated programs that analyze both samples jointly, which produces variant sets that appear to be enriched in the tumor. While the subtraction approach may provide adequate performance in some situations, normal tissue can sometimes be “contaminated” with tumor cells, yielding a low background of tumor-associated mutations in the normal data and could therefore be missed in the tumor (having been detected in the normal and so “subtracted” from the tumor). Therefore, tumor/normal comparisons should be performed on a pure normal sample, verified by tissue histology or some other method, and the analysis approach method should be capable of accommodating some level of contamination.

Implications for Clinical NGS The comparative performance of the various bioinformatic tools for SNV detection provides some guidance to clinical laboratories for design and implementation of NGS-based assays for somatic mutations. While the general features of the tools would likely be discovered early in the validation process of a clinical test, the range of mix proportions and the coverage analysis provided in published reports [77,84,85,94] provides laboratories with a clear indication of the types of issues encountered with common analysis tools are applied to clinical NGS tests. While the detailed information discussed above specifically only applies to a deep coverage hybrid capturebased targeted NGS approach, similar bioinformatic issues arise with amplification-based targeted NGS approaches, as well as exome and whole genome tests [97]. A mixing study design, using well-characterized HapMap samples, is an example of the type of experiment that should be implemented for initial test validation, as well as routine quality assurance (QA)/quality control (QC) in the clinical laboratory, for any NGS test for SNVs, regardless of the platform, approach, or loci to be evaluated.

II. BIOINFORMATICS

122

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

Orthogonal Validation Conventional approaches that have been used to detect SNVs can be used for orthogonal validation of NGS test results, including restriction fragment length polymorphism (RFLP) analysis, allele-specific PCR, Sanger sequencing, and SNP arrays. While each approach has its own advantages and disadvantages, several issues are common to all orthogonal validation approaches. First, although the lower limit of sensitivity of optimized conventional approaches is similar to that of routine NGS tests for SNVs, enhanced NGS bioinformatic pipelines enable detection of variants present at a frequency of less than 1% (as discussed above), a level of sensitivity that is significantly better than can be achieved by conventional techniques. Second, some of the discrepancies between SNVs detected by NGS assays and an orthogonal validation method may actually represent tumor heterogeneity and/or variations in tumor content rather than technical errors. Third, orthogonal validation used as confirmatory testing of positive results but not of negative results can raise the issue of discrepant analysis (also known as discordant analysis or review bias) that may poorly estimate test performance [98101]. This last issue is especially problematic since some current guidelines recommend the use of confirmatory testing for positive results [102104] without associated testing of negative (i.e., wild type) results.

INTERPRETATION OF SNVs Once a bioinformatics pipeline is optimized for sensitive and specific detection of SNVs, the next step in the bioinformatics analysis is the interpretation of the identified variant.

Online Resources and Databases There are a number of online tools that can be used to evaluate whether an SNV has a documented disease association. The most helpful databases include the Human Genome Mutation Database [105], the dbSNP [2], Pub Med [106], and Online Mendelian Inheritance in Man (OMIM) [107]. However, there are several caveats to the use of online resources. First, standardized annotation formats have not yet been developed to unambiguously indicate variant location and type, and consequently, care is required to ensure that the SNV from the clinical NGS test is identical to the SNV listed in the database. Fortunately, this significant limitation is well recognized and several working groups have been established by governmental entities as well as professional organizations to develop standards for the so-called clinical grade variant annotations [108,109]. Second, the level of review required to establish an association between a specific genotype and a specific phenotype (i.e., the rigor of the curation) varies widely between online databases and resources. Consequently, a putative association present within a database should be carefully reviewed, including critical review of the referenced literature. Again, several governmental and professional organizations have recognized the significance of this limitation and are actively working to address this shortcoming [110112].

Prediction Tools for Missense Variants Online resources can be used to predict the functional consequence of novel SNVs that fall within the coding region of a gene. Sequence conservation scores that make use of phylogenetic analyses provide a measure of evolutionary constraint on a specific coding residue [113] and can also be used to gauge functional impact. In addition, widely used tools that aggregate these data with other measures are available, including the SIFT algorithm [114] which is primarily based on sequence homology, and the PolyPhen algorithm [115] that uses information based on crystal structure, protein folding, and amino acid conservation. However, these prediction tools are known to have error rates and thus should only be used as informative screening tools. However, definitive assignment of a functional role to a sequence change that is a variant of unknown significance requires welldesigned in vitro and in vivo functional studies.

Prediction Tools for Possible Splice Effects As discussed above, SNVs involving nucleotides that have a critical role in mRNA splicing are a welldescribed category of mutations that can have significant deleterious effects. SNVs that occur at the highly conserved GU dinucleotide of splice donor site (at the 50 end of the intron) or the AG dinucleotide of the splice

II. BIOINFORMATICS

SUMMARY

123

acceptor site (at the 30 end of the intron) are deleterious and have been described in many genes, and an SNV in one of these invariant splice site sequences is currently considered a deleterious mutation by ACMG guidelines [116]. There are several tools available to predict the impact of an SNV on mRNA splicing. NetGEN2 [117] is a neural network-based prediction tool; the exonic spicing enhancer (ESE) finder algorithm is particularly useful for evaluating how an SNV may impact exonic splicing enhancers [118].

Kindred Testing In the setting of constitutional genetic testing, the significance of a novel SNV in a proband can often be clarified by parental testing, or testing of other family members at key positions within the kindred.

Paired Tumor-Normal Testing For NGS analysis of cancer specimens, paired tumor-normal testing can sometimes provide insight into the significance of a variant of unknown significance (VUS) obtained from tumor tissue. However, the laboratory decision to perform paired tumor-normal testing, whether ad hoc based on the NGS findings in a particular case or as the routine approach for all cases, relies on the analysis of a number of factors including the cost, the size of the target region, and the anticipated clinical use of the sequence results. In general, for NGS assays focused on a limited panel of genes and designed to identify mutations that are the targets of specific drug therapies, paired tumor-normal testing provides little additional information that impacts the patient’s care. However, for larger gene panels, clinical trials, and translational or basic science research, paired tumor-normal testing can be an integral part of NGS analysis because it makes possible more comprehensive evaluation of novel SNVs.

REPORTING As discussed in detail in Chapter 13, a number of professional organizations have recommended specific guidelines for reporting constitutional variants. Revised recommendations were issued by the American College of Medical Genetics and Genomics (ACMG) in 2007 with six interpretive categories largely based on variant association with disease causality [116]. However, for ease of interpretation, most clinical laboratories classify SNVs identified in constitutional testing into only five categories similar to those described for copy number variation, as follows: (1) pathogenic, (2) likely pathogenic, (3) uncertain significance, (4) likely benign, (5) benign [119]. This five level classification scheme has recently been proposed as the standard for interpretation of variants in inherited disorders through a joint recommendation of the ACMG and the Association of Molecular Pathology and is expected to be formally adopted following member comment and organizational approval in the near future. In contrast, no consensus guidelines have been published for reporting somatic or acquired variants, although several workgroups are currently addressing this issue. Most laboratories currently classify sequence variation with regard to the disease and in relation to the actionability of the given variant with respect to prognosis, diagnosis, or therapeutic decision making [120]. Since predicted responsiveness to targeted therapy is often the primary reason for NGS testing to identify somatic SNVs, variants expected to be sensitive or resistant to a given therapy should be documented; driver and passenger variants should also be reported even in the absence of a direct role in choice of therapy. When a variant is identified that is known to be associated with a familial cancer predisposition syndrome (and when paired tumor-normal analysis is not performed), follow-up formal genetic counseling and germ line analysis should be suggested in the report since it can be difficult to ascertain the etiology of a given variant as germ line or somatic based on the VAF of an SNV (or any other mutation) from NGS analysis of a tumor sample alone.

SUMMARY The platforms and bioinformatic pipelines of NGS are ideally suited to the detection of SNVs, and the expanding catalog of clinically relevant point mutations in inherited diseases and acquired point mutations in cancers is the driving force behind the adoption of NGS assays for SNVs in many clinical laboratories. While the utility of a clinical NGS assay for SNVs is heavily dependent on assay design features including an amplification-based

II. BIOINFORMATICS

124

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

versus hybrid capture-based targeted approach, DNA library complexity, depth of sequencing, and tumor cellularity (in sequencing of cancer specimens), technical features of the sequencing platforms themselves (such as their intrinsic error rates) also impact the usefulness of the tests. However, even with optimized test design, the sensitivity and specificity of the sequencing is dependent on the bioinformatic approaches used to identify SNVs in the sequence reads. Many popular NGS analysis programs for SNV detection are designed for constitutional genome analysis where variants occur in either 50% (heterozygous) or 100% (homozygous) of the reads, and these prior probabilities are often built-in to the algorithms to optimize sensitivity in inherited disease testing. However, SNVs in cancer specimens occur with a much broader range of VAFs, often with a frequency far outside the expected range for homozygous and heterozygous variants, and thus sensitive and specific detection of acquired SNVs requires either significant revision of the software packages designed for constitutional testing or new algorithms altogether. In addition, bioinformatic tools optimized for very sensitive detection of SNVs in NGS data require high coverage depth for acceptable performance and rely on spike-in control samples in order to calibrate run-dependent error models, features that must be accounted for in assay design. There are a number of online tools that can be used to predict the impact of an SNV and evaluate whether it has a documented disease association, but there are caveats to the use of online resources including a lack of standardized annotation formats for SNVs, and variability in the level of review that was performed to establish the associations between a specific genotype and a specific phenotype. Consequently, putative associations should always be carefully reviewed in the context of the published medical literature and correlated with the clinicopathologic features of the specific case under review. Although no single reporting format is in routine use, clinical labs should nonetheless follow the specific guidelines that have been recommended by professional organizations for constitutional variants and should consult the literature to stay abreast of the consensus guidelines for somatic or acquired variants.

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

The 1000 Genomes Consortium. A map of human genome variation from population-scale sequencing. Nature 2010;467:106173. dbSNP: ,http://www.ncbi.nlm.nih.gov/projects/SNP.. International HapMap Consortium. Integrating ethics and science in the International HapMap Project. Nat Rev Genet 2004;5:46775. International HapMap Consortium. The International HapMap Project. Nature 2003;426:78996. Altshuler D, Gibbs R, Peltonen L, et al. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467:528. Miki Y, Swensen J, Shattuck-Eidens D, et al. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science 1994;266:6671. D’Arcangelo M, Cappuzzo F. K-ras mutations in non-small cell lung cancer: prognostic and predictive value. ISRN Mol Biol 2012. Available from: http://dx.doi.org/doi:10.5402/2012/837306. Govindan R, Ding L, Griffith M, et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 2012;150:112134. Duncavage EJ, Abel HJ, Szankasi P, et al. Targeted next generation sequencing of clinically significant gene mutations and translocations in leukemia. Mod Pathol 2012;25:795804. Pritchard CC, Smith C, Salipante SJ, et al. ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. J Mol Diagn 2012;14:35766. Singh RR, Patel KP, Routbort MJ, et al. Clinical validation of a next-generation sequencing screen for mutational hotspots in 46 cancerrelated genes. J Mol Diagn 2013;15:60722. Cottrell CE, Al-Kateb H, Bredemeyer AJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16:89105. Mendell JT, Dietz HC. When the message goes awry: disease-producing mutations that influence mRNA content and performance. Cell 2001;107:4114. Pfeifer JD. Molecular genetic testing in surgical pathology. Chapters 2 and 5. Philadelphia, PA: Lippincott Williams & Wilkins; 2006;62957, 86110 Kawanishi S, Hiraku Y, Murata M, et al. The role of metals in site-specific DNA damage with reference to carcinogenesis. Free Radic Biol Med 2002;32:82232. Dizdaroglu M, Jaruga P, Birincioglu M, et al. Free radical-induced damage to DNA: mechanisms and measurement. Free Radic Biol Med 2002;32:110215. Beesk F, Dizdaroglu M, Schulte-Frohlinde D, et al. Radiation-induced DNA strand breaks in deoxygenated aqueous solutions. The formation of altered sugars as end groups. Int J Radiat Biol Relat Stud Phys Chem Med 1979;36:56576. Box HC, Dawidzik JB, Budzinski EE. Free radical-induced double lesions in DNA. Free Radic Biol Med 2001;31:85668. Dizdaroglu M, Gajewski E, Reddy P, et al. Structure of a hydroxyl radical induced DNAprotein cross-link involving thymine and tyrosine in nucleohistone. Biochemistry 1989;28:36258. Margolis S, Coxon B, Gajewski E, et al. Structure of a hydroxyl radical induced cross-link of thymine and tyrosine. Biochemistry 1988;27:63539.

II. BIOINFORMATICS

REFERENCES

125

[21] Kasprzak KS. Oxidative DNA and protein damage in metal-induced toxicity and carcinogenesis. Free Radic Biol Med 2002;10:95867. [22] Anastassopoulou J, Theophanides T. MagnesiumDNA interactions and the possible relation of magnesium to carcinogenesis. Irradiation and free radicals. Crit Rev Oncol Hematol 2002;42:7991. [23] Kunkel TA, Alexander PS. The base substitution fidelity of eukaryotic DNA polymerases. J Biol Chem 1986;261:1606. [24] Guengerich FP. Common and uncommon cytochrome P450 reactions related to metabolism and chemical toxicity. Chem Res Toxicol 2001;14:61150. [25] Pfeifer GP, Denissenko MF, Olivier M, et al. Tobacco smoke carcinogens, DNA damage and p53 mutations in smoking-associated cancers. Oncogene 2002;21:743551. [26] Poirer MC. Chemical-induced DNA damage and human cancer risk. Nat Rev Cancer 2004;4:6307. [27] Goodman MF. Error-prone repair DNA polymerases in prokaryotes and eukaryotes. Annu Rev Biochem 2002;71:1750. [28] Hubscher U, Maga G, Spadar S. Eukaryotic DNA polymerases. Annu Rev Biochem 2002;71:13363. [29] Pages V, Fuchs RP. How DNA lesions are turned into mutations within cells? Oncogene 2002;21:895766. [30] Sanderson BJ, Shield AJ. Mutagenic damage to mammalian cells by therapeutic alkylating agents. Mutat Res 1996;355:4157. [31] Kartalou M, Essigmann JM. Recognition of cisplatin adducts by cellular proteins. Mutat Res 2001;478:121. [32] Henner WD, Rodriguez LO, Hecht SM. γ Ray induced deoxyribonucleic acid strand breaks. 30 Glycolate termini. J Biol Chem 1983;258:7113. [33] Dizdarogle M, von Sonntag C, Schulte-Frohlinde D. Letter: strand breaks and sugar release by gamma-irradiation of DNA in aqueous solution. J Am Chem Soc 1975;97:22778. [34] Li X, Park WJ, Pyeritz RE, Jabs EW. Effect on splicing of a silent FGFR2 mutation in Crouzon syndrome. Nat Genet 1995;9:2323. [35] Richard I, Beckmann JS. How neutral are synonymous codon mutations? Nat Genet 1995;10:259. [36] Maquat LE. The power of point mutations. Nat Genet 2001;27:56. [37] Krawczak M, Reiss J, Cooper DN. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum Genet 1992;90:4154. [38] Guil S, Darzynkiewicz E, Bach-Elias M. Study of the 2719 mutant of the c-H-ras oncogene in a bi-intronic alternative splicing system. Oncogene 2002;21:564953. [39] Mitchell GA, Labuda D, Fontaine G, et al. Splice-mediated insertion of an Alu sequence inactivates ornithine δ-aminotransferase: a role for Alu elements in human mutation. Proc Natl Acad Sci USA 1991;88:8159. [40] Cartegni L, Chew SL, Krainer AR. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat Rev Genet 2002;3:28598. [41] Nishimura H, Yerkes E, Hohenfellner K, et al. Role of the angiotensin type 2 receptor gene in congenital anomalies of the kidney and urinary tract, CAKUT, of mice and men. Mol Cell 1999;3:110. [42] Brand K, Dugi KA, Brunzell JD, et al. A novel A-G mutation in intron I of the hepatic lipase gene leads to alternative splicing resulting in enzyme deficiency. J Lipid Res 1996;37:121323. [43] Hsu BY, Iacobazzi V, Wang Z, et al. Aberrant mRNA splicing associated with coding region mutations in children with carnitine acylcarnitine translocase deficiency. Mol Genet Metab 2001;74:24855. [44] Kuivenhoven JA, Weibusch H, Pritchard PH, et al. An intronic mutation in a lariat branchpoint sequence is a direct cause of an inherited human disorder (fish-eye disease). J Clin Invest 1996;98:35864. [45] Zhu X, Chung I, O’Gorman MR, et al. Coexpression of normal and mutated CD40 ligand with deletion of a putative RNA lariat branchpoint sequence in X-linked hyper-IgM syndrome. Clin Immunol 2001;99:3349. [46] Janssen RJ, Wevers RA, Haussler M, et al. A branch site mutation leading to aberrant splicing of the human tyrosine hydroxylase gene in a child with a severe extrapyramidal movement disorder. Ann Hum Genet 2000;64:37582. [47] Fujimaru M, Tanaka A, Choeh K, et al. Two mutations remote from an exon/intron junction in the beta-hexosaminidase beta-subunit gene affect 30 -splice site selection and cause Sandhoff disease. Hum Genet 1998;103:4629. [48] Antonarakis SE, Krawczak M, Cooper DN. The nature and mechanisms of human gene mutation. In: Vogelstein B, Kuzler KW, editors. The genetic basis of human cancer. New York, NY: McGraw-Hill; 2002. p. 741. [49] Khajavi M, Inoue K, Lupski JR. Nonsense-mediated mRNA decay modulates clinical outcome of genetic disease. Eur J Hum Genet 2006;14:107481. [50] Gonzalez-Redondo JM, Stoming TA, Kutlar A, et al. A C-T substitution at nt-101 in a conserved DNA sequence of the promotor region of the beta-globin gene is associated with “silent” beta-thalassemia. Blood 1989;73:170511. [51] Treistman R, Orkin SH, Maniatis T. Specific transcription and RNA splicing defects in five cloned β-thalassaemia genes. Nature 1983;302:5916. [52] Martin DI, Tsai SF, Orkin SH. Increased γ-globin expression in a nondeletion HPFH mediated by an erythroid-specific DNA-binding factor. Nature 1989;338:4358. [53] Loman NJ, Misra RV, Dallman TJ, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012;30:4349. [54] Li H, Handsaker B, Wysoker A, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:20789. [55] McCall CM, Mosier S, Thiess M, et al. False positives in multiplex polymerase chain reaction-based next-generation sequencing have unique signatures. J Mol Diagn 2014;16:5419. [56] Walsh PS, Erlich HA, Higuchi R. Preferential PCR amplification of alleles: mechanisms and solutions. PCR Methods Appl 1992;1:24150. [57] Barnard R, Futo V, Pecheniuk N, et al. PCR bias toward the wild-type k-ras and p53 sequences: implications for PCR detection of mutations and cancer diagnosis. Biotechniques 1998;25:68491. [58] Ogino S, Wilson RB. Quantification of PCR bias caused by a single nucleotide polymorphism in SMN gene dosage analysis. J Mol Diagn 2002;4:18590.

II. BIOINFORMATICS

126

8. SINGLE NUCLEOTIDE VARIANT DETECTION USING NEXT GENERATION SEQUENCING

[59] Nawy T. Single-cell sequencing. Nat Methods 2014;11(1):18. [60] Heitzer E, Auer M, Gasch C, et al. Complex tumor genomes inferred from single circulating tumor cells by array-CGH and nextgeneration sequencing. Cancer Res 2013;73:296575. [61] Macaulay IC, Voet T. Single cell genomics: advances and future perspectives. PLoS Genet 2014;10:e1004126. [62] Smits AJ, Kummer JA, de Bruin PC, et al. The estimation of tumor cell percentage for molecular testing by pathologists is not accurate. Mod Pathol 2014;27:16874. [63] Viray H, Li K, Long T, et al. A prospective, multi-institutional diagnostic trial to determine pathologist accuracy in estimation of percentage of malignant cells. Arch Pathol Lab Med 2014;137:15459. [64] Renovanz M, Kim EL. Intratumoral heterogeneity, its contribution to therapy resistance and methodological caveats to assessment. Front Oncol 2014;4:142. [65] Gerlinger M, Rowan A, Horswell S, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 2012;366:88392. [66] Yachida S, Jones S, Bozic I, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 2010;467:11147. [67] Auerbach C, Moutschen-Dahmen M, Moutschen J. Genetic and cytogenetical effects of formaldehyde and related compounds. Mutat Res 1977;39:31761. [68] Bresters D, Schipper M, Reesink H, et al. The duration of fixation influences the yield of HCV cDNAPCR products from formalin-fixed, paraffin-embedded liver tissue. J Virol Methods 1994;48:26772. [69] Feldman MY. Reactions of nucleic acids and nucleoproteins with formaldehyde. Prog Nucleic Acid Res Mol Biol 1973;13:149. [70] Karlsen F, Kalantari M, Chitemerere M, et al. Modifications of human and viral deoxyribonucleic acid by formaldehyde fixation. Lab Invest 1994;71:60411. [71] Loudig O, Brandwein-Gensler M, Kim R, et al. Illumina whole-genome complementary DNA-mediated annealing, selection, extension and ligation platform: assessing its performance in formalin-fixed, paraffinembedded samples and identifying invasion pattern-related genes in oral squamous cell carcinoma. Hum Pathol 2011;42:191122. [72] Kerick M, Isau M, Timmermann B. Targeted high throughput sequencing in clinical cancer settings: formaldehyde fixed-paraffin embedded (FFPE) tumor tissues, input amount and tumor heterogeneity. BMC Med Genomics 2011;4:68. [73] Spencer DH, Sehn JK, Abel HJ, et al. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333. [74] Karnes H, Duncavage ED, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122:10413. [75] Kanagal-Shamanna R, Portier BP, Singh RR, et al. Next-generation sequencing-based multi-gene mutation profiling of solid tumors using fine needle aspiration samples: promises and challenges for routine clinical diagnostics. Mod Pathol 2013;27:31427. [76] DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:4918. [77] Vallania FL, Druley TE, Ramos E, et al. High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res 2010;20:17118. [78] Koboldt D, Zhang Q, Larson D, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22:56876. [79] Cibulskis K, Lawrence M, Carter S, et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 2013;31:2139. [80] Larson D, Harris C, Chen K, et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 2012;28:3117. [81] Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 2009;25:175460. [82] Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907v2 [q-bio.GN]; 2012. [83] Saunders CT, Wong W, Swamy S, et al. Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 2012;28:18117. [84] Pritchard CC, Salipante SJ, Koehler K, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16:5667. [85] Spencer DH, Tyagi M, Vallania F, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 2014;16(1):7588. [86] Mardis ER. The $1,000 genome, the $100,000 analysis? Genome Med 2010;2:84. [87] Spencer DH, Abel HJ, Lockwood CM, et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data. J Mol Diagn 2013;15:8193. [88] Ewing B, Green P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 1998;8:18694. [89] Li H. Improving SNP discovery by base alignment quality. Bioinformatics 2011;27(15):11578. [90] Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. ArXiv e-prints; July 2012:1207.3907. [91] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18:18518. [92] Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K, et al. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009;19:112432. [93] Lassoued B, Nivaggioni A, Gabert J. Minimal residual disease testing in hematologic malignancies and solid cancer. Expert Rev Mol Diagn 2014;14:699712. [94] Hiatt JB, Pritchard CC, Salipante SJ, O’Roak BJ, Shendure J. Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation. Genome Res 2013;23:84354.

II. BIOINFORMATICS

REFERENCES

127

[95] Schmitt MW, Kennedy SR, Salk JJ, et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Nat Acad Sci USA 2012;109:1450813. [96] Kinde I, Wu J, Papadopoulos N, et al. Detection and quantification of rare mutations with massively parallel sequencing. Proc Natl Acad Sci USA 2011;108:95305. [97] Xu H, DiCarlo J, Satya R, et al. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 2014;15:244. [98] Lipman HB, Astles JR. Quantifying the bias associated with use of discrepant analysis. Clin Chem 1998;44:10815. [99] Hadgu A. The discrepancy in discrepant analysis. Lancet 1996;348:5923. [100] Hadgu A. Discrepant analysis is an inappropriate and unscientific method. J Clin Microbiol 2000;38:43012. [101] Miller WC. Bias in discrepant analysis: when two wrongs don’t make a right. J Clin Epidemiol 1998;51:21931. [102] CLSI. Molecular methods for clinical genetics and oncology testing; approved guideline. 3rd ed. Wayne: Clinical Laboratory Standards Institute; 2012. [CLSI document MM01-A3]. [103] American College of Medical Genetics. ACMG standards and guidelines for clinical genetic laboratories, ,http://www.acmg.net/AM/ Template.cfm?Section5Laboratory_Standards_and_Guidelines&Template5/CM/HTML.; 2008. [104] NCCLS. Nucleic acid sequencing methods in diagnostic laboratory medicine; approved guideline. NCCLS document MM9-A [ISBN 1-56238-558-5]. NCCLS, 940 West Valley Road, Suite 1400, Wayne, Pennsylvania 19087-1898 USA; 2004. [105] The Human Genome Mutation Database (HGMD, ,http://www.hgmd.cf.ac.uk.). [106] ,http://www.ncbi.nlm.nih.gov/PubMed/.. [107] ,http://omim.org/.. [108] Lubin IM, Aziz N, Babb L, et al. The clinical next-generation sequencing variant file: advances, opportunities, challenges for the clinical laboratory [submitted]. [109] Ramos EM, Din-Lovinescu C, Berg JS, et al. Characterizing genetic variants for clinical action. Am J Med Genet C Semin Med Genet 2014;166C:93104. [110] Eggington JM, Bowles KR, Moyes K, et al. A comprehensive laboratory-based program for classification of variants of uncertain significance in hereditary cancer genes. Clin Genet 2014;86(3):22937. [111] ,http://www.ncbi.nlm.nih.gov/clinvar/.. [112] ,http://www.iccg.org/about-the-iccg/clingen/.. [113] Pollard K, Hubisz MJ, Rosenbloom K, et al. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 2012;20:11021. [114] ,http://sift.jcvi.org.. [115] ,http://genetics.bwh.harvard.edu/pph2.. [116] Richards CS, Bale S, Bellissimo DB, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet Med 2008;10:294300. [117] ,http://www.cbs.dtu.dk/services/NetGene2.. [118] ,http://rulai.cshl.edu/tools/ESE2.. [119] Kearney H, Thorland E, Brown K, et al. American College of Medical Genetics standards and guidelines for interpretation and reporting of postnatal constitutional copy number variants. Genet Med 2011;13:6805. [120] Hagemann I, Cottrell C, Lockwood C. Design of targeted, capture-based, next generation sequencing tests for precision cancer therapy. Cancer Genet 2013;206:42031.

II. BIOINFORMATICS

This page intentionally left blank

C H A P T E R

9 Insertions and Deletions (Indels) Jennifer K. Sehn Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Overview of Insertion/Deletion Events (Indels) Introduction Indel Definition and Relationship to Other Classes of Mutations Testing for Indels in Constitutional and Somatic Disease

130 130

Sources, Frequency, and Consequences of Indels Mechanisms of Indel Generation Slipped Strand Mispairing (Polymerase Slippage) Secondary Structure Formation Imperfect Double-Strand DNA Break Repair Defective Mismatch Repair Unequal Meiotic Recombination Frequency of Indels in Human Genomes Functional Consequences Decreased Transcription Abnormal Protein Aggregation Microsatellite Instability/Rapid Repeat Expansion Altered Splicing Frameshift In-Frame, Decreased Protein Activity In-Frame, Increased Protein Activity Synonymous, Missense, and Nonsense Predicting Functional Effects of Novel Indels

133 133

Technical Issues That Impact Indel Detection by NGS

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00009-5

Sequencing Platform Chemistry Sequence Read Type and Alignment Library Preparation Technique Depth of Coverage Assay Design

139 140 141 141 141

Specimen Issues That Impact Indel Detection by NGS Specimen Cellularity and Heterogeneity Library Complexity

141 142 142

130 131

133 134 134 134 134

134 136 136 137 137 137 137 138 138 138 138

Bioinformatics Approaches to NGS Indel Detection General Bioinformatics Approaches to Indel Detection and Annotation Local Realignment Left Alignment Probabilistic Modeling Using Mapped Reads Split-Read Analysis Sensitivity and Specificity Issues Indel Length Indel Annotation Definition of Indel “Truth” Reference Standards

142 142 142 143 144 145 146 146 146 147 147

Summary

148

References

148

139

129

© 2015 Elsevier Inc. All rights reserved.

130

9. INSERTIONS AND DELETIONS (INDELS)

KEY CONCEPTS • Indels are characterized by insertion, deletion, or insertion and deletion of nucleotides into genomic DNA and are by definition anywhere from 1 bp to 1 kb in length. • Indels commonly occur in repetitive DNA sequences, which can make identification and annotation difficult. • Indels occur as normal polymorphisms in the human genome but are also important for constitutional and oncologic disease testing. • Tools that are optimized for detection of SNVs or other classes of mutation are not optimized to detect indels. Specific tools for indel detection are required. • The ability to detect indels is relatively new, and gold-standards for detection and annotation are not yet established. • Probabilistic modeling based on mapped sequence reads can be used to identify indels that are about 15% of the length of a sequence read, but not longer. • Split-read analysis can be used to identify indels of any size, but it may have an increased falsepositive rate. • Lack of concordance between different indel detection methods complicates assessment of sensitivity and specificity of clinical NGS assays. • The use of multiple possible annotations for the same indel precludes correlation with existing databases and literature for clinical interpretation. Retrospective and prospective efforts are required to standardize indel annotation in existing literature and going forward.

OVERVIEW OF INSERTION/DELETION EVENTS (INDELS) Introduction Some types of DNA alterations, including insertions/deletions (indels), are not necessarily the direct result of DNA damage, per se. Instead, indels can originate from DNA polymerase errors or incorrect DNA repair following a genetic insult. As a result, indels may be complex (e.g., include both inserted and deleted bases) and often involve areas with repetitive sequences, factors that can make alignment, identification, and annotation difficult. In addition, the various sequencing platforms and library preparation techniques in use for clinical next-generation sequencing (NGS) have varying susceptibilities to artifactual introduction of indels. Accurate and reproducible identification of indels is critical in clinical testing, as indels are commonly implicated in constitutional (hereditary) and somatic (acquired, including cancer) diseases and may be important for diagnosis (e.g., trinucleotide repeat expansion diseases), patient management and counseling (e.g., Lynch syndrome), therapy selection (e.g., EGFR mutations), or prediction of patient prognosis (e.g., FLT3 or NPM1 insertions).

Indel Definition and Relationship to Other Classes of Mutations “Indel” is a general term that may refer to insertion, deletion, or insertion and deletion of nucleotides in genomic DNA. By definition, indels are less than 1 kb in length. It is preferable not to refer to an indel as short, small, long, or large without specifying a size range, as these terms confer no specific meaning; one person may refer to a 30 bp indel as long, whereas 30 bp is rather short to someone who usually deals with 300 bp indels. Some indels may be as short as a single inserted or deleted nucleotide. Rarely, two single nucleotide variants (SNVs) are identified adjacent to each other; if the variants are in cis (on the same DNA strand), the correct mutation annotation is as an indel, not as two SNVs, to convey the relationship between the variants (Figure 9.1). Occasionally, NGS variant detection tools may identify an SNV immediately adjacent to an indel; by convention, this is interpreted as a single indel event and annotated accordingly (Figure 9.2). Similarly, insertion or deletion of tens to hundreds of bases often occurs during structural rearrangements (including translocations and

II. BIOINFORMATICS

OVERVIEW OF INSERTION/DELETION EVENTS (INDELS)

131

FIGURE 9.1 Adjacent SNVs in cis or trans. Any variant calls that are immediately adjacent to each other should prompt the user to review the aligned sequence reads to determine whether the variants occur in cis (on the same DNA fragment) or in trans (on different DNA fragments). If adjacent variants occur in cis, they are more appropriately annotated as a single deletioninsertion event, to convey the relationship between the variants. If they occur in trans, they should be evaluated as separate events. Rarely, two adjacent SNVs will be called within the same codon. The variant annotation and functional consequence must be carefully reviewed in such a case. The screen shot shows two adjacent SNV calls affecting one codon in TP53. The standard output annotation for this sequence change would be two separate SNVs: chr17: g.7577550C . G, resulting in a glycine to alanine substitution, and chr17:g.7577551C . G, resulting in a glycine to arginine substitution. These annotations would be correct if the variants were in trans. However, review of the aligned reads shows that the variants occur in cis. As such, the variant should be annotated as an indel with deletion of the two reference C’s at this codon and insertion of two G’s in their place (chr17: g.7577550_7577551delinsGG). The correct amino acid change for this mutant sequence is a glycine to proline substitution.

inversions) but is not viewed as a separate mutation; rather, these variants are lumped into one complex structural variant. Insertion or deletion of sequence larger than 1 kb is categorized as a copy number variant (CNV) and is more appropriately referred to as amplification, duplication, or deletion.

Testing for Indels in Constitutional and Somatic Disease Up to now, clinical indel detection has primarily been accomplished by polymerase chain reaction (PCR) amplification of a defined target (e.g., EGFR exon 19 for lung adenocarcinoma or FLT3 exon 14 for acute myeloid leukemia (AML)), followed by comparison of the size of the resulting amplicon to the expected wild-type fragment size by conventional gel or capillary electrophoresis. Testing may be performed on peripheral blood leukocytes or buccal squamous cells in constitutional disease (as for microsatellite instability testing in Lynch syndrome) or may require use of formalin-fixed, paraffin-embedded tissue blocks (as for oncologic testing). Particularly in oncologic testing, increased utilization of minimally invasive diagnostic procedures by clinicians (including preferential use of core needle biopsies or fine needle aspiration biopsies) yields smaller and smaller amounts of lesional tissue that must be judiciously rationed for both histologic diagnosis and molecular testing. A prime example of the clinical demand for extracting comprehensive information from a small biopsy specimen

II. BIOINFORMATICS

132

9. INSERTIONS AND DELETIONS (INDELS)

FIGURE 9.2 Complex EGFR exon 19 indel event. In lung cancer, the most common mutations leading to EGFR activation are indels occurring in exon 19. Many of the described mutations are complex, consisting of deletion of some nucleotides and insertion of others. Complex indels are often incorrectly annotated by variant calling software as insertion or deletion with an adjacent SNV. When reviewing NGS data, adjacent variant calls should prompt review of the sequence reads, usually in IGV, to determine whether the variants occur in cis (on the same DNA fragments) or in trans (on different DNA fragments). If all of the events occur in cis, the variant should be annotated as a single mutation event (i.e., a complex indel). If some or all of the variants occur in trans, they should be considered separate events (an indel and an SNV). The screen shot from IGV shows a complex indel that is incorrectly displayed as an SNV (chr7:g.55242467A . T, shown as a column of red T’s in the aligned sequence reads) and a deletion (chr7:g.55242468_55242485del, black bar in sequence reads). There are several interesting features to observe in this panel. First, a punched-out decrease in coverage (gray histogram track) is seen occurring in the same location as the predicted deletion. Second, no significant change in coverage is seen at the site of the inserted T (which is true for insertions in general, as there is no reference base against which to calculate or display the coverage). Third, soft-clipped reads are displayed (small black dash followed by multicolored nucleotide sequence), showing that base “mismatches” at the ends of sequence reads actually represent misalignment of the soft-clipped bases because an indel call was not made. If one imagines that the soft-clipped bases are shifted by the length of the indel, they do in fact align correctly to the reference. Fourth, the SNV and deletion calls occur on both forward and reverse strand reads (i.e., not strand biased). Fifth (and most important for this discussion), the predicted SNV and deletion events always occur together (i.e., they occur in cis). When scrolling through the rest of the aligned reads (not shown), there are no reads in which the SNV or deletion are present individually. Hence, the SNV and deletion are more appropriately annotated as a single complex indel: chr7:55242467_55242485delinsT. The difference in annotation is relevant to clinical interpretation, because there is really only one protein sequence change that is present (NP_005219: p.E746_S752delinsV), instead of two (NP_005219:p.E746V and NP_005219:p.E746_S752delinsD). The databases and existing literature accordingly should be searched for the significance of the complex indel event, instead of the two separate events.

is evaluation of lung tumors. First, the surgical pathologist must determine whether a lung mass is a primary lung cancer or one of many common cancers that metastasize to the lung, which often requires evaluation of at least a few immunohistochemical stains. If the mass is indeed a primary lung adenocarcinoma, evaluation for EGFR mutation may be requested to determine whether the patient is a candidate for EGFR inhibitor therapy; additional information about gene amplification or rearrangements involving ALK, ROS, RET, MET, or ERBB2 may also be required, as well as information about SNVs and/or indels in other genes implicated in lung adenocarcinoma (e.g., KRAS, BRAF, ERBB2, MET, PIK3CA). There sometimes is simply not enough tumor tissue available from several millimeters of biopsy material to perform all the desired testing if the tests are performed

II. BIOINFORMATICS

SOURCES, FREQUENCY, AND CONSEQUENCES OF INDELS

133

separately. However, NGS provides the ability to simultaneously evaluate all classes of mutation (SNV, indel, CNV, and structural variants) occurring in multiple gene targets, which is particularly valuable when dealing with limited tissue specimens.

SOURCES, FREQUENCY, AND CONSEQUENCES OF INDELS Mechanisms of Indel Generation In normal cells, human DNA polymerases have a low inherent error rate during replication (around 1025 errors per bp per cell division), with endogenous proofreading exonuclease activity resulting in even higher replicative fidelity (1026 or 1027 errors per bp per cell division). Postreplicative DNA mismatch repair (MMR) mechanisms function to decrease the overall mutation rate even further, to around 1029 per bp per cell division, but even this low rate indicates that mutations nonetheless do occur as a result of polymerase errors [1,2]. Additionally, cellular processes for repair of DNA mutations (resulting from either polymerase errors or exogenous factors like ultraviolet radiation or chemical mutagens) also have the capacity to introduce mutations. Slipped Strand Mispairing (Polymerase Slippage) The most commonly proposed mechanism for indel generation is slipped strand mispairing, also known as replication or polymerase slippage [3,4]. During replication, the DNA polymerase and newly synthesized DNA strand complex sometimes temporarily dissociates from the template DNA. In areas with repetitive sequences, the polymerase may reassociate with the template strand in a position one or two repeats ahead or behind where it left off (Figure 9.3). Polymerase slippage classically affects DNA regions with direct (also referred to as tandem) repeats of 14 bases (e.g., [T]n or [AGGC]n, with n number of repeats) and results in insertion of only one or two additional repeats [4]. If the length of the repeating element occurring in a coding sequence is not a multiple of 3, insertion or deletion of the repeat unit can result in a shift of the mRNA reading frame (a frameshift mutation) [1,5,6]. Slipped strand mispairing can also occur at noncontinuous repeats, resulting in longer insertions or deletions of intervening sequence flanked by the direct repeats [6,7]. Slipped strand mispairing is the mechanism thought to underlie benign polymorphic variation in short tandem repeats (STRs) that can be observed between individuals. STRs are often referred to as microsatellites, and testing for STRs has applications for identity testing in forensic settings, laboratory quality assurance programs (i.e., specimen provenance testing), and bone marrow transplant donor engraftment studies [8].

FIGURE 9.3 Slipped strand mispairing. A commonly proposed mechanism for indel generation is slipped strand mispairing, also known as replication slippage or polymerase slippage, that occurs in repetitive sequences. Panel A shows normal replication at a trinucleotide repeat tract, where the top line (blue) represents the template DNA sequence, the purple line represents the path followed by the polymerase (pol), and the bottom line (green) shows the complementary DNA that is synthesized by the polymerase. Four copies of the repeat are present in both the template and newly synthesized DNA strands. When the polymerase slips forward (B), it skips one of the template GTA repeats, with only three repeats copied into the newly synthesized DNA strand (i.e., deletion of one repeat). Correspondingly, when the polymerase slips backward (C), it copies an extra repeat into the newly synthesized DNA strand. It is also proposed that the polymerase can “slip” to a repeat that is not immediately adjacent to the template it was replicating, resulting in deletion or insertion of intervening nonrepetitive DNA (D).

II. BIOINFORMATICS

134

9. INSERTIONS AND DELETIONS (INDELS)

Secondary Structure Formation Inverted (as opposed to direct) repeats such as palindromes and quasipalindromes also contribute to indel formation [7]. During replication, inverted repeats in a single DNA strand can pair to form hairpin or cruciform structures. Nonhomologous sequences between inverted repeats in these structures can be identified as base mismatches and are susceptible to “repair” via cellular DNA repair mechanisms, with subsequent development of insertions or deletions at the “repaired” sites. Imperfect Double-Strand DNA Break Repair Human cells have several mechanisms for correcting or bypassing point mutations that are accumulated during interphase and encountered during DNA replication, including translesional synthesis by specialized polymerases, template strand switching, or convergence by an adjacent replication fork (Figure 9.4). If not promptly corrected, the presence of mismatched or abnormal bases at the replication fork can cause the replication machinery to stall, with resulting dissociation of the replication fork and possibly a double-strand DNA break (DSB) at that site [24]. DNA repair pathways are activated by DSB, including homologous recombination, nonhomologous end joining (NHEJ), or microhomology-mediated end joining (MMEJ, also known as alternative end joining, where “microhomology” is limited to only 68 bp of homologous sequence), depending on the phase of the cell cycle (Figure 9.5). Any of these mechanisms can result in insertion or deletion of a variable number of bases, resulting in indels or larger alterations like CNV or structural variants [9]. In addition, recent studies have also suggested that G-quadruplex structures that are present in normal DNA may obstruct DNA synthesis and cause DSB. It has been proposed that a specific repair mechanism mediated by DNA POLQ (theta-mediated end joining, TMEJ) is responsible for repair of DNA breaks related to G-quadruplex structures, which results in indels ranging from around 50 to 300 bp in length [10]. Defective Mismatch Repair The predominant mechanism used by cells to repair errors acquired during DNA replication or recombination is the MMR pathway, which functions to correct single base mismatches and insertion/deletion loops in what should be complimentary double-stranded DNA [2]. As expected, decreased expression of MMR proteins due to mutation or promoter hypermethylation (as in Lynch syndrome and sporadic colon or endometrial cancers) does not lead to aneuploidy or gross structural abnormalities, but instead permits accumulation of unrepaired SNVs and indels acquired as a result of polymerase errors and imperfect DNA repair mechanisms as discussed above [11]. Since roughly two-thirds of all indels acquired due to polymerase slippage in repetitive sequences are expected to be frameshift mutations (insertion/deletion of any length that is not a multiple of 3), the majority of exonic indels accumulated due to faulty MMR are frameshifts. Interestingly, defective MMR is typically due to mutation in Lynch syndrome but, in contrast, is due to epigenetic gene silencing via promoter hypermethylation without mutation in sporadic cancers [1214]. Unequal Meiotic Recombination As mentioned above, recombination as a pathway for DNA repair is one potential source of acquired indels. Similarly, meiotic recombination involving misaligned homologous partners can result in germ line indel formation. When the misaligned partners are from homologous chromosomes (nonsister chromatids, one paternal and one maternal), this process is referred to as unequal crossover. If the misaligned partners are sister chromatids (both maternal or both paternal), the process is referred to as unequal sister chromatid exchange [4]. Unequal recombination during meiosis is one proposed mechanism for expansion of long repeat tracts containing hundreds to thousands of repeats in trinucleotide repeat diseases (discussed further below).

Frequency of Indels in Human Genomes Compared with SNVs, CNVs, and structural variants, indels are more difficult to detect by conventional methods like Sanger sequencing, FISH, and/or karyotyping. This is especially true for larger indels (hundreds of bases), since such variants may not be well amplified in PCR-based Sanger sequencing assays (due to loss of primer binding sites, or template sequence expansion beyond the parameters of a given assay). At the same time, indels are too small to be visible at the resolution afforded by conventional FISH or karyotyping. Consequently,

II. BIOINFORMATICS

135

SOURCES, FREQUENCY, AND CONSEQUENCES OF INDELS

Stalling at damaged site a

Nucleotide insertion opposite lesion

b

c

Fork regression by strand switching

An adjacent replication fork converges

HR

Further extension

Resolution

A post-replication gap remains in one strand

Nature Reviews

Cancer

FIGURE 9.4 Mechanisms for bypassing point mutations during replication. Replicating cells have several mechanisms for correcting point mutations in template DNA, including translesional synthesis by specialized polymerases, template strand switching, or convergence by an adjacent replication fork. In this figure, the leading and lagging strands are shown with arrows indicating the direction of replication. The replication machinery will stall when a mutated site (red) is encountered by the replication machinery on the leading strand. The lagging strand may continue replication, but the leading strand on which the replication machinery is blocked is fragile. If the obstructing mutation cannot be corrected or bypassed, the replication machinery will dissociate, causing collapse of the replication fork with subsequent breakage of the DNA. One strategy to prevent prolonged stalling of the replication machinery with subsequent DNA breakage is to carry out translesion DNA synthesis (TLS) by successive steps (part A). The replication machinery switches to a specialized DNA polymerase for the insertion of a base (green). This step is potentially mutagenic because the wrong base will sometimes be incorporated. A switch to a second specialized DNA polymerase may take place to extend the nonstandard terminus opposite the damage, and finally there is a switch to a replicative DNA polymerase (Pol ε or Pol δ). DNA polymerase switching is facilitated by posttranslational modifications of DNA polymerases and their accessory factors, as reviewed elsewhere [2]. A second strategy is DNA replication fork regression (part B). Here, the blocked leading strand switches templates and begins to copy the already-replicated lagging strand. The newly replicated bases are shown in green. The regressed fork resembles a four-way junction that can be processed by homologous recombination (HR) enzymes and resolved. This pathway avoids errors, as it makes use of genetic information from the undamaged strand. A third strategy is illustrated in part C. If the replication fork remains stalled for long enough, an adjacent replication fork will converge with it. This allows one strand to replicate fully, while one strand will contain a gap. This gap will then remain through to late S phase or G2 phase of the cell cycle. The gap is then filled by DNA synthesis. During gap filling, two different specialized DNA polymerases may also be needed to accomplish synthesis across a lesion, for insertion and extension, and this is potentially mutagenic. Gaps could also conceivably arise by reinitiation of DNA synthesis on the other side of a DNA adduct. Reprinted with permission from Macmillan Publishers Ltd: Nature Reviews Cancer [2], copyright 2011.

II. BIOINFORMATICS

136

9. INSERTIONS AND DELETIONS (INDELS)

FIGURE 9.5 Nonhomologous recombination. On the appearance of a DNA double-strand break, two pathways can be activated. Classical nonhomologous end joining (C-NHEJ) involves the binding of Ku70Ku80 to the DNA break, followed by the recruitment of DNAdependent protein kinase catalytic subunit (DNA-PKcs) and several other factors that mediate blunt-end ligation of the break by DNA ligase 4 (LIG4). This process has no sequence requirements and may cause small-scale mutation, such as the addition or the deletion of a small number of nucleotides at the break junction. Alternative end joining (A-EJ) involves exonucleolytic processing of the double-strand break to reveal stretches of potentially complementary sequence (microhomology; indicated in red) on either side of the break. This resection process may be mediated by the exonuclease CtBP-interacting protein (CtIP). Following base pairing at regions of microhomology, the ends are joined by a ligase enzyme (LIG). Reprinted with permission from Macmillan Publishers Ltd: Nature Reviews Cancer [9], copyright 2013.

the frequency of indels in the general population is not well delineated. Recent efforts to assess the frequency of indels in large NGS data sets (including those generated by the 1000 Genomes Project and The Cancer Genome Atlas) have been limited by a high false discovery rate (and unknown false-negative rate) using existing methods for indel detection. Though there are numerous examples of constitutional and somatic diseases with associated clinically relevant indels, the frequency of indels in human disease is likewise not well defined and is an active area of investigation.

Functional Consequences Indels can have widely variable effects on gene expression, at any level from transcription through translation and protein function. The functional consequence of any particular indel, if it has not been described and characterized previously in a clinical setting, is difficult to determine. However, many indels occurring in human disease have been well characterized. Decreased Transcription One mechanism by which indels can exert clinically apparent effects is by altered transcription of the mutated gene. Fragile X, the second most common genetic cause of mental retardation, is caused by expansion of a trinucleotide repeat (CGG) in the 50 untranslated region (UTR) of the gene FMR1 [15]. Individuals with greater than 200 repeats are affected by the disease, whereas normal individuals harbor fewer than 55 repeats; alleles with

II. BIOINFORMATICS

SOURCES, FREQUENCY, AND CONSEQUENCES OF INDELS

137

55200 repeats are classified as premutations (discussed further below). In affected patients, transcription of FMR1 is markedly decreased, and two epigenetic mechanisms probably contribute to the process by which the abnormally expanded allele is silenced. In one model, expansion of the CGG repeat induces localized hypermethylation of the adjacent promoter by an unclear mechanism, resulting in decreased transcription; in fact, promoter methylation studies are an accepted test for definitive diagnosis of Fragile X [16,17]. In the other model, the abnormal mRNA initially transcribed from the expanded FMR1 gene during early development directly binds to the complementary DNA, forming a DNARNA hybrid that induces epigenetic silencing [18]. Hypothetically, any type of indel occurring in transcriptional regulatory elements (promoters, enhancers, etc.) could affect transcription of a gene, though the specific effect of a given noncoding indel is difficult to predict. Abnormal Protein Aggregation In contrast to trinucleotide repeat diseases affecting noncoding regions (as discussed above), expansion of trinucleotide repeats within coding sequences causes disease via production of an elongated, abnormal protein. Most trinucleotide repeat diseases arise from expansion of CAG repeats. For example, the normal HTT gene, encoding the protein huntingtin, can have 935 CAG repeats. Patients with Huntington disease have 36 or more repeats, with increased numbers of repeats ( . 60) leading to earlier onset of disease. The expanded number of trinucleotide repeats does not affect the amount of huntingtin that is transcribed and translated. Instead, the long polyglutamine tract encoded by the CAG repeat causes the mutant protein to aggregate in the cell, resulting in cell death [19]. Microsatellite Instability/Rapid Repeat Expansion Trinucleotide repeat diseases, including Fragile X and Huntington disease, exhibit a phenomenon known as anticipation. Once the number of repeats reaches a certain threshold, the sequence becomes unstable and rapidly expands during gametogenesis, with longer and longer repeats transmitted to subsequent generations [20,21]. Offspring with more repeats are more severely affected than their parents, as with relatively normal FMR1 premutation carriers compared with their affected children [17]. Subsequent generations may also have an earlier age of disease onset, as with Huntington disease [19]. Altered Splicing As with SNVs, indels can affect splicing by a variety of mechanisms. First, indels occurring within splice acceptor, donor, or branch sites can result in aberrant splicing. Second, insertion or deletion of nucleotides into the gene sequence may create new splice sites. Third, indels affecting splice site modifiers can alter mRNA splicing. Exon skipping as a result of altered splicing can additionally cause a frameshift in subsequent exons, since not all exons are frame-neutral (multiple of 3 bp in length). A classic example of an indel involving a splice site modifier with resulting aberrant mRNA splicing and defective protein synthesis involves the CFTR gene in cystic fibrosis. Thymidine homopolymer repeat polymorphisms in a CFTR intron 9 (formerly named intron 8) splice modifier site have been observed, containing 5, 7, or 9 thymidine residues. The 7T and 9T variants result in normal splicing, but the 5T variant results in skipping of exon 10 (formerly named exon 9) in most of the spliced mRNAs, with markedly decreased function in the resulting protein. In addition, the presence or absence of exon 10 in the protein product modulates the severity of disease in individuals who harbor another variant (R117H) [22]; as such, evaluation and reporting of the intron 9 poly-T tract length in patients with the R117H variant is recommended by the American College of Medical Genetics and Genomics (ACMG) [23]. It is important to consider that not all splice alterations result in decreased protein function. For example, somatic splice site mutations, including SNVs and indels, have been identified in MET in patients with lung adenocarcinoma, result in skipping of exon 14. Exon 14 encodes a portion of the transmembrane domain of the protein that is required for ubiquitin-mediated degradation of MET, and so skipping of exon 14 results in increased MET activity via decreased recycling of the protein [24,25]. Increased MET activity is highly clinically relevant, because first, MET activation is one mechanism described for primary or secondary resistance to EGFR inhibitors in patients with lung adenocarcinoma, and second, MET itself is a potential therapeutic target [26,27]. Frameshift Approximately two-thirds of indels occurring in protein-coding DNA are expected to result in a frameshift mutation. In addition to having a dramatic effect on the resulting amino acid sequence, frameshift mutations usually also convert a downstream codon to a nonsense codon and thus cause premature protein truncation.

II. BIOINFORMATICS

138

9. INSERTIONS AND DELETIONS (INDELS)

Although genes harboring frameshift mutations are transcribed, the mRNA is often not translated into a protein, as the abnormal mRNA is subjected to nonsense-mediated decay [28]. Rarely, translation of frameshifted mRNAs is permitted by the cell, most commonly when the frameshift occurs near the end of an mRNA and is less susceptible to nonsense-mediated decay. The resulting abnormal protein may retain its function, or it may act in a dominant negative fashion to disrupt the activity of normal cellular proteins (e.g., via abnormal proteinprotein interactions). For example, one recurring mutation that has been observed in AML is a frameshift indel in the 50 portion of the gene CEBPA, which encodes a transcriptional enhancer that is important for myeloid differentiation. The mRNA transcribed from the mutant allele is in fact translated, yielding a nonfunctional truncated protein. In addition to lacking its normal regulatory activity, the truncated protein blocks binding of wild-type protein to its DNA target, further inhibiting cell differentiation [29]. In general, frameshift mutations are expected to result in loss of functional protein, regardless of the mechanism by which the function is lost. Indeed, novel frameshift indels identified in clinical sequencing assays for constitutional or oncologic disease are interpreted as “likely pathogenic” according to guidelines put forth by the ACMG [30]. In-Frame, Decreased Protein Activity Indels occurring in coding sequences can also result in insertion or deletion of amino acids without disruption of the reading frame (i.e., in-frame indels) if they are a multiple of 3 bp in length. In-frame indels can result in decreased or increased protein activity, depending on how the amino acid change affects protein structure, functional domains, and localization. For example, in another category of CEBPA mutation observed in AML, in-frame insertion of additional bases occurs in the 30 portion of the gene and so affects the C-terminus of the protein. Interestingly, although these in-frame mutations result in the insertion or deletion of only one or a few amino acids in the resulting protein, they occur in a critical region of the protein and disrupt proper folding of the DNA-binding domain. The protein product is thus unable to bind its target DNA, again leading to inhibition of cell differentiation [31]. Interestingly, CEBPA mutations are often biallelic, with one allele harboring an N-terminal frameshift indel and the other harboring a C-terminal in-frame indel, a pattern that is associated with a relatively good prognosis in patients with cytogenetically normal AML [32]. In-Frame, Increased Protein Activity Some of the most important mutations for clinical testing are in-frame indels, particularly those in tyrosine kinases like EGFR and KIT that result in increased kinase activity via alteration of the kinase domain itself or other domains that regulate kinase activity. For example, in-frame indels in EGFR exon 19, involving the ATPbinding pocket of the kinase domain, are the most common EGFR mutations in lung adenocarcinoma and result in activation of EGFR kinase activity. As long as a co-occurring resistance mutation is not present, tumors harboring exon 19 indels are virtually always susceptible to inhibition by reversible EGFR tyrosine kinase inhibitors (TKIs) like erlotinib and gefitinib [33]. In contrast, in-frame indels in KIT most commonly occur not in the kinase domain, but rather in the juxtamembrane domain. These indels induce conformational changes that cause receptor dimerization and kinase activation even in the absence of KIT ligand. Tumors harboring mutations of the KIT juxtamembrane domain are typically sensitive to targeted treatment with TKIs like imatinib [34,35]. Synonymous, Missense, and Nonsense Insertion and deletion of an equal number of adjacent nucleotides (e.g., replacement of two cytosines with two guanines) in cis, while correctly annotated as an indel, can have the same types of effects that are observed for SNVs, namely synonymous, missense or nonsense variants (see Figure 9.1). The variant nucleotides may all occur in one codon, resulting in alteration of one amino acid, or they may be spread across two or more codons (e.g., the third nucleotide of codon 1 and the first nucleotide of codon 2), resulting in alteration of multiple amino acids. When multiple adjacent variants are identified by NGS variant detection pipelines, the user must review the calls to determine whether the variants occur in cis or trans, in order to correctly annotate and interpret the variants. Predicting Functional Effects of Novel Indels When it comes to interpreting clinical NGS test results, it is often helpful to compare identified indels with those that have been reported and characterized previously. A variety of reference databases exist for this purpose, and the strengths and weaknesses of publically available databases in general are discussed in Chapter 12. The same limitations that affect use of reference databases for interpretation of other classes of sequence

II. BIOINFORMATICS

TECHNICAL ISSUES THAT IMPACT INDEL DETECTION BY NGS

139

alteration also apply to indels, the most important of which is the variable strength of evidence required to establish a disease association/clinical meaning of a variant that is present in the databases. This critical issue has been recognized by several governmental and professional organizations, and efforts to address this limitation are ongoing (http://www.iccg.org/about-the-iccg/clingen/) [3638]. In silico prediction tools are sometimes helpful in determining whether a novel indel identified by NGS testing will alter the function of an encoded protein. It is important to note that these tools are generally applied to coding variants and not variants occurring in noncoding or regulatory regions. Two main tools for predicting functional effects of indels are publically available, namely PROVEAN and SIFT [39,40]. Most of the other tools available for prediction of the functional effects of SNVs cannot accept indels as input at this time, or are based on SIFT analysis. A key limitation of using PROVEAN and SIFT to predict the functional effect of an indel is that only two types of effect are possible as output: neutral (no expected functional effect) or deleterious. The user must always remember that “deleterious” does not necessarily mean decrease or loss of protein expression/ function; indeed, PROVEAN and SIFT predict that an activating ERBB2 insertion (chr17:g.37880981_ 37880982insGCATACGTGATG) is “deleterious.” The savvy user should not be mislead and instead must use his or her knowledge of molecular biology and human disease to decide whether the variant is likely to have a positive or negative effect.

TECHNICAL ISSUES THAT IMPACT INDEL DETECTION BY NGS As discussed in the previous section, clinically relevant indels occur in a variety of constitutional and oncologic diseases. Although NGS techniques are suitable for indel identification, several technical factors must be considered when developing a clinical assay that includes indel detection.

Sequencing Platform Chemistry The choice of sequencing platform can have a profound effect on indel detection in an NGS assay. Available platforms are extensively discussed in Chapter 1; only the critical points pertinent to indel detection in clinical testing are reviewed again here. There are two general approaches employed by the NGS instruments that are most often used in clinical applications (see Chapter 1, Figures 1.2 and 1.4). Illumina sequencing platforms (HiSeq or MiSeq systems; Illumina, Inc., San Diego, CA) use a reversible dye terminator approach that is somewhat comparable to conventional Sanger sequencing chemistry. In this approach, the four nucleotides used for sequencing (i.e., A, C, T, G) are each labeled with a different color fluorescent tag and pooled together. The 30 OH group on each nucleotide is blocked by an additional chemical modification, such that only one nucleotide can be added at a time. DNA fragments are sequenced by initial priming followed by one base extension by adding the pooled, labeled bases (i.e., all four nucleotides at once) with the required PCR reaction components. After a single nucleotide has been incorporated, further extension is blocked by the modified 30 OH. Unbound nucleotides are removed by washing, and the identity of the newly incorporated nucleotide is determined by fluorescence detection. The fluorescent tag is then removed, as is the blockage at the 30 OH site, in preparation for the next round of extension. The nucleotide pool is then added again, and the process is repeated until the target sequence read length is achieved (e.g., 101 bp for HiSeq, 150 bp for MiSeq). When paired-end reads are desired, the process repeats from the opposite end [41]. In contrast, the Ion Torrent (Life Technologies, Carlsbad, CA) platform is based on ultrasensitive pH detection and does not utilize a terminating approach. Individual nucleotides are not labeled, and as such must be made available sequentially, one at a time (A, wash, T, wash, etc.). Incorporation of a nucleotide is detected by the pH sensor when a hydrogen ion is released during extension of the sequence fragment; if the nucleotide that is available is not the next base in the template sequence, it is not incorporated and no pH change is detected. However, since extension can occur from the end of any incorporated nucleotide (i.e., there is no termination step), all of the nucleotides in a homopolymer will be incorporated in the same cycle; thus, for example, if there are five thymidines in a row in the template sequence, five adenines will be incorporated in the same cycle [41]. Unfortunately, output from the pH sensor is not linear at these homopolymeric sites, meaning that a run of five thymidines may falsely be read as four thymidines or seven thymidines. Knowing that indels often occur in areas with repetitive sequences (including homopolymers), pH detection chemistry is an obvious potential source of false indel calls and an important limitation to using nonterminating approaches for clinical applications.

II. BIOINFORMATICS

140

9. INSERTIONS AND DELETIONS (INDELS)

Several studies have evaluated the rate of insertion or deletion errors inherent to the various available NGS platforms. Although the chemistry, detection mechanism, and base-calling software for each platform are continuously evolving, these studies provide useful insights into which platforms are most appropriate for specific clinical applications. In terms of indel detection, admittedly only one of many factors that must be considered in selecting an NGS platform, the reversible dye terminator approach has a lower indel error rate (,0.001 indels per 100 bases sequenced) compared with pH detection (1.5 indels per 100 bases), for the reasons discussed above [4245].

Sequence Read Type and Alignment NGS platforms have the ability to generate reads of a specified length, depending on the platform and run cycle employed. Additionally, the user may decide whether to sequence fragments in one direction (single-end) or from both ends of the fragment (paired-end) (see Chapters 1 and 7). In general, clinical NGS assays are based on paired-end sequencing of reads at least 100 bp in length to facilitate alignment of generated sequence reads to the reference human genome and subsequent variant calling [46,47]. Longer reads are easier to map to the reference genome, as they are less likely to line up with multiple positions in the genome. Furthermore, using a paired-end approach not only allows for sequencing of longer template fragments compared with single-end sequencing, but also for consideration of the alignment position for each of the paired-end reads in relation to each other for identification of variants (including indels or structural variants) that are larger in scale than the individual reads (discussed below and in Chapter 10). Sequences that match more than one position in the genome (ambiguous alignment) are difficult to analyze and interpret because it is not possible to determine whether a variant identified in that fragment came from position A or position B in the genome. This is particularly problematic in clinical testing for genes for which there are inactive pseudogenes elsewhere in the genome (e.g., PIK3CA). If sequence reads are not sufficiently long to capture small variations in the reference sequence that can be used to distinguish the gene from the pseudogene, noncoding variants that are actually occurring in the pseudogene may incorrectly be interpreted as occurring in the transcriptionally active gene. The ability to detect indels (or any other variant type, for that matter) in NGS data depends first and foremost on the ability to correctly align generated sequence reads to the reference genome. Indels create a unique challenge in read alignment, since insertion or deletion of one or more bases will impact alignment of the rest of that sequence read, if the alignment tool is not able to ascertain that a base has been skipped and the rest of the nucleotides are shifted, but unchanged. There are two general approaches to aligning reads to a reference genome: ungapped or gapped (also called “split”) (Figure 9.6). Ungapped alignment does not allow for insertion or deletion of bases in the sequence read compared with the reference genome. As a result, any base aligned after an indel will not match the reference. There are a few potential outcomes to this problem: (1) an insufficient number of bases will map to the reference genome, and the read will be discarded (unmapped), or (2) enough bases will match to the reference for the read to map, but the shifted bases adjacent to the indel will not match the reference. These unmatched bases may either be hard-clipped (removed entirely from the read), softclipped (masked from further analysis, unless specifically evaluated by the user), or falsely identified as variants Reference sequence: A A A C C C A T G T A T G A A G T A C A G TG G A A G G T T G T T G A G A G A T Gapped alignment:

TGT A TGA AGT ACAGTG

T G T TG A G A G

Reference sequence: A A A C C C A T G T A T G A A G T A C A G TG G A A G G T T G T T G A G A G A T Ungapped alignment:

TGT A TGA AGT ACAGTG TG T TGAGAG

FIGURE 9.6 Gapped versus ungapped alignment. Two general approaches are used in aligning sequence reads to a reference genome: gapped and ungapped alignment. A 25 bp sequence read, containing a 6 bp deletion in comparison to the reference sequence, is shown as an example here. Gapped aligners allow the sequence read to partially align to one part of the reference genome and partially align to another, noncontinuous part of the reference genome. Hence, gapped aligners permit each sequence read to be broken into at least two subreads with a gap separating them (shown as a dashed line). With gapped alignment, the nucleotides that are present are optimally matched to the reference sequence, with no mismatches in the example presented here. Gapped alignment is best for variant detection in general and for indel detection in particular. In contrast, ungapped alignment does not allow the read to be split. As a result, all bases that occur after the 6 bp deletion do not match the reference sequence (red) and will either be clipped from the read or incorrectly interpreted as variants. Ungapped alignment is not useful for indel detection by NGS.

II. BIOINFORMATICS

SPECIMEN ISSUES THAT IMPACT INDEL DETECTION BY NGS

141

(e.g., a string of SNVs) during subsequent variant detection. Ungapped alignment is not a useful approach for indel detection by NGS. Most alignment tools in use for clinical NGS applications use a gapped approach. Aptly named, gapped aligners (e.g., Novoalign (Novocraft, Kuala Lumpur, Malaysia) and the BurrowsWheeler Aligner) allow gaps to occur during comparison of the sequence read to the reference genome [48]. For example, if the first 20 bases in a sequence read match the reference genome in one position, and the rest of the bases match the reference but appear shifted three bases from the previously established alignment, a gapped aligner will allow the beginning of the read to map at the first position, followed by a 3 bp gap, followed by the rest of the read. Gapped aligners are well suited to indel detection, because they accommodate alignment of reads-containing indels. Notably, many variant detection tools also incorporate mechanisms for local realignment of mapped sequence reads around identified indels to further improve variant identification and decrease false-positive variant calls (discussed below).

Library Preparation Technique As discussed above, polymerase errors contribute to indel formation in vivo. Similarly, polymerase errors during PCR cycles that are part of NGS protocols (see Chapter 1) can likewise introduce indels in the fragments to be sequenced. The two main library preparation techniques employed in NGS assays, hybrid capture and amplification, both utilize amplification steps (as discussed in detail in Chapters 3 and 4), and the pertinent point about library preparation for indel detection applications is that more cycles of PCR lead to more potential for false indel calls due to polymerase errors. Additionally, PCR amplification efficiency is variable across different templates in a multiplexed reaction, due to variations in template size and sequence content, and as a result, the observed variant allele frequency (VAF) may be skewed from the true frequency of that variant in the original sample [4954]. This is particularly important to remember when using amplification-based library preparation techniques, which involve significantly more rounds of amplification than hybrid capture-based techniques.

Depth of Coverage The depth of coverage that can be achieved in an assay is dependent upon the size of the target region to be sequenced, the number of samples that are multiplexed in a lane, and factors intrinsic to the target region like mappability (lower for areas with repeating sequences, or for areas with homology to multiple regions in the genome) and GC content [55,56]. The coverage over a target region can be highly variable from position to position, and a depth of coverage as high as 10003 is often required in order to achieve at least 4003 coverage at most targeted positions [46]. The relationship between sensitivity, specificity, and depth of coverage has not been clearly delineated for indels due to high variability in the rate of indel detection between different bioinformatics tools (discussed below), but in general, sensitivity for detection of low-frequency variants (VAF of 510%) is limited at depths of coverage below roughly 4003. Likewise, specificity decreases with decreasing depth of coverage [57,58]. It is important to remember that depth of coverage required for sensitive and specific variant detection from heterogeneous solid tumor samples is much higher (roughly 10003 as discussed above) than for constitutional testing (about 1003) where variants are expected to be present with VAFs of 0%, 50%, or 100% in the cells being evaluated.

Assay Design Like SNVs, indels can be identified from any size of panel (single gene through exome or genome), as long as sufficient depth of high-quality coverage of a representative sample is achieved. This is in contrast to CNVs, which are more readily identified from larger panels, as discussed in Chapter 11.

SPECIMEN ISSUES THAT IMPACT INDEL DETECTION BY NGS Indel detection by NGS is subjected to the same specimen quality limitations as SNV detection (discussed in Chapter 8). In general, testing for constitutional diseases is performed on fresh tissue (peripheral blood leukocytes or buccal epithelial cells) and has few specimen-related limitations. Most specimen quality considerations apply to cancer sequencing, which is briefly reviewed here.

II. BIOINFORMATICS

142

9. INSERTIONS AND DELETIONS (INDELS)

Specimen Cellularity and Heterogeneity In order to identify an indel that is present in a tumor, the indel obviously must be present in the cells that are submitted for sequencing. The first step in achieving this requirement is ensuring that the submitted sample is, indeed, tumor. As a result, review of the submitted specimen by a pathologist is required to confirm that tumor tissue is present. Review can be performed on sections cut from frozen tissue, or sections from fixed tissue. Formalin or alcohol fixation is appropriate for tissues being submitted for NGS analysis [59,60]. Specimens that have been decalcified with acid virtually never are acceptable for sequence analysis; decalcification with chelating agents (e.g., EDTA) is preferred if necessary [61,62]. During review for confirmation of diagnosis, the pathologist can also evaluate the sample to ensure that viable tumor cells are present in adequate numbers to ensure that the NGS results represent the tumor. Since tumors are heterogeneous, consisting not only of tumor cells but also associated inflammatory cells, stromal cells, blood vessels, and normal parenchymal cells, and since the relative proportion of these various cell types is highly variable between different tumor samples and even between different areas of the same tumor, to ensure optimum sensitivity, areas where the estimated tumor cellularity is more than two times the VAF threshold for variant reporting in the particular clinical assay being used should be selected for analysis (i.e., if a cutoff of 10% VAF is used for clinical reporting, areas with more than 20% tumor cellularity should be used to ensure that heterozygous variants present in all the tumor cells submitted for analysis are likely to be detected). It is also critical to remember that many variants will only be present in a subset of the tumor cells, reflecting the rich clonal architecture of many tumors, and thus the areas of highest tumor cellularity will lead to the highest test sensitivity [6366].

Library Complexity Library complexity describes the number of individual DNA molecules (and thus, the number of individual cells) that were sampled in a DNA library and can be represented as the percent of targeted positions with unique coverage meeting the threshold targeted by the clinical assay (e.g., percent unique coverage .4003 for a cancer assay). “Unique coverage” refers to the number of individual unique (not PCR duplicate) reads that overlap a particular genomic position, and thus library complexity depends largely on the amount of DNA extracted from the original tissue sample, with DNA yields of greater than 200 ng generally resulting in a library with good complexity. Paucicellular specimens are at risk for generating DNA libraries with low complexity, meaning that few unique DNA molecules are present at the start of the sequencing assay. DNA libraries with low complexity are more susceptible to sampling bias and allelic dropout, since only relatively few cells are being evaluated.

BIOINFORMATICS APPROACHES TO NGS INDEL DETECTION When performing clinical NGS testing for multiple classes of mutations, it is critical to remember that analysis tools are generally optimized for one class of mutation. Although each may have some capacity for identification of other classes, parallel integration of multiple informatics tools is typically required when building a clinical NGS analysis pipeline. In addition, there is a complicated relationship between indel annotation (how the variant itself is written) and linkage with the appropriate clinical interpretation (what the existing literature says about the variant and what it means clinically).

General Bioinformatics Approaches to Indel Detection and Annotation Local Realignment As discussed above, alignment of indel-containing sequence reads is technically challenging and is best achieved with a gapped or split alignment algorithm. Read alignment can be further improved by use of tools like the IndelRealigner (a component of the Genome Analysis Toolkit, or GATK) that reevaluate data that have already been mapped (i.e., a BAM file) and tweak the local alignment of bases within each mapped read so as to minimize the number of base mismatches [67]. As a result, artifactual base mismatches that result from initial misalignment due to a nearby indel are corrected; if these artifacts are not corrected, they may mistakenly be interpreted as SNVs (Figure 9.7).

II. BIOINFORMATICS

BIOINFORMATICS APPROACHES TO NGS INDEL DETECTION

143

FIGURE 9.7 Local realignment. Variant calling pipelines typically include a step to realign mapped sequence reads around possible indels, in order to minimize base mismatches in the surrounding sequence. The IGV screen shot shown here of the CHIC2 30 UTR in a HapMap sample illustrates the false SNVs that are called when a sequence read is not mapped correctly (arrows). In this compact alignment view, deletion calls are shown as black horizontal bars and C . A SNVs are shown as bright green bars. There are several features to note. First, a dip in coverage observed at the site of a C deletion lends support to a deletion call at that site. However, some reads indicate that the deletion involves not just the C, but also encompasses the adjacent A. It is possible that both deletions are present in the sample, though the coverage track really only shows evidence of the C deletion. Second, some of the reads show a C . A SNV instead of a C deletion. Again, it is possible that both variants are present. Third, some of the reads in which the variant is called as a C . A SNV instead of a C deletion have false variant calls downstream (insertions or additional SNVs, depending on the read overlap with adjacent nucleotides) as a result of an incorrect call at the site of the deletion. Fortunately, this variant occurs in a presumably healthy individual and affects an UTR and as such is not likely to be particularly relevant in clinical testing, but this case illustrates that indel identification often is not straightforward and the presence of indels can alter variant calls in nearby sequences.

Left Alignment An additional realignment tool that is available in the GATK package, LeftAlignIndels, is designed to map each indel at the leftmost possible position [67]. As mentioned previously, indels commonly occur in areas containing tandem repeats. Hence, indel annotation in repetitive sequences is ambiguous. That is, deletion of three thymidines in a string of five thymidines results in two thymidines, regardless of whether the first, middle, or last three thymidines were deleted. Although this point may seem trivial since the resulting sequence is the same whether the indel is annotated as deletion at the beginning, middle, or end of the repeat tract, issues arise when attempting to link the various possible annotations to existing literature for clinical interpretation. The most common activating ERBB2 indel in lung cancer provides an excellent illustration of this conundrum (Figure 9.8). Using the LeftAlignIndels tool with subsequent indel calling, the indel is annotated as p.E770_A771insAYVM. Amazingly, this annotation is not present in public databases like ClinVar, COSMIC, HGMD, or dbSNP (as of March 2014) [36,6870]. However, alternate but equivalent annotations are readily identified by searching the clinical literature, including p.M774_A775insAYVM and p.A775_G776insYVMA. Clinical studies have demonstrated that this particular indel results in HER2 kinase activation, and patients with lung adenocarcinoma harboring this mutation can have a durable clinical response to targeted inhibition with ERBB2 family inhibitors (as is the topic of ongoing clinical trials) [7173]. This example makes it clear that leftalignment of indels, though potentially very helpful in standardizing indel annotation, is not an easy solution to the problem of connecting identified indels with existing literature during NGS result interpretation. Since the

II. BIOINFORMATICS

144

9. INSERTIONS AND DELETIONS (INDELS)

FIGURE 9.8 Redundant annotations for ERBB2 insertion mutation. The most common activating ERBB2 indel in lung cancer results from duplication of 12 nucleotides in exon 20, resulting in insertion of 4 amino acids in the protein sequence. As with other indels, there are multiple ways to annotate the same final nucleotide and amino acid sequence. The reference genomic nucleotide sequence and resulting amino acid sequence at the beginning of exon 20, with amino acid numbering according to the NP_00439 isoform of the ERBB2/HER2/neu protein, are shown (A). Three possible annotations for the activating ERBB2 indel are shown (B, C and D; inserted nucleotides and amino acids shown in red, reference shown in blue). In parts B and C, the inserted nucleotide sequence is the same (GCATACGTGATG), but the site of insertion is different. In part B, the insertion is made at the beginning of the reference AYVM sequence, with the genomic annotation chr17: g.37880981_377880982ins12 and protein annotation NP_004439:p.E770_A771insAYVM. In part C, the insertion is made after the reference AYVM sequence, shifting the indel annotations over to chr17:g.37880993_37880994ins12 and NP_004439:p.M774_A775insAYVM. However, the resulting nucleotide and amino acid sequences are exactly the same in parts B and C. Treacherously, the same insertion can also be annotated with what seems to be a completely different inserted sequence (D). Note that the amino acid before and after the reference YVM is an alanine. However, the reference A771 is encoded by the genomic sequence GCA, whereas the reference A775 is encoded by the genomic sequence GCT. Insertion of the sequence ATACGTGATGGC, splitting the GC and T that normally encode A775 (i.e., genomic annotation chr17:g.37880995_37880996insATACGTGATGGC) keeps an A at amino acid number 775 (though now encoded by GCA instead of the reference GCT). The inserted nucleotides result in insertion of amino acids YVMA between reference amino acid positions 775 and 776 (i.e., protein annotation NP_00439:p.A775_G776insYVMA). The inserted A (just prior to G776) is derived from the last two inserted nucleotides (GC) and the T that was split off from what was originally A775. Although at first glance the various annotations listed in parts B, C, and D seem different, they all result in the same final nucleotide and amino acid sequence. Comically, none of these possible annotations is technically correct based on the Human Genome Variation Society recommendations for mutation nomenclature (http://www.hgvs.org/mutnomen/disc. html#dupins; accessed August 17, 2014), in which the variant is most appropriately annotated as a duplication (dup): chr17: g.378800982_37880993dup and NP_004439:p.A771_M774dup. The dup annotation easily captures annotations B and C, but it does not really encompass annotation D. The annotation complexities discussed here do not even address differences in protein isoforms (e.g., A771 in ERBB2 isoform NP_004439 is equivalent to A741 in isoform NP_001005862). Ultimately, the molecular pathologist or clinical genomicist is responsible for realizing that all of these possible annotations actually refer to the same final mutant sequence. He or she must be aware of the annotation format output by his or her bioinformatics pipeline, as well as the various annotations that may already exist in databases and clinical literature. Sophisticated bioinformatics solutions that can identify multiple redundant annotations and comprehensively review databases for each and all of the possible annotations will facilitate reproducible clinical interpretation of indels.

purpose of performing NGS in a clinical setting is to provide results that can be used to direct patient care, failure to correctly interpret an indel because of redundant annotations that lead to different correlations with existing data is decidedly unhelpful and can deprive patients of potentially effective treatments. Coordinated efforts will be required to standardize indel annotation both prospectively and retrospectively, in order to facilitate clinical interpretation of NGS indel calls. Probabilistic Modeling Using Mapped Reads Some indel detection tools (including the GATK UnifiedGenotyper, Dindel, and SAMtools) use probabilistic modeling of mapped reads to identify variants [67,74,75]. By these approaches, in order for an indel-containing read to be aligned to the reference genome, a sufficient number of high-quality bases must match the reference on both ends of the read (Figure 9.9). These well-aligned bases serve as an “anchor” that places the indelcontaining read in the correct position in the reference genome; if the indel is longer than about 15% of the read length, the flanking sequences are no longer “sticky” enough to appropriately align the read, and as a result the read will not be mapped. Thus, methods for indel detection that rely on mapped reads are limited to detection of

II. BIOINFORMATICS

BIOINFORMATICS APPROACHES TO NGS INDEL DETECTION

145

FIGURE 9.9 Algorithms for indel detection. FLT3 ITD detection can serve as an example for indel detection in general. The FLT3 ITD is an insertion that occurs between exons 13 and 14. These insertions (shown in gray) range in size from 15 to approximately 300 bp (A). Probabilistic methods for finding insertions, including the GATK, SAMtools, and Dindel, apply statistical models to make insertion calls based on data obtained during the initial read mapping and alignment process (B). Because of the difficulty associated with aligning short reads, only small insertion events (generally ,15% of the total read length) can be identified by this approach (aligned reads are shown in green; unaligned reads in purple). Such reads generally have sufficient homology in the regions flanking the insertion to permit accurate alignment. Large insertions (.16 bp), including the FLT3 ITD, are too long to be detected by probabilistic methods that rely on mapped reads. In contrast, paired-end split-read analysis approaches, including Pindel and de novo alignment, can reliably detect larger insertions, including the FLT3 ITD. In this approach, mate-pairs are identified in which one end is mapped, but the other is not. The unmapped mates are then assembled to form contigs with partial homology to the reference sequence, using a pattern-growth algorithm (Pindel) or de novo assembly with a custom script (unpublished data) executing Phrap assembly software. This method allows for the detection of much larger insertions. Reprinted with permission from Elsevier [76], copyright (2013).

indels around 15 bp in length (from 100 bp sequence reads), which is suboptimal in routine clinical use, since many disease associated indels are longer than 15 bp. The “probabilistic modeling” component of these tools means that the indel detection tool takes into account factors like sequencing error rates, base call and alignment quality scores (covered in Chapter 7), and userdefined “penalties” that either permit or discourage interpretation of variant sequences as indels. Some tools (like VarScan2) also use heuristic models to filter out false-positive calls that are due to homopolymers [77]. The likelihood of a particular nonreference sequence being due to a specific indel is determined based on statistical analysis, and indels are called if the likelihood is sufficiently high to meet criteria for output. As a result, indel detection tools using probabilistic modeling have a high specificity (although the limited size range of indels that can be detected by these methods limits their sensitivity in clinical testing). Split-Read Analysis Split-read analysis approaches to indel detection facilitate identification of longer indels (.15 bp). “Split read” refers to paired-end sequence reads where one read in the pair maps to the reference genome, but the other does not map well (Figure 9.9). Reasons for poor alignment with the reference sequence include settings in which the read spans an insertion, deletion, or structural variant like a translocation or inversion. If an insufficient number of high-quality bases that match the reference genome are present on either side of an indel-containing read to

II. BIOINFORMATICS

146

9. INSERTIONS AND DELETIONS (INDELS)

“anchor” alignment of the read to the reference, the read will remain unmapped. Alternatively, if one end of the indel-containing read maps well at a particular location of the reference genome but the other end of the read does not map well at that position (despite having high-quality base calls), the end that does not map well may be “soft-clipped.” Some algorithms for indel and structural variant detection specifically utilize split and softclipped reads (sometimes called one-end anchored reads) to identify possible breakpoints in NGS sequence data; the reads can either be analyzed using a pattern-growth algorithm (like Pindel) whereby unmapped reads are broken into smaller pieces and realigned separately to identify possible indels and/or by de novo assembly whereby unmapped reads are reassembled into a contig based on their overlaps with each other [76,78]. Importantly, evaluation of split and soft-clipped reads allows for the identification of the full size spectrum of indels, and the approach is not subject to the same read length constraints as probabilistic methods. In currently available split-read algorithms, support for a particular indel call is provided by the number of reads that are consistent with the indel. If analysis of many separate reads results in the same indel call, it is more likely to be a true positive than if only one read supports the indel call. Obviously, distinguishing lowfrequency true indel calls (especially in heterogeneous cancer samples) from false-positive calls without rigorous likelihood analysis as performed in probabilistic approaches can be challenging. Indeed, a high false-positive rate when using a low threshold for number of supporting reads is one of the main limitations of split-read approaches, and suffice it to say that ongoing refinement of split-read analysis tools with incorporation of more rigorous criteria for establishing true versus false-positive indel calls will be helpful in the future.

Sensitivity and Specificity Issues Assuming that adequate specimen quality, library complexity, and depth of coverage are accounted for, there are several factors inherent to indel variants themselves that impact the sensitivity and specificity of indel detection tools from the perspective of an NGS clinical assay. Indel Length As discussed above, indel detection tools that use probabilistic analysis of mapped reads have a limited capacity to identify indels that are longer than approximately 15% of the length of a sequence read, and as a result, probabilistic tools including the GATK UnifiedGenotyper are not able to identify longer insertions. While many clinically relevant insertions, like EGFR exon 19 activating indels, are short enough to be detectable by analysis of mapped reads, other insertions are missed entirely by such methods. For example, FLT3 internal tandem duplications (ITDs) that are used clinically to predict prognosis and guide treatment in patients with cytogenetically normal AML can be anywhere from 15 to over 300 bp in length [7982]. A study evaluating the performance of various indel detection tools showed that GATK, Dindel, SAMtools, and other probabilistic tools completely failed to identify FLT3 ITDs that by conventional PCR and capillary electrophoresis ranged from 18 to 185 bp in length [76]. In contrast, Pindel and de novo assembly techniques were able to identify nearly all of the ITDs, including in cases with multiple insertions, and in cases with low-frequency ITDs (estimated VAF 2%) that were initially missed by Sanger-based methods but subsequently confirmed by PCR. However, Pindel also identified occasional small (,5 bp) indels in analyzed FLT3 region that were determined to be false positives, all of which had few supporting reads (,5 supporting reads, with an average depth of coverage over 20003) and could be easily eliminated from consideration by requiring a minimum number of supporting reads [76]. While the optimum number or fraction of supporting reads necessary to determine whether an indel is a true positive or false positive has not been rigorously established, in many clinical settings, Pindel calls with fewer than 20 supporting reads (with a depth of coverage around 10003) cannot be reproduced by Sanger sequencing of the same DNA, and these variants tend to be false positives by visual inspection. As indel detection tools mature, sensitivity and specificity will likely improve. Indel Annotation A major issue with clinical indel detection is annotation (i.e., how the size, composition, and genomic location of the indel is written). As discussed above, indels often occur in repetitive sequences, and thus multiple possible annotations can describe the same resulting sequence (see Figure 9.8). In some cases, the variant caller cannot specify which of the possible annotations is the “real” annotation, and so will split the supporting reads across all the possible annotations (e.g., half in the left-aligned annotation, half in the right-aligned annotation). When filtering tentative indel calls by the number or fraction of supporting reads, an indel that is split across multiple

II. BIOINFORMATICS

BIOINFORMATICS APPROACHES TO NGS INDEL DETECTION

147

possible annotations may fall below the filter threshold for each individual annotation even though the number of supporting reads would exceed the filter threshold were the reads all annotated the same, leading to a false-negative result. Left-alignment prior to indel calling can decrease this problem by combining the multiple annotation bins into a single left-aligned annotation. Importantly, left-alignment does not necessarily facilitate comparison of the identified indel to existing databases and literature, and in fact may preclude correlation of a potentially relevant indel with existing data during interpretation, as discussed above. Another issue with indel annotation by existing NGS bioinformatic approaches is that the exact number of bases inserted or deleted does not always correlate with the size prediction by PCR and capillary electrophoresis, or by other NGS indel detection tools [76]. The difference between insertion of one or two trinucleotide repeats may not be clinically significant, but insertion of eight bases is certainly different than insertion of nine bases in a coding sequence because eight bases would cause a frameshift, but nine bases would maintain the reading frame. How often this type of scenario causes a problem in clinical indel detection is completely unknown. Definition of Indel “Truth” A persistent issue in evaluating sensitivity and specificity of indel detection by NGS is the lack of an appropriate “gold-standard” for validation. It is clear that NGS methods facilitate identification of low-frequency variants that are below the limit of detection by conventional Sanger sequencing, since the limit of detection of NGS is around 1% while the limit of detection for Sanger is about 510% [57,76]. Hence, use of Sanger sequencing as the gold-standard for indel identification will result in a falsely low specificity for NGS detection tools, particularly in evaluation of low-frequency indels. At the same time, it is clear that some NGS indel detection methods are prone to a high false-positive rate. A few characteristics observable by visual inspection of aligned sequence reads (e.g., in the Integrative Genomics Viewer, IGV) can be helpful in determining whether an indel called in an NGS pipeline is a true or false positive [83]. Features supporting a “true” indel are illustrated in Figure 9.2. Deletions are much more easily evaluated by visual review than insertions. Because insertions are by definition not part of the reference sequence, it is difficult to display them when the reference sequence is shown sequentially and without gaps; insertions are thus displayed as a small vertical tick mark regardless of length, and no change in coverage is apparent at sites of insertion. In contrast, deletions are shown as a black spacer bar where the nucleotides should be, and a well demarcated dip in coverage can be seen in the associated coverage track. In any event, indel calls should line up neatly in one position rather than being scattered in a repetitive sequence region with variable numbers of bases deleted and/or inserted. When soft-clipped bases are displayed, the clipped reads (shown in color, as they deviate from the reference sequence) should line up with the predicted indel and have sequence consistent with the predicted variant. (Notably, “show soft-clipped bases” is a display preference option that by default is not active; it must be selected by the IGV user.) The presence of the indel in forward and reverse strands, as opposed to a strand bias as is commonly seen in PCR artifacts, also lends credence to an indel call. When multiple indel calling tools are used to analyze the same data, consensus between the tools is reassuring. However, when the tools disagree, it can be difficult to determine which (if either) tool is correct. Again, review of the variant in IGV can be helpful. However, manual review of indel calls can be laborious and time intensive. With increasing application of NGS methods for indel detection in clinical testing, rigorous and standardized methods for indel detection that do not require extensive manual review will be required.

Reference Standards The spectrum of indels that exist in normal and diseased tissues is just beginning to be described, as previous molecular testing methods did not allow for wide-scale identification of indels across many genes or many individuals. Consequently, even though NGS methods have provided the capacity to identify indels in multiple genes, the lack of consensus indel calls across various technical approaches and variant annotations make establishing a reference standard difficult. As a result, reference standards for indels do not exist yet. Nonetheless, it is clear that a useful reference standard for clinical applications would include indels encompassing the full range of sizes observed in human disease (1 bp to hundreds of bp) and include some indels that are present at low frequency, near the limit of detection by the various NGS methods (e.g., about 10% VAF or less). However, even the development of an indel reference set is itself complicated by the lack of a goldstandard for indel detection, ambiguities in annotation, and lack of consensus between various indel calling algorithms. Efforts to develop indel standards in spite of these seemingly intractible issues are currently under way, which will markedly increase the ease and clinical utility of clinical NGS testing [84].

II. BIOINFORMATICS

148

9. INSERTIONS AND DELETIONS (INDELS)

SUMMARY Although some indels occur as normal polymorphisms in the human genome, indels are also implicated as the driving mechanism underlying a wide variety of constitutional and oncologic diseases, and so their detection by clinical NGS techniques is supremely important. However, multiple factors inherent to indels as a mutation class complicate their detection, including indel size, sequence context, and variant annotation. As discussed in detail above, features of the assay design, the sequencing platform, and the bioinformatics tools employed influence the sensitivity and specificity of clinical NGS tests designed to detect indels, but of all of these, the bioinformatic approaches are perhaps the important. Since the bioinformatic tools that are optimized for detection of SNVs or other classes of mutation are not optimized to detect indels, specific software packages designed for indel detection are required for clinical NGS. While each of the different algorithms has strengths and limitations, a primary limitation common to all is the lack of gold-standards for detection and annotation. In the absence of such reference standards, it is difficult to interpret the significance of the lack of concordance between different indel detection methods, and even more difficult to assess the sensitivity and specificity of clinical NGS assays. Further, the use of multiple possible annotations for the same indel all but precludes definitive correlation with existing databases and literature for clinical interpretation. Current efforts to standardize indel annotation and develop well-curated reference standards are thus of paramount importance.

References [1] Kunkel TA, Bebenek K. DNA replication fidelity. Annu Rev Biochem 2000;69:497529. [2] Lange SS, Takata K, Wood RD. DNA polymerases and cancer. Nat Rev Cancer 2011;11(2):96110. [3] Pfeifer JD. DNA damage, mutations, and repair. Molecular genetic testing in surgical pathology. Philadelphia, PA: Lippincott, Williams & Wilkins; 2006. p. 2957. [4] Strachan T, Read AP. Human genetic variability and its consequences. Human molecular genetics. 4th ed. New York, NY: Garland Science; 2011. p. 405440. [5] Huang QY, Xu FH, Shen H, Deng HY, Liu YJ, Liu YZ, et al. Mutation patterns at dinucleotide microsatellite loci in humans. Am J Hum Genet 2002;70(3):62534. [6] Kunkel TA. The mutational specificity of DNA polymerase-beta during in vitro DNA synthesis. Production of frameshift, base substitution, and deletion mutations. J Biol Chem 1985;260(9):578796. [7] Ripley LS. Model for the participation of quasi-palindromic DNA sequences in frameshift mutation. Proc Natl Acad Sci USA 1982;79 (13):412832. [8] Pfeifer JD, Zehnbauer B, Payton J. The changing spectrum of DNA-based specimen provenance testing in surgical pathology. Am J Clin Pathol 2011;135(1):1328. [9] Bunting SF, Nussenzweig A. End-joining, translocations and cancer. Nat Rev Cancer 2013;13(7):44354. [10] Koole W, van Schendel R, Karambelas AE, van Heteren JT, Okihara KL, Tijsterman M. A polymerase theta-dependent repair pathway suppresses extensive genomic instability at endogenous G4 DNA sites. Nat Commun 2014;5:3216. [11] Fujiwara T, Stolker JM, Watanabe T, Rashid A, Longo P, Eshleman JR, et al. Accumulated clonal genetic alterations in familial and sporadic colorectal carcinomas with widespread instability in microsatellite sequences. Am J Pathol 1998;153(4):106378. [12] Buchanan DD, Tan YY, Walsh MD, Clendenning M, Metcalf AM, Ferguson K, et al. Tumor mismatch repair immunohistochemistry and DNA MLH1 methylation testing of patients with endometrial cancer diagnosed at age younger than 60 years optimizes triage for population-level germline mismatch repair gene mutation testing. J Clin Oncol 2014;32(2):90100. [13] Karamurzin Y, Rutgers JK. DNA mismatch repair deficiency in endometrial carcinoma. Int J Gynecol Pathol 2009;28(3):23955. [14] Poynter JN, Siegmund KD, Weisenberger DJ, Long TI, Thibodeau SN, Lindor N, et al. Molecular characterization of MSI-H colorectal cancer by MLHI promoter methylation, immunohistochemistry, and mismatch repair germline mutation screening. Cancer Epidemiol Biomarkers Prev 2008;17(11):320815. [15] Oberle´ I, Rousseau F, Heitz D, Kretz C, Devys D, Hanauer A, et al. Instability of a 550-base pair DNA segment and abnormal methylation in fragile X syndrome. Science 1991;252(5009):1097102. [16] Alisch RS, Wang T, Chopra P, Visootsak J, Conneely KN, Warren ST. Genome-wide analysis validates aberrant methylation in fragile X syndrome is specific to the FMR1 locus. BMC Med Genet 2013;14:18. [17] Rousseau F, Rouillard P, Morel ML, Khandjian EW, Morgan K. Prevalence of carriers of permutation-size alleles of the FMRI gene—and implications for the population genetics of the fragile X syndrome. Am J Hum Genet 1995;57(5):100618. [18] Colak D, Zaninovic N, Cohen MS, Rosenwaks Z, Yang WY, Gerhardt J, et al. Promoter-bound trinucleotide repeat mRNA drives epigenetic silencing in fragile X syndrome. Science 2014;343(6174):10025. [19] Walker FO. Huntington’s disease. Lancet 2007;369(9557):21828. [20] Mirkin SM. Expandable DNA repeats and human disease. Nature 2007;447(7147):93240. [21] Pearson CE, Nichol Edamura K, Cleary JD. Repeat instability: mechanisms of dynamic mutations. Nat Rev Genet 2005;6(10):72942. [22] Kiesewetter S, Macek M, Davis C, Curristin SM, Chu CS, Graham C, et al. A mutation in CFTR produces different phenotypes depending on chromosomal background. Nat Genet 1993;5(3):2748.

II. BIOINFORMATICS

REFERENCES

149

[23] Strom CM, Janeszco R, Quan F, Wang SB, Buller A, McGinniss M, et al. Technical validation of a TM Biosciences Luminex-based multiplex assay for detecting the American College of Medical Genetics recommended cystic fibrosis mutation panel. J Mol Diagn 2006;8 (3):3715. [24] Onozato R, Kosaka T, Kuwano H, Sekido Y, Yatabe Y, Mitsudomi T. Activation of MET by gene amplification or by splice mutations deleting the juxtamembrane domain in primary resected lung cancers. J Thorac Oncol 2009;4(1):511. [25] Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 2014;511(7511):54350. [26] Bean J, Brennan C, Shih JY, Riely G, Viale A, Wang L, et al. MET amplification occurs with or without T790M mutations in EGFR mutant lung tumors with acquired resistance to gefitinib or erlotinib. Proc Natl Acad Sci USA 2007;104(52):209327. [27] Schwab R, Petak I, Kollar M, Pinter F, Varkondi E, Kohanka A, et al. Major partial response to crizotinib, a dual MET/ALK inhibitor, in a squamous cell lung (SCC) carcinoma patient with de novo c-MET amplification in the absence of ALK rearrangement. Lung Cancer 2014;83(1):10911. [28] Isken O, Maquat LE. The multiple lives of NMD factors: balancing roles in gene and genome regulation. Nat Rev Genet 2008;9 (9):699712. [29] Pabst T, Mueller BU, Zhang P, Radomska HS, Narravula S, Schnittger S, et al. Dominant-negative mutations of CEBPA, encoding CCAAT/enhancer binding protein-alpha (C/EBPalpha), in acute myeloid leukemia. Nat Genet 2001;27(3):26370. [30] Richards CS, Bale S, Bellissimo DB, Das S, Grody WW, Hegde MR, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet Med 2008;10(4):294300. [31] Gombart AF, Hofmann WK, Kawano S, Takeuchi S, Krug U, Kwok SH, et al. Mutations in the gene encoding the transcription factor CCAAT/enhancer binding protein alpha in myelodysplastic syndromes and acute myeloid leukemias. Blood 2002;99(4):133240. [32] Nerlov C. C/EBPalpha mutations in acute myeloid leukaemias. Nat Rev Cancer 2004;4(5):394400. [33] Sharma SV, Bell DW, Settleman J, Haber DA. Epidermal growth factor receptor mutations in lung cancer. Nat Rev Cancer 2007;7 (3):16981. [34] Heinrich MC, Rubin BP, Longley BJ, Fletcher JA. Biology and genetic aspects of gastrointestinal stromal tumors: KIT activation and cytogenetic alterations. Hum Pathol 2002;33(5):48495. [35] Nakahara M, Isozaki K, Hirota S, Miyagawa J, Hase-Sawada N, Taniguchi M, et al. A novel gain-of-function mutation of c-kit gene in gastrointestinal stromal tumors. Gastroenterology 1998;115(5):10905. [36] Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42:D9805. [database issue]. [37] Eggington JM, Bowles KR, Moyes K, Manley S, Esterling L, Sizemore S, et al. A comprehensive laboratory-based program for classification of variants of uncertain significance in hereditary cancer genes. Clin Genet 2014;86(3):22937. [38] Kenna KP, McLaughlin RL, Hardiman O, Bradley DG. Using reference databases of genetic variation to evaluate the potential pathogenicity of candidate disease variants. Hum Mutat 2013;34(6):83641. [39] Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One 2012;7 (10):e46688. [40] Sim NL, Kumar P, Hu J, Henikoff S, Schneider G, Ng PC. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 2012;40:W4527. [web server issue]. [41] Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif) 2013;6:287303. [42] Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012;2012:251364. [43] Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012;30(5):4349. [44] Ju¨nemann S, Sedlazeck FJ, Prior K, Albersmeier A, John U, Kalinowski J, et al. Updating benchtop sequencing performance comparison. Nat Biotechnol 2013;31(4):2946. [45] Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 2012;13:341. [46] Cottrell CE, Al-Kateb H, Bredemeyer AJ, Duncavage EJ, Spencer DH, Abel HJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16(1):89105. [47] Pritchard CC, Salipante SJ, Koehler K, Smith C, Scroggins S, Wood B, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16(1):5667. [48] Li H, Durbin R. Fast and accurate long-read alignment with BurrowsWheeler transform. Bioinformatics 2010;26(5):58995. [49] Walsh PS, Erlich HA, Higuchi R. Preferential PCR amplification of alleles: mechanisms and solutions. PCR Methods Appl 1992;1 (4):24150. [50] Ogino S, Wilson RB. Quantification of PCR bias caused by a single nucleotide polymorphism in SMN gene dosage analysis. J Mol Diagn 2002;4(4):18590. [51] Barnard R, Futo V, Pecheniuk N, Slattery M, Walsh T. PCR bias toward the wild-type k-ras and p53 sequences: implications for PCR detection of mutations and cancer diagnosis. Biotechniques 1998;25(4):68491. [52] Liu Q, Thorland EC, Sommer SS. Inhibition of PCR amplification by a point mutation downstream of a primer. Biotechniques 1997;22 (2):2924 296, 298, passim. [53] Mutter GL, Boynton KA. PCR bias in amplification of androgen receptor alleles, a trinucleotide repeat marker used in clonality studies. Nucleic Acids Res 1995;23(8):14118. [54] Polz MF, Cavanaugh CM. Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol 1998;64(10):372430. [55] Clark MJ, Chen R, Lam HY, Karczewski KJ, Euskirchen G, Butte AJ, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011;29(10):90814. [56] Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 2014;15(2):12132.

II. BIOINFORMATICS

150

9. INSERTIONS AND DELETIONS (INDELS)

[57] Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 2014;16(1):7588. [58] Lohr JG, Stojanov P, Carter SL, Cruz-Gordillo P, Lawrence MS, Auclair D, et al. Widespread genetic heterogeneity in multiple myeloma: implications for targeted therapy. Cancer Cell 2014;25(1):91101. [59] Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333. [60] Karnes HE, Duncavage EJ, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122(2):10413. [61] Wickham CL, Sarsfield P, Joyner MV, Jones DB, Ellard S, Wilkins B. Formic acid decalcification of bone marrow trephines degrades DNA: alternative use of EDTA allows the amplification and sequencing of relatively long PCR products. Mol Pathol 2000;53(6):336. [62] Reineke T, Jenni B, Abdou MT, Frigerio S, Zubler P, Moch H, et al. Ultrasonic decalcification offers new perspectives for rapid FISH, DNA, and RT-PCR analysis in bone marrow trephines. Am J Surg Pathol 2006;30(7):8926. [63] Gerlinger M, Horswell S, Larkin J, Rowan AJ, Salm MP, Varela I, et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat Genet 2014;46(3):22533. [64] Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 2010;467(7319):11147. [65] Ding L, Ley TJ, Larson DE, Miller CA, Koboldt DC, Welch JS, et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 2012;481(7382):50610. [66] Johnson BE, Mazor T, Hong C, Barnes M, Aihara K, McLean CY, et al. Mutational analysis reveals the origin and therapy-driven evolution of recurrent glioma. Science 2014;343(6167):18993. [67] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43(5):4918. [68] Forbes SA, Bhamra G, Bamford S, Dawson E, Kok C, Clements J, et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr Protoc Hum Genet 2008: [Chapter 10:Unit 10.1]. [69] Cooper DN, Ball EV, Krawczak M. The human gene mutation database. Nucleic Acids Res 1998;26(1):2857. [70] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29(1):30811. [71] Gandhi L, Bahleda R, Tolaney SM, Kwak EL, Cleary JM, Pandya SS, et al. Phase I study of neratinib in combination with temsirolimus in patients with human epidermal growth factor receptor 2-dependent and other solid tumors. J Clin Oncol 2014;32(2):6875. [72] Falchook GS, Janku F, Tsao AS, Bastida CC, Stewart DJ, Kurzrock R. Non-small-cell lung cancer with HER2 exon 20 mutation: regression with dual HER2 inhibition and anti-VEGF combination treatment. J Thorac Oncol 2013;8(2):e1920. [73] Wang SE, Narasanna A, Perez-Torres M, Xiang B, Wu FY, Yang S, et al. HER2 kinase domain mutation results in constitutive phosphorylation and activation of HER2 and EGFR and resistance to EGFR tyrosine kinase inhibitors. Cancer Cell 2006;10(1):2538. [74] Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: accurate indel calls from short-read data. Genome Res 2011;21(6):96173. [75] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009;25(16):20789. [76] Spencer DH, Abel HJ, Lockwood CM, Payton JE, Szankasi P, Kelley TW, et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data. J Mol Diagn 2013;15(1):8193. [77] Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22(3):56876. [78] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25(21):286571. [79] Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, Kashima K, et al. Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia 1996;10(12):19118. [80] Abu-Duhier FM, Goodeve AC, Wilson GA, Care RS, Peake IR, Reilly JT. Genomic structure of human FLT3: implications for mutational analysis. Br J Haematol 2001;113(4):10767. [81] Kottaridis PD, Gale RE, Frew ME, Harrison G, Langabeer SE, Belton AA, et al. The presence of a FLT3 internal tandem duplication in patients with acute myeloid leukemia (AML) adds important prognostic information to cytogenetic risk group and response to the first cycle of chemotherapy: analysis of 854 patients from the United Kingdom Medical Research Council AML 10 and 12 trials. Blood 2001;98 (6):17529. [82] Estey EH. Acute myeloid leukemia: 2013 update on risk-stratification and management. Am J Hematol 2013;88(4):31827. [83] Thorvaldsdo´ttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 2013;14(2):17892. [84] Lubin IM, Aziz N, Babb L, et al. The clinical next-generation sequencing variant file: advances, opportunities, and challenges for the clinical laboratory [submitted].

II. BIOINFORMATICS

C H A P T E R

10 Translocation Detection Using Next-Generation Sequencing Haley Abel1, John Pfeifer2 and Eric Duncavage2 1

Division of Statistical Genetics, Washington University School of Medicine, St. Louis, MO, USA 2Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction to Translocations Discovery of Translocations in Human Disease Mechanisms of Translocation Formation

152 152 152

Translocations in Human Disease Translocations in Hematologic Malignancies Translocations in Leukemias Translocations in Lymphomas Translocations in Solid Tumors Sarcomas Carcinomas Translocations in Inherited Disorders Developmental Delay Recurrent Miscarriages Hereditary Cancer Syndromes

153 153 153 154 154 154 155 155 155 155 155

Informatic Approaches to Translocation Detection Discordant Paired-End and Split Read-Based Analysis Detection of Translocations and Inversions RNA-Seq-Based Analysis

Translocation Detection Conventional Methods Translocation Detection by Whole Genome DNA Sequencing

156 156

Translocation Detection by Targeted DNA Sequencing Translocation Detection by RNA-Seq

156 158 158 158 158 159

Translocation Detection in Clinical Practice Laboratory Issues Online Resources

160 161 162

Summary and Conclusion

163

References

163

156

KEY CONCEPTS • The detection of translocations has important prognostic and diagnostic significance in human diseases, including cancer. • In most clinical laboratories, translocations are routinely detected by interphase or metaphase FISH, routine cytogenetics, or RT-PCR. • While other forms of structural variation including copy number variation can be detected by array-based methods, translocation detection by arrays has proven difficult.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00010-1

151

© 2015 Elsevier Inc. All rights reserved.

152

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

• Translocations generally occur in intronic sequences and can be identified in an unbiased manner by whole genome NGS at the DNA level or by RNA sequencing of expressed transcripts. • Translocations can be detected by targeted hybrid-capture-based DNA sequencing panels with a high sensitivity and specificity, but this requires sequencing of introns. • Capture-based targeted sequencing can identify all translocation partners of a captured gene by taking advantage of off-target coverage. • Translocations can be detected from targeted RNA sequencing panels, but only if the translocation event produces a chimeric fusion product. • Most NGS translocation calling programs require paired-end reads and are typically sensitive, but many are subject to a high false positive rate.

INTRODUCTION TO TRANSLOCATIONS Chromosomal translocations or rearrangements occur when two chromosomes exchange genetic material. Such events may place a proto-oncogene under the regulation of a transcriptional enhancer, resulting in aberrant gene expression, as in the case of immunoglobulin heavy chain rearrangements in lymphoma, or juxtapose the coding regions of two genes resulting in a fusion transcript with pro-oncogenic properties. The detection of recurrent chromosomal rearrangements by cytogenetics was one of the earliest clinical “molecular” oncology assays, and continues to play a major role in cancer diagnosis and prognosis [1,2].

Discovery of Translocations in Human Disease The discovery of translocations in human cancer began in 1960, 4 years after the identification and enumeration of human chromosomes when David Hungerford and Peter Nowell discovered a novel “minute” chromosome, dubbed the “Philadelphia chromosome” present only in the bone marrow of patients with chronic myelogenous leukemia (CML) [3]. It was not until 1972 and the advent of better cytogenetic banding techniques that the first chromosomal translocation was described in human cancer, a t(8;21)(q22;q22) in a leukemic patient [4]; shortly thereafter, it was demonstrated that the Philadelphia chromosome described 12 years earlier actually represented a t(9;22)(q34;q11) translocation event [1]. In the last 40 years, hundreds of additional chromosomal translocations have been described in human cancers, including those in carcinomas, sarcomas, and hematologic malignancies (described below). Many of these cancer-related translocations have been shown to be of diagnostic significance including t(9;22) in CML; others, including EML4-ALK in nonsmall cell lung cancers are of therapeutic significance, and still more including ERG-TMPRSS2 rearrangements in prostate cancer likely play a role in carcinogenesis but are of unclear clinical significance at this time.

Mechanisms of Translocation Formation A description of the mechanisms of translocation formation starts with an acknowledgment that double-strand breaks (DSBs) are purposely formed and resolved as part of normal cellular processes. For example, the V(D)J recombination that occurs during the rearrangement of immunoglobulin genes and T-cell receptor genes involves the recombination activating genes RAG1 and RAG2 which recognize so-called recombination signal sequences (RSSs); after a complex series of events, the coding ends of the induced chromosomal DSBs are finally joined by the nonhomologous end joining (NHEJ) pathway [5]. In addition, the enzyme activation-induced cytidine deaminase (AID) is the major enzyme required for somatic hypermutation and class switch recombination in the generation of antibody diversity. However, the proteins involved in the rearrangement of immunoglobulin and T-cell receptor genes are also responsible, directly or indirectly, for the formation of some translocations. Some genes that have been shown to harbor cryptic RSS elements are misrecognized by the RAGs with resulting formation of translocations, and the DSB required for class switching are also candidates for illegitimate joining by AID (including translocations involving TMPRSS2 and the ETS family of genes in prostate cancer) [68]. Regardless of the cause, the formation of chromosomal translocations is a complex process [9,10]. The initiating event is the occurrence of a DSB, whether it arises from a replication error, or from an exogenous source such

II. BIOINFORMATICS

TRANSLOCATIONS IN HUMAN DISEASE

153

as chemotherapeutic agents or ionizing radiation. When a DSB occurs, the cell quickly mounts a DNA damage response which involves a complex but coordinated accumulation of DNA repair proteins at the site of the damage which triggers additional signaling pathways to halt the cell cycle, initiate repair of the lesion, and restore the integrity of the genome [5,6,9,10]. There are several of different types of DSB repair mechanisms. In general, if a DSB occurs during the S phase of the cell cycle, repair proceeds via the process of homologous recombination, in which homologous sequences of sister chromatids are used as templates to repair the break. In contrast, NHEJ is active throughout the cell cycle, and although it efficiently ligates the broken strand, its lack of reliance on an homologous template means that the repair is often inaccurate [5]. Classical nonhomologous end joining (C-NHEJ) preferentially joins DSBs intra-chromosomally, while an alternative end-joining pathway (A-EJ), which is not well characterized and may actually represent several pathways, prefers ends with short microhomologies and physiologically primarily supports immunoglobulin class switch recombination in lymphocytes [11]. Of note, DSB translocations in some model systems appear to have a propensity for transcribed chromosomal regions through a mechanism that is not yet clear [12]. Even though all of these different pathways are efficient, some DSBs are not quickly resolved and it is thought that translocations arise from this group of breaks since the persistence of a DSB increases the likelihood that it will come into physical contact with an illegitimate partner (i.e., not the opposite side of the break from the same chromosome). The accumulation of repair proteins at the broken chromosomal ends of a DSB forms cytologically detectable DNA repair foci. One of the remarkable discoveries over the last several years is that, based on timelapse observations, the repair foci have a relatively stable intranuclear position over time. This very limited mobility has implications for the physical interactions and pairing of DSBs required for the illegitimate joining responsible for translocation formation. Since chromosomes within the nucleus are nonrandomly arranged in threedimensional space, there apparently is a propensity for particular pairs of chromosomes to undergo recombination which may provide part of the explanation for the observation that cells form a rather limited repertoire of diseasecausing translocations, and that there seems to be an inherent reciprocity to translocation formation [13]. For example, in Burkitt’s lymphoma, the distance between the MYC gene and its three translocation partners IGH, IGK, and IGL correlates with the translocation frequency of the partners (17), and similar correlations between tissue-specific translocations and tissue-specific chromosome locations have been observed in other tumor types [1416].

TRANSLOCATIONS IN HUMAN DISEASE Translocations in Hematologic Malignancies Translocations in human malignancies were first described in leukemia, and their detection continues to have considerable impact on the management and diagnosis of patients with hematologic malignancies. Translocations in hematologic malignancies can be divided into three major categories: those that are of diagnostic significance, those that are of prognostic significance, and those that are of therapeutic significance. Examples of rearrangements of diagnostic significance include the identification of BCR-ABL1 rearrangements in CML, IgH rearrangements in B-cell lymphomas, and FGFR1, FGFR2, and 8p11 rearrangements in myeloid malignancies with eosinophilia. Examples of prognostically significant rearrangements include IgH-MYC/IgH-BCL2 rearrangements in diffuse large B-cell lymphoma (so-called double-hit lymphomas), MLL rearrangements in acute myeloid and acute lymphocytic leukemias (AML and ALL), core binding factor rearrangements in AML including t(8;21)(q22; q22) and inv(16)/t(16;16)(p13.1;q22), and TEL-AML1 t(12;21)(p13;q22) rearrangements in ALL. Examples of therapeutically significant translocations include BCR1-ABL1 rearrangements in CML (treated with tyrosine kinase inhibitors), PML-RARA rearrangements in AML (treated with all-trans retinoic acid or arsenic), and IgH-CCND1 rearrangements in mantle cell lymphoma (MCL) (often treated by bone marrow transplant). Translocations in Leukemias The identification of recurrent translocations in leukemia has had an enormous impact on the classification and treatment of leukemias, and testing for rearrangements by cytogenetics or fluorescence in situ hybridization (FISH) is now part of the recommended diagnostic work up for AML, ALL, CML, and myelodysplastic syndromes (MDS). In the case of AML, ALL, and MDS, the identification of rearrangements allows for risk-adapted therapy whereby those patients with “favorable” cytogenetic rearrangements including t(8;21)(q22;q22), inv(16)/ t(16;16)(p13.1;q22), and t(15:17)(q22;q12) in AML and t(12;21)(p13;q22) in ALL are generally treated with

II. BIOINFORMATICS

154

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

induction chemotherapy followed by consolidation chemotherapy, while leukemias with unfavorable cytogenetics including 11q23 (MLL) rearrangements, t(9;22)(q34;q11.2), t(6;9)(p23;q34), and inv(3)/t(3;3)(q21;q26.2) may be more aggressively treated with an allogenic bone marrow transplant during the first complete remission. Outside of AML with t(15;17) and AML/CML with t(9:22), no targeted therapy exists for leukemias with recurrent rearrangements. Translocations in Lymphomas The identification of rearrangements in lymphomas is important in both the diagnostic and prognostic settings. In B-cell lymphomas, most rearrangements involve the immunoglobulin heavy chain region (IgH) on chromosome 14, while in T-cell lymphomas recurrent rearrangements are uncommon. For example, in the differential diagnosis between low grade B-cell lymphoma and benign lymphadenopathy, the identification of an IgH rearrangement by FISH may confirm the presence of a lymphoma. In the setting of diffuse large B-cell lymphphoma, the presence of an IgH-MYC and IgH-BCL2 rearrangement is suggestive of a “double-hit” lymphoma, which is associated with a more aggressive clinical course. In the differential diagnosis between chronic lymphocytic leukemia (CLL) and MCL, both CD5-positive B-cell lymphomas consisting of small B cells, the identification of a CCND1-IgH t(11;14) rearrangement is diagnostic of MCL, which is treated more aggressively than CLL.

Translocations in Solid Tumors A number of translocations have been shown to be characteristic of solid tumors, including carcinomas as well as sarcomas. While a detailed description of all of them is beyond the scope of this chapter, there are a number of important overarching concepts that have become clear as more and more translocations have been characterized. First, unlike translocations characteristic of many lymphomas that involve the immunoglobulin and T-cell receptor genes and typically result in the loss of relatively large chromosomal regions (in the range of 10200 kb or more, a reflection of the involvement of the recombination activating gene (RAG) pathway in their genesis), the translocations that occur in solid tumors are characterized by only small insertions or deletions at the breakpoints (in the range of only a couple of bases to several hundred bases) [17]. This fact has implications for detection of translocations in solid tumors by chromosomal microarray-based approaches, in that the absence of any significant loss or gain of genomic material makes them essentially copy number neutral. Second, although specific translocations are characteristic of specific tumors, a number of different translocations can be characteristic of the same neoplasm; conversely, the same translocation can be found in more than one tumor type. Despite the lack of absolute specificity, in many solid tumors the identification of a translocation can have significant utility for the prediction of response to a specific therapy. Sarcomas A feature of a large number of soft tissue tumors is that there is strong association with a characteristic set of translocations. The classic example is Ewing sarcoma/peripheral neuroectodermal tumor (EWS/PNET). Although the t(11;22) which forms an EWSR1-FLI1 fusion gene is present in up to 90% of cases, the remaining cases harbor translocations in which EWSR1 (or a homolog) is paired with another member of the ETS transcription factor family (including EWSR1-ERG, EWSR1-E1AF, EWSR1-ETV, EWSR1-FEV, and FUS-ERG), inversions (EWSR1-ZSG), or rearrangements involving unrelated loci [18]. The complexity of the genomic lesions that are associated with EWS/PNET is further increased by the observation that the translocation breakpoint is not specific to a specific region of the involved genes, but can occur in many different introns in either of the involved genes. Given the range of genes and breakpoints that occur in the tumor, clinical testing for the presence of a rearrangement that supports the diagnosis of EWS/PNET by conventional methods is a laborious process if it is to be comprehensive. But the situation is even more complicated; because the differential diagnosis of EWS/PNET often includes a number of other so-called malignant small round blue cell tumors, each of which harbors its own set of characteristic translocations, clinical testing for rearrangements to support diagnosis of sarcomas often involves a number of independent methods (e.g., classical cytogenetics, metaphase and interphase FISH, RTPCR) to achieve the necessary sensitivity and specificity, testing which is often constrained by only a limited quantity of tissue that is available for diagnosis [18]. Because next-generation sequencing (NGS)-based methods coupled with appropriate bioinformatic pipelines (as discussed below) can be used to detect gross structural rearrangements, NGS is ideally suited to provide DNA sequence information of clinical utility in sarcoma diagnosis.

II. BIOINFORMATICS

TRANSLOCATIONS IN HUMAN DISEASE

155

Carcinomas It has been known for several years that certain translocations are characteristic of specific carcinomas. However, whereas the role of translocation detection in sarcomas is largely in support of diagnosis, for carcinomas testing for translocations is of importance to guide the choice of therapy. The best examples of this scenario are in the treatment of patients with nonsmall cell lung carcinoma (NSCLC). Patients whose tumor harbors an ALK translocation have been shown to respond to targeted therapy with the tyrosine kinase inhibitor erlotinib, and similarly, patients whose tumor harbors a translocation involving RET or ROS have been shown to respond to targeted therapy directed against the associated activated cell signaling pathway [19,20]. NGS methods have not only been used to discover the role of these translocations in NSCLC, but are also ideally suited for clinical use to detect all of these rearrangements in a single assay because of their sensitivity, specificity, and suitability for very small biopsy or cytology patient samples.

Translocations in Inherited Disorders Traditional G-band karyotyping techniques have limited resolution, and consequently are unable to detect subtle rearrangements. Similarly, microarray-based methods are unable to detect rearrangements that are copy number neutral. The development of NGS approaches to detect translocations therefore offers advantages in the characterization of genomic rearrangements in a number of inherited disease syndromes. Developmental Delay Although the underlying cause of mental retardation is unknown in up to 80% of patients, it has recently become clear that small aneusomies, especially involving gene-rich subtelomeric regions, are responsible for mental retardation or unexplained developmental delay in a significant percentage of patients. Conventional cytogenetics can demonstrate an abnormality in approximately 15% of these patients, and adding a combination of various molecular techniques can identify an abnormality in an additional 5% of patients [21,22]. However, a subgroup of patients harbors a balanced chromosomal rearrangement which would not be detected by any of these conventional methods, and the advantages of genome-wide, highly sensitive NGS for detecting rearrangements in this clinical setting is obvious [23]. In fact, the utility of an NGS-based approach for mapping individual gene disruptions in patients with developmental delay has recently been demonstrated [24]. Recurrent Miscarriages About 1015% of clinically recognized pregnancies end in spontaneous abortion (miscarriage). The clinical diagnosis of recurrent miscarriage requires the loss of three or more pregnancies, and even though the range of underlying etiologies includes endocrine, immune, infective, thrombophilic, and anatomic causes, the role of genetic causes is becoming increasingly clear [25]. Balanced translocations (which are present in approximately 1 in 500 normal individuals) are responsible for recurrent miscarriages in 35% of couples with recurrent miscarriage; in this group, a reciprocal translocation is present in about two-thirds of cases and a Robertsonian translocation is present in about one-third. Subtelomeric translocations have recently been demonstrated to be another important cause of recurrent miscarriage, especially in couples with a family history of live-born offspring with mental retardation and a variety of other abnormalities [26]. Hereditary Cancer Syndromes A number of human diseases have been associated with germline mutations of the proteins involved in DSB repair by homologous recombination, and not unexpectedly, the diseases show radiosensitivity, genome instability, cancer susceptibility, and immunodeficiency as prominent clinical features. Included in this set of maladies are ataxia telangiectasia (caused by mutations in the ATM gene), ataxia telangiectasia-like disorder (caused by mutations in MRE11), and Nijmegen breakage syndrome (due to mutations in NBS1). Of note, ATMG is one of the early initiators of DNA damage cell cycle check point control, and both MRE11 and NSB1 are involved in DSB repair by the homologous recombination pathway. Fanconi anemia is another disease characterized by cellular sensitivity to DNA crosslinking agents, as well as diverse congenital abnormalities, premature bone marrow failure, and a variety of malignancies; at least 11 separate genes which are either directly or indirectly involved in DSB repair through the homologous recombination pathway are known to be responsible.

II. BIOINFORMATICS

156

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

TRANSLOCATION DETECTION Conventional Methods Translocations have been historically identified by routine cytogenetics in which metaphase DNA is collected from either stimulated or unstimulated cell cultures, stained, and evaluated by light microscopy. The primary advantage of cytogenetics is that it is an inexpensive and an unbiased method that can detect translocations without prior knowledge of the event or chromosomal breakpoints. Cytogenetic evaluation is particularly useful for hematologic malignancies in which viable cells are readily obtained from fresh bone marrow or blood samples. Such cases can be evaluated for the presence of recurrent rearrangements (PML-RARA, RUNX1-RUNXT1, BCR-ABL1, etc.) in addition to other clonal abnormalities such as full or partial chromosomal gains and losses. While highly useful in the clinical laboratory, conventional cytogenetics has significant limitations. For example, culture and preparation of cells can take up to 2 weeks, which is too long for some clinical applications (e.g., PML-RARA detection in AML). Some rearrangements may be cryptic by cytogenetics (e.g., PDGFRA in myeloid neoplasms with eosinophilia). Further, the sensitivity of cytogenetics is limited to events present in .10% of cells, as typically only 20 metaphase cells are scored. Finally, perhaps the most significant limitation is that cytogenetics requires viable cells that can be coaxed into dividing, making it difficult to impossible to perform cytogenetic testing on solid tumor samples. FISH detection of recurrent translocations offers considerable advantages over conventional cytogenetics in terms of increased resolution, and ability to work on both stimulated metaphase cultures or formalin-fixed interphase cells. An inherent limitation of FISH is that it is a biased approach and can only find rearrangements in those genes being evaluated. FISH probes for rearrangement detection are generally prepared from fluorescently labeled bacteria artificial chromosome (BAC) clones (about 200800 kb long) and can divided into two types: fusion probes and break-apart probes. When fusion FISH probes are used, one probe spans each partner gene. In the normal setting, probes are located on different chromosomes and the viewer sees separate color signals (red and green, for example) corresponding to spatially distinct chromosomes. However, in the setting of a rearrangement, the probes are adjacent to one another, representing merging of spatially distinct chromosomes, and the viewer sees only one color (yellow in this example). Break-apart FISH probes are essentially the opposite, and probes are located in adjacent DNA regions spanning common breakpoints (MLL for example). In their normal state, break-apart probes show only a single merged color signal indicating that a particular locus is intact; when a rearrangement occurs, the FISH probes no longer co-occupy the same space and separate probe colors are seen. A major advantage of break-apart probes is that they can identify a rearrangement regardless of the partner gene. For example, using MLL break-apart probes it is possible to identify any MLL rearrangement; however, in break-apart positive cases, it is impossible to know the partner gene (e.g., t(4;11) and t(9;11) produce the same pattern). Fusion FISH probes have the advantage in that they can identify partner genes, but only one at a time; to differentiate between a t(4;11) and t(9;11), two sets of fusion FISH probes would have to be used.

Translocation Detection by Whole Genome DNA Sequencing Whole genome sequencing is an unbiased approach for the identification of rearrangements, similar to conventional cytogenetics. In theory, all rearrangements can be detected by whole genome sequencing as the sequence data cover both introns and exons; the exact methods for rearrangement detection are discussed in the following sections. In clinical practice, it is not practical to use whole genome sequencing data for the identification of rearrangements due the high cost of sequencing an entire genome, however, the same informatic approaches can be used to identify rearrangements in whole genome data and targeted sequencing (exome or targeted panels) data. A major limitation of translocation detection by whole genome sequencing is that whole genome sequencing coverage is typically low (25303 ) and rearrangements present in only a subset of cells (i.e., due to dilution by nontumor cells) may not be adequately sampled [27]. However, an advantage of whole genome sequencing methods is that libraries with multiple insert sizes are sequenced (often ranging from 250800 bp long), which may make it easier to detect rearrangements that occur in repeat areas which would be otherwise difficult to align with standard short insert-size libraries.

Translocation Detection by Targeted DNA Sequencing Translocations can be detected from capture-enriched DNA sequencing using informatic approaches similar to whole genome sequencing. For successful translocation detection, however, intronic regions where rearrangements

II. BIOINFORMATICS

TRANSLOCATION DETECTION

157

FIGURE 10.1 Overview of translocation identification by NGS. (A) Translocations occurring at the DNA level were identified by designing capture probes that 23 tiled across the both exons (dark green) and introns (light green) of gene partners commonly involved in translocations. In this example, ABL1 is captured, but its partner BCR is not. (B) Genomic DNA was then fragmented into B300 bp pieces, library prepped, and captured. Genomic DNA containing sequences complementary to the ABL1 (green)-specific biotin-labeled capture probes (blue), in this example, was enriched. While most of the captured DNA represented contiguous areas of ABL1, regions with partial homology representing the actual DNA breakpoint (red and green) were also captured. (C) After aligning the sequence data, Breakdancer was used to identify paired reads in which one end of the paired-end read mapped to the targeted area (ABL1) and the other end did not (green/purple and green/red reads). The green-brown reads represent false positive translocation calls due to sequence repeats.

are known to occur must be directly targeted. Therefore, the use of exome or exon-only capture reagents in general will not identify translocation events, and custom capture panels that include intronic probes are required. A major advantage of capture-based enrichment over PCR-based identification of translocations is that by capturing only one translocation partner gene any fusion partner can be identified [28]. In contrast, PCRbased methods require a priori knowledge of the translocation partner and exact breakpoints so that appropriate primers can be used. For detection of rearrangements in genes known to be highly promiscuous, there is a significant advantage in the use of NGS-based detection methods. As an example, MLL has over 100 reported partner genes making it nearly impossible to detect rearrangements in MLL by RT-PCR since primers would have to be designed for each possible fusion partner and breakpoint. In contrast, targeted capture enrichment and NGS allows for the detection of any MLL rearrangement, regardless of the partner gene, as long as the correct introns of MLL are captured (Figure 10.1) [29]. The ability to detect chromosomal rearrangements by targeted NGS exploits the fact that targeted capture-based enrichment is not 100% efficient and that a high percentage of reads represent so-called off-target shoulder region coverage; these reads, though they originate from DNA regions that were not directly targeted, have complementarity over a portion of their length to the capture probes and thus they are enriched during library preparation. Again using the example of MLL, capture probes containing MLL intronic sequence at the DNA rearrangement breakpoint will by chance capture some sequences where inserts span both MLL and the adjacent partner gene. Translocations can also be detected by PCR-based enrichment, but only when the exact breakpoint and partner genes are known. For example, in BCR-ABL rearrangements, the breakpoints (at the RNA level) are well known and RT-PCR primer pairs have been published that will amplify the majority of rearrangements. These PCR products can easily be sequenced to determine the exact location of the breakpoint, however, they cannot identify novel breakpoints or novel fusion partners as such events would not be amplified during the initial PCR step.

II. BIOINFORMATICS

158

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

Translocation Detection by RNA-Seq As an alternative to DNA sequencing, sequencing of reverse-transcribed RNA, or RNA-Seq, can be used to detect fusion genes [30]. One important advantage of RNA-Seq compared to sequencing of DNA is that only fusion genes that undergo transcription are sequenced. This is of particular utility in unstable cancer genomes, for example, where a large number of translocations may occur, but only the fusions that result in aberrant gene products are likely to be of biological importance. Detection of translocations by sequencing of cDNA thus provides a useful filter for these likely spurious events. Similar to the case of DNA, either the whole transcriptome can be sequenced, or specific regions can be targeted. Although most applications of RNA-Seq have been whole-transcriptome shotgun sequencing of cDNA, the utility of sequencing from targeted RNA capture data has recently been demonstrated [31]. This approach has great potential for efficient detection of fusion genes from RNA; however, its performance on degraded RNA molecules extracted from archival formalin-fixed, paraffin-embedded (FFPE) samples has yet to be rigorously evaluated.

INFORMATIC APPROACHES TO TRANSLOCATION DETECTION Discordant Paired-End and Split Read-Based Analysis Methods for the detection of structural variation from NGS data depend on the orientation, spacing, and depth of mapped reads. Most algorithms for detection of gene fusions rely on discordant paired-end reads and split reads, with a few also taking read depth under consideration. Discordant paired reads are read pairs that do not map as expected: the paired ends may map to different chromosomes, to the same chromosome but in the incorrect orientation, or in the proper orientation but too far apart or too close together. Split reads are single reads that map to the genome discontinuously: one section of the read maps to one genomic region and the remainder to another. Due to the short read lengths currently available from NGS data, split reads are most reliable from paired-end data. In this case, the position of the split read can be determined with higher confidence when its mate can be uniquely mapped to the genome, serving as an “anchor.” Finally, the depth of sequencing coverage local to a particular point in the genome provides evidence of structural variation. While changes in read depth over large regions often indicate copy number changes, more subtle variation in sequence coverage is often seen near the breakpoints of translocations and inversions. The performance of any method for the detection of gene fusions is highly dependent on the specifics of the sequencing data that are available. For instance, split read methods to detect translocations generally require adequate coverage so that the translocation breakpoints are spanned by several split reads, and will not perform well using low coverage whole genome sequencing data. On the other hand, targeted capture methods for detection of gene fusions will obviously fail if the fusion breakpoints are not captured. Furthermore, fusion breakpoints commonly occur in intronic regions with high GC content, with the consequence that coverage depth from targeted capture data is often low and uneven in the vicinity of fusion breakpoints [29] (Figure 10.2).

Detection of Translocations and Inversions Many algorithms for the detection of genomic rearrangements rely on the presence of discordant paired reads [3235]. In the case of interchromosomal translocations, the two members of the pair map to distinct chromosomes (Figure 10.3); in the case of inversions or intrachromosomal translocations, the two ends map to the same chromosome but in the incorrect orientation or an unexpected distance apart. In general, these algorithms have high sensitivity to detect rearrangements in regions of the genome with high mappability. However, they can detect breakpoints with only low resolution and often suffer from low specificity, particular in repetitive regions or in regions that share homology with other areas of the genome. An additional complication is that, due to the mechanisms by which translocations are generated, they tend to occur in regions with repetitive elements, such as tandem duplications and transposons [34]. In these regions, true positives occur and are difficult to discern from the many false positives. Because informatic approaches based on discordant paired reads alone are subject to a high false positive rate, some algorithms for structural variants (SVs) detection make use of split reads, in which a single read contains spans a breakpoint between two distant genomic regions. Depending on the choice of mapping software, “softclipped” reads may serve to indicate the presence of split reads. Soft clips are produced by some alignment

II. BIOINFORMATICS

INFORMATIC APPROACHES TO TRANSLOCATION DETECTION

159

FIGURE 10.2 Coverage profiles within the targeted breakpoint hotspots for ALK (top) and MLL (bottom). The interquartile range of coverage depth at each position (blue-gray), %GC content (black), and alignability (CRG 50; gray) over the targeted capture region (exons as dark blue boxes) are shown. Breakpoints located in the set of positive controls are indicated with vertical red lines.

software (including Novoalign and burrows wheller aligner (BWA); discussed in more detail in Chapter 7) when one member of a read pair can be uniquely mapped to the genome but its mate cannot; if the mate can be partially aligned, in the correct orientation and with an insert size within the expected range, the unmapped remainder of the sequence is considered to be “soft-clipped” [36]. Soft-clipped reads often indicate reads with split mappings and so can be used to provide single base accuracy for the localization of rearrangements [37,38]. An added advantage of this precise localization of gene rearrangement breakpoints is that it facilitates orthogonal validation by PCR. In order to reduce the rate of false positives, most paired-end methods for the detection of structural variation rely on heuristic cutoffs, such as the number of supporting read pairs. More recently, however, several informatic approaches that integrate multiple types of information from mapped reads have been developed. One such approach is discovery of putative breakpoints using discordant read pairs followed by confirmation using split reads [29,39]. Similarly, the signal from discordant read pairs can be combined with subtle coverage depth signals in a probabilistic model to achieve greatly improved specificity in SV detection [40]. Another approach incorporates information from discordant read pairs and split reads, as well as other user-provided prior information, by combining the probability distributions for the positions of the two sides of each SV-defining breakpoint [41]. Some software tools have been developed to detect structural variation from sequenced libraries with two or more distinct insert lengths; often, short reads with an insert size of 250300 bp are combined with “mate-pair” reads with insert sizes ranging 25 kb in order to facilitate SV detection in repetitive and difficult to map genomic regions [39].

RNA-Seq-Based Analysis A number of informatic approaches for the detection of translocations from RNA-Seq data have also been developed recently. Detection of rearrangements proceeds in a manner similar to that for DNA sequencing; however, the computational problem is somewhat more complex due to posttranscriptional splicing, one disadvantage of translocation detection from RNA as compared to DNA.

II. BIOINFORMATICS

160

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

FIGURE 10.3 Identification of translocations from discordant paired-end reads. (A) In this example a t(4;11) translocation is identified by discordant paired-end reads. Read pairs are first identified in which one end maps to the targeted region (in this case the MLL gene on 11q23) and the other end maps to a different chromosome. (B) Discordant paired-end read methods are subject to high false positive rates due to sequence mapping errors and repeat regions in the genome. Most translocation identification software employs filtering criteria to reduce the number of false positive calls.

Mapping of the RNA-Seq reads poses the first challenge. Two basic approaches are available: reads can be aligned to an appropriately chosen transcriptome reference using an aligner such as BWA, Novoalign, or Bowtie; or a splice-aware aligner able to map cDNA reads across untranscribed introns can be used [42]. For paired-end RNA-Seq, discordant reads are often used to identify regions containing putative rearrangements [43,44]. For either single- or paired-end RNA-Seq, split reads can be used to identify fusions. In general, these breakpointspanning split reads are identified by dividing unmapped single-end RNA-Seq reads into segments of length 2530 bp and looking for sets of read fragments that map to different chromosomes, in the incorrect orientation, or farther apart on the same chromosome than expected for a spliced intron [42,45,46].

TRANSLOCATION DETECTION IN CLINICAL PRACTICE In the last 2 years, detection of translocations by NGS has become feasible in the clinical laboratory and has several advantages over conventional methods such as FISH and cytogenetics. For example, in the case of MLL rearrangements, where over 100 known fusion partners have been identified, NGS-based methods have the potential to identify all known and unknown partner genes; similar testing by fusion FISH probes would be costprohibitive. NGS-based methods also promise to streamline clinical testing workflows by allowing for a full spectrum of mutations, including single nucleotide variants, copy number variants, and translocations to be detected from a single assay compared with the disparate methods currently employed in the laboratory. NGS-based translocation identification also allows for single base resolution of the chromosomal breakpoints, whereas FISH can only localize breakpoints within 100300 kb, allowing the identification of noncanonical breakpoints that

II. BIOINFORMATICS

TRANSLOCATION DETECTION IN CLINICAL PRACTICE

161

FIGURE 10.4 Effect of duplicate reads. (A) The circos plot shows all translocation events within the 151 targeted genes on the panel in the 6 FISH positive KMT2A rearranged cases called by Breakdancer (red lines) and ClusterFAST (blue) when duplicate reads were not removed prior to analysis. (B) Box and whisker plot of log10 counts of all SVs, interchromsomal rearrangements, and ALK/KMT2A interchromsomal rearrangements, detected by Breakdancer (red) and ClusterFAST (blue) for all reads (dark red/blue) and with duplicates removed (light red/blue).

may not respond to chemotherapy. Finally, it has been shown that NGS-based analysis from archival formalinfixed tissue-derived DNA is an adequate substrate for NGS and can be used when fresh tissue for cytogenetic studies or metaphase FISH is not available [47].

Laboratory Issues Like all laboratory assays, NGS-based detection of rearrangements must be carefully evaluated to determine adequate performance characteristics. While most research studies have focused on detection sensitivity, specificity is also critically important in separating false positives from true positives, especially when calling rearrangements as the false discovery rate can be quite high. When implementing NGS-based rearrangement detection, care should be taken to run an adequate number of known translocation positive and negative cases, as determined by an orthogonal method such as FISH or cytogenetics. Several analysis steps can reduce the number of false positive translocation calls including use of combined discordant pair and split-end read analysis methods, removal of duplicate reads, removal of multiply mapped reads, and requiring multiple reads supporting a particular rearrangement to be present in order to call a translocation event (Figure 10.4). Sensitivity limits of detection for rearrangements should be determined and can be modeled by simple physical or in silico dilution series studies using cases for which the percentage of rearranged cells in the undiluted specimen is known by FISH-based studies (Figure 10.5). Another laboratory consideration is the design of capture probes for targeted sequence enrichment. As most rearrangements occur in introns, obtaining adequate sequencing coverage in intronic regions is essential to maintain high assay sensitivity. Most predesigned, commercially available capture probes do not include intron coverage and must be added manually. Many probe design pipelines may also exclude repeat regions by using a “repeat masker” type function. Since introns commonly contain a number of repetitive regions including, these regions may be inadvertently excluded from probe design pipelines. As most “repeat regions” tend to be short compared with the current sequencing read lengths obtained from Illumina chemistries, adequate coverage may still be obtained, and such regions may be further evaluated for alignability using the Duke alignability data available on the University of California Santa Cruz (UCSC) genome browser.

II. BIOINFORMATICS

162

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

FIGURE 10.5 Sensitivity of Breakdancer (black), Hydra (blue), and CF (red) to detect the breakpoints in 13 ALK and MLL rearranged cases, in randomly downsampled BAM files. Squares indicate the mean (over three random samples) sensitivity per tool, and error bars indicate standard error in the mean.

TABLE 10.1

Translocations and Inversions

Discordant paired-end

Comment

Download link

Breakdancer

Fast, simple to run

http://breakdancer.sourceforge.net

Hydra

considers multiple mappings of discordant pairs

https://code.google.com/p/hydra-sv/

Variation Hunter

considers multiple mappings of discordant pairs

http://variationhunter.sourceforge.net/Home

PEMer

Simulates SVs

http://sv.gersteinlab.org/pemer/introduction.html

GASVPro

improved specificity by combining info from discordant pairs and coverage depth

http://compbio.cs.brown.edu/software.html

CREST

Requires soft-clipped reads generated during alignment

http://www.stjuderesearch.org/site/lab/zhang

Slope

Replaced by ClusterFAST

https://github.com/eduncavage/clusterfast

SPLIT-END READ METHODS

Obtaining adequate intronic sequencing coverage is also complicated by the fact that some intronic regions have high percent GC content and are therefore difficult to sequence and capture (Figure 10.2). Problems associated with low coverage in GC-rich areas may be corrected by either sequencing to higher overall coverage depths, or by optimizing library construction and capture protocols.

Online Resources Numerous software tools are available for translocation detection. These include both publically available options (summarized in Table 10.1) and commercial offerings. When implementing NGS-based translocation detection, multiple tools should be evaluated to determine which one has optimal performance characteristics for the particular assay under consideration. Depending on the design of capture probes and regions being sequenced, there may be large differences in the sensitivity or specificity of different translocation detection tools.

II. BIOINFORMATICS

REFERENCES

163

SUMMARY AND CONCLUSION This chapter has summarized the importance of translocations in clinical medicine, and their detection in the clinical lab by NGS. While most translocation detection to date involves cytogenetics or FISH, during the next several years the adoption of translocation detection by NGS-based methods will continue to grow. This growth will be driven in part by decreasing sequencing costs, the widespread use of panel-based oncology testing in the clinical laboratory, and the ability of NGS-based methods to rapidly and cost-effectively interrogate numerous genes for rearrangements. Within the next 10 years, NGS-based translocation detection methods will likely eclipse FISH. Since the methods for NGS-based translocation detection will continue to rapidly evolve, labs interested in applying such should look to the literature for the most up-to-date information.

References [1] Rowley JD. Letter: a new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and Giemsa staining. Nature 1973;243(5405):2903. [2] Lejeune J, Gautier M, Turpin R. [Study of somatic chromosomes from 9 mongoloid children]. C R Hebd Seances Acad Sci 1959;248 (11):17212. [3] Rudkin CT, Hungerford DA, Nowell PC. DNA contents of chromosome Ph1 and chromosome 21 in human chronic granulocytic leukemia. Science 1964;144(3623):122931. [4] Rowley JD. Identificaton of a translocation with quinacrine fluorescence in a patient with acute leukemia. Ann Genet 1973;16(2):10912. [5] Lieber MR. The mechanism of human nonhomologous DNA end joining. J Biol Chem 2008;283(1):15. [6] Fugmann SD, Lee AI, Shockett PE, Villey IJ, Schatz DG. The rag proteins and V(D)J recombination: complexes, ends, and transposition. Annu Rev Immunol 2000;18:495527. [7] Lin C, Yang L, Tanasa B, Hutt K, Ju BG, Ohgi K, et al. Nuclear receptor-induced chromosomal proximity and DNA breaks underlie specific translocations in cancer. Cell 2009;139(6):106983. [8] Mani RS, Tomlins SA, Callahan K, Ghosh A, Nyati MK, Varambally S, et al. Induced chromosomal proximity and gene fusions in prostate cancer. Science 2009;326(5957):1230. [9] Nambiar M, Raghavan SC. How does DNA break during chromosomal translocations? Nucleic Acids Res 2011;39(14):581325. [10] Roukos V, Misteli T. The biogenesis of chromosome translocations. Nat Cell Biol 2014;16(4):293300. [11] Zhang Y, Gostissa M, Hildebrand DG, Becker MS, Boboila C, Chiarle R, et al. The role of mechanistic factors in promoting chromosomal translocations found in lymphoid and other cancers. Adv Immunol 2010;106:93133. [12] Chiarle R, Zhang Y, Frock RL, Lewis SM, Molinie B, Ho YJ, et al. Genome-wide translocation sequencing reveals mechanisms of chromosome breaks and rearrangements in B cells. Cell 2011;147(1):10719. [13] Misteli T. Beyond the sequence: cellular organization of genome function. Cell 2007;128(4):787800. [14] Nikiforova MN, Stringer JR, Blough R, Medvedovic M, Fagin JA, Nikiforov YE. Proximity of chromosomal loci that participate in radiation-induced rearrangements in human cells. Science 2000;290(5489):13841. [15] Parada LA, McQueen PG, Misteli T. Tissue-specific spatial organization of genomes. Genome Biol 2004;5(7):R44. [16] Roukos V, Voss TC, Schmidt CK, Lee S, Wangsa D, Misteli T. Spatial dynamics of chromosome translocations in living cells. Science 2013;341(6146):6604. [17] Ishida S, Yoshida K, Kaneko Y, Tanaka Y, Sasaki Y, Urano F, et al. The genomic breakpoint and chimeric transcripts in the Ewsr1-Etv4/ E1af gene fusion in Ewing sarcoma. Cytogenet Cell Genet 1998;82(34):27883. [18] Barr FG, Womer RB. Molecular diagnosis of Ewing family tumors: too many fusions. . .? J Mol Diagn 2007;9(4):43740. [19] Kobayashi K, Hagiwara K. Epidermal growth factor receptor (EGFR) mutation and personalized therapy in advanced nonsmall cell lung cancer (NSCLC). Target Oncol 2013;8(1):2733. [20] Bos M, Gardizi M, Schildhaus HU, Heukamp LC, Geist T, Kaminsky B, et al. Complete metabolic response in a patient with repeatedly relapsed non-small cell lung cancer harboring Ros1 gene rearrangement after treatment with Crizotinib. Lung Cancer 2013;81(1):1423. [21] Zambrano RM, Wohler E, Anneren G, Thuresson AC, Cutting GR, Batista DA. Unbalanced translocation 9;16 in two children with dysmorphic features, and severe developmental delay: evidence of cross-over within derivative chromosome 9 in patient #1. Eur J Med Genet 2011;54(2):18993. [22] Rauch A, Hoyer J, Guth S, Zweier C, Kraus C, Becker C, et al. Diagnostic yield of various genetic approaches in patients with unexplained developmental delay or mental retardation. Am J Med Genet A 2006;140(19):206374. [23] Hochstenbach R, van Binsbergen E, Engelen J, Nieuwint A, Polstra A, Poddighe P, et al. Array analysis and karyotyping: workflow consequences based on a retrospective study of 36,325 patients with idiopathic developmental delay in the Netherlands. Eur J Med Genet 2009;52(4):1619. [24] Utami KH, Hillmer AM, Aksoy I, Chew EG, Teo AS, Zhang Z, et al. Detection of chromosomal breakpoints in patients with developmental delay and speech disorders. PLoS One 2014;9(6):e90852. [25] Kavalier F. Investigation of recurrent miscarriages. BMJ 2005;331(7509):1212. [26] Joyce CA, Dennis NR, Howard F, Davis LM, Thomas NS. An 11p;17p telomeric translocation in two families associated with recurrent miscarriages and MillerDieker syndrome. Eur J Hum Genet 2002;10(11):70714. [27] Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med 2013;368(22):205974. [28] Duncavage EJ, Abel HJ, Szankasi P, Kelley TW, Pfeifer JD. Targeted next generation sequencing of clinically significant gene mutations and translocations in leukemia. Mod Pathol 2012;25(6):795804.

II. BIOINFORMATICS

164

10. TRANSLOCATION DETECTION USING NEXT-GENERATION SEQUENCING

[29] Abel HJ, Al-Kateb H, Cottrell CE, Bredemeyer AJ, Pritchard CC, Grossmann AH, et al. Detection of gene rearrangements in targeted clinical next-generation sequencing. J Mol Diagn 2014;16(4):40517. [30] Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature 2009;458(7234):97101. [31] Cabanski CR, Magrini V, Griffith M, Griffith OL, McGrath S, Zhang J, et al. cDNA hybrid capture improves transcriptome analysis on low-input and archived samples. J Mol Diagn 2014;16(4):44051. [32] Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, et al. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics 2010;26(12):i3507. [33] Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. Breakdancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009;6(9):67781. [34] Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010;20(5):62335. [35] Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, et al. Pemer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol 2009;10(2):R23. [36] Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010;26(5):58995. [37] Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, et al. Crest maps somatic structural variation in cancer genomes with basepair resolution. Nat Methods 2011;8(8):6524. [38] Suzuki S, Yasuda T, Shiraishi Y, Miyano S, Nagasaki M. Clipcrop: a tool for detecting structural variations with single-base resolution using soft-clipping information. BMC Bioinformatics 2011;12(Suppl. 14):S7. [39] Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO. Delly: structural variant discovery by integrated paired-end and splitread analysis. Bioinformatics 2012;28(18):i3339. [40] Sindi SS, Onal S, Peng LC, Wu HT, Raphael BJ. An integrative probabilistic model for identification of structural variation in sequencing data. Genome Biol 2012;13(3):R22. [41] Layer RM, Chiang C, Quinlan AR, Hall IM. Lumpy: a probabilistic framework for structural variant discovery. Genome Biol 2014;15(6):R84. [42] Kim D, Salzberg SL. Tophat-fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol 2011;12(8):R72. [43] Jia W, Qiu K, He M, Song P, Zhou Q, Zhou F, et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biol 2013;14(2):R12. [44] McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG, et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol 2011;7(5):e1001138. [45] Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 2013;14(4):R36. [46] Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 2010;38(18):e178. [47] Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333.

II. BIOINFORMATICS

C H A P T E R

11 Copy Number Variant Detection Using Next-Generation Sequencing Alex Nord1, Stephen J. Salipante2 and Colin Pritchard2 1

Center for Neuroscience, Departments of Neurobiology, Physiology and Behavior and Psychiatry, University of California at Davis, CA, USA 2Department of Laboratory Medicine, University of Washington, Seattle WA, USA

O U T L I N E Overview of Copy Number Variation and Detection via Clinical Next-Generation Sequencing Introduction CNV Definition and Relationship to Other Classes of Structural Variation Clinical CNV Screening and Potential for NGS-based Discovery Sources, Frequency, and Functional Consequences of Copy Number Variation in Humans Mechanisms of CNV Generation Frequency in the Human Genome CNVs and Disease: Functional Consequences

166 166 166 167

167 167 167 168

CNV Detection in Clinical NGS Applications Historical and Current Methods for Clinical Detection of CNVs NGS in the Clinic: The Promise of Cost-Effective Comprehensive Mutation Testing Targeted Sequencing of Candidate Genes Exome Sequencing Whole Genome Sequencing Cell-Free NGS DNA Screening

170

Conceptual Approaches to NGS CNV Detection Introduction

173 173

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00011-3

Discordant Mate Pair Methods Depth of Coverage SNP Allele Frequency Split Reads and Local De Novo Assembly

170 170 170 172 172 173

Detection in the Clinic: Linking Application, Technical Approach, and Detection Methods Targeted Gene Screening Exome Sequencing for Unbiased Coding Variant Discovery Cell-Free DNA Whole Genome Sequencing and Emerging Technologies

175 176 177 178 180 180 181 181 182

Reference Standards Genome Structural Variation Consortium Data Set 1000 Genomes Project Structural Variant Map Data Set

182

Orthogonal CNV Validation

184

Summary and Conclusion

184

References

184

Glossary

186

List of Acronyms and Abbreviations

187

165

183 183

© 2015 Elsevier Inc. All rights reserved.

166

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

KEY CONCEPTS • CNVs are an important class of genetic variation for many human diseases, including inherited syndromes and cancer-acquired mutations such as oncogene amplification. • Understanding the basic mechanisms of CNV formation and the functional consequences informs the interpretation of CNV detection from NGS data. • CNVs can be detected from NGS data by multiple, complementary methods including mate pair, relative depth of coverage, single nucleotide polymorphism allele frequency, and split read analysis. • Choice of the optimal CNV detection strategy and sensitivity of CNV detection is heavily dependent on the method of DNA preparation. • Whole genome shotgun sequencing and mate pair sequencing afford the greatest number of CNV detection options but are limited by cost and labor. • Targeted NGS can also successfully detect CNVs, offering comprehensive mutation screening when combinations of approaches are used.

OVERVIEW OF COPY NUMBER VARIATION AND DETECTION VIA CLINICAL NEXT-GENERATION SEQUENCING Introduction Copy number variation is a general term used to refer to population-level genetic differences characterized by the gain or loss of specific regions of DNA in individual genomes. Individual copy number variants (CNVs) may be inherited in the germline, or acquired as somatic mutations as in cancer genomes. Unlike single nucleotide variants, CNVs vary greatly in size and structure, ranging from tens to millions of nucleotides in length, and often involve complex DNA rearrangements. Inherited CNVs can be rare or common depending on how recently the variant arose in evolution, population history, and selective pressure, and CNVs may spontaneously arise via a variety of molecular processes. While less common in frequency than single nucleotide polymorphisms (SNPs), copy number variation accounts for the majority of nucleotide differences between any two individual genomes because of the large size of individual CNVs [1]. Both common and rare inherited CNVs are associated with human disease [24], acting through a variety of different mechanistic pathways including altered gene dosage, formation of novel gain-of-function alleles, or inherited deletion of a tumor suppressor gene predisposing to cancer. Stand-alone screening tests for CNVs both genome-wide and at disease-causing loci are currently used in medical genetics and in oncology, but the necessity of dedicated tests to detect CNVs increases costs, and clinical CNV screening tests are often limited in the spectrum and size of CNVs that are detectable. The emergence of next-generation sequencing (NGS) as a clinical diagnostic tool offers the potential for comprehensive mutation screening, however, high levels of complexity exist regarding identification of CNVs from short read sequence data for whole genome and targeted NGS approaches.

CNV Definition and Relationship to Other Classes of Structural Variation Copy number variation refers to gain or loss of DNA, however, varying definitions (e.g., .1 kb or .500 bp) have been used for events of particular sizes. This has resulted in a spectrum of defined variation classes describing DNA gain or loss, with inconsistently applied terminology and definitions. On one end of the spectrum is the gain or loss of one or several bases, referred to as insertions/deletions or indels, which are smaller than CNVs and often identified based on local sequence alignment. The size range of CNVs is bound on the other side by very large changes that can be visible on a karyotype, from microdeletions/microduplications which generally span greater than a megabase, to the gain or loss of whole chromosomes. This chapter covers methods that are used to detect copy number changes that are larger than the maximum size of indel events that can be identified by basic short read alignment algorithms. For such changes in copy number, the same detection approaches apply for detecting gains and losses of any size, thus this discussion covers detection of CNVs varying in size from tens of bases to millions of bases, including

II. BIOINFORMATICS

SOURCES, FREQUENCY, AND FUNCTIONAL CONSEQUENCES OF COPY NUMBER VARIATION IN HUMANS

167

microdeletions, microduplications, and chromosomal aneuploidy. On the molecular level, CNVs often involve complex rearrangements of DNA accompanying deletion or duplication, including insertion of exogenous DNA or inversion of duplicated sequence. While these rearrangements may be resolvable via mate pair or split read methods of sequence assembly as described later in this chapter, targeted NGS approaches are generally limited to detecting gain or loss of assayed sequence only, and may not characterize accompanying structural complexity of a CNV.

Clinical CNV Screening and Potential for NGS-Based Discovery Massively parallel DNA sequencing, more commonly referred to as NGS, is a cost-effective and powerful tool with many clinical applications. It is attractive as a diagnostic platform because it provides the potential for comprehensive mutation detection over many genomic targets simultaneously, high-throughput analysis of patient panels across disease loci, as well as numerous specialized applications. The focus of this chapter is on approaches to NGS data generation and methods for NGS data analysis that allow identification of duplicated or deleted DNA regions in a patient sample relative to a reference genome, with particular attention given to applications where comprehensive mutation screening is the goal and CNV detection complements the detection of other variants. Specialized applications are also covered, for example, where NGS is used as a sensitive assay to detect low prevalence genomic changes, such as identification of fetal trisomy via cell-free DNA sequencing from maternal plasma, and tumor cell changes in a background of admixture with nonneoplastic material. While specific methods are referenced where possible, it should be recognized that methods are constantly evolving. The focus is primarily on the conceptual description of CNV identification from NGS data, including aspects of experimental design, an overview of detection algorithms, and linking identified CNVs to actionable medical care.

SOURCES, FREQUENCY, AND FUNCTIONAL CONSEQUENCES OF COPY NUMBER VARIATION IN HUMANS Mechanisms of CNV Generation De novo formation of copy number variation can occur both in the germline and in somatic cells and is generated through two generalized mechanisms: (i) nonallelic homologous recombination (NAHR) and (ii) events that are not dependent on significant homology, including nonhomologous end joining (NHEJ) and microhomologymediated end joining (MMEJ) (Figure 11.1) [5]. NAHR is driven by extended sequence homology between two regions of the genome, where incorrect pairing during meiosis/mitosis or DNA repair across homologous regions can result in gain or loss of intervening sequence. Nonhomologous mechanisms of CNV formation appear to occur during DNA repair and during replication, where DNA breaks are repaired by annealing to nonhomologous DNA. Regions of the genome with duplicated structure, such as segmental duplications or Alu and LINE/L1 repetitive elements, are more likely to be copy number polymorphic due to NAHR, nonetheless, the majority of human copy number variation appears to have arisen via nonhomologous mechanisms [6]. With regard to clinically relevant CNVs, the presence of duplicated sequences or copy number polymorphic regions can decrease detection sensitivity and increase the likelihood of CNV formation at particular loci based on genome structure. For example, the high Alu content across BRCA1 leads to a greater proportion of causal CNVs relative to single nucleotide variants and indels in comparison to BRCA2 [7].

Frequency in the Human Genome Estimates of de novo CNV formation based on mapping of recombination events in sperm and via frequency of disease-causing CNVs fall between 1026 and 1024 per gamete, a higher rate of formation than other classes of mutations [8,9]. The majority of CNVs in a given human genome are inherited, and a large proportion of the human genome is estimated to be copy number polymorphic [10,11]. The allele frequency landscape for human CNVs is similar to point mutations, where the majority of CNVs across the population are rare, but the majority of CNVs carried in an individual genome are common across the population [12]. Until recently, estimates of CNV frequency were biased toward larger CNVs that effect nonduplicated sequence, which are often driven by NAHR. Advances in NGS-based CNV identification will clarify the genomic location, population frequency,

II. BIOINFORMATICS

168

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Nonallelic homologous recombination The

Fast

Car

The

Brown

Rat

Homologous recombination at incorrect locus The

The

Fast

Car

The

Fast

Car

The

Rat

Brown

Fast

Car

The

Brown

Rat

Tandem duplication The

Brown

Rat

Deletion

Nonhomologous repair (e.g., NHEJ)

Horses

Eat

Cats

Oats

Chase

Mice

DNA break Horses

Oats

Eat

Cats

Chase

Mice

Deleted after repair

Horses

Eat

Mice

Deletion

FIGURE 11.1 Mechanisms of CNV generation. NAHR (top panel) is a common mechanism of CNV generation. In the figure each word represents a genomic locus. The word “The” in the top panel represents a homologous region of the genome. When homologous recombination occurs at the incorrect locus, two reciprocal CNVs are produced, one chromosome with a tandem duplication and one with a deletion. More complex mechanisms of CNV production include NHEJ (bottom panel) and MMEJ. Understanding the mechanisms by which CNVs are produced is helpful for correctly inferring that a CNV is present from NGS data.

and mechanisms of formation for CNVs of all size classes. Such advances will be critical for predicting largeeffect causal CNVs during clinical screening, where filtering against population variation is a standard process in mutation screening.

CNVs and Disease: Functional Consequences The gain or loss of DNA sequence can produce a spectrum of functional effects and disease phenotypes (Figure 11.2, Table 11.1), leading to complexity in predicting the effect of a particular CNV. Complete duplication or deletion of coding sequence can change gene dosage and protein expression or lead to increased susceptibility to cancer via somatic loss of the remaining allele. Partial gain or loss of coding sequence can produce a number of different alleles, including both loss and gain of function. For example, deleted internal exons could result in a frameshift and subsequent loss of function through truncation or nonsense-mediated decay, or could result in a protein lacking functional domains that acts as a dominant negative. Chimeric proteins can be produced when CNV breakpoints lie within two separate genes, leading to fusion of two partial coding regions. CNVs in noncoding regions can also generate a number of position effects via the deletion or transposition of critical regulatory elements. For example, deletion of noncoding sequence may position a gene proximal to regulatory elements that drive ectopic expression. Furthermore, many large CNVs overlap numerous genes on a chromosome, leading to syndromic effects that can be difficult to dissect [13].

II. BIOINFORMATICS

SOURCES, FREQUENCY, AND FUNCTIONAL CONSEQUENCES OF COPY NUMBER VARIATION IN HUMANS

Protein A

Regulatory element A

169

Protein B

Gene B

Gene A Intervening sequence:

deleted

duplicated

Dosage effect Loss Protein B

2X Protein B

New alleles

Gain-of-function Protein A

Protein Protein Fusion

Position effect

X

X Protein B

FIGURE 11.2 Functional consequences of CNVs. A schematic representation of two genes and corresponding regulatory elements is shown in the top panel. CNVs commonly result in a simple dosage effect of loss or gain of gene/protein (second panel from top). New alleles, including gain-of-function proteins and fusion proteins may be created (third panel from top). Finally, CNVs may juxtapose regulatory elements for one gene with another, in some cases causing aberrant ectopic expression of a protein through this position effect (bottom panel).

TABLE 11.1

Selected Diseases Associated with Germline CNVs

Disease/Syndrome

CNV(s)

Velo-cardio-facial syndrome

22q11.2 deletion

Williams syndrome

7q11.2 deletions

Autism

e.g., 16p11.2 deletion, many others

Lynch syndrome

e.g., MSH2 exon 16 deletion, many others

Schizophrenia

16p11.2 duplications, VIPR2 duplication, many others

Charcot-Marie-Tooth

PMP22 duplication, many others

Alpha thalassemia

HBA1 and HBA2 deletions

While the functional consequences of a CNV may be difficult to predict, many CNVs do generate alleles with a clear cut impact and thus affect actionable clinical care, such as an inherited loss-of-function allele of a tumor suppressor in a patient with Lynch syndrome [14], deletions and duplications of different genomic regions leading to schizophrenia [15], or characterized syndromic microdeletions, such as identification of 22q11.2 deletion in a child with a mixture of developmental phenotypes [16]. Depending on the particular clinical application of NGS, prior hypotheses can be used to guide mutation interpretation, such as in the targeted screening of tumor suppressor genes, though even in these cases it may be difficult to separate deleterious from benign mutations.

II. BIOINFORMATICS

170

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

CNV DETECTION IN CLINICAL NGS APPLICATIONS Patient whole genome sequencing is likely to become routine in the near future given continued decreases in the cost of sequencing, data analysis, and data storage, and a concomitant development of the infrastructure needed to effectively apply individual genome data in clinical care. However, until whole genome sequencing is mature as a clinical technology, NGS will continue to be used in a more focused manner in the clinic, with a suite of approaches for different applications. Many of these specialized NGS approaches are currently being incorporated in clinical care, and are expected to eventually supplant the existing standardized tests for mutation detection. This section reviews current clinical CNV screening technologies and discusses emerging and future NGS technology in the context of CNV detection across various clinical applications.

Historical and Current Methods for Clinical Detection of CNVs High-resolution karyotyping, which allowed the identification of microdeletions and microduplications, arose in the late 1970s [17]. In the late 1980s, fluorescence in situ hybridization (FISH) was incorporated to test for specific gains and losses in the clinic [18]. Improvements and refinements to these technologies have progressively increased specificity and resolution, and karyotyping and FISH remained the dominant technologies until recently. Comparative genomic hybridization (CGH) was developed in the early 1990s [19], and recent improvements in detection resolution via array-based methods (array comparative genomic hybridization, aCGH) now permit the detection of CNVs down to tens to hundreds of kilobases in length [20]. aCGH is commonly used in clinical screening for germline mutations [21] and to profile somatic CNVs in cancer [22], even as the NGS-based methods that will likely replace aCGH are being rapidly developed. Parallel to genome-wide CNV discovery, screening for small CNVs at known disease loci is currently performed using Polymerase chain reaction (PCR) or amplification-dependent technology, such as multiplex ligation-dependent probe amplification (MLPA) [23]. Major limitations of these targeted CNV screening methods are dependency on prior hypotheses in the selection of targets and restrictions in the number of targets that can be assayed in one test. NGS applications have the potential to replace many of these existing technologies.

NGS in the Clinic: The Promise of Cost-Effective Comprehensive Mutation Testing Two main factors are driving the incorporation of NGS in the clinic, namely decreased relative cost of testing and increased sensitivity of mutation detection and higher levels of mutation characterization. With respect to CNVs, cost-effectiveness is improved for NGS platforms because their use can potentially detect all classes of variants in a single test [24], and targeted screening of candidate loci can be performed for large numbers of samples in parallel (on the same run) via sample indexing [24,25]. With respect to sensitivity and specificity gains, NGS approaches to CNV detection escape some size or target-based limitations of existing tests, can generate complementary supporting evidence of CNV presence, and potentially characterize CNVs at sequence-level resolution. There are four general categories of NGS approaches to CNV detection that differ in technical DNA preparation (Figure 11.3), CNV detection capability, trade-off between cost and scope, comparison with existing technologies, and clinical utility. These categories are: (i) targeted sequencing of selected regions, (ii) exome sequencing to profile coding variation genome-wide, (iii) whole genome sequencing of cellular DNA, and (iv) whole genome sequencing of cell-free DNA in plasma (Table 11.2). Relevant conceptual approaches to CNV detection—specifically depth of coverage, mate pair, and sequence-based evidence—are mentioned in the description of general categories of NGS application below, and then described in detail in the following section.

Targeted Sequencing of Candidate Genes The use of NGS for targeted mutation screening of characterized disease genes has been rapidly incorporated in clinical genetic testing as a method to screen large numbers of candidate genes for pathogenic mutations. As targeted genes are typically selected based on known or suspected contribution to disease, identified mutations are likely to be interpretable, clinically relevant, and actionable. This approach has been used for identification of germline mutations across a variety of disease phenotypes [24,26], and for profiling variation in tumors [27]. Targeted sequencing approaches rely on sequence selection via hybridization capture [26,28] or amplification-based methods [29], which may differ in CNV detection capabilities in downstream analysis.

II. BIOINFORMATICS

171

CNV DETECTION IN CLINICAL NGS APPLICATIONS

Targeted exons DNA

Fragmentation NGS read Sheared DNA (300 bp)

Mate pair library (~3 kb)

Shotgun sequencing

Targeted sequencing Amplification

Primers

Hybridization

RNA/DNA probe

Mapped reads

Mate pair Shotgun Targeted

FIGURE 11.3 DNA preparation for NGS sequencing. The method of NGS DNA template preparation affects CNV detection options. First, genomic DNA is typically fragmented by sonication or enzymatic methods to B300 bp sheared fragments, or prepared by specialized methods in a mate pair library, e.g., B3 kb inserts. For mate pair libraries, short sequencing reads are done at either end of the B3 kb insert, allowing inference of copy number gains or losses when the two paired sequences do not map to a reference genome at the expected B3 kb distance. Sheared DNA may be sequenced by shotgun sequencing, in which all fragments are sequenced, or by targeted approaches including PCR-based amplification or hybridization capture to enrich only selected regions. At the bottom a schematic of mapped NGS sequence reads is shown for a mate pair library (black), shotgun sequencing (blue), and targeted sequencing (red).

Targeted regions can be continuous across genomic loci [30], or discontinuous, as in the case of exon-only approaches [27]. NGS reads generated by high-coverage targeted sequencing are amenable to simultaneous single nucleotide, indel, and CNV detection, assuming coverage is deep enough, allowing near-comprehensive mutation screening in a single test [27]. As targeted NGS is both comprehensive and cost-effective on a per-sample basis, this technology compares very favorably with existing single sample Sanger sequencing combined with CNV screening using MLPA or a similar approach (e.g., Myriad Genetics “BRACAnalysiss Large Rearrangement Test

II. BIOINFORMATICS

172 TABLE 11.2

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Clinical NGS Approaches and CNV Calling

Approach

Scope

Clinical Example

Per Sample Cost

DNA Preparation

Primary CNV Methods

CNV Detection Resolution

Targeted sequencing: selected regions

Selected genes/exons

Cancer gene panel

Low

Capture or amplification

Depth of coverage

Coverage dependent, all CNVs possible

Targeted sequencing: whole exome

All coding exons

Rare disorder

Medium

Capture or amplification

Depth of coverage

Coverage dependent, currently multi-exon

Whole genome

Full genome

Tumor profiling

High

Shotgun, mate pair library

Depth of coverage, mate pair, split read, allele frequency

Variable, all structural variants possible

Cell-free DNA

Variable

Trisomy 21 detection

Variable

Variable

Depth of coverage, mate pair

Variable, generally very large CNVs

(BART)” test). Despite clear successes in CNV detection in targeted NGS applications using a combination of relative depth-of-coverage methods and direct sequence evidence [31], as well as other approaches, methods are not standardized and CNV screening is not yet consistently performed in all assays.

Exome Sequencing Exome sequencing is a specific variant of the targeted approach discussed above that has also been rapidly incorporated into clinical genetic screening. Instead of focusing on candidate genes, all coding regions across the genome are targeted for unbiased screening of coding variants. Exome sequencing has proven exceptionally useful for molecular diagnosis of rare monogenic disorders [32], identification of de novo variants implicated in complex diseases such as autism [33], and tumor/normal analysis to identify driver genes in cancer [34]. Large-scale exome sequencing is underway to identify rare variants that contribute to a number of common complex human traits. In contrast to more targeted gene capture panels, expansion of target sequences to cover nearly all coding regions requires sequencing a much higher proportion of the genome, reducing the capacity for sample multiplexing and the overall sequencing coverage depth per site. Regardless, analogous to targeted NGS strategies that focus on coding regions, the primary approach for CNV detection in exome sequencing is overall depth of sequencing coverage. When coverage is sufficiently deep, relative depth-of-coverage methods have been successful for robust CNV detection from NGS exome data [35]. While exome sequencing was the first technology to offer cost-effective unbiased genomic screening for single nucleotide variants and indels (albeit limited to coding regions), aCGH is already well-established for clinical full genome CNV screening [21]. In some respects, CNV detection by aCGH has advantages over current exome sequencing technology, for example, by incorporating consistently spaced probes across the genome rather than tiling over the inconsistent biological spacing of exons. In the future, exome designs could be extended to include targeting of specific intergenic regions, similar to how SNP arrays have been extended for increased utility in CNV screening through the inclusion of specific copy number probes. However, even focusing on coding sequences alone, the comparison between aCGH and exome sequencing for clinical CNV discovery is becoming more favorable, with current ability to detect coding CNVs growing more comparable between aCGH and exome-based CNV calls. With increased overall sequencing depth and decreased bias in local sequence coverage [36], it is likely that performing aCGH as a complementary method alongside or prior to exome sequencing will become unnecessary in the near future. The final two classes of clinical NGS applications are qualitatively different than targeted approaches, relying on shotgun sequencing of the full genome.

Whole Genome Sequencing Whole genome sequencing using NGS is anticipated to revolutionize clinical care, yet the general promise of whole genome sequencing is far from being fulfilled. While sequencing cost is still prohibitive for general clinical

II. BIOINFORMATICS

CONCEPTUAL APPROACHES TO NGS CNV DETECTION

173

application, it continues to drop rapidly and so the larger barrier to clinical utility is now the difficulty interpreting whole genome variation data in the context of the high number of rare variants and the lack of CNV annotation for noncoding sequence. Already, exome sequencing may uncover large numbers of candidate variants, and verification can require customized functional testing [37,38]. Nonetheless, several major initiatives are underway to generate whole genome sequence data on a population level [39] and for larger patient populations. In the near term, whole genome sequencing will likely by applied when the proportional average cost of medical treatment is significantly more expensive than that of whole genome sequencing, and when molecular diagnosis by whole genome sequencing may lead to directly actionable information, such as implicating genetic disorders or informing cancer treatment. Emerging sequencing technologies are likely to play a significant role in changing this balance, with the potential for longer sequencing reads, lower costs, and single molecule sequencing expanding the potential applications of whole genome sequencing clinically [40]. With regard to CNV detection, whole genome shotgun sequence data has several advantages compared with targeted NGS data. While overall coverage is typically far lower for whole genome data, as is necessary to cover a far larger sequence space while keeping sequencing costs economical, a combination of depth of coverage, direct sequence evidence, and mate pair sequencing strategies can be used to identify not only CNVs but also all classes of structural variation. In contrast to targeted or exome sequencing, whole genome sequencing has no comparable existing clinical technology: while aCGH can detect CNVs spanning the entire genome, it is restricted in resolution, and will fail to identify other forms of structural variation.

Cell-Free NGS DNA Screening Conventional whole genome sequencing has the goal of mapping all variants present in an individual genome, with many limitations remaining regarding application in a clinical setting. In comparison, applications that use whole genome NGS for the more focused purpose of CNV detection from cell-free DNA are quickly being adopted to identify fetal or cancer DNA signatures that occur at low frequency and are admixed with maternal or patient DNA, respectively. These methods take advantage of naked DNA fragments circulating in plasma, specifically genome fragments from cells at anatomically distant sites, including tumors or developing fetuses, that are present at a low, but detectable frequency in peripheral blood. For analysis of fetal DNA, whole genome NGS of cell-free DNA in maternal plasma was first demonstrated to be useful for diagnosis of fetal trisomies in a noninvasive fashion [41,42], which, compared with amniocentesis or chorionic villus sampling, confers virtually no safety risks to the mother or developing fetus and is no more expensive. Sensitivity is dependent on sequencing depth, and CNV detection resolution will likely improve quickly, permitting screening for other relevant mutations. Cell-free DNA sequencing can similarly be used to detect the presence of cancer cells harboring aneuploidy allowing screening for cancer remission or potentially early detection of cancer-causing mutations [43]. Due to the fragmented nature and limited supply of sequencing template, these applications rely on relative depth of coverage, often with very low resolution, such as the detection of whole chromosome arm duplication or deletion, and occasionally incorporate mate pair methods. Given the significant potential advantages over existing clinical technology with respect to assay sensitivity and noninvasiveness of sampling, cell-free DNA sequencing is likely to eventually become commonplace for fetal genetic screening and cancer surveillance.

CONCEPTUAL APPROACHES TO NGS CNV DETECTION Introduction NGS generates relatively short read sequences that can be aligned to a reference genome. There are several signatures that are used to identify CNVs from NGS data, both indirect and direct, which are described here in detail (Figure 11.4). Indirect signatures rely on patterns indicative of the presence of CNVs, often without fully characterizing exact boundaries or sequence content within the CNV. Early NGS applications used discordant mate pair mapping to identify insertions and deletions, along with balanced structural variants such as inversions and translocations [44,45]. The most common generalized approach is comparison of relative depth of coverage [4648], which was used first to identify CNVs in whole genome NGS data [20] and has since been applied to targeted sequencing panels [49] and to exomes [50]. A related indicator of CNV is allele frequency of common SNPs, where allele counts that depart from expected homozygous/heterozygous ratios indicate departure from a diploid copy state. Finally, with continued increases in read length, the sequence library may include

II. BIOINFORMATICS

174

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Discordant mate pair Overview: Mate pair paired end that reads map further apart or closer than expected indicate deletion/insertion. Paired reads that map to disparate regions or DNA strand indicate balanced structural variation.

Paired end reads 3 kb fragment length Genomic region

> 3 kb

< 3 kb

Deletion

Insertion

Disparate chromosomes

3 kb map distance Translocation

Strengths: Detects all classes of structural variation at low coverage Limitations: Requires mate pair library, incompatible with targeted approaches, no CNV characterization Primary applications: Whole genome structural variation profiling of cellular or cell free DNA Depth of coverage Overview: Sequencing depth of coverage is dependent on copy number of the sequenced region. Relative depth of coverage is compared across samples to identify regions where copy number changes are present.

Control Reads Test Reads Deletion Duplication Strengths: Compatible with all NGS approaches, can detect all sizes of CNV Limitations: Resolution dependent on high sequencing depth, subject to capture/sequencing bias Primary applications: All NGS applications

Genomic region

Allele frequency ratio Overview: SNP allele frequency ratios used to predict change in copy number across a genomic region.

Sample SNP reads

C C C C A A

A A A A A A

G G G G G G

G G G G G G

A A A A A A

Deletion (LOH)

C C C C G G

C C C C T T

T T T T T T

T T T T A A

Diploid region (1:1)

G G G G C C

C C C C G G

A A A A T T

A A A A T T

C C C C G G

T T T T A A

Duplication (2:1)

G G G G A A

A A A A G G Genomic region

Strengths: Compatible with all NGS approaches as complementary evidence, detect LOH Limitations: Dependent on presence of heterozygous SNPs, low resolution Primary applications: Secondary evidence supporting CNV call, LOH screening Split read/assembly Overview: Individual NGS reads that partially map to disparate regions indicate presence of structural variant breakpoint. Local assembly of split reads can characterize CNV breakpoints at base level resolution

Deletion spanning read

Duplication breakpoint spanning read Split reads

Reference genome Sample genome

Deletion Duplication Strengths: Complete characterization of CNV (base-level sequence resolution), can detect all classes of SV Limitations: Limited to detection of CNVs/SVs where breakpoints are sequenced Primary applications: Secondary evidence supporting CNV call, high coverage whole genome CNV/SV screening

FIGURE 11.4 Conceptual approaches to CNV detection. An overview of four conceptual approaches to CNV detection by NGS, including the strengths and limitations of each approach. Each approach is covered in detail in Figures 11.511.8.

II. BIOINFORMATICS

CONCEPTUAL APPROACHES TO NGS CNV DETECTION

175

split reads that capture CNV breakpoints and exact sequence content. The following sub-sections describe each of these approaches in detail. Published methods to perform CNV discovery using the methods below exist, however, as preference for particular computational methods change quickly, this discussion focuses on the conceptual approach, rather than on specific algorithms. Where possible, selected citations to specific methods are referenced.

Discordant Mate Pair Methods Mate pair libraries refer specifically to sequence libraries generated from long DNA fragments (e.g., 25 kb long) that are then incompletely sequenced using a paired-end strategy [51]. The paired ends are mapped back to the reference genome using standard alignment algorithms, and the distance between mapped reads is considered proportional to the length of sequence contained in the original fragment. While the majority of paired ends will map to the genome at a distance distributed around the expected target fragment size, a minority will map with an intervening distance significantly greater or less than predicted by the fragment size, or paired reads will map in the opposite orientation as expected. Because they deviate from expectations, these reads are referred to as “discordant.” Based on the map orientation and distance, the presence of discordant reads indicates the presence of a structural variant (Figure 11.5). Because each fragment spans a much longer stretch of DNA than what is sequenced, even at low base coverage (e.g., less than 13), the clone or fragment coverage across the genome is proportionately much higher, and multiple paired-end reads are likely to overlap any given structural variant. Low coverage sequencing can thus generate a robust map of genome-wide structural variation. As such, mate pair strategies for CNV and structural variant detection have been employed for both genomic DNA and lowabundance cell-free DNA originating from cancer cells [52]. While this method can identify the presence of insertions, it does not resolve the inserted sequence. Furthermore, there are limitations regarding the detection of variants under a certain size, depending on the tightness of the size distribution of the fragment library. As hybridization- and amplification-based sequence capture are designed to target short sequences (roughly 100300 bp), mate pair sequencing is not compatible with targeted NGS approaches. Finally, it is technically much more challenging, time consuming, and requires more input DNA to preparing mate pair libraries than to generate whole genome shotgun or targeted sequencing libraries, making this strategy more difficult to implement from a practical standpoint.

2. Paired-end sequencing of mate pair library

1. Prepare long insert circularized DNA fragments (2–5 kb) for paired-end sequencing

3. Map paired ends to reference genome and identify discordant mate pair mapping 11

2

Expected fragment size and directionality

11

11

22

11

22

Sample Genome

Tandem duplication

Insertion

Deletion

22

Reference genome 1

2

11

22

2

11

Deletion

Insertion

Duplication

Pairs map too far apart

Pairs map too close together

Pairs map in wrong order

II. BIOINFORMATICS

FIGURE 11.5 CNV detection via discordant mate pairs. Mate pair libraries are prepared with a long insert, typically between 2 and 5 kb. The size of the insert is tightly controlled through size selection. Paired-end sequencing is done at either of side of the long the insert and paired reads are mapped to a reference genome. Differences from the expected mapping distance of the paired reads, or differences in the direction of the mapped pairs, allow inferences to be made about CNVs. When pairs straddle a deletion in a test sample, they map too far apart in the reference genome; when straddling an insertion, they map too close together; and when straddling a tandem duplication map in the wrong order.

176

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Depth of Coverage In a diploid human genome, there are two copies of the majority of genomic regions in every cell. Departure from diploid copy number of regions of the genome results in proportional changes in the relative DNA content within the CNV. Assuming deep enough NGS sequencing coverage, this relative change in DNA content will be reflected in the number of reads mapping to within the CNV. Relative depth-of-coverage methods take advantage of this signature by evaluating the relative number of reads across sequenced regions (Figure 11.6). Depth of coverage is similar in principle to the use of probe intensity comparison for CNV calling using aCGH or SNP arrays, wherein the amount of signal (sequence reads) from a region is proportional to its copy number state. Relative depth of coverage for a region must be interpreted with respect to an external baseline reference. One strategy is to normalize relative depth of coverage to “average” genomic read depth across the same sample. However, because variability across local genomic regions is typically present, comparison of coverage across the region of the test sample versus a reference panel is frequently preferred. Considering the median read depth for

NGS read

1. Map sample reads to reference genome.

exon Shotgun mapped reads

2. Generate base/window coverage estimate based on mapped reads. Window size dependent on approach. For exome data, coverage is often summed across each exon. For high-coverage targeted experiments, base level resolution is possible. For heterogeneous DNA from tumors or from cell free approaches, larger windows are required.

Targeted mapped reads

Normalized sample coverage

3. Normalize sample coverage for capture/sequence bias. A standard approach is to normalize for GCcontent and local fluctuations in read depth. Normalization approaches vary by method. 4. Compare sample normalized coverage to expected/reference coverage. Relative coverage compared to reference coverage derived using same technical approach is necessary to account for differences in sequencing depth across the genome. Standard approach is to generate coverage ratio for sample:reference where reference is the average coverage across many experiments.

Relative coverage Sample Reference

5. Using expected sample:reference ratio for copy number variant regions, predict quantitative sample copy number. Expected ratios vary across applications. For normal cellular DNA, CNV signal via depth of coverage is robust and quantitative relative to diploid coverage (see main example). Resolution varies across approaches, with current resolution for exome data at multi-exon CNVs. In cell free DNA contexts, copy number changes in the tumor/fetal fraction will produce a much weaker signal against background of normal/germ-line DNA. (see boxed examples below).

1.5 1.0 0.5 1.5 1.0 0.5

Deletion in sample

Ratio to normal

Ratio to normal

1.0 0.5

Duplication in sample

CNV call

Cell free fetal/maternal DNA illustrating trisomy 21

Tumor vs. normal illustrating aneuploidy 1.5

Sample: reference coverage ratio

1.5 1.0 0.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X

Chromosome

Chromosome

FIGURE 11.6 CNV detection via depth of coverage.

II. BIOINFORMATICS

CONCEPTUAL APPROACHES TO NGS CNV DETECTION

177

a given region across many samples can reduce noise and increase CNV detection sensitivity by virtue of producing a consensus reference which buffers the variability in any single individual [31], although over or underrepresentation of common CNVs at the population level may potentially distort results. A unique normalization strategy employable for analyzing malignancies is to examine paired tumor-normal samples, allowing normalization of relative read depth against an individual-matched genomic baseline. It should be noted that CNVs detected by any normalization strategy are prone to artifacts in data interpretation, for example, copy number gain in a reference sample would be interpreted as copy number loss in a test sample from, and copy number loss in a reference sample would be interpreted as copy gain in a test sample [53]. A major limitation of depth of coverage, although not unique to this approach, is that CNV calling in repetitive DNA may be impossible if NGS reads cannot be mapped uniquely. Other limitations pertain to bias. Depth of coverage is less reliable in instances where NGS produces highly variable coverage across experiments, such as in high-GC regions [54]. As both hybridization- and amplification-based capture produce biased sampling across the target regions, sensitive and specific comparison of the same DNA region across samples requires that they are prepared and analyzed using the same technologies. In addition to generalized biases in sequence capture, there are also differences dependent on individual sample quality which can be correlated with intrinsic sequence features including local guanine-cytosine (GC) content, or index/barcode sequence used during library preparation. In order to handle this variability, normalization is required across samples, which typically includes corrections for local GC content and overall sample coverage, and accounts for technical considerations including batch and barcode. Additional normalization procedures have also been used, such as correction for local patterns via regional normalization, as well as more sophisticated approaches, including wavelet transformation [55] and singular value decomposition [50]. There are two major strengths of depth-of-coverage methods relative to other approaches. The first is that depth of coverage can be applied for both targeted and whole genome applications, and for single read or paired-end NGS data alike since the relative depth-of-coverage signature is present regardless of whether the actual CNV boundaries were within the sequenced DNA. As such, relative depth of coverage is currently the primary signature used for CNV detection from targeted or exome sequencing data. The second advantage is that sensitivity and specificity are directly proportional to overall sequencing depth, allowing detection of even short CNVs in nonrepetitive sequence regions if coverage is deep enough. For example, single exon deletions could be detected in screening of BRCA1 for inherited loss-of-function mutations with an average read depth 450-fold coverage per base [30]. In comparison, for lower coverage exomes, CNVs that overlapped an average of three consecutive exons were readily detected using a different algorithm [50]. Another strength of depth-of-coverage approaches is that the expected ratio of coverage relative to a reference can be used to infer quantitative copy number state. In the simplest case of a homozygous deletion, no reads will be present within the CNV. For a germline hemizygous deletion, a depth-of-coverage ratio of 0.5:1 for sample versus diploid reference is expected for the region spanning the deletion. Amplification events generate higher depth-of-coverage ratios indicative of absolute copy number (e.g., 1.5:1 indicates three copies, 2:1 indicates four copies). Copy number state quantitation can be particularly useful in cancer samples, though the observed ratios are dependent on both tumor purity and intratumoral heterogeneity. Even though high-confidence assignment of copy number may be impossible for such mixed samples, the presence of an amplification or deletion can still often be detected. For lower coverage applications, such as cell-free DNA whole genome sequencing, size limitations may be much greater, permitting robust depth-of-coverage copy number estimation only for large-scale CNVs or whole chromosomes. As the number of sequence reads that can be economically generated for a sample continues to increase, it is likely that relative depth of coverage will be able to consistently detect single exon CNVs across most targeted regions from exome sequence data in a typical experiment. Depth-of-coverage approaches can be combined with other approaches, such as mate pair strategies or allele frequency algorithms, providing complementary evidence for CNVs.

SNP Allele Frequency CNV calling using by conventional SNP arrays is based on consideration of allele frequency, where signals from heterozygous SNPs at a particular site are expected to occur in equal ratios, and deviation in favor of one SNP suggests CNV gain of that allele [56]. Likewise, monitoring allele frequency at commonly occurring SNPs can be a useful indicator of CNVs or loss of heterozygosity (LOH) in NGS data (Figure 11.7) [57]. While single

II. BIOINFORMATICS

178

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Mapped NGS read Heterozygous SNP

Homozygous SNP

Shotgun NGS

Exon Targeted NGS

A

C

G

A

G

T

A

C

A

T

T

C

G

A

C

G

G

G

C

A

G

A

T

T

C

A

T

A

G

A

C

A

T

T

C

G

G

G

A

G

T

T

A T 2:1

2:1

Duplication (2:1 ratio heterozygous SNPs)

1:1

1:1

1:1

Normal region (1:1 ratio heterozygous SNPs)

SNPs detected

A 1:1

Heterozygous allele frequency

Deletion/LOH (No heterozygous SNPs)

FIGURE 11.7 CNV detection via SNP allele frequency. Contiguous runs of adjacent SNPs that are either homozygous or have an allele frequency ratio that departs from 1:1 in the sample reads are suggestive of chromosomal gains or losses. Shotgun sequencing allows more accurate CNV calling by the allele frequency method than targeted approaches because more informative SNPs are captured. Even by shotgun sequencing, multiple informative SNPs are required for accurate CNV detection. Inference of CNVs by allele frequency is often done in combination with an alternative strategy (e.g., depth of coverage) for higher confidence.

SNPs are often not informative, the combined information of SNPs present across CNV regions can provide complementary evidence when combined with more sensitive methods such as depth of coverage. Long stretches of homozygous alleles can be indicative of deletion or gene conversion, while ratios that vary from the predicted 1:1 for diploid heterozygous SNPs indicate gains. Allele frequency evidence can be especially useful where coverage is lower than ideal for high-confidence depth-of-coverage CNV detection, and when the detection limit for CNVs is correspondingly large and likely to include common SNPs such as in lower coverage exome or whole genome sequence data. CNV calling by SNP allele frequency also allows detection of copy-neutral LOH, which is not detectable by depth-of-coverage approaches.

Split Reads and Local De Novo Assembly All the approaches for CNV detection discussed so far rely on the analysis of an indirect signal resulting from CNV events and are affected by sequence uniqueness in the genome and statistical variability in fragment size (mate pair approaches) or sequence representation and coverage (relative depth of coverage and allele frequency). In contrast, direct evidence of a copy number change will be present in sequences overlapping the breakpoints of CNVs (Figure 11.8). These reads are referred to as split reads, and can be used to anchor the breakpoints of a CNV [44,58]. Split read mapping requires long enough reads to permit partial read alignment on both sides of the breakpoint, a technical issue with early NGS platforms that has been largely resolved by the longer read lengths offered by current platforms. The major limitation of split read CNV mapping is that it requires at least one of the CNV breakpoints to be physically located within a single sequencing read. For whole genome sequencing data, this property markedly reduces the number of sequencing reads which can effectively contribute to CNV detection (Figure 11.8). For targeted

II. BIOINFORMATICS

179

CONCEPTUAL APPROACHES TO NGS CNV DETECTION GGGGCAGACCTAACCATCCTGGTCTT

ACCTAGAACAGATACCTAGAATGTC

Prefix

Suffix

Split reads nomenclature

Mapped portion of read Portion of read that does not map to region Gap in read relative to reference

Prefix Suffix

Prefix

Prefix

Suffix

Prefix Suffix

Suffix Reference Sample

Insertion of novel sequence

Targeted NGS and split reads

Tandem duplication

Short insertion

Targeted reads

Split read CNV evidence present due to overlap between targeted regions and CNV breakpoint

Targeted exons

Deletion with novel sequence insertion

CNV breakpoints

No split reads Deletion A

CNV detected

Deletion B

CNV missed

FIGURE 11.8 CNV detection via split reads. Split reads partially map to multiple DNA spans that may or may not be present in the reference. A split read can be divided into the first mapping span (known as the prefix) and the second mapping span (known as the suffix); split reads may also contain other sequences in addition to the mappable spans. In addition, split reads may be made up of one mappable span and one unmappable span, which are referred to as one-end anchored. Short deletions and insertions can be identified using gapped alignment in a local region. Larger deletions, duplications, and other classes of structural variants are identified by mapping the prefix and suffix to divergent genomic regions. Novel structural/sequence variation can be identified using one-end anchored reads. A major limitation of split read methods for CNV detection in targeted sequencing is the requirement that the CNV breakpoint is targeted (gray box). Because CNV breakpoints are often in introns, many or most CNVs will not be detected by split read analysis using common targeted NGS approaches that focus on exons only, including whole exome sequencing (gray box).

sequencing approaches, the consequences may be more severe since even though a small proportion of CNVs may have breakpoints within exons, the majority of CNVs effecting coding sequence are likely to have breakpoints in intronic or intergenic DNA, regions which are not typically captured by exome or targeted gene sequencing. The greatest advantage of split read mapping over other methods is that it can identify the precise genomic coordinates of a CNV, to the resolution of single nucleotides. Further, split read mapping can be used to identify CNVs that are larger than the maximum indel size achievable by standard alignment algorithms and smaller than those required for detection by depth-of-coverage methods (e.g., tandem duplications in the range of 30150 bp). Like relative depth of coverage, split read mapping approaches can be applied for both targeted and whole genome sequencing, and for both single and paired-end reads. For these reasons, split read mapping followed by local de novo genome assembly has the potential to become the standard approach for CNV mapping at least for whole genome sequencing contexts, especially in light of continued read-length improvements in existing NGS platforms and the emergence of technologies capable of very long sequence reads [59] (Table 11.3).

II. BIOINFORMATICS

180 TABLE 11.3

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

Recommended Integrated CNV Detection Approaches

Approach

Recommended Methods

Targeted sequencing

• Depth-of-coverage analysis to identify deletions and duplications over targeted regions • For lower coverage larger target sets (e.g., exome), use larger windows (e.g., exon-level) to identify multi-exon CNVs • For small target regions, high-coverage, or larger contiguous target regions, use base-level coverage to increase sensitivity for smaller CNVs • Split reads to identify insertions/deletions and other Structural variant (SV) classes where breakpoints are located in target regions

Whole genome

• • • •

Cell-free DNA

• Mate pair if structural variation is a goal (e.g., tumors) • Depth of coverage for fetal CNV screening in maternal plasma

Depth of coverage (window size dependent of coverage) Allele frequency/LOH scanning Split read analysis/de novo assembly of structural variants Mate pair approach where possible

DETECTION IN THE CLINIC: LINKING APPLICATION, TECHNICAL APPROACH, AND DETECTION METHODS Targeted Gene Screening NGS technology is an ideal platform for screening a large panel of known disease genes for causal CNV variants, offering the potential for comprehensive detection of all types of genomic variation and production of clinically actionable results. The most common uses of candidate gene sequencing are to identify all potentially causative variants of a disease phenotype, typically reflecting inherited or de novo germline mutations, and to profile somatic mutations at disease-relevant loci, such as amplification of oncogenes in tumor samples. The design of targeted candidate gene screening on NGS platforms impacts CNV detection. Design considerations for targeted sequencing include: 1. Whether to use hybridization- or amplification-based sequence selection? While both technologies likely introduce bias, hybridization-based capture has been demonstrated to generate NGS reads suitable for depth-of-coverage analyses. Depth-of-coverage CNV analysis by amplification-based approaches at present has yet to be well-documented. This point merits special consideration for clinical labs in selecting an appropriate assay NGS design, as amplification-based approaches are generally inexpensive and fast but may not be an optimal choice if CNV detection is required. Further assessment of CNV detection from amplification-based targeting approaches is necessary to determine whether this approach will be as successful as hybridization for depth-of-coverage methods. 2. Which sequences to target? The potential target for sequencing ranges from coding sequence only, to full gene regions including introns and Untranslated regions (UTRs) even extending into flanking intergenic sequence. Limiting targeting to exons and splice junctions decreases the number of bases that are interrogated, allowing for deeper coverage with a fixed number of samples, or higher numbers of multiplexed samples screened at a fixed read depth. However, shorter contiguous stretches may increase across-sample variance, leading to decrease in CNV call confidence. In addition, profiling noncoding regions can be useful when causal variants are suspected to lie within these intervals, however, the interpretation of noncoding variants is complex, and less likely to be clinically actionable at present. 3. How much total sequence to target? The number of loci to screen ranges from a single locus to large panels of genes. Similar to the above, the total number of bases interrogated is important to consider in establishing the trade-off between higher coverage and more bases/samples tested. 4. How many samples to multiplex of samples for cost-effectiveness? Again, the number of samples run in parallel is a variable in the equation determining coverage depth. 5. Is the primary goal sensitive detection of all variants or profiling copy number changes for larger regions? For causative variant discovery, a high premium must be placed on the capability to detect small CNVs, and the primary factor controlling CNV detection sensitivity in targeted NGS applications is depth of coverage. Estimates of necessary sequencing depth for depth of coverage are platform-specific and dependent on sequencing biases that contribute to variance in coverage estimates. As a rough estimate, coverage between 50 and 1003 appears

II. BIOINFORMATICS

DETECTION IN THE CLINIC: LINKING APPLICATION, TECHNICAL APPROACH, AND DETECTION METHODS

181

sufficient for detection of single exon CNVs for many targeted regions (that is, coverage of between 50 and 1003 of the cell population of interest; in cancer testing, where tumor cells are only a minority of the overall cell population, a higher depth of coverage is therefore required to ensure sufficient sensitivity of detection of CNVs in the neoplastic cells). It is useful to emphasize that an assay with overall average coverage of 5003 or even 10003 will usually have multiple exons in which coverage is less than 501003 . In targets where GC content is high, or for duplicated regions such as segmental duplications, CNV calling and absolute copy number estimation via relative depth of coverage may by less sensitive. While split read mapping is unlikely to identify breakpoints for large CNVs, the approach can bridge between alignment-based indel detection and depth-of-coverage-based CNV calling, identifying copy number changes that are larger than B30 bp but smaller than a single exon. For profiling somatic copy number changes in cancer, sensitive detection of small germline CNVs is often less important than detection of amplifications and deletions encompassing cancerrelevant genes. For this purpose, overall depth of coverage can be lower and detection algorithms may require different tuning parameters, such as averaging coverage across a genomic window to reduce noise. With proper design and incorporation of depth of coverage with split read analysis methods, comprehensive CNV detection is an obtainable goal for targeted sequencing applications.

Exome Sequencing for Unbiased Coding Variant Discovery In contrast to custom-targeted sequencing panels, exome sequencing is more standardized and is performed with the goal of unbiased variant discovery. Despite these differences, the technical considerations for CNV discovery by exome sequencing largely overlap those discussed above for general targeted sequencing approaches. A relatively large proportion of the genome is targeted by exome sequencing (about 36% or 3060 Mb), resulting in lower coverage than is feasible using smaller target panels. As such, the detection sensitivity decreases and the minimum reliably detectable CNV size increases. Nonetheless, read depths currently achieved through routine exome sequencing are sufficient for robust detection of larger CNVs, with resolution comparable to that of aCGH. Multiple methods are now available for specialized CNV analysis of exome data, and CNV calling is quickly becoming a standardized component of exome analysis. Numerous methods relying on depth of coverage for detection of CNVs have been released, with some tailored to different clinical applications including germline CNV screening detection or tumor-normal comparisons of somatic copy number changes. As described above, depth-ofcoverage approaches can be combined with split read mapping for comprehensive screening. Increases in coverage or changes in design could drive the sensitivity of exome-based CNV detection even higher in the near future.

Cell-Free DNA Two main clinical applications exist at present for cell-free DNA, interrogation of fetal DNA from a mother’s plasma or tumor DNA from a body fluid of a cancer patient. In the context of fetal cell-free DNA in maternal plasma, whole genome or targeted capture NGS approaches are used to profile CNVs via relative depth of coverage. The fetal contribution to the total cell-free DNA in maternal plasma is small but not insignificant, typically detectable at around 3% of total cell-free DNA by 5 weeks after conception and increasing during pregnancy to an average of B13%, although individual cases may vary widely [6062]. Because it represents a relatively rare population, the signal differences between normal diploid DNA and regions affected by CNV in the fetus may be subtle. However, with sufficient depth of coverage obtained by NGS, even subtle differences produce a robust signal, with current detection limits for CNVs under one megabase [63]. Normalization is still required, as GC content and sources of local coverage bias may influence read depth as in other depth-of-coverage applications. While stand-alone methods are not yet published for this relatively new application, normalized relative coverage averaged across genomic regions (e.g., 100 kb bins) has been compared to identify regions where coverage significantly diverges from genome average [63]. Somatic cell CNV and structural variant profiling using cell-free tumor DNA in plasma or other body fluids has the potential to be used for early detection of malignancies. In this application, somatic cell variants present in cell-free DNA stand out from the background diploid germline sequence variation. Clinical care could be informed by the early detection of cancer cells or the reappearance of cancer cells after disease remission [43,64] and guided by identification of individual driver variants, such as amplification of specific oncogenes or detection of actionable translocation events. If full structural variant profiling is desired, mate pair mapping followed by depth of coverage and allelic variant frequency analysis can be used to identify balanced structural variants,

II. BIOINFORMATICS

182

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

imbalanced gains and losses, and LOH. Similar to cell-free fetal DNA sequencing, stand-alone analysis tools combining these different approaches are not yet available and CNV analysis for this purpose currently requires custom-built analysis pipelines.

Whole Genome Sequencing and Emerging Technologies Whole genome sequencing via shotgun NGS approaches permits identification of CNVs using a combination of all the methods described above. Primary design considerations relevant to DNA discovery by this approach include: 1. Whether to use a mate pair library approach and what size fragment to target? Incorporation of mate pair libraries requires additional DNA preparation and should not be used alone if reliable coverage across the genome is required, due to biased sampling and mapping as a function of library complexity and read length. However, mate pair strategies allow mapping of all classes of structural variation, can provide ancillary evidence of CNVs detected using read depth alone, and can aid de novo assembly in regions divergent from the reference genome. When considering mate pair fragment length, use of longer fragments allows mapping across larger repeat elements but may result in reduced library complexity (standard mate pair fragment size is between 2 and 5 kb). Mate pair library preparation requires much more input DNA than shotgun approaches, so it can be wasteful when source DNA is limited. For the specific application of structural variant mapping low coverage mate pair mapping alone may be sufficient, and thus warrants careful consideration of the available material. 2. How deep to sequence? Unless the primary goal is mapping large structural variants, for which mate pair libraries can be used alone, coverage must be sufficient to allow calling DNA sequence variants. Single nucleotide and indel detection requires much lower overall coverage than read depth-based identification of small CNVs, and for most applications it is likely preferable to aim for a coverage that confidently calls single nucleotide variants and indels, while accepting lower resolution CNV detection by read depth-based analysis. Even at lower coverage, read depth can be an effective tool for discovery of CNVs as small as 1 kb [65] in whole genome sequence data. Furthermore, a combination of evidence from read depth, split reads, allele frequency, and discordant mate pairs (if available) increases reliability of CNV detection for whole genome sequencing [10]. 3. How long of read length to sequence and whether to use paired-end methods? Increased read length and paired-end sequencing both will generally improve CNV detection given increased mapability of paired ends and of split reads. With regard to depth-of-coverage methods, read length is not an important factor beyond general alignment considerations. For mate pair reads technical factors may control read length, and read length is less important for that strategy overall. In a recent analysis of population-level genome data, all approaches described above were incorporated, and multiple algorithms for each method were used [10]. This level of methodological and bioinformatics pipeline complexity is currently prohibitive to implement for most clinical labs. However, as personal genome sequencing becomes more prevalent, commercially available robust integrated pipelines for CNV detection will likely be developed, with each CNV prediction supported by multiple types of evidence. In addition, emerging sequencing technologies will eventually offer much longer reads from even single DNA molecules [59]. These future developments will transform the preferred approaches for CNV detection, allowing for robust sequence-based characterization of CNVs and accurate de novo assembly of CNV regions. As CNV detection methodology continues to improve and mature, the main future challenge for diagnostic laboratories is likely to be related to clinical interpretation. The following sections describe strategies for determining the accuracy of CNV calling methods, mutation validation, and functional characterization, toward the goal of evaluating clinical performance of CNV detection and translating CNVs detected in patient material to actionable care.

REFERENCE STANDARDS There are currently no universally accepted reference standards for benchmarking CNV calls using either NGS data, aCGH, or SNP array-based platforms [53,66,67], posing obvious challenges in establishing standardized approaches to assay validation. Further, the choice of platforms, the methods and samples used for establishing a

II. BIOINFORMATICS

REFERENCE STANDARDS

183

baseline, and the analysis algorithms used to interrogate potential reference standards can result in substantial variability in both the CNVs that are identified and the size of CNVs resolvable [53,67], making comparison of experimental results with published data sometimes difficult. For clinical purposes, it would be ideal to use a single sample as a reference that is highly cross-validated and characterized by many different platforms. This approach would allow not only for adequate method validation with a minimal investment of time and resources but would also permit routine inclusion of the sample as an internal quality control during clinical CNV testing. Studies examining relative performance of CNV detection have frequently used one or more HapMap reference individuals to evaluate results [68]. An important practical consideration is that immortalized cell lines have been established from these individuals, and that both cell lines and purified DNA are available from a centralized repository (Coriell). Therefore, there exists an easily obtained and (theoretically) limitless supply of material for use, although sequential passaging of cell lines and its potential effects on the genome warrant further investigation. HapMap sample NA12878 has been proposed as a universal reference standard for evaluation of NGS data, including both sequence variants and structural variants, since there presently exist numerous publically available whole genome sequencing and targeted gene sequencing studies which have interrogated NA12878 using a variety of NGS platforms. Based on these data, and the likelihood of additional data sets in the future, the Genome in a Bottle Consortium has advocated validation and distribution of NA12878 as National institute of Standards and Technology (NIST) reference material, which at the time of writing was pending Institutional Review Board (IRB) approval (http://genomeinabottle.org/blog-entry/post-ashg-update-genome-bottle). Thus it is possible that this particular HapMap sample will one day serve as a singular reference standard, at least in the context of clinical use. Acknowledging the challenges inherent to calling CNVs, two data sets, both of which include NA12878, have recurrently emerged in the literature as unofficial “gold standard” CNV calls against which methods can be compared.

Genome Structural Variation Consortium Data Set 40 HapMap individuals have been characterized using ultrahigh density aCGH [1]. These studies employed a set of NimbleGen arrays tiling across the nonrepetitive portion of the genome with approximately 42 million long oligo probes (5075 bp) distributed across 20 arrays, with a median spacing of about 56 bp across the genome [1,69]. This unusual experimental design has allowed CNV in the HapMap samples to be cataloged to a resolution of approximately 500 bp, considerably higher resolution than previous array-based studies [20]. Absolute copy number of each genotyped CNV has also been inferred from these data. A subset of individual CNV calls made in this study have been independently validated by multiple array-based data sets generated by independent laboratories, as well as experimentally using a combination, quantitative real-time PCR (qPCR), aCGH, and other methods [1]. It should be noted that these studies were controlled using a specific HapMap sample (NA10851), so that all CNV are called in relation to that individual. As such, copy number gain in the control would be interpreted as copy number loss in a test sample from, and copy number loss in the control would be interpreted as copy gain in a test sample [53]. Regardless, the majority of CNV predicted in this project were successfully validated using alternative methods [1], suggesting that artifacts resulting from this normalization approach are not limiting.

1000 Genomes Project Structural Variant Map Data Set Sequence-based CNV detection has been performed using whole genome sequence information from 179 unrelated HapMap individuals generated for the July 2010 data release of the 1000 Genomes Project [10,12]. In this data set, CNVs were called using 19 different computational approaches, utilizing a combination of read depth, split read, and paired-end sequencing data. Events were mapped to a single base-pair resolution, and CNV of 50 bp or greater were detectable. Putative CNV calls were subjected to extensive experimental validation using aCGH and/or PCR. The validated CNV calls from this data set have identified 3997 unique CNVs in NA12878 [1]. Not every CNV predicted in these above studies could be validated experimentally, and it is therefore difficult to assess the accuracy of the data sets overall. However, validated CNV calls from the 1000 Genomes Project Structural Variant Map are generally considered to be the most accurate set available to date [10,69]. As a general rule of thumb, experimentally validated CNVs from either data set can be considered trustworthy, although the potential for inaccuracies should be considered when discrepant results arise.

II. BIOINFORMATICS

184

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

ORTHOGONAL CNV VALIDATION Depending on the level of analytical certainty that is needed it is typically prudent to verify NGS CNV calls using an alternative technology. When considering orthogonal validation it is important to understand the strengths and limitations of the particular NGS approach being used with regard to CNV detection. Relevant questions to ask are: What is the smallest size of CNV that can be reliably detected in a given genomic region by my NGS method? What is the smallest magnitude of change or lowest quality score metric that reflects a reliable CNV call for my assay? Is the variant in a GC-rich region where CNV artifacts using depth-of-coverage analysis are more common? Is the CNV in a duplicated or highly homologous region of the genome? Is the CNV present in population databases [70] at a frequency that decreases the likelihood of it being a large-effect/high penetrance clinically actionable mutation? Is the CNV a previously characterized pathogenic variant that is consistent with the clinical presentation? These questions are particularly salient for clinical labs where accuracy is paramount, but rapid turnaround time of results is also an important consideration. Common technologies used for orthogonal CNV validation are qPCR, MLPA, interphase FISH, and aCGH. The optimal choice depends on the size and scope of CNV(s) being validated. For predicted CNVs of small size, such as single exon deletions, targeted qPCR is a good option because primers can be designed to sensitively interrogate copy number in as small as a 50 bp region. Several commercial options exist for qPCR-based CNV detection, which include software to facilitate accurate determination of copy number from very small changes in cycle threshold (Ct) values by normalizing to a reference gene and a diploid (negative) control. MLPA is a common CNV detection method in clinical labs that is well-suited for CNVs that span from two exons to an entire gene. The biggest limitation of MLPA is that, unlike qPCR, it is challenging to rapidly develop a new assay if an “off-the-shelf” test for the gene or region of interest does not already exist. Interphase FISH is appropriate for validation of larger CNVs, especially CNVs .100 kb, and FISH is commonly employed for validation of gene amplification in tumor samples. For global validation of CNV calls by NGS, aCGH or SNP microarray are the best options; array-based approaches may be appropriate even for targeted validation of single CNVs, especially if the array technology is already well-established in the laboratory. An advantage of this approach for clinical labs is that a different reference assay does not need to be validated for each new CNV identified by NGS.

SUMMARY AND CONCLUSION CNVs remain a challenging class of genomic variation to accurately classify by NGS, but several effective complementary approaches are now available. Depending on the genomic library preparation method, mate pair, depth of coverage, and split read analyses may be used to correctly infer most large deletions and duplications, and are especially robust if used in conjunction. Accuracy is critically dependent on high sequencing depth, and also improves with longer sequencing read length and capture density. When assay-specific limitations are thoroughly understood, and orthogonal validation is appropriately applied, NGS-based CNV detection is ready for use in clinical diagnostic settings. As both the sequencing technologies and bioinformatics algorithms continue to improve, NGS is likely to quickly become a gold standard method for CNV detection.

References [1] Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al. Origins and functional impact of copy number variation in the human genome. Nature 2010;464(7289):70412. [2] Girirajan S, Campbell CD, Eichler EE. Human copy number variation and complex genetic disease. Annu Rev Genet 2011;45:20326. [3] Ionita-Laza I, Rogers AJ, Lange C, Raby BA, Lee C. Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis. Genomics 2009;93(1):226. [4] Shelling AN, Ferguson LR. Genetic variation in human disease and a new role for copy number variants. Mutat Res 2007;622 (12):33341. [5] Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet 2009;10(8):55164. [6] Kim PM, Lam HY, Urban AE, Korbel JO, Affourtit J, Grubert F, et al. Analysis of copy number variants and segmental duplications in the human genome: evidence for a change in the process of formation in recent evolutionary history. Genome Res 2008;18(12):186574. [7] De Brakeleer S, De Greve J, Lissens W, Teugels E. Systematic detection of pathogenic alu element insertions in NGS-based diagnostic screens: the BRCA1/BRCA2 example. Human Mutat 2013. [8] Lupski JR. Genomic rearrangements and sporadic disease. Nat Genet 2007;39(7 Suppl.):S437. [9] Turner DJ, Miretti M, Rajan D, Fiegler H, Carter NP, Blayney ML, et al. Germline rates of de novo meiotic deletions and duplications causing several genomic disorders. Nat Genet 2008;40(1):905.

II. BIOINFORMATICS

REFERENCES

185

[10] Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, et al. Mapping copy number variation by population-scale genome sequencing. Nature 2011;470(7332):5965. [11] Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, et al. Diversity of human copy number variation and multicopy genes. Science 2010;330(6004):6416. [12] Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [13] Girirajan S, Rosenfeld JA, Cooper GM, Antonacci F, Siswara P, Itsara A, et al. A recurrent 16p12.1 microdeletion supports a two-hit model for severe developmental delay. Nat Genet 2010;42(3):2039. [14] Wagner A, Barrows A, Wijnen JT, van der Klift H, Franken PF, Verkuijlen P, et al. Molecular analysis of hereditary nonpolyposis colorectal cancer in the United States: high mutation detection rate among clinically selected families and characterization of an American founder genomic deletion of the MSH2 gene. Am J Hum Genet 2003;72(5):1088100. [15] Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 2008;320(5875):53943. [16] McDonald-McGinn DM, Emanuel BS, Zackai EH. 22q11.2 deletion syndrome. In: Pagon RA, Bird TD, Dolan CR, Stephens K, Adam MP, editors. GeneReviews. Seattle, WA: University of Washington; 1993. [17] Yunis JJ, Chandler ME. High-resolution chromosome analysis in clinical medicine. Prog Clin Pathol 1978;7:26788. [18] Trask BJ. Fluorescence in situ hybridization: applications in cytogenetics and gene mapping. Trends Genet 1991;7(5):14954. [19] Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992;258(5083):81821. [20] Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res 2009;19(9):158692. [21] Kang JU, Koo SH. Evolving applications of microarray technology in postnatal diagnosis (review). Int J Mol Med 2012;30(2):2238. [22] Craddock KJ, Lam WL, Tsao MS. Applications of array-CGH for lung cancer. Methods Mol Biol 2013;973:297324. [23] Chou LS, Lyon E, Mao R. Molecular diagnosis utility of multiplex ligation-dependent probe amplification. Expert Opin Med Diagn 2008;2(4):37385. [24] Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, et al. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc Natl Acad Sci USA 2010;107(28):1262933. [25] Pritchard CC, Smith C, Salipante SJ, Lee MK, Thornton AM, Nord AS, et al. ColoSeq provides comprehensive lynch and polyposis syndrome mutational analysis using massively parallel sequencing. J Mol Diagn 2012;14(4):35766. [26] Shearer AE, DeLuca AP, Hildebrand MS, Taylor KR, Gurrola II J, Scherer S, et al. Comprehensive genetic testing for hereditary hearing loss using massively parallel sequencing. Proc Natl Acad Sci USA 2010;107(49):211049. [27] Wagle N, Berger MF, Davis MJ, Blumenstiel B, Defelice M, Pochanard P, et al. High-throughput detection of actionable genomic alterations in clinical tumor samples by targeted, massively parallel sequencing. Cancer Discov 2012;2(1):8293. [28] Horn S. Target enrichment via DNA hybridization capture. Methods Mol Biol 2012;840:17788. [29] Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007;39(12):15227. [30] Walsh T, Casadei S, Lee MK, Pennil CC, Nord AS, Thornton AM, et al. Mutations in 12 genes for inherited ovarian, fallopian tube, and peritoneal carcinoma identified by massively parallel sequencing. Proc Natl Acad Sci USA 2011;108(44):180327. [31] Nord AS, Lee M, King MC, Walsh T. Accurate and exact CNV identification from targeted high-throughput sequence data. BMC Genomics 2011;12:184. [32] Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 2010;42(9):7903. [33] O’Roak BJ, Deriziotis P, Lee C, Vives L, Schwartz JJ, Girirajan S, et al. Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations. Nat Genet 2011;43(6):5859. [34] Barbieri CE, Baca SC, Lawrence MS, Demichelis F, Blattner M, Theurillat JP, et al. Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer. Nat Genet 2012;44(6):6859. [35] Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet 2012;91(4):597607. [36] Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011;12(2):R18. [37] Haack TB, Haberberger B, Frisch EM, Wieland T, Iuso A, Gorza M, et al. Molecular diagnosis in mitochondrial complex I deficiency using exome sequencing. J Med Genet 2012;49(4):27783. [38] Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 2011;13(3):25562. [39] Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):5665. [40] Shendure J, Lieberman Aiden E. The expanding scope of DNA sequencing. Nat Biotechnol 2012;30(11):108494. [41] Ehrich M, Deciu C, Zwiefelhofer T, Tynan JA, Cagasan L, Tim R, et al. Noninvasive detection of fetal trisomy 21 by sequencing of DNA in maternal blood: a study in a clinical setting. Am J Obstet Gynecol 2011;204(3):205.e1205.e11. [42] Chiu RW, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ 2011;342:c7401. [43] Leary RJ, Sausen M, Kinde I, Papadopoulos N, Carpten JD, Craig D, et al. Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Sci Transl Med 2012;4(162):162ra54. [44] Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009;6(9):67781.

II. BIOINFORMATICS

186

11. COPY NUMBER VARIANT DETECTION USING NEXT-GENERATION SEQUENCING

[45] Mills RE, Luttig CT, Larkins CE, Beauchamp A, Tsui C, Pittard WS, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 2006;16(9):118290. [46] Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res 2011;21(6):97484. [47] Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 2009;41(10):10617. [48] Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods 2009;6(1):99103. [49] Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, et al. CONTRA: copy number analysis for targeted resequencing. Bioinformatics 2012;28(10):130713. [50] Krumm N, Sudmant PH, Ko A, O’Roak BJ, Malig M, Coe BP, et al. Copy number variation detection and genotyping from exome sequence data. Genome Res 2012;22(8):152532. [51] Medvedev P, Fiume M, Dzamba M, Smith T, Brudno M. Detecting copy number variation with mated short reads. Genome Res 2010;20(11):161322. [52] Murphy SJ, Cheville JC, Zarei S, Johnson SH, Sikkink RA, Kosari F, et al. Mate pair sequencing of whole-genome-amplified DNA following laser capture microdissection of prostate cancer. DNA Res 2012;19(5):395406. [53] Scherer SW, Lee C, Birney E, Altshuler DM, Eichler EE, Carter NP, et al. Challenges and standards in integrating surveys of structural variation. Nat Genet 2007;39(7 Suppl.):S715. [54] Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 2012;40 (10):e72. [55] Amarasinghe KC, Li J, Halgamuge SK. CoNVEX: copy number variation estimation in exome sequencing data using HMM. BMC Bioinformatics 2013;14(Suppl. 2):S2. [56] Wang K, Bucan M. Copy number variation detection via high-density SNP genotyping. CSH Protoc 2008;2008:16. [57] Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet 2008;40(10):125360. [58] Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods 2011;8(8):6524. [59] Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T, et al. The potential and challenges of nanopore sequencing. Nat Biotechnol 2008;26(10):114653. [60] Guibert J, Benachi A, Grebille AG, Ernault P, Zorn JR, Costa JM. Kinetics of SRY gene appearance in maternal serum: detection by real time PCR in early pregnancy after assisted reproductive technique. Hum Reprod 2003;18(8):17336. [61] Lo YM, Tein MS, Lau TK, Haines CJ, Leung TN, Poon PM, et al. Quantitative analysis of fetal DNA in maternal plasma and serum: implications for noninvasive prenatal diagnosis. Am J Hum Genet 1998;62(4):76875. [62] Kitzman JO, Snyder MW, Ventura M, Lewis AP, Qiu R, Simmons LE, et al. Noninvasive whole-genome sequencing of a human fetus. Sci Transl Med 2012;4(137):137ra76. [63] Srinivasan A, Bianchi DW, Huang H, Sehnert AJ, Rava RP. Noninvasive detection of fetal subchromosome abnormalities via deep sequencing of maternal plasma. Am J Hum Genet 2013;92(2):16776. [64] Dawson SJ, Tsui DW, Murtaza M, Biggs H, Rueda OM, Chin SF, et al. Analysis of circulating tumor DNA to monitor metastatic breast cancer. N Engl J Med 2013;368(13):1199209. [65] Szatkiewicz JP, Wang W, Sullivan PF, Sun W. Improving detection of copy-number variation by simultaneous bias correction and readdepth segmentation. Nucleic Acids Res 2013;41(3):151932. [66] Zheng X, Shaffer JR, McHugh CP, Laurie CC, Feenstra B, Melbye M, et al. Using family data as a verification standard to evaluate copy number variation calling strategies for genetic association studies. Genet Epidemiol 2012;36(3):25362. [67] Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol 2011;29(6):51220. [68] The International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437(7063):1299320. [69] Haraksingh RR, Abyzov A, Gerstein M, Urban AE, Snyder M. Genome-wide mapping of copy number variation in humans: comparative analysis of high resolution array platforms. PLoS One 2011;6(11):e27859. [70] Pinto D, Marshall C, Feuk L, Scherer SW. Copy-number variation in control population cohorts. Hum Mol Genet 2007;16(Spec No. 2): R16873.

Glossary Allele frequency analysis A method of inferring CNVs from NGS data using imbalances in the ratio of SNPs. For example, long contiguous stretches of homozygous SNPs can indicate a genomic deletion. Local de novo assembly De novo assembly refers to creating a genome without a full reference genome as a scaffold. CNVs sometimes can be characterized using de novo assembly in a specific “local” genomic region when split read sequence data are available crossing the breakpoints of the event. Relative depth-of-coverage analysis A method of CNV detection that compares the number of sequencing reads mapped to a genomic region in a sample to the number of sequencing reads mapped in a control or an averaged group of controls. Deletions have lower relative depth of coverage compared to control, while insertions or duplications have higher depth of coverage. This method is among the most widely used for NGS CNV detection. Mate pair Mate pair refers to a specialized method of genomic library preparation in which genomic DNA is size-selected to specified length (typically between 2 and 5 kb) and cloned into in circular vector. Paired-end sequencing of a mate pair library provides linked sequence

II. BIOINFORMATICS

REFERENCES

187

data at either end the long insert. When the paired sequences are mapped to a reference genome, CNVs can be inferred if the mapping distance is different than expected. Paired-end sequencing A common NGS method in which the two ends of a DNA insert are sequenced in a manner such that it is known the two sequences came from the same DNA fragment. Paired-end reads allow powerful inferences for CNV detection, particularly when combined with mate pair. Prefix The first part of the mapping span of a split read. Shotgun sequencing A method in which DNA is fragmented and sequenced without further target enrichment. Shotgun sequencing is often used for whole genome sequencing. Split read analysis A method for CNV/structural variant detection in which a sequencing read maps to two different places in the genome. Suffix The second part of the mapping span of a split read.

List of Acronyms and Abbreviations aCGH CNV FISH Indel LOH MLPA NAHR NGS NHEJ SNP

Array comparative genomic hybridization, also known as genomic microarray Copy number variant Fluorescent in situ hybridization Small insertion and deletion variants Loss of heterozygosity Multiplex ligation-dependent probe amplification Nonallelic homologous recombination Next-generation sequencing, also known as massively parallel sequencing Nonhomologous end joining Single nucleotide polymorphism

II. BIOINFORMATICS

This page intentionally left blank

S E C T I O N

I I I

INTERPRETATION

This page intentionally left blank

C H A P T E R

12 Reference Databases for Disease Associations Wendy S. Rubinstein1, Deanna M. Church2 and Donna R. Maglott1 1

National Center for Biotechnology Information/National Library of Medicine/National Institutes of Health, Bethesda, MD, USA 2Personalis, Inc., Menlo Park, CA, USA O U T L I N E

Introduction Identification and Validation of Human Variation Methods for Identifying Human Variation Targeting Known Sequences Sequence Discovery Sequencing and Identifying Differences Relative to a Reference Understanding a Reference Assembly Validation of Variant Calls

192 193 193 193 193 194 194 195

Identification of Common Variation Overview HapMap 1000 Genomes Project NHLBI-ESP

195 195 196 197 198

Interpretation of Common Variation Determining Association Between Phenotype and Common Variation Health-Related Phenotypes (GWAS) Molecular Phenotypes (Gene Expression/GTex) Inferring Lack of Causality Based on Allele Frequency

199

Defining Diseases and Phenotypes UMLS and MedGen OMIMs Orphanet Human Phenotype Ontology Combining Phenotypic Features

199 200 200 200 201 201

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00012-5

199 199 199

199

Representation of Variation Data in Public Databases Archives of Variants dbSNP dbVar Archives of Variant/Phenotype Relationships dbGaP ClinVar OMIM DECIPHER COSMIC PubMed/PubMedCentral

201 201 201 202 202 202 204 204 204 204 205

Data Access and Interpretation By One or More Genomic Locations By Gene By Condition or Reported Phenotype By Attributes of a Particular Variant Variants in the ACMG Incidental Findings Gene List

205 205 208 208 208

Determination of Variant Pathogenicity Expert Panels and Professional Guidelines ClinGen: A Centralized Resource of Clinically Annotated Genes and Variants Standard Setting for Clinical Grade Databases

210 210

Global Data Sharing GA4GH and the Beacon Project

212 212

Conclusion

213

References

213

List of Acronyms and Abbreviations

215

191

209

211 211

© 2015 Published by Elsevier Inc.

192

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

KEY CONCEPTS • The approach of testing specific genes and variants based on pre-established evidence of causation is shifting toward the use of NGS techniques that can provide sensitivity at lower cost. However, the detection of many novel sequence variants requires automated approaches to facilitate interpretation. • Interpretation of the clinical significance of sequence variation requires evaluation of the relevant evidence, but such evidence may be difficult to gather or altogether lacking. Centralized databases such as those at National Center for Biotechnology Information (NCBI) support many reference databases that are designed for representing variation and for the reporting of relationships to phenotypes. The major databases that archive submissions include Database of Short Genetic Variations (dbSNP) and Database of Structural Variation (dbVar) (for small and large variation, respectively), Database of Genotypes and Phenotypes (dbGaP) (for variationdisease associations), and Database of Clinical Variation (ClinVar) (for clinical interpretations of variation). Medical Genetics resource at NCBI (MedGen) harmonizes phenotype terminologies and supports computational access to phenotype data. • Interpretation of variants requires understanding of the role of the human reference assembly and its annotation, which are periodically versioned based on new knowledge. Professionals using NGS techniques need to be aware of problematic regions within the human reference assembly. • Large-scale efforts such as HapMap, the 1000 Genomes Project, and NHLBI-ESP, that characterize genetic variation in diverse population groups, support assessments of the frequency of human variation. ClinVar, dbSNP, and Variation Viewer support searching and filtering functions related to allele frequencies. dbSNP provides variant call file (VCF) files of common variants not known to be disease-related. • Laboratories that utilize NGS technology to detect sequence variation encounter variants that have not previously been reported, and proprietary databases may be inadequate for the reliable interpretation of clinical significance. Consequently, there is increasing participation in public data sharing, and several community efforts are under way to develop standards that promote the development of clinical grade databases. • Professional societies, laboratories, accrediting organizations, and federal agencies have been developing professional guidelines and quality assurance programs that support the application of NGS technology in the clinical realm. The GeT-RM NGS dataset and browser, as well as the Genome In A Bottle project, support assessment of the analytical validity of any variant call.

INTRODUCTION Medical professionals must increasingly evaluate genetic variant data generated by clinical laboratories, research studies, and direct-to-consumer testing obtained by their patients. Not only is there a growing number of conditions for which testing is available, but also the previous paradigm of focused testing for genes with well-accepted clinical consequences, and for specific variants in those genes, has shifted toward the use of massively parallel sequencing to identify common and rare variants across many or all genes in the genome. Traditional single gene tests have begun to be replaced by complex panels that assay multiple genes with varying levels of evidence for disease association and limited characterization of the variation spectrum. Also, patients with presumably genetic, undiagnosed diseases are increasingly undergoing whole exome and whole genome analyses using next generation sequencing (NGS) methods. Not only does this pose challenges to the evaluation of variants in genes with a plausible contribution to the phenotype, but variants can also be uncovered in genes with unexpected but important clinical consequences. Technical advances in sequencing coupled with reduced cost have led to an abundance of genetic variant data across the clinical realm, with the promise of more to come. The interpretation of genetic variant data has become a rate-limiting step in the utilization of this information. Factors that make interpretation of variant data rate limiting include lack of clinical standards for interpreting primary NGS results, inconsistencies in reporting clinical variants in commonly used file formats, limited adoption of variant nomenclature standards, information gaps between those capturing variant data and those evaluating the phenotypes of the test subjects, lack of standard interoperable phenotype vocabularies, the relative immaturity of resources containing population-specific variant frequency data, insufficient cross-training between the research-based groups practiced in managing big data

III. INTERPRETATION

IDENTIFICATION AND VALIDATION OF HUMAN VARIATION

193

and the clinical community, and limited data sharing which slows well-powered analyses. Perhaps the single most important barrier to the interpretation of variant data is the lack of systematized communication among countless clinical and research settings and the scientific literature. This chapter discusses databases and tools that support the evaluation of variation data with an emphasis on clinical applications. It begins with the current reliance of genome interpretation on a high-quality reference assembly and briefly discusses how the analytical validity of variant calls from assorted NGS platforms can be determined. The chapter then reviews databases and tools to identify common variation in multiple populations, and how those data can be used to filter out variants not likely to be pathogenic. It continues with an overview of resources and tools to represent rare variation, the phenotypes of individuals who have those variants, and how those data can be applied to review the medical importance of any variant. The emphasis will be on resources available from the NIH’s National Center for Biotechnology Information (NCBI); there are too many resources worldwide to summarize all of them in complete detail. More details about any resource can be accessed at NCBI’s Variation portal (http://www.ncbi.nlm.nih.gov/variation/) or The NCBI Handbook (http://www.ncbi.nlm.nih.gov/books/NBK143764/).

IDENTIFICATION AND VALIDATION OF HUMAN VARIATION Methods for Identifying Human Variation Targeting Known Sequences As the reference sequence of the genome was established [1,2] and variant analysis was performed, it became possible to assess variation in individuals by determining which alleles they had at known locations. Probes were developed to detect single nucleotide changes at specific sites and to detect the copy number of genomic regions [3]. To enable large-scale analyses, these probes had been immobilized on solid supports or arrays, so hybridization with sample sequence could be read as the allele observed for each probe as determined by the position on the array. There are many commercial vendors for single nucleotide polymorphism (SNP) arrays or arrays for comparative genomic hybridization (array CGH); the location at which the variation was assessed can usually be determined from the rs# (see dbSNP section), the probe name (http://www.ncbi.nlm.nih.gov/probe/), or the name of the clone used to represent a genomic region (http://www.ncbi.nlm.nih.gov/clone). Sequence Discovery Whole exome sequencing (WES) and whole genome sequencing (WGS) both begin with DNA extraction from nucleated cells, breaking the DNA into short fragments, and determination of the sequences of those fragments with various sequencing technologies (as covered in more detail in Chapters 13). The chemistries of sequencing technologies vary, but the organizing principle is the generation of millions of sequence reads captured as data. The current landscape is dominated by technologies that produce short reads (on the order of 100250 bp); although longer read technologies are being rapidly developed, they are not currently in common use. Computer algorithms are then employed to virtually align the short sequence reads to specific positions on one of the human genome reference assemblies. (De novo assembly of sequence reads is not currently performed routinely and thus interpretation currently relies on computer-based alignment to a reference assembly.) Comparison with the reference assembly is tabulated for each position with respect to the genotype and the depth of coverage, i.e., the number of sequence reads across each position. Depending on the objective of the analysis, computational filtering is then applied, e.g., by variant frequency or predicted type of variation [4,5]. About 1% of the human genome is comprised of exons; current understanding about the clinical significance of variation predominantly exists for variation residing in the exome. WES attempts to enrich for coding sequence using capture techniques that couple fragmented DNA to artificial DNA linkers and select the fragments using complementary sequence baits. In practice, less than 100% of the exome is captured and flanking intronic regions, which may contain pathogenic splice variants, are not reliably included. The sequence reads generated by WGS are nearly randomly distributed across the genome but are enriched within the coding region of genes. WGS is more accurate than WES for detecting structural variation, but sensitivity is nonetheless limited for both methods. Neither method reliably detects trinucleotide repeats or small copy number variants (see Ref. [4] and supplementary material in Ref. [6] for more details). Positive results have high accuracy but are nonetheless usually validated by clinical laboratories with an orthogonal technique such as Sanger sequencing. The false-negative rate varies by the genomic region, e.g., areas with high guanine-cytosine (GC) content.

III. INTERPRETATION

194

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

There remains a role for tests that are optimized for a gene (e.g., detection of trinucleotide repeat expansion of HTT for Huntington disease) or set (panel) of genes, particularly for defined phenotypes with genetic heterogeneity (e.g., retinitis pigmentosa).

Sequencing and Identifying Differences Relative to a Reference After sequence data are generated from an NGS platform, a variant file is produced based on aligning the reads to a reference genome and using variant calling algorithms. This stage of analysis, often called a pipeline, can introduce various types of bias even before the resulting variant file is evaluated. Typically, different pipelines are used to call small- and large-scale variants, and integration of these data can be challenging. This chapter will not cover basic factors that can affect the analytical validity of sequencing results (see Chapters 711). Understanding a Reference Assembly It is currently not possible to interpret an individual genome without the use of a high-quality reference assembly. The reference assembly performs two major roles. First, it is a necessary component for reconstructing the genotype of an individual via alignment of sequence reads to the reference followed by identification of sites variant from the reference. Second, the reference assemblies are annotated with knowledge about genes, transcripts, regulatory sites, and variation that has been aggregated from multiple studies. The reference assembly used by both the research community and the clinical community is based on the assembly developed by the public Human Genome Project (HGP) [2]. The HGP started prior to the advent of high throughput sequencing (HTS). The HGP chose to take a mappingbased approach to producing a reference assembly [7] while others pursued a whole genome shotgun strategy [8]. Both strategies produced draft assemblies, though subsequent analysis revealed that the whole genome shotgun approach collapsed multicopy sequences [9]. These multicopy sequences, known as segmental duplications, are greater than 1 kb long with sequence identity of greater than 90% to some other sequence within the genome. Assembly models adopted by the HGP assumed relatively low variation across populations suggesting that an assembly model that represented a haploid genome would be sufficient for further analysis. While it was known that some regions of the genome exhibited extreme population diversity, such as the major histocompatibility complex (MHC), it was assumed this was rare. Initial versions of the reference assembly contained sequences representing one haploid equivalent of each chromosome, plus some sequence that could not be ordered or oriented with respect to the chromosome [1]. Analysis of these early versions of the reference assembly revealed that large-scale structural variation was much more common that previously appreciated [10]. When the reference assembly transitioned to the Genome Reference Consortium (GRC), the assembly model was updated in an effort to more faithfully represent regions of high allelic diversity [11]. When using the reference assembly to interpret individual genomes, several key issues must be considered. The reference sequence is constructed primarily from individual large-insert clones, each of which represents a single haplotype from the donor. The phasing of these clones is unknown in most cases, so it is possible that the reference sequence can abruptly switch haplotypes. The source DNA was obtained from a number of individuals who anonymously volunteered to donate their DNA for this project. As such, this means that the reference assembly does contain disease causing and risk alleles for several disorders [12]. Care should be taken at these loci to ensure that a homozygous reference genotype is interpreted correctly, because an individual with sequence that matches the reference may be at risk for a disorder. To prevent such interpretations and to provide stable numbering systems for reporting the location of variation, RefSeqGene and the Locus Reference Genomic (LRG) collaboration [13] provide genomic reference sequences that are curated to retain common alleles. Regions with high allelic diversity increase the chance of an assembly error. In some cases it is common for two haplotypes to be represented in the reference chromosome assembly (haplotype expansion). In many cases, these regions are associated with gaps. Additionally, highly duplicated, often human-specific duplications can become collapsed into a single locus. In both cases, analysis of an individual genome can be complicated as the incorrect genomic representation can lead to both false-positive and false-negative variant calls. Perhaps the most vexing problem in using the human reference is the fact that there are still human sequences missing from the reference. In many cases, the missing sequence contains paralogous copies of sequences already present in the current reference [14]. When analyzing an individual genome, the sequence from the missing paralogs can align well to the related copies in the reference. In some cases, these paralogous variants can be

III. INTERPRETATION

IDENTIFICATION OF COMMON VARIATION

195

miscalled as single nucleotide variants (SNV). In other cases, the region cannot be assessed at all as interpretation of the alignments can be too difficult for many variant callers. In the GRC assembly model, regions of extreme allelic diversity may have more than one sequence of representation in the assembly. In these cases, one sequence is incorporated into the Primary Assembly while allelic representations are added as alternate loci. These alternate loci are aligned to the Primary Assembly so that the allelic representation is retained. Currently, inclusion of these sequences in genome wide analysis is complicated as most analysis tools cannot distinguish between paralogous duplication and allelic duplication. The GRC has created a public database of known problematic regions within the human reference assembly (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/issues/). This database is available on the GRC web site and as a data track in various genome browsers. As the GRC corrects known issues in the reference, they release these sequences as patch sequences, so that the correct sequence version is available prior to the next major assembly update.

Validation of Variant Calls Determining the analytical validity of any variant call requires both reference sequence and the supporting reference biological materials that can be assayed for the accuracy of variant calls. One approach to addressing this problem is the Genetic Testing Reference Materials Coordination Program (GeT-RM) at the Centers for Disease Control and Prevention (CDC) (http://wwwn.cdc.gov/clia/Resources/GetRM/default.aspx). This program supports not only the reference biological materials but also the development browser (http://www.ncbi.nlm.nih. gov/variation/tools/get-rm/) to review genotype calls from participating laboratories. Participating labs analyzed the genomes NA12878 and NA19240 (CDC Cell and DNA Repository, Coriell Institute, Camden, NJ) using their standard laboratory protocols and submitted variant calls to NCBI. A subset of labs also provided validation information for their calls, alignments of NGS sequence (via Binary Alignment/ Map (BAM) files), and/or trace data when they had validated a variant call by Sanger sequencing. Additionally, data from the Genome In A Bottle (GIAB) project was incorporated into the GeT-RM dataset [15]. This dataset covers over 70% of the GRCh37 reference assembly and asserts regions where the NA12878 genome is homozygous reference. This dataset provides a resource for labs to assess analytical validity over a large part of the genome. The GeT-RM browser supports download of the data for participating and nonparticipating laboratories (Church et al., in preparation). Predefined datasets are available for download, including high-quality variants (validated by Sanger sequencing) for both of the NA12878 and NA19240 genomes, and a Browser Extensible Data (BED) file of regions that have been fully sequenced using Sanger technology (all nonvariant bases are asserted to be homozygous reference). Extensive documentation is available to support browsing and download activities (http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/help/).

IDENTIFICATION OF COMMON VARIATION Overview Large-scale databases of genetic variation are essential tools for assessing the novelty and frequency of variants in a dataset. Evaluation of frequency to gauge pathogenicity by testing ethnically matched control samples is a standard approach that has mostly been replaced by the availability of large publicly available datasets. An assumption can reasonably be made that a variant contributing to a highly penetrant, rare disorder has a low population frequency (in a population reflecting the sample source) or is novel (i.e., has not been previously reported). Similarly, causal variants for conditions with high penetrance and disease severity should not exist at high population frequencies. Thus, filtering of a variant result dataset commonly first involves exclusion of variants observed at “too high” a frequency to be causal for the phenotype of interest. There are multiple projects in the last decade that were implemented to identify common variation in diverse human populations. The primary goals of these projects included the characterization of regions of shared inheritance (haplotype blocks) in multiple population groups, as well as identification of allele frequencies. Earlier approaches were based on SNP arrays and thus determined allele frequencies at preidentified variant locations (HapMap, section III.b). Later approaches applied HTS and called variants by alignment, which thus supported identifying novel variation as well as determining their frequency (1000 Genomes Project section, and National Heart, Lung, and Blood Institute—Exome Sequencing Project (NHLBI-ESP) section).

III. INTERPRETATION

196 TABLE 12.1

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

Reference Databases and Datasets for Human Variation

Name/URL

Overview

References

International collaboration to assess common variation in multiple populations and identify haplotype blocks. Data captured primarily via SNP arrays.

[16,17]

International collaboration to assess common variation in multiple populations using NGS approaches. Has had multiple releases, each with detection of less frequent alleles. With Phase 3, there is sufficient information to impute phase and determine haplotype blocks.

[1820]

Multicenter collaboration using NGS approaches to identify variation in exomes. Includes rich phenotypic data. The goal of ESP is to discover novel genes and mechanisms contributing to heart, lung, and blood disorders.

[21,22]

Archive of short (,50 bp) sequence variants. Includes both common and rare variants, both germ line and somatic.

[23]

Archive of longer (.50 bp) sequence variants. Includes copy number changes and complex rearrangements.

[24]

SELECTED DATASETS International HapMap Project http://hapmap.ncbi.nlm.nih.gov/ 1000 Genomes Project http://www.1000genomes.org/

NHLBI GO-Exome Sequencing Project (ESP) https://esp.gs.washington.edu/drupal/ REPRESENTATIVE ARCHIVAL DATABASES dbSNP Database of Short Genetic Variations http://www.ncbi.nlm.nih.gov/snp dbVar http://www.ncbi.nlm.nih.gov/dbvar

Table 12.1 summarizes reference databases and datasets for human variation. The 1000 Genomes Project (http://www.1000genomes.org/) [1820] and the NHLBI GO-ESP [21,22] Exome Variant Server (http://evs.gs. washington.edu/EVS/) provide population frequency information about genetic variants. NCBI resources such as Variation Viewer (Figure 12.1) and Database of Clinical Variation (ClinVar) (Figure 12.2) report minor allele frequencies from the 1000 Genomes Project and NHLBI-ESP. Database of Short Genetic Variations (dbSNP) (dbSNP section) [23] also enables searching and filtering of human variation by reporting, for a global population, the global minor allele frequency (MAF) of each dbSNP record identifier of the type “rs#” (reference SNP (refSNP) cluster ID numbers). The MAF is the second most frequent allele value. The default global population currently used is the 1000 Genome Phase 1 genotype data from 1094 individuals from diverse populations. Extensive documentation is available to support searching for SNP frequency data, including how to find data for specific populations (http://www.ncbi.nlm.nih.gov/books/ NBK44431/).

HapMap The aim of the multiinstitutional International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/) [16,17] is to determine the common patterns and differences of DNA sequence variation in the human genome, by characterizing sequence variants, their frequencies, and correlations between them, in DNA samples from four populations with ancestry from parts of Africa, Asia, and Europe. The publicly available information generated from this important effort continues to help investigators find candidate genes of commonly shared genomic regions and link these regions to human disease. The project has had several releases, with the last (HapMap3) [16] identifying more than 1.6 million common DNA variants comprised of SNPs and copy number polymorphisms (CNPs). Although these data were identified in earlier versions of the genome, the variants identified by the project have been placed on current assemblies by dbSNP. By identifying blocks of linkage disequilibrium, the project early on supported use of imputation to “call” untyped variants in the same linkage disequilibrium block that are not directly evaluated by genotyping arrays. HapMap-based discoveries have revolutionized study of the hereditary factors in human disease. HapMap has been instrumental in identifying hundreds of new genes that contribute to conditions such as diabetes,

III. INTERPRETATION

IDENTIFICATION OF COMMON VARIATION

197

FIGURE 12.1 Screen shot of a result in Variation Viewer. (A) Pick Assembly supports selection of the assembly on which the query should be based. (B) Search box. Enter your query term and review results in the list that opens underneath, either by Gene (shown) or other features (second tab). The report provides the name of the feature and its location. Clicking on the display refreshes the result in the view box (D) and the table (E). In this example, noonan was entered as the query, and SOS1 was selected from the list. The results in tabular report (E) can be filtered (C) to be limited to selected categories. One of the filters not included in this screen shot was the filter by allele frequency, particularly useful to separate rare from common variation. In this example In ClinVar was selected for a filter, and the result set included several variants with frequencies reported by 1000 Genomes or GO-ESP. Note the Download button at the top of the tabular display. This example does NOT show use of the Your Data (where you can upload your data for display in the graphical display (D), nor use of the History where you can restore the display of a gene you have reviewed recently without having to redo the query. For more details on use of Variation Viewer, try the YouTube video (link at the upper right).

primary biliary cirrhosis, schizophrenia, elevated cholesterol levels, myocardial infarction, rheumatoid arthritis, systemic lupus erythematosis, Crohn’s disease, bipolar disorder, and many other complex multifactorial diseases in multiethnic backgrounds.

1000 Genomes Project Samples used for the 1000 Genomes Project [1820] were obtained from more than 2500 deidentified, unrelated individuals from 26 populations and do not have associated phenotypic information. NGS was performed with the goal of identifying most genetic variants that have frequencies of at least 1% in the populations evaluated [19]. At the time of this writing, the data from Phase 3 of the project had not been submitted to the public databases, but the initial call set included more than 79 million variant sites including SNPs, indels, deletions, and other variant classes. The richness of the data is reported to be sufficient to establish phase, i.e., which alleles are on the same chromosome, over large parts of the genome.

III. INTERPRETATION

198

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

FIGURE 12.2 Screen shot of a result of querying ClinVar by a gene symbol. The ACTG1 entered in the query box (A) was recognized as a gene symbol (B). The Search ClinVar for ACTG1 link (B) thus pops up in case some of the records were retrieved because ACTG1 occurred somewhere in the text other than in the context of the symbol. This search retrieved 35 records, but only 12 are shown because the filter Multiple submitters (C) was checked. To see all 35, click on Clear all. The tabular result provides the description of the variant, the associated phenotype, minor allele frequencies from the 1000 Genomes or GO-ESP projects, the current interpretation, the review status (bolded here because the filter is checked), and the location on chromosome 17 based on GRCh38. These data can be downloaded via the Send to link (E) at the upper right. The right sidebar can be displayed by clicking on SIDEBAR. The side supported navigating to information in other databases (PubMed selected here), reviewing your search and following recent activity to redo queries.

There are multiple tools to access the data generated by the 1000 Genomes Project, including NCBI (which enables free, programmatic access to the entire 1000 Genomes Project dataset by mirroring files via FTP) (ftp. ncbi.nlm.nih.gov/1000genomes/), Aspera (www.ncbi.nlm.nih.gov/public/1000genomes/), and Amazon Cloud (s3.amazonaws.com/1000genomes). The 1000 Genomes Browser (http://www.ncbi.nlm.nih.gov/variation/tools/ 1000genomes/) supports download of sequence data (BAM, SAM, FASTQ, or FASTA formats) and genotypes (variant call file (VCF) format) for selected individuals in a displayed region. Navigating the vast number of 1000 Genomes Project FTP files can be daunting and is supported by a web interface (http://www.1000genomes.org/ ftpsearch) [20] that provides access to a file called current.tree, at the root of the FTP site, which is a complete listing of all files and directories. Filtering can be applied to exclude file types with large numbers of results (e.g., FASTQ or BAM files). Download of variants or alignments from specific genomic regions—rather than the entire dataset—is supported. Variation data (VCF) can be sliced by population.

NHLBI-ESP The goal of ESP is to discover novel genes and mechanisms contributing to heart, lung, and blood disorders. Exome sequencing and deep phenotyping were performed, with a strategy of examining extremes of specific traits along with inclusion of unaffected controls [21,22]. Individual-level phenotype information is not available for the public ESP data but can be obtained through application to Database of Genotypes and Phenotypes (dbGaP) (http://www.ncbi.nlm.nih.gov/gap/ or https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page5login). The ESP Exome Variant Server has detailed documentation about its use (http://evs.gs.washington.edu/EVS/). As an exome sequencing project of heart, lung, and blood disorders, NHLBI-ESP is not useful for frequency-based filtering WGS variant data generated by studies of other phenotypes.

III. INTERPRETATION

DEFINING DISEASES AND PHENOTYPES

199

INTERPRETATION OF COMMON VARIATION Determining Association between Phenotype and Common Variation Once a large set of well-characterized and well-spaced SNPs is identified, it is possible to launch studies to determine the regions of the genome that contribute to phenotype, usually by a casecontrol approach. Individuals with (case) and without (controls) a particular phenotype are genotyped, and results are assessed to determine for which SNPs allele frequencies in the cases are indistinguishable from controls, and for which SNPs the allele frequencies are significantly different. By knowing where these SNPs are located on the genome, it possible to identify regions of interest. These studies do not necessarily identify a causative variant, but they may identify the haplotype block in which the causative variant lies. Health-Related Phenotypes (GWAS) In mid-2014, a query of PubMed for citations in which genome-wide association studies (GWAS) is a major topic resulted in more than 4000 entries, and the National Human Genome Research Institute (NHGRI) GWAS catalog (http://www.genome.gov/gwastudies/) showed a steady growth in the number of citations approaching 2000 [25]. Thus the GWAS approach to tease out areas of the genome that are associated with particular phenotypes is quite active. The primary data from such studies may be contributed to data archives such as dbGaP or the European GenomePhenome Archive, EGA. Access to primary data in these resources is controlled, but when studies are released, summary data can be retrieved by such interfaces as PhenotypeGenotype Integrator (PheGenI) and dbGaP’s Association Results Browser (http://www.ncbi.nlm.nih.gov/projects/gapplusprev/sgap_plus.htm). The HuGENavigator resource (http://hugenavigator.net/HuGENavigator) is a product of the Human Genome Epidemiology Network (HuGENet) that combines population-based studies of human genes in PubMed with machine learning [26] in the domains of genetic epidemiology and genetic associations. Study types include GWAS, observational, and metaanalysis, and data types include genedisease association, geneenvironment interaction, and pharmacogenetics. PubMed articles are indexed by Medical Subject Headings (MeSH) terms and NCBI’s Gene database. Several tools are available, including Phenopedia for phenotype-centric and Genopedia for genotype-centric views of genetic associations [27]. Molecular Phenotypes (Gene Expression/GTex) It is also possible to apply GWAS to localize areas of the genome that affect molecular phenotypes such as the level of gene expression (http://www.gtexportal.org/home/). By identifying the SNPs that have a significant association with the level of transcript sequence, it is possible to explore loci that affect maintenance of RNA molecules. These expression quantitative trait loci (eQTLs) can be categorized on the basis of altering expression of a gene near the variant (cis) or not (trans), the significance of the association, the relative change in expression, and the tissues in which expression was assessed [28]. A new round of funding will expand functional measures into protein and epigenomic consequences of germ line and somatic variation (http://www.genome.gov/27558185).

Inferring Lack of Causality Based on Allele Frequency When variant calls for an individual have been established, a next challenge is to determine which ones represent normal variation and which have been determined to cause disease or be associated with a phenotype of interest. Often the first step is to remove from consideration all variants that are common in one more populations, and that although common in some populations, are not known to be associated with disease. Thus a major benefit of these large-scale projects has been to determine the frequency of alleles at so many sites in so many populations. dbSNP provides VCF files of common variants not known to be disease-related as ftp://ftp.ncbi. nlm.nih.gov/pub/clinvar/vcf_GRCh37/common_no_known_medical_impact-latest.vcf.gz.

DEFINING DISEASES AND PHENOTYPES Maintaining a clearly defined set of concepts and terms for phenotypes is necessary to support efforts to characterize genetic variation by its effects on specific phenotypes. The assignment of identifiers for those concepts allows computational access to phenotypic information, an essential requirement for the large-scale analysis of genomic data.

III. INTERPRETATION

200

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

UMLS and MedGen MedGen is NCBI’s portal to information about human disorders and phenotypes having a genetic component (http://www.ncbi.nlm.nih.gov/books/NBK159970/). The purpose of MedGen is to serve health care professionals and the medical genetics community by harmonizing phenotype terminologies, supporting computational access to phenotype data, and adding value through a variety of data elements. Genetic disorders are often known by numerous names. The proliferation of terms for the same condition can obscure what is meant conceptually and limit computational approaches. As an example, the following terms all refer to the same disorder: angiokeratoma corporis diffusum; hereditary dystopic lipidosis, alpha-galactosidase A deficiency; ceramide trihexosidase deficiency, ceramide trihexosidosis, GLA deficiency, deficiency of melibiase, AndersonFabry disease, and Fabry disease (http://www.ncbi.nlm.nih.gov/medgen/?term5C0002986). MedGen aggregates terms used for particular disorders from multiple vocabulary sources into a specific concept and assigns a unique, stable identifier (Concept Unique Identifier, CUI) to that concept. Where possible, the identifier is the same used by the Unified Medical Language System (UMLS; www.nlm.nih.gov/research/umls/). UMLS is maintained by the National Library of Medicine and provided to researchers without charge via a license agreement. The UMLS Metathesaurus is a large vocabulary database of biomedical and health information which is used to seed terms for MedGen. MedGen has been anchored to UMLS to enhance standardization of genetic terminologies and to facilitate their utilization in electronic medical records systems. Concepts are categorized by semantic type which allows concepts that share terms to be differentiated by scope. For example, the term “autism” is a synonym for the condition called “autism spectrum disorders” (semantic type “mental or behavioral dysfunction”) as well as a clinical feature of many conditions (semantic type “finding”). MedGen provides multiple types of descriptors for concepts including names, synonyms, acronyms, semantic type, abbreviations, sources of descriptors, attribution, and identifiers (e.g., from Online Mendelian Inheritance in Man (OMIM) and Human Phenotype Ontology (HPO)), textual definitions from multiple sources (http://www. ncbi.nlm.nih.gov/medgen/docs/definitionsources/), hierarchical relationships between terms, cytogenetic locations, causative genes and variants, mode of inheritance, tests in the NIH Genetic Testing Registry (GTR), molecular resources, professional guidelines, reviews (e.g., from GeneReviews and Medical Genetics Summaries), consumer resources, molecular resources, clinical trials, and Web links to other related NCBI and non-NCBI resources (http://www.ncbi.nlm.nih.gov/books/NBK159970/table/MedGen.T.a_list_of_data_elements_aggrega/ ?report5objectonly).

OMIMs The home of OMIM is omim.org [29]. NCBI, however, does continue to process the full text records, index each record for searching, and maintain reciprocal links within NCBI between OMIM and other NCBI databases such as ClinVar (ClinVar section), Gene, GTR, dbSNP, Database of Structural Variation (dbVar), MedGen, and PubMed. OMIM is an authoritative compendium of human genes and genetic phenotypes. It is authored and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine. The full text, referenced overviews in OMIM, contain information on all known Mendelian disorders, with a focus on the relationship between phenotype and genotype. Entries are also richly linked to other genetics resources.

Orphanet Orphanet (http://www.orpha.net/) is a reference portal for information on rare diseases and orphan drugs, intended for all audiences. About 40 countries participate in a consortium which leads Orphanet. The French INSERM team coordinates the consortium. National teams assume responsibility for the collection of information about activities in their country including expert centers, medical laboratories, research activities, and patient organizations. In 2014, Orphanet, in collaboration with the European Bioinformatics Institute (EBI), released a representation of disease-related data as a formal ontology (ORDO, http://www.orphadata.org/cgi-bin/inc/ordo_orphanet.inc. php). Updated monthly, this ontology includes diseases, genes, epidemiologic data, and connections with other terminologies.

III. INTERPRETATION

REPRESENTATION OF VARIATION DATA IN PUBLIC DATABASES

201

Human Phenotype Ontology The HPO (http://www.human-phenotype-ontology.org) [30] provides a formal ontology of human phenotypic abnormalities based on approximately 10,000 terms annotating more than 7200 disorders represented in OMIM, Orphanet, and DatabasE of Genomic Variants and Phenotype in Humans Using Ensembl Resources (DECIPHER). This ontology is freely accessible and is being used in MedGen to support query of disorders based on these phenotypic terms.

Combining Phenotypic Features Evaluation of phenotypegenotype relationships is aided by analysis of observed traits, not only putative diagnoses. Thus there are several approaches to representing the set of features either observed in an individual or used to define a diagnosis. For example, ClinVar accepts submissions characterizing the set of clinical features observed in individuals with a variant to support future analyses. The resources enumerated below use one or more features to identify diagnostic terms that share clinical features, or to suggest a diagnosis. MedGen supports the identification of conditions with a constellation of clinical features based on HPO terms. Query results are ranked on the basis of search relevance. Factors important to the generation of a clinical differential diagnosis such as sensitivity and specificity of features to each condition, age-dependent penetrance, and prevalence of conditions are not currently part of the MedGen search algorithm. Phenomizer (http://compbio.charite.de/phenomizer/) produces a ranked list of diagnostic possibilities based on combinations of HPO terms. Phenomizer is not intended for clinical use. SimulConsult (www.simulconsult.com) is a medical decision support software product targeted to clinicians to combine clinical and laboratory findings to generate a probability-ranked differential diagnosis. Age of onset of findings (or lack thereof) and family history are integral features factored into the Bayesian analysis. SimulConsult has added the GenomePhenome Analyzer which integrates pertinent genetic variant test results into the differential diagnosis. In addition to providing a differential diagnosis, the software suggests tests which are most useful in narrowing the differential diagnosis while taking cost into account. SimulConsult performed well in the CLARITY genome interpretation contest with the fastest analysis times [31].

REPRESENTATION OF VARIATION DATA IN PUBLIC DATABASES Archives of Variants When information about human variation is deposited in comprehensive, well-supported public databases, there are multiple benefits. First, representation is standardized and submissions are integrated with contributions from investigators with diverse interests, making it easier to evaluate information from both clinical and research domains. Second, the data are stable and are tracked with accessions, so there is a history of submissions and reports. Finally, with standard representation, it is easier to develop and maintain basic tools that support analysis of variation no matter the version of the human genome assembly. The two major data archives at NCBI that maintain the primary definitions of variants at NCBI are dbSNP [23], for variation less than about 50 bp in length, and dbVar [24], for longer structural variation. Both germ line and somatic variants are archived in the databases. Establishing the initial definition of a variant requires a computable description of the location and type of sequence change. Note that there is no restriction on the frequency of any allele in either dbSNP or dbVar. Although the majority of human variation in dbSNP meets the criterion of being polymorphic, by having an MAF greater than 1% in at least one population, the public archives do accession rare variants, including those that may be highly penetrant in causing disorders. It is important to recognize that filtering out variants in dbSNP from a variant result list will remove medically relevant variation, both common and uncommon. dbSNP In 1998, NCBI established dbSNP in response to a need for a general catalog of genome variation. In the earlier days of dbSNP, when variation was discovered from primary sequencing, submitters would provide a short sequence with the location of the variation and a representation of the types of alleles observed at that location.

III. INTERPRETATION

202

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

These primary submissions were assigned an accession beginning with ss and could include metadata such as the population in which the variation was observed, the method of variant detection, and frequencies of different alleles. Now that most variation is called based on difference relative to a defined reference, data flows are changing to define a variant by location on an assembly. Thus submissions in the format of a VCF file (http://www.ncbi. nlm.nih.gov/projects/SNP/docs/dbSNP_VCF_Submission.pdf) or an Human Genome Variation Society (HGVS) expression (http://www.hgvs.org/mutnomen/) are becoming more common. Previously, the sequence data from different submitters were aligned to each other, and to an assembly, to determine if they described the same variant or not (i.e., based to a large degree on local sequence). Now, however, dbSNP aggregates data by location in the genome and by type of variation (e.g., single nucleotide change, deletion, insertion). No matter the method, the result of this aggregation is assigned a refSNP (rs) identifier, which is commonly used to refer that variant location in subsequent studies and publications. It must be emphasized that the rs identifier does not indicate the explicit sequence change at a location. In other words, one rs may be assigned to a location on the genome where single nucleotide variation has been reported, even if all four nucleotides have been observed at that location. dbVar dbVar, NCBI’s database of structural variation, also archives submissions and provides a layer aggregation of data. Although no reference variants are provided, dbVar contains large-scale genomic variants such as deletions, translocations, and inversions. Studies are assigned unique identifiers of the form nstd (when the submission comes through NCBI) or estd (when the initial submission is made to the Database of Genomic Variants Archive, DGVa). In this way, robust information about the methods and analyses used to make the variant calls can be captured. Each variant call is assigned an accession containing ssv (nssv or essv as above). The submitter can also submit information about the region that the submitter identifies from those variant calls. The accession assigned to this variant region begins with either nsv or esv. Note that the nsv/esv accession is not comparable to the rs accession of dbSNP, because the nsv is an aggregate within a study or submission, not across submissions. Although information about variation is maintained in distinct databases, the representation in those databases is being standardized to improve searching, reporting, evaluation, and analysis. For example, representation of types of variation (single nucleotide, insertion, copy number gain), consequences of that variation (nonsense, missense, frameshift), and molecular consequences (exon loss) are harmonized to terms from Sequence Ontology (http://sequenceontology.org/).

Archives of Variant/Phenotype Relationships Genotype calls are produced by alignment software number in the order of 100,000 variants for an exome and 600,000 variants for a genome [32]. When restricting variant types to nonsynonymous, nonsense, insertion/deletion, and splice site alterations, filtering on allele frequency can reduce the number of potentially causal variants by about two orders of magnitude [32]. Following the application of such filters, the remainder of the analysis relies on archives of variant/phenotype relationships. There are multiple databases available from NCBI that report relationships between sequence variation and phenotype. This section will provide an overview of dbGaP, ClinVar, OMIM, DECIPHER, and Catalog of Somatic Mutations in Cancer (COSMIC). The section also touches on the value of the literature to inform genotypephenotype correlations. Table 12.2 describes databases and datasets for variants in the context of phenotype. dbGaP The dbGaP [33,47] archives, curates, and distributes information produced by studies investigating the interaction of genotype and phenotype. It was launched in response to the development of NIH’s GWAS policy and provides unprecedented access to very large genetic and phenotypic datasets funded by the National Institutes of Health and other agencies worldwide. There is unrestricted access to a subset of data; scientists from the global research community may apply for access to what are deemed “Controlled Access” data. dbGaP includes individual-level molecular and phenotype data, analysis results, medical images, descriptions of studies, and documents that contextualize phenotypic variables such as research protocols and questionnaires.

III. INTERPRETATION

203

REPRESENTATION OF VARIATION DATA IN PUBLIC DATABASES

TABLE 12.2

Databases and Datasets for Variants in the Context of Phenotype

Resource/URL

Overview

References

NHGRI GWAS Catalog

Summary of GWAS studies, curated from published literature.

[25]

Subset of variants in genes reported to provide evidence about the relationship between human phenotypes and genes. Curated from published literature.

[29]

Archive of studies and datasets investigating the interaction of genotype and phenotype. Provides both unrestricted and controlled forms of access. A designated NIH repository for NIH-funded GWAS results.

[33]

Archive of submitted interpretations of medically related variants. Aggregates data from multiple submitters to expose any inconsistencies and reports ranking of the level of supporting evidence for any interpretation.

[34]

Collation of published genetic germ line variation in nuclear genes underlying or associated with human inherited disease. Public and professional versions.

[35]

Aggregation of gene-specific databases of observed variation.

[36]

Curated catalog of genes that undergo somatic mutation in human cancers, supported by the Wellcome Trust Sanger Institute. Contains information on neoplasms and other related samples, with somatic mutation frequencies. Data are extracted from primary literature and curated.

[37]

Comprehensive and coordinated effort to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

[38,39]

Repository of genetic variation with associated phenotypes that facilitates the identification and interpretation of pathogenic genetic variation in patients with rare disorders.

[40]

International multidisciplinary, scientific organization to improve the quality of care of patients and their families with hereditary gastrointestinal tumors.

[41]

Expert-authored, peer-reviewed disease descriptions presented in a standardized format and focused on clinically relevant and medically actionable information on the diagnosis, management, and genetic counseling of patients and families with specific inherited conditions.

[42]

Component of the UniProt Consortium which is an expertly curated database, a central access point for integrated protein information with cross-references to multiple sources.

[43]

Knowledgebase that collects, curates, and disseminates information about the impact of human genetic variation on drug responses.

[44,45]

International group of laboratories, physicians, and genetic counselors focused on issues of standardization, data sharing, and collaboration to improve the quality of genomic testing. Initially organized around structural variation, ICCG now has an expanded focus which also includes sequence-level variation.

[46]

Resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.

[28]

http://www.genome.gov/gwastudies/ OMIM http://omim.org/ dbGaP http://www.ncbi.nlm.nih.gov/gap ClinVar http://ncbi.nlm.nih.gov/clinvar/ Human Gene Mutation Database (HGMD) http://www.hgmd.org/ HGVS Locus Specific Mutation Databases http://www.hgvs.org/dblist/glsdb.html COSMIC—Catalog of somatic mutations in cancer http://cancer.sanger.ac.uk/cancergenome/ projects/cosmic/ The Cancer Genome Atlas (TCGA) http://cancergenome.nih.gov/abouttcga/ overview DECIPHER https://decipher.sanger.ac.uk/ International Society for Gastrointestinal Hereditary Tumors (InSiGHT) http://www.insight-group.org/variants/ classifications/ GeneReviews http://www.ncbi.nlm.nih.gov/books/ NBK1116/ UniProtKB (UniProt Knowledgebase) http://www.uniprot.org PharmGKB http://www.pharmgkb.org/ ISCA/ICCG International Standards for Cytogenomic Arrays Consortium (ISCA), now known as the International Collaboration of Clinical Genomics (ICCG) http://www.iccg.org/ GTEx (The Genotype-Tissue Expression (GTEx) project) browser http://www.ncbi.nlm.nih.gov/gtex/test/ GTEX2/gtex.cgi/

III. INTERPRETATION

204

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

Submitted data are quality controlled and curated by dbGaP staff before being released. At the present time, phenotypic measures are mapped to MeSH terms, but a goal is to link to MedGen as well. Use of dbGaP depends on the user’s level of access. Information about studies, summary level data, and documents related to studies can be accessed freely on the dbGaP website (http://www.ncbi.nlm.nih.gov/gap). This makes it possible for the general user to determine whether a study is registered, and if so, the variables (phenotypes) being assessed and groups that have been authorized for access. Individual-level data can be accessed only after a Controlled Access application, stating research objectives and demonstrating the ability to adequately protect the data, has been approved (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page5login). ClinVar ClinVar, NCBI’s database of clinical variation [34], archives submitted interpretations of variationphenotype relationships along with evidence related to the description of any variant and the interpretation of its clinical significance. Submissions are validated with respect to the description of the variant, and terms are mapped to standard values to support aggregation of data from multiple submitters. If descriptions of variants novel to NCBI are submitted via ClinVar, the data are also accessioned by dbSNP or dbVar as appropriate. Names of disorders are standardized to records in MedGen. As part of the aggregation process, records are rated via a star system to provide a quick visual check of the level of confidence that should be given to any interpretation. ClinVar is not curated by NCBI staff. ClinVar depends on primary submissions, and submissions by expert panels (3-star rating) or practice guidelines (4-star rating) to provide clinical grade data. OMIM In addition to descriptions of Mendelian disorders, the allelic variant section of a gene-specific OMIM record is an annotated extraction from the literature describing a subset of variants with functional significance. Each allelic variant OMIM record is loaded automatically into ClinVar and computationally converted to a genomic location whenever possible. OMIM adds links to ClinVar from each allelic variant. DECIPHER DECIPHER is an online database of variation and phenotypes for rare disorders begun in 2004 to foster international collaboration in collecting and cataloging genotypephenotype information [40]. By early 2014, DECIPHER contained data for over 21,000 patients, with consented data for public release available on about 9000 patients and 15,000 variants. Initiated as a resource for array-based detection of copy number variations (CNVs), DECIPHER now includes both CNV and sequence variation. Patient variant data that are uploaded or queried are shown in conjunction with potentially causal variants reported in other patients. Common phenotypes between deposited patients are highlighted for those in overlapping positions. Various tools are also provided to predict molecular consequences of sequence variation. Phenotype data are now based on HPO enabling interoperability with other databases such as ClinVar. Accessibility to the data is based on two tiers: the first tier is within a single center and the second tier is fully public, subject to explicit consent. Access to a bulk dataset is possible for research through signed data access agreements. COSMIC COSMIC, the catalog of somatic mutations in cancer (http://www.sanger.ac.uk/cosmic), contains comprehensive information about publications, samples, and variations relevant to human cancers [37]. Supported by the Wellcome Trust Sanger Institute, the database contains information on benign neoplasms and proliferations, in situ and invasive tumors, recurrences, metastases, and cancer cell lines. The provision of variant frequency data is enabled by also incorporating data about samples that are negative for specific variations. Data are extracted from the primary literature and curated. The data model supports querying by tissue, histology, or gene. COSMIC maintains and curates a list of genes that undergo somatic mutation in human cancer, with an emphasis on genes that are not the subject of dedicated databases (i.e., locus-specific databases (LSDBs)). Several types of data are available for download in various formats, such as a complete export of COSMIC data and files with assorted types of variations (e.g., fusions, structural rearrangements). The cancer Gene Census is a curated catalog of genes for which variations have been implicated in cancer causation. The Gene Census contains both somatic and germ line variations and can be sorted by these attributes and others such as variation type and chromosomal location.

III. INTERPRETATION

DATA ACCESS AND INTERPRETATION

205

In addition to incorporating data from several WGS studies, COSMIC has begun to include gene expression data from TCGA. The CONAN copy number analysis tool evaluates loss of heterozygosity, homozygous deletions, and amplifications in the COSMIC dataset. PubMed/PubMedCentral The published literature, accessed at NCBI via PubMed (http://www.ncbi.nlm.nih.gov/pubmed) or PubMedCentral (free full text; http://www.ncbi.nlm.nih.gov/pmc/), continues to be an archive of information about the relationship between genotype and phenotype. A query as simple as novel human mutation (http:// www.ncbi.nlm.nih.gov/pubmed/?term5novel1human1mutation), perhaps qualified with a gene symbol or a disease name (http://www.ncbi.nlm.nih.gov/pubmed/?term5novel1human1mutation1noonan), may be sufficient to identify sources for information of interest.

DATA ACCESS AND INTERPRETATION The benefit of archiving data in central repositories, from both large-scale projects and more focused research, is especially apparent when trying to determine what information is available about human variation. Rather than having to jump from one site to another, and trying to integrate data generated from various projects, central repositories bring the data together. Central data archiving enables users to begin at one site, view the data that have been integrated, and then navigate to the resources that provided the data to follow up as required. Table 12.2 describes databases and datasets for variants in the context of phenotype. Table 12.3 organizes tools and interfaces for variation resources. There are many approaches to determining whether variation has been identified on a region of the genome and what is known about that variation. Rather than attempting to provide detailed information about all approaches, this section summarizes some frequent use cases. The tabular summary of this text is given in Table 12.4. More detailed documentation is found by following links from http://www.ncbi.nlm.nih.gov/ variation/ or http://www.ncbi.nlm.nih.gov/guide/variation/#howtos_.

By One or More Genomic Locations Many resources at NCBI support querying by gene location. Some allow querying on both current and previous reference assemblies; others are restricted to the current assembly only. Location-based queries can be explicit, i.e., by entering a chromosome and coordinates on that chromosome. Location-based queries can also be indirect, by querying on the name of a feature, such as a gene or a variant which has a known genomic location. The results may often be filtered, e.g., by type of variant (e.g., single nucleotide, insertion, deletions, copy number loss), clinical significance, molecular consequence (e.g., missense, nonsense, splice donor site), or MAF in one or more large-scale projects. The latter application is particularly useful to separate common from rare variation. Most resources support downloading the primary or filtered result set and can also be accessed by application programming interfaces (API). Figure 12.1 shows an example of an interactive approach using Variation Viewer. Variation Viewer is a relatively recent tool at NCBI that integrates data from ClinVar, dbSNP, and dbVar to provide a single interface to review human variants. NCBI’s 1000 Genomes Browser (Table 12.3) enables interactive, graphical review of sequence alignments and variant calls from the 1000 Genomes Project in the context of other variation submitted to dbSNP. This interface is especially useful to review allele frequencies in selected populations in a region of interest, rather than one variant at a time. NCBI provides a beacon function (Table 12.3) that allows querying of controlled access data to determine if information has been reported at a specific location. This is based on data from the sequence read archive (SRA) and archived VCF. For batch processing, NCBI provides Variation Reporter (Table 12.3). Users can upload a file containing locations of interest, and Variation Reporter reports what is known about that location, and alleles at that location, with links to more information in dbSNP, dbVar, ClinVar, and PubMed. If NCBI’s databases do not have information at that location, and the uploaded files report alleles, Variation Reporter will predict the functional consequence of that allele based on a specific genome annotation release. Another approach to large-scale processing is to use files from NCBI’s ftp sites. The archival databases maintain resource-specific sites, as well as one for the 1000 Genomes Project. These files include the genomic location

III. INTERPRETATION

206 TABLE 12.3

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

Tools and Interfaces

Name

Overview

References/URLs

PheGenI

Aggregates information from GWAS and GTex approaches and supports searches by phenotype, location, SNP, and gene.

[48]

Combines population-based studies of human genes in PubMed with machine learning and offers Genopedia and Phenopedia tools.

[26]

The browser displays current results of the Genetic Testing Reference Materials Coordination Program (GeT-RM) for NGS validation.

http://wwwn.cdc.gov/clia/Resources/ GetRM/

Tool to search, browse, and navigate variations in genomic context. You can review data from dbSNP, dbVar, or ClinVar, or you can upload your own data.

http://www.ncbi.nlm.nih.gov/books/ NBK169030/#AboutVariation.Related_Tools

Tool for accessing information about variation based on location information you upload.

http://www.ncbi.nlm.nih.gov/books/ NBK169030/#AboutVariation.Related_Tools

HuGENavigator

GeT-RM

Variation Viewer

Variation Reporter

http://www.ncbi.nlm.nih.gov/gap/phegeni

http://hugenavigator.net/HuGENavigator

http://www.ncbi.nlm.nih.gov/variation/ tools/get-rm/

http://www.ncbi.nlm.nih.gov/variation/ view/

http://www.ncbi.nlm.nih.gov/variation/ tools/reporter ftp://ftp.ncbi.nlm.nih.gov/pub/factsheets/ Factsheet_Variation_Reporter.pdf Clinical remap

Converts files of descriptions of variants to a genomic, RefSeqGene, or LRG coordinates.

http://www.ncbi.nlm.nih.gov/books/ NBK169030/#AboutVariation.Related_Tools http://www.ncbi.nlm.nih.gov/genome/tools/ remap#tab5rsg ftp://ftp.ncbi.nih.gov/pub/factsheets/ Factsheet_Remap.pdf

1000 Genomes Browser

After finding a region of interest by submitting a query or browsing, you can review alignments supporting variant calls, display allele frequencies, or allele counts.

ftp://ftp.ncbi.nih.gov/pub/factsheets/ Factsheet_1000genomes_browser.pdf

NCBI Beacon

Reports whether or not variation has been reported at a specified genomic location based on data to which access may be otherwise restricted.

http://www.ncbi.nlm.nih.gov/projects/ genome/beacon/

Leiden Open Variation Database (LOVD)

Open source database architecture to support processing and rendering information about variation.

[49]

Universal Mutation Database (UMD)

Freely available database architecture to support processing and rendering information about variation.

[50]

http://www.ncbi.nlm.nih.gov/variation/ tools/1000genomes/

http://www.lovd.nl

www.umd.be

of variants, in addition to other information. For example, ClinVar’s FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/ clinvar/README.txt) points to files containing information about short variants in the commonly used VCF format. These files range in scope from comprehensive, to a subset in ClinVar, to the set of common variants not reported to be medically relevant. The latter is quite useful to identify locations on the genome where, based on the argument of allele frequency, it is not likely that an allele is pathogenic. ClinVar also provides a comprehensive dataset in XML format. This also provides location data packaged with the rich information about phenotype and other observations associated with each allele observed at a location.

III. INTERPRETATION

207

DATA ACCESS AND INTERPRETATION

TABLE 12.4

How to Find Variants at NCBI Based on Genomic Location, Gene, or Reported Phenotype

Tool, Database, or Dataset

Approach

Methods

FIND VARIANTS BASED ON GENOMIC LOCATION Variation Viewer

Supports queries by chromosome and location and cytogenetic band, for both GRCh37 and GRCh38. Can also identify genomic locations by the names of features that define a location, such as gene symbol, variant identifier (rs, nsv, esv) or name of a disorder. If multiple results are returned by a query, you can select which to display. Within a gene, can also navigate from exon to exon. The full result set can be downloaded, or filters can be applied to restrict the results to a subset of interest.

Interactive

Variation Reporter

Upload a file, or paste in data, describing locations of interest in the format of either HGVS, BED, VCF, or GVS. Results are returned as a table for display or, with even more content, for download.

Interactive

The default display from the Genomic regions, transcripts, and products section of a specific Gene record renders a subset of variation annotation. The Variation section of the gene record provides a link to Variation Viewer.

Interactive

The Advanced search option helps you construct queries for combination of chromosome and base position (GRCh38 only). The full result set can be downloaded, or filters can be applied to restrict the results to a subset of interest before downloading (Figure 12.2).

Interactive

1000 Genomes Browser

Supports queries by chromosome and location and cytogenetic band, currently only for GRCh37/hg19. Can also identify genomic locations by the names of features that define a location, such as gene symbol, rs number, or phenotype. If multiple results are returned by a query, the graphical ideograms show an overview of the results. You can select which to display. Within a gene, you can also navigate from exon to exon and to neighboring genes. The full result set can be downloaded.

Interactive

dbSNP

The Advanced search option helps you construct queries for combination of chromosome and base position. For dbSNP, the organism human must also be supplied. The full result set can be downloaded, or filters can be applied to restrict the results to a subset of interest.

Interactive

Gene

ClinVar

API

E-Utilities

E-Utilities

FTP

E-Utilities Batch query

PheGenI

The basic query interface supports query by chromosome and base position on that chromosome. (GRCh38 only).

Interactive

NCBI Beacon

Specify a location and a variant allele, and the beacon will report if there are any data at that location for that allele.

Interactive

VCF files

dbSNP and ClinVar provide VCF files reporting the location of rare and common variation, and of common variation not known to have clinical significance.

Scripts

FIND VARIANTS WITHIN A GENE The basic query interface supports query by gene symbol and provides a hint to restrict the results to records where the symbol represents the gene in which a variant is found. The full result set can be downloaded, or filters can be applied to restrict the results to a subset of interest before downloading (Figure 12.2).

Interactive

The basic query interface supports query by gene symbol. Remember to restrict the results to human (easily done on the web interface by applying the human filter at the upper right). You can display a single record by clicking on the rs number or generate a gene-specific page by clicking on GeneView next to the gene symbol.

Interactive

The basic query interface supports query by gene symbol. The results can be either studies or variant call regions.

Interactive

Gene

The basic query interface supports query by gene symbol. Remember to restrict the results to human. Other than the annotation viewed in the Genomic regions, transcripts, and products section, variation data are accessed primarily by following links to in the Phenotypes and Variation sections.

Interactive

Variation Viewer, 1000 Genomes Browser, PheGenI

These viewers support queries by gene as well. Review the overview in Find variants based on sequence location for more details.

Interactive

ClinVar

dbSNP

dbVar

E-Utilities

E-Utilities

E-Utilities

(Continued)

III. INTERPRETATION

208 TABLE 12.4 Tool, Database, or Dataset

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

(Continued) Approach

Methods

FIND VARIANTS BASED ON REPORTED PHENOTYPE ClinVar

dbVar

Gene

PheGenI

The basic query interface supports query name of a condition or clinical finding, MIM number or MedGen concept identifier (CUI). Query by identifiers from HPO, Orphanet, and EFO will be added before the end of 2014. The query is currently limited to an explicit term, although query by nodes in a hierarchy is being discussed.

Interactive

The basic query interface supports query by phenotype (clinical finding). The finding terms are standardized according to MedGen, so it is possible to follow links from dbVar, to findings in MedGen, to all disorders with that clinical feature, to records in ClinVar for that set of disorders.

Interactive

The basic query interface supports query by a disease name or a MIM number. Thus all the gene records that are retrieved based on a query by disease name can then be used to find all variants for that set of genes.

Interactive

The basic query interface supports query by one or more phenotypes with the ability to set a P-value, or by genotype based on chromosomal location/base pair range, gene symbol or ID, or SNP rs number. Gene expression data are integrated from GTex.

Interactive

E-Utilities

E-Utilities

E-Utilities

Download annotated tables

By Gene In addition to approaches that retrieve variants by genomic location defined by a gene, it is also possible to generate reports based on text alone. Users can start at the Gene database, identify a gene of interest, and follow the links to ClinVar, dbSNP, dbVar, or PheGenI from the Variation or Phenotypes section (alternatively, a search can begin at each specific resource). For comprehensive data, dbSNP or dbVar is used; for variation submitted because of some connection to phenotype, ClinVar is used; for results from GWAS, PheGenI is used. Figure 12.2 shows an example of finding variants in a gene using ClinVar.

By Condition or Reported Phenotype NCBI’s variation resources standardize disease or phenotypic terms via MedGen or MeSH. Although not perfectly realized yet, MedGen will be the major hub to identify phenotypic terms of interest and then find all variant data connected to those terms in variation-specific databases. As discussed above, MedGen categorizes phenotypic terms into classes called semantic types. Within resources such as ClinVar, assertions of clinical significance are provided relative to terms in the semantic type “disease,” while phenotypes observed in individuals with a variant are represented in the semantic type “findings.” NCBI’s Gene database reports conditions (diseases) related to a gene. PheGenI reports diseases or findings associated with variation. In broad strokes, to find variants associated with a phenotype, filterable by P-values, query PheGenI by phenotype. To find names of conditions or phenotypes and then look for variants, start at MedGen and follow links to ClinVar. To find a list of genes that may cause a known disorder, then find all pathogenic variants in those genes, start at Gene, follow the links to ClinVar, and then restrict on pathogenicity. Or, just start by querying ClinVar or dbVar directly for the phenotypic term. All these strategies are listed because standards for reporting genedisease relationships, diseasedisease relationships, and the phenotypic terms themselves are evolving. Thus a combination of approaches may be required. Figure 12.3 diagrams the approach of querying MedGen by a clinical feature, finding all disorders with that feature, finding all variants related to those disorder, and then filtering by pathogenicity.

By Attributes of a Particular Variant Several strategies can be used to assess the effect of variation at a particular location. Users can determine whether the data have been reported about that variant, evaluate the effect on translation if the variant is in a coding region (e.g., missense, frameshift, nonsense, read though), infer pathogenicity from allele frequency, or predict the effect of function by using multiple tools. Resources at NCBI currently support the first three approaches.

III. INTERPRETATION

DATA ACCESS AND INTERPRETATION

209

FIGURE 12.3 Finding variants in ClinVar based on disorders reported to share a clinical feature. In this example, ichthyosis is entered as a query to MedGen. Because the term is in the database as a clinical feature, a message is displayed (A) asking if you want to See MedGen results with ichthyosis as a clinical feature (74), with 74 being the number of results. After clicking on that link, the 74 diseases with that feature will be displayed. You can then find records in ClinVar linked to that set of disorders by selecting ClinVar in the Find Related Data menu (B) and clicking on Find items. The result will be a set of records in ClinVar that you can then filter as you wish (C), say with clinical significance reported as Pathogenic or Likely pathogenic.

Thus, if data have been submitted about a particular allele, Variation Reporter or Variation Viewer can be used to guide the user to information in the archival databases (ClinVar, dbGaP, dbSNP, dbVar). These archival databases support assessments of reports of pathogenicity, allele frequencies in defined populations, molecular consequences, and reported functional consequences. If no data are found, Variation Reporter will predict functional consequence based on the current annotation of the genome (i.e., to determine if a sequence change affects translation or lands in a splice junction). Variants in the ACMG Incidental Findings Gene List ClinVar also supports adoption of the American College of Medical Genetics and Genomics (ACMG) recommendations for reporting of incidental findings in clinical exome and genome sequencing [51] by providing an interactive table containing the 56 genes for which “known pathogenic” and “expected pathogenic” variations should be reported by testing laboratories irrespective of the indication (phenotype) for testing. Accessible from the ClinVar home page, the table (http://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/) provides the disease name and MIM number (hyperlinked to the OMIM disease record), a MedGen link to explore phenotype information (e.g., professional guidelines), a Gene link via GTR with MIM gene number, and variations known to ClinVar. The link to variation records in ClinVar defaults to a preset, filtered list of pathogenic and likely pathogenic variations. The standard filters for molecular consequence enable exploration of the dataset to determine whether any observed variations meet the ACMG definition of “expected pathogenic.” Expected pathogenic mutations are previously unreported variations with predicted molecular consequences that lead to loss of function, in genes known to cause disease through loss of function. In addition to providing the ACMG incidental findings gene list in a table, the entire gene set can be queried. This action retrieves all variants known to ClinVar in the ACMG incidental genes list, which can then be filtered. However, since ClinVar is still in its early stages of development, there are caveats for the use of this table, including that ClinVar is not yet comprehensive and may not contain variations that have been observed and

III. INTERPRETATION

210

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

possibly reported elsewhere (i.e., are not truly novel). Importantly, accuracy for the designation of clinical significance relies on broad participation in ClinVar by the submitter community, curated submissions by expert panels, and the development of professional guidelines to achieve its full potential. Nonetheless, no other resource currently provides such a readily accessible collation of data for the ACMG incidental findings gene list along with an agile interface to explore the clinical significance and molecular consequences of the variants.

DETERMINATION OF VARIANT PATHOGENICITY It is important to note that the assessment of pathogenicity reported by NCBI’s databases is not determined within NCBI. The databases reflect what is submitted from groups that are funded for evidence-based review. Sources of data in ClinVar include genetic testing laboratories, semiautomatic data flows from OMIM and GeneReviews, LSDBs, research studies, and community projects such as Sharing Clinical Reports Project for BRCA1 and BRCA2 (http://www.ncbi.nlm.nih.gov/clinvar/docs/datasources/). The promise of collaborations with groups such as Clinical Genome Resource (ClinGen), Pharmacogenomics Knowledge Base (PharmGKB), Clinical Pharmacogenetics Implementation Consortium (CPIC), Evidence-Based Network for the Interpretation of Germline Mutant Alleles (ENIGMA), and International Society for Gastrointestinal Hereditary Tumors (InSiGHT) to evaluate available data and maintain assessments of clinical relevance will hopefully soon be realized. The interpretation of sequence variations hinges on a standardized classification system [41,52,53]. A quantitative approach to variant classification—e.g., evidence to support that a variant has a 98% likelihood of pathogenicity for a given condition—is highly desirable and is available for select genes [54,55]. However, for most genes, there are insufficient classification methods to perform reliable functional assessments and/or a lack of data to enable accurate probabilistic classification. Currently, the clinical significance of pathogenicity is performed using tiered levels ranging from benign to uncertain significance to pathogenic. NCBI represents the fivelevel interpretation of sequence variation classification scheme [56] recommended by the ACMG (i.e., Benign, Likely benign, Uncertain significance, Likely pathogenic, and Pathogenic), (http://www.ncbi.nlm.nih.gov/ clinvar/docs/clinsig/); a revised classification scheme is currently under development and is expected to be published in 2015. The International Standards for Cytogenomic Arrays Consortium (ISCA), now known as the International Collaboration of Clinical Genomics (ICCG), has developed an evidence-based review process for genetic testing using chromosomal microarrays (CMA) to assist in the clinical interpretation of copy number variants and to help optimize array design for clinical utility [57]. The rating system was developed to evaluate the clinical significance of dosage sensitivity as assessed by CMA, but the evidence-based review model can be applied to other genomic technologies. Close collaboration between ISCA and NCBI demonstrated clear benefits of data sharing to laboratories and improvements in clinical care [58]. For example, the collaboration resulted in a consensus statement that CMA is a first-tier clinical diagnostic test in place of the G-banded karyotype for patients with unexplained developmental delay/intellectual disability, autism spectrum disorder, or multiple congenital anomalies [59]. Lessons learned by the ISCA experience were crystallized as important components of the ClinGen collaboration.

Expert Panels and Professional Guidelines The Clinical Genome Resource (ClinGen) supports evaluations about the level of review for submissions through a four-star rating system. Currently, submitters default to the category of “single source”—one star—given the inability to gauge expertise on most submissions. Variants with two or more single source assertions that are concordant for clinical significance are rated with two stars. A three-star rating is applied to submissions from groups which are designated as an “expert panel,” as determined via application to the ClinGen program. A fourstar rating is applied to variants for which professional practice guidelines have been published (e.g., Ref. [60]). Groups seeking Expert Panel designation are asked to submit information about the group membership including names and member roles (e.g., medical professionals caring for patients relevant to the disease gene in question, medical geneticists, clinical laboratory diagnosticians, and/or molecular pathologists who report such findings, and researchers relevant to the disease, gene, functional assays, and statistical analyses). More than one academic or commercial institution should be represented and potential conflicts of interest disclosed. Rules for variant classification should be described and be publicly available via publications and/or posting on a public web site, and agreement to post the information supporting expert panel status on the ClinVar web site is expected.

III. INTERPRETATION

DETERMINATION OF VARIANT PATHOGENICITY

211

ClinGen: A Centralized Resource of Clinically Annotated Genes and Variants The Clinical Genome Resource (ClinGen) is an NIH-funded program supporting a range of activities encompassing both collaboration with ClinVar and other projects. ClinGen includes three cooperative agreement grants primarily funded by NHGRI with support from National Institute of Child Health and Human Development (NICHD) and National Cancer Institute (NCI). ClinGen and ClinVar are distinct but related projects. ClinVar (as discussed above) is a database developed and maintained by NCBI with input from the community, including ClinGen investigators. ClinGen seeks to alleviate isolation of data resources and domain experts by facilitating expert curation and sharing within the community. The purpose of ClinGen is to create a centralized resource of clinically annotated genes and variants to improve the understanding of genomic variation. Data will be harnessed from voluminous clinical and research testing results with careful attention to privacy, and expert curation efforts as well as the development of standards will be supported. The goals of ClinGen are to (1) standardize clinical assessment of variants and deposition into ClinVar, (2) develop a consensus process for identifying clinically relevant variants, (3) curate genes and variants within multiple clinical domains, (4) develop machine learning algorithms to improve accuracy and throughput for variant interpretation, and (5) disseminate and explore integration with electronic health records. Clinically relevant variants identified through ClinGen expert curation will be submitted to ClinVar at the review level of “expert panel” (three-star rating). A clinical validity classification system will be developed on the level of a genedisease relationship, with evidence levels ranging from definitive for or against a causal role for a gene in a specific disease. Performance of the classification system will be systematically evaluated for different disease domains. An actionability scale of genedisease pairs in presymptomatic individuals will be developed based on scoring criteria such as severity (morbidity and mortality), penetrance, effectiveness and acceptability of interventions, and degree of evidence. These and additional activities will be published with updates accessible through http://www.clinicalgenome.org/.

Standard Setting for Clinical Grade Databases The high specificity afforded by strategies that target genes and variants with well-documented clinical consequences is being replaced by NGS approaches with the aim of increasing overall sensitivity for identifying causal genetic variants. Essentially, NGS techniques have engendered a shift toward hypothesis-free genomic analyses and away from testing of specific genes and variants based on preestablished evidence of causation. Notably, the evidence-gathering and analysis activities for disease causation have migrated from the beginning of the genetic testing process (test development) to the end (interpretation of test results). Along with the relinquishment of a well-vetted set of test targets has come an outpouring of variant results, all of which are candidates for disease causality until they are either rejected based on evidence or placed into a large bin of uncertainty. The reliance of interpretation of NGS results on the analysis of available evidence is magnified compared with traditional approaches, yet the evidence to support such analyses may be difficult to gather or altogether lacking. The interpretation of genetic variant data has become a rate-limiting step in the utilization of NGS data not only based on the time and effort involved, but also because of a heavy dependence on several types of evidence that are difficult to collect for programmatic analysis, and for which standards of interpretation are under development. As a cautionary note, good intentions to make a clinical genetic diagnosis must be tempered by strict adherence to protocols for interpretation that are established prior to the analysis of test results. An axiom of low specificity of variant results is that many candidates are generated, few or none of which are causative for the observed phenotype. Rational criteria, such as biological plausibility, may not be able to stand alone—finding a novel variant in a known disease-associated gene, or in a pathway with such genes, typically does not constitute sufficient evidence of disease causality, particularly in the background of a high false discovery rate. It should be remembered that false-positive results can have dire clinical consequences including misdiagnosis, the delivery of inappropriate and potentially harmful medical interventions, cessation of inquiry into the true underlying cause of disease, and false reassurance for relatives who test negative for the variant with misattributed disease causation. These considerations emphasize the benefit of the review being fostered by the ClinGen project, and other expert panels, to share information about the status of interpretation of clinical significance, and disseminate that information via ClinVar and ClinGen. Currently, the Food and Drug Administration (FDA) exercises enforcement discretion over laboratorydeveloped genetic tests, but less than 0.2% of clinical tests in the USA registered in GTR [61] report FDA approved/cleared status. In other words, FDA has not generally enforced applicable regulatory requirements. In

III. INTERPRETATION

212

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

July 2014, FDA notified Congress of its intention to publish a risk-based oversight framework for laboratorydeveloped tests (http://www.fda.gov/NewsEvents/Newsroom/PressAnnouncements/ucm407321.htm). Nine percent of clinical molecular tests in GTR use NGS methods, yet the FDA only recently authorized the first next generation instrument platform and reagents [62]. The decision to grant marketing authorization rested heavily on demonstrated technical accuracy across the genome and on high-quality data to support the clinical validity of variations in the CFTR gene. Professional societies, laboratories, accrediting organizations, and federal agencies have been developing professional guidelines and quality assurance programs that support the application of NGS technology in the clinical realm. For example, the College of American Pathologists (CAP) has published consensus recommendations to provide a framework for composing patient reports and has summarized elements for checklists used in the laboratory accreditation process for NGS [63]. In addition, the ACMG has published recommendations on clinical laboratory standards for NGS [5]. As part of these professional standards, ACMG recommends that all diseasefocused and/or diagnostic testing include confirmation of the final result using a companion technology, particularly because false-positive rates are appreciable for most NGS platforms currently in use. In 2010, CAP and ACMG launched a methods-based Sequencing Educational Challenge Survey to validate the laboratory’s ability to correctly identify, name, and interpret sequence variants. This survey was based on electropherograms that were distributed to participants for analysis and interpretation of sequence variants (a dry challenge which did not provide DNA for participants to sequence). The survey analysis concluded that methodsbased proficiency testing programs may be one part of the solution to evaluate NGS-based genetic testing [64]. The Next Generation Sequencing: Standardization of Clinical Testing (Nex-StoCT) workgroup has published guidelines as initial steps to ensure that results from NGS tests are reliable and useful for clinical decision making [6]. The Nex-StoCT workgroup collaborated to define platform-independent approaches for establishing technical process elements for analytical validity and to support compliance of NGS tests with existing regulatory and professional quality standards. The workgroup developed definitions of CLIA performance characteristics that are applicable to NGS and a framework for establishing NGS test systems for clinical use based on validation of the platform, test, and informatics pipeline; quality control; proficiency testing or alternate assessment when proficiency testing is not available; and reference materials. The CDC has coordinated the GeT-RM for many years and is collaborating with NCBI and participating laboratories to develop the GeT-RM web site at NCBI. The National Institute of Standards and Technology (NIST) has organized the Genome in a Bottle Consortium to develop reference materials for NGS; collaborators include FDA, NIH/NCBI/ NHGRI/NCI, and CDC (http://www.genomeinabottle.org/). An important early deliverable of this consortium is the publication of a dataset of highly confident SNP and indel genotype calls for the NA12878 pilot reference genome, useful for laboratories to benchmark their NGS tests. The pilot genotype set was developed by integrating several datasets produced by different sequencing technologies, mapping algorithms, and variant callers [15].

GLOBAL DATA SHARING The international community has been categorizing clinical variation for many years to create valuable resources. Hundreds of gene-centered collections of variation data known as LSDBs have been established [65,66] and further collected for ready access by the Leiden Open (source) Variation Database (LOVD). LOVD has recently established version 3.0 which extends gene-centered collection and display of DNA variations to storage of patient-centered and NGS data. The Human Variome Society (HVP; http://www.humanvariomeproject.org/) is an Australia-based organization developing standards, systems, and infrastructure for the worldwide collection and sharing of all genetic variations affecting human disease [67,68]. The ICCG is a strong proponent of data sharing and began as the ISCA Consortium, founded in 2007. ISCA showed the enormous benefits of data sharing surrounding structural variation detected by CMA testing for improving quality in clinical laboratories and patient care. The American Medical Association has adopted multiple resolutions stating that restricting data sharing is contrary to best practices of medical care and is unethical.

GA4GH and the Beacon Project The Global Alliance for Genomics and Health (GA4GH) has organized key leaders and organizations to catalyze data sharing among the many systems that have been created. GA4GH plans to galvanize the community

III. INTERPRETATION

REFERENCES

213

around two demonstration projects related to cancer, one for germ line variation and the other for cancer (somatic) variation. The germ line project will share data on the BRCA1 and BRCA2 genes which cause hereditary breastovarian cancer. The somatic/cancer project focuses on the BRAF activation mutation known as V600E which is used to guide therapy for melanoma [69]. An international GA4GH steering committee has been formed. Public and private diagnostic laboratories and research laboratories are invited to share anonymized genotype and phenotype data as partners in GA4GH. Key partners with data on the BRCA1 and BRCA2 genes needed to ensure success, such as Breast Cancer Information Core (BIC), ClinVar, CIMBA (Consortium of Investigators of Modifiers of BRCA1/2), ENIGMA, LOVD, COGR (Canadian Open Genetics Repository), and others have expressed enthusiasm about participating and sharing data. The Beacon project is a GA4GH pilot project which addresses the feasibility of data sharing before embarking on the development of more complex infrastructure such as the creation of APIs. The Beacon project tests the willingness and technical ability of potential participants to share data by constructing a simple test of participants’ intentions. Each site that intends to join GA4GH is invited to create a public web service that has no authorization restrictions and answers a simple request for data. The request is whether or not (yes/no) the database contains information about any genomes with a specific nucleotide base (e.g., “A”) at a particular genomic position (e.g., position 100,735 on chromosome 3). A “beacon” is a site offering this service. The pilot project is designed as a test with a minimal technical requirement that queries for genomic information which cannot be construed as violating the privacy of any individual. Minimizing these potential barriers to data sharing enables a straightforward determination of feasibility for individual groups to participate in GA4GH. Groups that cannot create a beacon with such limited requirements are very unlikely to be able to share more complex data. Participation can be gauged by having a presence on the GA4GH beacon site (http://ga4gh.org/#/beacon). NCBI has implemented a publicly accessible beacon—a web interface for genomic variant queries—at http:// www.ncbi.nlm.nih.gov/projects/genome/beacon/. The NCBI beacon indexes raw sequence data from two sources. The first is sequence-based alleles in the SRA (http://www.ncbi.nlm.nih.gov/books/NBK47528/) which are aggregated from the NHLBI GO-ESP http://www.ncbi.nlm.nih.gov/bioproject/165957. The second source is submitter-called variants submitted as VCF files from the Phase 1 data release of the 1000 Genomes Project and GO-ESP variants as reported by the Exome Variant Server.

CONCLUSION There are many resources that manage information about the relationships among human variation and disease. The focus on a limited set of resources in this chapter is intended to provide enough background to improve general understanding of how data are generated, archived in public databases, and processed via multiple visualization and reporting tools. There are many benefits of data sharing. The continued evolution and adoption of standards for cataloging variant and phenotype data, and gauging clinical significance, will support the goal of developing clinical grade variant resources for all disease-associated genes. Collaborations such as ClinGen bring this goal closer by helping to operationalize data sharing across the genetics ecosystem.

References [1] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431 (7011):93145. [2] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [3] Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, et al. Detection of large-scale variation in the human genome. Nat Genet 2004;36(9):94951. [4] Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med 2014;370(25):241825. [5] Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15(9):73347. [6] Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30(11):10336. [7] Green P. Against a whole-genome shotgun. Genome Res 1997;7(5):4107. [8] Weber JL, Myers EW. Human whole-genome shotgun sequencing. Genome Res 1997;7(5):4019. [9] She X, Jiang Z, Clark RA, Liu G, Cheng Z, Tuzun E, et al. Shotgun sequence assembly and recent segmental duplications within the human genome. Nature 2004;431(7011):92730.

III. INTERPRETATION

214

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

[10] Sharp AJ, Cheng Z, Eichler EE. Structural variation of the human genome. Annu Rev Genomics Hum Genet 2006;7:40742. [11] Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol 2011;9(7):e1001091. [12] Chen R, Butte AJ. The reference human genome demonstrates high risk of type 1 diabetes and other disorders. Pac Symp Biocomput 2011;23142. [13] Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully RE, Proctor G, et al. Locus reference genomic sequences: an improved basis for describing human DNA variants. Genome Med 2010;2(4):24. [14] Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, et al. Diversity of human copy number variation and multicopy genes. Science 2010;330(6004):6416. [15] Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 2014;32(3):24651. [16] International HapMap Consortium, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, et al. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467(7311):528. [17] International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437(7063):1299320. [18] Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):5665. [19] Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [20] Clarke L, Zheng-Bradley X, Smith R, Kulesha E, Xiao C, Toneva I, et al. The 1000 Genomes Project: data management and community access. Nat Methods 2012;9(5):45962. [21] Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 2012;337(6090):649. [22] Krumm N, Sudmant PH, Ko A, O’Roak BJ, Malig M, Coe BP, et al. Copy number variation detection and genotyping from exome sequence data. Genome Res 2012;22(8):152532. [23] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29(1):30811. [24] Church DM, Lappalainen I, Sneddon TP, Hinton J, Maguire M, Lopez J, et al. Public data archives for genomic structural variation. Nat Genet 2010;42(10):8134. [25] Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res 2014;42:D10016. [26] Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ. A navigator for human genome epidemiology. Nat Genet 2008;40(2):1245. [27] Yu W, Clyne M, Khoury MJ, Gwinn M. Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations. Bioinformatics 2010;26(1):1456. [28] GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet 2013;45(6):5805. [29] Amberger J, Bocchini C, Hamosh A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM(R)). Hum Mutat 2011;32(5):5647. [30] Kohler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 2014;42:D96674. [31] Brownstein CA, Beggs AH, Homer N, Merriman B, Yu TW, Flannery KC, et al. An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge. Genome Biol 2014;15(3):R53. [32] Johnston JJ, Biesecker LG. Databases of genomic variation and phenotypes: existing resources and future needs. Hum Mol Genet 2013;22 (R1):R2731. [33] Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nat Genet 2007;39(10):11816. [34] Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42:D9805. [35] Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics 2012; [Chapter 1: Unit 13]. [36] Horaitis O, Talbot Jr. CC, Phommarinh M, Phillips KM, Cotton RG. A database of locus-specific databases. Nat Genet 2007;39(4):425. [37] Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, et al. COSMIC: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res 2011;39:D94550. [38] Chin L, Hahn WC, Getz G, Meyerson M. Making sense of cancer genomic data. Genes Dev 2011;25(6):53455. [39] Chin L, Andersen JN, Futreal PA. Cancer genomics: from discovery science to personalized medicine. Nat Med 2011;17(3):297303. [40] Bragin E, Chatzimichali EA, Wright CF, Hurles ME, Firth HV, Bevan AP, et al. DECIPHER: database for the interpretation of phenotypelinked plausibly pathogenic sequence and copy-number variation. Nucleic Acids Res 2014;42:D9931000. [41] Thompson BA, Spurdle AB, Plazzer JP, Greenblatt MS, Akagi K, Al-Mulla F, et al. Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database. Nat Genet 2014;46(2):10715. [42] In: Pagon RA, Adam MP, Ardinger HH, Bird TD, Dolan CR, Fong CT, et al., editors. GeneReviews(R). Seattle, WA; 1993. [43] UniProt Consortium. Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res 2011;39:D2149. [44] Thorn CF, Klein TE, Altman RB. PharmGKB: the Pharmacogenomics Knowledge Base. Methods Mol Biol 2013;1015:31120. [45] Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther 2012;92(4):4147. [46] Riggs ER, Wain KE, Riethmaier D, Savage M, Smith-Packard B, Kaminsky EB, et al. Towards a Universal Clinical Genomics Database: the 2012 International Standards for Cytogenomic Arrays Consortium Meeting. Hum Mutat 2013;34(6):9159.

III. INTERPRETATION

REFERENCES

215

[47] Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, et al. NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res 2014;42:D9759. [48] Ramos EM, Hoffman D, Junkins HA, Maglott D, Phan L, Sherry ST, et al. PhenotypeGenotype Integrator (PheGenI): synthesizing genome-wide association study (GWAS) data with existing genomic resources. Eur J Hum Genet 2014;22(1):1447. [49] Fokkema IF, Taschner PE, Schaafsma GC, Celli J, Laros JF, den Dunnen JT. LOVD v.2.0: the next generation in gene variant databases. Hum Mutat 2011;32(5):55763. [50] Beroud C, Hamroun D, Collod-Beroud G, Boileau C, Soussi T, Claustres M. UMD (Universal Mutation Database): 2005 update. Hum Mutat 2005;26(3):18491. [51] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. [52] Plon SE, Eccles DM, Easton D, Foulkes WD, Genuardi M, Greenblatt MS, et al. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat 2008;29(11):128291. [53] Greenblatt MS, Brody LC, Foulkes WD, Genuardi M, Hofstra RM, Olivier M, et al. Locus-specific databases and recommendations to strengthen their contribution to the classification of variants in cancer susceptibility genes. Hum Mutat 2008;29(11):127381. [54] Lindor NM, Guidugli L, Wang X, Vallee MP, Monteiro AN, Tavtigian S, et al. A review of a multifactorial probability-based model for classification of BRCA1 and BRCA2 variants of uncertain significance (VUS). Hum Mutat 2012;33(1):821. [55] Spurdle AB, Healey S, Devereau A, Hogervorst FB, Monteiro AN, Nathanson KL, et al. ENIGMA—Evidence-Based Network for the Interpretation of Germline Mutant Alleles: an international initiative to evaluate risk and clinical significance associated with sequence variation in BRCA1 and BRCA2 genes. Hum Mutat 2012;33(1):27. [56] Richards CS, Bale S, Bellissimo DB, Das S, Grody WW, Hegde MR, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet Med 2008;10(4):294300. [57] Riggs ER, Church DM, Hanson K, Horner VL, Kaminsky EB, Kuhn RM, et al. Towards an evidence-based process for the clinical interpretation of copy number variation. Clin Genet 2012;81(5):40312. [58] Riggs ER, Jackson L, Miller DT, Van Vooren S. Phenotypic information in genomic variant databases enhances clinical care and research: the International Standards for Cytogenomic Arrays Consortium experience. Hum Mutat 2012;33(5):78796. [59] Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, et al. Consensus statement: chromosomal microarray is a firsttier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet 2010;86(5):74964. [60] Watson MS, Cutting GR, Desnick RJ, Driscoll DA, Klinger K, Mennuti M, et al. Cystic fibrosis population carrier screening: 2004 revision of American College of Medical Genetics mutation panel. Genet Med 2004;6(5):38791. [61] Rubinstein WS, Maglott DR, Lee JM, Kattman BL, Malheiro AJ, Ovetsky M, et al. The NIH Genetic Testing Registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency. Nucleic Acids Res 2013;41:D92535. [62] Collins FS, Hamburg MA. First FDA authorization for next-generation sequencer. N Engl J Med 2013;369(25):236971. [63] Gulley ML, Braziel RM, Halling KC, Hsi ED, Kant JA, Nikiforova MN, et al. Clinical laboratory reports in molecular pathology. Arch Pathol Lab Med 2007;131(6):85263. [64] Richards CS, Palomaki GE, Lacbawan FL, Lyon E, Feldman GL, CAP/ACMG Biochemical and Molecular Genetics Resource Committee. Three-year experience of a CAP/ACMG methods-based external proficiency testing program for laboratories offering DNA sequencing for rare inherited disorders. Genet Med 2014;16(1):2532. [65] Samuels ME, Rouleau GA. The case for locus-specific databases. Nat Rev Genet 2011;12(6):3789. [66] Howard HJ, Beaudet A, Gil-da-Silva Lopes V, Lyne M, Suthers G, Van den Akker P, et al. Disease-specific databases: why we need them and some recommendations from the Human Variome Project Meeting, May 28, 2011. Am J Med Genet Part A 2012;158A(11):27636. [67] Vihinen M, den Dunnen JT, Dalgleish R, Cotton RG. Guidelines for establishing locus specific databases. Hum Mutat 2012;33(2):298305. [68] Celli J, Dalgleish R, Vihinen M, Taschner PE, den Dunnen JT. Curating gene variant databases (LSDBs): toward a universal standard. Hum Mutat 2012;33(2):2917. [69] Chapman PB, Hauschild A, Robert C, Haanen JB, Ascierto P, Larkin J, et al. Improved survival with vemurafenib in melanoma with BRAF V600E mutation. N Engl J Med 2011;364(26):250716.

List of Acronyms and Abbreviations ACMG API BAM BED BIC CAP CDC CIMBA ClinGen ClinVar CMA CNP CNV COGR COSMIC CUI

American College of Medical Genetics and Genomics Application programming interface Binary Alignment/Map Browser Extensible Data Breast Cancer Information Core College of American Pathologists Centers for Disease Control and Prevention Consortium of Investigators of Modifiers of BRCA1/2 Clinical Genome Resource Database of Clinical Variation Chromosomal microarray Copy number polymorphism Copy number variation Canadian Open Genetics Repository Catalog of Somatic Mutations in Cancer Concept Unique Identifier

III. INTERPRETATION

216 dbGaP dbSNP dbVar DECIPHER DGVa EBI ENIGMA eQTL FDA GA4GH GeT-RM GIAB GRC GTEx GTR GWAS HGMD HGP HGVS HPO HuGENet HVP InSiGHT ISCA/ICCG LOVD LRG LSDB MedGen MeSH MHC NCBI NCI Nex-StoCT NGS NHGRI NHLBI-ESP NICHD NIST OMIM PharmGKB PheGenI refSNP SNP SNV SRA TGCA UMD UMLS UniProtKB VCF VUS WES WGS

12. REFERENCE DATABASES FOR DISEASE ASSOCIATIONS

Database of Genotypes and Phenotypes Database of Short Genetic Variations Database of Structural Variation DatabasE of Genomic Variants and Phenotype in Humans Using Ensembl Resources Database of Genomic Variants Archive European Bioinformatics Institute Evidence-Based Network for the Interpretation of Germline Mutant Alleles Expression quantitative trait locus US Food and Drug Administration Global Alliance for Genomics and Health Genetic Testing Reference Materials Coordination Program Genome In A Bottle Genome Reference Consortium Genotype-Tissue Expression project Genetic Testing Registry Genome-wide association study Human Gene Mutation Database Human Genome Project Human Genome Variation Society Human Phenotype Ontology Human Genome Epidemiology Network Human Variome Project International Society for Gastrointestinal Hereditary Tumors International Standards for Cytogenomic Arrays Consortium/International Collaboration of Clinical Genomics (ICCG) Leiden Open Variation Database Locus reference genomic Locus-specific database Medical Genetics resource at NCBI Medical Subject Headings Major histocompatibility complex National Center for Biotechnology Information National Cancer Institute Next-Generation Sequencing: Standardization of Clinical Testing Next Generation Sequencing National Human Genome Research Institute National Heart, Lung, and Blood Institute—Exome Sequencing Project National Institute of Child Health and Human Development National Institute for Standards and Technology Online Mendelian Inheritance in Man Pharmacogenomics Knowledge Base PhenotypeGenotype Integrator Reference SNP Single nucleotide polymorphism Single nucleotide variant Sequence read archive The Cancer Genome Atlas Universal Mutation Database Unified Medical Language System UniProt Knowledgebase Variant call file Variant of uncertain significance Whole exome sequencing Whole genome sequencing

III. INTERPRETATION

C H A P T E R

13 Reporting of Clinical Genomics Test Results Kristina A. Roberts1,2, Rong Mao1,2, Brendan D. O’Fallon2 and Elaine Lyon1,2 1

Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT, USA 2 ARUP Laboratories, Salt Lake City, UT, USA

O U T L I N E Introduction

219

Components of the Written NGS Report Patient Demographics and Indication for Testing Summary Statement of Test Interpretation Variants That May Explain the Patient’s Phenotype Gene Name and Transcript Number Variant Nomenclature and Zygosity Variant Classification and Supporting Evidence Results of Familial Testing Confirmation of Variants by a Secondary Method Interpretation of the Test Result Recommendations Incidental or Secondary Findings Technical Information About the Assay Performed Methodology and Platform Types of Mutations Detected by the Assay

219 219 219 220 220 221 221 223 223 223 224 224 225 225 225

Data Analysis and Variant Interpretation Algorithms Depth of Coverage and Areas of No/Low Coverage Analytical Sensitivity and Specificity Clinical Sensitivity and Specificity Additional Test Limitations Disclaimer about FDA Approval

226 226 226 227 227

Signature of the Person Approving the Report

227

226

Beyond the Written Report: Other NGS Reporting Issues to Consider Providing Raw Data to Clinicians and Patients Reanalysis and Reinterpretation of NGS Data Data Storage

227 227 228 228

Conclusion

228

References

228

List of Acronyms and Abbreviations

229

KEY CONCEPTS • The product delivered by a clinical laboratory to the ordering clinician is a written report containing a detailed description of the result, interpretation of the result within the clinical context of the patient, recommendations for further testing, and technical information about the assay.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00013-7

217

© 2015 Elsevier Inc. All rights reserved.

218

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

• NGS reports require special consideration due to the amount and complexity of the data generated by genomic assays. The number of variants reported may be greater than for single gene tests, and variants need to be presented and explained in a systematic fashion using standard nomenclature. • Variant interpretation integrates multiple lines of evidence, such as the variant type and its relationship to known disease mechanisms, literature reports, population frequency, computational predictions, functional studies, and segregation studies in affected families. Variants are often classified as Pathogenic, Likely Pathogenic, Uncertain, Likely Benign, or Benign. Standardized classification guidelines are currently being developed by professional organizations. • In addition to the variant interpretation, NGS reports must provide detailed descriptions of reported genes and their relevance to the patient’s disease or phenotype. Similar to variant classifications, gene classifications may include known association, likely or possible association, and unknown association with the disease or symptoms. • Analyzing samples from a patient’s family members (either affected or unaffected) can help identify important variants and may assist in variant interpretation. NGS testing of relatives followed by appropriate variant filtering (e.g., for de novo variants or variants that fit the suspected inheritance pattern) may be more effective and efficient than testing the proband alone. If familial NGS testing is not performed, targeted testing of relatives for specific identified variants can help determine whether they are inherited or de novo. Reporting of familial results on the proband’s report should be done with the family member’s consent or in a manner that maintains the privacy of health information. • The American College of Medical Genetics and Genomics recommends reporting incidental or secondary findings detected in anyone undergoing exome or genome sequencing, including patients and their relatives. These incidental findings are unrelated to the patient’s symptoms, but cause “medically actionable” conditions for which preventive measures or surveillance is available. Fifty-six genes are currently included on the recommended list for the reporting of incidental findings. • Technical information about the assay should be included on the clinical report. The methodology and limitations of the assay, as well as performance characteristics, should be described in detail. For NGS tests, some performance characteristics may be patient-specific (e.g., depth of coverage, areas of no/low coverage, filtering parameters) and any variation from the general technical performance criteria should be noted. Information on data analysis and variant interpretation algorithms may also be included. • Written NGS reports should include information about the test’s analytical sensitivity and specificity. Determining analytical sensitivity (i.e., which specific mutations an assay can and cannot detect) can be very challenging, especially for exome and genome sequencing tests. At a minimum, the types of mutations detectable by the test (e.g., single nucleotide variations, small insertions/deletions, intronic variants at intron/exon borders, copy number variations), as determined during assay validation, should be listed. The analytical specificity of NGS tests can be quite high if clinically significant variants are confirmed by a secondary method prior to reporting. • The clinical sensitivity and specificity of traditional molecular genetic tests are typically determined through research studies of well-defined patient populations. The patient populations being tested by targeted gene panels, and especially by exome and genome sequencing, are comparatively more diverse phenotypically and less well-studied. Therefore, it may be more difficult to calculate the clinical sensitivity and specificity of these NGS tests. Estimates should be provided whenever possible. • Clinical genomics laboratories should carefully consider a number of additional reporting issues and develop policies to manage them. Clinicians or patients may request laboratories to provide them with raw NGS data, or to reanalyze or reinterpret data as genomic knowledge improves or as the patient’s symptoms evolve. The types of data stored, as well as the length of time the data will be available to clinicians, should also be considered. • Clinical genomics is a relatively new and rapidly developing field. Best practices for data analysis, variant interpretation, and reporting are still evolving. Many of these issues are being actively debated within the genetics and pathology communities, and additional standards and guidelines are likely to be developed in coming years.

III. INTERPRETATION

COMPONENTS OF THE WRITTEN NGS REPORT

219

INTRODUCTION Clinical laboratories provide genetic testing services, but the product delivered to the ordering clinician is a written report that is incorporated into the patient’s medical record. This report includes patient demographics, the indication for testing, a description and interpretation of the test result, and recommendations for follow-up. Background information about the test, such as methodology, sensitivity and specificity, and technical limitations of the assay, should also be included in the report. Written reports for next-generation sequencing (NGS) tests contain the same basic components as reports for traditional DNA sequence testing by traditional methods. However, these genomic assays and the data generated are often very complex, and it can be challenging to communicate the details of test results in an effective manner. Quality patient care depends on a written report that can be easily understood and properly acted upon by physicians, genetic counselors, and other medical professionals. This chapter will explore the essential elements of NGS reporting by clinical laboratories. The components of the written report will be discussed in detail, but other important reporting considerations will also be covered. These considerations include variant classification, clinician and patient access to raw NGS data, reanalysis and reinterpretation of NGS data by the clinical laboratory, and data storage. At present, clinical NGS is most often used for exome sequencing (ES) and for sequencing targeted genes on symptom-specific or disease-specific panels, but genome sequencing (GS) is also clinically available. While the reporting process for ES/GS and targeted panels is similar, there are also unique considerations for each assay type and some of these differences are discussed below. This chapter focuses on the interpretation and reporting of germline variants, but issues specific to NGS tests for somatic mutations are also described. It must be noted that clinical genomics is a relatively new and rapidly developing field, and thus best practices for data analysis, variant interpretation, and reporting are still evolving and many of these issues are being actively debated within the genetics community.

COMPONENTS OF THE WRITTEN NGS REPORT One source for general principles for the reporting of molecular genetic test results is the American College of Medical Genetics and Genomics (ACMG) Standards and Guidelines for Clinical Genetics Laboratories, Section G17 [1]. The ACMG Practice Guideline on NGS provides additional guidance on reporting standards for genomic tests, including targeted gene panels and ES [2]. Sample reports for both types of tests can be found in the online supplementary data that accompanies the ACMG guideline. In this section, we describe the main components of the written report and highlight aspects of the reporting process that are unique to NGS.

Patient Demographics and Indication for Testing Patient demographic information, such as name, gender, date of birth, and ethnicity/race, is typically placed toward the top of the report, along with the indication for testing and other pertinent clinical information. Traditionally, molecular genetic testing has focused on a specific disease, a single gene, or targeted mutations. Indications for this type of testing include molecular confirmation of a specific clinical diagnosis, determination of carrier status, or predictive testing in a presymptomatic individual. However, for genomic assays and especially for ES/GS, the indication for testing may be a set of symptoms observed in the patient but not yet linked to a particular disease or gene. In this case, a description of the patient’s clinical findings and relevant family history is included on the report and genomic variants are interpreted within this specific clinical context. As the clinical uses of NGS expand, the indications for genomic testing are also expanding. Some carrier screening panels now use NGS technology to determine an individual’s carrier status for a wide array of recessive disorders. For NGS assays that detect somatic mutations in tumor tissue, indications may include diagnosis, prognostic prediction, and/or identification of potential therapeutic targets to guide treatment decisions.

Summary Statement of Test Interpretation A succinct interpretive statement summarizing the test results should be included near the top of the report. This statement typically indicates whether one or more pathogenic or potentially pathogenic mutations were identified during testing. Since ES and GS yield a large number of potential mutations, summary statements for these tests may instead indicate whether mutations were identified in any genes that provide a likely or plausible

III. INTERPRETATION

220

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

explanation of the patient’s symptoms. Reports of somatic NGS testing may indicate whether mutations were detected in genes for which target therapies are currently available. Both the gene and the identified variant(s) are addressed briefly in this summary statement (e.g., “A pathogenic mutation was identified in a gene that is likely associated with the patient’s symptoms”). The details of any identified variants are not included in this brief result line but are listed later in the report.

Variants That May Explain the Patient’s Phenotype Determining which sequence variants to include on the written report is a complex task that requires expert analysis and interpretation. The expanded sequencing capacity of NGS is changing the way clinicians and laboratories approach genetic diagnosis, but the size of the data sets results in significant interpretation challenges. These challenges stem from the fact that there is limited information about the clinical significance of many identified variants and, in the case of ES/GS, a significant number of variants are located in genes or genomic regions of unknown function. A variety of filtering strategies can be employed to reduce the number of variants under consideration and to focus on a subset most likely to be clinically significant. Initial variant filtering is often based on population frequency and the suspected inheritance pattern in the family, as a variant with a population frequency greater than the incidence of disease (or greater than the carrier rate for recessive disorders) is unlikely to be causative. Intergenic variants, intronic and synonymous variants unlikely to affect splicing, or variants outside of a predetermined region of interest (e.g., areas of homozygosity or loss of heterozygosity identified by cytogenomic microarray) are often filtered out. Additional parameters, such as NGS results from other affected or unaffected family members, or comparison of germline mutations to those identified in a tumor sample, may also be included in the filtering algorithm. The choice of filtering strategy and settings is frequently case-specific and requires careful consideration to avoid either overfiltering or underfiltering the detected variants. Laboratories may choose to take a stepwise approach to variant filtering, which allows them to look first for obvious mutations that clearly explain the patient’s symptoms before expanding their search to include additional variants. Once a basic filtering algorithm has been applied to the data set, the remaining variants must be evaluated in the context of the patient’s clinical findings and family history. This may involve manually investigating whether the phenotypes associated with disruption of a given gene match the patient’s phenotype, and/or the use of an additional automated analysis tool (e.g., Ingenuity Variant Analysis or VarRanker) to help identify and prioritize genes that may be functionally or phenotypically relevant. Such tools are particularly useful for ES/GS data sets, where thousands of variants may remain after initial filtering steps. Ingenuity’s proprietary Variant Analysis software leverages a large, curated database to prioritize variants in genes known to be associated with the phenotypes in question. VarRanker provides more limited functionality and uses several publicly available databases to identify gene-phenotype associations [3]. Variants in genes deemed to be relevant to the patient’s phenotype must be investigated in further detail, as described below. When developing targeted NGS panels, laboratories ensure that each gene included on the panel has established clinical validity. Because these panels are usually disease-specific or symptom-specific, every gene is likely to be relevant to the patient’s clinical phenotype, and therefore, in most instances, all variants identified on a targeted gene panel (after basic filtering algorithms are applied) should be evaluated and classified. The written NGS report should list sequence variants that are known or likely to be relevant given the clinical indication for testing, as well as variants of uncertain significance in genes related to the patient’s symptoms. Variants should be ordered according to their predicted relevance to the patient. Laboratories may choose to include additional variants, such as those in genes of unknown function or in genes not known to be associated with the patient’s phenotype, in an expanded or supplemental report. The criteria for including or excluding variants from the main and/or expanded report are determined independently by each clinical genomics laboratory, as there is currently no consensus guideline within the genetics community. The following information should be provided for all variants and mutations included on the written NGS report: Gene Name and Transcript Number Genes listed on the report should be described using gene symbols and/or gene names approved by the HUGO Gene Nomenclature Committee. Information on approved gene names and symbols can be found at www.genenames.org. The transcript reference sequence should also be provided and is typically defined by an NCBI RefSeq accession number (e.g., NM number) along with the transcript version number. RefSeq accessions

III. INTERPRETATION

COMPONENTS OF THE WRITTEN NGS REPORT

221

and transcript sequences are available at www.ncbi.nlm.nih.gov/refseq. If a gene has multiple transcripts, the variant should be reported using the major transcript unless it is predicted to have a greater impact on another transcript. In this case, the impact on both transcripts should be described in the report [2]. Variant Nomenclature and Zygosity All variants should be reported in accordance with Human Genome Variation Society (HGVS) recommendations, which are described in detail at www.hgvs.org/mutnomen. Variants should be listed using both coding DNA sequence (c.) and amino acid (p.) nomenclature; nucleotide numbering should start at the A of the translation initiation codon ATG. For some variants, it may be useful to include the genomic (g.) position. If chromosomal translocations are detected, they too should be described using standard HGVS nomenclature. The zygosity of each variant (e.g., heterozygous, homozygous, and hemizygous) should also be indicated on the report. Zygosity is difficult to evaluate for somatic mutations identified in tumor samples; however, some laboratories may choose to report the allelic fraction for each somatic mutation. Mutations detected in the mitochondrial genome should be described using HGVS mitochondrial (m.) nomenclature, along with an indication of whether the variant is heteroplasmic or homoplasmic in the tissue tested. Variant Classification and Supporting Evidence Sequence variants are typically classified as Pathogenic, Likely Pathogenic, Variant of Uncertain Significance (VUS), Likely Benign, or Benign. The ACMG has published guidelines for the interpretation of sequence variations, which state that laboratories should provide their best estimate of a variant’s clinical significance based on the evidence available [4]. However, since the evidence used for classification has not been standardized across laboratories, a variant may be classified differently by different laboratories. The laboratory genetics community, led by the ACMG, the Association for Molecular Pathology (AMP), and the College of American Pathologists (CAP), is currently working to standardize evidence guidelines and provide clinical laboratories with a more structured framework for variant classification. Reports for all NGS tests should list the laboratory’s classification of each reported variant and document the supporting evidence used to classify variants with respect to their known or potential role in disease. Many different types of evidence may be considered during the variant classification process; some of the most common considerations are described below. Integrating these pieces of information to determine an appropriate classification is not a trivial task and can be especially difficult when conflicting data must be weighed. MUTATION EFFECT

The most basic level of variant analysis involves consideration of the type of variant detected. Mutations that are predicted to have a drastic effect on protein production (e.g., nonsense mutations, frameshifts, canonical splice site mutations, and missense mutations of the translation initiation codon) are often presumed to be pathogenic and little supporting evidence is required to justify that classification. However, even these variants must be interpreted with caution and correlated with the molecular mechanism of disease. For example, a nonsense mutation that results in loss of protein function may not be pathogenic when found in a gene that causes disease solely through gain of function mutations. Missense variants, small in-frame deletions or insertions, and variants located near a splice site (but outside of the invariant nucleotides in the 22, 21, 11, and 12 positions) can be difficult to classify and often require significant investigation. Synonymous variants and deeper intronic variants are more likely to be benign than other variant types, but do in some cases cause disease (e.g., through creation of a cryptic splice site). When analyzing ES/GS cases, laboratories may choose to exclude synonymous and deep intronic variants from their first tier analysis and focus initially on variants that are more likely to be pathogenic. ONLINE MUTATION DATABASES

Variant investigation typically involves consulting one or more mutation databases to determine whether detected variants have been previously observed in patient populations. A variety of locus-specific databases (LSDBs) are available online; each contains information about the variants found in a particular diseaseassociated gene. LSDBs must be viewed critically because they are independently operated and vary widely in their standards for mutation inclusion, frequency of updates, and level of curation (e.g., some provide curated variant classifications, while others simply list all variants detected in reportedly symptomatic individuals). In addition to these gene-specific resources, a number of consolidated mutation databases are also available. Online Mendelian Inheritance in Man (OMIM; www.omim.org) is a comprehensive catalog of genes with known disease associations and each entry includes brief descriptions of some of the mutations reported in the literature.

III. INTERPRETATION

222

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

The Human Gene Mutation Database (HGMD; a free version is available at www.hgmd.org but access to the updated Professional version requires a subscription) provides extensive mutation information for thousands of different genes, with each mutation entry linked to the associated literature report(s). While both OMIM and HGMD are relatively well-curated, their variant classifications are often taken directly from published reports; therefore, it is important to go back to the cited primary literature to independently assess the evidence used for classification and determine whether it meets clinical standards. Specialized databases, such as the mitochondrial database MITOMAP (www.mitomap.org) and the Catalog of Somatic Mutations in Cancer (COSMIC; cancer.sanger.ac.uk), may also be useful for specific NGS applications. The genetics and pathology communities are currently working to develop clinical-grade variant databases (e.g., ClinVar; www.ncbi.nlm.nih.gov/clinvar) that are freely available, updated regularly, and include variant classifications along with detailed supporting evidence, population frequency data, and information on genotypephenotype correlations. MEDICAL AND SCIENTIFIC LITERATURE

The medical and scientific literature is a key source of information about mutations identified in affected individuals. As noted above, online mutation databases often provide primary literature citations, but an independent search is also recommended to ensure that all reports, especially recent publications, are identified. Literature searches can be performed in PubMed (www.ncbi.nlm.nih.gov/pubmed), but general Internet search engines (e.g., Google) often give more comprehensive results because they search for mutation and gene names in an article’s text and tables, not just in the abstract and keywords. When reading literature reports, it is essential to remember that variant classifications proposed and published by researchers may not be appropriate in a clinical diagnostic setting. Literature reports should be compiled, and correlated with additional variant information, to determine an appropriate clinical-grade interpretation of the variant’s significance. References for relevant publications should be included on the written report as this information will help clinicians understand the evidence behind a variant’s classification, as well as any reported genotypephenotype correlations that may help guide patient care. POPULATION FREQUENCY

Information about the population frequency of detected variants can be gathered from a variety of online resources, including dbSNP (www.ncbi.nlm.nih.gov/snp), 1000 Genomes Project (browser.1000genomes.org), and the NHLBI-ESP Exome Variant Server (evs.gs.washington.edu/EVS). Each of these resources contains frequency data for different human populations, including from diverse geographical regions or from groups with specific phenotypes. It is important to understand the makeup of each of these populations because some include patient cohorts, while others include only apparently healthy controls. In those that include patients, the population data must be interpreted with extreme care, especially if the phenotype of the patient cohort is similar to that of the patient in which the variant was detected. It is also necessary to take the inheritance pattern, age of onset, prevalence, and penetrance of the patient’s condition into account when evaluating the population frequency of an identified variant. Unfortunately, in many cases, especially for ES/GS approaches, this information may not be readily available. Only independent observations (i.e., those not in the same patient or in related individuals) should be considered when analyzing variant frequency data; however, this information can sometimes be difficult to discern from databases and literature reports. COMPUTATIONAL PREDICTION PROGRAMS

Many computational programs are available to help predict whether an identified variant is pathogenic or benign. Missense variants can be evaluated using algorithms that attempt to predict the impact of an amino acid substitution by computing a numeric score representing the probability that the variant causes disease. SIFT (Sorting Intolerant from Tolerant; available at sift.jcvi.org and sift-dna.org) and PolyPhen2 (genetics.bwh.harvard. edu/pph2) are two of the most popular online prediction algorithms, but many others are available. Generally, SIFT predictions are based on the degree of amino acid conservation in closely related sequences, while PolyPhen2 uses sequence, phylogenetic, and structural information to characterize the substitution. These predictions may be useful as supporting evidence in variant classification, but they cannot be used as stand-alone evidence of pathogenicity. Potential splice site variants can be evaluated using splice site prediction tools, such as Berkeley Splice Site Prediction by Neural Network (NNSplice; www.fruitfly.org/seq_tools/splice.html) and NetGene2 (www.cbs.dtu.dk/services/NetGene2); these programs detect and score potential splice donors and splice acceptors within a submitted sequence, and comparison of the predictions for wild-type and variant sequences can indicate whether a detected variant is likely to either create a new splice site or decrease the

III. INTERPRETATION

COMPONENTS OF THE WRITTEN NGS REPORT

223

efficiency of an existing one. The sensitivity and specificity of these prediction programs are highest at positions nearest the splice site and decline at positions further from the intron/exon boundary. EVOLUTIONARY CONSERVATION

Assessing the evolutionary conservation of the amino acid at which a substitution occurs can also assist with variant classification. Typically, this simply involves looking at a sequence alignment of the protein in multiple species; there is no standard number or group of species that should be analyzed. Variants that occur at nonconserved residues are generally considered less likely to be pathogenic, while sequence conservation may indicate that the residue is important for proper protein structure or function. Evolutionary conservation should be used only as supporting evidence in variant classification and should not be considered independent evidence if computational prediction programs that incorporate conservation information are also being used. PROTEIN STRUCTURE AND FUNCTIONAL DOMAINS

Many proteins contain important structural motifs and/or functional domains. It is important to try to assess whether a detected variant falls within one of these regions and, if so, how it might impact that domain. This type of assessment may be more difficult in some proteins identified by ES/GS that are less well-studied than those with documented disease or phenotype associations. FUNCTIONAL STUDIES

For some genetic variants, it may be possible to perform functional studies (e.g., measurement of enzyme activity, transporter activity, and metabolite levels) to help clarify the variant’s significance. Some of these studies can be performed as clinical assays on patient samples, while others are only done on a research basis in academic laboratories. Information on functional studies of a particular variant may also be available in the medical and scientific literature. Functional studies must be interpreted with caution, especially if the functional deficit can be caused by mutation of more than one gene or by environmental factors (e.g., diet). Results of Familial Testing Familial testing is frequently done to assist in variant filtering and interpretation. Testing of a proband’s parents or other family members, either by NGS or targeted sequencing, can help determine whether a detected variant is inherited or de novo, clarify whether specific variants are found in cis or trans, and indicate whether a variant segregates with disease. The information gained through familial testing is often crucial to the variant and overall test interpretations, and some indication of these findings must be included on the report. However, to comply with Health Insurance Portability and Accountability Act (HIPAA) regulations, only the minimal amount of information required to interpret the variants should be provided in the proband’s report. Specific names and relationships should be avoided if possible. For example, a report can indicate that a variant was confirmed to be inherited without specifying which parent was found to carry it. Alternatively, laboratories can get consent from the family members to include their results on the proband’s report. Confirmation of Variants by a Secondary Method Some NGS protocols have relatively high analytical false-positive rates compared with traditional molecular genetic tests, and for those assays it is strongly recommended that the final results of clinical NGS testing be confirmed by a companion technology, such as Sanger sequencing. NGS reports should clearly state whether the reported variants have been confirmed by a secondary method. If the report includes any nonconfirmed variants, it should indicate that follow-up confirmatory testing is necessary if the result will be used for patient care.

Interpretation of the Test Result The interpretation section of the report should discuss the potential relevance of each reported variant to the patient’s phenotype. Many of the genes mentioned in genomics reports may be unfamiliar to the ordering clinician, especially if they have been linked to disease in only a few patients. Providing a brief overview of each reported gene (e.g., information about function, reported disease associations, and the inheritance pattern(s) associated with mutation of the gene) will help the clinician understand why the variant may be relevant. Including references to literature reports about the gene may also be helpful, even if the references don’t describe the specific variant or mutation detected in the patient. Since NGS tests may identify variants in multiple genes that

III. INTERPRETATION

224

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

provide a plausible explanation of the patient’s symptoms, the report may contain a number of alternative result interpretations. Consideration should also be given to the possibility that a complex phenotype is caused by mutations in a unique combination of genes.

Recommendations After interpreting any detected variants in the context of the patient’s clinical and family history, recommendations for additional testing are often provided in the written report. If the results of the NGS test do not explain the patient’s symptoms, sequencing of additional genes may be considered (e.g., ES or GS after a targeted gene panel fails to identify causative mutations). If the sequencing test is not validated as a quantitative test, formal analysis of deletions and duplications may be advised; such testing may include cytogenomic microarray (if not previously performed), gene-specific interrogation by MLPA, or exonic level microarray. In particular, if only one variant is discovered in a recessive gene known to be related to the patient’s symptoms, deletion/duplication analysis of that gene may be indicated. As with all genetic tests, at-risk family members, including those who may be affected and potential carriers, should be offered targeted testing for any identified pathogenic mutations. Genetic consultation is also recommended. Variants identified through ES/GS may point toward a possible disease mechanism that had not been previously considered. In some of these cases, nonmolecular or even nongenetic tests may prove useful in clarifying the significance of variants detected in a gene of known function. For example, one of the early ES tests identified a novel mutation in the XIAP gene, which is associated with immune disease; other laboratory methods, including flow cytometry and enzyme-linked immunosorbent assay, were able to confirm that the mutated protein was not functioning properly [5]. If an NGS result suggests that a specific functional, biochemical, or other nongenetic assay may be helpful, that information should also be included in the Recommendations section of the written report.

Incidental or Secondary Findings NGS analysis of exomes and genomes may yield medically significant genetic findings that are unrelated to the clinical indication for testing. There is controversy over whether mutations known to cause medically actionable conditions (i.e., those for which some type of prevention or surveillance is available) should be reported to ordering physicians and their patients. The ACMG has developed recommendations regarding such incidental findings, which indicate that reporting some incidental findings would likely have medical benefit for patients and families [6,7]. According to these recommendations, individuals who undergo ES/GS and any relatives also analyzed by NGS (including unaffected members of a trio, such as parents) should receive a report of medically useful incidental findings. These reports would be provided to the ordering clinician, who would then be responsible for communicating the findings to the patient and/or family member. The ACMG incidental findings working group has developed a list of 56 genes that they deem medically actionable: 23 genes associated with increased cancer risk, 31 genes associated with cardiovascular diseases, and two genes associated with anesthesia complications. In the ACMG’s view, this list represents the minimum set of genes that should be evaluated for incidental findings; laboratories may choose to expand the list of genes they analyze and report. The ACMG working group recommends that only “known pathogenic” and “expected pathogenic” mutations be included in the report of incidental findings. Known pathogenic mutations are sequence variants that have been previously reported in the scientific literature and for which evidence is available to clearly establish that they are causative of disease. Expected pathogenic mutations have not been previously reported but are of a type that is expected to cause disease; generally, these are clear loss of function mutations (e.g., nonsense mutations, frameshifts, canonical splice site mutations, and missense mutations of the translation initiation codon) in genes where loss of function is a known mechanism of disease. The written report should include information about the disease associated with the incidental finding and relevant references to support the pathogenic variant classification. Since incidental findings are sent to the ordering physician who may not have specialized training in management of the identified disease, the report may recommend that the patient be evaluated and, if necessary, followed by an appropriate specialist. Testing of at-risk family members is also recommended. It is very important that the written report also provide information about the limitations of incidental findings. Reports should not only include a list of the genes analyzed for incidental findings but also make clear when the analysis is not an exhaustive evaluation of all variation within the genes. With respect to the ACMG recommendations, laboratories are to analyze the sequence data that is generated by their standard NGS test, but

III. INTERPRETATION

COMPONENTS OF THE WRITTEN NGS REPORT

225

they are not expected to ensure adequate coverage of every exon by “filling in” with Sanger sequencing. In addition, as noted above, only mutations known or expected to be pathogenic at the time of the analysis should be reported; variants of uncertain clinical significance within the ACMG recommended genes should not be included on the report. Furthermore, pathogenic mutations that lie outside of coding regions and intron/exon boundaries may not be identified by ES assays, and certain types of mutations (e.g., large deletions and rearrangements) may not be reliably detected by NGS; therefore, individuals who are “negative” for incidental findings may still have a pathogenic mutation in one or more of the ACMG-recommended genes, a fact which should also be included in the report. The ACMG guideline recommends providing incidental findings for all patients and family members tested by ES/GS. However, to respect patient autonomy, some laboratories currently allow patients and their relatives to choose whether the laboratory reports these findings. “Opt-in” policies require patients to indicate that they want incidental findings, whereas “opt-out” policies allow laboratories to return incidental results unless directed otherwise. Reporting incidental findings, especially in genes that are not included on the ACMG list, may convey risk to the laboratory, although the level of risk depends on a number of factors including disease severity, ability to treat the disease, the penetrance or relative risk for disease, and other risk versus benefit considerations. It is clear that laboratories performing clinical NGS testing should be aware of the ethical and patient management issues associated with incidental findings, and have a policy for how these results will be reported.

Technical Information About the Assay Performed All genetic test reports should include technical information about the assay and its performance characteristics. This information is intended to help the ordering clinician understand how the test was performed, as well as the limitations of the assay. Technical information may also be useful to outside laboratories that receive a copy of the NGS report if and when follow-up testing is requested. For traditional molecular genetic tests, general technical information is provided for the assay and these characteristics should not differ from run to run. For genomic tests, however, some technical information may be specific to an individual patient’s result and must be customized on each report. To maintain clarity in the body of the report, technical information may be described in a separate section at the bottom of the written report. Methodology and Platform All reports should include a description of the test methodology, as well as information about the platform used for testing. Terms such as “massively parallel sequencing,” “next-generation sequencing,” and “genomic sequencing” are accurate general descriptions of the methodology but may not provide sufficient detail about the assay. Listing the sequencing platform and target-enrichment method (e.g., a specific PCR-based or probe capture technique) provides additional insight into the assay and its limitations. However, listing this information should not be considered a substitute for detailing the test limitations elsewhere in the technical information section of the report. Since exome enrichment is not fully efficient and there are refractory regions in each design, the expected capture efficiency based on the laboratory’s validation data should be described. Secondary methods used to confirm the NGS results should also be listed, such as Sanger sequencing for single nucleotide variants, or pyrosequencing or allele-specific amplification for low-level somatic mutations, heteroplasmic mitochondrial mutations, and mosaic mutations. Types of Mutations Detected by the Assay The types of mutations that can and cannot be detected by each NGS assay should be clearly communicated. These are often determined during test validation or predicted by the combination of enrichment, sequencing, and bioinformatics methods used in the assay. NGS platforms typically detect single nucleotide variants and small insertions and deletions, while repetitive regions (e.g., in triplet repeat diseases) do not align well and are generally not interrogated. Copy number variation (i.e., large deletions or duplications at either the exon or gene level) should only be listed as a detected variant type if the test has been validated as a quantitative assay. Some NGS applications may target specific mutations, such as carrier testing of known recessive disease mutations, or cancer panels that include common somatic mutations, gene rearrangements, and translocations. In these cases, a list of interrogated mutations may be included with the technical background information on the report.

III. INTERPRETATION

226

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

Data Analysis and Variant Interpretation Algorithms Although bioinformatics pipelines are complex, including some technical information about data analysis in the NGS report may be useful. At a minimum, the specific reference sequence used for alignment (e.g., Hg19) should be indicated. It is currently recommended that exomes and targeted panels be aligned to the full genome, rather than to gene-specific reference sequences, to reduce mismapping of off-target reads. If a different approach is taken, that should be noted on the report. Information about the alignment, variant calling, annotation, and variant filtering algorithms used should also be provided. If commercial or publicly available annotation tools are utilized in the pipeline, these algorithms should be described or referenced, since different algorithms may vary in their sensitivity to detect insertions and deletions, large copy number variations, and structural chromosomal rearrangements. Variant filtering parameters, such as population frequency, suspected inheritance pattern, or filtering using familial NGS results, should also be described. A general description of variant interpretation processes and procedures may also be considered for inclusion on the report. Depth of Coverage and Areas of No/Low Coverage One unique aspect of NGS assays is that coverage is unlikely to be uniform throughout the targeted regions or across the genome. In general, variant calls are more reliable as depth of coverage increases. Low coverage increases the risk of missing real variants, incorrect zygosity assignment to detected variants, and of false positives due to sequencing artifacts. Laboratories should establish a minimum coverage threshold for each of their assays and this information should be included on the written report. For targeted gene panels, most laboratories use Sanger sequencing to “fill in” poorly covered regions. However, filling in all regions of low coverage in ES/ GS is impractical; therefore, ES/GS reports should provide information about the percentage of bases that meet an absolute minimum coverage threshold (e.g., 90.2% of targeted regions have coverage greater than 103) or a similar metric. If specific genes related to the patient’s phenotype are to be evaluated following ES (e.g., at the request of the ordering physician), any regions of these genes with no or low coverage should be listed on the report, unless they are analyzed by Sanger sequencing. If ES is used or marketed as a disease-specific test, coverage of the regions of interest should be noted. Analytical Sensitivity and Specificity NGS reports should include information about the analytical sensitivity and specificity of the assay. The analytical sensitivity describes the assay’s ability to detect true mutations within the interrogated regions, while the analytical specificity measures the proportion of samples without pathogenic mutations that are correctly classified as negative. These parameters are generally calculated using data collected during assay validation. Greater experience with NGS platforms may allow sources and regions of systematic errors to be recognized, reducing false positives. However, because it is not possible to test all known variants by alternative methods, an empiric false-negative rate is unknown. One method used by clinical laboratories is to compare single nucleotide variants detected by NGS to data from SNP array analysis of the same sample. However, this method applies only to SNPs and may not accurately assess false-negative rates. Efforts to provide high-quality reference materials are underway through multiple organizations, including the Centers for Disease Control (CDC) and the National Institute for Standards and Technology (NIST). Ultimately, the availability of reference samples in which all mutations are characterized will be critical for accurate determination of both sensitivity and specificity. In lieu of such materials, laboratories can examine smaller, well-characterized regions, or samples that have undergone extensive Sanger sequencing. Assessment of both sensitivity and specificity should include all the multiple mutation types that are reported, including single nucleotide variants, small insertions and deletions, and large gene rearrangements (i.e., translocations, copy number variants, and gene conversions). Clinical Sensitivity and Specificity The clinical sensitivity and specificity of an assay describe its ability to correctly identify mutations in patients with the disease and to not identify mutations in individuals without the disease. For traditional genetic tests, these numbers are typically derived from research studies using patient populations with well-defined clinical symptoms. It may be possible to calculate the clinical sensitivity and specificity of targeted gene panels by aggregating genespecific data but is difficult to define these parameters for ES/GS tests, as the clinical indications for testing vary widely from case to case. Clinical ES has been recently reported to have a molecular diagnostic rate, defined as identifying a mutation highly likely to be causative of the disease, of 25% [8]. This number may be improved when trios (e.g., a proband and both parents) are analyzed by NGS to assist with variant filtering and interpretation.

III. INTERPRETATION

BEYOND THE WRITTEN REPORT: OTHER NGS REPORTING ISSUES TO CONSIDER

227

Additional Test Limitations Any additional test limitations (i.e., those not addressed in other sections of the technical information) should be clearly stated on the report. This information will be assay-specific, but may include details such as (a) the potential for rare variants in probe or primer sites to compromise analytical sensitivity, (b) that variants in genes of unknown function are not reported, and (c) that the presence of pseudogenes may interfere with variant annotation. Pseudogenes and other homologous sequences pose a challenge for all short-read sequencing approaches. With hybridization-based enrichment, co-capture cannot be avoided, which can lead to false positives and false negatives. Careful primer design and PCR enrichment may help circumvent problematic regions. Disclaimer about FDA Approval Currently, most NGS assays are developed and validated by the clinical laboratories offering the test. A disclaimer should be included on reports for all tests that are not approved by the Food and Drug Administration (FDA). The ACMG suggests the following wording [1]: “This test was developed and its performance characteristics determined by the XXXX Laboratory. It has not been cleared or approved by the FDA. FDA approval is not required for clinical use of the test, and therefore validation was done as required under the requirements of the Clinical Laboratory Improvement Act of 1988.” CAP recommends similar wording in its Molecular Pathology Checklist.

Signature of the Person Approving the Report NGS tests are technically complex and present a wide range of interpretive challenges. The ACMG recommends that clinical reporting and oversight of NGS-based testing be performed by individuals with appropriate genetics training and certification by the American Board of Medical Genetics in medical/laboratory genetics or by the American Board of Pathology in molecular genetic pathology [2]. Individuals signing NGS reports should also have experience in the interpretation of sequence variants and expertise in the technical aspects of sequencing technologies.

BEYOND THE WRITTEN REPORT: OTHER NGS REPORTING ISSUES TO CONSIDER Providing Raw Data to Clinicians and Patients After the clinical genomics laboratory completes testing and issues their written report of the results, clinicians and/or patients will sometimes request access to the “raw” NGS data. In these cases, they are often interested in looking for potentially important variants (e.g., those in genes of unknown function) that were not included on the laboratory’s report. They may wish to follow up on these unreported findings with research testing or simply monitor the literature for new information about the function and/or disease associations of these genes. Clinical laboratories should develop a clear policy on how they handle such requests and what type(s) of data they provide. Laboratories rarely provide truly “raw” NGS data, as some processing is required to make it useful to the clinician or patient. NGS results may be provided as sequence alignment data, contained in a BAM file that can be viewed in the freely available Integrative Genomics Viewer (IGV; www.broadinstitute.org/igv/home). Alternatively, laboratories may provide a variant call format (VCF) file, which encodes information about the detected sequence variations. Variants included in the VCF file may differ based on the amount and type of filtering applied; therefore, laboratories need to clearly communicate the level of data they are providing. Some degree of genetics training is required to properly use BAM or VCF files and evaluate their contents. One major concern in providing raw NGS data is that patients or clinicians may over-interpret findings, attaching undue significance to certain variants. Also, it is possible that variants identified BAM or VCF files could be false positives, and that the absence of a variant could be a false negative. Additionally, laboratories often have internal data from other patients’ results that are useful for variant interpretation but, due to HIPAA requirements and technical constraints, this information may not be accessible to clinicians and patients. Whenever raw data is provided, laboratory professionals should be available to discuss results with the clinician or patient, particularly if a question arises about a laboratory’s classification or omission of a variant. It should be stressed that any putative variant(s) identified by the patient or clinician that will influence patient care should be confirmed by the clinical laboratory; if a variant detected by the clinician or patient is found to be relevant to the patient’s phenotype, an amended or corrected report should be issued by the laboratory so that the patient’s medical record

III. INTERPRETATION

228

13. REPORTING OF CLINICAL GENOMICS TEST RESULTS

accurately reflects this finding. This two-way communication between the laboratory and the clinician is crucial for optimal variant interpretation and quality patient care.

Reanalysis and Reinterpretation of NGS Data Clinical genomics laboratories provide a rigorously validated analysis of NGS data and their best interpretation of the results based on current understanding of genetic variation. However, even after the written report is issued, genetic and technical knowledge continues to evolve. New information may be published about the functions and disease associations of previously uncharacterized genes and about the clinical significance of specific sequence variants. This new knowledge may significantly alter the interpretation of a patient’s NGS test results and impact patient care. Technical improvements to bioinformatics pipelines and revision of the human genome reference sequence may also affect analysis and interpretation. Clinical laboratories should develop explicit policies regarding reassessment of previously reported cases in the face of new genetic and technical information. Laboratories may choose to do a systematic review of reported cases at predetermined intervals or to review individual cases at the request of the clinician. Variants in genes of previously unknown function and uncertain variants in known genes may be periodically reinterpreted to determine whether new evidence affects their classification or relevance to the patient’s clinical presentation. Reinterpretation may also be requested by the clinician if a patient’s symptoms have changed. Laboratories may choose to reanalyze sequence data from previously reported cases when significant improvements are made to their bioinformatics pipeline.

Data Storage NGS tests generate a tremendous amount of data and laboratories must consider how the data will be stored. Data may be stored at the laboratory, at an offsite location, or in a cloud computing environment. However, any storage solution must be HIPAA compliant and, at present, many cloud computing services do not meet this requirement. Due to the multistep nature of NGS bioinformatics pipelines, many different data files produced during the testing process vary in both their content and their size. Long-term storage of the sequencing image files (which are of terabyte size) would add significant overhead costs to the laboratory and is, therefore, not a common practice; FASTQ, BAM, and VCF files are a more manageable size and can be stored at a more reasonable cost. Laboratories should develop a policy that details which file types will be stored and for how long. Current recommendations suggest that, for a minimum of 2 years, stored data should be sufficient to allow regeneration of the primary results and to permit future reanalysis with improved bioinformatics pipelines, and FASTQ and BAM files may be sufficient for this purpose. VCF files and the final written report should be stored for as long as possible, as these files will allow for future reinterpretation of variant classifications [2]. All laboratory data retention policies should be developed in accordance with local, state, and federal regulations.

CONCLUSION The importance of a well-organized and well-written clinical genomics report cannot be overstated. Clinicians, patients, and other laboratories must be able to efficiently extract not only the variant data but also the interpretation and significance of these results for the patient and his or her family. The importance of including detailed information on test limitations, quality measures, and testing recommendations may be easily overlooked, but these components are critical to understanding the significance of both positive and negative NGS results. Clinical laboratories are still determining how to best convey complex NGS results to the ordering clinician, and those in the clinic are providing feedback about the number and types of variants they would like to have reported. The clinical genomics reporting process will continue to evolve in coming years and new standards will likely be developed by the genetics community.

References [1] American College of Medical Genetics and Genomics Standards and Guidelines for Clinical Genetics Laboratories. ,https://www.acmg. net/ACMG/Publications/Laboratory_Standards___Guidelines.aspx.. [accessed 03.07.14].

III. INTERPRETATION

REFERENCES

229

[2] Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, Working Group of the American College of Medical Genetics and Genomics Laboratory Quality Assurance Commitee, et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15(9):73347. [3] O’Fallon BD, Wooderchak-Donahue W, Bayrak-Toydemir P, Crockett D. VarRanker: rapid prioritization of sequence variants associated with human disease. BMC Bioinformatics 2013;14(Suppl. 13):S1. [4] Richards CS, Bale S, Bellissimo DB, Das S, Grody WW, Hegde MR, Molecular Subcommittee of the ACMG Laboratory Quality Assurance Committee, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet Med 2008;8(10):294300. [5] Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, et al. Making a definitive diagnosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med 2011;13(3):25562. [6] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. [7] American College of Medical Genetics and Genomics. Incidental findings in clinical genomics: a clarification. Genet Med 2013;15(8):6646. [8] Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N Engl J Med 2013;369(16):150211.

List of Acronyms and Abbreviations ACMG AMP BAM CAP CDC COSMIC ES FDA GS HGMD HGVS HIPAA HUGO IGV LSDB MLPA NCBI NGS NHLBI-ESP NIST OMIM PCR SIFT VCF VUS

American College of Medical Genetics and Genomics Association for Molecular Pathology Binary Alignment/Map College of American Pathologists Centers for Disease Control Catalog of Somatic Mutations in Cancer Exome Sequencing Food and Drug Administration Genome sequencing Human Gene Mutation Database Human Genome Variation Society Health Insurance Portability and Accountability Act Human Genome Organisation Integrative Genomics Viewer Locus-specific database Multiplex Ligation-Dependent Probe Amplification National Center for Biotechnology Information Next-generation sequencing National Heart, Lung, and Blood Institute—Exome Sequencing Project National Institute for Standards and Technology Online Mendelian Inheritance in Man Polymerase chain reaction Sorting Intolerant from Tolerant Variant call format Variant of uncertain significance

III. INTERPRETATION

This page intentionally left blank

C H A P T E R

14 Reporting Software Rakesh Nagarajan Department of Pathology and Immunology, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA O U T L I N E Introduction

232

Final Report Transmission to the EMR

236

Clinical Genomic Test Order Entry

232

Leveraging Standards in Clinical Genomics Software Systems

237

Laboratory Information Management Systems (LIMS) Tracking

233

Regulatory Compliance

237

Analytics: From Reads to Variant Calls Analytical Validation Provenance Tracking and Versioning Pipeline Orchestration and Management

233 234 234 234

Support Personnel

237

Conclusion

238

References

238

Analytics: Variant Annotation and Classification

235

List of Acronyms and Abbreviations

239

Variant Interpretation

236

KEY CONCEPTS • Reporting software for clinical next-generation sequencing assays should comply with regulatory guidelines for information technology (IT) components operating in a clinical laboratory. • Reporting software for clinical next-generation sequencing assays should integrate with computerized physician order entry (CPOE) and laboratory information systems (LIS) as well as the electronic medical record (EMR) to receive orders and submit results, respectively. • Reporting software should only support execution of clinically validated bioinformatics pipelines, which may need to be customized for each assay and variant type(s) that are to be detected. • Reporting software should maintain complete provenance tracking of informatics tools, databases, parameters, and filters used to identify, classify, annotate, and interpret variants. • Reporting software should facilitate variant interpretation and variant data sharing by integrating with national initiatives aimed to facilitate these goals. • Reporting software should leverage vocabulary and ontology standards, well-defined and accepted file formats, and messaging and application programming interface specifications. • Technical personnel needed to support a clinical molecular diagnostic laboratory include IT administrators knowledgeable in hardware and software required to manage and analyze “big data”; clinical bioinformaticians who are skilled in classic computational biology techniques and are educated in clinical validation approaches; and variant scientists who are adept at variant review and preliminary interpretation.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00014-9

231

© 2015 Elsevier Inc. All rights reserved.

232

14. REPORTING SOFTWARE

INTRODUCTION The advent of genome-wide profiling technologies and their routine use in basic science and translational research is now leading to the promotion of their application in clinical settings. Significant reductions in cost, together with advancing knowledge about clinically significant variants in single gene and complex genetic constitutional and somatic disorders, are making next-generation sequencing (NGS) based clinical diagnostics possible and necessary, respectively. However, there are several potent informatics and information technology (IT) barriers that must be overcome before clinical NGS can become routine. First, Good Laboratory Practice-based quality assurance metrics must be established and enforced by IT systems to guarantee the accuracy required to make medical decisions. Second, genetic variants identified by sequencing must be systematically annotated and interpreted so that a clinical genomicist can decide which are medically actionable. Third, software applications and technologies are required to support the complete workflow from order entry of a clinical genomic test in computerized physician order entry (CPOE) or laboratory information systems (LIS) to final delivery of an interpretive report to the electronic medical or health record (EMR/EHR). These concepts and processes are described in further detail below (Figure 14.1).

CLINICAL GENOMIC TEST ORDER ENTRY Inpatient and outpatient environments typically utilize a CPOE as part of an enterprise EMR/EHR to facilitate ordering of clinical tests, an approach that will likely be employed to order a new clinical genomic test. However, such tests have special considerations that may not be accounted for in existing clinical systems,

Sample

Sequence Order Patient

Physician Variant annotation and classification

Reporting software Read to variant calls EMR Variant interpretation

Clinical Genomicist

FIGURE 14.1 The general workflow and lifecycle of a clinical diagnostic assay are portrayed. The process starts through a treating physician who orders a clinical NGS assay to better manage his/her patient. Order entry systems message this order to reporting software and after sample procurement, nucleic acid isolation, and library preparation, the sample is sequenced. Data from the sequencer (Sequencer image downloaded from http://support.illumina.com/sequencing/sequencing_instruments/hiseq_2000.html) are processed by the reporting software to identify, annotate, classify, and support the interpretation of variants. These steps require the integration of clinically validated computational pipelines and agglomeration of numerous genome annotation, variant, and medical knowledge bases to generate a draft report. The clinical genomicist may then review variants, alignments, supporting evidence, and other information to finalize the report (Report icon was downloaded and utilized in original format from http://www.iconarchive.com/show/medical-icons-by-dapino/medical-report-icon. html) and sign it out in the reporting system. Results are then messaged to the EMR where they are then reviewed by the treating physician in collaboration with the patient to make medical treatment and management decisions.

III. INTERPRETATION

ANALYTICS: FROM READS TO VARIANT CALLS

233

namely, a genomic test may be ordered on more than one specimen or sample as part of an overall case. For example, a tumor sample may be sequenced in conjunction with the patient’s nonmalignant sample to identify somatic variants that may be clinically actionable, or a proband’s germline DNA may be sequenced along with a set of unaffected or affected family members to identify possible disease causing variants. Thus, existing CPOE and LIS will need to accommodate such complexities in a generalized and scalable fashion to support clinical genomic testing broadly and comprehensively. Once a clinical genomic test has been entered into a CPOE or LIS, the order must be tracked, processed, analyzed, annotated, interpreted, and signed out by a board-certified molecular pathologist or medical geneticist, termed a “clinical genomicist,” a new clinical discipline wherein medical personnel are trained not only in genetics but also genomics concepts. Thus, the order must be replicated in a system or set of systems capable of supporting the above workflow; in clinical environments, messaging using Health Language 7 (HL7) (typically version 2) is used to transmit such information from system to system. Therefore, the clinical genomic test including patient demographic and clinically relevant information such as the clinical indication, pathological diagnosis, clinical phenotype(s), age, race, ethnicity, sex, and accessioned specimen(s)’ data (based on each test’s specifications, e.g., date collected, date received, specimen procurement site (liver, lung, brain, median cubital vein), and quality criteria (e.g., qualifying specimens as low input, hemodiluted, hemolyzed, thawed, clotted)) need to be messaged to a system/systems to facilitate the clinical NGS workflow.

LABORATORY INFORMATION MANAGEMENT SYSTEMS (LIMS) TRACKING Once a clinical genomic test has been ordered and specimens have been received by the lab performing the assay, an LIMS is required to track the test and its specimen components through the specimen processing protocol. Namely, LIMS need to track specimens to orders/accessions, support unique patient/accession/specimen labeling and barcoding, and track sample processing and sequencing steps. Existing commercially available and internally developed and validated LIMS [13] may support each of these steps. While the overarching IT and informatics systems do not need to support these detailed steps directly for clinical NGS, basic outputs/fields from LIMS are required for further informatics processing and interpretation by informaticians and clinical genomicists. Such required data include the instrument(s), run identifier(s), lane(s), and nucleic acid barcode(s)/index(ices) used to sequence each sample in the case. These basic pieces of information then enable downstream systems to access primary sequence data (typically contained in single-end or paired-end FASTQ files) automatically from file systems for downstream processing. Note that these LIMS must be able to relate multiple samples to a case and must be able to associate multiple sequencing data sets (whether from the same or different runs, lanes, libraries, or indices) in order to support diverse experimental protocols such as those using multiple libraries for a sample (e.g., the Illumina Trusightt Tumor Panel), or those for which multiple specimens must be sequenced in concert and coanalyzed such as during tumor/nonmalignant and trio sequencing. Furthermore, basic DNA/RNA quality criteria from the primary specimen processing as well as from the library preparation steps and the actual sequencing results may need to be reviewed by medically trained personnel to ensure that the sequencing process is likely to yield high-quality sequences of sufficient complexity. Such quality criteria include, but are not limited to, DNA concentration, DNA quality, library fragment sizes, library DNA concentration, cluster density, quality scores, phasing/prephasing values, and per cycle images (discussed in Chapter 7). Thus, such data should be made available from LIMS within reporting software systems so that the clinical genomicist may evaluate the success of a particular sequencing run and associated input DNA and library preparation prior to evaluating the specific case’s sequencing metrics.

ANALYTICS: FROM READS TO VARIANT CALLS Sequencing instrument vendors typically provide basic software to convert raw cycle images to raw intensities, normalize such data, and apply phasing/prephasing corrections to generate sequence and quality information per base which is then output in a file format called FASTQ [4], a derivative of the FASTA [5] format wherein read data generated during each phase (e.g., single-end vs. paired-end sequencing) are concatenated such that for each read there is a descriptive header, the read’s sequence, and the quality of each base in that sequence. Instrument software also typically demultiplex sample sequence data using data from the index sequencing

III. INTERPRETATION

234

14. REPORTING SOFTWARE

phase(s) of the run, ultimately generating either one (single-end sequencing) or two (paired-end sequencing) FASTQ files per library sequenced. Once FASTQ files are generated for all the samples/libraries in a case, they need to be subject to appropriate analytics to identify variants with high sensitivity and positive predictive value. Clinical NGS reporting systems that incorporate analytics thus need to support multiple concepts in order to serve this overall goal (i.e., have clinically validated pipelines, use these pipelines in appropriate settings and combinations to generate variant calls, track which pipelines and versions were used in each case, and be able to manage multiple analyses concurrently and robustly). These concepts also include analytical validation, provenance tracking, versioning, and analytical pipeline orchestration and management, which are described in greater detail below.

Analytical Validation Analytical pipelines to detect somatic versus constitutional variants, those that detect variants using data sets derived from amplicon versus targeted capture versus whole genome sequencing, and those that detect variants derived from FASTQ files generated by different sequencing platforms all need to be appropriately validated to serve the particular experimental, disease, and instrument context. Furthermore, analytical validation of each variant type (e.g., single nucleotide variations, indels, and structural variants including gene fusions and copy number variations) required for the assay must be conducted specifically. Such validation is typically performed using a combination of well-characterized samples (e.g., HapMap and COSMIC cell lines), patient samples that have previously been characterized using orthogonal technologies, and reference standards that are now becoming available from commercial entities (e.g., Horizon Diagnostics) and the College of American Pathologists (CAP). However, the complexity of analytical validation given the number of genes (each having different sequence characteristics, such as GC content, repeat regions, and homopolymer runs) and variant types (having different lengths, positions, and so on) can make analytical validation difficult using such samples, since certain variant types and sequence contexts in which they occur (e.g., a gene fusion wherein the partner gene may be one of many dozens) are unique. Given this level of complexity, organizations such as CAP have embarked on a project to deliver in silico generated data sets which are capable to interrogating virtually any variant type in any gene across a range of variant characteristics.

Provenance Tracking and Versioning Once an analytical for a particular assay has been validated, the tools, their versions, source databases, parameter settings, filter settings, and analytical steps all need to be maintained by software systems such that the exact fully specified pipeline that was eventually determined to meet the assay’s performance needs is used to process the clinical cases. Detailed cataloging of all of these components is termed provenance tracking, and this is a key feature of any clinical reporting system. Thus, tools, their versions, parameter settings for each tool, how each tool is orchestrated in a pipeline, filter settings in each step, and databases and versions all need to be tracked unambiguously. Databases and versions include sources such as the genome build version (e.g., Build 37.13), genome annotation version (e.g., Ensembl version 75), and dbSNP (e.g., Build 141), as well as input file versions, such as assay target region BED files or manifest files from the vendor, all need to be appropriately tracked. Finally, the overall pipeline itself should be versioned since a change to any one of the subcomponents described above will trigger a new version of the pipeline itself. Having this provenance tracking is vital for reporting systems to relate which pipeline version (and thus which steps, tools, versions, and so on) was used to process any one patient’s case.

Pipeline Orchestration and Management Clinical NGS reporting systems that incorporate analytics to process reads to variant calls should be able to handle analyses for multiple patient cases concurrently. Namely, these systems should be able to schedule analyses including individual steps (e.g., alignment, recalibration, variant calling) across multiple cases efficiently and robustly. The systems should be capable of recovering gracefully from analyses that failed and be able to relaunch pipelines for a case in their entirety, or to resume from the step that first failed thereby avoiding reexecution of steps that were completed successfully. Furthermore, such systems should generate logs during each step in the analysis of a patient’s case so that failures in job analysis or discrepancies during analysis may be reviewed by

III. INTERPRETATION

ANALYTICS: VARIANT ANNOTATION AND CLASSIFICATION

235

informaticians to address the issues causing the failure, and also to track and document such analysis exceptions in the laboratory’s Clinical Laboratory Improvement Amendments (CLIA) recording system (paper or electronic).

ANALYTICS: VARIANT ANNOTATION AND CLASSIFICATION Once variants have been identified using clinically validated pipelines for an assay, reporting software must next annotate variants using available genome, gene, and variant databases to enable the rapid determination of each variant’s biological impact by the clinical genomicist. This step involves determining what gene or genes the variant impacts, and what the impact is; it also involves associating each variant with information from other variant databases. This last step is performed, for example, to determine the variant’s allele frequency in the human population (possibly even within the patient’s race and ethnicity), to mark its previous identification in research and/or clinical human variant databases, to predict its protein impact computationally, and to ascertain the conservation of that base/amino acid/region in orthologous genes in other species. Each of these steps, relevant databases, and software concepts are described in greater detail below. To determine a variant’s impact on a gene or genes, genome builds and gene/mRNA/protein annotations must be utilized. Sources such as NCBI [6], Ensembl [7], and Locus Reference Genomic (LRG) [8] provide both databases although there are differences between the information provided by these three sources. Specifically, gene and transcript annotations may be different for the same gene among these three sources, and NCBI and Ensembl version both genome builds and gene/mRNA/protein coordinates. The LRG provides stable sequences curated by the clinical community and is targeted to support clinical reporting, but does not contain all genes and corresponding transcripts (and has only about 400 genes curated at the time of this writing). Once a variant’s location in the genome as well as its association/impact on a gene’s mRNA and/or protein is determined, a common next step is to determine if the variant has been reported previously in clinical databases; has been seen in prior, published research genomic projects; and/or is likely to be a common polymorphism. Typically, clinical variant databases such as ClinVar [9], the Human Gene Mutation Database (HGMD) [10], and emerging initiatives like the ClinGen consortium aim to address the first goal. Additionally, locus-specific databases [11] such as those for TP53 [12] may provide supporting information with respect to the variant’s likelihood of affecting a disease state. Databases such as the Catalogue of Somatic Mutations in Cancer (COSMIC) [13] and The Cancer Genome Atlas (TCGA) [14] frequently support variant classification in somatic cancer testing since they provide basic information as to whether the variant has been observed in cancer as well as the frequency with which the variant appears in each tumor type. In silico prediction tools such as PolyPhen [15], SIFT [16], and Condel [17] are used to determine pathogenicity based on structural predictions and other criteria in the absence of other supporting information in either research databases and/or published functional studies. Finally, databases such as the database of Single Nucleotide Polymorphisms (dbSNP) [18] and the National Heart, Lung, and Blood Institute’s (NHLBI) Exome Sequencing Project (ESP) [19] provide allele frequency information from different populations (e.g., European American and African-American populations in the NHLBI ESP) which may then be used within each clinical NGS assay’s context to determine which variants are likely benign polymorphisms. While multiple databases and knowledge bases may be used to determine the clinical relevance of a variant, until recently a standard defined clinical classification scheme to group variants identified after NGS testing has not been available, prompting groups performing clinical NGS to develop their own approaches. However, the American College of Medical Genetics and Genomics (ACMG) recommended a classification scheme last year to group variants based on pathogenicity, and to develop a logical scheme to assign variants to a classification based on the level of evidence available within clinical and research databases, the published literature, and through the use of computation predictions. This classification scheme is composed of five levels, which are termed pathogenic, likely pathogenic, variant of unknown significance, likely benign, and benign [20]. While this classification system works well for constitutional diseases, its applicability and fit in somatic cancer testing is still being actively debated, and the ACMG is considering variations of this scheme to better reflect the need to classify variants based on their ability to influence treatment with targeted therapies. Given these requirements, clinical reporting software must first maintain versioning of each database leveraged by the end user within each assay such that the provenance of variant classification may be determined at a later point in time and may be reliably reviewed in the future. Second, the software should be able to facilitate variant classification especially in large-scale assays such as clinical whole exome and genome studies. Third,

III. INTERPRETATION

236

14. REPORTING SOFTWARE

logical and defined approaches to variant classification should be codified and versioned in the software system and should be customizable by assay to meet each disease indication’s needs. For example, criteria for classifying variants as benign may vary between somatic and constitutional disorders.

VARIANT INTERPRETATION Once variants are classified, those deemed to require additional clinical interpretation that will ultimately alter patient treatment or management are typically described in further detail in a summary interpretation or recommendation section of the clinical report. Here, databases such as the published literature, well-established guidelines (e.g., the American Society of Clinical Oncology (ASCO) and the National Comprehensive Cancer Network (NCCN)), and opportunities for participation in clinical trials or investigational studies (e.g., from ClinicalTrials. gov) all need to be integrated to provide a comprehensive interpretation. This task is especially daunting and is amenable to software facilitations as the published medical literature exceeds 60 million articles and guidelines, and new trials are frequently evolving and becoming available. National, government-funded initiatives, such as ClinVar and ClinGen, as well as commercial systems, such as Ingenuity Clinical, Thomson Reuters products, CollabRx, and others, aim to fill this need. Software systems must be able to integrate data from one or many of these resources, as well as primary information from the published literature, in a logical fashion to facilitate the review and interpretation step by clinical genomicists. At the same time, they must provide comprehensive information quickly and rapidly to allow clinical genomicists to draft appropriate clinical interpretations in each clinical context. Furthermore, systems should be able to store such interpretations in appropriate disease and patient phenotype contexts to be able to make future inferences automatically in the absence of newer information, thus speeding the interpretation process. Finally, systems should be capable of complex inferencing across the set of variants identified in each patient as the unique combination in each patient may lead to unique interpretations. Namely, variants must be capable of being considered together and not in isolation, since the ultimate interpretation may be different when two or more variants in the same or different genes are found together instead of in isolation. As described above, because this step is very labor intensive and dependent upon review of prior clinical or research evidence, databases such as ClinVar are available for laboratories to share not only clinically relevant variants, but also how each laboratory interpreted that variant in a patient’s context. Furthermore, the ClinGen project is aimed at creating a clinical-relevant variant resource to catalog and curate medically important variants as determined by physician expert committees within each disease focus area (e.g., somatic cancer, cardiac disorders, and autism). These data and information sharing initiatives are not only useful to facilitate the sign out process of individual patient tests, but also enable the medical community to “peer review” and to crowd source the clinical variant knowledge base such that prevailing clinical interpretation inferences for each set of variants within a disease context may be reliably and reproducibly made regardless of where the clinical assay is being performed or by whom the data are being interpreted.

FINAL REPORT TRANSMISSION TO THE EMR Once the interpretive report is finalized by the clinical genomicist, it needs to be signed out and submitted to the EMR/EHR for access by the treating physician in order that she/he may use this information to manage and/or treat the patient. Thus, software systems should be able to integrate with downstream clinical systems (EHRs, EMRs, and LIS as appropriate at each organization) to send reported results using standard HL7 messaging approaches. Because the overwhelming majority of healthcare providers and their systems still use HL7 version 2, it is most pragmatic to result order data in this version. Note that most downstream systems are likely only capable of receiving either unformatted free text data or a formatted document (e.g., as a PDF); discrete genomic results reporting and ingest capability are not available in virtually any EMR/EHR (e.g., EPIC and Cerner products) today. However, as the capabilities become available, results reporting using either the HL7 Clinical Genomics model, using LOINC codes for particular variants, or using some to-be-developed standard for broad genomics reporting will likely be most appropriate.

III. INTERPRETATION

SUPPORT PERSONNEL

237

LEVERAGING STANDARDS IN CLINICAL GENOMICS SOFTWARE SYSTEMS Adoption of vocabulary/ontology, interface, and file specification standards, whether established or prevailing, is a best practice that software systems should embrace. Many of these standards have already been described above through the description of the clinical genomic assay order, analysis, interpretation, and reporting workflow. First, the HL7 messaging standard should be leveraged to support order and result message receipt and submission. Second, NGS file specification standards such as use of BAM and VCF files to store and transmit alignment and variant data, respectively, should be employed. Here, extensions to the VCF specification that support complete representation of structural variants including complex chromosomal rearrangement may be adopted to store the diverse range of variants detected by any NGS assay. Third, the Human Genome Variation Society (HGVS) recommendations on variant nomenclature should be used to represent variants at the genomic, cDNA, and protein levels. These recommendations include complex representation of structural variants of all types (e.g., gene fusions, inversions, and copy number variants). Fourth, standard and versioned genome builds and genome annotations such as those from NCBI, Ensembl, and LRG should be utilized to name variants per HGVS and in the context of the specific build and annotation version used for each patient’s data. Similarly, versioned variant databases, data sources, and computational tools and pipelines should be employed to ultimately ensure complete provenance tracking from primary results generation from the sequencer to final reporting. Fifth, appropriate clinical vocabulary/ontology standards should be adopted to represent diseases, disease indications, medications, procedures, laboratory results, and phenotypes. Such standards include SNOMEDCT, ICD-9/ICD-10, RxNorm, CPT, LOINC, and HPO, and ontology integration resources such as BioPortal [21] and the Unified Medical Language Thesaurus (UMLS) [22] may be used to meet this need. Although professional society guidelines do not currently exist, CDC, as part of a collaborative multi-stakeholder exercise, has finalized a draft of clinical bioinformatics standards which is expected to be published in 2015; these standards will serve as guideposts for clinical laboratories for clinical implementation of NGS-based testing.

REGULATORY COMPLIANCE Clinical NGS software systems should comply with general IT requirements as specified in the CAP set of checklists; to ensure HIPAA compliance including (but not limited to) meeting criteria for physical, technical, and administrative safeguards; and to ensure adoption of concepts of encryption using well-established standards (e.g., Secure Socket Layer) for data at rest and in motion to protect the privacy of patient data. Furthermore, systems should have appropriate redundancy (e.g., failover web, application, and database servers and geographically distributed data approaches) and backup/recovery policies in place to ensure adequate uptime needed to support clinical operations. Compliance with HIPAA may be assessed through internal and external audits.

SUPPORT PERSONNEL Consistent with molecular diagnostic laboratories running non-NGS assays, a basic level of IT support consisting of systems, network, and database administrators is required to maintain laboratory information and clinical reporting systems. However, in addition to these requirements typical of clinical informatics IT, systems and database administrators are also required to be proficient in “big data” IT. Specifically, they need the ability to deploy and manage high-performance computing clusters with requisite scheduling software and tools; use data management and query tools such as NoSQL [23]; and map-reduce approaches like Hadoop [24,25]. In addition, specialized personnel who are trained in bioinformatics and are also educated in the principles and concepts of clinical laboratory operations are essential. Such “clinical bioinformaticians” understand and recognize the need and importance of clinically validated bioinformatics pipelines, documentation of all informatics components of the reporting system, and tracking of all changes in pipelines and tools for each assay as the laboratory evolves each diagnostic test. Finally, PhD level, molecularly trained personnel who are adept at reading the literature and at interpreting the results in the clinical context are vital to the reporting process in that frequently such personnel are utilized in the review and sign out process. These “variant scientists” are especially invaluable in nonteaching laboratories to allow for initial review of case results and to make an initial recommendation to the

III. INTERPRETATION

238

14. REPORTING SOFTWARE

genomicist who can then sign out the case more expeditiously. Specialized training programs to educate and cultivate both clinical bioinformaticians and variant scientists are required to enable clinical NGS to truly scale with the level of medical genomic testing that is anticipated.

CONCLUSION The use of NGS clinically in molecular diagnostic assays is poised to become widespread and routine in the management of patients having virtually any diagnosis. However, IT and informatics barriers, such as the need to manage the large data sets generated for each patient and to then identify, annotate, classify, and interpret variants, must all be addressed by reporting software while being compliant with clinical standards for computer systems. Furthermore, these applications must be able to integrate seamlessly into the clinical workflow typical of a molecular diagnostic laboratory by interoperating with upstream and downstream clinical systems such as CPOE, LIS, LIMS, and EMRs, and should leverage vocabulary and engineering standards as a best practice. Finally, as the biomedical knowledge base is growing at breakneck speeds, reporting software must be able to integrate and share data nationally and globally so that the medical community can use this agglomerated information to better interpret patient genomic results in the future.

References [1] Core LIMS. [cited August 2014]; Available from: ,http://www.corelims.com/ngs-lims/.. [2] Clarity LIMS. [cited August 2014]; Available from: ,https://www.genologics.com/clinical-genomics.. [3] Scholtalbers J, Rossler J, Sorn P, de Graaf J, Boisguerin V, Castle J, Sahin U. Galaxy LIMS for next-generation sequencing. Bioinformatics 2013;29(9):12334. [4] Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. The Sanger FASTQ file format for sequences with qualityscores, and the Solexa/ Illumina FASTQ variants. Nucleic Acids Res 2010;38(6):176771. [5] FASTA Format. [cited August 2014]; Available from: ,http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml.. [6] National Center for Biotechnology Information (NCBI). [cited August 2014]; Available from: ,http://www.ncbi.nlm.nih.gov/.. [7] Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res 2014;42:D74955. [8] Locus Reference Genomic. [cited August 2014]; Available from: ,http://www.lrg-sequence.org/.. [9] Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42:D9805. [10] Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics 2012; [Chapter 1: p. Unit1 13]. [11] Soussi T. Locus-specific databases in cancer: what future in a post-genomic era? The TP53 LSDB paradigm. Hum Mutat 2014;35 (6):64353. [12] Olivier M, Eeles R, Hollstein M, Khan MA, Harris CC, Hainaut P. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat 2002;19(6):60714. [13] Forbes SA, Tang G, Bindal N, Bamford S, Dawson E, Cole C, et al. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer. Nucleic Acids Res 2010;38:D6527. [14] The Cancer Genome Atlas (TCGA). [cited August 2014]; Available from: ,http://cancergenome.nih.gov/.. [15] Adzhubei I, Jordan DM, Sunyaev SR. Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet 2013; [Chapter 7: p. Unit7 20]. [16] Sim NL, Kumar P, Hu J, Henikoff S, Schneider G, Ng PC. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 2012;40:W4527. [17] Gonzalez-Perez A, Lopez-Bigas N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet 2011;88(4):4409. [18] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29(1):30811. [19] NHLBI Exome Variant Server. [cited August 2014]; Available from: ,http://evs.gs.washington.edu/EVS/.. [20] Duzkale H, Shen J, McLaughlin H, Alfares A, Kelly MA, Pugh TJ, et al. A systematic approach to assessing the clinical significance of genetic variants. Clin Genet 2013;84(5):45363. [21] Whetzel PL, Team N. NCBO Technology: powering semantically aware applications. J Biomed Semantics 2013;4(Suppl. 1):S8. [22] Lindberg C. The Unified Medical Language System (UMLS) of the National Library of Medicine. J Am Med Rec Assoc 1990;61(5):402. [23] Lee KK, Tang WC, Choi KS. Alternatives to relational database: comparison of NoSQL and XML approaches for clinical data storage. Comput Methods Programs Biomed 2013;110(1):99109. [24] Dong X, Bahroos N, Sadhu E, Jackson T, Chukhman M, Johnson R, et al. Leverage Hadoop framework for large scale clinical informatics applications. AMIA Jt Summits Transl Sci Proc 2013;53. [25] Nguyen T, Shi W, Ruden D. CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 2011;4:171.

III. INTERPRETATION

REFERENCES

List of Acronyms and Abbreviations ACMG ASCO BAM CLIA COSMIC CPT CPOE dbSNP EHR EMR ESP HGMD HGVS HL7 HPO ICD-9/ICD-10 IT LIMS LIS LOINC LRG NCBI NCCN NGS NHLBI SNOMED-CT TCGA UMLS VCF

American College of Medical Genetics and Genomics American Society of Clinical Oncology Binary Alignment MAP Clinical Laboratory Improvement Amendments Catalogue of Somatic Mutations in Cancer Current Procedural Terminology Computerized Physician Order Entry Database of Single Nucleotide Polymorphisms Electronic Health Record Electronic Medical Record Exome Sequencing Project Human Gene Mutation Database Human Genome Variation Society Health Language 7 Human Phenotype Ontology International Classification of Diseases Information Technology Laboratory Information Management Systems Laboratory Information Systems Logical Observation Identifiers Names and Codes Locus Reference Genomic National Center for Biotechnology Information National Comprehensive Cancer Network Next-Generation Sequencing National Heart, Lung, and Blood Institute Systematized Nomenclature of Medicine—Clinical Terms The Cancer Genome Atlas Unified Medical Language Thesaurus Variant Call Format

III. INTERPRETATION

239

This page intentionally left blank

C H A P T E R

15 Constitutional Diseases: Amplification-Based Next-Generation Sequencing Vanessa L. Horner and Madhuri R. Hegde Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA

O U T L I N E Introduction Disease-Targeted Sequencing Target Enrichment

241 241 242

Multigene Panel Validation Select Genes Design and Inspect Primers for Each Exon Run Validation Samples Bioinformatics

243 244 244 244 245

Clinical Workflow

245

Concurrent Testing Bioinformatics and Data Interpretation

245 245

Conclusion Advantages and Disadvantages of Amplification-based NGS Future Directions

247

References

248

List of Acronyms and Abbreviations

249

247 248

KEY CONCEPTS • This chapter describes multigene panel validation using an amplification-based approach, from gene selection to test validation, implementation, and interpretation.

INTRODUCTION Disease-Targeted Sequencing This is an exciting era for constitutional disorder genetic testing, as next-generation sequencing (NGS) technologies have broadened the scope of testing from single genes examined individually to multiple genes concurrently. This approach to genetic testing is invaluable for several classes of constitutional disorders. The first class includes disorders with genetic heterogeneity, in which one disorder can be caused by mutations in several genes. The classic example of genetic heterogeneity is retinitis pigmentosa, which is caused by mutations in at least 43 loci and can be inherited in an X-linked, autosomal dominant, or autosomal recessive fashion [1]. Another class of constitutional disorders that benefit from NGS are disorders in which many related genes in a pathway give rise to similar or overlapping phenotypes that are difficult to distinguish clinically. For instance, the so-called RASopathies include nine disorders in which various members of the RAS/MAPK signaling

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00015-0

241

© 2015 Elsevier Inc. All rights reserved.

242

15. NGS OF CONSTITUTIONAL DISORDERS

TABLE 15.1

CMD Single-Gene Testing and NGS Panel Testing Clinical Yield Comparison

Gene/Test

Through March 2012 No. of Patients Tested

No. of Pathogenic Calls Made

Through March 2013 Clinical Yield (%)

No. of Patients Tested

No. of Pathogenic Calls Made

Clinical Yield (%)

INDIVIDUAL GENE TEST BY SANGER SEQUENCING SC6A1

COL6A1

21

1

5

24

1

4

SC6A2

COL6A2

24

0

0

26

0

0

SC6A3

COL6A3

21

0

0

22

0

0

SFKRP

FKRP

26

2

8

41

2

5

SFKTN

FKTN

29

9

31

39

12

31

SITG7

ITG7

7

0

0

8

0

0

SLAM2

LAMA2

31

16

51

42

16

38

SLARG

LARGE

15

0

0

17

1

6

SPOM1

POMT1

34

3

9

37

3

8

SPOM2

POMT2

29

2

7

32

2

6

SPOMG

POMGNT1

27

3

11

28

3

11

SSEP1

SEPN

29

2

7

32

3

9

293

38

11

348

43

10

TOTAL NGS GENE PANEL TESTS SCMDP

CMD comprehensive

41

20

49

68

40

59

SCO6P

Bethlem/Ullrich CMD

43

5

12

55

10

18

SMDCP

Merosin-deficient CMD

17

17

100

21

32

152

SMPCP

Merosin-positive CMD

15

1

7

16

1

6

116

43

42

160

83

59

TOTAL

pathway are disrupted; the RAS/MAPK pathway is critical for cell cycle control and differentiation, and disruption of different components of the pathway causes similar craniofacial, cardiac, cutaneous, musculoskeletal, and ocular abnormalities [2]. Other constitutional disorders appropriate for NGS are those affecting a particular body system or organ, such as the heart or skeletal muscle. For example, the congenital muscular dystrophies (CMDs) are a group of genetically and phenotypically heterogeneous disorders characterized by congenital hypotonia and muscle weakness, contractures, and delayed motor development [3]. The traditional diagnostic approach includes targeted genetic testing via sequential Sanger sequencing of up to 20 genes known to cause CMD, an approach that has a diagnostic yield of only 1011% (Table 15.1); in contrast, a CMD multigene panel, which includes NGS of 12 genes as well as deletion/duplication analysis, has a diagnostic yield of 4259% (Table 15.1). Overall, multigene panel tests demonstrate consistently higher clinical yield than single-gene testing, and thus multigene panels have the potential to achieve a diagnosis faster and more cheaply than a series of single-gene tests. Furthermore, multigene panels allow clinicians to approach more complex congenital disorders by identifying the affected system and directly testing all the genes known to affect that system at once. Currently, there are more than 70 multigene panels offered by commercial labs (www.genetests.org). These panels span several biological categories; the categories with selected examples are shown in Table 15.2.

Target Enrichment Multigene panel sequencing requires a strategy to select and enrich the genes of interest prior to sequencing. There are two broad target-enrichment strategies: amplification-based (discussed in this chapter) and hybridization-based (discussed in Chapter 16). Amplification-based target capture uses polymerase chain reaction

III. INTERPRETATION

MULTIGENE PANEL VALIDATION

TABLE 15.2

243

Constitutional Disorders with Currently Available NGS Panels

Category

Selected Examples of NGS Multigene Panels

Brain development disorders

Holoprosencephaly, lissencephaly, microcephaly, cerebellar hypoplasia

Bone disorders

Osteogenesis imperfecta, short stature, skeletal dysplasia

Cancer syndromes

Hereditary nonpolyposis colon cancer, breast, ovarian, uterine cancer, pheochromocytomas, paragangliomas

Cardiac diseases

Cardiomyopathies, cardiac arrhythmia, aortic aneurysm, brugada syndrome, long QT syndrome

Ciliopathies

Bardet-biedl, Joubert, primary ciliary dyskinesia

Connective tissue disease

Stickler syndrome, Marfan syndrome, LoeysDietz syndrome, familial thoracic aortic aneurysms

Epilepsy

Early infantile epileptic encephalopathy, childhood onset, progressive myoclonic epilepsy, febrile seizures

Eye disorders

Retinitis Pigmentosa, macular dystrophies, Leber congenital amaurosis, Usher syndrome, progressive external opthalmoplegia, cone-rod dystrophy

Hearing loss

Syndromic, nonsyndromic

Ion channel diseases

Episodic ataxia, migraine, myotonia, neuropathic pain syndromes, Bartter syndrome

Immune disorders

Severe combined immunodeficiency syndrome, periodic fever

Metabolic diseases

Congenital disorders of glycosylation, urea cycle disorders, peroxisome biogenesis disorders, cobalamin metabolism, fatty acid oxidation disorders, glycogen storage disease

Mitochondrial

Mitochondrial and nuclear gene panels

Neurodegenerative diseases

Amyotrophic lateral sclerosis, dementia, Parkinson, dystonia, heterotaxy

Neurodevelopmental disorders

Autism spectrum disorders, X-linked intellectual disability, Rett syndrome, and variant Rett syndrome

Neuromuscular diseases

Spinal muscular atrophy, muscular dystrophies, myopathies

Rasopathies

Noonan, LEOPARD, Costello, cardiofaciocutaneous syndrome

Vascular malformations

Hereditary hemorrhagic telangiectasia, ParkesWeber syndrome

(PCR)-based methods to amplify all the exons of the genes in a panel. Two technologies today employ a PCRbased target-enrichment strategy: RDT 1000 (RainDance Technologies, Billerica, MA) and Access Array (Fluidigm Corp, South San Francisco, CA). The methodology of amplification-based sequencing is discussed in Chapter 4. Briefly, the objective of both RainDance and Access Array is to perform thousands of PCR reactions efficiently and accurately. With the RainDance technology, each PCR reaction occurs independently in a picoliter-sized microdroplet emulsion, preventing possible primer pair interactions [4]. With the Fluidigm technology, PCR reactions occur in nanoliter-volume microfluidic reaction containers, each containing 110 primer pairs.

MULTIGENE PANEL VALIDATION NGS technologies have made whole exome sequencing (WES) possible in the clinical lab. Early studies have confirmed the clinical utility of exome sequencing, with several labs reporting a diagnostic yield of about 2025% [5]. Critically, however, approximately 510% of the exome has low or no sequencing coverage; therefore, WES is more accurately viewed as a screen rather than a diagnostic test. Poor sequencing coverage may be due to a number of factors, including repetitive or complex sequence, G-C-rich sequences, and A-T-rich sequences (see below). In contrast to WES, disease-targeted panel sequencing is a comprehensive diagnostic test with complete sequencing coverage of all the selected exons. There are three main goals to amplification-based panel validation: (1) design of unique primers for each exon for accurate and precise amplification, (2) confirmation that the primers can amplify both alleles, and (3) identification of exons with low or no sequencing coverage. With respect to the latter, it is virtually always the case that, despite extensive optimization, some exons in an amplificationbased NGS assay will not have adequate coverage (again due to repetitive or complex sequences, G-C-rich or A-T-rich regions, and so on). Most labs compile a list of these exons, validate a simple Sanger-based assay

III. INTERPRETATION

244

15. NGS OF CONSTITUTIONAL DISORDERS

for each, and then proactively perform the Sanger sequencing in parallel with the amplification-based NGS. The ultimate goal of the validation process is to ensure that the panel reliably covers all exons of every gene by using both amplification-based NGS and Sanger sequencing.

Select Genes Careful selection of genes to include in a clinical panel is vital for downstream interpretation. The evidence that a gene is implicated in the disease of interest must be critically evaluated. Gene selection may begin by searching for available practice guidelines from a group such as the American College of Medical Genetics (ACMG), and recent literature review articles are additional excellent sources to find relevant genes. For example, the ACMG practice guideline on genetic evaluation of short stature was used in conjunction with a specific review article as a starting point in gene selection for the short stature panel at one academic lab [6,7]. In general, there should be at least one published report that the gene is implicated in the disease of interest for a gene to be included on a panel. Finally, consultation with specialists, such as physicians or researchers who are experts on the disorder, can be an invaluable source of information on clinically relevant genes. In addition to which genes to include, the number of genes to include is also important. There is an optimal range of the number of genes/exons to include in an amplification-based multigene panel. If there are too few genes (generally less than 10), Sanger sequencing is more economical. However, there is a limit to the number of exons that can be targeted using amplification-based methods due to the high cost of primers, reagents, and DNA input requirements. Access Array permits amplification of 48 PCR reactions for 48 samples in a single run (48 amplicons), or up to 48, 10-plex reactions for 48 samples in a single run (480 amplicons). In contrast, RainDance allows amplification of up to 4000 amplicons for 48 samples in a single run, or up to 20,000 multiplexed amplicons for 48 samples in a single run. For very large target regions (such as the 30 Mb human exome), hybridization-based target selection is most appropriate.

Design and Inspect Primers for Each Exon Once the list of genes to include in the panel is complete, the next step is to design primers for each exon of each gene in the panel. The initial primer design can be completed by RainDance or Access Array, since both these commercial vendors have automated pipelines for primer design. By this approach, the company is provided with a list containing the gene names and any aliases, genomic coordinates, and transcript numbers (in NM format); if there is more than one isoform of the gene, there will be more than one NM number. For example, MECP2 has two isoforms, MECP2A which lacks exon 1 and includes exon 2 (NM_004992.3), and MECP2B which includes exon 1 and lacks exon 2 (NM_001110792.1). The transcript numbers should be chosen by the laboratory after thorough investigation of the literature to ensure that all exons with known diseasecausing mutations are included in the panel. Once the initial primer design is completed, the laboratory must inspect primers carefully for several features. First, the primers must be specific for the exon of interest. To determine if the primers can amplify more than one product, an in silico PCR can be performed (http://genome.ucsc.edu/cgi-bin/hgPcr?command5start). If the primers amplify two exons with 95% or more identity, then the exon of interest has a corresponding pseudoexon. If an exon has a pseudoexon, primers must be designed so that they are specific for the active exon and not the pseudoexon (often, exons with pseudoexons must be Sanger-sequenced to confirm amplification of the active gene region). Second, primers must be able to amplify the target genes in diverse populations and in individuals with mutations. To ensure this, the primers should not contain single nucleotide polymorphisms (SNPs), or variants found in the Human Gene Mutation Database (HGMD) and the Exome Variant Server (EVS). Finally, the primers should be located 50100 base pairs (bp) from the exon/intron borders, so at least 10 bp of the flanking introns will be sequenced in order to detect any variants that affect splicing. These steps of checking the primers are important, since in general approximately 510% of the primers designed by commercial vendors do not meet the above requirements. Since an average primer library is approximately 2000 amplicons, this means there are about 100200 primers that must be redesigned manually.

Run Validation Samples The technical details of validation of amplification-based methods are covered elsewhere (Chapters 1 and 4); the discussion here will focus on operational issues. First, samples used for validation should include three

III. INTERPRETATION

CLINICAL WORKFLOW

245

wild-type controls and 17 positive controls containing a previously identified mutation either in a heterozygous or homozygous state. While this number of samples is only a recommendation, and laboratories can choose any number of samples deemed appropriate, the key is to include enough validation samples to demonstrate the test’s ability to detect heterozygous, hemizygous (for X-linked diseases), and homozygous variants. This level of validation is required to ensure that no allele dropout occurs during the amplification process (although rare allele dropout may still occur due to the presence of as yet unidentified rare SNPs, which may fall under an amplification primer). Ideally, the wild-type controls (http://ccr.coriell.org) will have been used extensively in NGS and exome validations, such as Coriell controls or a laboratory’s own in-house controls for Sanger sequencing, so that many of the expected sequence variants are known.

Bioinformatics Once the sequencing is complete, the initial task is to align the patient sequence to a reference or control sequence. This is a computationally intensive process, accomplished with an in-house-developed pipeline or a commercial software package. Prior to aligning the sequence, however, any redundant sequences are removed. Redundant, or duplicate, sequences contain the identical sequence and start and stop site positions and are generated during one of the PCR steps of the NGS workflow. They must be removed because any PCR-introduced errors can skew variant allele frequencies and reduce variant detection [8]. The nonredundant sequencing files are then aligned to a reference sequence, and sites where the sample sequence does not match the reference sequence are identified (variant calling). The reference sequence may include the entire human genome, or may be limited to the genes or exons included in the panel (the panel library). Aligning short sequence reads to the entire genome is a challenge; alignment to just the panel library is much simpler and increases the number of target matched sequences. Since amplification-based NGS (like PCR itself) is very specific and so therefore there is less concern that a sequence will map to off-target sites, alignment to the panel library is appropriate. Following alignment, many laboratories find it useful to develop in-house bioinformatics programs (so-called scripts) to automatically calculate and display various metrics of the sequencing run. A particularly useful script is one that calculates the sequencing run statistics, including the total number of sequencing reads, how many of the reads align to the exons of the panel, and the average sequencing coverage per target base. These statistics are then compared with lab-defined coverage QC parameters to make sure the run meets minimum standards. Each exon plus the flanking 10 bp of intronic sequence (corresponding to the splice donor and acceptor sites) must meet minimum coverage requirements as well. Exons that do not meet coverage requirements are considered “low coverage” or “no coverage” based on defined parameters. The low-coverage and no-coverage exons thus identified during the validation process are placed on a “proactive” list of exons that must undergo Sanger sequencing each time the panel is run.

CLINICAL WORKFLOW Concurrent Testing After a multigene panel has been validated, it is available for clinical testing in a diagnostic laboratory. When a sample is received, the amplification-based NGS (as well as proactive Sanger sequencing of regions anticipated to have low coverage) are begun simultaneously. Any additional testing, such as deletion/duplication analysis of the genes on the panel, is also run concurrently (Figure 15.1). The strategy is to collect all the relevant data at once, to aid analysis and interpretation and to reduce turnaround time.

Bioinformatics and Data Interpretation Upon completion of amplification-based NGS, sequence alignment and variant calling occur as described above. The sequence run statistics are analyzed to determine whether the run passes QC overall, and if there are any run-specific exons that have low or no coverage that are not on the proactive list. Any such exons must subsequently undergo Sanger sequencing to ensure full coverage. The next major step is to analyze each of the patient’s sequence variants, which may number in the hundreds depending on the size of the panel and the number of exons. These variants must be sorted and

III. INTERPRETATION

246

15. NGS OF CONSTITUTIONAL DISORDERS

Sample arrives at lab

Custom gene list generated Phenotype data collected

Sanger sequencing • Proactive exons • Low coverage fill-ins • Confirmations

NextGen sequencing

Del/Dup analysis Targeted array CGH

Analysis

Interpretation and Reporting

FIGURE 15.1 Clinical workflow of amplification-based next-generation sequencing.

filtered to generate a manageable list of disease-causing candidate variants. There are commercially available data filtering software packages, but their reliability and interpretative capacity are not sufficiently established, so many laboratories develop their own bioinformatics pipeline. Regardless of bioinformatics software, each variant is annotated with information from a number of external and internal resources (Table 15.3), including basic identifying information (gene name, exon, variant position), quality information (sequencing coverage, quality score, percent wild type, and variant base), and functional classification (amino acid change or potential splice site effects). Information from external databases is presented, including the allele frequencies in dbSNP and EVS (if known), and whether or not a variant is in HGMD. Variants with high allele frequencies (generally greater than 1%) in dbSNP and EVS are more likely to be benign, whereas variants with an entry in HGMD are more likely to be pathogenic. Notably, however, these databases are only a guide since they may be contaminated with true mutations (in the case of dbSNP) or benign polymorphisms (in the case of HGMD). Finally, internal database information is provided, including the number of times a variant has been observed and the classification of the variant if it has been characterized previously. The classification of the biologic significance of a variant is usually presented via an interpretive scale that groups variants into similar functional categories. One general scheme in use by many NGS labs has the following categories: (1) pathogenic, (2) likely pathogenic, (3) variant of unknown clinical significance (VOUS), (4) likely benign, and (5) benign. The definition of these five classifications, as well as the classification groups themselves, differs between labs. However, the precise definitions used by individual labs, as well as the way variants are classified, are usually publicly available (for example, http://genetics.emory. edu/egl/emvclass/emvclass.php). One option for sorting and filtering the variants (including those that may require Sanger confirmation) is to begin with allele frequency: as a rule of thumb, any variant with an EVS allele frequency over 1% is excluded as potentially disease-causing. For example, in Table 15.3, the missense variant detected in the gene OPHN1 has an EVS allele frequency of 7.8%, which is too high to be disease-causing, based on the disease frequency of MRXSO (mental retardation X-linked OPHN1-related) reported in approximately seven families worldwide [9]. However, exceptions to the rule occur when the disease prevalence is high; for example, the prevalence of sickle cell disease in African Americans is approximately 1 in 500, and the frequency of the common diseasecausing variant (p.Glu26Val) in the HBB gene is 4% (EVS). Variants can also be sorted based on the number of times a variant has been observed in the samples run by a laboratory. A variant observed in over 10% of the samples in general should be eliminated (and in fact may represent a sequencing error specific to that panel); for instance, in Table 15.3, the 3 bp deletion in the RAI1 gene has been observed in 40.5% of the autism panels and so is very likely a sequencing error. Sorting by the two parameters of allele frequency and observed frequency generally reduces the list to a handful of variants that are likely to be real (not sequencing errors) and not benign (for example, the variant in the PNKP gene in Table 15.3). These short-listed candidate variants are then confirmed by Sanger sequencing, as necessary, and are interpreted and placed into one of the five categories described above.

III. INTERPRETATION

247

CONCLUSION

TABLE 15.3

Annotation of Three Variant Calls from an Autism NGS Panel Test Performed at Emory Genetics Lab (EGL) Splice Site Variant

Missense Variant

3 bp Deletion

1.

Gene

PNKP

OPHN1

RAI1

2.

OMIM ID

605610

300127

607642

3.

Transcript number

NM_007254.3

NM_002547.2

NM_030665.3

4.

Exon

9

2

3

5.

Variant (cDNA)

c.817-5C . CG

c.115G . GA

c.870_872delGCA

6.

Variant (protein)



39V . IV

In-frame

7.

Coverage

169

86

273

8.

Quality score

17.7

15.2

19

9.

A (%)

0

62.79

0

10.

C (%)

59.17

0

0

11.

G (%)

40.83

37.21

66.67

12.

T (%)

0

0

0

13.

Insertion (%)

0

0

0

14.

Deletion (%)

0

0

33.33

15.

No. of times observed/total no. panel runs

000/084 (0%)

006/084 (7.1%)

034/084 (40.5%)

16.

In RainDance primer?

No

No

No

17.

dbSNP number



rs41303733



18.

dbSNP mutant allele frequency



0.056522



19.

EVS highest mutant allele Frequency



0.078180737



20.

HGMD







21.

EGL classification

Unknown

Benign

Unknown

22.

PolyPhen



Possibly damaging



23.

Sift



Deleterious



24.

Synonymous (Y/N)



No



25.

Splice

Donor site created



?

26.

Pseudogenes



ARHGAP42P3



27.

Pseudoexons (Y/N)



No



28.

Chr

19

X

17

29.

Strand

Reverse

Reverse

Forward

30.

Genomic position

50366000

67652748

17697132

31.

Flanking sequence

TCTCC(C/G)GTAG

AAGAC(A/G)TAAT

CAGCA(delG)CAAG

32.

Reference sequence

C

G

G

33.

Selected for confirmation?

Yes

No

No

Of the three, only the splice site variant in PNKP was selected for Sanger confirmation.

CONCLUSION Advantages and Disadvantages of Amplification-Based NGS The major advantages of amplification-based target selection are the greater enrichment specificity and uniformity versus hybrid capture-based approaches. Hybrid capture is sensitive to base composition because sequences

III. INTERPRETATION

248 TABLE 15.4

15. NGS OF CONSTITUTIONAL DISORDERS

Advantages and Disadvantages of Amplification-Based NGS Compared to Hybridization-Based NGS

Advantages

Disadvantages

• Greater enrichment specificity and uniformity • Less sensitive to base composition and repetitive sequence • More automated/less laborious

• Larger genomic DNA input requirements • Not as easily scalable: limited target region size and number of samples • Expensive reagents and primers

with high A-T or G-C content can be lost due to poor annealing and secondary structure. Furthermore, hybrid capture probes usually undergo repeat masking to avoid capture of homologous repetitive elements; as a result, 515% of the desired target region may be lost [10]. In contrast, amplification-based target capture uses primer pairs whose position and amplicon length can be adjusted to accommodate repetitive sequences and high A-T and G-C content (Table 15.4). Two advantages of multigene NGS panel testing versus WES are the more complete coverage of all the genes on the panel and fewer incidental findings, the latter of which pose bioinformatics and ethical challenges. It is worth note that as the number of panel-based tests increases, they will provide ever-greater amounts of information about variants, so the number of VOUS will decrease over time leading to greater interpretability of panels compared with whole exome based approaches. Increased certainty about the clinical significance of variants exome will significantly reduce the time needed to interpret and report test results, in turn reducing cost and turnaround time. One disadvantage of amplification-based NGS compared to hybrid capture-based NGS is that it is not as easily scalable. Hybrid capture can capture large target regions in a single experiment more rapidly and cheaply than PCR (Table 15.4). Compared with WES, a disadvantage of multigene panel testing is that there is no chance of finding novel disease-associated mutations outside the target region. Moreover, as the field of constitutional genetics expands, new genes will be identified that are associated with a clinical phenotype, requiring time-consuming redesign and revalidation of the panel.

Future Directions Amplification-based NGS relies on technologies that can accurately and reliably perform thousands of PCR reactions, generally achieved by physically separating each reaction in emulsion droplets or microfluidic chambers. This feature of amplification-based NGS allows the absolute number of droplets (or chambers) containing specific DNA variants to be compared with the number of droplets with wild-type DNA, an absolute quantification without reference to standards or endogenous controls known as digital PCR [11]. Digital PCR is very sensitive, capable of detecting one mutant among 250,000 wild-type molecules, and has major applications for rare allele detection in samples containing low-level mosaicism where the mutant allele occurs at a much lower frequency than wild type. Rare alleles are a frequent occurrence in cancer, for example, but may also occur in constitutional disorders. Moving forward in the amplified-based genomics era, continual improvements in sequencing technology will make rare allele detection more straightforward, which will lead to greater understanding and diagnosis of rare constitutional genetic disorders.

References [1] Kaplan J, Bonneau D, Frezal J, Munnich A, Dufier JL. Clinical and genetic heterogeneity in retinitis pigmentosa. Hum Genet 1990;85 (6):63542. [2] Tidyman WE, Rauen KA. The RASopathies: developmental syndromes of Ras/MAPK pathway dysregulation. Curr Opin Genet Dev 2009;19(3):2306. [3] Reed UC. Congenital muscular dystrophy. Part I: a review of phenotypical and diagnostic aspects. Arq Neuropsiquiatr 2009;67(1):14468. [4] Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, et al. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat Biotechnol 2009;27(11):102531. [5] Yu Y, Wu BL, Wu J, Shen Y. Exome and whole-genome sequencing as clinical tests: a transformative practice in molecular diagnostics. Clin Chem 2012;58(11):15079. [6] Seaver LH, Irons M. ACMG practice guideline: genetic evaluation of short stature. Genet Med 2009;11(6):46570. [7] Johnston Rohrbasser LB. Genetic testing of the short child. Horm Res Paediatr 2011;76(Suppl. 3):136. [8] Koboldt DC, Ding L, Mardis ER, Wilson RK. Challenges of sequencing human genomes. Brief Bioinform 2010;11(5):48498.

III. INTERPRETATION

REFERENCES

249

[9] Online Mendelian Inheritance in Man, OMIMs. Baltimore, MD: Johns Hopkins University, MIM Number: 300127: 06/03/2011. World Wide Web URL: ,http://omim.org/.. [10] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):1118. [11] Pohl G, Shih Ie M. Principle and applications of digital PCR. Expert Rev Mol Diagn 2004;4(1):417.

List of Acronyms and Abbreviations NGS WES ACMG EGL QC BLAT SNP HGMD EVS VOUS

Next-generation sequencing Whole exome sequencing American College of Medical Genetics Emory Genetics Lab Quality control BLAST-like alignment tool Single nucleotide polymorphism Human Gene Mutation Database Exome Variant Server Variant of unknown clinical significance

III. INTERPRETATION

This page intentionally left blank

C H A P T E R

16 Targeted Hybrid Capture for Inherited Disease Panels Sami S. Amr1 and Birgit Funke2 1

Department of Pathology, Brigham and Women’s Hospital/Harvard Medical School, Boston, MA, USA; Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, MA, USA 2Department of Pathology, Massachusetts General Hospital/Harvard Medical School, Boston, MA, USA; Laboratory for Molecular Medicine, Partners Healthcare Personalized Medicine, Cambridge, MA, USA

O U T L I N E Introduction Inherited Cardiomyopathies Costello Syndrome Hereditary Hearing Loss Evolution of Medical Sequencing in Molecular Diagnostics Target Selection Using Hybridization-Based Capture Design and Implementation of Targeted Hybridization-Based Capture Panels Technical Design Considerations Determining the ROI Ensuring Adequate Coverage Across the Entire ROI Sequencing Regions of Increased or Decreased GC Content Sequencing Regions with High Sequence Homology Operational Considerations: Workflow, Cost and TAT Workflow Sequencing Cost Cost-Reduction Measures Factors Impacting the Ability to Batch and Pool Samples Automation Turnaround Time

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00016-2

252 252 253 253 253 255 257 257 257 257 258 259

259 259 260 260 260 260 261

Impact of the Type of Sequencing Machine Targeted Hybrid Capture: Analytical Sensitivity Across the Variant Spectrum

Targeted Hybrid Capture: Selecting a Panel for Constitutional Diseases Gene Panel Testing Strategy Anticipating Technical Limitations Regions with Low Coverage GC-Rich or Repetitive Regions Genes with Homology to Other Loci Anticipating Interpretive Challenges: Impact of Panel Size on Variant Interpretation Disease-Targeted Gene Panels: Comparison with other Sequencing Strategies Whole Genome Sequencing Whole Exome Sequencing Amplification-Based Capture Methods Other Target Selection Methods

261

261 262 262 263 263 263 263 263 264 264 265 265 265

Applications in Clinical Practice: Lessons Learned Benefits of Targeted NGS Capture Panels Inherited Cardiomyopathies Hearing Loss and Related Disorders Challenges Conclusion and Outlook

266 266 266 267 267 267

References

268

251

© 2015 Elsevier Inc. All rights reserved.

252

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

INTRODUCTION Over the past 2030 years, Sanger sequencing has dominated medical sequencing. Despite its low-throughput and high cost, this technology was used by the Human Genome Project (HGP) to sequence the entirety of the human genome. As knowledge of gene-disease associations expanded and genetic testing evolved from genotyping a handful of well-established disease causing variants to sequencing entire genes, Sanger sequencing quickly became a bottleneck for comprehensive genetic testing. Of the approximately 22,000 genes in the human genome, more than 5000 have now been associated with constitutive disorders [1]. Some of these disorders are phenotypically well defined and can be attributed to a single or a few variants in one gene, such as achondroplasia which is caused by the Gly380Arg variant in FGFR3 in about 99% of individuals tested [2]. Genetic testing for constitutive disorders typically included genotyping assays that were simple, inexpensive, and fast. This paradigm has changed dramatically with the realization that many disorders exhibit not only a high degree of genetic heterogeneity but often also a wider phenotypic spectrum than initially appreciated. Testing for single variants, or sequencing of a single or few genes, is therefore often no longer appropriate, and genetic testing is now rapidly moving toward testing large numbers of genes up to the exome or entire genome. The following examples illustrate this paradigm shift and its impact on the evolution of medical sequencing. Allelic heterogeneity describes the identification of sometimes hundreds to thousands of pathogenic variants in one gene across a patient cohort. Therefore, in diseases displaying allelic heterogeneity, interrogation of every base in a gene or group of genes is necessary to increase variant detection across a patient cohort. As expected, fixed-content genotyping panels miss much of the pathogenic variant in most populations. One such example is cystic fibrosis (CF) in which over 1000 pathogenic variants in the CFTR gene have been identified [3]. While one common variant, F508del (also known as ΔF508), is detected at a frequency of 3080% of affected individuals, the vast majority of CF variants are present at low frequencies and may be unique to an ethnicity, race, or a family. The American College of Medical Genetics and Genomics (ACMG) recommends a panel of the 23 most common variants, however, the panel’s detection rate is suboptimal in most populations and ranges from 94% in the Ashkenazi Jewish population to 49% in the Asian American population. The use of targeted genotyping tests such as the CF panel was long driven by their lower cost compared with sequencing-based tests. However, variants that are rare or are present in minorities are often missed by these types of genotyping panels. This exposes the need of genetic testing to extend beyond the 23 variant panel by sequencing the entire gene in order to increase clinical sensitivity. Locus heterogeneity refers to one disorder being caused by variants in one of several to many genes. This phenomenon is common among genetic disorders and amplifies the need for comprehensive sequencing to maximize clinical sensitivity. In some cases, many genes can harbor pathogenic variants, but the majority reside in a very small number of genes (e.g., 80% of pathogenic hypertrophic cardiomyopathy (HCM) variants are located in two genes, MYH7 and MYBPC3) [4]. Other diseases, for example dilated cardiomyopathy (DCM), can be caused by a single variant in any one of over 40 genes, most of which contribute a relatively small fraction (e.g., less than 1% to about 6%) with the main contributor identified to date (TTN) only accounting for about 1525% (laboratory for Molecular Medicine (LMM), unpublished data [5,6]). To further complicate matters, variants identified in these genes are usually unique to an individual or a family, displaying an extreme degree of allelic heterogeneity. Therefore, the identification of a causative variant in any single patient requires interrogation of thousands of exons by sequence analysis. Clinical overlap with related disorders is increasingly appreciated and can lead to diagnostic uncertainty. Traditionally, genetic testing panels have had a relatively narrow clinical focus, resulting in testing odysseys when initial testing was negative. Prominent examples include inherited cardiomyopathies, Noonan spectrum disorders, and hereditary hearing loss where a clinical diagnosis alone is often insufficient to correctly select the most appropriate set of genes, as follows. Inherited Cardiomyopathies Clinical overlap is increasingly appreciated for DCM, which can represent as an end-stage presentation of other cardiomyopathies such as HCM or arrhythmogenic right ventricular cardiomyopathy (ARVC) [7,8], sometimes leading to misdiagnosis in the clinic. Expanded gene panels now commonly cover all types of inherited cardiomyopathy, and the detection of pathogenic variants in genes that were previously thought to be unique to one cardiomyopathy in patients with another is evidence of the clinical utility of this approach [6].

III. INTERPRETATION

253

INTRODUCTION

Costello Syndrome Approximately 8090% of patients with this diagnosis are thought to carry a disease causing variant in the HRAS gene [9]. However, genetic testing has shown that in broad referral populations from many healthcare providers, not all of whom are specialists, the detection rate is much lower and many patients are found to carry a pathogenic variant in overlapping RASopathies (LMM, unpublished data). Hereditary Hearing Loss The most common birth defect affects 1 in 500 newborns, and over 110 chromosomal loci and more than 65 genes have been associated with hearing loss [10]. The inheritance pattern of hereditary hearing loss can be dominant, recessive, X-linked, or mitochondrial depending on the aberrant gene and the type of variant. In addition, hearing loss can present in isolation (nonsyndromic) or in association with additional features as part of a broader disease (syndromic). In certain syndromes, the initial presentation appears to be congenital nonsyndromic hearing loss, however, additional symptoms manifest later in life. Examples include congenital hearing loss preceding the onset of retinitis pigmentosa (Usher syndrome) or prolonged QT interval (Jervell and LangeNielsen syndrome). These complexities pose challenges for genetic testing in individuals with hearing loss, requiring the screening of genes associated with isolated hearing loss as well as genes responsible for syndromic disorders in which hearing loss is only one of several features.

Evolution of Medical Sequencing in Molecular Diagnostics The primary objective of a molecular diagnostic laboratory is to provide comprehensive, low cost, and accurate genetic testing in order to confirm a suspected diagnosis, assess recurrence risk to families, and inform on disease risk and treatment response. More comprehensive understanding of genetic diseases, such as the ones described above, and the broad spectrum of genes and variants associated with any particular disorder, highlights the need for expanded gene panels to maximize clinical sensitivity. For over three decades, the “gold standard” for genetic analysis has been Sanger sequencing, which moved the field beyond limited genotyping by allowing the interrogation at every nucleotide position within a gene. While being the clear method of choice for single-gene disorders with a well-defined clinical indication, Sanger sequencing can sequence only a single 500600 bp amplicon at a time, thereby limiting high-throughput and scalability. Figure 16.1 illustrates the evolution of medical sequencing for DCM, which is associated with .40 genes. Until 2009, test panels included only 510 genes, covering a region ,1015 kb in size. By Sanger sequencing, the addition of the TTN gene, which is now known to have the highest contribution to disease, was not possible due to the gene’s enormous size (about an 81 kb coding region).

kb sequenced 280 260 240 220 200 180 160 140 120 100 80 60 # genes 40 10 20 0 2007

Sanger sequencing

Test price 46

$ 5000

$ 4000

$ 3000

$ 2000 20 $ 1000

2009

2011

Resequencing microarrays

Next-generation sequencing

FIGURE 16.1 Evolution of medical sequencing: dilated cardiomyopathy.

III. INTERPRETATION

254

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

As the need to expand gene panels in a cost effective, high-throughput manner grew, novel DNA sequencing platforms, such as array-based oligo-hybridization sequencing, began to emerge [11,12]. These technologies can target a larger amount of sequence (10300 kb) in a repetitive manner, making it possible to simultaneously sequence a greater number of genes. Improvements offered by array-based resequencing over Sanger sequencing led to increased clinical sensitivity due to increased content, while reducing cost and turnaround time (TAT) (Figure 16.1) [11]. However, limitations to array-based sequencing technology include a reduced ability to detect novel insertions and deletions, scalability to target regions ,300 kb, and static designs, creating difficulty in the addition of new content. The development of high-throughput sequencing methods, referred to as next-generation sequencing (NGS) or massively parallel sequencing (MPS), have facilitated increased capacity to sequence larger regions while reducing the cost of sequencing per base at a pace far faster than Moore’s law [13]. Without restrictions on the size of the target regions, NGS affords the ability to sequence an expanded number of genes, the exome, or the genome of a patient for the identification of disease causing variants. The field is now rapidly moving to routine whole exome sequencing or even whole genome sequencing (WES and WGS, respectively). However, while WES/WGS have the greatest potential in achieving maximum clinical sensitivity, these strategies are still costly and require lengthy analysis and interpretation. Thus, for clinical testing of genetic diseases with an established repertoire of associated genes, selective enrichment of sets of genes (ranging from tens to hundreds) with subsequent sequencing provides the greatest cost-to-benefit ratio of NGS technology. Increased understanding of genetic diseases such as the ones described above, as well as the realization of the broad spectrum of genes and variants associated with any particular disorder, highlight the need for expanded gene panels that provide greater clinical sensitivity. Clinical testing needs to include interrogating all possible types of variants (single nucleotide variants (SNVs), small indels, large deletions and duplications, and other structural rearrangements). In the past, detection of larger copy number changes or structural rearrangements was difficult (Figure 16.2). However, NGS has the ability to detect all variant types, and clinical laboratories are beginning to use NGS assays for comprehensive variant detection. This chapter describes all aspects of a commonly used type of disease-targeted NGS sequencing, specifically the target-hybrid capture approach, which is one of several possible ways to select genes of interest from an individual’s genome prior to sequencing.

• Multigenic disease panels without size limits • Whole genome/exome sequencing

Next-generation sequencing

• Detection of broad spectrum of variants (SNVs, indels, and CNVs) • Identifies novel variants • Multigenic disease panels but limits on size of target region Resequencing microarrays

Sanger sequencing

• Limited detection of spectrum of mutations (optimal for SNVs only) • Identifies novel variants

• Panels with single to a few genes • Detection of SNVs and indels, but not CNVs • Identifies novel variants

FIGURE 16.2 Sequencing technologies: capabilities and limitations in variant detection.

III. INTERPRETATION

TARGET SELECTION USING HYBRIDIZATION-BASED CAPTURE

255

TARGET SELECTION USING HYBRIDIZATION-BASED CAPTURE Two target enrichment approaches have been developed and are commonly applied in the clinical laboratory setting: (1) Target-hybrid capture-based enrichment, and (2) amplification-based enrichment using polymerase chain reaction (PCR). The applicability and success of different target strategies depends on several parameters including (1) on-target coverage, (2) probe design efficiency, (3) uniformity, (4) analytical specificity and sensitivity, (5) cost, (6) ease of use, and (7) amount of starting DNA per capture [14,15]. Herein, the focus will be on the implementation and application of the most widely used target-hybrid capture strategies in clinical testing, with comparisons to other NGS strategies (WGS, WES, and amplification-based enrichment) employed in molecular diagnostics. In the target-hybrid capture approach, custom-designed DNA or RNA “baits,” either immobilized to an array (solid phase) or in-solution (biotinylated oligomers), hybridize to the target sequence within a library of genomic fragments prepared from the patient sample using one of many available approaches (see Chapter 3). The captured fragments are purified and subsequently enriched by low cycle PCR. This enriched target library can then be sequenced on an NGS platform. This type of enrichment strategy can target DNA regions ranging from B100 kb to B5 Mb, making the approach ideal for designing large gene panels for a disease or a group of closely related diseases. The solid-phase hybrid-target approach usually requires expensive specialized equipment, and a large amount of starting DNA (1015 μg) is needed regardless of the target region size. In-solution methods can be carried out with smaller quantities of starting DNA and can be performed using 96 well thermocyclers common to molecular laboratories. Although both solid phase and in-solution target hybrid approaches have been successfully used by clinical laboratories, the latter method is reported to yield better coverage of target regions, does not require special equipment, has no limitations on the concentration of probes due to steric hindrance and array size, and can be run using 1 μg (or less) of starting DNA. Both types of methodologies are more high-throughput and faster than amplification-based methods, and are less prone to contamination. A wide variety of RNA and DNA “baits” have been described [1419] with lengths ranging from 30 bp to more than 800 bp, however, the optimal length of a bait and its efficiency to bind to its target is dependent on the length of the fragments in the genomic library. For clinical applications in which target regions represent noncontiguous chromosomal sequences (coding exons) across a large number of genes, fragment size and capture probes on the shorter end of that range provide the greatest coverage and specificity. The average length of a human protein-coding exon is 120 bp [20]; therefore, capture probes that can effectively hybridize to fragments that are about 150200 bp long are desired to increase the on-target coverage of coding regions. However, fragments within a genomic library that are smaller than 500 bp yield reduced hybridization efficiency and target specificity [21], and testing can only be performed by increasing the input DNA. On the other hand, by using excess bait in an in-solution capture reaction, shorter fragments (about 250 bp long) can be effectively captured with high specificity while maximizing on-target coverage for exon-sized target fragments [14]. Commercial hybridization kits, currently available through Roche NimbleGen, Agilent, and Illumina Technologies, offer immobilized (NimbleGen SeqCap array) or in-solution (NimbleGen SeqCap EZ, Agilent SureSelect, and Illumina TruSeq) capture probes that can be customized to the desired target region (Table 16.1). The type and length of baits varies across kit types with DNA baits that are 6090 bp and 95 bp in length used by the NimbleGen and Illumina kits, respectively. Agilent utilizes 120160 bp RNA or DNA capture probes. Several studies have evaluated the performances of each type of baiting strategy [2123], based on a custom design [23] or vendor-designed exome capture kits [21,22]. The different platforms were assessed for several performance parameters including on-target coverage, uniformity, and analytical sensitivity. Higher on-target coverage depth (.303) and coverage reproducibility across samples was observed with Agilent SureSelect over NimbleGen’s array capture platform, while the latter provided greater uniformity of coverage across targeted regions [23]. A different outcome has been reported with NimbleGen’s SeqCap EZ Exome Library v2.0, an in-solution hybrid capture kit using overlapping 6090-mer baits, which provides slightly improved depth and coverage of targeted bases (96.8% at $103 and 81.2% at $203) than that of bases targeted by Agilent’s SureSelect Human All Exon 50 Mb kit (89.6% at $103 and 60.3% at $ 203) which uses adjacent 150-mer RNA baits [21,22]. The coverage performance of Ilumina’s TruSeq Exome Enrichment kit, with spaced 95-mer DNA baits, was comparable to the Agilent kit with 90.0% of targeted bases at $ 103 coverage depth [21]. However, the Agilent and Illumina kits enriched a larger number of target bases with greater depth [21]. These differences in coverage performance are attributed to the probe design strategy applied in each kit; a higher probe density targeting a smaller genomic region, employed by NimbleGen, results in more efficient hybridization, while longer probes with a lower density design, as is the case with the

III. INTERPRETATION

256 TABLE 16.1

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

Descriptive and Workflow Parameters of Commercial Targeted Capture Kits In-Solution Hybrid Capture

Selector ProbeBased Capture

AmplificationBased Capture

Agilent SureSelect

NimbleGen SeqCap

Illumina TruSeq

Agilent HaloPlex

RainDance

Probe

120-mer RNA baitsa

50105-mer DNA baits

95-mer DNA baits

Selector probesb

Multiplex PCR primers

Target size (per kit)c

,200 kb50 Mb

100 kb50 Mb

500 kb25 Mb

1 kb5 Mb

100 kb10 Mb

Input DNA

3 μg

13 μg

1 μg

200 ng

Varies; greater target size requires more input

Hybridization duration

2472 h

72 h

1620 h 3 2

3 or 16 hd

N/A”

Advantages

• Longer baits tolerate mismatches • Greater sensitivity for indels • Better capture of low GC regions • RNADNA hybrids stronger than DNADNA hybrids

• Higher density of tiling using shorter probes provides better uniformity

• Requires lower amount of input DNA for comparable target size region

• Lowest amount of input DNA • Shortest hybridization duration and fastest workflow

• Faster workflow than in-solution hybrid capture methods • Better ability to capture GC-rich regions and genes with pseduogenes

Disadvantages

• Decreased capture efficiency due to off-target capture • Longer workflows

• Shearing dependent on restriction enzyme digest

• Limited target size • Requires increased DNA input with larger target regions

Probe and workflow parameters

a

Agilent also offers 120-mer DNA baits. Selector probes contain target complementary end sequences (B20 bp at each end). c Recommended target size based on workflow considerations as well maximum target limits of kit. d Probe designs with ,20,000 probes require a 3 h incubation, those with .20,000 probes require a 16 h incubation. b

Agilent and Illumina kits, can cover a greater target region at the expense of completeness of on-target coverage [21]. For targeted hybrid capture of custom gene panels with smaller target regions than the exome, a dense tiling strategy for Agilent and Illumina probe designs can increase on-target coverage and depth. One limitation seen across all platforms was enrichment of target regions with high or low GC content. High GC content (6080%) resulted in a sharp decrease in coverage across all platforms [2123]. The same drop in performance was observed in the NimbleGen and Illumina designs at low GC content (2040%), while the longer probes and/or efficient RNA baits afforded the Agilent platform efficient coverage of these regions [21]. Comparison of the analytic sensitivity of each platform, assessed by concordance of detected variants to those identified on an independent platform such as a genotyping chip, revealed that increased sensitivity does not necessarily correlate with increased capture efficiency. In these studies, to control for on-target coverage discrepancies across the platforms, only variants with 203 coverage were included in the analysis. One study reported that the analytical sensitivity for single nucleotide variants (SNVs) was comparable between NimbleGen’s array capture and Agilent’s SureSelect technology with a concordance rate of 99.7% (952/955 correct matches) and 99.8% (936/938 correct matches), respectively [23]. Similar concordance for SNVs was reported for NimbleGen’s SeqCap in-solution kit and Illumina’s TruSeq Exome kit [21,22]. In addition, all platforms showed minimal allelic bias (AB ratio 5 0.530.55) and performed equally well in detection of heterozygous SNVs [2123]. Lower sequencing depths (,203) resulted in an increase in discordant SNV calls across all platforms [22]. Analysis of the ability of these hybrid capture platforms to detect insertions and deletions (indels) revealed that total coverage region and target coverage depth both contributed to identification of indels. The highest

III. INTERPRETATION

DESIGN AND IMPLEMENTATION OF TARGETED HYBRIDIZATION-BASED CAPTURE PANELS

257

sensitivity for indels within a common target region was observed when using the NimbleGen SeqCap exome v2.0 kit owing to greater coverage depth, but this capture kit had a lower number of total indels detected than both the Illumina and Agilent kits owing to the larger size of the targeted region of the latter two. The Agilent kit detected the greatest number of indels of all three capture designs and notably identified indels at lower total read counts, presumably due to the superior binding properties of longer RNA baits [21]. However, no differences in the mean size of indels detected across the platforms were noted [21,22]. Although these studies focused on exome capture kits, their conclusions can be generalized to all sizes of targeted capture panels, and thus they reveal important considerations for choosing the type of probe and design strategy. In-solution designs have clear advantages over solid-phase designs in terms of technical requirements, as the latter has limits on the number and density of probes on the array and the need for hybridization stations [15]. It should be noted that NimbleGen’s solid-phase hybrid capture arrays are no longer available, and only the in-solution SeqCap capture methods are offered. Assessment of in-solution probe designs illustrates that higher density of probes covering a target region, as illustrated with the NimbleGen in-solution design, yields higher coverage of on-target regions. On the other hand, longer baits with increased hybridization efficiency, as is characteristic of Agilent’s RNA probes, can more readily capture difficult regions such as those with low GC content or those containing indels. In addition, SNV detection and accurate zygosity calling is dependent largely on coverage depth at that position, with at least 203 sequencing depth necessary to provide a high analytical sensitivity. By combining the high density approach and the use of longer RNA baits, custom probe designs can provide optimal coverage and sequence depth with increased sensitivity for detection of SNVs and indels.

DESIGN AND IMPLEMENTATION OF TARGETED HYBRIDIZATION-BASED CAPTURE PANELS Technical Design Considerations In clinical practice today, targeted gene panels are typically used when the clinical diagnosis is relatively clear. Besides cost, the main reason is that panels enable sequencing at high coverage across the complete target, which in turn ensures optimal analytical and clinical sensitivity. However, NGS technologies have several important advantages and drawbacks, which need to be taken into account at the design stage. Important limitations include inefficiencies in capture and mapping of GC-rich regions, as well as difficulties sequencing regions with high sequence homology or repeat expansions. It is also vital to understand optimal placement of oligonucleotide capture probes to ensure complete and adequate coverage at the ends of a targeted region of interest (ROI). Determining the ROI The ROI of any given gene within a disease panel usually encompasses all sequences of that gene in which variations have been demonstrated to impact its transcription, or the structure or function of the translated gene product. This may include coding exons, conserved splice site regions within flanking introns (615 bp from the exon), 50 and 30 untranslated regions (UTRs), and any additional sequences or loci outside of these regions that have been reported in the literature to harbor pathogenic variants. In reality, most laboratories do not yet systematically interrogate 50 and 30 UTRs or other deep intronic sequences because of limited understanding of the role of these regions in disease; while it is expected that these regions will eventually be important contributors, most variation identified in them today would be of “unknown clinical significance” and thus of limited clinical utility for the patient. Intronic regions outside of the conserved splice consensus sequence are 5.5 kb in length on average [24] and inclusion of these regions in targeted capture panels will increase cost with little to no added clinical utility. The median size of a human exon, on the other hand, is about 200 bp with genes containing 89 exons on average [24]. Therefore, targeting only exons and flanking intronic splice site regions allows inclusion of a large number of genes without compromising coverage. Ensuring Adequate Coverage Across the Entire ROI The captured territory usually extends beyond the last capture probe (as it will bind to the 50 or 30 end of some genomic DNA fragments that exist for that region). However, coverage drops at the ends of the targeted sequence as fewer fragments are captured and therefore probes should be designed beyond the exon and the flanking 615 bp intronic splice sequences. Baits extending 650 bp outside exons into flanking introns usually ensure adequate coverage of the ROI (Figure 16.3).

III. INTERPRETATION

258

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

Read depth across targeted region

Coverage drops off at ends of targeted region

Mapped reads across targeted region

3× tiling strategy (120-mer biotinylated RNA probe) ROI Targeted region

FIGURE 16.3 Coverage at a sample ROI using a 33 probe-tiling strategy.

To further enhance coverage of the ROI, an increase in the density of probes targeting each region, also referred to as probe- or bait-tiling, is recommended. Higher probe density across a target region has been observed to increase coverage [18]. The extent of probe-tiling covering each base in the sequence will depend on limitations on the maximum size of the probe design and the total size of the gene panel. With most gene panels, a 33 probe-tiling frequency at each base pair is possible with long (about 120 bp) baits, yielding efficient capture and high coverage across the ROI. Increased probe-tiling density of difficult regions, such as GC-rich and AT-rich regions, can also improve coverage of these regions. Sequencing Regions of Increased or Decreased GC Content The GC content of a gene region can impact its coverage, with regions having 5060% GC content receiving the highest coverage while regions with high (7080%) or low (3040%) GC content having significantly decreased coverage [14]. Yet, adequate coverage of GC-rich regions (which are commonly present in the promoter and first exon of many genes) is necessary for a high analytical sensitivity of a targeted gene panel. The coverage bias observed in GC-rich regions is attributed, in part, to the suboptimal hybridization of baits to these regions, in addition to the introduction of compression artifacts of GC-stretches during PCR-amplification steps during library construction [25]. To augment loss of coverage, several modifications of both the probe design phase and the library construction process can be taken. Longer probes (about 120 bp) capture GC-rich regions at a higher efficiency [19]. In addition, “spiking-in” additional baits targeting these regions can improve coverage across these regions. Often, however, GC content bias is introduced during PCR in the library amplification and post-capture enrichment steps of the library construction, leading to a lower representation of GC-rich regions [19,25,26]. Omitting PCR-amplification steps may lead to decreased GC content bias [26], however, this modification requires a larger input amount of DNA. Modifications to the PCR protocol, including prolonging the denaturation step, decreasing the temperature of primer-extension step, and using enhanced Taq polymerases, such as AccuPrime Taq HiFi, can offset PCR-induced depletion of GC-rich regions [25]. Alternatively, bias-free library amplification can be achieved using water-in-oil emulsion PCR with GC-coverage profiles comparable to amplification-free libraries [19]. However, these measures do not protect against bias introduced even further downstream during cluster formation and sequencing [25]. Taken together, these studies demonstrate

III. INTERPRETATION

DESIGN AND IMPLEMENTATION OF TARGETED HYBRIDIZATION-BASED CAPTURE PANELS

259

Genomic DNA shearing

Adapter ligation

Library amplification Pooling prior to hybridization Target hybridization

Target enrichment

Target library elution Pooling prior to sequencing NGS

FIGURE 16.4 Targeted hybrid capture workflow.

that the best approach to GC-rich regions in target panels includes adequate capture by addition of long “spikein” probes along with modifications to the PCR-amplification steps during library construction. Sequencing Regions with High Sequence Homology Regions with homology to more than one genomic locus pose another major obstacle in hybrid capture target enrichment. Examples include short tandem repeats that lie in the promoter, intronic, and less commonly, exonic regions; SINES; highly homologous functional genes that arose from recent gene duplication events; and genes that have processed pseudogenes. Here, the problem lies in the co-capture of homologous sequences, which, depending on the read length and degree of homology, may align to the wrong locus and therefore generate false positive (or false negative) variant calls [27]. This can be partially resolved by the use of longer sequencing reads and/or paired-end sequencing [27]. However, depending on the length of the homologous sequence, it may be necessary to exclude them from the probe design, for example, by using the RepeatMasker program (http:// www.repeatmasker.org) [14,18]. The human genome is estimated to contain approximately 20,000 pseudogenes, with about 8000 “processed” pseudogenes that share an average of 86% of nucleotides and 75% of amino acids with their closest corresponding gene [28,29]. Unfortunately, many of these are medically relevant [some examples include GBA (Gaucher disease), NF1 (Neurofibromatosis type 1), PKD1 (polycystic kidney disease), and STRC (nonsyndromic hearing loss)]. The NF1 gene has seven pseudogenes and other homologous sequences at different loci and the STRC gene has a nearby pseudogene, which is 99.6% identical across exonic and intronic sequences. These genes are currently not analyzable by standard NGS and their sequencing necessitates development of alternate or complementary assays (typically standard or long-range PCR-based approaches provided that unique primer binding sites are available).

Operational Considerations: Workflow, Cost and TAT Workflow Generally, hybrid capture approaches, regardless of probe design and strategy, require the same workflow for processing samples which includes shearing of the DNA, adapter ligation, library enrichment, hybrid capture, targeted library enrichment, and quantification for pooling and/or loading onto the sequencing instrument (Figure 16.4). Several additional quality control steps, to check the size range and amount of nucleic acid of the

III. INTERPRETATION

260

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

sheared DNA sample and to check the sample after library preparation, may be added to ensure successful library construction. Most of these steps can be carried out with instruments readily available in most molecular genetics laboratories. On average, the entire workflow can be performed in 25 workdays with 68 h technician hands-on time. The biggest impact on workflow time is the length of time of the hybridization reaction which may impact efficiency of capture. Manufacturer-recommended hybridization times range from 25 to 72 h (Table 16.1). Sequencing Cost A major catalyst for increased clinical genetic testing is the reduction in cost and TAT for large-scale sequencing of genes and samples. Recent dramatic improvements in sequencing technology significantly increased the amount of DNA templates that can be sequenced simultaneously (about 102 templates for Sanger sequencing vs. about 109 for NGS), resulting in an approximate 100,000-fold decrease in the per-base cost of DNA sequencing [13]. However, the cost per sequencing run, as well as all additional sample processing charges associated with a clinical test, still amount to several thousand dollars (in 2014), and thus, the cost to benefit ratio of developing and implementing an NGS-targeted gene panel correlates to the size of the target gene or genes [30]. Although the costs for larger gene panels are driven down by the sequencing efficiency afforded by NGS, the much publicized “1000 dollar genome” has not yet been realized and Sanger sequencing is still the less expensive for tests with small content (less than 10 exons or so). Cost-Reduction Measures In order to reduce NGS costs, clinical laboratories commonly use sample batching (processing several samples through the library construction steps simultaneously), and sample pooling (adding more than one sample into the target capture and/or sequencing reaction). Both of these processing strategies lower the per-sample cost by spreading the labor and/or reagent associated cost across several samples. For example, batching 10 samples for library construction with subsequent pooling for hybrid capture and sequencing in one lane on the same platform can reduce cost nearly by an order of magnitude to a few hundred dollars per sample (LMM, unpublished data). Factors Impacting the Ability to Batch and Pool Samples Clinical laboratories have to adhere to set TATs and therefore the ability to batch or pool samples is dependent on the number of samples for a particular test. Because NGS testing is still a lengthy process, clinical samples need to be processed almost immediately after receipt to meet on-time results delivery; therefore, batching and pooling of samples for genetic tests will be limited when the volume of samples is low. Another factor that impacts the ability to pool samples is the size of the target region, as the output for an NGS sequencing reaction is fixed and therefore increasing the sequenced region will decrease coverage. Currently, many laboratories sequence targeted panels at an average depth of several hundred fold to minimize the number of bases that are sub-optimally covered (most laboratories consider bases with less than 15303 as failed). While there are no set standards, the numbers of samples that can be pooled per lane (in 2014) typically range between 510 for large disease-targeted gene panels and 24 for exomes. It is important to remember that NGS platforms differ vastly in the number and length of sequencing reads generated per run [31]; as such, a pooling level that does not affect coverage on one platform may result in incomplete coverage or inadequate read depth of some regions within the gene panel. Automation Automation of workflows using liquid handling robots can decrease the amount of time and the cost of processing samples for sequencing. Although automation requires specialized robots with an associated initial capital investment, it can dramatically increase the scale and throughput of sample processing. Because in-solution target-hybrid capture consists of a series of liquid handling events, it can be easily automated, thereby increasing output from 1020 captures per week per technician to greater than 384 captures per week [32]. The clear advantage of automation of workflows is reduced cost and TATs for target capture and library construction, but extensive development and optimization efforts are needed to assure that the quality of the library produced is equivalent to the manual process. However, sample volume will determine the utility of automation.

III. INTERPRETATION

DESIGN AND IMPLEMENTATION OF TARGETED HYBRIDIZATION-BASED CAPTURE PANELS

261

Turnaround Time The time needed for receiving a sample to reporting a result depends on several steps in including DNA extraction, library construction and target capture, sequencing instrument run, and analysis and interpretation. Library construction workflows that include target-hybrid capture approaches differ from those using amplification-based approaches. In addition, sequencing instruments differ in their capacity and run length, which will impact overall TAT. In general, sequencing run times have greatly improved since the inception of NGS (see below). The sequencing run times for targeted gene panels has been reduced from weeks to 12 days depending on the sequencing machine; even an exome can be sequenced in 27 h on Illumina’s HiSeq2500 “rapidrun mode” platform. However, as the technical process has become more and more streamlined, an emerging and critical bottleneck is the interpretation of sequence variants. Therefore, while the sample processing time is more and more independent of the target region, the overall TAT is more and more directly proportional to the number of genes sequenced. For an exome or genome, the entire clinical testing process can still take several months, but for targeted gene panels that cover from 10 to 100 genes with known association to a disease or a spectrum of closely related diseases, the entire test cycle (sequencing and interpretation) can be completed in a few weeks. Impact of the Type of Sequencing Machine One final consideration affecting both cost and TAT is the sequencing platform used. Several sequencing instruments are available on the market that differ in sequence output and run time. For example, the Illumina HiSeq 2000 can produce 100 bp paired end reads in 46 days, while a similar read length on the Illumina MiSeq takes less than 24 h but with a tenfold decrease in total read output. The Ion PGM Sequencer can produce longer reads (200400 bp) in a shorter time (24 h), but yields a significantly lower number of reads than both the HiSeq and MiSeq. For low-throughput, the Ion PGM Sequencer provides the greatest time benefit; the MiSeq can accomplish a relatively fast TAT and handle a greater number of samples or larger gene panels. However, for WGS or WES, output of the HiSeq 2000 is needed to ensure adequate coverage. The launch of the HiSeq 2500 in 2012, an upgrade to the HiSeq 2000 instrument, introduced a “rapid run mode” which yields a comparable number of reads per lane (187 M vs. 150 M single end reads) in a much shorter time frame (11 days vs. 27 h) than its predecessor; however, only two lanes at a time can be run in this mode. Improvements in these sequencing platforms, as well as new platforms developed by other manufactures, continue to break the barriers of output and TATs, and it is expected that a high capacity, short TAT platform will be available in the near future.

Targeted Hybrid Capture: Analytical Sensitivity Across the Variant Spectrum The sensitivity of a genetic test is dependent in its ability to accurately detect a broad spectrum of variants including SNVs, insertions and/or deletions (indels), copy number variants (CNVs), and structural rearrangements such as translocations or inversions. While Sanger sequencing remains the “gold standard” in the detection of SNVs and small indels, it usually does not detect CNVs and structural rearrangements. NGS technologies offer an all encompassing platform for detection of a broad spectrum of variants simultaneously without a priori knowledge of genetic variation within a sample. Harnessing this ability depends on read length, coverage depth, and uniformity, as well as sophisticated bioinformatics analyses. Target hybrid capture approaches ensure that these parameters are met for genes within a panel, while allowing for simpler bioinformatics than with WGS/WES. Assuming successful capture and adequate coverage, the targeted-hybrid capture approach has been shown to have a near perfect sensitivity for SNVs, such that NGS is now believed at least equal, if not superior, to Sanger sequencing. This has been best demonstrated in a recent comparison between targeted hybrid capture NGS and Sanger sequencing of a 48-gene panel [33]. However, it must be emphasized that this success of the NGS target approach was contingent on a well-designed capture strategy with 120 bp RNA probes and additional tiling of difficult regions, as well as on long sequencing reads (150 bp paired end) and deep coverage across the full target ($303) [33]. In contrast, indels are more difficult to detect than SNVs, as both probe hybridization and simple alignment are compromised by the presence of this class of variants. This is illustrated in the technical validation of a targeted-hybrid capture panel containing 46 cardiomyopathy genes, sequenced using 50 bp paired end reads. SNV detection was optimal (100% detection of 258 previously identified SNVs), but the analysis of 39 known indels ranging from 1 to 30 bp in size revealed decreased sensitivity (89.7%) (LMM, unpublished data). All 18

III. INTERPRETATION

262

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

indels that were 12 bp in size were detected, while the 4 missed indels were all greater than 3 bp long, indicating that the size of the indel is a major determinant for detection. Probe length and density can influence capture of regions containing larger indels [21], and improvements in alignment and mapping tools will lead to better identification of indels. Detection of CNVs and structural rearrangements is also heavily dependent on probe hybridization and complex bioinformatics tools. Uniform and deep coverage of genome sequence can be used to accurately predict copy number changes of genes and chromosomal loci [34], but this is a challenge in hybrid capture since uniformity and depth of coverage vary due to differences in hybridization efficiency across small and noncontiguous regions [35]. Bioinformatics methods that focus on the normalization of coverage, correction of capture bias, and comparison of coverage ratios of pooled samples sequenced within a lane have yielded promising results with an analytical sensitivity ranging from 75% to 85% for CNV detection [35,36]. However, coupled with limited knowledge on the rate of false discovery, the need to follow up on called CNVs by a second molecular analysis method, such as multiplex ligation-dependent probe amplification (MLPA) or droplet digital PCR, is critical in a clinical setting. Similarly, breakpoints and structural rearrangements are impossible to detect by hybrid capture techniques if the regions are not covered in the probe design. Again, the onus is on improved capture technology, with longer more efficient baits, and with advances in bioinformatics tools to increase the detection of these types of aberrations.

TARGETED HYBRID CAPTURE: SELECTING A PANEL FOR CONSTITUTIONAL DISEASES The clinical and genetic heterogeneity common in many inherited diseases has catalyzed a rapid transition in genetic testing from Sanger sequencing to NGS. While the jump to WGS or WES may seem the most promising genetic testing strategy for patients, technical and analytical challenges currently limit their widespread implementation in molecular diagnostic laboratories (as discussed in more detail in the following section). Thus, for a clinically defined genetic disease or a group of closely related diseases, a targeted gene panel approach is currently the more favorable testing strategy. Target hybrid capture panels, which provide the greatest flexibility and size of targeted capture approaches, can target up to 5 Mb of genomic sequence, enabling the simultaneous interrogation of about 10% of the exome.

Gene Panel Testing Strategy NGS technologies have created the need for increased up-front evaluation of genes for inclusion in a clinical panel. In the past, gene panels were of such limited size that their contents were naturally restricted to wellcharacterized disease genes. Today, it is possible and tempting to analyze all genes with a published disease association; however, it has become more apparent that many disease-gene associations do not withstand the rigor of a thorough clinical evaluation. Inclusion of such genes will almost always result in an increased number of “variants of unknown significance” which can reduce the clinical utility of the test. Most experts believe that the type of targeted gene panel should be carefully chosen based on the specific clinical situation. Smaller disease-targeted panels are usually preferred when the patient presents with a well-defined clinical diagnosis where the likelihood to detect a pathogenic variant is high. In contrast, exome/genome sequencing can be a good option for clinically complex cases for patients that have already exhausted routine testing choices [37]. Many laboratories have designed large gene panels (typically encompassing multiple, clinically overlapping disorders) that enable targeted bioinformatic analysis of subsets of the captured genes; this type of design provides enhanced flexibility for the ordering physician. Whether to order the largest panel including candidate genes or more disease-focused subsets depends on a multitude of factors including how firmly the clinical diagnosis has been established (for clinically complex cases it may be beneficial to cast a wider net), cost considerations, patient preferences (such as the ability to deal with inclusive findings), and whether or not a family history of disease is present (in which case family testing can aid in clarifying the significance of variants that are not well characterized in the literature). Importantly, this design retains the ability to easily expand the analysis to include more genes, since the sequenced region will always include all genes but the “genetic test” is now a bioinformatics exercise where sets of genes can be selectively analyzed.

III. INTERPRETATION

TARGETED HYBRID CAPTURE: SELECTING A PANEL FOR CONSTITUTIONAL DISEASES

263

Anticipating Technical Limitations Regions with Low Coverage To ensure complete coverage across the sequenced region of a targeted gene panel, many laboratories “fill in” regions that are not adequately covered by NGS using alternate technologies, often Sanger sequencing. The extent of NGS coverage is usually very reproducible, and thus it is critical to examine coverage early on in the test development process to estimate the amount of such “fill-in” sequencing that will be required as this will be associated with significantly increased cost and TAT. The extent to which a laboratory can sustain “fill-in” sequencing depends on factors such as available automation. GC-Rich or Repetitive Regions These sequences consistently yield low coverage due to inefficient probe capture or poor alignment, as discussed above. Some laboratories simply opt to omit some of these regions, especially if they lie in noncoding exons that contain the 50 UTR, since variants in the 50 UTR are not well understood and are difficult to interpret in the context of the disease. Genes with Homology to Other Loci Regions of high homology are a challenge regardless of the sequencing technology used but are particularly challenging for targeted gene panels as they can cause problems in several steps of testing. As outlined in detail above, capture probes may hybridize to both a target gene and its pseudogene, and it may be difficult or impossible to map reads from the target gene while avoiding reads from the pseudogene, which may lead to false positive or negative variant calls. When possible, regions of high homology should be omitted from the capture assay and alternative approaches (such as amplicon-based assays where primers can be placed in unique sequences) should be considered. It is therefore extremely important to perform a bioinformatic analysis of all selected genes early in test design to identify if regions of high homology are present. Regions with high sequence homology can also complicate variant confirmation; even by Sanger sequencing it may be impossible to confirm a variant for genes where sequence identity to other loci extends beyond the coding sequence. Depending on the length of the homologous sequence, variants identified may therefore require nonstandard assays (e.g., long-range PCR assays avoiding the regions of homology followed by Sanger sequencing using internal primers) which can create challenges in clinical laboratories that strive for standardized assay designs to maximize operational efficiency. Since the advantages of NGS targeted-hybrid capture are not realized for genes that have regions of high sequence homology, many laboratories simply offer analysis of these regions as an adjunct Sanger sequencing test.

Anticipating Interpretive Challenges: Impact of Panel Size on Variant Interpretation The impact of sequencing large numbers of genes on data analysis and interpretation is often underestimated. A larger number of genes on a panel will inevitably lead to an increase in the number of variants detected (Figure 16.5). This poses an interpretation challenge due the limited availability of clinically curated variant

Number of novel variants detected

300 Cardiomyopathy novel variants

250 200 150

Cardiomyopathy NGS panel launch

100 50 0

FIGURE 16.5 Impact of launching an expanded cardiomyopathy NGS panel on the novel variant interpretation bottleneck.

III. INTERPRETATION

264

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

databases, and the fact that the majority of variants are rare and may be unique to a family or an individual. For example, of 7772 variants across 113 genes identified in a large patient cohort, 58.1% (4514) were identified in one proband or family and only 14% (1101) were identified in 10 or more probands (LMM, unpublished data). Variant assessment is still a highly manual process that requires a highly trained staff. In the absence of automation, it can take anywhere from 10 min to several hours per variant, depending on the information that needs to be reviewed. This can have a severe impact on a test’s TAT, an often underappreciated fact, as well as added labor expenses to the overall cost of the NGS panel. To summarize, when designing a gene panel, an understanding of the technical capabilities and limitations of the capture method along with an awareness of data interpretation complexities should be considered for maximal benefit-to-cost for patients. Inclusion of genes and gene regions should focus on a limited set of genes with an established causative role in the disease. Addition of genes with little or no evidence of disease causation will burden the interpretation process and reduce clinical utility. In addition, since technical parameters such as analytical sensitivity and specificity are dependent on greater depth of coverage across genes, conservation of coverage by limiting inclusion to genes with an established role in the disease provides a high clinical sensitivity at a reasonable cost and TAT. Sanger follow-up is often needed for gene regions with difficult-to-capture sequences, and the inclusion of these regions should be evaluated based on their interpretability.

Disease-Targeted Gene Panels: Comparison with other Sequencing Strategies NGS technologies have reduced the cost and the time needed to sequence a set of target genes or even the whole genome. Yet, different sequencing strategies and complementary targeting enrichment technologies have emerged that offer varying levels of utility and applicability (Table 16.2). Besides the target hybrid capture method, the most widely used sequencing strategies in molecular diagnostic laboratories are WGS, WES, and PCR-amplification target enrichment methods. In the following section, each method will be briefly described with an emphasis on how it compares with targeted-hybrid capture. Whole Genome Sequencing Complete sequencing of all 3 billion bp of the human genome is attempted by the WGS method. All variants within coding regions as well as noncoding regions (50 and 30 UTRs, introns, and intergenic regions) and across the variant spectrum (SNVs, indels, CNVs, and structural rearrangements) are investigated for each individual. Ideally, WGS would achieve uniform coverage across the entire genome at an adequate depth (.303) to enable accurate and sensitive variant detection, but complete WGS coverage has yet to be achieved. Reported coverage across several studies ranges from 79% to 96% total coverage at an average coverage depth of 26.165.8% [38], but even achieving this level of total coverage and depth required more from 74 to 188 Gb of sequencing [38] which is equivalent to 24 lanes of 100 bp paired-end sequencing on Illumina HiSeq 2000 platform. In addition, a total of about 3.3 million variants were identified per individual including about 2 million SNVs (roughly 19,000 in coding regions) and about 400,000 indels (roughly 400 in coding regions) [38]. These staggering variant numbers highlight the burden on laboratory infrastructure, bioinformatics pipeline, and data interpretation for molecular diagnostic laboratories. Furthermore, while the WGS laboratory workflow omits the library capture TABLE 16.2

Comparison of Common NGS Strategies WGS

WES

Target Hybrid Capture Gene Panels

PCR-Based Capture Gene Panels

Selector Probe Gene Panels

Cost

High

High

Medium

Mediumhigh

LowMedium

Difficulty of analysis/ interpretation

High

High

Lowmedium

Lowmedium

Low

Clinical application

Rare genetic diseases with unknown etiology

Rare genetic diseases with unknown etiology

Genetic diseases with medium to high number of known causative genes

Genetic diseases with a low to medium number of causative genes

Somatic tumor testing with limited DNA input Genetic diseases with a small number of causative genes

III. INTERPRETATION

TARGETED HYBRID CAPTURE: SELECTING A PANEL FOR CONSTITUTIONAL DISEASES

265

step, the same technical and analytical challenges, such as amplification bias of GC regions and coverage needs for variant detection, are observed with WGS. Whole Exome Sequencing WES targets known protein encoding genes including coding and noncoding exons and flanking intronic splice regions. Targeted hybrid capture kits, commercially available from several manufacturers, are used to capture the exome with workflows that are identical to those of targeted hybrid capture gene panels. As demonstrated by several studies, incomplete or low coverage remains a major limitation of WES [21], raising the possibility of missing disease causing variants. Albeit on a smaller scale than WGS, the number of sequencing data and variants generated through WES are far greater than that generated with a targeted gene panel, thereby impacting the load on bioinformatics analyses and interpretation. WES is less costly than WGS, yet is still more expensive than targeted gene panels due to lower pooling of samples per lane as well as the additional effort required for the complexities of variant analyses and interpretation. The technical and interpretive challenges associated with WGS and WES favor targeted gene panels for diseases with well-established clinical diagnoses. In a policy statement on the use of WGS/WES for clinical diagnostic testing, the ACMG recommends that WGS/WES be considered if an individual is affected with a heterogeneous genetic disease that is not covered by available targeted panels, or as a second tier test once the targeted panel yields a negative result [39]. Amplification-Based Capture Methods Target enrichment using high-throughput PCR is an alternative method to focus sequencing on a limited number of genes. Simplex or multiplex PCR methods can be used to enrich single or a handful of targets, however, their low-throughput and capacity make them less attractive for gene panels with more than 10 genes. Specialized chips allow multiple PCR reactions to be simultaneously carried out by simplex PCR in small chambers on a microfluidic chip (Fluidigm, http://www.fluidigm.com/) or using microdroplet-based multiplex PCR, also known as emulsion PCR (RainDance, http://raindancetech.com/) (Table 16.1). These target enrichment strategies are susceptible to the limitations associated with all PCR assays including availability of suitable primer targets and amplification biases of GC-rich regions and repeats. In addition, they may require a much larger amount of input DNA and target a smaller genomic regions (0.110 Mb) than that of hybrid capture (150 Mb). Furthermore, the cost of primers and PCR reagents needed for amplification-based enrichment incur a progressively higher cost with increases in target size, further compounded by the need of expensive specialized equipment for the microdroplet-based PCR enrichment method. Yet, these methods have advantages over hybrid capture such as increased target design efficiency across GC-rich and repetitive regions, as well as for genes with pseudogenes due to the ability to optimize primer position and amplicon length [23]. On the other hand, both amplification and hybrid capture target enrichment methods produce comparable sensitivity, specificity, coverage uniformity, and reproducibility [15,23], although in the hybrid capture method these parameters are impacted by probe design and tiling strategy. In addition, both methods are amenable to addition of new target regions to existing panels. Overall, the choice of capture method (amplification based vs. hybrid capture) will depend on the anticipated size of a target region within a panel, and the availability of specialized equipment. Smaller gene panels can be easily set up using amplification-based capture methods, however, target hybrid enrichment approaches make it possible to capture larger or multiple gene panels in one reaction with minimal impact on cost and TAT. Other Target Selection Methods The already high number of available technologies for target selection continues to expand. Another commercially available capture method is the HaloPlex technology (Agilent Technologies, http://www.genomics.agilent .com/) which combines both hybridization and amplification for fast enrichment of targets using small input amounts of DNA (200 ng) (Table 16.1). In this technique, sometimes referred to as the selector method, DNA is fragmented by restriction enzyme digestion, and then hybridized to a probe that selects, circularizes, and introduces a standard sequence into target fragments. Selected fragments are then enriched in parallel using a universal primer specific to the introduced standard sequence [40]. This method does not require specialized equipment and is not labor intensive with only 1 day (68 h) required to prepare sequence ready libraries. Although there is limited literature on the technical parameters of the selector method, the method’s utility appears to lie in its high-throughput capacity of targeting multiple gene targets with low amounts of input DNA [41,42].

III. INTERPRETATION

266

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

APPLICATIONS IN CLINICAL PRACTICE: LESSONS LEARNED Molecular diagnostic laboratories are offering an increasing number of targeted gene panels [30,43,44]. These panels vary in the size of target region and the number of genes, and may target one gene for single-gene disorders or tens to hundreds of genes for genetically heterogeneous diseases or diseases with a degree of clinical overlap. For diseases which often have a clearly defined clinical picture and a strong association with relevant genes, many clinical laboratories still opt to invest in the development and implementation of targeted gene panels in lieu of WES or WGS. While it is expected that WES/WGS sequencing will be the primary testing strategy for all genetic diseases in the future, the high cost, incomplete coverage, and the analysis and interpretation bottleneck of WES/WGS strategies make targeted gene panels currently more advantageous for the laboratory as well as physicians and their patients. Genetic target panels using different capture technologies (hybrid capture vs. amplification) implemented successfully in the clinical setting have been extensively described (reviewed in [30] and [43]). To highlight the advantages of targeted gene panels, specifically target hybrid capture panels, the experience of the LMM, a CLIA-certified molecular diagnostic laboratory, will be described for two large disease areas: inherited cardiomyopathy and hearing loss. Both diseases are genetically heterogeneous, cover overlapping clinical entities, and have a high degree of interpretive complexity. Prior to the development of NGS target gene panels, testing for both diseases was performed using Sanger sequencing and then transitioned to resequencing microarrays (see Figure 16.1). The transition from this platform to NGS has clearly improved clinical sensitivity, but has also introduced new challenges, both of which will be discussed.

Benefits of Targeted NGS Capture Panels Inherited Cardiomyopathies Inherited cardiomyopathies represent a genetically heterogeneous group of disorders that include HCM, DCM, ARVC, left ventricular noncompaction (LVNC), and restrictive cardiomyopathy (RCM), all of which affect ventricular morphology and function [4]. Historically, each type of cardiomyopathy was considered a distinct entity with unique clinical features; however, growing evidence reveals considerable overlap in clinical manifestations. Over 50 genes have been reported to be causative for the various cardiomyopathies [4], and the significant phenotypic overlap across cardiomyopathies is also observed genetically. Previously, genetic testing using gene panel Sanger sequencing or resequencing microarrays targeted genes specific to a particular cardiomyopathy based on the clinical diagnosis. These methods were both expensive and time consuming, as is the case of Sanger sequencing, or had low sensitivity for indels and variants in GC-rich regions, as is the case with resequencing arrays. In addition, the choice of the gene panel relied heavily on the physician’s clinical impression and did not account for the heterogeneous phenotypic and genetic nature of this group of diseases. An NGS-targeted hybrid pan-cardiomyopathy capture gene panel targeting 46 genes (approximately 350 kb) illustrates the benefits of an NGS panel approach with these methods. The inclusion of the TTN gene was dependent on the power of NGS as its enormous size (about 400 exons) precluded analysis using older technologies. The comprehensive NGS test resulted in significantly improved detection rates (the inclusion of TTN alone is estimated to increase the clinical sensitivity by about 1525% [5]) and, as expected, shortened diagnostic odysseys by avoiding the need to have a multistep testing strategy where suspected sets of genes were analyzed sequentially. The following case example illustrates the benefits of this multidisease panel testing strategy. The proband was initially diagnosed with DCM and had received traditional DCM gene panel testing without a tangible result. Subsequent reevaluation of the clinical features prompted additional testing of genes associated with ARVC, a cardiomyopathy that is now known to overlap heavily with DCM. This second test detected a likely pathogenic variant in an ARVC gene that segregated in all affected family members tested. This stepwise testing strategy was lengthy and costly. Today, the causative variant would be identified earlier as pan-cardiomyopathy testing covers both disorders. Examples like this are becoming increasingly frequent and are catalyzing a dramatic paradigm shift in genetic medicine, where genetic testing is moving from a confirmatory exercise to being part of establishing a clinical diagnosis.

III. INTERPRETATION

APPLICATIONS IN CLINICAL PRACTICE: LESSONS LEARNED

267

Hearing Loss and Related Disorders Hearing loss is the most common sensory impairment and can be syndromic or nonsyndromic. With older sequencing platforms, only a small subset of the large number of associated genes was tested resulting in limited clinical sensitivity. In addition, older sequencing platforms were unable to capture CNVs that spanned one or more exons; since CNVs are common in these genes, supplemental testing using an MLPA or cytogenetic microarray was therefore often needed. Development of a hybrid capture NGS panel that expanded the number of targeted genes for syndromic and nonsyndromic hearing loss from 19 genes to 71 genes increased detection rates in hearing loss patients, not only due to the greater number of genes examined but also due to the ability to use NGS sequencing data directly to call CNVs [35,36]. Implementation of CNV analysis using hybrid capture NGS sequencing data yields identification of CNVs in a significant number of patients (LMM, unpublished data); in fact, if only the added feature of CNV detection provided by this NGS target capture platform is considered, a .15% improvement in clinical sensitivity is achieved without the costs and additional time required for supplemental testing. The ability to use one data set to call nearly all types of genetic variants is a tremendous advantage of NGS panels, and thus as NGS technology becomes commoditized, it will open avenues for more comprehensive variant detection by smaller laboratories that have no access to traditional CNV detection methods such as cytogenomic microarray analysis.

Challenges Although the benefits afforded by NGS assays are clear, NGS assays, sequencing chemistries, and platforms are still evolving. Remaining technical challenges include the need for more complete and uniform coverage; remaining bioinformatics challenges include the need for new analysis approaches to identify all variant classes. All these changes are difficult for clinical laboratories as the need to remain current must be weighed against the requirement to revalidate tests when changes are introduced. As the size of gene panels increases, it is also increasingly difficult to adhere to the traditional “complete and confirmed” paradigm. Clinical diagnostic laboratories were traditionally able to return complete results (i.e., data for every coding base of a gene) and confirm all identified variants using an orthogonal technology (such as Sanger sequencing). As discussed above, by far the most dramatic (and often underappreciated) challenge is the interpretation of the increased number of variants generated from expanded NGS panels. Tens to hundreds of variants are identified in targeted multigene panels for each patient. Bioinformatic pipelines can filter benign variants based on their frequency in public databases, which constitute the majority of variants identified, though this process may erroneously filter common pathogenic variants [30]. The remaining novel variants not only require laborious manual assessment by searching the literature and public databases but also computational analyses to predict the impact of the variant on the translation of the protein or its function. Often limited data are available and the accuracy of computational predictions is moderate to low, resulting in a large number of “variants of unknown significance,” which can be difficult to understand and integrate into a definitive management plan, and therefore reduce the clinical utility of NGS panels to physicians and patients alike. Since NGS technology and capture methods are relatively new, technological advances in the near future can be expected to improve performance and ease some of these challenges. Issues with probe inefficiencies need to be addressed to ensure uniform coverage across all regions. In addition, advances in the sequencing instrumentation and chemistry that provide longer and a greater number of reads in a shorter time frame are necessary for higher throughput and better mappability of difficult targets such as GC-rich regions and pseudogenes.

Conclusion and Outlook Genetic testing of patients has evolved rapidly since the completion of the HGP, driven by decreased cost of sequencing due to NGS technologies and a richer understanding of the underlying genetic etiology of many diseases. Currently, the $1000 genome seems to be within reach. Yet, technical and interpretation challenges, which both incur a burden in both cost and time, need to be resolved before WGS is transitioned into the clinic as a first tier test. Targeted capture panels for well-diagnosed genetic disorders with known causative genes offer a cost-effective and timely intermediate solution. Several targeting approaches are available to molecular diagnostic laboratories including amplification-based capture and oligonucleotide hybrid capture; though both have technical advantages and limitations, their benefits on patient care have already been realized.

III. INTERPRETATION

268

16. TARGETED HYBRID CAPTURE FOR INHERITED DISEASE PANELS

Integrated efforts across national and international laboratories to curate knowledge surrounding genetic variants and genes in publicly available databases are under way and will eventually alleviate remaining interpretive challenges.

References [1] Human Gene Mutation Database [Internet]. The human gene mutation database at the institute of medical genetics in Cardiff, ,http:// www.hgmd.cf.ac.uk/ac/index.php. [accessed 03.06.2013]. [2] Pauli RM. Achondroplasia. October 12, 1998 [Updated February 16, 2012]. In: Pagon RA, Adam MP, Bird TD, et al., editors. GeneReviewst [Internet]. Seattle, WA: University of Washington; 19932013. Available from: ,http://www.ncbi.nlm.nih.gov/books/ NBK1152/. [accessed 03.06.2013]. [3] Cystic Fibrosis Mutation Database [Internet]. Cystic fibrosis centre at the hospital for sick children in Toronto, ,http://www.genet .sickkids.on.ca/cftr/. [updated April 25, 2011 accessed 03.10.2013]. [4] Teekakirikul P, Kelly MA, Rehm HL, Lakdawala NK, Funke BH. Inherited cardiomyopathies: molecular genetics and clinical genetic testing in the postgenomic era. J Mol Diag 2013;15(2):15870. [5] Herman DS, Lam L, Taylor MR, Wang L, Teekakirikul P, Christodoulou D, et al. Truncations of titin causing dilated cardiomyopathy. N Engl J Med 2012;366(7):61928. [6] Hershberger RE, Morales A. Dilated cardiomyopathy overview. July 27, 2007 [Updated May 9, 2013]. In: Pagon RA, Adam MP, Bird TD, et al., editors. GeneReviewst [Internet]. Seattle, WA: University of Washington; 19932013. Available from: ,http://www.ncbi.nlm.nih .gov/books/NBK1309/. [accessed 03.10.2013]. [7] Biagini E, Coccolo F, Ferlito M, Perugini E, Rocchi G, Bacchi-Reggiani L, et al. Dilated-hypokinetic evolution of hypertrophic cardiomyopathy: prevalence, incidence, risk factors, and prognostic implications in pediatric and adult patients. J Am Coll Cardiol 2005;46 (8):154350. [8] Sen-Chowdhry S, Syrris P, Prasad SK, Hughes SE, Merrifield R, Ward D, et al. Left-dominant arrhythmogenic cardiomyopathy: an under-recognized clinical entity. J Am Coll Cardiol 2008;52(25):217587. [9] Gripp KW, Lin AE. Costello syndrome. August 29, 2006 [Updated January 12, 2012]. In: Pagon RA, Adam MP, Bird TD, et al., editors. GeneReviewst [Internet]. Seattle, WA: University of Washington; 19932013. Available from: ,http://www.ncbi.nlm.nih.gov/books/ NBK1507/. [accessed 03.06.2013]. [10] Smith RJH, Shearer AE, Hildebrand MS, et al. Deafness and hereditary hearing loss overview. February 14, 1999 [Updated January 3, 2013]. In: Pagon RA, Adam MP, Bird TD, et al., editors. GeneReviewst [Internet]. Seattle, WA: University of Washington; 19932013. Available from: ,http://www.ncbi.nlm.nih.gov/books/NBK1434/. [accessed 03.12.2013]. [11] Zimmerman RS, Cox S, Lakdawala NK, Cirino A, Mancini-DiNardo D, Clark E, et al. A novel custom resequencing array for dilated cardiomyopathy. Genet Med 2010;12(5):26878. [12] Teekakirikul P, Cox S, Funke B, Rehm HL. Targeted sequencing using Affymetrix CustomSeq Arrays. Curr Protoc Hum Genet 2011; 69:7.18:7.18.1–7.18.17. [13] Lander ES. Initial impact of the sequencing of the human genome. Nature 2011;470(7333):18797. [14] Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009;27(2):1829. [15] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):1118. [16] Maricic T, Whitten M, Pa¨a¨bo S. Multiplexed DNA sequence capture of mitochondrial genomes using PCR products. PLoS One 2010;5 (11):e14004. [17] Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, NISC Comparative Sequencing Program, et al. Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 2010;20(10):142031. [18] Tewhey R, Nakano M, Wang X, Pabo´n-Pen˜a C, Novak B, Giuffre A, et al. Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol 2009;10(10):R116. [19] Querfurth R, Fischer A, Schweiger MR, Lehrach H, Mertes F. Creation and application of immortalized bait libraries for targeted enrichment and next-generation sequencing. Biotechniques 2012;52(6):37580. [20] Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, et al. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 2007;104(49):1942833. [21] Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011;29(10):90814. [22] Sulonen AM, Ellonen P, Almusa H, Lepisto¨ M, Eldfors S, Hannula S, et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol 2011;12(9):R94. [23] Hedges DJ, Guettouche T, Yang S, Bademci G, Diaz A, Andersen A, et al. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform. PLoS One 2011;6(4):e18595. [24] Sakharkar MK, Chow VT, Kangueane P. Distributions of exons and introns in the human genome. In Silico Biol 2004;4(4):38793. [25] Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol 2011;12(2):R18. [26] Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G 1 C)-biased genomes. Nat Methods 2009;6(4):2915. [27] Sipos B, Massingham T, Stu¨tz AM, Goldman N. An improved protocol for sequencing of repetitive genomic regions and structural variations using mutagenesis and next generation sequencing. PLoS One 2012;7(8):e43359. [28] Torrents D, Suyama M, Zdobnov E, Bork P. A genome-wide survey of human pseudogenes. Genome Res 2003;13(12):255967.

III. INTERPRETATION

REFERENCES

269

[29] Zhang Z, Harrison PM, Liu Y, Gerstein M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res 2003;13(12):254158. [30] Rehm HL. Disease-targeted sequencing: a cornerstone in the clinic. Nat Rev Genet 2013;14(4):295300. [31] Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012;2012:251364. [32] Fisher S, Barry A, Abreu J, Minie B, Nolan J, Delorey TM, et al. A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries. Genome Biol 2011;12(1):R1. [33] Sikkema-Raddatz B, Johansson LF, de Boer EN, Almomani R, Boven LG, van den Berg MP, et al. Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics. Hum Mutat 2013;34(7):103542. [34] Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, 1000 Genomes Project, et al. Diversity of human copy number variation and multicopy genes. Science 2010;330(6004):6416. [35] Nord AS, Lee M, King MC, Walsh T. Accurate and exact CNV identification from targeted high-throughput sequence data. BMC Genomics 2011;12:184. [36] Li J, Lupat R, Amarasinghe KC, Thompson ER, Doyle MA, Ryland GL, et al. CONTRA: copy number analysis for targeted resequencing. Bioinformatics 2012;28(10):130713. [37] Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15(9):73347. [38] Shen H, Li J, Zhang J, Xu C, Jiang Y, Wu Z, et al. Comprehensive characterization of human genome variation by high coverage wholegenome sequencing of forty four Caucasians. PLoS One 2013;8(4):e59494. [39] ACMG Board of Directors. Points to consider in the clinical application of genomic sequencing. Genet Med 2012;14(8):75961. [40] Dahl F, Gullberg M, Stenberg J, Landegren U, Nilsson M. Multiplex amplification enabled by selective circularization of large sets of genomic DNA fragments. Nucleic Acids Res 2005;33(8):e71. [41] Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, et al. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proc Natl Acad Sci USA 2007;104(22):938792. [42] Johansson H, Isaksson M, So¨rqvist EF, Roos F, Stenberg J, Sjo¨blom T, et al. Targeted resequencing of candidate genes using selector probes. Nucleic Acids Res 2011;39(2):e8. [43] Zhang W, Cui H, Wong LJ. Application of next generation sequencing to molecular diagnosis of inherited diseases. Top Curr Chem 2014;336:1945. [44] National Center for Biotechnology Information. GTR: genetic testing registry: Available from: ,http://www.ncbi.nlm.nih.gov/gtr/.; [accessed 05.20.2013].

III. INTERPRETATION

This page intentionally left blank

C H A P T E R

17 Constitutional Disorders: Whole Exome and Whole Genome Sequencing Benjamin D. Solomon Medical Genetics Branch, National Human Genome Research Institute/National Institutes of Health, Bethesda, MD, USA

O U T L I N E Introduction Historical Perspective Early Chromosomal Studies Genomic-Based Studies: Genetic Markers The Microarray The GWAS Era The Human Genome Project Modern Sequencing Technologies

273 274 275 275 275 276 276 276

Genomic Sequencing Advantages of Genomic Sequencing Disadvantages of Genomic Sequencing Comparison: Exomes Versus Genomes What Regions Are Targeted/Covered? Depth of Coverage Type of Variants Detected Resource-Based Considerations

276 276 277 277 277 278 278 278

Analyzing Individual and Multiple Data Sets for Causal Mutation Discovery Phenotypically Similar Unrelated Probands The Continued Importance of Clinical Analyses in the Era of Genomic Sequencing Familial Studies Recessive Diseases De Novo Mutations Using Databases of Population Variation Issues and Concerns with the Use of Population Variation Databases to Filter Genomic Data Sets

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00017-4

Penetrance and Expressivity The Accuracy and Reproducibility of Databases

Incorporation of Pathway-Related Data Recognizing and Managing Artifacts The Necessity of Independent Validation Functional Interpretation of Variants Combinatorial Approaches Clinical Genomic Sequencing Determining the Optimal Scope of Genetic/Genomic Investigations Clinical Utility: Translating Genomic Knowledge from Rare Disease Research to more General Health Care Situations The Clinical Timeline Integrating the Management of Incidental Genomic Information Managing the Data Load in Clinical Scenarios Consequences of Genomic Sequencing Genetic Counseling and Ethical Issues

279 279

280 281 281 282 282

283 283 284 284 285 285 286 287 287

288 288 288 289

290 291

Conclusion and Future Directions

291

Acknowledgment

292

References

292

Glossary

296

282

271

© 2015 Published by Elsevier Inc.

272

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

KEY CONCEPTS • Introduction • Modern genomic sequencing technologies are rapidly changing clinical practices related to constitutional disorders, but genomic analysis is not conceptually new—what is new is our ability to efficiently, cheaply, and accurately generate genomic sequence data on the base level. • Advantages and disadvantages of genomic sequencing Advantages of genomic sequencing • Advantages include the abilities to rapidly query many genomic regions, detect a wide range of variants. Genomic approaches can be used in a targeted or hypothesis-free fashion, and data can be banked for future study as data sets grow and knowledge evolves. Disadvantages of genomic sequencing • Genomic sequencing can result in a potentially overwhelming amount of information, and tools must be used to help best manage derived data. • Genomic sequencing will inevitably result in both false-negative and false-positive data, and analysis must take into account these and related issues. Comparison: exomes versus genomes • While genome sequencing may eventually replace exome sequencing, exome sequencing may require less resources and offer faster results with higher coverage • Genome sequencing includes many potentially important regions that are outside the exome, and can detect certain types of variants (such as structural variants) that may be missed in exome sequencing. • Analyzing individual and multiple data sets for causal mutation discovery Phenotypically similar unrelated probands • Unrelated probands who clinically have highly similar phenotypes can be used to identify common genes or pathways with potentially pathogenic variants. Familial studies • Analysis of small or large families through genomic sequencing can help prioritize causal variants, and even in cases when the inheritance pattern is unclear. Using databases of population variation • Filtering through publicly available and private databases of population variation is a key step in ranking potentially causal variants. Incorporation of pathway-related data • Candidate genes may be suggested by knowledge of relevant biological networks, and a pathway-related approach may suggest candidate genes. Recognizing and managing artifacts • The large amount of data derived through genomic sequencing will result in the detection of artifacts, and the ability to recognize these will improve analysis ability. Functional interpretation of variants • Predictive models of variant effect are an important part of genomic analysis. • Ultimately, in order to understand disease processes and basic biological mechanisms, predictive functional models cannot substitute for bench-based investigations. Combinatorial approaches • The above considerations and approaches can and should be used combinatorially in order to expedite and strengthen analysis.

III. INTERPRETATION

INTRODUCTION

273

• Genomic sequencing in the clinical realm Clinical utility: Translating genomic knowledge to more general health care situations • The nature of clinical medicine includes critical time-pressure, and there is strong incentive to generate, analyze, and interpret genomic sequencing data quickly and accurately. • Preplanning for the management of incidental medical information derived through clinical sequencing will benefit genomicists, health care workers, and patients. Consequences of genomic sequencing • The increased use of genomic sequencing is changing how we view the clinical spectrum and severity of many constitutional disorders, and will affect penetrance estimates in numerous conditions, as well as how we view inheritance patterns. Genetic counseling and ethical issues • Though not the focus of this chapter, the use of genomic sequencing in clinical contexts raises many critical issues related to genetic counseling and bioethics. • Conclusions and future directions • Genomic sequencing data will be most powerful when considered in the context of other biomedical and health parameters. • Genomicists must strive to ensure that the benefits of new genomic sequencing methodologies are not limited to certain individuals in specific parts of the world.

INTRODUCTION Due to relatively recent advances in sequencing technologies, the use of high-throughput genomic sequencing (in this chapter, the general phrase “genomic sequencing” will be used to include both whole exome and whole genome sequencing) is becoming an increasingly prevalent means to interrogate and understand diverse causes of human disease, and genomic techniques are being applied in more and more situations. These situations include research-based endeavors focused on the discovery of novel causes of rare diseases, as well as a technique that can help unravel the genetic underpinnings of more common health conditions. Equally importantly, and in parallel with the shift from pure research to everyday clinical applications, the use of whole exome and whole genome sequencing has progressively spread from rarefied, high-funded academic research centers and consortiums to many smaller laboratories, often located at diverse national and international institutions, and for clinical purposes other than the discovery of the novel causes of a genetic disorder. These clinical purposes include the ability to rapidly interrogate many genes in order to efficiently answer health-related questions in “real time.” A number of potential examples demonstrate the pace and extent of the explosion of genomic sequencing in the world of research genetics. A PubMed search of “whole genome sequencing” restricted to human studies shows an increase of approximately 025% citations per year from 2000 through 2008; after 2008, the increase ranges from 6075% per year. In terms of exome-based studies, after the first publications describing the use of whole exome sequencing (in late 2009 and early 2010) [1,2], over 1000 studies on exome sequencing were published within the next 3 years, with a similar rate of increase as seen in whole genome sequencing. As a reflection of this, clinical exome and genome sequencing is becoming increasingly available. Prices have also decreased considerably; for example, the current cost for the technical component of sequencing exomes is slightly more than 10% of the cost when it was first offered 3 years ago, and is anticipated to continue to decrease. Similarly, in the fall of 2011, CLIA-certified exome sequencing, along with clinical analysis, was first offered; within the next 2 years, a dozen other laboratories, representing seven separate countries, offered such exome sequencing and analysis (http://www.ncbi.nlm.nih.gov/gtr/labs/?term 5 whole 1 exomes), and four laboratories offered whole genome sequencing (http://www.ncbi.nlm.nih.gov/gtr/labs/?term 5 whole 1 genome). Some laboratories, importantly, offer users the choice of different levels of service (with pricing scaled accordingly), such that clinically oriented analysis can be chosen or can be kept for the referring clinician or laboratory.

III. INTERPRETATION

274

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

This chapter will focus on the use of “whole exome” and “whole genome” (or simply, “exome” and “genome”) sequencing to study constitutional human disorders, where constitutional disorders refers to innate or inborn disorders that may present congenitally or much later in life, and which involve the entire affected individual. And, unless otherwise noted, exome or genome sequencing described here refers to the use of modern genomic technologies involving massively parallel or next-generation sequencing. Traditionally, the constitutional nature of a disease reflects a strong genetic influence, though there are clearly gray zones. One gray zone involves multifactorial disorders, many of which likely involve multiple interacting genetic and environmental factors that may vary considerably from one patient to the next. Examples of these multifactorial disorders include congenital anomalies such as nonsyndromic cleft lip and/or palate and neural tube defects [35], as well as conditions that may present later in life like schizophrenia and diabetes mellitus [6,7]. Another gray zone involves conditions resulting from somatic mutations (in addition to classic oncologically related somatic mutations). For example, exome sequencing was used to demonstrate the cause of Proteus syndrome, a rare but fascinating overgrowth condition in which manifestations typically develop early in childhood; in order to discover the cause of disease, exome sequencing was used to compare affected tissue versus unaffected tissue, demonstrating the presence of somatic activations mutations in AKT1 in the former [8]. Regardless of how exactly constitutional disorders are defined, it is clear that they are very common. According to early estimates, up to 5% of live-born infants have congenital malformations [9,10], and almost 8% of individuals can be identified with a congenital malformation and/or genetic disorder by 25 years of age [11]. With such a high incidence rate, and the complexities involved in the medical care of patients with many constitutional disorders, it should be unsurprising that constitutional disorders place a heavy burden on the health care system. Over 70% of all admissions (and a higher percentage of chronic disorders) at one pediatric hospital during a 1-year period were related to an underlying disorder with a “significant genetic component”; further, as patients with constitutional disorders frequently require long stays and many medical interventions, they account for a disproportionate amount of total hospital charges [12]. In separate estimates, over one-third of deaths at another pediatric hospital over a 4-year period were attributed to congenital malformations and/ or genetic disorders [13]. At a pediatric rehabilitation hospital, approximately one-half of all admissions, and over one-half of end-of-life admissions involved patients with these types of constitutional disorders [14]. Thus, related to constitutional disorders, any method that has diagnostic, prognostic, and/or therapeutic relevance can have a strong impact on both the individual patient and societal level. Genomic sequencing certainly represents this type of method. This chapter is divided into two major sections. The first section focuses on using genomic sequencing to identify novel (or rare) causes of constitutional disorders. The second section focuses on the use of genomic technologies in more general health care situations; these situations may still involve constitutional disorders, but the emphasis is on ways to harness genomic technologies in clinical settings. Naturally, these sections will overlap to a considerable degree.

Historical Perspective Despite the recent (and rapidly growing) emphasis on genomic sequencing, the use of genomic interrogation is not a new phenomenon. New genomic technologies have resulted in two main differences. First, the level of resolution has increased enormously: older techniques of genomic analysis, such as karyotype and microarraybased studies, were not able to visualize the genome on the level of the individual base. Second, sequencing of the first full human genomes was a massive and very slow-moving project that was only possible with vast sums of money, and required a staggering level of resources in terms of personnel and physical space. With new sequencing technologies, the second change thus involves the ability to generate large amounts of sequence data cheaply and quickly. This short historical description presented here is not meant to be comprehensive, and will admittedly gloss over many important technical aspects, and ignore a number of key milestones (especially early in the history of human genetics) and other specific details; it will instead focus on different ways to view and therefore analyze the genome as a whole. The central theme, which continues to the present day, again hinges on the differences mentioned in terms of resolution, speed, and expense: at each step, larger amounts of higher resolution genomic data are able to be generated more quickly and for relatively less expense. Further, these changes have allowed genomic technologies to be useful tools in a wider spectrum of situations and to be used in more than the most esoteric scenarios.

III. INTERPRETATION

INTRODUCTION

275

Early Chromosomal Studies Chromosomal studies initially demonstrated the presence of cytogenomic anomalies as a source of human disease; it is stunning that these discoveries, which may seem so ingrained and rather ancient, were in fact made rather recently. In 1959 and 1960, for example, the chromosomal causes of many of the common aneuploidies were described, including trisomy 21 (Down syndrome), monosomy X (Turner syndrome), trisomy 13 (Patau syndrome), and trisomy 18 (Edwards syndrome) [1519]. In addition to differences affecting chromosome number, early cytogenetic investigations identified common structural anomalies, such as deletions and duplications, in conditions like WolfHirschhorn syndrome and Cri-du-Chat syndrome [20,21]. Cytogenetic techniques have improved, and remain a clinically important means of looking for large genomic imbalances. “High-resolution banding” allows visualization of a greater number of chromosomal regions, or bands (at least 650 bands are able to be differentiated), and enables the detection of, for example, deletions or duplications at least about 5 Mb in size [22]. The karyotype does have one advantage over later methods, such as the microarray, or even exome sequencing. Specifically, karyotypic analysis allows the detection of balanced translocations and genomic rearrangements, which can be clinically important for a number of reasons. For example, a parent of a child with a large genomic deletion might carry a balanced translocation, and knowing this may be important for reproductive planning [23]. Second, evolving techniques show that many balanced translocations, which had earlier been thought to be largely benign, in fact involve small imbalances and/or disrupted key genes and signaling mechanisms [24,25]. Genomic-Based Studies: Genetic Markers Several techniques evolved to examine targeted genomic regions as well as the genome more generally. Many of these techniques largely allowed the detection of relatively large cytogenetic anomalies, such as deletions or duplications, but would typically not demonstrate the presence of a point mutation at the base level. Subtelomeric copy number studies, first using techniques such as variable number tandem repeats (VNTR) [26], and then fluorescence in situ hybridization (FISH) [27] became a standard practice in the evaluation of patients with constitutional disorders. Like many early genomic studies, these endeavors focused on and were applied in clinical practice to patients with obvious, severe clinical manifestations, such as otherwise unexplained neurocognitive impairment [28]. The increasing use of genetic markers and a better understanding of genomic architecture also allowed more frequent applications of techniques such as positional cloning through linkage analysis, which led to the identification of many causes of Mendelian disorders [29]. Positional cloning through linkage analysis first and famously allowed the identification of bi-allelic mutations in CFTR as the cause of cystic fibrosis [30]. The Microarray One of the next major steps in the evolution of genomic investigation of constitutional disorders involved the development of the DNA microarray. DNA microarrays allow analysis of the amount of genomic material in an individual so that, for example, a part of the genome with missing or extra genomic information could be detected. In general, microarrays were first used to study the development of somatic variants in oncologic processes by comparing tumor to normal tissue, initially in animal experiments, and then with human neoplasms [3133]. Whole genome arrays to study constitutional disorders came soon after uses aimed at studying cancer and related diseases, with the advent of array comparative genomic hybridization (aCGH). In order to search for genomic abnormalities, several different types of probes were developed, including bacterial artificial chromosomes (BACs) [34], cDNA [35], and single nucleotide polymorphism (SNP) probes [36], which yield a higher resolution view of the genome—in other words, with the use of more probes, smaller genomic anomalies could be confidently identified. The increasing use of microarrays in the study of constitutional disorders allowed the identification and/or refinement, as well as the easier detection of common “microdeletion” and “microduplication” syndromes, or contiguous gene syndromes, such as deletion 22q11.2 syndrome (often historically called DiGeorge syndrome) and 1p36 deletion syndromes [37,38]. Equally importantly, using microarrays to determine the minimal critical regions involved in these contiguous gene syndromes led to the identification of intragenic point mutations (in a gene contained within the minimal critical region) that could result in the same or similar phenotype. For example, SmithMagenis syndrome is a relatively common contiguous gene syndrome that is caused by heterozygous deletions of chromosome 17p11.2. SmithMagenis syndrome was first identified through traditional chromosomal (karyotype) studies, but microarray analyses refined the minimal critical region, and it was ultimately found that point mutations in the gene RAI1 can

III. INTERPRETATION

276

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

result in an almost indistinguishable phenotype [39,40]. Finally, in addition to showing causality of recognized syndromes, microarrays have broad clinical utility in diagnosing individuals with findings such as developmental delay or congenital anomalies, even if the exact syndrome or underlying disorder is unclear [41]. The GWAS Era These array-based analyses ushered in the era of the genome-wide association search (GWAS), which in large part was used to look for genetic markers associated with complex/multifactorial traits, such as ophthalmologic disturbances, diabetes mellitus, and many psychiatric conditions [42,43]. In essence, the GWAS looks for genetic markers that are statistically significantly more likely to be found in affected than unaffected individuals. One important point is that the GWAS approach aims to identify genetic markers (representing preselected, known gene variants) that are associated with disease, but the findings in isolation do not imply causality. The marker could, for example, be co-inherited with an actual pathogenic mutation that simply happens to be located near the detected associated genetic variant. This is very different from many exome/genome approaches, in which identifying an associated genetic variant may imply direct causality related to the disease in question. The Human Genome Project It might be said that the human genome project, a massive project costing over 10 billion dollars, involving huge national and international collaborations, and lasting over a decade, truly launched the modern genomic age [44]. And while this was an enormous accomplishment in many respects, coming only about 50 years after the structure of DNA was described [45] and substantially altering numerous biomedical fields, the methodology used to accomplish the human genome project (Sanger sequencing) was not something that could scaled to be readily used as a clinical tool [46,47]. Modern Sequencing Technologies The last step that allowed genomic sequencing to become a commonly used tool in more general clinical situations was the development of “next generation” or “massively parallel” sequencing, allowing much faster and more affordable availability of interpretable sequence data [48,49]. With the advent of massively parallel sequencing, the main issue shifted from the ability to generate data to problems of how to interpret the data in a thorough, clinically meaningful manner.

GENOMIC SEQUENCING Advantages of Genomic Sequencing There are many benefits to being able to employ genomic sequencing in both research spheres and clinical realms. These advantages include the ability to efficiently query many genes and genetic regions (the latter especially in the case of true whole genome sequencing), and to detect a wide range of mutations in terms of variant type. Genomic approaches can be used in both a targeted approach or, if a mutation in a high-likelihood gene is not found, in a hypothesis-free manner aimed at identifying novel disease causes (this latter approach is typically more related to research enquiries, and benefits from considerations discussed below). Further, even if immediate answers are not forthcoming from the initial analysis, genetic data can be maintained for future queries to help understand the human genome and how specific variants may contribute to health issues. For example, if a person eventually requires a certain treatment for a condition, their genome can be interrogated for specific pharmacogenomically relevant variants [50]. An additional advantage to genomic sequencing is that once technical challenges have been met (e.g., sequencing data are adequate and the bioinformatics pipelines have been established to manage the data), genomic sequencing can be employed without the requirement of the time and expense related to creation, optimization, and validation of individual panels. Changes to the specific sequencing parameters are inevitable, and will allow improvements in terms of the time to generate sequence data and the quality of that data, but this does not lessen the impact of genomic sequencing in its current state.

III. INTERPRETATION

GENOMIC SEQUENCING

277

Disadvantages of Genomic Sequencing While genomic sequencing through new methodologies clearly has many advantages over more traditional sequencing methods, including, primarily, the ability to generate large amounts of analyzable data, this sea of genetic information can be overwhelming. After generation of genomic data, the main challenge frequently hinges on bioinformatic issues. In addition to bioinformatics problems related to genomic data management and analysis, genomic sequencing technologies do have several other difficulties. Even in extremely high-quality sequencing, there will inevitably be false-positive (i.e., the detection of artifactual variants) data. Despite overall high rates of accuracy, the amount of data generated means that any single exome or genome will contain large amounts of artifacts, and appropriate filters must be designed to manage these problems. With modern genomic technologies, there can also be relatively large amounts of false-negative data (i.e., due to incomplete coverage or variants that are not identified). In general, there are portions of the genome that are not amenable to sequencing through some of the next-generation technologies. For example, genomic sequencing is not adept with portions of the genome that are highly repetitive or simple [51]. Among other issues, this can result in clinical problems; for example, if a person has a strong personal and family history of cancer, and a mutation in a cancer susceptibility gene is suspected, using next-generation sequencing to interrogate these genes may not be sufficient, as a causative mutation may be located in a portion of the gene that is not well covered [52]. In this case, reverting to bidirectional dideoxynucleotide (Sanger) sequencing, along with copy number analysis, may be the best way to rule out the presence of a mutation, although another alternative involves combining next-generation sequencing with, for example, Sanger sequencing, to cover problematic regions. Finally, exomes and genomes, while containing large amounts of information, are limited to what they query, and will not be able to answer every genetic or biomedical question—for this, one must consider additional factors that can impact health parameters such as the presence, type, and severity of constitutional disorders. These issues will be addressed in more detail below.

Comparison: Exomes Versus Genomes There are a number of important factors when comparing exome to genome sequencing, and assuming that other genetic investigations are not feasible, these factors may influence the choice of modality in terms of exome versus genome sequencing. In addition, careful consideration of the relative advantages and disadvantages of the methods may highlight important adjunctive tests that may be used after or simultaneous with genomic sequencing. What Regions Are Targeted/Covered? The first and perhaps overall most important issue involves the sequence target in exome versus genome sequencing, and there are several key points that should be emphasized. The “exome,” or all exons (coding portions of genes) of the genome, accounts for approximately 12% of the genome, depending on the exome sequencing platform used. However, approximately 85% of mutations that exert a large effects on diseaserelated trait may be located within the exome (at least according to one estimate, though other studies suggest a relatively higher importance to variants in the rest of the genome) [53,54]. In addition to the protein-coding regions, some standard exome sequencing kits typically also include the sequencing of regions surrounding the exons, such as 30 and 50 untranslated regions (UTRs) and noncoding RNA so that this information can be analyzed along with variants affecting protein-coding regions. Again, what is covered in addition to the exome varies between different available kits (as does the coverage and accuracy) [55,56]. Exome sequencing of a single individual will typically reveal tens to hundreds of thousands of variants, while genome sequencing of single individual will result in, on average, approximatey 4.5 million variants [57]. In human disease processes that involve genetic factors, a large proportion of causal mutations, especially related to severe constitutional disorders, may reside in exons [53]. However, there are a number of important exceptions to this, such that nonexomic regions should not be ignored. For example, there are a number of instances where mutations in gene-regulatory regions have been shown to cause constitutional disorders. Mutations in genes such as SHH and ZIC2 (as well as other genes) have long been known to be major causes of holoprosencephaly, a severe neuroanatomical disorder resulting from failed or incomplete forebrain separation [5860]; over a decade after the discovery of causative mutations in these genes, variants in key regulatory regions were also shown to be involved in the pathogenesis of holoprosencephaly [61,62]. Diverse processes of disease may also be related to these

III. INTERPRETATION

278

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

regulatory regions—for example, point mutations in a long-range SHH enhancer can result in limb anomalies such as preaxial polydactyly [63]. While the mapping of the human genome has helped tremendously, our knowledge of these regulatory regions lags behind the identification of the protein-coding genes themselves. A large challenge, therefore, centers on identification and dissection of these regulatory elements and an understanding of how they affect gene expression and protein activity. Mounting evidence shows that much of the genome, previously thought to be relatively inactive “junk DNA” existing simply as remnants of the evolutionary process, may actually serve important functional roles—initial large-scale estimates describe a functional biochemical role to approximately 80% of the human genome [54]. And as each gene may be influenced by multiple interacting regulatory elements, teasing out how these elements interact and modulate gene expression will be a key task, but the bottom line is that this genomic “dark matter” is fertile ground for genomic discovery, and may well contain variants that explain genetic causes of disease. As a result, variants in and factors affecting such regulatory regions may be important to examine after exomic mutations have been excluded. Depth of Coverage Closely related to this, a second factor is the depth of coverage provided by exome versus genome sequencing. The average depth of coverage for research-based exomes is approximately 30503, depending on the particular kit selected, and at this coverage, 90% to 99% of bases in the target region are covered with at least 13 coverage [55,56,64]. Clinical exome sequencing, on the other hand, typically requires at least 1003 for average depth of coverage; depending on the kit selected, at this coverage from 93% to over 99% of bases in the target regions are covered with at least 13 coverage. After 1003 coverage, the return for higher coverage is incrementally less (1503 coverage increases the percent coverage by about 1% over 1003 coverage), but in clinical situations, such an increase may be necessary. Currently, genome sequencing at anything approaching this depth is extremely costly, and genome sequencing may thus miss variants detected by exome sequencing. This will undoubtedly change, but at this point, genome sequencing will not result in depth approaching that easily and affordably available with exomes, resulting in less sensitivity and specificity. Type of Variants Detected A third factor involves the importance of detecting structural variants that may contribute to disease as well as single point mutations. It is currently difficult to detect structural variants from exomic sequencing, though various methodologies do allow detection of copy number variants (CNVs) (deletions and duplications) in massively parallel sequencing [65]. In the future, the fact that structural variants are important causes and contributors to many types of human disease may provide impetus toward the natural shift toward genomic (rather than exomic) sequencing [25,66]. To return to the example of holoprosencephaly, whole (or larger) gene deletions of SHH or ZIC2 may cause disease almost as often as intragenic point mutations, and it is therefore important to test for these CNVs in addition to point mutations [60,67]. Techniques are evolving to allow CNV analysis from massively parallel data [68]. Pragmatically, in the short term, in order to ensure that clinically important CNVs are identified, an alternative strategy that is frequently employed involves combining high-density microarrays to search for pathogenic CNVs alongside exome sequencing to detect exonic point mutations. This approach may be especially important in patients who are affected by severe constitutional disorders for which a clear diagnosis is unclear (e.g., in many cases of neurocognitive impairment or congenital malformations). In this instance, it can be helpful first to detect explanatory CNVs before proceeding to genomic sequencing—for example, if a causal deletion is found in a patient, genomic sequencing may not be necessary. CNV analysis can also be helpful in cases of recessive diseases. For example, a patient may be found to have a deletion including a recessive disease-associated gene on one chromosome (inherited from one parent), and a point mutation in that same gene on the other chromosome (inherited from the other parent). Resource-Based Considerations The fourth issue, which may or may not be paramount depending on financial constraints, involves the cost of sequencing. Since the advent of modern genomic sequencing through new technologies, the expense involved has been decreasing considerably. In the several years since new sequencing technologies allowed genomic sequencing approaches to be commonly used to study constitutional disorders [1], the cost of human exome sequencing has relatively steadily held at approximately one-sixth of that of genome sequencing. We are

III. INTERPRETATION

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

279

currently well into the era of the “$1000 exome” (or considerably less), and the promised “$1000” genome is inevitable in the relatively near future. The fifth difference when comparing exome to genome sequencing, which is directly related to the cost of sequencing, involves the time required to perform the data analysis of the requested sequencing. Genome sequence analysis requires greater computing resources than exome sequence analysis—current estimates for both describe a ratio of approximately 15:1 when comparing genome to exome sequencing. This practical consideration, involving the requirements for laboratory space, personnel, and data storage, may affect the decision in terms of the modality when a project is being planned or a commercial lab is being established [69].

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY The use of high-throughput sequencing methods, including exome and genome sequencing, results in the production of large amounts of data. Managing these data in an efficient manner can be challenging, especially due to the burden of potentially pathogenic variants each individual carries [70]. In this sense, looking for the genetic cause of a single constitutional disorder, especially in a situation where a novel cause is likely, can be like looking for a “needle in a stack of needles” [71]. This section will review several general strategies that can help with mutation prioritization in order to identify the causal genetic change. Although the major issues will be discussed in isolation, these areas of consideration can and should be used combinatorially through the implementation of automated probabilistic algorithms (Figure 17.1). For example, combining data from inheritance patterns with scoring systems and data sets involving factors such as population variant frequency data and genotype quality can be an effective way to rank the most likely causative variants [72,73]. While these methods are imperfect, they can help prioritize targets for further manual inspection. Further, while the focus of this chapter is on traditional “Mendelian” constitutional disorders, rather than more complex traits, the availability of large-scale genomic data can also be used, again through the applications of sophisticated statistical modeling, in order to search for combinations of genetic factors that may together account for complex/multifactorial disorders [74]. These methods may yield more success than GWAS studies in terms of identifying causal genetic changes, rather than statistical associations of unclear pathogenesis.

Phenotypically Similar Unrelated Probands While the advent of new sequencing technologies makes data acquisition much easier, studying single probands frequently results in too much data for ready analysis, especially when the genetic cause is novel. Several of the earliest demonstrations that exome sequencing could be used to discover disease alleles used an approach focusing on cohorts of unrelated probands with strong clinical similarity among themselves (as mentioned, these approaches

Variant type

Polulation variants

FIGURE 17.1 Key filtering steps in the identification of causal mutations in genomic sequencing analysis of patients with constitutional disorders. In different specific situations, certain steps may become more or less relevant, and the approach may not necessarily occur in this order. Many automated methods take such considerations into account combinatorially in order to produce a probabilistic ranking of candidate genes.

Familial studies

Phenotypic findings Pathway-related and biological analyses Candidate variants

III. INTERPRETATION

280

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

used other analyses to help prioritize candidate genes). This approach is based on the premise that phenotypically similar individuals (i.e., individuals with highly similar constitutional disorders) may share a common underlying cause, such as different point mutations involving the same underlying gene. This was the primary approach initially used to identify causes of constitutional disorders through exome sequencing, though other methods were used, including inheritance patterns and filtering through databases of population variation. This approach was first demonstrated in patients with FreemanSheldon syndrome, a dominant condition involving limb contractures, scoliosis, and craniofacial anomalies caused by heterozygous mutations in MYH3 (as a proof-of-principle exercise, since the cause had previously been identified). In addition to filtering based on genotype quality, this groundbreaking study prioritized variants based on the requirement that presumed deleterious mutations in the same gene were required to be found in each proband. However, this naturally results in the inclusion of thousands of variants (in the analysis of four exomes in this study), and filtering by removing common variants contained in dbSNP and HapMap exomes allowed the identification of a single (and correct) candidate gene [1]. This method was next applied to a disorder of previously unknown etiology, Miller syndrome, an autosomal recessive disorder involving craniofacial, ophthalmologic, and limb anomalies. Sequencing of two siblings and two additional unrelated probands showed that the cause of Miller syndrome involves bi-allelic mutations in the gene DHODH [1,2]. When considering only the siblings using the methods described above (for FreemanSheldon syndrome), filtering produced 228 and 9 candidate genes, using dominant and recessive inheritance models, respectively. Combining the results of analysis of the two siblings with the other unrelated probands reduced the candidate genes for the dominant and recessive models to, respectively, 26 and 1 (the latter referring to DHODH). Sequencing of additional probands confirmed that bi-allelic mutations in DHODH cause Miller syndrome [2]. Another interesting example, which highlights the need for detailed and thoughtful clinical analysis in concert with genomic sequencing, arose in the use of exome sequencing to delineate the cause of Kabuki syndrome, an autosomal dominant disorder whose cardinal features include developmental delay, a distinctive dysmorphic facial appearance, and cardiovascular and skeletal defects, which was found to result from heterozygous mutations in MLL2 [75]. In this study, initial analyses of exome sequencing data from 10 patients who had been clinically diagnosed with Kabuki syndrome did not reveal an obvious underlying genetic cause. In fact, the single candidate gene that met filtering criteria (MUC16) was discarded because of its large size and polymorphic nature. Next, by taking into account causal heterogeneity (i.e., that individuals might have Kabuki syndrome because of mutations in different genes), analyses were performed such that not all individual were required to have pathogenic mutations in the same gene. This resulted in a candidate gene list of 3, 6, and 19 genes, but there was no efficient and satisfying way to next rank these genes without, for example, performing detailed functional assays, including the potential establishment of high-fidelity animal models. Subsequent ranking was based on two factors. First, the functional impact of the variants—those variant types predicted to be most highly pathogenic, such as truncating mutations, were prioritized. Second, patients were clinically grouped through dysmorphology analysis of canonical facial characteristics of Kabuki syndrome. That is, determining which patients were most phenotypically similar played a key role in allowing the identification of the genetic cause in these individuals. Interestingly, Sanger sequencing then also identified mutations in MLL2 in the individuals who did not initially have mutations in MLL2 detected by exome sequencing. This highlights the potentially imperfect nature of exome (and genome) sequencing, especially with lower coverage kits. Clearly, this imperfect sensitivity has important implications for clinical applications of modern sequencing technologies. As with other studies, many other patients with Kabuki syndrome next underwent MLL2 sequencing to correctly confirm that mutations result in the disease (26 of 43 subsequently tested patients with clinically diagnosed Kabuki syndrome were found to have mutations in MLL2) [75]; reasons explaining the fact that those who had been diagnosed with Kabuki syndrome did not have mutations identified in MLL2 include the possibility that they may have other, entirely distinct genetic causes not putatively related to MLL2, may have mutations in interacting genes, or may have mutations in MLL2 regulatory regions (i.e., regions that are not included in most exome panels). The Continued Importance of Clinical Analyses in the Era of Genomic Sequencing The experience of the discovery of MLL2 as a genetic cause of Kabuki syndrome anecdotally emphasizes the role of the clinician, particularly the medical geneticist/genomicist, in the modern genomic era. Thorough clinical analysis can and should be synergized with high-throughput sequencing in order to help arrive at answers quickly and reliably [76]. Such clinical analysis in constitutional disorders can include a standard (and thorough) personal and detailed family medical history, and physical examination with special emphasis on dysmorphic or anomalous physical features. Such an examination is often best performed by a trained and experienced clinical

III. INTERPRETATION

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

281

dysmorphologist. In addition to the history and physical examination, other parts of the clinical work-up can be tailored to the specific patient and/or conditions that are part of the differential diagnosis. For example, imaging studies would be an important part of the work-up in an individual with a suspected constitutional skeletal dysplasia, and an individual who has facial features reminiscent of a condition that includes congenital cardiac anomalies should appropriate screening and/or testing, such as an echocardiogram. These testing modalities may inform clinical care as well as help aid the search for genetic explanations. With the shift toward increased use of genomic sequencing, thorough phenotypic analysis may be used in the same way in order to prioritize panels of genes as part of the “genomic differential diagnosis.” Within a genomic data set, it may be most efficient to examine certain genes first, even though sequencing data for many more genes will be available. For example, in a patient with features that appear to be congruent with a Rasopathy, such as a patient with a Noonan-syndrome-like phenotype (a constitutional disorder including a characteristic facial appearance, developmental delay, short stature, congenital cardiovascular, hematologic, lymphatic, and renal anomalies), a handful of known genes (i.e., PTPN11, SOS1, RAF1, KRAS, and NRAS) might first be examined before moving on to mining the genomic data set for novel causes [77]. Further, to continue the example of Noonan syndrome, in addition to keeping in mind the relative proportion of mutations in these genes in patients who have been clinically diagnosed with Noonan syndrome, specific genotypephenotype correlations may help shift attention to a certain gene: the presence of a bleeding diathesis and short stature may point to PTPN11 as the most likely candidate gene, while hypertrophic cardiomyopathy and certain skin findings may suggest that a mutation in RAF1 may be present. In addition, so as to synergize clinical data with large-scale genomic data, efforts are underway to establish databases with rich phenotypic as well as genotypic data. These efforts center on the tenet that identifying the genetic etiology of a certain disorder necessitates adequate phenotypic data and knowledge, including information on the natural history and range and type of manifestations of a disease, sometimes termed the “phenome” [78]. On a pragmatic level, this will be especially important when different patient cohorts are combined. Further, an emphasis on exquisitely phenotyped cohorts will be even more critical as attention shifts from using genomic analyses to find the causes of single-gene Mendelian disorders toward the causes of complex disorders.

Familial Studies In many constitutional conditions, causative mutations may be inherited, either dominantly, recessively, or may also involve more complex patterns of inheritance. As discussed above, considering inheritance patterns has been a key part of genomic sequencing analysis since the early successes in determining the causes of constitutional disorders using modern sequencing technologies [2,79]. In some pedigrees, there will be a high index of suspicion for a specific inheritance pattern. For example, a highly consanguineous kindred with multiple affected siblings may suggest an autosomal recessive inheritance pattern as the most likely possibility, though a heterozygous causative mutation involving incomplete penetrance or germline mosaicism is entirely possible. In searching for the causal mutation, filtering genomic data based on the hypothesized inheritance pattern will reveal different numbers of genes. For this reason, recessive diseases may result in lower numbers of candidate genes, and constitutional conditions with autosomal recessive inheritance patterns may be easier to evaluate than autosomal dominant disorders. This is because in any one individual, there will be less instances of homozygous or compound heterozygous versus heterozygous mutations in any given gene. For example, in the study described above demonstrating the cause of Miller syndrome, two affected siblings were also affected by primary ciliary dyskinesia, and bi-allelic mutations in DNAH5, a gene associated with primary ciliary dyskinesia, were able to be identified through study of the quartet (the two affected siblings and both parents) [79]. Analysis that includes family members has the added advantage of improving accuracy by eliminating some artifacts [79]. Recessive Diseases Modern genomic sequencing is frequently combined with more traditional tools of disease gene analysis in order to identify causal mutations. Traditionally (prior to the availability of genomic sequencing), linking analysis or homozygosity mapping approaches would yield a genomic region of high interest for which statistical analyses showed a high likelihood of harboring a causative mutation; the identification of this genomic region would lead to what would frequently be a lengthy “house-to-house” search in which every gene in the region of interest would be sequenced individually. In populations with high rates of consanguinity, these approaches can be

III. INTERPRETATION

282

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

especially useful, either in single-family situations, as well as in larger cohorts affected by relatively common disorders. For example, one of the first uses of exome sequencing to explain a clinical scenario was in the investigation of a patient who was the product of a consanguineous union and who presented with failure to thrive and dehydration of unclear etiology. Homozygosity mapping (via an Illumina 370 K SNP microarray) was used to identify 462 Mb that were identical by descent, of which 5.3 Mb (including 2459 genes) were within the exome. By filtering for novel and rare variants, 2405 homozygous variants were identified (1493 single nucleotide variants, including 668 nonsynonymous substitutions, 791 synonymous coding variants, 12 canonical splice-site variants, 19 coding region insertion-deletions, 3 premature termination codons, as well as 931 variants in introns, UTRs, or intergenic regions). Missense variants were ranked according to the predicted functional effects, and clinically oriented analysis related to known disease-associated genes was performed. SLC26A3 was identified as a strong explanatory gene, as recessive loss-of-function mutations in this gene cause congenital chloride-losing diarrhea. This was consistent with the patient’s clinical work-up, though the underlying reason for his initial presentation had not been obvious from the initial medical work-up. Ultimately, other similarly affected patients with mutations in the same gene were also found [53]. Related, more targeted approaches involving homozygosity mapping can also be used. For example, in an analysis of 136 consanguineous Middle Eastern families with evidence that they could be affected by recessive cognitive disorders, homozygosity mapping was used to identify genomic regions that were homozygous by descent, and then exon enrichment was used to interrogate those regions—in other words, next-generation sequencing was performed only on those regions included in the linked loci [80]. Exome (or genome) sequencing could also have been used to interrogate these regions, but considerations such as price, time, other resources required, and the quality of sequence data, would need to be taken into account. De Novo Mutations Familial analysis can also be conducted to detect de novo mutations that explain some constitutional disorders. This can be a powerful approach when used alongside other filtering approaches, such as selecting rare, novel, and presumed pathogenic alterations. For example, in the identification of X-linked dominant mutations in WDR45 as a cause of a clinically sporadic disorder, static encephalopathy of childhood with neurodegeneration in adulthood (SENDA), family-based exome sequencing was employed in two kindreds, and 180 and 187 novel protein-altering or splice-site variants were found in the two probands, respectively. In the first proband, two of these variants were de novo, and only one of these variants was de novo in the second proband, with WDR45 being the only shared gene with novel de novo mutations [81]. This approach may be used both in the setting of rare Mendelian disorders, as well as in other conditions with substantial heterogeneity and complex and multifactorial patterns of causality, such as nonsyndromic intellectual disability or autism [8284].

Using Databases of Population Variation Prior to genomic sequencing becoming ubiquitous, standard practice dictated that, in addition to other proofs of pathogenicity, a certain number (usually in the few hundreds in a rare Mendelian disease, though the number would certainly shift in the case of a more common condition) of ethnically matched control samples be tested for the mutation in question. The advent of large-scale databases now, to some extent, obviates this need [51,85]. As mentioned above, information from these databases can be built into probabilistic models that rank variants in an attempt to identify causal mutations. Population variant databases can be used in several ways. One way they can be used is to simply eliminate variants (or at least assign a lower priority) in genes that are found in individuals unaffected by the disease. A further method involves ranking genes based on the frequency of presumed deleterious variants—that is, mutations in genes which are less polymorphic, and therefore less tolerant of variants, may be more likely to cause a constitutional disorder [73]. Issues and Concerns with the Use of Population Variation Databases to Filter Genomic Data Sets While the presence of these databases is extremely valuable and while population variation data is a key part of genomic analysis, several caveats are important to note. First, deleterious alleles will be found in the databases used for population-based filtering [70,86]. This can make analysis challenging, and it is important not to absolutely dismiss variants simply because they are found in the databases. Second, there are considerations that must be taken into account when studying recessive disorders. As part of the overall burden of genomic variants,

III. INTERPRETATION

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

283

all individuals are “carriers” for potentially pathogenic heterozygous mutations that, in the bi-allelic state, would cause disease. Although the exact number of deleterious variants is difficult to estimate due to incomplete knowledge of genomic architecture and the functional results of many variants, it is clear that all individuals harbor hundreds of deleterious variants [70,86]. Setting aside the issue of highly consanguineous groups, heterozygous mutations associated with certain recessive disorders are common in all populations, though the specific disorder may vary between the specific population or ethnic group. For example, approximately 1 in 25 Caucasians is a carrier for a deleterious mutation in CFTR, in which bi-allelic mutations cause cystic fibrosis, and approximately 810% of African-Americans carry the sickle cell trait (Hb AS) [87,88]. Estimates based upon rare diseases demonstrate this point more generally. The Office of Rare Disease Research generally considers a rare (or orphan) disease to have a prevalence of fewer than 200,000 affected individuals in the United States, which results in a maximum prevalence of approximately 1 in 1600 [89]. In the case of an autosomal recessive disorder, this would mean that 1 in 20 individuals would carry a disease allele for that particular condition, such that, even in a very polymorphic disorder, many such alleles would be found in any sizable database. Thus, in resources that are frequently used for filtering, such as publicly available databases like 1000 genomes (http://www.1000genomes.org/) and the Exome Variant Server (EVS) (http://evs.gs. washington.edu/EVS/) [51,85], a substantial proportion of alleles will contain deleterious mutations in such genes, and this must be taken into account when analyzing genomic data sets. Penetrance and Expressivity Along the lines of the above discussion, another important consideration has to do with the fact that many mutations, even in classic autosomal dominant conditions, are not completely penetrant, or have highly variable expressivity such that mildly or subtly affected individuals may not come to clinical attention or realize their disease status, especially early in the course of the condition. As the association between deleterious mutations and a particular disease state may not be known in any particular individual, these variants could be reported in databases of population variants. This is similar to the case of heterozygous mutations in recessive diseases, which are also found at a certain frequency in control populations. For example, the same heterozygous mutations in holoprosencephaly-related genes can result in extreme severe midline brain malformations that have obvious and devastating effects, or can result in subtle midline facial differences (such as close-set eyes or a narrow nasal bridge). This is frequently observed in many families segregating mutations in genes like SHH and SIX3, though a satisfying biological explanation is still lacking. In many families, subtly affected individuals are only ascertained by the birth or conception of a severely affected individual, and retrospective clinical and genetic analyses are able to determine that many individuals in the kindred have the causative mutation [60,9092]. This issue can be demonstrated directly by querying databases like the EVS, a database of population variation [85], alongside the Human Gene Mutation Database (HGMD) (https://portal.biobase-international.com/hgmd/pro/), a database on disease-associated mutations [93]. In querying the gene SIX3 (in March, 2013), three rare missense variants were found in the EVS that were not found in HGMD, and were not recorded as known SNPs. These three variants might be dismissed as simply being extremely rare variants not previously assayed, but, even more interestingly, two missense variants were found in EVS that had both been previously shown to cause human holoprosencephaly, both by clinical studies as well as through functional proof derived via animal-based models [91,94]. These latter two missense variants were found in 20/10,646 and 4/12,334 exomes sequenced, respectively. Whether this argues against the variant’s actual primary pathogenicity (i.e., perhaps in reality, the variant has little or no effect on the patient’s constitutional disorder), or whether these individuals in the population databases are actually likely to have subtle disease manifestations and thus be at relatively high risk of having a severely affected child or other relative, remains unclear but is an important question. The converse problem can also occur—certain regions are difficult to sequence using modern sequencing technologies, and thus will be underrepresented in databases constructed primarily from next-generation sequencing data. Finding a variant in this type of region could initially seem to suggest that this was likely to be a pathogenic mutation simply because the region was not well annotated in population databases, and thus the area appeared to be “variant poor.” The Accuracy and Reproducibility of Databases For a number of reasons, the information included in databases must be examined and considered carefully. Information contained in the databases may not have uniform validation/confirmation, and may thus contain

III. INTERPRETATION

284

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

artifacts (or, in the case of databases constructed from low-coverage next-generation sequencing, certain information may be imputed) [51,70]. Often, the degree to which variants have been validated can be queried in the databases. Further, information in databases may have been generated using different platforms, making direct comparisons of allele frequencies challenging in some situations (e.g., one version of a sequencing platform may not have provided adequate coverage for a specific region, while later versions may include better coverage). Even setting aside the question of the reliability of the data, certain issues regarding the database population may present challenges. While there is some information available related to ethnic/racial origins of the individuals, there is not equal representation in population databases. This could be problematic, for example, in the study of an individual from an ethnic group not well represented in the variant resources. In this latter case, reverting to actively testing panels of ethnically matched control samples would be recommended.

Incorporation of Pathway-Related Data In many instances, using the above-described strategies may not reveal a causative mutation. In that scenario, another strategy may involve using known biologic data as part of the interrogation, such as a pathway-type analysis to select candidate genes within a genomic data set [95]. For example, a patient may have a clinical presentation strongly suggesting a mutation in a single gene (or a gene family). If analyzing this gene’s sequence data, and similarly looking for causal CNVs or variants affecting known regulatory regions, does not offer an answer, it may be possible to interrogate known interacting genes, even if they are not (yet) known to be related to human disease. One caveat is that the success of this approach depends upon how well a particular pathway is understood and described.

Recognizing and Managing Artifacts One important way to filter out variants that may not contribute to the disease in question is by using population data, as described above. Another very important related feature involves tools that are designed to help weed out likely artifacts, or at least assign a lower priority to detected genotypes that appear likely to represent false-positive data. There are a number of considerations here. First, there are clearly a number of technical considerations that can affect sequencing accuracy, such as various enrichment methods and the platform chosen, as discussed above [55,56,64]. Bioinformatic issues are of equal importance. In terms of genotype calling, the main issue is balancing the system such that there is not an overwhelming amount of false-positive data, and that the likelihood of missing potentially important variants is reduced. This part of the genome sequencing analysis is based on a genotype score, calculated by algorithms that attempt to maximize sensitivity without sacrificing accuracy when calling genotypes. For example, Most Probable Genotype (MPG) is a Bayesian genotype-assigning algorithm that uses aligned sequence data, and when calculated from well-aligned Illumina reads, genotypes with MPG scores of at least 10 are in concordance with SNP chip microarray data genotypes approximately 99.8% of the time (a score cutoff of 10 is appropriate, as accuracy is not greatly improved with higher scores) [64]. Achieving an MPG score of at least 10 is feasible using approximately 10203 coverage, arguing that lower coverage may still produce analyzable and useful information [64]. Further, using a cutoff of at least 0.5 for the MPG score/coverage ratio can reduce false-positive data arising from alignment errors, similar to other methods [96]. Beyond genotype calling, there are several other key issues related to decreasing noise in genomic sequencing data. One consideration is that specific platforms, whether using custom-designed or widely commercially available probesets, will each have their own peccadilloes resulting in specific regions or base positions that can be problematic. One way to approach this problem is to compare sequence data with other data sets derived on the same platform in the same setting. For example, one study examining the use of exome sequencing in constitutional disorders looked at exome sequence data from 118 individuals (representing 29 individual families) [97]. From exomic sequencing data in these 118 individuals, analysis resulted in a list of genes that may be initially ranked lower in terms of the likelihood of being disease-associated because (1) they were located in highly polymorphic genomic regions or; (2) sequence data suggested misalignment or; (3) variants were called because of reference genome data that was misleading (i.e., the reference sequence used for comparison at that position actually represented a less common genotype). Using these 118 individuals (and an additional 401 control exomes) [98], and applying HardyWeinberg equilibrium analysis, over 23,000 positions

III. INTERPRETATION

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

285

in which alignment errors were suggested (because of the frequency of heterozygous variants) were found, as well as over 1000 positions which were initially read as containing homozygous variants because the reference sequence contains a minor allele designation. The analysis also identified several thousand polymorphic genes that might be considered or prioritized differently when looking for causative mutations in a rare disorder [96]. As with other aspects of genomic sequencing, manually recalibrating these parameters can be difficult, and the incorporation of machine-learning can be a key part of the automation process to allow tools to evolve and improve [96]. The Necessity of Independent Validation It should be clear that the large amount of data generated by genomic sequencing, even if highly accurate, will result in a substantial amount of false-positive data. Thus, independent validation of detected genotypes is critical. Familial studies can be helpful (e.g., in the case of inherited homozygous, compound heterozygous variants, or heterozygous variants), though familial studies will not be informative in the case of a putative de novo variant. For point mutations, validation is frequently performed through Sanger sequencing, though other methods may also be feasible, depending on the situation and the nature of the variant detected.

Functional Interpretation of Variants In addition to the methods outlined above (which in large part center on general bioinformatic strategies), the predicted pathogenicity of the detected variant remains a key part of prioritization. This step can be especially challenging, even when taking population frequency data into account [72,73]. There are a number of in silico predictive models available, some of which focus on single nucleotide substitutions, and which take into account factors such as the type of variant (e.g., missense substitution including the type of amino acid change, truncating mutation, and in-frame deletion or insertion), the evolutionary conservation of the affected amino acid or portion of the gene (as well as the surrounding regions), and the location of the variant in relation to known functional motifs [99101]. As described below, these predictive models can be incorporated into probabilistic models that help rank variants in the search for genetic causes [64,72,101]. Naturally, these predictive algorithms depend on the current state of genomic knowledge, and on manually constructed databases, and are thus by nature imperfect predictors of functional effect. To a certain degree, this problem may be less overwhelming when the goal of genomic sequencing is to identify a novel genetic cause of a constitutional disorder—that is, the presence of apparently severe mutations segregating in a family may overall provide strong evidence for the role of those genetic variants in the particular condition. However, on the individual level where a proband undergoes genomic sequencing for clinical reasons, making judgments can be extremely challenging in the setting of a variant that may simply represent a rare familial change, or may represent a true disease-causing mutation. This is an especially difficult problem in genes that are polymorphic. This is not a new issue by any means, and has been a large challenge in the interpretation of data derived from Sanger sequencing, DNA microarrays, and other modalities. The difference, of course, is that new genomic sequencing methods simply generate more variants of unknown significance (VUSs) by virtue of the amount of sequence data produced. Thus, even in the age of ubiquitous genomic sequencing, classic bench-based functional approaches must not be abandoned entirely. In some instances, these functional approaches can be obviated by statistical evidence, though bench work is still critical to explore and explain the underlying biology. Both cell-based functional assays and animal models will remain as important steps in demonstrating the causality of a given mutation detected by genomic sequencing in constitutional disorders. The type of assay and/or animal model selected will naturally depend upon the disorder. Just as pipelines are being developed to sequence, align, and analyze large-scale genomic data sets, similar research pipelines are necessary for functional analysis in order to allow better functional predictions [102]. An illustration of one of the emerging resources that can be coupled with the increased ability to generate sequence data is Zebrafish Insertion Collection (ZInC) (available at http://research. nhgri.nih.gov/ZInC/), a genome-wide knockout resource targeting all protein-coding genes in zebra fish [103,104]. Such resources can be highly beneficial in the study of new candidate genes that may provide novel explanations for constitutional human disorders. Notably, ZInC makes all generated mutant zebra fish freely available to the scientific community, offering a great potential for future study.

III. INTERPRETATION

286

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

Combinatorial Approaches A number of tools exist to help automatically filter and thus rank variants, and may be employed to examine individuals with hypothesized Mendelian diseases or more complex traits (though the latter naturally requires larger samples sizes to achieve statistical significance) [72,73]. For example, one tool, called the “Variant Annotation, Analysis and Search Tool” (VAAST, available at http://www.yandell-lab.org/software/vaast.html), is a probabilistic algorithm that combines elements of amino acid substitution analysis [105] with aggregative approaches to construct a single, unified likelihood framework; combining the types of considerations outlined above can help achieve better accuracy and statistical power [72]. From a genomic data set, while taking into account the inheritance pattern suggested by a particular pedigree, VAAST annotates the single nucleotide variants based on the effect the variant has on the coding sequence and performs a statistical analysis (which can occur within a certain chromosome, such as the X-chromosome, or within a linked region) in order to flag those variants that are shown to be the most likely to cause the disease in question. When selecting candidate genes, VAAST uses a likelihood ratio test that takes into account both amino acid substitution frequencies and allele frequencies to prioritize candidate genes. The power of this particular method was demonstrated by identifying that mutations in NAA10 cause X-linked N-terminal acetyltransferase deficiency, a condition that is lethal in males and involves a prematurely aged appearance, neurological impairment, and cardiac, craniofacial, and genitourinary anomalies [106]. Another tool, using a similar approach, is VAR-MD (available at: http://research.nhgri.nih.gov/software/var-md/), which focuses on identifying the causes of rare Mendelian disorders from exomic data, and which, like VAAST, can work well using small pedigrees [73]. VAR-MD, like VAAST, produces a list of ranked variants, and takes into consideration the predicted pathogenicity of the detected variant, population variant frequency, genotype quality, and the inheritance pattern. VAR-MD was used to find the cause of disease in three families affected by AFG3L2related spastic ataxia-neuropathy (through homozygosity mapping, as the parents were first cousins, followed by exome sequencing of a quartet including two affected siblings), GM1 gangliosidosis (through sequencing a quartet, including one affected and one unaffected sibling, as well as a first cousin), and fatty acid hydroxylase-associated neurodegeneration (through the identification of a heterozygous deletion in FA2H by SNP microarray, followed by exomic sequencing a quartet with one affected and one unaffected sibling, which identified a mutation in trans with the deletion in the proband) [73].

Thorough phenotypic (clinical) analysis

Yes

Is the disorder recognizable?

Is there a known genetic cause?

Yes

No

Is there evidence for a genetic etiology?

Is there a targeted test available? Targeted test

Yes

Positive test

Negative test

Familial studies Further studies as necessary (e.g., functional analyses study of similar patients)

No

No

Yes

No Consider genomic studies (assuming genetic factors are likely)

Analysis of known genes Positive test

Negative test

Analysis per Figure 1

III. INTERPRETATION

FIGURE 17.2 Example algorithm for the consideration of genetic/genomic investigation in a patient with a constitutional disorder. Depending on disease recognition, the availability of targeted testing, and the results of initial testing, true genomic studies may be warranted. If causes are not initially found through genomic sequencing, analysis steps as outlined in Figure 17.1 may be attempted.

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

287

Clinical Genomic Sequencing There are a number of additional points that focus on clinical applications of genomic sequencing, particularly involving how new sequencing technologies are changing the practice of clinical genetics and related fields. Determining the Optimal Scope of Genetic/Genomic Investigations While genomic sequencing will make a great deal of genetic data available for a single patient, it should be clear that simply sequencing a patient’s exome or genome and analyzing the data may be far less than ideal in many scenarios. Frequently, it is better to proceed with a more targeted approach, and there are many ways to do this depending on available resources and the nature of the clinical question (Figure 17.2). One way to limit the scope of genetic examination (and thus decrease analysis time) relies on preselecting targets for analysis. At one end of the spectrum, some conditions may involve known common mutations, such as occurs with achondroplasia (the most common form of dwarfism) or certain types of craniosynostosis (in which there is premature fusion of some or all of the sutures of the skull) [107,108]. In other words, a lack of allelic heterogeneity may allow easy and cheap assays that obviate the need for large-scale sequencing. To extend this point to conditions with considerable allelic heterogeneity, but which are known to result from mutations in a single gene or small number of genes, custom panels of selected variants may be used to quickly screen for common mutations. This is frequently done in cystic fibrosis screening, such as in the instance of preconception planning. An advantage of these common panels is well-established sensitivity and specificity, as well as the ability to circumvent challenges involved in attempting to interpret VUSs. Conversely, limiting the data set to preselected mutations will miss variants that are not part of a chosen panel, and negative results should be interpreted in the light of Bayesian risk calculations [109]. Further, “next-generation” methodologies may also be used to explore larger panels of genes. In this case, massively parallel sequencing (even without sequencing the full exome or genome) can allow large amounts of sequencing data to be readily available. One advantage of this is that probes can be specifically designed to capture important regions, including regions outside the exome. Such designs can help improve the genotype quality for the targeted region. To an extent, using exome or genome sequencing can also be used to interrogate a preselected panel of genes, but the caveat here goes back to the issue of incomplete coverage, and in the clinical setting this may be especially problematic. For example, patients with ciliopathies may demonstrate a recognizable combination of findings. This phenotype may include a combination of renal, ophthalmologic, neurologic, and other manifestations, and there is considerable overlap between conditions. The genetic cause may be due to bi-allelic mutations in one of dozens of genes [110]. Genomic sequencing can allow rapid analysis of these genes, and further, if causative explanations are not found, may also (more eventually) allow identification of novel causes [111]. Similarly, genomic sequencing might be combined with biochemical assays in order to augment newborn screening. By performing sequencing (of select genes that are involved in the disorders that newborn screening currently assays) along with traditional biochemical methods, clinically important information may be arrived at more quickly, especially as genotypic data can be used to make clinical decisions and parse variants that may affect enzymatic activity (and result in anomalies on biochemical assays), but may not result in clinical disease [112]. One of the great benefits of genomic sequencing is that the availability of the genomic data can allow the rapid testing of clinical hypotheses (in the case of ciliopathies, the analysis of the dozens of implicated genes), as well as a more hypothesis-free approach involving larger gene sets, or even the entire exome or genome. After ruling out genes that are known to be disease associated, novel candidate genes can be explored (such as by pathway analysis). If the genetic cause remains unknown at this point, mutations in other genes may be investigated, even if there is no independent biological evidence that this gene may be involved in the particular disease; for example, if a number of individuals with the same clinical condition all have truncating mutations in a certain gene, it may be that this gene is causative even if relatively little is known about the underlying pathophysiologic mechanism. Finally, if a causal mutation has still not been established, the genomic data can be banked until more similar patients are available for study, as a larger cohort may be necessary for gene identification, especially if the condition is not monogenic. Data can also be reanalyzed at a later time when the cause may have been found independently.

III. INTERPRETATION

288

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

Clinical Utility: Translating Genomic Knowledge from Rare Disease Research to more General Health Care Situations The Clinical Timeline Clearly, the use of genomic technologies will increase our knowledge about the causes of both extremely rare and more common genetic diseases since it now relatively easy to generate large amounts of genomic data quickly and cheaply. However, as mentioned, a key obstacle arises in the attempt to apply genomic sequencing in “real-time” situations, in which medical decisions may hinge on knowledge of the genetic diagnosis [113]. In many clinical situations, the knowledge of the presence of a genetic mutation could potentially have a number of effects on clinical care, including treatments used, surveillance for manifestations that would be indicated, as well as considerations related to prognostic knowledge, medical decision making, and eventually, reproductive choices. Unfortunately, the advantages of genetic diagnosis in many of these situations may be lost (or greatly diminished) without efficient results. For example, there are number of genes that can be mutated in autosomal recessive “malignant” osteopetrosis, a condition resulting in severely increased bone density, and which can manifest with bone fractures, compressive neuropathies, tetanic seizures due to hypocalcaemia, and life-threatening pancytopenia. Prompt recognition of the genetic lesion may be important in order to allow early treatment with hematopoietic stem cell transplantation [114]. In other constitutional conditions, such as certain types of severe combined immunodeficiency [115] or WiskottAldrich syndrome (an X-linked disorder whose features include thrombocytopenia), identification of the genetic lesion can allow treatment with gene therapy [116]. Less dramatically perhaps, knowledge of a genetic disorder can also be used to institute long-term treatments aimed at ameliorating disease sequelae, such as in the classic case of phenylketonuria in which the avoidance of phenylalanine coupled with dietary supplements can decrease the risk of permanent neurological impairment (in patients with phenylketonuria, neurocognitive impairment resulting either from failure to recognize the disease or less-than-ideal adherence to treatment regimens is irreversible) [117]. In other constitutional disorders, preventive measures and surveillance can help detect disease sequelae early and prevent or decrease significant morbidity and mortality. To illustrate this point, individuals affected by EhlersDanlos syndrome type IV (due to heterozygous mutations in COL3A1) typically manifest with features including characteristic facies, thin and translucent skin, easy bruisability, and arterial, intestinal, and/or uterine fragility. Often, the condition may not be recognized until a severe presentation brings an affected patient clinical attention; these presentations can include vascular dissection or rupture, gastrointestinal perforation, or organ rupture. If patients are diagnosed early, surveillance with CT or MRI can help diagnose arterial anomalies to allow early management, pregnant women can be followed at specialized centers that can plan the gestation and delivery in the optimal fashion, and preventive measures can be instituted such as avoidance of contact sports and other activities that will put patients at risk of traumatic injury [118,119]. With strong evidence that genetic information can be important for the management of many constitutional disorders, the key question is how to use modern sequencing technologies to help as many patients as possible, and at the earliest possible time. A few experiences show the increasing potential of genomic sequencing of as a clinical tool, and the value of increased automation of pipelines for sequencing, variant analysis and confirmation, and clinical interpretation [66,120]. Integrating the Management of Additional Genomic Information Genomic sequencing can be medically important as relates to health issues other than the genetic cause of a patient’s constitutional disorder. This additional medical information refers to, in this context, genetic information that may not be related to the primary research or clinical question. While the goal of DNA sequencing may be to explain the presence of a congenital malformation or other constitutional disorder, the patient’s genome will inevitably include other potentially relevant data that may be just as important, in some instances, as the cause of the congenital malformation [121123]. For example, individuals may be found to harbor previously unsuspected high-penetrance alleles which will result in a high risk of cancer. These individuals may benefit from surveillance that allows early detection and treatment of neoplasms, which can ameliorate resultant morbidity and mortality. As many of these medically important variants are inherited, such findings may have important health implications for the proband (the individual who initially comes to clinical attention) as well as for family members [124].

III. INTERPRETATION

ANALYZING INDIVIDUAL AND MULTIPLE DATA SETS FOR CAUSAL MUTATION DISCOVERY

289

As a specific example of the potential medical significance of medical information resulting from large-scale sequencing, exome sequencing in a patient with features of VACTERL association, a multimalformation disorder involving multiple congenital anomalies [125], did not reveal an obvious cause of VACTERL association, but did reveal a potential explanation why the patient suffered a very difficult recovery from his neonatal surgery for the anomalies that are part of VACTERL association. The patient was found to have a hypomorphic allele in CPS1, which encodes an enzyme involved in the citric acid cycle, and which may have affected nitric oxide production, causing susceptibility to pulmonary artery hypertension [126,127]. In addition to being able to identify causes of other co-morbid constitutional conditions, genomic data will also encompass risk factors for multifactorial disorders (examples include genetic risk factors for many relatively common diseases, such as psychiatric disorders, cardiovascular disease, and diabetes mellitus), though these can be very difficult to interpret in a clinically meaningful way on the individual patient level [128]. As mentioned above, genomic data may have pharmacogenomic importance, such as relate to rates of drug metabolism as well as other sources of potential adverse effects from certain agents [50]. Managing the Data Load in Clinical Scenarios It must be emphasized that, with currently available tools and resources, each individual’s genome inevitably contains far too many potentially clinically meaningful variants for any one health care worker to interpret and manage, even if every single person were assigned their own “clinical genomicist” to analyze and handle the medical consequences arising from the data. For example, one prediction suggested that each patient who undergoes genome sequencing might need to be informed of the presence of approximately 100 genetic variants detected in their genome; extrapolating from this, the investigators estimated that, if each variant was discussed for only 3 min with a given patient, this would require at least 5 h of discussion for the initial return of results alone. Obviously, the “3 min” per variant used here may be a vast underestimate if any reasonable or full discussion were attempted, or if the patient were not already extremely facile with many basic concepts of genetics and genomics. Moreover, this estimate does include requirements for pretest counseling and early and late long-term follow-up, as well as additional counseling and testing that would be needed for at-risk family members [129,130]. Several related strategies have been devised to optimize the analysis and use of a patient’s total set of genomic information. One general strategy involves the a priori “binning” of genes and types of variants. As a part of this binning process, certain genes would be preselected such that mutations would be automatically flagged and theoretically brought to clinical attention. For example, any time a pathogenic mutation in the genes MLH1 or MSH2 (associated with a type of hereditary colon cancer) was found, this would be highlighted and immediately brought to the attention of the managing researcher or clinician [131,132]. According to this strategy, as genomic sequencing becomes increasingly common, the highest clinical yield, especially in the early phases, will be related to conditions and specific variants with the strongest evidence suggesting a highly penetrant, pathogenic mutation. This binning process, in other words, hinges on two requirements. The first requirement involves the gene or locus where the variant is found—there must be extremely strong evidence that mutations in that locus cause the disease in question. The second requirement involves the precise variant itself—there must be again be extremely strong evidence that the variant is pathogenic, either because of the nature of the variant (such as a clear loss-of-function truncating mutation in a scenario where a heterozygous loss-of-function mutation has been shown to cause disease) or because that exact variant has been demonstrated to be disease causing (such as a novel missense variant or an in-frame deletion or insertion) [131,132]. In addition to the many obvious challenges related to this binning process, a major issue involves the fact that genetic knowledge is changing, and so the constructed bins must be dynamic to keep pace with scientific discovery. In addition to considerations related to the disease and identified variants, there is a third factor. This third factor, which is perhaps even more challenging and controversial, involves the clinical utility of knowledge of the variant. That is, would knowledge of the variant make a difference in the medical management of the patient? This question hearkens back to philosophies that form the foundation for decisions involving which clinically significant disorders are included in newborn screening programs [133]. Considerations for inclusion may be summarized by the following points: (1) there must be a currently available, beneficial, and necessary intervention; (2) the natural history of the disease must be well understood. In addition to this, committees deciding what should be included in newborn screening panels also have to consider the central question of whether a test is available; after all, a devastating condition may be easily and effectively treated when diagnosed early, but if there is no way to test for the condition as part of newborn screening, it cannot be included. With genomic

III. INTERPRETATION

290

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

sequencing, this latter consideration (i.e., whether there is a test is available) is often moot: in conducting genomic sequencing, thousands of Mendelian/constitutional conditions are being tested for, not to mention the many thousands of potential genetic risk factors such as those identified through association-based studies. This is not to imply that genomic sequencing is a perfect test in many situations, but the fact remains that many of these conditions can be genomically assayed. An example of a tool that has been constructed to help manage this issue is the Clinical Genomic Database (CGD) (available at http://research.nhgri.nih.gov/CGD/) [134]. The CGD is a searchable, freely web-accessible database of all conditions with genetic causes, focusing on the clinical utility of genetic diagnosis and the availability of specific medical interventions. The CGD includes thousands of genes and conditions for which the finding of pathogenic mutations could be expected to justify specific intervention(s), as well as thousands of genes and conditions for which specific interventions are not yet available, but for which genetic knowledge may be important to, for example, select supportive care, make medical decisions, enable prognostic considerations, inform reproductive decisions, and allow avoidance of unnecessary testing. For each entry, the CGD includes the gene symbol, conditions, allelic conditions, clinical categorization, mode of inheritance, affected age group, description of interventions/rationale, links to other complementary databases, including databases of variants and presumed pathogenic mutations, and links to individual PubMed references. The CGD is actively maintained to keep pace with changing medical knowledge and may assist in the rapid clinically oriented curation of individual genomes [134]. One challenge that arises with tools like the CGD, and binning strategies more generally, is the issue related to agreement about which conditions are truly of clinical utility in the genomic context. There is a wide divergence of opinion even among expert genomicists in terms of specific conditions [135]. Tackling this problem will require a great amount of cooperative work, which will undoubtedly not be without controversy. However, it is both impossible and pointless to attempt to arrive at a single algorithm that can govern how every possible genomic variant is treated regardless of the clinical situation. In the use of genomic sequencing, it is vital to consider the context in which sequencing is performed, whether it is carried out on a research or clinical basis, and within each of these broad categories, what the exact circumstances are surrounding the particular genomic question [136]. The bottom line is that while there is not a single best way to manage all the information in a person’s genome, careful and conscientious preplanning is critical so as to keep from overwhelming genomic and other practitioners, overly diverting resources away from the primary research or clinical question, and, most importantly, to enable the best possible outcomes for patients and research participants [123,131,132].

Consequences of Genomic Sequencing The availability of new sequencing technologies is unquestionably answering many biomedical questions related to the causes of human disease. Just as importantly, these tools are allowing complex new hypotheses to be formulated and addressed. In even the early days of the “genomic revolution,” the advent of sequencing technologies is changing how human disease is viewed in several key ways. These changes, at this point, apply especially to classic constitutional/Mendelian disorders. First, genomic sequencing is helping to find the causes of constitutional disorders in many individuals who, on pure clinical grounds, do not fit easily into a known syndrome or condition. In some cases, the simple explanation is that the condition is due to a novel cause. In other instances, however, the patient may have a mutation in a gene that is already known to cause disease; in these situations, this may be because the disease is allelic with a previously known condition (i.e., more than one disease caused by mutations in the same gene). There are many examples of these types of allelic conditions—one of the most interesting is PallisterHall syndrome and Greig cephalopolysyndactyly syndrome, both of which are due to mutations in the GLI3 gene. The diseases do share some phenotypic similarities, but PallisterHall syndrome is distinguished by features such as insertional polydactyly, hypothalamic hamartoma, and bifid epiglottis, while Greig cephalopolysyndactyly syndrome’s cardinal manifestations include preaxial polydactyly and hypertelorism. In the case of these conditions, the clinical expression (i.e., which of the two phenotypic categories the patient falls into) largely depends on the type and location of the mutation within GLI3 [137,138]. Due to the increased use of genomic sequencing, many more such allelic diseases will likely be identified. And sequencing individuals with unknown genetic etiologies will further identify individuals with known syndromes (like PallisterHall or Greig syndromes) whose manifestations include features that have not previously been recognized. Related to this, many constitutional disorders are initially described in the most severely affected individuals. This is logical, as the most severely affected individuals tend to come to clinical attention earliest and most

III. INTERPRETATION

CONCLUSION AND FUTURE DIRECTIONS

291

frequently. As genomic sequencing is performed for more and more affected individuals, it is anticipated that the less severe end of the spectrum in any given constitutional disorder will also become better recognized. In other words, with increased use of genomic sequencing in patients with unknown genetic etiologies, the spectrum of disease is expanding considerably, both in terms of the range of severity, as well as with respect to novel findings not previously known to be part of a disorder. Second, just as the limits of variable expressivity are expanding in many constitutional disorders, the use of high-throughput sequencing in large numbers of individuals is changing the penetrance estimates for many conditions. A fascinating example of this relates to cancer-predisposition genes, in which predicted high-penetrance mutations in genes such as BRCA1 and BRCA2 are increasingly found in asymptomatic individuals who do not have a suggestive family history, and who would thus not meet criteria for genetic screening [124]. Analysis of large-scale data sets will have important implications for many disorders in addition to those involving cancer-predisposition genes. These analyses will help to provide more accurate risk-benefit ratios and are anticipated to answer questions about the clinical utility of genetic screening and testing in the population as a whole, as well as in smaller subgroups. Third, and again related to the above points, genomic sequencing will ultimately challenge how traditional inheritance patterns of constitutional disorders are viewed. For example, even a classic autosomal dominant condition like holoprosencephaly has highly variably expressivity, ranging from subtle, isolated midline facial differences to severe neuroanatomical anomalies incompatible with postuterine life [60]. Genomic sequencing may eventually provide clarity to many such disorders by revealing the interplay of multiple genetic factors, rather than one mutation acting in isolation. As these multiple factors are better understood, and now that there are tools to look at many genes and regions simultaneously, it may eventually become routine to look beyond a single mutation in order to better understand modifying variants.

Genetic Counseling and Ethical Issues There are complex ethical, legal, and social considerations involved in managing and disseminating genomic sequencing results. These issues pertain to both research and clinical applications of genomic sequencing, as discussed in Chapter 24. Key ethical issues include a lack of empirical data to form the basis of existing guidelines, concerns about children’s autonomy in the case of pediatric genomic sequencing (as is most frequent in the investigation of constitutional disorders), intra-family issues relating to inherited mutations, concerns about genetic self-determinism, the problem of medical paternalism, questions regarding the “right not to know,” thresholds of clinical utility, management of VUSs, and the possibility of identifiability [130,139148].

CONCLUSION AND FUTURE DIRECTIONS Modern genomic sequencing technologies are dramatically changing the landscape of clinical practice. The pace of this change is expected to accelerate. Currently, due to the sudden availability and affordability of exomic and genomic sequencing, heavy emphasis has been placed upon maximizing the yield from genomic technology. This is logical, but it should not be imagined that the genetic code acts in a vacuum. In order to gain a fuller understanding of human health and disease, it is clear that multiple interacting genetic, epigenetic, and environmental factors must be considered. In the future, it may become commonplace to examine many “omes” along with the genome, and to analyze these data together, as this approach may provide the most satisfying and complete explanations. For example, humans may have congenital vertebral malformations due to bi-allelic mutations in a number of genes, including DLL3, HES7, LFNG, and MESP2; heterozygous mutations, also detected in affected patients, have been shown to act as susceptibility factors combined with environmental agents (such as hypoxia, in one mouse model) [149]. In another illustration, both genetic factors (e.g., mutations in or affecting the Hedgehog pathway) and environmental agents (such as maternal ethanol ingestion and diabetes mellitus) may contribute to holoprosencephaly pathogenesis [150,151]. Again, these findings suggest that even in classic Mendelian conditions, gene-environment interactions are important modulators of expressivity, may help explain questions of incomplete severity, and thus must be considered in concert with genomic sequencing data. In other words, the combination of genomic sequencing and additional temporally scaled high-throughput methods (e.g., transcriptomic, proteomic, metabolomic, and autoantibody profiles) may provide a powerful means for

III. INTERPRETATION

292

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

diagnosing disease, determining prognosis and optimal treatment, and monitoring response to therapy, among other advantages [152]. Just as achieving a better understanding of how genomic variation explains and contributes to human disease is an important goal in the ongoing evolution of true genomic medicine, it will be critical to ensure that technological advances that may potentially benefit many individuals are not limited to affecting only those who live in certain geographic areas or nations. This puts many individuals at risk of being “left behind in [the] genomic revolution,” which may well result in contributing to and increasing global and ethnic inequalities in health and economic status [153]. Multiple institutions and funding agencies are attempting to address these concerns, but much dedication and care will be necessary to ensure that the advantages of genomic medicine are widely and globally available.

Acknowledgment This work was supported by the Division of Intramural Research, National Human Genome Research Institute (NHGRI), National Institutes of Health and Human Services, USA. The author thanks all the research participants who have taken part in his genomic research; the NIH Intramural Sequencing Center faculty and staff for their dedication, expertise, and patient guidance; Dr. Max Muenke for his support and mentorship; and also expresses his appreciation to Drs. Neil Boerkoel, Erich Roessler, and Murat Sincan for their helpful discussion and insights.

References [1] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461:2726. [2] Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, et al. Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet 2010;42:305. [3] Jugessur A, Murray JC. Orofacial clefting: recent insights into a complex trait. Curr Opin Genet Dev 2005;15:2708. [4] Grosen D, Chevrier C, Skytthe A, Bille C, Mølsted K, Sivertsen A, et al. A cohort study of recurrence patterns among more than 54,000 relatives of oral cleft cases in Denmark: support for the multifactorial threshold model of inheritance. J Med Genet 2010;47:1628. [5] De Marco P, Merello E, Cama A, Kibar Z, Capra V. Human neural tube defects: genetic causes and prevention. Biofactors 2011;37:2618. [6] Glessner JT, Hakonarson H. Common variants in polygenic schizophrenia. Genome Biol 2009;10:236. [7] Sanghera DK, Blackett PR. Type 2 diabetes genetics: beyond GWAS. J Diabetes Metab 2012;3:198. [8] Lindhurst MJ, Sapp JC, Teer JK, Johnston JJ, Finn EM, Peters K, et al. A mosaic activating mutation in AKT1 associated with the Proteus syndrome. N Engl J Med 2011;365:6119. [9] McIntosh R, Merritt KK, Richards MR, Samuels MH, Bellows MT. The incidence of congenital malformations: a study of 5,964 pregnancies. Pediatrics 1954;14:50522. [10] Green CR. The incidence of human maldevelopment. Am J Dis Child 1963;105:30112. [11] Baird PA, Anderson TW, Newcombe HB, Lowry RB. Genetic disorders in children and young adults: a population study. Am J Hum Genet 1988;42:67793. [12] McCandless SE, Brunger JW, Cassidy SB. The burden of genetic disease on inpatient care in a children’s hospital. Am J Hum Genet 2004;74:1217. [13] Stevenson DA, Carey JC. Contribution of malformations and genetic disorders to mortality in a children’s hospital. Am J Med Genet A 2004;126A:3937. [14] O’Malley M, Hutcheon RG. Genetic disorders and congenital malformations in pediatric long-term care. J Am Med Dir Assoc 2007;8:3324. [15] Lejeune L, Gautier M, Turpin RA. Mongolisme; une maladie chromosomique (trisomy). Bull Acad Natl Med 1959;143:25665. [16] Ford CE, Jones KW, Polani PE. A sex chromosomal anomaly in a case of gonadal dysgenesis (Turner’s syndrome). Lancet 1959;1:7713. [17] Edwards JH, Harnden DG, Cameron AH, Crosse VM, Wolff OH. A new trisomic syndrome. Lancet 1960;1:78790. [18] Patau K, Smith DW, Therman E, Inhorn SL, Wagner HP. Multiple congenital anomaly caused by an extra autosome. Lancet 1960;1:7903. [19] Arnason U. 50 years after—examination of some circumstances around the establishment of the correct chromosome number of man. Hereditas 2006;143:20211. [20] Cooper H, Hirschhorn K. Apparent deletion of short arms of one chromosome (4 or 5) in a child with defects of midline fusion. Mamm Chrom Nwsl 1961;4:14. [21] Lejeune J, Lafourcade J, Berger R, Vialatte J, Boeswillwald M, Seringe P, et al. 3 cases of partial deletion of the short arm of a 5 chromosome. C R Hebd Seances Acad Sci 1963;257:3098102. [22] Yunis JJ. High resolution of human chromosomes. Science 1976;191:126870. [23] Therman E, Susman B, Denniston C. The nonrandom participation of human acrocentric chromosomes in Robertsonian translocations. Ann Hum Genet 1989;53:4965. [24] Baptista J, Prigmore E, Gribble SM, Jacobs PA, Carter NP, Crolla JA. Molecular cytogenetic analyses of breakpoints in apparently balanced reciprocal translocations carried by phenotypically normal individuals. Eur J Hum Genet 2005;13:120512. [25] Talkowski ME, Rosenfeld JA, Blumenthal I, Pillalamarri V, Chiang C, Heilbut A, et al. Sequencing chromosomal abnormalities reveals neurodevelopmental loci that confer risk across diagnostic boundaries. Cell 2012;149:52537.

III. INTERPRETATION

REFERENCES

293

[26] Flint J, Wilkie AO, Buckle VJ, Winter RM, Holland AJ, McDermid HE. The detection of subtelomeric chromosomal rearrangements in idiopathic mental retardation. Nat Genet 1995;9:13240. [27] Knight SJ, Lese CM, Precht KS, Kuc J, Ning Y, Lucas S, et al. An optimized set of human telomere clones for studying telomere integrity and architecture. Am J Hum Genet 2000;67:32032. [28] Ravnan JB, Tepperberg JH, Papenhausen P, Lamb AN, Hedrick J, Eash D, et al. Subtelomere FISH analysis of 11 688 cases: an evaluation of the frequency and pattern of subtelomere rearrangements in individuals with developmental disabilities. J Med Genet 2006;43:47889. [29] Botstein D, Risch N. Discovering genotypes underlying human phenotypes: past successes for Mendelian disease, future approaches for complex disease. Nat Genet 2003;33(Suppl.):22837. [30] Kerem B, Rommens JM, Buchanan JA, Markiewicz D, Cox TK, Chakravarti A, et al. Identification of the cystic fibrosis gene: genetic analysis. Science 1989;245:107380. [31] Augenlicht LH, Kobrin D. Cloning and screening of sequences expressed in a mouse colon tumor. Cancer Res 1982;42:108893. [32] Augenlicht LH, Wahrman MZ, Halsey H, Anderson L, Taylor J, Lipkin M. Expression of cloned sequences in biopsies of human colonic tissue and in colonic carcinoma cells induced to differentiate in vitro. Cancer Res 1987;47:601721. [33] Augenlicht LH, Taylor J, Anderson L, Lipkin M. Patterns of gene expression that characterize the colonic mucosa in patients at genetic risk for colonic cancer. Proc Natl Acad Sci USA 1991;88:32869. [34] Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, et al. A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet 2004;36:299303. [35] Dhami P, Coffey AJ, Abbs S, Vermeesch JR, Dumanski JP, Woodward KJ, et al. Exon array CGH: detection of copy-number changes at the resolution of individual exons in the human genome. Am J Hum Genet 2005;76:75062. [36] Bignell GR, Huang J, Greshock J, Watt S, Butler A, West S, et al. High-resolution analysis of DNA copy number using oligonucleotide microarrays. Genome Res 2004;14:28795. [37] Driscoll DA, Budarf ML, Emanuel BS. A genetic etiology for DiGeorge syndrome: consistent deletions and microdeletions of 22q11. Am J Hum Genet 1992;50:92433. [38] Wu YQ, Heilstedt HA, Bedell JA, May KM, Starkey DE, McPherson JD, et al. Molecular refinement of the 1p36 deletion syndrome reveals size diversity and a preponderance of maternally derived deletions. Hum Mol Genet 1999;8:31321. [39] Smith AC, McGavran L, Robinson J, Waldstein G, Macfarlane J, Zonona J, et al. Interstitial deletion of (17)(p11.2p11.2) in nine patients. Am J Med Genet 1986;24:393414. [40] Slager RE, Newton TL, Vlangos CN, Finucane B, Elsea SH. Mutations in RAI1 associated with SmithMagenis syndrome. Nat Genet 2003;33:4668. [41] Miller DT, Adam MP, Aradhya S, Biesecker LG, Brothman AR, Carter NP, et al. Consensus statement: chromosomal microarray is a firsttier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet 2010;86:74964. [42] Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al. Complement factor H polymorphism in age-related macular degeneration. Science 2005;308:3859. [43] Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med 2010;363:16676. [44] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, , et al.International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature 2001;409:860921. [45] Watson JD, Crick FH. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature 1953;171:7378. [46] Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 1975;94:4418. [47] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 1977;74:54637. [48] Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 2000;18:6304. [49] Schuster SC. Next-generation sequencing transforms today’s biology. Nat Methods 2008;5:168. [50] Urban TJ. Whole-genome sequencing in pharmacogenetics. Pharmacogenomics 2013;14:3458. [51] 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467:106173. [52] Chan M, Ji SM, Yeo ZX, Gan L, Yap E, Yap YS, et al. Development of a next-generation sequencing method for BRCA mutation screening: a comparison between a high-throughput and a benchtop platform. J Mol Diagn 2012;14:60212. [53] Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA 2009;106:19096101. [54] ENCODE Project Consortium, Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:5774. [55] Clark MJ, Chen R, Lam HY, Karczewski KJ, Chen R, Euskirchen G, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011;29:90814. [56] Sulonen AM, Ellonen P, Almusa H, Lepisto¨ M, Eldfors S, Hannula S, et al. Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol 2011;12:R94. [57] Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet 2009;10:24151. [58] Roessler E, Belloni E, Gaudenz K, Jay P, Berta P, Scherer SW, et al. Mutations in the human Sonic Hedgehog gene cause holoprosencephaly. Nat Genet 1996;14:35760. [59] Brown SA, Warburton D, Brown LY, Yu CY, Roeder ER, Stengel-Rutkowski S, et al. Holoprosencephaly due to mutations in ZIC2, a homologue of Drosophila odd-paired. Nat Genet 1998;20:1803. [60] Solomon BD, Mercier S, Ve´lez JI, Pineda-Alvarez DE, Wyllie A, Zhou N, et al. Analysis of genotypephenotype correlations in human holoprosencephaly. Am J Med Genet C Semin Med Genet 2010;154C:13341.

III. INTERPRETATION

294

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

[61] Jeong Y, Leskow FC, El-Jaick K, Roessler E, Muenke M, Yocum A, et al. Regulation of a remote Shh forebrain enhancer by the Six3 homeoprotein. Nat Genet 2008;40:134853. [62] Roessler E, Hu P, Hong SK, Srivastava K, Carrington B, Sood R, et al. Unique alterations of an ultraconserved non-coding element in the 30 UTR of ZIC2 in holoprosencephaly. PLoS One 2012;7:e39026. [63] Lettice LA, Heaney SJ, Purdie LA, Li L, de Beer P, Oostra BA, et al. A long-range Shh enhancer regulates expression in the developing limb and fin and is associated with preaxial polydactyly. Hum Mol Genet 2003;12:172535. [64] Teer JK, Bonnycastle LL, Chines PS, Hansen NF, Aoyama N, Swift AJ, , et al.NISC Comparative Sequencing Program Systematic comparison of three genomic enrichment methods for massively parallel DNA sequencing. Genome Res 2010;20:142031. [65] Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, et al. Using ERDS to infer copy-number variants in high-coverage genomes. Am J Hum Genet 2012;91:40821. [66] Talkowski ME, Ordulu Z, Pillalamarri V, Benson CB, Blumenthal I, Connolly S, et al. Clinical diagnosis by whole-genome sequencing of a prenatal sample. N Engl J Med 2012;367:222632. [67] Pineda-Alvarez DE, Dubourg C, David V, Roessler E, Muenke M. Current recommendations for the molecular evaluation of newly diagnosed holoprosencephaly patients. Am J Med Genet C Semin Med Genet 2010;154C:93101. [68] Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2014;15:25678. [69] Biesecker LG, Shianna KV, Mullikin JC. Exome sequencing: the expert view. Genome Biol 2011;12:128. [70] 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:5665. [71] Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet 2011;12:62840. [72] Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, et al. A probabilistic disease-gene finder for personal genomes. Genome Res 2011;21:152942. [73] Sincan M, Simeonov DR, Adams D, Markello TC, Pierson TM, Toro C, et al. VAR-MD: a tool to analyze whole exome-genome variants in small human pedigrees with Mendelian inheritance. Hum Mutat 2012;33:5938. [74] Hartley SW, Sebastiani P. PleioGRiP: genetic risk prediction with pleiotropy. Bioinformatics 2013;29:10868. [75] Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 2010;42:7903. [76] Hennekam RC, Biesecker LG. Next-generation sequencing demands next-generation phenotyping. Hum Mutat 2012;33:8846. [77] Roberts AE, Allanson JE, Tartaglia M, Gelb BD. Noonan syndrome. Lancet 2013;381:33342. [78] Oetting WS, Robinson PN, Greenblatt MS, Cotton RG, Beck T, Carey JC, et al. Getting ready for the Human Phenome Project: the 2012 forum of the Human Variome Project. Hum Mutat 2013;34:6616. [79] Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 2010;328:6369. [80] Najmabadi H, Hu H, Garshasbi M, Zemojtel T, Abedini SS, Chen W, et al. Deep sequencing reveals 50 novel genes for recessive cognitive disorders. Nature 2011;478:5763. [81] Saitsu H, Nishimura T, Muramatsu K, Kodera H, Kumada S, Sugai K, et al. De novo mutations in the autophagy gene WDR45 cause static encephalopathy of childhood with neurodegeneration in adulthood. Nat Genet 2013;45:4459, 449e1. [82] de Ligt J, Willemsen MH, van Bon BW, Kleefstra T, Yntema HG, Kroes T, et al. Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med 2012;367:19219. [83] O’Roak BJ, Vives L, Girirajan S, Karakoc E, Krumm N, Coe BP, et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 2012;485:24650. [84] Rauch A, Wieczorek D, Graf E, Wieland T, Endele S, Schwarzmayr T, et al. Range of genetic mutations associated with severe non-syndromic sporadic intellectual disability: an exome sequencing study. Lancet 2012;380:167482. [85] Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA. Available at: ,http://evs.gs.washington.edu/EVS/.. [86] Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res 2009;19:155361. [87] Ashley-Koch A, Yang Q, Olney RS. Sickle hemoglobin (HbS) allele and sickle cell disease: a HuGE review. Am J Epidemiol 2000;151:83945. [88] Rohlfs EM, Zhou Z, Heim RA, Nagan N, Rosenblum LS, Flynn K, et al. Cystic fibrosis carrier testing in an ethnically diverse US population. Clin Chem 2011;57:8418. [89] Office of Rare Disease Research. Available at: ,http://rarediseases.info.nih.gov/RareDiseaseList.aspx.. [90] Solomon BD, Lacbawan F, Jain M, Domene´ S, Roessler E, Moore C, et al. A novel SIX3 mutation segregates with holoprosencephaly in a large family. Am J Med Genet A 2009;149A:91925. [91] Lacbawan F, Solomon BD, Roessler E, El-Jaick K, Domene´ S, Ve´lez JI, et al. Clinical spectrum of SIX3-associated mutations in holoprosencephaly: correlation between genotype, phenotype and function. J Med Genet 2009;46:38998. [92] Solomon BD, Bear KA, Wyllie A, Keaton AA, Dubourg C, David V, et al. Genotypic and phenotypic analysis of 396 individuals with mutations in Sonic Hedgehog. J Med Genet 2012;49:4739. [93] Human Gene Mutation Database, Professional version. Available at: ,https://portal.biobase-international.com/hgmd/pro/.. [94] Domene´ S, Roessler E, El-Jaick KB, Snir M, Brown JL, Ve´lez JI, et al. Mutations in the human SIX3 gene in holoprosencephaly are loss of function. Hum Mol Genet 2008;17:391928. [95] Dand N, Sprengel F, Ahlers V, Schlitt T. BioGranat-IG: a network analysis tool to suggest mechanisms of genetic heterogeneity from exome-sequencing data. Bioinformatics 2013;29:73341. [96] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:4918.

III. INTERPRETATION

REFERENCES

295

[97] Fuentes Fajardo KV, Adams D, NISC Comparative Sequencing Program, Mason CE, Sincan M, Tifft C, et al. Detecting false-positive signals in exome sequencing. Hum Mutat 2012;33:60913. [98] Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW, , et al.NISC Comparative Sequencing Program The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine. Genome Res 2009;19:166574. [99] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4:107381. [100] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7:2489. [101] Teer JK, Green ED, Mullikin JC, Biesecker LG. VarSifter: visualizing and analyzing exome-scale sequence variation data on a desktop computer. Bioinformatics 2012;28:599600. [102] Mullins JG. Structural modelling pipelines in next generation sequencing projects. Adv Protein Chem Struct Biol 2012;89:11767. [103] Varshney GK, Huang H, Zhang S, Lu J, Gildea DE, Yang Z, et al. The Zebrafish Insertion Collection (ZInC): a web based, searchable collection of zebrafish mutations generated by DNA insertion. Nucleic Acids Res 2013;41:D8614. [104] Varshney GK, Lu J, Gildea DE, Huang H, Pei W, Yang Z, et al. A large-scale zebrafish gene knockout resource for the genome-wide study of gene function. Genome Res 2013;23:72735. [105] Ng PC, Henikoff S. Predicting the effects of amino acid substitutions on protein function. Annu Rev Genomics Hum Genet 2006;7:6180. [106] Rope AF, Wang K, Evjenth R, Xing J, Johnston JJ, Swensen JJ, et al. Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet 2011;89:2843. [107] Rousseau F, Bonaventure J, Legeai-Mallet L, Pelet A, Rozet JM, Maroteaux P, et al. Mutations in the gene encoding fibroblast growth factor receptor-3 in achondroplasia. Nature 1994;371:2524. [108] Bellus GA, Gaudenz K, Zackai EH, Clarke LA, Szabo J, Francomano CA, et al. Identical mutations in three different fibroblast growth factor receptor genes in autosomal dominant craniosynostosis syndromes. Nat Genet 1996;14:1746. [109] Ogino S, Wilson RB, Gold B, Hawley P, Grody WW. Bayesian analysis for cystic fibrosis risks in prenatal and carrier screening. Genet Med 2004;6:43949. [110] Davis EE, Katsanis N. The ciliopathies: a transitional model into systems biology of human genetic disease. Curr Opin Genet Dev 2012;22:290303. [111] Dixon-Salazar TJ, Silhavy JL, Udpa N, Schroth J, Bielas S, Schaffer AE, et al. Exome sequencing can improve diagnosis and alter patient management. Sci Transl Med 2012;4:138ra78. [112] Solomon BD, Pineda-Alvarez DE, Bear KA, Mullikin JC, Evans JP, NISC Comparative Sequencing Program. Applying genomic analysis to newborn screening. Mol Syndromol 2012;3:5967. [113] Gahl WA, Markello TC, Toro C, Fajardo KF, Sincan M, Gill F, et al. The National institutes of health undiagnosed diseases program: insights into rare diseases. Genet Med 2012;14:519. [114] Stark Z, Savarirayan R. Osteopetrosis. Orphanet J Rare Dis 2009;4:5. [115] Montiel-Equihua CA, Thrasher AJ, Gaspar HB. Gene therapy for severe combined immunodeficiency due to adenosine deaminase deficiency. Curr Gene Ther 2012;12:5765. [116] Galy A, Thrasher AJ. Gene therapy for the WiskottAldrich syndrome. Curr Opin Allergy Clin Immunol 2011;11:54550. [117] Be´langer-Quintana A, Burlina A, Harding CO, Muntau AC. Up to date knowledge on different treatment strategies for phenylketonuria. Mol Genet Metab 2011;104(Suppl.):S1925. [118] Pepin M, Schwarze U, Superti-Furga A, Byers PH. Clinical and genetic features of EhlersDanlos syndrome type IV, the vascular type. N Engl J Med 2000;342:67380. [119] Leistritz DF, Pepin MG, Schwarze U, Byers PH. COL3A1 haploinsufficiency results in a variety of EhlersDanlos syndrome type IV with delayed onset of complications and longer life expectancy. Genet Med 2011;13:71722. [120] Saunders CJ, Miller NA, Soden SE, Dinwiddie DL, Noll A, Alnadi NA, et al. Rapid whole-genome sequencing for genetic disease diagnosis in neonatal intensive care units. Sci Transl Med 2012;4:154ra135. [121] Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010;362:118191. [122] Tong P, Prendergast JG, Lohan AJ, Farrington SM, Cronin S, Friel N, et al. Sequencing and analysis of an Irish human genome. Genome Biol 2010;11:R91. [123] Solomon BD, Hadley DW, Pineda-Alvarez DE, NISC Comparative Sequencing Program, Kamat A, Teer JK, et al. Incidental medical information in whole-exome sequencing. Pediatrics 2012;129:e160511. [124] Johnston JJ, Rubinstein WS, Facio FM, Ng D, Singh LN, Teer JK, et al. Secondary variants in individuals undergoing exome sequencing: screening of 572 individuals identifies high-penetrance mutations in cancer-susceptibility genes. Am J Hum Genet 2012;91:97108. [125] Solomon BD. VACTERL/VATER Association. Orphanet J Rare Dis 2011;6:56. [126] Solomon BD, Pineda-Alvarez DE, Hadley DW, NISC Comparative Sequencing Program, Hansen NF, Kamat A, et al. Exome sequencing and high density microarray testing in monozygotic twin pairs discordant for features of VACTERL association. Mol Syndroml 2013;4:2731. [127] Solomon BD, Pineda-Alvarez DE, Hadley DW, NISC Comparative Sequencing Program, Teer JK, Cherukuri PF, et al. Personalized genomic medicine: lessons from the exome. Mol Genet Metab 2011;104:18991. [128] Roberts NJ, Vogelstein JT, Parmigiani G, Kinzler KW, Vogelstein B, Velculescu VE. The predictive capacity of personal genome sequencing. Sci Transl Med 2012;4:133ra58. [129] Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, et al. Clinical assessment incorporating a personal genome. Lancet 2010;375:152535. [130] Ormond KE, Wheeler MT, Hudgins L, Klein TE, Butte AJ, Altman RB, et al. Challenges in the clinical application of whole-genome sequencing. Lancet 2010;375:174951. [131] Berg JS, Khoury MJ, Evans JP. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genet Med 2011;13:499504.

III. INTERPRETATION

296

17. CONSTITUTIONAL EXOME AND GENOME SEQUENCING

[132] Berg JS, Adams M, Nassar N, Bizon C, Lee K, Schmitt CP, et al. An informatics approach to analyzing the incidentalome. Genet Med 2013;15:3644. [133] Newborn screening: toward a uniform screening panel and system. Genet Med 2006;8(Suppl. 1):1S252S. [134] The Clinical Genomic Database. Available at: ,http://research.nhgri.nih.gov/CGD/.. [135] Green RC, Berg JS, Berry GT, Biesecker LG, Dimmock DP, Evans JP, et al. Exploring concordance and discordance for return of incidental findings from clinical sequencing. Genet Med 2012;14:40510. [136] Beskow LM, Burke W. Offering individual genetic research results: context matters. Sci Transl Med 2010;2:38cm20. [137] Johnston JJ, Olivos-Glander I, Killoran C, Elson E, Turner JT, Peters KF, et al. Molecular and clinical analyses of Greig cephalopolysyndactyly and PallisterHall syndromes: robust phenotype prediction from the type and position of GLI3 mutations. Am J Hum Genet 2005;76:60922. [138] Johnston JJ, Sapp JC, Turner JT, Amor D, Aftimos S, Aleck KA, et al. Molecular analysis expands the spectrum of phenotypes associated with GLI3 mutations. Hum Mutat 2010;31:114254. [139] Kohane IS, Masys DR, Altman RB. The incidentalome: a threat to genomic medicine. JAMA 2006;296:2125. [140] McGuire AL, Caulfield T, Cho MK. Research ethics and the challenge of whole-genome sequencing. Nat Rev Genet 2008;9:1526. [141] McBride CM, Alford SH, Reid RJ, Larson EB, Baxevanis AD, Brody LC. Putting science over supposition in the arena of personalized genomics. Nat Genet 2008;40:93942. [142] McGuire AL, Lupski JR. Personal genome research: what should the participant be told? Trends Genet 2010;26:199201. [143] Evans JP, Berg JS. Next-generation DNA sequencing, regulation, and the limits of paternalism: the next challenge. JAMA 2011;306:23767. [144] Kohane IS. No small matter: qualitatively distinct challenges of pediatric genomic studies. Genome Med 2011;3:62. [145] Lantos JD, Artman M, Kingsmore SF. Ethical considerations associated with clinical use of next-generation sequencing in children. J Peds 2011;159: 87980.e1. [146] Hens K. Whole genome sequencing of children’s DNA for research: points to consider. J Clin Res Bioeth 2011;2:7. [147] Tabor HK, Berkman BE, Hull SC, Bamshad MJ. Genomics really gets personal: how exome and whole genome sequencing challenge the ethical framework of human genetics research. Am J Med Genet A 2011;155A:291624. [148] Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y. Identifying personal genomes by surname inference. Science 2013;339:3214. [149] Sparrow DB, Chapman G, Smith AJ, Mattar MZ, Major JA, O’Reilly VC, et al. A mechanism for gene-environment interaction in the etiology of congenital scoliosis. Cell 2012;149:295306. [150] Bae GU, Domene´ S, Roessler E, Schachter K, Kang JS, Muenke M, et al. Mutations in CDON, encoding a hedgehog receptor, result in holoprosencephaly and defective interactions with other hedgehog receptors. Am J Hum Genet 2011;89:23140. [151] Hong M, Krauss RS. Cdon mutation and fetal ethanol exposure synergize to produce midline signaling defects and holoprosencephaly spectrum disorders in mice. PLoS Genet 2012;8:e1002999. [152] Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HY, Chen R, et al. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 2012;148:1293307. [153] H3 Africa. Human heredity and health in Africa. Available at: ,http://h3africa.org/..

Glossary Allele A single version of a gene or locus Allelic heterogeneity A description of the scenario in which different mutations in the same gene or locus can result in the same effect Bi-allelic (biallelic) Involving mutations in both alleles of a gene Constitutional disorder Innate or inborn disorders, which may present congenitally or later in life, and which involve the entire affected individual Copy number variant A structural genetic difference involving the amount of a portion of the genome De novo A newly occurring (not inherited) genetic change Exome The region of the genome made up of exons, or all known coding regions of the genes Functional analysis Study of the functional (potential pathogenic) impact of a genetic variant Genome The entire genetic make-up of a person (or organism) Genotype An individual’s genetic information, either at a single locus or involving multiple loci Incidental medical information Information not related to the direct clinical or research question Karyotype An organized, viewable profile of an individual’s chromosomal make-up Linkage analysis A technique used to find the location of a mutation through studying genetic variants that segregate with a specific condition Mendelian disorder A genetic condition caused by mutations in a single gene (or small number of genes), usually involving a recognizable inheritance pattern Microarray A genomic analysis technique based on using large numbers of probes arranged on a membrane or slide in order to (depending on the type of microarray) investigate gene expression, assess CNVs, identify and analyze known SNPs Mutation A pathogenic (disease-causing) genetic change Next-generation sequencing The process, through massively parallel sequencing, of performing many sequencing reactions simultaneously, allowing rapid investigation of large amounts of genetic material Phenotype A medical condition or state, often referring to effects of a particular genotype Point mutation A mutation resulting from a single base variant (or, more loosely but more commonly interpreted, involving a small number of bases) Sanger sequencing A traditional method of genetic sequencing, which involves the incorporation of chain-terminating dideoxynucleotides as part of the sequencing reaction Variant A genetic change

III. INTERPRETATION

C H A P T E R

18 Somatic Diseases (Cancer): AmplificationBased Next-Generation Sequencing Fengqi Chang1, Geoffrey L. Liu2, Cindy J. Liu3 and Marilyn M. Li1,4 1

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA 2Department of Human Genetics, University of Chicago, Chicago, IL, USA 3Research and Computing Services, Harvard Business School, Cambridge, MA, USA 4Dan Duncan Cancer Center, Baylor College of Medicine, Houston, TX, USA

O U T L I N E Introduction

298

NGS Technologies Pyrosequencing-Based NGS: Roche 454 Genome Sequencer Reversible Dye-Terminator-Based NGS: Illumina HiSeq and MiSeq Systems Ion Semiconductor-Based NGS: Life Technology PGM and Proton Systems Sequencing by Ligation-Based NGS: Life Technology ABI SOLiD Sequencer

298 298 299 300 301

Amplification-Based NGS Technologies 302 DNA Sequencing 303 Targeted DNA Analysis Using Multiplex Amplification 303 Targeted DNA Analysis Using Single-Plex Amplification 304 Targeted DNA Analysis Using Targeted Capture Followed by Multiplex Amplification 304 General Considerations 305 RNA Sequencing 305 Targeted RNA Analysis by Multiplex Amplification 305 Targeted RNA Analysis by Single-Plex Amplification 306

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00018-6

Targeted RNA Analysis Using Targeted Capture Followed by Multiplex Amplification 306

Methylation Analysis

307

Advantages and Disadvantages of Amplification-Based NGS

307

Clinical Application of Amplification-Based NGS in Cancer Sample Requirements DNA/RNA Extraction and Quality Control Cancer-Specific Targeted Panels AmpliSeqt Cancer Hotspot Panel v2 Ion AmpliSeqt Comprehensive Cancer Panel AmpliSeq Custom Cancer Panels Ion AmpliSeqt RNA Cancer Panels RainDance ONCOSeqt Panel Illumina TruSeq Amplicon Cancer Panel

308 309 309 310 311 311 312 313 313 313

Data Analysis

314

Interpretation and Reporting

315

Challenges and Perspectives

316

References

317

297

© 2015 Elsevier Inc. All rights reserved.

298

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

INTRODUCTION The development of DNA sequencing techniques has greatly facilitated our understanding of genetics and human biology over the last 40 years. From Frederick Sanger [1] and Walter Gilbert [2], the pioneers who invented Sanger sequencing in the late 1970s, to the completion of the Human Genome Project (HGP) in 2003 using Sanger and shotgun sequencing methods [35], recent progress in sequencing and computing technologies has been immense and continues to evolve. The HGP compiled the initial draft of the DNA sequence of the human genome in 2001, and the final version was completed in 2003 via the collaboration of 23 laboratories from many countries at the cost of approximately US$3 billion over 13 years. This human genome sequence data has provided the molecular basis for the understanding of many human diseases at the genetic level [4] and made possible the identification of causal genes for human disorders and the detection of germline and somatic mutations responsible for human diseases [6]. Typically, Sanger sequencing generates long read lengths with excellent raw base accuracy, but it is lowthroughput as well as costly when multiple genes are being analyzed. It is therefore more appropriate for the analysis of small target regions. In many clinical situations, however, mutations in several different genes may be responsible for the same phenotype. For example, in Noonan syndrome, causal mutations have been reported in at least 11 different genes, and in hereditary nonpolyposis colorectal cancer (HNPCC), a mutation in any one of the mismatch repair genes can be causal; in cancer, multiple and often numerous mutations can be found in a single tumor. Diagnoses in these situations thus require fast, sensitive, high-throughput, and cost-effective methods for gene/mutation analysis. This great demand for improvements on Sanger sequencing has led to the rapid development of technological innovations over the past 10 years. The emergence of Next-Generation Sequencing (NGS), also known as massively parallel or multiplex cyclic sequencing, has completely changed the way genomic research and genomic medicine is done by greatly decreasing the cost of sequencing while simultaneously increasing the throughput.

NGS TECHNOLOGIES NGS technologies allow sequencing of millions of DNA templates simultaneously and can generate millions of sequence reads at a tiny fraction of the cost of conventional Sanger sequencing [7,8]. The NGS pipeline consists of library preparation and enrichment, sequencing, sequence alignment, and variant calling. As discussed in more detail in Chapter 1, there are several NGS platforms currently available, including Roche/454; Illumina/ HiSeq and MiSeq; Life Technologies/SOLiD, Personal Genome Machine (PGM), and Proton; Pacific Biosciences/ PacBio II; and Nanopore/GridION. Extensive reviews concerning the technology, amplification method, chemistry, read length, throughput, and run time of different platforms have been published [812]. This chapter will focus on the latest commercially available NGS platforms that are suitable for clinical applications by amplification-based approaches, namely the Roche 454t GS FLX Titanium and Junior systems; the Life Technology SOLiDt, PGM, and Proton systems; and Illumina HiSeqt and MiSeqt systems.

Pyrosequencing-Based NGS: Roche 454 Genome Sequencer The 454 Genome Sequencer (454 Life Science, Roche Applied Sciences) was the first commercially available NGS instrument on the market, released in 2004 [13]. All 454 systems work on the principle of pyrosequencing reactions, using the pyrophosphate molecule released during nucleotide incorporation by DNA polymerase, and ultimately producing light from the cleavage of oxyluciferin by luciferase [14]. First, DNA samples are cut into blunt-end fragments. Oligonucleotide adapters are then attached to both ends of the fragments. Adapter-ligated DNA fragments are amplified by emulsion PCR (emPCR) [15] on the surface of thousands of agarose beads coated with millions of oligomers which are complementary to the adapter sequences ligated to the ends of DNA fragments during the library preparation (Figure 18.1). Each agarose bead surface then contains up to 1 million copies of the original annealed DNA fragment to produce detectable signals from the sequencing reactions. Several hundred thousand beads are added into sequencing reaction wells of a “picotiter well” plate which is made from a fused fiber-optic bundle [14]. Subsequently, much smaller magnetic and latex beads with active enzymes (polymerase, sulfurylase, and luciferase) and other reagents needed for pyrosequencing are added to the picotiter well plate. Adding one or more nucleotides to DNA chain generates a light

III. INTERPRETATION

299

NGS TECHNOLOGIES

G T C A G T C A G T C A G T C A G T C A G T C A GT C AGT C A G CAGTC G G A A Primer C

DNA polymerase APS PPi

ATP sulfurylase

Luciferase

ATP

Luciferin

APS: Adenosine 5′ phosphosulfate PPi: Pyrophosphate

Oxyluciferin + light

FIGURE 18.1 Roche/454 pyrosequencing technology.

signal that is recorded by the CCD (charge-coupled devices) camera in the instrument. The signal strength is directly proportional to the number of molecules of that particular nucleotide that was incorporated. The current 454 instrument, the GS FLX1 system, produces an average read length of about 700 bp, with the longest read lengths over 1 kb in length—approaching those of Sanger sequencing. However, the total sequence output from this platform is far less than those of other instruments, which can generate many more sequence reads though they are of much shorter lengths.

Reversible Dye-Terminator-Based NGS: Illumina HiSeq and MiSeq Systems While Illumina platforms are often employed with hybrid capture-based NGS approaches, they can also be used to perform sequence analysis through amplification-based approaches (though briefly reviewed here, more detail can be found in Chapter 1). Illumina platforms use reversible dye terminator sequencing by synthesis (SBS) chemistry involving iterative cycles of single base incorporation, imaging, and cleavage of the terminator chemistry. The technology was first commercially launched by Solexa in 2006 as the Genome Analyzer [10] before Illumina purchased Solexa in 2007. Illumina’s library preparation includes fragmentation of high molecular weight DNA, addition of specific adapters/indices by PCR or DNA ligase, hybridization, and enrichment. The Illumina microfluidic conduit is a flow cell decorated by covalent attachment of adapter sequences (anchors) complementary to the library adapters. A precisely diluted library is amplified in situ on the flow cell surface using a “bridge” amplification technique to produce colonies of sequences (clusters) for sequencing. The “bridge” amplification technique relies on captured DNA strands “arching” over and hybridizing to an adjacent anchor oligonucleotide. Multiple amplification cycles convert the single-molecule DNA template to a clonally amplified arching cluster (Figure 18.2). Sequencing primers complementary to the adapter sequences are added to initiate the sequencing reactions followed by the addition of polymerase and a mixture of the four nucleotides labeled with different fluorescent dyes and a reversible terminator. Labeled nucleotides are incorporated in each strand in a clonal cluster. A camera takes images of the fluorescently labeled nucleotides, after which the dye and the terminal 30 blocker are chemically removed from the DNA, allowing for the next cycle to begin. For reads on the reverse strand (paired end), the instrument (such as MiSeq) removes the synthesized strands by denaturation and regenerates the clusters by “bridge” amplification. The reverse sequencing primer is added and the sequencing initiates again as described above.

III. INTERPRETATION

300

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

O

HN

5′ G T A

O

O

HN C

N

O

P P P

O

N

O O

O Next cycle

3′

3′ OH

Block

FIGURE 18.2 Illumina SBS technology.

The Illumina HiSeq 2000 was released in early 2010 with an output of 600 Gb per run, which could be finished in anywhere from 211 days. In early 2012, Illumina introduced the HiSeq 2500, which is also capable of generating up to 600 Gb data per run. Sequencing on the HiSeq 2500 can be finished in 7 h with 1 3 36 bp read length and 40 h with 2 3 150 bp using RAPID Run Mode. MiSeq, a benchtop sequencer, was launched in 2011. It is a compact all-in-one platform that incorporates cluster generation, paired-end fluidics, SBS chemistry, and complete data analysis in one system. MiSeq is the only benchtop sequencer that can produce 2 3 300 paired-end reads and generate up to 15 Gb data with 25 million reads in a single run (www.illumina.com).

Ion Semiconductor-Based NGS: Life Technology PGM and Proton Systems In 2010, Ion Torrent Systems Inc. (now owned by Life Technologies) commercially released its benchtop sequencer, the PGM. Ion Torrent uses Ion Semiconductor Sequencing Technology. The Ion Torrent chip is an ultrasensitive pH meter that detects hydrogen ions released when nucleotides are incorporated during DNA synthesis. Each Ion Chip contains millions of ion-sensitive field-effect transistor (ISFET) sensors that allow parallel detection of multiple sequencing reactions (Figure 18.3). Ion Torrent library construction includes DNA fragmentation, partial digestion of primer sequences, ligation of adapters, and library purification. Sequence template preparation can be performed manually or by use of a

III. INTERPRETATION

301

NGS TECHNOLOGIES

dNTP

H+ Δ pH ΔQ

Sensing layer Sensor plate ΔV

Bulk

Drain

Source

Silicon substrate

To column receiver

FIGURE 18.3 The architecture of the Ion Torrent Chips used for the detection of pH change after nucleotide incorporation by DNA polymerase (http://www3.appliedbiosystems.com/cms/groups/applied_markets_marketing/documents/generaldocuments/cms_094273.pdf). Used with permission from Life Technologies.

OneToucht Machine (Life Technologies). Template preparation is carried out using an emPCR and enrichment system on Ion Spherest Particles (ISPs). The ISPs have covalently linked complementary adapter sequences on their surfaces to facilitate amplification on the particles. Enriched particles are primed for sequencing by annealing a sequencing primer and are then loaded into the wells of an Ion Chip. The Ion Chip has an upper surface that serves as a microfluidic conduit to deliver the reagents needed for the sequencing reactions as well as a lower surface which interfaces directly with a hydrogen ion detector that translates the released hydrogen ions from each well into a quantitative readout of nucleotide bases. The average read length obtained on the PGM has increased from 100 to 200 bp over the last several years. Mass production of the Ion Chip achieved by using standard semiconductor techniques and reaction volume miniaturization make this technology relatively inexpensive and fast. It is therefore ideal for smaller laboratories that wish to use NGS in their work but do not require extremely large data sets [16]. In January of 2012, Life Technologies officially launched its second semiconductor platform, the Ion Torrent Protont sequencer, which uses a novel complementary metal-oxide semiconductor (CMOS) chip with 165 million 1.3 mm-diameter microwells, automatically templated submicron particles, and integrated hardware and software that enable acquisition of about 5 billion data points per second over a 24 h run time with on-instrument signal processing [17]. The system can generate up to 200 bp read lengths and produce up to 15 Gb sequence data with 6080 million reads.

Sequencing by Ligation-Based NGS: Life Technology ABI SOLiD Sequencer Sequencing by Oligo Ligation Detection (SOLiD) technology was developed in the laboratory of George Church, Professor of Genetics at Harvard Medical School, and published in 2005 [18]. The sequencer was commercially released at the end of 2007 by Applied Biosystems. The technology utilizes a unique sequencing process catalyzed by DNA ligase (sequencing by ligation chemistry). Its library preparation process is similar to other technologies in which DNA fragments are ligated to specific adapters, attached to beads, and clonally amplified by emPCR. Templates on the selected beads undergo a 30 modification to allow covalent attachment to a SOLiD flow cell surface. Sequencing is achieved by using the sequencing primer complementary to the P1 adapter and four sets of 8-base probes that contain the ligation site (first base), cleavage site (fifth base), and four

III. INTERPRETATION

302

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

3′

5′

3′

TC n n n zz z

G G n nnz zz

ligase

3′ 5′

Universal seq primer

1-μm bead

3′

p5′

3′

5′

5′

AT n n nzz z

T A n n n z z z

3′

5′

Template sequence

Ligation

P1 adapter

Universal seq primer 1-μm bead

p5′ T A

3′

5′

Template sequence

Cleavage

P1 adapter

Cleavage Universal seq primer 1-μm bead

T A

5′

P1 adapter

p5′

1,2

Template sequence

3′

FIGURE 18.4 Applied Biosystems SOLiD sequencing by ligation technology. Used with permission from Life Technologies.

different fluorescent dyes (linked to the last base). The probe is an octamer, which contains (in the 30 to 50 direction) two probe-specific bases followed by six degenerate bases with one of four fluorophores linked to the 50 end of the probe. The sequencing primers hybridize to the P1 adapter on the templated beads and the probes compete for ligation to the sequencing primer. Specificity of the probe is achieved by interrogating every first and second base in each ligation reaction. Following a series of ligation, detection, and cleavage cycles, the extension product is removed and the template is reset with a primer complementary to the n 2 1 position for a second round of ligation cycles (Figure 18.4) (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing/next-generation-systems/solid-sequencing-chemistry.html). The 5500xl SOLiDt sequencer was released in late 2010. The system generates up to 180 Gb per run, 2 3 60 bp reads, up to 30 Gb/day at 2.8 billion paired-end reads (1.4 billion beads)/run. Additional improvements are expected with the transition from microbeads to nanobeads. The new version has several upgrades compared to the SOLiD system, such as improved read length, 85 bp data output, and 30 Gb/day. A summary of key features of a few NGS platforms is shown in Table 18.1. See Chapter 1 for more details.

AMPLIFICATION-BASED NGS TECHNOLOGIES Amplicon-based library preparation offers a powerful option for targeted sequencing of regions of interest (ROI). Primers can be designed to avoid or minimize the amplification of pseudogenes or genomic regions with high sequence homology to the ROI. The technologies can be applied to DNA sequencing, RNA sequencing (RNA-seq), and targeted methylation analysis. At least three amplicon-based library preparation approaches are

III. INTERPRETATION

303

AMPLIFICATION-BASED NGS TECHNOLOGIES

TABLE 18.1

Comparisons of the Latest Commercially Available NGS Platforms

Platform

Amplification method

Chemistry

Maximum read length (bp)

Maximum throughput

Run time

GS FLX Titanium

Clonal emPCR

Pyrosequencing

6001000

450700 Mb

1023 h

GS Junior

Clonal emPCR

Pyrosequencing

400

35 Mb

10 h

HiSeq 2000

Clonal Bridge PCR

Reversible Dye Terminator

2 3 100

600 Gb

11 days

HiSeq 2500

Clonal Bridge PCR

Reversible Dye Terminator

2 3 1002 3 150

600 Gb

27 h11 days

MiSeq

Clonal Bridge PCR

Reversible Dye Terminator

2 3 300

15 Gb

448 h

Clonal emPCR

Sequencing by ligation

85

90300 Gb

17 days

35400

40 Mb2 Gb

24 h

200

10 Gb

24 h

ROCHE 454

ILLUMINA

LIFE TECHNOLOGY ABI/5500 SOLiD Ion Torrent PGM Ion Torrent Proton

Clonal emPCR Clonal emPCR

1

SBS, H detection 1

SBS, H detection

currently being used for preparation and enrichment of target sequences of interest: multiplex PCR, single-plex PCR, and targeted capture followed by multiplex PCR.

DNA Sequencing Targeted DNA Analysis Using Multiplex Amplification Multiplex PCR consists of multiple primer sets within a single PCR mixture to produce amplicons that are specific to different DNA sequences. Multiplex PCR-based NGS technology is well represented by the AmpliSeq technology from Life Technologies. Ion AmpliSeqt (Life Technologies) uses a proprietary ultrahigh multiplex PCR technology to generate thousands of amplicons for massively parallel sequencing. Over 25,000 primer pairs that selectively amplify the ROI can be pooled together in one single tube for PCR reactions. The number of PCR cycles for the first round of PCR is determined by the primer pool size and sample type. As a rule of thumb, the higher the number of primer pairs, the lower the number of PCR cycles. In general, the lower range limit is used for DNA from blood and fresh tissue and the upper range limit for FFPE samples. After the first round of PCR, sequence-specific primers are removed and the PCR products are phosphorylated. Ion-compatible adapters are then ligated to amplicons in preparation for the second round of PCR amplification. After five to seven cycles of PCR, depending on sample types, amplicons are purified and quantified for template preparation. Sequence template preparation can be done manually, or through use of a OneToucht (Life Technologies) machine which employs emPCR for template preparation. It is critical to determine the optimal concentrations of template for emPCR, since excessively high or low concentrations often result in a high percentage of polyclonal or empty wells, respectively, leading to reduced usable reads. The total workflow from DNA to templates ready for sequencing can take as little time as 6.5 h (http://www.lifetechnologies.com/us/en/ home/life-science/sequencing/next-generation-sequencing/ion-torrent-next-generation-sequencing-workflow/ ion-torrent-next-generation-sequencing-select-targets/ampliseq-target-selection.html). The AmpliSeq technology requires only 10 ng of input DNA and works well with different types of tumor samples including archived FFPE samples. Templates generated from AmpliSeq are ready for sequencing using PGM without further enrichment. The turnaround time (TAT) from receiving samples to reporting for profiling 50100 cancer genes using AmpliSeq technology can be as short as 35 days. This technology has also scaled up recently from amplifying targeted gene panels to amplifying the whole exome. The Ion AmpliSeq Exome kit includes 294,000 primer pairs that amplify the whole exome in 12 primer pools using as little as 50 ng DNA (Ion AmpliSeqt Exome Solution flyer, http://tools.invitrogen.com/content/sfs/brochures/Ion-AmpliSeq-Exome-KitProduct-Flyer.pdf).

III. INTERPRETATION

304

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

Targeted DNA Analysis Using Single-Plex Amplification Single-plex PCR-based target enrichment was pioneered by Microdroplet PCR from RainDance Technologies Inc. (Billerica, MA). This technology utilizes picoliter-sized droplets as individual reaction vessels to perform over 1 million unique PCR reactions per sample in less than 1 day. This discrete encapsulation of microdroplet PCR reactions prevents possible unwanted primerprimer interactions, allowing for highly efficient simultaneous amplification of up to 4000 target sequences per sample in a single tube, and greatly reduces the amount of reagents required [19]. The primer pairs that cover the ROI are individually encapsulated forming primer pair droplets using a microfluidic chip; primer pair droplets are then mixed together to ensure equal representation of each library element. Genomic DNA is fragmented, biotinylated, and purified before mixing with other PCR reaction components; then the genomic DNA template mixture is made into droplets on the microfluidic chip. In the merging area within the microfluidic chip, one template droplet is paired and merged with a primer pair droplet, and the merged droplets collected for emPCR. The droplet PCR reactions are then destabilized and the PCR products are released and purified for sequencing [19]. A critical part of this technology is the quality of the primer library; the mixed primer library should be tested for droplet size and uniformity as well as equal representation of library elements. Microdroplet technology is well suited to parallel amplification of sequencing targets in a large number of samples. However, it does require the use of additional specialized equipment. The Access Array System (Fluidigm, San Francisco, CA) is another amplicon-based library preparation and enrichment approach. The technology uses single-plex and low-level multiplex PCR technology. Centered on the 48.48 Access Array Integrated Fluidic Circuit (IFC), the technology facilitates parallel amplification of 48 unique samples with up to 48 different primer pairs in a microfluidic device and generates 2304 parallel PCR reactions in a single amplification run with 50 ng input genomic DNA per sample. Primers covering the ROI are tagged with sample-specific barcode and universal adapters before PCR amplification, and the PCR products from each sample are harvested and pooled for sequencing. Using the Access Array system, the overall time for genotyping 192 samples could be within 48 days [20]. This technique is particularly suitable for sequencing large number of samples for a few genes, but again, requires the use of specific equipment. Targeted DNA Analysis Using Targeted Capture Followed by Multiplex Amplification This group of hybrid techniques utilizes probes to capture the ROI, with subsequent ROI enrichment using PCR technology, and includes Illumina’s TruSeq Amplicon and Agilent’s HaloPlex technologies (Agilent Technologies, Inc., Santa Clara, CA). The TruSeq Amplicon technology uses a pair of oligonucleotides (upstream and downstream) that are specifically designed to hybridize to each target region to capture the ROI. The input DNA amount should be in the range of 150250 ng, based on the sample type. For FFPE samples, 250 ng of DNA is recommended. After hybridization, unbound oligonucleotides are removed and the hybridized templates are cleaned in preparation for extensionligation of the bound oligonucleotides. The extensionligation process connects the upstream and downstream oligos and generates products containing the ROI flanked by sequences required for amplification. The extensionligation products are amplified for 27 cycles; the PCR products are then purified. The library quality is assessed at this point using either gel electrophoresis or a Bioanalyzer (Agilent Technologies). Each library is normalized before pooling together for sequencing to ensure equal library representation in the pooled sample. The total assay time from DNA to templates ready for loading on to either the HiSeq or MiSeq instruments takes approximately 7 h (http://supportres.illumina.com/documents/myillumina/02fe2a31-7867-495f9783-de30d3ccc919/truseq_amplicon_cancer_panel_guide_15031875_a.pdf). The HaloPlex targeted-enrichment system also uses probes specifically designed to capture the ROI, but DNA samples are first fragmented using restriction enzymes and denatured before the probe library is added and hybridized. Each HaloPlex probe is an oligonucleotide designed to hybridize to both ends of a targeted DNA restriction fragment, thereby guiding the targeted fragments to form circular DNA molecules. The probe also contains a method-specific sequencing motif that is incorporated during the circularization; in addition, a sample barcode sequence is incorporated in this step. The HaloPlex probes are biotinylated and the targeted fragments can therefore be retrieved with magnetic streptavidin beads. The circular molecules are closed by ligation, a very precise reaction that ensures that only fragments that are perfectly hybridized are circularized. The circular DNA targets are then amplified, providing an enriched and barcoded amplification product ready for sequencing. The assay requires 200250 ng of input DNA from each sample. The total assay time from DNA to templates ready for sequencing is approximately 6 h (http://www.chem.agilent.com/library/usermanuals/Public/G9900-90021.pdf).

III. INTERPRETATION

AMPLIFICATION-BASED NGS TECHNOLOGIES

305

General Considerations When choosing library preparation and enrichment approaches for clinical oncology applications, the following parameters should be considered: (1) sensitivity, the percentage of targeted bases that are sequenced at a predefined depth based on the clinical application; (2) specificity, the percentage of on-target reads; (3) uniformity, the evenness of sequence coverage across targeted regions; (4) amount of input DNA required for each sample; (5) cost per sample; and (6) speed, the time required to get from DNA to templates ready for sequencing. Both the Microdroplet technology and the Access Array System are suitable for sequencing a few genes in a large number of samples; neither would be economical for sequencing a large number of genes in a few samples, as the amount of input DNA as well as the cost for numerous primers and PCR reactions can be very high. Both technologies also require additional specialized equipment, which is a major drawback. And although both approaches are compatible with FFPE samples, their low tolerance for poor quality DNA and the large amount of input DNA required limit their applicability in sequencing cancer samples for a panel of genes. The AmpliSeq, TruSeq Amplicon, and HaloPlex techniques fit clinical practice particularly well since they are suitable for sequencing up to hundreds of genes in a small or large number of samples, and library preparation for these three approaches can be done within a day. The AmpliSeq technique consumes the least amount of input DNA and offers the shortest TAT, which is a huge advantage in cancer diagnosis where the tissue suitable for NGS is often limited and short TAT is highly desired. The sensitivity and specificity of these techniques are comparable. While the HaloPlex technology provides higher uniformity, the TruSeq Amplicon technique is easier to use and requires less hands-on time.

RNA Sequencing Microarray technology remains, as of this writing, the most commonly used technique for measuring gene expression, since it allows high-throughput analysis of thousands of target genes in parallel. Unspecific hybridization and limited dynamic range are the main drawbacks of microarray analysis; it also cannot be easily used to detect splice events or unknown transcripts [21]. In contrast, next-generation RNA-seq permits the discovery of single nucleotide variants (SNVs) [2224], indels, chimeric transcripts [25], novel transcripts, alternative splicing [26], allelic imbalances [27], differentially expressed transcripts [28], and translocations [29] at the same time. RNA-seq also has many advantages when comparing transcription levels across different genes, samples, and time points. RNA-seq ensures a greater dynamic range, increased sensitivity, higher level of technical reproducibility, and higher accuracy in expression quantification of the human transcriptome compared with array-based technologies [3032]. However, the cost and complexity of whole transcriptome RNA-seq data sets have heretofore prevented this method from being used in routine molecular diagnostic testing. Recently developed targeted approaches have drastically reduced data complexity and cost due to their focused nature [3337]. Targeted-enrichment strategies were originally developed for genomic DNA resequencing, but slight modifications for cDNA applications have allowed the development of targeted RNA-seq. Targeted RNA-seq approaches have been used to detect fusion transcripts, allele-specific expression, mutations, and RNA-editing events in a subset of transcripts [33,3841]. Targeted RNA Analysis by Multiplex Amplification The AmpliSeq technology can also be used for RNA-seq. A prerequisite step in RNA-seq using AmpliSeq technology is to convert extracted total RNA to single strand cDNA. This is achieved by adding reverse transcription (RT) enzyme and Dynabeadss Cleanup Modules to existing Ion AmpliSeqt reagents (Life Technologies). Primers for the genes of interest can be generated by using the Ion AmpliSeqt Designer, a free online assay design tool with over 20,000 targeted genes to select from (https://www.ampliseq.com/browse.action). Amplicons are designed to account for GC content and filtered for repeats and SNPs, and for genes with multiple documented transcript isoforms, amplicons are designed to target exons that represent the maximum number of transcripts for single genes. The work flow of Ion AmpliSeqt RNA-seq is outlined in Figure 18.5. At the moment, Ion AmpliSeqt RNA technology in combination with RNA-specific primer design parameters enables up to 300 genes to be assayed in one amplification reaction with a total input RNA as low as 5 ng. Raw data are processed on the Ion PGMt Sequencer and transferred to the PGMt Torrent Server for base calling and variant calling. Further analysis can be performed using the Coverage Analysis plug-in which reports amplicon counts in various formats to allow users to import data into other software packages for differential expression analysis. The whole process can be carried out in about 15 h, spread over 1.5 days.

III. INTERPRETATION

306

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

FIGURE 18.5 Targeted RNA-seq with Ion AmpliSeqt RNA technology. Used with permission from Life Technologies.

RNA Reverse transcribe

cDNA Amplify targets using lon AmpliSeqTM RNA panel

Partially digest primer sequences

Adapters A P1 OR Barcode adapters

Ligate adapters

X P1 Amplify library A

P1

Nonbarcoded library Barcoded library

A recent study [42] using competitive multiplex PCR-based amplicon library preparation on the Ion Torrent NGS platform for targeted quantitative RNA-seq calculated the intertarget variation in PCR amplification during library preparation by measuring each transcript native template relative to a known number of synthetic competitive template internal standard copies. The study demonstrated that competitive multiplex PCR amplicon library preparation for targeted quantitative RNA-seq can provide quantitative transcript abundance data for selected gene targets for detection of twofold changes with 97% accuracy and high concordance across days, library preparations, and laboratories. In addition, the method could reduce sequencing reads required for transcript abundance quantification by more than 10,000-fold. The benefit of internal quality control, high reproducibility, and reduced sequencing reads necessary for development and implementation of targeted quantitative RNA-seq make this technology ideal for molecular diagnostic testing. Targeted RNA Analysis by Single-Plex Amplification The Fluidigm C1t Single-Cell Auto Prep System provides simplified transcriptome analysis of single cells for RNA-seq. This system is designed specifically for detailed expression profiling in diverse cell populations, with a very low sample requirement that is compatible with single-cell input (as little as 10 pg per reaction). It is an integrated microfluidic system for routine preparation of 96 full-length cDNA libraries in parallel, and provides a complete workflow from cell isolation, wash, live/dead cell staining, cell lysis, and RT to long-range PCR amplification through a microfluidic IFC. Overall hands-on time from cells to sequence-ready libraries is 34 h, with an overall run time of less than 14 h (http://www.fluidigm.com/single-cell-mrna-sequencing.html). Targeted RNA Analysis Using Targeted Capture Followed by Multiplex Amplification Based on MiSeq DNA sequencing technology, Illumina developed the TruSeq Targeted RNA Expression assay. The assay supports quantitative multiplexed gene expression profiling for 121000 targets per sample and up to 384 samples in a single MiSeq run. The use of Illumina DesignStudio allows for the creation of custom panels by selecting up to 1000 ROI from a database of over 400,000 predesigned assays targeting genes, exons, splice junctions, SNPs, and fusions. The TruSeq Targeted RNA Expression protocol is optimized for 50 ng of high quality total RNA, and the assay is compatible with FFPE samples (although for FFPE samples with an average fragment size equal to or longer than 200 bp, the RNA input should be increased to 100 ng). Targets are amplified in a

III. INTERPRETATION

ADVANTAGES AND DISADVANTAGES OF AMPLIFICATION-BASED NGS

307

Total RNA cDNA

ULSO

ROI

sm

R 5 ’ NA

Hybridization DLSO

SB 3’ S3

Extension–ligation

P7

Amplification Index 1

P Index 2 5

Sequencing library

FIGURE 18.6 Illumina Truseq RNA expression system. ULSO, Upstream locus-specific oligo; smRNA, small RNA-seq primer binding site; DLSO, Downstream locus-specific oligo; SBS3, Read 2 sequencing primer binding site; P7, Flow cell binding site; P5, Flow cell binding site.

single reaction to minimize potential bias (Figure 18.6), and once the library is constructed, the rest of the steps are similar to those for TruSeq Targeted DNA sequencing. From sample receipt to data analysis, the entire process takes less than 2 days. Sequence data are automatically aligned and analyzed, and can be viewed using MiSeq Reporter. MiSeq Reporter permits customizable significance thresholds and pairwise comparisons of relative expression between samples or groups of samples (http://res.illumina.com/documents/products/datasheets/datasheet_truseq_targeted_rna_expression.pdf).

Methylation Analysis DNA methylation is a major form of epigenetic modification and plays essential roles in many diseases. Since epigenetic alterations may represent key indicators of disease onset and progression, methylation analysis bears great potential for clinical applications. Bisulfite conversion of DNA is required to distinguish methylated cytosine from unmethylated cytosine. The method is based on the discovery that treatment of denatured genomic DNA with sodium bisulfite chemically deaminates unmethylated cytosine residues much more rapidly than methylated cytosine residues [43,44]. Thus, the chemical treatment effectively converts all the unmethylated cytosine residues to uracil, which are amplified as thymine during PCR. In contrast, since methylated cytosine residues are not affected by treatment, they are still detected as cytosine in the final sequencing result. In combination with bisulfite conversion, many NGS platforms have been used to perform targeted methylation analysis. A novel approach [45] has been reported for conducting multisample, multigene, ultradeep bisulfite sequencing analysis to evaluate DNA methylation patterns in 25 gene-related CpG rich regions of more than 40 samples in a single run on a Roche 454 platform. So-called Bisulfite Patch PCR, has also been described [46], which enables highly multiplexed bisulfite PCR and sequencing across many samples. RainDance Technologies has also recently expanded the capabilities of its sophisticated primer design pipeline to enable analysis of methylated regions of the human genome using bisulfite sequencing and RainDance MethylSeqt platform for targeted methylation analysis (http://raindancetech.com/targeted-dna-sequencing/ tndas-applications/targeted-methylation-analysis-methylseq/). Their primer design algorithm is capable of interrogating all regions of the methylome, and up to 4000 customer-defined ROI can be interrogated at the same time. Multiple cytosine residues in a single CpG island on the same DNA strand can also be measured. Based on this platform, it is possible to incorporate bisulfite treatments with microdroplet PCR and NGS to simultaneously assess DNA sequence and methylation status of human cell lines and cancer samples [47,48].

ADVANTAGES AND DISADVANTAGES OF AMPLIFICATION-BASED NGS The two most commonly used targeted library preparation approaches are hybridization-based capture (either on-array or in-solution) or PCR-based targeted amplification (either highly multiplex PCR or parallel single-plex

III. INTERPRETATION

308

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

PCR). The number one advantage of amplification-based NGS is its speed. Amplification-based NGS technology relies on a PCR approach to amplify ROI for sequencing, which can be done within a few hours, translating into shorter TAT, while hybridization-based NGS requires longer reaction times (typically 2448 h incubation time) and, therefore, longer TAT. Hybridization-based targeted NGS may suffer from design restrictions, including problems with high GC content or repetitive regions, issues with gene family members that share sequence homology [49], or pseudogenes. In contrast, with amplification-based NGS technology, primers can be designed to avoid or minimize the amplification of pseudogenes or genomic regions with high sequence homology to the ROI, resulting in fewer off-target reads. Tumor samples are often a mixture of tumor cells and normal stromal cells. In addition, intratumoral heterogeneity (i.e., the presence of multiple clones within a tumor) requires high sensitivity of mutation detection. Amplicon-based NGS enables deep sequencing, which allows the detection of low-level sequence variants and the identification of different tumor clones. Unlike most constitutional genomic studies that use normal blood samples, tumor samples are very diverse and often associated with necrosis, which affects the quality of the DNA. In addition, most of the samples for cancer genetic analysis are FFPE tumor materials that consist of crosslinked DNA. Amplification-based NGS can tolerate degraded DNA, and generally requires only 10200 ng DNA, per assay. Finally, the cost of amplicon-based NGS is usually lower than hybridization-based NGS due to its smaller-sized target regions, simpler data analysis pipeline, easy optimization for high sensitivity and specificity, and low cost of PCR reagents. Due to these advantages, amplicon-based NGS is especially suitable for accurate somatic mutation analysis in clinical cancer diagnosis. One of the disadvantages of amplification-based NGS is PCR bias, such as GC content bias and amplicon size bias. Usually, GC-rich fragments are underrepresented in sequencing results and shorter fragments are overrepresented, especially in multiplex amplification; amplicons can be completely lost due to sequence variations in primer binding sites. These PCR biases can be minimized by careful primer and amplicon design, a decrease in the number of PCR cycles, and normalization of individual PCR amplicons or multiplex products to equal molarities when pooling amplicons together prior to template preparation. Amplification-based NGS also requires prior knowledge of the sequences and the nature of the mutations to be targeted, pseudogenes, and/or sequence homology of the targeted sequences to other genomic regions. One other significant disadvantage is the lack of potential for identifying novel disease-associated mutations outside the targeted regions.

CLINICAL APPLICATION OF AMPLIFICATION-BASED NGS IN CANCER Cancer is a complex disease typically caused by the accumulation of genomic and epigenomic alterations. Using NGS technologies, researchers have identified novel genetic alterations contributing to oncogenesis, cancer progression, and metastasis [50,51]. Significant advances have been achieved using NGS technologies in cancer genomics research for breast cancer, ovarian cancer, colorectal cancer, lung cancer, liver cancer, renal cell carcinoma, head and neck cancer, melanoma, and acute myeloid leukemia (AML) ([51]; http://cancergenome.nih.gov/ cancersselected). The technology has also boosted the discovery of novel cancer genes—about 1500 cancer genes (genes that are causally implicated in tumor initiation and progression) have been identified and collected in the Network of Cancer Genes database [52]. Retroactive and prospective clinical trials based on current knowledge of the cancer genome have begun to illustrate the clinical significance of genomic alterations in cancer diagnosis, prognosis, and treatment (http://cancergenome.nih.gov/cancergenomics/impact). Identification of cancer-associated mutations has become standard care for cancer diagnosis and risk stratification; examples of such mutations include the PML-RARA gene fusion in acute promyelocytic leukemia and the FLT3 ITD and NPM1 mutations in AML [53]. Many drugs have been developed to specifically target certain mutations and associated altered biological signaling pathways or mutant gene products, for example Trastuzumab (Genentech, South San Francisco, CA), a monoclonal antibody that specifically interferes with the HER2/neu receptor for HER2/neu-amplified breast cancer; Imatinib (Novartis, Basel, Switzerland), a tyrosinekinase inhibitor for BCR-ABL-positive chronic myelogenous leukemia [54]; Gefitinib (AstraZeneca, London, UK) and Erlotinib (Genentech, South San Francisco, CA), also tyrosine-kinase inhibitors, for lung cancer with EGFR mutations; and Crizotinib (Pfizer, New York City, NY), an ALK-inhibitor for ALK-rearranged lung cancer [5558]. Additionally, due to the heterogeneity of the cancer genome, mutation profiles may differ significantly among patients with histologically similar tumors. Conversely, patients with histologically different tumors may carry similar mutation profiles [59]. Therefore, accurate detection of cancer mutations is a critical step in personalized cancer care. Typically, Sanger sequencing is used clinically to identify mutations with well-described

III. INTERPRETATION

CLINICAL APPLICATION OF AMPLIFICATION-BASED NGS IN CANCER

309

clinical phenotypes in a one-gene-one-test or one-mutation-one-test manner [60]. When several genes are queried for mutations, however, this approach becomes too time- and resource-intensive [60]. NGS can overcome these limitations, and NGS technologies have been adopted in clinical laboratories to perform a major role in cancer diagnosis. However, clinical NGS involves important principles and practical considerations in sample requirements, DNA/RNA quality control, and data analysis, interpretation, and reporting.

Sample Requirements Amplification-based NGS may be performed on any specimen that can yield DNA or RNA, such as peripheral blood, bone marrow, saliva, fresh or frozen tissue, FFPE tissues, and prenatal specimens. As the quality and quantity of DNA/RNA extracted from different sample types may vary significantly, clinical laboratories need to validate the NGS test on each type of sample and establish minimum requirements of sample quantity and quality, as well as shipping and handling requirements for each sample type. In cancer diagnosis, FFPE samples are the most common tissue sources and some of them may have been archived for years. Formalin fixation causes DNA/RNA cross-linking and fragmentation [61]. The level of DNA damage depends on the length of the fixation procedures [62], the conditions and duration of sample storage [63], and the type of tissue [64]. Tissue necrosis is also common in tumor samples, which additionally affects DNA/RNA quality. Typically, a cancer specimen contains a mixture of cancer and normal cells and therefore a mixture of cancer and normal genomes; tumor samples consequently must be reviewed by a pathologist for tumor content, and the regions of tumor should be at least macrodissected if not laser captured to enrich tumor cells for DNA/RNA extraction. For metastatic cancers, the sample sources are often core biopsy or fine needle aspiration specimens, and after pathologic evaluation for diagnosis, the amount of sample remaining for genetic testing can be extremely limited. Thus, when selecting amplification-based NGS platforms for cancer testing, it is important to consider the size of the amplicons and the amount of input DNA required to accommodate the often low yield and relatively poor quality DNA/RNA of FFPE samples. The amount of input DNA for library preparation and the DNA to sequence-ready template time required for different library preparation and enrichment techniques for different applications are listed in Table 18.2. Some RNA library preparation kits used for RNA-seq are also shown in the table.

DNA/RNA Extraction and Quality Control The quality of DNA and RNA can greatly depend on the tissue type, fixation (fresh vs. FFPE), and the methods used for extraction. A standardized and efficient workflow for DNA/RNA extraction, quantification, and qualification is essential in clinical molecular diagnosis, and DNA/RNA extraction kits and automation equipment are commercially available from several vendors. For large laboratories, in addition to validating automatic DNA/ RNA extraction by machines, it is important to validate manual extraction methods for especially difficult samples. After the extraction, the quality and the quantity of the genomic DNA or total RNA (integrity and purity) should be evaluated. These can be done using several different methodologies. The integrity and size of gDNA or RNA can be checked by regular or pulse-field electrophoresis using an agarose gel. The DNA/RNA purity can be assessed by analysis of the ratio of the absorbance at 260 and 280 nm (A260/A280) on a spectrophotometer; pure nucleic acids have an A260/A280 ratio of 1.82.1 [65]. RNA concentration can be determined by measuring the absorbance at 260 nm in a spectrophotometer such as a Nanodrop. For high-throughput NGS testing, quantification of double-stranded DNA (dsDNA) is necessary, which can be determined with real-time quantitative PCR (qPCR) or using PicoGreens dsDNA quantitation kits (Life Technologies), which are compatible with most fluorescence-based microplate readers and fluorometers, or with the Qubits dsDNA BR Assay Kit (Life Technologies) in conjunction with the Qubits 2.0 fluorometer. The Qubits 2.0 fluorometer can be used for RNA quantitation as well. qPCR can accurately quantify dsDNA; however, it is labor intensive, time consuming, and costly. RNA integrity can be more accurately assessed on a Bioanalyzer with an RNA LabChips and analyzed using the RNA Integrity Number (RIN) software algorithm (all from Agilent Technologies). However, it can be very expensive to use a Bioanalyzer for routine RNA integrity assessment due to the cost of the machine and the consumables. For these reasons, many laboratories use a sequential combination of Nanodrop and Qubit measurements to assess the quantity and quantity of dsDNA, since Qubit measurements have proved to be highly reproducible and consistent with qPCR measurements. However, some protocols, such as the Illumina TruSightt Tumor, require the qPCR method for the DNA quantification step (http://supportres.illumina.com/documents/documentation/chemistry_documentation/samplepreps_trusight/ trusight_tumor_sample_prep_guide_15042911_a.pdf).

III. INTERPRETATION

310 TABLE 18.2

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

Comparison of Library Preparation Kits of Different NGS Platforms Platforms

DNA sequencing

Input DNA

Duration time

AmpliSeq Library Kit 2.0

10 ngc,d

6h

Ion Xpress Plus Fragment Library Kit

50 ng1 μg

2.55.5 h

Ion Plus Fragment Library Kit

50 ng1 μg

2.55.5 h

5500 SOLiD Fragment 48 Library Core Kit

5 ng5 μg

N/A

Nextera enrichment

50 ng

3h

TruSeq DNA Sample Preparations kits

100 ng2 μg

5 h2 days

TruSight Cancer

50 ng

1.5 days

TruSight Tumor

30300 ng

.20 h

TruSight Inherited Disease

50 ng

1.5 days

Haloplex Target Enrichment

225 ng

.6 h

SureSelect/Capture system

3 μg

N/A

SureSelect/Target Enrichment

500 ng (for Roche 454), 3 μg (for Illumina)

2.254.25 days

500 ng

N/A

AmpliSeq RNA Library Kit

10 ngd of total RNA

4.5 h

SOLiD Total RNA-Seq Kit

5500 ng mRNA

.20 h

100 ng4 μg of total RNA

N/A

200 ng4 μg total RNA

.5 h

a

Life Technologies Ion Torrent

b

Illumina

d

Agilent

Roche 454 GS FLX Titanium Rapid Library Preparation Kit RNA-seq

Life Technologies/Ion Torrenta

Illumina TruSeq Stranded mRNA and Total RNA Sample Preparation Kits Agilent Strand-Specific RNA Library (on the Illumina platform) a

Needs about 4 h emPCR to generate the templates. If used for HiSeq systems, needs about 45 h to generate clonal clusters. c For AmpliSeq Comprehensive Cancer Panel (CCP), 40 ng. d Compatible with FFPE samples. b

Cancer-Specific Targeted Panels Cancer-specific targeted gene panels interrogate known cancer-associated genes. Targeting on a limited number of genes allows greater sequencing depth and therefore increased analytical sensitivity for detecting sequence variants with low allele frequencies. Furthermore, focusing on those genes and mutations with established clinical significance makes it easier to annotate and interpret the sequencing data. Sequencing a panel of genes can be done on benchtop sequencers, such as PGM and MiSeq, that are much cheaper than those with higher throughput, such as HiSeq 2500. Because the total targeted region is small, more samples can be batched in one run to further decrease the cost. The amount of data and storage requirements are also more manageable. There are several ready-to-use cancer panels from different companies that currently have potential to be used for clinical testing. Laboratories may also develop customized panels to meet their research and clinical needs.

III. INTERPRETATION

311

CLINICAL APPLICATION OF AMPLIFICATION-BASED NGS IN CANCER

TABLE 18.3

The Gene List Covered by Ion AmpliSeqt Cancer Hotspot Panel v2

ABL1

EGFR

GNAS

KRAS

PTPN11

AKT1

ERBB2

GNAQ

MET

RB1

ALK

ERBB4

HNF1A

MLH1

RET

APC

EZH2

HRAS

MPL

SMAD4

ATM

FBXW7

IDH1

NOTCH1

SMARCB1

BRAF

FGFR1

IDH2

NPM1

SMO

CDH1

FGFR2

JAK2

NRAS

SRC

CDKN2A

FGFR3

JAK3

PDGFRA

STK11

CSF1R

FLT3

KDR

PIK3CA

TP53

CTNNB1

GNA11

KIT

PTEN

VHL

AmpliSeqt Cancer Hotspot Panel v2 AmpliSeqt Cancer Hotspot Panel v2 (Panel v2) (http://www.lifetechnologies.com/order/catalog/product/ 4475346) is one of the ready-to-use panels offered by Life Technologies. The panel utilizes highly multiplex PCR with a single pool of primers to generate an amplicon library that interrogates 2855 hotspot mutations in 50 genes frequently altered in cancer. The genes covered by Panel v2 are listed in Table 18.3. The size of the total targeted region of the panel is approximately 22 Kb and the coverage for the targeted region is 100%. A total of 207 primer pairs are amplified in a single tube to generate 207 different amplicons. The amplicon lengths range from 111 to 187 bp (average 154 bp). The required amount of input DNA is 10 ng, and reportedly can be as low as 5 ng. The panel is compatible with FFPE samples and is optimized for library construction with the Ion AmpliSeqt Library Kit 2.0. The basic protocol is as follows: 10 ng of genomic DNA is used as the template for multiplex PCR to prepare the amplicon library. After partial digestion of the primer sequences, Ion AmpliSeqt Adapters or Ion Xpresst Barcode Adapters are ligated onto the amplicons, followed by the library purification and reamplification steps. The amplified library is quantified using the Bioanalyzer instrument with the High Sensitivity DNA Kit (Agilent Technologies Inc, Santa Clara, CA), followed by sequencing template preparation which can be performed using either the automated Ion OneToucht System (Life Technologies, Carlsbad, CA) with the Ion OneToucht 200 Template Kit v2 DL, or manual emPCR. The library stocks are diluted to an appropriate working concentration (based on laboratory validations) and pooled (if barcoded); the template ISPs are enriched with the automated Ion OneToucht ES instrument or via a manual method if emPCR is used. Quality and quantity of the enriched ISPs are assessed using the Guavas easyCytet 5 Flow Cytometer (Millipore, Billerica, MA), and the library is then ready for sequencing. Sequence data are automatically analyzed in the Torrent Suite software v3.0 (or a later version) for base calling and alignment. Variant calling can be done using the Variant Caller Plugin, Ion Reportert Software, or any commercial or laboratory-developed software. A few sample multiplexing strategies and associated specifications for the Ion AmpliSeqt Cancer Hotspot Panel v2 are listed in Table 18.4 (http://tools.lifetechnologies.com/content/ sfs/brochures/Ion-AmpliSeq-Cancer-Hotspot-Panel-Flyer.pdf). In general, the coverage of targeted regions and the sequence uniformity of the Ion AmpliSeqt Cancer Hotspot Panel v2 are much better than that of the original Ion AmpliSeqt Cancer Hotspot Panel. In most cases, 1003 coverage for 100% of the targeted regions can be achieved. With the release of Ion Torrent’s new Chips (Chip v2), the performance specifications of the assay have been further improved. Figure 18.7 shows performance statistics for a representative assay performed on an Ion Torrent PGM platform with a 316 Chip v2 and using AmpliSeqt Cancer Panel v2. Ion AmpliSeqt Comprehensive Cancer Panel The Ion AmpliSeqt CCP is another cancer panel offered by Life Technologies. The CCP covers 409 cancer genes with 15,992 primer pairs in four multiplex pools. The size of the total targeted region of the panel is 1.29 Mb and the coverage for the targeted region is approximately 95.35%. The amplicon lengths range from 125 to 175 bp (average 155 bp) and the required DNA input is as low as 40 ng total (10 ng per pool). CCP is also

III. INTERPRETATION

312

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

TABLE 18.4

Sample Multiplexing Strategies and Some Specifications for Panel v2 Ion 314t Chip: 2 samples, B1400 3 average coverage

Sample multiplexing (observed performance)

Ion 316t Chip: 8 samples, B1400 3 average coverage Ion 318t Chip: 16 samples, B1400 3 average coverage Specification

Observed performance (Ion 314t chip)

$95%

$98%

On-target reads

$90%

$96%

Average depth of coverage

N/A

.20003

SNP detection sensitivity

N/A

98% detection rate for 5% variant frequency at positions with average sequencing coverage from 1000 3 to 4000 3

Coverage uniformity

a

b

Coverage uniformity: percentage of bases covered at $20% of the mean coverage. On-target reads: percentage of reads that mapped to target regions out of total mapped reads per run.

a

b

100 %

(A)

2120 wells

Loading density (Avg ~ 90%) 600

80 %

500

70 %

400

60 %

300

50 %

200

40 %

100

30 %

0

(B)

90 %

90% Loading

5,732,713

100% Enrichment

5,731,707

0% No template

68% Clonal

3,912,254

32% Polyclonal

10% Empty wells

20 % 0

200

400

600

800

2% Test fragments 0% Adapter dimer 12% Low quality

86% 3,376,631 Final library

10 %

3392 wells 0%

(C) Variant Caller Reports

Mapped Reads

Bases Reads On Target% On Target%

Read Depth

20× 100× 1× Coverage% Coverage% Coverage%

Variants Detected

IonXpress_005

869,360

98.76

92.09

3941.92

100.000

100.000

100.000

16

IonXpress_006

966,818

98.51

92.09

4355.44

100.000

100.000

100.000

14

IonXpress_007

859,803

98.20

91.37

3918.21

100.000

100.000

100.000

15

IonXpress_008

644,058

98.88

92.01

2859.55

100.000

100.000

100.000

12

FIGURE 18.7 Performance specifics of a run performed on Ion Torrent PGM platform with a 316 Chip v2 with AmpliSeqt Cancer Panel v2. (A) ISP density; (B) ISP summary; (C) Variant Caller report.

optimized for library construction with the Ion AmpliSeqt Library Kit 2.0, and it shares the same workflow as Panel v2. Because the in silico coverage of this panel is not 100%, it is necessary to polish the panel to avoid false negatives by employing additional methodologies to cover the low- to no-coverage regions and to avoid confusion with pseudogenes and unintended sequences. AmpliSeq Custom Cancer Panels Laboratories can design their own panels based on their clinical needs using the Ion AmpliSeqt designer (www.ampliseq.com). Many clinical labs have found this tool helpful for design of a variety of panels, including somatic mutation panels, panels for hematological malignancies, and panels for overgrowth syndromes caused by somatic mutations.

III. INTERPRETATION

313

CLINICAL APPLICATION OF AMPLIFICATION-BASED NGS IN CANCER

TABLE 18.5

Genes Covered by the Illumina TruSeq Amplicon Cancer Panel

ABL1

AKT1

ALK

APC

ATM

BRAF

CDH1

CDKN2A

CSF1R

CTNNB1

EGFR

ERBB2

ERBB4

FBXW7

FGFR1

FGFR2

FGFR3

FLT3

GNA11

GNAQ

GNAS

HNF1A

HRAS

IDH1

JAK2

JAK3

KDR

KIT

KRAS

MET

MLH1

MPL

NOTCH1

NPM1

NRAS

PDGFRA

PIK3CA

PTEN

PTPN11

RB1

RET

SMAD4

SMARCB1

SMO

SRC

STK11

TP53

VHL

Ion AmpliSeqt RNA Cancer Panels The Ion AmpliSeqt RNA Cancer Panel was developed as the RNA complement to the Ion AmpliSeqt Cancer Hotspot Panel v2 using Ion AmpliSeqt technology. The panel is a single pool of primers representing 50 oncogenes and tumor suppressor genes included in Panel v2 (http://tools.lifetechnologies.com/content/sfs/ brochures/AmpliSeq-RNA-Cancer-Flyer.pdf). The 1.5-day workflow starts with RNA library preparation using the Ion AmpliSeqt RNA Library kit. RNA library preparation includes RT to produce single-stranded cDNA and amplification of target regions with high fidelity DNA polymerase. In general, 19 and 22 amplification cycles are used for unfixed and fixed (FFPE) RNA, respectively, to amplify targeted regions. After ligation of adapters or barcodes to the amplicons and purification, the library is ready for the second round of amplification and purification. After the library is quantitated using the Agilent Bioanalyzer and Agilent High Sensitivity DNA kit (Agilent), it is ready for template preparation using the same methods as for gDNA. In addition to the Ion AmpliSeqt RNA Cancer Panel, Life Technologies also offers the AmpliSeqt RNA Apoptosis Panel. This panel is designed as a screening tool to probe 267 genes involved in the cellular apoptosis pathway including genes associated with death receptor-mediated apoptosis, c-Myc, and p53-mediated apoptosis. It contains 267 amplicons of approximately 150 bp each in length. The workflow of this panel is similar to that of the Ion AmpliSeqt RNA Cancer Panel (http://tools.lifetechnologies.com/content/sfs/brochures/AmpliSeq-RNA-Apoptosis-Flyer.pdf). RainDance ONCOSeqt Panel The RainDance ONCOSeqt Panel enables simultaneous sequence analysis of 142 cancer genes with 99% coverage of the target region in a single assay. This panel targets the exons, intronexon junctions (for splice site mutations), and 1000 bp of the 50 and 30 flanking regions (to cover 50 promoters and 30 UTRs for each gene). Using RainDance Microdroplet technology, the ONCOSeqt Panel can minimize enrichment bias to maintain allelic representation of heterozygous alleles and underrepresented mutant alleles. The panel requires 250 ng of genomic DNA. One benefit of this technology is that it is relatively easy to modify the panel by adding or subtracting sequencing targets according to the laboratory’s specific needs (http://raindancetech.com/targeted-dnasequencing/tdnas-content-panels/oncoseq-panels/). Illumina TruSeq Amplicon Cancer Panel The TruSeq Amplicon Cancer Panel uses predesigned oligonucleotide probes to capture sequencing targets, followed by multiplex PCR. The panel targets mutation hotspots in 48 genes that are almost identical to those of the AmpliSeqt Cancer Hotspot Panel. The panel includes 212 amplicons ranging from 170 to 190 bp in length. The total genomic region covered by this panel is about 35 Kb. Table 18.5 contains the complete list of oncogenes included in the panel. Prior to amplicon preparation, a DNA quality check by qPCR is required to predict assay performance from FFPE DNA. The TruSeq Amplicon assay starts with hybridization of the premixed oligonucleotide probes upstream and downstream of the ROI; each probe includes a target capture sequence and an adapter sequence to be used in the subsequent amplification reaction. The captured templates are then PCR amplified and two unique sample-specific indices are incorporated in this step. The final reaction product contains amplicons that are ready for sequencing. In addition, for multisample library preparation, an integrated bead-based normalization procedure is performed to allow for simple volumetric library pooling. Pooled amplicon libraries can be loaded directly onto the MiSeq system without additional processing (http://res.illumina.com/documents/products/ datasheets/datasheet_truseq_amplicon_cancer_panel.pdf). Illumina also offers 10 targeted RNA expression panels that are listed in Table 18.6.

III. INTERPRETATION

314 TABLE 18.6

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

TruSeq Targeted RNA Panels

Panel name

Number of genes covered

cellular pathways

Cardiotoxicity Panel

76

Pathways affected by cardiotoxic compounds or stress

Apoptosis Panel

117

Pro-apoptotic and anti-apoptotic genes associated with cellular apoptosis

Wnt Panel

93

Upstream and downstream signal transduction, as well as transcription factors and target genes involved in Wnt signaling pathway

p53 Panel

52

Upstream and downstream signal transduction, as well as transcription factors and target genes involved in p53 signaling pathway

NFκB Panel

105

Upstream and downstream signal transduction, as well as transcription factors and target genes involved in NFκB signaling pathway

Hedgehog Panel

76

Upstream and downstream signal transduction, as well as transcription factors and target genes involved in Hedgehog signaling pathway

Stem Cell Panel

100

Stem cell markers, differentiation markers, pluripotency markers, cytokines, and growth factors

Cytochrome p450 Panel

28

Cytochrome P450 genes involved in drug and toxin metabolism

Cell Cycle Panel

63

All phases of the cell cycle as well as DNA replication

Neurodegeneration Panel10 targeted

77

Neurodegenerative and neurotoxic pathways such as those implicated in Alzheimer’s Disease

DATA ANALYSIS Data analysis is a critical part of any clinical NGS assay. It consists of three steps: base calling and quality score computation (primary analysis); assembly and alignment (secondary analysis); and variant calling and annotation (tertiary analysis). Different companies use their own proprietary analysis software to call base pairs and generate associated quality scores, and so the data output from an NGS instrument essentially consists of a text file containing millions of raw sequence reads. Since assembly is usually only required when no reference genome exists for the DNA sequenced, for clinical applications secondary analysis often consists only of alignment of the reads to the human reference sequence. The latest version of the human reference genome provided by the Genome Reference Consortium (GRCh37 at the time of writing) should be used for alignment (http:// www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml). For tertiary analysis, the majority of SNVs identified are synonymous or benign changes which can be filtered using commercially available or local labdeveloped software with a specific set of filter criteria. Many databases for variant annotation are publicly available, such as the 1000 Genomes Project (http://www.1000genomes.org/) and the dbSNP database (http://www. ncbi.nlm.nih.gov/projects/SNP/). For targeted sequencing, custom-designed or commercially available .bed files can be used for calling variants, which produce a much smaller number of variants than when the whole genome is used. The variants that remain after filtering (sometimes imprecisely referred to as “rare variants”) then need to be further annotated. Commonly used alignment and variant calling algorithms and software tools for data viewing are listed in Table 18.7. Unfortunately, all currently available analytical tools have limitations due to the many different types of data generated from the different NGS platforms, varying reference sequences used for alignment, databases used for variant annotation and filtering, and so on. The advantages and limitations of existing tools must be objectively evaluated by clinical laboratories for their specific sequencing needs. Furthermore, for clinical NGS data analysis, the acceptable thresholds for data quality and depth of coverage should be determined during the assay development and validation process. The minimum depth of coverage depends on the required sensitivity of the assay, sequencing method, and type of mutations to be detected. The extent of the analysis, including ROI and level of mosaicism, must be clearly defined.

III. INTERPRETATION

315

INTERPRETATION AND REPORTING

TABLE 18.7

Computational Tools for NGS Data Analysis

Program

Functions

URL

References

Bowtie2

Alignment

http://bowtie-bio.sourceforge.net/bowtie2

[66]

BWA

Alignment

http://bio-bwa.sourceforge.net

[67,68]

SOAP2

Alignment

http://soap.genomics.org.cn/soapaligner.html

[3840]

MAQ

Alignment and assembly

http://maq.sourceforge.net

[69]

TopHat

Splice junction mapping

http://tophat.cbcb.umd.edu

[70]

Samtools

Variant calling

http://samtools.sourceforge.net

[3840]

VARiD

Variant calling

http://compbio.cs.utoronto.ca/varid

[71]

VarScan2

Variant calling

http://varscan.sourceforge.net

[72]

SpliceMap

Spliced alignment

http://www.stanford.edu/group/wonglab/SpliceMap

[73]

SOAPfuse

Gene fusion

http://soap.genomics.org.cn/soapfuse.html

[74]

FusionAnalyser

Gene fusion

http://www.ilte-cml.org/FusionAnalyser

[75]

NextGENe

Integrated tool

http://softgenetics.com

N/A

SeqMan NGen

Integrated tool

www.dnastar.com

N/A

Savant

Viewer

www.savantbroswer.com

[86]

Integrative Genomics Viewer

Viewer

http://www.broadinstitute.org/igv/

N/A

Ion Reporter

Integrated tool

https://ionreporter.lifetechnologies.com/ir/

N/A

Ingenuity Variant Analysis

Integrated tool

http://www.ingenuity.com/products/variant-analysis

N/A

INTERPRETATION AND REPORTING Variants should be classified as pathogenic, benign, or variant of unknown significance (VUS). As previously stated, there are many tools and various databases publicly or commercially available for variant interpretation. Software to predict the function of variants is also available, such as SIFT and PolyPhen, however, results from these functional prediction tools should be cautiously evaluated, especially in cancer diagnosis. Several tools and resources useful in variant annotation and interpretation are listed in Table 18.8. It is expected that NGS testing by amplification-based approached will reveal large numbers of SNVs and small insertions or deletions (indels). As noted above, amplification-based methods are not ideally suited to reveal copy number variations (CNVs) or structural variants (SVs) such as translocations. Most labs not only report cancerassociated mutations but also VUSs since some VUSs may be reclassified as pathogenic or benign in the future. All variants should also be described in accordance with Human Genome Variation Society (HGVS) recommendations. The reference coding sequences should preferably be derived from the RefSeq database (http://www.ncbi.nlm. nih.gov/refseq/) and the particular transcripts used for variant annotation should be specified. Clinically actionable mutations (i.e., mutations of diagnostic, prognostic, and/or therapeutic significance) should be described in detail, and relevant publications and databases should be appropriately cited in the report. Confirmation of variants via a second independent and laboratory-established technology is an important quality assurance step of NGS tests, as false-positive calls associated with specific platforms or specific tissue types have been observed. To what extent the confirmation should be performed can be determined based on the quality of the variant calls and the rarity of the variants. For example, common mutations that have been repeatedly confirmed with a second technique may not need to be confirmed if the coverage depth and the sequence quality score meet the predefined laboratory cutoffs. Many labs independently confirm all rare mutations, especially private mutations. The interpretation of cancer NGS results is unavoidably complicated by common as well as private germline variants. These germline variants can be pathogenic or benign. Ideally, both tumor and normal tissues would be tested at the same time; however, generally only a tumor sample is tested in the majority of cases referred for cancer mutation panel analysis due to financial constraints. When a germline mutation is suspected, testing of a germline sample for confirmation is advised, and genetic counseling should be recommended.

III. INTERPRETATION

316 TABLE 18.8

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

Resources for Mutation Functions Prediction and Interpretation

Program

Description

URL

References

Polyphen-2

Mutation function prediction

http://genetics.bwh.harvard.edu/pph2

[76]

SIFT

Mutation function prediction

http://sift.jcvi.org

[77]

CHASM

Mutation function prediction

http://wiki.chasmsoftware.org

[78]

ANNOVAR

Annotation

http://www.openbioinformatics.org/ annovar/

[79]

COSMIC

Catalogue of Somatic Mutations in Cancer

http://www.sanger.ac.uk/genetics/ CGP/cosmic

N/A

UCSC Cancer Genomics Browser

Web-based tools to visualize, integrate, and analyze cancer genomics and its associated clinical data

https://genome-cancer.soe.ucsc.edu/

N/A

Cancer Genome Workbench

Hosts mutation, copy number, expression, and methylation data from TCGA, TARGET, COSMIC, GSK, NCI60 projects; tools for visualizing sample-level genomic and transcription alterations in various cancers

https://cgwb.nci.nih.gov/

N/A

HGVS

Human Genome Variation Society; recommendations for the annotation of variants

www.hgvs.org/mutnomen

N/A

RefSeq database

Derivation of coding sequence

www.ncbi.nlm.nih.gov/RefSeq

N/A

dbSNP

Single Nucleotide Polymorphism Database

http://www.ncbi.nlm.nih.gov/ projects/SNP

[80]

HGMD

Human Gene Mutation Database

http://www.biobase-international. com/product/hgmd

[81]

Illumina VariantStudio

Annotation of variants

http://www.illumina.com/clinical/ clinical_informatics/illuminavariantstudio.ilmn

N/A

Ingenuity NGS Clinical Test Interpretation and reporting solution beta program

Interpretation and reporting

http://www.ingenuity.com/ngsclinical-beta

N/A

CHALLENGES AND PERSPECTIVES As described above, NGS has already been implemented in clinical diagnostics in many laboratories. However, comprehensive analysis and accurate interpretation of the large amount of sequence data produced by NGS technology requires a multidisciplinary team with expertise in genetics, pathology, oncology, bioinformatics, and data storage, among others. In addition, establishing well-curated genomic databases with phenotypic information, and crowdsourcing the labor-intensive interpretation of DNA variants, are both required for the widespread clinical implementation of NGS-based tests [82]. Ethical and socioeconomic challenges that arise, such as incidental findings, patient confidentiality and data security, and test-reimbursement issues [83] must also be addressed. Genetic testing can identify individuals at risk long before disease onset, raising concerns about stigmatization and discrimination [84]. Still other legal issues that must be solved include access to proprietary databases and the legality of gene patenting [85]. Despite these challenges, the clinical application of NGS in cancer will undoubtedly move forward to meet the needs of precision medicine due to its high efficiency and low cost. Close collaborations between research laboratories, clinical practitioners of different disciplines, diagnostic genetic laboratories, and vendors offering NGS instruments and analytic software will ensure that patients receive the maximum possible benefit from NGS technology.

III. INTERPRETATION

REFERENCES

317

References [1] Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 1975;94(3):4418. [2] Maxam AM, Gilbert W. A new method for sequencing DNA. Proc Natl Acad Sci USA 1977;74(2):5604. [3] Watson JD. The human genome project: past, present, and future. Science 1990;248(4951):449. [4] The International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431 (7011):93145. [5] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [6] Taylor BS, Ladanyi M. Clinical cancer genomics: how soon is now? J Pathol 2011;223(2):31826. [7] Mardis ER. The impact of next-generation sequencing technology on genetics. Trends Genet 2008;24(3):13341. [8] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008;26(10):113545. [9] Metzker ML. Sequencing technologies-the next generation. Nat Rev Genet 2010;11(1):3146. [10] Liu L, Li Y, Li S, Hu N, He Y, Pong R, et al. Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012;2012:251364. [11] Zhou X, Ren L, Meng Q, Li Y, Yu Y, Yu J. The next-generation sequencing technology and application. Protein Cell 2010;1(6):52036. [12] Voelkerging KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem 2009;55(4):64158. [13] Rothberg JM, Leamon JH. The development and impact of 454 sequencing. Nat Biotechnol 2008;26(10):111724. [14] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high density picolitre reactors. Nature 2005;437(7057):37680. [15] Dressman D, Yan H, Traverso G, Kinzler KW, Vogelstein B. Transforming single DNA molecules into fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci USA 2003;100(15):881722. [16] Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6:287303. [17] Ghosh S, Bee G, Reddy M, Pickle L, Dudas M, Mistro GD, et al. Semiconductor sequencing of human exomes on the Ion Proton system. J Biomol Tech 2013;24(Suppl.):S44. [18] Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 2005;309(5741):172832. [19] Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, et al. Microdroplet-based PCR enrichment for large-scale targeted sequencing. Nat Biotechnol 2009;27(11):102531. [20] Moonsamy PV, Williams T, Bonella P, Holcomb CL, Ho¨glund BN, Hillman G, et al. High throughtput HLA genotyping using 454 sequencing and the Fluidigm Access ArrayTM System for simplified amplicon library preparation. Tissue Antigens 2013;81 (3):1419. [21] Tang F, Lao K, Surani MA. Development and application of single-cell transcriptome analysis. Nat Methods 2011;8(4 Suppl.):S611. [22] Ruan Y, Ooi HS, Choo SW, Chiu KP, Zhao XD, Srinivasan KG, et al. Fusion transcripts and transcribed retrotransponsed loci discovered through comprehensive transcriptome analysis using Paired-End diTags (PETs). Genome Res 2007;17(6):82838. [23] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B, et al. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 2008;5(7):6218. [24] Morin RD, Mendez-Lago M, Mungall AJ, Goya R, Mungall KL, Corbett RD, et al. Frequent mutation of histone-modifying genes in nonHodgkin lymphoma. Nature 2011;476(7360):298303. [25] Roberts KG, Morin RD, Zhang J, Hirst M, Zhao Y, Su X, et al. Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Cancer Cell 2012;22(2):15366. [26] Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, et al. The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 2012;486(7403):3959. [27] Jones DT, Ja¨ger N, Kool M, Zichner T, Hutter B, Sultan M, et al. Dissecting the genomic complexity underlying medulloblastoma. Nature 2012;488(7409):1005. [28] Hammerman PS, Lawrence MS, Voet D, Jing R, Cibulskis K, Sivachenko A, et al. Comprehensive genomic characterization of squamous cell lung cancers. Nature 2012;489(7417):51925. [29] Ameur A, Wetterbom A, FeuK L, Gyllensten U. Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol 2010;11(13):R34. [30] Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10(1):5763. [31] Mwenifumbo JC, Marra MA. Cancer genome-sequencing study design. Nat Rev Genet 2013;14(5):32132. [32] Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y, et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008;18(9):150917. [33] Levin JZ, Berger MF, Adiconis X, Rogov P, Melnikov A, Fennell T, et al. Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts. Genome Biol 2009;10(10):R115. [34] Halvardson J, Zaghlool A, Feuk L. Exome RNA sequencing reveals rare and novel alternative transcripts. Nucleic Acids Res 2013;41(1): e6. [35] Mercer TR, Gerhardt DJ, Dinger ME, Crawford J, Trapnell C, Jeddeloh JA, et al. Targeted RNA sequencing reveals the deep complexity of the human transcriptome. Nat Biotechnol 2012;30(1):99104. [36] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):1118. [37] Schageman J, Cheng A, Bramlett K. RNA sequencing and quantitation using targeted amplicons. J Biomol Tech 2013;24(Suppl.):S42. [38] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009;25(16):20789. [39] Li JB, Levanon EY, Yoon JK, Aach J, Xie B, Leproust E, et al. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 2009;324(5931):12103.

III. INTERPRETATION

318

18. SOMATIC DISEASES (CANCER): AMPLIFICATION-BASED NEXT-GENERATION SEQUENCING

[40] Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, et al. An improved ultrafast tool for short read alignment. Bioinformatics 2009;25 (15):19667. [41] Zhang K, Li JB, Gao Y, Egli D, Xie B, Deng J, et al. Digital RNA allelotyping reveals tissue specific and allele-specific gene expression in human. Nat Methods 2009;6(8):6138. [42] Blomquist TM, Crawford EL, Lovett JL, Yeo J, Stanosezk LM, Levin A, et al. Targeted RNA-sequencing with competitive multiplex PCR amplicon libraries. PLoS One 2013;8(12):e79120. [43] Clark SJ, Harrison J, Paul CL, Frommer M. High sensitivity mapping of methylated cytosines. Nucleic Acids Res 1994;22(15):29907. [44] Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, et al. Genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 1992;89(5):182731. [45] Taylor KH, Kramer RS, Davis JW, Guo J, Duff DJ, Xu D, et al. Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res 2007;67(18):85118. [46] Varley KE, Mitra RD. Bisulfite patch PCR enables multiplexed sequencing of promoter methylation across cancer samples. Genome Res 2010;20(9):127987. [47] Komori HK, LaMere SA, Torkamani A, Hart GT, Kotsopoulos S, Warner J, et al. Application of microdroplet PCR for large-scale targeted bisulfite sequencing. Genome Res 2011;21(10):173845. [48] Herrmann A, Haake A, Ammerpohl O, Martin-Guerrero I, Szafranski K, Stemshorn K, et al. Pipeline for large-scale microdroplet bisulfite PCR-based sequencing allows the tracking of hepitype evolution in tumors. PLoS One 2011;6(7):e21332. [49] Huentelman MJ. Targeted next-generation sequencing: microdroplet PCR approach for variant detection in research and clinical samples. Expert Rev Mol Diagn 2011;11(4):3479. [50] Rusk N, Kiermer V. Primer: sequencing: the next generation. Nat Methods 2008;5(1):15. [51] Shyr D, Liu Q. Next generation sequencing in cancer research and clinical application. Biol Proced Online 2012;15(1):4. [52] D’Antonio M, Pendino V, Sinha S, Ciccarelli FD. Network of cancer genes (NCG 3.0): integration and analysis of genetic and network properties of cancer genes. Nucleic Acids Res 2012;40(Database issue):D97883. [53] Abdel-Wahab O. Molecular genetics of acute myeloid leukemia: clinical implications and opportunities for integrating genomics into clinical practice. Hematology 2012;17(Suppl. 1):S3942. [54] Druker BJ, Talpaz M, Resta DJ, Peng B, Buchdunger E, Ford JM, et al. Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. N Engl J Med 2001;344(14):10317. [55] Mitsudomi T, Morita S, Yatabe Y, Negoro S, Okamoto I, Tsurutani J, et al. Gefitinib versus cisplatin plus docetaxel in patients with nonsmall-cell lung cancer harbouring mutations of the epidermal growth factor recepto (WJTOG3405): an open label, randomised Phase 3 trial. Lancet Oncol 2009;11(2):1218. [56] Mok TS, Wu YL, Thongprasert S, Yang CH, Chu DT, Saijo N, et al. Gefitinib or carboplatin-paclitaxel in pulmonary adenocarcinama. N Engl J Med 2009;361(10):94757. [57] Rosell R, Moran T, Queralt C, Porta R, Cardenal F, Camps C, et al. Screening for epidermal growth factor receptor mutations in lung cancer. N Engl J Med 2009;361(10):95867. [58] Peters S, Taron M, Bubendorf L, Blackhall F, Stahel R, et al. Treatment and detection of ALK-rearranged NSCLC. Lung Cancer 2013;81 (2):14554. [59] Schweiger MR, Kerick M, Timmermann B, Isau M. The power of NGS technologies to delineate the genome organization in cancer: from mutations to structural variations and epigenetic alterations. Cancer Metastasis Rev 2011;30(2):199210. [60] Klee EW, Hoppman-Chaney NL, Ferber MJ. Expanding DNA diagnostic panel testing: is more better? Expert Rev Mol Diagn 2011;11 (7):7039. [61] Fraenkel-Conrat H. Reaction of nucleic acid with formaldehyde. Biochim Biophys Acta 1954;15(2):3079. [62] Greer CE, Peterson SL, Kiviat NB, Manos MM. PCR amplification from paraffin-embedded tissues: effects of fixative and fixation time. Am J Clin Pathol 1991;95(2):11724. [63] Ferrer I, Armstrong J, Capellari S, Parchi P, Arzberger T, Bell J, et al. Effects of formalin fixation, paraffin embedding, and time of storage on DNA preservation in brain tissue: a BrainNet Europe study. Brain Pathol 2007;17(3):297303. [64] Jewell SD, Srinivasan M, McCart LM, Williams N, Grizzle WH, LiVolsi V, et al. Analysis of the molecular quality of human tissues: an experience from the cooperative human tissue network. Am J Clin Pathol 2002;118(5):73341. [65] Green MR, Sambrook J. Molecular cloning: a laboratory manual. 4th ed. New York, NY: Cold Spring Harbor Laboratory Press; 2012. [66] Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9(4):3579. [67] Li H, Durbin R. Fast and accurate long-read alignment with BurrowsWheeler transform. Bioinformatics 2010;26(5):58995. [68] Li H, Durbin R. Fast and accurate short read alignment with BurrowsWheeler transform. Bioinformatics 2009;25(14):175460. [69] Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18 (11):18518. [70] Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009;25(9):110511. [71] Dalca AV, Rumble SM, Levy S, Brudno M. VARiD: a variation detection framework for color-space and letter-space platforms. Bioinformatics 2010;26(12):i343349. [72] Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 2012;22(3):56876. [73] Au KF, Jiang H, Lin L, Xing Y, Wong WH. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res 2010;38(14):45708. [74] Jia W, Qiu K, He M, Song P, Zhou Q, Zhou F, et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biol 2013;14(2):R12. [75] Piazza R, Pirola A, Spinelli R, Valletta S, Redaelli S, Magistroni V, et al. FusionAnalyser: a new graphical, event-driven tool for fusion rearrangements discovery. Nucleic Acids Res 2012;40(16):e123.

III. INTERPRETATION

REFERENCES

319

[76] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7(4):2489. [77] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding nonsynonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4(7):107381. [78] Wong WC, Kim D, Carter H, Diekhans M, Ryan MC, Karchin R. CHASM and SNVBox: toolkit for detecting biologically important single nucleotide mutations in cancer. Bioinformatics 2011;27(15):21478. [79] Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010;38(16):e164. [80] Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2008;36(Database issue):D1321. [81] Cooper DN, Stenson PD, Chuzhanova NA. The Human Gene Mutation Database (HGMD) and its exploitation in the study of mutational mechanisms. Curr Protoc Bioinformatics 2006;1(13):120. [82] Cook-Deegan R, Conley JM, Evans JP, Vorhaus D. The next controversy in genetic testing: clinical data as trade secrets? Eur J Hum Gent 2013;21(6):5858. [83] Soden SE, Farrow EG, Saunders CJ, Lantos JD. Genomic medicine: evolving science, evolving ethics. Pers Med 2012;9(5):5238. [84] Korf BR, Rehm HL. New approaches to molecular diagnosis. J Am Med Assoc 2013;309(14):151121. [85] Fialho AM, Chakrabarty AM. Patent controversies and court cases: cancer diagnosis, therapy and prevention. Cancer Biol Ther 2012;13 (13):122934. [86] Fiume M, Williams V, Brook A, Brudno M. Savant: genome browser for high-throughput sequencing data. Bioinformatics 2012;26 (16):193844.

III. INTERPRETATION

This page intentionally left blank

C H A P T E R

19 Targeted Hybrid-Capture for Somatic Mutation Detection in the Clinic Catherine E. Cottrell, Andrew J. Bredemeyer and Hussam Al-Kateb Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction Clinical Utility of Somatic Mutation Detection in Cancer Description of Hybridization-Based Methodology Solid-Phase Versus In-Solution Phase Capture Comparison of In-Solution Hybridization Capture-Based and Amplification-Based Targeted Enrichment Methods for Molecular Oncology Testing Utility of Targeted Hybrid Capture Analysis of Large Number of Genes Involved in Cancer Useful for Precious Samples Amenable to Multiplexing Detection of a Full Range of Mutations

322

Detection of Structural Rearrangements (Translocations, Inversions, and Indels) Copy Number Variation Detection

322 323 323

324 326 326 326 326 328

Cost-effectiveness Provides High Depth of Coverage

328 331

332 332

NGS in a Clinical Laboratory Setting Design of the Clinical Assay Specimen Requirements for Somatic Variant Detection Pathologic Assessment Reportable Range Genetic Targets QC Metrics Validation

333 333

Conclusion

338

References

338

333 333 334 334 334 337

KEY CONCEPTS • Sequence analysis of tumor samples enables detection of genomic alterations which may aid in diagnosis, prognosis, and therapeutic decision making. • Next-generation sequencing (NGS) by targeted hybridization capture is a sensitive and specific method to detect somatic alterations in tumor samples. • Use of a targeted hybridization capture approach for NGS may enable detection of all classes of genomic alterations including single nucleotide variants, insertion/deletion events, copy number alterations, and structural rearrangements. • Characteristics of targeted hybridization capture-based NGS include the ability to enrich for large target regions, the ability to detect multiple classes of genomic alteration, utility with low input amounts of DNA, ability to multiplex specimens, and reasonable cost.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00019-8

321

© 2015 Elsevier Inc. All rights reserved.

322

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

• Key performance characteristics for a targeted hybridization capture assay include analytic sensitivity, analytic specificity, limit of detection, and reproducibility. • Pathologic assessment of a tumor sample is a critical component in a clinical workflow utilizing targeted hybridization capture to verify the presence of malignant tissue and to ensure that quality and quantity standards are met. • Quality control metrics for hybridization capture-based clinical assays should include measures at specimen intake as well as during DNA extraction, library preparation, and sequencing.

INTRODUCTION The first type of successful targeted therapy in oncology was the landmark discovery of selective ABL protooncogene 1 (ABL) tyrosine kinase inhibitor in chronic myeloid leukemia (CML) in 1996 and represented a breakthrough in the battle against cancer. In that study, a selective ABL tyrosine kinase inhibitor was used, which achieved a 9298% reduction in the number of BCR-ABL1 colonies grown from blood or bone marrow of patients with CML, without affecting normal colonies [1]. In solid tumors, Herceptin (trastuzumab) was the first example of gene-based cancer drug therapy. Herceptin was designed by humanizing the mouse monoclonal antibody 4D5 [2] that binds HER2/neu on the surface of tumor cells, thereby inducing receptor internalization and causing inhibition of cell cycle progression [3]. The response rate achieved by ABL1 targeted inhibitors in CML patients intensified the search for activating mutations in oncogenes in different types of cancer, driven by the hope that the selective targeting of those oncogenes could result in similar response rate as in CML. These efforts led to the discovery of activating mutations in exon 11 of the KIT gene [4] in gastrointestinal stromal tumor (GIST). Until the year 2000, the prognosis of patients with metastatic GISTs was very bleak. A dramatic clinical and radiographic response to treatment with imatinib mesylate was shown in a single patient with advanced, chemotherapy-resistant GIST [5]. Many other types of activating mutations have since been discovered in different types of cancer. These discoveries quickly found their way into clinical application where patients with certain types of cancer are screened for well-documented activating mutations for which an approved drug is available. The clinical screening test requires a high accuracy rate so that the results produced are highly reliable. Since the first cancer genome was sequenced in 2008, several thousand cancer genomes have been sequenced generating an explosion of the amount of mutations identified, many of which have clinical significance. Translation of these discoveries into the clinic required the adoption of tumor profiling techniques by oncologists and pathologists where information of diagnostic and prognostic relevance (dependent upon the context in which this assay is applied) can be obtained and applied for clinical patient management. Due to the heterogeneity of types of alterations identified even within the same type of cancer, and the complexity of those mutations, there was a need for a methodology that can adequately address those challenges. One such methodology uniquely adaptable for the purpose of somatic variant detection is next-generation sequencing (NGS). As a nucleic acid sequencing technique, NGS allows for the identification of bases across a short read length of DNA performed through multiple parallel reactions (also known as massively parallel sequencing). This methodology is highly scalable and can be used to produce sequence for a small targeted region or an entire genome, and allows for data generation in a rapid manner. NGS can be used to sequence the entire genome or selected regions of the genome. The utility of any clinical assay has to be well defined before such assay is offered as required by CLIA regulations. Since clinical utility has been defined for only several hundred or a few thousand of mutations in different types of cancer in a limited subset of genes, interrogating of the entire genome in a clinical setting is currently unjustified. Therefore, targeted interrogation of the genome is currently the most reasonable approach to apply in clinic. Through the use of PCR amplification of the targeted regions, or by solid- or in-solution hybrid-capture of the targeted regions, NGS allows for the design of a strategic and highly scalable methodology for use in a clinical setting.

CLINICAL UTILITY OF SOMATIC MUTATION DETECTION IN CANCER Cancer constitutes the second most common cause of death in the USA, preceded only by heart disease. According to the American Cancer Society, in 2013, around 600,000 people died of cancer in the USA, and by

III. INTERPRETATION

DESCRIPTION OF HYBRIDIZATION-BASED METHODOLOGY

323

2030, it is estimated that 13 million people per year will die worldwide of cancer [6]. The National Institute of Health (NIH) estimates $77.4 billion for direct medical costs and $124 billion for indirect mortality costs due to productivity loss caused by premature death. Cancer is considered a genetic disease. The genomes of all cancer cells carry somatic alterations [7]. Some of these are named “driver mutations” because they confer selective clonal growth advantage or evasion from apoptosis and are casually involved in oncogenesis. The remainders are passengers which do not contribute to development of cancer, but are carried along with driver mutations. Because cancer cells are reliant upon driver mutations, targeting these mutations by pharmacologic inhibition can be highly effective. Identifying somatic driver mutations in cancer can have one or more of the following clinical applications: First, predictive somatic mutations can be pharmacogenomics biomarkers which can predict how a patient may respond to a drug with respect to toxicity or efficacy. For example, alterations in exon 19 of EGFR in patients with nonsmall cell lung cancer (NSCLC) were found to be responsive to treatment with gefitinib [8]; a 93% disease control rate (DCR) and 100% DCR was observed in patients with lung adenocarcinoma who were carriers of insertions in exon 20 of ERBB2 gene when treated with trastuzmab- and afatinib-based therapy, respectively [9]. Similarly, patients with NSCLC or lung adenocarcinoma who carried inversions in ALK or translocation in ROS1 achieved a 61% objective response or showed tumor shrinkage, respectively, when treated with crizotinib [10,11]. Other somatic mutations, however, predict resistance to therapy with tyrosine kinase inhibitors (TKIs), such as KRAS mutations in lung cancer [12]. Use of TKIs in NSCLC patients with EGFR mutations has been associated with better clinical outcome. However, the use of the same drug in NSCLC patients who carry KRAS mutations has been associated with poor clinical outcome [12]. In addition, certain somatic mutations arise after targeted therapy and cause resistance to therapy, i.e., the T791M mutation in EGFR exon20 in lung cancer [12] or the T315I mutation in CML [13]. Second, prognostic somatic mutations can provide information on the risk of disease progression or relapse. For example, the internal tandem duplication (ITD) in the fms-related tyrosine kinase 3 (FLT3) gene is associated with poor prognosis in acute myeloid leukemia (AML), whereas mutations in nucleophosmin (NPM1) are associated with a favorable prognosis [14].

DESCRIPTION OF HYBRIDIZATION-BASED METHODOLOGY NGS technologies like those employed by Illumina and Life Technologies platforms require molecular biology steps, known as library preparation, prior to loading on the sequencing instrument (for detailed discussion of NGS techniques including library preparation, see Chapter 1). Library preparation for use in these technologies yields DNA fragments, generated by PCR or by fragmentation, with the appropriate flanking linker sequences for primer annealing and extension so that sequence information can be read out during synthesis. Nucleic acid barcodes, or indexes, of defined sequence are often added during library preparation as well to enable sequence multiplexing and, less commonly, pooled hybridization capture. Sequencing of targeted subsets of the genome, such as the exome or custom sets of genes, requires enrichment of the prepared library for the targets of interest by one of a number of technologies.

Solid-Phase Versus In-Solution Phase Capture Hybridization capture-based enrichment of templates for sequencing was an extension of existing highdensity microarray technologies and, as such, began as capture of genomic DNA by oligonucleotides fixed to a solid substrate. Instead of using hybridization of DNA to the array per se as a readout, hybridized DNA was recovered from the array for sequence analysis. Several simultaneous reports described this approach as a means of efficiently enriching for large contiguous regions, or many Mb of smaller dispersed targets [1517], which was impossible by standard PCR-based enrichment, and inefficient and imprecise by selection with bacterial artificial chromosomes [18]. This approach was demonstrated to work for whole exome sequencing [19]. While microarray-based capture is capable of enriching for relatively large genomic regions and is sufficiently high throughput for many targeted NGS needs, it is not immediately accessible to laboratories without expensive microarray instrumentation [20]. The adaptation of microarray-based oligonucleotide synthesis to a fluid phase capture approach [21,22] enabled wide adoption of hybridization capture-based target enrichment by research laboratories and clinical laboratories. In one approach, array-synthesized oligos are cleaved from the solid support, PCR

III. INTERPRETATION

324

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

amplified and biotinylated and, in some cases, transcribed into copy RNA (cRNA) before biotinylation. In another approach, individual synthesis of oligonucleotide probes can also be employed to generate a set of baits for hybridization capture. By either method, the biotinylated DNA or cRNA probes are hybridized to sheared genomic DNA in solution, followed by separation from unhybridized genomic DNA using streptavidin-coated beads. The in-solution capture approach offers a number of advantages. Amplification of probes allows a large molar excess of probes relative to the genomic targets and therefore more favorable kinetics. This reduces the required genomic DNA input to achieve the needed depth of unique coverage by an order of magnitude, from 10 to 1 μg or less. Additionally, solution capture is more efficient than solid-phase capture for fragments much smaller than 500 bp [16]. Because most surgical oncology specimens are formalin-fixed, paraffin-embedded (FFPE) archival tissues, nucleic acids are routinely degraded to fragment sizes under 1 kb. These relaxed nucleic acid quantity and quality requirements make solution capture ideal for clinical oncology targeted NGS applications. In-solution capture offers another key benefit to clinical laboratories, namely the production of large batches of enrichment probes that can be tested, validated, and frozen as aliquots for routine use over weeks or months [21]. Synthesis of new solid-phase capture probes for each set of clinical specimens would not only be expensive but would pose a challenge for quality control. Relative to in-solution capture, solid-phase capture can yield more evenly distributed sequence coverage. Probe synthesis on the fixed substrate proceeds at controlled quantities; however, once cleaved and amplified by PCR, probe ratios may become unbalanced [22] (although individual synthesis of probes rather than array-based synthesis largely overcomes this limitation). On the whole, in-solution capture is superior for the clinical laboratory due to the efficiency of capture, selection of vendors for custom design of probes, and laboratory workflow considerations. Several vendors currently offer custom in-solution capture probes suitable for somatic mutation testing. The most prominent include Agilent Technologies (Santa Clara, CA) as used by several clinical laboratories [16,23]; Roche NimbleGen (Madison, WI); Life Technologies (Carlsbad, CA); and Integrated DNA Technologies (IDT; Coralville, IA) as used by several clinical laboratories [24,25]. Agilent Technologies’ SureSelect probes are 120mer cRNAs transcribed from photochemically synthesized oligonucleotide baits; the other three products are DNA probes that can be of variable length, but are in general shorter than 120 bp. IDT individually synthesizes oligonucleotide probes, setting it apart from other major vendors of hybridization capture reagents. Life Technologies’ Ion TargetSeq enrichment kits are optimized for Life Technologies’ Ion PGM and Proton sequencers and may outperform other probe designs on those platforms, while the other vendors’ probe products are largely sequencing platform agnostic. Comparisons between Agilent SureSelect and Roche NimbleGen SeqCap EZ capture baits have demonstrated similar performance on Illumina sequencing platforms [20]. Any of these vendors can conduct probe design and synthesis that is capable of yielding good performance, but any de novo design comes with some expectation of redesign and optimization.

Comparison of In-Solution Hybridization Capture-Based and Amplification-Based Targeted Enrichment Methods for Molecular Oncology Testing As discussed in Chapter 3, approaches for targeted enrichment can be divided into amplification-based enrichment and hybridization capture-based enrichment. The former uses genomic or other DNA as a template for PCR using strategically designed primers to selectively amplify regions of interest. As discussed above, hybridization capture-based enrichment methods rely upon strong noncovalent interactions between complementary single-stranded nucleic acid probes or “baits” and fragmented genomic DNA in order to separate genomic sequences of interest from the rest of the genome. Amplification-based enrichment offers advantages for the clinical laboratory because it is quick and inexpensive and is amenable to multiplexing. However, as discussed below, amplification-based enrichment is not feasible for testing large gene panels (greater than roughly 500 kb), is prone to artifact from PCR errors and amplification bias, and is unable to detect all of the types of variation relevant to molecular oncology. Microfluidic PCR technologies [26] can enable the amplification of hundreds of amplicons in a single assay, but with short amplicons, of target regions less than 1 Mb in size. It is important to emphasize that even capture methodologies depend upon limited-cycle amplification of libraries. However, hybridization capture-based approaches overcome PCR errors and amplification bias because they are based on shotgun library preparation, whereby PCR duplicates can be distinguished from distinct, independently sampled alleles. Because hybridization capture uses as its starting material sheared genomic DNA,

III. INTERPRETATION

DESCRIPTION OF HYBRIDIZATION-BASED METHODOLOGY

325

most captured genomic fragments are unique, i.e., the combination of the genomic positions of a fragment’s two ends is not shared with other fragments in the library. Demonstrating that the sequenced genomic fragments are unique ensures that a single-base call is not overrepresented relative to its true frequency among the genomes sampled, and, perhaps more importantly, limits the contribution of propagated PCR errors to the pool of base calls used for variant calling. In addition, the ability to distinguish sequence fragments derived from separate genomes makes it possible to evaluate more precisely the depth of coverage, as well as library complexity. Relative to amplification-based enrichment methods, then, hybridization capture-based methods limit false positive (FP) calls attributable to PCR error. In the context of somatic mutation testing in cancer, this can be particularly important for detecting variants at lower allele fraction. With amplicon enrichment, PCR errors in early cycles are prone to magnification in low complexity libraries, or as a result of PCR bias; this severely limits the limit of detection (LOD) of an assay because specificity suffers at variant allele fractions much below 10%. Importantly, amplicon-based library preparation limits the type of variation that can be detected. Because library generation by this technique depends upon successful annealing of primers that flank small (often 100300 bp) stretches of genomic DNA, large insertions and deletions as well as structural rearrangements are difficult or impossible to detect. Amplicons containing evidence of this type of variation may not be generated during library preparation because no primer pair flanks the variation (e.g., a translocation breakpoint) or because the flanked region is of an unexpected size (e.g., due to a large insertion) rendering it difficult to amplify under the defined PCR conditions. Alternatively, a successfully amplified region containing a large insertion may yield reads that are difficult to map and/or provide incomplete size and sequence information about the insertion. Capture-based methodologies enable detection of larger insertion/deletion events and structural rearrangements in three ways, all attributable to the approach’s reliance on shotgun library preparation. First, targeted enrichment by hybridization is permissive of off-target capture. For example, it will yield sequence reads from a translocation partner that is part of a genomic fragment linked to a targeted sequence. The translocated segment will therefore be sequenced despite the translocation event not being explicitly targeted in the capture reagent design. Second, the often broad distribution of fragment size, when coupled with paired-end sequencing, can help pinpoint unexpected and/or untargeted sequences adjacent to the targeted sequence, providing analysis algorithms with evidence of large insertions, duplications, or rearrangements. Third, the random fragment ends yield many sequence start positions, enabling more even coverage of large structural changes than is achievable under even the best circumstances with amplification-based enrichment strategies, which rely on a relatively small number of primer binding sites and coarser blocks of coverage. Enrichment by hybridization capture does suffer from several drawbacks relative to amplification-based methods that are relevant for the clinical laboratory. Library construction and capture-based enrichment requires 24 days to complete, while library construction by amplification can be completed in a day or less—more palatable for clinical testing time frames. Per base enriched and sequenced, capture is usually more expensive, especially for medium to high throughput, which is attributable to increased labor for longer library preparation times, the higher reagent cost of capture probes (particularly custom array-synthesized probes) relative to PCR primers, and the added cost of sequencing the off-target regions captured by this method. Specimen requirements for the two approaches differ as well. Experience with capture-based methods for molecular oncology testing suggests that variants at a low allele fraction may not be reliably detected with DNA input amounts of less than 200 ng [23], as discussed below, although some laboratories performing capture-based somatic mutation testing require as little as 50 ng [24]. In contrast, several recently described amplification-based clinical molecular oncology assays describe use of as little as 10 ng of genomic DNA as starting material with reported good sensitivity [27,28]. The ability to perform sequence analysis with very little DNA input quantities is attractive because biopsy specimens are often depleted by prior diagnostic testing, or are limited because the biopsy is so small as with fine needle aspiration specimens [28]. Pseudogenes and other genomic sequences that are homologous to targeted genes of interest can interfere with accurate mutation detection. Such homologous sequences are difficult to exclude from the hybridization capture enriched libraries because capture probes can tolerate mismatches while retaining high affinity. Instead, informatic methods must be employed to minimize the effects of this off-target capture and sequencing. Paired-end sequencing of hybridization capture enriched libraries offers superior mappability relative to single end sequencing because information from both reads is available to anchor a pair to a unique genomic position. In theory, amplification-based enrichment can outperform hybrid capture in particularly difficult to map regions if PCR primers are strategically designed to avoid amplification of homologous genomic sequences. However, in cases of large regions of 100% identity, small amplicon libraries may be unable to generate reads that map uniquely.

III. INTERPRETATION

326

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

UTILITY OF TARGETED HYBRID CAPTURE Analysis of Large Number of Genes Involved in Cancer While sequencing costs of an entire genome are decreasing due to the development and commercialization of a new generation of sequencing methodologies and instruments, the majority of diagnostic goals may be achieved by targeting a specific subset of the genome. In clinical settings, efficient and cost-effective targeting of genes of interest can significantly lower the sequencing costs of a test while maximizing its diagnostic utility. Genetic heterogeneity is a well-known feature of cancer, particularly in solid tumors. For example, in lung cancer, the leading cause of cancer-related death in the world [29], recurrent mutations in potentially therapeutic targets have been identified in the HGF, MET, JAK2, and EPHA3 genes, and fusion genes for which targeted inhibitors are available have been identified, including KDELR2-ROS1 and EML4-ALK [30]. Whole genome and exome sequencing study of NSCLC identified a total of 54 genes with potentially druggable alterations that included point mutations, copy number variations, and high gene expression levels with a median of 11 potentially druggable targets per patient illustrating the heterogeneity of potentially therapeutic targets in this type of cancer [30]. In lung squamous cell carcinoma, a large study that included 178 tumors indicated that 96% of tumors harbor one or more mutations in tyrosine kinases, serine/threonine kinases, PI3K catalytic and regulatory subunits, nuclear hormone receptors, G-proteincoupled receptors, and proteases and tyrosine phosphatases [31]. Of note, 39% of tyrosine kinase and 42% of serine/ threonine kinase gene mutations were located in the kinase domain. However, the frequency of each mutation represents only a small percentage of all mutations identified, implying that a large number of genes need to be sequenced in each patient in order to capture such mutations. Hybrid capture methodologies have the flexibility to target a wide range of genes from one gene to the entire exome. Currently available liquid capture kits from Agilent have a target size that ranges from 1 kb to 24 Mb and can accommodate 16 to 480 samples. Liquid capture kits are also available from NimbleGen with a target size that ranges from 100 kb to 64 Mb and can accommodate 4 to 960 samples.

Applications for Precious Samples Tumor specimens, whether excision specimens or cytology specimens, can vary greatly in size and consequently in the yield of DNA too. The estimate of DNA yield from cellular material obtained from a single FNA slide is variable and ranges from 10 to 3000 ng [32]. On-array capture protocols require relatively large amounts of DNA, around 1015 μg, irrespective of target size [20]. Currently, protocols provided by Agilent for in-solution capture of the entire exome require 1 μg of input DNA; however, they can be modified to accommodate samples with low levels of input DNA without affecting the yield of library preparation or downstream sequence output. Such changes in protocols should minimize the number of PCR cycles required to produce enough DNA material for prehyping step since excessive PCR amplification can introduce severe bias due to decreased representation of genomic regions with high G 1 C content [20]. In one study, use of different amounts of input DNA with the same number of PCR cycles for prehybridization (six cycles) did not show any effect on library complexity when using 500 ng versus 3000 ng of input DNA for a capture size of roughly 350 kb [33]. However, two factors need to be considered when testing samples with low input DNA: first, the impact of low-input DNA on the ability to generate enough library product for the hybridization step, generally 500 ng, and second the impact of low DNA input on the library complexity. In another study, the impact of low-input DNA on the number PCR cycles needed to produce enough DNA amounts for hybridization (500 ng) was evaluated for a capture region of 960 kb target region; for input DNA samples of at least 200 ng, eight PCR cycles were needed to generate the required amount of DNA for hybridization, but additional PCR cycles were needed for samples with lower input DNA (submitted). Similarly, when library complexity is defined as 4003 unique coverage at greater than 80% of positions using a capture size of 960 kb for DNA from FFPE tissue, the percentage of positions that achieved 4003 unique coverage was the same for 200, 400, and 1000 ng of input DNA, but dropped to around 45% for 100 ng of input DNA (Figure 19.1). Likewise, for cases with between 200 and 399 ng of input DNA, an average of about 75% of positions achieved 4003 unique coverage, sufficient to permit sensitive variant detection, but for specimens yielding 100199 ng DNA that metric fell to roughly 50% of positions, which may compromise variant detection (Figure 19.2).

Amenable to Multiplexing Current sequencing platforms provide physical separation for the samples during the sequencing run. 454/ Roche pyrosequencers can accommodate up to 16 independent samples physically separated by using manifolds;

III. INTERPRETATION

UTILITY OF TARGETED HYBRID CAPTURE

327

FIGURE 19.1 Target base coverage as a function of input mass (DNA dilution).

FIGURE 19.2 Percentage of bases covered at different levels of sequencing depth for different amounts of input DNA.

each of the 16 spaces can yield an average of 0.63 and 2.88 Mb per run for the GS20 and GS FLX sequencing platforms, respectively. The Illumina HiSeq 2500 system uses flow cells that have eight lanes which can also provide physical separation of eight samples. A single Flow Cell on the 2500 system can generate up to 1.5 billion single reads or 3 billion paired-end reads, which provides an estimated coverage of greater than 5003 for whole exome sequencing. However, for smaller applications when only few genes need to be sequenced, running one sample per lane is impractical for logistical reasons and cost considerations. To overcome such limitations, a barcoding approach where sequence tags can be introduced into the DNA fragments generated during library generation is a better alternative. However, such approach must be efficient and sensitive enough to be able to trace sequences to their origins within an environment where background sequences lacking sequence tags, misassignment of sequences to their sample origin, and heterogeneous sequence representation among samples can result from sequencing errors and incomplete reactions. One approach to tagging samples uses barcoding adapters and a restriction system able to exclude background sequences. For the 454 pyrosequencing platform, 48 forwardreverse barcode pairs each of which has at least four nucleotide unique positions have been designed [34], and the combination of forward and reverse barcodes can be used to sequence as many as n2 independent libraries for each set of “n” forward and “n” reverse barcodes (e.g., for 48 barcodes, 2304 combination barcodes). The barcoded primers used for the PCR amplification step can be of 4546 nucleotides (nts) in length and are composed of Forward or Reverse adapter followed by the forward or reverse barcode (10 nts and additional four unique nucleotides) and then the cloning linker; the length of the unique portion was based on the observation that the false-discovery rate, defined by the percentage of “perfect but unexpected” occurrences, was found to decrease to less than 0.005% when three bases of the linker were used in addition to the 10-base barcode to assign sample identities, and so a motif of 13 nt was determined to be ideal sequence tag that can minimize the falsediscovery rate and maximize the number of sequences with error-free hits [34]. With indexing, there is a chance of cross-sample contamination during library preparation; multiplexing cases in one sequencing lane can also

III. INTERPRETATION

328

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

cause sample cross-contamination. However, it has been shown that the frequency of reads with the most commonly encountered spurious index due to cross-sample contamination is negligible [23], as is the crossover between indexed samples [23]. Multiplexing allowed by hybrid-based methods therefore maximizes the capacity of NGS by enabling highly parallel sequencing of pooled libraries generated using different sets of baits or pooled libraries generated by using the same set of baits but from different individuals. In addition, in-solution hybrid capture allows for introducing indexes during the posthyping PCR amplification. Depending on the target size and the required level of coverage, the number of clinical samples that can be pooled in one flow cell of Illumina’s HiSeq and MiSeq systems will vary.

Detection of a Full Range of Mutations Somatic mutations in cancer are highly heterogeneous. The number of mutations in a tumor differs by tumor type; generally, solid tumors exhibit more mutations than hematological tumors. A typical solid tumor has, on average, 3366 genes somatically mutated, but the number of mutations is highly variable and can reach 163 in small cell lung cancer or be as few as four in rhabdoid cancer [35]. Leukemias and pediatric tumors carry fewer point mutations in comparison, averaging just under 10 per tumor [35]. In addition to point mutations, cancers present with a wide range of numerical chromosomal changes, including whole chromosome gains and losses; segmental aneuploidy such as deletions and amplifications; and structural variation including inversions, translocations, and complex mutations. While single nucleotide variations (SNVs) constitute the vast majority of mutations in all types of cancer studied so far, the frequency of other types of mutations (small indels, copy number changes, and structural variants) differ by cancer type (Figure 19.3). Detection of Structural Rearrangements (Translocations, Inversions, and Indels) Recurrent balanced translocations have been identified in almost every tumor type (Mitelman database, http://cgap.nci.nih.gov/Chromosomes/Mitelman). Over 267 such translocations have been identified in AML, the most comprehensively studied type of tumor. Many translocations have a remarkable specificity to the tumor type and clinical presentation [36,37] and to global gene expression profile [38]. In addition, many translocations 80.0 Translocation Deletions Amplifications Indels SBS

Number of alterations per tumor

70.0 60.0 50.0 40.0 30.0 20.0 10.0

om a

ed ul

bl as t M

lio G

lo bl as t

om a

a rc in om

in om ea tic

ad en oc a

no ca rc ad e

ea st Br

Pa nc r

C

ol o

re ct

al

ad e

no ca rc

in om

a

a

0.0

FIGURE 19.3 Contribution of different types of alterations affecting genetic coding regions in different types of tumors. SBS, singlebase substitutions; indels, small insertions and deletions. Figure adapted from Vogelstein et al. [35].

III. INTERPRETATION

UTILITY OF TARGETED HYBRID CAPTURE

329

are associated with diagnostic, prognostic, or therapeutic information (Table 19.1). Translocations exert their action by either deregulating gene expression or producing a fusion of two genes. Burkitt lymphoma is representative of the first mechanism and is characterized by one of three translocations: t(8;14)(q24;q32), t(2;8)(p11;q24), or t(8;22)(q24;q11). In each, the chromosome 8 breakpoint is within or adjacent to the MYC gene. The other breakpoint always occurs within an immunoglobulin gene, encoding either the heavy chain (IGH) or the kappa (IGK) or lambda (IGL) light chains. The consequence of the translocation is that MYC becomes constitutively expressed due to the influence of the regulatory elements of the immunoglobulin gene [39]. The second mechanism creates a chimeric gene [39]. Fused genes can derive from the same chromosome (such as the fusion of PDGFRA to FIP1L1 in myeloproliferative leukemia which results from a deletion of a segment that separates these two genes) or from two different chromosomes (such as the ABL1 and BCR genes from chromosomes 9 and chromosome 22, respectively, which fuse in CML). The frequency of translocation mutations in cancer is estimated to be tenfold less than that of SNV [35]. Solid tumors exhibit, on average, a dozen translocations; however, most of them are passenger mutations. The breakpoints of such passenger translocations are often in regions devoid of genes known as gene deserts. Historically, translocations have been detected by chromosome analysis and fluorescence in situ hybridization (FISH), the latter of which is typically used for the detection of chromosomal rearrangements when at least one partner gene is known. Chromosome analysis provides a tool for whole genome analysis; however, it is a relatively low-resolution methodology, it requires dividing cells for metaphases to be produced and analyzed, and it requires skilled interpretation. In addition, because of the low resolution observed by cytogenetic studies, the estimation of breakpoints can vary greatly necessitating finer mapping techniques. FISH provides higher sensitivity for the detection of structural rearrangements and aneuploidy and does not require living cells; however, this technique is most useful when it is used to assay for discrete copy number or structural alterations (e.g., deletion 5q in AML, BCR/ABL1 t(9;22)(q34.1;q11.2) in CML, or PML/RARA t(15;17)(q24;q22) in acute promyelocytic leukemia). FISH techniques provide less specific information in cases where the gene of interest recombines with many different genes. Two examples are rearrangements involving KMT2A, also known as mixed lineage leukemia or MLL, and ETV6, where the former has 121 known partners [40] and the latter 30 [41]. Rearrangement detection in these scenarios can demand the use of multiple probes, leading to increased cost and test complexity [42,43]. Bioinformatic algorithms can be devised to allow for the detection of structural rearrangements and copy number alterations (CNAs), in addition to the SNV and insertion/deletion events historically derived from sequencing data. One such example, known as “paired-end methods (PEMs),” makes use of NGS data generated by paired-end sequencing to detect structural abnormalities [44]. After alignment to the reference genome, PEMs determine whether the paired-end reads are concordant, i.e., they map to the correct location with the correct TABLE 19.1

Examples of Balanced Chromosomal Translocations and Their Clinical Significance

Chromosome Rearrangement

Gene Fusion

Clinical Characteristics

HEMATOLOGIC TUMORS t(8;14)(q24;q32)

IGHMYC

t(8;21)(q22;q22)

RUNX1RUNX1T1 AML, M2 type, with dysplastic features and Auer rods, good prognosis

t(9;22)(q34;q11)

BCRABL1

CML, responds to treatment with imatinib and other TKIs

t(15;17)(q22;q21)

PMLRARA

Acute promyelocytic leukemia, responds to treatment with all trans-retinoic acid, good prognosis

Burkitt lymphoma/leukemia, highly aggressive but good prognosis with intensive chemotherapy treatment

MALIGNANT SOLID TUMORS t(11;22)(q24;q12)

EWSR1FLI1

Ewing sarcoma, mainly in children and adolescents and mainly skeletal

t(15;19)(q14;p13)

BRD4NUT

Poorly differentiated carcinoma affecting midline structures in children and adolescents, very poor prognosis

t(4;6)(q15.2;q22)

SLC34A2-ROS1

NSCLC, responds to treatment with crizotinib

Adapted from Ref. [43a].

III. INTERPRETATION

330

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

orientation and the correct mapping distances with regard to the known insert size, or discordant when the location, mapped distances, or orientation is not correct. Discordance can arise in a number of scenarios. Two ends can map to different reference chromosomes, potentially reflecting an interchromosomal rearrangement. Similarly, an intrachromosomal rearrangement is implicated when both ends map to the correct chromosome, but the distance between the two ends is significantly larger than the estimated insert size. Large indels can also be detected by PEMs based on a discrepancy between the distance between the two ends and the estimated insert size. The distribution of insert size can be estimated from the mapped paired-end sequences on the reference genome [45]. PEMs can also be used to detect inversions when a single read from a pair spanning one of the breakpoints of the inversion is in the incorrect orientation relative to the other read in the pair. Insertions from an extragenomic source (e.g., a viral insertion) can be suggested when one of the paired-end reads maps to reference genome and other does not. A combination of the above-mentioned structural abnormalities can exist in the same genome, and needless to say, disentangling such events is a challenging matter. The algorithms used for PEMs are either clustering-based algorithms or distribution-based algorithms. The clustering-based algorithms classify paired-end reads as concordant or discordant pairs, then cluster the discordant paired reads according to the positions of the paired-end reads. The discordant paired reads include those that have either incorrect orientation or are pairs whose mapped distances are below or over a fixed range of insert size. The size of indel that can be detected by these methods depends on the characteristics of the library sequenced and the bioinformatics algorithms used. For example, if the fixed size range has a cutoff of mean insert size 63 SD, any insertion smaller than the mean insert size 13 SD will not be called because those pairs will be classified as concordant; in this setting, a library with small insert size will permit detection of smaller inserts. Some distribution methods [46] compare the local distribution of the insert sizes with the genome wide distribution of insert size. The local and the genome wide insert size distribution will be identical if there is no local indel and will be shifted if there is a local hetero- or homozygous indel. The clustering methods are able to identify large indels as well as translocations and inversions, whereas the distribution-based methods can detect smaller indels but not translocations and inversions. A combination of these methods has been described in an algorithm called BreakDancer [47] that consists of two complementary algorithms, BreakDancerMax (which can detect translocations, inversions, and larger indels) and BreakDancerMini (which can detect smaller indels). A more recent algorithm called PeSV-Fisher toolkit has been developed [48] that utilizes a combination of methods based on paired reads and read depth strategies, making use of information provided by paired-end and mate-pair libraries (the two types of libraries differ by the size of insert, where the insert size is less than 600 bp in the former and over 2 kb in the latter.) PeSV-Fisher defines the structural variants as balanced and unbalanced. For unbalanced SV, it defines the affected region and the breakpoints. For balanced SVs, the algorithm calls the SV type as an inversion or an intra- or interchromosomal translocation, along with its orientation. The aforementioned algorithms have mostly been developed and tested in research environments. As a proof of concept that data generated by targeted hybrid-capture NGS can be analyzed using publically available software to identify a broad spectrum of clinically important gene mutations, hybrid capture of 20 leukemia-associated genes using five leukemia cell lines which harbor clinically important translocations was performed, and the captured DNA was then sequenced in multiplex on the Illumina HiSeq platform [49]. The investigators successfully identified translocations in 3/3 cell lines in which the translocations were covered by the capture probes used and were also able to identify all of the known mutations in clinically important genes such as NPM1, FLT3, and KIT. Several issues must be considered when designing a targeted NGS test that aims to detect translocations. These include: 1. Probe design: The vast majority of translocations occur in intronic regions. Therefore, intronic regions that are hotspots for translocations should be targeted by the capture baits. Often there are no known hotspots for breakpoints within the gene of interest, requiring that a large genomic region be targeted. 2. Target space: Depending on the clinical application of the test, space may become a limiting factor. The size of the capture space can increase significantly with the inclusion of large intronic regions for rearrangement detection. 3. Depth of coverage: Since deep coverage is critical for sensitive detection of somatic mutations including rearrangements, the necessary coverage levels established during assay validation, in combination with the size of the target space, will dictate the requirements of a single test, which in turn will determine the number of samples that can be run in a single lane using a given sequencing platform.

III. INTERPRETATION

UTILITY OF TARGETED HYBRID CAPTURE

331

The performance of several algorithms for detecting rearrangements involving ALK and KMT2A (MLL) in clinical use has been evaluated [50]. Using six cases with ALK rearrangement in lung carcinoma, one case with ALK rearrangement in anaplastic large cell carcinoma, six cases of leukemia with KMT2A (MLL) rearrangement, and 77 cancers negative for ALK and KMT2A (MLL) rearrangement by FISH, the algorithm ClusterFAST was able to correctly detect all 13 cases (100%), while Breakdancer and Hydra [51] each detected translocations in 12 of 13 cases (92%). Multiple factors were found to affect the sensitivity of detection including tumor heterogeneity, the presence of SNV or small indels around the breakpoint, coverage depth, and the mappability of the breakpoint region. Coverage depth in particular was found to be critical for high sensitivity; by performing down-sampling experiments on 3 of the 13 cases, the sensitivity by the tools ClusterFAST, Breakdancer, and Hydra dropped to 92% at a mean coverage level 3323 and to 5485% at a mean coverage level 1333 over the targeted ALK and KMT2A (MLL) regions. For specificity, neither ClusterFAST nor Breakdancer detected FPs in the 77 cases tested that were negative for ALK and KMT2A (MLL) rearrangement by FISH. The use of available algorithms for calling indels in clinical setting has rarely been tested. The ability of a variety of NGS analysis tools to detect FLT3 ITDs ranging from 17 to 185 bp as determined by capillary electrophoresis in AML specimens has been evaluated [52], in a system in which the sequence data were generated using targeted hybrid-capture of multiple genes and an Illumina platform. The investigators examined eight different software tools including SAMtools, Maq, Dindel, Pindel, Genome Analysis Toolkit, CLC, SLOPE, and BreakDancer, and Pindel was found to have 100% specificity and sensitivity and was able to call accurate ITD insertion sizes with an allele burden as low as 1%. Another group has also demonstrated a bioinformatics approach that was able to successfully call 129 of 130 known mutations (99.2% sensitivity) that included indels (as well as SNVs; ITDs; gene copy number losses, gains, and amplifications; chromosomal gains and losses; and actionable genomic rearrangements, including ALK-EML4, ROS1, PML-RARA, and BCR-ABL) in FFPE samples in a panel of 194 clinically relevant genes in cancer by targeted hybrid capture [53]. Thus, targeted NGS by hybrid capture can maximize the use of the data generated by reliably calling different types of mutations using a single test. Copy Number Variation (CNV) Detection It is estimated that around 15% of the human genome has copy number changes [54]. Both germ line and somatic CNVs have been recognized as frequent contributors to the spectrum of mutations leading to cancer development. One study in acute lymphoblastic leukemia (ALL) identified CNVs that affect genes encoding key regulators of B-lymphocyte development and differentiation in 40% of cases [55]. In breast and colorectal cancers, 7 and 18 copy number changes were found, respectively, affecting 24 and 9 protein coding genes on average by either amplification or homozygous deletion [56]. FISH and Microarray-based techniques have been widely used for the detection of somatic CNVs in cancer [57]. However, one of the limitations of microarrays is the number of probes that can be placed on the array. The median spacing between probes on the newest generation of oligonucleotide arrays is 2 kb, making reliable calls possible at no better than about 1050 kb resolution [58]. In addition, some copy number gains are inserted in areas of the genome other than where they originated, and in such cases microarray analysis cannot provide information about the location of the gained copies. Finally, the use of microarrays requires the purchase of expensive equipment. With the proper level of sequencing depth, targeted NGS can identify CNVs and the exact breakpoints at the base pair level and provide the location of extra copies of DNA in the context of copy number gain. Algorithms have been developed to fully utilize the information provided by the sequence data in order to call CNV. Five general approaches are being utilized to call CNVs from NGS data. First, read-depth methods assess the read density of the aligned reads along the chromosomes. These methods partition the genome into nonoverlapping windows; in each window reads are counted. Potential CNVs are then identified as consecutive genomic windows where the observed depth of coverage is substantially different from the expected coverage. A proportional increase or decrease of read density to the copy number of the region is expected in case of copy gain or loss, respectively. However, read density is affected by the guanine-cytosine (GC) content [59] and by the mappability of the aligned reads since a significant portion of the reads may align to multiple positions in the reference genome [60]. Depth-of-coverage-based methods are cost-effective since they do not require the use of a control genome; however, they have limited power to detect small deletions and insertions due to the requirement to partition the genome into windows of at least 100 bp. Second are algorithms that utilize read depth from a control genome as a reference to make CNV calls. These algorithms avoid the bias introduced by GC content. This strategy can be used for the detection of somatic gains or losses by comparing the tumor genome versus matched normal genome [61,62]. The SegSeq [61] algorithm focuses on detecting the breakpoints rather than regions of

III. INTERPRETATION

332

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

CNV per se. Third, split-read [63] and clip-read [64] algorithms directly identify sequence reads that contain breakpoints of structural variants. Fourth, sequence assembly algorithms enable the fine-scale discovery of CNVs including nonreference sequence insertions [6365]. Fifth, the tool ExomeCNV, which was developed specifically to detect CNVs in the setting of cancer [66], uses depth-of-coverage and B-allele frequencies from mapped short sequences read to estimate CNV and loss of heterozygosity (LOH). This tool was tested on samples from Melanoma and was able to detect CNVs that vary in size from 120 bp to 240 Mb. Resolution of CNV identification in capture-based techniques depends on the way the baits are designed. If only coding exons of the genome are the target, the actual breakpoint can be anywhere in the region between the terminal exon called within a CNV region and the adjacent exon in a non-CNV space. This space can be anywhere between 125 bp and 22.8 Mb with the median of 5 kb (numbers are based on Sure Select Human All Exon Kit G3362). If the target is not the entire exome, but instead several selected genes, the resolution becomes much lower and can be up to tens of megabases depending on how far the next targeted gene is located from the terminal exon called within a CNV region.

Cost-Effectiveness Traditionally, the approach to the detection of different types of clinically important somatic mutations in cancer was dependent on different technologies. Analysis of the cost of the molecular evaluation of AML using one academic protocol [67], which includes chromosome analysis by cytogenetics; detection of translocations associated with AML that include t(15;17), t(8;21), inv(16), and t(9;22) by FISH; and molecular testing for FLT3, NPM1, and KIT mutations which are now recognized as important predictors of resistance to therapy by standard therapy in AML patients with normal cytogenetics [9], has shown that the list price of this “standard of care” molecular evaluation at a national reference laboratory was about $4575 (in 2011) and did not include the cost of molecular testing for other recurring mutations (e.g., CEBPA, NRAS, IDH1, IDH2, TET2, DNMT3A, RUNX1, and ASXL1). Analysis of these targets currently requires an expensive infrastructure with investment in different technologies and skilled personnel in those technologies. Well-designed genetic tests with genes selected to fit within the clinical relevance of the cancer of interest using targeted NGS by hybrid capture may prove to be more cost-effective and streamlined and therefore provide a higher diagnostic yield at a lower cost. In addition, such a test may offer a comprehensive analysis, thus eliminating the need to send out testing to multiple different laboratories. Furthermore, a comprehensive NGS test will likely produce gains in efficiency in the areas of quality management and quality assurance, as well as interpretation and reporting. In addition, sample multiplexing by NGS offers a great tool to reduce the per sample cost and tune the sequencing depth to the desired level. In addition, hybrid-capture methods are very scalable, and advances in library preparation, sequencing reagents, and instruments upgrades will all likely contribute to cost reduction and shortening of turnaround time.

High Depth of Coverage Identification of somatic sequence variation via NGS technology requires special consideration to ensure that an assay with adequate sensitivity is attained. While a low depth of coverage may be appropriate in a constitutional test aimed at detecting germ line genetic alterations, read depths of hundreds to thousands are necessary to ensure that somatic variants that present at a low allelic fraction are able to be detected. The ability to detect low-frequency variants is dependent upon the variant caller utilized during bioinformatic analysis as well as the depth of coverage, with higher mean depths of coverage associated with increased sensitivity [68]. The need for high read depths reflects the complexity involved in somatic variant detection. Tumor biopsy specimens represent a heterogeneous mixture of tissue encompassing malignant cells, as well as supporting stromal cells, inflammatory cells, and uninvolved tissue; malignant cells harboring somatic variation can become diluted out in this admixture. Of additional consideration, intramural heterogeneity creates tumor subclones so that only a small proportion of the total tumor cell population may harbor a given mutation. The read depth of the assay should be sufficiently high to compensate for such occurrences. Aside from tumor heterogeneity, an additional confounder observed in somatic testing includes copy number abnormalities. Divergence from a two-copy state is a common occurrence during tumorgenesis [68], and alterations in copy number may include whole chromosomal gains and/or losses, as well as smaller duplications and/or deletions. The effect of these CNAs can further complicate NGS analysis due to associated loss of heterozygosity or amplification which may obfuscate the sequencing results at an affected locus.

III. INTERPRETATION

NGS IN A CLINICAL LABORATORY SETTING

333

NGS IN A CLINICAL LABORATORY SETTING Design of the Clinical Assay A critical component in the design of a clinical assay is selection of the appropriate methodology for the desired performance characteristics of the test. NGS methodologies are compatible with clinical somatic variant detection at both the level of sample input and data output, a paradigm illustrated by the ability to utilize fragmented DNA from FFPE tumor specimens and the capability to detect low-frequency somatic alterations at high sensitivity as achieved by read depths in the hundreds or thousands [23,24,53]. Further considerations for clinical NGS assay development include an assessment of sequencing platforms for error rates, read length, requirements for physical and computing space, sample volume, type of assays, data output, and run time. In addition, the clinical testing strategy should evaluate the necessary specimen type and quality, nucleic acid input requirements, variant detection capabilities of the assay, and cost.

Specimen Requirements for Somatic Variant Detection The type of clinical specimen impacts all downstream workflows. Whether fresh, frozen, or fixed, or from peripheral blood or solid tissue, the pathologic specimen serves as the starting point for a successful result. Suitability for downstream analysis is dependent on the ability to obtain high-quality DNA (free of inhibitors), as well as a determination that the sample contains adequate tumor involvement. For solid tumors, the most frequently available specimen type is FFPE tumor tissue. The nature of biopsy dictates the form and cellularity of the tissue observed in an FFPE specimen and may include material obtained from a fine needle aspirate, core needle biopsy, or excisional biopsy, among others. Examples of hematologic specimens may be fresh bone marrow aspirates, tumor-involved peripheral blood, or an FFPE bone marrow core biopsy. As is inherent with all specimen types, the performance characteristics of those samples must be assessed and documented during assay validation. It is reassuring that data in the literature show that FFPE specimens and fresh tumor samples are both amenable to NGS analysis [69]. While those specimens derived from FFPE have smaller library insert sizes, greater variability in coverage, and increased C to T transitions in comparison with fresh tissue, performance characteristics including sequencing error rates, library complexity, and enrichment statistics are not significantly different [69]. Similarly, cytology specimens have also been shown to be amenable to clinical NGS analysis by hybrid captive-based methods [70], whether derived from ethanol- or methanol-based fixation protocols. As with FFPE tissue, the sensitivity and specificity of the NGS were not significantly impacted by routine preparation methods for cytological specimens, although the paucity of malignant cells in many cytology samples frequently limits the usefulness of NGS in routine practice. Similarly, the utility of cytology specimens for the detection of somatic mutations in cancer specimens has been demonstrated for amplification-based NGS methods [28]. Specimens from a wide range of sources should be tested allowing for the observation of complicating events which may preclude downstream analysis. Documentation of inhibitors, suboptimal collection and transport conditions, and other factors likely to impact assay performance should be detailed such that these confounders are known in advance of offering the clinical test.

Pathologic Assessment Prior to nucleic acid isolation, review of the pathologic specimen by an anatomic pathologist must be used to ensure the presence of malignant tissue and assess the quality and quantity of the material submitted for testing. This component of the clinical workflow is a requirement of the College of American Pathologists (MOL.32395 Molecular Pathology Checklist 07.29.2013), which dictates that the neoplastic cell content in FFPE specimens should be assessed and documented in samples for which DNA is being extracted for downstream molecular testing reliant upon somatic variant detection. The pathologic assessment is not only a quality control evaluation of the submitted tumor tissue but also allows for evaluation of possible analytic confounders. Examples include tumor cell admixture with stromal/normal tissues, necrosis, and suboptimal sample preparation such as acid decalcification or the use of unbuffered formalin. It is valuable to extend the pathologic assessment to all specimen types submitted for somatic variant detection in order to ensure that both the quantity of malignant cells and tissue quality is sufficient for assay performance. For example, an evaluation of a hematoxylin and eosin (H&E) stained slide in a fresh bone marrow aspirate can be utilized to ensure that the percent of malignant cells

III. INTERPRETATION

334

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

within the specimen is sufficient and at a level compatible with the reported sensitivity of the assay. Records of the surgical pathology workup for given specimen provide useful information in regard to diagnosis, tumor involvement, immunohistochemistry, prior cytogenetic and molecular studies, and flow cytometric analysis. Obtaining an approximate value of the percentage of tumor involvement in a given sample may also be useful during interpretation of the sequencing data in regard to the variant allele fraction. As previously mentioned, confounding variables have been described in pathologic specimens which may have an effect on downstream NGS data. The effects can be of a more minor variety (e.g., tumor/normal admixture resulting in a slightly reduced sensitivity to detect low-frequency variants) or may be so detrimental to the NGS workflow such that no usable data are obtained (e.g., bone biopsy specimens decalcified in formic acid, with resulting in DNA hydrolysis with breakage of the N-glycosidic link between the base and the deoxyribose). The effect of acid fixation is of particular note in that it produces such profound degradation of DNA [71]. Use of EDTA in place of formic acid during decalcification should be considered if specimens may undergo molecular testing [72]. Similar in nature may be the effect of unbuffered formalin used during fixation of a pathologic specimen, as unbuffered formalin can oxidize to formic acid resulting in degradation of the DNA contained within a biopsy specimen.

Reportable Range Due to variability among test platforms and methodologies, the reportable range of somatic variation within an NGS assay depends on size of target space, type of nucleic acid being sequenced, specimen type, and ability to detect multiple categories of genetic variation. SNV, insertions and deletions (indels), CNAs, and structural variation are all theoretically possible to detect utilizing NGS, but the ability to do so in practice requires an optimally designed assay and bioinformatic pipeline. Desired assay characteristics will influence target space selection and may be restricted to regions of specific interest (mutational hotspots) or may encompass genomic sequence more broadly (e.g., all coding exons of a given gene). The inclusion of intronic regions of select genes may allow for the detection of structural rearrangements such as inversions and translocations at the DNA level. Likewise, an assay may be strategically designed to detect CNA of genes of interest.

Genetic Targets Genetic loci in the assay will be influenced by the disease type under study, with the inclusion of genes or locus-specific hotspots which have clinical utility in patient management. For instance, a hematologic disease panel may be expected to contain gene regions of interest which are most relevant to either a specific disease subset (e.g., MDS or AML) or a larger number of genes recurrently in a broader range of hematologic diseases. Herein again lies the need for consideration of specimen variant classes for inclusion in the assay. For example, the ability to detect indel events such as FLT3 ITDs (about 40 bp or larger) and NPM1 insertions (most often only 4 bp long), in addition to SNVs and specific translocations, in a hematologic disease panel requires careful assay design and optimization as well as thorough vetting of the bioinformatic analyses [73,74]. For this reason, a thorough review of the literature as well as examination of mutation databases is critical when selecting gene targets for the assay to ensure that the test is designed in a manner to capture all desired clinically relevant mutational targets.

QC Metrics NGS-based somatic mutation testing is a multistep, complex assay that employs multiple distinct molecular techniques and hand-offs, as well as sophisticated bioinformatics analysis. The assay’s complex nature makes it prone to any number of pitfalls, including specimen quality issues, inefficient enzymatic steps, human error, and instrument failure. While somatic mutation testing is not a quantitative assay per se, it is amenable to quantitative performance assessment at numerous steps to identify problematic specimens, troubleshoot the source of a failure, and take corrective action to generate an accurate reportable result. Recent publications of clinical NGS assays for detection of somatic mutations, for both amplicon-based [27,75,76] and hybridization capture-based [24,53] methods, have detailed validation methodologies but notably have not outlined QC metrics employed in routine clinical use. However, the QC steps and metrics measured for one hybridization capture-based oncology assay performed routinely for patient care have been described [23]. Tables 19.2 and 19.3 detail the steps at which

III. INTERPRETATION

NGS IN A CLINICAL LABORATORY SETTING

335

TABLE 19.2 QC Measurements Assessed During Specimen Intake, DNA Extraction, Library Preparation and Enrichment, and Sequencing for NGS Somatic Mutation Testing Testing Stage

QC Measurement

Value

Specimen intake

Percent tumor cell nuclei in the selected areas

• Ensures that tumor cells represent a large enough fraction of the cells to be assayed to allow detection of somatic tumor variants, given the assay’s LOD. • May demand consideration of tumor cell viability metric (below) to ensure sufficient tumor cell DNA yield.

Viability of tumor cells in the selected areas

• Ensures sufficient tumor cell viability to yield good-quality tumor DNA for sequencing. • May demand consideration of percent tumor cell nuclei metric (above) to ensure sufficient tumor cell DNA yield.

Heterogeneity (tumor cellularity and/or viability) within or between the selected areas

• Qualitative description of uniformity of the tumor sample; perhaps weakly correlated with degree of heterogeneity of tumor somatic mutations.

0.8% agarose gel image of extracted DNA

• Gross assessment of degree of nucleic acid degradation. The presence of only very low molecular weight material correlates with library amplification failure.

DNA yield (ug)

• Ensures sufficient material exists to begin library preparation protocol.

260/280 ratio

• A ratio of ,1.6 or .2.1 can indicate poor quality or contaminated DNA.

Peak fragment size post sonication

• Ensures successful acoustic shearing of genomic DNA. • Abnormally small fragment size can indicate over-shearing of DNA or extensive nucleic acid degradation due to preanalytic factors.

Peak fragment size after amplification of adapter-ligated product

• Ensures proper size increase given successful adapter ligation.

DNA yield after amplification of adapterligated product

• Ensures ample PCR product to proceed to hybridization step. Insufficient yield can indicate extensively degraded DNA that is not amenable to amplification or failure to recover product between steps.

Peak fragment size, DNA yield after posthybridization amplification

• Ensures successful capture, posthybridization amplification, and recovery of product between steps.

DNA extraction

Library preparation/ enrichment

Peak fragment size, DNA yield of multiplexed • Ensures proper behavior after pooling samples. pool

Sequencing (lane level)

Concentration of multiplexed pool by quantitative PCR

• For precise loading of multiplexed pool(s) on the sequencer.

Cluster density

• Indicates whether the proper loading concentration was achieved in the Illumina flow cell lane; underloading leads to a lower read yield while overloading leads to more clusters being discarded due to overlap.

Number of paired-end reads

• Indicates the amount of sequence data generated in the lane. Correlates positively with cluster density within range. Indicates the level of coverage for the samples assayed in the lane.

Pass filter percent

• The percent of clusters that were sufficiently well separated from neighboring clusters to distinguish optically. A cluster density above the desired range can yield a low pass filter percent. • Multiplied by the number of paired-end reads metric, indicates the total number of usable paired-end reads from the lane.

Read 1, Read 2 error rates

• Based on the PhiX control, the rate of errant base incorporation in each read of the pair.

Read 1, Read 2 phasing/prephasing

• The degree to which individual molecules in a cluster fall out of phase with the rest of the cluster. When values are within spec, this issue is corrected for computationally.

Percent of bases with base quality score $30 (Q30%)

• Indication of overall quality of base calls across clusters. Base quality scores are taken into account during alignment and variant calling, but low Q30% results in reduced power to make accurate variant calls.

III. INTERPRETATION

336 TABLE 19.3

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

Final Specimen-Level Sequencing QC Metrics

Sequencing QC Metric

Definition

Value

Total reads

The total number of single end 101 bp reads assigned to a specimen’s sequence index

• Relative to other specimen indexes on the same lane, indicates how well-balanced multiplexed samples are. • Extremely low numbers of reads can indicate a sequence index entry error.

Percent mapped

The percent of total reads that map to the human genome

• Low percent mapped can indicate many reads with low mapping quality, pointing to instrument error or contamination with nonhuman DNA.

Percent on target

The percent of mapped reads that map to the defined target region

• Indicates the efficiency of the hybridization capture step.

Percent unique on target

The percent of paired on target reads that are unique, i.e., whose genomic ends are not shared with other sequenced fragments

• Corresponds to the molecular diversity of the library. Low percent unique on target indicates low library complexity. • In samples with very high numbers of total reads, sequenced fragments can share genomic ends by chance and this metric can be misleading.

Unique on target reads

The number of reads deriving from uniquely sampled genomic fragments. Equals (percent on target) x (number of on target reads)

• Corresponds to the true unique coverage of the specimen. Coverage depth distribution metrics derive from these reads only.

Percent of positions covered at $503 , $4003 , $10003 unique reads

The percent of positions across the defined target region that achieve the indicated threshold of unique coverage

• Describes the distribution of coverage depth across the target space in a more complete fashion than average unique coverage. • The $4003 metric indicates the percent of the target space with sufficient depth for high sensitivity detection of variants at an allele fraction of 10%; the $503 metric indicates the percent of the target space with sufficient depth to make any reference or variant call.

Average unique coverage

The average number of unique reads covering a given base in the target space. Equals (unique on target reads)/(size of target space in bp)

• Convenient metric for a rough assessment of sequencing performance of a given specimen; blind to specific genomic regions of poor performance.

Failed exons

Targeted coding exons that do not achieve 503 unique • Identifies poorly performing exons, which are declared in the clinical report. coverage across 95% of exonic coding positions. Percent of positions covered uniquely at 503 , 2003 , and 4003 • A failed exon may warrant employing an is also shown orthogonal approach if its genotype is important for a given indication. • A string of adjacent failing exons may indicate a deletion or a disproportionate effect of a preanalytic factor on a particular locus.

Failed hotspots

Cancer hotspots (from a curated, maintained list) that do not achieve 503 unique coverage. Percent of positions covered uniquely at 503 , 2003 , and 4003 is also shown

• Identifies any key targetable or prognostic hotspots across the targeted genes that did not achieve the minimum depth required to make a reference or variant call. Hotspot failure is extremely rare but can be the result of a deletion. Hotspot failure may warrant repeating the assay or performing a targeted orthogonal test.

QC measurements are taken as well as the specific metrics recorded at each step. Key stages at which QC is performed are pathologist review at specimen intake, after DNA extraction, during library preparation, and after sequencing. Table 19.3 pays particular attention to metrics that describe unique coverage distribution across the target space, which are the bottom-line indicators of performance for a given specimen. The specimen-level QC metrics in Table 19.3 are key data used to gauge performance of a given specimen. In addition to the values for the clinical specimen under review, a reference average for each metric, calculated from a set of production cases, should be evaluated. The clinical genomicists reviewing the QC data can then

III. INTERPRETATION

NGS IN A CLINICAL LABORATORY SETTING

337

compare the metrics for the case at hand against the reference average to assess performance in each area (although admittedly, for most metrics, minimum required values have not been established that trigger the addition of a disclaimer or limited study comment in a clinical report). In this regard, it is important to note that the identification of QC metrics for NGS-based assays is a dynamic process by which the performance of the assay is closely monitored and additional QC metrics are developed and added as required. An average unique coverage metric provides an at-a-glance read of how well a sample performed, as does the distribution of coverage depth across the target space to ensure important regions and positions are not missed. Automatic review of a curated list of clinically actionable positions across the reported genes for each specimen is another useful QC measure; failures require an assessment of the underlying technical cause as well as relevance of the hotspot to the clinical indication for testing. Should the position be deemed important, a repeat of the assay or an orthogonal test should be performed. Of note, a QC hotspot can be a telltale sign of a locusspecific issue or event, e.g., a deletion could be an underlying cause of dramatically lowered coverage including at a hotspot. Commonly, exon-level metrics indicate that one or more exons failed to achieve necessary unique coverage. This is most common in exon 1 of some genes, which are notoriously difficult to capture, amplify, and/or sequence, although the method of capture probe synthesis may have a bearing on performance in such regions. Any exons that fail this metric should be declared in the clinical report; unless a significant number of exons, or exons that are critical for the test indication, have failed, these findings do not preclude issuing a report and, in fact, the exons may not necessarily perform better if the assay is repeated. In cases with high average depth-of-coverage and good coverage distribution but a handful of adjacent failed exons, the poor performance of those exons can often be attributed to locus-specific phenomena inherent to the specimen and not that iteration of the assay. Some laboratories evaluate distribution of coverage depth by asking what fraction of the target space is covered to different depth thresholds, which is important since despite an average coverage of 10003 across the capture space of a target region, a test may only achieve that coverage depth at a minority of positions for a particular specimen. Coverage depth metrics that demonstrate highly variable across a target space enriched by hybridization capture will indicate an imperative to aim for a higher average depth of coverage, a need for probe redesign, or both. The percent of targeted positions with unique coverage is also a key indicator of library complexity [23]; lower unique coverage can be the result of a number of assay performance issues including a reduced number of total reads (due to out-of-range cluster densities or other sequence platform technical problems) or poor hybrid capture efficiency, but it is most often due to low molecular complexity of the library leading to fewer unique reads (i.e., discarding of many duplicate reads during the bioinformatics analysis) which indicates that the pool of successfully sampled independent starting molecules was not ideal. In general, samples that yield 200 ng or less of high-quality DNA (as described above), or those with extensively degraded nucleic acids, are prone to a suboptimal unique sequencing depth. Hopefully, improved DNA extraction and library preparation techniques may permit rescue of borderline samples in the future.

Validation Clinical NGS assay validation has largely proceeded based on clinical laboratories’ best judgment and the application of molecular genetic testing guidelines that were created to address previously available (and uniformly less complex) technologies. Only recently have professional societies such as the College of American Pathologists (CAP), government agencies including the Centers for Disease Control and Prevention (CDC) and the Food and Drug Administration, and the New York State Department of Health begun to set expectations for the demonstration of analytical performance characteristics for NGS-based testing [77]. Key performance characteristics for any qualitative assay include analytic sensitivity, analytic specificity, LOD, and reproducibility. Three published reports describing validation of hybridization capture-based NGS somatic detection assays [23,24,53] outline similar but distinct approaches, highlighting the regulatory uncertainty surrounding the expected experimental design. Ideally, determination of the assay sensitivity and specificity for SNVs and indels should be performed on samples with quality similar to that of the clinical samples on which the assay is to be run. However, availability of such samples is a limiting factor. Therefore, sensitivity and specificity of SNV and indel detection can be performed using cell lines, tumor samples, and/or both. Well-characterized cell lines, while inequivalent to FFPE tumor tissue, enable assessment of sensitivity and specificity across many genomic positions exhibiting a wide range of sequence contexts and coverage levels [23,24,53].

III. INTERPRETATION

338

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

Detection performance for other classes of variants including larger indels, copy number variants, and structural variants is difficult to assess using cell lines because few such constitutional variants exist in a limited capture space. For validation of these variant types in clinical NGS assays, some groups have relied exclusively on clinical tumor samples with known somatic alterations [53]. Other groups have collected multiple indel- and CNV-containing tumor cell lines to perform indel and CNV sensitivity and specificity calculations [24], and still others have utilized a combined approach [23,50]. The LOD for somatic cancer variants is a crucial validation parameter. Mixing together samples containing variants at known allele fraction is one means of creating variants at artificially low allele fractions near an assay’s expected LOD. Normal cell lines with germ line polymorphisms can be mixed at different ratios to simulate cancer mutations found in a subset of assayed cells, such as in a tumor subclone or in a tumor sample admixed with normal tissue. Mixing primary tumor sample DNA can be more challenging to accomplish with precision, especially with DNA derived from FFPE tissue. Again, the approach used for validation has differed among clinical laboratories. Some laboratories have employed serial dilutions of hematopoietic tumor DNA with nonneoplastic control DNA to estimate SNV and indel LOD [53]; others have utilized mixes of normal cell line DNA to calculate SNV LOD and mixes of tumor cell line DNA for indel and CNV LOD (with down-sampling mixed datasets in silico to probe the dependence of performance at the LOD on coverage) [24]. Still other groups have used a combination of both [23,50,68]. Assay reproducibility has generally been assessed using either HapMap cell line DNA or patient tumor samples. Library preparation for one or more DNA samples was repeated on different days, in different batches, by different laboratory technologists, and/or between different sequencing instruments [23,24]. Uniformly, NGS assays for somatic mutation detection have been demonstrated to be highly repeatable and reproducible.

CONCLUSION The advent of NGS has enabled uncovering the molecular basis of numerous types of cancer, identifying tens of thousands of somatic variants. A small portion of these variants has an established and well-characterized clinical significance, whether as targets for therapy with small molecule drugs or as prognostic markers. Studies are under way to establish the clinical significance for many of the newly discovered variants in different cancer types and to establish the efficacy of targeted therapy for new small molecule drugs. Targeted NGS, especially by hybrid-capture approaches, provides an optimal choice of transferring these discoveries into clinical applications and overcomes many challenges known to be characteristics of somatic mutation detection in cancer such as low tumor content, admixture with normal tissues, clonal and mutation heterogeneity, limited sample volume, and compromised DNA quality. However, a clinical test that uses targeted NGS to detect somatic mutations in cancer requires careful design with clear expectations of what the test can achieve, and a clear strategy on how the test expectations can be met with the existing infrastructure and the bioinformatic analysis tools. Validation of such highly complex tests also requires a clear strategy on how to validate both the technical and bioinformatic components of the assay, as well as how to monitor the QC metrics of the test in routine clinical use.

References [1] Druker BJ, Tamura S, Buchdunger E, Ohno S, Segal GM, Fanning S, et al. Effects of a selective inhibitor of the Abl tyrosine kinase on the growth of Bcr-Abl positive cells. Nat Med 1996;2(5):5616. [2] Fendly BM, Kotts C, Vetterlein D, Lewis GD, Winget M, Carver ME, et al. The extracellular domain of HER2/neu is a potential immunogen for active specific immunotherapy of breast cancer. J Biol Response Mod 1990;9(5):44955. [3] Shepard HM, Lewis GD, Sarup JC, Fendly BM, Maneval D, Mordenti J, et al. Monoclonal antibody therapy of human cancer: taking the HER2 protooncogene to the clinic. J Clin Immunol 1991;11(3):11727. [4] Hirota S, Isozaki K, Moriyama Y, Hashimoto K, Nishida T, Ishiguro S, et al. Gain-of-function mutations of c-kit in human gastrointestinal stromal tumors. Science 1998;279(5350):57780. [5] Joensuu H, Roberts PJ, Sarlomo-Rikala M, Andersson LC, Tervahartiala P, Tuveson D, et al. Effect of the tyrosine kinase inhibitor STI571 in a patient with a metastatic gastrointestinal stromal tumor. N Engl J Med 2001;344(14):10526. [6] ,http://www.cancer.org.. [7] Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature 2009;458(7239):71924. [8] Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, et al. Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N Engl J Med 2004;350(21):212939.

III. INTERPRETATION

REFERENCES

339

[9] Mazieres J, Peters S, Lepage B, Cortot AB, Barlesi F, Beau-Faller M, et al. Lung cancer that harbors an HER2 mutation: epidemiologic characteristics and therapeutic perspectives. J Clin Oncol 2013;31(16):19972003. [10] Camidge DR, Bang YJ, Kwak EL, Iafrate AJ, Varella-Garcia M, Fox SB, et al. Activity and safety of crizotinib in patients with ALKpositive non-small-cell lung cancer: updated results from a phase 1 study. Lancet Oncol 2012;13(10):10119. [11] Bergethon K, Shaw AT, Ou SH, Katayama R, Lovly CM, McDonald NT, et al. ROS1 rearrangements define a unique molecular class of lung cancers. J Clin Oncol 2012;30(8):86370. [12] Campos-Parra AD, Zuloaga C, Manriquez ME, Aviles A, Borbolla-Escoboza J, Cardona A, et al. KRAS mutation as the biomarker of response to chemotherapy and EGFR-TKIs in patients with advanced non-small cell lung cancer: clues for its potential use in second-line therapy decision making. Am J Clin Oncol 2013. [13] Deininger M, Buchdunger E, Druker BJ. The development of imatinib as a therapeutic agent for chronic myeloid leukemia. Blood 2005;105(7):264053. [14] Estey EH. Acute myeloid leukemia: 2013 update on risk-stratification and management. Am J Hematol 2013;88(4):31827. [15] Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007;4(11):9035. [16] Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007;39(12):15227. [17] Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 2007;4(11):9079. [18] Bashiardes S, Veile R, Helms C, Mardis ER, Bowcock AM, Lovett M. Direct genomic selection. Nat Methods 2005;2(1):639. [19] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461(7261):2726. [20] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):1118. [21] Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009;27(2):1829. [22] Blumenstiel B, Cibulskis K, Fisher S, DeFelice M, Barry A, Fennell T, et al. Targeted exon sequencing by in-solution hybrid selection. Curr Protoc Hum Genet 2010; [chapter 18: unit 18 4]. [23] Cottrell CE, Al-Kateb H, Bredemeyer AJ, Duncavage EJ, Spencer DH, Abel HJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16(1):89105. [24] Frampton GM, Fichtenholtz A, Otto GA, Wang K, Downing SR, He J, et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol 2013;31(11):102331. [25] Hagemann IS, Cottrell CE, Lockwood CM. Design of targeted, capture-based, next generation sequencing tests for precision cancer therapy. Cancer Genet 2013;206(12):42031. [26] Wang J, Lin M, Crenshaw A, Hutchinson A, Hicks B, Yeager M, et al. High-throughput single nucleotide polymorphism genotyping using nanofluidic dynamic arrays. BMC Genomics 2009;10:561. [27] Singh RR, Patel KP, Routbort MJ, Reddy NG, Barkoh BA, Handal B, et al. Clinical validation of a next-generation sequencing screen for mutational hotspots in 46 cancer-related genes. J Mol Diagn 2013;15(5):60722. [28] Kanagal-Shamanna R, Portier BP, Singh RR, Routbort MJ, Aldape KD, Handal BA, et al. Next-generation sequencing-based multi-gene mutation profiling of solid tumors using fine needle aspiration samples: promises and challenges for routine clinical diagnostics. Mod Pathol 2014;27(2):31427. [29] Ferlay J, Shin HR, Bray F, Forman D, Mathers C, Parkin DM. Estimates of worldwide burden of cancer in 2008: GLOBOCAN 2008. Int J Cancer 2010;127(12):2893917. [30] Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 2012;150(6):112134. [31] Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011;39(17):e118. [32] Arcila ME. Simple protocol for DNA extraction from archival stained FNA smears, cytospins, and thin preparations. Acta Cytol 2012;56 (6):6325. [33] Shearer AE, Hildebrand MS, Smith RJ. Solution-based targeted genomic enrichment for precious DNA samples. BMC Biotechnol 2012;12:20. [34] Parameswaran P, Jalili R, Tao L, Shokralla S, Gharizadeh B, Ronaghi M, et al. A pyrosequencing-tailored nucleotide barcode design unveils opportunities for large-scale sample multiplexing. Nucleic Acids Res 2007;35(19):e130. [35] Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz Jr LA, Kinzler KW. Cancer genome landscapes. Science 2013;339 (6127):154658. [36] Harrison CJ, Foroni L. Cytogenetics and molecular genetics of acute lymphoblastic leukemia. Rev Clin Exp Hematol 2002;6(2):91113 [discussion 2002]. [37] Mrozek K, Harper DP, Aplan PD. Cytogenetics and molecular genetics of acute lymphoblastic leukemia. Hematol Oncol Clin North Am 2009;23(5):9911010. v. [38] Davicioni E, Finckenstein FG, Shahbazian V, Buckley JD, Triche TJ, Anderson MJ. Identification of a PAX-FKHR gene expression signature that defines molecular classes and determines the prognosis of alveolar rhabdomyosarcomas. Cancer Res 2006;66(14):693646. [39] Aplan PD. Causes of oncogenic chromosomal translocation. Trends Genet 2006;22(1):4655. [40] Meyer C, Hofmann J, Burmeister T, Groger D, Park TS, Emerenciano M, et al. The MLL recombinome of acute leukemias in 2013. Leukemia 2013;27(11):216576. [41] De Braekeleer E, Douet-Guilbert N, Morel F, Le Bris MJ, Basinko A, De Braekeleer M. ETV6 fusion genes in hematological malignancies: a review. Leuk Res 2012;36(8):94561.

III. INTERPRETATION

340

19. TARGETED HYBRID-CAPTURE FOR SOMATIC MUTATION DETECTION IN THE CLINIC

[42] Welch JS, Westervelt P, Ding L, Larson DE, Klco JM, Kulkarni S, et al. Use of whole-genome sequencing to diagnose a cryptic fusion oncogene. JAMA 2011;305(15):157784. [43] Meyer C, Kowarz E, Hofmann J, Renneville A, Zuna J, Trka J, et al. New insights to the MLL recombinome of acute leukemias. Leukemia 2009;23(8):14909. [43a] Mitelman F, Johansson B, Mertens F. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer 2007;7 (4):23345. [44] Xi R, Kim TM, Park PJ. Detecting structural variations in the human genome using next generation sequencing. Brief Funct Genomic 2010;9(56):40515. [45] Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, et al. Fine-scale structural variation of the human genome. Nat Genet 2005;37(7):72732. [46] Lee S, Hormozdiari F, Alkan C, Brudno M. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat Methods 2009;6(7):4734. [47] Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 2009;6(9):67781. [48] Escaramis G, Tornador C, Bassaganyas L, Rabionet R, Tubio JM, Martinez-Fundichely A, et al. PeSV-Fisher: identification of somatic and non-somatic structural variants using next generation sequencing data. PloS one 2013;8(5):e63377. [49] Duncavage EJ, Abel HJ, Szankasi P, Kelley TW, Pfeifer JD. Targeted next generation sequencing of clinically significant gene mutations and translocations in leukemia. Mod Pathol 2012;25(6):795804. [50] Abel HJ, Al-Kateb H, Cottrell CE, Bredemeyer AJ, Pritchard CC, Grossmann AH, et al. Detection of gene rearrangements in targeted clinical next-generation sequencing. J Mol Diagn 2014;7(4):40517. [51] Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 2010;20(5):62335. [52] Spencer DH, Abel HJ, Lockwood CM, Payton JE, Szankasi P, Kelley TW, et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, next-generation sequencing data. J Mol Diagn 2013;15(1):8193. [53] Pritchard CC, Salipante SJ, Koehler K, Smith C, Scroggins S, Wood B, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16(1):5667. [54] Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med 2010;61:43755. [55] Mullighan CG, Goorha S, Radtke I, Miller CB, Coustan-Smith E, Dalton JD, et al. Genome-wide analysis of genetic alterations in acute lymphoblastic leukaemia. Nature 2007;446(7137):75864. [56] Leary RJ, Lin JC, Cummins J, Boca S, Wood LD, Parsons DW, et al. Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers. Proc Natl Acad Sci USA 2008;105(42):162249. [57] Cancer Genome Atlas Research N. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008;455(7216):10618. [58] Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol 2011;29(6):51220. [59] Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res 2008;36(16):e105. [60] Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009;27(1):6675. [61] Chiang DY, Getz G, Jaffe DB, O’Kelly MJ, Zhao X, Carter SL, et al. High-resolution mapping of copy-number alterations with massively parallel sequencing. Nat Methods 2009;6(1):99103. [62] Kim TM, Luquette LJ, Xi R, Park PJ. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics 2010;11:432. [63] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 2009;25(21):286571. [64] Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, Ma J, et al. CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods 2011;8(8):6524. [65] Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, et al. Mapping copy number variation by population-scale genome sequencing. Nature 2011;470(7332):5965. [66] Sathirapongsasuti JF, Lee H, Horst BA, Brunner G, Cochran AJ, Binder S, et al. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV. Bioinformatics 2011;27(19):264854. [67] Welch JS, Link DC. Genomics of AML: clinical applications of next-generation sequencing. Hematology Am Soc Hematol Educ Program 2011;2011:305. [68] Spencer DH, Tyagi M, Vallania F, Bredemeyer AJ, Pfeifer JD, Mitra RD, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 2014;16(1):7588. [69] Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333. [70] Karnes HE, Duncavage EJ, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122:10413. [71] Wickham CL, Sarsfield P, Joyner MV, Jones DB, Ellard S, Wilkins B. Formic acid decalcification of bone marrow trephines degrades DNA: alternative use of EDTA allows the amplification and sequencing of relatively long PCR products. Mol Pathol 2000;53(6):336. [72] Brown RS, Edwards J, Bartlett JW, Jones C, Dogan A. Routine acid decalcification of bone marrow samples can preserve DNA for FISH and CGH studies in metastatic prostate cancer. J Histochem Cytochem 2002;50(1):1135.

III. INTERPRETATION

REFERENCES

341

[73] Stirewalt DL, Kopecky KJ, Meshinchi S, Engel JH, Pogosova-Agadjanyan EL, Linsley J, et al. Size of FLT3 internal tandem duplication has prognostic significance in patients with acute myeloid leukemia. Blood 2006;107(9):37246. [74] Verhaak RG, Goudswaard CS, van Putten W, Bijl MA, Sanders MA, Hugens W, et al. Mutations in nucleophosmin (NPM1) in acute myeloid leukemia (AML): association with other gene abnormalities and previously established gene expression signatures and their favorable prognostic significance. Blood 2005;106(12):374754. [75] Hadd AG, Houghton J, Choudhary A, Sah S, Chen L, Marko AC, et al. Targeted, high-depth, next-generation sequencing of cancer genes in formalin-fixed, paraffin-embedded and fine-needle aspiration tumor specimens. J Mol Diagn 2013;15(2):23447. [76] Rechsteiner M, von Teichman A, Ruschoff JH, Fankhauser N, Pestalozzi B, Schraml P, et al. KRAS, BRAF, and TP53 deep sequencing for colorectal carcinoma patient diagnostics. J Mol Diagn 2013;15(3):299311. [77] ,http://www.wadsworth.org/labcert/TestApproval/forms/NextGenSeq_ONCO_Guidelines.pdf..

III. INTERPRETATION

This page intentionally left blank

C H A P T E R

20 Somatic Diseases (Cancer): Whole Exome and Whole Genome Sequencing Jennifer K. Sehn Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction to Exome and Genome Sequencing in Cancer Interpretative Considerations in Exome and Genome Cancer Sequencing Spectrum of Somatic Mutations in Cancer Codon Level Mutations Exon Level Mutations Gene Level Mutations Chromosome Level Mutations Paired TumorNormal Testing TumorNormal Comparison for Somatic Mutational Status Determination of Somatic Status Without Paired Normal Tissue Variants of Unknown Significance General Categories of VUS Statistical Models of Mutation Effect Pathway Analysis

Driver Mutation Analysis Clonal Architecture Analysis

344 344 344 345 345 346 346 347 348 348 349 350 350 351

351 351

Analytic Considerations for Exome and Genome Sequencing in Cancer Specimen Requirements Limitations Decreased Depth of Coverage, Sensitivity, and Specificity Advantages Validation of a Single Assay Evaluate Many Genes Simultaneously from One Sample Improved Copy Number Variant Detection Improved SV Detection—Genomes

352 352 352 352 354 354 355 355 356

Summary

356

References

357

KEY CONCEPTS • Somatic genetic alterations in cancer can be identified by next-generation sequencing (NGS) of clinical samples, with well-established pathogenic variants following a few general patterns of mutation. • There are relatively few genes for which sufficient clinical-grade evidence exists to support interpretation of functional or therapeutic consequences for variants identified in those genes. • Most variants identified by exome or genome sequencing in cancer are variants of unknown significance (VUS), with insufficient evidence to support meaningful clinical interpretation or to contribute to patient management.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00020-4

343

© 2015 Elsevier Inc. All rights reserved.

344

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

• High depth of coverage (about 1000 3 ) is required for identification of somatic variants in cancer samples because of admixing of benign and malignant cells within the tumor, clonal heterogeneity of the tumor cells, and variation in coverage across different regions of DNA. • Paired normal tissue is required to determine whether variants identified by NGS of tumor samples are somatic or germline, since variant allele frequency (VAF) alone is not a reliable indicator of somatic versus germline status for an identified variant. • Exome or genome sequencing is helpful for detection of copy number variants (CNVs). • Genome sequencing is especially well suited to detection of structural variants (SVs), which often involve noncoding DNA breakpoints.

INTRODUCTION TO EXOME AND GENOME SEQUENCING IN CANCER Exome and genome sequencing by next-generation sequencing (NGS) methodologies have played a primarily investigative function in the study and treatment of cancer. Such broad-scale strategies have provided insights into the mutations that underlie many cancer types, with hints at possible targets for current or future treatments [18]. Additionally, exome- and genome-level sequencing projects have elucidated other fascinating aspects of tumor biology, including tumoral heterogeneity comprised of clones and subclones, as well as clonal evolution through development of resistance mechanisms following initial treatment and other interventions [912]. In the research laboratory, exome or genome sequence analysis is often paired with sequencing of the RNA transcriptome and expression assays at the protein level, to provide a comprehensive view of cancer biology from genetics through protein expression. Intensive quality assessment and analysis are required to manage the vast amounts of data provided by such large-scale investigations. The methods and bioinformatics approaches that support whole exome or genome evaluation are covered in detail in Sections 1 and 2 of this book (see Chapters 13, 711). However, the primary barrier to clinical adoption of exome or genome sequencing for clinical cancer testing is the astronomical number of variants, particularly variants of unknown significance (VUS), that are identified per case. As will be discussed in this chapter, there is insufficient evidence to support clinical interpretation of many of the variants identified in cancer samples through exome or genome sequencing, either because the identified variants do not follow the typical pattern of mutation that has been documented for well-studied genes, or the variants occur in genes that have not been systematically evaluated in the literature. Some general approaches to analyzing VUS, including testing of paired tumor and normal tissue samples, will be reviewed. This chapter will also cover the downstream consequences of sequencing such large DNA target regions, with a discussion of the impact of target size on depth of coverage and mutation detection. Additionally, this chapter will discuss the advantages of exome and genome sequencing for specific purposes in cancer testing, mostly related to copy number and structural variant (SV) detection, as well as research and discovery applications.

INTERPRETATIVE CONSIDERATIONS IN EXOME AND GENOME CANCER SEQUENCING Spectrum of Somatic Mutations in Cancer Broadly focused NGS investigations like exome and genome sequencing are optimally suited for the study of a large number of genes related to a disease, with or without the requirement for a priori knowledge of which genes may be involved. This is especially true when sequence variants show no predilection for a particular mutation site or particular class of mutations. In cancer, the spectrum of mutations between tumor types and between genes within a single tumor type can be quite variable, characterized by anything from a handful single nucleotide variants (SNVs) affecting a small number of codon “hotspots” affecting only a few oncogenes or tumor suppressors, to a wide variety of point mutations or large deletions that could inactivate a tumor suppressor gene. Examples of these highly variable types of mutations are discussed in this section, to set the stage for consideration of the role of exome and genome sequencing in clinical and research cancer testing.

III. INTERPRETATION

INTERPRETATIVE CONSIDERATIONS IN EXOME AND GENOME CANCER SEQUENCING

345

Codon Level Mutations Many genes that are currently targetable by specific drugs have mutations at a relatively limited number of hotspots. A classic example is the epidermal growth factor receptor (EGFR) gene, which is mutated in a subset of carcinomas, predominantly lepidic pattern lung adenocarcinoma arising in nonsmokers [13]. Activating mutations in EGFR occur most commonly as in-frame indels in exon 19, specifically involving the ATP-binding pocket of the tyrosine kinase domain (codons 745759) [14]. As long as a co-occurring resistance mutation is not present, tumors harboring exon 19 indels are virtually always susceptible to inhibition by reversible EGFR tyrosine kinase inhibitors (TKIs) like erlotinib and gefitinib [1521]. Correspondingly, the most common EGFR mutation conferring resistance to TKIs is the threonine to methionine Thr790Met amino acid substitution resulting from a SNV in exon 20, which also encodes part of the tyrosine kinase domain. This so-called gatekeeper mutation negates the EGFR-inhibitor susceptibility conferred by exon 19 insertions and renders the tumor nonresponsive to reversible EGFR inhibitors [19,22]. Less common somatic EGFR mutations occurring in cancer samples are significantly more difficult to classify in terms of potential response to TKIs. For example, patients with tumors harboring EGFR exon 20 in-frame indels involving codons 763767 seem to respond to TKI therapy, whereas patients harboring in-frame indels involving codons 769773 of exon 20 are reportedly resistant to reversible TKIs [19,2326]. This difference in TKI response is observed in spite of the fact that the mutations occur very near each other and involve the tyrosine kinase domain. As this brief discussion illustrates, the clinical interpretation of mutations in genes that typically harbor a small number of hotspot variants is challenging, and it is unclear what value is added to clinical patient management when comment cannot be made regarding whether a tumor that harbors a novel or rare sequence variant is or is not likely to respond to a particular treatment strategy. Genes like EGFR, with a limited number of actionable mutations in cancer, can be effectively evaluated by conventional PCR, Sanger sequencing, and capillary electrophoresis techniques and do not always require more advanced methods like NGS; even in these cases though, NGS assays do have the benefit of simultaneously evaluating multiple genes from limited tissue samples, as discussed in Chapter 19. However, conventional laboratory methods for mutational analysis become increasingly complex as the clinical demand for more comprehensive genetic analysis of cancer samples grows. Although conventional techniques are effective for detection of a few hotspot mutations, they are limited in their technical capacity to be multiplexed for additional loci or genes of interest, or to assess multiple classes of mutations simultaneously.

Exon Level Mutations Other cancer genes, such as KIT, have slightly broader mutational spectra. KIT, like EGFR, encodes a tyrosine kinase that is sometimes activated by somatic mutation in human cancers, most commonly gastrointestinal stromal tumor and acute myeloid leukemia. Unlike EGFR, where relatively few codons are hotspots for mutation in cancer, the functional and therapeutic consequences of somatic KIT mutations are typically grouped by the exon in which the mutations occur. Tumors harboring activating mutations in the juxtamembrane domain of KIT (encoded by exons 9 and 11), including both SNVs and small in-frame indels, typically are responsive to TKI therapy with imatinib or other TKIs [2729]. Interestingly, exon 9 mutations are usually less sensitive to imatinib than exon 11 mutations, leading to a recommendation for increased doses to achieve therapeutic response in patients harboring exon 9 mutations versus those with exon 11 mutations [27,30,31]. Correspondingly, point mutations in the ATP-binding domain (encoded by exons 13 and 14) or the activation loop (encoded by exon 17) are commonly described as conferring secondary resistance to TKIs following an initial period of TKI responsiveness in patients with KIT-mutated tumors [27,30,32]. Though subtleties still exist in the TKI response profile observed between different KIT mutations, consistency of the overall response (i.e., sensitive vs. resistant) across each exon facilitates clinically meaningful classification and interpretation of mutations. As such, evaluation of entire exons of KIT can provide meaningful clinical information; however, little data exist that can aid in the interpretation of mutations occurring in exons other than those discussed above. KIT SNVs and indels occurring in the relatively limited number of exons in solid tumors (exons 9, 11, 13, 14, and 17) or hematologic malignancies (exons 8 or 17) can be efficiently evaluated in clinical cancer samples by PCR followed by Sanger sequencing and/or capillary electrophoresis. When working from formalin-fixed, paraffin-embedded tissue blocks, these conventional techniques can be applied easily to targets in the range of 10s to 100s of bases. However, as mentioned previously, NGS does provide the added benefit of evaluating multiple gene targets simultaneously, which is especially useful when dealing with small tissue samples.

III. INTERPRETATION

346

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

Gene Level Mutations Still other genes that are recurrently mutated in cancer show very little reproducibility in the location or types of sequence variants that can be observed. Many of these genes encode tumor suppressors, whose function or expression can be lost by multiple mechanisms including missense or nonsense SNVs, small frameshift indels, or large deletions encompassing all or part of the gene and/or its regulatory elements. Correspondingly, oncogenes may be activated via amplification of the gene. TP53 is an example of a tumor suppressor gene with innumerable documented somatic mutations occurring across many cancer types. According to the International Agency for Research on Cancer (IARC) TP53 Database, over 28,000 confirmed somatic mutations in TP53 have been reported in cancer [33]. At the same time that TP53 somatic mutations can be identified in a wide range of cancers, their ubiquity also limits their clinical relevance. For example, TP53 is mutated in over 90% of ovarian epithelial carcinomas, and it is difficult to derive clinical meaning from TP53 mutation status when almost all of the tumors harbor a TP53 mutation and no specific therapy is indicated in tumors with TP53 mutation [34]. In contrast, PTEN is a tumor suppressor gene that is often inactivated via somatic mutation but for which mutational status does directly impact therapy. PTEN is an intrinsic inhibitor of the PI3K/Akt/mTOR cell growth and proliferation pathway, and intact PTEN additionally inhibits RET/RAS/ERK proliferative signaling. Inactivation of PTEN by somatic mutation releases its inhibition of both pathways and facilitates cell growth and proliferation through downstream Akt/mTOR and ERK signals [35]. Interestingly, targeted inhibitors of both the PI3K/Akt/mTOR pathway and the RET/RAS/ERK pathway are currently being investigated in the treatment of cancer, lending potential clinical significance not only to identification of inactivating mutations in PTEN but also to concurrent evaluation of other genes in those pathways to identify simultaneous mutations that could independently activate proliferative signaling [36]. For example, focused inhibition of PI3K in a patient whose tumor harbors a PTEN inactivating mutation is unlikely to be effective if the tumor also has an activating AKT1 mutation; the activating AKT1 mutation could be sufficient to drive proliferative signaling autonomously, without requiring any signal from PI3K. Hence, inhibiting PI3K alone would not inhibit cell growth and proliferation. In this context, the benefit of more broad analysis of genes related through interacting signaling pathways becomes clearer. However, it is important to emphasize that the significant finding is really loss of tumor suppressor expression or function, not DNA mutation per se. In fact, tumor suppressor expression can also be lost by regulatory or epigenetic alterations (e.g., promoter methylation) without changing the DNA sequence at all [37,38]. In this way, evaluation of all the genes in a pathway by DNA sequencing still may not yield a complete picture of the underlying tumor biology. It is also worth noting that other existing clinical methods, including immunohistochemical (IHC) staining of tissue sections or flow cytometry of liquid specimens, can be used to evaluate protein loss, aberrant expression, or abnormal localization in clinical cancer samples, which can eliminate the need for DNA sequence analysis. Indeed, IHC stains and flow cytometry are commonly used in routine evaluation of tumor biopsy specimens both for diagnostic and prognostic indications. Chromosome Level Mutations A final class of common mutations in cancer is structural variation, including translocations and inversions. The classic instance observed in cancer is the t(9;22) BCR-ABL translocation of chronic myelogenous leukemia, which results in an active fusion kinase that can be effectively inhibited by imatinib [39]. Relatively few breakpoints have been identified for this translocation, so clinical testing is effectively accomplished by relatively straightforward RT-PCR of RNA extracted from peripheral blood leukemic cells or bone marrow aspirates [40]. On the opposite end of the spectrum, KMT2A (also known as MLL) translocations in poor-prognosis acute leukemia are known to involve at least 70 different partners [41]. While rearrangements involving multiple possible partners can be efficiently evaluated in a clinical setting using break-apart fluorescence in situ hybridization (BAFISH), BA-FISH does not provide any information about the identity of the rearrangement partner. In many cases, however, including KMT2A (MLL) rearrangements, identification of the translocation partner actually bears no clinical significance beyond that conferred by the presence of a KMT2A (MLL) translocation, not further classified [42]. For this reason, FISH is currently the most widely used technique in clinical cancer testing to identify SV involving genes related to cancer diagnosis, prognosis, or treatment. Additionally, karyotyping can be performed to identify large structural rearrangements, as is routinely performed for hematologic malignancies. In summary, the spectrum of mutations within and between genes, as well as within and between cancer types, can be quite variable. As discussed for each class of variants, methods for clinical mutation testing in cancer depend on the types of mutations expected to be present, the level of classification (i.e., codon, exon, gene, or

III. INTERPRETATION

347

INTERPRETATIVE CONSIDERATIONS IN EXOME AND GENOME CANCER SEQUENCING

chromosome alteration) required to guide patient care, the number of genes that need to be evaluated from the same patient sample, and the clinical relevance of identified variants.

Paired TumorNormal Testing Extensive NGS analysis of very large target regions, including exomes and genomes, categorically requires evaluation of corresponding normal samples for assessment of somatic mutations, essentially in order to filter out germline variants. Unfortunately, current reimbursement paradigms do not support clinical sequencing of nontumor samples for comparison to tumor samples, one of several reasons why exome or genome sequencing for cancer samples is not currently practical in the clinical laboratory. The number of somatic mutations present in a tumor sample is highly variable between cancer types, with some cancers having fewer than 1 mutation per Mb of coding DNA sequence, with others having over 100 mutations per Mb (Table 20.1) [15,4346]. Unlike targeted NGS analysis of relatively limited and well-described sets of hotspot mutations and cancer genes (described in Chapters 18 and 19), exome targeting captures 30 Mb to over 75 Mb of sequence (depending on the reagent used for capture) and produces hundreds of nonsynonymous coding sequence variants from each cancer sample. In so-called “hypermutator phenotype” tumors characterized by unusually high rates of somatic mutation, over 1000 somatic nonsynonymous variants (SNVs and indels) can be seen [1,43]. Lung squamous cell carcinoma (SCC), which has one of the highest described somatic mutation rates, harbors an average of 228 nonsilent protein-coding sequence mutations, 165 structural rearrangements, and 323 copy number changes per tumor [2]. Inclusion of noncoding (e.g., intronic or untranslated region (UTR)) variants increases the average number of variants per case still further. The number of variants that can be identified via whole genome sequencing of cancer samples is nearly incomprehensible (e.g., a median of over 18,000 single nucleotide changes for lung adenocarcinoma), as the much less conserved nongenic sequences harbor greater genetic variation than the comparatively highly conserved genes targeted by exome sequencing [44]. The importance of this mutational spectrum in tumors is further complicated by the fact that an average individual harbors anywhere from 140 to 420 nonsilent (nonsynonymous SNV, gain or loss of stop codon, frameshift or in-frame indel, or change in splice site) germline variants that are not present in any significant proportion of other individuals (i.e., they are variants with a minor allele frequency of ,0.5%) [47]. The vast majority of these are not expected to contribute directly to carcinogenesis and likely represent benign variation seen in healthy humans [48]. However, the number of “novel” germline variants increases dramatically for individuals from ethnicities that are less well genetically characterized, based solely on inadequate sampling of rare benign polymorphisms in those populations. Since benign polymorphisms are generally indistinguishable from tumorassociated mutations in cancer samples if matched normal tissue is not available for comparison, these polymorphisms in the patient’s background germline genetic profile cannot be separated from tumor-associated acquired mutations. TABLE 20.1

Number of Somatic Mutations Identified by Exome Sequencing in Various Cancer Types

Tumor type

Mean somatic mutation rate (# of mutations/Mb of coding DNA sequence)a

Mean number of somatic copy number alterationsa

Mean number of somatic structural variantsb

Lung squamous cell carcinoma [2]

8.1

323

165

Lung adenocarcinoma [1,44]

0.6 (nonsmokers), 10.5 (smokers)

N/A

98

Colorectal carcinoma [4]

3.2

213

75

Glioblastoma multiforme[45]

2.3

282

6

Ovarian serous carcinoma [3]

2.1

477

N/A

Clear cell renal cell carcinoma [5]

1.1

156

N/A

Breast carcinoma [46]

1.0

282

90

Acute myeloid leukemia [43]

0.56

N/A

N/A

a

Determined by exome sequencing. Determined by genome sequencing.

b

III. INTERPRETATION

348

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

TumorNormal Comparison for Somatic Mutational Status Comparison of tumor tissue versus normal (or more accurately, nonneoplastic) tissue from the same patient is thus extremely valuable for two purposes: (i) determining whether an identified variant is a germline variant or a somatic mutation, and (ii) decreasing the overall number of variants to be evaluated by assays intended to identify somatic mutations in cancer samples. Variants identified in the nonneoplastic tissue are not expected to be specific to the malignancy being tested, but instead are expected to represent germline variants that are present in all of that individual’s tissues. A caveat to this approach is that somatic mutations are in fact acquired in dividing cells over the lifespan of an individual [49], but most of these mutations do not result in neoplasia and are considered benign or nonpathologic [48]. Hence, if a variant is present in nonneoplastic tissue, it is generally not considered further in assessment of cancer-related mutations identified only from the paired tumor sample. It is worth mention that there is one caveat to removal of germline variants from subsequent analyses, namely that some germline variants are very relevant in cancer. Examples include the association of germline variants in BRCA1 or BRCA2 in families with hereditary breast cancer, or germline variants in TP53 in families with LiFraumeni syndrome [50,51]. However, standard clinical evaluation of these patients, including family history and personal past medical history, should indicate whether hereditary cancer syndromes are a diagnostic consideration requiring further germline genetic evaluation, which is beyond the scope of this chapter.

Determination of Somatic Status Without Paired Normal Tissue Myriad underlying factors convolute the interpretation of germline versus somatic status for each mutation identified from a cancer sample when a separate normal tissue sample from the same patient is not available for comparison. Many approaches to this problem rely on the variant allele frequency (VAF), which is a numerical representation of the percentage of sequence reads that harbor one nucleic acid versus another. For example, a common polymorphism in TP53 is p.P72R (substitution of arginine for proline at codon 72 of TP53), which is caused by substitution of cytosine for the reference nucleotide guanine at the corresponding nucleotide in the TP53 gene (chr17:7579472G . C). Sequencing a patient who is heterozygous for this variant should result in a VAF of 0.5 (50%) for the G . C substitution, as one of the two alleles for each cell harbors the variant versus the reference allele. However, when evaluating VAFs for variants identified from cancer samples, it is important to consider what cells are actually being sequenced. Tumor samples in general, and solid tumor samples in particular, are inherently heterogeneous consisting not only of the tumor cells themselves but also associated inflammatory cells (lymphocytes, neutrophils, macrophages), stromal cells (fibroblasts), endothelial cells (blood vessels and lymphatics), and normal parenchymal cells (e.g., adipocytes in a sample of invasive breast cancer). The relative proportion of these various cell types is highly variable between different tumor samples, and even between different areas of the same tumor. Unfortunately, quantification of tumor cellularity by visual estimation, even when performed by an experienced pathologist, is notoriously inaccurate, with poor interobserver agreement and significant deviation from tumor cellularity quantified by counting individual cells [52]. The difficulty with visual estimation of tumor cellularity stems from the widely variable shapes, sizes, and growth patterns observed for cells admixed in a cancer sample. If a sample submitted for sequencing has 50% tumor cellularity and the patient is heterozygous for a germline variant, the tumor cells (as well as normal cells) would have an expected VAF of 0.5. If the patient is germline homozygous for a reference allele but there is a heterozygous somatic mutation at that locus in the tumor, the VAF would be expected to be 0.25. If the mutation was homozygous instead of heterozygous in the tumor cells, the expected VAF would be 0.5. Thus, based on the VAF alone, it is often uncertain whether a variant is a somatic mutation or a germline polymorphism. The VAF also does not provide clear insight into the underlying distribution of a mutation that is known to be somatic (e.g., a heterozygous mutation present in all the tumor cells could have the same VAF as a homozygous mutation present in half of the tumor cells). Interpretation of VAFs is further complicated by a complex interplay of various classes of mutation affecting the same locus. For example, a lung cancer may harbor a clonal SNV in EGFR that results in overactivation of progrowth signaling. If the patient is then treated with targeted inhibitors, the tumor cells may escape cell death via amplification of the EGFR gene. If the allele that harbors the SNV is amplified, the VAF will increase relative to the number of individual cells that harbor the mutation, as each cell will contribute more than two alleles to the total number of sequenced alleles. Alternatively, EGFR amplification may occur in a different subpopulation of the tumor cells that does not harbor the SNV, which would result in a lower VAF relative to the number of tumor cells harboring the SNV. Even with integration of copy number analysis, direct consideration of the

III. INTERPRETATION

INTERPRETATIVE CONSIDERATIONS IN EXOME AND GENOME CANCER SEQUENCING

349

FIGURE 20.1 TP53 variant identified in cancer samples with VAF ,0.5. NGS was performed on two separate tumor samples from one patient. A TP53 variant was identified in both samples (chr17:7579514G . C). The patient’s oropharyngeal SCC contained the variant with a VAF of 0.30 (A). The same variant was present in the lung SCC with a VAF of 0.32 (B). Sanger sequencing was performed on nonneoplastic tissue from the same patient, which demonstrated that the variant was heterozygous in the germline (C). Copy number analysis at this position showed no copy number alterations. Based on the VAF alone, this variant may have been incorrectly interpreted as a somatic mutation that was shared between the two tumors.

observed VAF or correlation of VAFs with tumor cellularity for assessment of clonality or germline versus somatic status is not straightforward and can be misleading. Using VAFs to infer a somatic versus germline status for an identified variant is further complicated by analytic factors inherent to capture and/or amplification techniques that introduce technical sources of bias. Even capture-based methods involve a few cycles of amplification, as discussed in depth in Chapters 24 and briefly covered below. The end result of these technical biases is that one allele may be over- or underrepresented compared with the other allele(s) present in the sample [5358]. Taken together, then, variability in tumor cellularity, co-occurring genetic alterations, and technical biases combine to make simple measures like VAFs hopelessly convoluted and virtually useless in assessment of germline versus somatic status for a given variant identified by NGS, if a paired normal tissue is not available for comparison. A clear example of a misleading VAF for an identified variant has been reported previously [59]. A 63year-old smoker initially presented with a keratinizing SCC of the oropharynx; 2 years after she underwent partial pharyngectomy with negative margins, a new 1.5-cm lung keratinizing SCC was identified. Targeted clinical NGS (of a panel of 40 genes) was performed on both tumors to assess whether the lung tumor was a metastasis from the oropharyngeal cancer or a new lung cancer. Both specimens contained a p.P58R (c.173C . G) proline to arginine substitution in TP53, with a VAF of 0.30 (30%) in the oropharyngeal SCC and 0.32 (32%) in the lung tumor, and copy number analysis performed on this sample showed no CNV at the affected locus. Sanger sequencing of a nontumor tissue sample from the patient showed that the TP53 p.P58R variant was in fact a germline heterozygous variant (and thus the VAF should be 0.5) (Figure 20.1). Based on the VAF alone, without knowledge of the germline status, the variant would likely have been incorrectly interpreted as somatic, and erroneously taken as evidence that the two carcinomas were related to each other (i.e., primary oropharyngeal tumor with lung metastasis).

Variants of Unknown Significance The single most important point in considering the application of exome or genome sequencing in the clinical laboratory is what clinically relevant information can be derived from the testing. Although the interpretation of variants identified by NGS is covered in Chapter 13, because of its paramount importance it is discussed again here with a specific focus on cancer testing. Presumably, advances in sequencing platforms and reagents will result in decreased cost of sequencing that could facilitate performing exome or genome sequencing at a clinically meaningful depth of coverage with

III. INTERPRETATION

350

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

clinically acceptable test performance characteristics (discussed later). However, even if an exome- or genomelevel study was economically feasible and optimized for clinically relevant sensitivity and specificity, the overwhelming majority of variants identified—even if only nonsynonymous variants were considered—would fall into the category of “VUS.” By definition, for a VUS there is insufficient existing evidence to support an interpretation regarding the effect of the variant on protein function, cell function, tumor behavior, and/or response to treatment. General Categories of VUS Several general types of VUS can be defined. A sequence change affecting the coding region of a wellestablished gene related to cancer, but not following the pattern of somatic mutation observed for that gene, would be a VUS. For example, constitutive activation of ABL1 in myeloid neoplasms typically occurs via translocation with a resulting BCR-ABL fusion protein kinase [39,60]. A nonsynonymous SNV in ABL1 that has not been described as a polymorphism in healthy individuals would be a VUS, as the effect of the SNV on ABL1 kinase function cannot be extrapolated from the literature about BCR-ABL activity or response to targeted kinase inhibition. Previously undescribed variants occurring in DNA sequences that do not encode amino acids, including splice site or other UTR variants, are even more difficult to interpret. Splice donor site variants may result in inclusion of intronic sequence in the translated portion of the mRNA, leading to inclusion of extra amino acids in the protein product; the extra coding sequence contributed by the intron may be in frame or out of frame with the normal protein sequence and may or may not have a significant impact on the protein’s function, depending on how the extra amino acids affect the protein’s structure. Variants in the splice acceptor site may result in exon skipping, which again may or may not have a significant impact on the protein’s function [48]. Furthermore, it is well established that more than one isoform exists for many proteins, and not all exons in any particular protein are included in the isoforms observed for that protein in normal tissue. Indeed, it is typically nearly impossible to sort out which isoform is primarily expressed in any particular tissue type. Consequently, loss of one exon of a gene in a tissue type that does not usually contain that exon anyway is unlikely to significantly affect protein function in that tissue, while loss of the same exon in a tissue type where that exon is used for some critical function (e.g., protein localization) may have a dramatic effect on protein and cell function [48,61,62]. Variants occurring in noncoding regulatory regions, including promoters, enhancers, UTRs, or even more broadly acting regulatory elements like microRNAs or long noncoding RNAs, can have unanticipated and widespread consequences in terms of gene expression or cell function. VUS occurring in these regions mostly are identified through genome sequencing and to a lesser extent through exome sequencing (when flanking regulatory regions surrounding protein-coding genes and/or RNA genes are captured in addition to the protein-coding genes sequences themselves). There is very little existing literature to guide interpretation of noncoding variants for clinical applications. Conversely, many genes that are known to be recurrently mutated in cancer currently have no direct implications for patient management in terms of treatment, prognosis, or diagnosis. These are mutations for which the functional consequences are known, but that are not treatable or otherwise actionable. For example, somatic mutation of APC is a characteristic genetic alteration involved in progression of colon polyps to in situ and invasive colonic adenocarcinomas [63]. While APC plays an important role in colorectal carcinogenesis, an APC mutation does not bear any particular clinical significance when identified from a lesion that is already known to be malignant. Somatic APC mutation is not correlated with prognosis or response to standard therapy, and loss of APC protein function does not provide alternative targets for treatment [64]. Indeed, it is important to note that somatic APC mutations can even be seen in colorectal lesions that are not malignant [65]. The number of genes harboring mutations that currently, based on the published literature, can be classified as having a direct role in patient management is quite small in comparison with the number of genes evaluated in an exome or genome. Thus, in a research/discovery context and in clinical interpretation of VUS, several approaches are used to understand the information provided by exome or genome sequencing, namely: (i) does the sequence variant result in altered protein function and (ii) does the altered protein function affect tumor phenotype (growth, survival, invasion, metastasis, response to treatment, recurrence, etc.)? Statistical Models of Mutation Effect As discussed in Chapter 13, a number of different statistical approaches can be applied to predict the likely impact of a mutation on the function of a gene product. Features like evolutionary conservation of a particular base or amino acid, frequency of a given variant in the “healthy” population, location of a variant within the

III. INTERPRETATION

INTERPRETATIVE CONSIDERATIONS IN EXOME AND GENOME CANCER SEQUENCING

351

protein structure (which domain, near an enzymatically active site, etc.), and more comprehensive in silico models of protein function (e.g., SIFT, PROVEAN, PolyPhen) can all be considered when trying to determine whether an identified variant is likely to alter the function of a gene [6670]. However, these approaches must be applied with caution when assigning functional significance to mutations in a clinical setting, as the effect of a mutation on protein function (and more broadly on cell signaling) is rarely straightforward [71]. Pathway Analysis Clustering of somatically mutated genes by the functional pathways in which they are involved is one technique by which additional information can be gleaned from the vast amount of sequence data provided by exome or genome analysis of cancer samples [7275]. As discussed above, a very broad spectrum of mutations occurring in PTEN, PIK3CA, AKT, mTOR, RET, RAS, ERK, and/or any other member of these signaling pathways could all have a downstream effect of inducing cell growth and proliferation. Clustering of genes that are involved through overlapping pathway mediators facilitates understanding of shared biological implications of mutations occurring in otherwise seemingly unrelated genes. Although many of the key proteins in the abovementioned pathway have been extensively studied and currently are “targetable” by existing drugs (or drugs in development), there are many cell pathways involved in proliferation, metabolism, chromatin regulation, and so on, mutations of which do not yet provide clinically useful insights for patient management. Driver Mutation Analysis Another interesting approach has been the attempt to identify so-called driver mutations that are responsible for oncogenesis in various cancer types [7678]. Driver mutation analysis aims to distinguish genes that are important for tumor growth, survival, or metastasis (driver genes) from genes that are mutated as collateral damage in genetically deranged cancer cells that have impaired DNA repair mechanisms (passenger mutations) [76]. In addition to providing insights into the underlying tumor biology, identification of driver genes may provide targets for novel treatment strategies. Presently, driver mutation analysis is a highly statistical approach and requires further investigation before it becomes a routine method in clinical NGS testing. Clonal Architecture Analysis A critically important aspect of tumor biology that has recently been revealed by research sequencing of cancer samples is the rich clonal genetic diversity that can be seen in morphologically homogenous tumors. A classic study that used statistical and spatial analyses of variants identified through exome sequencing of multiple areas of a renal cell carcinoma unequivocally demonstrated that malignant neoplasms are frequently (always?) composed of many tumor subclones, each of which may independently gain the ability to metastasize [79]. Although the initial sequencing approach taken in this study evaluating multiple samples from primary and metastatic lesions from one patient is not currently clinically applicable, findings of this study are immediately relevant in the clinic. The research discovery of separately metastasizing tumor subclones that harbor distinct mutational profiles clearly indicates that clinical sequencing of one area of a patient’s primary tumor or one of a patient’s multiple metastatic lesions does not necessarily fully represent the genetic architecture of the patient’s cancer as a whole [10,11,79]. In fact, a growing number of studies have demonstrated that malignancies are extremely dynamic in terms of the tumor clones and subclones that are present. Exome and genome sequencing studies of acute myeloid leukemia samples both at initial diagnosis and at relapse following treatment have clearly shown that environmental selective pressures (e.g., the administration of chemotherapy) cause the expansion or development of tumor clones that harbor mutations enabling escape from the antitumor treatment [11]. Interestingly, it seems as though these evasive clones harbor a relatively limited repertoire of escape mutations, indicating a potential bottleneck that could be exploited as a target for future therapeutics. Additionally, studies in melanoma and multiple myeloma have demonstrated that targeted therapy of one tumor subclone can actually induce growth in a different tumor clone; in particular, targeting of a BRAF-mutated subpopulation of tumor cells with a BRAF inhibitor actually induces growth in a different tumor subclone with wild-type BRAF but mutated NRAS [80,81]. Even more worrisome, it has been shown that small subclones of tumor cells harboring mutations that confer worse prognosis can be induced to expand through treatment, whereas the clones may persist at low levels (i.e., not expand) for extended periods of time without treatment [82]. As with driver mutation analysis, rigorous basic science studies and clinical research trials are required to address how these intriguing and complicated findings translate into treating patients in the clinic.

III. INTERPRETATION

352

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

In summary, clinical mutation testing in cancer depends on being able to assign clinical relevance to the identified variants. Discovery of driver genes or key pathways involved in cancer biology through exome or genome sequencing is both interesting and essential to continued understanding of human oncobiology and development of possible treatment strategies, but the broad approaches that are so valuable in the research laboratory currently yield too many VUS for exome or genome sequencing to be practical in the clinical laboratory.

ANALYTIC CONSIDERATIONS FOR EXOME AND GENOME SEQUENCING IN CANCER Specimen Requirements As with other NGS approaches, many specimen types are acceptable substrates for exome or genome testing, including fresh/frozen, formalin-fixed, and alcohol-fixed tissues [83,84]. Specimens that have been decalcified with acid virtually never are acceptable for sequence analysis; decalcification with chelating agents (e.g., EDTA) is preferred if necessary [85,86]. The same considerations taken in amplicon or gene panel capture-based approaches regarding adequate tumor cellularity, DNA input, and specimen library complexity apply for exome and genome sequencing (see Chapter 3). Library complexity describes the number of individual DNA molecules (and thus, the number of individual cells) that are present in the input DNA sample. A specimen with high cellularity and an adequate amount of DNA (generally .200 ng) usually yields a library with good complexity. Paucicellular specimens are at risk for generating DNA libraries with low complexity, meaning that few unique DNA molecules are present at the start of the sequencing assay. DNA libraries with low complexity (few unique molecules) are at risk for sampling bias; if a variant is present at low frequency in a tumor and only a few cells from the tumor are sequenced, the variant simply may not be present in the genomes from the cells from which the sequence is produced. If the variant is not sampled, it cannot be detected, which is referred to as sampling bias or “allelic dropout.” Alternatively, a variant may be present in the sample subjected to sequencing but may be under- or overrepresented in the sequence data due to technical amplification bias (recall that amplification cycles are a component of all NGS methods, including exome and genome methods, as discussed in Chapter 1), which can occur with as little as one nucleotide difference between the variant and reference alleles [5358]. Additionally, errors that occur early during amplification cycles can be overrepresented in samples with few unique input DNA template molecules; any form of early PCR bias or error that results in overamplification of a product in relation to its frequency in the original sample is generally referred to as “jackpotting” [87]. If the sequencing assay is intended to identify somatic variants, paired tumor and normal tissue from the same patient are required for filtering of germline variants for downstream analysis and interpretation, to enable accurate assignment of somatic versus germline status for identified variants, as described above. Unfortunately, current clinical reimbursement paradigms do not support sequencing of normal tissue samples for comparison with tumor samples. Further, if sequence results from paired normal tissue are not available for use in filtering VUS identified in the cancer sample, overwhelming numbers of VUS will be identified from each cancer sample, as also discussed above. The cost and time required to interpret many VUS per case—especially when the resulting interpretations lack sufficient evidence to support changes in patient management—are major reasons why exome and genome sequencing are not well suited for clinical cancer testing, despite their usefulness in the research setting.

Limitations Decreased Depth of Coverage, Sensitivity, and Specificity The relationship between depth of coverage and the reproducibility of variant detection from a given sample is straightforward: a higher number of high quality sequence reads lends confidence to the base called at a particular location, whether the base call from the sequenced sample is the same as the reference base (no variant identified) or is a nonreference base (variant identified) [88,89]. In a multiplexed clinical assay where multiple samples are sequenced simultaneously, the depth of coverage that can be achieved in each sample depends on the composite size of the targeted region to be sequenced (e.g., 400 kb for a typical panel of genes, vs. 30 Mb to over 75 Mb for an exome, vs. over 3 Gb for a genome), the sequencing platform being used, and the number of barcoded samples loaded onto each lane. The matter is further complicated by the facts that coverage is not equal across all positions being evaluated in an assay, and that not all exome kits target the same “whole exome.”

III. INTERPRETATION

ANALYTIC CONSIDERATIONS FOR EXOME AND GENOME SEQUENCING IN CANCER

353

FIGURE 20.2 Variability in exome sequencing target capture kits. Several different bait designs are used for targeted exome capture in commonly used kits, with variable degrees of overlap between each probe (A). Although there is extensive overlap between the “exome” targeted by each kit, there are also numerous positions that are captured by only one of the kits (B). Each kit partially covers each of the reference exome data sets (RefSeq exons, Ensembl exons, and RefSeq UTRs shown in panels CE, respectively). Reprinted by permission from Macmillan Publishers Ltd: Nature Biotechnology [88], copyright 2011.

Multiple reference human exomes exist, including the consensus coding sequence project assembly (CCDS, B30 Mb total) and the GENCODE collaboration assembly (B35 Mb total) [90,91]. Most commonly used exome kits are based on the CCDS reference exome, including SureSelect Human All Exon 1/2 UTR (Agilent, Santa Clara, CA), SeqCap EZ Exome (Roche/Nimblegen, Madison, WI), and TruSeq Exome (Illumina, San Diego, CA). Even though these kits are based on the same reference, they have different bait designs with varying but overlapping coverage of the “whole exome” (Figure 20.2) [88]. The depth of coverage that can be achieved in an assay is intrinsically affected by factors such as mappability of the target region (which is lower for areas with repeating sequences or homology to multiple regions in the genome) and GC content (which decreases the efficiency of methods like exome sequencing) [88,89]. In fact, the coverage over a target region can be highly variable from position to position, and a depth of coverage as high as 1000 3 is often required in order to achieve at least 400 3 coverage at most targeted positions (Figure 20.3) [92]. The sensitivity of variant detection itself depends on the depth of coverage, with sensitivity for detection of low frequency (VAF of 510%) variants typically dropping off precipitously at depths of coverage below roughly 400 3 [93]. Specificity likewise decreases with decreasing depth of coverage, with as many as 10% of variants identified by low-coverage research sequencing (usually about 80 3 for the exome and about 30 3 for the genome) being unreproducible by repeat sequencing or orthogonal methods using the same DNA sample [81]. Unfortunately, increasing the average coverage to 8001000 3 for an exome or 4060 3 for a genome increases the amount of generated data to around 15 GB for an exome bam file or .200 GB for a genome bam file. Thus, in the clinical laboratory, where both positive and negative results are used to inform decision making for patient management, and false-positive and false-negative rates of 10% are not acceptable, the size of data files required for high depth sequencing of exomes and genomes for improved sensitivity and specificity is unwieldy. For example, consider a colon cancer sample that consists of 75% tumor cells, the remaining 25% of cells being inflammatory cells, stromal cells, and parenchymal cells. If only one-fifth of the sampled tumor cells (15% of total tissue cellularity) represent a subclone that harbors a heterozygous KRAS codon 12 mutation, the VAF for the mutation is only 7.5%, and a bioinformatics pipeline that has a lower limit of detection around 10% is likely to falsely call this codon wild type. In a setting where wild-type results are informative in and of themselves, as is the case with KRAS codon 12 variants, the implications of missing a low frequency variant in terms of patient management can be dramatic. If a KRAS codon 12 substitution mutation is present in a colon cancer, the tumor is not likely to respond to EGFR inhibitors like cetuximab; if a KRAS G12 substitution is called by the pipeline, a

III. INTERPRETATION

354

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

FIGURE 20.3 Coverage variability for exome and genome sequencing. NGS sequence coverage is not distributed evenly over a given target region. As an example, the exome coverage track above shows the coverage achieved (Y-axis) at each position across chromosome 7 (Xaxis) by sequencing DNA from a liver tumor using the Agilent SureSelect V5 human exome kit (Agilent, Santa Clara, CA) at a mean depth of about 6003. The genome coverage track shows the coverage at each genomic position in chromosome 7 for genome sequencing of HapMap sample NA12878 at a mean depth of about 50 3 (1000 Genomes Project) [47]. Duplicate reads were removed from the aligned sequence files prior to calculating coverage. For reference, the locations of RefSeq genes are also indicated, and a schematic of chromosome 7 is provided (from the UCSC Genome Browser, http://genome.ucsc.edu). Peaks and valleys in sequence coverage occur as a result of multiple factors, including assay design and systematic technical biases. In the exome plot, gaps in coverage are observed in segments of DNA that do not contain genes and are not targeted by the capture kit. In both the exome and genome plots, coverage is not evenly distributed across targeted regions due to variability in factors including GC content and mappability. For example, a gap in coverage is clearly seen over the region of the centromere (indicated by triangles in the chromosome 7 schematic), because it is not possible to map the highly repetitive reads generated from centromeric DNA to a unique position in the genome.

different therapeutic regimen that is not based on cetuximab or similar drugs will be used. Thus, it is impossible to overstate the critical fact that a clinical assay must have adequate depth of coverage to confidently and reproducibly identify low frequency variants that have direct implications for patient management.

Advantages Validation of a Single Assay Selection of genes for inclusion in a comprehensive clinical cancer sequencing assay is not straightforward. Clinical and research knowledge about genes involved in cancer diagnosis, prognosis, and treatment are continuously evolving, with new studies being published on a monthly if not weekly or daily basis. Maintaining an upto-date gene panel is particularly difficult in the clinical environment, where major changes to assay protocols require revalidation of the assay. Hypothetically, a clinical laboratory could choose to validate a protocol for sequencing the exome, while only reporting the relevant panel of genes that apply in a particular clinical setting. For example, if a lab receives one sample from a colon cancer and one from an acute myeloid leukemia, it could sequence the exome on both cases and report EGFR and KRAS for the colon cancer but DNMT3A and CEBPA for the leukemia sample. A “one assay” approach would also have the benefit of streamlining laboratory operations, such that all samples would be processed according to one protocol, instead of using multiple different protocols for multiple different sample types. An obvious downside to this approach is the additional cost of sequencing extraneous DNA that is not going to be reported. Additionally, fewer separate cases can be run in multiplex per lane while achieving the same depth of coverage as compared with a smaller panel of genes. It is important to keep in mind that the depth of coverage required for sensitive and specific variant detection from heterogeneous solid tumor samples is much higher (roughly 1000 3 as discussed above), than for constitutional testing (about 100 3 ) where variants are expected to be present in at least half of the cells being evaluated [92,94]. As the price of sequencing continues

III. INTERPRETATION

ANALYTIC CONSIDERATIONS FOR EXOME AND GENOME SEQUENCING IN CANCER

355

to decrease, technical factors related to cost may become less important, but still it is unlikely that labs will choose to perform genome sequencing on all samples because there is little added benefit in terms of evaluating nongenic sequences, as opposed to only genic sequences, due to a lack of clinical utility in identifying VUS that are not clinically actionable. Regardless of the cost of sequencing, the question of whether or not all DNA that is sequenced must be evaluated and reported from an ethical perspective persists. Exome or genome sequencing as a backbone for reporting mutations in genes that are intended by the clinician to be evaluated will, by definition, result in sequence information of genes beyond those that were ordered by the clinician (unless a bioinformatics pipeline is used that has been specifically designed to suppress analysis of sequence reads that do not map to the target region). Even within the genes of specific clinical interest, germline variants will be identified along with somatic variants, and germline variants may involve genes that are known to be involved in cancer predisposition syndromes or other genetic diseases. There are currently no widely accepted guidelines for the laboratory’s responsibility to bioinformatically analyze and report “incidental” findings in genes that are not included in the ordered test but are likely to have a major impact on the patient’s or their family members’ health [95,96], an issue discussed in much more detail in Chapter 24. An excellent example of genes evaluated by exome sequencing that may harbor clinically relevant germline variants is BRCA1 or BRCA2. There is no shortage of controversy surrounding clinical evaluation of BRCA1 and BRCA2 in general, but the example of discovery of a potentially pathogenic BRCA1 mutation in a patient afflicted by a tumor without a known BRCA1 association illustrates the ethical conundrum; reporting of the BRCA1 variant could have major implications in terms of patient and family clinical management, despite the fact that the clinical setting for testing did not indicate any rationale for analysis of BRCA1. Evaluate Many Genes Simultaneously from One Sample In a clinical environment where there is a growing emphasis on minimally invasive diagnostic procedures including the preferential use of core needle biopsies or fine needle aspiration biopsies of patient tumors, there is increasing pressure to do more with a smaller amount of tumor tissue. In some cases, only one needle biopsy from a patient’s tumor may be available for both assessment of the histologic diagnosis and molecular analysis of prognostic and therapeutic genetic markers. A prime example of this clinical scenario is biopsy of lung tumors. The surgical pathologist or cytologist must first determine whether a lung mass is a primary lung tumor or a metastasis, as numerous tumors commonly metastasize to the lung. Arriving at the correct histologic diagnosis often requires the use of at least a few IHC markers. If the mass is a lung adenocarcinoma, further evaluation for EGFR mutation may be requested, which typically requires at least five unstained tissue slides (or the equivalent tissue volume). Additional evaluation for gene amplification or rearrangements involving ALK, ROS, RET, MET, or HER2 may also be needed, requiring an additional 510 unstained slides. Tissue utilization considerations are further complicated by the required use of so-called “companion diagnostic” assays, specific test kits required by the Food and Drug Administration (FDA) to establish the presence of a genetic mutation (e.g., BRAF V600E or V600K) prior to initiation of treatments that target that mutation [97]. There sometimes is simply not enough tumor tissue available from several millimeters of biopsy material to perform all the desired testing if the various molecular analyses are performed separately. Exome and genome sequencing can both be used to evaluate multiple genes simultaneously, thereby maximizing the information that can be gathered from limited tissue samples. Thoughtfully designed gene panels with optimized bioinformatics pipelines are also able to detect all classes of genetic variation, so exome or genome sequencing are not required in order to take advantage of this useful feature of NGS techniques, as discussed above [92,94]. However, some variant types, especially CNVs and SVs, are more readily detected from exome- or genome-level sequence information, as discussed later. Improved Copy Number Variant Detection SNVs and small insertions and deletions (indels) up to around 100 bp in length can be readily identified from cancer samples using small capture or amplification-based NGS gene panels, if adequate coverage is obtained over the genes of interest, the bioinformatics pipelines are optimized to identify variants in heterogeneous samples, and the specimens used for testing are of adequate quality (see Chapters 18 and 19). This is true also for exome and genome sequencing, with the same caveats. However, identification of copy number variants (CNVs) is accomplished much more readily via exome or genome sequencing than by more narrowly targeted approaches. In general, bioinformatics tools for copy number analysis in NGS data sets require a normalization process to compensate for variations in coverage observed

III. INTERPRETATION

356

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

as a result of intrinsic assay performance characteristics, such as decreased coverage over GC-rich areas (leading to difficulty with capture and amplification) or areas with low sequence complexity (owing to difficulty with mapping sequence reads). Many of the tools also utilize SNV genotypes identified across regions of the targeted sequence to corroborate coverage-based CNV calls (see Chapter 11 for further discussion). While smaller gene panels make CNV analysis more difficult, exome and genome sequencing are both well suited to copy number analysis because of the large areas available for coverage normalization and SNP detection [98]. Improved SV Detection—Genomes SV detection in NGS pipelines is challenging but possible. The challenges arise from the mechanisms by which SVs are generated. The breakpoints for inter- and intrachromosomal rearrangements most often are located in noncoding (e.g., intronic or intergenic) DNA sequences, often in highly repetitive regions, and therefore are difficult both to capture and to map to the reference genome using NGS techniques. In addition, SV breakpoints often contain superimposed sequence variation ranging from small indels to fragments from several chromosomes [99,100]. NGS sequence reads spanning complex rearrangements with multiple contributing chromosomes are difficult to map to the reference genome because only small stretches (in general only tens of bases long) map to each of the contributing chromosomes [98]. A number of SV detection tools have been developed to manage these issues (see Chapter 11 for in-depth discussion). Each of these requires a sufficient number of high quality sequence reads that either span the breakpoint (discordant read pairs consisting of one read mapping to one chromosome and the mate read mapping to the second chromosome) or that contain the breakpoint itself (split reads where one of the reads in a pair maps partially to the same location as its mate and partially to a distant location) [98]. Since the discordant read pairs and split reads that are used to identify SV breakpoints are often located in noncoding sequences, those intronic and intergenic regions must be targeted for DNA analysis. Exome sequencing capture probes have limited coverage of introns and UTRs flanking known genes, with no coverage of intergenic regions, and as such are not particularly amenable to SV detection. Although genome sequencing by definition targets the entire genome, including introns and intergenic spans, the achievable coverage over those repetitive regions is actually quite poor, both because of the huge size of the targeted DNA space and the difficulty with aligning nonunique repeating sequences correctly to the reference genome. In contrast, targeted-capture based gene panels designed to include introns and intergenic sequences that are known to harbor rearrangements in cancer can be used to focus coverage to those areas to achieve the number of reads required for SV detection (see Chapters 10 and 19). Besides issues related to obtaining adequate coverage of breakpoint regions, SV detection also is plagued by a high false-positive call rate [98]. Most of these false-positive calls occur outside of regions that are known to harbor SV in cancer and involve partners other than those that have been described and/or are known to have biological relevance. SV calls involving genes that are known to be rearranged in cancer can be filtered by whether or not the identified rearrangements fit into the described spectrum of SV for the genes involved, but this approach does not address “novel” rearrangements which may actually be present in the sample. Supplemental techniques like RNA sequencing, FISH, or karyotyping are required to sort out which of many novel SV calls made from exome- or genome-level data are true positives.

SUMMARY Exome and genome sequencing are valuable tools for the research study of cancer, because these techniques facilitate a relatively unbiased investigation of all known genes or the whole genome, with the ability to detect all four classes of mutations. Genome sequencing is especially well suited to the identification of translocations and other SVs. While either an exome or genome approach could hypothetically be used to validate a “one size fits all” clinical assay protocol and analysis pipeline from which subpanels of genes could be reported, the majority of identified variants are of unknown clinical significance, and the large target space encompassed by an exome or genome necessitates that fewer samples be multiplexed in one sequencing lane in order to achieve a clinically acceptable depth of coverage. As sequencing costs continue to decrease, it is becoming more practical from a depth of coverage and cost perspective to perform exome sequencing in cancer. However, from a clinical point of view, the primary limitation of exome and genome sequencing is the high number of VUS that are found by sequencing genes for which there is insufficient clinical-grade evidence to support any meaningful interpretation of how the variant affects diagnosis, prognosis, or treatment. Thus, for clinical purposes, there is little added benefit to sequencing an entire

III. INTERPRETATION

REFERENCES

357

genome from a cancer specimen as the additional information does not contribute directly to patient management. Decreasing the technical costs of sequencing does nothing to change the lack of clinical relevance for variants identified in under-characterized genes, or other regions of the genome for which no function is known. To summarize, useful clinical mutation testing requires sufficient supporting evidence to assign clinical relevance to identified variants. Although identification of the full spectrum of DNA variants present in a tumor is both interesting and valuable from a research perspective, evidence-based data from clinical studies, case reports, and trials are required to move genes and pathways from the realm of research into the clinical setting, where variants can be used in real time to determine how patients should be most effectively managed to improve outcomes. Thus, the primary and persistent limitation to performing exome or genome sequencing clinically in cancer testing is the assessment of clinical relevance and actionability of identified variants. Regardless of the technical or financial feasibility of exome- or genome-level sequencing in a clinical setting, the innumerable VUS that are identified in such broad-scale sequencing methods do not contribute meaningful information for clinical patient management. Instead, at present, smaller panels of genes with known clinical significance are better suited for the identification of actionable variants in multiple genes from limited clinical tissue samples.

References [1] Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, et al. Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 2012;150(6):112134. [2] Network CGAR. Comprehensive genomic characterization of squamous cell lung cancers. Nature 2012;489(7417):51925. [3] Network CGAR. Integrated genomic analyses of ovarian carcinoma. Nature 2011;474(7353):60915. [4] Network CGA. Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012;487(7407):3307. [5] Network CGAR. Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 2013;499(7456):439. [6] Borad MJ, Champion MD, Egan JB, Liang WS, Fonseca R, Bryce AH, et al. Integrated genomic characterization reveals novel, therapeutically relevant drug targets in FGFR and EGFR pathways in sporadic intrahepatic cholangiocarcinoma. PLoS Genet 2014;10(2):e1004135. [7] Brastianos PK, Taylor-Weiner A, Manley PE, Jones RT, Dias-Santagata D, Thorner AR, et al. Exome sequencing identifies BRAF mutations in papillary craniopharyngiomas. Nat Genet 2014;46(2):1615. [8] Kandoth C, Schultz N, Cherniack AD, Akbani R, Liu Y, Shen H, et al. Integrated genomic characterization of endometrial carcinoma. Nature 2013;497(7447):6773. [9] Gerlinger M, Horswell S, Larkin J, Rowan AJ, Salm MP, Varela I, et al. Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nat Genet 2014;46(3):22533. [10] Yachida S, Jones S, Bozic I, Antal T, Leary R, Fu B, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 2010;467(7319):11147. [11] Ding L, Ley TJ, Larson DE, Miller CA, Koboldt DC, Welch JS, et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 2012;481(7382):50610. [12] Johnson BE, Mazor T, Hong C, Barnes M, Aihara K, McLean CY, et al. Mutational analysis reveals the origin and therapy-driven evolution of recurrent glioma. Science 2014;343(6167):18993. [13] Shigematsu H, Lin L, Takahashi T, Nomura M, Suzuki M, Wistuba II, et al. Clinical and biological features associated with epidermal growth factor receptor gene mutations in lung cancers. J Natl Cancer Inst 2005;97(5):33946. [14] Sharma SV, Bell DW, Settleman J, Haber DA. Epidermal growth factor receptor mutations in lung cancer. Nat Rev Cancer 2007;7 (3):16981. [15] Mitsudomi T, Morita S, Yatabe Y, Negoro S, Okamoto I, Tsurutani J, et al. Gefitinib versus cisplatin plus docetaxel in patients with nonsmall-cell lung cancer harbouring mutations of the epidermal growth factor receptor (WJTOG3405): an open label, randomised phase 3 trial. Lancet Oncol 2010;11(2):1218. [16] Maemondo M, Inoue A, Kobayashi K, Sugawara S, Oizumi S, Isobe H, et al. Gefitinib or chemotherapy for non-small-cell lung cancer with mutated EGFR. N Engl J Med 2010;362(25):23808. [17] Zhou C, Wu YL, Chen G, Feng J, Liu XQ, Wang C, et al. Erlotinib versus chemotherapy as first-line treatment for patients with advanced EGFR mutation-positive non-small-cell lung cancer (OPTIMAL, CTONG-0802): a multicentre, open-label, randomised, phase 3 study. Lancet Oncol 2011;12(8):73542. [18] Mitsudomi T, Kosaka T, Endoh H, Horio Y, Hida T, Mori S, et al. Mutations of the epidermal growth factor receptor gene predict prolonged survival after gefitinib treatment in patients with non-small-cell lung cancer with postoperative recurrence. J Clin Oncol 2005;23 (11):251320. [19] Mok TS, Wu YL, Thongprasert S, Yang CH, Chu DT, Saijo N, et al. Gefitinib or carboplatinpaclitaxel in pulmonary adenocarcinoma. N Engl J Med 2009;361(10):94757. [20] Han SW, Kim TY, Hwang PG, Jeong S, Kim J, Choi IS, et al. Predictive and prognostic impact of epidermal growth factor receptor mutation in non-small-cell lung cancer patients treated with gefitinib. J Clin Oncol 2005;23(11):2493501. [21] Chou TY, Chiu CH, Li LH, Hsiao CY, Tzen CY, Chang KT, et al. Mutation in the tyrosine kinase domain of epidermal growth factor receptor is a predictive and prognostic factor for gefitinib treatment in patients with non-small cell lung cancer. Clin Cancer Res 2005;11 (10):37507. [22] Tokumo M, Toyooka S, Ichihara S, Ohashi K, Tsukuda K, Ichimura K, et al. Double mutation and gene copy number of EGFR in gefitinib refractory non-small-cell lung cancer. Lung Cancer 2006;53(1):11721.

III. INTERPRETATION

358

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

[23] Wu JY, Wu SG, Yang CH, Gow CH, Chang YL, Yu CJ, et al. Lung cancer with epidermal growth factor receptor exon 20 mutations is associated with poor gefitinib treatment response. Clin Cancer Res 2008;14(15):487782. [24] Yasuda H, Park E, Yun CH, Sng NJ, Lucena-Araujo AR, Yeo WL, et al. Structural, biochemical, and clinical characterization of epidermal growth factor receptor (EGFR) exon 20 insertion mutations in lung cancer. Sci Transl Med 2013;5(216):216ra177. [25] Sequist LV, Martins RG, Spigel D, Grunberg SM, Spira A, Ja¨nne PA, et al. First-line gefitinib in patients with advanced non-small-cell lung cancer harboring somatic EGFR mutations. J Clin Oncol 2008;26(15):24429. [26] Eberhard DA, Johnson BE, Amler LC, Goddard AD, Heldens SL, Herbst RS, et al. Mutations in the epidermal growth factor receptor and in KRAS are predictive and prognostic indicators in patients with non-small-cell lung cancer treated with chemotherapy alone and in combination with erlotinib. J Clin Oncol 2005;23(25):59009. [27] Corless CL, Barnett CM, Heinrich MC. Gastrointestinal stromal tumours: origin and molecular oncology. Nat Rev Cancer 2011;11 (12):86578. [28] Singer S, Rubin BP, Lux ML, Chen CJ, Demetri GD, Fletcher CD, et al. Prognostic value of KIT mutation type, mitotic activity, and histologic subtype in gastrointestinal stromal tumors. J Clin Oncol 2002;20(18):3898905. [29] Zhi X, Zhou X, Wang W, Xu Z. Practical role of mutation analysis for imatinib treatment in patients with advanced gastrointestinal stromal tumors: a meta-analysis. PLoS One 2013;8(11):e79275. [30] Heinrich MC, Maki RG, Corless CL, Antonescu CR, Harlow A, Griffith D, et al. Primary and secondary kinase genotypes correlate with the biological and clinical activity of sunitinib in imatinib-resistant gastrointestinal stromal tumor. J Clin Oncol 2008;26(33):53529. [31] (MetaGIST) GSTM-AG. Comparison of two doses of imatinib for the treatment of unresectable or metastatic gastrointestinal stromal tumors: a meta-analysis of 1,640 patients. J Clin Oncol 2010;28(7):124753. [32] Debiec-Rychter M, Cools J, Dumez H, Sciot R, Stul M, Mentens N, et al. Mechanisms of resistance to imatinib mesylate in gastrointestinal stromal tumors and activity of the PKC412 inhibitor against imatinib-resistant mutants. Gastroenterology 2005;128(2):2709. [33] Petitjean A, Mathe E, Kato S, Ishioka C, Tavtigian SV, Hainaut P, et al. Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat 2007;28(6):6229. [34] Ross JS, Ali SM, Wang K, Palmer G, Yelensky R, Lipson D, et al. Comprehensive genomic profiling of epithelial ovarian cancer by next generation sequencing-based diagnostic assay reveals new routes to targeted therapies. Gynecol Oncol 2013;130(3):5549. [35] Zbuk KM, Eng C. Cancer phenomics: RET and PTEN as illustrative models. Nat Rev Cancer 2007;7(1):3545. [36] Janku F, Hong DS, Fu S, Piha-Paul SA, Naing A, Falchook GS, et al. Assessing PIK3CA and PTEN in early-phase trials with PI3K/AKT/ mTOR inhibitors. Cell Rep 2014;6(2):37787. [37] Garcı´a JM, Silva J, Pen˜a C, Garcia V, Rodrı´guez R, Cruz MA, et al. Promoter methylation of the PTEN gene is a common molecular change in breast cancer. Genes Chromosomes Cancer 2004;41(2):11724. [38] Esteller M, Corn PG, Baylin SB, Herman JG. A gene hypermethylation profile of human cancer. Cancer Res 2001;61(8):32259. [39] Kantarjian H, Sawyers C, Hochhaus A, Guilhot F, Schiffer C, Gambacorti-Passerini C, et al. Hematologic and cytogenetic responses to imatinib mesylate in chronic myelogenous leukemia. N Engl J Med 2002;346(9):64552. [40] Bagg A. Chronic myeloid leukemia: a minimalistic view of post-therapeutic monitoring. J Mol Diagn 2002;4(1):110. [41] Krivtsov AV, Armstrong SA. MLL translocations, histone modifications and leukaemia stem-cell development. Nat Rev Cancer 2007;7 (11):82333. [42] Shih LY, Liang DC, Fu JF, Wu JH, Wang PN, Lin TL, et al. Characterization of fusion partner genes in 114 patients with de novo acute myeloid leukemia and MLL rearrangement. Leukemia 2006;20(2):21823. [43] Kandoth C, McLellan MD, Vandin F, Ye K, Niu B, Lu C, et al. Mutational landscape and significance across 12 major cancer types. Nature 2013;502(7471):3339. [44] Imielinski M, Berger AH, Hammerman PS, Hernandez B, Pugh TJ, Hodis E, et al. Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell 2012;150(6):110720. [45] Network CGAR. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008;455 (7216):10618. [46] Network CGA. Comprehensive molecular portraits of human breast tumours. Nature 2012;490(7418):6170. [47] Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):5665. [48] Strachan T, Read AP. Human genetic variability and its consequences. Human molecular genetics. 4th ed. New York, NY: Garland Science; 2011. pp. 405440 [49] Welch JS, Ley TJ, Link DC, Miller CA, Larson DE, Koboldt DC, et al. The origin and evolution of mutations in acute myeloid leukemia. Cell 2012;150(2):26478. [50] Marcus JN, Watson P, Page DL, Narod SA, Lenoir GM, Tonin P, et al. Hereditary breast cancer: pathobiology, prognosis, and BRCA1 and BRCA2 gene linkage. Cancer 1996;77(4):697709. [51] Malkin D, Li FP, Strong LC, Fraumeni JF, Nelson CE, Kim DH, et al. Germ line p53 mutations in a familial syndrome of breast cancer, sarcomas, and other neoplasms. Science 1990;250(4985):12338. [52] Smits AJ, Kummer JA, de Bruin PC, Bol M, van den Tweel JG, Seldenrijk KA, et al. The estimation of tumor cell percentage for molecular testing by pathologists is not accurate. Mod Pathol 2014;27(2):16874. [53] Walsh PS, Erlich HA, Higuchi R. Preferential PCR amplification of alleles: mechanisms and solutions. PCR Methods Appl 1992;1 (4):24150. [54] Ogino S, Wilson RB. Quantification of PCR bias caused by a single nucleotide polymorphism in SMN gene dosage analysis. J Mol Diagn 2002;4(4):18590. [55] Barnard R, Futo V, Pecheniuk N, Slattery M, Walsh T. PCR bias toward the wild-type k-ras and p53 sequences: implications for PCR detection of mutations and cancer diagnosis. Biotechniques 1998;25(4):68491. [56] Liu Q, Thorland EC, Sommer SS. Inhibition of PCR amplification by a point mutation downstream of a primer. Biotechniques 1997;22(2): 2924, 6, 8, passim.

III. INTERPRETATION

REFERENCES

359

[57] Mutter GL, Boynton KA. PCR bias in amplification of androgen receptor alleles, a trinucleotide repeat marker used in clonality studies. Nucleic Acids Res 1995;23(8):14118. [58] Polz MF, Cavanaugh CM. Bias in template-to-product ratios in multitemplate PCR. Appl Environ Microbiol 1998;64(10):372430. [59] Sehn JK, Hagemann IS, Pfeifer JD, Cottrell CE, Lockwood CM. Diagnostic utility of targeted next-generation sequencing in problematic cases. Am J Surg Pathol 2014. [60] Melo JV. The diversity of BCR-ABL fusion proteins and their relationship to leukemia phenotype. Blood 1996;88(7):237584. [61] Onozato R, Kosaka T, Kuwano H, Sekido Y, Yatabe Y, Mitsudomi T. Activation of MET by gene amplification or by splice mutations deleting the juxtamembrane domain in primary resected lung cancers. J Thorac Oncol 2009;4(1):511. [62] Wang GS, Cooper TA. Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet 2007;8(10):74961. [63] Smith G, Carey FA, Beattie J, Wilkie MJ, Lightfoot TJ, Coxhead J, et al. Mutations in APC, Kirsten-ras, and p53—alternative genetic pathways to colorectal cancer. Proc Natl Acad Sci USA 2002;99(14):94338. [64] Zauber NP, Wang C, Lee PS, Redondo TC, Bishop DT, Goel A. Ki-ras gene mutations, LOH of the APC and DCC genes, and microsatellite instability in primary colorectal carcinoma are not associated with micrometastases in pericolonic lymph nodes or with patients’ survival. J Clin Pathol 2004;57(9):93842. [65] Sugai T, Habano W, Uesugi N, Jiao YF, Nakamura S, Sato K, et al. Molecular validation of the modified Vienna classification of colorectal tumors. J Mol Diagn 2002;4(4):191200. [66] Grantham R. Amino acid difference formula to help explain protein evolution. Science 1974;185(4154):8624. [67] Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP11. PLoS Comput Biol 2010;6(12):e1001025. [68] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4(7):107381. [69] Choi Y, Sims GE, Murphy S, Miller JR, Chan AP. Predicting the functional effect of amino acid substitutions and indels. PLoS One 2012;7 (10):e46688. [70] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7(4):2489. [71] Tchernitchko D, Goossens M, Wajcman H. In silico prediction of the deleterious effect of a mutation: proceed with caution in clinical genetics. Clin Chem 2004;50(11):19748. [72] Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 2010;26(12):i23745. [73] Wendl MC, Wallis JW, Lin L, Kandoth C, Mardis ER, Wilson RK, et al. PathScan: a tool for discerning mutational significance in groups of putative cancer genes. Bioinformatics 2011;27(12):1595602. [74] Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al. PID: the pathway interaction database. Nucleic Acids Res 2009;37 (Database issue):D6749. [75] Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 2012;40(Database issue):D10914. [76] Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, Bignell G, et al. Patterns of somatic mutation in human cancer genomes. Nature 2007;446(7132):1538. [77] Carter H, Chen S, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, et al. Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res 2009;69(16):66607. [78] Cheng WC, Chung IF, Chen CY, Sun HJ, Fen JJ, Tang WC, et al. DriverDB: an exome sequencing database for cancer driver gene identification. Nucleic Acids Res 2014;42(Database issue):D104854. [79] Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 2012;366(10):88392. [80] Poulikakos PI, Zhang C, Bollag G, Shokat KM, Rosen N. RAF inhibitors transactivate RAF dimers and ERK signalling in cells with wildtype BRAF. Nature 2010;464(7287):42730. [81] Lohr JG, Stojanov P, Carter SL, Cruz-Gordillo P, Lawrence MS, Auclair D, et al. Widespread genetic heterogeneity in multiple myeloma: implications for targeted therapy. Cancer Cell 2014;25(1):91101. [82] Rossi D, Khiabanian H, Spina V, Ciardullo C, Bruscaggin A, Fama` R, et al. Clinical impact of small TP53 mutated subclones in chronic lymphocytic leukemia. Blood 2014;123(14):213947. [83] Spencer DH, Sehn JK, Abel HJ, Watson MA, Pfeifer JD, Duncavage EJ. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15(5):62333. [84] Karnes HE, Duncavage EJ, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122(2):10413. [85] Wickham CL, Sarsfield P, Joyner MV, Jones DB, Ellard S, Wilkins B. Formic acid decalcification of bone marrow trephines degrades DNA: alternative use of EDTA allows the amplification and sequencing of relatively long PCR products. Mol Pathol 2000;53(6):336. [86] Reineke T, Jenni B, Abdou MT, Frigerio S, Zubler P, Moch H, et al. Ultrasonic decalcification offers new perspectives for rapid FISH, DNA, and RT-PCR analysis in bone marrow trephines. Am J Surg Pathol 2006;30(7):8926. [87] Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH, et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci USA 2013;110(49):198727. [88] Clark MJ, Chen R, Lam HY, Karczewski KJ, Euskirchen G, Butte AJ, et al. Performance comparison of exome DNA sequencing technologies. Nat Biotechnol 2011;29(10):90814. [89] Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet 2014;15(2):12132. [90] Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, et al. Current status and new features of the consensus coding sequence database. Nucleic Acids Res 2014;42(Database issue):D86572.

III. INTERPRETATION

360

20. SOMATIC DISEASES (CANCER): WHOLE EXOME AND WHOLE GENOME SEQUENCING

[91] Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, et al. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet 2011;19(7):82731. [92] Cottrell CE, Al-Kateb H, Bredemeyer AJ, Duncavage EJ, Spencer DH, Abel HJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16(1):89105. [93] Spencer D, Vallania F, Tyagi M, Bredemeyer A, Pfeifer J, Mitra R, et al. Performance of common methods for detecting low frequency single nucleotide variants in targeted next generation sequence data. J Mol Diagn 2014;16(1):7588. [94] Pritchard CC, Salipante SJ, Koehler K, Smith C, Scroggins S, Wood B, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16(1):5667. [95] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. [96] Burke W, Matheny Antommaria AH, Bennett R, Botkin J, Clayton EW, Henderson GE, et al. Recommendations for returning genomic incidental findings? We need to talk!. Genet Med 2013;15(11):8549. [97] Anderson S, Bloom KJ, Vallera DU, Rueschoff J, Meldrum C, Schilling R, et al. Multisite analytic performance studies of a real-time polymerase chain reaction assay for the detection of BRAF V600E mutations in formalin-fixed, paraffin-embedded tissue specimens of malignant melanoma. Arch Pathol Lab Med 2012;136(11):138591. [98] Abel HJ, Duncavage EJ. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet 2013;206(12):43240. [99] Berger M, Dirksen U, Braeuninger A, Koehler G, Juergens H, Krumbholz M, et al. Genomic EWS-FLI1 fusion sequences in Ewing sarcoma resemble breakpoint characteristics of immature lymphoid malignancies. PLoS One 2013;8(2):e56408. [100] Rajaram V, Knezevich S, Bove KE, Perry A, Pfeifer JD. DNA sequence of the translocation breakpoints in undifferentiated embryonal sarcoma arising in mesenchymal hamartoma of the liver harboring the t(11;19)(q11;q13.4) translocation. Genes Chromosomes Cancer 2007;46(5):50813.

III. INTERPRETATION

S E C T I O N

I V

REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

This page intentionally left blank

C H A P T E R

21 Assay Validation Amy S. Gargis1, Lisa Kalman2 and Ira M. Lubin2 1

Division of Preparedness and Emerging Infections, Laboratory Preparedness and Response Branch, Centers for Disease Control and Prevention, Atlanta, GA, USA 2Division of Laboratory Programs, Services, and Standards, Centers for Disease Control and Prevention, Atlanta, GA, USA

O U T L I N E Introduction

364

Reportable and Reference Ranges

372

NGS Workflow

364

Quality Control

372

The Regulatory and Professional Framework for Assuring Quality

Reference Materials

373

367

Conclusion

373

Assay Validation

367

Acknowledgment

374

Accuracy

369

References

374

Precision

370

List of Acronyms and Abbreviations

376

Analytical Sensitivity and Analytical Specificity

371

KEY CONCEPTS • Next-generation sequencing (NGS) is a complex test method that can be divided into four steps: • Sample preparation • Sequencing of the physical patient sample to generate a large number of sequence reads • Alignment of the reads against a human reference sequence to identify variations from the reference sequence • Analysis to identify clinically relevant variants. • Assay validation is the process of establishing and documenting the performance specifications of a clinical assay. • The performance characteristics for which specifications are established include accuracy, precision, analytical sensitivity and specificity, reportable range, reference range, and other characteristics that are applicable for a particular clinical assay.



Disclaimer: The findings and conclusions in this chapter are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention/The Agency for Toxic Substances and Disease Registry.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00021-6

363

© 2015 Published by Elsevier Inc.

364

21. ASSAY VALIDATION

• NGS, like Sanger sequencing, presents unique challenges to the application of regulatory and professional standards for assuring quality, primarily because every clinically relevant sequence variation cannot be individually validated or quality controlled. • The scope of considerations are different for assay validation based on the regions targeted for analysis (e.g., gene panel, exome, or whole genome sequencing) and the types of sequence variations that are targeted (e.g., single nucleotide variants, insertions and deletions). • Not all regions of the human genome can be sequenced with acceptable quality. In some instances, Sanger sequencing or other methods can be used to sequence regions where NGS fails to provide a sequence with acceptable quality. • Clinical NGS tests, especially exome and whole genome analysis, are prone to false-positive results. Confirmatory testing is typically used to identify these occurrences.

INTRODUCTION Next-generation sequencing (NGS) is an evolving set of technologies that are capable of deriving sequence information for large regions of the human genome. The first manuscript that described the use of NGS in the research setting was published in 2005 [1]. NGS has subsequently become cost effective and time efficient, allowing the rapid migration into clinical laboratory use. NGS has been used successfully to diagnose rare disease and guide the choice of cancer therapies when other test methods were either unavailable or not informative. The use of NGS for clinical testing evolved from translational research initiatives at major academic institutions and commercial laboratories. These institutions were positioned to use their significant research and clinical expertise and infrastructure to set up workflows able to produce reliable results. As the usefulness of clinical NGS testing became apparent and the cost for testing decreased, a number of laboratories added NGS to their test menus. However, the adoption of these new and complex technologies present challenges to the existing laboratory quality framework. This chapter addresses the components of quality management in a clinical environment that are essential for validating an NGS assay and informing the development of quality control (QC) procedures that are used to assure and maintain accurate test results. Reference materials will also be discussed from the perspective of their importance in establishing performance specifications that are used to provide confidence that the NGS test is reliable. The discussion will focus on the detection of germ line sequence variations but principles relevant to the use of NGS for other applications, such as cancer, will be discussed when applicable.

NGS WORKFLOW The clinical testing process is divided into three phases: preanalytic, analytic, and postanalytic phase (Figure 21.1). The preanalytic phase of testing involves test selection, ordering, specimen collection, processing, handling, and delivery to the testing site [2,3]. The analytic phase involves establishing the performance of test procedures, monitoring the accuracy and reliability of results, and the documentation of test findings. Result reporting is a part of the postanalytic phase of testing and this process also includes the archival of records, reports, and tested specimens [2]. For NGS, the analytical phase of testing is composed of two general processes, sequence generation and sequence analysis using an informatics pipeline. Prior to sequence generation, the patient specimen is processed to develop a “library” of size-selected DNA fragments. Gene panel or exome analysis requires a capture or an enrichment step to isolate a targeted subpopulation of DNA fragments. This enrichment step is not performed for whole genome sequencing. A finished library contains millions of size-selected individual DNA fragments that represent the regions of the genome to be sequenced. NGS permits the analysis of multiple sample libraries simultaneously using a technique known as multiplexing. This is performed by adding a library-specific oligonucleotide label (sometimes referred to as “tag” or “index”) to each fragment. This allows the sequence associated with each fragment to be assigned to a specific patient specimen using a process known as demultiplexing. The multiplexing/demultiplexing protocols recommended by the NGS platform manufacturer have been

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

365

NGS WORKFLOW

Indication for testing, counseling, test selection

Postanalytic

Analytic

Preanalytic

Sample (Library) preparation

Sequence generation

Sequence analysis

Base-calling to generate sequence reads

Specimen collection, submission

Alignment and mapping of reads to a reference sequence

Result reporting

Informatics pipeline Variant/genotype calling, annotation

Clinical interpretation

FIGURE 21.1 The clinical NGS testing workflow is divided into the preanalytic, analytic, and postanalytic phases of testing.

optimized for their particular instrument and comparative data is available to assess the fidelity of the process. Therefore, many laboratories use the indexes, protocols, and demultiplexing software recommended by the platform manufacturers in their assay development and validation. In some instances, custom index design can provide superior results. When custom indexes are used, careful consideration needs to be given to the design of tags, labeling of the library fragments, and capacity to distinguish reads associated with different patients [4]. The next step is the generation of DNA sequence data. Current NGS sequencing technologies employed in the clinical setting utilize the clonal amplification of the DNA fragments, followed by sequencing in a massively parallel fashion, reviewed here [511]. The output from an NGS sequencing instrument is a digital representation of short sequences referred to as “reads.” NGS data analysis is complex, and requires significant computer processing. Initially, the quality of each read is assessed and those that do not meet criteria established by the laboratory are discarded by the sequencing software. Criteria can include analysis of base call quality scores, duplicate reads due to polymerase chain reaction (PCR) artifacts and the ability to map reads to the reference sequence [12]. A key feature of NGS is the capacity to interrogate each base in a given DNA sequence multiple times to derive a statistical concordance for the identity of each position in a read. The number of times a base is interrogated by multiple overlapping reads is referred to as the “depth of coverage.” Having an adequate depth of coverage is essential for deriving an accurate sequence because each read is subject to error that can vary based on the chemistry, sequence context, and other factors. The remaining stages of the analysis utilize a set of software tools that are referred to as the “informatics pipeline.” These computational tools are used to map and align the reads to a reference sequence, followed by the identification of sequence variations. These variations are then further analyzed to identify those that are relevant to the clinical indication for testing. A variety of software tools from commercial sources and in the public domain, are available to analyze the data generated by NGS analysis. The design and optimization of the informatics pipeline will depend on the intended clinical application. It is desirable to have professionals with informatics expertise to inform the development of the pipeline because there are a variety of software tools that require optimization for the intended clinical application.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

366

21. ASSAY VALIDATION

Most informatics pipelines contain processes to improve the quality of the data, remove spurious results and variants that are not related to the analysis. Initially, individual reads are “trimmed” to remove a few bases at each end of the sequence that typically have lower quality. Also, duplicate reads are removed to mitigate their interference with alignment and base calling. These steps have been reported to reduce the occurrence of false positives [13]. Next, the sequence reads of acceptable quality are aligned to a reference sequence. The quality of the alignment can vary based upon the software application, the region of the genome analyzed, and the types of sequence variations under investigation. A range of alignment software tools are available [1418]. Some aligners are optimized for use with a particular platform and some are designed to address particular platform-specific errors (e.g., difficulty sequencing homopolymer tracts). For example, achieving a correct alignment of homologous sequences can be challenging and software varies in the capacity to effectively perform this function particularly with respect to differentiating functional genes from pseudogenes [19]. After alignment, sequence variations between the patient’s sample and the reference sequence are identified. The variant calling process will initially generate between 20,000 and 50,000 variants per exome, and approximately 3 million single nucleotide variants (SNVs) for whole genome sequencing [20,21]. Many of these variants are not real but are a consequence of mapping error or other errors that can be detected and filtered by software [21,22]. The propensity to generate incorrect calls can be reduced using several processes [19,23] that include: • Local realignment • Removal of duplicate reads that result from PCR artifacts during library creation • Recalibration of base call quality scores. The next step is to use the aligned sequence to identify variations that differ from the reference. Software to identify these sequence variations have been described and compared elsewhere [19,23,24]. At the present time, no single software tool can identify all types of variations, and multiple variant callers are used to analyze the same data set [18,22]. For example, an algorithm optimized to call SNVs will not necessarily perform well for calling insertions and deletions (indels) [18,22]. Once a set of sequence variations are identified, the final step of the process is to determine which one(s) are relevant to the clinical indication for testing. This process begins by annotating each sequence variation with information pertaining to its established or predicted effects on the protein product and/or disease in question. This annotation process requires the use of a number of different informatics analysis programs and databases [25]. For example, two programs, SIFT and POLYPHEN, are used to predict the effect of an amino acid change on protein structure [26]. Other annotation features can include: • Segregation of the variant relative to disease in an affected family • Reported evidence that the sequence variation or associated gene is related to the medical condition in question • Population prevalence of the variant relevant to the patient. After the identified variants are annotated, they are analyzed to identify those that are relevant to the clinical question. This step typically combines software automation with manual review. Software filters are used to remove sequence variations that are not relevant to the clinical question. For example, filters may be used to remove those variants that are prevalent in a healthy population. The use of appropriate software solutions and databases can significantly reduce the number of sequence variations that must be analyzed, but the time to review those that remain can be significant and involves a manual operation requiring a high level of expertise. Confirmatory testing is currently recommended for clinically actionable variants [12,2730]. This is because there is a likelihood for false positives and it is not practical to directly validate all clinically relevant variants that may be detected prior to patient testing. Confirmatory testing should be performed using a separate clinically validated method, for example, a different NGS platform and informatics pipeline, Sanger sequencing, SNV array analysis, or other methods. This can be a complex undertaking because the confirmatory test must be validated for each region of the genome targeted. The confirmatory test should be designed and validated prior to patient testing to minimize turnaround time. For example, if Sanger sequencing is used to confirm results of an NGS gene panel assay, the laboratory can design and validate primer pairs and the associated sequencing reaction across all genomic regions that are targeted by the NGS test. While this is possible for targeted gene panels, it becomes less practical for exome and whole genome analysis because of the cost and time it takes to design and validate a separate test for such large regions of the genome.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

ASSAY VALIDATION

367

THE REGULATORY AND PROFESSIONAL FRAMEWORK FOR ASSURING QUALITY Results from clinical laboratory testing must be reliable and inform appropriate clinical decision making. Regulatory, accreditation, and professional standards include requirements to help assure reliable laboratory testing. The oversight of clinical laboratory testing varies worldwide but many countries base their requirements on International Organization for Standardization (ISO) standards, chiefly ISO 15189— Medical LaboratoriesParticular Requirements for Quality and Competence [31]. In the United States, clinical laboratories are regulated under the Clinical Laboratory Improvement Amendments (CLIA) regulations that are administered by the Centers for Medicare and Medicaid Services (CMS) [32]. Several mechanisms exist for laboratories to become CLIA-certified that include (1) direct certification of compliance by CMS, (2) accreditation from a CMS-approved provider [e.g., the College of American Pathologists (CAP) Laboratory Accreditation Program (LAP)], or (3) fulfilling the requirements of a state (New York and Washington) that is deemed by CMS to provide regulatory oversight comparable to that of the federal CLIA program [3335]. Until recently, no quality standards or guidelines specifically addressed NGS. One of the first efforts in the United States, organized by the Centers for Disease Control and Prevention (CDC) in 2011, established a national working group that translated existing technical quality standards to more specifically apply to NGS clinical testing [12]. Other efforts developed in parallel or following the CDC effort have been undertaken by the Clinical and Laboratory Standards Institute (CLSI) [36] and the American College of Medical Genetics and Genomics (ACMG) [30]. The CAP published a checklist for NGS as a component of their LAP [33]. The US Food and Drug Administration (FDA) hosted a public panel to gather input to inform their processes for reviewing clinical NGS tests submitted for regulatory oversight clearance or approval [37]. Outside the US, several notable publications are relevant to the clinical use of NGS. A position paper was facilitated by a Metaforum Leuven working group that addressed the broader implications of NGS in clinical practice [38]. The Clinical Molecular Genetics Society of the United Kingdom published practice guidelines for NGS sequencing analysis and interpretation [39]. The European Society of Human Genetics released recommendations for performing whole genome sequencing in health care settings [40]. In 2013, Eurogentest is considering development of a European guideline (personal communication, Professor Gert Matthijs, University of Leuven). Despite the current efforts to develop guidance for NGS, it is interesting to note that many of the same challenges apply to Sanger sequencing, which was introduced into clinical testing circa 19751980 and is considered the “gold” standard for sequence analysis. For example, it is difficult or impossible to validate the ability of either test to detect the presence of every sequence variation that can occur within the region of the genome targeted. This is due to the ability to detect a large number and type of variation. Therefore, strategies have been developed to demonstrate the validity and limitations of sequencing tests without having to assess every possible outcome. Surprisingly, limited formal guidance has been developed for clinical Sanger sequencing. Therefore, the development of guidance for clinical NGS testing cannot draw on comparable efforts.

ASSAY VALIDATION Documentation of the clinical laboratory test validation is central to assuring the quality of clinical testing [2,32]. In the United States, the CLIA regulations require laboratories to verify the performance specifications established by the manufacturer for unmodified tests that are approved or cleared by the FDA [32]. For tests that are not approved or cleared by the FDA or cleared tests that are modified by the laboratory, i.e., laboratorydeveloped tests (LDTs), laboratories must perform a validation to establish and document performance specifications for the following performance characteristics: • • • •

Accuracy Precision Analytical sensitivity Analytical specificity

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

368

21. ASSAY VALIDATION

QC

Assay development and optimization

PT/AA

Assay validation

Platform

Test

IT / Pipeline

Patient testing

Daily

Periodically

FIGURE 21.2

The NGS assay validation protocol is performed as a single workflow that includes evaluation of the platform, test, and informatics pipeline.

• Reportable range • Reference range • Other characteristics, as applicable. Outside the United States, ISO 17025 [41] and ISO 15189 [31] standards are used to broadly define verification and validation. For the purpose of this chapter, we will use the term “validation.” The current generation of NGS instruments and software were primarily designed to serve the research community and subsequently adapted for clinical applications. In 2013, the US Food and Drug Administration (FDA) cleared one NGS instrument system and reagent kit, and two NGS-based tests for cystic fibrosis [72]. In 2014, the majority of clinical laboratories that perform NGS continue to design, optimize, and validate their assays as laboratory-developed tests. This places the responsibility for establishing and documenting performance specifications of all aspects of an NGS test on the clinical laboratory professionals. The sequencing platform, which is installed with default settings and software, is optimized by the laboratory for the intended clinical application prior to test validation. Currently, informatics pipelines are not available “out of the box” and must be assembled and optimized by each laboratory. Platform and software developers, once primarily focused on the research uses, are becoming more adept at helping laboratories properly utilize their platforms for clinical applications. Assay validation establishes the parameters by which a clinical test must perform during patient testing. The assay validation protocol for NGS is performed as a single workflow, where the platform, test, and informatics pipeline are evaluated (Figure 21.2). Platform validation is the process of establishing the performance specifications for the detection of each type of sequence variation targeted by the assay (e.g., SNVs, indels). Test validation establishes the performance specifications for the detection of sequence variations within the regions of the genome targeted by the test that are relevant to the intended clinical application [12]. The informatics validation documents that the informatics pipeline can reliably analyze sequence data derived from the sequencing test. Once the assay is implemented, any subsequent changes made to procedures or reagents may require a revalidation to reestablish the performance specifications [12,3133]. Platform and software developers frequently update their software; for example, a software change to improve alignment efficiency represents a substantial change that requires a revalidation. The time and expense required to revalidate an assay can be significant to the clinical laboratory. In some instances, a total revalidation of the NGS workflow is not necessary. Laboratories may only need to revalidate the portion of the NGS workflow from the point the change was introduced through to the end of the procedure. For example, an update to the annotation software does not impact upstream processes, such as the chemical sequencing of the patient sample or the alignment, and therefore these upstream processes may not require revalidation. Reagent replacement and routine instrument maintenance typically do not require a revalidation but do require the use of QC procedures to ensure that the change does not compromise the quality of the assay; for example, a new lot of reagent must perform comparably to the previous one. The definitions for the performance characteristics (e.g., accuracy, precision) have been adapted to DNA sequence analysis and specifically to NGS [12] (Table 21.1). These specifications represent the metrics that describe the performance of the entire analytic testing process. The following sections provide additional details about the establishment of performance specifications and QC procedures for NGS applications.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

ACCURACY

TABLE 21.1

369

CLIA Performance Characteristics Defined for NGS

Performance characteristics

Definitions for NGS applicationsa

Accuracy

The degree of agreement between the nucleic acid sequences derived from the assay and a reference sequence

Precision

The degree to which repeated sequence analyses give the same result—repeatability (within-run precision) and reproducibility (between-run precision)

Analytical sensitivity

The likelihood that the assay will detect the targeted sequence variations, if present

Analytical specificity

The probability that the assay will not detect a sequence variation when none are present (the falsepositive rate is a useful measurement for sequencing assays)

Reportable range

The region of the genome in which sequence of an acceptable quality can be derived by the laboratory test

Reference range

Establishment of reportable sequence variations the assay can detect that are expected to occur in an unaffected population

a

Definitions originally described in Ref. [12].

ACCURACY Accuracy is defined as the closeness of agreement between a test result and the accepted reference value [42,43]. For NGS sequencing applications, accuracy can be defined as “the degree of agreement between the nucleic acid sequences derived from the assay and a reference sequence” [8]. There are several factors throughout an NGS workflow that influence the accuracy of test results. For tests in which patient samples are multiplexed, laboratory professionals take precautions to minimize the possibility of an index becoming associated with the wrong patient sample. The reliability of multiplexing is greatly influenced by several factors including index design, the method by which the indexes are added to each fragment, and the use of software tools to demultiplex sequence reads. Evaluation of the accuracy of each read depends on several factors that include the base call quality scores, the depth of coverage, sequence content, and the inherent error rate of the NGS technology [12,44]. NGS technologies assign a quality score, or Q score, to each base that is sequenced. These Q scores are a quantitative measure of base call accuracy based on the quality or Phred scores used in Sanger sequencing [45]. Phred Q scores represent the log value of the error probability for a given base called by the sequencing instrument [46]. These scores define the likelihood that a base call is accurate, for example, a Q score of 20 has a 1 in 100 likelihood of error or 99% base call accuracy; Q score of 30 has a 1 in 1000 likelihood of error for 99.9% base call accuracy [46]. Each NGS platform produces errors inherent to the technology used and the base-calling procedures vary among platforms. The Q scores do not always accurately reflect the true base-calling error rate because they do not necessarily assess all sources of error [23]. To address these issues, Q scores can be recalibrated using algorithms that take into account the confidence of alignment to a reference sequence, error profile aspects of the sequencing technology, depth of coverage achieved, and other criteria [15,47] to generate more accurate quality scores [19]. The depth of coverage necessary to make accurate base calls is established during assay validation. Coverage is typically reported as the average depth of coverage, or the average number of overlapping reads within the region of the genome sequenced [12]. The depth of coverage needed to make an accurate variant call depends on a number of factors including the type of sequence variation to be evaluated and the flanking sequence context [12]. For example, less coverage is typically needed to detect homozygous than for heterozygous SNVs. The depth of coverage will vary across the genome so minimal coverage thresholds are set to assure that all regions under investigation will be covered at an adequate depth to make variant calls. The average depths of coverage that have been reported in the literature for clinical applications range from 15100 3 and above; however, the required depth will be dependent on a given laboratory’s assay design, region of the genome targeted, type of variant to be detected, and choice of sequencing technology [28,29]. Some regions of the genome are prone to low coverage. This is frequently seen in the first exon of many genes, which are often GC rich. This is not unique to NGS; Sanger sequencing has the same limitation with respect to GC-rich regions [48]. For regions where adequate coverage cannot be achieved, an alternate validated method (e.g., Sanger sequencing, SNV array analysis) must be used [28]. NGS may be prone to both false-positive and false-negative findings. This can be monitored, to an extent, by metrics established during validation that include the allelic read percentage or allelic fraction and identification

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

370

21. ASSAY VALIDATION

of strand bias [12]. The allelic read percentage is the established proportion of unique reads required to make an accurate variant call [5]. In theory, every read that contributes to a homozygous variant call would contain the variant, resulting in an allelic read percentage of 100. Likewise, a heterozygous variant would ideally contain the variant base in 50% of reads, or an allelic read percentage of 50. In practice, the allelic read percentage may not reflect these expected percentages due to amplification bias or the presence of PCR duplicate reads [5,12]. PCR duplicates are reads that contain the same start and end positions and are generated by clonal amplification of the NGS sample library prior to sequencing. Duplicate reads (except for the read with the highest quality score) are typically removed because they can alter the allelic fraction or incorrectly indicate the presence of strand bias (preferential sequencing of one strand but not the other). They also are likely to represent the unequal amplification and subsequent sequencing of identical fragments that may contain PCR errors and lead to falsepositive or negative variant calls [5,19]. Ideally the sequencing reactions will produce an equal distribution of forward and reverse strand reads; however, technical artifacts that arise during the sequencing process can result in reads mapping to the reference that result from only one strand, generating strand bias. The genotype inferred from the forward and reverse strand can sometimes be different. Therefore, a bias of reads from either the forward or reverse direction can produce false-positive or false-negative calls. Laboratory professionals monitor the presence of “strand bias” during test validation and patient testing to detect instances where this may be a problem. Automated and manual evaluation of sequence variations are performed to identify those that are relevant to the clinical question posed. The accuracy of this process depends on several factors that include appropriate annotation, filtering, and classification. For example, the population prevalence of a sequence variation may determine whether it is common in an asymptomatic population and therefore less likely to be disease associated. Estimation of genotype prevalence varies across populations and the accuracy of the reported figures varies within the published literature. Laboratory professionals strive to recognize these limitations and account for them in the annotation process. Classification of variants in part will depend on what has been reported in databases that link these to a disease phenotype. The quality of these databases also varies. For example, one study reported that 27% of variations that were cited in the literature and deposited into databases as potentially disease associated were found to be either common variations or otherwise mislabeled [49]. In the absence of highquality, curated clinically useful databases, laboratory and other professionals are required to carefully evaluate available data in making decisions about the quality of available evidence that links sequence variations to a medical condition and ultimately to the patient.

PRECISION For both Sanger sequencing and NGS, precision is defined as the degree of agreement between replicate measurements of the same material [12]. To establish precision, samples are analyzed to assess the reproducibility (between-run precision) and repeatability (within-run precision) of the test. Reproducibility is established to assess the consistency of results when testing the same sample under different conditions, for example, between different runs, different sample preparations, by different technicians, and using different instruments. Repeatability is established by sequencing the same samples multiple times under the same conditions and evaluating the concordance of variant detection and performance. It is not practical, because of both time and cost, to run a sufficient number of samples using NGS to derive a statistically meaningful result to establish precision [12]. Also, it is not likely that a reasonable number of samples will contain the full spectrum of clinically relevant sequence variations associated with a given medical condition for which NGS testing would be ordered. Nonetheless, an US national workgroup considered this challenge and proposed that three different reference samples should be sequenced between 3 and 5 times in the same and different runs [12]. The workgroup also suggested that concordance of QC metrics established during assay validation, such as depth of coverage, uniformity of coverage, allelic read percentage, and the other metrics such as the transition/transversion ratio (the ratio of transition base substitutions to the number of transversion base substitutions for a pair of sequences), be evaluated and compared since concordance may be indicative of precision. Use of electronic data files may help to define the precision of the informatics pipeline at several stages of the analysis. For example, data files that contain real or simulated sequence reads, alignments, and/or variant calls can theoretically be used to evaluate some or all of the informatics pipeline. The use of these materials has not been extensively studied and the design and use of such data files has not been established in general practice.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

ANALYTICAL SENSITIVITY AND ANALYTICAL SPECIFICITY

371

ANALYTICAL SENSITIVITY AND ANALYTICAL SPECIFICITY Analytical sensitivity is defined as the lower limit of detection (LOD) or the proportion of biological samples that have a positive test result and are correctly classified as positive [2]. With NGS and Sanger sequencing, analytical sensitivity may be expressed as the likelihood that an assay will detect a sequence variation when one is present within the genomic region that is sequenced [12]. Analytical specificity is the probability that the assay will not detect sequence variation(s) when none are present within the genomic region that is analyzed [12]. During assay validation, analytical sensitivity and specificity are typically established for a set of sequence variations anticipated to represent the spectrum of potential clinically relevant findings through the comparison of NGS test results to those obtained by another method(s). Sanger sequencing and SNV arrays are two methods that can be useful for this purpose. However, the usefulness of an SNV array to establish sensitivity and specificity will depend on the location, number, and distribution of SNVs that are included in the assay because SNV arrays generally do not include regions of the genome that are difficult to sequence [50]. In some instances, SNV arrays cannot be used due to the lack of overlap between the SNVs contained in the array and those targeted within the genomic regions to be tested. The analytical sensitivity and analytical specificity for NGS assays varies across the genome [12]. Sensitivity and specificity are influenced by many factors that include: • Read quality and cutoffs established by filters • Depth and uniformity of coverage achieved • The capacity to correctly align the sequence reads to a reference. The ability to measure sensitivity and specificity can also be affected by the distribution and type of disease associated and naturally occurring sequence variations in the reference materials used to measure these assay performance metrics. SNVs are more prevalent, approximately 1 in 5001000 bases across the human genome [51], than other types of sequence variations. Therefore, assessment of the sensitivity and specificity of SNV detection is likely to be more accurate than for other types of variants. It may not be possible to measure analytical sensitivity for the detection of disease-associated indels in genomic regions where they may occur due to the lack of reference materials. However, the analytical sensitivity and specificity may sometimes be inferred by analysis of nondisease associated, naturally occurring indels in genomic regions targeted by the test. Previous characterized genomic DNA reference materials can be useful for this assessment. There is no formal recommendation specifying the number of samples needed to establish the analytical sensitivity and specificity for clinical NGS testing, but an accepted practice in validation of chromosome microarray (CMA) tests was proposed as a model for NGS [12,52]. For CMA, it is recommended that a minimum of 30 specimens with disease-associated chromosomal abnormalities be evaluated during test validation [52]. This approach takes into account the almost limitless variation that can be detected by CMA analysis as well as an acknowledgment that it is time consuming and costly to run hundreds of samples for assay validation for this type of assay. NGS is primarily used to detect homozygous and heterozygous sequence variations that have an allele fraction of 100% and 50%, respectively. For many genetic analyses, such as the detection of mosaicism or mitochondrial heteroplasmy, the detection of alleles present at lower allelic fractions is required. The lower LOD refers to the minimum allele fraction that the assay is able to detect. This relates to the analytical sensitivity of the assay. Other factors that influence the lower LOD for low frequency alleles include artifacts introduced by PCR or library contamination, and the quality of the alignment and variant calling algorithms [53]. Successful approaches that result in high analytical sensitivity during mitochondrial analysis take advantage of the capacity to sequence at a high depth of coverage, typically .50003 [54]. This depth of coverage is possible because of the relatively small size of the mitochondrial genome compared to its nuclear counterpart coupled with the presence of many copies of the mitochondrial genome. Deep sequencing permits the detection of sequence variants present at a frequency as low as 1.3%. In comparison, the LOD for Sanger sequencing is approximately 15% [54]. The clinical significance of this LOD has not been thoroughly studied but in some instances important information is derived that otherwise may be missed using other methods. For example, the tRNA mutation, A3243G, associated with mitochondrial encephalomyopathy, lactic acidosis, and stroke-like episodes (MELAS), when present at a low percentage (below the LOD for Sanger sequencing) is associated with maternally inherited diabetes mellitus and deafness [55]. NGS is used as a method for the detection of somatic mutations related to cancer. There are several challenges to developing a sensitive and specific NGS assay that can detect somatic sequence variations. Many neoplasms have an abundance of stroma, admixtures of different cell lineages (vasculature, blood cells, etc.), necrotic tissue,

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

372

21. ASSAY VALIDATION

and other normal cells. This effectively reduces the level of cancer-associated sequence variations in the mix. Utilization of laser microdissection and other cellular enrichment methods to isolate cancer tissue has been helpful, and the success in the use of these methods influences the analytical sensitivity and specificity of the NGS test. The LOD is established during assay validation. Generally, NGS testing of tumor tissue requires higher coverage to detect tumor sequence within a mixed sample. In addition, DNA extracted from formalin-fixed paraffin-embedded (FFPE) tumor samples contains chemical crosslinks between the formalin and DNA, which increase over time. These crosslinks cause lesions in the DNA that result in damage, and higher coverage has proven useful to achieve high-quality DNA sequencing from these samples [56].

REPORTABLE AND REFERENCE RANGES The “reportable range” is defined as “the span of test result values over which the laboratory can establish or verify the accuracy of the instrument or test system measurement response” [32]. For NGS, the reportable range can be defined as the portion of the genome for which sequence information can be reliably derived for a defined test system [12]. The reportable range may not necessarily be a contiguous region of the genome. This is because exons are not contiguous in genes and genes are not contiguous in exomes. In addition, gene panel tests target only specific genes within the genome, which may not be contiguous. The reportable range is inclusive of sequence determined by all methods (NGS and any others) used to interrogate the targeted region(s) of the genome. The “reference range” (or reference intervals) is defined as “the range of test values expected for a designated population of persons” [32]. For NGS, the reference range is defined as the range of normal sequence variations that the assay is designed to detect within a defined population [12]. Establishing the reference range can be problematic because a clear definition of “normal variation” is not always possible. Variant frequencies may differ among populations, and may inconsistently correlate with a disease association. When NGS results are outside the established reference range (e.g., detection of an indel not normally found in the sequenced region), additional investigation to establish the clinical significance is needed. Sequence variant databases that are used to determine the frequency of sequence variations that exist in defined populations need to be updated with accurate, current information. Certain types of sequence variations do not readily fit into a reference range, such as those relevant to pharmacogenetics. Pharmacogenetic alleles generally affect drug metabolism, but do not affect a disease state. Therefore, these would be considered normal sequence variations by the current definition. For example, the CYP2C19 2 allele is a loss of function haplotype (splicing defect) that significantly diminishes conversion of clopidogrel (Plavix) to its active form [57]. Clopidogrel in conjunction with aspirin is typically ordered for patients following heart catheterization. Use of an alternate drug is indicated for patients with a  2/ 2 haplotype. For healthy patients, the presence of a  2 allele is not known to be disease associated and does not have a disease-associated phenotype [57].

QUALITY CONTROL The QC procedures used during patient testing to monitor test performance are determined during test validation. All aspects of the NGS testing process, including DNA extraction, library preparation, DNA sequencing, and the informatics analysis pipeline are quality controlled. QC procedures are established to detect and correct errors caused by test system failure before patient test results are reported [2]. Internal control materials (e.g., combinations of spiked-in, synthetic, and normal sequence variations within the patient’s sample) as well as external controls, such as previously characterized genomic DNA are useful for ongoing quality assessment [12]. Controls are selected to assess the performance of the assay and its capacity to detect sequence variations within the targeted regions of the genome. However, it is not possible to have a control for each clinically relevant sequence variation and this presents a risk for false positives and negative results [12]. In order to detect failures in an NGS test early in the process, “quality checkpoints” are included in the assay design. These checkpoints can serve as indicators that one or more procedures have failed, or that there has been a significant deviation from the specifications established during validation. If these checkpoints fail, the sequencing run may be terminated prior to completion before additional time and costs accrue. These checkpoints may include [12]: • the assessment of the quality and size uniformity of DNA used for library preparation • the quality of the first base read from the sequencing platform

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

CONCLUSION

373

• attaining the necessary coverage as established during test validation • achieving the minimal percentage of reads that map to regions of the genome targeted by the assay.

REFERENCE MATERIALS The term reference materials describes a number of different material types, including certified or standard reference materials, quality control materials, and calibrators [58]. These materials are used by laboratory professionals for assay development, assay validation, QC, and proficiency testing. Clinical laboratories follow standards and recommendations from professional organizations regarding the use of reference materials to assure the quality of their clinical tests [2,3135,5962]. Reference materials are used in the development of laboratory tests and to monitor their performance. It is recommended that they resemble patient specimens as closely as possible [43,61]. Patients’ samples and/or genomic DNA derived from characterized cell lines with diseaseassociated sequence or naturally occurring sequence variations targeted by the assay can be used as reference materials [12]. While it is not practical or even possible to obtain reference materials that contain every possible variant in every gene targeted, laboratories often use reference materials that contain a variety of mutation types (SNVs, indels, etc.) that the assay is designed to detect. These reference materials can be used to establish the performance specifications for a small subset of disease-associated and naturally occurring sequence variations that may be detected. Use of reference materials primarily serve to establish confidence that a clinically relevant sequence variation can be detected within the regions of the genome assessed. A variety of materials, including genomic DNA from blood or cell lines, manufactured DNA such as plasmids or oligonucleotides, and electronic data files can be developed as reference materials (RMs), and each has advantages and limitations [12]. Approaches for development of reference materials for NGS have been reviewed [12]. At the present time, the human genome reference sequence is incomplete. This current sequence is a culmination of multiple efforts to define a consensus sequence [63]. Patches and updates are routinely issued to correct mistakes and add sequence that was missing from previous analyses. When updates are issued, clinical laboratories may use the revised sequence to reanalyze previous NGS data in which no clinically relevant findings were found. There is no estimate of the extent to which this has resulted in new clinical findings. The National Institute of Standards and Technology (NIST) is developing Standard Reference Materials using a number of genomic DNA samples [64]. When finished, these DNA samples will comprise a set of well-characterized whole genome and synthetic DNA reference materials along with the methods (documentary standards) and reference data necessary for use as reference materials. Also, the CDC’s Genetic Testing Reference Material Coordination Program (GeT-RM) is working with the National Center for Biotechnology Information and a number of volunteer clinical and research laboratories to collect preexisting sequence information and generate new sequence data (NGS & Sanger) from clinical NGS panel assays, as well as data from whole exome and whole genome analysis for two publicly available HapMap samples. Data has and continues to be received, analyzed, and formatted to populate a custom browser (http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/) that displays the data and consensus sequence together with metrics that laboratories can use to compare and troubleshoot their NGS assays [65]. Electronic data files constructed to simulate actual patient data can also be useful for test optimization and validation although limited work in this area has been reported in the literature as of 2014 [66]. The use of this type of simulated data may help to assess the informatics pipeline and better understand the potential for false positives and negatives within regions that cannot be readily evaluated by previously characterized genomic DNA samples [12].

CONCLUSION NGS technologies are transforming clinical genetic testing by enabling large regions of the human genome to be sequenced to identify variations that are clinically relevant for individual patients. Practice guidelines for clinical NGS are beginning to emerge [12,30,33,34,36,39]. NGS testing is a multistep process that utilizes technologies and software that continue to evolve. For example, it is anticipated that future NGS technologies will be capable of producing longer read lengths with lower error rates. As a consequence, current practices and guidance to assure quality testing will need to be updated. NGS can be validated for clinical testing, but current error profiles may require the use of complementary alternative methods to reliably sequence all targeted regions of the genome and confirmatory testing to account for false positives. The availability of well-characterized reference materials, and databases that report reliable variant and gene association with medical conditions will be critical

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

374

21. ASSAY VALIDATION

for the success of clinical NGS. Efforts to establish variant databases are anticipated to provide laboratory professionals with curated and reliable data for the interpretation of sequence variations. For example, the eMERGE network is a consortium for conducting genomic studies, which is developing a database to link patients’ specimens deposited in biorepositories with clinical data [67]. This and similar efforts may provide high-quality clinically useful databases for more effective sharing of insights about clinical significance of genomic variations. ACMG has published a policy recommending that laboratories offering exome or whole genome clinical testing also report incidental findings of significant clinical relevance to the patient that are separate from the reason the test was ordered [68]. The guideline provides a list of genes associated with medical conditions that were considered to be medically important and actionable (http://www.acmg.net). The recommended list of incidental findings is relevant to a spectrum of medical conditions that include heritable cancer disorders, cardiomyopathy, and rare conditions such as malignant hyperthermia susceptibility. Adoption of this policy recommendation will require laboratories to adjust their validation protocols to establish that they can reliably detect sequence variations associated with the targeted conditions. This poses a challenge for laboratories that lack expertise in the detection, analysis, and reporting of sequence variations associated with these incidental findings. NGS has transformed the world of clinical genetic testing. It has been used to make diagnoses not attainable by other methods that resulted in lifesaving measures [6971]. The quality of clinical NGS applications is dependent on the capacity of laboratory professionals to validate these tests now and into the future.

Acknowledgment The work was supported in part by appointment of Amy S. Gargis to the Research Participation Program at the Centers for Disease Control and Prevention administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy and CDC.

References [1] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in open microfabricated high density picoliter reactors. Nature 2005;437:37680. [2] Chen B, Gagnon M, Shahangian S, Anderson NL, Howerton DA, Boone JD. Good laboratory practices for molecular genetic testing for heritable diseases and conditions. MMWR Recomm Rep 2009;58(RR-6):137. [3] CLSI. [CLSI document MM20-A]. Quality management for molecular genetic testing; approved guideline. Wayne, PA: Clinical Laboratory Standards Institute; 2012. [4] Bystrykh LV. Generalized DNA barcode design based on hamming codes. PLoS ONE 2012;7:e36852. [5] Voelkerding KV, Dames S, Durtschi JD. Next generation sequencing for clinical diagnostics-principles and application to targeted resequencing for hypertrophic cardiomyopathy a paper from the 2009 William Beaumont Hospital Symposium on Molecular Pathology. J Mol Diagn 2010;12:53951. [6] Su Z, Ning B, Fang H, Hong H, Perkins R, Tong W, et al. Next-generation sequencing and its applications in molecular diagnostics. Expert Rev Mol Diagn 2011;11:33343. [7] Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008;9:387402. [8] ten Bosch JR, Grody WW. Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. J Mol Diagn 2008;10:48492. [9] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008;26:113545. [10] Metzker ML. Sequencing technologies: the next generation. Nat Rev Genet 2010;11:3146. [11] Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature 2011;475:34852. [12] Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30:10336. [13] Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y. Steps to ensure accuracy in genotype and SNP calling from Illumina sequencing data. BMC Genomics 2012;13(Suppl. 8):S8. [14] Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Meth 2009;6(11s):S612. [15] Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010;11:47383. [16] Ruffalo M, LaFramboise T, Koyutu¨rk M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics 2011;27:27906. [17] Yu X, Guda K, Willis J, Veigl M, Wang Z, Markowitz S, et al. How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? BioData Min 2012;5(1):6. [18] Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 2013;15(2):25678. [19] Coonrod EM, Durtschi BS, Margraf RL, Voelkerding KV. Developing genome and exome sequencing for candidate gene identification in inherited disorders an integrated technical and bioinformatics approach. Arch Pathol Lab Med 2013;137:41533. [20] Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol 2011;12:227.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

375

[21] Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet 2012;20:4907. [22] Altmann A, Weber P, Bader D, Preuss M, Binder EB, Mu¨ller-Myhsok B. A beginners guide to SNP calling from high-throughput DNAsequencing data. Hum Genet 2012;131:154154. [23] Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011;12:44351. [24] Neuman JA, Isakov O, Shomron N. Analysis of insertion-deletion from deep-sequencing data: software evaluation for optimal detection. Brief Bioinform 2013;14:4655. [25] Pope BJ, Nguyen-Dumont T, Odefrey F, Hammer F, Bell R, Tao K, et al. FAVR (Filtering and Annotation of Variants that are rare): methods to facilitate the analysis of rare germline genetic variants from massively parallel sequencing datasets. BMC Bioinformatics 2013;14:19. [26] Flanagan SE, Patch AM, Ellard S. Using SIFT and PolyPhen to predict loss-of-function and gain-of-function mutations. Genet Test Mol Biomarkets 2010;14:730. [27] Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 2009;10:R32. [28] Gowrisankar S, Lerner-Ellis JP, Cox S, White ET, Manion M, LeVan K, et al. Evaluation of second-generation sequencing of 19 dilated cardiomyopathy genes for clinical applications. J Mol Diagn 2010;12:81827. [29] Jones MA, Bhide S, Chin E, Ng BG, Rhodenizer D, Zhang VW, et al. Targeted polymerase chain reaction-based enrichment and next generation sequencing for diagnostic testing of congenital disorders of glycosylation. Genet Med 2011;13:92132. [30] Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, Friez MJ, Funke BH, Hegde MR, Lyon E, Working Group of the American College of Medical Genetics and Genomics Laboratory Quality Assurance Commitee. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15:73347. [31] ISO/IEC 15189. Medical laboratories—particular requirements for quality and competence. Geneva: International Organization for Standardization; 2007. [32] The Clinical Laboratory Improvement Amendments (CLIA) regulations; laboratory requirements, 42 C.F.R. Part 493 (2004). Available at: ,http://wwwn.cdc.gov/CLIA/Regulatory/default.aspx.; [accessed 23.06.14]. [33] College of American Pathology Laboratory Accreditation Program. Available at: ,http://www.cap.org/apps/cap.portal.; [accessed 27.03.13]. [34] New York State Department of Health. Clinical laboratory evaluation program, laboratory standards; 2008. Available at: ,http://www. wadsworth.org/labcert/clep/standards.htm.; [accessed 27.03.13]. [35] Washington State Office of Laboratory Quality Assurance. Available at: ,http://www.doh.wa.gov/LicensesPermitsandCertificates/ FacilitiesNewReneworUpdate/LaboratoryQualityAssurance.aspx.; [accessed 27.03.13]. [36] CLSI [CLSI document MM09-A2]. Nucleic acid sequencing methods in diagnostic laboratory medicine; approved guideline. Clinical and Laboratory Standards Institute; 2014. [37] The Food and Drug Administration. Workshops and Conferences. Ultra high throughput sequencing for clinical diagnostic applications—approaches to assess analytical validity, June 23, 2011. Available at: ,www.fda.gov/MedicalDevices/NewsEvents/ WorkshopsConferences/ucm255327.htm.; [accessed 27.03.13]. [38] Metaforum Leuven. Full Sequencing of the Human Genome. Hollands College, Leuven, Belgium. Available at: ,http://www.kuleuven. be/metaforum/page.php?LAN 5 E&FILE 5 wg_docs.; [accessed 27.03.13]. [39] Ellard S, Charlton R, Lindsay H, Camm N, Watson C, Abbs S, et al. Ratified by the CMGS Executive Committee in December 2012. Practice guidelines for Targeted Next Generation Sequencing Analysis and Interpretation. Available at: ,http://www.cmgs.org/BPGs/ BPG%20for%20targeted%20next%20generation%20sequencing%20final.pdf.; [accessed 24.05.13]. [40] van El CG, Cornel MC, Borry P, Hastings RJ, Fellmann F, Hodgson SV, et al. Whole-genome sequencing in health care recommendations of the European society of human genetics. Eur J Human Genet 2013;21:5804. [41] ISO/IEC 17025. General requirements for the competence of testing and calibration laboratories. Geneva: International Organization for Standardization; 2005. [42] ISO/IEC 3534-1. Statistics—vocabulary and symbols—Part 1: general statistical terms and terms used in probability. Geneva: International Organization for Standardization; 2006. [43] CLSI. [CLSI document MM17-A]. Verification and validation of multiplex nucleic acid assays; approved guideline. Wayne, PA: Clinical and Laboratory Standards Institute; 2008. [44] Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res 2011;21:1498505. [45] Richterich P. Estimation of errors in “Raw” DNA sequences: a validation study. Genome Res 1998;8:2519. [46] Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998;8:17585. [47] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:4918. [48] Chin EL, da Silva C, Hegde M. Assessment of clinical analytical sensitivity and specificity of next-generation sequencing for detection of simple and complex mutations. BMC Genet 2013;14:6. [49] Bell CJ, Dinwiddie DL, Miller NA, Hateley SL, Ganusova EE, Mudge J, et al. Carrier testing for severe childhood recessive diseases by next-generation sequencing. Sci Transl Med 2011;65:114. [50] Shaffer LG, Beaudet AL, Brothman AR, Hirsch B, Levy B, Martin CL, , et al.Working Group of the Laboratory Quality Assurance Committee of the American College of Medical Genetics Microarray analysis for constitutional cytogenetic abnormalities. Genet Med 2007;9:65462. [51] Taillon-Miller P, Gu ZJ, Li Q, Hillier L, Kwok PY. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res 1998;8:74854.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

376

21. ASSAY VALIDATION

[52] Lam HY, Clark MJ, Chen R, Chen R, Natsoulis G, O’Huallachain M, et al. Performance comparison of whole-genome sequencing platforms. Nat Biotechnol 2011;30:7882. [53] Li M, Stoneking M. A new approach for detecting low-level mutations in next-generation sequence data. Genome Biol 2012;13:R34. [54] Zhang W, Cui H, Wong LJ. Comprehensive one-step molecular analyses of mitochondrial genome by massively parallel sequencing. Clin Chem 2012;58:132231. [55] Liang MH, Wong LJ. Yield of mtDNA mutation analysis in 2000 patients. Am J Med Genet 1998;77:395400. [56] Mardis ER. Applying next-generation sequencing to pancreatic cancer treatment. Nat Rev Gastroenterol Hepatol 2012;9:47786. [57] Ned R. Genetic testing for CYP450 polymorphisms to predict response to clopidogrel: current evidence and test availability. PLoS Curr 2010;2:RRN1180. [58] Emons H, Fajgelj A, van der Veen AMH, Watters R. New definitions on reference materials. Accred Qual Assur 2006;10(10):5768. [59] Association for Molecular Pathology statement: recommendations for in-house development and operation of molecular diagnostic tests. Am J Clin Pathol 1999;111:44963. [60] Chen B, O’ Connell CD, Boone DJ, Amos JA, Beck JC, Chan MM, et al. Developing a sustainable process to provide quality control materials for genetic testing. Genet Med 2005;7:53449. [61] CLSI. [CLSI document MM01-A3]. Molecular methods for clinical genetics and oncology testing; approved guideline—Third Edition. Wayne, PA: Clinical Laboratory Standards Institute; 2012. [62] American College of Medical Genetics, Standards and Guidelines for Clinical Genetics Laboratories 2006 Edition. Available at: ,http:// www.acmg.net/Pages/ACMG_Activities/stds-2002/g.htm.; [accessed 24.05.13]. [63] Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, et al. Modernizing reference genome assemblies. PLoS Biol 2011;9:e1001091. [64] National Institute for Standards and Technology, Genome in a Bottle Consortium. Available at: ,http://www.genomeinabottle.org/.; [accessed 27.03.13]. [65] Genetic Testing Reference Materials Coordination Program (GeT-RM). Available at: ,http://wwwn.cdc.gov/dls/genetics/RMMaterials/.; [accessed 27.03.13]. [66] Frampton M, Houlston R. Generation of artificial FASTQ files to evaluate the performance of next-generation sequencing pipelines. PLoS One 2012;7:e49110. [67] McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, , et al.eMERGE Team The eMERGE network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics 2011;4:13. [68] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. [69] Bick DP, Dimmock DP. Whole exome and whole genome sequencing. Curr Opin Pediatr 2011;23:594600. [70] Link DC, Schuettpelz LG, Shen D, Wang J, Walter MJ, Kulkarni S, et al. Identification of a novel TP53 cancer susceptibility mutation through whole-genome sequencing of a patient with therapy-related AML. JAMA 2011;305:156876. [71] Welch JS, Westervelt P, Ding L, Larson DE, Klco JM, Kulkarni S, et al. Use of whole-genome sequencing to diagnose a cryptic fusion oncogene. JAMA 2011;305:157784. [72] Collins FS, Hamburg MA. First FDA authorization for next-generation sequencer. New Engl J Med 2013;369:236971.

List of Acronyms and Abbreviations ACMG CAP LAP CDC CLIA CLSI CMA CMS DNA FDA FFPE GeT-RM Indel ISO LDTs LOD NGS NIST PCR Q score QC SNVs USA

American College of Medical Genetics and Genomics College of American Pathologists, Laboratory Accreditation Program Centers for Disease Control and Prevention Clinical Laboratory Improvement Amendments Clinical and Laboratory Standards Institute Constitutional cytogenetic abnormalities Centers for Medicare and Medicaid Services Deoxyribonucleic acid Food and Drug Administration Formalin-fixed paraffin-embedded Genetic Testing Reference Material Coordination Program Insertions and deletions International Organization for Standardization Laboratory-developed tests Limit of detection Next-generation sequencing National Institute of Standards and Technology Polymerase chain reaction Quality score Quality control Single nucleotide variations United States of America

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

C H A P T E R

22 Regulatory Considerations Related to Clinical Next Generation Sequencing Shashikant Kulkarni1 and John Pfeifer2 1

Department of Pathology and Immunology, Department of Pediatrics, and Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA 2Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction

378

Regulatory Standards

378

FDA Oversight of Clinical NGS

379

Total Quality Management: QC Preanalytic Variables In Traditional Tests Not Shared by NGS In Common with Traditional Tests Unique to NGS Analytic Variables Sequencing Platform Wet-Bench Procedures Bioinformatic Pipeline Postanalytic Variables

381 381 381 381 382 382 383 383 384 385

Total Quality Management: QA Objectives of the QA Program Proficiency Testing Sample Exchange Programs Analyte-Specific Versus Methods-Based PT Cell Lines Comprehensive PT Challenges Versus In Silico PT Challenges

386

Conclusion

388

References

389

386 386 387 387 387 388

KEY POINTS • Next generation sequencing (NGS) is subject to the same regulatory standards as other molecular genetic tests. • Regulatory oversight of clinical laboratories, and therefore of clinical NGS testing, varies worldwide. Many countries base compliance on standards set by the International Standards Organization (ISO), but in the USA, oversight is federally regulated based on the Clinical Laboratory Improvement Amendments act of 1988 (CLIA ’88). • In the USA, the Food and Drug Administration (FDA) regulates the manufacture of equipment, devices, and assay reagent kits used in clinical testing, which in the context of NGS includes the sequencing platforms themselves, the kits used for library preparation and specific tests, and the bioinformatics pipelines to analyze the data. • Currently, most NGS tests are categorized as laboratory developed tests (LDTs), which are also subject to FDA oversight.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00022-8

377

© 2015 Elsevier Inc. All rights reserved.

378

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

• NGS, like other laboratory tests, has a test cycle that includes a preanalytic phase, an analytic phase, and a postanalytic phase. The quality control (QC) and quality assurance (QA) principles that govern these phases for more traditional molecular genetic tests are also applicable to NGS tests. • NGS tests are somewhat unique in that the analytic portion of the test consists of three individual components, specifically the sequence platform itself; the so-called wet-bench procedures that are involved in DNA library preparation; and the bioinformatics associated with base calling, reference genome alignment, variant identification, variant annotation, and variant interpretation. • The fact that there are three independent aspects of NGS complicates proficiency testing (PT) of NGS assays. The emphasis to date has been on the development of comprehensive PT surveys that evaluate all three aspects of the tests; in order to specifically address the bioinformatics component of NGS assays, a novel type of PT has recently been developed termed in silico-based PT.

INTRODUCTION Next generation sequencing (NGS) is subject to the same regulatory standards as other molecular genetic tests. However, governmental and regulatory agencies have increasingly come to understand that the complexity of testing and associated bioinformatics pipelines that are intrinsic to massively parallel sequencing approaches require modifications to traditional regulatory paradigms. The range of sequence variants that can be detected in a given assay, including single nucleotide variants (SNVs), insertions and deletions (indels), copy number variants (CNVs), and structural variants (SVs, including such aberrations as translocations and inversions), as well as the large number of genetic loci that can be evaluated in a given assay, have created a need for a new regulatory paradigms. NGS tests, like all other laboratory tests, have a test cycle that includes a preanalytic phase, an analytic phase, and a postanalytic phase. The principles that govern these phases for more traditional molecular genetic tests are also applicable to NGS tests, but again the complexity of NGS methods, and the range of genetic abnormalities can be detected, have created novel test cycle features that need to be considered. The analytic phase of NGS is of particular interest in as much as it consists of three separate, operationally distinct components [1], namely platforms (which, depending on the vendor, require different assay designs to optimize detection of different types of variants); library preparation steps (the so-called wet-bench part of NGS, which are usually structured around amplification-based or hybrid capture-based assay designs); and bioinformatics pipelines (the so-called dry-bench part of NGS, which must be optimized for each of the four different classes of variants, platform from which the data were generated, and whether the assay is designed to detect germ line variants versus somatically acquired variants). Consequently, quality control (QC) and quality assurance (QA) activities for NGS must address both the wet-bench and dry-bench components of the testing, and given the complexity of NGS, the techniques pose specific challenges for proficiency testing (PT). Because NGS is a relatively new type of DNA sequence analysis, different laboratories have developed different models for the associated QC and QA activities to meet appropriate regulatory guidelines and best practice standards, although it is not yet clear which of these models will prove to be most useful in routine clinical testing. This chapter will present the many different approaches that have been developed (including their strengths and weaknesses) and highlight areas of overlap that suggest an emerging consensus as to best practices. The chapter will also discuss those aspects of NGS testing for which there is disagreement as to the most appropriate QC and QA activities, aspects which highlight uncertainty about the most critical parts of NGS in routine clinical practice.

REGULATORY STANDARDS Regulatory oversight of clinical laboratories, and therefore of clinical NGS testing, varies worldwide. Many countries base compliance on standards set by the International Standards Organization (ISO) [2], but in the USA, oversight is federally regulated based on the Clinical Laboratory Improvement Amendments act of 1988 (CLIA ’88) [3]. CLIA regulations provide minimum standards for laboratories that offer clinical testing

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

FDA OVERSIGHT OF CLINICAL NGS

379

(for all parts of the test cycle, including the testing itself, personnel training, PT, QC, and QA), and only licensed laboratories can perform testing on patient samples if the results will be used to guide patient management. Clinical laboratories can demonstrate that they meet the minimum standards for licensure through federal certification by the Centers for Medicare and Medicaid Services (CMS) or via an accreditation program with standards that are deemed to be comparable to the federal regulatory requirements (e.g., The College of American Pathologists Laboratory Accreditation Program) [4]. A small number of states have regulations comparable to the federal standards and thus are exempt from CLIA regulations [5,6]. In the USA (and in most other countries as well), research NGS testing is not regulated by CLIA. Research testing can be distinguished from clinical testing in several ways. First, clinical NGS testing is performed on a patient specimen to yield information used to direct medical management of a patient [7,8]. If the information is more likely to provide insight into disease mechanisms or guide diagnosis and treatment of future patients, then the testing is more likely research (i.e., experimental or investigational). Second, in clinical testing, the findings are conveyed to the requesting medical professional in a formal report for integrated with other laboratory results for the same patient or specimen, rather than as anonymized results to a research database. The emerging use of NGS approaches in clinical laboratories has led to the development of guidelines to ensure that it is performed to the same rigorous standards as more conventional clinical tests that focus on the analysis of nucleic acids, such as DNA sequence analysis by Sanger methodology, microarray analysis, conventional cytogenetics, and metaphase or interphase FISH. The College of American Pathologists (CAP) has developed a checklist specific for NGS [9]; although the CAP checklist addresses both the technical and bioinformatics components of NGS, it is structured as a series of requirements with little guidance as to how the requirements should be met in routine clinical practice. Similarly, the Next Generation Sequencing-Standardization of Clinical Testing (Nex-StoCT) working group facilitated by the US Centers for Disease Control and Prevention (CDC) recently provided a detailed document covering the validation of clinical NGS tests (and precise guidelines from Nex-StoCT for validation of bioinformatics pipeline is currently being finalized) [10], as has the NY State Department of Health [11]; the recommendations from both groups are comprehensive and cover both the laboratory-based and bioinformatics components of NGS, but again there is little detail as to how the requirement should be met in routine clinical practice. Similarly, while the Clinical and Laboratory Standards Institute (CLSI) is expected to offer descriptive guidance on the implementation of clinical NGS (projected for 2014 as documented in MM09-A2) [12], it is not expected to directly address issues such as validation or QC. The US FDA is aware of the need for increased oversight of NGS, particularly of laboratory developed tests (LDTs) [13], but to date has not issued any specific regulatory guidance.

FDA OVERSIGHT OF CLINICAL NGS In the USA, the Food and Drug Administration (FDA) regulates the manufacture of equipment, devices, and assay reagent kits used in clinical testing, which in the context of NGS includes the sequencing platforms themselves, the kits used for library preparation and specific tests, and the bioinformatics pipelines to analyze the data. The agency defines tests that are developed and used within the same clinical laboratory, and that are not provided to other clinical laboratories, as LDTs. A laboratory that develops an LDT must inform the medical professional who requested the test, in a statement appended to the test report, that “This test was developed and its performance characteristics determined by {Name of the Laboratory}. It has not been cleared or approved by the US FDA” [14]. Of note, even though specimens may cross state boundaries, the LDT assay itself cannot be distributed beyond the testing laboratory. Because the number of FDA cleared or approved platforms, tests, and bioinformatic pipelines is currently so small, virtually all current clinical NGS assays are by default LDTs, as discussed more fully below. The FDA traces its authority for LDT oversight to legislation which defines a medical device as an “instrument, apparatus, implement, machine, contrivance, implant, in vitro reagent, or other similar or related article, including the component part, or accessory” that is “intended for use in the diagnosis of disease or other conditions, or in the cure, mitigation, treatment, or prevention of disease in man or other animals” [1517]. The agency uses a three-tiered, risk-based classification scheme for medical devices, and regulatory control increases from Class I to Class III (Table 22.1). Although the risk the device poses to patients is the major factor used to determine the class to which it is assigned, classification also depends on the device’s intended use and indications for use. Class I includes devices with the lowest risk, and most Class I devices are exempt from 510 (k) premarket notification; Class III includes devices with the greatest risk, and most require premarket approval [19,20].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

380 TABLE 22.1

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

Risk Classification Principles

Classification Principles

Oversight

Low

Test result is typically used in conjunction with other clinical findings to establish or confirm diagnosis. No claim about test result alone determines prognosis or direction of therapy. The consequence of an incorrect result or incorrect interpretation is unlikely to lead to serious morbidity/mortality. Most are exempt from premarket notification 510 (k).

Lab assesses risk. Lab performs validation. Lab places test in service. Accreditor inspects.

Moderate

Laboratory may make claims about test results that inform prognosis or direction of therapy. The consequence of an incorrect result or incorrect interpretation may lead to serious morbidity/mortality AND the test methodology is well understood and independently verifiable. Most require premarket notification 510 (k).

Lab assesses risk. Lab performs validation. Independent desk review of materials before clinical testing. Accreditor inspects.

High

Test result predicts risk, progression of, or patient eligibility for a specific therapy. AND test uses proprietary algorithms or computations, such that the test result cannot be tied to the methods used, or interlaboratory comparisons cannot be performed. The consequence of an incorrect result or incorrect interpretation could lead to serious morbidity/mortality AND the test methodology is not well understood or is not independently verifiable. Most require premarket approval.

Lab assesses risk. Lab performs validation. FDA review before clinical testing. Accreditor inspects.

Modified from Ref. [18].

Currently, the FDA has cleared the MiSeqDx platform for clinical use, and there is no doubt that the number of FDA cleared or approved platforms and tests will expand in the next several years. In 2010, the FDA announced its intention to being exercising oversight of LDTs [21]. Among the factors responsible for the NEW approach [18] are the considerable growth in the volume and types of LDTs; the increasing development of LDTs by biotechnology companies and commercial laboratories, which has led to the evolution of many LDTs from tests primarily performed in a local medical setting (which inherently encourages direct communication between the ordering physician and the laboratory medical staff) to testing performed in national reference labs (with less immediate consultation between the ordering physician and laboratory medical staff); and the increasingly aggressive marketing of LDTs to clinicians, and even to patients through direct-toconsumer advertising, to achieve targeted revenue goals. The FDA’s change in oversight philosophy follows decades of hearings, task force and oversight reports, and action plans that have all emphasized a need for regulatory oversight of this class of laboratory tests [18]. As part of the continuing evolution of regulatory oversight of LDTs, FDA issued revised guidance in 2013 on the distribution of in vitro diagnostic (IVD) products labeled for research use only or investigational use only [22], and at the same time, professional societies have continued to propose regulations to help guide FDA into best methods for ensuring appropriate oversight and validation of molecular diagnostic procedures [23,24]. One viewpoint holds that the CLIA program administered by the CMS is the most appropriate vehicle through which LDTs should be regulated [23] because most LDTs are categorized as high complexity and are therefore subject to rigorous CLIA regulations that are addressed by current laboratory accreditation programs [4,25] or state regulations [5,6] which, in general, are more stringent for LDT assays than for FDA-reviewed assays. NGS testing is emblematic of the regulatory issues surrounding LDTs. First, the rapid pace of advancements in understanding the genetic basis of disease (whether inherited constitutional diseases in germ line testing, or acquired genetic abnormalities in oncology testing) is driving adoption of NGS. Massively parallel sequencing technologies have been adopted for clinical testing because they provide a rapid, accurate, and cost-effective approach to provide the sequence analysis of large genetic regions that is increasingly used to stratify patients into different treatment groups. Given the rapid rate at which basic science discoveries can be translated into genetic tests that can be offered in a clinical setting, the demand for clinical testing far outstrips the pace at which clinical laboratories and vendors can complete the regulatory requirements required for FDA clearance or approval as an IVD. Thus, at least as far as NGS is concerned, attempts by FDA (or individual states) to more tightly regulate LDTs will slow the pace of translation of basic science discoveries into clinical practice and thereby limit the improvement in patient outcomes that can be achieved through care that is guided by the largescale genetic analysis enabled by NGS technologies. Second, NGS illustrates the way in which technological

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

TOTAL QUALITY MANAGEMENT: QC

381

advancements can render well-intentioned regulatory approaches obsolete. At its core, the current paradigm of “companion diagnostics” was designed to essentially link a specific test with the indication for use of a specific drug in a particular patient care setting [26], and there is no doubt that this paradigm has provided pharmaceutical companies with an approach to meet FDA regulatory requirements, as well as the potential to offset some of the large costs incurred in the development of new drugs. However, given the rapid pace of discovery, it has become clear that the companion diagnostic model is unresponsive to clinical needs for genetic testing in routine patient care since a companion diagnostic approved by FDA for a specific gene, therapy, and disease setting quickly becomes antiquated. As an example, the companion diagnostic test for BRAF V600E analysis was approved for testing from formalin-fixed paraffin-embedded (FFPE) tissue for selecting melanoma patient for treatment with vemurafnib [27]. Hence, use of the test on fresh tissue, peripheral blood, or tumor types other than melanoma to demonstrate the V600E mutation as a basis for therapy, for example, in hairy cell leukemia and non-small cell lung cancer [2830] (as is now routinely performed but based on medical advances after the companion diagnostic was developed) by definition reclassifies the companion diagnostic model as an LDT. In this context, it is worth noting that the companion diagnostic model may nonetheless provide opportunities for a broader approach that employs NGS; e.g., a companion diagnostic could be designed to include several genetic regions, from a variety of tissue types, of all four major classes of genetic variants, a model that would fit nicely with treatment paradigms based on the pattern of genetic abnormalities rather than strictly on anatomic sites of disease.

TOTAL QUALITY MANAGEMENT: QC As with any laboratory test, the goal of an NGS total quality management program is to ensure that the test results are accurate and reliable and that the service meets the needs of clinicians; to ensure that the performance of the clinical testing is frequently evaluated and approved; and to ensure compliance with the requirement for a continuous QA program that is mandated for clinical laboratory certification by CLIA. QC procedures involve all components of the test cycle, including the preanalytic, analytic, and postanalytic phases, and must be incorporated into each step of the test cycle not only to monitor the performance of the test, but to immediately detect any errors when they occur.

Preanalytic Variables The preanalytic phase has traditionally been the most difficult for quality management since most of the preanalytic variables that impact laboratory testing are outside of traditional laboratory boundaries. In Traditional Tests Not Shared by NGS Many of the variables that are so important in assuring an accurate result for many traditional laboratory tests have little impact on NGS results. For example, the DNA sequence of loci of interest is not expected to be influenced by recent food ingestion, fasting, or starvation; alcohol or recreational drug use; environmental factors such as ambient temperature, altitude, season, or time of day; or coexisting diseases. In Common with Traditional Tests Some aspects of the preanalytic phase are just as prone to error in NGS as for any other laboratory testing method, such as incorrect patient identification. The ubiquity of errors of this type in routine laboratory testing has been shown to occur in about 6% of accessioned cases [31]. In addition, specimen contamination issues can also occur when a presumably extraneous tissue contaminant is present in a surgical or cytology specimen (a problem more likely encountered in NGS testing performed on processed tumor tissue). Contamination can be demonstrated in almost 3% of cases on focused review [32]; of particular note, approximately 30% of contaminants were abnormal/neoplastic, and about 10% presented some degree of diagnostic uncertainty. Despite over a century of process improvements, there is little indication that this class of preanalytic errors is substantially diminishing. In fact, recent studies have drawn attention to a previously unrecognized class of specimen acquisition errors, specifically occult provenance errors [33,34].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

382

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

Unique to NGS The first factor is time to specimen fixation, and fixation itself [35]. Although it has been demonstrated numerous times that FFPE tissue is an acceptable substrate for NGS testing, the chemical changes in DNA that result from formalin fixation effect the quantity of intact DNA that can be recovered from tissue and have more subtle impacts on the actual sequence itself because of deamination reactions that result from formalin exposure [3641]; these changes become more significant the longer the period of exposure to formalin [36]. Similarly, it has been demonstrated that ethanol- and methanol-fixed specimens (e.g., cytology specimens) are also suitable substrates for NGS testing by both amplification-based and hybrid capture-based approaches, although again subtle differences in the quantity of nucleic acids that can be recovered from the specimens are apparent [42,43]. Second, since exposure to acid efficiently hydrolyzes phosphate diester links (and also damages nucleotides leading to abasic sites) in both DNA and RNA, acid decalcification renders tissue samples unacceptable for NGS analysis [44]. When decalcification is required, calcium chelating agents such as EDTA should be used since they have no significant impact on nucleic acids. Specimen size is the third important preanalytic factor. Most amplification-based NGS approaches require a minimum DNA input in the range of 10 ng; while hybrid capture-based approaches have been described that utilize as little as 10 ng of DNA, concerns about library complexity lead most clinical laboratories to require 100200 ng of DNA as a minimum for NGS testing by this approach. Fourth, the tissue sample must be appropriate. In the setting of oncology testing, this variable is far more nuanced than the simple requirement that the tissue contains the neoplasm. For analysis of tumor specimens, histopathologic evaluation is required to determine neoplastic cellularity and viability, although it has recently been shown that routine estimation of tumor cellularity may not be a reliable method for evaluating the tumor content [45,46]. In addition, there can be a question as to the most appropriate site to sample tumor in a patient with recurrent disease; in as much as it has been demonstrated that clonal heterogeneity is often present within the primary tumor, and that tissue metastases often represent divergent tumor clones [4749], the tumor site sampled is an important preanalytic variable that may impact NGS test results. Finally, in patients who are status post an allogeneic bone marrow transplant, it can be difficult to collect tissue that does not show a level of chimerism high enough to influence the detection and interpretation of variants present at low allele frequencies. In the setting of constitutional testing, the presence of mosaicism may complicate the interpretation of the presence (or lack) of a variant, which is not a trivial issue since it is clear that a large number of diseases [50], including, for example, NF1 [51], McCuneAlbright syndrome [52], PIK3CA-related segmental overgrowth [53], are characterized by mosaicism.

Analytic Variables Quality management of the analytic portion of an NGS assay starts with validation of the test [1,5459]. As discussed in more detail in Chapter 21, test validation in general includes three steps, namely establishment of the analytic sensitivity and analytic specificity of the test, definition of the range of detectable mutations and the limits of detection of the assay, and demonstration of the capability of the NGS test to detect mutations in undiagnosed patients. However, NGS tests are somewhat unique in that the analytic portion of the test itself consists of three individual components, specifically the sequence platform itself; the so-called wet-bench procedures that are involved in the extraction of nucleic acids and DNA library preparation (which may or may not be associated with amplification-based or hybrid capture-based targeting of specific regions of interest); and the bioinformatics associated with base calling, reference genome alignment, variant identification, variant annotation, and variant interpretation [1]. Because NGS involves these three interconnected components, many of the QC and QA procedures that are traditionally used in DNA sequence analysis are inadequate. Most laboratory tests involve the measurement of continuous variables, and consequently measures of analytical variability and calibration rely on statistical methods to detect laboratory test errors. For example, sophisticated QC approaches based on the Gaussian (normal) distribution have been developed to detect in precision or systematic bias [60], power function grafts are often used to evaluate the ability of different QC interpretive rules to detect systematic errors [61], LeveyJennings plots are used to detect variability that cannot be explained by a Gaussian distribution [62,63], and so on. However, DNA sequencing requires fundamentally different QC approaches to evaluate the integrity of the test results because DNA sequence is a discontinuous, nominal

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

TOTAL QUALITY MANAGEMENT: QC

383

variable [64]. As discussed in more detail in Chapter 23, it is generally acceptable in this setting to substitute an estimate of the probability of correctness in place of a quantitative estimate of uncertainty, and several methods have been proposed to address this issue [6568]. Sequencing Platform Most laboratories “spike in” a small number of DNA molecules of known sequence to check the quality of the data generated by the sequencing platform. For example, for Illumina platforms, bacteriophage PhiX is most commonly used to assess the quality of cluster generation, signal detection, and phasing; the PhiX genome is ideally suited for this use because of its small size and well-balanced base composition. In addition, each individual base call generated from a sequencing run is assigned a PHRED probability score for base call quality (known as a Q score), which is the normalized logarithmic value of the likelihood that the instrument made an incorrect base call. Acceptable Q values for an NGS-based test must be determined by each laboratory. In general, the pattern of errors associated with a particular NGS platform are intrinsic to that instrument, relatively constant, and show little variability as long as the sequencing chemistry and detection approach remain constant. However, error rates do show significant variation depending on the quality of the DNA sample, DNA library preparation method, indexing and multiplexing approaches, and so on. To monitor for errors in these steps, most laboratories utilize an external QC sample (often referred to as an ExQC) that is subjected to the exact same laboratory procedures as an individual clinical sample [69]. The use of synthetic DNA fragments as ExQC samples has particular advantages since they can be designed to incorporate specific sequence variants, at known ratios, at known positions, and in known allelic ratios, to simultaneously evaluate many aspects of not only platform performance, but also library preparation and bioinformatics analysis [70]. Wet-Bench Procedures Laboratories typically utilize a range of quality checkpoints as part of the quality management program of the wet-bench procedures of library preparation, many of which admittedly have overlap with quality checkpoints of the platform itself. Many regulatory guidelines include the following metrics, and so they are a part of the quality management program common to most laboratories. ACCURACY

Accuracy is “the closeness of agreement between measured value and the true value” [10]. Although Phred scores (Q scores) are recommended as the quantitative measure for base call accuracy, overall test accuracy depends more broadly on assay design. Laboratories typically use a lower threshold (termed the minimum base coverage) to determine whether adequate sequence for sensitive and specific detection of variants is present within an assay and to detect systematic errors. A number of quality measures are used to determine this coverage threshold, such as the total number of reads, the percentage of reads mapped to the target region, the percentage of unique reads mapped to the target region, the percentage of bases in a specific region that achieve various thresholds for coverage (e.g., 10003 , 4003 , or 503), and distribution of forward and reverse strand reads. Some laboratories have developed specific mathematical formulas to quantitate the interaction of these types of multiple variables [69]. When validation studies demonstrate that the required minimum base coverage is not consistently achieved for a particular region, many laboratories use an alternative method (e.g., Sanger sequencing) to “backfill” the problematic loci. PRECISION

The agreement between replicate measurements of an analyte, or group of analytes, is referred to as precision [71]. Precision is evaluated by assessing both repeatability (within-run precision) and reproducibility (betweenrun precision). ANALYTIC SENSITIVITY AND ANALYTIC SPECIFICITY

Analytic (or technical) sensitivity and analytic specificity are particularly troublesome metrics to monitor in clinical NGS testing because the range of sequence variants that can be detected by NGS approaches (e.g., SNVs, CNVs, indels, and SVs) makes it difficult to design comprehensive quality measures. A number of different platforms are usually required for orthogonal validation (e.g., Sanger sequencing, chromosomal microarray analysis, and interphase FISH), but orthogonal testing approaches may lack the sensitivity of NGS using optimized bioinformatic pipelines. In addition, it is cumbersome and expensive to accumulate well-characterized reference materials for the full range of variants.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

384

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

REPORTABLE RANGE AND REFERENCE RANGE

The reportable range is defined by CLIA regulations as “the span of test result values of which the laboratory can establish or verify the accuracy of the instrument or test system measurement response,” while the reference range is defined as “the range of test values expected for a designated population of persons” [3]. While it is straightforward to define the reference range for an NGS test as the range of normal sequence variants within a population [1], this definition is problematic because normal variation differs among various populations, may be biased by inaccurate disease associations and genotypephenotype correlations, and is inconsistencies between different databases. SEQUENCE VERIFICATION

Current guidelines recommend the use of confirmatory testing for positive results (or unexpected positive results), especially for tests that have a propensity to produce false positive results [10,72,73]. However, confirmatory testing to verify only positive results raises the problem of discrepant analysis (also known as discordant analysis or review bias) that is intrinsically flawed because it generates significant overestimates of test performance [7477]. Thus, a more appropriate QC approach for confirmatory testing (as well as a more appropriate approach for assay validation to begin with) also includes confirmatory testing of negative (i.e., wild type) results to evaluate the frequency of false negative results. SPECIMEN PROVENANCE

A comprehensive quality management program also provides checkpoints for specimen provenance issues including sample switches and samples contamination. Traditional methods of short tandem repeats (STR) analysis can be utilized to address specimen switches and some contamination issues [33,34]. While some groups utilize of a set of internal markers that are processed and sequenced along with patient samples [69], this approach will not detect contamination that arises prior to the library preparation step or misidentified reads that arise due to index swapping. Consequently, other laboratories have developed bioinformatic approaches that rely on population-based data to identify the presence of reads from different individuals to detect contamination [78,79]. The power of these bioinformatics approaches lies in the fact that they cover both the preanalytic and analytic phases of the test cycle, and since they utilize the actual sequence files on which the genetic variants are identified, they are not associated with the increased costs of an independent genetic test. Bioinformatic Pipeline The professional recommendations and regulatory requirements of clinical NGS also require quality management of the bioinformatics pipelines utilized in the testing. These bioinformatics pipelines include software that aligns the individual sequence reads to a reference genome, identifies sequence variants, annotates the variants, determines the likely clinical significance of the variants, and generates a clinical report. Each of these steps within this bioinformatic pipeline must be a component of ongoing laboratory QC and QA. The quality management program varies greatly between laboratories because the bioinformatic pipelines are quite different; some clinical laboratories utilize software supplied by the sequencing platform manufacturers, others employ bioinformatic pipelines licensed from software vendors, and still others rely on software packages developed in-house. However, regardless of the bioinformatics solution, the general components of pipeline QC and QA are the same. INDEPENDENT EVALUATION OF THE DIFFERENT CLASSES OF MUTATIONS

It is well established that software packages optimized to detect one class of variants in routine clinical use are not necessarily optimized for clinical laboratory use to detect other classes of variants [36,54,8083], and that there are differences between optimized pipelines for constitutional versus somatic analysis [65,84]. These differences not only reflect the limitations of the algorithms themselves, but also reflect the impact of test design (e.g., amplification-based versus hybrid capture-based target enrichment, depth of coverage, and read length). VERSIONING AND REVALIDATION

Any time significant changes are introduced into any component of the bioinformatic pipeline (often referred to as reversioning), such as updates of the reference database or upgrades/changes to individual software modules (including changes in default parameter settings), the NGS assay must be revalidated. The revalidation must

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

TOTAL QUALITY MANAGEMENT: QC

385

include steps to ensure that the changes do not have unanticipated consequences in the components of the bioinformatic pipeline that were not updated. Most NGS laboratories retain a copy of all prior versions of their bioinformatic pipeline for reference, to meet regulatory standards, and for medicolegal reasons. “CLINICAL GRADE” DATABASES AND REFERENCE MATERIALS

The process of annotation is one aspect of the bioinformatic pipeline which requires particularly rigorous QC oversight. Annotation refers to the process by which variants are coded to provide their location, type, and functional consequence in a standardized format. With respect to the nomenclature required to unambiguously indicate variant location and type, standardized annotation formats are still under development. The current lack of standardization makes it difficult to develop databases and registries, report variant data in the patient record, and evaluate the level of proficiency between laboratories. Work groups have been established by the National Human Genome Research Institute (NHGRI), CDC, CAP, and other organizations to develop standards for the so-called clinical grade variant annotations [10,85,86]. For example, ClinGen is a National Institute of Health (NIH)-funded resource dedicated to harnessing both research data and the data from the hundreds of thousands of clinical genetics tests being performed each year, as well as supporting expert curation to determine which variants are most relevant to patient care [87]. The annotation of the functional consequence of variants is particularly problematic currently. In the most general sense, the assignment of a variant to a clinical category or specific disease is the most critical point in an NGS test since it is the point at which a genotypephenotype correlation is established [88]. Unfortunately, given the current lack of the so-called clinical grade databases in the public domain [87,8992], individual NGS laboratories are forced to perform manual ad hoc investigation of multiple databases that do not contain the same information, and that show marked variability in their curation and maintenance [93], a scenario which virtually insures that there is a lack of reproducibility between laboratories. To address this lack of standardization, approaches for the development of clinical databases have been proposed [87,94,95], and as mentioned above, working groups have been formed to develop guidelines for the so-called clinical grade databases. And it is worth mentioning that although still in its infancy, one such effort funded and housed at National Center for Biotechnology Information (NCBI) is currently ongoing; this database of clinical variants known as ClinVar provides a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence [96]. However, it is clear that without a mechanism for clinical laboratories to share large, rapidly evolving databases of variant classifications, annotation of sequence variants will continue to be inconsistent [96]. The lack of clinical grade databases clearly complicates the design of QC approaches within individual laboratories; until standardized variant databases are developed, each laboratory will need quality measures for internal comparison of the functional consequences reported by the various members of the lab for specific variants, as well as quality measures that compare the function annotations with those of outside laboratories.

Postanalytic Variables The postanalytic phase of NGS testing requires interpretation and reporting by appropriately qualified personnel, which in most laboratories is performed by laboratory professionals with board certification in molecular genetic testing from the American Board of Medical Genetics and Genomics (ABMGG) or in molecular genetic pathology from the American Board of Pathology. From a regulatory perspective, reporting and archiving results must follow general laboratory requirements of other clinical tests, which must be compliant with CLIA and Health Insurance Portability and Accountability Act of 1996 (HIPAA), as well as state requirements [3,97]. Although professional organizations have recommended more specific guidelines for reporting constitutional variants [90,98], no consensus has emerged; several work groups are currently addressing guidelines for reporting somatic variants. Standard guidelines recommend that NGS test results be retained (while CLIA regulations require data retention for at least 2 years for the clinical report itself, current recommendations for molecular genetic tests are that reports and related data files should be retained for at least 25 years) [1,3,96] not only as part of good laboratory practice for archiving, but also as the data are available for reanalysis or reinterpretation should new information become available that may alter the test interpretation. However, the cost involved for retention of sequence files is nontrivial, especially as the complexity of test panels increases. Similarly, it is not clear how often and under what circumstances review should be performed in as much as there is currently no regulatory governance on this issue, and it is likewise unclear who will pay for reanalysis.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

386

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

TOTAL QUALITY MANAGEMENT: QA Objectives of the QA Program QA programs are designed to evaluate and monitor, systematically and objectively, the appropriateness and quality of test results. In the USA, an active QA program is mandated by CLIA ’88; there are similar requirements in Europe (e.g., the United Kingdom National External Quality Assessment Service). Good laboratory practice mandates that the QA program addresses every aspect of the test cycle, including the preanalytic, analytic, and postanalytic processes, and include written policies and documentation for education and training of personnel, continuing medical education, internal and external inspections, PT, and corrective actions for all identified deficiencies. Although the QC program is an integral part of the overall QA enterprise, the QA program is focused on the aspects of a clinical test that do not directly bear on the analytical validity of the test result, and are thus separate from the QC program. Among the metrics that a comprehensive QA program should monitor are turnaround times, review of normal and abnormal test results, review of specimen rejection criteria, and participation in PT program. One of the important attributes of QA activities is the concept of independence of review to build quality improvements into laboratory activities through empowerment of analysis of performance standards, identification of outliers, and problem solving [99] (Table 22.2).

Proficiency Testing CLIA ’88 mandates PT for external quality assessment (EQA) as part of the laboratory accreditation process [100102], although the precise rules and regulation that govern PT continue to evolve. Numerous organizations TABLE 22.2

NGS Testing Processes and Their Potential Errors

Process

Potential Errors

PREANALYTIC Test Ordering

Specimen identification Inappropriate test Special requirements not specified Delayed order Transport conditions

Specimen

Specimen identification Inadequate specimen (incorrect tube, container, or size) Specimen type (germ line or tumor) Improper fixation

ANALYTIC Specimen

Specimen identification Inadequate DNA (or RNA)

Platform

Instrument bias

Library preparation

Technical error Methodological bias Confirmatory testing bias

Bioinformatic pipeline

Appropriateness for different classes of mutations (SNVs, indels, CNVs, SVs) Reference databases

POSTANALYTIC Test reporting

Patient identification Delay in reporting Transcriptional/typographical

Test interpretation

Specificity/sensitivity of test not considered Appropriateness of test not understood Previous values not available for comparison Clinical setting not understood Reference databases

Modified from Ref. [99].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

TOTAL QUALITY MANAGEMENT: QA

387

are accredited by CLIA to provide PT programs, the largest of which is CAP; as noted above, some states also sponsor programs [5,6,103]. However, there are many analytes for which EQA surveys are not available, and in this setting, laboratories must implement the so-called alternative PT assessment procedures. These alternative assessment procedures include split sample analysis with other laboratories, assessment of split samples with an established in-house method in previously assayed material, exchange of blind samples with another laboratory, retesting de-identified patient samples, analysis of DNA from cell lines with previously determined genotypes, or analysis of synthetic DNA of known genotype [104,105]. Sample Exchange Programs Several professional societies, such as the American College of Medical Genetics and Genomics (ACMGG) and the Association of Molecular Pathology (AMP), sponsor laboratory sample exchange programs. It is worth mention that alternative assessment programs, while useful surrogates for participation in an EQA, have some shortcomings. Retesting of samples within one laboratory is unlikely to identify systematic errors, and sample exchanges within a small number of laboratories may also not identify systematic errors or biases. Similarly, sample exchanges within a limited group may be insufficient to detect significant performance differences between different methods, and, when discrepancies do arise, may be insufficient for identification of the source of the discrepancies. Finally, anonymity may be difficult to maintain when sample exchanges only involve only a small number of labs. Analyte-Specific Versus Methods-Based PT Many PT programs are based on an individual analyte and are appropriately termed analyte-specific or diseasespecific PT programs. The utility of disease-specific approaches for DNA analysis has been well documented [106108], and it is well established that laboratories that do not perform disease-specific surveys have more errors than labs that do [109]. However, given the number of genes that are routinely evaluated in clinical practice by NGS-based approaches, and the range of mutations for which testing is performed by NGS, it is virtually impossible for laboratories to follow an analyte-specific PT approach in routine clinical practice. For this reason, the socalled methods-based proficiency testing (MBPT) paradigms have been developed which are centered on the method of analysis rather than the specific analyte being tested [102,110]. The Center for Medicare and Medicaid Services understands that analyte-specific PT is not possible for the many different mutations that can occur in the many different genes that are the targets of NGS assays and therefore supports the concept of MBPT. While the strengths of an analyte-specific PT approach include assessment of a laboratory’s competence at identification and interpretation of specific mutations, and analyte-specific approaches can be designed to very closely replicate clinical samples to assess performance of an entire test, MBPT has some distinct advantages. First, MBPT is scalable in that the approach makes it possible to provide comparisons between laboratories for dozens (if not hundreds or thousands of genes) by very complex methods such as NGS. MBPT approaches also make it possible to evaluate proficiency in detection of a wide range of variants, rather than one specific mutation type; similarly, laboratories that participate in the MBPT challenges are not penalized for the inability to detect a sequence variant that lies in a region outside the scope of their validated test, or types of sequence variant that are not validated within their NGS approach. The MBPT approach has been endorsed by CAP and the ACMG [102,111], but it is important to recognize that, at present, MBPT cannot be used for PT when developed internally by an individual laboratory. MBPT is only acceptable when administered by an external PT provider. Cell Lines Within an individual lab, alternative assessment approaches can employ synthetic DNA with a known mutational profile, DNA from patient samples with known variants, or cell lines that are genetically well characterized. Because cell lines are an inexhaustible reagent, and because FFPE cell blocks can easily be produced from cell lines, they are a particularly useful source of reference material for PT (as well as assay validation, especially for characterization of assay sensitivity and limit of detection, and ongoing QC and QA, especially for comparison of assay performance between platforms and test kits). CDC’s genetic reference material coordination program (Get-RM) has developed several well-characterized cell lines for various variants specific to many genetic conditions [112]. In addition, NGS-specific reference materials have been developed by the Get-RM program and by NIST (as detailed in Chapter 23). Several commercial vendors [113,114] and professional organizations (e.g., CAP) incorporate cell lines into the PT materials they offer for NGS.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

388

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

Comprehensive PT Challenges Versus In Silico PT Challenges As discussed above, NGS is a unique testing paradigm in that it involves a sequence platform, wet-bench protocols, and bioinformatic analysis of the sequence reads, and the fact that there are three independent aspects of NGS complicates surveys designed for PT of NGS assays, whether via an analyte-specific or a method-based paradigm. The emphasis to date has been on the development of comprehensive PT surveys that evaluate all three aspects of an NGS test based on well-characterized genomic DNA samples. However, as discussed above, a recurring theme in clinical NGS testing is that bioinformatic pipelines are not standardized across laboratories. The pipelines show marked variability in sensitivity and specificity of detection within a given class of variants, and between the four major classes of variants, and thus the bioinformatic analysis of NGS sequence reads is a major source of variability between labs performing the same NGS test on the same analyte. In order to specifically address this issue, a novel type of MBPT has recently been developed that is focused solely on the bioinformatics component of NGS tests, the so-called in silico-based proficiency tests [102,105]. By this approach, the actual sequence files from NGS of a well-characterized specimen are manipulated by computerized algorithms that introduce relevant sequence variants into the reference sequence files [115], and the resulting in silico data files (also referred to as simulated data files) are an ideal method for MBPT of bioinformatic pipelines for several reasons (since it is difficult to mimic the heterogeneity of actual biologic specimens as well as the distribution of actual sequence reads in an entirely de novo generated file, actual specimen sequence files are usually used as the substrate for computerized manipulation). First, they challenge every step in the bioinformatics pipeline from alignment through variant detection, annotation, and interpretation. Second, in silico data files can be designed for all four major classes of variants, either alone or in combination. Third, simulated data files can be developed for any genetic locus, either alone or in combination. Fourth, the variant allele frequency can be manipulated to test the sensitivity and specificity of variant detection. Fifth, different variants can be mixed within the same in silico data file, creating complex mixtures of variants that mimic the complexity of clinical samples. The in silico approach is ideally situated for PT of bioinformatic pipelines used for identification of germ line variants in constitutional disease testing as well as for identification of acquired mutations in oncology testing. It is also well suited for mixed genotype sample testing such as occurs in mosaicism or mitochondrial heteroplasmy, and the approach is scalable for MBPT of the genetic complexity encountered in clinical testing which would be far too cumbersome and expensive to achieve via MBPT utilizing synthetic oligonucleotides, plasmid-based DNA preparations, or customized cell lines. Finally, since the in silico data files can be produced from any file format, the method can be used in MBPT of the sequence files produced by any of the sequencing platforms in clinical use.

CONCLUSION The rapid pace of advancements in understanding the genetic basis of disease is driving adoption of NGS. Massively parallel sequencing technologies have been adopted for clinical testing because they provide a rapid, accurate, and cost-effective approach to provide the sequence analysis of large genetic regions that is increasingly used to stratify patients into different treatment groups. However, because NGS is a new type of DNA sequence analysis that has only recently been introduced into clinical laboratories for the analysis of tissue specimens to guide patient care, regulatory paradigms are still evolving. Since novel techniques and assays included in the NGS landscape continue to emerge (e.g., amplification-based versus hybrid capture-based methods; tests for constitutional versus somatically acquired mutations; analysis of SNVs, indels, CNVs, and SVs versus only a subset of variant classes), it is not surprising that state and federal rules governing clinical NGS (especially those promulgated by FDA) are frequently revised. In this changing regulatory environment, different laboratories have developed different quality management models for the associated QC and QA activities to ensure that NGS is performed to the same rigorous standards as more conventional clinical tests that focus on the analysis of nucleic acids, such as DNA sequence analysis by Sanger methodology, microarray analysis, conventional cytogenetics, and metaphase or interphase FISH. However, NGS tests are somewhat unique in that the analytic portion of the test consists of three individual components of the sequence platform itself, the so-called wet-bench procedures, and bioinformatics, and so many of the QC and QA procedures that are traditionally used in DNA sequence analysis are inadequate, including approaches for PT. Fortunately, a variety of expert panels from a number of professional organizations and governmental agencies are actively working to address these issues, and the guidelines they provide should help standardize NGS performed in patient care settings.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

389

References [1] Gargis AS, Kalman L, Berry MW, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30:10336. [2] ISO/IEC 15189. Medical laboratories—particular requirements for quality and competence. Geneva: ISO; 2007. [3] Centers for Medicare & Medicaid Services, Centers for Disease Control and Prevention. 42 CFR Part 493. Medicare, Medicaid, and CLIA programs; laboratory requirements relating to quality systems and certain personnel qualifications. Final Rule: 36403714, ,http:// www.gpo.gov/fdsys/browse/collectionCfr.action?collectionCode5CFR.; 2003. [4] College of American Pathology Laboratory Accreditation Program, ,http://www.cap.org/apps/cap.portal.; 2012. [5] New York State Department of Health (2012) Clinical laboratory evaluation program, laboratory standards, ,http://www.wadsworth. org/labcert/clep/standards.htm.; 2008. [6] Washington State Office of Laboratory Quality Assurance, ,http://www.doh.wa.gov/LicensesPermitsandCertificates/ FacilitiesNewReneworUpdate/LaboratoryQualityAssurance.; 2012. [7] Federal Food, Drug, and Cosmetic Act (FD&C Act), ,http://www.fda.gov/regulatoryinformation/legislation/federalfooddrugandcosmeticactfdcact/default.htm.; 2012. [8] Department of Health and Human Services, Centers for Medicare and Medicaid Services. Clinical laboratory improvement amendments for 1988: final rule. Federal Register 1992;7170: [42CFR493,1265], 7176 [42CFR493.1445(3)(51)]. [9] College of American Pathologists. Molecular pathology checklist. Northfield, IL: College of American Pathologists; 2012. [10] CLSI. Molecular methods for clinical genetics and oncology testing; approved guideline. 3rd ed. Wayne: Clinical Laboratory Standards Institute; 2012. [CLSI document MM01-A3]. [11] ,http://www.wadsworth.org/labcert/TestApproval/forms/NextGenSeq_ONCO_Guidelines.pdf.. [12] National Committee for Clinical Laboratory Standards (NCCLS). CLSI (NCCLS) document MM09-A Nucleic acid sequencing methods in diagnostic laboratory medicine: approved guideline. Wayne, PA: Clinical and Laboratory Standards Institute; 2004. [13] ,http://www.fda.gov/MedicalDevices/NewsEvents/WorkshopsConferences/ucm255327.htm.. [14] Department of Health and Human Services, Food and Drug Administration. Medical devices; classification/reclassification; restricted devices; analyte specific reagents. Final rule. Federal Register 1997;November 21:622435. [21CFR809, 21CFR864]. [15] The Medical Device Amendments of 1976 (MDA). 21 USC. 1360c et seq.; 1976. [16] Federal Food, Drug and Cosmetic Act of 1994 (FD&C), 21 USC 301395; 1994. [17] Javitt GH. In search of a coherent framework: options for FDA oversight of genetic tests. Food Drug Law J 2007;62:61752. [18] Vance GH. College of American pathologists proposal for the oversight of laboratory-developed tests. Arch Pathol Lab Med 2011;135:1432. [19] Wagner JK. Understanding FDA regulation of DTC genetic tests within the context of administrative law. Am J Hum Genet 2010;87:4516. [20] Yustein A. The FDA’s process of regulatory premarket review for new medical devices. Gastroenterol Hepatol Ann Rev 2006;1:1424. [21] US Food and Drug Administration. FDA/CDRH public meeting: oversight of laboratory developed tests (LDTs), ,http://www.fda. gov/medicaldevices/newsevents/workshopsconferences/ucm212830.htm.. [22] ,http://www.fda.gov/MedicalDevicew/DeviceRegulationandGuidance/GuidanceDocuments/ucm[insertspecificnumber].htmt.. [23] Ferreira-Gonzalez A, Emmadi R, Day SP, et al. Revisiting oversight and regulation of molecular-based laboratory-developed tests: a position statement of the Association for Molecular Pathology. J Mol Diagn 2014;16:36. [24] Association for Molecular Pathology. Recommendations for in-house development and operation of molecular diagnostic tests. Am J Clin Pathol 1999;111:44963. [25] Nazneen A, Qin Z, Bry L, et al. College of American Pathologists’ Laboratory Standards for Next Generation Sequencing Clinical Tests, ,http://dx.doi.org/10.5858/arpa.2014-0250-CP.. [26] ,http://www.fda.gov/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm262292.htm.. [27] ,http://www.accessdata.fda.gov/cdrh_docs/pdf11/p110020a.pdf.. [28] ,http://www.gsk.com/media/press-releases/2014/tafinlar--receives-fda-breakthrough-therapy-designation-for-non-.html.. [29] Dietrich S, Glimm H, Andrulis M, et al. BRAF inhibition in refractory hairy-cell leukemia. N Engl J Med 2012;366:203840. [30] Samuel J, Macip S, Dyer MJS. Efficacy of vemurafenib in hairy-cell leukemia. N Engl J Med 2014;370:2868. [31] Nakhleh RE, Zarbo RJ. Surgical pathology specimen identification and accessioning: A College of American Pathologists Q-Probes Study of 1 004 115 cases from 417 institutions. Arch Pathol Lab Med 1996;120:22733. [32] Gephardt GN, Zarbo RJ. Extraneous tissue in surgical pathology: a College of American Pathologists Q-Probes study of 275 laboratories. Arch Pathol Lab Med 1996;120:100914. [33] Pfeifer JD, Liu J. Rate of occult specimen provenance complications in routine clinical practice. Am J Clin Pathol 2013;139:93100. [34] Pfeifer JD, Payton J, Zehnbauer BA. The changing spectrum of DNA-based specimen provenance testing in surgical pathology. Am J Clin Pathol 2011;135:1328. [35] Ransohoff DF, Gourlay ML. Sources of bias in specimens for research about molecular markers for cancer. J Clin Oncol 2010;28:698704. [36] Spencer DH, Sehn JK, Abel HJ, et al. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J Mol Diagn 2013;15:62333. [37] Auerbach C, Moutschen-Dahmen M, Moutschen J. Genetic and cytogenetical effects of formaldehyde and related compounds. Mutat Res 1977;39:31761. [38] Feldman MY. Reactions of nucleic acids and nucleoproteins with formaldehyde. Prog Nucleic Acid Res Mol Biol 1973;13:149. [39] Karlsen F, Kalantari M, Chitemerere M, et al. Modifications of human and viral deoxyribonucleic acid by formaldehyde fixation. Lab Invest 1994;71:60411. [40] Loudig O, Brandwein-Gensler M, Kim RS, et al. Illumina whole-genome complementary DNA-mediated annealing, selection, extension and ligation platform: assessing its performance in formalin-fixed, paraffinembedded samples and identifying invasion pattern-related genes in oral squamous cell carcinoma. Hum Pathol 2011;42:191122.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

390

22. REGULATORY CONSIDERATIONS RELATED TO CLINICAL NEXT GENERATION SEQUENCING

[41] Kerick M, Isau M, Timmermann B, et al. Targeted high throughput sequencing in clinical cancer settings: formaldehyde fixed-paraffin embedded (FFPE) tumor tissues, input amount and tumor heterogeneity. BMC Med Genomics 2011;4:68. [42] Karnes H, Duncavage ED, Bernadt CT. Targeted next-generation sequencing using fine-needle aspirates from adenocarcinomas of the lung. Cancer Cytopathol 2014;122:10413. [43] Kanagal-Shamanna R, Portier BP, Singh RR, et al. Next-generation sequencing-based multi-gene mutation profiling of solid tumors using fine needle aspiration samples: promises and challenges for routine clinical diagnostics. Mod Pathol 2013;27:31427. [44] Williams NH. DNA hydrolysis: mechanism and reactivity. In: Zenkova MA, editor. Nucleic acids and molecular biology. Berlin: Springer-Verlag; 2004. p. 318. [45] Smits AJ, Kummer JA, de Bruin PC, et al. The estimation of tumor cell percentage for molecular testing by pathologists is not accurate. Mod Pathol 2014;27:16874. [46] Viray H, Li K, Long T, et al. A prospective, multi-institutional diagnostic trial to determine pathologist accuracy in estimation of percentage of malignant cells. Arch Pathol Lab Med 2013;137:15459. [47] Renovanz M, Kim EL. Intratumoral heterogeneity, its contribution to therapy resistance and methodological caveats to assessment. Front Oncol 2014;4:142. [48] Gerlinger M, Rowan A, Horswell S, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 2012;366:88392. [49] Yachida S, Jones S, Bozic I, et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 2010;467:11147. [50] Biesecker LG, Spinner NB. A genomic view of mosaicism and human disease. Nat Rev Genet 2013;14:30720. [51] Kehrer-Sawatzki H, Cooper DN. Mosaicism in sporadic neurofibromatosis type 1: variations on a theme common to other hereditary cancer syndromes? J Med Genet 2008;45:62231. [52] Narumi S, Matsuo K, Ishii T, et al. Quantitative and sensitive detection of GNAS mutations causing McCuneAlbright syndrome with next generation sequencing. PLoS One 2013;8:e60525. [53] Kurek KC, Luks VL, Ayturk UM, et al. Somatic mosaic activating mutations in PIK3CA cause CLOVES syndrome. Am J Hum Genet 2012;90:110815. [54] Pritchard CC, Salipante SJ, Koehler K, et al. Validation and implementation of targeted capture and sequencing for the detection of actionable mutation, copy number variation, and gene rearrangement in clinical cancer specimens. J Mol Diagn 2014;16:5667. [55] Singh RR, Patel KP, Routbort MJ, et al. Clinical validation of a next-generation sequencing screen for mutational hotspots in 46 cancerrelated genes. J Mol Diagn 2013;15:60722. [56] Lin MT, Mosier SL, Tiess M, et al. Clinical validation of KRAS, BRAF, and EGFR mutation detection using next-generation sequencing. Am J Clin Pathol 2014;141:85666. [57] Frampton GM, Fichtenholtz A, Otto GA. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol 2013;31:102331. [58] Cottrell CE, Al-Kateb H, Bredemeyer AJ, et al. Validation of a next-generation sequencing assay for clinical molecular oncology. J Mol Diagn 2014;16:89105. [59] Rehm HL, Bale SJ, Bayrak-Toydemir P, et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15:73347. [60] Ryan TP. Statistical methods for quality control. New York, NY: Wiley; 1989. [61] Westgard JO, Groth T. Power functions for statistical control rules. Clin Chem 1979;25:8639. [62] Levey S, Jennings ER. The use of control charts in the clinical laboratory. Am J Clin Pathol 1950;20:105966. [63] Westgard JO, Barry PL, Hunt MR. A multi-rule Shewhart chart for quality control in clinical chemistry. Clin Chem 1981;27:493501. [64] Pyzdek T. What every Engineer should know about quality control. New York, NY: Marcel Dekker, Inc.; 1989. [65] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43:4918. [66] McKenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20:1297303. [67] Danecek P, Auton A, Abecasis G, et al. The variant call format and VCFtools. Bioinformatics 2011;27:21568. [68] Ajay SS, Parker SCJ, Abaan HO, et al. Accurate and comprehensive sequencing of personal genomes. Genome Res 2011;21 (9):1498505. [69] Zhang W, Cui H, Wong LJ. Comprehensive one-step molecular analyses of mitochondrial genome by massively parallel sequencing. Clin Chem 2012;58:132231. [70] Zook JM, Samarov D, McDaniel J, Sen SK, et al. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLos One 2012;7:10. [71] American College of Medical Genetics Policies and Standards, ,www.acmg.net.; 2012. [72] American College of Medical Genetics. ACMG standards and guidelines for clinical genetic laboratories, ,http://www.acmg.net/AM/ Template.cfm?Section5Laboratory_Standards_and_Guidelines&Template5/CM/HTML.; 2008. [73] NCCLS (2004) Nucleic acid sequencing methods in diagnostic laboratory medicine; approved guideline. NCCLS document MM9-A [ISBN 1-56238-558-5]. NCCLS, 940 West Valley Road, Suite 1400, Wayne, Pennsylvania 19087-1898 USA. [74] Lipman HB, Astles JR. Quantifying the bias associated with use of discrepant analysis. Clin Chem 1998;44:10815. [75] Hadgu A. The discrepancy in discrepant analysis. Lancet 1996;348:5923. [76] Hadgu A. Discrepant analysis is an inappropriate and unscientific method. J Clin Microbiol 2000;38:43012. [77] Miller WC. Bias in discrepant analysis: when two wrongs don’t make a right. J Clin Epidemiol 1998;51:21931. [78] Jun G, Flickinger M, Hetrick KN, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 2012;91:83948. [79] Sehn JK, Spencer DH, Duncavage ED, et al. Human specimen admixture in clinical next generation sequencing data [submitted].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

391

[80] Spencer DH, Tyagi M, Vallania F, et al. Performance of common analysis methods for detecting low-frequency single nucleotide variants in targeted next-generation sequence data. J Mol Diagn 2014;16:7588. [81] Mardis ER. The $1,000 genome, the $100,000 analysis? Genome Med 2010;2:84. [82] Spencer DH, Abel HJ, Lockwood CM, et al. Detection of FLT3 internal tandem duplication in targeted, short-read-length, nextgeneration sequencing data. J Mol Diagn 2013;15:8193. [83] Sharma MK, Phillips J, Agarwal S, et al. Clinical genomicist workstation. AMIA Jt Summits Transl Sci Proc 2013;2013:1567. [84] Li H, Handsaker B, Wysoker A, , et al. 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics 2009;25:20789. [85] Lubin IM, Aziz N, Babb L, et al. The Clinical Next-Generation Sequencing Variant File: Advances, Opportunities, Challenges for the Clinical Laboratory [submitted]. [86] Ramos EM, Din-Lovinescu C, Berg JS, et al. Characterizing genetic variants for clinical action. Am J Med Genet C Semin Med Genet 2014;166C:93104. [87] ,http://www.iccg.org/about-the-iccg/clingen/.. [88] Sboner A, Mu XJ, Greenbaum D, et al. The real cost of sequencing: higher than you think! Genome Biol 2011;12:125. [89] Ogino S, Gulley ML, den Dunnen JT, et al. Standard mutation nomenclature in molecular diagnostics: practical and educational challenges. J Mol Diagn 2007;9:16. [90] Richards CS, Bale S, Bellissimo DB, et al. ACMG recommendations for standards for interpretation and reporting of sequence variations: revisions 2007. Genet Med 2008;10:294300. [91] Gulley ML, Braziel RM, Halling KC, et al. Clinical laboratory reports in molecular pathology. Arch Pathol Lab Med 2007;131:85263. [92] Stanley CM, Sunyaev SR, Greenblatt MS, et al. Clinically relevant variants—identifying, collecting, interpreting, and disseminating: the 2013 annual scientific meeting of the Human Genome Variation Society. Hum Mutat 2014;35:50510. [93] Soussi T. Locus-specific databases in cancer: what future in a post-genomic era? The TP53 LSDB paradigm. Hum Mutat 2014;35:64353. [94] Eggington JM, Bowles KR, Moyes K, et al. A comprehensive laboratory-based program for classification of variants of uncertain significance in hereditary cancer genes. Clin Genet 2014;86(3):22937. [95] Kenna KP, McLaughlin RL, Hardiman O, et al. Using reference databases of genetic variation to evaluate the potential pathogenicity of candidate disease variants. Hum Mutat 2013;34:83641. [96] ,http://www.ncbi.nlm.nih.gov/clinvar/.. [97] Chen B, Gagnon M, Shahangian S, Anderson NL, Howerton DA, Boone JD. Good laboratory practices for molecular genetic testing for heritable diseases and conditions. MMWR Recomm Rep 2009;58(RR-6):137. [98] ,http://www.amp.org/Webinars/2014.cfm.. [99] Westgard JO, Klee GG. Quality management. In: Burtis CA, Ashwood ER, Bruns DE, editors. Tietz textbook of clinical chemistry and molecular diagnostics. 4th ed. St. Louis, MI: Elsevier; 2006. p. 485529. [chapter 19]. [100] Public Law 100578. Clinical Laboratory Improvement Amendments of 1988. Stat 42 USC 201, HR 5471; October 31, 1988. [101] US Department of Health and Human Services. Clinical laboratory improvement amendments of 1988: Final Rules and Notice. 42 CFR Part 493. Federal Register 1992;57:7188288. [102] Schrijver I, Aziz N, Jennings LJ, et al. Methods-based proficiency testing in molecular genetic pathology. J Mol Diagn 2014;16:2837. [103] Rej R, Norton C. External assessment of laboratory cholesterol measurements using patient specimens. Clin Chem 1989;35:1069. [104] CLSI. [CLSI document GP29A2] Assessment of laboratory tests when proficiency testing is not available; approved guideline. 2nd ed. Wayne, PA: Clinical and Laboratory Standards Institute; 2008. [105] Kalman LV, Lubin IM, Barker S, et al. Current landscape and new paradigms of proficiency testing and external quality assessment for molecular genetics. Arch Pathol Lab Med 2013;137:9838. [106] Palomaki GE, Richards CE. Assessing the analytic validity of molecular testing for Huntington disease using data from an external proficiency testing survey. Genet Med 2012;14:6975. [107] Weck KE, Zehnbauer B, Datto M, et al. Molecular genetic testing for fragile X syndrome: laboratory performance on the College of American Pathologists proficiency surveys (20012009). Genet Med 2012;14:30612. [108] Feldman GL, Schrijver I, Lyon E, et al. Results of the College of American Pathology/American College of Medical Genetics and Genomics external proficiency testing from 2006 to 2013 for three conditions prevalent in the Ashkenazi Jewish population. Genet Med 2014;1. [109] Hudson KL, Murphy JA, Kaufman DJ, et al. Oversight of US genetic testing laboratories. Nat Biotechnol 2006;24:108390. [110] Richards CS, Palomaki GE, Lacbawan FL, et al. Three-year experience of a CAP/ACMG methods-based external proficiency testing program for laboratories offering DNA sequencing for rare inherited disorders. Genet Med 2014;16:2532. [111] Maddalena A, Bale S, Das S, et al. Technical standards and guidelines: molecular genetic testing for ultra-rare disorders. Genet Med 2005;7:57183. [112] ,http://wwwn.cdc.gov/clia/Resources/GetRM/default.aspx.. [113] ,http://www.horizondx.com/.. [114] ,http://www.lgcstandards.com/epages/LGC.sf/en_GB/?ObjectPath5/Shops/LGC/Categories/Proficiency_testing_information.. [115] Frampton M, Houlston R. Generation of artificial FASTQ files to evaluate the performance of next generation sequencing pipelines. PLoS One 2012;7:e49110.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

This page intentionally left blank

C H A P T E R

23 Genomic Reference Materials for Clinical Applications Justin Zook and Marc Salit Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA

O U T L I N E Introduction Challenges in Developing a Whole Genome Reference Material

393

Genome in a Bottle Consortium Reference Material Selection and Design Reference Material Characterization Bioinformatics, Data Integration, and Data Representation Data Representation Performance Metrics and Figures of Merit

394 394 396

393

396 397 397

Reference Data

399

Other Reference Materials for Genome-Scale Measurements Microbial Genome RMs Gene Expression RMs

400 400 400

Conclusion

401

References

401

INTRODUCTION Reference materials (RMs) are frequently used in clinical laboratories to understand accuracy or calibrate instruments. An RM is defined as a “material, sufficiently homogeneous and stable with respect to one or more specified properties, which has been established to be fit for its intended use in a measurement process.” Because RMs are homogeneous and stable, they can be used to compare performance in different laboratories at different times. Standard Reference Materials s (SRMs), which are Certified RMs produced by the National Institute of Standards and Technology (NIST), are “NIST RMs characterized . . . for one or more specified properties, accompanied by a certificate that provides the value of the specified property, its associated uncertainty, and a statement of metrological traceability.” SRMs, in addition to being homogeneous and stable, have been characterized for certified values, which are values “for which NIST has the highest confidence in its accuracy in that all known or suspected sources of bias have been fully investigated or accounted for by NIST” (http://www.nist.gov/srm/ definitions.cfm). An SRM is a tool that can be used by any laboratory to benchmark their results against those in which NIST has highest confidence, letting that laboratory establish its performance.

Challenges in Developing a Whole Genome Reference Material Most SRMs are characterized for a single certified property, usually a quantitative one like the mass fraction of cholesterol (NIST SRM 911c), length of a DNA short tandem repeat (NIST SRM 2399), or DNA concentration (NIST

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00023-X

393

© 2015 Published by Elsevier Inc.

394

23. GENOMIC REFERENCE MATERIALS FOR CLINICAL APPLICATIONS

SRM 2366 for cytomegalovirus). These SRMs are typically used either for calibration of a measurement result to establish metrological traceability to the properties of the SRM, or for evaluation of bias by comparison of measured results to certified properties. Metrological traceability makes it possible to compare measurement results across space and time, by referring all results to be compared to a common reference. Evaluating bias with a reference material helps establish validity (providing evidence that “I’ve measured what I set out to measure”), and understanding bias helps in building a quantitative “measurement uncertainty budget” (critical for meaningful comparison of results such as “I am 95% confident that this value is higher/lower/different than that value”). Usually only one or a few values are characterized in an SRM, even in complex matrices such as blood serum (e.g., NIST SRM 955c). In contrast, the human genome has billions of properties to be characterized, specifically the genotype at every position in the genome. Also, genotype calls are “nominal properties,” or values for which no algebraic manipulation is sensible. The development of nominal property RMs is an immature branch of metrology, largely emerging from the application of metrology to biological measurements. Concepts analogous to metrological traceability, measurement uncertainty, or validation are yet to be established in wide practice. In addition, biases in whole genome sequencing measurements are only partially understood. For these reasons, creating an SRM certified for whole genome sequence is a daunting task, unprecedented in the scale and type of measurements. In consideration of these challenges, NIST will likely release samples of a whole human genome as an RM characterized for homogeneity and stability but without certified values. A set of “information values” for single nucleotide polymorphism (SNP) and indel genotypes will be released with the RM, but “all known or suspected sources of bias” may not be “fully investigated or accounted for by NIST.” Alternatively, NIST may release the genomic DNA as an SRM with certified values for SNP and/or indel genotypes in well-understood regions of the genome, and information values for the rest of the genome. As additional measurements are made on the RM/ SRM, with new technologies and maturing bioinformatics methods, certified values may be added for additional types of variants and more difficult regions of the genome. It is expected that the utility of a stable and homogeneous RM/SRM will improve over time as read lengths increase, errors diminish and are better understood, bioinformatics methods improve, and sequencing costs decrease. Clinical translation of human genome sequencing calls for well-documented, standard measures of sequencing performance. Homogeneous and stable RMs/SRMs will help make this possible, enabling regulatory oversight by the Food and Drug Administration (FDA), and laboratory accreditation by the College of American Pathologists (CAP) and Clinical Laboratory Improvement Amendments (CLIA). The first RMs NIST plans to develop for these applications will be extracted genomic DNA. The stability of DNA can be assured better than cells (live or fixed), since cells can be measured from a variety of tissues stored in different forms (e.g., frozen vs. FFPE, formalin-fixed paraffin-embedded). The first RMs will be limited in scope to the parts of the generic sequencing process highlighted in Figure 23.1. For current sequencing processes, the scope includes library preparation, sequencing, mapping, and variant calling, but does not include preanalytical steps such as DNA extraction or clinical interpretation of the variants.

GENOME IN A BOTTLE CONSORTIUM The NIST-hosted Genome in a Bottle (GIAB) Consortium was formed to develop the reference materials, reference methods, and reference data needed to enable clinical translation and regulatory oversight of human genome sequencing. NIST organized multiple invitational meetings in 2011 and 2012 to gauge interest in establishing a consortium. The first large public meeting was held at NIST on August 1617, 2012, with about 100 attendees from government, private companies, academic sequencing centers, and clinical laboratories. Four working groups were formed at this meeting: (1) Reference Material Selection and Design, (2) Reference Material Characterization, (3) Bioinformatics, Data Integration, and Data Representation, and (4) Performance Metrics and Figures of Merit.

Reference Material Selection and Design The RM Selection and Design Working Group is tasked with selecting genomic DNA for RMs and designing synthetic DNA constructs for RMs. The working group extensively explored a variety of perspectives on the appropriate consent for a genomic RM. The discussion particularly focused on whether older consents, such as the HapMap consent for the highly characterized sample NA12878, are appropriate for a NIST RM. The HapMap consent

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

395

GENOME IN A BOTTLE CONSORTIUM

Sample

Library preparation

Sequencing

Alignment/assembly

Variant calling

Confidence estimates

Tested by genomic reference materials

Generic measurement process

gDNA isolation

Downstream analysis

FIGURE 23.1 Overall measurement process for sequencing DNA, with black boxes indicating the parts of the measurement process that will be assessed by candidate whole genome DNA NIST RMs.

acknowledged that reidentification may be possible, but the risk was thought to be small at that time, though it also stated that risks may change. The NA12878 sample had been previously characterized extensively by numerous academic studies and is frequently used as a de facto RM by many private companies and clinical laboratories. Therefore, it is ideal for developing bioinformatics methods that can be applied to other RMs, and the consortium currently plans to use it as a pilot RM. NIST received .8300 10-μg units of DNA from NA12878 in April 2013, which is candidate NIST RM 8398/SRM 2398. Future RMs will be developed from fathermotherchild trios in the Personal Genome Project (PGP). The PGP genomes have a broad open consent, including consent for reidentification and broad commercial use such as redistribution of derived products from the cell lines. The working group also discussed potential sources of DNA for RMs, and decided that EpsteinBarr virusimmortalized lymphocyte cell lines were the best option because they can be easily renewed. Mutations can occur in cell lines, so the RMs will be extracted DNA from large homogenized growths of cells. This homogenized DNA may have some de novo or low frequency mutations particular to the batch, but each vial of the RM is expected to be essentially the same. With the consortium, NIST will characterize the homogeneity within vials and between vials, as well as the stability of the DNA over time. Immortalized cell lines may have some differences from DNA in blood or other tissues, but these differences will be characterized and are expected to be sufficiently small that they should be a reasonable surrogate for assessing performance of sequencing other tissues. Synthetic DNA constructs are also being discussed as possible RMs to help understand performance. The NIST SRM 2374 DNA plasmids were recently used to analyze and recalibrate base quality scores [1], which was more accurate than recalibrating using the genome. The GIAB Consortium is discussing additional synthetic DNA constructs that could be used to assess DNA sequencing and bioinformatics, and some consortium members have designed DNA plasmids that include known cancer-associated mutations with a short sequence barcode near the mutation so that the DNA can be spiked-in to any sample in any given ratio. The Consortium has also discussed designed pairs of synthetic DNA sequences that would be modeled after the types of variants and sequence contexts found in the genome but would have sequence content that is different from any known genome so they could be added to any sequencing experiment. These constructs could allow testing of particular sequencing problems such as complex variants, structural variants, homopolymers, tandem repeats, and copy number variants (CNVs).

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

396

23. GENOMIC REFERENCE MATERIALS FOR CLINICAL APPLICATIONS

Reference Material Characterization After selecting which genomes or synthetic DNA constructs to use as RMs, they need to be characterized. An RM is defined by NIST as a “material, sufficiently homogeneous and stable with respect to one or more specified properties, which has been established to be fit for its intended use in a measurement process.” Testing homogeneity and stability helps ensure that measurements made on different vials at different times are all measuring essentially the same DNA. For genomic DNA RMs intended to assess sequencing performance, homogeneous means that each vial should have a sufficiently similar mixture of sequences. Since DNA mutations can occur during cell line propagation, the genomes in the cell culture may not be completely homogeneous, and there can be differences between genomes in different expansions of the same cell line [2,3]. Therefore, the Consortium proposed that NIST purchase a large batch (e.g., about 80 mg of DNA) from a well-mixed combination of expansions of cells for whole genome RMs. An individual unit of the RM may have a mixture of genomes due to mutations occurring during cell line expansions, but each vial should have approximately the same mixture of genomes because the cells and DNA were well mixed. The homogeneity between vials will be characterized during RM development to determine if there are any detectable differences in allele frequency or CNVs between vials. To ensure the ability to discriminate small differences in allele frequency, an important value in a homogeneous RM, careful attention was paid to mixing the DNA prior to aliquoting while taking care to avoid shearing the DNA. Experience suggests that the DNA will be stable when stored frozen, and also that it may become more fragmented when exposed to freezethaw cycles or room-temperature storage. Fragmentation may be a secondary consideration for current short read technologies (so long as it is random), but it may have deleterious effect on results from future longer read technologies. Therefore, the stability of the DNA will be tested in a variety of conditions, including after freezethaw cycles, stored frozen, and stored at higher temperatures. In addition to homogeneity and stability, the RMs will be characterized for their sequence so that labs can understand their performance. Since every characterization method has strengths and weaknesses, multiple sequencing technologies, library preparation methods, and other DNA characterization methods will be combined to provide the best, comprehensive, results. Currently planned sequencing methods include Illumina, SOLiD, Ion Torrent Proton, Pacific Biosciences, Complete Genomics, and 454, as well as emerging technologies such as nanopore sequencing. Library preparation methods will likely include short paired-end, longer matepair/paired-end, fosmid sequencing, and limited dilution methods such as those described by Moleculo, Tile-seq [4], Complete Genomics Long Fragment Read [5], and chromosome sorting [6]. Other characterization methods may include genotyping microarrays, array comparative genomic hybridization (CGH), and optical and nanopore-based mapping techniques. Selected SNP and indel sites may be confirmed by Sanger sequencing, high-depth next-generation sequencing (NGS), and manually curation of alignments. Structural variants may be confirmed by microarrays, polymerase chain reaction (PCR), and mapping technologies. In addition, the GIAB Consortium decided that sequencing of family members is an important way to understand accuracy and characterize phasing of variants. Mendelian inheritance can be used to identify sequencing errors, particularly when larger pedigrees are used. Haplotype phasing (i.e., identifying whether heterozygous variants fall on the same chromosome or opposite copies of the chromosome) can be achieved through long-read technologies, limited dilution methods, fosmid sequencing, or inheritance patterns, and the consortium plans to use a combination of these methods.

Bioinformatics, Data Integration, and Data Representation After the experimental characterization of the RMs is performed, the data will be analyzed, integrated, and represented in a useful format. Many bioinformatics methods have been developed to map, realign, perform local de novo assembly, call variants/genotypes, and filter potential false-positive variants. For most variant callers, an important first step is mapping reads to the proper location in the reference genome and locally aligning the bases in the read to the bases in the reference genome. Alternatively, some methods have recently been developed to perform global de novo assembly of the reads and then call variants with respect to the reference genome. While mapping-only methods are more mature and robust, global and local de novo assembly methods can detect larger variants that are difficult to detect with mapping-only techniques, so it will be important to incorporate both types of methods in the characterization of the RMs. In addition, different bioinformatics algorithms are used for small variants (e.g., SNPs and small indels) vs. larger structural variants like CNVs, inversions, and rearrangements.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

GENOME IN A BOTTLE CONSORTIUM

397

To capture the individual strengths of the different methods (including library preparations, sequencing technologies, mapping/de novo assembly algorithms, and variant calling methods) and provide a robust charterization of their deficits, an integration approach will be established to create the well-characterized RMs. Data can be integrated in multiple ways. Simple voting or “majority rules” methods are easiest to implement and understand, but systematic errors shared across multiple methods can cause the majority of methods to be incorrect. Voting methods can also be biased if one type of sequencing or analysis method is overrepresented. Therefore, methods have developed that arbitrate between genotype calls using information about biases in each data set. In the arbitration method, if data sets have discordant genotypes at a particular position, data sets that have evidence of bias at that position are down-weighted. Evidence of bias includes technical characteristics such as atypical mapping quality, base quality, strand bias, distance from the end of the read (i.e., “soft clipping”), high coverage, variant quality divided by coverage, and other characteristics that are associated with systematic sequencing errors, local alignment errors, and global mapping errors. As noted above, no widely accepted best practice has been established for estimation of uncertainty for nominal properties. For nominal properties, internationally accepted documentary standards allow for an estimate of probability of correctness to be used in place of a quantitative estimate of uncertainty. Several approaches have been proposed to estimate uncertainty for diploid genomes, including expression of uncertainty as the probability of a genotype being incorrect, genotype likelihoods [79], or genotype likelihood ratios [10]. Generally, genotype likelihoods for SNPs and indels are calculated from the pileup of reads at each genomic position, using a Bayesian statistical model that assumes a binomial distribution with a sequencing error rate equal to the quality score of each base. Unfortunately, these models do not currently account well for many systematic sequencing errors, global mapping errors, and local alignment errors. Therefore, genotype likelihoods frequently underestimate uncertainty, particularly with high-depth sequencing. A better-informed estimate that will use the variety of annotations has been developed to identify potential systematic errors, such as strand bias, base quality score, mapping quality score, and soft clipping of reads. These annotations can be used in a framework such as Genome Analysis ToolKit’s (GATK) Variant Quality Score Recalibration (VQSR), which identifies variant sites with unusual characteristics. VQSR can potentially be used both to arbitrate between data sets where they have discordant genotypes and to identify sites with lower confidence. In an integrated approach such as the one proposed for RM characterization, it isn’t currently possible to assign accurate quantitative probabilistic uncertainties, so the current plan is to use qualitative categories of uncertainty for the RM based on genotype likelihoods and characteristics of bias. Parts of the genome cannot be accurately characterized by current technologies, so the RM characterization will include regions and variants that are uncertain. As sequencing technologies and bioinformatics methods improve (e.g., longer reads or improved de novo assembly methods), the characterization will improve and uncertainties will decrease. Therefore, the value of the RM will increase over time as additional data are accrued. Data Representation The characterization of the genomic RM could be represented in different formats. Because the characterization will be used to assess false-positive variant calls, it is essential that confident homozygous reference locations be distinguished from uncertain locations, which is not typically done in most variant file formats (e.g., VCF). However, recently a new file format called gVCF was proposed to extend VCF to specify regions with homozygous reference calls. Alternatively, standard VCF can be used along with a bed file that specifies genomic regions in which confident genotype calls can be made. Phasing information and structural variants will also need to be represented, which is sometimes difficult in VCF. Some variants can be represented correctly in multiple ways with respect to the reference assembly (see Figure 23.2), so standardized ways to represent these variants, or methods to compare different representations of variants, is important. To address many of these problems, the RM characterization could be represented as an assembly graph, ideally as paternal and maternal contigs for each pair of chromosomes, since even parental origin of a haplotype can affect function [11]. To assess variant calling in an experiment, these contigs could be mapped to the reference assembly (e.g., GRCh37 or hg19) to determine variants.

Performance Metrics and Figures of Merit Perhaps the key application of genomic RMs is to understand performance of the sequencing process, including library preparation, sequencing, and bioinformatics (mapping and variant calling), as depicted in Figure 23.1.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

398

23. GENOMIC REFERENCE MATERIALS FOR CLINICAL APPLICATIONS

BWA

T insertion

ssaha2

CGTools

Novoalign

Ref:

TCTCT insertion

A

A

A

C

A

G

T

G

A

G

A

A

FIGURE 23.2 Depiction of four different correct alignments around the same homozygous complex variant CAGTGA . TCTCT, which would result in four different sets of variant calls. BWA has a T insertion followed by a 2-base deletion followed by two SNPs. Ssaha2 has a 1-base deletion followed by four SNPs. Complete Genomics CGTools has an SNP followed by a 1-base deletion followed by three SNPs. Novoalign has a 6-base deletion followed by a 5-base insertion. All are correct alignments but they would result in very different variant calls, which complicates comparison of variant calls from different aligners and data sets.

Because the RM is characterized for homogeneity and stability, the RM provides a constant benchmark that can be used to compare performance of different methods, including new methods developed in the future. The GIAB Performance Metrics and Figures of Merit Working Group is tasked with developing a framework for assessing performance of a sequencing process. This framework would allow any laboratory that has sequenced the RM to compare their variants, genotype calls, and/or assembly to the consensus characterization of the RM. Regulatory and accreditation bodies can use standard methods of performance assessment and reporting to establish a consistent enterprise-scale method to compare performance and make confident decisions. Laboratories could refine and optimize their protocols and procedures and learn about the different types of biases and errors affecting their results. Assessing performance of genome sequencing poses a variety of challenges: 1. Sensitivity, specificity, false-positive, and false-negative rates for variant calls are oft-used measures to specify performance, where “positives” typically refer to any type of variant and “negatives” refer to homozygous reference. These two categories over-simplify performance assessment, since at least three possible genotypes exist at any genome position (homozygous reference, heterozygous, and homozygous variant). At sites with more than one possible alternate allele, even more than three possible genotypes exist. Therefore, genotype comparison tables in which genotype calls from two methods are compared give a more comprehensive description of different types of genotyping error rates. 2. In most current clinical genetic tests, samples with the mutation(s) of interest are used as “positives” and samples without the mutations are “negatives.” In this way, laboratories can measure accuracy for each mutation. This paradigm becomes untenable for whole genome, whole exome, and even multigene panels because it is not possible to have RMs with every possible variant that might be seen in clinical samples.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCE DATA

399

Fortunately, since a single sample or a few samples will generally have many examples of most types of variants, it becomes possible to test sequencing performance for different classes of variants with a limited number of samples. However, dividing variants into different classes is not trivial, since sequencing accuracy can be affected by variant type, sequence context, and genome region in complex ways. 3. Current sequencing technologies and bioinformatics methods have different accuracies for different variant types (e.g., SNPs, indels, CNVs, and structural variants such as rearrangements). For example, SNPs tend to be easier to detect than indels and CNVs, and bioinformatics methods are more mature for SNPs. Complex variants (clustered SNPs and indels) and moderately long indels (in the range of 10100 bps) can be particularly difficult to detect with current mappers and variant callers, though new local and global de novo assembly methods often help. 4. Some sequence contexts are particularly difficult for current sequencing technologies, particularly repeat sequences (e.g., homopolymers and tandem repeats) that are longer than the sequencing read length. Certain sequence contexts can also cause systematic sequencing errors for different platforms (e.g., homopolymers for 454, Ion Torrent, and Sanger capillary sequencing, or GGT motifs in Illumina). 5. Some regions of the genome are more difficult to sequence. Regions with high or low GC content often have low or no coverage due to PCR bias when sequencing with NGS. In addition, a small fraction of the human reference assembly is not finished, so reads cannot be mapped to it. The functionally important HLA (human leukocyte antigen) region requires specialized bioinformatics methods due to its high sequence diversity [12]. Centromeres and telomeres have low sequence diversity, which makes them difficult to sequence and map. Large tandem duplications, mobile element insertions, pseudogenes, and other regions of the genome with high homology also cause significant problems for most current sequencing technologies. For these regions, it is often impossible to determine from which copy a particular sequencing read originates. It is important to identify these duplicated regions in the reference assembly as well as duplications in the reference material sample that differ from the reference assembly. Duplicated regions in the reference assembly can be identified from low mapping quality of reads, but duplicated regions in the sample of interest require specialized methods for CNV analysis. Because accuracy can vary by variant type, sequence context, and genomic region, overall performance assessment may change as characterization of the RM improves. As more difficult variants, sequence contexts, and genomic regions can be characterized, overall accuracy of a particular method will likely decrease when assessed against the RM characterization. To avoid accuracy changing as the RM characterization improves, the genome and variants could be divided into different regions and types of variants. However, the genome and variants could be divided in numerous ways, and some divisions could depend on sequencing platform and library preparation. For example, longer reads can resolve longer repeat regions, long mate-pair can help resolve duplications, and some platforms have higher error rates for homopolymers or other specific sequence motifs.

REFERENCE DATA In addition to distributing the physical genomic DNA as a NIST Reference Material, data collected for the RM will be made available as Reference Data. These data will likely include mapped and unmapped sequence reads (e.g., bam files), and genotype and variant calls across the whole genome. In addition, these data may be visualized through a genome browser to view alignments and variants in a particular region (e.g., the browser being developed by NCBI (National Center for Biotechnology Information) for the Genetic Testing Reference Materials project GeT-RM). A lab sequencing and analyzing the RM could look in this browser at any location at which their call differs from the integrated consensus genotype call to help determine why their answer differs. The Reference Data can also be used to help understand the performance of bioinformatics pipelines. Typically, bioinformatics pipelines are assessed using synthetic “in silico” generated reads. Synthetic reads are used so that the truth about the location of the reads and variants in the genome are known. Unfortunately, synthetic read generators do not model all systematic error processes that occur during sequencing, so they generally overestimate performance of the bioinformatics programs. Nevertheless, synthetic reads can be useful for understanding errors, particularly in mapping and alignment. Alternatively, the genome reference assembly to which the reads are mapped can be altered in strategic ways. For example, variants can be introduced that are not in the genome being sequenced, and the ability to detect these variants can be assessed. For microbial genomes, the reads from one strain can be mapped to the genome reference assembly from a related strain, and the variants found can be compared to the variants between the two strains [13]. While changing the genome reference

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

400

23. GENOMIC REFERENCE MATERIALS FOR CLINICAL APPLICATIONS

assembly can be useful for assessing variant detection by microbial bioinformatics software, it can only assess detection of homozygous variants, so it is less useful for diploid organisms like humans. Reference Data from the RMs can provide a useful way to assess bioinformatics pipelines with real human sequence data from a well-characterized genome. These Reference Data could include data sets from multiple sequencing platforms, so that bioinformatics pipelines could be tested with multiple data sets from different platforms or versions of library preparation and sequencing chemistry. Because the RMs will be well-characterized, the results of the bioinformatics analyses of the Reference Data could be compared to the integrated consensus variant calls to assess accuracy. Using the Reference Data as a benchmark, the effect of changing parameters in the bioinformatics software could be analyzed. A challenge of assessing performance of bioinformatics pipelines with Reference Data is that the sequencing platforms and library preparation methods are changing rapidly, which can strongly interact with the bioinformatics results. An advantage of having a homogeneous and stable RM is that Reference Data for this RM will continue to be accumulated for new versions of sequencing methods, as users of the RM choose to deposit their data in public databases.

OTHER REFERENCE MATERIALS FOR GENOME-SCALE MEASUREMENTS Microbial Genome RMs NIST is also working with the FDA to develop whole genome microbial DNA RMs (and/or SRMs) similar to the human DNA RMs the GIAB Consortium is developing. These whole genome RMs will be DNA extracted from large-scale cultures of several bacterial organisms, across a range of GC content (to enable evaluation of sequencing platform performance for low- and high-GC content genomes, a challenge to some current platforms). Similar to human RMs, a significant value of these RMs will be homogeneity and stability; each RM vial will contain a sample of the same DNA, so it will not be subject to changes over time due to mutations that occur during growth of the organisms. The DNA will also be characterized on a whole genome scale with multiple methods, with the expectation of a highly confident de novo assembly for the particular genome (or genomes, since despite the care taken in preparing the samples from a clonal population, these single strain samples might in fact contain mixtures of genomes arising from mutation in culture) contained in the vials. These RMs could then be used to understand performance of sequencing instruments and bioinformatics pipelines used for microbial sequencing. Because bacterial genomes are haploid and do not have heterozygous variants, the genome reference assembly to which reads are mapped can indeed be changed to understand the performance of bioinformatics pipelines. If the reads from the RM are mapped to the genome reference assembly generated from the RM, no variants should be detected. While this could help understand certain types of errors, mapping to a genome reference assembly different from the RM assembly is a more realistic test of how bioinformatics software is typically used. Therefore, the genome reference assembly generated from the RM could either be modified with variants, or the reads from the RM could be mapped to the genome reference assembly from a related strain or species that has known differences from the RM [14]. By sequencing the RM and mapping to different genome reference assemblies, multiple steps in the sequencing process could be systematically tested, including library preparation, sequencing, mapping/alignment, and variant calling. As with the human genomic DNA samples, these microbial DNA RMs will not test preanalytical steps such as DNA extraction, so they would not be useful for understanding the effect of (differential) DNA extraction on quantitation (such as in metagenomic studies). In addition, these RMs will be from known strains of only a few species, so they will not comprehensively assess the ability of laboratories to assign identity to an unknown microbial sample. However, they will provide a way to understand performance of sequencing and bioinformatics, including random and systematic errors introduced by these methods.

Gene Expression RMs NIST has recently released SRM 2374—“DNA Sequence Library for External RNA Controls” as an RM to support confidence in genome-scale gene expression measurements. This reference material is intended to be used as a library of templates from which RNA controls can be in vitro transcribed (IVT) and added to (“spiked-in”) samples of interest. Genome-scale gene expression measurements are impractical to calibrate; there are too many mRNAs to prepare exogenous calibration materials, and there are no reliable methods to establish the purity of calibration

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

401

transcripts. While unable to provide a calibration, the addition of exogenous control RNA molecules is a reasonable approach to building evidence to assess confidence in a gene expression experiment. This approach was described and initiated at a NIST-hosted industry workshop, out of which grew the External RNA Control Consortium (ERCC). This consortium was established to develop a common set of RNA controls for use in gene expression assays; standard methods for their use; and standard, objective, quantitative analysis approaches that would allow the technical performance of an experiment to be reported in a comparable fashion. The controls are designed to mimic natural mammalian mRNA, and to be useful in a variety of assay formats. There are 96 different controls represented in the reference set, averaging about 1000 nucleotides in length, and ranging in GC content from about 33% to about 54%. Each control is a DNA sequence inserted in a common plasmid vector, engineered for simple IVT of either “sense” or “antisense” RNA, and flanked with restriction and sequencing promoter sites. The 96 controls contain 86,319 bases of certified sequence, with a confidence estimate for each base. This SRM was a pilot for further developments in sequence RMs. It was the first material with a scale of many thousands of certified properties, and was the material for which an “ordinal scale” was developed to describe confidence in the certified properties (this scale is “Most Confident,” “Very Confident,” “Confident,” and “Ambiguous”). It was also developed in partnership with the ERCC, which was composed of the end-user community, reagent manufacturers, technology developers, other federal agencies (including regulators), academic labs, and professional societies. The consortium model ensured that the RM would be relevant and useful, and the partnership with the technology and reagent developers assured that assay content would be available for the standard. It is anticipated that this model will prove useful in the context of the GIAB Consortium, as that effort gets fully underway. Another type of reference material appropriate for genome-scale gene expression measurements is a mixedtissue reference material [15]. Such a material relies on the fractions of materials from different tissues mixed into a sample pair in different known proportions. While the absolute abundances of the mRNA molecules are unknown in the sample pair, their relationship can be established through the mixing proportions and characterization of the signals from assay of the pure components of the mixture. NIST is actively evaluating this approach as a way to establish reference materials for validation of genome-scale measurements.

CONCLUSION Reference materials can play an important role in enabling clinical translation of new sequencing technologies. The GIAB Consortium and NIST are developing well-characterized whole human and microbial genomes as NIST reference materials, which will be used by clinical and research laboratories to understand performance of sequencing and bioinformatics pipelines. In the future, these pure DNA reference materials could be supplemented by additional types of reference materials for genome-scale measurements, such as whole transcriptome, and proteome materials, which might be developed from induced pluripotent stem cell lines from the same individuals from which DNA reference materials are being developed. Reference materials for genome-scale measurements, including the genomic materials currently being developed, are a critical part of the measurement infrastructure needed to have confidence in clinical measurements of billions of analytes, such as a human genome sequence.

References [1] Zook JM, Samarov D, McDaniel J, Sen SK, Salit M. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS One 2012;7(7):10. [2] Londin E, Keller M, D’Andrea M, Delgrosso K, Ertel A, Surrey S, et al. Whole-exome sequencing of DNA from peripheral blood mononuclear cells (PBMC) and EBV-transformed lymphocytes from the same donor. BMC Genomics 2011;12(1):464. [3] Saito S, Morita K, Kohara A, Masui T, Sasao M, Ohgushi H, et al. Use of BAC array CGH for evaluation of chromosomal stability of clinically used human mesenchymal stem cells and of cancer cell lines. Human Cell 2011;24(1):28. [4] Lundin S, Gruselius J, Nystedt B, Lexow P, Kaller M, Lundeberg J. Hierarchical molecular tagging to resolve long continuous sequences by massively parallel sequencing. Sci Rep 2013;3:1186. [5] Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 2012;487(7406):1905. [6] Fan HC, Wang JB, Potanina A, Quake SR. Whole-genome molecular haplotyping of single cells. Nat Biotechnol 2011;29(1):51. [7] DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 2011;43(5):491.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

402

23. GENOMIC REFERENCE MATERIALS FOR CLINICAL APPLICATIONS

[8] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20(9):1297303. [9] Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCF tools. Bioinformatics 2011;27(15):21568. [10] Ajay SS, Parker SCJ, Abaan HO, Fajardo KVF, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res 2011;21(9):1498505. [11] Howey R, Cordell HJ. PREMIM and EMIM: tools for estimation of maternal, imprinting and interaction effects using multinomial modelling. BMC Bioinformatics 2012;13:13. [12] Liu C, Yang X, Duffy B, Mohanakumar T, Mitra RD, Zody MC, et al. ATHLATES: accurate typing of human leukocyte antigen through exome sequencing. Nucleic Acids Res 2013;41(14):e142. [13] Kisand V, Lettieri T. Genome sequencing of bacteria: sequencing, de novo assembly and rapid analysis using open source tools. BMC Genomics 2013;14:211. [14] Farrer RA, Henk DA, MacLean D, Studholme DJ, Fisher MC. Using false discovery rates to benchmark SNP-callers in next-generation sequencing projects. Sci Rep 2013;3:1512. [15] Thompson KL. Use of a mixed tissue RNA design for performance assessments on multiple microarray formats. Nucleic Acids Res 2005;33:e187.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

C H A P T E R

24 Ethical Challenges to Next-Generation Sequencing Stephanie Solomon Albert Gnaegi Center for Health Care Ethics, Saint Louis University, Salus Center, St. Louis, MO, USA

O U T L I N E Introduction Respect for Autonomy Beneficence/Nonmaleficience Justice Conclusion

404 406 406 407 407

Challenging Existing Frameworks Diagnostics Versus Screening Research and Clinical Care Individuals and Families “What to Disclose” Is Becoming “What Not to Disclose”

408 408 409 410

Notifying of Results Introduction Different Kinds of Results Research Results Versus Clinical Results Raw Data Probabilistic and Susceptibility Information Variants of Unknown Significance Incidental Findings Changing Status How Do We Categorize Which Results to Return? Analytic Validity Clinical Validity Clinical Utility Personal Utility ELSI (Ethical, Legal and Social Implications) Recommendations Recommendations

411 411 411 411 412 412 412 412 413 413 413 414 414 415 416 417 420

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00024-1

410

Privacy and Confidentiality Introduction Concepts Data Protection Methods Data Environment Untrustworthy People Reidentification Required/Permitted Sharing Recommendations

421 421 422 422 423 423 423 424 425

Informed Consent Introduction Recommendations Balance the Amount of Information with Patient Initiative The Right to Know and the Right Not to Know Negotiation of Clinical and Personal Risks and Benefits Evolving Results Counseling Transparency

425 425 427

Conclusion

430

References

430

Glossary

433

List of Acronyms and Abbreviations

434

403

427 428 429 429 429 430

© 2015 Elsevier Inc. All rights reserved.

404

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

KEY CONCEPTS • The ethical challenge of examining the implications of NGS is to determine its clinical use in ways that are consonant with the ethical principles of biomedical ethics: respect for autonomy, nonmaleficence, beneficence, and justice. • Whether due to historical distrust or other reasons, there is a documented differential of genetic research performed on different populations. As a result, minorities such as African Americans are more likely to have variants of uncertain significance [1,2]. Until this reality is remedied, it is important for clinicians to be aware of it, for it may alter the potential cost/benefit ratio for some of their patients to undergo NGS. • NGS challenges existing frameworks that are used to evaluate the ethics of clinical practice: the distinction between diagnosis and screening, the divide between research and clinical care, the impact on individuals versus their family members, and the status quo of disclosure asking “what to disclose” as opposed to “what not to disclose.” • NGS introduces the challenge of determining what constitutes a clinical result in the first place. Clinicians need to reflect on the status of research results, raw data, probability and susceptibility information, variants of unknown significance, incidental findings, and findings that may constitute results in the future for their clinical practice. • Clinicians should use the ACCE model (analytic validity, clinical validity, clinical utility, and ELSI) to categorize the implications of NGS results and also realize the complexities that underlie each of these concepts. Further, the notion of personal utility is growing in influence and clinicians should seek to understand the role that the results can play (either positively or negatively) in their patients’ lives. • Clinicians need to negotiate the pros and cons of the current strategies of Listing and Binning to determine which results to return to their patients, and be as familiar as possible with the evolving status of different types of results. • The adequacy of data protection methods can only be assessed in the context of the current and evolving data environment. • The challenge of informed consent for NGS is to balance the amount of information that could to be provided with the amount of patient initiative to decide how much information he or she wants or needs. • An ethical fundamental of informed consent is to facilitate patient choice in deciding both what they want to know and what they do not want to know. • Informed consent must reflect not only the clinical aspects of NGS and returning results but also their potential personal utility to the patient. • Genetic counseling is an ideal approach to returning NGS results, but will probably be increasingly challenging to provide. Clinicians should be prepared for the task of conducting NGS even without genetic counselors as a resource. • The transparency of the informed consent process regarding the potential benefits and potential drawbacks of NGS, both in terms of clinical utility and potential harms, is paramount.

INTRODUCTION The analysis I had done tested one million places in my DNA. But this is just the beginning. Soon, probably within the next 5 or 7 years, each of us will have the opportunity to have our complete DNA sequenced, all the three billion letters of the code, at a cost of less than $1000. This information will be very complex and powerful. Careful analysis of the complete content of your genome will allow a considerably more useful estimate of your future risks of illness than is currently possible, enabling a personalized plan of preventive medicine to be established. —Francis Collins, director of NIH, in his book The Language of Life, 2010

Today, we are on the cusp of a revolution in healthcare delivery where clinicians will be able to access patients’ genetic information cheaply, quickly, and entirely. As the much-anticipated $1000 genome comes closer to a clinical reality for patients and their healthcare providers, we must face the numerous ethical questions that

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

INTRODUCTION

405

arise with forethought. Unfortunately, the history of science and medicine has taught us that the question, “Can we do it?” usually long precedes reflection on the questions, “How should we do it?” or sometimes even, “Should we do it at all?” Fortunately, this has not been the case with genetic and genomic advances. From the inception of the Human Genome Project in 1990, there was a general awareness that by increasing knowledge in something so intrinsic and essential to humans as their genetics would bring along with it corresponding concerns about the ethical, legal, and social implications of this knowledge. In rare explicit acknowledgment that scientific advances should be married to research into their ethical implications, the funding for the Human Genome Project earmarked 35% of its annual budget toward studying these issues [3]. Using these and other sources of support, multidisciplinary and international scholars, government organizations, advisory councils, and numerous others have been exploring the ethical challenges of genetics research and publishing thousands of papers, ranging from examinations of ethical issues in genetic and genomic research and clinical translation, to analyzing the impact of these issues on legal, public policy, and other societal issues [4]. These include organizations such as the American College of Medical Genetics and Genomics (ACMG), the Electronic Medical Records (EMRs) and Genomics Network, the Clinical Sequencing Exploratory Research Consortium, and the College of American Pathologists, as well as commissions such as the Presidential Commission for the Study of Bioethical Issues, the National Bioethics Advisory Commission (NBAC), and the President’s Commission for the Study of Bioethical Issues. Not surprisingly, positions on the ethical approaches to genetic information have had to adapt and evolve in lockstep with the constantly changing field itself. Positions on these issues have been divided depending upon where people land in the debate on “genetic exceptionalism,” or the question of whether genetic/genomic information should be treated differently from other health information, especially in the context of privacy protections, data access, and permissible use. The position in favor of genetic exceptionalism argues that genetic/genomic data exhibit several characteristics that, when taken as a whole, are unique to these data. Based on this fact, those in favor of genetic exceptionalism believe that genetic and genomic information have individual and societal implications that far exceed the impact of other health information. On the other side, some argue that either genetic information is not qualitatively different from other health information or even suggest “that the proper question to ask is not whether genetic information should be treated like other medical information—but instead, why other medical information should not be treated like genetic information” [5]. The nine characteristics of genetic information are: 1. Uniqueness: Each individual (except identical twins) has a unique genetic/genomic code. 2. Predictive capability: Genetic/genomic information can predict, with various degrees of probability, future disease or drug response. 3. Immutability: Other than somatic mutations, an individual’s genetic code does not change throughout life. 4. Requirement of testing: Although there are phenotypic indicators of genotype, many genetic markers can only be known through a genetic or genomic test. 5. Historical misuse: Genetic information has been historically used to harm people, either through eugenics programs, discrimination, stigmatization, or stereotypes. 6. Variability in public views: Different individuals, and different cultures, have very different understandings, sensitivities, and feelings about genetics. 7. Impact on family: Genetic/genomic information has the potential to impact not only the individual who gets tested but also that person’s blood relatives, ancestors, and descendants. 8. Temporality: Knowledge about, use of, and societal implications of genetic/genomic information are inevitably going to change and evolve over time. 9. Ubiquity and ease of procurement: DNA can be obtained from any number of sources, including cheek swabs, saliva, hair, and blood. It can be accessible with or without a person’s permission as well [6]. These qualities are just as true and problematic for last-generation sequencing as they are for next-generation sequencing (NGS). What has changed for NGS is the opportunity to cheaply and quickly view a person’s entire genome or exome. NGS technologies like whole exome sequencing (WES) and whole genome sequencing (WGS) began to be used extensively in the research realm in 2006; a mere 3 years later, they began to be used in the clinical context [7]. On the face of it, NGS seems like a mere quantitative shift from past generation sequencing, but this is misleading. Recent technological advances in genetic sequencing provide different challenges in how to adapt our previous views of what information is, what it means for patients, and how it is useful. While we currently ask questions about whether Ashkenazi Jews should get screened for genetic carrier status for rare diseases found in

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

406

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

their populations, with NGS a Californian company can identify over 100 rare, recessively inherited conditions, with a single carrier test [8]. As this example demonstrates, while the costs of sequencing have dropped dramatically, the yield of sequencing has exploded. As bioethicist Richard Sharp put it, “the $1000 genome may create a million dollar headache” [9]. The ethical headache of examining the implications of all of this available information is to determine how to use it clinically in ways that are consonant with the ethical principles of biomedical ethics: respect for autonomy, nonmaleficence, beneficence, and justice. As principles, they do not constitute a complete moral theory for clinical practice, but more usefully provide a framework through which to evaluate questions and actions in clinical care, including the ethical uses of NGS.

Respect for Autonomy Grounded in an idea common to most Western democracies and increasingly the rest of the world, the principle of autonomy emphasizes the importance of an individual’s right to freedom and choice. Either legitimately or illegitimately, autonomy can be limited in numerous ways, both externally by constraint, coercion, laws, and regulations as well as internally by lack of understanding, lack of appreciation, or lack of decision-making capacity. The fundamental idea behind this principle is that individuals, all things being equal, are entitled to determine their own destiny. The most obvious challenges to respecting autonomy in medicine come from cases where individuals make choices (or want to make choices) that society deems not to be in their best interest. Clear cases include Jehovah’s Witnesses’ refusals of blood transfusions for religious reasons or cases of suffering patients desiring euthanasia. Thankfully, in the context of NGS, the tension between autonomous choice and individual harm is not so stark. In this context, questions of autonomy manifest in three general areas: 1. People’s right to choose (as opposed to their doctor, the analyzing laboratory, or regulating bodies) how much and what type of genetic information they want to know 2. People’s right to choose what genetic information to disclose to implicated relatives, share with researchers, or make available to other interested parties 3. The rights of parents to choose to know genetic information about their children. While some lean toward a paternalistic stance that worries that much genetic information, especially with unclear risk or unavailable clinical response, may be more harmful to patients than they know, evidence is emerging that “adults have shown themselves to be more capable of dealing with troubling information, potentially bad news and uncertainty than they were once thought to be” [5] (this point will be elaborated in section on Return of Results below). More controversy emerges with issues of an individual’s rights to limit disclosure to impacted family members, especially when knowledge of genetic or genomic traits could affect their welfare or reproductive decisionmaking. The legal obligations in this sector are controversial and everchanging (this point will be elaborated in section on Privacy and Confidentiality below). A final issue with autonomy and NGS enters in with the consideration of the rights of children, since many NGS technologies can provide health information about children either before they are born or afterward. Whether parents should have the right to choose what their children will know, or whether certain genomic information should be withheld from parents (and their children) until the children reach a legal age to autonomously decide for themselves is a central issue in NGS. This type of potential autonomy rights in children has been called “anticipatory autonomy” [10], and NGS forces us to ask how much this future autonomy should be respected, especially when faced with the actual autonomous choices of parents.

Beneficence/Nonmaleficience The most clear and uncontroversial ethical principle of medical practice is beneficence, since healthcare is explicitly aimed at promoting the welfare of its patients. The challenge of this principle is that it is often used to refer to two distinct obligations: (1) the obligation to do no harm and (2) the obligation to actively provide benefit. These two aspects of beneficence are distinct because some decisions require choosing one over the other. Sometimes the obligation to benefit can supersede the obligation to do no harm, for example, the negligible “harm” of a venipuncture is clearly outweighed by the sizable benefit from a blood transfusion. Other times, a slight benefit, like protecting a person’s confidential disclosure of intending suicide, is outweighed by a

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

INTRODUCTION

407

significant potential harm, such as the harm of allowing that suicide to take place. A further complexity, especially with the obligation to provide benefit, is the question of how far that obligation extends. Does a physician have the obligation to track down a patient who is no longer at her practice to tell him that an unknown mutation found 5 years ago is now known to cause a disorder? As the last example shows, preventing harm and promoting benefit in NGS most often reflects issues of the benefit and harm of genetic information. As discussed below, these benefits and harms can be divided into the clinical benefits and harms of the information, and the nonclinical benefits and harms, such as psychological, economic, and social implications, among others. In some cases, the benefits or harms are clear, but sometimes the benefits or the harms are more probabilistic or uncertain. Issues of when and how to return NGS results hinges crucially on the state of empirical evidence on the types of benefits and harms that this knowledge provides to patients, which unfortunately is in a very nascent state in most cases.

Justice The final and often forgotten principle of bioethics is justice. What justice means has been debated at least since Plato in Ancient Greece, but the common understanding of it refers to giving people what is fair, due, or owed. Justice, unlike autonomy and beneficence, has a clear societal implication. While it is tempting in a clinical encounter to think of the patient, or maybe even his or her family, as the sole bearers of benefit and harm in a medical encounter, this is not strictly the case. As is no longer deniable, society bears strong economic benefits and burdens depending on choices we make in the healthcare sector. Less broadly, genetic and genomic information has implications not only for individuals and their families but also for those with whom they share ancestry and their communities. This means that the benefits and harms of genetic information can impact broader groups that must also be taken into account. A commitment to justice in medicine is also a commitment to ensuring that the unavoidable burdens of achieving technological advances in healthcare do not fall disproportionately on any individual or group, and that the benefits and use of technological advances are widely distributed. “Distributive Justice” refers to this ideal of a fair, equitable, and appropriate allocation of goods and burdens within a society [11]. In the context of healthcare services, and especially new and potentially expensive ones like NGS, the principle of justice suggests reflection on who will have access to NGS. With a technology that is so close to its research roots, the question of access to research participation and the generality of research results is also pertinent. In a letter written by the Secretary’s Advisory Committee of Genetics, Health, and Society of the National Institutes of Health (NIH) in 2010, it is stated that The application of genetics and genetic technologies should aim to enhance equity in health outcomes and reduce health disparities, and access to genetic technologies should be equitable and fair. Coverage and reimbursement policies also play an important role in patient access, and a number of adjustments are needed before such policies can bring about equity and fairness [12].

This goal may be more difficult to achieve than first appears. In the context of predictive genetic testing for Huntington’s Disease, it has been noted that “requiring multiple, in-person appointments may contravene the principle of justice and fair resource allocation, as those who do not reside in a major urban center face barriers (in terms of costs, time away from family and stress)” [13]. Besides the challenges of access, different populations may have different attitudes toward genetic testing due to the differing history of abuses by the medical system, and genetic knowledge specifically, in those populations. For example, “[v]irtually all the approaches to return of [W]ES/WGS results currently being studied focus on testing in European American populations. Little effort has been made to consider how the perceived benefits and possible harms. . .might differ in populations who have been historically at the margins of genomics” [14]. Whether due to historical distrust or other reasons, there is a documented differential of genetic research performed on different populations. As a result, minorities such as African Americans are more likely to have variants of uncertain significance (VUSs) [14]. Until this reality is remedied, it is important for clinicians to be aware of it, for it may alter the potential cost/benefit ratio for some of their patients to undergo NGS.

Conclusion In light of the changes in healthcare posed by NGS, the flowing tide of conversation regarding the ethical use of genetic information needs to be revisited in light of these principles. While particular recommendations and

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

408

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

guidance become obsolete quickly, the general trends of the ethical concerns, as well as the general trajectory of insights on these issues, reflect patterns of thought and acceptable behavior that can be useful to those who find themselves practicing medicine in this new frontier.

CHALLENGING EXISTING FRAMEWORKS While at first blush it seems that NGS is merely an expansion from earlier genetic sequencing techniques, NGS uniquely challenges traditional distinctions and frameworks used to ethically examine healthcare practices. Even when employed for diagnostic purposes NGS will yield numerous results that are currently asymptomatic in the patient, and NGS therefore challenges the traditional distinction between diagnosis and screening. As a novel technology that has moved rapidly from the research context to clinical uses, it is challenging the traditional divide between research and clinical care. While all genomic information implicates both individuals and their families, the scope of information yielded by NGS expands this challenge of deciding whether test results are only those of the individual patient or if they are also results for closely related family members. Finally, by yielding vast amount of results with one test, NGS shifts the status quo of disclosure from the question of “what to disclose” to the question of “what not to disclose.” Each of these issues will be discussed in this section.

Diagnostics Versus Screening The traditional distinction between diagnosis and screening is based on the motivation for acquiring health information. Diagnosis is conducted within the context of clinical care, in response to a symptom or complaint, justified by previous findings or the family history of a patient. Screening, on the other hand, is most often found in the context of public health, is frequently unrequested or even required, and is given to an asymptomatic person or population. The individual and social trade-offs of diagnostic tests and screening tests differ, as well as the expectation and likelihood of a result. Even without a clinical hypothesis, if NGS is employed to determine the cause of a clinical symptom or symptoms, it is still being employed diagnostically [15]. The challenge emerges for NGS because (1) NGS is being increasingly recommended for use in patients before they (or their children) manifest any symptoms and (2) the technology will inevitably bring back information well beyond the answer sought. In the past, the attention of genetic science and testing was on rare, single, or several gene diseases with low prevalence and high penetrance [16]. These types of diseases are amenable to targeted testing in the clinical context for individuals who have undiagnosed symptoms and/or have a high likelihood of carrying the variant(s). These targeted tests are hypothesis driven and are amenable to genetic testing that focuses on particular sections of the genome. These genetic tests are analogous to nongenetic tests, such as metabolic panels, complete blood count (CBC), and glucose tests that are used diagnostically to ascertain if a patient has a disorder indicated by clinical symptoms. There is little ethical divergence on the appropriate use of this targeted testing, especially when there is a clear clinical recourse to the information. While it is possible that NGS will be used in this straightforwardly diagnostic way, it is more often used in different clinical scenarios. When a clinician cannot diagnose a patient based on clinical findings, there is a growing impetus to do more broad-scale testing of the patient’s genome through WES or WGS. These techniques are uniquely useful for searching for novel disease-causing variants when the disease etiology is unknown, so there is no hypothesis to work with [17]. Another important use, especially for WES, is to search for de novo disorders that are genetic but not inherited, such as many examples of intellectual disability. There is already evidence that these sequencing technologies are causing a shift from “known disease-causing variants to novel, previously unreported rare variants with uncertain disease causality, which include (1) novel frameshifts or nonsense mutations in known disease genes, (2) smaller insertions or deletions without frameshifts, (3) substitution of an amino acid with similar chemical properties [18]. More broadly, many are beginning to suggest NGS for asymptomatic individuals. This information can subsequently be used to deliver personalized medicine in the form of tailored lifestyle advice, preventative measures, and treatments, as well as potentially informed reproductive decisions. These uses of NGS, when employed without a concrete medical indication, are more akin to screening practices than diagnostic testing [15]. Closely aligned with the anticipation of fully personalized medicine, NGS is seen as crucial to “a ‘proactive’ model of healthcare, where one does not wait for people to exhibit disease symptoms, but health risks are mapped out while they are still healthy. . .The goal becomes the provision of preventive, diagnostic and therapeutic

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

CHALLENGING EXISTING FRAMEWORKS

409

interventions across the entire continuum of health to sickness in a personalised[sic] manner based on the person’s individual risk profile” [15]. The different implications of this “proactive” model of healthcare for nonreproductive and reproductive purposes of screening are significant. “[A]n important distinction is that while non-reproductive screening for conditions for which there is no treatment or prevention is considered problematic, reproductive screening is often specifically focused on such conditions” [15]. Clinicians should reflect on the difference in clinical utility assessments in nonreproductive and reproductive decision-making. These different uses will have implications for all three major ethical challenges to NGS, how to determine which results to return, privacy and confidentiality practices, and informed consent. Traditionally, different models of all three are in place for diagnostic, clinical tests as opposed to screenings, but NGS brings these two practices much closer together.

Research and Clinical Care NGS technology first emerged in the research realm in 2006, and as early as 2009 it was being used selectively for clinical purposes. This expedited introduction of a novel and still investigational technology blurs the already precarious distinction between research and clinical care. While the same ethical principles of autonomy, beneficence, and justice are applied to both the research and clinical realm, these principles, and the actions that they entail, are quite different. For example, while clinicians clearly have an obligation to make choices that are strictly motivated by the welfare of their patients, researchers are more often required to do no harm, since their ultimate goal is not the benefit of individual patients but increasing knowledge to benefit patients as a whole, in the future. This difference manifests in more robust requirements for informed consent in the research context than in the clinical one, since it is central to the ethics of research that a person autonomously chooses to undertake the risks posed by research and to recognize that they are not receiving clinical care. On the other hand, the duty to promote individual welfare is more extensive in the clinical context than in the research one. While researchers are usually not required to return genetic results that emerge through their scientific investigations and rarely return results into the future, the default requirement for clinicians is to return results that could be of use to their patients. Another challenge to the research/clinical care distinction is that many NGS laboratories are current research laboratories. On the face of it, this is merely a pragmatic distinction, since “[w]hole genome sequence data collected in the clinical setting are indistinguishable from whole genome sequence data collected in the course of research, and data increasingly move back and forth between the clinical and research settings” [19]. But in spite of the practical equivocation between research sequencing results and clinical sequencing results, there are important differences. The fact that much of current NGS technology is currently housed in research laboratories means that the handling, use, and interpretation of the genetic information is not intended for clinical use. In 1988, as a result of public and congressional concerns about the quality of laboratory testing in the United States (specifically to deaths from incorrect Pap smear results), the Department of Health and Human Services (HHS) developed specific standards for all laboratories which conducted tests on human specimens and whose analytic result was to be used for diagnosis or patient management (i.e., clinical care). These standards and the regulations that enforced them became the Clinical Laboratory Improvement Amendments (CLIA) [20]. At the time CLIA was written, genetic testing was in its infancy, so no specific provisions were made for regulating genetic tests. Although numerous bodies have recommended specific standards for genetic testing, they still do not exist. This being said, genetic testing for clinical use must satisfy the same CLIA standards as all other clinical testing. This requirement is not necessary for investigative tests only intended for research use [21]. The implications of this divide for NGS, which currently is not available to all clinical laboratories and is often accessed through research venues, are vast. While the trend toward “bench to bedside” translational research seeks to smooth the transition from research data to clinical use, the oft-mentioned bottleneck of information that is not quite ready for clinical use continues to increase, while clinical demand sometimes brings this information into the clinic before it is validated and quality controlled. Thus, whether lab results come from a clinical or research laboratory is a crucial ethical consideration, since the harms of false positives or false negatives or other analytic errors can have huge implications for individual patients. This issue will be revisited in the discussion of analytic validity below. Another issue is the blurring of the justifications for clinical and research uses of NGS. While research knowledge increases proportionately with the increased employment of NGS, clinical justification requires caution and

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

410

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

further patient considerations. One example of this tension comes from a plea for expanded newborn screening: “The question of whether the trend toward increasingly broad tests is in part driven by the need to obtain data for scientific research also applies to other forms of diagnostic testing and screening. As long as this is a supplemental goal that does not itself determine the scope of the test, this does not have to be a problem” [15]. But it is not clear that research benefit would be a mere supplemental goal of increased clinical uses of NGS. One researcher puts the point plainly: “There is hope of developing and evaluating effective therapies only with early presymptomatic identification of the disorder and the availability of sufficient numbers of presymptomatic patients with rare disorders” [22].

Individuals and Families Another ethical framework that is troubled by the emergence of genetic testing in general, and even more so from NGS, is the distinction between the ethical obligations to individual patients and the obligations to their blood relatives. The idea of an autonomous choice is that an individual has the right to determine his or her own destiny, but in the context of decisions about obtaining or avoiding genetic knowledge, this impacts the destiny of others directly. As was mentioned as a rationale for genetic exceptionalism, “genetic/genomic information has the potential to impact not only the individual who gets tested, but also that person’s blood-relatives, ancestors and descendants” [6]. This point, and concerns about its ethical implications, has been echoed throughout the literature [17,2325]. The amount of impact varies from large to small, starting with immediate blood relatives and extending, in lessening amounts, to other relatives, ancestors, and descendants. “An individual’s genome reveals half of the genome of his parents and children and a substantial fraction of his siblings” [23]. A further challenge comes when the person at issue is not a living patient but a deceased one. Patients’ relatives, like the patients themselves, may have a legitimate interest in receiving some results or at least having access to them. This issue becomes especially salient when the patient has passed away, which is not uncommon in studies involving patients with cancer [26]. Thus, issues of disclosure to relatives, whether the patient is alive or deceased, should be anticipated and addressed explicitly ahead of time, in the informed consent process.

“What to Disclose” Is Becoming “What Not to Disclose” NGS poses a final challenge to existing ethical frameworks due to the sheer amount of information it can provide. This exponential increase in yield creates a shift in the default of information disclosure. Traditionally, the key ethical question has been, “How much health information should be disclosed to a patient?” The discussion that would follow would focus on the information possible to obtain, and would determine its type, its ethical implications, and whether it should be tested for and ultimately disclosed. With the emergence of technologies that test for vast amounts of DNA sequence at once, for the same price, the question of what to disclose no longer takes place at the point of discovery. The trend is toward testing everything that can be tested for, and then asking which results should not be returned. “[T]he question becomes not what should be tested for, but rather what should not” [15]. This shift from deciding what people should know to what people shouldn’t know is huge, for two reasons. First, it is much more ethically problematic to limit information disclosure once that information exists. Previously, if a test wasn’t run, there was no information to be withheld. Now, the test obtains the information, and the question is whether this information can and should be kept from patients. Second, research on status quo bias shows when given the choice between a default and an alternative, people are much more likely to choose the default [27]. Prior to NGS, the status quo was not to learn anything besides what a targeted test revealed, and the choice was whether to have a targeted test or not. With NGS, the default is to test for everything, and the choice is to limit that information. For both of these reasons, it is likely that people will choose to know much more about their genetic selves in the world of NGS than they chose before, even if only the technology, and not their views of the role of genetics in their lives, has changed. One arena where a parallel to this movement can already be seen is in the screening of newborns. With the advent of tandem mass spectrometry (TMS), the technological and economic capacity emerged to screen newborns for a broad spectrum of disorders. These disorders far exceed those which, unless detected and treated early, cause physical problems, developmental delay, and in some cases, death [28]. Tests for the latter types of disorders, like Phenyketonuria, Congenital Hypothyroidism, and Maple Syrup Urine Disease, are clearly

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

NOTIFYING OF RESULTS

411

beneficial and ethically unambiguous. But with TMS and even more so with NGS, the ability and likelihood of screening for disorders well beyond this inner circle of ethically straightforward diseases increases exponentially. Several organizations have voiced concerns about this technological momentum. Once personalized genomic medicine becomes standard medical practice for adults, the logic of providing physicians with this powerful tool earlier and earlier in the patient’s life may prove inescapable. —President’s Council on Bioethics [29] The technology could be expanded to screen for additional disorders as mutational analysis or other multiplex technology become available, with decisions being based more on what not to screen for (perhaps Huntington’s Disease) than on what to include. —National Institute of Child Health and Human Development (NICHD) [22]

In addition to the momentum identified by the President’s Council and the NICHD, social pressure from parent organizations and consumer organizations will further motivate expanded uses of NGS [15]. Each of these challenges to existing frameworks will emerge in the three major ethical issues discussed below. They make the ethical analysis of NGS much more challenging, since they do not follow in an easy pattern with preexisting healthcare practices and the associated ethical norms.

NOTIFYING OF RESULTS Introduction At an abstract level, the obligation to return results exists from both the ethical principle of beneficence and the ethical principle of autonomy. At least some results of NGS are likely to be beneficial to a patient’s physical or psychological wellbeing, or else benefit his or her reproductive decision-making and life-planning decisions. Especially in the clinical, as opposed to the research context, there is a strong obligation to provide benefit to patients, even in a situation of limited resources or capabilities. Likewise, respect for autonomy supports returning the results of NGS to patients. Even if laboratories or clinicians have reservations about the beneficence of returning certain results to patients, the default should be to allow patients to decide for themselves whether the risks of receiving certain results are outweighed by the benefits [30]. Once this obligation to return results is acknowledged, however, the ethical discussion is not over. Like many other aspects of clinical care, the devil is in the details. Once a clinician has determined that he or she does have an obligation to return NGS results, especially those that were not explicitly sought, he or she must then make many further determinations. Some facets of this question include defining what kinds of results there are, and the implications that correspond to each one. Much NGS output is in the form of raw data, sometimes from a research lab, with unclear validity and interpretation. This type of information is a far cry from the tradition notion of a result of a clinical test that yields a definitive diagnosis. Another facet to the question is whether and how to distinguish between the results that NGS is employed to find, and the other information that it will inevitably yield which is termed “incidental findings.” Finally, even if decisions are made in the present on how to interpret the yield of NGS and under what conditions to return results, the everchanging nature of the technology and genetic knowledge means that information that may not constitute a returnable result today may meet those conditions in the future. Even if a determination of a returnable result can be made, numerous questions remain. When should results be returned to patients (when they take the test, when they come of age, as sufficient information emerges, continuously, etc.)? How should we evaluate the results and the conditions under which they should be returned? And finally, how should clinicians make the practical decision on which NGS results ought to be returned to their patients in a given situation? Each of these issues will be discussed in this section, culminating in a discussion of several recommendations for clinicians who must decide which results to return.

Different Kinds of Results Research Results Versus Clinical Results As mentioned above, most genetic data have been accrued in a research context. This brings up data quality issues, and the question of whether the yield of NGS performed in a research context constitutes “results.” Currently, there is a low threshold for allowing technologies to move from the research to the clinical context, which provides a greater burden on healthcare providers to use their own discretion to determine whether

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

412

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

genetic testing used in research is ready for prime time [31]. Some worry that this leads to genetic tests being used in the clinical environment with insufficient scientific foundation [9,32,33]. This issue hearkens back to the discussion of the CLIA certification of clinical laboratories that is not required for research laboratories, and the differing reliability of results from the two settings. Raw Data A second issue emerges because the information first yielded by most genetic sequencing technologies is in a raw form, and is meaningless both scientifically and clinically until it is analyzed in some way [34]. Raw sequence data must go through several phases of interpretation before they constitute meaningful results, including “1) short-read mapping, or alignment of each sequence read to a reference genome to identify the genomic sequence represented on the short read; 2) base calling at every genomic position covered by aligned short reads; and 3) identification of sequence variation from the reference genome” [35]. Views on the appropriateness of returning raw data seem to be divided between clinicians and patients. In one study, the former worried that releasing raw data would lead to costly and unnecessary follow-up and cause unnecessary anxiety. Patients, on the other hand, viewed their raw data as their property and their right to have it, to explore or neglect as they so chose [36]. Probabilistic and Susceptibility Information A third issue is to determine when probabilistic or susceptibility information constitutes a “result.” There are vast differences between “predictive” genetic testing that identifies a dominant and highly penetrant gene that is a sufficient cause for a disease, and tests that yield information on “susceptibility” where there is no single causative gene, but the gene found must interact with both other genes and the environment before a disease occurs. This yields a spectrum of genetic “results,” from 100% deterministic, like the gene for Huntington’s Disease, to more or less probabilistic variants, such as those that implicate asthma, heart disease, and many cancers. For many conditions, environmental factors have a stronger influence on risk than genetic factors, and in others multiple different genes play a role in disease risk [15]. Some argue that this distinction is not merely a quantitative difference in severity, but rather that the distinction between simple (predictive) and complex (probabilistic, susceptibility) is a qualitative one, with vastly different implications for the patient [37]. Variants of Unknown Significance (VUS) Another questionable “result” from NGS are VUSs. Although NGS like WES has the capability of focusing in on coding regions with more likely functional significance, this does not mean that we know what those functions are (yet) [38]. Do VUSs constitute “results?” A further challenge with VUSs is that they do not affect all populations equally. These types of “results” are more likely to be found in minority populations, even in the context of genes with mutations of known clinical significance, like BRCA1 and BRCA2 [39]. Clinical decisions on how to treat VUSs will have different impacts on different populations, especially minorities. Incidental Findings An important yield of NGS that has been discussed at length in the ethical literature is the status of incidental findings that are identified during the course of sequencing or analysis. Incidental findings have traditionally been defined as “a finding concerning an individual research participant(sic) that has potential health or reproductive importance and is discovered in the course of conducting research but is beyond the aims of the study” [40]. Although incidental findings have largely been defined in the research domain, the same issue of findings that were not part of the initial diagnostic question or screening focus applies to the clinical context as well. Examples of incidental findings in medicine broadly include variant[s] of potential clinical importance beyond the variants or genotype/phenotype associations directly under study. . . a finding of misattributed paternity or parentage in a genetic family study, an unexpected mass or aneurysm visualized in the course of structural magnetic resonance imaging (MRI) of the brain, and an unexpected mass at the base of the lung discovered in computed tomography (CT) colonography. [41]

There are two tensions in calling unintended findings in NGS testing “incidental.” The first is that while certain findings may not be intended in the motivation for testing, it is difficult to claim that when searching a person’s whole genome or exome, one can’t expect to find any mutations or variants with implications beyond that motivation. The notion of incidental findings in genetics is rooted in the traditional approaches where the tests

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

NOTIFYING OF RESULTS

413

focus on targeting a few genes, or else sequence more broadly nonfunctional variants that may or may not be linked to functional changes. These two approaches have minimal chances of finding results that are both unexpected and have clinical significance. NGS, on the other hand, is almost assured to find unrelated variants with clinical utility, including autosomal dominant diseases, carrier status for recessive diseases, and genetic risk factors [38]. These findings are much more likely the more broad the array of genetic screening, and they are almost unavoidable in NGS where upward of four million variants from the reference sequence can be observed at once. Of these, each person has an average of 50100 heterozygous variants suspected to cause inherited disorders [42]. “It is already dubious to label findings of genome-wide diagnostic testing using current techniques as ‘chance findings’ or ‘unexpected findings’; once all sequences within the genome are analyzed, the position becomes essentially untenable” [15]. For this reason, some have replaced the concept of “incidental” findings with more appropriate terms like “unsought for findings” or “unsolicited findings” [26], “serendipitous and iatrogenic findings” [43], “nonincidental secondary findings” [43], “unanticipated findings” [43], “off-target results” [43], and most reflectively “secondary variants” [44]. This linguistic debate is a reaction to the implication of the word “incidental” that these results are unexpected or accidental. Once the notion of these findings as unexpected is abandoned, different ethical obligations result. Some groups, like the ACMG, now recommend that clinicians and laboratories “actively search for the specified types of mutations in the specified genes” that they consider appropriate for return [43]. This recommendation radically changes the discussion. Instead of distinguishing between results that are intended and those that are not, it urges clinicians to look at all results of NGS on the same level and then determine which should be examined and returned and which should not. As a result, when examining the recommendations below, some only apply the criteria of return to incidental findings, while others apply to all results of NGS. The second threat to the notion of “incidental” findings is that many (although not by any means all) NGS may not have any particular question in mind. In other words, one major reason for using NGS is to identify all variants in an exome/genome [38]. In the research context, this may involve secondary analyses looking for patterns, while in clinical testing this may manifest as a whole genome test to try to identify some explanation for a clinical symptom or problem. In either case, untargeted analysis is taking place, so the idea that a finding is “incidental” is more difficult to establish [45,46]. Changing Status A final challenge is that all of these qualities that bring the status of “result” into question are not static but could change over time. “At the time of the initial test, we will have a clear set of known parameters to work within when interpreting the findings, but over time these parameters will change and with this the significance of genetic findings at an individual level will also alter. For example, a test result from an analysis conducted in 2012 may be inconclusive, but that data may become clinically relevant in 2020 in light of new scientific advances” [17]. Those performing NGS in the clinic need to not only determine procedures for determining what constitutes a “result” in the present but also plan ahead for dealing with variants that may emerge as “results” in the future.

How Do We Categorize Which Results to Return? At first, this appears to be the question regarding returning results that is easiest to answer. Bioethicists and policymakers have converged on a framework to assess when genomic information is ethical to bring to the clinic, which is often called the “ACCE model” [47]. This model defines four components: Analytic validity; Clinical validity; Clinical utility; and Ethical, Legal, and Social Issues (ELSI) (Table 24.1). Analytic Validity Analytic validity is defined as a test’s ability to accurately and reliably identify a particular genetic characteristic [30]. As NGS is still a relatively new technology, if not performed in a rigorously validated CLIA-certified laboratory it may not yet be as reliable as Sanger sequencing or other conventional methods of DNA sequence analysis [43]. Further, many existing publicly available genome annotation databases contain high amounts of error, by some estimates making up more than 25% of the entries [35]. This high error rate is part of the motivation for confirmatory tests through Sanger sequencing, especially when the initial NGS was run in a non-CLIAcertified laboratory.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

414

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

TABLE 24.1 ACCE Model Analytic validity

Test accurately and reliably identifies particular genetic characteristic

Clinical validity

Genetic characteristic accurately and reliably identifies or predicts a phenotypic characteristic

Clinical utility

Leads to improved health outcome, patient welfare

ELSI

Clinical actionability

Therapy or preventive action available

Personal utility

Personal response exists

Ethical, Legal and Social implications of the test for individual and society

Clinical Validity Clinical validity refers to a test where a genotype accurately and reliably identifies or predicts a phenotype [30]. Unlike analytic validity which depends on the quality of the sequencing test itself, clinical validity much more depends on the state of the genetics fields in which the test is employed, including how much knowledge there is about genotype/phenotype relationships, as well as gene/gene interactions and gene/environment interactions. For many identified genotypes, research has not yet discovered with sufficient certainty the phenotypes that correspond or are associated with them. The ACMG recommends only returning results for “variants that have been previously reported and are recognized cause of the disorder or variants that are previously unreported and are of the type which is expected to cause the disorder” [43]. An important note about both analytic and clinical utility is that they are concepts of degrees; there are rarely cutoffs or solid conclusions about validity, but more often the amount of validity depends on the state of the field, the context of use, and the discretion of clinicians regarding the sufficiency of the evidence. Clinical Utility A result has clinical utility when it leads to an improved health outcome [30]. Assessing clinical utility requires taking into account the natural history of the clinical disorder (at what age does it manifest?), the availability and effectiveness of treatments for the disorder, as well as possible actions to slow or treat the symptoms of the disorder. Some guidelines delimit clinical utility to those findings where confirmatory approaches to diagnosis exist or where preventative measures or treatments are available or disorders in which individuals with the mutations would be asymptomatic for a long time [43]. Current scholarship often speaks of clinical utility in terms of clinical actionability. Criteria for clinical actionability are [48]: 1. Practice guidelines for the genetic condition exist, such as GeneTest, Online Mendelian Inheritance in Man (OMIM), guidelines.gov, PubMed, OrphaNet, Clinical Utility Gene Cards. 2. Practice guidelines suggest an action in (a) patient management, (b) surveillance or screening, (c) family management, (d) circumstances to avoid. 3. Actions are effective. 4. Actions are acceptable to individual in terms of burdens or risks. The question of clinical actionability is the most straightforward for clinicians to assess, as opposed to NGS laboratories or patients. “Actionable means that disclosure has the potential to lead to an improved health outcome; there must be established therapeutic or preventive interventions available or other available actions that may change the course of disease. Actionable may include surveillance and interventions to improve clinical course, such as by delaying onset, leading to earlier diagnosis, increasing likelihood of less burdensome disease, or expanding treatment options” [49]. Results can have clinical validity but cannot have a corresponding action in a clinical context. Examples of clinically valid results that are not clinically actionable include variants with pharmacogenomic implications (which are only actionable if the relevant drug is prescribed), incurable single gene disorders, and risk alleles like apolipoprotein E (APoE) [42]. But these examples open up a new question: are there results that have clinical utility (i.e., can increase patient welfare) that are not a result of their clinical actionability? A growing movement is arguing that clinicians should look beyond clinical actionability and ask whether results have personal utility to patients in order to adequately assess their overall utility.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

NOTIFYING OF RESULTS

415

Personal Utility The main insight behind the notion of personal utility is that health information may have meaning and use for patients beyond any clear clinical response. It can be defined as the advantages or disadvantages of genetic information for individuals even in the absence or ambiguity of remediating clinical action. This should include both psychological effects and behavior change [50]. Some evidence suggests that genetic risk information may decrease preferences for health behavior changes and increase preferences for biological interventions [5154]. Others suggest that it may improve patient compliance to their physician’s recommendations. More broadly, having genetic information may improve an individual’s general sense of self-control and guide informed decision-making in his or her life choices. These choices may include reproductive decisions, end-of-life preparations, family dynamics considerations, and many others. Ethicists, policymakers, and clinicians agree that returned results should be valuable, and not harmful, to patients’ physical and social wellbeing. But beyond this agreement, the determination of what types of results have positive and negative effects on this wellbeing is a matter of much contention. Many scholars argue that returning broad genetics results that encompass illnesses that lack preventative or therapeutic responses, like Alzheimer’s Disease, is ill-advised as it will lead to negative psychosocial outcomes such as distress and anxiety [5557]. Other scholars and working groups believe that in spite of the current lack of therapeutic or preventative responses, results of broad genetic testing that include the APoE gene should be returned precisely because it will lead to positive psychosocial outcomes such as a sense of psychological empowerment, positive lifeplanning responses, and positive health behavioral changes [58,59]. While professional policies and the clinicians who implement them are often reticent to return predictive genetic results for diseases with no clinical recourse, preliminary research on the issue shows that the psychological harms of doing so are more hypothetical than real. Some studies have shown negative psychosocial impacts of genetic testing on both those who receive negative results and those who receive positive results. These include guilt, which encompass attitudes such as “survivor guilt” and the guilt of knowing that one has transmitted a disease to a child [60], as well as depression and anxiety. On the other hand, emerging studies are showing that many of these concerns may be blown out of proportion, or else not balanced with the positive psychological effects that these tests can provide. Groundbreaking work in this area has been done in the context of the Risk Evaluation and Education for Alzheimer’s disease (REVEAL) study. It evaluated the impact of APoE ε41 results given to asymptomatic people with one living or deceased parent affected by AD after the age of 60. Unlike the predictive testing available for Huntington’s Disease, genetic tests for Alzheimer’s are for susceptibility only, further decreasing the likelihood of clinical actionability. One year after disclosure, participants who were told they were APoE ε41 were significantly more likely to perform AD-specific behavior changes (including diet, exercise, and vitamin/medications) than either those who were told they were APoE ε42 or those in the control group. Further, there were no significant differences in anxiety, depression, or test-related distress between the groups (although those receiving a negative result had lower levels of test-related distress than the positive group) [59]. Surveys of public attitudes regarding genetic incidental findings have found that, at least hypothetically, 69% of people wanted information of unclear risk [61]. In one study, among focus group participants, the majority wanted to receive results of genetic research on their samples. Their articulated reasons went well beyond clinical actionability, mentioning such motivations as empowerment, control, planning their lives, or living them more fully. Even when explicitly told that a result would implicate a disease with no existing treatment or prevention, study participants still felt that the knowledge would be useful. “Even if there’s nothing you can necessarily do about it, at least you kind of know. So, you can change your diet, or, you know, exercise more, or, whatever. . .at least feel like you have some kind of control over what’s going to end up happening even though you really don’t” [62]. The desire to receive results with unknown significance varied among the participants, indicating that there is not a unanimous sentiment for or against receiving ambiguous or uncertain results. Some have spoken of this diversity of patient preferences by breaking people down into types: the information seekers, the information avoiders, fatalists, and altruists [63]. Potential patients give multiple examples of personal utility of results, even without clinical actionability, including the possibility of changing the way one lives his or her life; choice of retirement, career, and school plans; reproductive plans; changing priorities; opportunities to get one’s affairs in order; and finally as a meaningful explanation for their health and sickness [64]. These types of findings have been borne out in the context of other results, such as those that implicate Huntington’s Disease [13], Type 2 Diabetes [65], and even autism [66].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

416

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

These findings indicate that (1) even in the absence of proven preventive measures for a disease (i.e., clinical actionability), genetic information empowers many people to make changes in their lives, and (2) genetic information has a power to impact people beyond the information gained from clear family history risk factors. These all weigh in favor of the beneficence of returning results to patients, even without clear clinical actionability. Further, a respect for autonomy suggests that the weighing of advantages or disadvantages of returning results (both clinical actionability and personal utility) should at least partially be up to the patient. Deciding whether to return results solely based on personal utility is difficult. One reason to avoid it is that personal utility is notoriously difficult to measure [67] and may vary by context within the same individual. On the other hand, the nonclinical risks and benefits to patients from receiving their NGS results may outweigh the clinical risks and benefits. In addition to personal utility, NGS can subject patients to personal harms as well, due to the fact that “whole genome sequence information is uniquely connected to our conceptions of self” [19] which has implications for individuals in society. Examining how patients may be harmed by NGS leads to consideration of the realm of ELSI.

ELSI (Ethical, Legal and Social Implications) ELSI considerations refer to the ethical, legal, and social implications of genetic testing in society, both for the individual and his or her relatives, as well as broader issues that affect communities and society at large. On the individual level, many people worry about two specific harms that could impact the decision to engage in NGS: stigma and discrimination. The word “stigma” originally referred to a physical scar but evolved to refer to a mark of shame or discredit [68]. In the case of genetic information, the concern is that if known, people will treat certain genetic conditions or predispositions as a mark of shame or discredit. Although this appears inconceivable in the cases of genetic predispositions to cancer or rare genetic disorders, today genetic sequencing reflects a much broader swath of conditions, such as psychiatric disorders, that are stigmatized by society. Further, NGS will yield extensive nonhealth information about patients, such as behavior, cognition, and personality. Unique characteristics “like refractive errors. . .digital clubbing. . .cryptorchidism. . .ear wax. . .bitter taste reception. . .freckling. . .male baldness. . .or hair morphology. . .is predictable together with behavioural traits like aggression. . .or anxiety type disorders. . .There is a gene thought to influence divorce rate while alcohol dependency. . .and addictive smoking may be detectable as well. . .One even became famous as the ‘god gene’. . .for being connected to religiosity, and another gene was assumed to influence intelligence. . .” [69]. Beyond the concern of stigma is the concern that others will take discriminatory action based on knowledge of a person’s genetic traits. Commonly cited discrimination worries include insurance (health, life, long term, disability), employment, financial backing or loan approval, educational opportunities, sports eligibility, military acceptance, or eligibility for adoption. Genetic discrimination refers “to the perceived unfair treatment of individuals or their family members based on presumed or actual genetic differences as opposed to physical features” [70]. Discrimination in health, life, and disability insurance is a prevailing concern with these types of testing, especially if entered into a patient’s medical records. Numerous surveys of patients indicate this worry [6]. The Congressional enactment of the Genetic Information Nondiscrimination Act, or GINA, on May 21, 2008 may provide some solace to patients and healthcare providers regarding this ELSI concern. Further considerations include the fact that the number of documented cases of discrimination on the basis of genetic test results is small [70]. On the other hand, it must be noted that GINA has rarely been tested in court, nor does it apply to life, long term, or disability insurance. It also does not apply to group-based, as opposed to individual discrimination. When companies make decisions based on genomic information, there is some debate regarding whether this always constitutes unfair discrimination. For example, some genetic factors may increase a person’s susceptibility to harm in certain job environments. In 2002, the Supreme Court found in favor of the Chevron Corporation when it refused to hire a man based on a preexisting liver condition that was likely to be exacerbated when exposed to toxins on the job. Analogous findings are increasingly likely in the case of NGS, and the implications for hiring may be vast [71]. This employment use, as well as when an employer offers health or genetic services as part of a wellness program, are exceptions to GINA. Despite frequent calls for research, there is little evidence of either the nature or the prevalence of genetic discrimination in asymptomatic people who carry clinically significant genetic mutations (independent of family history) [51]. A study in Canada of 233 people at risk (both tested and untested) for the Huntington’s Disease

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

NOTIFYING OF RESULTS

417

variant found that 39.9% of respondents experienced genetic discrimination, most often in life and disability insurance (29.9%), family (most often reproductive decision-making and relationships) (15.5%), and social (12.4%) settings [70]. On the other hand, discrimination in employment, healthcare, and in public were much less frequently reported. Also, a majority believed that they experienced discrimination due to their family history rather than genetic test results. This suspicion was borne out in the recent case that first invoked GINA, where the court found for the defendant, a woman who was not hired due to the disclosure of her family history of carpal tunnel syndrome (it is of note that although this case invoked GINA, no direct genetic testing was utilized in the discriminatory practice) [71]. While many ELSI issues overlap with the issues of personal utility mentioned above, their scope is greater. One ELSI concern is the cost-effectiveness of genetic testing. For example, it has been shown that warfarinrelated genotyping is not cost-effective for typical patients with atrial fibrillation because it resulted in more than $170,000 cost for every quality adjusted life year that resulted from the testing [72]; on the other hand, this testing may be cost-effective in populations at high risk for hemorrhage. On the other hand, cost-effectiveness may decreasingly be a consideration as the cost of NGS continues to drop. The implications for practice when cost is no longer a limiting factor are vast. Much like neonatal screening technologies, when high-thoroughput technologies are capable of presenting large amounts of information for the same cost, or even less cost, than more directed tests, the inclination is that more is better. This tendency is strengthened by pressure from patient groups and companies looking to develop and patent diagnostic measures [73]. This is a tendency that may not be resistible in medical practice, but it puts a greater burden on ELSI considerations to make sure that the large amounts of information are interpreted and disseminated in a thoughtful and ethical way. Some have cautioned against introducing NGS into the clinical setting before the knowledge base and clinical infrastructure are developed to do it appropriately [9,32]. Even with tests that are increasingly affordable, the cost of subsequent testing and follow-up may burgeon out of control. Further, even if the costs of sequencing go down, the costs of adequately informing and counseling patients on NGS, both before and after the test, will most likely increase. The US medical reimbursement system would need to change in order to cover the extensive counseling and patient education required to ethically return genetic results on this scale. A final important ELSI concern is how genetic knowledge impacts not only individuals and families, but communities. While there is much debate on the definition of community, the NIH defines it broadly as a group with “characteristics such as biological relatedness, geographical dispersion, social interactions, cultural values, and past experiences” [74]. Especially with regard to small communities such as the Havasupai Indians, the Amish, and Ashkenazi Jews, genetic information on individuals, even if deidentified, can have implications for the broader community. Many communities are concerned about the potential for community stigma and discrimination, as well as potential legal implications of genetic knowledge surfacing about their group [75,76].

Recommendations Even with the aid of ACCE as a decision tool, the challenge of practically deciding which results to return, in what contexts, is a difficult one. Strategies for deciding are helpful whether the decision lies solely with the clinician or as a conversation tool between the clinician and his or her patients. Two options can be rejected outright when deciding what results of NGS to return to patients. Full disclosure, if this entails returning all potentially meaningful results, is not currently feasible or ethical. It would require an amount of expertise, time, and labor to perform appropriately that far exceeds the current capabilities of the clinical encounter (this claim is clearly provisional, for as technology, resources, knowledge and guidance improve, the burden of returning more results will lessen). To not return any results, on the other hand, is not ethically acceptable because at least a subset of results will have serious medical implications for the patient. At the very least, these results invoke the doctor’s obligations of beneficence and nonmaleficence toward his or her patient. The key question, then, is not whether to return results, but which results to return and how. Current recommendations vary both in strategy and particulars. One popular strategy is to divide results into different categories or “bins” and then make different recommendations based upon which category applies to the results. These options vary from an obligation to return (Yes), a negotiation between the clinician and the patient, or an obligation not to return (No). Clinicians would then consult the prevailing scientific evidence at the time to decide which results fit into which category. These strategies usually appeal to the ACCE framework discussed above to determine a positive or negative obligation, but then incorporate patient choice (acknowledging personal utility) in some categories.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

418

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

TABLE 24.2

Binning Option #1

Bin

Description

Known deleterious

Presumed deleterious

VUS

Presumed benign

Known benign

Bin 1: clinical utility

Medically actionable

Yes

Yes

No

No

No

Bin 2: clinical validity

2a: Low risk

Yes/no

N/A

N/A

N/A

No

2b: Medium risk

Yes/no

Yes/no

No

No

No

2c: High risk

Yes/no

Yes/no

No

No

No

N/A

No

No

No

No

Bin 3: unknown clinical implications Adapted from Berg et al. [48].

TABLE 24.3 Bin

Binning Option #2

Description

Return results?

Bin 1 Genetic findings useful for the current diagnosis of the disease which initially led to the analysis

Yes

Bin 2 Any clinically relevant genetic findings, which may have immediate benefits for the patient related to present diseases or clinical conditions 2a: Diseases for which possible treatment is available

Yes

2b: Disease for which no treatment is available

Yes/no (except with minors, wait until legal age)

Bin 3 Genetic mutations related to high risks for future Mendelian diseases 3a: Information about risks of preventable or treatable diseases

Yes

3b: Information about risks of nonpreventable, nontreatable future diseases

Yes/no (except with minors, wait until legal age)

Bin 4 Information about carrier status of mutations for an X-linked or an autosomal recessive disorder impacting reproductive life decisions

Yes

Bin 5 Information of variable risk for future diseases

Yes/no

 High disposition for complex diseases  Pharmacogenetic variants Bin 6 Information of unknown significance

Yes/no [7]

Adapted from Ayuso et al. [7].

Several binning recommendations currently exist. One noteworthy strategy has been presented whereby results are divided into three “bins” based upon their clinical actionability and clinical validity, as well as the level of harm to the patient they pose [48]. Only in the case of Yes/No would patients be provided the option of choosing whether the results are returned, in conversation with their clinicians (Table 24.2). A more complex binning scheme has been described that divides results differently and incorporates more nuances, placing more emphasis on patient autonomy when clinical actionability (Yes/No option) is in doubt [7] (Table 24.3). Another simple list breaks down results not by the type of result, but by the clinical purpose of the test, differentiating between three types of genetic tests: those intended to make a diagnosis with the goal of improving medical treatment, those intended to assist in reproductive choices, and those used, usually in a presymptomatic manner, to predict future health outcomes [77] (Table 24.4). The major problem with binning is that clinicians may diverge on how they place results into these categories. In a survey of geneticists and genetic counselors in Canada regarding their views on returning results to patients [78], with regard to adult patients, there was convergence on returning some categories of results, such as serious

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

419

NOTIFYING OF RESULTS

TABLE 24.4

Binning Option #3

Bin

Description

Return results?

Bin 1

Testing to diagnose and improve treatment

Yes

 Diagnostic genetic tests  Pharmacogenetic tests Bin 2

Testing to assist in reproductive choice

Yes/no

 Genetic carrier screening  Preimplantation genetic diagnosis  Prenatal genetic screening Bin 3

Predictive genetic testing

Protected access: return on condition of clinician training and decision support tools

Adapted from Darcy et al. [77].

TABLE 24.5

Consensus and Dissensus about Binning

Type

Include

Sometimes include

Exclude

Serious and treatable

I 5 94

S56

E50

Pharmacogenetic information

I 5 75

S 5 22

E53

Serious and untreatable

I 5 57

S 5 40

E53

Carrier status

I 5 73

S 5 26

E51

Adult-onset

I 5 67

S 5 31

E52

Risk for multifactorial condition

I 5 41

S 5 41

E 5 18

Information of unknown significance

I 5 29

S 5 46

E 5 25

Social implications

I 5 13

S 5 48

E 5 39

Adapted from Lohn et al. [78].

and treatable conditions and those that indicate pharmacogenetic information, but much more divergence in other categories. However, with regard to pediatric patients, there was less consensus (Table 24.5). Perhaps the divergence is less marked when particular patients are at issue, since clinicians can then utilize contextual factors and personal utility to make consistent decisions. On the other hand, a clinician’s lack of knowledge of these different categories and their respective impact on patients is likely to far exceed that of genetic counselors and geneticists, so one would expect the divergence to be even more stark in the nonspecialist clinical setting. Utilizing a consensus of genetics experts seems a useful and beneficial strategy to avoid this issue, and this is the strategy employed with “Listing.” Unlike Binning, Listing is a strategy where experts in genetics come together and decide precisely which results should be returned. While it is likely that experts will invoke similar considerations as those providing in Binning strategies, the decision-making process is solely at the expert level and only a list of specific recommendations (return results for variant X, do not return results for variant Y) are utilized at the clinical level. The Working Group on Incidental Findings in Clinical Exome and Genome Sequencing of the ACMG published a recommended list in March 2013. They recommend that clinical sequencing laboratories actively seek out variants that they list and return them to clinicians who will in turn return them to their patients. Second, they believe that the rationale for returning or not returning results is objective, and thus do not require the input of patient preferences or even clinician discretion (it should be noted that these recommendations have been strongly criticized [79] by some groups). The Working Group ultimately settled on a “minimum list” of findings to report from clinical sequencing, based on considerations of Ref. [43]: 1. The existence of confirmatory testing. 2. Disorders where preventative measures or treatments are available. 3. Disorders which may be asymptomatic for long periods of time.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

420

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

4. Variants that are “Pathogenic,” defined as a recognized cause of a disorder or of the type which is expected to cause a disorder. 5. Variants with a high likelihood of causing disease. A similar but more extensive list was provided not from a working group but from the consensus of a group of specialists in clinical genetics or molecular medicine [80]. This longer list was chosen out of variants in 99 common conditions, and as would be expected, the level of agreement between the specialists waned as conditions became less pathogenic, less clinically valid, and less analytically valid. That being said, concordance was 100% for 21 conditions or genes and 80% or higher for another 64 conditions or genes. The major problem with listing is that it is relatively inflexible to the changes in genetic knowledge and clinical responses to genetic disorders. While a 2013 list of reportable conditions is appropriate now, it is necessary for the list to be updated on an extremely frequent basis in order to keep up with the changing reality of the field. Perhaps this could be remedied with a standing committee of experts or constantly updating resources, but this may not be a practical option. So where does this leave the clinicians of today? Should they rely on potentially outdated lists or choose a strategy of binning that may require more background knowledge than they have?

Recommendations One recommendation is to try to bring binning and listing together in order to remedy the weaknesses of both. For example, in the context of cancer genomics, it has been suggested to break down results into bins defined by the nature of the benefits (clear, possible, unlikely, and unknown), with each bin containing a corresponding list of results (or package) that fall under that category informed by the current state of the science [26]. This compromise may allow packages to be a little out of date as long as clinicians also know the motivation for each result belonging to each package (Table 24.6). A second recommendation is to try to make NGS more approximate last-generation sequencing so that the decisions can return to being “what information to return” as opposed to “what information not to return.” This has been recommended by the use of filters which take advantage of the interpretive gap between raw data and interpreted findings. Filters can be implemented at two stages of the NGS process. At the laboratory stage, filters can be used to decide which data are analyzed and interpreted, regardless of the extent of the genome that is sequenced. This way, there is not direct withholding of information, since raw data are only by a stretch considered meaningful information. “Use of filters allows the implementation of whole genome sequencing as the basis for targeted genetic screening. . .Filters can be used to prevent uncovering unsought information wherever possible” [15]. The second locus of filtering can take place at the clinician stage of the process. For example, the ACMG recommends that laboratories sequence the entire subset of recommended genes and pathogenic mutations defined by their expert panel, and that “patients cannot opt-out of the laboratory’s reporting of incidental findings to the ordering clinician.” On the other hand, the clinician can filter these findings and decide, in his or her knowledge of the clinical context at hand and patient preferences, to determine which of these results are returned to the patient [81]. With the emergence of EMRs, the capability of filtering interpretation and results that are made available to both clinicians and patients will increase. “A strength of EMR systems is that they not only provide a mechanism to store and organize patient health information but also provide the ability to filter health information based on TABLE 24.6

Compromise Binning and Listing

Bin

Description

Return results?

Which results?

Bin 1: Clear benefit

Life threatening conditions which can be either prevented or treated or influence reproductive decisions

Yes

Default package

Bin 2: Possible benefit

Potential or moderate clinical or reproductive benefit

Yes/no

Package 2

Bin 3: Unlikely benefit

No clear clinical utility

Yes/no

Package 3

Bin 4:Unknown benefit

Unclassified variants

No

No disclosurea

a For a list of what is included in each package, see [77]. Adapted from Lolkema et al. [26].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

PRIVACY AND CONFIDENTIALITY

421

clinical utility or relevance to individual clinical specialties, provide education and clinical decision support, and implement patient preferences around access to health information” [77]. The challenge with filters is the same as with listing: filtering decisions are based on current levels of knowledge. “What appears to be innocent noise now, may prove to be meaningful later” [15]. But as an electronic method, it may be more amenable to updating and revision than listing. No matter which strategies for determining the appropriate results to return, there are certain recommendations that are true in general. First, clinicians utilizing NGS need to make themselves as familiar as possible with the existing databases of known mutations and their clinical status with regard to ACCE. These databases are plentiful, including the comprehensive databases of rare mutations such as the consensus coding sequence database, RefSeq, University of California Santa Cruz (UCSC) KnownGenes database, the GENCODE and ENSEMBL databases; Mendelian databases such as Human Gene Mutation Database and Online Mendelian Inheritance in Man (OMIM); disease-specific databases; NGS and catalogs [81]. They are also of varying quality, and so must be used with care as resources to assess the significance of difference results and the appropriateness of returning them. Second, research has consistently shown that the most important factor to patients about results is that they are not given surprises. This puts more pressure on the informed consent process (see below) to prepare patients for the types of results that they may encounter and to make educated decisions on what they do and don’t want to know, as well as what they do not have choices about.

PRIVACY AND CONFIDENTIALITY Introduction As we unlock the secrets of the human genome, we must work simultaneously to ensure that new discoveries never pry open the doors of privacy. And we must guarantee that genetic information cannot be used to stigmatize or discriminate against any individual or group. —President Bill Clinton [82]

The commitment of the medical profession to respect a patient’s privacy and uphold a patient’s confidentiality are as old as the profession itself. Every American Medical Association (AMA) Code of Ethics since the first issued in 1847 has insisted that physicians safeguard the privacy and confidentiality of what they see and hear in the medical encounter. The centrality of this trust for the medical profession lies in the need for patients to be honest and open with their clinicians, which they will be reticent to do if they do not feel their information is safe. Moreover, patients are in a particularly vulnerable position vis-a`-vis their physicians, where they are more susceptible to harm than in other contexts. Patients who are ill already have a diminished sense of autonomy. . .Patients must often expose their bodies, thoughts, behaviors and habits to the power and gaze of medical judgment. The promise of confidentiality not only promotes the free exchange of information between patients and providers, but also protects patients by assuring them that their vulnerability will not be exploited or that the intimate details of their personal life will not otherwise be exposed outside the context of the provider-patient relationship. [83]

Public opinion polls that show that not only are patients concerned about the confidentiality of their health information, but also they often go to extensive lengths to avoid disclosing sensitive information in cases where this trust is lacking. These lengths have included “seeking a different doctor, paying out-of-pocket instead of filing a claim with an insurer, avoiding care to prevent disclosures to employers, giving inaccurate or incomplete information on medical histories, and asking physicians to either not record some information or to misrepresent that information in the record” [84]. Unfortunately, these practices reflect concerns that physicians are not able to keep private information confidential. One study shows that while a vast majority of participants wanted genetic results returned to them (95.7% of veterans and 93.1% of nonveterans), a vast minority wanted the results returned to them and their physicians (4.3% of veterans and 6.9% of nonveterans) [85]. Not only are patients concerned about illegitimate access to their health information, but also they have little protection from potentially harmful access from those who are currently authorized to access it, including employers, insurers, and managed care companies. These entities cannot access health information without patients’ permission, but they can compel this permission as a condition for employment or services [86]. Thus, anyone who hopes to engage NGS ethically in the clinical setting must address issues of privacy and confidentiality. The following section will first clarify some concepts in this domain that are often confused, conflated, or ambiguous. It will then evaluate potential data protection methods that clinicians can utilize to best

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

422

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

protect their patients’ genetic information. Finally, it will discuss the implications of protecting genetic privacy in the current data environment, and how this environment provides both threats and promises to the confidentiality of patients.

Concepts Before discussing the ethical issues regarding privacy and NGS, it is important to clarify some often confused concepts. The term “privacy” refers specifically to the limitation or restriction of access to persons, personal information, or personal property. “Information privacy” specifically refers to “how much personal information is available from sources other than the individual to whom it pertains” [84]. Informational privacy is important to people because by controlling the information about them, as well as who has access to it, they can control and shape how they are perceived and treated by others. If privacy is not respected, it is considered an “invasion of privacy.” When a patient is aware that genetic information is obtained but unaware of the extent of the genetic information obtained (as may be the case if consent is insufficient for NGS), this could constitute an invasion of privacy. In contrast, “confidentiality” refers to entrusting access to information considered private to specific persons, institutions, or entities, which then are obligated to only allow further access with specific permission. It is unlikely in the medical context that private genetic information will be obtained without a patient’s permission. The bigger concern is with the confidentiality of this information that is entrusted to health care providers for the purposes of individual care. If private information that is entrusted to one or a set of entities is shared without permission, this is considered a “breach of confidentiality.” In other words, confidentiality refers to the upholding of a promise. A further set of distinctions is how access to private information is allowed. Private information can be given to others, allowing them to possess it and use it as they wish. Although those who gain possession of data may promise to only use it in certain ways, it is nearly impossible to track or enforce violations of these promises. A less permissive approach is to allow others access to information but not to take possession of it. Many current technologies may provide capabilities to directly control how and when things are accessed, and thus restrict usage in various ways. Finally, people could be allowed to use the specimens or data without having either possession or access, such as when people request particular results from a third party without having access or ownership of the raw data itself [19].

Data Protection Methods The way that clinicians and health institutions manage the data itself is a key first defense against breaches of confidentiality. “Anonymity” refers to the complete removal of identifiers so that even of someone obtained access to private information, they would not know whose private information they were accessing. Achieving anonymity of private information, if possible, would be the most secure form of data protection from breaches of confidentiality. On the other hand, without any corresponding information anonymous data loses most of their usefulness, whether for the patient him or herself, for the healthcare institution, or for researchers. “Data can be either useful or perfectly anonymous, but never both” [87]. For this reason, genetic information is rarely anonymized but is most often merely deidentified. There is a spectrum of data deidentification practices that are less secure than anonymity (but often provide balancing benefits) that include coding, third-party coding, encryption, and other methods. In these cases, identifying information is available, but not directly linked to the genomic information and requires several steps (which usually require permission) in order to be reidentified. The question remains whether genomic data, and especially whole genome and whole exome data, are ever really secure from reidentification. While some argue that it is by definition identifiable information about a person, it is plausible to claim that even with access to whole genome sequence data an individual can maintain “practical obscurity,” as “his or her identity is not readily ascertainable from the data” [19]. According to the US Department of Health and Human Services (HHS), genetic and genomic information does not currently count as an identifier, although it is at least plausible that it would fall under the Health Insurance Portability and Accountability Act (HIPAA) category of either “biometric identifier” or “any other identifying number, characteristic, or code.” Others argue that the genome is the “ultimate identifier” [88]. Even the HHS issued Advance Notice for Proposed Rulemaking, issued in 2011 to start the process of amending the regulations that govern human subjects research, explicitly posed the question of whether genomic data is ever truly deidentifiable. In spite of these increasing arguments based on the uniqueness of a person’s genetic code, the fact currently remains that those who want to reidentify a person “must expend a lot of effort to do so” [19]. In other words,

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

PRIVACY AND CONFIDENTIALITY

423

“while whole genome sequence data are uniquely identifiable, they are not currently readily identifiable” [19] and it is current coding and deidentification practices that make it so. It is this belief that makes it possible to currently assure patients that their genetic information will be safe and protected.

Data Environment While comforting and plausible, these strategies leave out something crucial to the appropriate protection of patient confidentiality. Even with the most rigorous deidentification practices, protecting confidentiality requires a “spectrum of conditions to be in place, including ethical and trustworthy behavior by researchers and clinicians, sufficient security of information technology, and policies and laws that hold violators accountable” [19]. Another way of putting this is that protecting information requires not only certain qualities to be true of the data, like appropriate coding practices, but also certain qualities to be true of the broader context in which that data exists, or the “data environment.” There are numerous aspects to the current data environment that can threaten these protections. Untrustworthy People Confidentiality is intrinsically connected with the idea of trust, and trusting relationships. No matter how many careful policies or information systems are in place, if those handling patients’ health information are malintended or negligent, no protections can overcome it. For example, in a recent case the unencrypted private health information of over 800 people was accidentally included in public, online Powerpoint presentations [19]. Reidentification In order to identify someone from their genetic code (without identifiers), one only need a reference code elsewhere that has connected identifiers [89]. As genetic information is increasingly gathered in numerous places outside the clinical context, in increasing quantities, with varying levels of identifiers throughout the United States and the world, the likelihood of cross-referencing these genetic repositories increases. NIH has the database of Genotypes and Phenotypes (dbGAP), the FBI has the Combined DNA Index System (CODIS), and the Veterans Administration has the Million Veteran Program, just to name a few of the most prominent. In addition, sequence data are being generated by direct-to-consumer companies. In addition to genetic reference databases, there are numerous public databases that include other private information about patients which can be crossreferenced to reidentify them, including voter registration data, census data, and the numerous commercial data sets aggregated by supermarkets, geographical information systems, and Internet engines like Google and Amazon [89]. Various scholars have demonstrated how the data environment can be used to reidentify genetic and genomic data in striking ways [9092]. In one case, an MIT graduate student felt challenged when William Weld, then the governor of Massachusetts, assured the public that the Managed Care organization that insured state employees could release deidentified hospital data to researchers with no threat to patient privacy. She was able to reidentify William Weld’s medical records by combining the “anonymously” shared medical records and a publicly available voter registration list [90,91]. This was possible because for 87% of the US population, the combination of ZIP code, birth date (with year) and sex uniquely identifies a single person. A similar point was made in light of James Watson’s decision to make his sequenced genome public, with the specific proviso that he wanted all gene information about APOE redacted (Watson was concerned about the association between this gene and late onset Alzheimer’s disease, and did not want this information available). Even without information on that specific gene, it was shown that Dr. Watson’s APOE status could be accurately predicted with the use of advanced computational tools, publicly available data (like HapMap data in this case), and the rest of his genome [92]. The implication of this demonstration is that deidentification may prove to be increasingly insufficient to protect a person’s intention to hide genetic information, and even their desire not to know, in the context of the modern data environment. These cases, and another similar one which showed how easy it is to reidentify a Netflix subscriber (and all his/her movie choices) from an anonymized data set and access to the Internet Movie Database [93], should cause clinicians to be concerned about promises of confidentiality that may not be supportable. Another violation of confidentiality is not in the form of reidentification of an individual, but rather what is called “attribute disclosure,” “which happens when an individual or group is known to be present in a data set which shows that everybody has a particular trait” [89]. If in this aggregate form, attribute disclosure can cause

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

424

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

multiple inferences about individuals within a population being likely to have a trait, such as diseases, predispositions, or behaviors. The concern with attribute disclosure is that even without individual information, many companies (e.g., employment, insurance) make decisions based on probabilities. “For example, an individual could be excluded from an insurance policy based on the fact that the majority of individuals living in the same geographic area had suffered from unusually poor health” [89]. This issue invokes the ELSI worry mentioned above about the relationship between genetic information and community risks and benefits. A final concern with reidentification is that genomic information continues to be useful long into the future, both the future of the individual patient as well as his or her descendants. “An encryption scheme considered strong today might gradually weaken in the long term. Consequently, it is not too far-fetched to imagine that a third-party in possession of an encrypted genome might be able to decrypt it, e.g., 20 or 50 years later” [88]. Required/Permitted Sharing While the notion of confidentiality is based on the idea that individuals have a right to decide who has access to their private information, this right is not absolute. There are three categories of cases where the right to decide who has access is not entirely in the patient’s hands: (1) as a condition for services, (2) as a research resource, and (3) when the wellbeing of someone else is at stake. “At least 25 million times per year, individuals are compelled to sign authorizations to release their health records as conditions of employment, life insurance, or other application processes” [6]. These services that require release of health information range broadly, but include health insurance, life insurance, long-term care insurance, disability insurance, automobile insurance, social security disability insurance, workers’ compensation, veterans’ disability and personal injury lawsuits [86]. With the enactment of GINA in 2008, only health insurance and employment insurance discrimination is prohibited. While not “involuntary” in a straightforward sense, by making access to genetic information a condition for services, patients may often feel compelled to make a problematic trade-off between their desire for privacy and their need for the services at issue. While it is not within the purview of clinicians to change this reality, it is their duty to ensure patients are adequately aware of it. Clinicians should also be concerned that “every information privacy law or regulation grants a get-out-of-jail free card to those who anonymize their data,” by allowing open sharing and use (most often by researchers) without consent and without oversight in many cases [87]. This sentiment that anonymized genetic information no longer need be controlled by a patient is reflected both in the Health Insurance Portability and Accountability Act (HIPAA) that only applies to health information with identifiers, as well as Office of Human Research Protection (OHRP) guidance that determined that research using only deidentified materials does not constitute “human subjects research” and therefore is not subject to the Common Rule [94]. There have been several cases that show that this perspective is not shared by large portions of the public, such as the class action lawsuits against hospitals that allowed researchers to access deidentified newborn bloodspots for research in Minnesota and Texas, as well as the infamous lawsuit brought by the Havasupai Tribe against the University of Arizona. A final mode of sharing is the possible compulsion to share genetic results that pose risks to a patient’s relatives. This is an issue of much contention in both scholarship, the law, and in medical societies. There have been three court cases brought against physicians for not warning family members of genetic results (Pate v. Threlkel 1995, Safer v. Estate of Pack 1996 and Molloy v. Meier) that ultimately found, in different ways, that the physician has an obligation to a patient’s relatives, but that this obligation is not absolute. A physician must take “reasonable steps” to guarantee that immediate family members are warned, but reasonable steps can be taken by imparting to the patient him or herself the importance of warning relatives who are at increased risk. The Presidential Commission in 1998 and the Institute of Medicine Committee on Assessing Genetic Risks attempted to define the extenuating circumstances under which physicians can breach patient confidentiality and warn relatives, but even in these cases the duty is up to the physician. Finally, professional societies such as the AMA and the American Society of Clinical Oncology (ASCO) take the position that patients should be adequately informed about the risks results pose to their family members and assisted in communicating with them, but no obligation to breach confidentiality exists [24]. The default ethical position from experts in the field is that there should be an attempt to warn relatives through the patient, not around him or her. At the same time, it is conceivable that circumstances could arise where a physician feels morally compelled to warn a relative of genetic risks, especially those that are clinically actionable, against the will of the patient. In these cases, the most important obligation is to make this clear to the patient before he or she gets tested so an informed choice can be made.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

INFORMED CONSENT

425

Recommendations While these factors of the data environment seem to indicate that there is no way to adequately protect NGS results, all may not be lost. There are several recommendations that have emerged that maintain that genomic information can be utilized ethically and honestly, even within this existing and evolving data environment. One recommendation is that genetic information should automatically receive heightened protection along with other “sensitive” information in the medical record [6]. A precedent exists for recognizing certain aspects of an individual’s health information, such as psychiatric data, as more sensitive than other types of health information. On the other hand, “there is currently no comprehensive Federal legislation that limits access to sensitive health data” [6]. Patient views support the idea of folding genetic information in with other sensitive health information. One study showed that the public tend to see genetic information much like other information, but want sensitive information to have heightened protections [95]. In order for this recommendation to have any efficacy, the legal and regulatory environment should recognize that sensitive information, especially in the healthcare setting, requires added protections at a national level. One possible strategy has been suggested: change HIPAA’s requirement that health data in general is permitted to be broadly released for treatment, payment, and health operations, and limit the sharing of sensitive data to treatment uses [4]. Another recommended strategy is to only allow “controlled access” and “masked” genetic or genomic information in medical records, like other sensitive information. “[R]esults of genetic tests that seek specific mutations, including tests within research protocols, could be afforded additional protection, as psychotherapy notes have been, because both are sources of potentially inconclusive or stigmatizing information” [84]. On the other hand, since it is important that physicians are aware that information in the medical record is being masked, there should be some indication to providers so that they can request the information directly from the patient, if need be. Another option suggested for clinicians is a “break the glass” mechanism, where an authorized user could override masking and other protections of data in the medical record only under emergency circumstances” [77]. Another recommendation is to shift our focus from concern about access to confidential information, to concern about the misuse of it for purposes of discrimination, stigma, and so on. Many emphasize that the risks of identification or reidentification of private information is not necessarily the same as risks of harm to an individual. In order for a person to be harmed by a breach of confidentiality, the information not only must be accessed, but it must be misused. “Unauthorized access to data is not necessarily a problem in and of itself—despite having access to information, one can choose to not use it, and thereby not produce any harm” [38]. Misuse of information can therefore be more ethically significant than unauthorized access [19]. In this spirit, Schadt suggests “education and legislation aimed less at protecting privacy and more at preventing discrimination. . .” [96]. Likewise, “[Clinicians and] Genomics researchers can only do so much to protect privacy. Ultimately, many of the concerns. . .relate to matters such as insurance and employment; these concerns are really only satisfactorily addressed by appropriate legislative and governance responses, at a higher policy level, and by ensuring fair access to employment and healthcare” [89]. A final recommendation focuses specifically on implications for sharing requirements for the informed consent process. Most patients have little information about the requirements and allowances for sharing information that is in their medical records. Here, the focus is less on protecting information or punishing those who access it, but rather ensuring that patients realize, and give explicit consent, for access to their genetic and genomic information. If consent is not required at the current time, at least disclosure of the avenues where their information will be shared is mandatory. Although this has traditionally not been required as long as the clinical data is “deidentified,” with the increasing recognition of the inherent and unavoidable risks of genetic and genomic information, this existing practice is no longer ethically acceptable without consent or acknowledgment from patients willing to undertake that risk [77].

INFORMED CONSENT Introduction Particularly where whole genome analysis (WGA) is concerned, the question arises whether approaching the issue via the classic informed consent model is realistic. Can care providers themselves maintain a grasp of the potential outcomes of genome-wide diagnostic testing and the chances and implications thereof? Insofar as they can, do they have the time and the means to present this information to the person in question in an understandable fashion? Won’t the amount and variety of the information, as well as the fact a great deal remains unclear or uncertain, preclude a well-informed choice? [15]

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

426

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

Informed consent (IC) is the cornerstone of both ethical research and clinical care, although it differs between the two [97]. Research consents are often longer and more involved, emphasizing the voluntariness of the procedure and the risks to participating. This is due to the unique context of research where there is an inherent tension between the goals of the research (e.g., to expand knowledge and improve care for future patients) and the interests of the subject (e.g., to improve his or her own benefit). In addition, researchers, especially those who are not clinician researchers, do not have the same ethical obligations of rescue and care to research participants as clinicians do for their patients. Clinical consents, on the other hand, may cover some of the same issues required of research consents like a summary of procedures and explanations of risks and benefits but are often shorter. The risks and voluntariness of the procedure are often less emphasized due to the fact that clinical procedures are always intended to benefit the individual patient upon whom they are performed [19]. Even acknowledging these distinctions, there are several reasons why research and clinical consents are often blurred. Current law and culture consider both patients and participants as having the fundamental right to adequate information about proposed procedures, and to make an autonomous decision to participate based on that knowledge. In both cases, the goal of informed consent is to ensure that people appreciate what may be revealed to them, make informed decisions about whether or not they want to take the test, how they want to receive the results, and how they can be prepared for the implications of the results. Whether genetics is involved in the clinical or the research setting, adequately informing lay people is extremely difficult due to the intricacies of the genetic science itself, as well as proven difficulties of lay people accurately assessing risks and the complexity of genegene and gene-environment interactions [98]. The general health literacy of the US population is extremely low, and by all accounts research and genetic literacy is even lower [99]. Understanding the complexity of NGS, as well as the variety of types of results it can yield, is even more difficult. Finally, the current investigational status of most NGS makes it very unclear whether it is an investigational test being employed for a clinical purpose, or if research on an investigational test being conducted in a clinical context. While these two activities may look identical from the patient perspective, they invoke very different moral and legal obligations, as the former is intended primarily for the benefit of the patient (with incidental research findings) and the latter is intended primarily to advance research (with incidental patient benefit). Using prior scholarship to determine the appropriate clinical consent for NGS is challenging for two reasons. First, most discussions of NGS focus on consent in the research context. As was discussed above, while covering overlapping issues, research consents must deal with inherent conflicts of interest that do not exist in the clinical context, while dealing less with the direct obligations of care are intrinsic to a clinical context. Further, research consent is subject to the Common Rule requirements and under the purview of Institutional Review Boards, while clinical consent is not. Second, most prior discussions focus on consent for conventional genetic testing, not NGS. This being said, since there is so little literature discussing clinical consent for NGS, it is helpful to look at recommendations for genetic testing in the clinical setting in general. For example, the American Society for Clinical Oncology set out suggested elements for inclusion in informed consent for Cancer Susceptibility Testing in 2003 [100] (Table 24.7). Likewise, a more limited set of five basic components of informed consent for genetic testing has been suggested [101] (Table 24.8). The key questions are how and whether to adapt these recommendations to NGS in the clinical context. Informed consent in the clinical setting for NGS is a very recent discussion, but even so, recommendations have emerged from review articles, expert opinions, clinical practice guidelines, expert advice, position papers, and consensus statements. The existing recommendations in the literature have recently been distilled into 10 elements of critical information, which should always be included in informed consent for NGS [7] (Table 24.9). Although these aggregated requirements are ethically responsible and thorough, the challenge of implementing these requirements in a clinical setting should not be overlooked. As an example, in an attempt to perform a thorough informed consent discussion for WGS with two families with Miller Syndrome, in order to adequately cover all the elements they found necessary, the clinical team sent the IC form ahead of time for advanced review, and a “consent conference” (in one case face-to-face and in the other using telephone and live streaming video) was convened [102]. In the resulting interviews with the two families regarding their experience of the consent process, they voiced frustration with the length of time required and still more than half found it difficult to understand WGS. Although this study was done in the context of a research consent as opposed to a clinical consent, the burden of time and the trade-off of understanding raise important questions regarding whether patients can adequately understand NGS from an IC discussion, no matter how long and involved.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

INFORMED CONSENT

TABLE 24.7

427

General Informed Consent Elements for Genetic Testing

Target of testing

Information on the specific genetic mutation(s) or genomic variant(s) being tested, including whether the range of risk associated with the variant will impact medical care

Implications

Implications of a positive or negative results

Limitations

Possibility that test will not be informative

Alternatives

Options for risk estimation without genetic or genomic testing

Risks

Reproductive, psychological, discrimination

Benefits

Psychological Technical accuracy of test, including licensure of testing laboratory

Costs

Fees involved with testing and counseling

Protections

Against breaches of confidentiality, against discrimination

Future use

Researchers, quality improvement, etc.

Follow-up

Plans for follow-up after testing

Adapted from Bruinooge [100].

TABLE 24.8

Concise Informed Consent Elements for Genetic Testing

Nature and scope

Background information  Purpose of test—determine genetic cause  Possible result outcomes (1) normal; (2) abnormal; (3) variant of unknown significance; (4) incidental/secondary

Benefits

May identify the genetic cause/diagnosis Medical and psychosocial benefits to diagnosis

Limitations

Does not rule out all genetic conditions May not lead to definitive cure or treatment May require further testing

Risks

Ambiguous results Unexpected/unrelated information Familial implications

Costs

Check with insurance Advise on out-of-pocket expenses

Adapted from Cohen et al. [101].

Recommendations Building from all this work, there are several important points that ought to be emphasized to patients, no matter what precise form of informed consent is used in order to manifest the underlying ethical obligations of the clinical setting. Balance the Amount of Information with Patient Initiative In the context of low health and genetic literacy combined with the increased complexity of explaining NGS to patients, clinicians should not set themselves the goal to ensure patients fully grasp the notion of NGS. An attempt at a complete and thorough consent process has been described [103]: 1. During genetic counseling, discuss the possibility that clinical assessment incorporating a personal genome might uncover high risk of a serious disease, including some for which there is no treatment.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

428 TABLE 24.9

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

Recommended Informed Consent Elements for NGS

Element

Description

Examples

Scope

The scope of the test

Carrier testing, presymptomatic, newborn screening

Description

Description of test process

The procedure, its limitations, its implications

Benefits

Benefits that are expected

Increased knowledge of disease risk and predispositions, personalized lifestyle recommendations, more tailored drug therapy

Risks

Possible disadvantages, risks, complications

Medical, psychosocial, reproductive, third party, discrimination, stigma

Voluntary

Voluntary nature of the test

Refusal

Possible refusal at any time without consequences

Alternative test

Description of alternative diagnostic methods, if any

Confidentiality

Measures taken to protect privacy and confidentiality, both now and in the future

Whether data is recorded in medical records, who will have access and in what circumstances (like researchers), to inform relatives or not

Future use

Destination of samples when study ends

Storage, encryption, anonymization or destruction, possibility of recontacting patients, reinterpreting results

Incidental findings Management of incidental findings and the right not to know

How they will be returned and under what conditions, state standards for what is reportable

Adapted from Ayuso et al. [7].

2. Additionally, describe the reproductive implications of heterozygous status for autosomal recessive diseases such as cystic fibrosis, which might not be predictable from family history. Also warn of increases or decreases of genetic risk of common diseases. 3. Note that most of the sequence information is difficult to interpret, and discuss error rates and validation processes. 4. Discuss that risk alleles might be discovered that have reproductive or familial importance rather than personal importance (such as those for breast cancer or ovarian cancer). 5. Address the possibility of discrimination on the basis of genetics. . . offer extended access to clinical geneticists, genetic counselors, and clinical lab directors to interpret the information presented. It has been estimated that an average person would obtain results from WGS for roughly 100 genetic risks. “Even if that information averaged only 3 min per disorder, this process would take more than 5 h of direct patient contact, after many hours of background research into the importance of the various genomic findings” [104]. This type of process could require time, energy, and expertise far beyond the capacity of most clinical settings (and current reimbursement paradigms), and still not be successful. Rather, the goal of the informed consent process should be to ensure that patients understand a basic minimum amount of information and, if they want to know more, to assure that the information is made accessible to them (upon inquiry or via supplied or referred resources). For example, consent should cover the major issues such as those mentioned above [nature and scope, benefits, limitations, risks (including confidentiality risks), and costs] and then provide access to more extensive information if the patient so desires. If patients want to understand how NGS works, the difference between a genome and an exome, and so on, this information should be made available upon inquiry but not included unless necessary. One suggested possible strategy is to provide a shortened version of the informed consent document with necessary information, and then a full length version as a reference for patients to use as they see fit [105]. The Right to Know and the Right Not to Know It should be made clear to patients the nature and scope of information that NGS can yield before the test is employed. This should include the purpose of performing the test in the patient’s particular situation, as well as the possible data that NGS could provide. An emphasis on the benefits and drawbacks of any alternatives (more directed tests, no tests at all) should be explained, as well as any options regarding how and how many results

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

INFORMED CONSENT

429

are returned. This conversation would also be a good time to mention required sharing and potential confidentiality risks, so they can be incorporated appropriately into the patient’s choice. Negotiation of Clinical and Personal Risks and Benefits In order to aid patients in making a decision, it is important to acknowledge that most people, until faced with the decision, do not reflect much on the role of genetic knowledge in their lives. Clinicians should explain the role and strength of the clinical utility of testing in the patient’s particular situation. Further, the clinician should discuss with the patient possible personal advantages or drawbacks to having the test. One suggested strategy is to categorize results according to type, but unfortunately there is no clear consensus on an appropriate typology [36]. While clinicians tend to feel more comfortable discussing clinical factors such as seriousness of the condition, urgency of treatment, and probability of disease, as discussed above this is not the extent of clinical utility. Clinicians should explicitly include risks and benefits that reflect the possible personal utility to the patient, such as of the cost of the test and potential risks to confidentiality and discrimination, as well as potential personal benefits such as a feeling of control, life-planning options, and potential lifestyle adjustments. In light of existing evidence, there is no reason to suppose a priori that information will necessarily harm or benefit a patient, since the harm or benefit appears to be determined much more by the character, needs, and context of the particular patient than any external facts. Thus, an informed negotiation between clinician and patient is the best approach to determining the advantages or disadvantages of embarking on NGS. “For the lay groups, it was important that they were able to undertake a ‘risk analysis’ for themselves, rather than have the professionals make decisions on their behalf” [36]. Evolving Results A major question in the clinical setting is the duty to revisit results in a field where the technology and the information derived from it evolve so quickly. The most important aspect for informed consent is to plan ahead. If revisiting results in the future is possible in the future, clinicians should make sure to inform patients (i) whether it will be initiated by the physician or the patient and (ii) the conditions under which the decision will be made. If given a choice, make the choice clear. Ethically, revisiting results should be the default position, for if certain conditions make disclosure of results obligatory at the present time, the attainment of those conditions in the future would make disclosure of those results obligatory as well. The issue of the administrative or financial burden of this, if it arises, could necessitate making patients take the initiative of inquiring as opposed to clinician scanning all patient records when new information surfaces [15]. If neither of these options are possible, patients should at least be informed that they will not be notified of new information and why. Counseling Although the ACMG and other societies currently recommend both pre- and posttest counseling for NGS, especially for asymptomatic individuals in the setting of germline testing, this is likely not to be a practical reality if NGS gains widespread adoption throughout the clinical sector. There are currently not enough genetic counselors, nor is there an adequate approach to funding their services, to incorporate them generally into the clinical context. This is not to suggest that thorough counseling is not an ideal approach, but rather that it is vital to plan an ethical approach to gaining informed consent and preparing people for NGS results in the context where these services are not available. Two recommendations emerge from this recognition. First, healthcare professionals should work at different levels to create a sustainable model for the role of genetic counselors in the future of NGS. This could include increased incentives for people to enter the field, as well as ways to gain coverage for their services more easily through common insurance programs. More radically, new models of counseling can be created where genetic counselors serve as a resource so that clinicians can adequately counsel their own patients, or where newer technologies are leveraged so that counseling can be provided through telecommunications or other modes. Second, as mentioned above, clinicians need to work to educate themselves or avail themselves of the resources needed to adequately counsel patients of the pros and cons of NGS in their particular circumstances. “Physicians must have ready access to the significance and risks of positive genomic results so they can understand the real risks to their patients, explain them, and make cost-effective decisions about subsequent testing” [106]. Likewise, materials need to be developed that can streamline and incorporate NGS into clinical practice smoothly to support clinicians. Current resources do not have the necessary functionalities; for example, it has been noted that, “No centrally maintained repository of all rare and disease-associated variants currently exists. . .interpretation of the patient’s genome involved the work of a clinical geneticist, a genetic counselor, and

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

430

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

experts in bioinformatics, genetic cardiology, internal medicine, and pharmacogenomics, among others” [104]. The ACMG is developing a set of clinical decision support tools called ACT Sheets to guide referring physicians [81], but additional clinician education requirements include other ‘point of care’ educational resources such as “hyperlinks to disease information and resources or brief education modules that physicians would have the opportunity to view when pertinent to the case at hand” [5]. Web-based educational tools, or utilization of telemedicine, are other methods that could be leveraged to assist clinicians on a day-to-day basis. Transparency The most important aspect of information given to patients, either before testing in the informed consent or after testing while returning results, is that it is honest and avoids surprises as much as possible. People don’t mind having tentative results as long as they are not presented as definitive and they don’t come as a surprise [36,62]. As mentioned in the Privacy section of this chapter, NGS consents must be honest and direct about the scope and limits of confidentiality protections that clinicians can provide. “Classic consents must therefore transition away from attempting to guarantee individuals’ privacy. Rather, new forms of consent should aim at educating research subjects on what the data collected on them can say and the degree to which it can or cannot be protected” [96]. This includes reference to research uses, sharing with service or marketing organizations, and whether clinicians are obligated to share genetic information to others implicated by the information (such as family members or spouses).

CONCLUSION Beyond all of the expert advice and ethical nuances of incorporating NGS into the clinical setting, the biggest hurdle to its implementation is the novelty of the technology and the corresponding lack of literacy both in the ostensible providers and the consumers. NGS will produce a volume of genetic data that exceeds any individual clinician’s ability and expertise. Clinicians, perhaps in concert with their patients, will need be prepared for the types of results that they will receive, and prioritize which results are the most relevant and most important to reveal [9]. For clinicians, the challenge will be to understand the next generation of genetic knowledge, its implications for organ systems that may be outside of their training and practice, and the approaches to communicate this information to their patients [43]. For patients, the challenge will be to make a decision about the risks and benefits of NGS for their particular situation, acknowledging the fact that they will almost certainly be working with incomplete information and understanding.

References [1] Nanda R, Schumm LP, Cummings S, et al. Genetic testing in an ethnically diverse cohort of high_risk women: a comparative analysis of BRCA1 and BRCA2 mutations in American families of European and African Ancestry. JAMA 2005;294:192533. [2] Tennessen JA, Bigham AW, O’Connor TD, Fu W, Kenny EE, Gravel S, et al. NHLBI Exome Sequencing Project Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 2012;337:649. [3] Ethical, legal, and social issues research. Available from: ,http://web.ornl.gov/sci/techresources/Human_Genome/elsi/index.shtml.. [4] McEwen JE, Boyer JT, Sun KY. Evolving approaches to the ethical management of genomic data. Trends Genet 2013;9:37582. [5] Soden SE, Farrow EG, Saunders CJ, Lantos JD. Genomic medicine: evolving science, evolving ethics. Pers Med 2012;9:5238. [6] McGuire AL, Fisher R, Cusenza P, Hudson K, Rothstein MA, McGraw D, et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet Med 2008;10:4959. [7] Ayuso C, Milla´n JM, Manchen˜o M, Dal-Re´ R. Informed consent for whole-genome sequencing studies in the clinical setting. Proposed recommendations on essential content and process. Eur J Hum Genet 2013;21:10549. [8] Couzin-Frankel J. Chasing a disease to the vanishing point. Science 2010;328(5976):298300. [9] Sharp RR. Downsizing genomic medicine: approaching the ethical complexity of whole-genome sequencing by starting small. Genet Med 2011;13(3):1914. [10] Feinberg J. The child’s right to an open future. In: Aikin W, La Folette H, editors. Whose Child? Children’s rights, parental authority and state power. Totowa, NJ: Rowman and Littlefield; 1980. p. 12453. [11] Deutsch M. Distributive justice: a social-psychological perspective. New Haven, CT: Yale University Press; 1985. [12] Secretary’s Advisory Committee on Genetics, Health, and Society (SACGHS). Letter to the Secretary. 2010. Available from: ,http://oba. od.nih.gov/oba/SACGHS/SACGHS_Letter_to_the_Secretary_November_9_2010.pdf.. [13] Hawkins AK, Ho A, Hayden MR. Lessons from predictive testing for Huntington disease: 25 years on. J Med Genet 2011;48 (10):64950.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

431

[14] Yu J-H, Crouch J, Jamal SM, Tabor HK, Bamshad MJ. Attitudes of African Americans toward return of results from exome and whole genome sequencing. Am J Med Genet A 2013;161(5):106472. [15] Dondorp WJ, Wert GM. The ‘thousand-dollar genome’: an ethical exploration. Eur J Hum Genet 2013;21(Suppl 1):S626. [16] Biesecker LG. Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet Med 2012;14(4):3938. [17] Thompson R, Drew CJ, Thomas RH. Next generation sequencing in the clinical domain: clinical advantages, practical, and ethical challenges. Adv Protein Chem 2012;89:2763. [18] Haas J, Katus HA, Meder B. Next-generation sequencing entering the clinical arena. Mol Cell Probes 2011;25(5):20611. [19] Privacy and Progress in Whole Genome Sequencing. Presidential Commission for the Study of Bioethical Issues, 2012. Available from: ,http://bioethics.gov/node/764.. [20] U.S. Government Printing Office. 42 CFR 493.1850 Part 493. Available from: ,http://www.gpo.gov/fdsys/granule/CFR-2010-title42vol5/CFR-2010-title42-vol5-sec493-1850/content-detail.html.. [21] 42 CFR y 493.3 b(2). [22] Alexander D, van Dyck PC. A vision of the future of newborn screening. Pediatrics 2006;117(Suppl. 3):S3504. [23] Greenbaum D, Du J, Gerstein M. Genomic anonymity: have we already lost it? Am J Bioeth 2008;8(10):714. [24] Offit K, Groeger E, Turner S, Wadsworth EA, Weiser MA. The “duty to warn” a patient’s family members about hereditary disease risks. JAMA 2004;292(12):146973. [25] Stol YH, Menko FH, Westerman MJ, Janssens RM. Informing family members about a hereditary predisposition to cancer: attitudes and practices among clinical geneticists. J Med Ethics 2010;36(7):3915. [26] Lolkema MP, Gadellaa-van Hooijdonk CG, Bredenoord AL, Kapitein P, Roach N, Cuppen E, et al. Ethical, legal, and counseling challenges surrounding the return of genetic results in oncology. J Clin Oncol 2013;31(15):18428. [27] Samuelson W, Zeckhauser R. Status quo bias in decision making. J Risk Uncertain 1988;1(1):759. [28] Newborn Screening Tests. Available from: ,http://kidshealth.org/parent/system/medical/newborn_screening_tests.html.. [29] Pellegrino ED, Bloom FE, Carson BS, Dresser RS, Eberstadt NN, Elshtain JB. The changing moral focus of newborn screening: an ethical analysis by the President’s Council on Bioethics. Wash Dc Pres Counc Bioeth 2008. [30] Bredenoord AL, Kroes HY, Cuppen E, Parker M, van Delden JJ. Disclosure of individual genetic data to research participants: the debate reconsidered. Trends Genet 2011;27(2):417. [31] Khoury MJ, Berg A, Coates R, Evans J, Teutsch SM, Bradley LA. The evidence dilemma in genomic medicine. Health Aff (Millwood) 2008;27(6):160011. [32] Hunter DJ, Khoury MJ, Drazen JM. Letting the genome out of the bottle—will we get our wish? N Engl J Med 2008;358(2):1057. [33] Sijmons RH, Van Langen IM, Sijmons JG. A clinical perspective on ethical issues in genetic testing. Account Res 2011;18(3):14862. [34] McGuire AL, Caulfield T, Cho MK. Research ethics and the challenge of whole-genome sequencing. Nat Rev Genet 2008;9 (2):1526. [35] Dewey FE, Pan S, Wheeler MT, Quake SR, Ashley EA. DNA sequencing clinical applications of new DNA sequencing technologies. Circulation 2012;125(7):93144. [36] Townsend A, Adam S, Birch PH, Lohn Z, Rousseau F, Friedman JM. “I want to know what’s in Pandora’s box”: comparing stakeholder perspectives on incidental findings in clinical whole genomic sequencing. Am J Med Genet A 2012;158A(10):251925. [37] Arribas-Ayllon M. The ethics of disclosing genetic diagnosis for Alzheimer’s disease: do we need a new paradigm? Br Med Bull 2011;100 (1):721. [38] Tabor HK, Berkman BE, Hull SC, Bamshad MJ. Genomics really gets personal: how exome and whole genome sequencing challenge the ethical framework of human genetics research. Am J Med Genet A 2011;155(12):291624. [39] Nanda R, Schumm LP, Cummings S, Fackenthal JD, Sveen L, Ademuyiwa F, et al. Genetic testing in an ethnically diverse cohort of high risk women. JAMA 2005;294(15):192533. [40] Wolf SM, Lawrenz FP, Nelson CA, Kahn JP, Cho MK, Clayton EW, et al. Managing incidental findings in human subjects research: analysisand recommendations. J Law Med Ethics 2008;36(2):21948. [41] Wolf SM, Crock BN, Van Ness B, Lawrenz F, Kahn JP, Beskow LM, et al. Managing incidental findings and research results in genomic research involving biobanks and archived data sets. Genet Med 2012;14(4):36184. [42] Goddard KA, Whitlock EP, Berg JS, Williams MS, Webber EM, Webster JA, et al. Description and pilot results from a novel method for evaluating return of incidental findings from next-generation sequencing technologies. Genet Med 2013;15(9):7218. [43] Kalia S, ScM CGC, Korf BR, McGuire A, Nussbaum RL, MD10 JM, et al., ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Available from: ,https://www.genome.gov/Pages/Health/HealthCareProvidersInfo/ ACMG_Incidental_Findings_Report.pdf.. [44] Christenhusz GM, Devriendt K, Dierickx K. Secondary variants: in defense of a more fitting term in the incidental findings debate. Eur J Hum Genet 2013;21(12):3314. [45] Biesecker LG. Incidental variants are critical for genomics. Am J Hum Genet 2013;92(5):64851. [46] Simon CM, Williams JK, Shinkunas L, Brandt D, Daack-Hirsch S, Driessnack M. Informed consent and genomic incidental findings: IRB chair perspectives. J Empir Res Hum Res Ethics Int J 2011;6(4):5367. [47] Haddow JE, Palomaki GE. ACCE: a model process for evaluating data on emerging genetic tests. In: Khoury M, Little J, Burke W, editors. Human genome epidemiology: a scientific foundation for using genetic information to improve health and prevent disease. Oxford University Press; 2003. p. 21733. [48] Berg JS, Khoury MJ, Evans JP. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genet Med 2011;13(6):499. [49] Fabsitz RR, McGuire A, Sharp RR, Puggal M, Beskow LM, Biesecker LG, et al. Ethical and practical guidelines for reporting genetic research results to study participants updated guidelines from a national heart, lung, and blood institute working group. Circ Cardiovasc Genet 2010;3(6):57480.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

432

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

[50] Khoury MJ, McBride CM, Schully SD, Ioannidis JP, Feero WG, Janssens ACJ, et al. The scientific foundation for personal genomics: recommendations from a national institutes of health_centers for disease control and prevention multidisciplinary workshop. Genet Med 2010;11(8):55967. [51] Marteau TM, Weinman J. Self-regulation and the behavioural response to DNA risk information: a theoretical analysis and framework for future research. Soc Sci Med 2006;62(6):13608. [52] Marteau T, Senior V, Humphries SE, Bobrow M, Cranston T, Crook MA, et al. Psychological impact of genetic testing for familial hypercholesterolemia within a previously aware population: a randomized controlled trial. Am J Med Genet A 2004;128(3):28593. [53] Phelan JC, Yang LH, Cruz-Rojas R. Effects of attributing serious mental illnesses to genetic causes on orientations to treatment. Psychiatr Serv 2006;57(3):3827. [54] Picot J, Bryant J, Cooper K, Clegg A, Roderick P, Rosenberg W, et al. Psychosocial aspects of DNA testing for hereditary hemochromatosis in at-risk individuals: a systematic review. Genet Test Mol Biomarkers 2009;13(1):714. [55] Post SG, Whitehouse PJ, Binstock RH, Bird TD, Eckert SK, Farrer LA, et al. The clinical introduction of genetic testing for Alzheimer disease: an ethical perspective. JAMA 1997;277(10):832. [56] Miller FA, Giacomini M, Ahern C, Robert JS, de Laat S. When research seems like clinical care: a qualitative study of the communication of individual cancer genetic research results. BMC Med Ethics 2008;9(1):4. [57] Genetic Testing and Alzheimer Disease—Program in Genomics, Ethics, and Society (PGES)—Stanford Center for Biomedical Ethics (SCBE)—Stanford University School of Medicine. Available from: ,http://bioethics.stanford.edu/pges/alzheimer_paper.html.. [58] Roberts JS, Shalowitz DI, Christensen KD, Everett JN, Kim SY, Raskin L, et al. Returning individual research results: development of a cancer genetics education and risk communication protocol. J Empir Res Hum Res Ethics 2010;5(3):1730. [59] Green RC, Roberts JS, Cupples LA, Relkin NR, Whitehouse PJ, Brown T, et al. Disclosure of APOE genotype for risk of Alzheimer’s disease. N Engl J Med 2009;361(3):245. [60] Murakami Y. Guilt from negative genetic test findings. Am J Psychiatry 2001;158(11):1929. [61] Haga SB, O’Daniel JM, Tindall GM, Lipkus IR, Agans R. Survey of US public attitudes toward pharmacogenetic testing. Pharmacogenomics J 2011;12(3):97204. [62] Bollinger JM, Scott J, Dvoskin R, Kaufman D. Public preferences regarding the return of individual genetic research results: findings from a qualitative focus group study. Genet Med 2012;14(4):4517. [63] Murphy J, Scott J, Kaufman D, Geller G, LeRoy L, Hudson K. Public expectations for return of results from large-cohort genetic research. Am J Bioeth 2008;8(11):3643. [64] Daack-Hirsch S, Driessnack M, Hanish A, Johnson VA, Shah LL, Simon CM, et al. “Information is Information”: a public perspective on incidental findings in clinical and research genome-based testing. Clin Genet 2013;84(1):118. [65] Cho AH, Killeya-Jones LA, O’Daniel JM, Kawamoto K, Gallagher P, Haga S, et al. Effect of genetic testing for risk of type 2 diabetes mellitus on health behaviors and outcomes: study rationale, development and design. BMC Health Serv Res 2012;12(1):16. [66] Miller FA, Hayeems RZ, Bytautas JP. What is a meaningful result? Disclosing the results of genomic research in autism to research participants. Eur J Hum Genet 2010;18(8):86771. [67] Grosse SD, McBride CM, Evans JP, Khoury MJ. Personal utility and genomic information: look before you leap. Genet Med Off J Am Coll Med Genet 2009;11(8):575. [68] Stigma—Definition and More from the Free Merriam-Webster Dictionary. Available from: ,http://www.merriam-webster.com/dictionary/stigma.. [69] Wjst M. Caught you: threats to confidentiality due to the public release of large-scale genetic data sets. BMC Med Ethics 2010;11(1):21. [70] [a]Bombard Y, Veenstra G, Friedman JM, Creighton S, Currie L, Paulsen JS, et al. Perceptions of genetic discrimination among people at risk for Huntington’s disease: a cross sectional survey. BMJ 2009;338:b2175.[b],http://www.eeoc.gov/eeoc/statistics/enforcement/ charges.cfm.. [71] [a]Chevron U.S.A., Inc v. Eschazabal 122S. Ct. 2045. 2002.[b]http://www.eeoc.gov/eeoc/newsroom/release/5-7-13b.cfm. [72] Eckman MH, Rosand J, Greenberg SM, Gage BF. Cost-effectiveness of using pharmacogenetic information in warfarin dosing for patients with nonvalvular atrial fibrillation. Ann Intern Med 2009;150(2):7383. [73] Jordan BR, Tsai DFC. Whole-genome association studies for multigenic diseases: ethical dilemmas arising from commercialization—the case of genetic testing for autism. J Med Ethics 2010;36(7):4404. [74] Points to consider when planning a genetic study that involves members of named populations—bioethics resources on the web—NIH. Available from: ,http://bioethics.od.nih.gov/named_populations.html.. [75] Hausman D. Protecting groups from genetic research. Bioethics 2008;22(3):15765. [76] Drabiak-Syed K. Lessons from Havasupai Tribe v. Arizona State University Board of Regents: recognizing group, cultural, and dignity harms as legitimate risks warranting integration into research practice. J Heal Biomed 2010;6:175. [77] Darcy D, Lewis E, Ormond K, Clark D, Trafton J. Practical considerations to guide development of access controls and decision support for genetic information in electronic medical records. BMC. Health Serv Res 2011;11(1):294. [78] Lohn Z, Adam S, Birch P, Townsend A, Friedman J. Genetics professionals’ perspectives on reporting incidental findings from clinical genome-wide sequencing. Am J Med Genet A 2013;161(3):5429. [79] Allyse M, Michie M. Not-so-incidental findings: the ACMG recommendations on the reporting of incidental findings in clinical whole genome and whole exome sequencing. Trends Biotechnol 2013;8:43941. [80] Green RC, Berg JS, Berry GT, Biesecker LG, Dimmock DP, Evans JP, et al. Exploring concordance and discordance for return of incidental findings from clinical sequencing. Genet Med 2012;14(4):40510. [81] American College of Medical Genetics—Incidental_Findings_in_Clinical_Genomics_A_Clarification.pdf. Available from: ,http://www. acmg.net/docs/Incidental_Findings_in_Clinical_Genomics_A_Clarification.pdf.. [82] Office of the Press Secretary. White House Remarks by the President on the completion of the First Survey of the Entire Human Genome Project, 2000. Available from: ,http://www.ornl.gov/sci/techresources/Human_Genome/project/clinton2.shtml.. [83] Naser C.R., Alpert S.A. Protecting the privacy of medical records: an ethical analysis: a white paper. The Coalition 1999.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

433

[84] Alpert SA. Protecting medical privacy: challenges in the age of genetic information. J Soc Issues 2003;59(2):30122. [85] Arar N, Seo J, Lee S, Abboud HE, Copeland LA, Noel P, et al. Preferences regarding genetic research results: comparing veterans and nonveterans responses. Public Health Genomics 2010;13:4319. [86] Rothstein MA, Talbott MK. Compelled authorizations for disclosure of health records: magnitude and implications. Am J Bioeth 2007;7 (3):3845. [87] Ohm P. Broken promises of privacy: responding to the surprising failure of anonymization. Ucla Law Rev 2010;57:1701. [88] De Cristofaro E. Whole genome sequencing: innovation dream or privacy nightmare? Eprint Arxiv 2012. [89] Heeney C, Hawkins N, De Vries J, Boddington P, Kaye J. Assessing the privacy risks of data sharing in genomics. Public Health Genomics 2010;14(1):1725. [90] Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertainty Fuzziness Knowledge Based Syst 2002;10(05):55770. [91] Barth-Jones D. The ‘Re-Identification’of Governor William Weld’s Medical Information: a critical re-examination of health data identification risks and privacy protections, then and now. 2012; Available from: ,http://papers.ssrn.com/sol3/papers.cfm?. abstract_id52080162. [92] Nyholt DR, Yu C-E, Visscher PM. On Jim Watson’s APOE status: genetic information is hard to hide. Eur J Hum Genet 2008;17 (2):1479. [93] Narayanan A, Shmatikov V. How to break anonymity of the Netflix prize dataset. Eprint Arxiv 2006; Cs0610105. Available from: ,http://arxiv.org/abs/cs/0610105.. [94] OHRP—Guidance on Research Involving Coded Private Information or Biological Specimens. Available from: ,http://www.hhs.gov/ ohrp/policy/cdebiol.html.. [95] Diergaarde B, Bowen DJ, Ludman EJ, Culver JO, Press N, Burke W. Genetic information: special or not? Responses from focus groups with members of a health maintenance organization. Am J Med Genet A 2007;143(6):5649. [96] Schadt EE. The changing privacy landscape in the era of big data. Mol Syst Biol 2012;8:612. [97] Beauchamp TL. The Belmont Report. Oxford textbook of clinical research ethics; 20081798 [98] Michie S, Smith JA, Senior V, Marteau TM. Understanding why negative genetic test results sometimes fail to reassure. Am J Med Genet A 2003;119A(3):3407. [99] Yin HS, Johnson M, Mendelsohn AL, Abrams MA, Sanders LM, Dreyer BP. The health literacy of parents in the United States: a nationally representative study. Pediatrics 2009;124(Suppl):S28998. [100] Bruinooge SS. American society of clinical oncology policy statement update: genetic testing for cancer susceptibility. J Clin Oncol 2003;21(12):2397406. [101] Cohen J, Hoon A, Wilms Floet AM. Providing family guidance in rapidly shifting sand: informed consent for genetic testing. Dev Med Child Neurol 2013;55(8):7668. [102] Tabor HK, Stock J, Brazg T, McMillin MJ, Dent KM, Yu J-H, et al. Informed consent for whole genome sequencing: a qualitative analysis of participant expectations and perceptions of risks, benefits, and harms. Am J Med Genet A 2012;158(6):13109. [103] Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, Dewey FE, et al. Clinical assessment incorporating a personal genome. Lancet 2010;375(9725):152535. [104] Ormond KE, Wheeler MT, Hudgins L, Klein TE, Butte AJ, Altman RB, et al. Challenges in the clinical application of whole-genome sequencing. Lancet Lond Engl 2010;375(9727):1749. [105] McGuire AL, Achenbaum LS, Whitney SN, Slashinski MJ, Versalovic J, Keitel WA, et al. Perspectives on human microbiome research ethics. J Empir Res Hum Res Ethics Int J 2012;7(3):114. [106] Kohane IS. The incidentalome: a threat to genomic medicine. JAMA 2006;296(2):2125.

Glossary Analytic validity First component of the ACCE model, a test’s ability to accurately and reliably identify a particular gene characteristic Anonymity Complete removal of identifiers Attribute disclosure Which happens when an individual or group is known to be presented in a data set which shows that everybody has a particular trait Autonomy A basic ethical principle of bioethics, technically means “self-rule,” but refers to the freedom to choose a course of action in accordance with one’s own values Beneficence A basic ethical principle of bioethics, to promote health and welfare Binning One strategy for deciding which results to return, which consists of creating categories, or “bins” of types of results, each with specific recommendations for return Clinical actionability There exist established therapeutic or preventive interventions that may change the course of the tested disorder Clinical utility Third component of the ACCE model: a test result that leads to an improved health outcome Clinical validity Second component of the ACCE model: a test where a genotype accurately and reliably identifies or predicts a phenotype Confidentiality Entrusting access to information to specific persons, institutions, or entities who then only allow access to a third party under conditions acceptable to provider of information Diagnosis The process of identifying the nature or cause of a health problem Distributive justice A fair allocation of goods in a society ELSI Fourth component of the ACCE model, refers to the ethical, legal and social implications of genetic and genomic information Genetic exceptionalism Genetic/genomic information should be treated differently from other health information, especially in the context of privacy protections, data access, and permissible use. Informational privacy Limitation or restriction of access to personal information Justice a basic ethical principle of bioethics, “to treat equals equally” or to “give each his due” (of course who are equals and what is due are subjects of much debate). In most straightforward terms, it means to treat people fairly.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

434

24. ETHICAL CHALLENGES TO NEXT-GENERATION SEQUENCING

Listing One strategy for deciding which results to return, which consists of compiling a list of recommended findings to return based on expert consensus Nonmaleficence A basic ethical principle of bioethics, “do no harm” Personal utility The advantages or disadvantages of health information for individuals Privacy The limitation or restriction of access to persons, personal information or personal property Screening The process of identifying an undiagnosed health problem in individuals with no signs or symptoms Stigma Originally meant a physical scar, refers to a mark of shame or discredit.

List of Acronyms and Abbreviations ACCE ACMG AMA ASCO CLIA CODIS CT dbGAP ELSI EMR GINA HHS HIPAA IC MRI NBAC NGS NICHD NIH OHRP TMS VUS WES WGA WGS

Analytic validity, clinical validity, clinical utility, ELSI American College of Medical Genetics and Genomics American Medical Association American Society of Clinical Oncology Clinical Laboratory Improvement Amendments Combined DNA Index System Computed tomography Database of genotypes and phenotypes Ethical, Legal and Social Implications (of Genetics) Electronic medical records Genetic Information Nondiscrimination Act Department of Health and Human Services Health Insurance Portability and Accountability Act Informed consent Magnetic resonance imaging National Bioethics Advisory Commission Next-generation sequencing National Institute of Child Health and Human Development National Institutes of Health Office of Human Research Protection Tandem mass spectrometry Variants of unknown significance Whole exome sequencing Whole genome analysis Whole genome sequencing

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

C H A P T E R

25 Legal Issues Roger D. Klein Department of Molecular Pathology, Robert J. Tomsich Pathology and Laboratory Medicine Institute, Cleveland Clinic, Cleveland, OH, USA

O U T L I N E Introduction

435

Patent Overview

436

History of Gene Patents

436

Arguments for and Against Gene Patents

438

Important Legal Cases

438

Implication of Recent Court Decisions for Genetic Testing

443

Genetic Information Nondiscrimination Act

443

References

445

INTRODUCTION Patents on human gene sequences and on relationships between human gene variants and clinical phenotypes have proven extremely controversial in clinical diagnostics. Many patient and consumer groups, pathologists, geneticists, and other laboratory professionals contend that such patents increase test costs, decrease innovation, reduce patient access, restrict patients’ choices of providers and their access to second opinion testing, inhibit clinical and basic research, and foster the development of proprietary databases of medically significant genetic findings [13]. Author Michael Crichton joined the chorus of gene patent critics in his 2006 novel Next, going as far as to include an appendix to the book that exposed the “evils” of gene patents and advocated a ban on them, views he also expressed in a New York Times column [4]. In February 2007, Congressmen Xavier Becarra (D-Calif.) and David Weldon (R-Fla.) introduced “The Genomic Research and Accessibility Act” (HR 977), a bill that would have banned future patents on all nucleic acid sequences. Conversely, proponents of gene patents have argued that these patents incentivize gene discovery, and stimulate investments in and commercialization of genetic tests. Gene patents, it has been argued, benefit patients by encouraging discoveries of genetic relationships and the development and introduction of new assays that in the absence of patents would not have been brought to fruition. In parallel, the introduction of molecular-inherited disease testing into medical practice has raised concerns about potential discrimination in employment and insurance on the basis of genetic test results and other genetic information. The nature of the information derived from heritable disease testing differs from that obtained from other laboratory investigations because of its innate character, the potential strength of its predictive power, and the implications of test results for other family members. These concerns underlay the passage of the Genetic Information Nondiscrimination Act (GINA), which was signed into law by President George W. Bush on May 21, 2008 [5]. This chapter will discuss GINA and its implications for hereditary disease testing.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00025-3

435

© 2015 Elsevier Inc. All rights reserved.

436

25. LEGAL ISSUES

This chapter will chronicle the history of human gene patents, discuss arguments for and against gene patents, and present important legal cases that impact on or directly address the validity and permissible scope of such patents. Finally, the chapter will discuss the implications of these recent legal developments for diagnostic testing, including next-generation sequencing.

PATENT OVERVIEW A US utility patent confers upon the patent holder the right to exclude others from making, using, selling, offering to sell, or importing an invention or a product made by a patented process, for 20 years from the filing date [6]. The basis for the U.S. patent system is found in the Constitution, which in Article I, Section 8, Clause 8 states, “The Congress shall have Power . . . To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries; . . .” The incentives patents generate for inventors to create, commercialize, and disclose new inventions has served as the justification for the governmental grant of exclusivity, the benefits from which are believed to accrue to society at large. Congress enacted the first U.S. patent laws in 1790. The Patent Act of 1790 was repealed and replaced in 1793, and the patent laws have subsequently been modified on numerous occasions. Congress established the basic structure of the current Patent Act in 1952, when it last reenacted the patent laws in their entirety. Since the passage of the 1952 Act, the patent laws have been amended several times, recently undergoing significant revisions as part of the America Invents Act of 2011 [7]. Under U.S. patent law, patentable inventions must be novel, nonobvious, and useful [8]. In addition, under “written description” and “enablement” requirements a patent must describe the patented invention in what is termed its “specification,” “in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same.” Moreover, the specification must set forth the “best mode,” in the mind of the inventor, of practicing the invention [9]. Within the specification, patent “claims” define the invention’s features, establishing the boundaries of what is claimed, much as a real estate deed delineates the boundaries of a plot of land. Patent applications are submitted to the Patent Office (USPTO) where they are rejected, or allowed and issued. “Processes, machines, manufactures, and compositions of matter” can be patented [10], but patents may not be obtained on products of nature or, under the “natural phenomenon doctrine,” “laws of nature, natural phenomena, and abstract ideas” [11]. Patent infringement, the making, using, selling, offering to sell, or importing of a patented product or a product made by a patented process, can occur through direct infringement of the patent [12]; inducement of others to infringe the patent [13]; or contributing to another’s infringement of the patent [14]. For example prior to the recent U.S. Supreme Court’s decisions in Myriad and Mayo, a laboratory could have been found to have directly infringed a gene patent if it tested for mutations in a patented gene, or variants claimed in a patented genotypephenotype association. In order to be found liable for inducing another to infringe a patent, a party must have actively, intentionally, and knowingly solicited or assisted another to infringe the patent, with the solicited individual or entity itself having directly infringed the patent. Thus, a laboratory that used educational materials to promote an offered test for a patented genetic association to physicians who then ordered the test, received the results, and thought about the association during management of their patients, could until the Mayo and Myriad decisions have been found to have induced the direct infringement of the patent by the ordering physicians. Finally, sale of a material component of a patented invention that has no substantial use other than as a component of the invention, denotes contributory infringement. Applying this definition to testing for a patented genetic association, the laboratory in the preceding example could, prior to Mayo and Myriad have also been found liable for contributory infringement, with the ordering physicians who thought about the patented genetic association during the course of their patients’ care directly infringing the patent.

HISTORY OF GENE PATENTS The legitimization of gene patents in the United States appears to have been an outgrowth of legal and political changes that were initiated in response to the economic dislocations of the late 1970s and early 1980s. During

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

HISTORY OF GENE PATENTS

437

this period, the country was plagued by high unemployment, high inflation, and a decline in economic confidence. In response, Congress took a number of steps to encourage the growth of domestic technology industries. Among the most significant of these were changes to the U.S. patent system. To maximize the economic value derived from our substantial federal investments in basic science research, Congress in 1980 passed the BayhDole Act [15]. BayhDole encouraged universities to patent inventions arising out of government sponsored research grants in order to aid in the inventions’ commercialization. In addition, following the passage of BayhDole Congress increased federal financial commitments dedicated to biomedical research dramatically. National Institutes of Health funding of biomedical research ballooned from approximately $5 billion in the late 1970s to $26 billion in 2003 [16,17]. Because of these governmental actions, the number of patents assigned to universities increased from 264 in 1979 to 3291 in 2002 [18,19]. Key developments also occurred in the courts. In 1980, the U.S. Supreme Court ruled in Diamond v. Chakrabarty [20] that man-made, living organisms could be patented. Ananda Chakrabarty, a research scientist at General Electric, applied for a patent on a Pseudomonas bacterium that had been bioengineered to carry multiple plasmids. These plasmids conferred upon the newly created microorganism the ability to digest complex mixtures of hydrocarbons, potentially offering a biotechnological solution to oil spill cleanup. The Patent Office rejected Chakrabarty’s patent application, finding that living organisms could not be patented. Chakrabarty appealed the USPTO’s decision to the courts, with the case ultimately accepted by the U.S. Supreme Court. In its Chakrabarty decision, the Supreme Court urged a broad interpretation of patent eligibility, reciting from the legislative history of the 1952 Patent Act in ruling that “anything under the sun that is made by man,” including living organisms, can be patented. In Diamond v. Diehr [11], the Supreme Court held that an automated process for molding and curing rubber that incorporated a general purpose computer programmed with the Arrhenius equation was eligible for patenting. The USPTO found that the process, which included continual measurement of the temperature within the curing press, transfer of that value to a computer, calculation of the requisite cure time, and automated discontinuation of the process when the optimal cure time equaled the elapsed time, was in substance directed toward an unpatentable mathematical algorithm. Ultimately, the Supreme Court, in a 54 decision disagreed, holding rather that the process constituted an application of the algorithm sufficient to render it a patentable process. Like Chakrabarty, Diehr was perceived to represent a liberalization of the Supreme Court’s view of patent-eligible subject matter, helping to pave the way for eventual patenting of processes claiming biological relationships such as genotypephenotype associations. Finally, in an effort to provide national uniformity and add greater certainty and expertise to the application of patent law, in 1982 Congress created the Court of Appeals for the Federal Circuit (CAFC), with exclusive appellate jurisdiction for patent cases [21]. Since its inception, Federal Circuit decisions have affected the biotech sector significantly by generally expanding patent-eligible subject matter and strengthening the rights of patent holders relative to potential infringers. Since the Chakrabarty and Diehr decisions, and implementation of the CAFC, many patents have been issued on a range of biotech inventions, from transgenic mice and leukemiaderived cell lines to recombinant drugs and vaccines. Thousands of patents have also been awarded on human gene sequences, genetic variants, and more recently, genotypephenotype correlations [22]. The coalescence of the preceding events set the stage for the enormous growth of the U.S. biotech industry. From 1982 through 2002, U.S. Food and Drug Administration (FDA) approvals for biotech drugs and vaccines grew from 2 to 35. The number of U.S. biotech companies expanded from 225 in 1977 to 1457 in 2001. Biotech employment mushroomed from 700 in 1980 to 191,000 in 2001. In addition, the industry’s growth created hundreds of thousands of jobs in related industries [23,24]. It has been argued that in awarding gene patents the U.S. Patent and Trademark Office and the CAFC merely followed the Supreme Court’s instruction in Chakrabarty to interpret patent eligibility broadly [25]. Importantly, post-Chakrabarty our patent system looked to chemical law precedents as a basis for awarding gene sequence patents, and treated DNA itself as a chemical despite its dual roles as a physical substance and a store of biological information. In Amgen v. Chugai Pharmaceutical Co. the CAFC wrote, “A gene is a chemical compound, albeit a complex one.” [26] Prior precedents in chemical law upheld the patenting of isolated, purified compounds such as aspirin, epinephrine, vitamin B12, and prostaglandins [2730]. The Patent Office applied these legal precedents to isolated DNA sequences. This direct superimposition of chemical law precedents to DNA permitted circumvention of the “product of nature” doctrine’s long-standing prohibition against patenting natural substances, and allowed for the issuance of patents on isolated, purified human genes.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

438

25. LEGAL ISSUES

ARGUMENTS FOR AND AGAINST GENE PATENTS Practitioners in the field argue that gene patents have significantly inhibited the provision of genetic testing services [31]. Many providers have discontinued or have been prevented from providing molecular genetic testing for inherited breast and ovarian cancer, Duchenne muscular dystrophy, spinocerellar ataxias, genes causing Long QT syndrome, as well as the FLT3 internal tandem duplication in patients with intermediate risk acute myeloid leukemia (AML), the JAK2 V617F variant in myeloproliferative neoplasms, and many others. Intuitively, monopolistic behavior would be expected to lead to increased prices and decreased patient access to testing. Although there is some support for this contention, because of the roles of third-party insurance and government as a major payer, true markets do not exist for health care services in the United States. This framework may mitigate against or obscure predicted economic consequences of genetic test monopolies. Further, actual test prices are often difficult to obtain, making comparisons of true charges difficult. Finally, gene patent enforcement and licensing practices have been inconsistent, presenting obstacles to study of the impacts of gene patents and making generalizations about their effects difficult [32,33]. For single gene discoveries and their subsequent introduction into clinical testing, the notion that gene patents have been a necessary stimulus seems dubious. In general, rather than encouraging the introduction of new tests, gene patents have tended to cause labs to discontinue tests they had already been performing. Most human genes on which clinical testing has been performed have been discovered by university faculty members. For these professors, publication and attainment of grants based upon the quality of their research is required for academic promotion and even survival, rendering patents unnecessary as a motivator. In support of this contention, inherited diseases are frequently rare, offering very limited market potential. Yet, many such genes have been discovered despite an apparent lack of significant commercial or monetary potential, because they fell within research interests of investigators. Lastly, it is typically inexpensive to design, develop, validate, and perform genetic tests using justifiably patented tools and techniques. This is in contrast to pharmaceutical development, which requires costly, extensive periods of discovery and testing, and compliance with an expensive approval process, features that support the need for robust patent protection for “commercialization” [34]. Although the preceding discussion regarding the adverse effects of gene patents on the introduction of new molecular genetic assays holds true for most assays, the relative impact of gene-related patents on some tests based on multianalyte gene expression profiling seems less clear. A central feature of these assays is a reliance on proprietary mathematical algorithms that proponents claim allow for correlation of the expression patterns of, for example, multiple mRNAs, sometimes in combination other parameters, with relevant clinical characteristics such as diagnosis, prognosis, or response to drug therapy. A number of such tests are oncology related. Implementation of these types of expression profiling assays typically requires prolonged and potentially expensive periods of study in order to establish sufficient clinical validity and utility to justify their use. At the time of this writing few such assays have crossed this threshold of support by demonstrating high-level evidence of clinical utility. Further, expression profiling tests that are offered as laboratory-developed tests (LDTs) may in the future require FDA approval or clearance, adding to development costs. Therefore, exclusivity may be necessary to attract sufficient funding to advance those assays that ultimately prove worthy into clinical care. Arguably, some inventive work has occurred in generating such assays through establishment of the gene “signature.” Moreover, although gene expression profiling assays incorporate or rely on natural biological associations, they also can generally be presented as not involving a risk of tying up essential natural phenomena. Finally, while patent protection may be needed to bring assays of this type to market, patents on individual genes or nucleic acids could otherwise obstruct development by restricting use of the genes available for inclusion in a test.

IMPORTANT LEGAL CASES Bilski v. Kappos. In Bilski v. Kappos, the USPTO rejected a patent application for a computerized process of hedging commodities against price fluctuations. The method involved having purchasers contract to buy commodities at fixed prices from sellers who wanted to hedge against a fall in prices. The purchasers also contracted to sell the commodities at fixed prices to consumers who were hedging against a rise in prices. On appeal the CAFC upheld the USPTO’s denial of a patent [35]. Prior to its Bilski decision, the rule at the CAFC was that a patent-eligible process had to produce a “useful, concrete, and tangible result.” In Bilski, the

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

IMPORTANT LEGAL CASES

439

CAFC articulated a new standard termed its “machine or transformation test.” The CAFC, sitting en banc, held that patentable processes must be tied to a particular machine or apparatus or must transform a particular article into a different state or thing, and that this transformation must be central to the purpose of the process. Bilski’s hedging process, the CAFC ruled, failed to meet the machine or transformation test, and therefore was ineligible to receive a patent. The U.S. Supreme Court affirmed the lack of patent eligibility of the claimed hedging process, but refined the CAFC’s reasoning [36]. Although the machine or transformation test may be a “useful and important clue or investigative tool” for deciding whether some processes are patent-eligible inventions under 35 U.S.C. section 101 of the Patent Act, the Supreme Court held that it is not the sole test of patent eligibility by which such processes are to be evaluated. Some gene patent claims that assert ownership over genotypephenotype associations have framed these natural laws as a series of steps, thus characterizing them as processes. The Bilski decision influences the framework under which the patent eligibility of process claims is evaluated. Therefore, although the Supreme Court’s opinion was narrowly crafted to the specific set of business facts before it in the case, Bilski has relevance for the assessment of the patent eligibility of process claims involving human genes. KSR Int’l Co. v. Teleflex Inc. In the 2007 case of KSR Int’l Co. v. Teleflex Inc., a unanimous U.S. Supreme Court relaxed the legal standards for determining patent obviousness under section 103 of the Patent Act [37]. KSR added a sensor to one of its previously designed automobile throttle pedals. Teleflex then sued KSR for infringement of a patent that claimed the combination of an adjustable automobile accelerator pedal and an electronic sensor. In response, KSR argued that the patent was invalid because its subject matter was obvious. The district court agreed with KSR, ruling that the accelerator-sensor combination was obvious. The CAFC reversed the lower court decision. In upholding the patent, the CAFC applied what was termed its “teaching, suggestion, or motivation” test (TSM test) for obviousness determinations. Under this test, a patent claim could only be found obvious if there was “some motivation or suggestion to combine the prior art teachings” present in the previous body of knowledge within the field, the nature of the problem the solution sought to solve, or the knowledge of a person who possessed ordinary skill in the field. That an approach was “obvious to try,” the CAFC wrote, had under previous precedents long been irrelevant [38]. The Supreme Court rejected the CAFC’s rigid, formalistic, and narrow process for establishing obviousness in favor of a more “expansive and flexible approach,” ruling that the throttle pedal-sensor combination was obvious at the time of the patent application. Importantly, the Supreme Court held that obviousness to try a problemsolving approach can in fact render a patent obvious under circumstances in which there is a demonstrated need for a discovery, and a finite number of identified, predictable solutions to the problem. The Court wrote, When there is a design need or market pressure to solve a problem and there are a finite number of identified, predictable solutions, a person of ordinary skill has good reason to pursue the known options within his or her technical grasp. If this leads to the anticipated success, it is likely the product not of innovation but of ordinary skill and common sense. In that instance the fact that a combination was obvious to try might show that it was obvious under y103 [37].

Many patented genes were initially mapped to a chromosomal region prior to their discovery. Significant numbers of medically important genes are involved in sequential biochemical pathways. Frequently, diseaserelated perturbations in such pathways were known before particular genetic associations were identified. Therefore, it would have been obvious to look for variants in a finite number of pathway genes during genetic studies of the relevant disorder. Finally, cDNA sequences are directly derived from the exon sequences of native genes, and can also be deduced from the amino acid sequences of the proteins for which they encode, likely rendering significant numbers of patent claims on cDNA obvious. In light of the preceding, the Supreme Court’s decision in KSR potentially affects the validity of many gene-related patents. In re Kubin. In 2009, the case of In Re Kubin provided the CAFC with an early opportunity to apply the obviousness paradigm the Supreme Court set forth in KSR [39]. In Kubin, the USPTO refused to award a patent on the full gene and cDNA sequences of the Natural Killer Cell Activation Inducing Ligand (NAIL), an NK cell surface receptor that plays a role in cellular activation. The Patent Office rejected the application both on obviousness grounds under 35 U.S.C. section 103 and for inadequate written description under 35 U.S.C. section 112. On appeal the CAFC affirmed the Patent Office’s decision, agreeing that delineation of the NAIL gene sequences was obvious in light of the prior art, which included knowledge of the existence of the NAIL protein but not its protein sequence. Citing the case of Graham v. John Deere Co. [40], the CAFC reviewed the factual inquiries necessary for a legal finding of obviousness. These, the CAFC wrote, include: “(1) the scope and content

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

440

25. LEGAL ISSUES

of the prior art; (2) the differences between the prior art and the claims at issue; (3) the level of ordinary skill in the art at the time the invention was made; and (4) objective evidence of nonobviousness, if any.” Under the aforesaid criteria, the CAFC found that the NAIL gene sequences were obvious [39]. In applying the Supreme Court’s KSR decision, the CAFC reversed one of its previous DNA cases, In re Deuel [41]. In Deuel, the CAFC had overturned the Patent Office’s conclusion that the existence of a prior art reference describing a method of gene cloning together with the partial amino acid sequence of a protein rendered the underlying cDNA sequence obvious. Instead, the Deuel court found that knowledge of the protein sequence was in and of itself insufficient to generate the sequence of the underlying cDNA and, therefore, that the sequence was nonobvious. Moreover, in Deuel the CAFC held stated that “obvious to try” was an inappropriate test for obviousness. The Court wrote: [T]he existence of a general method of isolating cDNA or DNA molecules is essentially irrelevant to the question whether the specific molecules themselves would have been obvious, in the absence of other prior art that suggests the claimed DNAs . . . . “Obvious to try” has long been held not to constitute obviousness. A general incentive does not make obvious a particular result, nor does the existence of techniques by which those efforts can be carried out [41].

In light of the Supreme Court’s prior rejection of the CAFC’s “obvious to try” doctrine in KSR, the CAFC in Kubin found that the combination of elements required to obtain the NAIL cDNA and full gene sequences were obvious to try, and therefore obvious, under 35 U.S.C. section 103. The CAFC stated, In light of the concrete, specific teachings of Sambrook and Valinente, artisans in this field, as found by the Board in its expertise, had every motivation to seek and every reasonable expectation of success in achieving the sequence of the claimed invention. In that sense, the claimed invention was reasonably expected in light of the prior art and “obvious to try” [39].

Mayo v. Prometheus. In Mayo Collaborative Services v. Prometheus Laboratories, Inc. [42], Prometheus Labs sued Mayo Clinic in the District Court for the Southern District of California for infringement of two patents covering the postadministration correlation of blood levels of the thiopurine metabolites 6-methyl mercaptopurine and 6-thioguanine with thiopurine efficacy and related side effects. Both patents were written in the form of stepwise processes, the relevant claims of which included the generic steps of: (1) administering the drug; (2) measuring the metabolite levels; and (3) describing the metabolite concentrations above and below which are associated with an increased likelihood of toxicities or lack of efficacy, respectively, and informing the ordering physician of the potential need to decrease or increase the drug dose. Thus, the patent in effect claimed the reference ranges for thiopurine drugs. Mayo Clinic had been utilizing Prometheus’ test, but in 2004 announced that it was going to offer its own internally developed test for the thiopurine metabolites. Prometheus sued Mayo for patent infringement. Mayo Clinic argued that the association between thiopurine metabolite levels and physiological response was an unpatentable natural phenomena, and that Prometheus’ patents were therefore invalid as a matter of law under section 101 of the Patent Act. The District Court agreed with Mayo, and ruled that Prometheus’ patents were invalid. The CAFC reversed the District Court, instead holding that the patents claimed methods of treatment. Moreover, the CAFC held, the in vivo metabolism of thiopurine agents constituted transformations of matter under that Court’s “machine or transformation test,” a test which was discussed earlier in this chapter in connection with Bilski v. Kappos. In Bilski, the Supreme Court clarified that although the “machine or transformation test” is an important and useful clue to patent eligibility, it is not a definitive test for it. Mayo appealed to the Supreme Court, which following its decision in Bilski, accepted Mayo v. Prometheus and immediately returned it to the CAFC for reconsideration. On remand, the CAFC reaffirmed its earlier decision reversing the District Court’s determination that Prometheus’ patents were invalid. Mayo again appealed to the Supreme Court, and the Court accepted the case. In a 90 decision, the Supreme Court held that the processes claimed in Prometheus’ patents were not patent eligible. The Court recognized that an unpatentable biological correlation lay at the center of Prometheus’ patents. In order to receive a process patent that purports to claim an application of a natural law, the Court noted, sufficient inventive effort must be added to the natural law so as to ensure that the patent is “significantly more than a patent upon the natural law itself.” Moreover, the Court emphasized that the addition of routine steps cannot convert a natural law into a patentable process. As the Court explained, “If a law of nature is not patentable, then neither is a process of reciting a law of nature, unless that process has additional features that provide practical assurance that the process is more than a drafting effort designed to monopolize the law of nature itself.” The Court succinctly summarized, “[T]o transform an unpatentable law of nature into a patent-eligible application of such a law, one must do more than simply state the law of nature while adding the words ‘apply it.’”

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

IMPORTANT LEGAL CASES

441

The unanimity, clarity, and strength of the Supreme Court’s opinion in support of this ruling standing alone implies that analogous patents covering genotypephenotype associations are also invalid. This conclusion is bolstered by the CAFC’s affirmance of the District Court finding of invalidity of Myriad Genetics’ sequence comparison claims in Association for Molecular Pathology v. Myriad Genetics [43], discussed subsequently, and is reinforced and further strengthened by the Supreme Court’s decision in the Myriad case that human DNA sequences cannot be patented. Mayo v. Prometheus and Association for Molecular Pathology v. Myriad Genetics have important implications for genomic analyses performed using next-generation sequencing, and for genetic testing as a whole. Association for Molecular Pathology v. Myriad Genetics, Inc. Finally, in Association for Molecular Pathology v. Myriad Genetics, Inc. [44], a lawsuit sponsored by the American Civil Liberties Union, a number of medical and professional societies, health care providers, and breast cancer patients sued Myriad Genetics, the University of Utah Research Foundation, and the USPTO seeking to invalidate key composition of matter and process claims of patents covering the wild-type and mutated sequences of the BRCA1 and BRCA2 genes, as well as correlations between variants in those sequences and the predisposition to breast and ovarian cancer. In total, the plaintiffs challenged 15 claims contained in 7 patents. They argued that these patent claims were invalid under section 101 of the Patent Act of 1952, and unconstitutional under Article I, Section 8, Clause 8 and the First and Fourteenth Amendments, because they asserted ownership of natural products, natural laws, natural phenomena, abstract ideas, and basic human knowledge or thought. In response, Myriad argued that although its patents claimed DNA sequences that were identical to those present in the human body, because the sequences were isolated from the body they constituted human inventions. Myriad also asserted that its patented associations between variants in BRCA1 and BRCA2 and the hereditary predisposition to breast and ovarian cancers were actually diagnostic methods involving sequence comparisons, not patents on abstract thought or the biological relationships themselves. The District Court distilled the lawsuit into a single fundamental question, “Are isolated human genes and the comparison of their sequences patentable?” [45] The Judge, Robert W. Sweet, emphasized the centrality of knowledge of molecular biology to the proper disposition of the case, as well as the importance of any potentially relevant additional inventive steps [43]. On page 27 of its opinion, Judge Sweet wrote, “An understanding of the basics of molecular biology is required to resolve the issues presented and to provide the requisite insight into the fundamentals of the genome, that is, the nature which is at the heart of the dispute between the parties. . . .” The Court devoted the next 19 pages of the opinion to a thorough review of generally accepted principles of molecular biology. It concluded the section with the recognition that some inventive work was involved in the initial sequencing of the BRCA1 and BRCA2 genes stating, “However, because sequencing requires knowledge of the sequence of a portion of the target sequence, some ingenuity and effort is required for the initial sequencing of a target DNA.” Expert declarations helped the Court sort out the extent, significance, and relevance of this work to the validity of the claims at issue [46,47]. In the pertinent sections of their respective declarations, Dr. Mark A. Kay attempted to emphasize the inventive aspects of sequencing a newly discovered product. Dr. Roger D. Klein emphasized the breadth of the patents; the natural products and laws they claimed; the routine, insubstantial, and nontransformative steps involved in performing genetic testing; and the relationship of genetic testing to other forms of medical diagnosis. In paragraph 183 of his declaration, Dr. Kay described the steps involved in sequencing a newly identified product: To sequence a particular target, at least part of the target sequence must be known to design a suitable primer. The initial sequencing of a target sequence requires ingenuity far beyond the mere application of routine laboratory techniques and usually involves a significant amount of trial and error. A primer is used to initiate the sequencing reaction at the desired location of a target sequence. A primer is an artificial DNA fragment, usually between 15 and 30 nucleotides long, that binds specifically to the target nucleotide sequence. The nucleotide sequence of the primer is complementary to the target sequence such that the bases of the primer and the bases of the target sequence bind to each other.

By contrast, in paragraphs 3234 of his declaration Dr. Klein wrote: The claims at issue in this case do not cover diagnostic tools or actual methods used in genetic testing. Nor are they analogous to patents on medical instruments. Rather they claim DNA sequences which are themselves the subject of medical inquiry. Further, they incorporate generic steps in an effort to describe the biological relationships between mutations in BRCA1 and BRCA2 and the predisposition to cancer in the abstract patent language of a “process.” However, the key steps in genetic testing, DNA extraction, amplification, and sequencing can now be performed using routine, automated methods. Nevertheless, the defendants claim the exclusive right to read and compare BRCA1 and BRCA2 sequences irrespective of the method used, whether that method is in existence now or will be

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

442

25. LEGAL ISSUES

invented in the future. Correlating a patient’s gene sequence with the predisposition to disease is simply another form of medical diagnosis, similar to correlating elevations in blood glucose with diabetes, a heart murmur with mitral stenosis, or the patterns on a pathology slide with a particular type of tumor and its optimal therapy. Automated sequencers reveal the sequence of the nucleotides visually in what is called a chromatogram. That chromatogram is then “read” (by software and visual inspection) to determine a patient’s gene sequence. DNA extraction and sequencing are not transformative activities. Rather extraction is a routine, nonsubstantial preparatory step that allows for PCR amplification and sequencing. Sequencing is an automated procedure. DNA extraction, PCR, and sequencing do not involve transformations that are central to the purpose of the process of reading a patient’s gene sequence. Unlike “tanning, dyeing, making waterproof cloth, vulcanizing India rubber, or smelting ores,” which are performed for the purpose of physically transforming substances so as to create what are essentially new materials for their own sake, the purpose of genetic testing is solely to read the sequence of the DNA, not to transform it into something else. Only in this way can the patient and her physician learn whether a medically relevant mutation is present in her body.

On March 29, 2010 in a landmark decision, the District Court held that the composition of matter claims on the BRCA1 and BRCA2 gene sequences and their cDNAs, and the process claims covering the correlations between mutations in BRCA1 and BRCA2 and the predisposition to breast cancer and ovarian cancer were invalid as a matter of law [45]. In evaluating the composition of matter claims on the isolated gene sequences, the Court emphasized the unique informational characteristics contained in the DNA sequence, and the preservation of that native sequence in isolated DNA stating, “Because the claimed isolated DNA is not markedly different from native DNA as it exists in nature it constitutes unpatentable subject matter under 35 U.S.C. section 101.” Similarly, the Court found comparison claims of known wild-type and patient sequences for diagnosis, claims that in effect asserted ownership over the biological relationships between BRCA1 and BRCA2 mutations and the predisposition to breast cancer, invalid as merely claiming abstract mental processes. On appeal, the CAFC on July 29, 2011 in a 21 decision reversed the District Court, holding that isolated human gene sequences are patent eligible [43]. However, the CAFC upheld the lower court’s ruling that Myriad’s sequence comparison claims were invalid. The plaintiffs appealed the case to the Supreme Court. Immediately after deciding Mayo v. Prometheus, the Supreme Court accepted Association for Molecular Pathology v. Myriad Genetics, vacated the CAFC’s decision, and sent the case back to the CAFC for further consideration in light of its decision in Mayo. After rebriefing of the case and a second round of oral arguments, the CAFC again held 21 that isolated human genes were patent eligible, with the CAFC ruling that they represented new compositions of matter that did not exist in nature. As in the CAFC’s previous decision in the case, each judge wrote a separate opinion. All three judges agreed that BRCA1 and BRCA2 cDNA should be patent eligible, reasoning that cDNA is not naturally occurring and is made by man. The central disagreement among these judges was whether separating human DNA from its chromosome and other cellular constituents renders it a patent-eligible invention. The two judges who determined that human DNA is patent eligible came to the same conclusion using different reasoning. Judge Alan Lourie, who authored what was nominally the primary opinion for the Court, opined that because separating a gene from its chromosome involves breaking covalent bonds, a DNA sequence removed from its natural environment is a new chemical. Judge Kimberly Moore relied at least in part on the past practice of the USPTO in granting such patents, and the reliance of companies and inventors on that practice. This judge said she may have voted differently had the question come before her on a “blank canvas.” Judge William Bryson in dissent wrote that the breaking of covalent bonds alone did not create a new molecule, and was not determinative of patent eligibility. Rather, he concluded that the genes’ DNA sequences are identical whether the genes are within or outside the body, and because of this that these DNAs are fundamentally the same molecules, irrespective of location. For the dissenting judge, the importance of the sequence of nucleotides in the DNA molecules substantially outweighed the importance of any chemical differences between DNA in the body and DNA isolated from it. Finally, Judge Bryson recognized the potential threat gene patents posed for large-scale sequencing, writing: “[T]he court’s decision, if sustained, will likely have broad consequences, such as preempting methods for whole-genome sequencing, even though Myriad’s contribution is not remotely consonant with such effects. . . .” The CAFC ultimately chose to disregard the constancy of a gene’s most fundamental and relevant property, its coding sequence. On behalf of the Court Judge Lourie wrote, “The isolated DNA molecules before us are not found in nature. They are obtained in the laboratory and are man-made, the product of human ingenuity.” He maintained that native and isolated gene sequences have distinct chemical structures and identities because the native genes have been separated from associated proteins and the chromosomes on which they naturally reside, either through the cleaving of covalent bonds or by synthesis. In addition, the CAFC again held that Myriad’s sequence comparison claims were invalid. The plaintiffs once more appealed the case to the Supreme Court.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

GENETIC INFORMATION NONDISCRIMINATION ACT

443

On June 13, 2013, in a historic 90 decision authored by Justice Clarence Thomas, the Supreme Court held that naturally occurring DNA sequences are “products of nature” that are not patent eligible [44]. The Court acknowledged Myriad’s contribution to the field, but noted that its discoveries were limited to identifying the precise location and sequence of the BRCA1 and BRCA2 genes. The Court stated, “In this case. . .Myriad did not create anything. To be sure, it found an important and useful gene, but separating that gene from its surrounding genetic material is not an act of invention.” The Court referred to Myriad’s patent claims, which themselves confirmed that the fundamental essence of DNA lies in its information content: Myriad’s claims are simply not expressed in terms of chemical composition, nor do they rely in any way on the on the chemical changes that result from the isolation of a particular section of DNA. Instead, the claims understandably focus on the genetic information encoded in the BRCA1 and BRCA2 genes. If the patents depended upon the creation of a unique molecule, then a would-be infringer could arguably avoid at least Myriad’s patent claims on entire genes . . . by isolating a DNA sequence that included both the BRCA1 or BRCA2 gene and one additional nucleotide pair. Such a molecule would not be chemically identical to the molecule “invented” by Myriad. But Myriad obviously would resist that outcome because its claim is concerned primarily with the information contained in the genetic sequence, not with the specific chemical composition of a particular molecule.

Finally, the Court did rule that cDNA is patent eligible because it is not naturally occurring. However, patent eligibility, as the Court pointed out in a footnote, does not necessarily equate to patentability under other sections of the Patent Act that this decision did not address. Moreover, because cDNA is not essential for the performance of most genetic testing, the ruling that cDNA is patent eligible is unlikely to have a significant impact on molecular genetic testing going forward.

IMPLICATION OF RECENT COURT DECISIONS FOR GENETIC TESTING In two recent decisions relevant to genetic testing, both unanimous, the Supreme Court reaffirmed its longstanding prohibitions against patenting natural laws and products of nature. In Mayo, the Court was clear that characterizing a biological association as a process does not, without adding a truly inventive step, convert the association into a patent-eligible application of a natural law. Mayo was an extremely important decision, the holding of which seemingly means that method patents that attempt to claim associations between genetic variants and clinical phenotypes are invalid. In Association for Molecular Pathology, the Supreme Court found that naturally occurring human DNA sequences are not patentable, rendering patents on human genes invalid. When read together these two cases appear to have removed the intellectual property barriers associated with testing for genetic mutations and relationships to clinical phenotypes, whether testing is for identification of predisposition to disease, therapeutic responsiveness, medicinal side effects, or tumor behavior. Thus, the Supreme Court has helped facilitate the introduction of large-scale sequencing into clinical practice, and has thereby encouraged the advancement, development and implementation of personalized medicine.

GENETIC INFORMATION NONDISCRIMINATION ACT The introduction of hereditary disease testing into clinical medicine has been accompanied by concerns about possible misuse of information obtained from this testing. These apprehensions reflect a general perspective on the nature of genetic information that has been termed “genetic exceptionalism,” i.e., the notion that awareness of an individual’s predisposition to or development of a heritable disease or condition differs from other medical information in a manner that raises unique social issues [48]. Underlying this viewpoint is understanding of the innate and typically limited modifiability of the inherited disorders or conditions to which an individual may be predisposed, the often highly predictive nature of genetic test results, and the implications of genetic testing for family members. Thus, many people believe that it is generally unjust to permit differential treatment of an individual or his or her family members based on knowledge of inborn characteristics that are beyond their control. These considerations were evidenced in the passage of the Genetic Information Nondiscrimination Act of 2008 (GINA), which was signed into law by President George W. Bush on May 21, 2008 [5,49]. The statute’s provisions became effective on May 21, 2009. GINA prohibits discrimination on the basis of genetic information in the individual and group health insurance markets, and in employment-related decisions. Title I of GINA amended the Employee Retirement Income Security Act (ERISA), which is administered by the Department of Labor, and sets minimum standards for most

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

444

25. LEGAL ISSUES

employee health and pension plans; the Public Health Service Act, administered by the Department of Health and Human Services (HHS); the Internal Revenue Code, administered by the Department of Treasury and the Internal Revenue Service (IRS); and the Social Security Act, under the jurisdiction of HHS, to prohibit group health plans and health insurance companies from discriminating on the basis of genetic information in the group, individual and supplemental insurance markets. GINA defines genetic information about an individual to include genetic tests; the genetic tests of family members; the manifestation of a genetic disease or disorder in an individual’s family members; a request for or receipt of genetic services, including counseling and education, by that individual or his or her family members; and participation in genetic research. However, the Statute’s definition of genetic information excludes information about an individual’s age or sex. The Act defines a genetic test as “an analysis of human DNA, RNA, chromosomes, proteins, or metabolites, that detects genotypes, mutations, or chromosomal changes.” Analyses of proteins and metabolites that do not detect genotypes, mutations, or chromosomal changes, or are related to an already manifested disease, are excluded from the definition of genetic test. Under GINA group health plans and health insurers are not permitted to adjust premium or contribution amounts based on genetic information; request or require genetic testing from the applicant or participant (including fetuses and embryos) or his or her family members; or request, require, or purchase genetic information from an individual prior to or in connection with enrollment in a group plan. GINA also outlaws the use of genetic information to impose preexisting condition exclusions, although the subsequent passage of the Affordable Care Act, which requires insurers to cover all applicants and forbids the use of preexisting health conditions in setting insurance rates, has largely superseded this GINA provision. Title I’s amendments to the Social Security Act required that the Secretary of HHS revise the Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy regulations to explicitly include genetic information as health information, and to prohibit use or disclosure of such information by group, individual, and Medicare supplemental policy insurers. Title II prohibits employers, employment agencies, labor organizations and joint labor-management committees that control training and apprenticeship programs from using genetic information to discriminate in any aspect of employment including hiring; discharge; compensation; job assignments; promotions; layoffs; fringe benefits; membership; admission; participation decisions; and terms, conditions, or privileges of employment; or to otherwise classify employees in a manner that would negatively impact their employment opportunities or status. In addition, employment agencies, labor organizations, and training programs are prohibited from causing or attempting to cause employers to discriminate on the basis of genetic information. Further, it is unlawful for joint labor-management committees that control training and apprenticeship programs to intentionally request, require, or purchase genetic information about an employee or the employee’s family members. GINA does contain limited exceptions for these latter provisions that include dispensing of health services, voluntary participation in employee wellness programs, and compliance with the Family and Medical Leave Act of 1993 or similar state laws; voluntary or legally mandated genetic monitoring of the biological effects of toxic substances in the workplace in a manner compliant with applicable statutes and regulations; and information obtained through the purchase of certain types of commercially and publicly available documents like newspapers and magazines. Title II requires that if an employer, employment agency, labor organization, or joint labor-management committee possesses genetic information about an employee or member, the information must be maintained on separate forms and in separate medical files, and be treated as a confidential medical record of the employee or member. Nor can these entities disclose genetic information regarding an employee or member except to that employee or member after written request; to a health researcher as part of federally compliant research; in response to a court order; to government officials investigating GINA Title II compliance; in connection with an employee’s compliance with federal and state family and medical leave laws; or to federal, state, or local public health agencies in relation to a contagious disease that presents an imminent hazard of death or life-threatening illness. GINA does not prohibit covered entities from use or disclosure of health information authorized by the HIPAA regulations. Historically, there have been few documented examples of discrimination in health insurance or employment based on genetic testing or the results of genetic testing. Prior to the enactment of GINA, the majority of states had enacted laws that outlawed the use of genetic tests in determining eligibility for health insurance or its costs. Thirty-four states had laws banning discrimination in employment based on the results of genetic testing. In addition, the Equal Employment Opportunity Commission (EEOC) had interpreted the Americans with Disabilities Act (ADA) to include the inherited predisposition to disease. Finally, under HIPAA, group health insurers were prohibited from applying “preexisting condition” exclusions to genetic conditions diagnosed solely on the basis of genetic testing, as opposed to symptoms [50].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

445

Since the implementation of GINA, the number of genetic discrimination complaints filed with the EEOC has been a small proportion of the total complaints filed. Moreover, it seems likely that these accusations have largely been tacked on to other employment-related discrimination complaints, for example, claims filed under the ADA, or related to age, race, and/or sex. Further, an online review of published court decisions provides little evidence many such complaints have been substantiated, or in particular that genetic testing has been a significant contributor to employment or health insurance discrimination [51,52]. However, there are gaps in GINA that are important to understand with the emergence of large-scale genomic sequencing. Importantly, GINA’s health insurance-related protections do not extend to long-term care, disability, or life insurance, and the law’s reach is limited to genetic information uncovered in the presymptomatic stage. Pretest counseling and the decision to report incidental findings in whole exome, whole genome, and other multigene sequencing tests must be informed by the understanding that insurance companies can request and legally make use of the information reported from these procedures in life, long-term care, and disability coverage and pricing decisions.

References [1] Klein RD. Gene patents and genetic testing in the United States. Nat Biotechnol 2007;25:98990. [2] Klein RD. Legal developments and practical implications of gene patenting on targeted drug discovery and development. Clin Pharmacol Ther 2010;87:6335. [3] Cook-Deegan R, Conley JM, Evans JP, Vorhaus D. The next controversy in genetic testing: clinical data as trade secrets? Eur J Hum Genet 2013;21:5858. [4] Crichton M. Patenting Life, New York Times, February 13, 2007, at A23. Available at: ,http://www.nytimes.com/2007/02/13/opinion/ 13crichton.html.; [accessed 21.07.14]. [5] Pub. L. No: 110233, 112 Stat. 881 (2008). [6] 35 U.S.C. y 154(a) (2012). [7] Pub. L. No: 11229, 125 Stat. 284 (2011). [8] 35 U.S.C. yy 101103 (2012). [9] 35 U.S.C. y 112 (2012). [10] 35 U.S.C. y 101 (2012). [11] Diamond v. Diehr, 450 U.S. 175 (1981). [12] 35 U.S.C. y 271(a) (2012). [13] 35 U.S.C. y 271(b) (2012). [14] 35 U.S.C. y 271(c) (2012). [15] 35 U.S.C. yy 200212 (2012). [16] Moses H, Dorsey ER, Matheson DHM, Their SO. Financial anatomy of biomedical research. JAMA 2005;294:133342. [17] ,http://www.aaas.org/spp/rd/discip05.pdf.. [18] Rai AK, Eisenberg RS. BayhDole reform and the progress of medicine. Law Contemp Probl 2003;66:289314. [19] ,http://www.uspto.gov/web/offices/ac/ido/oeip/taf/univ/asgn/table_1_2005.htm.; [accessed 21.07.14]. [20] Diamond v. Chakrabarty, 447 U.S. 303 (1980). [21] 28 U.S.C. 1295 (2012). [22] Caulfield T, Cook-Deegan RM, Kieff FS, Walsh JP. Nat Biotechnol 2006;24:10914. [23] ,http://www.bio.org/ataglance/bio/.; [accessed 21.07.14]. [24] Lee SB, Wolfe LB. Biotechnology industry. Encyclopaedia of occupational health and safety. 4th ed. International Labour Organization; 2011, ,http://www.ilo.org/oshenc/part-xii/chemical-processing/examples-of-chemical-processing-operations/item/382-biotechnologyindustry.; [accessed 29.06.14]. [25] Brief for Respondent, Association for Molecular Pathology v. Myriad Genetics, No. 12-398, 569 U.S. ___ (March 7, 2013); U.S. Patent and Trademark Office Utility Guidelines, 66 Fed. Reg. 1092 (January 5, 2001). [26] Amgen v. Chugai Pharmaceutical Co., 927 F.2d 1200 (1990), cert. denied, 502 U.S. 856 (1991). [27] Kuehmsted v. Farbenfabriken, 179 F. 701 (7th Cir. 1910), cert. denied, 220 U.S. 622 (1911) (acetyl salicylic acid). [28] Parke-Davis & Co. v. H.K. Mulford & Co., 189 F. 95 (SDNY 1911), aff’d, 196 F. 496 (2d Cir. 1912) (epinephrine). [29] Merck & Co. v. Olin Mathieson Chemical Corp., 253 F.2d 156 (4th Cir. 1958) (Vitamin B12). [30] In re Bergstrom, 427 F.2d 1394 (CCPA 1970) (PGE, PGF). [31] Cho MK, Illangasekare S, Weaver MA, Leonard DGB, Merz JF. Effects of patents and licenses on the provision of clinical genetic testing services. J Mol Diagn 2003;5:38. [32] Secretary’s Advisory Committee on Genetics, Health, and Society. Gene patents and licensing practices and their impact on patient access to genetic tests. Available at: ,http://oba.od.nih.gov/oba/SACGHS/reports/SACGHS_patents_report_2010.pdf.; 2010. [33] Klein, RD. Public comments submitted May 15, 2009 in response to Secretary’s Advisory Committee on Genetics, Health and Society: Gene Patents and Licensing Practices and Patient Access to Genetic Tests, Draft Report, March 2009 (Available from author on request). [34] Bessen J, Meurer MJ. Patent failure: how judges, bureaucrats, and lawyers put innovators at risk. Princeton, NJ: Princeton University Press; 2008. [35] Bilski v. Kappos, 545 F.3d 943 (Fed. Cir 2008) (en banc). [36] Bilski v. Kappos, 130 S.Ct. 3218 (2010).

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

446 [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52]

25. LEGAL ISSUES

KSR Int’l Co. v. Teleflex Inc., 550 U.S. 398 (2007). Teleflex, Inc. v. KSR Int’l Co., 119 F. App’x 282 (Fed. Cir. 2005). In re Kubin, 561 F.3d 1351 (Fed. Cir. 2009). Graham v. John Deere Co., 383 U.S. 1 (1966). In Re Deuel, 51 F.3d 1552 (Fed. Cir. 1995). Mayo Collaborative Services v. Prometheus Laboratories, Inc., 132 S.Ct. 1289 (2012). Association for Molecular Pathology v. Myriad Genetics, Inc., 689 F. 3d 1303 (2012). Association for Molecular Pathology v. Myriad Genetics, Inc., No. 12-398, 569 U.S. (2013). Association for Molecular Pathology v. United States Patent and Trademark Office, no. 09-4515 (S.D.N.Y. filed May 12, 2009). Declaration of Dr. Mark A. Kay, Association for Molecular Pathology v. United States Patent and Trademark Office, no. 09-4515 (S.D.N.Y. filed May 12, 2009). Declaration of Dr. Roger D. Klein, Association for Molecular Pathology v. United States Patent and Trademark Office, no. 09-4515 (S.D.N.Y. filed May 12, 2009). Ross LF. Genetic exceptionalism v. paradigm shift: lessons from HIV. J Law Med Ethics 2001;29:1418. Slaughter L. Genetic information Non-Discrimination Act. Harv J Legis 2013;50:4166. Klein RD, Mahoney MJ. Medical legal issues in prenatal diagnosis. Clin Perinatol 2007;34:28797. ,http://www.eeoc.gov/eeoc/statistics/enforcement/charges.cfm.; [accessed 29.07.14]. ,http://www.eeoc.gov/eeoc/statistics/enforcement/genetic.cfm.; [accessed 29.07.14].

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

C H A P T E R

26 Billing and Reimbursement Kris Rickhoff, Andrew Drury and John Pfeifer Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO, USA

O U T L I N E Introduction

448

Insurance Payers

448

Reimbursement Processes Reimbursement Rate Diagnosis and Procedure Codes Predetermination of Coverage and Benefits

448 448 450 452

Test Design Factors That Impact Reimbursement

452

Patient Protection and Affordable Care Act Entities Focused on Healthcare Expenditures

454 454

Accountable Care Organizations Health Outcomes

455 455

Cost Structure

456

Summary

456

References

457

Glossary

457

List of Acronyms and Abbreviations

458

KEY CONCEPTS • Insurance payers in the healthcare industry are categorized into two groups, private and government funded. Private insurance payers are further categorized as Health Maintenance Organizations (HMOs), Preferred Provider Organizations (PPOs), and Commercial Fee-for-Service; these insurance payers are accessed directly by individuals or through employers. Government-funded payers include Medicare, Medicaid, Tricare, and the Veterans Administration and are accessed directly by individuals. • The most common government-funded payers are regulated through the Centers of Medicare and Medicaid Services (CMS). • Unlike rates established by a governmental entity, laboratories have the ability to negotiate terms with private insurance carriers. • In order for insurance payers to understand what services are being provided and for what reason, providers submit claims that include diagnosis and procedure codes. In the USA, the National Center for Health Statistics and CMS create, remove, and revise diagnosis codes that correspond to the WHO International Classification of Disease (ICD); the American Medical Association (AMA) creates, removes, and revises procedure codes (known as Current Procedural Terminology codes or CPTs codes) to report services provided. • In routine patient care, direct clinical utility has the most impact on the rate and level at which services are reimbursed. • Although it is unclear how expanded coverage mandated by the Affordable Care Act (ACA) will impact rising healthcare associated expenditures, including those related to clinical DNA sequence analysis, the ACA included provisions to create three new entities focused on decreasing healthcare expenditures.

Clinical Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-404748-8.00026-5

447

© 2015 Elsevier Inc. All rights reserved.

448

26. BILLING AND REIMBURSEMENT

INTRODUCTION Billing in a laboratory environment is broken down into a professional component and a technical component. The professional component of a laboratory service covers the cost of the interpretative work provided (typically by a pathologist). The technical component of a service covers the cost of the Practice Expense (PE), such as equipment, consumables (reagents), and technical staff. Not all services provided in a laboratory have a professional (interpretive) component. Although laboratories have several different revenue stream opportunities, the most common revenue stream is from payments received for care provided to patients. Laboratories have the ability to contract and enroll with insurance payers (private and governmental) and bill the payers directly for services provided; in this scenario, amounts not paid by the insurance payers become the responsibility of the patient in accordance with the insurance plan benefits. Laboratories may also opt to become a sole source reference laboratory, meaning they do not bill insurance payers or patients, but rather directly bill the referring institution, hospital, or other laboratory for the services provided (in this scenario, the referring institution will likely seek reimbursement for the services provided from the patient or the patient’s insurance payer). Most laboratories employ a combination of both strategies.

INSURANCE PAYERS As noted above, insurance payers in the healthcare industry are typically categorized into two groups, private and government funded. Private insurance payers are further categorized as Health Maintenance Organizations (HMOs), Preferred Provider Organizations (PPOs), and Commercial Fee-for-Service; these insurance payers are accessed directly by individuals or through employers. Government-funded payers include Medicare, Medicaid, Tricare, and the Veterans Administration and are accessed directly by individuals. The most common government-funded payers are regulated through the Centers of Medicare and Medicaid Services (CMS). CMS establishes policies, coverage determinations (what laboratory services in what disease settings), and fee schedules at the national level and utilizes Medicare Administrative Contractors (MACs) jurisdictions to administer the Medicare program [1] at a local level in accordance with national policy. State MAC assignments are listed in Table 26.1 and can also be found on CMS’s web site (www.cms.gov). MACs have the capability to establish a coverage determination at the local level (if a coverage determination has not already been established by CMS at the national level) which is referred to as a Local Coverage Determination (LCD). It is not uncommon for CMS to later adopt an LCD into a National Coverage Determination (NCD), although NCDs always supersede LCDs. Adjustments are also made at the local level to the CMS fee schedules.

REIMBURSEMENT PROCESSES CMS reimburses laboratories through either Part A benefits or Part B benefits. Part A provides reimbursement to hospitals when a patient has an inpatient status utilizing a diagnosis-related group (DRG) model (the DRG model pays a set lump sum for various diagnoses regardless of the procedures performed); the payment is global and includes laboratory technical services provided during the hospital admission. Part B provides reimbursement for professional services provided by an MD (professional services provided by a PhD are considered “technical” services and are not reimbursement separately) when a patient has an inpatient status, and for professional and technical services when the patient has a non-inpatient status. Laboratories receive fee-for-service payments from Part B, and there are two fee schedules for reimbursement for the services provided. The Medicare Physician Fee Schedule (MPFS) is utilized for services that contain a professional interpretation component provided by a pathologist; the Clinical Laboratory Fee Schedule (CLFS) is utilized for services provided do not have a professional interpretation component provided by a pathologist.

Reimbursement Rate The MPFS is updated annually, at a minimum. To determine payment amounts, CMS multiplies the Conversion Factor (CF) by the Relative Value Unit (RVU) assigned to the procedure code(s). The current year CF

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

449

REIMBURSEMENT PROCESSES

TABLE 26.1

Medicare Administrative Contractors (MACs) As of November 2013

MAC Previous MAC Jurisdiction Jurisdiction

Processes Part A & Part B Claims for the following states:

MAC

DME A

DME A

Connecticut, Delaware, District of Columbia, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, Vermont

NHIC, Inc.

DME B

DME B

Illinois, Indiana, Kentucky, Michigan, Minnesota, Ohio, Wisconsin

National Government Services, Inc.

DME C

DME C

Alabama, Arkansas, Colorado, Florida, Georgia, Louisiana, Mississippi, New Mexico, North Carolina, Oklahoma, South Carolina, Tennessee, Texas, Virginia, West Virginia, Puerto Rico, U.S. Virgin Islands

Cigna Administrators, LLC Corp.

DME D

DME D

Alaska, Arizona, California, Hawaii, Idaho, Iowa, Kansas, Missouri, Montana, Nebraska, Nevada, North Dakota, Oregon, South Dakota, Utah, Washington, Wyoming, American Samoa, Guam, Northern Mariana Islands

Noridian Healthcare Solutions, LLC

E

1

California, Hawaii, Nevada, American Samoa, Guam, Northern Mariana Islands

Noridian Healthcare Solutions, LLC

F

2&3

Alaska, Arizona, Idaho, Montana, North Dakota, Oregon, South Dakota, Utah, Washington, Wyoming

Noridian Healthcare Solutions, LLC

5

5

Iowa, Kansas, Missouri, Nebraska

Wisconsin Physicians Service InsuranceCorporation

6

6



Illinois, Minnesota, Wisconsin HH 1 H for the following states: Alaska, American Samoa, Arizona, California, Guam, Hawaii, Idaho, Michigan, Minnesota, Nevada, New Jersey, New York, Northern Mariana Islands, Oregon, Puerto Rico, US Virgin Islands, Wisconsin and Washington

National Government Services, Inc.

H

4&7

Arkansas, Colorado, New Mexico, Oklahoma, Texas, Louisiana, Mississippi

Novitas Solutions, Inc.

8

8

Indiana, Michigan

Wisconsin Physicians Service InsuranceCorporation

9

9

Florida, Puerto Rico, U.S. Virgin Islands

First Coast Service Options, Inc.

10

10

Alabama, Georgia, Tennessee

Cahaba Government Benefit Administrators, LLC

11

11

North Carolina, South Carolina, Virginia, West Virginia (excludes Part B for the counties of Arlington and Fairfax in Virginia and the city of Alexandria in Virginia)  HH 1 H for the following states: Alabama, Arkansas, Florida, Georgia, Illinois, Indiana, Kentucky, Louisiana, Mississippi, New Mexico, North Carolina, Ohio, Oklahoma, South Carolina, Tennessee, and Texas

Palmetto GBA, LLC

L

12

Delaware, District of Columbia, Maryland, New Jersey, Pennsylvania (includes Part B for counties of Arlington and Fairfax in Virginia and the city of Alexandria in Virginia)

Novitas Solutions, Inc.

K

13 & 14

Connecticut, New York, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont  HH 1 H for the following states: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont

National Government Services, Inc.

15

15

Kentucky, Ohio  HH 1 H for the following states: Delaware, District of Columbia, Colorado, Iowa, Kansas, Maryland, Missouri, Montana, Nebraska, North Dakota, Pennsylvania, South Dakota, Utah, Virginia, West Virginia, and Wyoming

CGS Administrators, LLC



Also Processes HH 1 H claims

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

450

26. BILLING AND REIMBURSEMENT

and RVU values can be found on CMS’s web site. The CF converts the RVU of a procedure into a dollar amount for reimbursement to the provider. An RVU is assigned to each procedure or service provided, and an RVU has three components. The Work RVU is assigned for the level of effort and skill required by a physician to provide the service. The PE RVU is assigned to cover the nonphysician costs to maintain a practice, such as equipment, supplies, rent, and staff. The Malpractice (MP) RVU is assigned to cover the professional liability associated with each procedure provided. At the local level, RVUs are adjusted based on Geographic Practice Cost Indices (GPCIs), which are also established by CMS and reviewed every 3 years, to adjust for differences in practice costs around the USA. Hence, the formula utilized by MACs to calculate physician reimbursement rates for services paid under the MPFS is: Reimbursement rate 5 ½ðWork RVU 3 Work GPCIÞ 1 ðPE RVU 3 PE GPCIÞ 1 ðMP RVU 3 MP GPCIÞ 3 CF The CLFS is established quite differently than the MPFS. Reimbursement per service is set at the lesser of the amount billed, a fee established at the local geographic level, or a national limit. The MACs establish local payment amounts utilizing claims data for services provided in each geographic area; the national limit is set at a percentage of the median of all local payment amounts. The CLFS is updated annually at a minimum, and the update is based on changes in the Consumer Price Index (CPI). All local level fee schedules can be found on the MACs’ web sites. Unlike rates established by a governmental entity, laboratories have the ability to negotiate terms with private insurance carriers. Many private insurance carriers utilize CMS policies and procedures as a guide for establishing their own coverage determinations, and during fee schedule negotiations it is not uncommon for private insurance carriers to negotiate at a percentage of the current CMS fee schedules. Thus, it is important for a laboratory to understand each element of a contract presented by a private insurance carrier, as CMS fee schedule rates do not always cover costs. In order for a service to be reimbursed by an insurance payer, governmental and private alike, it must meet the payer’s specific medical coverage criteria. As discussed previously, Medicare has both national and LCDs. Private payers also establish their own set of medical coverage policies. Medical policies or coverage determinations established by insurance payers are in narrative format and provide what the payer deems as medically necessary services for a particular diagnosis or symptom. In addition to the narrative, policies include acceptable procedure and diagnosis code combinations. Many private insurance payers follow CMS’s lead in medical coverage determinations; some private insurances point directly to CMS’s medical coverage determinations. CMS has a very clearly defined process outlined on their web site for establishing an NCD [2,3]. Of utmost importance, the service provided must be reasonable and necessary for diagnosis and/or treatment of an illness or injury. The determination process is a 9-month process. The first 6 months culminate in CMS publishing a draft coverage determination. The public then has 30 days to review and comment. Then follows a 60-day period for CMS to review the comments, make edits, and publish the completed Final Decision Memorandum and Implementation Instructions. However, it is also possible for CMS to initiate an Appeal or Reconsideration phase that delays implementation of the Final NCD. It is vital to the success of clinical laboratories that are performing next-generation sequencing (NGS) to actively engage in the comment periods as CMS NCDs are established.

Diagnosis and Procedure Codes In order for insurance payers to understand what services are being provided and for what reason, providers submit claims that include diagnosis and procedure codes. The National Center for Health Statistics (NCHS) and CMS create, remove, and revise diagnosis codes that correspond to the World Health Organization’s (WHO) International Classification of Disease. The current version of the codes is known as the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes [4]. These codes are reviewed annually; October is the start of each new year (i.e., October 1, 2013 started the reporting period of 2014 ICD-9-CM codes). In 2009, the US Department of Health & Human Services adopted the transition from the ninth revision to the tenth revision (ICD-10-CM). Differences between ICD-9-CM and ICD-10-CM are quite extreme because the ICD-10-CM codes are much more specific, are longer in length (adding alpha and numeric characters), and add more than 50,000 codes [5]. Because of their complexity, implementation of the ICD-10-CM codes has been delayed and is set for October 1, 2014.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REIMBURSEMENT PROCESSES

451

In the USA, the American Medical Association (AMA) creates, removes, and revises procedure codes (known as Current Procedural Terminology (CPTs ) codes) to report services provided. CPTs codes are also reviewed and modified annually, with an effective date of January 1 of each year. The AMA-CPT Editorial Panel owns the responsibility of making modifications to CPTs codes; the panel includes physicians nominated by medical societies such as the College of American Pathologists (CAPs) or the Association for Molecular Pathology (AMP), members from the insurance payer industry, a member from the American Hospital Association, and two members from the Health Care Professional Advisory Committee. The panel is supported by a much larger AMA-CPTs Advisory Committee that includes members from many medical specialty organizations such as the AMP and the American College of Medical Genetics and Genomics (ACMG). It is the responsibility of Advisory Committee members to bring forward suggested modifications and revisions to ensure that their medical specialty practices have appropriate codes established as treatment paradigms evolve. CPTs and ICD codes are reported together on claims submitted to insurance payers for reimbursement; these codes are linked on the claim to describe to the insurance payers what service was provided and for what diagnosis or symptom. Insurance payers, based on their established medical policies and coverage determinations within their payment systems, evaluate claims submitted by providers to ensure they meet their established medical coverage criteria based on the CPTs and ICD coding combination. As new treatment paradigms evolve, the AMA-CPT Editorial Panel also works to modify (or delete) existing CPTs codes, as well as establish new CPTs codes, to accurately and transparently reflect new services being provided. In late 2009, the AMA-CPTs Editorial Panel assembled the Molecular Pathology Coding Workgroup (MPCW) as a result of concerns brought to their attention by insurance payers and providers relating to CPTs coding of molecular diagnostic tests [6]. The workgroup included participants from CMS, private insurance carrier medical directors, various medical specialty members, and other relevant healthcare industry representatives. The charge of the workgroup was to modify the existing so-called “stacked CPTs coding” methodology (which required reporting several different CPTs codes (“stacking”) for each step of a molecular diagnostic test (i.e., cell lysis, nucleic acid stabilization, extraction, digestion, amplification, detection, and interpretation)). In this context, the AMP Economic Affairs Committee submitted a “Proposal for CPTs Coding Reform in Molecular Diagnostics” to the AMA-CPTs Editorial Panel that aimed to reconstruct CPTs coding for molecular diagnostic tests to generate consistency among laboratories and provide transparency to assist insurance payers [7]. The first action of the MPCW (which still convenes regularly) was the addition of an entirely new subsection of Molecular Pathology within the section of Pathology and Laboratory in the 2012 AMA-CPTs codebook [8]. The fundamentally different coding structure of molecular diagnostic tests was reflected in the inclusion of more than 100 new molecular pathology CPTs codes in 2012 and the subsequent addition of more than 40 new codes in 2013. Molecular diagnostic test CPTs coding is no longer based on “code stacking”; instead, “code stacks” have been bundled into one CPTs code, and within the Molecular Pathology subsection of the AMA-CPTs codebook there are now Tier 1 Molecular Pathology Procedures and Tier 2 Molecular Pathology Procedures. Tier 1 Molecular Pathology Procedures (codes 8120081383) are reported when a specific gene or very specific target within a gene is being evaluated; for example, 81210 is defined as BRAF gene analysis, V600E variant. Tier 2 Molecular Pathology Procedures (codes 8140081408) are reported when there is no available Tier 1 code, or when two or more genes are analyzed at the same time. For example, 81407 is defined as Molecular Pathology Procedure, Level 8 (e.g., analysis of 2650 exons by DNA sequence analysis, mutation scanning or duplication/ deletion variants of .50 exons, sequence analysis of multiple genes on one platform). At its January, 2013 meeting, the MPCW added NextGen and Whole Exome Sequencing next steps for CPTs , with the goal to deploy new codes in 2015 [9]. Additionally, the MPCW continues to review CPTs code proposals submitted by various labs throughout the country that have been designed by those specific laboratories to accurately reflect the particular molecular diagnostic test they are providing. CPTs coding for Molecular Pathology Procedures has thus been a work in progress since 2009 and will continue to evolve as the clinical need progresses and technologies advance. Quite timely, the AMP Economic Affairs Committee submitted their “Proposal to address CPTs coding for Genomic Sequencing Procedures” to the AMA-CPTs Editorial Panel in March, 2013 to address the advancing DNA sequencing technologies [10]. AMP’s proposal noted that current technologies used for genomic sequencing are only in their infancy stage, and therefore the proposal uses the generic term “genomic sequencing procedure” (GSP) in an effort to capture not only what is employed in practice today, but also to include techniques that are yet to come. Currently, the GSPs are also known as NGS or Massively Parallel Sequencing (MPS), but the AMP approach is to focus on the clinical question addressed by the test rather than the technology; given the broad

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

452

26. BILLING AND REIMBURSEMENT

spectrum of existing sequencing technology that already exists, the AMP proposal is thus geared to provide transparency and consistency among laboratories regardless of technique. The major difference acknowledged by the AMP proposal is that GSPs have the capability to provide clinically useful results on multiple genes or gene targets relevant to a given clinical condition and therefore they supersede historic approaches by which testing is limited to single genetic targets. Additionally, AMP noted that because GSPs are able to sequence a whole exome or genome, they provide a pathway to produce a data set that can be reevaluated to answer multiple clinical questions as the need arises. In this context it is worth mention that professional CPTs codes also provide a way to bill for reevaluations of previously sequenced data to answer additional clinical questions at a later date. Because GSPs are designed to assess multiple genes for a given clinical setting, AMP’s proposal also discusses the need to include the clinical setting within the CPTs code description. By this approach, the coding structure would be organized from least to greatest based on the degree of technical and analytical work required (i.e., targeted multiple gene sequencing to whole exome sequencing to whole genome sequencing) and would provide separate CPTs codes for technical work and professional work. The rationale for inclusion of separate codes is explained by the facts that GSPs provide a vast amount of complex data (of a scale not shared by standard sequencing techniques such as Sanger sequencing), and that the level of complexity requires a higher level of professional interpretation to address the associated clinical question. It is unknown how the AMA-CPTs Editorial Panel will respond to the AMP proposal, although it is certain that the AMA-CPTs Editorial Panel will be adding new, revising, or deleting molecular pathology procedure codes for many years to come.

Predetermination of Coverage and Benefits It is worth noting that although insurance payers were a driving force behind the decision by the AMA-CPTs Editorial Panel to reconstruct the Molecular Pathology Procedure codes, now that the new codes are available, insurance payers may not clearly understand how to interpret them. Thus, as new services become available, it is imperative for laboratories to make sure their insurance payers understand the services that are being provided. To accomplish this, laboratories may adopt a policy whereby they always contact insurance companies prior to providing service to predetermine coverage and benefits. This process can be initiated by calling or submitting a written request to the insurance payer (before initiating the predetermination of coverage and benefits process, it is important to consult the insurance plans’ web site or plan documents to understand the specific procedures). During the predetermination process, the insurance payer may approve the service being requested, may require a peer-to-peer discussion with the ordering physician or laboratory professional and the medical director of the insurance payer, or may deny the service. In the latter case, laboratories may have the option to appeal in writing (the appeal process depends on the insurance payer, and often the appeal is conducted by an external source). When submitting an appeal, it is important to include any and all relevant documentation, including supporting peer-reviewed literature. Medicare is quite different than private insurance payers in that it does not have a process to allow laboratories to submit a predetermination of coverage and benefits. Instead, it is the expectation that the submitting provider understands Medicare’s coverage policies and submits claims for reimbursement for only medically necessary services. However, Medicare does have an appeals process for denied claims. There are five levels of appeals for Medicare Part A and Part B Fee-for-Service claims; the process is outlined in Table 26.2 and can be found on CMS’s web site.

TEST DESIGN FACTORS THAT IMPACT REIMBURSEMENT As discussed above, there are complicated rules and regulations that govern reimbursement for laboratory testing including clinical NGS. In routine patient care, however, one overarching practical factor has the most impact on the rate and level at which NGS is reimbursed, namely direct clinical utility. Insurance payers (whether governmental or private) only reimburse for laboratory testing that provides diagnostic, predictive, or prognostic information that directly impacts patient care. In general terms, payers make the determination of utility based on three criteria, namely, the testing must be standard of care (i.e., in widespread clinical use), use of the test must be supported by the medical literature (the strength of the evidence is also important, such as case reports versus randomized prospective clinical trials), and the test must lead to improved patient outcomes. Testing that does not meet these three general criteria is, by default, viewed as experimental or investigational

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

453

TEST DESIGN FACTORS THAT IMPACT REIMBURSEMENT

TABLE 26.2

Original Medicare (Parts A&B  Fee-for Services) Initial Determination/Appeals Process

EXPEDITED PROCESS (Some Part A only) Notice of Discharge or Service Termination

STANDARD PROCESS Parts A and B FI, Carrier, or MAC Initial Determination

Initial Determination

Noon the next calendar day

120 days to file

Quality Improvement Organization Redetermination 72 hour time limit

FI, Carrier or MAC Redetermination 60 day time limit

180 days to file

First Appeal Level

Noon the next calendar day

Qualified Independent Contractor Reconsideration 60 day time limit

Qualified Independent Contractor Reconsideration 72 hour time limit

Second Appeal Level

60 days to file Office of Medicare Hearings and Appeals ALJ Hearing AIC ≥ $140* 90 day time limit

Third Appeal Level

60 days to file Medicare Appeals Council 90 day time limit

Fourth Appeal Level

60 days to file Federal District Court AIC ≥ $1,400*

Judicial Review

AIC = Amount In Controversy ALJ = Administrative Law Judge FI = Fiscal Intermediary MAC = Medicare Administrative Contractor *The AIC requirement for an ALJ hearing and Federal District Court is adjusted annually in accordance with the medical care component of the consumer price index. The chart reflects the amounts for calendar year (CY) 2013.

and is not reimbursed since insurance payers are not research organizations. Similarly, payers will not generally reimburse for testing that is performed in order to identify whether or not a patient is a candidate for a specific clinical trial. Thus, even though NGS testing provides the opportunity to sequence hundreds of genes, the exome, or the whole genome, and even though NGS provides the opportunity to collect data on the oncobiology of tumor development, progression, and escape from therapy, insurance carriers usually will not reimburse for the genetic analysis if the data on a specific gene (or group of genes) are not directly diagnostic, predictive, or prognostic. The existence of a CPTs code does not guarantee payment for NGS testing; if a large panel of hundreds of genes, the exome, or the whole genome in the clinical setting of a specific patient has no clinical utility, it will not be reimbursed. Consequently, clinical NGS testing currently must often be limited to a relatively small set of genes that provide actionable information.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

454

26. BILLING AND REIMBURSEMENT

Clinical utility is directly related to the so-called use case under which NGS is performed. For example, in the setting of the care of an oncology patient, the utility of NGS testing is often not the same at all stages of the disease process. At the time of initial diagnosis, choice of an appropriate initial chemotherapy regimen often depends on the mutational status of a relatively limited set of genes (e.g., non-small cell lung cancer and colorectal adenocarcinoma) [11] and it is difficult to make an argument for analysis of a broader panel of genes that do not impact patient care. However, when the same patient develops recurrent or metastatic disease, NGS testing to evaluate the mutational status of a much larger number of genes to guide targeted therapy has demonstrated clinical utility in that it has been shown to improve patient outcomes [12,13]. It is noteworthy that the importance of direct clinical utility for reimbursement of NGS in cancer testing mirrors the current reimbursement paradigm for DNA sequence analysis to identify germ line mutations in inherited diseases. In the setting of constitutional diseases, the precedent is for testing limited to the genes that are directly relevant for diagnosis or therapy, rather than dozens or hundreds of genes without a clear association with the patient’s particular disease, or that have not been shown to provide any directly actionable information. This policy persists despite the fact that there is disagreement among experts regarding the diseases for which genomic information has clinical utility [14]; that limiting testing to preselected genes (or even regions within a gene) with a known disease association may miss variants that are clearly also disease causing; and that the scope of clinical utility of genetic analysis likely should be more broadly defined to include information important in order to plan for supportive care, inform reproductive decisions, avoid unnecessary testing, and so on. In an even broader laboratory context, the importance of clinical utility for reimbursement of NGS also mirrors the reimbursement paradigm for automated chemistry testing. Despite the extremely low incremental cost of additional tests performed by automated chemistry and immunoassay instruments, reimbursement is generally limited to those tests necessary for immediate patient management (in fact, attempts to manipulate panel content to force physicians into ordering additional tests at extra cost has in some settings [15] resulted in criminal prosecution). The practical constraints on reimbursement of clinical NGS by private or governmental insurance carriers have implications for test design and utilization, exerting pressure to test only small sets of genes in limited clinical settings. However, other sources of reimbursement provide more flexibility in NGS test panel design and clinical use, and may support testing that involves hundreds of genes, the exome, or even the whole genome. These other revenue streams include institutional contracts, research grant support (which can often be used to support clinical testing as part of a clinical or basic science trial), philanthropy, and private investment (such as venture capital).

PATIENT PROTECTION AND AFFORDABLE CARE ACT The Patient Protection and Affordable Care Act (ACA) signed into law in March, 2010, brought sweeping changes to the healthcare landscape in the USA. The law was created in response to a 2009 presidential study that showed a dim future for healthcare financing in the USA, emphasized by the fact that roughly 18% of the gross domestic product of the USA in 2010 was spent on healthcare expenditures, with a projected growth to 34% by 2040 [16]. The new law requires every US citizen to hold health insurance, via either an employersponsored plan; private insurance; or a government run Medicare, Medicaid, or Veterans Administration program. The law also provides penalties for those individuals who do not carry insurance. Finally, the law prohibits insurance plans from denying coverage based on preexisting conditions. Ideally the law will result in coverage for the estimated 49.9 million Americans who were uninsured in 2010 [17]. Although it is unclear how expanded coverage will impact rising healthcare associated expenditures, the ACA included provisions to create three new entities focused on decreasing healthcare expenditures.

Entities Focused on Healthcare Expenditures The first new entity is the independent, publicprivate funded Patient-Centered Outcomes Research Institute (PCORI), created to conduct research aimed at examining the relative health outcomes, clinical effectiveness, and appropriateness of clinical care delivered in the USA [18]. The intent of this institute is to produce valid scientific evidence that supports a broad range of clinical decisions. Ultimately, the results from “Patient-Centered Outcomes” studies will be used in coverage determinations for insurance programs.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

PATIENT PROTECTION AND AFFORDABLE CARE ACT

455

Another program created by the ACA is the Center for Medicare and Medicaid Innovation. The mission of the Innovation Center as defined by Congress is to “innovate payment and service delivery models to reduce program expenditures. . ..while preserving or enhancing the quality of care” for those who receive Medicare, Medicaid, or Children’s Health Insurance Program (CHIP) benefits [19]. Functionally, the goal of the Innovation Center is to support researchers who will develop models that utilize and evaluate CMS data to answer critical policy questions, models that can potentially define future payment and service delivery for healthcare services. Currently the Innovation Center has three main priorities, namely testing of new payment and service delivery models, evaluating results and advancing best practices, and engaging a broad range of stakeholders to develop additional models. Lastly, the ACA created the Independent Payment Advisory Board (IPAB), a 15-member nonelected board composed of industry experts and healthcare advocates. The IPAB is charged with recommending changes to Medicare with a focus on curbing the increasing costs of healthcare through a spending target system and fast track legislative approval process. In the event that projected per-beneficiary spending growth exceeds targets, Congress must consider the Medicare reforms proposed by the board, with strict rules that make it hard for Congress to overturn the recommendations from the board [2022]. Given the potential power of the board, the IPAB has been a hotbed of public debate [23]. Currently, the political future of a board charged with making controversial decisions regarding the nation’s largest social programs is uncertain [23].

Accountable Care Organizations Another provision of the ACA is the creation of Accountable Care Organizations (ACOs) which are groups of doctors, hospitals, and other healthcare providers that volunteer to provide coordinated high-quality care to their Medicare patients [24]. The Shared Savings Program from CMS incentivizes ACOs to strive to provide patients with coordinated care at the “right time” and “right place” to avoid unnecessary duplication of services and help prevent medical errors. The Shared Savings Program requires ACOs to meet quality metrics in four areas. In 2013, the metrics include 33 quality measures in the four areas of patient and caregiver experience, care coordination and patient safety, preventative health, and at-risk populations (including patients with diabetes, hypertension, ischemic vascular disease, heart failure, and coronary artery disease) [25]. When an ACO succeeds in meeting the quality metrics, it receives back a portion of the Medicare savings.

Health Outcomes A key focus for all of the new ACA initiatives is the development of methods for better health outcomes in the most efficient and effective way possible. The impact of this focus on clinical laboratories, especially those that provide molecular testing, is uncertain. It could reasonably be argued that the immediate impact for clinical laboratories is an increase in laboratory testing, since 49.9 million previously uninsured individuals will have access to health insurance as is now required. On the other hand, it is difficult to predict the impact of the Patient Protection and ACA on clinical laboratories over the long term. There is no doubt that the current fee-for-service model in the US healthcare system is flawed. It is a system that incentivizes and rewards over-utilization and abuse. For example, in 2011, the US Government Accountability Office (GAO) estimated $48 billion in fiscal year 2010 was spent on improper Medicare payments [26]. With so much governmental attention and focus on reducing healthcare expenses in programs such as Medicare, it is not unreasonable to foresee changes in the mechanisms by which clinical laboratories receive payment for their services. One thought is that the industry as a whole will move from a fee-for-service model to a model similar to the DRG model that Medicare uses for inpatient hospital services (the DRG model pays a set lump sum for various diagnoses regardless of the procedures performed, and it is worth note that Medicare has recently begun to include quality metrics in DRG payments). Although there are provisions for extra payments when unforeseen complications arise, the effect of DRGs has been to incentivize hospitals to focus on containing inpatient costs to maximize revenue from DRG payments. A DRG model for molecular testing reimbursement would likely, on the positive side, incentivize appropriate utilization; however, on the negative side, it would likely exert direct and indirect pressure to lower the fees paid for testing.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

456

26. BILLING AND REIMBURSEMENT

$35,000

Estimated cost

$30,000 $25,000 $20,000

Sanger

$15,000 $10,000

NGS

$5000 $0 0

2

4 6 8 Kilobase of sequence

10

12

FIGURE 26.1 Comparison of the per-base cost of DNA sequence analysis by Sanger and NGS methods when performed in a CLIA-licensed clinical laboratory (technical component of testing only).

COST STRUCTURE Determining fees for molecular genetic testing is a complex business exercise that involves understanding a laboratory’s internal cost structure, determining the appropriate CPTs code to apply, and setting fee rates. Specifically with reference to NGS, the intrinsic cost structure is governed by the fact that technical cost per kilobase is relatively fixed, and thus above a certain threshold the cost is lower than DNA sequence by traditional methods such as Sanger sequencing (Figure 26.1). NGS is therefore ideally suited for analysis where diagnosis, prediction, or prognosis requires information about the presence of mutations in relatively large areas of the same gene or multiple different genes. In contrast, NGS is not ideally suited for analysis of a very small target region (for which Sanger or indirect sequencing methods may have more utility) or a limited number of specific structural rearrangements (for which classical cytogenetics or interphase FISH may be more appropriate). Molecular genetic testing is a highly specialized field, and there are higher costs per test because of the high complexity of the testing. In order to ensure an effective pricing strategy, it is crucial to understand all the direct and indirect costs associated with an assay (direct costs are those costs that can be directly related and tracked to an assay, such as reagents and consumables; indirect costs are the less tangible costs such as administration and management time, building rent, and service contracts that can be more difficult to directly calculate per assay). If a laboratory is going to offer testing to other institutions, additional information is required including a market analysis to establish the reference laboratory fees of competitors that offer comparable testing. Finally, governmental fee schedules must also be evaluated. While the 2014 National Limit for Medicare on the Clinical Diagnostic Laboratory Fee Schedule for CPTs codes governs the reimbursement from Medicare, most laboratories have a patient population covered by a mixture of government and private payers. Since private payers often have constructed rates above those of Medicare, it is prudent to review the contracted rates with all payers to set fees at a level that ensures that the maximum possible revenue is captured. Once the cost of the assay has been determined, the CPTs code that best applies to the testing must be determined. As noted above, the codes for molecular testing can be found in the 81200 through 81479 series of the 2014 CPTs codebook. When deciding which code to assign to a given assay, it is required that the code which most accurately describes the assay must be used. Additionally, if there is a code that specifically describes the assay, that code must be used. Failure to use the correct code is considered fraud by CMS and carries penalties from fines and repayment to criminal prosecution.

SUMMARY There are three factors that are critical when it comes to billing for clinical molecular testing. The first is an understanding of the current regulatory environment. Because NGS is an emerging medical field (both in terms of technologies and applications) it is important to stay abreast with CPTs coding, ICD-10 codes, and the regulatory environment to effectively bill for services rendered in this field. The second factor is determination of the appropriate fees for NGS testing. Proper coding for testing, with knowledge of Medicare fee schedules and

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

REFERENCES

457

contracted rates from private payers for the relevant CPTs codes, will ensure maximum revenue for the testing being performed. The third factor is an understanding of the laboratory’s cost structure for NGS testing, especially since the cost of equipment and kits for the testing are changing so rapidly. Since NGS tests are expensive to perform, large losses can accrue quickly in the absence of a careful (and up to date) cost analysis.

References [1] MAC Jurisdictions, 2014. Centers for Medicare & Medicaid Services. Retrieved March 22, 2014, from ,http://www.cms.gov/Medicare/ Medicare-Contracting/Medicare-Administrative-Contractors/MACJurisdictions.html.. [2] Medicare Coverage Determination Process, 2013. Centers for Medicare & Medicaid Services. Retrieved January 3, 2014, from ,http:// www.cms.gov/Medicare/Coverage/DeterminationProcess/index.html.. [3] Medicare Modernization Act (MMA); Coverage Flowchart, 2013. Centers for Medicare & Medicaid Services. Retrieved January 3, 2014, from ,http://www.cms.gov/Medicare/Coverage/DeterminationProcess/Downloads/8a.pdf.. [4] International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), 2013. Centers for Disease Control and Prevention. Retrieved January 3, 2014, from ,http://www.cdc.gov/nchs/icd/icd9cm.htm.. [5] International Classification of Diseases, (ICD-10-CM/PCS) Transition, 2013. Centers for Disease Control and Prevention. Retrieved January 3, 2014, from ,http://www.cdc.gov/nchs/icd/icd10cm_pcs_background.htm.. [6] Request for Molecular Pathology Code Review and Feedback. (n.d.). American Medical Association. Retrieved July 30, 2013, from ,http://www.apcprods.org/advocacy/documents/AMA%20Request%20for%20Molecular%20Pathology%20Coding%20Feedback.pdf.. [7] Proposal for CPT Coding Reform in Molecular Diagnostics, 2009. Association for Molecular Pathology. Retrieved June 30, 2014, from ,http://www.amp.org/committees/economics/AMPCPTReformProposal_Final.pdf.. [8] CPT 2014: current procedural terminology. Professional ed. Chicago: American Medical Association; 2013. [9] Molecular Pathology Coding Workgroup Meeting, 2013. American Medical Association. Retrieved July 30, 2013, from ,http://www .ama-assn.org/resources/doc/cpt/mpwg-presentation-jan312013-randhyatttampabay.pdf.. [10] Proposal to Address CPT Coding for Genomic Sequencing Procedures, 2013. Association for Molecular Pathology. Retrieved July 30, 2013, from ,http://www.amp.org/documents/AMPProposaltoAddressCodingforGenomicSequencingProcedures_FINAL.pdf.. [11] NCCN Guidelines for Treatment of Cancer by Site. (n.d.). NCCN Clinical Practice Guidelines in Oncology. Retrieved July 30, 2013, from ,http://nccn.org/professionals/physician_gls/f_guidelines.asp#site.. [12] Tsimberidou AM, Iskander MK, et al. Personalized medicine in a phase I clinical trials program: the MD Anderson Cancer Center initiative. Clin Cancer Res 2012;18:637383. [13] Von Hoff DD, Stephenson Jr JJ, Rosen P, Loesch DM, Borad MJ, Anthony S, et al. Pilot study using molecular profiling of patients’ tumors to find potential targets and select treatments for their refractory cancers. J Clin Oncol 2010;28:487783. [14] Green RC, Berg JS, Berry GT, Biesecker LG, Dimmock DP, Evans JP, et al. Exploring concordance and discordance for return of incidental findings from clinical sequencing. Genet Med 2012;14:40510. [15] OPERATION LABSCAM, 2003. Michael Wynne. Retrieved July 30, 2013, from ,http://www.bmartin.cc/dissent/documents/health/ labscam.html.. [16] The Economic Case for Health Care Reform, 2009. Executive Office of the President Council of Economic Advisers. Retrieved July 30, 2013, from ,http://www.whitehouse.gov/assets/documents/CEA_Health_Care_Report.pdf.. [17] U.S Census Bureau. Income, poverty, and health insurance coverage: 2010. ,www.census.gov.; issued September 2011. [18] H.R. 3590—111th Congress: Patient Protection and Affordable Care Act, 2009. In: ,www.GovTrack.us.. Retrieved March 22, 2013, from ,http://www.govtrack.us/congress/bills/111/hr3590.. [19] Centers for Medicare and Medicaid Services, Innovation Center (n.d.). About Us. Retrieved July 30, 2013, from ,http://innovation.cms. gov/About/index.html.. [20] Newman D, Davis CM. The Independent Payment Advisory Board. Washington, DC: Congressional Research Service; 2010, ,http:// assets.opencrs.com/rpts/R41511_20101130.pdf.. [21] Jost TS. The Independent Payment Advisory Board. N Engl J Med 2010;363:1035. [22] Ebeler J, Neuman T, Cubanski J. The Independent Payment Advisory Board: a new approach to controlling Medicare spending. Washington, DC: Henry J. Kaiser Family Foundation; 2011, ,http://kaiserfamilyfoundation.files.wordpress.com/2013/01/8150.pdf.. [23] Oberlander J, Morrison M. Failure to launch? The independent payment advisory board’s uncertain prospects. N Engl J Med 2013;369:1057. [24] Centers for Medicare and Medicaid Services, n.d. Accountable Care Organizations (ACO). Retrieved July 31, 2013, from ,http://www .cms.gov/Medicare/Medicare-Fee-for-Service-Payment/ACO/index.html?redirect5/aco/.. [25] RTI International & Telligen, 2012. Accountable Care Organization 2013 Program Analysis. Retrieved July 31, 2013, from ,http://www .cms.gov/Medicare/Medicare-Fee-for-Service-Payment/sharedsavingsprogram/Downloads/ACO-NarrativeMeasures-Specs.pdf.. [26] King KM, Daly KL. Medicare and Medicaid fraud, waste, and abuse. Retrieved July 31, 2013, from ,http://www.gao.gov/assets/130/ 125646.pdf..

Glossary Billing the process of submitting claims to health insurance payers to receive payment for services provided by healthcare providers. Coding the process of converting descriptions of medical diagnoses and procedures into universal diagnosis and procedure codes. Medical Coverage Criteria a set of criteria established by insurance payers that determine whether or not the insurance payer will pay for an item or service.

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

458

26. BILLING AND REIMBURSEMENT

Medically Necessary healthcare services or supplies needed to prevent, diagnose, or treat an illness, injury, condition, disease, or its symptoms and that meet accepted standards of medicine. Predetermination the process of submitting a request for coverage determination prior to performing a service to predetermine whether or not an insurance payer will pay for an item or service. Professional Component the physician’s professional interpretive component of a service. Reimbursement compensation for services provided. Relative Value Unit a measure of value used to establish the Medicare fee for a service which is ranked on a common scale based on the resources used to provide the service. Technical Component the component of a service that addresses the use of equipment, facilities, nonphysician medical staff and supplies.

List of Acronyms and Abbreviations ACA ACMG ACO AMA AMP CAP CF CHIP CLFS CMS CPTs DRG GAO GPCI HMO ICD-9-CM IPAB LCD MAC MP MPCW MPFS NCD PE PPO RVU

Affordable Care Act American College of Medical Genetics Accountable Care Organization American Medical Association Association of Molecular Pathology College of American Pathologist Conversion Factor Children’s Health Insurance Program Clinical Laboratory Fee Schedule Centers for Medicare and Medicaid Services Current Procedural Terminology Diagnosis-Related Group Government Accountability Office Geographic Practice Cost Indices Health Maintenance Organization International Classification of Diseases, Ninth Revision, Clinical Modification Independent Payment Advisory Board Local Coverage Decision Medicare Administrative Contractor Malpractice Molecular Pathology Coding Workgroup Medicare Physician Fee Schedule National Coverage Decision Practice Expense Preferred Provider Organization Relative Value Unit

IV. REGULATION, REIMBURSEMENT, AND LEGAL ISSUES

Index

Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A ABI. See Applied Biosystems (ABI) ABI PRISM dGTP BigDye Terminator v3.0 kit, 5 Abnormal protein aggregation, 137 ACA. See Affordable Care Act (ACA) Accountable Care Organizations (ACO), 455 Accuracy clinical testing, 369370 defined, 369 depth of coverage, 369 population variation databases, 283284 wet-bench procedures, 383 ACMG. See American College of Medical Genetics and Genomics (ACMG) ACO. See Accountable Care Organizations (ACO) ADA. See Americans with Disabilities Act (ADA) Advance Notice for Proposed Rulemaking, 422 Affordable Care Act (ACA), 444, 454455 ACO, 455 Center for Medicare and Medicaid Innovation, 455 healthcare expenditures, 454455 health outcomes, 455 inception of, 454 IPAB, 455 PCORI, 454 Agarose gel electrophoresis, 10 Agencourt Bioscience, 11 Agilent SureSelect Human All Exon Kit, 48 Alignment, read mapping factors, 102103 formats, 102 processing, 103 Alignment tools, read mapping, 101102 Bowtie, 101 BWA, 101 Isaac, 102 MAQ, 101 Mosaik, 102 Novoalign, 101 TMAP, 102 Allele frequency, 199 Allelic heterogeneity, constitutional disorders, 252 Altered splicing, 137 Alternative splicing, RNA-sequencing and, 84 AMA. See American Medical Association (AMA) AMA-CPT Editorial Panel, 451 Amazon, 423

American Civil Liberties Union, 441 American College of Medical Genetics and Genomics (ACMG), 367, 451 American Hospital Association, 451 American Medical Association (AMA), 451 Code of Ethics, 421 American Society for Clinical Oncology, 426 Americans with Disabilities Act (ADA), 444 Amgen v. Chugai Pharmaceutical Co., 437 AMP. See Association for Molecular Pathology (AMP) Amplification-based NGS capture-based methods vs., 4748, 6264 clinical applications, 6466 cancer gene panels, 6465 targeted amplification-based sequencing, 64, 64f cost-effectiveness, 4748 disease-targeted gene panels, 265 library preparation, 6162, 61f multiplex PCR, 6061, 61f nucleic acids preparation, 59 overview, 5758 principles of, 5859 samples, 59 sequencing workflow, 58, 58f Amplification-based NGS, constitutional diseases advantages of, 247248 bioinformatics, 245246 concurrent testing, 245 disadvantages of, 247248 disease-targeted sequencing, 241242, 242t exons, design and inspect primers, 244 future directions, 248 multigene panel validation, 243245 run validation samples, 244245 select genes, 244 target enrichment, 242243 Amplification-based NGS, in cancer advantages of, 307308 challenges and perspectives, 316 clinical application of, 308313 AmpliSeq Cancer Hotspot Panel v2, 311 AmpliSeq custom cancer panels, 312 cancer-specific targeted panels, 310313 DNA/RNA extraction and quality control, 309 Illumina TruSeq Amplicon Cancer Panel, 313 Ion AmpliSeq CCP, 311312 Ion AmpliSeq RNA Cancer Panel, 313

459

RainDance ONCOSeq Panel, 313 sample requirements, 309 data analysis, 314, 315t disadvantages of, 307308 DNA sequencing, 303305 general considerations, 305 multiplex amplification, 303 single-plex amplification, 304 targeted capture, 304 interpretation and reporting, 315, 316t methylation analysis, 307 overview, 298 RNA sequencing, 305307 multiplex amplification, 305306, 306f single-plex amplification, 306 targeted capture, 306307 technologies, 298302 dye terminator sequencing by synthesis (SBS), 299, 300f Illumina HiSeq and MiSeq systems, 299300 ion semiconductor-based NGS, 300301 Roche 454 genome sequencing, 298299, 299f SOLiD, 301302, 302f AmpliSeq, 16 AmpliSeq Cancer Hotspot Panel v2, 311 AmpliSeq Cancer Panels, 65 AmpliSeq custom cancer panels, 312 Analyte-specific proficiency testing, 387 Analytic considerations cancer, exome and genome sequencing in, 352356 advantages, 354356 CNV detection, 355356 depth of coverage, 352354 limitations, 352354 single assay validation, 354355 specimen requirements, 352 SV detection-genomes, 356 Analytics, reporting software, 233235 Analytic validity, results, 413 Analytic variables, 382385 bioinformatic pipeline, 384385 databases, 385 independent evaluation, 384 reference materials, 385 revalidation, 384385 versioning, 384385 sequencing platform, 383 wet-bench procedures, 383384 accuracy, 383

460 Analytic variables (Continued) precision, 383 reference range, 384 reportable range, 384 sensitivity, 383 sequence verification, 384 specificity, 383 specimen provenance, 384 Anonymity. See also Ethical challenges defined, 422 of private information, 422 Applied Biosystems (ABI), 5, 11 Archives of variant/phenotype relationships, 202205, 203t ClinVar, 204 COSMIC, 204205 dbGaP, 202204 DECIPHER, 204 OMIM, 204 PubMed, 205 Archives of variants, 201202 dbSNP, 201202 dbVar, 202 Array comparative genomic hybridization (aCGH), 170 Artifacts, exome and genome sequencing, 284285 Assay validation. See Clinical testing Association for Molecular Pathology (AMP), 387, 451 Association for Molecular Pathology v. Myriad Genetics, Inc., 441, 443 Automation, 28, 260 Autonomy, 406

B Baits. See Capture probes, for hybridcapture-based NGS Barcodes, to DNA libraries, 5758, 61f, 62 Base calling, 9293 Illumina platforms, 9596, 9899 density of DNA clusters, 9596 intensity by cycle plot, 96 intensity versus cycle (IVC) plot, 96 percentage phasing/prephasing, 96 PhiX-based quality metrics, 96 quality score distribution, 96 quality score heatmap, 96 quality scoring, 99 template generation, 98 platform-specific, 94100 sources of error, 100 Torrent platforms, 97100 clonal ISP, 97 key processes, 99100 library and test ISP, 97 live ISP, 97 loading or ISP density, 97 postprocessing, 100 test fragment metrics, 98 usable sequence, 97 Batch and pool samples, 260 Bayh-Dole Act, 437 Beacon project, 213 Becarra, Xavier, 435

INDEX

Beckman Coulter, 11 Benchtop MPS technology, 65 Benchtop sequencing, 65 Beneficence, 406407 BigDye products, 5 Billing, 448 Bilski v. Kappos, 438439 Binary alignment map (BAM) file, 102 Binning, 418420, 419t Bioinformatic approaches, to SNV detection, 117122 clinical implications, 121 high sensitivity tools, 121 orthogonal validation, 122 parameters, 118120 software, 117118, 118t tumor/normal analyses, 121 Bioinformatic pipeline, 384385 databases, 385 independent evaluation, 384 reference materials, 385 revalidation, 384385 versioning, 384385 Bioinformatics, 2930, 396397 constitutional diseases, 245 insertion/deletion events (indels), 142147 left alignment, 143144 local realignment, 142 probabilistic modeling using mapped reads, 144145 split-read analysis, 145146 Bioinformatic workflow, for methylome sequencing, 86f Bisulfite conversion, methylome sequencing, 85 Bowtie, 101 Bryson, William, 442 Burrows Wheeler Aligner (BWA), 101 Burrow Wheeler transform, 101 Bush, George W., 435, 443 BWA. See Burrows Wheeler Aligner (BWA) BWA-GATK pipeline, 102

C Cancer, amplification-based NGS advantages of, 307308 challenges and perspectives, 316 clinical application of, 308313 AmpliSeq Cancer Hotspot Panel v2, 311 AmpliSeq custom cancer panels, 312 cancer-specific targeted panels, 310313 DNA/RNA extraction and quality control, 309 Illumina TruSeq Amplicon Cancer Panel, 313 Ion AmpliSeq CCP, 311312 Ion AmpliSeq RNA Cancer Panel, 313 RainDance ONCOSeq Panel, 313 sample requirements, 309 data analysis, 314, 315t disadvantages of, 307308 DNA sequencing, 303305 general considerations, 305 multiplex amplification, 303 single-plex amplification, 304

targeted capture, 304 interpretation and reporting, 315, 316t methylation analysis, 307 overview, 298 RNA sequencing, 305307 multiplex amplification, 305306, 306f single-plex amplification, 306 targeted capture, 306307 technologies, 298302 dye terminator sequencing by synthesis (SBS), 299, 300f Illumina HiSeq and MiSeq systems, 299300 ion semiconductor-based NGS, 300301 Roche 454 genome sequencing, 298299, 299f SOLiD, 301302, 302f Cancer, exome and genome sequencing in analytic considerations, 352356 advantages, 354356 CNV detection, 355356 depth of coverage, 352354 limitations, 352354 single assay validation, 354355 specimen requirements, 352 SV detection-genomes, 356 overview, 344 somatic mutations, 344347 chromosome level mutations, 346347 codon level mutations, 345 exon level mutations, 345 gene level mutations, 346 paired tumor-normal testing, 347349 variants of unknown significance, 349352 clonal architecture analysis, 351352 driver mutation analysis, 351 general categories, 350 pathway analysis, 351 statistical models of mutation effect, 350351 Cancer genes, somatic mutation detection, 326 Cancer-specific targeted panels, 310313 Capillary electrophoresis, 5 Capture probes, for hybrid-capture-based NGS, 39 Carcinomas, 155 Causal mutation discovery, exome and genome sequencing, 279291 de novo mutations, 282 familial studies, 281282 phenotypically similar unrelated probands, 279281 recessive diseases, 281282 CCDS. See Consensus coding sequence (CCDS) CDC. See Centers for Disease Control and Prevention (CDC) Cell-free DNA screening, CNV, 173, 181182 Cell lines, proficiency testing, 387 Cellularity, indels, 142 Center for Medicare and Medicaid Innovation, 455 Centers for Disease Control and Prevention (CDC), 367, 379

INDEX

Centers for Medicare and Medicaid Services (CMS), 367, 448 CGH. See Comparative genomic hybridization (CGH) Changing status, results, 413 Chemical mutagens, 112 Children’s Health Insurance Program (CHIP), 455 Chimeric proteins, 168 CHIP. See Children’s Health Insurance Program (CHIP) Chromosomal studies, 275 Chromosome level mutations, 346347 Chromosome microarray (CMA) tests, 371 CLFS. See Clinical Laboratory Fee Schedule (CLFS) CLIA. See Clinical Laboratory Improvement Amendments (CLIA) ClinGen, 385 Clinical and Laboratory Standards Institute (CLSI), 367, 379 Clinical care, research and, 409410 Clinical Laboratory Fee Schedule (CLFS), 448 Clinical Laboratory Improvement Amendments (CLIA), 409 Clinical Molecular Genetics Society, 367 Clinical results, research results vs., 411412 Clinical testing accuracy, 369370 analytic phase, 364365 confirmatory testing, 366 DNA sequence data, 365 informatics pipeline, 365366 overview, 364 postanalytic phase, 364365 preanalytic phase, 364365 precision, 370 QC procedures, 372373 quality assurance, 367 reference materials, 373 reference ranges, 372 reportable range, 372 sensitivity, 371372 sequence variations, 366 software tools, 365 specificity, 371372 validation, 367368, 368f variant calling process, 366 workflow, 364366, 365f Clinical utility exome and genome sequencing, 288290 data load, 289290 incidental genomic information, 288289 timeline, 288 results, 414 somatic mutation detection, 322323 Clinical validity, results, 414 ClinVar, 201, 204, 385 Clonal architecture analysis, 351352 CLSI. See Clinical and Laboratory Standards Institute (CLSI) CMA. See Chromosome microarray (CMA) tests

CMOS. See Complementary metal-oxide semiconductor (CMOS) technology CNV. See Copy number variants (CNV) Code of Ethics, AMA, 421 Coding regions, single nucleotide variant (SNV), 113 missense SNV, 113 nonsense SNV, 113 synonymous SNV, 113 CODIS. See Combined DNA Index System (CODIS) Codon level mutations, 345 College of American Pathologists (CAP), 379, 451 Combined DNA Index System (CODIS), 423 Commercial Fee-for-Service, 448 Common variation HapMap, 196197 identification of, 195198 interpretation of, 199 NHLBI-ESP, 198 phenotype and, 199 allele frequency, 199 GWAS, 199 molecular, 199 1000 Genomes Project, 197198 Comparative genomic hybridization (CGH), 170 Complementary metal-oxide semiconductor (CMOS) technology, 14 Computational prediction programs, 222223 Conceptual approaches, CNV detection, 173179, 174f depth of coverage, 176177, 176f discordant mate pair methods, 175, 175f SNP allele frequency, 177178, 178f split reads, 178179, 179f Concurrent testing, constitutional diseases, 245 Confidentiality breach of, 422 data environment, 423424 reidentification, 423424 untrustworthy people, 423 data protection methods and, 422423 anonymity, 422 data deidentification, 422 reidentification, 422423 required/permitted sharing, 424 defined, 422 overview, 421422 recommendations, 425 controlled access, 425 informed consent, 425 misuse concerns, 425 Confirmatory testing, 366 Consensus coding sequence (CCDS), 48 Constitutional diseases, amplification-based NGS advantages of, 247248 bioinformatics, 245246 concurrent testing, 245 disadvantages of, 247248 disease-targeted sequencing, 241242, 242t

461 exons, design and inspect primers, 244 future directions, 248 multigene panel validation, 243245 run validation samples, 244245 select genes, 244 target enrichment, 242243 Constitutional disorders, exome and genome sequencing in artifacts, recognizing and managing, 284285 causal mutation discovery, 279291 de novo mutations, 282 familial studies, 281282 phenotypically similar unrelated probands, 279281 recessive diseases, 281282 clinical utility, 288290 data load, 289290 incidental genomic information, 288289 timeline, 288 combinatorial approaches, 286 genetic/genomic investigations, 287 consequences of, 290291 ethical issues, 291 genetic counseling, 291 genomic sequencing, 276279 advantages of, 276 CNV detection, 278 depth of coverage, 278 disadvantages of, 277 exomes vs. genomes, 277279 resource-based considerations, 278279 targeted/covered regions, 277278 historical perspective, 274276 chromosomal studies, 275 genetic markers, 275 GWAS, 276 human genome project, 276 microarray, 275276 modern sequencing technologies, 276 overview, 273276 pathway-related data, 284 population variation databases, 282284 accuracy and reproducibility of, 283284 expressivity, 283 filtering genomic data sets, 282283 penetrance, 283 variants, functional interpretation of, 285 Constitutional disorders, targeted hybrid capture for allelic heterogeneity, 252 challenges, 267 clinical application, 266268 hearing loss and related disorders, 267 inherited cardiomyopathies, 266 clinical overlap with related disorders, 252 costello syndrome, 253 disease-targeted gene panels, 264265 amplification-based capture, 265 target selection methods, 265 whole exome sequencing, 265 whole genome sequencing, 264265 gene selection, 262

462 Constitutional disorders, targeted hybrid capture for (Continued) hereditary hearing loss, 253 inherited cardiomyopathies, 252 interpretive challenges, 263264 locus heterogeneity, 252 molecular diagnostics, 253254 operational considerations automation, 260 batch and pool samples, 260 CNV, 261262 cost-reduction measures, 260 sequencing cost, 260 sequencing machine, 261 turnaround time, 261 workflow, 259260 overview, 252254 target selection, 255257 technical challenges, 263 GC-rich or repetitive regions, 263 target size, 263 technical design considerations, 257259 Conversion Factor (CF), 448450 Copy number variants (CNV) cell-free DNA screening, 173, 181182 CGH, 170 chimeric proteins, 168 clinical detection, 170173 conceptual approaches, 173179, 174f depth of coverage, 176177, 176f discordant mate pair methods, 175, 175f SNP allele frequency, 177178, 178f split reads, 178179, 179f constitutional disorders, targeted hybrid capture for, 261262 cost-effective mutation testing, 170 defined, 166167 disease phenotypes, 168169, 169t exome sequencing, 172 unbiased variant discovery, 181 FISH, 170 formation mechanisms, 167 microhomology-mediated end joining (MMEJ), 167, 168f nonhomologous end joining (NHEJ), 167, 168f functional consequences of, 168169, 169f historical and current methods, 170 human genome frequency, 167168 orthogonal validation, 184 overview of, 166 reference standards for, 182183 somatic mutation detection, 331332 target capture methods, 52 targeted gene sequencing, 170172, 180181 whole genome sequencing, 172173 emerging technologies, 182 COSMIC, 204205 Cost-effective mutation testing, CNV, 170 Cost-effectiveness, somatic mutation detection, 332 Costello syndrome, 253 Cost-reduction measures, 260 Cost structure, 456

INDEX

Court of Appeals for the Federal Circuit (CAFC) formation of, 437 Coverage, RNA-sequencing, 84 Crichton, Michael, 435 Current Procedural Terminology (CPTs) codes, 451452 Cyclic array sequencing, 7 Cytogenetics, 3

D Data, reanalysis and reinterpretation, 228 Data access and interpretation, 205210 ACMG incidental genes list, 209210 by condition/reported phenotype, 208 by gene, 208 genomic locations, 205207, 207t by particular variant attributes, 208210 Data analysis cancer, amplification-based NGS, 314, 315t technical information about test, 226 Databases, 201205. See also Population variation databases archives of variant/phenotype relationships, 202205, 203t ClinVar, 204 COSMIC, 204205 dbGaP, 202204 DECIPHER, 204 OMIM, 204 PubMed, 205 archives of variants, 201202 dbSNP, 201202 dbVar, 202 genetic reference, 423 online mutation, 221222 SNV detection, 122 variant pathogenicity, 211212 Data load, 289290 Data representation, reference materials (RM), 397 Data storage, reporting issues, 228 DbGaP, 202204 DECIPHER, 204 Decreased protein activity, 138 Decreased transcription, 136137 Defective mismatch repair, indels, 134 De novo read assembly, 81 Department of Health and Human Services (HHS), 443444 Depth of coverage accuracy, 369 cancer, exome and genome sequencing in, 352354 constitutional disorders, exome and genome sequencing in, 278 copy number variants (CNV), 176177, 176f insertion/deletion events (indels), 141 metrics for assessing genotype quality, 105 somatic mutation detection, 332 somatic mutation detection, targeted hybrid-capture for, 332 written report, 226 Depth of sequencing, 116

Developmental delay, 155 Diagnosis-related group (DRG) model, 448 Diagnostics vs. screening, 408409 Diamond v. Chakrabarty, 437 Diamond v. Diehr, 437 Differential expression, RNA-sequencing, 83 Disclosure of information, 410411 Discordant mate pair methods, CNV detection, 175, 175f Discordant paired-end analysis, translocation detection, 158 Disease, translocations, 152 carcinomas, 155 developmental delay, 155 hematologic malignancies, 153154 hereditary cancer syndromes, 155 inherited disorders, 155 leukemias, 153154 lymphomas, 154 recurrent miscarriages, 155 sarcomas, 154 tumors, 154155 Disease-associated exome testing, 51 advantage of, 51 disadvantage of, 51 sequencing, 51 WES vs., 51 Disease phenotypes, copy number variants (CNV), 168169, 169t Disease-targeted gene panels, targeted hybrid capture for, 264265 amplification-based capture, 265 target selection methods, 265 whole exome sequencing, 265 whole genome sequencing, 264265 Disease-targeted sequencing, 241242, 242t DNA preparation, hybrid-capture-based NGS, 3839 DNA replication errors, SNV, 112 DNA/RNA extraction and quality control, 309 DNA sequencing, cancer, amplificationbased NGS, 303305 general considerations, 305 multiplex amplification, 303 single-plex amplification, 304 targeted capture, 304 Down syndrome, 3 DRG model. See Diagnosis-related group (DRG) model Driver mutation analysis, 351 Dye terminator sequencing by synthesis (SBS), 299, 300f

E EEOC. See Equal Employment Opportunity Commission (EEOC) Electronic sequencing, 75 ELSI (ethical, legal and social implications) considerations, 416417 Emory Genetics Laboratory, 48 Employee Retirement Income Security Act (ERISA), 443444 EMR/EHR, 236

463

INDEX

Epigenetic modifications, SMRT sequencing, 72, 72f Equal Employment Opportunity Commission (EEOC), 444445 ERCC. See External RNA Control Consortium (ERCC) ERISA. See Employee Retirement Income Security Act (ERISA) Ethical challenges autonomy, 406 beneficence, 406407 confidentiality. See Confidentiality diagnostics vs. screening, 408409 individuals and families, 410 information disclosure, 410411 informed consent. See Informed consent (IC) justice, 407 overview, 404408 privacy. See Privacy research and clinical care, 409410 results. See Results European Society of Human Genetics, 367 Evolutionary conservation, 223 Exome and genome sequencing, in cancer analytic considerations, 352356 advantages, 354356 CNV detection, 355356 depth of coverage, 352354 limitations, 352354 single assay validation, 354355 specimen requirements, 352 SV detection-genomes, 356 overview, 344 somatic mutations, 344347 chromosome level mutations, 346347 codon level mutations, 345 exon level mutations, 345 gene level mutations, 346 paired tumor-normal testing, 347349 variants of unknown significance, 349352 clonal architecture analysis, 351352 driver mutation analysis, 351 general categories, 350 pathway analysis, 351 statistical models of mutation effect, 350351 Exome and genome sequencing, in constitutional disorders artifacts, recognizing and managing, 284285 causal mutation discovery, 279291 de novo mutations, 282 familial studies, 281282 phenotypically similar unrelated probands, 279281 recessive diseases, 281282 clinical utility, 288290 data load, 289290 incidental genomic information, 288289 timeline, 288 combinatorial approaches, 286 genetic/genomic investigations, 287 consequences of, 290291

ethical issues, 291 genetic counseling, 291 genomic sequencing, 276279 advantages of, 276 CNV detection, 278 depth of coverage, 278 disadvantages of, 277 exomes vs. genomes, 277279 resource-based considerations, 278279 targeted/covered regions, 277278 historical perspective, 274276 chromosomal studies, 275 genetic markers, 275 GWAS, 276 human genome project, 276 microarray, 275276 modern sequencing technologies, 276 overview, 273276 pathway-related data, 284 population variation databases, 282284 accuracy and reproducibility of, 283284 expressivity, 283 filtering genomic data sets, 282283 penetrance, 283 variants, functional interpretation of, 285 Exome capture, 4850. See also Whole exome sequencing (WES) challenging aspect, 48 kits, 48, 49t Exome sequencing copy number variants (CNV), 172 unbiased variant discovery, 181 Exon level mutations, 345 Exons, constitutional diseases, 244 Expressivity, exome and genome sequencing, 283 External RNA Control Consortium (ERCC), 400401

F Familial testing, reports, 223 Families, ethical challenges, 410 Family and Medical Leave Act, 444 FASTQ format, 102 FDA. See US Food and Drug Administration (FDA) 5500 Genetic Analyzers, 14 Filtering, RNA-sequencing, 82 Filtering genomic data sets, 282283 FISH. See Fluorescence in situ hybridization (FISH) Fluidigm Access array system, 62 Fluorescence in situ hybridization (FISH), 170 Fluorescent dye-terminator method, 5. See also Sanger sequencing Fluorometery, 59 Food and Drug Administration (FDA), 367, 437 LDT oversight, 379381 regulatory oversight, 379381 risk classification, 379380, 380t Fourth-generation sequencing nanopore sequencing, 7374, 73f overview, 70t, 7374 Frameshift, 137138

Functional consequences copy number variants (CNV), 168169, 169f insertion/deletion events (indels), 136139 abnormal protein aggregation, 137 altered splicing, 137 decreased protein activity, 138 decreased transcription, 136137 frameshift, 137138 increased protein activity, 138 microsatellite instability/rapid repeat expansion, 137 in silico prediction, 139 synonymous, missense or nonsense variants, 138 Functional domains, 223 Functional studies, 223 Fusion detection, RNA-sequencing, 8384

G GAO. See US Government Accountability Office (GAO) GATK. See Genome Analysis Toolkit (GATK) Gene expression profiling assays, 438 Gene expression RM, 400401 calibration, 400401 controls, 401 mixed tissue, 401 SRM, 401 SRM 2374, 400 Gene level mutations, 346 Gene name, 220221 Gene panels, targeted enrichment, 5051 Gene patents adverse effects of, 438 arguments for and against, 438 expression profiling assays, 438 GINA. See Genetic Information Nondiscrimination Act (GINA) history of, 436437 key developments, 437 legal cases, 438443 overview, 435436 General Electric, 437 Gene selection, constitutional disorders, targeted hybrid capture for, 262 Genetic counseling, 291 Genetic exceptionalism, 405 Genetic information characteristics of, 405 genetic exceptionalism and, 405 Genetic Information Nondiscrimination Act (GINA), 435, 443445 discrimination complaints, 445 gaps in, 445 on genetic information, 444 on genetic test, 444 Title II of, 444 Title I of, 443444 Genetic markers, 275 Genetic reference databases, 423 Genetic test cost, 456 court decisions, 443 GINA on, 444 Supreme Court on, 443

464 Genetic Testing Reference Material Coordination Program (GeT-RM), 373 Genome Analysis Toolkit (GATK), 11 Genome in a Bottle (GIAB) Consortium, 394399 Performance Metrics and Figures of Merit Working Group, 397399 reference material characterization, 396 RM Selection and Design Working Group, 394395 Genome-Phenome Analyzer, 201 Genome-scale measurements, reference materials for, 400401 gene expression, 400401 microbial DNA, 400 Genome-wide measurement, methylome sequencing, 85 Genomic sequencing, constitutional disorders, 276279 advantages of, 276 CNV detection, 278 depth of coverage, 278 disadvantages of, 277 exomes vs. genomes, 277279 resource-based considerations, 278279 targeted/covered regions, 277278 Genotyping, 92 base calling. See Base calling library preparation, 92 read mapping. See Read mapping strategies template amplification, 92 Geographic Practice Cost Indices (GPCI), 448450 GeT-RM. See Genetic Testing Reference Material Coordination Program (GeT-RM) GINA. See Genetic Information Nondiscrimination Act (GINA) Global Alliance for Genomics and Health (GA4GH), 212213 Google, 423 Government-funded insurance payers, 448 GPCI. See Geographic Practice Cost Indices (GPCI) Graham v. John Deere Co., 439440 GS FLX 1 system, 18 GS FLX Titanium chemistry, 18 454 GS20 Genome Sequencer, 18, 18t Guanine-cytosine (GC) content, 4142 GWAS, 199, 276

H Haplotype phasing, 396 HapMap, 196197 HapMap consent, 394395 Havasupai Tribe, 424 Health Care Professional Advisory Committee, 451 Health Insurance Portability and Accountability Act (HIPAA), 424 Health Maintenance Organizations (HMO), 448 Hearing loss and related disorders, 267 Heliscope genetic analysis system, 72

INDEX

Hereditary cancer syndromes, 155 Hereditary hearing loss, 253 Heterogeneity, indels, 142 High sensitivity tools, SNV detection, 121 HIPAA. See Health Insurance Portability and Accountability Act (HIPAA) HiSeq system, Illumina sequencing, 10 Homopolymers Ion Torrent sequencing, 1415 pyrosequencing, 18 Human disease, translocations, 152 carcinomas, 155 developmental delay, 155 hematologic malignancies, 153154 hereditary cancer syndromes, 155 inherited disorders, 155 leukemias, 153154 lymphomas, 154 recurrent miscarriages, 155 sarcomas, 154 tumors, 154155 Human genome frequency, copy number variants (CNV), 167168 Human genome project, 276 Human Genome Project (HGP), 92, 405 Human variation identification methods, 193194 sequence discovery, 193194 targeting known sequences, 193 reference assembly, 194195 variant call validation, 195 Human Variome Society, 212 Hybrid-capture-based NGS capture probes for, 39 clinical applications of, 4851 disease-associated exome testing, 51 exome capture, 4850, 49t selected gene panels, 5051 coverage, 41 DNA preparation for, 3839 library complexity, 42 library preparation, 3940 obstacles, 4142 overview, 38 principles of, 3842 sensitivity, 41 specificity, 41 specimens for, 3839 target capture enrichment, 4348 amplification-based enrichment vs., 4748 comparison of performance characteristics, 47, 47t molecular inversion probes (MIP), 46 practical and operational considerations, 5253, 53t solid-phase hybrid capture, 4344, 43f solution-based hybrid capture, 4445, 45t target region of interest (ROI), 39 turnaround time (TAT), 5253 uniformity, 41 variant detection, 5152 workflow, 5253

I ICD-9-CM, 450 ICD-10-CM, 450 Identification, of variation, 195198 IGV. See Integrative Genome Viewer (IGV) Illumina Clinical Services Laboratory (ICSL), 28 Illumina HiSeq and MiSeq systems, 299300 Illumina platforms, base calling, 9596, 9899 density of DNA clusters, 9596 intensity by cycle plot, 96 intensity versus cycle (IVC) plot, 96 percentage phasing/prephasing, 96 PhiX-based quality metrics, 96 quality score distribution, 96 quality score heatmap, 96 quality scoring, 99 template generation, 98 Illumina sequencing, 811, 9f distinguishing feature of, 8 estimated sequencing yield for, 11t HiSeq system, 10 library preparation, 910 A-tailing (50 adenylation) step, 10 end repair step, 10 gel electrophoresis, 10 index sequence, 10 input DNA, 910 Klenow fragment, 10 purification procedure, 10 MiSeq system, 10 phasing, 1011 steps for, 9t variants, 1011 Illumina TruSeq Amplicon Cancer Panel, 313 Imperfect double-strand DNA break repair, indels, 134 Incidental findings, 412413 Incidental genomic information, 288289 Increased protein activity, 138 Independent Payment Advisory Board (IPAB), 455 Individuals, ethical challenges, 410 Informatic approaches, to translocation detection, 158160 discordant paired-end analysis, 158 inversions, 158159 RNA-Seq, 159160 split read-based analysis, 158 Informatics pipeline, 365366 Information disclosure, 410411 Informed consent (IC), 425430 for Cancer Susceptibility Testing, 426, 427t clinical consents, 426 components of, 427t goal of, 426, 428 recommendations, 427430 counseling, 429430 critical information elements, 426, 428t patient initiative, 427428 recontacting, 429 risks and benefits, negotiation of, 429 transparency, 430 research consents, 426 sharing information, 425

465

INDEX

Inherited cardiomyopathies, 252, 266 Inherited disorders, translocations in, 155 developmental delay, 155 hereditary cancer syndromes, 155 recurrent miscarriages, 155 In re Deuel, 439440 In re Kubin, 439 Insertion/deletion events (indels) bioinformatics, 142147 left alignment, 143144 local realignment, 142 probabilistic modeling using mapped reads, 144145 split-read analysis, 145146 constitutional and somatic disease, 131133 defined, 130 formulation mechanisms, 133134 defective mismatch repair, 134 imperfect double-strand DNA break repair, 134 secondary structure formation, 134 slipped strand mispairing, 133, 133f unequal meiotic recombination, 134 frequency, 134136 functional consequences, 136139 abnormal protein aggregation, 137 altered splicing, 137 decreased protein activity, 138 decreased transcription, 136137 frameshift, 137138 increased protein activity, 138 microsatellite instability/rapid repeat expansion, 137 in silico prediction, 139 synonymous, missense or nonsense variants, 138 overview, 130133 reference standards, 147 sensitivity and specificity issues, 146147 annotation, 146147 length, 146 truth definition, 147 SNV, 130131, 131f specimen issues, 141142 cellularity and heterogeneity, 142 library complexity, 142 technical issues, 139141 assay design, 141 depth of coverage, 141 library preparation technique, 141 sequence read type and alignment, 140141 sequencing platform chemistry, 139140 In silico-based proficiency tests, 388 In silico prediction, 139 In situ sequencing, 74 Insurance payers, 448 government-funded, 448 private, 448 Integrative Genome Viewer (IGV), 102 Intensity by cycle plot, 96 Intensity versus cycle (IVC) plot, 96 Internal Revenue Code, 443444 International Classification of Diseases (ICD), 450451

Ninth Revision, Clinical Modification (ICD-9-CM). See ICD-9-CM Tenth Revision, Clinical Modification (ICD-10-CM). See ICD-10-CM International Standards Organization (ISO), 378379 Interpretation, 3133 SNV detection, 122123 databases, 122 kindred testing, 123 missense variants, 122 online resources, 122 paired tumor-normal testing, 123 possible splice effects, 122123 of variation, 199 Ion AmpliSeq CCP, 311312 Ion AmpliSeq RNA Cancer Panel, 313 Ion PGM, 15, 16t Ion Proton, 15, 16t Ion semiconductor-based NGS, 300301 Ion Sphere Particle (ISP), 97. See also Torrent platforms, base calling clonal, 97 library and test, 97 live, 97 usable sequence, 97 Ion Torrent sequencing, 1416, 15f chemistry, 15 disadvantage of, 15 estimated sequencing yield for, 16t homopolymers, 1415 instruments Ion PGM, 15, 16t Ion Proton, 15, 16t library preparation for, 14 AmpliSeq, 16 CMOS technology, 14 paired-end modes, 15 software, 15 substrate in, 14 wells per chip, 14 IPAB. See Independent Payment Advisory Board (IPAB) Isaac, 102 ISO. See International Standards Organization (ISO)

J Justice, ethical challenges, 407

K Kay, Mark A., 441 Kindred testing, SNV detection, 123 Klein, Roger D., 441442 KSR Int’l Co. v. Teleflex Inc., 439

L Laboratory-developed tests (LDT), 438 FDA, 379381 Laboratory issues, translocation detection, 161162 LCD. See Local Coverage Determination (LCD) Left alignment, indels, 143144 Legal cases, on gene patents, 438443

Leiden Open (source) Variation Database (LOVD), 212 Leveraging standards, reporting software, 237 Library complexity hybrid-capture-based NGS, 42 SNV detection, 116 Library preparation amplification-based NGS, 6162, 61f cyclic array sequencing, 7 genotyping, 92 hybrid-capture-based NGS, 3940 DNA fragmentation, 39 Escherichia coli DNA polymerase, 39 T4 DNA polymerase, 39 Illumina sequencing, 910 A-tailing (50 adenylation) step, 10 end repair step, 10 gel electrophoresis, 10 index sequence, 10 input DNA, 910 Klenow fragment, 10 purification procedure, 10 insertion/deletion events (indels), 141 Ion Torrent sequencing, 14 AmpliSeq, 16 CMOS technology, 14 nanopore sequencing, 7374 Roche 454 genome sequencing, 1718, 17f Limit of detection (LOD), 371 LIMS tracking, 233 Listing, 419420, 419t Local Coverage Determination (LCD), 448 Local realignment, indels, 142 Locus heterogeneity, 252 LOD. See Limit of detection (LOD) Long QT syndrome, 438 Lourie, Alan, 442 LOVD. See Leiden Open (source) Variation Database (LOVD)

M Malpractice (MP) RVU, 448450 MAQ, 101 Mayo Collaborative Services v. Prometheus Laboratories, Inc., 440443 MBPT. See Methods-based proficiency testing (MBPT) MedGen, 200 Medical and scientific literature, 222 Medical EmExome, 48 Medicare Administrative Contractors (MAC), 448450 Medicare Physician Fee Schedule (MPFS), 448450 Mendelian inheritance, 396 Metal ions, 111 Methods-based proficiency testing (MBPT), 388 analyte-specific vs., 387 Methylome sequencing, 8586 bioinformatic workflow for, 86f bisulfite conversion, 85 genome-wide measurement, 85

466 Metrics, for assessing genotype quality, 103106 depth of coverage, 105 library fragment length, 104105 percent of mapped reads, 104 of read pairs aligned, 104 reads in the exome or target region, 104 of unique reads, 105106 performance and diagnostic, 103106 target coverage grap, 106 total read number, 104 Microarray, 275276 Microarray analysis, RNA-sequencing, 78, 79t Microbial DNA RM, 400 Microhomology-mediated end joining (MMEJ), 167, 168f Microsatellite instability/rapid repeat expansion, 137 Million Veteran Program, 423 MIP. See Molecular inversion probes (MIP) Miscarriages, 155 MiSeq system, Illumina sequencing, 10 Missense SNV, 113 Missense variants, SNV detection, 122 Mitochondrial heteroplasmy, 7 Mixed tissue RM, 401 MLPA. See Multiplex ligation-dependent probe amplification (MLPA) Modern sequencing technologies, 276 Molecular diagnostics, constitutional disorders, targeted hybrid capture for, 253254 Molecular inversion probes (MIP), 46 advantages of, 46 clinical applications, 46 disadvantages of, 46 ParAllele Bioscience, 46 Molecular oncology testing, 324325 Molecular phenotype, 199 Moore’s law, 14 Mosaik, 102 MPFS. See Medicare Physician Fee Schedule (MPFS) Multigene panel validation, constitutional diseases, 243245 Multiplex amplification, 305306, 306f Multiplexing, amenable to, somatic mutation detection, 326328 Multiplex ligation-dependent probe amplification (MLPA), 170 Multiplex PCR, 6061, 61f Mutation effect, 221 Mutations, technical information about test, 225 Mutation Surveyor, 7 Myriad Genetics, 441

N NA12878, 394395 Nanopore sequencing, 7374, 73f electrical current variations, 73 library preparation, 7374 technical modifications, 74 National Bioethics Advisory Commission (NBAC), 405

INDEX

National Center for Biotechnology Information (NCBI), 385 National Center for Health Statistics (NCHS), 450 National Coverage Determination (NCD), 448 National Human Genome Research Institute (NHGRI), 385 National Institute of Standards and Technology (NIST), 393 Standard Reference Materials, 373 NBAC. See National Bioethics Advisory Commission (NBAC) NCBI. See National Center for Biotechnology Information (NCBI) NCD. See National Coverage Determination (NCD) NCHS. See National Center for Health Statistics (NCHS) Next (Crichton), 435 Next-generation sequencing (NGS), 4 analytical processes, 2729 applications, 2426 bioinformatics, 2930 interpretation, 3133 overview, 2122 preanalytical and quality laboratory processes, 2627 proficiency testing, 31 reporting, 3133 test information, 2426 challenges, 2526 validation, 3031 Next Generation Sequencing-Standardization of Clinical Testing (Nex-StoCT) working group, 379 NHGRI. See National Human Genome Research Institute (NHGRI) NHLBI-ESP, 198 NimbleGen SeqCap EZ, 48 NimbleGenSeqCap EZ Comprehensive Cancer Panel, 6465 NIST. See National Institute of Standards and Technology (NIST) Nomenclature, 221 Nonhomologous end joining (NHEJ), 167, 168f Nonsense SNV, 113 Nonsmall cell lung carcinoma (NSCLC), 155 Novel technologies, 70t, 7475 electronic sequencing, 75 in situ sequencing, 74 transmission electron microscopy (TEM), 7475 Novoalign, 101 Novocraft Technologies, 101 NSCLC. See Nonsmall cell lung carcinoma (NSCLC) Nucleic acids isolation of, 59 preparation, 59

O Office of Human Research Protection (OHRP), 424 OHRP. See Office of Human Research Protection (OHRP)

OMIM, 200, 204 Online mutation databases, 221222 Online resources SNV detection, 122 to translocation detection, 162 Orphanet, 200 Orthogonal validation copy number variants (CNV), 184 Orthogonal validation, SNV detection, 122

P Paired tumor-normal testing, 123, 347349 ParAllele Bioscience, 46 Patent, 436. See also Gene patents applications, 436 Constitution and, 436 infringement, 436 laws, 436 Patent Act in 1952, 436 Patent Act of 1790, 436 Patient-Centered Outcomes Research Institute (PCORI), 454 Patient Protection and Affordable Care Act (ACA). See Affordable Care Act (ACA) Patient’s privacy. See Privacy PCORI. See Patient-Centered Outcomes Research Institute (PCORI) Penetrance, exome and genome sequencing, 283 Performance Metrics and Figures of Merit Working Group, 397399 Personal Genome Project (PGP), 394395 Personal utility, results, 415416 Pfu polymerase, 60 PGP. See Personal Genome Project (PGP) Phenomizer, 201 Phenotype, 199 allele frequency, 199 GWAS, 199 molecular, 199 PhiX-based quality metrics, 96 Polymorphisms, single nucleotide variant (SNV), 110111 Population frequency, 222 Population variation databases, exome and genome sequencing, 282284 accuracy and reproducibility of, 283284 expressivity, 283 filtering genomic data sets, 282283 penetrance, 283 Practice Expense (PE) RVU, 448450 Prader-Willi syndrome, 3 Precision defined, 370 electronic data files, 370 reproducibility, 370 wet-bench procedures, 383 Preferred Provider Organizations (PPO), 448 Privacy defined, 422 information, 422 invasion of, 422 overview, 421422 Private insurance payers, 448

INDEX

Probabilistic information, 412 Probabilistic modeling using mapped reads, 144145 Proficiency testing, 31 analyte-specific vs. methods-based, 387 cell lines, 387 EQA, 386387 MBPT, 387 sample exchange programs, 387 Protein structure, 223 Provenance tracking and versioning, 234 Pseudomonas bacterium, 437 Public Health Service Act, 443444 PubMed, 205 Pyrococcus furiosus, 60 Pyrosequencing, in Roche 454 system, 1618, 16f apyrase, 1617 ATP sulfurylase, 1617 dATPαS, 1617 luciferin, 1617 non-high-throughput configuration, 1617 template DNA strand, 1617

Q Quality assessment (QA) proficiency testing analyte-specific vs. methods-based, 387 cell lines, 387 EQA, 386387 MBPT, 387 proficiency testing, 386388 sample exchange programs, 387 programs, 386 Quality control (QC) procedures, 381385 analytic variables, 382385 postanalytic variables, 385 preanalytic variables, 381382 for testing, 372373 Quality score distribution, 96 Quality score heatmap, 96 Quality scoring, 99

R Radiation, single nucleotide variant (SNV), 112113 RainDance ONCOSeq Panel, 313 RainDance Thunderstorm system, 62 Raw data reporting issues, 227228 results, 412 Reactive oxygen species (ROS), 111, 112f Read alignment, RNA-sequencing, 8182 reference genome, 82, 83f Read mapping strategies alignment factors, 102103 formats, 102 processing, 103 alignment tools, 101102 Bowtie, 101 BWA, 101 Isaac, 102 MAQ, 101 Mosaik, 102

Novoalign, 101 TMAP, 102 genotyping, 9394 reference genome, 100 RNA-sequencing, 8182 de novo read assembly, 81 read alignment, 8182 Recessive diseases, 281282 Recurrent miscarriages, 155 Reference Data, 399400 Reference genome, 100 to read alignment, 82, 83f Reference materials (RM) bioinformatics, 396397 characterization, 396 clinical testing and, 373 data integration, 396397 data representation, 397 defined, 393 developing, challenges in, 393394 gene expression, 400401 homogeneity, 396 microbial DNA, 400 overview, 393394 stability, 396 Reference range, 372 wet-bench procedures, 384 Reference standards, copy number variants (CNV), 182183 Regulatory compliance, reporting software, 237 Regulatory regions, single nucleotide variant (SNV), 114 Regulatory standards CAP checklist, 379 CDC, 379 CLIA, 378379 FDA, 379381 overview, 378 Reimbursement processes, 448452 test design factors affecting, 452454 Reimbursement rate, 448450 calculation formula, 448450 medical coverage determinations, 450 NCD, 450 Relative Value Unit (RVU), 448450 components, 448450 GPCI and, 448450 Reportable range, 372 wet-bench procedures, 384 Reporting, 3133 single nucleotide variant (SNV), 123 Reporting issues, 227228 data, reanalysis and reinterpretation, 228 data storage, 228 raw data, 227228 Reporting software analytics, 233235 clinical genomic test, 232233 EMR/EHR, 236 leveraging standards, 237 LIMS tracking, 233 overview, 232 regulatory compliance, 237

467 support personnel, 237238 variant annotation and classification, 235236 interpretation, 236 variant calls, 233235 analytical validation, 234 pipeline orchestration and management, 234235 provenance tracking and versioning, 234 Reproducibility population variation databases, 283284 Research, and clinical care, 409410 Research results, vs. clinical results, 411412 Results analytic validity, 413 changing status, 413 clinical utility, 414 clinical validity, 414 ELSI considerations, 416417 incidental findings, 412413 notification of, 411421 overview, 411 personal utility, 415416 probabilistic or susceptibility information, 412 raw data, 412 recommendations, 417420 research vs. clinical, 411412 VUS, 412 RNA-sequencing (RNA-Seq) alternative splicing, 84 approaches to, 78 cancer, amplification-based NGS, 305307 multiplex amplification, 305306, 306f single-plex amplification, 306 targeted capture, 306307 clinical applications, 85 coverage, 84 differential expression, 83 expression estimation, 83 filtering, 82 fusion detection, 8384 microarray analysis, 78, 79t next-generation methods, 78, 79t overview, 78 raw read processing, 7981 read mapping strategies, 8182 de novo read assembly, 81 read alignment, 8182 software tools, 79, 80t translocation detection, 158160 variant calling, 82 variant detection, 8485 workflow of, 7984, 80f Robotics, 28 Roche 454 genome sequencing, 1618, 298299, 299f advantages of, 18 emulsion PCR, 1718, 17f estimated sequencing yield for, 18t library preparation, 1718, 17f pyrosequencing, 1618, 16f apyrase, 1617 ATP sulfurylase, 1617 dATPαS, 1617

468 Roche 454 genome sequencing (Continued) homopolymers, 18 luciferin, 1617 non-high-throughput configuration, 1617 template DNA strand, 1617 workflow in, 17f RVU. See Relative Value Unit (RVU)

S Sample exchange programs, 387 Sample requirements, 309 Sanger sequencing, 47 in clinical genomics, 56 DNA polymerase in, 4 electropherogram for, 57, 5f gel electrophoresis for, 4, 5f Klenow fragment, 4 steps, 4 chase, 4 labeling and termination, 4 Taq polymerase, 4 T7 DNA polymerase, 4 technical constraints, 67 input DNA, 67 read lengths, 6 sensitivity limitation, 7 20 ,30 -dideoxynucleotide triphosphates (ddNTPs), 4 variants of, 4 SCARF format, 102 Screening, diagnostics vs., 408409 Secondary structure formation, indels, 134 Sensitivity assays, 371372 hybrid-capture-based NGS, 41 insertion/deletion events (indels), 146147 annotation, 146147 length, 146 truth definition, 147 wet-bench procedures, 383 Sequence alignment map (SAM), 102 Sequence verification, wet-bench procedures, 384 Sequencing by Oligo Ligation Detection (SOLiD), 1114 advantages, 1314 disadvantages, 14 estimated sequencing yield for, 14t fluorophores, 1113 ligation step, 11 paired-end, 14 primer elongation, 11 principle of, 13f substrate for, 11 Sequencing cost, 260 Sequencing machine, 261 Shared Savings Program, 455 SimulConsult, 201 Single-molecule real-time (SMRT) sequencing, 7172, 71f advantages, 72 epigenetic modifications, 72, 72f zero-mode wave guides (ZMW), 71 Single nucleotide variant (SNV)

INDEX

bioinformatic approaches, 117122 clinical implications, 121 high sensitivity tools, 121 orthogonal validation, 122 parameters, 118120 software, 117118, 118t tumor/normal analyses, 121 coding regions, 113 missense SNV, 113 nonsense SNV, 113 synonymous SNV, 113 consequences, 113114 insertion/deletion events (indels), 130131, 131f interpretation, 122123 databases, 122 kindred testing, 123 missense variants, 122 online resources, 122 paired tumor-normal testing, 123 possible splice effects, 122123 polymorphisms, 110111 regulatory regions, 114 reporting, 123 RNA processing and, 113114 sources, 111113 chemical mutagens, 112 DNA replication errors, 112 metal ions, 111 radiation, 112113 reactive oxygen species (ROS), 111, 112f spontaneous chemical reactions, 111 target capture methods, 52 technical issues, 114117 anticipated sample purity, 116 depth of sequencing, 116 library complexity, 116 platforms, 114 sample type, 117 target enrichment approach, 115116 target size, 114115 Single-plex amplification, 306 Slipped strand mispairing, indels, 133, 133f SNP allele frequency, CNV detection, 177178, 178f SNV. See Single nucleotide variant (SNV) Social Security Act, 443444 Software. See also Reporting software RNA-sequencing, 79, 80t SNV detection, 117118, 118t SOLiD, 301302, 302f Solid-phase hybrid capture, 4344, 43f advantages of, 44 microarrays, 4344 Solid-phase vs. in-solution phase, somatic mutation detection, 323324 Solution-based hybrid capture, 4445, 45t advantages, 45 LDT, 45 microarray enrichment, 4445 oligonucleotide probes, 4445 published description, 45 Somatic mutation detection, targeted hybridcapture for cancer genes, 326

clinical laboratory setting, 333338 design, 333 genetic targets, 334 pathologic assessment, 333334 QC metrics, 334337, 335t reportable range, 334 specimen requirements, 333 validation, 337338 clinical utility, 322323 CNV, 331332 cost-effectiveness, 332 depth of coverage, 332 molecular oncology testing, 324325 multiplexing, amenable to, 326328 overview, 322 samples, 326 solid-phase vs. in-solution phase, 323324 structural rearrangements, 328331 Somatic mutations, cancer, exome and genome sequencing in, 344347 chromosome level mutations, 346347 codon level mutations, 345 exon level mutations, 345 gene level mutations, 346 paired tumor-normal testing, 347349 Specificity assays, 371372 hybrid-capture-based NGS, 41 insertion/deletion events (indels), 146147 annotation, 146147 length, 146 truth definition, 147 wet-bench procedures, 383 Specimen issues, insertion/deletion events (indels), 141142 cellularity and heterogeneity, 142 library complexity, 142 Specimen provenance, wet-bench procedures, 384 Split-read analysis indels, 145146 translocation detection, 158 Split reads, CNV detection, 178179, 179f Spontaneous chemical reactions, SNV, 111 SRM. See Standard Reference Materials (SRM) SRM 2374, 400 Standard Reference Materials (SRM), 373, 393 Statistical models of mutation effect, 350351 Structural variants (SV) target capture methods, 52 Supporting evidence, 221223 Support personnel, reporting software, 237238 Supreme Court on genetic testing, 443 Susceptibility information, 412 SV. See Structural variants (SV) Sweet, Robert W., 441 Synonymous, missense or nonsense variants, 138 Synonymous SNV, 113

INDEX

T Taq polymerase, 60 Target capture enrichment, 4348. See also Hybrid-capture-based NGS amplification-based enrichment. See Amplification-based NGS comparison of performance characteristics, 47, 47t molecular inversion probes (MIP), 46 advantages of, 46 clinical applications, 46 disadvantages of, 46 ParAllele Bioscience, 46 practical and operational considerations, 5253, 53t solid-phase hybrid capture, 4344, 43f advantages of, 44 microarrays, 4344 solution-based hybrid capture, 4445, 45t advantages, 45 LDT, 45 microarray enrichment, 4445 oligonucleotide probes, 4445 published description, 45 Targeted capture, 306307 Targeted gene sequencing copy number variants (CNV), 170172, 180181 translocation detection, 156157 Targeted hybrid capture, for constitutional disorders allelic heterogeneity, 252 challenges, 267 clinical application, 266268 hearing loss and related disorders, 267 inherited cardiomyopathies, 266 clinical overlap with related disorders, 252 costello syndrome, 253 disease-targeted gene panels, 264265 amplification-based capture, 265 target selection methods, 265 whole exome sequencing, 265 whole genome sequencing, 264265 gene selection, 262 hereditary hearing loss, 253 inherited cardiomyopathies, 252 interpretive challenges, 263264 locus heterogeneity, 252 molecular diagnostics, 253254 operational considerations automation, 260 batch and pool samples, 260 CNV, 261262 cost-reduction measures, 260 sequencing cost, 260 sequencing machine, 261 turnaround time, 261 workflow, 259260 overview, 252254 target selection, 255257 technical challenges, 263 GC-rich or repetitive regions, 263 target size, 263 technical design considerations, 257259

Targeted hybrid-capture, for somatic mutation detection cancer genes, 326 clinical laboratory setting, 333338 design, 333 genetic targets, 334 pathologic assessment, 333334 QC metrics, 334337, 335t reportable range, 334 specimen requirements, 333 validation, 337338 clinical utility, 322323 CNV, 331332 cost-effectiveness, 332 depth of coverage, 332 molecular oncology testing, 324325 multiplexing, amenable to, 326328 overview, 322 samples, 326 solid-phase vs. in-solution phase, 323324 structural rearrangements, 328331 Target enrichment, constitutional diseases, 242243 Target enrichment approach, SNV detection, 115116 Target region of interest (ROI), 39 Target selection methods, disease-targeted gene panels, 265 Target size, SNV detection, 114115 Technical challenges, constitutional disorders, 263 GC-rich or repetitive regions, 263 target size, 263 Technical information about test, 225227 additional test limitations, 227 analytical sensitivity and specificity, 226 clinical sensitivity and specificity, 226 data analysis, 226 depth of coverage, 226 FDA approval, 227 methodology and platform, 225 mutations, 225 no/low coverage, 226 variant interpretation algorithms, 226 Technical issues insertion/deletion events (indels), 139141 assay design, 141 depth of coverage, 141 library preparation technique, 141 sequence read type and alignment, 140141 sequencing platform chemistry, 139140 SNV detection, 114117 anticipated sample purity, 116 depth of sequencing, 116 library complexity, 116 platforms, 114 sample type, 117 target enrichment approach, 115116 target size, 114115 Technologies, amplification-based NGS for cancer, 298302 TEM. See Transmission electron microscopy (TEM)

469 Template generation, Illumina base calling, 98 Thermus aquaticus, 60 Third-generation sequencing, 18 Heliscope genetic analysis system, 72 overview, 70, 70t SMRT DNA sequencing, 7172, 71f advantages, 72 epigenetic modifications, 72, 72f zero-mode wave guides (ZMW), 71 Thomas, Clarence, 443 1000 Genomes Project, 197198 Timeline, exome and genome sequencing, 288 TMAP. See Torrent Mapping Alignment Program (TMAP) Torrent Mapping Alignment Program (TMAP), 102 Torrent platforms, base calling, 97100 clonal ISP, 97 key processes, 99100 library and test ISP, 97 live ISP, 97 loading or ISP density, 97 postprocessing, 100 test fragment metrics, 98 Transcript number, 220221 Translocation detection, 156158 in clinical practice, 160162 laboratory issues, 161162 online resources, 162 conventional methods, 156 informatic approaches, 158160 discordant paired-end analysis, 158 inversions, 158159 split read-based analysis, 158 RNA-Seq, 158160 targeted gene sequencing, 156157 whole genome sequencing, 156 Translocations formulation mechanisms, 152153 human disease, 152 carcinomas, 155 developmental delay, 155 hematologic malignancies, 153154 hereditary cancer syndromes, 155 inherited disorders, 155 leukemias, 153154 lymphomas, 154 recurrent miscarriages, 155 sarcomas, 154 tumors, 154155 overview, 152153 Transmission electron microscopy (TEM), 7475 TruSeq amplicon sequencing system, 65 TruSite amplicon panel, 65 Tumor/normal analyses, SNV detection, 121 Tumors carcinomas, 155 sarcomas, 154 translocations, 154155 Turnaround time, 261 Turnaround time (TAT) for hybrid-capture enrichment, 5253

470 U UMLS. See Unified Medical Language System (UMLS) Unequal meiotic recombination, indels, 134 Unified Medical Language System (UMLS), 200 Uniformity hybrid-capture-based NGS, 41 University of Arizona, 424 US Food and Drug Administration (FDA). See Food and Drug Administration (FDA) US Government Accountability Office (GAO), 455

V VAF. See Variant allele frequency (VAF) Validation, 3031 somatic mutation detection, 337338 Validation, of clinical testing, 367368, 368f accuracy clinical testing, 369370 defined, 369 depth of coverage, 369 precision defined, 370 electronic data files, 370 reproducibility, 370 reference range, 372 reportable range, 372 sensitivity, 371372 specificity, 371372 Variant allele frequency (VAF), 7 Variant calling, RNA-sequencing, 82 Variant calls reporting software, 233235 analytical validation, 234 pipeline orchestration and management, 234235 provenance tracking and versioning, 234 Variant detection. See also Copy number variants (CNV) hybrid-capture-based NGS, 5152 CNV, 52 SNV, 52 structural variants (SV), 52 RNA-sequencing, 8485 Variant interpretation algorithms, 226 Variant pathogenicity, 210212 ClinGen, 211 clinical grade databases, 211212 expert panels, 210 ICCG, 210 ISCA, 210 professional guidelines, 210 Variant Quality Score Recalibration (VQSR), 397 Variants classes of, 5152 constitutional disorders, exome and genome sequencing in, 285 reporting software annotation and classification, 235236 interpretation, 236 written report, 220223

INDEX

classification, 221223 computational prediction programs, 222223 evolutionary conservation, 223 familial testing, 223 functional domains, 223 functional studies, 223 gene name, 220221 medical and scientific literature, 222 mutation effect, 221 nomenclature, 221 online mutation databases, 221222 population frequency, 222 protein structure, 223 supporting evidence, 221223 transcript number, 220221 zygosity, 221 Variants of unknown significance (VUS), 49 cancer, exome and genome sequencing in, 349352 clonal architecture analysis, 351352 driver mutation analysis, 351 general categories, 350 pathway analysis, 351 statistical models of mutation effect, 350351 results, 412 Variation data, in public databases, 201205 archives of variant/phenotype relationships, 202205, 203t ClinVar, 204 COSMIC, 204205 dbGaP, 202204 DECIPHER, 204 OMIM, 204 PubMed, 205 archives of variants, 201202 dbSNP, 201202 dbVar, 202 VQSR. See Variant Quality Score Recalibration (VQSR) VUS. See Variants of unknown significance (VUS)

W Watson, James, 423 Weld, William, 423 Weldon, David, 435 Wet-bench procedures, 383384 accuracy, 383 precision, 383 reference range, 384 reportable range, 384 sensitivity, 383 sequence verification, 384 specificity, 383 specimen provenance, 384 Whole exome sequencing (WES), 48, 243 application, 50 constitutional disease and, 50 coverage, 48 in diagnosis and treatment, 49 disease-targeted gene panels, targeted hybrid capture for, 265

disease-targeted panel sequencing, 243244 patient management improvement, 50 variants identified in, 49 Whole genome sequencing (WGS) copy number variants (CNV), 172173 emerging technologies, 182 disease-targeted gene panels, targeted hybrid capture for, 264265 translocation detection, 156 Workflow amplification-based NGS, 58, 58f clinical testing, 364366, 365f hybrid-capture-based NGS, 5253 methylome sequencing, 86f RNA-sequencing (RNA-Seq), 7984, 80f Roche 454 genome sequencing, 17f Work RVU, 448450 Written report components of, 219227 incidental/secondary findings, 224225 patient demographics, 219 recommendations, 224 signature, 227 summary statement, 219220 technical information about test, 225227 additional test limitations, 227 analytical sensitivity and specificity, 226 clinical sensitivity and specificity, 226 data analysis, 226 depth of coverage, 226 FDA approval, 227 methodology and platform, 225 mutations, 225 no/low coverage, 226 variant interpretation algorithms, 226 testing indication, 219 test result interpretation, 223224 variants, 220223 classification, 221223 computational prediction programs, 222223 evolutionary conservation, 223 familial testing, 223 functional domains, 223 functional studies, 223 gene name, 220221 medical and scientific literature, 222 mutation effect, 221 nomenclature, 221 online mutation databases, 221222 population frequency, 222 protein structure, 223 supporting evidence, 221223 transcript number, 220221 zygosity, 221

X X-linked inhibitor of apoptosis (XIAP) gene, 49

Z Zero-mode wave guides (ZMW), 71 ZMW. See Zero-mode wave guides (ZMW) Zygosity, 221