Genomics of Rare Diseases: Understanding Disease Genetics Using Genomic Approaches 0128201401, 9780128201404

Genomics of Rare Diseases: Understanding Disease Genetics Using Genomic Approaches, a new volume in the Translational an

215 6 4MB

English Pages 316 [318] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Cover
Genomics of Rare Diseases
Copyright Page
Contents
List of contributors
About the editors
Preface
References
Acknowledgments
1 Introduction to concepts of genetics and genomics
1.1 Introduction
1.2 The human genome: structure and function
1.3 Genetic variation
1.4 Nomenclature in human genetics and genomics
1.5 Mendelian patterns of inheritance
1.6 Other modes of inheritance
1.7 Considerations of Mendelian disorders and genetic inheritance
1.8 Conclusion
Further reading
2 Karyotyping as the first genomic approach
2.1 Introduction
2.2 Numerical chromosome aberrations
2.2.1 Autosomal aneuploidy
2.2.2 Sex chromosome aneuploidy
2.3 Structural chromosome aberrations
2.3.1 Reciprocal translocations
2.3.2 Paracentric and pericentric inversions
2.3.3 Robertsonian translocations
2.3.4 Deletions and duplications
2.3.5 Intrachromosomal and interchromosomal insertions
2.3.6 Isochromosomes
2.3.7 Marker chromosomes
2.3.8 Complex rearrangements and chromothripsis
2.4 Uniparental disomy and genomic imprinting
2.5 Clinical indications and special considerations for chromosome analysis
References
3 Genomic disorders in the genomics era
3.1 Introduction
3.2 Chromosomal microarray analysis for copy-number variant detection and diagnosis of genomic disorders
3.3 The evolution of next-generation sequencing and bioinformatics for the detection of genomic rearrangements
3.4 Molecular mechanisms of genomic rearrangement generation
3.5 Genomic disorders and next-generation sequencing-based testing in the clinic
3.6 Interpretation of genomic structural and copy-number variants
3.7 Outlook
References
4 Genomic sequencing of rare diseases
4.1 Introduction: a human genome reference sequence
4.2 Sequencing of genomes and exomes
4.3 The process of genomic sequencing
4.3.1 DNA preparation
4.3.2 Library preparation
4.3.3 Sequencing
4.3.4 Data analysis
4.4 Overview of sequencing technologies
4.4.1 First generation sequencing
4.4.2 Second generation sequencing technologies
4.4.3 Third generation sequencing technologies
4.5 Sequence data analysis
4.6 Genomic databases
4.7 Genomic sequencing of rare diseases
4.7.1 Large-scale genomic sequencing projects for rare diseases
4.7.2 Genomic sequencing in the clinic
4.8 Outlook
References
5 Recessive diseases and founder genetics
5.1 Introduction
5.2 Autosomal recessive inheritance
5.2.1 The role of consanguinity in recessive diseases
5.2.2 The founder effect
5.2.3 Hardy–Weinberg equilibrium and inbreeding
5.3 Disease gene mapping of autosomal recessive disorders
5.3.1 Homozygosity mapping in consanguineous pedigrees
5.3.2 Genomic sequencing of rare recessive disorders
5.3.3 Genomics of founder populations
5.4 Outlook
References
6 Dominant and sporadic de novo disorders
6.1 Introduction
6.2 Autosomal dominant disorders
6.2.1 Mechanisms of dominant disease
6.2.2 Incomplete penetrance and variable expressivity of dominant disorders
6.3 Sporadic disorders
6.3.1 The human de novo mutation rate
6.3.2 Mechanisms of disease of de novo mutations
6.3.3 Genomic studies of sporadic disorders and identification of de novo mutations
6.4 Outlook
References
7 X-linked and mitochondrial disorders
7.1 Introduction
7.2 X Chromosome disorders
7.2.1 X-linked recessive disorders
7.2.2 X-linked dominant disorders
7.2.3 X chromosome inactivation as a modifier of X-linked disorders
7.3 Mitochondrial disorders
7.4 Outlook
References
8 Mosaicism in rare disease
8.1 Introduction
8.1.1 Rate of somatic mutations
8.2 Strategies/technologies to identify mosaic variation
8.2.1 Clinical observation
8.2.2 Cytogenetics
8.2.3 Array comparative genome hybridization
8.2.4 Single nucleotide polymorphism arrays
8.2.5 Next-generation sequencing to identify mosaic variants
8.2.6 Mosaic disease in ClinVar
8.3 Mosaic aneuploidy in rare disease
8.3.1 Introduction to aneuploidy
8.3.2 General principles of mosaic aneuploidy
8.3.3 Mosaic autosome aneuploidies
8.3.3.1 Trisomy 1 mosaicism
8.3.3.2 Trisomy 2 mosaicism
8.3.3.3 Trisomy 3 mosaicism
8.3.3.4 Trisomy 4 mosaicism
8.3.3.5 Trisomy 5 mosaicism
8.3.3.6 Trisomy 6 mosaicism
8.3.3.7 Trisomy 7 mosaicism
8.3.3.8 Trisomy 8 mosaicism
8.3.3.9 Trisomy 9 mosaicism
8.3.3.10 Trisomy 10 mosaicism
8.3.3.11 Trisomy 11 mosaicism
8.3.3.12 Trisomy 12 mosaicism
8.3.3.13 Trisomy 13 mosaicism
8.3.3.14 Trisomy 14 mosaicism
8.3.3.15 Trisomy 15 mosaicism
8.3.3.16 Trisomy 16 mosaicism
8.3.3.17 Trisomy 17 mosaicism
8.3.3.18 Trisomy 18 mosaicism
8.3.3.19 Trisomy 19 mosaicism
8.3.3.20 Trisomy 20 mosaicism
8.3.3.21 Trisomy 21 mosaicism
8.3.3.22 Trisomy 22 mosaicism
8.3.4 Mosaic disorders of chromosome X
8.3.5 Mosaic disorders of chromosome Y
8.3.6 Variegated mosaic aneuploidy
8.3.7 Ring chromosomes
8.3.8 Mitochondrial genome mosaicism
8.3.9 Mosaic mobile element insertions
8.4 Categories of mosaic variation
8.4.1 Germline mosaicism
8.4.2 Cryptic mosaicism
8.5 Obligate mosaicism in rare disease
8.5.1 GNAS and McCune–Albright syndrome
8.5.2 Proteus syndrome
8.5.3 PIK3CA and CLOVES syndrome
8.5.4 GNAQ and Sturge–Weber syndrome
8.5.5 TSC1, TSC2, and the tuberous sclerosis complex
8.6 Cancer as a series of rare mosaic diseases
8.6.1 Somatic single nucleotide variants in cancer
8.6.2 Clonal evolution and cancer field effect (field cancerization)
8.6.3 Somatic mutations along lines of Blaschko
8.7 Mendelian disorders in mosaic form
8.7.1 Neurofibromatosis type I
8.7.2 Incontinentia pigmenti
8.7.3 Darier–White disease
8.7.4 Hereditary hemorrhagic telangiectasia
8.8 Chimerism
8.9 Outlook
References
9 Multilocus inheritance and variable disease expressivity in rare disease
9.1 Introduction
9.2 Dual molecular diagnoses
9.2.1 Delineating dual molecular diagnoses
9.2.2 Dual molecular diagnoses lead to syndrome disintegration
9.2.3 Agnostic molecular techniques reveal multiple molecular diagnoses
9.2.4 Dissecting dual molecular diagnoses
9.3 Phenotypic expansion
9.3.1 Delineating phenotypic spectrum and expansion
9.3.2 Multiple molecular diagnoses masquerading as phenotypic expansion
9.3.3 Dissection of genotype–phenotype relationships through intrafamilial genotypic and phenotypic variability
9.4 Variable expressivity
9.4.1 Variable expressivity may portend mutational burden
9.5 Incomplete penetrance
9.5.1 Rare + common alleles at a single locus explain some cases of incomplete penetrance
9.5.2 Rare + common alleles at unlinked loci explain some cases of incomplete penetrance
9.6 A comprehensive model: Clan Genomics
References
10 Statistical approaches to rare disease analyses
10.1 Introduction
10.2 Pedigree-based statistical methods
10.2.1 Linkage analysis
10.2.2 Transmission disequilibrium testing
10.3 Association analyses for rare diseases
10.3.1 Rare variant association testing in phenotype ascertained studies
10.3.2 Rare variant association testing of unascertained population-based studies
10.4 Conclusion
References
11 Transcriptomics in rare diseases
11.1 Introduction
11.2 The transcriptome and transcriptomic methodologies
11.3 Transcriptomics in rare diseases
11.3.1 Mechanisms underlying RNA-seq-based genetic diagnoses
11.3.2 Transcriptomic analysis highlights disease modifiers
11.3.3 Tools for transcriptomics analyses in rare disease diagnosis
11.4 Single-cell resolution transcriptomics
11.5 Limitations of using RNA-seq in clinical molecular diagnosis
References
12 Other omics approaches to the study of rare diseases
12.1 Introduction
12.2 Epigenomics
12.2.1 Definition of epigenetics and epigenomics
12.2.2 DNA methylation as the most prominent epigenetic mechanism
12.3 Landscape of epigenomic technologies
12.3.1 DNA methylation
12.3.2 Modification of chromatin states
12.3.3 Alternative methods to dissect chromatin modifications
12.3.4 Single-cell ChIP-seq
12.4 Dissecting chromatin structures
12.4.1 Profiling nucleosome positioning and chromatin accessibility
12.4.2 Evaluating higher-order chromatin architecture
12.5 Epigenomic studies in rare diseases
12.6 Proteomics
12.6.1 Protein arrays for biomarker discovery
12.6.2 Bottom-up and top-down proteomics approaches
12.6.3 Proteomics approaches to study rare diseases
12.7 Metabolomics
12.7.1 Metabolomics workflow and data analysis
12.7.2 Untargeted versus targeted metabolomics
12.7.3 Metabolomics studies in rare diseases
12.8 Outlook
References
13 Challenges and opportunities in rare diseases research
13.1 Introduction
13.2 Challenges in rare diseases research
13.2.1 The N=1 Problem
13.2.2 Underrepresentation of non-European genetic ancestries
13.2.3 Unequal access to genomic initiatives aimed at understanding health and disease
13.2.4 Genetic heterogeneity, clinical variability, and phenotypic expansion of rare diseases
13.2.5 Insufficient knowledge of gene and genome function
13.3 Opportunities in rare diseases research
13.3.1 Insights into novel biology
13.3.2 Therapy development for rare diseases
13.3.3 Implementation of precision medicine and population health
13.3.4 Drug development derived from rare diseases insights
13.4 Outlook
References
Index
Back Cover
Recommend Papers

Genomics of Rare Diseases: Understanding Disease Genetics Using Genomic Approaches
 0128201401, 9780128201404

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Genomics of Rare Diseases Understanding Disease Genetics Using Genomic Approaches

This page intentionally left blank

Translational and Applied Genomics Series

Genomics of Rare Diseases Understanding Disease Genetics Using Genomic Approaches Edited by

Claudia Gonzaga-Jauregui International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano, Universidad Nacional Auto´noma de Me´xico, Mexico

James R. Lupski The Cullen Professor of Genetics and Genomics and Professor of Pediatrics, Baylor College of Medicine and Texas Children’s Hospital, Houston, TX, United States

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-820140-4 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Andre Gerhard Wolff Acquisitions Editor: Peter B. Linsley Editorial Project Manager: Kristi Anderson Production Project Manager: Swapna Srinivasan Cover Designer: Matthew Limbert Series Editor: George P. Patrinos, Department of Pharmacy, University of Patras School of Health Sciences, Patras, Greece; Department of Pathology, United Arab Emirates University, College of Medicine and Health Sciences, Al-Ain, UAE; United Arab Emirates University, Zayed Center of Health Sciences, Al-Ain, UAE Typeset by MPS Limited, Chennai, India

Contents List of contributors ............................................................................................................................. xiii About the editors ................................................................................................................................. xv Preface ............................................................................................................................................... xvii Acknowledgments .............................................................................................................................. xxi

CHAPTER 1 Introduction to concepts of genetics and genomics........................... 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Karlla Welch Brigatti Introduction ................................................................................................................ 1 The human genome: structure and function .............................................................. 1 Genetic variation ........................................................................................................ 5 Nomenclature in human genetics and genomics ....................................................... 8 Mendelian patterns of inheritance ............................................................................. 9 Other modes of inheritance...................................................................................... 12 Considerations of Mendelian disorders and genetic inheritance ............................ 14 Conclusion ................................................................................................................ 15 Further reading ......................................................................................................... 15

CHAPTER 2 Karyotyping as the first genomic approach....................................... 17 Amy Breman and Paweł Stankiewicz 2.1 Introduction .............................................................................................................. 17 2.2 Numerical chromosome aberrations ........................................................................ 19 2.2.1 Autosomal aneuploidy ................................................................................... 21 2.2.2 Sex chromosome aneuploidy ......................................................................... 21 2.3 Structural chromosome aberrations ......................................................................... 22 2.3.1 Reciprocal translocations ............................................................................... 22 2.3.2 Paracentric and pericentric inversions........................................................... 23 2.3.3 Robertsonian translocations ........................................................................... 23 2.3.4 Deletions and duplications............................................................................. 25 2.3.5 Intrachromosomal and interchromosomal insertions .................................... 25 2.3.6 Isochromosomes............................................................................................. 26 2.3.7 Marker chromosomes..................................................................................... 27 2.3.8 Complex rearrangements and chromothripsis ............................................... 28 2.4 Uniparental disomy and genomic imprinting .......................................................... 28 2.5 Clinical indications and special considerations for chromosome analysis ............. 30 References................................................................................................................. 31

v

vi

Contents

CHAPTER 3 Genomic disorders in the genomics era............................................ 35 Cinthya J. Zepeda Mendoza and Claudia Gonzaga-Jauregui 3.1 Introduction .............................................................................................................. 35 3.2 Chromosomal microarray analysis for copy-number variant detection and diagnosis of genomic disorders................................................................................ 38 3.3 The evolution of next-generation sequencing and bioinformatics for the detection of genomic rearrangements...................................................................... 40 3.4 Molecular mechanisms of genomic rearrangement generation............................... 42 3.5 Genomic disorders and next-generation sequencing-based testing in the clinic............................................................................................................... 45 3.6 Interpretation of genomic structural and copy-number variants............................. 47 3.7 Outlook ..................................................................................................................... 48 References................................................................................................................. 49

CHAPTER 4 Genomic sequencing of rare diseases............................................... 61 Claudia Gonzaga-Jauregui and Cinthya J. Zepeda Mendoza 4.1 Introduction: a human genome reference sequence ................................................ 61 4.2 Sequencing of genomes and exomes ....................................................................... 63 4.3 The process of genomic sequencing ........................................................................ 65 4.3.1 DNA preparation............................................................................................ 65 4.3.2 Library preparation ........................................................................................ 65 4.3.3 Sequencing ..................................................................................................... 68 4.3.4 Data analysis .................................................................................................. 68 4.4 Overview of sequencing technologies ..................................................................... 68 4.4.1 First generation sequencing ........................................................................... 71 4.4.2 Second generation sequencing technologies ................................................. 71 4.4.3 Third generation sequencing technologies .................................................... 73 4.5 Sequence data analysis............................................................................................. 74 4.6 Genomic databases ................................................................................................... 79 4.7 Genomic sequencing of rare diseases ...................................................................... 83 4.7.1 Large-scale genomic sequencing projects for rare diseases ......................... 84 4.7.2 Genomic sequencing in the clinic ................................................................. 86 4.8 Outlook ..................................................................................................................... 87 References................................................................................................................. 88

CHAPTER 5 Recessive diseases and founder genetics ........................................ 97 Erik G. Puffenberger 5.1 Introduction .............................................................................................................. 97 5.2 Autosomal recessive inheritance.............................................................................. 97

Contents

vii

5.2.1 The role of consanguinity in recessive diseases ......................................... 100 5.2.2 The founder effect........................................................................................ 101 5.2.3 HardyWeinberg equilibrium and inbreeding............................................ 104 5.3 Disease gene mapping of autosomal recessive disorders...................................... 106 5.3.1 Homozygosity mapping in consanguineous pedigrees ............................... 106 5.3.2 Genomic sequencing of rare recessive disorders ........................................ 109 5.3.3 Genomics of founder populations................................................................ 110 5.4 Outlook ................................................................................................................... 112 References............................................................................................................... 112

CHAPTER 6 Dominant and sporadic de novo disorders ...................................... 117 6.1 6.2

6.3

6.4

Claudia Gonzaga-Jauregui, Lauretta El Hayek and Maria Chahrour Introduction ............................................................................................................ 117 Autosomal dominant disorders .............................................................................. 118 6.2.1 Mechanisms of dominant disease ................................................................ 118 6.2.2 Incomplete penetrance and variable expressivity of dominant disorders ........123 Sporadic disorders .................................................................................................. 124 6.3.1 The human de novo mutation rate............................................................... 124 6.3.2 Mechanisms of disease of de novo mutations............................................. 126 6.3.3 Genomic studies of sporadic disorders and identification of de novo mutations........................................................................................ 128 Outlook ................................................................................................................... 131 References............................................................................................................... 131

CHAPTER 7 X-linked and mitochondrial disorders ............................................. 137 Lauretta El Hayek and Maria Chahrour 7.1 Introduction ............................................................................................................ 137 7.2 X Chromosome disorders....................................................................................... 138 7.2.1 X-linked recessive disorders ........................................................................ 138 7.2.2 X-linked dominant disorders ....................................................................... 139 7.2.3 X chromosome inactivation as a modifier of X-linked disorders............... 141 7.3 Mitochondrial disorders ......................................................................................... 142 7.4 Outlook ................................................................................................................... 145 References............................................................................................................... 146

CHAPTER 8 Mosaicism in rare disease ............................................................... 151 Bracha Erlanger Avigdor, Ikeoluwa A. Osei-Owusu and Jonathan Pevsner 8.1 Introduction ............................................................................................................ 151 8.1.1 Rate of somatic mutations ........................................................................... 153 8.2 Strategies/technologies to identify mosaic variation............................................. 153 8.2.1 Clinical observation ..................................................................................... 155

viii

Contents

8.3

8.4

8.5

8.6

8.7

8.8 8.9

8.2.2 Cytogenetics ................................................................................................. 155 8.2.3 Array comparative genome hybridization ................................................... 156 8.2.4 Single nucleotide polymorphism arrays ...................................................... 156 8.2.5 Next-generation sequencing to identify mosaic variants ............................ 156 8.2.6 Mosaic disease in ClinVar ........................................................................... 158 Mosaic aneuploidy in rare disease......................................................................... 158 8.3.1 Introduction to aneuploidy........................................................................... 158 8.3.2 General principles of mosaic aneuploidy .................................................... 163 8.3.3 Mosaic autosome aneuploidies .................................................................... 163 8.3.4 Mosaic disorders of chromosome X............................................................ 167 8.3.5 Mosaic disorders of chromosome Y............................................................ 167 8.3.6 Variegated mosaic aneuploidy..................................................................... 168 8.3.7 Ring chromosomes....................................................................................... 168 8.3.8 Mitochondrial genome mosaicism............................................................... 168 8.3.9 Mosaic mobile element insertions ............................................................... 169 Categories of mosaic variation .............................................................................. 169 8.4.1 Germline mosaicism .................................................................................... 169 8.4.2 Cryptic mosaicism ....................................................................................... 170 Obligate mosaicism in rare disease ....................................................................... 170 8.5.1 GNAS and McCuneAlbright syndrome .................................................... 170 8.5.2 Proteus syndrome ......................................................................................... 171 8.5.3 PIK3CA and CLOVES syndrome................................................................ 171 8.5.4 GNAQ and SturgeWeber syndrome.......................................................... 171 8.5.5 TSC1, TSC2, and the tuberous sclerosis complex....................................... 174 Cancer as a series of rare mosaic diseases ............................................................ 174 8.6.1 Somatic single nucleotide variants in cancer .............................................. 175 8.6.2 Clonal evolution and cancer field effect (field cancerization) ................... 175 8.6.3 Somatic mutations along lines of Blaschko ................................................ 175 Mendelian disorders in mosaic form ..................................................................... 175 8.7.1 Neurofibromatosis type I ............................................................................. 176 8.7.2 Incontinentia pigmenti ................................................................................. 176 8.7.3 DarierWhite disease .................................................................................. 176 8.7.4 Hereditary hemorrhagic telangiectasia ........................................................ 176 Chimerism .............................................................................................................. 177 Outlook ................................................................................................................... 177 References............................................................................................................... 177

CHAPTER 9 Multilocus inheritance and variable disease expressivity in rare disease .................................................................................. 185 Jennifer E. Posey 9.1 Introduction ............................................................................................................ 185

Contents

ix

9.2 Dual molecular diagnoses ...................................................................................... 186 9.2.1 Delineating dual molecular diagnoses......................................................... 186 9.2.2 Dual molecular diagnoses lead to syndrome disintegration ....................... 186 9.2.3 Agnostic molecular techniques reveal multiple molecular diagnoses ........ 187 9.2.4 Dissecting dual molecular diagnoses........................................................... 189 9.3 Phenotypic expansion............................................................................................. 192 9.3.1 Delineating phenotypic spectrum and expansion........................................ 192 9.3.2 Multiple molecular diagnoses masquerading as phenotypic expansion .........193 9.3.3 Dissection of genotypephenotype relationships through intrafamilial genotypic and phenotypic variability .......................................................... 193 9.4 Variable expressivity.............................................................................................. 194 9.4.1 Variable expressivity may portend mutational burden ............................... 194 9.5 Incomplete penetrance ........................................................................................... 195 9.5.1 Rare 1 common alleles at a single locus explain some cases of incomplete penetrance ................................................................................. 196 9.5.2 Rare 1 common alleles at unlinked loci explain some cases of incomplete penetrance ................................................................................. 197 9.6 A comprehensive model: Clan Genomics ............................................................. 198 References............................................................................................................... 200

CHAPTER 10 Statistical approaches to rare disease analyses ............................ 205 Cristopher V. Van Hout 10.1 Introduction ............................................................................................................ 205 10.2 Pedigree-based statistical methods ........................................................................ 205 10.2.1 Linkage analysis......................................................................................... 205 10.2.2 Transmission disequilibrium testing.......................................................... 207 10.3 Association analyses for rare diseases................................................................... 208 10.3.1 Rare variant association testing in phenotype ascertained studies ........... 208 10.3.2 Rare variant association testing of unascertained population-based studies......................................................................................................... 211 10.4 Conclusion .............................................................................................................. 212 References............................................................................................................... 212

CHAPTER 11 Transcriptomics in rare diseases .................................................... 215 Maria Kousi 11.1 Introduction ............................................................................................................ 215 11.2 The transcriptome and transcriptomic methodologies .......................................... 217 11.3 Transcriptomics in rare diseases ............................................................................ 218 11.3.1 Mechanisms underlying RNA-seq-based genetic diagnoses..................... 218 11.3.2 Transcriptomic analysis highlights disease modifiers............................... 223 11.3.3 Tools for transcriptomics analyses in rare disease diagnosis ................... 224

x

Contents

11.4 Single-cell resolution transcriptomics.................................................................... 224 11.5 Limitations of using RNA-seq in clinical molecular diagnosis ............................ 225 References............................................................................................................... 226

CHAPTER 12 Other omics approaches to the study of rare diseases .................. 229 Giusy Della Gatta 12.1 Introduction ............................................................................................................ 229 12.2 Epigenomics ........................................................................................................... 229 12.2.1 Definition of epigenetics and epigenomics ............................................... 229 12.2.2 DNA methylation as the most prominent epigenetic mechanism ............ 230 12.3 Landscape of epigenomic technologies ................................................................. 232 12.3.1 DNA methylation....................................................................................... 232 12.3.2 Modification of chromatin states............................................................... 233 12.3.3 Alternative methods to dissect chromatin modifications.......................... 233 12.3.4 Single-cell ChIP-seq .................................................................................. 234 12.4 Dissecting chromatin structures ............................................................................. 235 12.4.1 Profiling nucleosome positioning and chromatin accessibility................. 235 12.4.2 Evaluating higher-order chromatin architecture........................................ 236 12.5 Epigenomic studies in rare diseases ...................................................................... 237 12.6 Proteomics .............................................................................................................. 238 12.6.1 Protein arrays for biomarker discovery ..................................................... 239 12.6.2 Bottom-up and top-down proteomics approaches..................................... 240 12.6.3 Proteomics approaches to study rare diseases........................................... 242 12.7 Metabolomics ......................................................................................................... 243 12.7.1 Metabolomics workflow and data analysis ............................................... 244 12.7.2 Untargeted versus targeted metabolomics................................................. 246 12.7.3 Metabolomics studies in rare diseases....................................................... 246 12.8 Outlook ................................................................................................................... 248 References............................................................................................................... 249

CHAPTER 13 Challenges and opportunities in rare diseases research ............... 263 Claudia Gonzaga-Jauregui 13.1 Introduction ............................................................................................................ 263 13.2 Challenges in rare diseases research...................................................................... 264 13.2.1 The N 5 1 Problem .................................................................................... 264 13.2.2 Underrepresentation of non-European genetic ancestries......................... 266 13.2.3 Unequal access to genomic initiatives aimed at understanding health and disease ...................................................................................... 268 13.2.4 Genetic heterogeneity, clinical variability, and phenotypic expansion of rare diseases ......................................................................... 269 13.2.5 Insufficient knowledge of gene and genome function.............................. 270

Contents

xi

13.3 Opportunities in rare diseases research ................................................................. 272 13.3.1 Insights into novel biology ........................................................................ 272 13.3.2 Therapy development for rare diseases ..................................................... 272 13.3.3 Implementation of precision medicine and population health.................. 275 13.3.4 Drug development derived from rare diseases insights ............................ 276 13.4 Outlook ................................................................................................................... 278 References............................................................................................................... 279 Index .................................................................................................................................................. 285

This page intentionally left blank

List of contributors Amy Breman, PhD Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States Karlla Welch Brigatti, MS, CGC Clinic for Special Children, Strasburg, PA, United States Maria Chahrour, PhD Eugene McDermott Center for Human Growth and Development, Department of Neuroscience, Center for the Genetics of Host Defense, Department of Psychiatry, Peter O’Donnell Jr. Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States Giusy Della Gatta, PhD Regeneron Genetics Center, Regeneron Pharmaceuticals, Tarrytown, NY, United States Bracha Erlanger Avigdor, PhD Department of Neurology, Kennedy Krieger Institute, Baltimore, MD, United States Lauretta El Hayek, MS Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, United States Claudia Gonzaga-Jauregui, PhD International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico Maria Kousi, PhD CSAIL, Massachusetts Institute of Technology, Cambridge, MA, United States; The Broad Institute of Harvard and MIT, Cambridge, MA, United States Ikeoluwa A. Osei-Owusu, PhD Department of Neurology, Kennedy Krieger Institute, Baltimore, MD, United States; Department of Human Genetics, Johns Hopkins School of Medicine, Baltimore, MD, United States Jonathan Pevsner, PhD Department of Neurology, Kennedy Krieger Institute, Baltimore, MD, United States; Department of Human Genetics, Johns Hopkins School of Medicine, Baltimore, MD, United States; Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, United States Jennifer E. Posey, MD, PhD Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States Erik G. Puffenberger, PhD Clinic for Special Children, Strasburg, PA, United States

xiii

xiv

List of contributors

Paweł Stankiewicz, MD, PhD Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States Cristopher V. Van Hout, PhD International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico Cinthya J. Zepeda Mendoza, PhD Cytogenetics and Genomic Microarray, ARUP Laboratories, Salt Lake City, UT, United States; Department of Pathology, University of Utah, Salt Lake City, UT, United States

About the editors Claudia Gonzaga-Jauregui, PhD Claudia Gonzaga-Jauregui grew up in Mexico where she did her undergraduate studies in Genomic Sciences at the National Autonomous University of Mexico (UNAM). She performed undergraduate research training at UNAM and the Weizmann Institute of Science in Israel. Claudia obtained her PhD in Molecular and Human Genetics from the Baylor College of Medicine in Houston, Texas, where she contributed to large population genomic studies such as HapMap 3 and pioneered the analyses of genomic sequence data of personal genomes and exomes for the identification of disease genes and molecular diagnoses. Since then, she has led large-scale projects for genomic analyses of patients and families with rare and undiagnosed genetic disorders in academia and industry. Dr. Gonzaga-Jauregui’s research focuses on the investigation of human pathogenic and polymorphic genomic variation that contributes to human traits and diseases leveraging family-based analyses of rare and common genetic disorders to better understand disease mechanisms and pathophysiology. She believes that the application and understanding of human genetics and genomics can lead to improved treatments and the realization of precision genomic medicine. Because the beginning of precision medicine is an accurate genetic diagnosis, throughout her career Dr. Gonzaga-Jauregui has identified numerous novel genetic disorders and studied known and novel disease-associated variation responsible for rare genetic conditions to provide molecular answers to patients with undiagnosed diseases. James R. Lupski, MD, PhD, DSc (hon) Jim Lupski is Cullen Professor of Genetics and Genomics and Professor of Pediatrics at the Baylor College of Medicine and Texas Children’s Hospital in Houston, Texas. He received his initial scientific training at Cold Spring Harbor Laboratory (CSHL) as an Undergraduate Research Participant (URP) and at the New York University receiving his undergraduate degree in chemistry and biology (1979) and completing the MD/PhD program in 1985. In 1986 he moved to Houston, TX, for clinical training in pediatrics (1986–89) and medical genetics (198992), and established his laboratory at the Baylor College of Medicine. He is an elected member of AAAS (1996), ASCI (1998), IOM/NAM (2002), and the American Academy of Arts and Sciences (2013). For his work in human genomics and the elucidation of genomic disorders, he received a DSc Honoris Causa in 2011 from the Watson School of Biological Sciences at CSHL. He has coauthored more than 800 scientific publications, has coedited 3 books including the definitive text on genomic disorders, is a coinventor on more than a dozen US and European patents, and has delivered over 536 invited lectures in 38 countries. Jim is board certified in Clinical Genetics and Clinical Molecular Genetics from the American Board of Medical Genetics and Genomics.

xv

xvi

About the editors

Jim’s research studies focus on trying to understand mutational mechanisms and linking specific genomic mutations and gene variants to human disease. As a pioneer in personal genome sequencing, he documented the tremendous variation in both single nucleotide (SNV) and copy number variants (CNV) in personal genomes. His contributions have advanced whole-genome and exome sequencing to gain mechanistic insights into disease origins, initially on his own genome, but also toward clinical genomics implementation to understand and manage the treatment of individual patients. He proposed the Clan Genomics hypothesis highlighting the roles of new mutation and rare variant contribution to disease. In summary, as a physician-scientist and leader in the fields of human genetics, mutagenesis, and clinical genomics, he has substantially defined the field of genomic medicine and his lab’s work has helped to foster precision medicine.

Preface “Nature is nowhere accustomed more openly to display her secret mysteries than in cases where she shows traces of her workings apart from the beaten path; nor is there any better way to advance the proper practice of medicine than to give our minds to the discovery of the usual law of nature by the careful investigation of cases of rarer forms of disease. For it has been found in almost all things, that what they contain of useful or of applicable nature, is hardly perceived unless we are deprived of them, or they become deranged in some way.” William Harvey (15781657)

It is challenging to anticipate what might materialize with the introduction of a disruptive technology. Perhaps impossible to fathom all the possibilities when a multitude of disruptive technologies, including computer science, the world wide web, and massively parallel DNA sequencing, all converge and enable a personal genome [1]. Suffice it to say, that the ability to characterize genome variation in a human has the capability to inform human biology, evolution, development, genetics, genomics, clinical practice, and human health. The beginning of this decade, 2020 and 2021, mark the 30th anniversary of the launch of the Human Genome Project and the 20th anniversary of the publication of the first draft of the human genome reference sequence [2], respectively. Perhaps foretelling the pace of the new discipline of human genomics, the Human Genome Project, originally projected to take 15 years, accomplished the generation of a first draft of the sequence containing the blueprint for the human species in only 10 years. The following decade after the initial publication of the reference sequence and its later refinement, also witnessed a meteoric advancement in our understanding of the genes encoded in the genome, its architecture, its variation, and its relation with human disease. It has now been a decade since the initial “clinical genomes” appeared [3,4] demonstrating the effectiveness of genomic sequencing for molecular diagnosis and patient management. Currently, massive amounts of genomic and variant data are being generated every day for clinical genomics and human genomics research [5]. The challenge remains to interpret and contextualize these variant data to provide accurate molecular diagnoses, assist physicians to narrow differential diagnoses, inform prognosis and counseling, identify therapies, and improve overall patient health. With the continued almost logarithmic pace of “disease gene discovery” [6], increasing ability to identify and resolve variants of all types [7], and better appreciation by the clinical communities and stakeholders of the role of genomics in medicine [8], genomics is destined now more than ever to become an integral part of medicine and clinical practice with far-reaching implications for the understanding of human disease and health. In this context, the Editors sought to synthesize and structure an “actionable book” that was both conceptual and practical, written by genomic scientists, physician-scientists, and genetic counselors actively engaged in and contributing to the field of human genetics and genomics. Much of the information content embodied in this book has not yet made its way into any textbooks or other teaching tools as a coherent and contextualized body of knowledge about the study of human genetic disorders from discovery to therapy development. Moreover, it can be challenging to cull

xvii

xviii

Preface

such knowledge from the world’s literature, even with the power of electronic searches, for a field as dynamic, prolific, and rapidly evolving as human genetics and genomics. Nevertheless, the path forward, and endless opportunities to learn and improve as a field that we have tried to capture in this book, including the need for expanding access and increasing diversity and equity in genomics, will likely remain as relevant for those entering science and medicine in the future as they are for all of us today. Suffice-it-to-say that we are at an unprecedented time in human history when the disciplines of inquiry and scientific fields of genetics and genomics merge in the study of humans and human biology. Moreover, disease as we know it, and may be thought of from the informed viewpoint of perturbations from biological homeostasis, has never been more amenable to experimental science than through the harnessing of human genetics and genomics. The insights derived from rare disease studies and rare variant family-based genomic approaches to understand an individual’s disease process, and the specific disease trait associated with pathogenic variation at a given locus, are applicable to our broader understanding of human disease and incredibly revealing for all stakeholders involved, including the individuals affected by these conditions [9], their families, physicians caring for patients, and researchers in all areas of genetics, genomics, and other clinical and biological sciences. In some respects, the convergence of genetics and genomics is perhaps making Homo sapiens the model organism of choice for experimental studies. In doing so, we are learning more about our human biology and physiology in health and disease states and also about the evolution of our species and the evolution of other species’ genomes, in the context of the rich biological diversity of our planet. Integrated models of genome variation, evolution, and phenotypic consequence, such as the clan genomics hypothesis [10,11] can now be tested and expanded leveraging genomic information for thousands of individuals, forging a direct link between the theory of evolution and both medicine and health. This book should act as a resource to all stakeholders in the human genome, that is, all of us, including research participants, research scientists, physician investigators, and all humans studying or caring for other humans and invested in realizing the promise of precision medicine. Claudia Gonzaga-Jauregui and James R. Lupski

References [1] Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med 2012;63:3561. [2] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [3] Bainbridge MN, Wiszniewski W, Murdock DR, Friedman J, Gonzaga-Jauregui C, Newsham I, et al. Whole-genome sequencing for optimized patient management. Sci Transl Med 2011;3(87):87re3. [4] Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010;362(13):118191. [5] Lupski JR. Clinical genomics: from a truly personal genome viewpoint. Hum Genet 2016;135 (6):591601. [6] Liu P, Meng L, Normand EA, Xia F, Song X, et al. Reanalysis of clinical exome sequencing data. N Engl J Med 2019;380(25):247880.

Preface

xix

[7] Lupski JR, Liu P, Stankiewicz P, Carvalho CMB, Posey JE. Clinical genomics and contextualizing genome variation in the diagnostic laboratory. Expert Rev Mol Diagn 2020;20(10):9951002. [8] Stankiewicz P, Lupski JR. The genomic basis of medicine. In: Firth J, Conlon C, Cox T, editors. Oxford textbook of medicine. 6th ed Oxford University Press; 2020. p. 149. [9] Lupski JR. A human in human genetics. Cell 2019;177(1):915. [10] Gonzaga-Jauregui C, Yesil G, Nistala H, Gezdirici A, Bayram Y, et al. Functional biology of the Steel syndrome founder allele and evidence for clan genomics derivation of COL27A1 pathogenic alleles worldwide. Eur J Hum Genet 2020;28(9):124364. [11] Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA. Clan genomics and the complex architecture of human disease. Cell 2011;147(1):3243.

This page intentionally left blank

Acknowledgments The editors and authors of “Genomics of Rare Diseases: Understanding Disease Genetics Using Genomic Approaches” wish to thank the patients and families living with rare genetic disorders whom over decades have participated in human genetics research. Their contributions have enabled a better understanding of disease genetics and human biology for the benefit of patients and humans around the world. Author Pawel Stankiewicz was supported by the US National Institute of Health Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) grant R01HD087292. Authors Bracha Erlanger Avigdor, Ikeoluwa A. Osei-Owusu, and Jonathan Pevsner wish to thank Sanjana Tyagi for support, members of the Brain Somatic Mosaicism Network (BSMN) for helpful discussions, and N. Varg for comments on the Chapter 8 manuscript. Jonathan Pevsner was supported by the Intellectual and Developmental Disabilities Research Center (Grant U54 HD079123) from the Eunice Kennedy Shriver National Institute of Child Health and Human Development and by a grant from the National Institutes of Mental Health (U01 MH106884) as part of the BSMN. Ikeoluwa Osei-Owusu was supported by NIH grant R36 MH118005.

xxi

This page intentionally left blank

CHAPTER

Introduction to concepts of genetics and genomics

1

Karlla Welch Brigatti Clinic for Special Children, Strasburg, PA, United States

1.1 Introduction Genetics is broadly considered as the study of biological inheritance and traits, whereas the totality of the genetic information of an organism is known as the genome. The large-scale study of the information contained in the genome is referred to as genomics. While genomic studies began with the sequencing of the whole genome of the bacteria Haemophilus influenzae in 1995, the completion of the first draft of the human genome reference sequence by the Human Genome Project (HGP) in 2003 (see Chapter 4: Genomic Sequencing of Rare Diseases) and the continued advances in molecular biology, biochemistry, biophysics, computational sciences, and biotechnology ushered in a new discipline of study, human genomics. The breathtaking advancements of human genomics in the last decades have allowed scientists to understand the human genome, and its variation, to a precise and unprecedented degree, further enabling the application of this knowledge into clinical genomics. Bioinformatics is the interdisciplinary field of biology and computer science that analyzes complex genomic information to interpret biological variant data and predict gene function or dysfunction.

1.2 The human genome: structure and function With the exception of mature red blood cells and cornified hair cells, all cells in the human body contain a nucleus that houses the majority of the human genome (nuclear genome). A much smaller genome can also be found elsewhere in the cell within the mitochondria (mitochondrial genome), the organelles responsible for producing the energy needed for cell function. Disruptions, or pathogenic variation, to either genome can lead to human disease. The genetic information of the human genome is maintained as deoxyribonucleic acid (DNA), a double-stranded macromolecule bound together in stable form as a double helix. Each DNA strand is made of a sugar and phosphate backbone coupled to a sequence of bases in one of four versions: adenine (A), cytosine (C), guanine (G), and thymine (T). These bases pair with one another across the two strands by hydrogen bonds in a prescribed WatsonCrick complementary base-pairing fashion: guanine on one strand pairs with cytosine on the other, as do adenine and thymine. This strict pairing of base pairs makes the sequence of one strand represent the reverse complement of Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00009-0 © 2021 Elsevier Inc. All rights reserved.

1

2

Chapter 1 Introduction to concepts of genetics and genomics

the sequence on the opposite strand. Approximately six billion base pairs make up the diploid nuclear genome, which consist of two sets of chromosomes and genes, and consequently double the three billion base pairs that comprise the haploid reference human genome sequence (see Chapter 4: Genomic Sequencing of Rare Diseases). Long stretches of DNA are organized, supported, and packaged into 46 rod-shaped structures called chromosomes within the nucleus of the cell, arranged in 23 homologous pairs of matching DNA sequence. Twenty-two of these pairs are named and numbered from 1 to 22, according to size and relative to DNA and fraction of genome content with chromosome 1 being the largest and chromosomes 21 and 22 the smallest. These 22 pairs of chromosomes are called autosomes and are the same in males and females. The 23rd pair makes up the sex chromosomes: two X chromosomes are found in females, while an X and Y chromosome pair is found in biologically male individuals. The study of chromosomes, their structure, and their inheritance is known as cytogenetics. The chromosomal complement in a given cell is the karyotype, which also refers to the photographic arrangement of the magnified chromosome pairs after specific preparation and under certain staining conditions. Karyotyping and its role in rare disease discovery and diagnosis are explored further in Chapter 2, Karyotyping as the First Genomic Approach. The genome is characterized by long stretches of noncoding DNA sequences interspersed with smaller sequence content (coding DNA) that make up genes. Genes contain the instructions to direct the production of proteins or ribonucleic acid (RNA) products necessary for cells to perform their given function (Fig. 1.1). The human genome contains about 25,000 genes. Genes vary in length, but they share similar characteristics that relate to their function and help differentiate them from the surrounding noncoding sequence to computationally annotate and map them on the genome. Structurally, genes are composed by different regions and recognizable sequence features. Exons are regions of DNA and the parts of a gene that determine the amino acid sequence of its protein product. Conversely, introns are regions of noncoding sequence separating exons from one another that are eliminated from the mature messenger RNA (mRNA) after transcription. In addition, a sequence of noncoding DNA known as the promoter can be found adjacent to the beginning of a gene (classically defined as the 50 end) and acts as the region where certain regulatory proteins will bind sequence elements or motifs to enhance gene expression or silence it altogether. Alterations to the canonical DNA sequence, in the form of mutations or variants, in any of these structural elements of genes can disrupt normal function and expression of the gene, leading to human disease. The majority of variants currently associated with genetic conditions are found in the exons, which make up only about 1% of the haploid human genome and are maintained or constrained by selection and evolution; the aggregate sum of all exons is known as the exome. Following the HGP, the development of massively parallel sequencing technologies, also known as next-generation sequencing (NGS), enabled the rapid sequencing of millions of short DNA fragments in parallel, significantly reducing the cost of sequencing individual human genomes. The main applications of NGS in rare disease have focused on sequencing the protein-coding fraction of the genome through whole-exome sequencing (WES) together with whole-genome sequencing (WGS), which involves sequencing the totality of the DNA in the human genome, including the nonproteincoding regions; these technologies are explored in-depth in Chapter 4, Genomic Sequencing of Rare Diseases. Both techniques are commonly used in genetics research and clinical genetics settings to identify and investigate the rare variant potential contribution or genetic etiology to the clinical presentation of a disorder under investigation, thereby rendering a molecular diagnosis.

1.2 The human genome: structure and function

3

FIGURE 1.1 Gene expression through transcription and translation, simplified in a hypothetical gene containing four exons and three introns. During transcription in the nucleus, the DNA sequence of a gene is used as a template to produce a pre-mRNA transcript that includes introns and exons. The four bases of DNA are shown in exon 2. In mature mRNA, the introns are spliced out, such that the coding sequence is continuous. The mRNA moves to the cytoplasm for translation, where ribosomes attach to the mRNA template and protein synthesis occurs. Specific amino acid tRNA molecules bind to the mRNA as determined by the sequence of mRNA codons, groups of three mRNA bases that correspond to one of 20 amino acids or three stop codons. A peptide bond forms between the growing amino acid chain until a stop codon is reached and the sequence is released. The polypeptide chain undergoes folding and posttranslation modifications to become a functional protein. DNA, Deoxyribonucleic acid; mRNA, messenger RNA; tRNA; transfer RNA.

4

Chapter 1 Introduction to concepts of genetics and genomics

The flow of genetic information from DNA to RNA to protein product is known as the “central dogma” of molecular biology, and it can be predicted by scientists thanks to the elucidation and understanding of the genetic code, which establishes the rules of translation via the three base sequence or triplet code, from DNA sequence to amino acid composition of proteins. RNA is the mechanism for expression of the genetic information stored in the DNA toward the cell machinery to process and produce bioactive molecules in the form of proteins or noncoding RNA. RNA is similar to DNA, except that the sugar backbone is ribose, the thymine base is replaced by uracil (U), and RNA is single-stranded instead of double-stranded like the DNA double helix. When the product of a particular gene is needed, that portion of DNA containing the gene will unwind, and through a process known as transcription, a single strand of complementary RNA is generated, and the intronic sequences spliced out to produce a mature mRNA. Transcription takes place in the nucleus, where the DNA resides. Then, the mRNA moves from the nucleus to the intracellular cytoplasm, where organelles known as ribosomes utilize that mRNA template for protein synthesis through a process known as translation. During translation, the ribosome moves along the mRNA strand and binds the mRNA template to a second type of RNA known as transfer RNA (tRNA) that joins together specific amino acids as determined by three consecutive mRNA bases (known as codons) whose sequence encodes one of 20 possible corresponding amino acids (Table 1.1). The genetic code is said to be “degenerate” in that most amino acids are encoded by more than one codon. The standard start codon for translation of a gene is “AUG,” which encodes the amino acid methionine (Met or M), and establishes the reading frame for the ribosome to follow, adding corresponding amino acids in a polypeptide chain. The translation complex halts the process of protein production once it reaches a stop codon (encoded by one of the three codons UAA, UAG, or UGA), and the completed polypeptide is released from the ribosome for posttranslational modification. This process is illustrated in Fig. 1.1. As previously mentioned, in addition to the genome housed in the nucleus of cells (nuclear genome), human cells also contain another smaller genome that resides within the energy-producing organelles of the cells, the mitochondria. The mitochondrial genome (mtDNA) is made up of a little over 16,000 DNA bases arranged in a circle that contains two related promoter sequences, one for each strand, which are transcribed in their entirety. All cells contain multiple mitochondria, each of which has several copies of their mitochondrial genome. The 37 genes encoded by the mtDNA are specific to the structure and function of the mitochondria itself, which are integral to the production of cellular energy. Unlike the nuclear genome, the mitochondrial genome is inherited only through the maternal line, as the sperm cell contributes no mitochondria during conception. A change in the mtDNA that alters the production of proteins necessary to meet the energy requirements of the cell can cause mitochondrial disease, often affecting the organs with high energy requirements, such as the brain, heart, eyes, and skeletal muscles. Nuclear genes also contribute to mitochondrial function, so mitochondrial disorders can result from alterations in nuclear or mitochondrial genes. Mitochondrial disorders are examined in depth in Chapter 7, X-linked and Mitochondrial Disorders. Replication of nuclear DNA occurs during cell division of somatic cells in a process known as mitosis, in which two genetically identical daughter cells are produced from the original parent cell, and which maintains the diploid (46) chromosomal content. Meiosis is the biological process of germ cell production. It is specific to the cells of the reproductive system and results in four haploid gametes (23 chromosomes each) that are genetically unique from each other and to the parent cell, due to the process of meiotic recombination. During fertilization, the egg and sperm join together and the full chromosomal complement is restored. The biological significance of these two important processes is ensuring the constancy of genetic information from one generation to the next and promoting genetic diversity.

1.3 Genetic variation

5

Table 1.1 The genetic code determines the translation of the DNA sequence encoded in genes into the corresponding sequence of amino acids to produce proteins.

During the replication of DNA, either in mitosis or meiosis, changes in the DNA can occur, known as mutations. Several cellular proofreading and repair processes exist to ensure the integrity of the nuclear genome and fidelity of the code, though changes can sometimes escape detection. Certain exposures, such as ionizing radiation, can also increase the rate of mutation. Additionally, other biological factors in humans may contribute to an increased rate of mutation in offspring, such as advanced maternal age for chromosome aneuploidies like trisomy 21, commonly known as Down syndrome (see Chapter 2: Karyotyping as the First Genomic Approach), and advanced paternal age for single gene defects (see Chapter 6: Dominant and Sporadic De Novo Disorders).

1.3 Genetic variation DNA sequence variation is a constant feature of both germ and somatic cells and can occur on a scale varying from single DNA nucleotide changes to deletions or duplications of entire

6

Chapter 1 Introduction to concepts of genetics and genomics

chromosomes. The genetic information and variation encoded in the DNA combined with environmental influences determine individual characteristics and susceptibility to disease, and together make up the clinical characteristics or traits known as phenotype. The effect of DNA variation on gene expression and ultimately phenotype often depends on where the change occurs, for example, changes that occur in genes can ultimately alter proteins, whereas when DNA alterations happen in the noncoding regions of the genome, they tend to have no obvious or strong effect on cellular function. Such silent or subtle changes are generally considered to be benign polymorphisms. Some of these benign polymorphisms can also occur in coding sequences of genes but if they do not confer a deleterious effect to the cell, they can be passed on by generations contributing to common variation in the human population and to traits such as hair, skin, or eye color. Sequence variants that change one nucleotide in the DNA sequence and that differ between individuals or even humans and other species are known as single-nucleotide polymorphisms (SNPs) and occur quite frequently across the human genome (see Chapter 4: Genomic Sequencing of Rare Diseases). SNPs have been extensively studied in association with disease and drug response. When certain DNA changes occur in introns, exons, promoters, or span entire genes or chromosomes, they can abolish or alter the normal function of the encoded proteins and consequently exert a profound phenotypic effect. These changes are often referred to as deleterious variants or mutations, depending on whether they have been observed in other individuals in the population or they occurred as a new event in a given person, respectively. Single base-pair mutations (also referred to as single or simple nucleotide variants, SNVs) within the exon can alter the coding sequence, as illustrated in Fig. 1.2. Synonymous variants, also called silent mutations, occur when the single base pair substitution maintains the same amino acid, due to the degenerate nature of the genetic code, and consequently do not alter the final protein product. Nonsynonymous or missense mutations cause a codon change from one amino acid to another. A missense mutation may not exert a strong phenotypic effect if the new amino acid shares similar physicochemical properties to the original conserved amino acid at that position or occurs at a nonessential site along the protein. Other missense mutations ultimately alter the protein configuration or enzymatic function and may introduce novel properties that exert a deleterious effect or change an important one rendering the protein inefficient or nonfunctional. Nonsense mutations result from substitutions that introduce a premature stop codon in the mRNA sequence. If the nonsense mutation occurs early in the transcribed mRNA, the cell can identify the abnormal location of the stop codon and dispose of the defective transcript through a mechanism known as nonsensemediated decay (NMD), which effectively leads to the destruction of the mRNA and the absence of a protein product resulting in a loss-of-function (LoF). Conversely, if the nonsense mutation occurs later toward the end of the transcript, the mRNA can escape NMD and go on to be transcribed into a truncated and nonfunctional protein product. In some instances, this truncated form of the protein, although unable to perform normal biological functions, can act as a toxic protein product that interferes with other proteins it may interact with, causing disease through a dominant negative effect (see Chapter 6: Dominant and Sporadic De Novo Disorders). Frameshift mutations are caused by the insertion or deletion (indels) of one to a few nucleotides by a number nondivisible by three. This disrupts the reading frame such that all ensuing DNA sequence is transcribed incorrectly and the improper amino acids are incorporated during translation from the location where the indel occurred. A frameshift mutation may also introduce a premature stop codon resulting in a nonfunctional and truncated protein product or leading to degradation of the mRNA through NMD.

1.3 Genetic variation

7

FIGURE 1.2 Types of mutations that can occur to the reference sequence of nucleotide base pairs and their effect on the resulting protein.

When nucleotides are inserted or deleted in the DNA sequence by multiples of three, the reading frame is conserved, although amino acids may be added or missing from the final protein product; these mutations are known as nonframeshifting or in-frame mutations. The addition or deletion of in-frame amino acids can sometimes occur in regions of the protein that are important for proper function or affect amino acids essential for enzymatic or catalytic functions. Lastly, splice site mutations occur at the junctions between exons and introns and may cause exons to be removed or intronic sequence to remain in the mature mRNA, altering the amino acid sequence and exerting a functional effect on the gene product. Copy number variants (CNVs) are a class of structural variation (SV), meaning variation that modifies the architecture of the genome, involving alterations in the number of copies of specific regions of DNA, which can either be deleted or duplicated (see Chapter 3: Genomic Disorders in the Genomics Era). These involve large stretches of DNA varying from thousands of base pairs to

8

Chapter 1 Introduction to concepts of genetics and genomics

segments or entire chromosomes (chromosomal aneuploidy; see Chapter 2: Karyotyping as the First Genomic Approach). Some large CNVs do not have any impact on gene function, while other small ones can exert a strong effect by removing sections of a coding gene or altering the expression or dosage of a given gene. While large CNVs can be evident on a karyotype, changes smaller than 35 Mb are below the resolution of chromosome studies; therefore the most common and precise technique in use for identifying submicroscopic CNVs is chromosomal microarray analysis (CMA). As discussed in depth in Chapter 3, Genomic Disorders in the Genomics Era, CMA will not identify balanced rearrangements of genetic material, such as balanced translocations, where different chromosomal segments can be joined together. Intrachromosomal submicroscopic inversions, although copy number neutral, can alter the normal expression of genes or disrupt those that occur at the breakpoint of the genomic rearrangement. Even if a balanced translocation maintains the full genetic complement, it can lead to abnormalities in copy number during meiosis and introduce CNVs in the gametes. Implementation of genomic sequencing technologies is allowing better detection and characterization of CNVs and SV in human genomes.

1.4 Nomenclature in human genetics and genomics The Human Variation Genome Society (www.hgvs.org) maintains the standards for consistent nomenclature for the description of sequence variations and gene names. Human genes are named using symbols designated by the Human Gene Nomenclature Committee (HGNC) and are generally capitalized and italicized in print (e.g., SMN1 is the name for the survival of motor neuron 1 gene); while the protein product of the gene uses a nonitalicized symbol (e.g., SMN1 is the survival of motor neuron 1 protein). Various symbols and abbreviations are used to refer to designated variants or changes to the reference sequence and their impact on different molecules. References to particular molecules generally use the RefSeq database maintained by the National Center for Biotechnology Information. A table of common abbreviations and nomenclature conventions is found in Table 1.2 below. Table 1.2 Common abbreviations used for nomenclature of genetic variants. Abbreviation

Interpretation

Example

“g.”

Refers to a genomic reference sequence Followed by genomic coordinates including chromosome and position of a given nucleotide or variant. If multiple assemblies, the version of the reference assembly should be included Refers to a coding DNA reference sequence Generally given in reference to a particular transcript isoform. Followed by the position within the coding sequence and the nucleotide or variant Refers to a protein reference sequence Generally given in reference to a particular protein isoform. Followed by the position within the protein sequence and the amino acid change using the 3-letter code

NC_000009.12: g.114195977G . C

“c.”

“p.”

NM_032888 (COL27A1): c.2089G . C NP_116277 (COL27A1):p. Gly697Arg

1.5 Mendelian patterns of inheritance

9

A commonly used resource in human genetics is the Online Mendelian Inheritance in Man (OMIM) database (www.omim.org). OMIM is a continuously updated comprehensive compendium of human genes and phenotypes with a presumed genetic basis. It focuses on the relationships between phenotype and genotype and documents established gene-disease associations of so-called Mendelian disorders, based on literature review and curation. Throughout this book, we refer to many different genetic disorders by name and also by acronyms and provide their designated sixdigit identification number or MIM number. The reader can then look-up such disorders of interest in OMIM using these unique identifiers to learn more about their clinical features and associated information. The # symbol prior to the MIM numbers of genetic disorders referenced throughout indicates that the molecular basis or gene affected in that disorder has been identified and documented in the scientific literature. From a clinical perspective, variant classification is primarily based on the predicted impact of the variant on disease expression. A framework for these designation criteria of individual variants as disease-causing (pathogenic) or for variants that do not result in disease expression (benign) has been established by The American College of Medical Genetics and Genomics (ACMG), based on factors such as predicted functional effect, population frequency, evolutionary conservation, segregation of the allele among related individuals, or laboratory studies. As these factors may not all be known or may even conflict with their assertions, variants can also be classified as “likely pathogenic,” “likely benign,” or simply as a “variant of unknown significance.” These designations can vary between laboratories and often change based on new evidence to support one assertion over another. The open-access ClinVar database (www.ncbi.nlm.nih.gov/clinvar/) aggregates and maintains these interpretations from variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and supporting data utilized in that designation.

1.5 Mendelian patterns of inheritance The discipline of medical genetics traditionally has focused on chromosomal syndromes and Mendelian disorders, the latter group caused by single gene mutations exhibiting a significant effect on gene function and responsible for the phenotype of the condition. These mutations are rare in the general population due to selection against them, as they often impact health and potentially reduce the ability to reproduce. In fact, over 80% of Mendelian disorders described to date have a pediatric-onset phenotype, though Mendelian disorders that manifest in adulthood are increasingly appreciated in the clinic. Sometimes pediatric versus adult-onset can relate to the severity of a gene variant allele/mutation and its manifestation over time. “Mendelian disorders” were named as such because they segregate in families in a way that follows Gregor Mendel’s laws of segregation and independent assortment. The law of segregation observes that each parent passes one of two versions of any given gene (known as alleles) to their offspring. The law of independent assortment follows that any two separate genes are passed independently of one another to offspring. Put simply, generally one copy or allele of each of the two genes in the diploid genome is inherited from each parent. Exceptions to this rule are genes found on the sex chromosomes and those physically close to one another on a given chromosome, which

10

Chapter 1 Introduction to concepts of genetics and genomics

are often transmitted together (a phenomenon known as linkage disequilibrium). A genotype is a particular combination of alleles for a particular gene or locus. In Mendelian genetics, the trait or disease in question is transmitted through specific modes of inheritance depending on the expression of the phenotype and the location of the gene. These Mendelian inheritance patterns include autosomal recessive, autosomal dominant, X-linked recessive, and X-linked dominant, and they are the subjects of Chapters 5, 6, and 7 in this book. One of the main tools in medical genetics to help clarify the mode of inheritance for a potential Mendelian disorder is a three-generation family tree or pedigree. This includes information about the index patient (or proband), immediate and extended relatives, both living and deceased, as well as infertility or pregnancy loss, adoption, ethnic background, or consanguinity. Examples of the symbols used in constructing a pedigree are found in Fig. 1.3. Additionally, certain patterns observed through constructing family pedigrees can help distinguish the different inheritance patterns. In general, autosomal conditions affect males and females equally, though some may be sexlimited due to other genetic and nongenetic factors. Autosomal dominant inheritance can be observed to follow so-called “vertical transmission,” where affected individuals appear in every generation; whereas autosomal recessive disorders show “horizontal transmission” with affected individuals showing in alternate generations. X-linked recessive disorders are classically recognized by the presence of multiple affected males in the pedigree born to unaffected carrier females. In autosomal recessive conditions, the phenotype is expressed only in individuals who harbor mutations in both their maternal and paternal alleles (i.e., biallelic pathogenic variation); they are said to be homozygous when both alleles represent the same mutation in the gene and compound heterozygous when each allele of the gene has a different mutation. In general, autosomal recessive conditions arise from LoF or hypomorphic mutations, where gene function is eliminated or reduced, respectively, rather than altered function as in dominant disorders described below. Heterozygous carriers are generally asymptomatic because they have one allele that functions appropriately, compensating for the decreased or absent function of the mutant one. Two heterozygous parents for a recessive condition will have a 25% risk for each conception they have together of being affected by the genetic condition, and clinically normal siblings of an affected individual have a 67% chance of being carriers like their parents. For this reason, affected individuals appear to “skip generations” in the pedigree. Recessive conditions are explored in more detail in Chapter 5, Recessive Diseases and Founder Genetics, especially as they relate to founder populations, where a shared genetic background in genetic isolates enriches for recessive alleles within regions of homozygosity in the genome, and manifest in a higher prevalence of population-specific recessive disorders. One of the characteristics of autosomal dominant disorders is the evidence of phenotype in successive generations, affecting both males and females. In dominant conditions, a single copy (heterozygous, also referred to as monoallelic at the genetic locus) of the disease-associated mutation or variant is sufficient to cause the disease, even when the second allele is normal. Homozygotes for the dominant phenotype may be more severely affected than heterozygotes, or a homozygous state for a dominant mutation may be incompatible with life altogether. As heterozygous individuals for dominant diseases have a 50% chance of passing on the mutant allele, each conception in a couple in which one member is heterozygous for a dominant disease mutation has a 50% chance of being affected. Autosomal dominant conditions can be the result of gain-of-function (GoF) mutations in genes, altering the canonical function of the encoded protein; haploinsufficiency, in which the LoF of one allele and the associated decrease in protein production is enough to cause

1.5 Mendelian patterns of inheritance

11

FIGURE 1.3 Pedigree symbols and conventions and examples of pedigrees demonstrating the different types of Mendelian inheritance patterns. These examples assume conditions that are fully penetrant and do not affect reproductive fitness.

12

Chapter 1 Introduction to concepts of genetics and genomics

the disease phenotype; or dominant negative effects, where the mutant protein “poisons” the cellular machinery. In pedigrees with no evident family history, dominant disorders often arise as sporadic conditions associated with de novo mutation in a germ cell (egg or sperm) of one of the parents or in the fertilized egg itself early in embryogenesis. In such cases, the recurrence risk for future affected offspring to couples without the constitutional mutation is low but difficult to assess; while the potential for independent spontaneous mutations in the same gene is on the order of 1:10,000 to 1:1,000,000, it is currently challenging to predict or determine what proportion of the unaffected parents’ gametes may contain the germline mutation. Dominant and sporadic disorders are the focus of Chapter 6, Dominant and Sporadic De Novo Disorders. In X-linked recessive disorders, the phenotype is more obvious in males due to being hemizygous for the gene on their single X chromosome inherited from their mothers, as they inherit the Y chromosome from their fathers. If a gene on the X chromosome has a mutation, those males will be affected. A female who has one mutant allele and one unaffected allele (heterozygous) may be only mildly affected (a manifesting heterozygote) or completely asymptomatic (carrier). However, she has a 50% chance to pass the mutant allele to any offspring. Each male offspring has a 50% chance of being affected and each female offspring has a 50% probability of being a carrier like her mother. Males who inherit the mutation will pass it to all their daughters and none of their sons. In contrast, X-linked dominant disorders are often lethal or very severe in males, and so appear predominantly in affected females. While those mutant alleles can be passed to successive generations, X-linked dominant disorders are often associated with reduced reproductive fitness due to the severity of their phenotypes. X-linked disorders are discussed in detail in Chapter 7, X-linked and Mitochondrial Disorders. Other forms of inheritance do not follow canonical Mendelian inheritance patterns. As mentioned earlier, multiple copies of the mitochondrial genome are inherited from the mother to her offspring through the egg, and this specific feature of the mitochondria creates distinctive features for mitochondrial disease inheritance. Mitochondrial replication does not adhere to the tightly controlled segregation of chromosomes that takes place in the nuclear genome, but randomly distributes mtDNA copies between two daughter cells during cell division through a process known as replicative segregation. When a mutation is present in the mtDNA, replicative segregation causes differential distribution of the mutation carrying mtDNA in daughter cells. Heteroplasmy is observed when a proportion of the mitochondria in a cell have the mutation but others do not; while a daughter cell with all normal or all mutant mtDNA is said to be homoplasmic. These properties contribute to the variable expressivity and reduced penetrance characteristic of these disorders. Furthermore, they complicate recurrence risk estimation for offspring of heteroplasmic females, as it can vary considerably depending on the percent of mutant mitochondria passed through replicative segregation, something impossible to determine. Mitochondrial inheritance and its associated considerations are also discussed in Chapter 7, X-linked and Mitochondrial Disorders.

1.6 Other modes of inheritance Beyond the classical inheritance patterns of Mendelian disorders, rare genetic disorders that follow alternative modes of inheritance or special conditions have been described in the literature. As

1.6 Other modes of inheritance

13

genomic sequencing continues to be applied to the study of rare diseases, our understanding of classical and nonclassical modes of inheritance for genetic disorders expand. •





Repetitive DNA sequences constitute 30% of the human genome, including regions of short trinucleotide repeats near promoters and exons of protein-coding genes. In general, the number of these repeats varies somewhat from person to person and has no effect on gene expression. However, in a small number of genes, the triplet repeat number can expand during meiosis to become unstable and disrupt gene function. The larger the repeat, the more unstable and likely it will expand into the pathogenic range as it is passed from one generation to the next. There are now more than 40 known triplet repeat expansion disorders, including Fragile X (MIM #300624), myotonic dystrophy (MIM #160900), and Friedreich ataxia (MIM #229300). Triplet repeat expansion disorders share certain characteristics: they tend to be progressive and neurodegenerative in nature and exhibit genetic anticipation, a phenomenon in which the condition becomes more severe and earlier in onset with successive generations, depending in some cases on the parent of origin. In pedigrees, triple repeat expansion disorders will be observed to segregate dominantly, either autosomic or X-linked, depending on the chromosome containing the gene involved in the disease. Some disorders can be the result of mutations occurring in the embryo during early development, resulting in only a fraction of the cells in the individual having the genetic defect in a phenomenon known as mosaicism. Mosaicism is discussed in great detail in Chapter 8, Mosaicism in Rare Disease. About 1% of nuclear genes follow the conventions of Mendelian inheritance, but their expression is dependent on the parent of origin of the inherited allele, known as genomic imprinting. In genes that undergo imprinting, methyl groups are chemically attached (methylated) to specific segments of DNA in a sex-specific way during gametogenesis. This methylation phenomenon silences the expression of different genes in the maternal and paternal gametes. An individual normally has one active and one inactive copy of each imprinted gene, but imbalances in that process, through mutation or other mechanisms as discussed below, leading to two active or inactive copies of the imprinted gene will result in abnormalities in growth and development. The classic examples of imprinting defects are Angelman syndrome (AS, MIM #105830) and PraderWilli syndrome (PWS, MIM #176270). These two conditions are characterized by distinctly different phenotypes but associated with one another by the same imprinted region on chromosome 15. AS is paternally imprinted, such that the genes in the imprinted region inherited from the mother are the only expressed ones; the paternal equivalent is silenced through methylation. When the maternal UBE3A allele in this imprinted region of the chromosome is nonfunctional, either due to a gross deletion (70% of cases) or point mutation (11% of cases) of the UBE3A gene, any children who inherit the maternal mutant UBE3A will have AS, as they are rendered without a functional copy of that gene. Uniparental disomy, the inheritance of two copies of a chromosome from a single parent, or abnormalities in the imprinting center itself may also underlie the condition. Conversely, PWS is maternally imprinted, so the expression of the genes in the imprinted region of the maternal chromosome 15 is repressed. While most cases of PWS are due to a deletion of the region in the paternal chromosome (70%80% of cases), uniparental disomy of the maternal chromosome 15 generally accounts for the remaining ones. Indeed, PWS and AS display very different

14





Chapter 1 Introduction to concepts of genetics and genomics

inheritance patterns in a pedigree. Uniparental disomy is further discussed in Chapter 2, Karyotyping as the First Genomic Approach. Certain rare disorders have been characterized that manifest in individuals harboring pathogenic variants as two distinct alleles from different genes; such that their disease phenotype is driven by heterozygous alleles in two different but related genes, comprising digenic inheritance. Unaffected parents have a 25% for each of their offspring to be affected, and while children of affected individuals also have a 25% risk of the condition, the inheritance of this condition will appear dominant within a pedigree. Certain types of retinitis pigmentosa, Usher syndrome, and BardetBiedl syndrome have been described in the medical literature with this more complex form of inheritance. Finally, genome-wide interrogation using NGS agnostic approaches such as WES or WGS has identified a growing number of complex genetic phenotypes that arise from two or more separate conditions within one person. Except in cases of linkage disequilibrium, these conditions segregate independently of one another. Recent estimates suggest that these occur in approximately 5% of individuals, the likelihood of which increases with consanguinity. Dual diagnoses and multilocus pathogenic variation and inheritance are the focus of Chapter 9, Multilocus Inheritance and Variable Disease Expressivity in Rare Disease.

1.7 Considerations of Mendelian disorders and genetic inheritance While a disease phenotype is often obvious, some additional variables and phenomena are worth considering that may influence phenotypic differences within affected individuals sharing the same disease genotype. Penetrance of a gene mutation or variant refers to whether individuals with a given genotype present the corresponding phenotype. When the disease manifests in all individuals with the disease genotype, it is said that there is complete penetrance; an example of this is achondroplasia (MIM #100800). Penetrance may be age-dependent, such that the phenotype in a person with the mutation will be observed after certain age, such as in Huntington disease (MIM #143100). If some individuals with the disease-associated genotype do not develop features of the disorder (i.e., remain asymptomatic), it is said to have incomplete (or reduced) penetrance. Sometimes penetrance is expressed as a fraction of probability that individuals with the diseaseassociated genotype will exhibit the phenotype as well; one example is retinoblastoma (MIM #180200), in which about 10% of patients who harbor the heterozygous RB1 disease allele do not develop intraocular tumors. Additional genetic or environmental factors may influence penetrance of genetic disorders. A related and often confused genetic term with penetrance is variable expressivity, which refers to the range of phenotypic features (signs and symptoms of the disease) and differences in severity of the disorder among individuals who share the same disease-associated genotype. Some affected individuals may exhibit only a few features of a given condition, while other affected patients have more or different features of the condition. A condition with variable expressivity is Waardenburg syndrome type 4A (MIM #277580) due to biallelic EDNRB mutations, where about 50% of homozygotes have aganglionic megacolon and some combination of hearing loss, bicolor irides (heterochromia), and white forelock.

Further reading

15

In correlating phenotype and genotype, it bears in mind to note that genetic disorders with similar phenotypes can be due to myriad genotypes at different loci in the genome, a phenomenon known as locus or genetic heterogeneity. One example is nonsyndromic hearing loss, for which mutations in over 60 different genes have been found to underlie the phenotype. Furthermore, different mutations in the same gene can cause the same disease, referred to as allelic heterogeneity; while different diseases can converge on the same genetic locus, referred to as allelic affinity. As an example, LoF mutations in the RYR1 gene cause central core disease (MIM #117000), while GoF variants in the same gene are associated with susceptibility to malignant hyperthermia (MIM #145600).

1.8 Conclusion Scientific and technical advances of the recent past are enriching and expanding our understanding of the human genome and the role of genetic variants in health and development, informing the practice of medicine in an unprecedented way. Rare genetic disorders offer insights into the discovery of fundamental and novel biological processes. As these findings increasingly inform medical care, every healthcare provider, scientist, and individual interested in traits that might be specifically observed within families should have a framework of understanding the basic concepts underlying human genetics and genomics.

Further reading Balci T, Hartley T, Xi Y, Dyment D, Beaulieu C, Bernier F, et al. Debunking Occam’s razor: diagnosing multiple genetic diseases in families by whole-exome sequencing. Clin Genet 2017;92(3):2819. Bamshad M, Nickerson D, Chong J. Mendelian gene discovery: fast and furious with no end in sight. Am J Hum Genet 2019;105(3):44855. Claussnitzer M, Cho J, Collins R, Cox N, Dermitzakis E, Hurles M, et al. A brief history of human disease genetics. Nature 2020;577(7789):17989. Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med 2012;63:3561. Haendel M, Vasilevsky N, Unni D, Bologa C, Harris N, Rehm H, et al. How many rare diseases are there? Nat Rev Drug Discov 2020;19(2):778. Lee CE, Singleton KS, Wallin M, Faundez V. Rare genetic diseases: nature’s experiments on human development. iScience 2020;23(5):101123. Lupski JR. Clinical genomics: from a truly personal genome viewpoint. Hum Genet 2016;135(6):591601. Nussbaum R, McInnes R, Willard H. Thompson & thompson genetics in medicine. 8th edition Elsevier; 2017.

This page intentionally left blank

CHAPTER

Karyotyping as the first genomic approach

2

Amy Breman1 and Paweł Stankiewicz2 1

Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States 2Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States

2.1 Introduction For more than five decades, the analysis of human chromosomes has been an established method for obtaining genetic diagnoses across the prenatal, postnatal, and oncology patient population. Few laboratory disciplines have the potential to affect such a broad range of medical specialties, yet with the advancement of genomic technologies, cytogenetic testing is perhaps more poorly understood than ever. The human genome is organized within 23 pairs of large DNA molecules known as chromosomes. Karyotyping, or conventional chromosome analysis, refers to the evaluation of banded metaphase chromosomes from cultured cells (Fig. 2.1). Clinical cytogeneticists analyze karyotypes to detect a wide range of chromosomal abnormalities, including changes in chromosome number associated with aneuploid conditions, such as trisomy 21 (Down syndrome). Analysis of karyotypes can also reveal more subtle structural changes, such as chromosomal deletions, duplications, inversions, or translocations, which can be balanced or unbalanced (Fig. 2.2). Not long after the first recognition that variations in the normal diploid chromosome number and structure could lead to severe disability and disease, it became clear that a standardized system of terminology was needed to describe chromosome abnormalities. An international standard was ultimately developed that came to be known as the International System for Human Cytogenetic Nomenclature (ISCN), which includes chromosome band locations, symbols, and abbreviated terms used in the description of chromosome abnormalities. Today, the ISCN is the central reference for the description of karyotyping, fluorescence in situ hybridization (FISH), and chromosomal microarray results. In karyotype nomenclature, the first number listed is the total number of chromosomes, followed by a comma (,). The sex chromosomes are listed next and represent the complement of X and Y chromosomes. Thus, the normal human karyotype is designated as either 46,XX or 46,XY. When an abnormality is present, the chromosome number and region involved in the change are specified within parentheses immediately following the type of rearrangement. For example, the nomenclature 46,XY,del(3)(q29) indicates a male individual (XY) with the normal diploid number of chromosomes (46), where one chromosome 3 has a deletion on the long arm (q) at band 2, sub-band 9. Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00002-8 © 2021 Elsevier Inc. All rights reserved.

17

18

Chapter 2 Karyotyping as the first genomic approach

FIGURE 2.1 Normal female karyogram.

The standard G-banded karyotype at a 500- to 600-band level of resolution allows for detection of large (microscopically visible) chromosomal rearrangements that are greater than approximately 510 Mb in size. Over the years, several key technical advances in cytogenomic techniques have allowed for a substantially higher level of resolution across the entire genome. Implementation of molecular methods in classical cytogenetics such as FISH, first introduced in the late 1980s, led to the emergence of molecular cytogenetics, which has changed the standard of care for patients suspected of having a chromosomal disorder. FISH involves the hybridization of a labeled DNA probe to a chromosomal target, allowing for the detection of subtle deletions, duplications, or rearrangements that are not routinely delineated with standard G-banded chromosome analysis. Chromosomal microarray analysis (CMA) using array comparative genomic hybridization (aCGH) or single nucleotide polymorphism (SNP) arrays has the capacity to detect microscopic and submicroscopic chromosomal imbalances on a genome-wide scale. The principle behind aCGH is the detection of copy number imbalance by comparing equal amounts of genomic DNA from a patient and a normal control (see Chapter 3: Genomic Disorders in the Genomics Era). SNP arrays, meanwhile, detect copy number imbalance based on hybridization intensities without the need for a normal control DNA being co-hybridized to a patient’s DNA sample. Many high-resolution CMA platforms have been developed to detect CNVs that utilize aCGH or SNP technology or both, and

2.2 Numerical chromosome aberrations

19

Chromosome aberrations are common 1 in ~160 live births Numerical (unbalanced)

Structural

1 in ~250

1 in ~375

Sex chromosomes 1 in ~440

Balanced 1 in ~490

Autosomes 1 in ~700

Turner syndrome 1 in 4000 females

Trisomy 21 1 in 830

Klinefelter syndrome 1 in 1000 males

Trisomy 18 1 in 7500

Unbalanced 1 in ~1600

Reciprocal translocations 1 in 650 Robertsonian translocations 1 in 1000

Trisomy 13 1 in 22,700

FIGURE 2.2 Frequencies of chromosomal aberrations.

CMA now represents the first-tier diagnostic method for developmental disorders and congenital anomalies [1]. Regarding the types of specimens most appropriate for chromosome analysis, it is important to consider that the cells must be capable of proliferation in culture. This means that formalin-fixed paraffin-embedded (FFPE) or frozen specimens are typically not acceptable. The most common sample type used for chromosome analysis is peripheral blood T-lymphocytes, although bone marrow, skin fibroblasts, and prenatal specimens derived from amniotic fluid or chorionic villus biopsy are also suitable. Saliva and buccal swab specimens, while gaining popularity for molecular-based methods, are not typically compatible with chromosome analysis. Notably, rare mosaic chromosomal abnormalities may be difficult to detect in the peripheral blood and require analysis of cultured skin fibroblasts or buccal cells. This is due to the potential for selection against the abnormal cell line during the cell culture process carried out in the laboratory. The contribution of disease-causing genomic imbalances or copy number variants (CNVs) has been increasingly recognized in recent years and has led to the rapid development of nextgeneration sequencing (NGS) pipelines that have the capacity to detect CNVs in addition to single nucleotide variants (SNVs). Nonetheless, classical cytogenetics still has an important role in diagnostic genetics and genomics and can be viewed as a complementary testing strategy in many diagnostic testing scenarios.

2.2 Numerical chromosome aberrations The recognition of aneuploidy as a mechanism of human disease dates back to the 1950s, when the cause of Down syndrome was first identified as an extra copy of chromosome 21 [2,3]. In the years

20

Chapter 2 Karyotyping as the first genomic approach

since, chromosomal aneuploidy has been shown to be the leading cause of congenital birth defects and a major contributor to anomalies in first-trimester pregnancy loss [4]. Numerical chromosome abnormalities primarily result from errors in meiosis and are detected in approximately 5% of clinically recognized pregnancies and 0.3% of live births [46]. Most cases of aneuploidy result either from trisomy (three copies of a particular chromosome) or monosomy (one copy of a particular chromosome), with important distinctions with regard to phenotypic consequences for each. Two other abnormal chromosome complements, triploid (3n) and tetraploid (4n), are occasionally observed in liveborn individuals with mosaicism for a normal (euploid) cell line, but nonmosaic triploidy or tetraploidy is almost always incompatible with life. Karyotyping is still considered a first-line test for the diagnosis of the most frequent and clinically important autosomal aneuploidies (trisomies 13, 18, and 21) as well as sex chromosome aneuploidies (SCAs) (Table 2.1).

Table 2.1 Most common numerical chromosomal abnormalities observed in humans. Disorder

Chromosomal abnormality

Main clinical features

Down syndrome

Trisomy 21 2 Full trisomy 21 (94%) 2 Mosaic trisomy 21 (2.4%) 2 Translocation 21 (3.3%)

Short stature, characteristic facial features, congenital heart defects, joint laxity, short broad hands, hypotonia, intellectual disability, Alzheimer disease in adulthood, leukemia

Patau syndrome

Trisomy 13 2 Mosaic forms have been reported

Polydactyly, cleft lip and palate, holoprosencephaly, profound intellectual disability, brain abnormalities, microphthalmia, congenital heart defects, omphalocele, feeding difficulties, failure to thrive, 80% die within the first year of life

Edwards syndrome

Trisomy 18 2 Mosaic forms have been reported

Intrauterine growth retardation (IUGR), profound intellectual disability, micrognathia, microcephaly, congenital heart defects, overlapping fingers and toes, clenched hands, rocker-bottom feet, omphalocele, feeding difficulties, failure to thrive, 90% die in the first year of life

Turner syndrome

Monosomy X (45,X) 2 Variety of associated karyotypes, including mosaicism and/or structural abnormalities of the X chromosome

Phenotypic female, short stature, delayed puberty, infertility, congenital heart and renal defects, webbed neck, lymphedema of hands and feet. Most affected females have normal intelligence but some have mild intellectual disability and learning difficulties

Klinefelter syndrome

Extra X chromosome in males (47,XXY)

Phenotypic male, low testosterone, reduced muscle mass, facial and body hair, gynecomastia, tall stature, delayed puberty, small testes and penis, infertility, learning difficulties

Triple X syndrome XYY syndrome

Trisomy X (47,XXX)

Phenotypic female, tall stature, hypotonia, clinodactyly, language delay, dyspraxia. Undiagnosed in most cases Phenotypic male, tall stature, acne in puberty, learning difficulties, behavioral problems. Undiagnosed in most cases

Extra Y chromosome in males

2.2 Numerical chromosome aberrations

21

2.2.1 Autosomal aneuploidy Chromosomal aneuploidy is the most common genetic abnormality in humans and is associated with errors in chromosome segregation. Despite its frequency, only three autosomal aneuploidies are compatible with postnatal life: trisomy 21 (Down syndrome), trisomy 18 (Edwards syndrome), and trisomy 13 (Patau syndrome). Down syndrome is by far the most well-characterized of the autosomal aneuploidies and is the most common form of inherited intellectual disability at a frequency of approximately one in 830 live births [7]. But that is not to say the significance of autosomal aneuploidy is limited to these three well-known disorders. In the era of molecular genomic analyses, the significance of chromosomal aneuploidy in the identification of rare chromosomal disorders may be overlooked. In reality, the advent of whole-genome sequencing (WGS) has allowed us to appreciate that aberrations of chromosome number arising from errors in meiosis or mitosis are not uncommon events, and the associated pathologies of chromosomal mosaicism and uniparental disomy (UPD), discussed in detail later, are likely more prevalent than previously thought. It has been suggested that undetected mosaicism could be a frequent cause of phenotypic abnormalities even in the presence of a normal karyotype (see Chapter 8: Mosaicism in Rare Disease). This is because routine constitutional chromosome studies from stimulated blood samples consist of analysis of 20 metaphase cells, unless mosaicism is suspected, for which a total of at least 30 cells must be counted [8]. Even then, analysis of 30 cells is only able to rule out mosaicism at a level of 10% or greater [9]. It is also noteworthy that mosaic aneuploidy has the potential to be missed in some cases due to factors such as selective growth of normal diploid cells in culture or tissue-limited mosaicism. In the context of rare disease, this means that some cases of mosaic aneuploidy likely remain undetected due to relatively mild, overlapping, and unspecific symptoms that could be missed by a standard chromosome analysis, especially if a mosaicism workup is not requested by the ordering physician. Trisomy 9 is an example of a rare chromosome disorder that is often seen in a mosaic form and can be described by the abbreviation mos in cytogenetic nomenclature [e.g., mos 47,XX, 1 9/46,XX], although mosaicism is more commonly represented by separating each clone by a slant line (/) with square brackets [], placed after the karyotype description to indicate the number of cells in each clone, for example, 47,XX, 1 9[5]/46,XX[15].

2.2.2 Sex chromosome aneuploidy SCAs as a whole are relatively common, with an estimated incidence of approximately 1 in 440 births and are characterized by the loss or gain of one or more sex chromosomes. The most wellknown SCAs include Turner syndrome (monosomy X), Klinefelter syndrome (XXY), trisomy X (XXX), XYY, and XXYY. Despite their frequency, SCAs are almost certainly underdiagnosed because the phenotypes associated with these disorders are typically less severe than those associated with autosomal chromosome disorders. Affected individuals may have few, if any, recognizable phenotypes and may present later in adolescence or even adulthood with subtle features, including learning difficulties, behavioral challenges, or in some cases delayed puberty and infertility [10]. For example, it has been estimated that 50%85% of individuals with XXY (Klinefelter syndrome) or XYY syndrome remain undiagnosed [11]. Other indications that raise the possibility of an SCA include primary or secondary amenorrhea, infertility, or ambiguous genitalia. Mosaicism and structural abnormalities of the sex chromosomes may also confound diagnosis. For

22

Chapter 2 Karyotyping as the first genomic approach

instance, approximately 50% of women with Turner syndrome have a 45,X karyotype; the remaining patients have a variety of associated karyotypes, including mosaicism and/or structural abnormalities of the X chromosome, for example, isochromosome of the long arms, 45,X[20]/46,X,i(X)(q10)[10]. Patients with 45,X/46,XY mosaicism can show a wide spectrum of phenotypes, ranging from a typical Turner female phenotype to a predominantly male phenotype with some common Turner stigmata. For these patients, the clinical presentation likely depends on the proportion and tissue distribution of each cell line. Phenotypic females with a Y chromosome may be at an increased risk of developing gonadoblastoma if gonadal tissue remains in the abdomen [12]. In contrast, there is considerable variability in the cognitive and behavioral phenotype in individuals with Klinefelter syndrome, but this variability is only rarely due to mosaicism, which is found in approximately 15%20% of patients, typically having a 46,XY/47,XXY chromosome complement [13].

2.3 Structural chromosome aberrations 2.3.1 Reciprocal translocations Reciprocal translocations are a form of structural variation in which segments of two nonhomologous chromosomes are exchanged without any apparent cytogenetically observed changes in copy number. Of these, the vast majority are inherited from a parent and are not associated with any phenotypic effect. Carriers of balanced reciprocal translocations are, however, at increased risk for infertility, recurrent miscarriages, or having children with abnormalities due to the unbalanced gametes that are produced. While it is estimated that approximately 1/5001/650 individuals (and thus 1/250325 couples) carry a de novo balanced translocation, only around 6% are thought to have an associated disease phenotype [14,15]. Of note, using CMA, 40% of patients with an apparently balanced translocation and a “chromosomal phenotype” were found to be unbalanced [16]. More recently, large-scale studies examining the breakpoints of apparently balanced chromosomal rearrangements have become possible as a result of advances in NGS technologies. Analysis of the breakpoint junction events at the DNA level in translocation carriers has allowed for the study of the molecular mechanisms leading to the formation of these aberrations [17]. The resulting data have enabled elucidation of risk factors for their formation and identification of key genes responsible for rare genetic disorders [1822]. The disruption of a critical gene directly by the translocation breakpoint or indirectly by position effect leading to gene dysregulation may underlie the cause of disease in apparently balanced de novo translocation carriers, particularly if it involves a dosage-sensitive disease gene with clinical traits associated with a dominant inheritance pattern. Because of the two major mechanisms leading to disease, the first being a copy number imbalance at the chromosomal level and the second being a disruption of gene function at the molecular level, the same testing mechanism does not easily apply for all cases. Notably, in the case of balanced reciprocal translocations, the minimum detectable imbalance is likely to be larger, particularly when both chromosomal regions involved in the exchange are similar in size. Currently, CMA is the testing method of choice for patients with a suspected chromosomal abnormality and, while it varies based on the lab and array platform, it can reliably detect most copy number imbalances above 100 kb. The limitations of CMA, however, are its inability to detect balanced chromosomal rearrangements, such as inversions.

2.3 Structural chromosome aberrations

23

2.3.2 Paracentric and pericentric inversions Inversions are intrachromosomal rearrangements where a single chromosome undergoes two breaks and the material between the two breakpoints reverses in orientation. Inversions of chromosomal segments are classified into two types, defined by whether they involve a single chromosome arm (paracentric, excluding the centromere) or both chromosome arms (pericentric, including the centromere). Much like reciprocal translocations, the majority of inversions are inherited and not associated with a clinical phenotype, but carriers are at increased risk of abnormal gametes and associated fertility issues, resulting from meiotic crossover events within the inverted segment. When the inversion is pericentric, a recombination event within the inverted region can result in unbalanced recombinant chromosomes with deletion and duplication of chromosome segments distal to the inversion breakpoints. The risk for a carrier of a pericentric inversion to have a child with an unbalanced karyotype is estimated to be 5%10%, depending on the size of the inverted segment. When the inversion is paracentric, the risk that a carrier will have a liveborn child with an abnormal karyotype is very low because any unbalanced recombinant chromosomes would be acentric or dicentric and typically would not lead to viable offspring. The karyotype of an inversion carrier could be described as, for example, 46,XX,inv(3)(p24.2q26.1) (Fig. 2.3). Similar to reciprocal translocations, in rare cases, the inversion breakpoints may disrupt a gene or result in a position effect, leading to disease even in the absence of a copy number imbalance. The capacity to detect inversions is also influenced by the testing methodology and whether it is balanced or unbalanced. A large, microscopically visible inversion could be detected by conventional chromosome analysis regardless of whether it is balanced or unbalanced, but CMA would only detect unbalanced inversions. Neither method could detect disruption of a gene at the breakpoint junction, which would rely on sequencing methods. The case depicted in Fig. 2.3 exemplifies the different levels of information that can be obtained through various cytogenetic methodologies.

2.3.3 Robertsonian translocations Robertsonian translocations (RTs) are the most common recurrent constitutional chromosomal aberrations in humans, with an incidence of B1/1000 individuals. They are most often formed by centric or short arm fusion between two different acrocentric chromosomes (1315 or 2122) or rarely, two acrocentric homologs. Of note, a mirror image chromosome with the long arms of the same homologous acrocentric chromosome can represent a product of an isochromosome formation (see Section 2.3.6) [23,24]. When a chromosomal break occurs within the centromeres, the karyotype of the carrier of such aberration is described as, for example, 45,XX,der(14;21)(q10;q10) or 45,XX,rob(14;21)(q10;q10). When a break maps near the centromeres and they can be ascertained (e.g., by C-banding or by FISH), the chromosome is designated as dicentric or pseudodicentric [25,26]. All combinations of acrocentric chromosomes have been observed, but those involving chromosomes 13 and 14 or 14 and 21 are the most common, with rob(13;14)(q10;q10) found in 1 in 1300, being considered the most common chromosomal aberration in humans [27]. In carriers of RT, the number of chromosomes is reduced to 45, but the chromosomal complement is considered balanced, as the loss of the short arms harboring repetitive satellite and ribosomal DNA sequences (they are observed in certain circumstances) is not deleterious. The carriers of the balanced RT are typically healthy; however, because meiotic segregation results in nullisomic or

24

Chapter 2 Karyotyping as the first genomic approach

B

A

3

inv(3)

C

FIGURE 2.3 Multiple cytogenetic testing methods applied to a single case. This patient was referred for cytogenetic studies due to developmental delay and a family history of a pericentric inversion of chromosome 3. (A) Conventional chromosome analysis revealed an apparently balanced pericentric inversion, 46,XX,inv(3)(p24.2q26.1) (arrows indicate breakpoints). However, due to the patient’s developmental delay and limited resolution of G-banded chromosome analysis, further studies were requested to determine if there was a deletion at the breakpoints of the inversion. (B) FISH studies revealed a deletion at the breakpoint on the q-arm, minimally involving the probe region at 3q26.1 (green signal, present on only one homolog). (C) CMA studies were then performed to determine the size and number of genes involved in the deleted interval, which revealed the deletion is confined to a region of 864 kb (red arrow) that includes no annotated genes, indicating the CNV is likely benign.

disomic gametes that leads to monosomic or trisomic zygotes, respectively, they have an increased risk of miscarriage or offspring with aneuploidy or, rarely, UPD. When the rearrangements (RT or isochromosomes) involve homologous acrocentrics like chromosomes 14 or 15, they manifest a genomic imprinting disorder, that is, Temple/KagamiOgata syndromes or PraderWilli/Angelman syndromes, respectively, due to UPD (see below). In general, the risk of an affected liveborn child for carriers of RTs between different acrocentric chromosomes is B1% with the exception of female carriers of rob(14;21)(q10;q10), where it has been estimated as 15%. When chromosome 13 is involved, the surviving offspring can have trisomy 13 (Patau syndrome), that is, 46,XX, 1 13,der(13;21)(q10;q10) or 46,XX, 1 13,rob(13;21) (q10;q10). In the case of a homologous 21;21 RT, balanced carriers have 100% risk of progeny with Down syndrome [46,XX,der(21;21)(q10;q10), 1 21 or 46,XX,rob(21;21)(q10;q10), 1 21)]

2.3 Structural chromosome aberrations

25

unless the rearrangement in the parent is mosaic or the other parent has the same aberration. Importantly, a product of RT is found in up to 4% of individuals with Down syndrome, and in contrast to free trisomy 21, the recurrence risk is higher and unrelated to the age of the mother carrying the RT or isochromosome involving chromosome 21.

2.3.4 Deletions and duplications Chromosomal deletions and duplications occur when a segment of chromosomal material is lost or gained, respectively. These events can be further classified as terminal, when there is one break in the chromosome arm and the entire end region distal (telomeric) to the breakpoint is lost or gained; or interstitial, where there are two breaks in the chromosome arm and only the region in between is lost or gained. The karyotype of an individual with a large deletion or duplication visible by conventional chromosome analysis could be described as, for example, 46,XY,del(5)(q22q33) or 46,XY,dup(1)(q21q25), respectively. Submicroscopic events are referred to as microdeletions or microduplications. A CNV can be defined as gain or loss of a DNA segment ranging in size from 1 kb to several Mb that differs in copy number among individuals due to deletion, duplication, or rarely more complex chromosomal rearrangement (CCR). Although some CNVs cause disease, many others represent benign polymorphic variants in the population [28]. The burden of large CNVs over 1 Mb in size is enriched in individuals with more severe developmental and intellectual disability phenotypes or multiple congenital abnormalities [29,30]. Owing to the structural characteristics of the human genome such as segmental duplications or low-copy repeats (LCRs), there are a number of recurrent deletion/duplication events that are associated with well-characterized genomic disorders [31] (see Chapter 3: Genomic Disorders in the Genomics Era). Recurrent CNVs generally arise by nonallelic homologous recombination (NAHR) between LCRs during meiosis, and since these repeat DNA sequences are localized in clusters or “hotspots” throughout the genome, the affected region is essentially the same even in unrelated individuals [32]. While large deletions and duplications may be detectable by standard chromosome analysis, CMA is the testing method of choice for detection of genomic copy number changes due to its increased resolution. Although FISH technology provides a level of resolution approaching that of CMA, particularly for deletions, its use is limited by the need to target a specific genomic region based on a clinical diagnosis or suspicion. Nonetheless, it is a practical tool for confirmation of suspected chromosome abnormalities by karyotype or CMA, as well as a tool to confirm the presence of a specific copy number change in a family member after it has been identified by CMA testing in a patient.

2.3.5 Intrachromosomal and interchromosomal insertions Insertions, or insertional translocations, are chromosomal aberrations with a genomic fragment inserted into a nonhomologous chromosome (interchromosomal rearrangements) and or into a different locus (the same or other arm) on the same chromosome or the other homolog (intrachromosomal rearrangements). As at least three breaks in the chromosomes are required for insertion formation, they qualify as CCRs. Similar to other aberrations, insertions can be de novo or inherited (familial) events and balanced or unbalanced. When the inserted fragment is situated in the same orientation, the insertion is referred to as direct (dir ins). For example, insertion of chromosomal region 2q21q31 into 2p13 is described as

26

Chapter 2 Karyotyping as the first genomic approach

ins(2)(p13q21q31). When the same fragment is inserted in opposite orientation, insertion is referred to as inverted (inv ins) and designated as ins(2)(p13q31q21). Insertion of the same chromosomal region 2q21q31 into chromosome 5p14 is designated as ins(5;2)(p14;q21q31) for dir ins and ins (5;2)(p14;q31q21) for inv ins. By traditional cytogenetic techniques, insertions were considered as rare aberrations with the incidence of microscopically visible aberrations in 1 in 10,00080,000 live births [33]. However, application of molecular techniques, including CMA and FISH (the latter can also determine the genomic position or location of the additional material) revealed a much higher frequency, at approximately 1 in 500 individuals tested [34,35]. Recent preliminary studies show that submicroscopic insertions are much more frequent than previously thought; however, many of the small events are most likely polymorphic. The real incidence and scale of insertion events await more systematic analyses using NGS techniques. An abnormal phenotype in patients with an insertion can result from (i) increased copy number of a dosage-sensitive gene(s), (ii) gene disruption or deletion, or (iii) abnormal gene expression due to position effects of a flanking gene. It has been demonstrated that B2.1% of apparently de novo, interstitial CNV gains were imbalances inherited from parents with balanced insertions [36]. Initially, insertions were thought to arise from random breaks and nonhomologous end-joining (NHEJ) events. However, recent detailed molecular analyses of apparently balanced insertions revealed submicroscopic imbalances (both deletions and duplications) with microhomologies at the breakpoint junctions. These findings, resembling chromothripsis-like or chromoanasynthesis events, have implied replication-based mechanism(s) and iterative template switches as more plausible mechanisms of their formation (see below) [37,38]. Chromothripsis, initially identified in cancer, refers to the shattering of a chromosome with random reassembly of tens to thousands of pieces by NHEJ events [39]. In contrast, chromoanasynthesis is described as chromosome reconstitution or reassortment due to DNA repair errors during the DNA replication process [40].

2.3.6 Isochromosomes Isochromosomes are mirror image chromosomes that arise when one part of the chromosome is duplicated and separated from the other (U-type exchange). Isochromosomes can be monocentric, when the breakpoint occurs within the centromere (centromere misdivision, centric fission), or dicentric, when the chromosomal break maps outside the centromere. Isodicentrics, also sometimes referred to as inverted duplication chromosomes, are unstable unless one of the centromeres becomes inactivated (pseudodicentrics, pseudoisodicentrics). Isochromosomes should be distinguished from interhomologous translocations (see RTs above). The most common is isochromosome of the long arms of chromosome X, i(Xq). Constitutional nonmosaic autosomal isochromosomes are usually lethal. Exceptions are isochromosomes 12p or i (12p) identified in patients with PallisterKillian syndrome, typically detected in cultured fibroblasts but not in lymphocytes, and isochromosome 9p or i(9p). Mechanistically, isochromosomes usually occur due to meiotic nondisjunction error (e.g., i(Xq) a female with a 46,X,i(X)(q10) karyotype) or NAHR between inverted LCRs [e.g., inv dup(22) in patients with Cat-eye syndrome or inv dup(15)] [41,42], or due to mitotic error [e.g., isochromosome 17q or i(17q), the most common isochromosome in human neoplasia] [43]. Isochromosome i(18p) is usually found with mosaic i(18q), likely representing a fission event.

2.3 Structural chromosome aberrations

27

Isodicentric 15 is a bisatellited supernumerary dicentric chromosome described as idic(15), inv dup(15), or pseudodicentric 15. Its phenotypic manifestations are variable and depend on its size, parental origin, mosaic status, and segregation of the normal chromosomes 15. If the idic(15) contains the PraderWilli/Angelman syndrome critical regions and is maternal in origin, it is associated with developmental delay, autistic spectrum disorders, or epilepsy [42]. The inv dup(15) has been associated with advanced maternal age and UPD15 as a likely by-product of nondisjunction and trisomy rescue. When nonallelic LCRs serve as NAHR substrates, the resulting chromosome is asymmetric in shape.

2.3.7 Marker chromosomes Constitutional small marker chromosomes (SMCs) are chromosomes that are equal or smaller in size than chromosome 20 in the same metaphase spread, and cannot be unambiguously identified by conventional banding techniques alone. In most cases, SMCs are supernumerary (sSMCs) and typically exist in a mosaic state because of their mitotic instability. They can be metacentric (incl. isochromosomes, 35%), acrocentric (with satellites), or dicentric in shape. SMCs can form different chromosomal structures, for example, inverted duplicated chromosomes, complex rearranged chromosomes, minutes, rings, or neocentric chromosomes [44]. The ring structure predisposes them to mitotic instability (e.g., double-ring structure formation). Due to their small size and heterogeneous morphology, SMCs are difficult to characterize by conventional cytogenetic methods. Development of molecular cytogenetic techniques, that is, FISH, chromosome microdissection with reverse painting, and CMA, enabled identification and detailed characterization of many SMCs that were challenging to characterize by conventional chromosome analysis. Once the chromosomal origin of an SMC is recognized, it is referred to as a derivative chromosome (der). The most commonly detected sSMCs are derived from autosomal chromosomes, for example, 15 (accounts for more than a half of all marker chromosomes), 22, 12, or 18. Nonacrocentric autosomal markers are rare and found in only B15% of cases. SMCs originating from sex chromosomes are typically detected as nonsupernumerary in female patients with Turner syndrome with a mosaic karyotype designated as 46,XX/46,X, 1 mar. Importantly, when the origin of the marker chromosome in patients with Turner syndrome is determined as being derived from the Y chromosome, it poses an increased risk of gonadoblastoma, necessitating removal of the undeveloped gonads (see above). sSMCs are detected with a frequency of B0.24/1000 in newborns, B0.41.5/1000 in prenatal studies, and in B0.5/1000 individuals in the general population. The prevalence among phenotypically abnormal individuals is much higher and was estimated as B23/1000 [4548]. Except for inv dup(15) and inv dup(22) (see Section 2.3.6), for which phenotypic consequences are now well known, most SMCs require more detailed cytogenetic and clinical data for better insight into karyotypephenotype correlations. Interpretation of their clinical consequences is based on their morphology, size, parental and chromosomal origin, and level of somatic mosaicism [4951]. The risk of an abnormal phenotype in de novo cases has been estimated to be up to 28% [47,52]. When inherited, the risk of an abnormal phenotype is considered smaller. The exception to this size rule is marker chromosomes derived from X. Small X-chromosome-derived chromosomes that do not harbor the XIST gene, involved in X chromosome inactivation in females, are associated

28

Chapter 2 Karyotyping as the first genomic approach

with a more severe phenotype than those that contain XIST, due to their inability to undergo X-inactivation. Recently, studies using NGS revealed that some SMCs may result from multiple-step mechanisms, including maternal meiotic nondisjunction followed by postzygotic anaphase lagging of the supernumerary chromosome and its subsequent DNA repair-based chromothripsisor chromoanasynthesis-like events [53,54].

2.3.8 Complex rearrangements and chromothripsis Complex chromosome rearrangements (CCRs) are rare structural chromosomal abnormalities involving at least two chromosomes and greater than two breakpoints [5557]. Traditionally, CCRs have been categorized according to the number of breakpoints. Those with four or fewer breaks, for example, the most common three-way translocations with one breakpoint in each chromosome, belong to group I; group II includes those with more than four breaks [58]. The vast majority of CCRs have been found to have arisen as de novo events, whereas in the rarer familial cases, CCRs are typically transmitted to progeny through more permissive oogenesis. Balanced CCRs are often associated with abnormal reproduction (e.g., recurrent spontaneous abortions) and not uncommonly with abnormal phenotypes, including developmental delay, dysmorphisms, and congenital anomalies. These clinical consequences most likely result from chromosome breaks disrupting dosage-sensitive genes or their regulatory sequences (position effect), or the presence of submicroscopic CNVs. Application of molecular cytogenetic techniques, including FISH and later CMA, revealed that many CCRs are actually much more complex than initially ascertained. A substantial fraction of them were found to have additional cryptic complexities or genomic imbalances such as loss, gain, or inversion at or near the breakpoints. In reproductive counseling, estimation of the recurrence risk for liveborn children with abnormal phenotypes or pregnancy losses depends on the size and location of the rearranged chromosomal fragments and the mode of meiotic segregation (symmetrical or asymmetrical) [59]. As recently as 2011, accumulated data obtained using NGS techniques enabled further insight into the genomic structure of some CCRs [39,40,6063]. Analyses of DNA samples from individuals with both cancer and congenital disorders revealed massive (dozens, hundreds, or thousands) of clustered chromosomal rearrangements that likely occurred during a single catastrophic event and were confined to one or a few chromosomes. Depending on the underlying molecular mechanism, these events were termed chromothripsis or chromoanasynthesis as described above [64].

2.4 Uniparental disomy and genomic imprinting UPD is a phenomemon of copy number neutral lack of bi-parental transmission of a chromosome or its fragment from the biological parents [65,66]. When both homologous chromosomes are inherited from the same parent, they are referred to as heterodisomic. In contrast, two homologous chromosomes are isodisomic when they derive from one parental homolog (typically the result of monosomy rescue) [67,68]. UPD is usually identified molecularly using either SNP arrays (isodisomy) or methylation-sensitive PCR-based analysis of informative polymorphic markers.

2.4 Uniparental disomy and genomic imprinting

29

The most common cause of UPD is meiotic chromosome nondisjunction, resulting in nullisomic or disomic gametes and then, respectively, zygotic monosomy or trisomy of a particular chromosome. Early postzygotic random trisomy rescue events, for example, mitotic nondisjunction due to a failure of segregation of homologous chromosomes or sister chromatids during meiotic or mitotic cell division, can restore disomic status. However, in one-third of such cases, UPD is formed. Reciprocal nullisomy or monosomy can be rescued, respectively, through gametic complementation, that is, union of aneuploid gametes that complement one another, or zygotic chromosome duplication. When nondisjunction occurs in meiosis I, pericentromeric regions will be heterodisomic, whereas distal fragments can be isodisomic due to crossing-over events. Conversely, the presence of pericentromeric isodisomy and stretches of distal heterodisomy indicate nondisjunction in meiosis II. As the frequency of nondisjunctions positively correlates with advanced maternal age, maternal UPD is usually heterodisomic, whereas paternal UPD is isodisomic in most cases. Also, somatic recombination events during postzygotic mitotic divisions leading to segmental UPD can be clinically relevant, as described, for example, in some patients with BeckwithWiedemann syndrome (Table 2.2). Recently, a novel DNA replication-based mutational mechanism of formation of segmental uniparental isodisomy due to template switching invasion of the homologous chromosome has been described for runs of distal homozygosity identified adjacent to de novo CNV triplications mapping next to copy number neutral genomic intervals [69,70]. A few disease coding and noncoding genes are expressed from one allele only in a parent-of-originspecific manner. This non-Mendelian epigenetic phenomenon of inheritance is referred to as genomic imprinting (silencing of transcription). It typically utilizes DNA methylation or histone methylation without altering the nucleotide DNA sequence. The epigenetic marks are established (“imprinted”) in the parental germlines and are typically maintained through mitotic cell divisions in the offspring. Table 2.2 Summary of the known genomic imprinting disorders due to uniparental disomy. Chr

Parental UPD

Phenotype

Chr region

Gene

6

Paternal

Transient neonatal diabetes

6q24

PLAGL1

7

Maternal

Silver-Russell syndrome (7%10%)

7p12, 7q11.2-qter, 7q31-qter

GRB10

11

Maternal

Silver-Russell syndrome (35%50%)

11p15.5

H19/IGF2 imprinting center 1

11

Paternal

BeckwithWiedemann syndrome

11p15.5

p57, CDKN1C, H19, KCNQ1OT1

14

Maternal

Temple syndrome

14q32

DLK1/GTL2

14

Paternal

KagamiOgata

14q32

DLK1/GTL2

15

Maternal

PraderWilli syndrome (20%30%)

15q11.2q12

snoRNA gene cluster

15

Paternal

Angelman syndrome (7%)

15q11.2q12

UBE3A (tissue specific)

16

Maternal



(Mosaic trisomy 16)

20

Paternal

IUGR and variety of congenital malformations Pseudohypoparathyroidism type 1B

20q13.32

GNAS1

30

Chapter 2 Karyotyping as the first genomic approach

UPD has been demonstrated for all chromosomes; however, for most chromosomes, it has no clinical consequences. It can manifest clinically when the disomic (hetero or iso) regions harbor imprinted gene(s) or when an isodisomic interval encompasses a pathogenic variant within an autosomal recessive disease gene. In addition, a remaining (undetected) low-level mosaic trisomy (that was rescued) can also contribute to an abnormal phenotype [e.g., in maternal UPD(16)] [71,72]. Of note, a few cases of mosaic genome-wide paternal uniparental isodisomy have been described [73,74]. One should also consider UPD when a patient carries certain numerical or structural chromosome aberrations, for example, RTs, isochromosomes, marker chromosomes, derivative chromosomes, and reciprocal translocations. Notably, UPD is found in B60% of acrocentric isochromosomes and B21% of RTs between nonhomologous chromosomes [e.g., 45,XY,upd der(15;15)(q10;q10)pat]. Moreover, mosaic or nonmosaic trisomy for the imprinted chromosomes detected prenatally in CVS increases the disease risk in apparently trisomy-free fetuses [75].

2.5 Clinical indications and special considerations for chromosome analysis In the era of molecular cytogenetics and whole-genome NGS, the clinical utility of a conventional chromosome analysis can sometimes be overlooked. Not only is a karyotype a more economic approach to the detection of certain suspected chromosomal abnormalities, but there are also limitations to the detection of structural chromosome abnormalities that have yet to be fully resolved with modern NGS approaches [76]. Clinical indications where chromosome analysis is warranted include: phenotypic features typical for a specific chromosome syndrome; short stature, delayed puberty, amenorrhea, and ambiguous genitalia (to rule out sex chromosome abnormalities); stillbirth, neonatal death, and products of conception (to rule out aneuploidy); couples with a history of infertility or recurrent miscarriage (to rule out balanced chromosomal rearrangements); and a family history of a chromosome abnormality detected by cytogenetic methods. A CMA, meanwhile, is recommended for individuals with autism spectrum disorders, developmental delay, and those with congenital anomalies that are not well defined by a known syndrome. It has also been appreciated for some time that karyotype analysis is useful in conjunction with CMA to identify balanced abnormalities, such as translocations and inversions, that CMA is not designed to detect. A number of laboratories now offer the option to order a combined CMA with abbreviated (five cells) chromosome analysis as an adjunct test to identify balanced rearrangements. This is exemplified by the detection of aneuploidy by CMA. When CMA identifies a trisomy of an acrocentric chromosome (namely, 13 and 21), this could be the result of either a free trisomy (which is most likely sporadic) or a parental RT (associated with increased recurrence risk) (see above). To make this important distinction, chromosome studies are necessary. This concept applies also to the use of NGS techniques for detection of copy number imbalances, where confirmatory FISH and/or chromosome studies may be warranted to accurately determine the recurrence risk in cases involving a copy number abnormality. In conclusion, chromosome analysis remains helpful and, in some cases, necessary for maximum diagnostic yield for patients and their families.

References

31

References [1] Miller DT, Adam MP, Aradhya S, et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet 2010;86(5):74964. [2] Jacobs PA, Baikie AG, Court Brown WM, Strong JA. The somatic chromosomes in mongolism. Lancet 1959;1(7075):710. [3] Lejeune J, Turpin R, Gautier M. Mongolism: a chromosomal disease (trisomy). Bull Acad Natl Med 1959;143(11-12):25665. [4] Hassold T, Abruzzo M, Adkins K, et al. Human aneuploidy: incidence, origin, and etiology. Env Mol Mutagen 1996;28(3):16775. [5] Hassold T, Hunt P. To err (meiotically) is human: the genesis of human aneuploidy. Nat Rev Genet 2001;2(4):28091. [6] Nagaoka SI, Hassold TJ, Hunt PA. Human aneuploidy: mechanisms and new insights into an age-old problem. Nat Rev Genet 2012;13(7):493504. [7] Bunt CW, Bunt SK. Role of the family physician in the care of children with Down syndrome. Am Fam Phys 2014;90(12):8518. [8] Wiktor AE, Bender G, Van Dyke DL. Identification of sex chromosome mosaicism: is analysis of 20 metaphase cells sufficient? Am J Med Genet A 2009;149A(2):2579. [9] Hook EB. Exclusion of chromosomal mosaicism: tables of 90%, 95% and 99% confidence limits and comments on use. Am J Hum Genet 1977;29(1):947. [10] Skuse D, Printzlau F, Wolstencroft J. Sex chromosome aneuploidies. Handb Clin Neurol 2018;147:35576. [11] Hong DS, Reiss AL. Cognitive and neurological aspects of sex chromosome aneuploidies. Lancet Neurol 2014;13(3):30618. [12] Cools M, Drop SL, Wolffenbuttel KP, Oosterhuis JW, Looijenga LH. Germ cell tumors in the intersex gonad: old paths, new directions, moving frontiers. Endocr Rev 2006;27(5):46884. [13] Abdelmoula NB, Amouri A, Portnoi MF, et al. Cytogenetics and fluorescence in situ hybridization assessment of sex-chromosome mosaicism in Klinefelter’s syndrome. Ann Genet 2004;47(2):16375. [14] Jacobs PA, Browne C, Gregson N, Joyce C, White H. Estimates of the frequency of chromosome abnormalities detectable in unselected newborns using moderate levels of banding. J Med Genet 1992;29 (2):1038. [15] Shaffer LG, Lupski JR. Molecular mechanisms for constitutional chromosomal rearrangements in humans. Annu Rev Genet 2000;34:297329. [16] De Gregori M, Ciccone R, Magini P, et al. Cryptic deletions are a common finding in “balanced” reciprocal and complex chromosome rearrangements: a study of 59 patients. J Med Genet 2007;44(12):75062. [17] Nilsson D, Pettersson M, Gustavsson P, et al. Whole-genome sequencing of cytogenetically balanced chromosome translocations identifies potentially pathological gene disruptions and highlights the importance of microhomology in the mechanism of formation. Hum Mutat 2017;38(2):18092. [18] Carvalho CM, Lupski JR. Mechanisms underlying structural variant formation in genomic disorders. Nat Rev Genet 2016;17(4):22438. [19] Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med 2010;61:43755. [20] Talkowski ME, Rosenfeld JA, Blumenthal I, et al. Sequencing chromosomal abnormalities reveals neurodevelopmental loci that confer risk across diagnostic boundaries. Cell 2012;149(3):52537. [21] Suzuki T, Tsurusaki Y, Nakashima M, et al. Precise detection of chromosomal translocation or inversion breakpoints by whole-genome sequencing. J Hum Genet 2014;59(12):64954.

32

Chapter 2 Karyotyping as the first genomic approach

[22] Weckselblatt B, Rudd MK. Human structural variation: mechanisms of chromosome rearrangements. Trends Genet 2015;31(10):58799. [23] Robinson WP, Bernasconi F, Basaran S, et al. A somatic origin of homologous Robertsonian translocations and isochromosomes. Am J Hum Genet 1994;54(2):290302. [24] Berend SA, Horwitz J, McCaskill C, Shaffer LG. Identification of uniparental disomy following prenatal detection of Robertsonian translocations and isochromosomes. Am J Hum Genet 2000;66(6):178793. [25] Sullivan BA, Jenkins LS, Karson EM, Leana-Cox J, Schwartz S. Evidence for structural heterogeneity from molecular cytogenetic analysis of dicentric Robertsonian translocations. Am J Hum Genet 1996;59:16775. [26] Page SL, Shin J-C, Han J-Y, Choo KHA, Shaffer LG. Breakpoint diversity illustrates distinct mechanisms for Robertsonian translocation formation. Hum Mol Genet 1996;5:127988. [27] Han J-Y, Choo KHA, Shaffer LG. Molecular cytogenetic characterization of 17 rob(13q14q) Robertsonian translocations by FISH, narrowing the region containing the breakpoints. Am J Hum Genet 1994;55:9607. [28] Iafrate AJ, Feuk L, Rivera MN, et al. Detection of large-scale variation in the human genome. Nat Genet 2004;36(9):94951. [29] Cooper GM, Coe BP, Girirajan S, et al. A copy number variation morbidity map of developmental delay. Nat Genet 2011;43(9):83846. [30] Girirajan S, Brkanac Z, Coe BP, et al. Relative burden of large CNVs on a range of neurodevelopmental phenotypes. PLoS Genet 2011;7(11):e1002334. [31] Lupski JR. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet 1998;14(10):41722. [32] Lindsay SJ, Khajavi M, Lupski JR, Hurles ME. A chromosomal rearrangement hotspot can be identified from population genetic variation and is coincident with a hotspot for allelic recombination. Am J Hum Genet 2006;79(5):890902. [33] Van Hemel JO, Eussen HJ. Interchromosomal insertions. Identification of five cases and a review. Hum Genet 2000;107(5):41532. [34] Kang SH, Shaw C, Ou Z, et al. Insertional translocation detected using FISH confirmation of arraycomparative genomic hybridization (aCGH) results. Am J Med Genet A 2010;152A(5):111126. [35] Neill NJ, Ballif BC, Lamb AN, et al. Recurrence, submicroscopic complexity, and potential clinical relevance of copy gains detected by array CGH that are shown to be unbalanced insertions by FISH. Genome Res 2011;21(4):53544. [36] Nowakowska BA, de Leeuw N, Ruivenkamp CA, et al. Parental insertional balanced translocations are an important cause of apparently de novo CNVs in patients with developmental anomalies. Eur J Hum Genet 2012;20(2):16670. [37] Gu S, Yuan B, Campbell IM, et al. Alu-mediated diverse and complex pathogenic copy-number variants within human chromosome 17 at p13.3. Hum Mol Genet 2015;24(14):406177. [38] Lindstrand A, Eisfeldt J, Pettersson M, et al. From cytogenetics to cytogenomics: whole-genome sequencing as a first-line test comprehensively captures the diverse spectrum of disease-causing genetic variation underlying intellectual disability. Genome Med 2019;11(1):68. [39] Stephens PJ, Greenman CD, Fu B, et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 2011;144:2740. [40] Liu P, Erez A, Nagamani SC, et al. Chromosome catastrophes involve replication mechanisms generating complex genomic rearrangements. Cell 2011;146:889903. [41] McDermid HE, Morrow BE. Genomic disorders on 22q11. Am J Hum Genet 2002;70(5):107788. [42] Wang NJ, Parokonny AS, Thatcher KN, Driscoll J, Malone BM, Dorrani N, et al. Multiple forms of atypical rearrangements generating supernumerary derivative chromosome 15. BMC Genet 2008;9:2.

References

33

[43] Barbouti A, Stankiewicz P, Nusbaum C, et al. The breakpoint region of the most common isochromosome, i(17q), in human neoplasia is characterized by a complex genomic architecture with large, palindromic, low-copy repeats. Am J Hum Genet 2004;74(1):110. [44] Liehr T. ChromoSomics database. Available from: http://cs-tl.de. [45] Buckton KE, Spowart G, Newton MS, Evans HJ. Forty four probands with an additional “marker” chromosome. Hum Genet 1985;69:35370. [46] Sachs ES, Van Hemel JO, Den Hollander JC, Jahoda MGJ. Marker chromosomes in a series of 10,000 prenatal diagnoses. Cytogenetic and follow-up studies. Prenat Diagn 1987;7:819. [47] Warburton D. De novo balanced chromosome rearrangements and extra marker chromosomes identified at prenatal diagnosis: clinical significance and distribution of breakpoints. Am J Hum Genet 1991;49:9951013. [48] Liehr T, Claussen U, Starke H, et al. Small supernumerary marker chromosomes (sSMC) in humans. Cytogenet Genome Res 2004;107:5567. [49] Blennow E, Nielsen KB, Telenius H, et al. Fifty probands with extra structurally abnormal chromosomes characterized by fluorescence in situ hybridization. Am J Med Genet 1995;55:8594. [50] Starke H, Nietzel A, Weise A, et al. Small supernumerary marker chromosomes (SMCs): genotypephenotype correlation and classification. Hum Genet 2003;114:5167. [51] Liehr T, Mrasek K, Weise A, et al. Small supernumerary marker chromosomes  progress towards a genotype-phenotype correlation. Cytogenet Genome Res 2006;112:2334. [52] Crolla JA. FISH and molecular studies of autosomal supernumerary marker chromosomes excluding those derived from chromosome 15: II. Review of the literature. Am J Med Genet 1998;75:36781. [53] Grochowski CM, Gu S, Yuan B, et al. Marker chromosome genomic structure and temporal origin implicate a chromoanasynthesis event in a family with pleiotropic psychiatric phenotypes. Hum Mutat 2018;39(7):93946. [54] Kurtas NE, Xumerle L, Leonardelli L, et al. Small supernumerary marker chromosomes: a legacy of trisomy rescue? Hum Mutat 2019;40(2):193200. [55] Pai GS, Thomas GH, Mahoney W, Migeon BR. Complex chromosome rearrangements. Clin Genet 1980;18:43644. [56] Kleczkowska A, Fryns JP, Van Den Berghe H. Complex chromosomal rearrangements (CCR) and their genetic consequences. J Genet Hum 1982;30:199214. [57] Zhang F, Carvalho CM, Lupski JR. Complex human chromosomal and genomic rearrangements. Trends Genet 2009;25(7):298307. [58] Kousseff BG, Nichols P, Essig YP, Miller K, Weiss A, Tedesco TA. Complex chromosome rearrangements and congenital anomalies. Am J Med Genet 1987;26:77182. [59] Batista DAS, Shashidhar Pai G, Stetten G. Molecular analysis of complex chromosomal rearrangement and review of familial cases. Am J Med Genet 1994;53:25563. [60] Kloosterman WP, Guryev V, van Roosmalen M, Duran KJ, de Bruijn E, Bakker SC, et al. Chromothripsis as a mechanism driving complex de novo structural rearrangements in the germline. Hum Mol Genet 2011;20(10):191624. [61] Kloosterman WP, Hoogstraat M, Paling O, Tavakoli-Yaraki M, Renkens I, Vermaat JS, et al. Chromothripsis is common mechanism driving genomic rearrangements in primary and metastatic colorectal cancer. Genome Biol 2011;12(10):R103. [62] Chiang C, Jacobsen JC, Ernst C, et al. Complex reorganization and predominant non-homologous repair following chromosomal breakage in karyotypically balanced germline rearrangements and transgenic integration. Nat Genet 2012;44(4):3907. [63] Crasta K, Ganem NJ, Dagher R, Lantermann AB, Ivanova EV, Pan Y, et al. DNA breaks and chromosome pulverization from errors in mitosis. Nature 2012;482(7383):538.

34

Chapter 2 Karyotyping as the first genomic approach

[64] Holland AJ, Cleveland DW. Chromoanagenesis and cancer: mechanisms and consequences of localized, complex chromosomal rearrangements. Nat Med 2012;18(11):16308. [65] Engel E. A new genetic concept: uniparental disomy and its potential effect, isodisomy. Am J Med Genet 1980;6(2):13743. [66] Spence JE, Perciaccante RG, Greig GM, et al. Uniparental disomy as a mechanism for human genetic disease. Am J Hum Genet 1988;42(2):21726. [67] Ledbetter DH, Engel E. Uniparental disomy in humans: development of an imprinting map and its implications for prenatal diagnosis. Hum Mol Genet 1995;4:175764. [68] Kotzot D. Abnormal phenotypes in uniparental disomy (UPD): fundamental aspects and a critical review with bibliography of UPD other than 15. Am J Med Genet 1999;82(3):26574. [69] Carvalho CM, Pfundt R, King DA, et al. Absence of heterozygosity due to template switching during replicative rearrangements. Am J Hum Genet 2015;96(4):55564. [70] Carvalho CMB, Coban-Akdemir Z, Hijazi H, et al. Interchromosomal template-switching as a novel molecular mechanism for imprinting perturbations associated with Temple syndrome. Genome Med 2019;11(1):25. [71] Eggermann T, Curtis M, Zerres K, Hughes HE. Maternal uniparental disomy 16 and genetic counseling: new case and survey of published cases. Genet Couns 2004;15(2):18390. [72] Schulze KV, Szafranski P, Lesmana H, et al. Novel parent-of-origin-specific differentially methylated loci on chromosome 16. Clin Epigenetics 2019;11(1):60. [73] Wilson M, Peters G, Bennetts B, McGillivray G, Wu ZH, Poon C, et al. The clinical phenotype of mosaicism for genome-wide paternal uniparental disomy: two new reports. Am J Med Genet A 2008;146A (2):13748. [74] Kalish JM, Conlin LK, Bhatti TR, et al. Clinical features of three girls with mosaic genome-wide paternal uniparental isodisomy. Am J Med Genet A 2013;161A(8):192939. [75] Shaffer LG, McCaskill C, Adkins K, Hassold TJ. Systematic search for uniparental disomy in early fetal losses: the results and a review of the literature. Am J Med Genet 1998;79(5):36672. [76] R. Hochstenbach, T. Liehr, R.J. Hastings. Chromosomes in the genomic age. Preserving cytogenomic competence of diagnostic genome laboratories, Eur J Hum Genet 2020, in press. Available from: https:// doi.org/10.1038/s41431-020-00780-y.

CHAPTER

Genomic disorders in the genomics era

3

Cinthya J. Zepeda Mendoza1,2 and Claudia Gonzaga-Jauregui3 1

Cytogenetics and Genomic Microarray, ARUP Laboratories, Salt Lake City, UT, United States 2Department of Pathology, University of Utah, Salt Lake City, UT, United States 3International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico

3.1 Introduction Six decades ago, the importance of chromosome abnormalities in human disease was illustrated by the identification of trisomies 13, 18, and 21 (see Chapter 2: Karyotyping as the First Genomic Approach). These were linked to Patau, Edwards, and Down syndromes [14], respectively, with recognizable clinical manifestations that facilitated subsequent individual diagnosis, prognosis, and medical management. In addition to numerical abnormalities, pathogenic structural rearrangements were rapidly discovered. Cytogenetic karyotype analysis and in situ hybridization approaches revealed microscopically and submicroscopically visible chromosome insertions, inversions, deletions, duplications, and translocations, many of them associated with diseases known as “genomic disorders” [510]. The term was first coined by Dr. James R. Lupski following the study of the etiology of Charcot-MarieTooth disease type 1 A (CMT1A, MIM #118220) due to recurrent duplications of the 17p12-p11.2 region involving the dosage-sensitive gene, PMP22 [11]. In essence, genomic disorders arise from DNA rearrangements involving relatively large regions of the genome that change their expected order or diploid state through inversion, deletion, or amplification events, as opposed to classically described genetic disorders caused by simple nucleotide variants [8,10]. Following the initial association of 17p11.2 deletions with Smith-Magenis syndrome (SMS, MIM #182290) [12,13], 15q11-q13 deletions with Prader-Willi (PWS, MIM #176270) [14] / Angelman syndromes (AS, MIM #105830) [15], and 17p12-p11.2 duplications with CMT1A [11], the medical genetics field geared toward a systematic assessment of the contribution of submicroscopic deletions and amplifications of genomic regions, denominated as copy-number variants (CNVs), to genomic disease. Analysis of subjects with congenital disorders using chromosome microarray analysis and massively parallel sequencing technology (also known as next-generation sequencing, NGS) pinpointed to a number of shared and recurrent, disease-specific duplications or deletions, as well as nonrecurrent, individual-restricted de novo CNVs [8,10,16] (see Table 3.1 for a list of selected examples). Several diseases were soon linked to CNVs, most notably autism [1724], schizophrenia [18,2529], congenital anomalies [3032], and multiple intellectual and developmental disorders (Table 3.1) [3336]. These studies played crucial roles in the Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00004-1 © 2021 Elsevier Inc. All rights reserved.

35

Table 3.1 Selected examples of genomic disorders with recurrent and nonrecurrent chromosome rearrangements. MIM number

SV type

Genomic rearrangement occurrence

Dosagesensitive genes

Major phenotypic features

References

RERE

Characteristic craniofacial features; brachy/ camptodactyly; short feet; developmental delay/ intellectual disability (DD/ID); hypotonia; structural brain anomalies; congenital heart defects

[37]

Nonrecurrent

ATAD3A cluster

Respiratory insufficiency; hypotonia; cardiomyopathy; death in infancy

[38]

Deletion

Recurrent Nonrecurrent

NSD1

Overgrowth; macrocephaly; characteristic facial features; learning disabilities; advanced bone age

[39]

Duplication

Recurrent Nonrecurrent

Microcephaly; short stature; DD; delayed bone maturation;

[40]

Deletion

Recurrent Nonrecurrent

Cardiovascular disease; characteristic facial features; connective tissue abnormalities; ID; unique personality characteristics (including overfriendliness); growth abnormalities; endocrine abnormalities

[41]

Duplication

Recurrent Nonrecurrent

Cardiovascular disease; characteristic facial features; speech disorders; behavioral problems (including social anxiety); ID

[42]

Deletion

Nonrecurrent

Coloboma; heart defects; choanal atresia; retarded growth and development; genital hypoplasia; ear anomalies (including deafness)

[43]

Duplication

Nonrecurrent

Hearing loss; congenital heart defects; ID; hypotonia in infancy; Duane anomaly

[44]

Deletion

Nonrecurrent

Characteristic facial features; ID; autistic-like features; severe expressive speech delay; childhood hypotonia

[45,46]

Duplication

Nonrecurrent

Characteristic facial appearance; DD; speech difficulties; learning disabilities; behavioral problems (autism and ADHD most common)

[47]

Deletion

Nonrecurrent

Characteristic facial features; congenital heart disease; ID; Paris Trousseau bleeding disorder; structural kidney defects; immunodeficiency

[48]

Locus

Genomic disorder

1pterp36.31

1p36 deletion syndrome

607872

Deletion

Nonrecurrent

1p36 duplication syndrome

618815

Duplication

Sotos syndrome

117550

5q35.2q35.3

5q35 duplication syndrome 7q11.23

Williams-Beuren syndrome

194050

7q11.23 duplication syndrome 8q12.2

CHARGE syndrome

214800

8q12 duplication syndrome 9q34.2

Kleefstra syndrome

610253

9q34 duplication syndrome 11q

Jacobsen syndrome

147791

ELN

CHD7

EHMT1

FLI1, ETS-1

15q11q13

16p11.2

17p11.2

17p12

22q11.2

Prader-Willi syndrome

176270

Deletion (paternal)

Recurrent

SNRPN

Severe hypotonia and feeding difficulties in early infancy; excessive eating; morbid obesity (if eating is not externally controlled); delayed motor and language development; cognitive impairment; distinctive behavioral anomalies; hypogonadism

[49,50]

Angelman syndrome

105830

Deletion (maternal)

Recurrent

UBE3A

Severe DD/ID; severe speech impairment; gait ataxia and/or tremulousness of the limbs; distinctive behavior (happy demeanor with frequent laughing and excitability); microcephaly; seizures

[50,51]

15q11-q13 duplication syndrome

608636

Duplication (maternal)

Recurrent Nonrecurrent

Hypotonia; motor delays; ID; autism; epilepsy

[50,52]

16p11.2 deletion syndrome

611913

Deletion

Recurrent Nonrecurrent

DD/ID; autism; macrocephaly; structural brain anomalies

[20,24,53]

16p11.2 duplication syndrome

614671

Duplication

Recurrent Nonrecurrent

DD/ID; autism; microcephaly; speech and language delays

[24,53]

Smith-Magenis syndrome

182290

Deletion

Recurrent Nonrecurrent

RAI1

Characteristic facial features that progress with age; DD; cognitive impairment; behavioral abnormalities; sleep disturbance; childhood-onset abdominal obesity

Potocki-Lupski syndrome

610883

Duplication

Recurrent Nonrecurrent

RAI1

DD/ID; behaviorally abnormalities; autism; hypotonia; congenital heart disease; hypoglycemia associated with growth hormone deficiency; mildly dysmorphic facial features

[54]

Hereditary neuropathy with liability to pressure palsies (HNPP)

162500

Deletion

Recurrent Nonrecurrent

PMP22

Recurrent acute sensory and motor neuropathy in a single or multiple nerves

[55,56]

Charcot-MarieTooth disease type 1A

118220

Duplication

Recurrent Nonrecurrent

PMP22

Progressive, symmetric distal motor neuropathy of the arms and legs beginning in the first to third decade of life; distal muscle weakness and atrophy; weak ankle dorsiflexion; depressed tendon reflexes; pes cavus foot deformity

[11,56]

DiGeorge/ Velocardiofacial syndrome

188400

Deletion

Recurrent Nonrecurrent

TBX1, others

Congenital heart disease; palatal abnormalities; immune deficiency; characteristic facial features; learning difficulties

[57,58]

22q11.2 duplication syndrome

608363

Duplication

Recurrent Nonrecurrent

DD/ID; short stature; hypotonia; palatal anomalies; variable heart defects

[59,60]

ADHD, Attention deficit hyperactivity disorder; DD, developmental delay; ID, intellectual disability.

38

Chapter 3 Genomic disorders in the genomics era

characterization of multiple dosage-sensitive genes, transcriptional regulation, as well as novel mechanisms of rearrangement generation [7]. CNVs are included in a broader category of genomic variation called “structural variants” (SVs), which include genomic insertion, inversion, deletion, and duplication events greater than 50 base-pairs (bp) in size, as well as chromosome translocations [61]. Importantly, structural variation is a common feature of the human genome and not always associated with genomic disorders. It was not until 2004, that the extent of polymorphic copy-number and structural variants in the human genome was recognized [62,63]. Since then, the application of genomic technologies has revealed extensive structural variation in personal genomes, accounting for 5%10% of interindividual genomic variation, as well as the complex architecture of the human genome. While the study of genomic disorders brought to the clinic banded karyotype, fluorescence in situ hybridization (FISH), and high-resolution chromosome microarray as first-tier testing strategies in congenital disease [6466], NGS technologies have steadily dominated the medical genetics field in the last decade. Given NGS’s ability to detect SVs and single-nucleotide variants (SNVs) with a high degree of sensitivity and specificity, the number of clinicians and laboratories that order, perform, and interpret clinical NGS tests is on the rise. Individuals with apparently normal molecular and cytogenetic observations have benefitted the most from this transition, as we recognize the increasing number of disease-associated abnormalities that fall below the scope and resolution levels of classic techniques [67]. The constant enhancement of chromosome microarray and NGS technologies has led to faster, better, and more economical options in clinical genetic testing in less than two decades, with important insights into the molecular mechanisms leading to structural variation and genomic disorders.

3.2 Chromosomal microarray analysis for copy-number variant detection and diagnosis of genomic disorders Chromosomal microarray analysis (CMA), first introduced into clinical diagnostics in 2004 [68], has become a first-tier test in the diagnosis of many rare genetic disorders. At its core, CMA is based on the molecular technique known as array Comparative Genomic Hybridization (aCGH) [69], which relies on the based complementarity of DNA to perform competitive hybridization between a test or proband’s DNA and a control or reference DNA from apparently healthy individuals. Samples are labeled with different fluorophores, and then, competitive hybridization is performed on a solid array platform which contains probes linked to the surface of a slide tiled to interrogate the whole genome or specific regions of interest. After extended hybridization, the array is washed and scanned using laser wavelengths that excite each of the fluorophores and the signal intensity of each probe in the array is measured. The images obtained are then processed, merged, and analyzed. If the signal intensity for a particular probe is greater when exciting the fluorophore of the proband’s DNA, it indicates that more test DNA was bound and there is a copy-number gain of that probe in the test DNA. Conversely, if the signal intensity for a different probe is greater when exciting the fluorophore of the control DNA, then this indicates a copy-number loss of that region in the test DNA. If the signal intensity for both fluorophores is about the same for a given probe, then it is

3.2 Chromosomal microarray analysis

39

FIGURE 3.1 Visualization of array comparative genomic hybridization (aCGH) to identify genomic rearrangements. The upper panel shows the recurrent duplication in the 17p12-11.2 region that includes the dosage-sensitive gene PMP22. Duplication of this region is the most common molecular cause of Charcot-Marie-Tooth disease type 1A (CMT1A). The lower panel shows the reciprocal deletion of the same region, which causes the disorder Hereditary Neuropathy with Liability to Pressure Palsies (HNPP). These recurrent and mirror rearrangements are mediated by nonallelic homologous recombination (NAHR) between pairs of low-copy repeats (LCRs) in direct orientation that flank the region.

assumed that there is no copy-number change between the proband’s and control’s DNA for the region interrogated by the oligonucleotide probe [70,71] (Fig. 3.1). Initially, the grid array platforms used large fragments of DNA cloned into bacterial artificial chromosomes (BACs) to tile the genome and interrogate for CNVs [72]; however, this made the calculation of CNV sizes and boundaries difficult, inaccurate, and overestimated. With the development of aCGH platforms, and a reference human genome sequence to design probes, came the tiling of the genome using covalently bound oligonucleotide (B60mers) probes that could be selected to tile the whole genome or specific regions at higher resolution; the major constraint in genome resolution was linked to the number of probes utilized per array design, that is, the number of “pixels” used for genome resolution. The performance of the oligonucleotide probes can vary depending on their GC content, and therefore it is usually necessary to have a change in the same direction in the intensity of at least five continuous probes in order to be able to detect a signal

40

Chapter 3 Genomic disorders in the genomics era

representing a true CNV by aCGH. The technique has proven to be very accurate and successful at identifying large CNVs; however, some of its limitations include: (1) not being able to confidently resolve CNVs of less than 50 kbp genome-wide, (2) relying on a “control” DNA, which by itself will harbor CNVs of its own of unknown clinical significance, (3) not being able to detect genomic gains of more than four copies due to the fluorescence signal’s dynamic range, (4) being “blind” to balanced structural rearrangements such as inversions and translocations as well as small insertions, and (5) being unable to determine direction of inserted sequences within the genomic context (i.e., direct vs inverted orientation). Another array-based method that can be utilized for identifying CNVs is single-nucleotide polymorphism (SNP) array genotyping, also known as SNP chips. SNP detection platforms perform allelic discrimination to interrogate polymorphisms at specific positions in the genome by measuring the intensity of hybridization of the labeled test sample DNA to the probes in the array reported as the log2 ratio of the test versus reference intensities. SNP array genotype data can be downstream analyzed based on the B-allele frequency (BAF), which in a diploid state presents three possible states for a reference allele (A) and an alternate allele (B) at a given genomic position, namely AA 5 0, AB 5 0.5, and BB 5 1. When the BAF deviates from the trimodal expected distribution, this can indicate allelic imbalance relating to genomic events such as CNVs or regions of absence of heterozygosity (AOH) or uniparental disomy (UPD), all detected within the same assay [71,73]. These data can also be visualized in a BAF plot. Some platforms have subsequently integrated classical aCGH probes with SNP genotyping into the same array platform for CMA to maximize the detection capabilities of the assay [74,75].

3.3 The evolution of next-generation sequencing and bioinformatics for the detection of genomic rearrangements The feasibility of developing clinical DNA sequencing tests was bolstered by next-generation massively parallel sequencing, otherwise known as NGS [7680]. Compared to the multiplexed Sanger sequencing strategy employed in the reference Human Genome Project [81], NGS methodologies are able to sequence billions of DNA base-pairs within a single run in a fully automated fashion (see Chapter 4: Genomic Sequencing of Rare Diseases). NGS technologies accomplish this by sequencing many times small DNA fragments and producing short sequence “reads” that vary in size depending on the technology. However, the small size of these NGS sequence reads can present a challenge when trying to identify genomic rearrangements and breakpoint sites. In the last few years, the most commonly used NGS platform, Illumina, has evolved to offer longer paired-end (PE) read sequencing capabilities (2 3 150 bp and 2 3 300 bp), with up to 20 billion sequenced reads per run (NovaSeq 6000, S4 flow cell), making the platform more attractive for the analysis of single nucleotide and structural genomic variation in one assay. Long-read technology is a thriving sequencing approach suitable for de novo genome assemblies and detection of larger SVs. The most popular long-read sequencing platforms include Pacific Biosciences (PacBio) and Oxford Nanopore, which rely on single-molecule sequencing (SMS) [82]. SMS identifies individual nucleotides in longer DNA segments, either from their addition by a polymerase (PacBio) [83,84], or through ionic signals produced by transport of DNA through a

3.3 Next-generation sequencing and bioinformatics

41

nanometer-scale pore (Oxford Nanopore) [8587]. Currently, PacBio produces .1 kbp reads, averaging 10 kbp read lengths and up to 60 kbp (RS II system). Oxford Nanopore routinely produces 10100 kbp sized reads depending on the sample preparation protocol and has now achieved ultralong-reads of 1.3 million-base pairs (Mbp) and up to 2.2 Mbp in length [88,89]. Long-read technology has been applied in the research setting for the study of Fragile X and other repeat expansion diseases [9096], HLA typing [97,98], and the identification of complex, diseaseassociated SVs and their mechanisms of generation [99111]. The incursion of long-read sequencing into the clinic has been predicted to overcome many of the short-read NGS pitfalls, particularly when dealing with long homologous or GC-rich DNA regions; however, additional technological development is needed before its widespread adoption in the clinic to reduce costs, standardize sequencing protocols, and develop computational data processing and analysis tools. As expected from the intrinsic variability among sequencing technologies and their data outputs, multiple bioinformatic strategies and algorithms have emerged to identify SVs from NGS data. As of January 2020, there were 66 tools for whole-exome sequencing (WES) CNV annotation and 135 whole-genome sequencing (WGS) tools for SV detection [112115]. Both short and long reads have inherent assets and disadvantages in SV analysis, the most obvious coming from read length

FIGURE 3.2 Analysis strategies for the identification of structural variants (SVs) using (A) short reads, (B) long reads, and (C) both (assembly method). Distinct types of structural rearrangements are represented, including deletion, tandem duplication, inversion, insertion and translocation events. In (A), all strategies use paired-end (PE) sequencing. Mapping orientation is indicated with arrows.

42

Chapter 3 Genomic disorders in the genomics era

itself. Short-read technologies have historically performed poorly in detecting SVs with complex breakpoints, high GC content, or within repetitive regions, whereas long-reads display higher sequencing error rates (B8%20%) which can hamper alignment and precise breakpoint detection [116]. To date, classic strategies for SV detection using short reads are based on PE read mapping, split-read analysis (SR), read-depth (RD), de novo genome assembly (GA), and a combination of the aforementioned approaches [117] (Fig. 3.2). Supplementary to these strategies are technical adaptations of short reads that focus on SV discovery, including modified mate-pair sequencing protocols [118], HiC [119,120], and the assembly of longer reads through barcoded linked shortreads [100], all with unique biases and specialized SV calling algorithms. Likewise, SV detection with long reads relies on de novo GA and comparison to the reference or other genome assembly, single split long-read analysis, within-read alignments and coverage, and a combination of these methods [115,121] (Fig. 3.2). It is generally accepted that, given the different scales of structural chromosome rearrangements, no single bioinformatic algorithm can detect the full spectrum of SVs [117,121123]. So far, the combination of multiple sequencing analysis approaches has provided better SV detection rates while reducing the number of false positives, with high-quality genome assemblies and SV maps resulting from the coupling of short and long reads [109,123].

3.4 Molecular mechanisms of genomic rearrangement generation The combinatorial use of chromosome karyotyping, FISH, CMA, and NGS to the study of genomic disorders has unveiled multiple avenues by which SVs arise in the human genome. At its core, SV formation is tied to two basic cellular mechanisms, namely DNA replication and recombination, both involved in the generation of recurrent and nonrecurrent chromosomal rearrangements. Molecular characterization of the SMS, CMT1A, and HNPP deletion/duplication regions showed the presence of long-stretches of flanking repetitive segments, subsequently termed “lowcopy repeats” (LCRs) [124,125]. Genomic repetitive elements including short-tandem repeats (microsatellites and satellites) and interspersed repeats [such as short-interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs)], which can be present at hundreds or thousands of copies in the human genome [126,127], make up a large proportion of human genomic content and represent remnants of the human genome’s evolution. Different to these highly repetitive elements, LCRs are large genomic regions ($1 kbp) with high degrees of sequence identity present at fewer number of copies (from two to sometimes dozens), usually derived as clusters of paralogous sequences in direct or inverse orientations. Some LCRs have been shown to overlap segmental duplications, which are pairs of genomic sequence larger than 1 kbp in size and with $ 90% of sequence identity between the duplicons, typically located in pericentromeric chromosome regions and sometimes associated with gene clusters, with apparent roles in genome function and evolution [128130]. LCRs were initially postulated and since then been shown to mediate meiotic recombination events between nonallelic chromosome regions through nonallelic homologous recombination (NAHR), where misalignment and unequal crossovers lead to the recurrent chromosome alterations observed in genomic disorders [131]. Thereupon, the presence of flanking LCRs was shown for additional genomic disorders, facilitated by positional cloning as well as the initial sequencing and analysis of the reference human genome sequence [15,132,133].

3.4 Molecular mechanisms of genomic rearrangement generation

43

Genome-wide analyses demonstrated LCRs to constitute B5% of the human sequence, frequently overlapping regions associated with genomic disorders, and further reinforcing the idea of their involvement in recombination events [134]. By definition, NAHR can result in interchromosomal and intrachromosomal deletions, duplications, and inversions, influenced by factors such as size, orientation, degree of sequence identity, intervening distance between LCRs, and the presence of PRDM9 recombination hotspots [135,136]. At the sequence level, intrachromatid recombination between direct-oriented LCRs results in deletions; inter-chromatid/chromosome recombination can result in deletions and duplications; whereas intrachromosomal inverted repeats can lead to sequence inversions (Fig. 3.3). The interchromosome and interchromatid exchange can result in the generation of reciprocal rearrangements, often generating

FIGURE 3.3 Comparison of genomic mechanisms leading to genomic disorders. The two lines represent double strands of DNA, shaded arrows (black, white, and gray) represent repeated sequences, and shaded boxes represent two different regions on a chromosome. (A) Nonhomologous end joining (NHEJ). Double-strand breaks between nonhomologous chromosome regions are repaired and rejoined, leading to breakpoints with either blunt ends or small microhomologies. (B) Nonallelic homologous recombination (NAHR). Recombination between nonallelic homologous sequences, such as low-copy repeats (LCRs) can lead to diverse types of rearrangements, including deletions and duplications/amplifications derived from recombination events between direct repeats, and inversions derived from recombination between inverted repeats. (C) Microhomology-mediated break-induced replication and fork stalling and template switching (MMBIR/ FoSTeS). Replication along the fork is indicated with the black long dash arrow, and sites of fork instability and switching with gray dashed lines. The initiating repair templates can be 25 bp in length and depending on the utilization of homologous or paralogous sequences, the produced rearrangements can involve complex patterns of gains, losses, inversions, translocations, or insertions.

44

Chapter 3 Genomic disorders in the genomics era

human disease traits that are mirror images of each other (see Table 3.1). Sequence analyses of cohorts of individuals presenting with genomic disorders have established a positive correlation between rearrangement frequencies, LCR length and degree of identity, and the presence of PRDM9 recombination hotspots, with a marked decrease in crossovers with longer inter-LCR distances [135]. For example, LCRs longer than 10 kbp, located less than 10 Mb apart, and with sequence similarity greater than 97% have been shown to be more prone to elicit NAHR-mediated genomic rearrangements [137]. While LCRs can display varying degrees of homology, rearrangement breakpoints associated with genomic disorders have been seen to cluster within specific, small (,2 kb) and highly homologous segments within the flanking LCRs [138140]. Such observations fall in line with previous reports of homologous recombination, where strand exchanges preferentially occur between highly identical sequences with minimal sizes of 134232 bp [141]. In individuals harboring genomic duplications associated with genomic disorders, the availability of additional identical sequences can further mediate NAHR events, in which more complex but nonrecurrent rearrangements have been observed [135]. While NAHR favors the formation of recurrent rearrangements commonly associated with classical genomic disorders, nonrecurrent rearrangements have also been linked to dozens of genomic disease cases [142]. Compared to recurrent breakpoints in which NAHR generally achieves accurate DNA repair, nonrecurrent SV breakpoints are characterized by blunt ends or 133 bp microhomologies, with or without the insertion of random nucleotides or small segments (,100 bp) from neighboring regions [143]. Nonrecurrent rearrangements are primarily generated by nonhomologous repair or recombination processes, including nonhomologous end joining (NHEJ) and microhomologymediated end joining (MMEJ), as well as replication-based mechanisms, including break-induced replication (BIR), microhomology-mediated break-induced replication (MMBIR), fork stalling and template switching (FoSTeS), and serial replication slippage [137,143] (Fig. 3.3). As surmised from the name, nonhomologous recombination requires zero (NHEJ) to minimal (MMEJ) sequence homology to repair double-strand breaks. NHEJ can result in accurate blunt breakpoints, leading to small deletions (14 bp) or insertions of free floating DNA [144]; whereas MMEJ invariably leads to deletion of sequences between annealed microhomologies [145]. Importantly, recombinationbased repair mechanisms, either homologous (NAHR) or nonhomologous (NHEJ, MMEJ), aim to repair double-strand breaks in the DNA; whereas replication-based mechanisms (such as MMBIR) repair single-ended, double-stranded DNA (seDNA) breaks that may result from collapsed forks or at chromosome telomeres. Replication-based mechanisms came to light through the study of more complex pathogenic rearrangements where more than one breakpoint was observed. In BIR, homologous recombination mediated by Rad51 repairs broken or collapsed replication forks, usually resulting in faithful sequence repair; however, if the template used for repair involves a homologous or paralogous sequence in a different chromosome position (i.e., nonallelic), it can produce deletion, duplication, and translocation events, essentially constituting an alternative NAHR mechanism [146]. Curiously, BIR has also been observed to occur in a Rad51-independent manner (MMBIR), utilizing 25 bp microhomologies as initiating repair templates compared to the longer stretches of high-sequence identity utilized in homologous recombination [147]. MMBIR has been proposed as one of the major contributors to the generation of nonrecurrent breakpoints in human genomic disorders, based on the pervasive presence of microhomology and microhomeology (highly similar but imperfect sequence matches) found at rearrangement breakpoints, the large genomic distances between rearranged genomic segments (explained by replication template

3.5 Genomic disorders and next-generation sequencing

45

switches in the FoSTeS model), as well as the higher degree of rearrangement complexity [147,148]. NGS analyses of disease-associated SVs have revealed a tremendous amount of genomic rearrangement breakpoint complexity, much of which can be explained by MMBIR, including de novo formation of triplications and interspersed copy-number gains with additional structural complexity such as inversions and insertions (many times associated with a phenomenon known as chromoanasynthesis) [149,150]. Chromoanasynthesis is one of three major and newly recognized phenomena of chromosomal rearrangements (chromothripsis and chromoplexy being the other two), collectively named “chromoanagenesis” (from the greek “chromo” for ‘chromosome’ and “anagenesis” for ‘rebirth’) and characterized by “catastrophic” chromosome shattering and restructuring leading to very complex structural rearrangements and breakpoints. These massive chromosome restructuring processes are believed to play a role in large-scale genome evolution [151]. Overall, the study of genome architecture and disease-associated SVs has opened our eyes to the diverse array of cellular mechanisms that participate in the repair of DNA sequence. Understanding the properties of such processes will provide the clinical genomics community an improved understanding of the frequency of said disorders, as well as better detection assays able to tackle their often-observed genomic complexity.

3.5 Genomic disorders and next-generation sequencing-based testing in the clinic As CMA’s resolution increased and their production cost decreased, it became the first-tier clinical testing strategy for the analysis of known and suspected genomic disorders, as well as multiple congenital anomalies, developmental, and intellectual disorders of unknown etiology [152]. CNV burden in these individuals was found to be significantly increased compared to the normal population [36,153], and new rare pathogenic microdeletions and microduplications were uncovered [154158]. Compared to initially published genomic disorders (i.e., CMT1A, PW, AS, etc.), which had recognizable collections of clinical features and recurrent rearrangements facilitated by LCRs, many of the newly discovered genomic disorders were characterized by variable phenotypes and nonrecurrent rearrangements arising from a variety of mechanisms such as NHEJ and FoSTeS [143,159161]. Overall, clinical CMA diagnostic yields are currently estimated at 15%20% [152], adding to a growing list of rare, albeit true genomic disorders. Since the wider adoption of DNA sequencing through NGS, for which availability and production costs have steadily decreased over the past decade [162] (see Chapter 4: Genomic Sequencing of Rare Diseases), clinical CMA testing has been widely challenged. Short-read protocols were quickly improved and modified to capture clinically relevant regions for targeted gene panels or WES analyses [163]. In addition, mate-pair (MPseq) and PE short-read sequencing approaches were developed and broadly utilized for the detection of SVs at various levels of resolution [164168], complementing initial copy-number studies performed with chromosome microarray [62,63,169]. Compared to CMA, NGS testing can deliver nucleotide-level resolution of CNV breakpoints, spot low-frequency mosaic SVs, and detect highly complex rearrangements such as those derived from chromoanagenesis events [67]. In addition, NGS testing can identify submicroscopic

46

Chapter 3 Genomic disorders in the genomics era

balanced and apparently balanced SVs (i.e., inversions and translocations). It is estimated that a substantial abundance of balanced and apparently balanced rearrangements is associated with genomic developmental and intellectual disorders, as revealed by NGS studies tailored for the highresolution detection of SVs [170,171]. For instance, subjects with autism spectrum disorder display an average of 14 large and newly reported complex SVs with minimal genomic loss undetected by CMA [170]. This and other studies of balanced SVs in congenital disease have expanded our understanding of genomic disorders, as even though genomic imbalance is not present, SVs can potentially affect the regulatory and chromosome organization landscape of a genomic region and largely modify expression patterns of distal neighboring genes. The best characterized example of such phenomenon pertains to the study of rare human limb malformations, in which large SVs have been shown to change the integrity of the topologically associated domain containing the WNT6/IHH/EPHA4/PAX3 locus, thus causing ectopic gene misexpression through altered enhancerpromoter interactions [172]. Subsequent reports of balanced SV position effects linked to the modification of chromosome organization have emerged in diseases old and new [67,171,173175]. While the prediction of long-term morbidity for balanced and apparently balanced SVs is still in its infancy, it has become clear that the identification and study of these events necessitates nucleotide-level resolution of breakpoints, attainable only through DNA sequencing [171,173,174,176]. The clinical relevance of WES has been extensively demonstrated throughout the last decade, achieving diagnostic yields up to 30% and ending diagnostic odysseys for hundreds of affected individuals with diverse genetic disorders [177182]. To improve diagnostic yields, CNV calling has been routinely incorporated into WES, although more work is needed to systematically decrease false positive and negative calls on CNVs involving less than three exons [183188], shown to be major contributors to disease which can typically go undetected in clinical WES and microarray tests [189,190]. In the prenatal setting, WES studies have successfully identified pathogenic and likely pathogenic variants in fetuses with ultrasound anomalies without aneuploidy or large CNVs; a 10%12% diagnostic yield has been reported, with as much as 5.6% of variants being CNVs [191,192]. When WES, CMA, and panel testing results are normal, clinical WGS has become the followup testing of choice. Compared to gene panels and WES, which asses smaller portions of the genome, WGS has more uniform read coverage, thus increasing the sensitivity and specificity of SV calling, facilitating identification of larger and more complex rearrangements, and assessing SNVs and SVs in noncoding portions of the genome. In critically ill infants with neurodevelopmental disorders, use of rapid short-read WGS as first-tier genetic testing has reached diagnostic yields of up to 70% and 40% for individuals with previous genetic studies, shortening diagnostic odysseys up to 77 months [193]. In these studies, SVs smaller than 10 kbp were common, with detected insertions as large as 236 bp. Compared to CMA, CNV calling using short-read WGS has been shown to have greater sensitivity, particularly for events ,50 kbp in size [194]. Upon filtering for data quality and population frequencies, WGS produces an average of 10 CNVs for clinical interpretation per undiagnosed individual, with 15% of cases harboring pathogenic/uncertain significance CNVs, with additional balanced and unbalanced SV events, aneuploidies, and the absence of heterozygosity events [194]. In prenatal diagnosis, low-pass WGS (B15 million shortreads per sample) has been shown to parallel CMA results and has increased diagnostic rates by 1.7%. For a small additional cost (B$300), the application of low-pass WGS to prenatal diagnosis

3.6 Interpretation of genomic structural and copy-number variants

47

can provide more accurate and faster diagnoses while further enhancing our understanding of chromosome alterations leading to genomic disorders [195,196].

3.6 Interpretation of genomic structural and copy-number variants Coincidentally, the study of genomic disorders with CMA and NGS-based technologies has additionally revealed an unexpected abundance of submicroscopic CNVs in apparently healthy individuals from diverse populations, ranging from 50 bp to upwards of 50 kbp in size and overlapping as many as 100 genes [123,197]. Current analyses estimate that 4.8%9.5% of the genome contributes to copy-number variation, surpassing the amount of bp involved as SNPs [123,197]. These observations have fueled diverse functional and population genetics studies [168,198,199], which have made apparent the evolutionary impact of CNVs in our species [200202]. The sizable amount of naturally occurring CNVs can posit a major challenge to the study of genomic disorders, as oftentimes rarely seen SVs in affected individuals could be benign variants that may be interpreted as variants of uncertain significance, likely pathogenic, or pathogenic based on limited population frequency data. Similar to single-nucleotide variation, the human genetics community has created SV population databases of apparently healthy control individuals, thus allowing the querying and comparison of candidate SVs against a background of SVs catalogs in diverse populations, providing additional resources for SV pathogenicity classification. Among the first catalogs available was the Database of Genomic Variants, established shortly after the discovery of the extent of copy-number variation in human populations [203]. The 1000 Genomes Project ensued, creating in the span of seven years a comprehensive compendium of genomic variation by NGS in individuals from varied populations [168]. Conversely, databases containing genomic and phenotypic information of morbidly affected individuals were also created, including ClinVar [204] and DECIPHER [205]. Both projects benefit from the submissions of laboratories and clinicians to establish publicly accessible compendiums of subjects with diverse genetic disorders. DECIPHER has created a cohort of well-curated phenotypic and genomic information from undiagnosed pediatric patients affected by severe developmental disorders and other conditions. Other projects, such as the 100,000 Genomes Project, have taken on a more ambitious and comprehensive collection of morbidly affected individuals at the national level in the United Kingdom [206]. Additionally, the characterization of SVs in large population datasets is also adding to the catalog of polymorphic and potentially clinically relevant SVs [188,207,208]. These resources aim to help differentiate pathogenic from polymorphic variation and heavily depend on the consenting and reporting of phenotype information by the medical genetics community to advance the development of personalized medicine. Additionally, guidelines have been created to standardize the clinical classification of SVs. Among these, recommendations by the Association for Molecular Pathology, American College of Medical Genetics (ACMG), and ClinGen have focused on the analysis and interpretation of SVs and SNVs, with several SV-related working groups focused on clinical dosage sensitivity maps, variant interpretation guidelines, and other classification and reporting topics [209212]. These guidelines homogenize genomic, computational, familial, population, and other data into a methodical strategy for the evaluation of SVs in the clinic. Notably, these guidelines specifically state the

48

Chapter 3 Genomic disorders in the genomics era

uncoupling of variant classification from clinical significance in the context of an individual’s phenotypic presentation or diagnosis, in order to achieve objectivity and consistency in CNV interpretation among different laboratories [210]. The application of these resources and guidelines have led to a large-scale reinterpretation of CNVs, including the reclassification of 155/246 (63%) of CNVs in ClinVar [213]. Like the advancement of NGS technologies, interpretation guidelines are an everevolving subject, as new and improved sequencing platforms, cohort studies, databases, and gene functional annotations become available.

3.7 Outlook From karyotype, to FISH, to CMA and NGS, geneticists have consistently linked structural chromosome changes to human disease. While classical genomic disorders have been defined based on their frequency and associated genomic recombination hotspots, the ongoing analyses of balanced and unbalanced SVs and the recruitment of morbidly affected individuals in phenotype- and genotype-driven studies suggests the presence of various potentially new genomic disorders, whose existence has been confounded by their clinical variability, low-population frequencies, and our technological limitations. Genomic microarrays and NGS-based assays, such as WES and WGS, have been particularly useful in the discovery and further characterization of new and known human structural variation, opening a window to a more precise identification of SVs, their mechanisms of generation, and their roles in human disease, specially genomic disorders. The rapid evolution of NGS technologies and their versatility in structural variation testing have postulated them as future substitutes of karyotyping, FISH and CMA. A notable highlight is WGS, given its ability to detect structural, exonic and nonexonic sequence variants on a single platform, and recent published support and endorsements as a viable replacement of WES and CMA as a first-tier test for individuals with rare genetic conditions [177,179,182,194,214217]. Even though the benefits of research and clinical WGS testing are undeniable, the technology needs to overcome several technical, regulatory, and economic hurdles to become a comprehensive test replacement for karyotyping, FISH, CMA, and WES. For example, detection of SVs with short-read NGS is restricted by size, complexity of SVs breakpoints, and mapping to repetitive and GC-rich regions, hindering identification of common pathogenic balanced SVs with breakpoints in repetitive regions such as Robertsonian translocations. While the use of long-read strategies or their combination with short-read approaches can overcome these limitations and result in the refinement of SV calls [123], the pricing of long-read technology is currently prohibitive for routine clinical genomic testing, alongside the lack of clinical bioinformatic platforms tailored to interpret this new type of sequence data outside the research setting. In addition, WGS has also been shown to underperform in the detection of low-level mosaicism compared to WES or targeted panels, which may be significant for certain rare disorders [218] (see Chapter 8: Mosaicism in Rare Disease). The known caveats and variability added by bioinformatic analyses also make it necessary to perform orthogonal confirmations for most clinically relevant SVs, adding up to the turn-around time and cost of WGS and NGS-based testing in general. The regulation of quality assurance is another major concern for clinical WGS test development, given the broad availability of commercial sequencing platforms and

References

49

analysis algorithms; however, progress has been made toward the development of best practices for clinical WGS test validations in germline conditions [215]. As the human genetics field heads toward universal implementation of WGS in the clinical laboratory, the study of human structural variation necessitates the utilization of diverse cytogenetic and molecular techniques, as has been highlighted and exemplified throughout this chapter. The integration of mechanistic, structural, and regulatory genetic information is crucial to the deciphering of SV pathogenicity and a means to discovering and classifying new forms of genomic disease. Following in the footsteps of previous genomic disorder characterizations, future studies will keep making important contributions to understanding coding and noncoding functional sequence annotations, expanding our knowledge of the molecular processes that participate in human structural variation generation, and provide further insights into the regulatory landscape of the human genome. Genomic disorders have been a pillar to our understanding of the molecular basis of human disease; as our technology advances so will our capability to detect and eventually predict with high accuracy the effects of SVs in human health and disease.

References [1] Edwards JH, Harnden DG, Cameron AH, Crosse VM, Wolff OH. A new trisomic syndrome. Lancet. 1960;1(7128):78790. [2] Smith DW, Patau K, Therman E, Inhorn SL. A new autosomal trisomy syndrome: multiple congenital anomalies caused by an extra chromosome. J Pediatrics 1960;57:33845. [3] Patau K, Smith DW, Therman E, Inhorn SL, Wagner HP. Multiple congenital anomaly caused by an extra autosome. Lancet. 1960;1(7128):7903. [4] Lejeune J, Gautier M, Turpin R. Study of somatic chromosomes from 9 mongoloid children. Comptes rendus Hebd des seances de l’Academie des Sci 1959;248(11):17212. [5] Schmickel RD. Contiguous gene syndromes: a component of recognizable syndromes. J Pediatrics 1986;109(2):23141. [6] Theisen A, Shaffer LG. Disorders caused by chromosome abnormalities. Appl Clin Genet 2010;3: 15974. [7] Lupski JR, Stankiewicz P. Genomic disorders: the genomic basis of disease. 2007. [8] Lupski JR. Genomic disorders ten years on. Genome Med 2009;1(4):42. [9] Carvalho CMB, Zhang F, Lupski JR. Genomic disorders: a window into human gene and genome evolution. Proc Natl Acad Sci 2010;107(Suppl. 1):176571. [10] Harel T, Lupski JR. Genomic disorders 20 years on-mechanisms for clinical manifestations. Clin Genet 2018;93(3):43949. [11] Lupski JR, de Oca-Luna RM, Slaugenhaupt S, et al. DNA duplication associated with Charcot-MarieTooth disease type 1A. Cell 1991;66(2):21932. [12] Smith A, McGavran L, Waldstein G. Deletion of the 17 short arm in two patients with facial clefts. Am J Hum Genet 1982;34. [13] Chen KS, Potocki L, Lupski JR. The Smith-Magenis syndrome [del (17) p11. 2]: clinical review and molecular advances. Ment Retard Dev Disabil Res Rev 1996;2(3):1229. [14] Butler MG, Meaney FJ, Palmer CG. Clinical and cytogenetic survey of 39 individuals with PraderLabhart-Willi syndrome. Am J Med Genet 1986;23(3):793809.

50

Chapter 3 Genomic disorders in the genomics era

[15] Amos-Landgraf JM, Ji Y, Gottlieb W, et al. Chromosome breakage in the Prader-Willi and Angelman syndromes involves recombination between large, transcribed repeats at proximal and distal breakpoints. Am J Hum Genet 1999;65(2):37086. [16] Watson CT, Marques-Bonet T, Sharp AJ, Mefford HC. The genetics of microdeletion and microduplication syndromes: an update. Annu Rev Genomics Hum Genet 2014;15:21544. [17] Rosenfeld JA, Ballif BC, Torchia BS, et al. Copy number variations associated with autism spectrum disorders contribute to a spectrum of neurodevelopmental disorders. Genet Med 2010;12(11): 694702. [18] Kushima I, Aleksic B, Nakatochi M, et al. Comparative analyses of copy-number variation in autism spectrum disorder and schizophrenia reveal etiological overlap and biological insights. Cell Rep 2018;24 (11):283856. [19] Christian SL, Brune CW, Sudi J, et al. Novel submicroscopic chromosomal abnormalities detected in autism spectrum disorder. Biol Psychiatry 2008;63(12):111117. [20] Kumar RA, KaraMohamed S, Sudi J, et al. Recurrent 16p11.2 microdeletions in autism. Hum Mol Genet 2008;17(4):62838. [21] Marshall CR, Noor A, Vincent JB, et al. Structural variation of chromosomes in autism spectrum disorder. Am J Hum Genet 2008;82(2):47788. [22] Sebat J, Lakshmi B, Malhotra D, et al. Strong association of de novo copy number mutations with autism. Science 2007;316(5823):4459. [23] Szatmari P, Paterson AD, Zwaigenbaum L, et al. Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nat Genet 2007;39(3):31928. [24] Weiss LA, Shen Y, Korn JM, et al. Association between microdeletion and microduplication at 16p11.2 and autism. N Engl J Med 2008;358(7):66775. [25] Kirov G, Grozeva D, Norton N, et al. Support for the involvement of large copy number variants in the pathogenesis of schizophrenia. Hum Mol Genet 2009;18(8):1497503. [26] Need AC, Ge D, Weale ME, et al. A genome-wide investigation of SNPs and CNVs in schizophrenia. PLoS Genet 2009;5(2):e1000373. [27] Stefansson H, Rujescu D, Cichon S, et al. Large recurrent microdeletions associated with schizophrenia. Nature 2008;455(7210):2326. [28] International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 2008;455(7210):23741. [29] Walsh T, McClellan JM, McCarthy SE, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 2008;320(5875):53943. [30] Erdogan F, Larsen LA, Zhang L, et al. High frequency of submicroscopic genomic aberrations detected by tiling path array comparative genome hybridisation in patients with isolated congenital heart disease. J Med Genet 2008;45(11):7049. [31] Mefford HC, Clauin S, Sharp AJ, et al. Recurrent reciprocal genomic rearrangements of 17q12 are associated with renal disease, diabetes, and epilepsy. Am J Hum Genet 2007;81(5):105769. [32] Richards AA, Santos LJ, Nichols HA, et al. Cryptic chromosomal abnormalities identified in children with congenital heart disease. Pediatr Res 2008;64(4):35863. [33] de Vries BB, Pfundt R, Leisink M, et al. Diagnostic genome profiling in mental retardation. Am J Hum Genet 2005;77(4):60616. [34] Sagoo GS, Butterworth AS, Sanderson S, Shaw-Smith C, Higgins JP, Burton H. Array CGH in patients with learning disability (mental retardation) and congenital anomalies: updated systematic review and meta-analysis of 19 studies and 13,926 subjects. Genet Med 2009;11(3):13946. [35] Shaffer LG, Kashork CD, Saleki R, et al. Targeted genomic microarray analysis for identification of chromosome abnormalities in 1500 consecutive clinical cases. J Pediatrics 2006;149(1):98102.

References

51

[36] Cooper GM, Coe BP, Girirajan S, et al. A copy number variation morbidity map of developmental delay. Nat Genet 2011;43(9):83846. [37] Heilstedt HA, Ballif BC, Howard LA, Kashork CD, Shaffer LG. Population data suggest that deletions of 1p36 are a relatively common chromosome abnormality. Clin Genet 2003;64(4):31016. [38] Gunning AC, Strucinska K, Munoz Oreja M, et al. Recurrent de novo NAHR reciprocal duplications in the ATAD3 gene cluster cause a neurogenetic trait with perturbed cholesterol and mitochondrial metabolism. Am J Hum Genet 2020;106(2):2729. [39] Tatton-Brown K, Douglas J, Coleman K, et al. Multiple mechanisms are implicated in the generation of 5q35 microdeletions in Sotos syndrome. J Med Genet 2005;42(4):30713. [40] Rosenfeld JA, Kim KH, Angle B, et al. Further evidence of contrasting phenotypes caused by reciprocal deletions and duplications: duplication of NSD1 causes growth retardation and microcephaly. Mol Syndromol 2013;3(6):24754. [41] Antonell A, Del Campo M, Magano LF, et al. Partial 7q11.23 deletions further implicate GTF2I and GTF2IRD1 as the main genes responsible for the Williams-Beuren syndrome neurocognitive profile. J Med Genet 2010;47(5):31220. [42] Berg JS, Brunetti-Pierri N, Peters SU, et al. Speech delay and autism spectrum behaviors are frequently associated with duplication of the 7q11.23 Williams-Beuren syndrome region. Genet Med 2007;9 (7):42741. [43] Hsu P, Ma A, Wilson M, et al. CHARGE syndrome: a review. J Paediatr Child Health 2014;50 (7):50411. [44] Lehman AM, Friedman JM, Chai D, et al. A characteristic syndrome associated with microduplication of 8q12, inclusive of CHD7. Eur J Med Genet 2009;52(6):4369. [45] Harada N, Visser R, Dawson A, et al. A 1-Mb critical region in six patients with 9q34.3 terminal deletion syndrome. J Hum Genet 2004;49(8):4404. [46] Yatsenko SA, Brundage EK, Roney EK, Cheung SW, Chinault AC, Lupski JR. Molecular mechanisms for subtelomeric rearrangements associated with the 9q34.3 microdeletion syndrome. Hum Mol Genet 2009;18(11):192436. [47] Allderdice PW, Eales B, Onyett H, et al. Duplication 9q34 syndrome. Am J Hum Genet 1983;35 (5):100519. [48] Mattina T, Perrotta CS, Grossfeld P. Jacobsen syndrome. Orphanet J Rare Dis 2009;4:9. [49] Angulo MA, Butler MG, Cataletto ME. Prader-Willi syndrome: a review of clinical, genetic, and endocrine findings. J Endocrinol Invest 2015;38(12):124963. [50] Kalsner L, Chamberlain SJ. Prader-Willi, Angelman, and 15q11-q13 duplication syndromes. Pediatr Clin North Am 2015;62(3):587606. [51] Buiting K, Williams C, Horsthemke B. Angelman syndrome—insights into a rare neurogenetic disorder. Nat Rev Neurol 2016;12(10):58493. [52] Al Ageeli E, Drunat S, Delanoe C, et al. Duplication of the 15q11-q13 region: clinical and genetic study of 30 new cases. Eur J Med Genet 2014;57(1):514. [53] Fernandez BA, Roberts W, Chung B, et al. Phenotypic spectrum associated with de novo and inherited deletions and duplications at 16p11.2 in individuals ascertained for diagnosis of autism spectrum disorder. J Med Genet 2010;47(3):195203. [54] Potocki L, Bi W, Treadwell-Deering D, et al. Characterization of Potocki-Lupski syndrome (dup(17) (p11.2p11.2)) and delineation of a dosage-sensitive critical interval that can convey an autism phenotype. Am J Hum Genet 2007;80(4):63349. [55] Potocki L, Chen KS, Koeuth T, et al. DNA rearrangements on both homologues of chromosome 17 in a mildly delayed individual with a family history of autosomal dominant carpal tunnel syndrome. Am J Hum Genet 1999;64(2):4718.

52

Chapter 3 Genomic disorders in the genomics era

[56] Zhang F, Seeman P, Liu P, et al. Mechanisms for nonrecurrent genomic rearrangements associated with CMT1A or HNPP: rare CNVs as a cause for missing heritability. Am J Hum Genet 2010;86(6):892903. [57] Nogueira SI, Hacker AM, Bellucco FT, et al. Atypical 22q11.2 deletion in a patient with DGS/VCFS spectrum. Eur J Med Genet 2008;51(3):22630. [58] Yamagishi H, Garg V, Matsuoka R, Thomas T, Srivastava D. A molecular pathway revealing a genetic basis for human cardiac and craniofacial defects. Science 1999;283(5405):115861. [59] Ou Z, Berg JS, Yonath H, et al. Microduplications of 22q11.2 are frequently inherited and are associated with variable phenotypes. Genet Med 2008;10(4):26777. [60] Portnoi MF. Microduplication 22q11.2: a new chromosomal syndrome. Eur J Med Genet 2009;52(2-3): 8893. [61] Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet 2011;12(5):36376. [62] Sebat J, Lakshmi B, Troge J, et al. Large-scale copy number polymorphism in the human genome. Science 2004;305(5683):5258. [63] Iafrate AJ, Feuk L, Rivera MN, et al. Detection of large-scale variation in the human genome. Nat Genet 2004;36(9):94951. [64] Meyer LM, Ramin KD, Ramsey PS, Ogburn PL, Jalal SM. Fluorescent in situ hybridization technique for the rapid detection of chromosomal abnormalities. Obstet Gynecol 2000;95(4, Supplement 1):S64. [65] Kearney L. Molecular cytogenetics. Best Pract Res Clin Haematology 2001;14(3):64568. [66] Bishop R. Applications of fluorescence in situ hybridization (FISH) in detecting genetic aberrations of medical significance. Biosci Horizons: Int J Stud Res 2010;3(1):8595. [67] Zepeda-Mendoza CJ, Morton CC. The Iceberg under water: unexplored complexity of chromoanagenesis in congenital disorders. Am J Hum Genet 2019;104(4):56577. [68] Vissers LE, de Vries BB, Osoegawa K, et al. Array-based comparative genomic hybridization for the genomewide detection of submicroscopic chromosomal abnormalities. Am J Hum Genet 2003;73(6): 126170. [69] Pinkel D, Segraves R, Sudar D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998;20(2):20711. [70] Pinkel D, Albertson DG. Comparative genomic hybridization. Annu Rev Genomics Hum Genet 2005;6:33154. [71] Brady PD, Vermeesch JR. Genomic microarrays: a technology overview. Prenat Diagn 2012;32(4): 33643. [72] Snijders AM, Nowak N, Segraves R, et al. Assembly of microarrays for genome-wide measurement of DNA copy number. Nat Genet 2001;29(3):2634. [73] Yau C, Holmes CC. CNV discovery using SNP genotyping arrays. Cytogenet Genome Res 2008; 123(1-4):30712. [74] Scionti F, Di Martino MT, Pensabene L, Bruni V, Concolino D. The cytoscan HD array in the diagnosis of neurodevelopmental disorders. High Throughput 2018;7(3). [75] Wiszniewska J, Bi W, Shaw C, et al. Combined array CGH plus SNP genome analyses in a single assay for optimized clinical testing. Eur J Hum Genet 2014;22(1):7987. [76] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008;26(10):113545. [77] Levy S, Sutton G, Ng PC, et al. The diploid genome sequence of an individual human. PLoS Biol 2007;5(10):e254. [78] Wheeler DA, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 2008;452(7189):8726. [79] Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456(7218):539.

References

53

[80] Wang J, Wang W, Li R, et al. The diploid genome sequence of an Asian individual. Nature 2008; 456(7218):605. [81] Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [82] Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat Rev Genet 2020;21(10):597614. [83] Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. Zero-mode waveguides for single-molecule analysis at high concentrations. Science 2003;299(5607):6826. [84] Eid J, Fehr A, Gray J, et al. Real-time DNA sequencing from single polymerase molecules. Science 2009;323(5910):1338. [85] Laszlo AH, Derrington IM, Ross BC, et al. Decoding long nanopore sequencing reads of natural DNA. Nat Biotechnol 2014;32(8):82933. [86] Bayley H. Nanopore sequencing: from imagination to reality. Clin Chem 2015;61(1):2531. [87] Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nat Biotechnol 2016;34(5): 51824. [88] Jain M, Koren S, Miga KH, et al. Nanopore sequencing and assembly of a human genome with ultralong reads. Nat Biotechnol 2018;36(4):33845. [89] The long view on sequencing. Nat Biotechnol. 2018;36(4):287. [90] Loomis EW, Eid JS, Peluso P, et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Res 2013;23(1):1218. [91] Ardui S, Race V, Zablotskaya A, et al. Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing. Hum Mutat 2017;38(3):32431. [92] McFarland KN, Liu J, Landrian I, et al. SMRT sequencing of long tandem nucleotide repeats in SCA10 reveals unique insight of repeat expansion structure. PLoS One 2015;10(8):e0135906. [93] Schu¨le B, McFarland KN, Lee K, et al. Parkinson’s disease associated with pure ATXN10 repeat expansion. npj Parkinson’s Dis 2017;3(1):27. [94] Tsai Y-C, Greenberg D, Powell J, et al. Amplification-free, CRISPR-Cas9 targeted enrichment and SMRT sequencing of repeat-expansion disease causative genomic regions. bioRxiv 2017;203919. [95] Wenzel A, Altmueller J, Ekici AB, et al. Single molecule real time sequencing in ADTKD-MUC1 allows complete assembly of the VNTR and exact positioning of causative mutations. Sci Rep 2018; 8(1):4170. [96] Hoijer I, Tsai YC, Clark TA, et al. Detailed analysis of HTT repeat elements in human blood using targeted amplification-free long-read sequencing. Hum Mutat 2018;39(9):126272. [97] Albrecht V, Zweiniger C, Surendranath V, et al. Dual redundant sequencing strategy: full-length gene characterisation of 1056 novel and confirmatory HLA alleles. HLA 2017;90(2):7987. [98] Mayor NP, Robinson J, McWhinnie AJ, et al. HLA typing for the next generation. PLoS One 2015; 10(5):e0127153. [99] Chaisson MJ, Huddleston J, Dennis MY, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 2015;517(7536):60811. [100] Zheng GX, Lau BT, Schnall-Levin M, et al. Haplotyping germline and cancer genomes with highthroughput linked-read sequencing. Nat Biotechnol 2016;34(3):30311. [101] Cretu Stancu M, van Roosmalen MJ, Renkens I, et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 2017;8(1):1326. [102] Merker JD, Wenger AM, Sneddon T, et al. Long-read genome sequencing identifies causal structural variation in a Mendelian disease. Genet Med 2018;20(1):15963. [103] Reiner J, Pisani L, Qiao W, et al. Cytogenomic identification and long-read single molecule real-time (SMRT) sequencing of a Bardet-Biedl Syndrome 9 (BBS9) deletion. NPJ Genom Med 2018;3:3.

54

Chapter 3 Genomic disorders in the genomics era

[104] Aneichyk T, Hendriks WT, Yadav R, et al. Dissecting the causal mechanism of X-linked DystoniaParkinsonism by integrating genome and transcriptome assembly. Cell 2018;172(5):897909 e821. [105] Goncalves A, Oliveira J, Coelho T, et al. Exonization of an intronic LINE-1 element causing becker muscular dystrophy as a novel mutational mechanism in dystrophin gene. Genes (Basel) 2017;8(10). [106] Masset H, Hestand MS, Van Esch H, et al. A distinct class of chromoanagenesis events characterized by focal copy number gains. Hum Mutat 2016;37(7):6618. [107] Wang M, Beck CR, English AC, et al. PacBio-LITS: a large-insert targeted sequencing method for characterization of human disease-associated chromosomal structural variations. BMC Genomics 2015;16:214. [108] Gong L, Wong CH, Cheng WC, et al. Picky comprehensively detects high-resolution structural variants in nanopore long reads. Nat Methods 2018;15(6):45560. [109] English AC, Salerno WJ, Hampton OA, et al. Assessing structural variation in a personal genometowards a human reference diploid genome. BMC Genomics 2015;16:286. [110] Carvalho CMB, Coban-Akdemir Z, Hijazi H, et al. Interchromosomal template-switching as a novel molecular mechanism for imprinting perturbations associated with Temple syndrome. Genome Med 2019;11(1):25. [111] Beck CR, Carvalho CMB, Akdemir ZC, et al. Megabase length hypermutation accompanies human structural variation at 17p11.2. Cell 2019;176(6):131024 e1310. [112] omicX. Whole-genome sequencing data analysis software tools. 2020; https://omictools.com/wholegenome-resequencing-category. [113] Gambin T, Akdemir ZC, Yuan B, et al. Homozygous and hemizygous CNV detection from exome sequencing data in a Mendelian disease cohort. Nucleic Acids Res 2017;45(4):163348. [114] Coban-Akdemir Z, White JJ, Song X, et al. Identifying genes whose mutant transcripts cause dominant disease traits by potential gain-of-function alleles. Am J Hum Genet 2018;103(2):17187. [115] Song X, Beck CR, Du R, et al. Predicting human genes susceptible to genomic instability associated with Alu/Alu-mediated rearrangements. Genome Res 2018;28(8):122842. [116] Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17(6):33351. [117] Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinforma 2013; 14(11):S1. [118] Hanscom C, Talkowski M. Design of large-insert jumping libraries for structural variant detection using Illumina sequencing. Curr Protoc Hum Genet 2014;80:219 7.22. [119] Putnam NH, O’Connell BL, Stites JC, et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res 2016;26(3):34250. [120] Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 2013;31(12):111925. [121] Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet 2018;19(6):32946. [122] Zhang L, Bai W, Yuan N, Du Z. Comprehensively benchmarking applications for detecting copy number variation. PLoS Comput Biol 2019;15(5):e1007069. [123] Chaisson MJP, Sanders AD, Zhao X, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun 2019;10(1):1784. [124] Chen KS, Manian P, Koeuth T, et al. Homologous recombination of a flanking repeat gene cluster is a mechanism for a common contiguous gene deletion syndrome. Nat Genet 1997;17(2):15463. [125] Kiyosawa H, Lensch MW, Chance PF. Analysis of the CMT1A-REP repeat: mapping crossover breakpoints in CMT1A and HNPP. Hum Mol Genet 1995;4(12):232734.

References

55

[126] Schmid CW, Deininger PL. Sequence organization of the human genome. Cell 1975;6(3):34558. [127] Jurka J, Kapitonov VV, Kohany O, Jurka MV. Repetitive sequences in complex genomes: structure and evolution. Annu Rev Genomics Hum Genet 2007;8:24159. [128] Wong Z, Royle NJ, Jeffreys AJ. A novel human DNA polymorphism resulting from transfer of DNA from chromosome 6 to chromosome 16. Genomics 1990;7(2):22234. [129] Tomlinson IM, Cook GP, Carter NP, et al. Human immunoglobulin VH and D segments on chromosomes 15q11.2 and 16p11.2. Hum Mol Genet 1994;3(6):85360. [130] Eichler EE, Budarf ML, Rocchi M, et al. Interchromosomal duplications of the adrenoleukodystrophy locus: a phenomenon of pericentromeric plasticity. Hum Mol Genet 1997;6(7):9911002. [131] Lupski JR. Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. Trends Genet 1998;14(10):41722. [132] Christian SL, Fantes JA, Mewborn SK, Huang B, Ledbetter DH. Large genomic duplicons map to sites of instability in the Prader-Willi/Angelman syndrome chromosome region (15q11-q13). Hum Mol Genet 1999;8(6):102537. [133] Edelmann L, Pandita RK, Morrow BE. Low-copy repeats mediate the common 3-Mb deletion in patients with velo-cardio-facial syndrome. Am J Hum Genet 1999;64(4):107686. [134] Bailey JA, Gu Z, Clark RA, et al. Recent segmental duplications in the human genome. Science 2002;297(5583):10037. [135] Liu P, Lacaria M, Zhang F, Withers M, Hastings PJ, Lupski JR. Frequency of nonallelic homologous recombination is correlated with length of homology: evidence that ectopic synapsis precedes ectopic crossing-over. Am J Hum Genet 2011;89(4):5808. [136] Dittwald P, Gambin T, Szafranski P, et al. NAHR-mediated copy-number variants in a clinical population: mechanistic insights into both genomic disorders and Mendelizing traits. Genome Res 2013;23(9): 1395409. [137] Carvalho CM, Lupski JR. Mechanisms underlying structural variant formation in genomic disorders. Nat Rev Genet 2016;17(4):22438. [138] Lopes J, Ravise N, Vandenberghe A, et al. Fine mapping of de novo CMT1A and HNPP rearrangements within CMT1A-REPs evidences two distinct sex-dependent mechanisms and candidate sequences involved in recombination. Hum Mol Genet 1998;7(1):1418. [139] Reiter LT, Hastings PJ, Nelis E, De Jonghe P, Van Broeckhoven C, Lupski JR. Human meiotic recombination products revealed by sequencing a hotspot for homologous strand exchange in multiple HNPP deletion patients. Am J Hum Genet 1998;62(5):102333. [140] Bi W, Park SS, Shaw CJ, Withers MA, Patel PI, Lupski JR. Reciprocal crossovers and a positional preference for strand exchange in recombination events resulting in deletion or duplication of chromosome 17p11.2. Am J Hum Genet 2003;73(6):130215. [141] Waldman AS, Liskay RM. Dependence of intrachromosomal recombination in mammalian cells on uninterrupted homology. Mol Cell Biol 1988;8(12):53507. [142] Vissers LE, Stankiewicz P. Microdeletion and microduplication syndromes. Methods Mol Biol 2012;838:2975. [143] Hastings PJ, Lupski JR, Rosenberg SM, Ira G. Mechanisms of change in gene copy number. Nat Rev Genet 2009;10(8):55164. [144] Lieber MR. The mechanism of human nonhomologous DNA end joining. J Biol Chem 2008;283(1): 15. [145] McVey M, Lee SE. MMEJ repair of double-strand breaks (director’s cut): deleted sequences and alternative endings. Trends Genet 2008;24(11):52938. [146] Malkova A, Ira G. Break-induced replication: functions and molecular mechanism. Curr Opin Genet Dev 2013;23(3):2719.

56

Chapter 3 Genomic disorders in the genomics era

[147] Hastings PJ, Ira G, Lupski JR. A microhomology-mediated break-induced replication model for the origin of human copy number variation. PLoS Genet 2009;5(1):e1000327. [148] Bahrambeigi V, Song X, Sperle K, et al. Distinct patterns of complex rearrangements and a mutational signature of microhomeology are frequently observed in PLP1 copy number gain structural variants. Genome Med 2019;11(1):80. [149] Liu P, Erez A, Nagamani SC, et al. Chromosome catastrophes involve replication mechanisms generating complex genomic rearrangements. Cell 2011;146(6):889903. [150] Smith CE, Llorente B, Symington LS. Template switching during break-induced replication. Nature. 2007;447(7140):1025. [151] Pellestor F, Gatinois V. Chromoanagenesis: a piece of the macroevolution scenario. Mol Cytogenet 2020;13:3. [152] Miller DT, Adam MP, Aradhya S, et al. Consensus statement: chromosomal microarray is a first-tier clinical diagnostic test for individuals with developmental disabilities or congenital anomalies. Am J Hum Genet 2010;86(5):74964. [153] Itsara A, Cooper GM, Baker C, et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet 2009;84(2):14861. [154] Sharp AJ, Hansen S, Selzer RR, et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat Genet 2006;38(9):103842. [155] Koolen DA, Sharp AJ, Hurst JA, et al. Clinical and molecular delineation of the 17q21.31 microdeletion syndrome. J Med Genet 2008;45(11):71020. [156] Shaw-Smith C, Pittman AM, Willatt L, et al. Microdeletion encompassing MAPT at chromosome 17q21.3 is associated with developmental delay and learning disability. Nat Genet 2006;38(9):10327. [157] Mefford HC, Rosenfeld JA, Shur N, et al. Further clinical and molecular delineation of the 15q24 microdeletion syndrome. J Med Genet 2012;49(2):11018. [158] Tuzun E, Sharp AJ, Bailey JA, et al. Fine-scale structural variation of the human genome. Nat Genet 2005;37(7):72732. [159] Carvalho CM, Pehlivan D, Ramocki MB, et al. Replicative mechanisms for CNV formation are error prone. Nat Genet 2013;45(11):131926. [160] Zhang F, Khajavi M, Connolly AM, Towne CF, Batish SD, Lupski JR. The DNA replication FoSTeS/ MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans. Nat Genet 2009;41(7):84953. [161] Hijazi H, Coelho FS, Gonzaga-Jauregui C, et al. Xq22 deletions and correlation with distinct neurological disease traits in females: further evidence for a contiguous gene syndrome. Hum Mutat 2020;41 (1):15068. [162] Levy SE, Myers RM. Advancements in next-generation sequencing. Annu Rev Genomics Hum Genet 2016;17(1):95115. [163] Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature 2009;461(7261):2726. [164] Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 2007;318(5849):4206. [165] McKernan KJ, Peckham HE, Costa GL, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res 2009;19(9):152741. [166] Kidd JM, Cooper GM, Donahue WF, et al. Mapping and sequencing of structural variation from eight human genomes. Nature 2008;453(7191):5664. [167] Kidd JM, Graves T, Newman TL, et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 2010;143(5):83747.

References

57

[168] Genomes Project C, Abecasis GR, Altshuler D, et al. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [169] Redon R, Ishikawa S, Fitch KR, et al. Global variation in copy number in the human genome. Nature 2006;444(7118):44454. [170] Collins RL, Brand H, Redin CE, et al. Defining the diverse spectrum of inversions, complex structural variation, and chromothripsis in the morbid human genome. Genome Biol 2017;18(1):36. [171] Redin C, Brand H, Collins RL, et al. The genomic landscape of balanced cytogenetic abnormalities associated with human congenital anomalies. Nat Genet 2017;49(1):3645. [172] Lupianez DG, Kraft K, Heinrich V, et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 2015;161(5):101225. [173] Ordulu Z, Kammin T, Brand H, et al. Structural chromosomal rearrangements require nucleotide-level resolution: lessons from next-generation sequencing in prenatal diagnosis. Am J Hum Genet 2016;99 (5):101533. [174] Zepeda-Mendoza CJ, Ibn-Salem J, Kammin T, et al. Computational prediction of position effects of apparently balanced human chromosomal rearrangements. Am J Hum Genet 2017;101(2):20617. [175] Spielmann M, Lupianez DG, Mundlos S. Structural variation in the 3D genome. Nat Rev Genet 2018;19(7):45367. [176] Zepeda-Mendoza CJ, Bardon A, Kammin T, et al. Phenotypic interpretation of complex chromosomal rearrangements informed by nucleotide-level resolution and structural organization of chromatin. Eur J Hum Genet 2018;26(3):37481. [177] Vissers LELM, van Nimwegen KJM, Schieving JH, et al. A clinical utility study of exome sequencing versus conventional genetic testing in pediatric neurology. Genet Med 2017;19(9):105563. [178] Perucca P, Scheffer IE, Harvey AS, et al. Real-world utility of whole exome sequencing with targeted gene analysis for focal epilepsy. Epilepsy Res 2017;131:18. [179] Stark Z, Schofield D, Alam K, et al. Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement. Genet Med 2017;19(8):86774. [180] Posey JE, Rosenfeld JA, James RA, et al. Molecular diagnostic experience of whole-exome sequencing in adult patients. Genet Med 2016;18(7):67885. [181] Stark Z, Tan TY, Chong B, et al. A prospective evaluation of whole-exome sequencing as a first-tier molecular test in infants with suspected monogenic disorders. Genet Med 2016;18(11):10906. [182] Clark MM, Stark Z, Farnaes L, et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med 2018;3(1):16. [183] Yamamoto T, Shimojima K, Ondo Y, et al. Challenges in detecting genomic copy number aberrations using next-generation sequencing data and the eXome Hidden Markov Model: a clinical exome-first diagnostic approach. Hum Genome Var 2016;3(1):16025. [184] Zare F, Dow M, Monteleone N, Hosny A, Nabavi S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinforma 2017;18(1):286. [185] Wang W, Wang W, Sun W, Crowley JJ, Szatkiewicz JP. Allele-specific copy-number discovery from whole-genome and whole-exome sequencing. Nucleic Acids Res 2015;43(14):e90. [186] Pfundt R, del Rosario M, Vissers LELM, et al. Detection of clinically relevant copy-number variants by exome sequencing in a large cohort of genetic disorders. Genet Med 2017;19(6):66775. [187] Hong CS, Singh LN, Mullikin JC, Biesecker LG. Assessing the reproducibility of exome copy number variations predictions. Genome Med 2016;8(1):82. [188] Ruderfer DM, Hamamsy T, Lek M, et al. Patterns of genic intolerance of rare copy number variation in 59,898 human exomes. Nat Genet 2016;48(10):110711.

58

Chapter 3 Genomic disorders in the genomics era

[189] Dharmadhikari AV, Ghosh R, Yuan B, et al. Copy number variant and runs of homozygosity detection by microarrays enabled more precise molecular diagnoses in 11,020 clinical exome cases. Genome Med 2019;11(1):30. [190] Yuan B, Wang L, Liu P, et al. CNVs cause autosomal recessive genetic diseases with or without involvement of SNV/indels. Genet Med 2020;22(10):163341. [191] Lord J, McMullan DJ, Eberhardt RY, et al. Prenatal exome sequencing analysis in fetal structural anomalies detected by ultrasonography (PAGE): a cohort study. Lancet 2019;393(10173):74757. [192] Petrovski S, Aggarwal V, Giordano JL, et al. Whole-exome sequencing in the evaluation of fetal structural anomalies: a prospective cohort study. Lancet 2019;393(10173):75867. [193] Soden SE, Saunders CJ, Willig LK, et al. Effectiveness of exome and genome sequencing guided by acuity of illness for diagnosis of neurodevelopmental disorders. Sci Transl Med 2014;6(265):265ra168. [194] Gross AM, Ajay SS, Rajan V, et al. Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease. Genet Med 2019;21(5):112130. [195] Wang H, Dong Z, Zhang R, et al. Low-pass genome sequencing versus chromosomal microarray analysis: implementation in prenatal diagnosis. Genet Med 2019;. [196] Dong Z, Yan J, Xu F, et al. Genome sequencing explores complexity of chromosomal abnormalities in recurrent miscarriage. Am J Hum Genet 2019;105(6):110211. [197] Zarrei M, MacDonald JR, Merico D, Scherer SW. A copy number variation map of the human genome. Nat Rev Genet 2015;16(3):17283. [198] Sjodin P, Jakobsson M. Population genetic nature of copy number variation. Methods Mol Biol 2012;838:20923. [199] Chen W, Hayward C, Wright AF, et al. Copy number variation across European populations. PLoS one 2011;6(8):e23087. [200] Cheng Z, Ventura M, She X, et al. A genome-wide comparison of recent chimpanzee and human segmental duplications. Nature 2005;437(7055):8893. [201] Rodrigo G. Evolutionary impact of copy number variation rates. BMC Res Notes 2017;10(1):393. [202] Iskow RC, Gokcumen O, Lee C. Exploring the role of copy number variants in human adaptation. Trends Genet 2012;28(6):24557. [203] MacDonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res 2014;42(Database issue): D98692. [204] Landrum MJ, Lee JM, Riley GR, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42(Database issue):D9805. [205] Firth HV, Richards SM, Bevan AP, et al. DECIPHER: database of chromosomal imbalance and phenotype in humans using ensembl resources. Am J Hum Genet 2009;84(4):52433. [206] Mark C., Jim D., Martin D., et al. The National Genomics Research and Healthcare Knowledgebase. 2019. [207] Collins RL, Brand H, Karczewski KJ, et al. A structural variation reference for medical and population genetics. Nature 2020;581(7809):44451. [208] Maxwell E.K., Packer J.S., O’Dushlaine C., et al. Profiling copy number variation and disease associations from 50,726 DiscovEHR Study exomes. 2017:119461. [209] Riggs ER, Church DM, Hanson K, et al. Towards an evidence-based process for the clinical interpretation of copy number variation. Clin Genet 2012;81(5):40312. [210] Riggs ER, Andersen EF, Cherry AM, et al. Technical standards for the interpretation and reporting of constitutional copy-number variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics (ACMG) and the Clinical Genome Resource (ClinGen). Genet Med 2019;.

References

59

[211] Richards S, Aziz N, Bale S, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 2015;17(5):40524. [212] Monaghan KG, Leach NT, Pekarek D, et al. The use of fetal exome sequencing in prenatal diagnosis: a points to consider document of the American College of Medical Genetics and Genomics (ACMG). Genet Med 2020;22:67580. [213] Riggs ER, Nelson T, Merz A, et al. Copy number variant discrepancy resolution using the ClinGen dosage sensitivity map results in updated clinical interpretations in ClinVar. Hum Mutat 2018;39(11): 16509. [214] Lionel AC, Costain G, Monfared N, et al. Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test. Genet Med 2018;20(4):43543. [215] Marshall CR, Chowdhury S, Taft RJ, et al. Best practices for the analytical validation of clinical whole-genome sequencing intended for the diagnosis of germline disease. NPJ Genom Med 2020;5:47. [216] Bick D, Jones M, Taylor SL, Taft RJ, Belmont J. Case for genome sequencing in infants and children with rare, undiagnosed or genetic diseases. J Med Genet 2019;56(12):78391. [217] Stavropoulos DJ, Merico D, Jobling R, et al. Whole genome sequencing expands diagnostic utility and improves clinical management in pediatric medicine. NPJ Genom Med 2016;1:15012. [218] D’Gama AM, Walsh CA. Somatic mosaicism and neurodevelopmental disease. Nat Neurosci 2018; 21(11):150414.

This page intentionally left blank

CHAPTER

Genomic sequencing of rare diseases

4

Claudia Gonzaga-Jauregui1 and Cinthya J. Zepeda Mendoza2,3 1

International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico 2Cytogenetics and Genomic Microarray, ARUP Laboratories, Salt Lake City, UT, United States 3Department of Pathology, University of Utah, Salt Lake City, UT, United States

4.1 Introduction: a human genome reference sequence In 2001 the first draft of the human genome reference sequence was published by the International Human Genome Sequencing Consortium [1]. The Human Genome Project (HGP) was a decade-long multinational effort aimed to produce the first comprehensive and representative sequence of nucleotides that spell the totality of instructions to every living human being. While the human genome reference sequence is unquestionably one of the greatest scientific achievements of humanity, the work of the HGP was not finished then, but the refinement of the reference sequence has continued through the years and up to the present [2,3]. We continue to learn about the sequence, structure, architecture, and variation of the human genome and, in so doing, understand more about our species. A reference assembly for any organism is a linear representation in the form of A’s, C’s, G’s, and T’s of the totality of the genetic information of that organism. When read in the correct order and context, the sequence of letters or nucleotides provides the instructions to produce the bioactive molecules that perform the essential biological functions for that organism. Research efforts around the world have produced genomic sequences for about 15,000 organisms, most of them microbial. For the majority of complex organisms that have had their genomes sequenced, such as the dog [4] or the chimpanzee [5], the reference sequence represents that of a single individual of that particular species. In the case of the human genome, the reference sequence is the composite of a handful of individuals as it aimed to be a pan-human representation of our species genome. Two major insights from the HGP and the initial analyses of the human genome reference sequence were the realization that the human genome had fewer genes than anticipated and that it is more variable than previously thought. These two major features of the genome remain the cornerstones of current genomic analyses and our understanding and interpretation of genomic data. A simple and broad definition of “gene” is a DNA sequence that encodes a bioactive molecule. Generally, the major category of genes encompasses those that encode for proteins, the major structural and functional molecules of cells. However, research has shown that other genes can encode multiple types of noncoding RNA (ncRNA) molecules, which do not get translated into proteins but that can exert biological functions. Some of these ncRNAs participate in important biological Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00001-6 © 2021 Elsevier Inc. All rights reserved.

61

62

Chapter 4 Genomic sequencing of rare diseases

processes like protein translation via ribosomal RNA (rRNA) and transfer RNA (tRNA) molecules, whereas others are involved in the regulation of expression of other genes. Prior to the HGP, it was predicted that protein-coding gene content was proportional to organismal complexity; that is, the more complex the organism, the more genes its genome would have to have in order to achieve such complexity. Therefore bacteria would only have a handful of genes; worms, flies, or mice a few thousand, but humans were predicted to have between 50,000 and 100,000 genes [6]. Surprisingly, initial analyses of the human reference sequence estimated about 35,000 proteincoding genes in the human genome. Twenty years later, we have refined our estimate to be about 20,000 protein-coding genes. Another major realization derived from the HGP was the degree of variation contained in the human genome. While most of the genome is identical between two individuals, a striking observation was that about 1 in 1300 base pairs (bp) in the human genome was different between two distinct human genomes. These one letter changes in the DNA were named single-nucleotide polymorphisms (SNPs), and more than 1.4 million SNPs were identified in the first draft of the human genome [1]. The recognition that this small fraction of variation in the genome was responsible for the characteristics that make each person unique, and the observation that some of these SNPs were shared among individuals of the same ancestry background and passed together from generation to generation in large blocks named haplotypes, led to the launch of projects aiming to assess genomic variation across world populations, such as the HapMap [710] and the 1000 Genomes [1113] projects. Genomic sequencing and characterization of more human genomes not only unveiled much more variation, but also revealed that the genome is not static and that individuals vary between each other by more than a few million SNPs. Another major source of human genomic variation, namely structural variation, was therefore later appreciated. Structural variation (SV) refers to the deviation in order, orientation, or number of copies of DNA segments with respect to the human genome reference. As such, SV encompasses insertions, translocations, inversions, and copy-number variants (CNVs), which are large deletions, duplications, or amplifications of DNA. Some of these CNVs and other types of SV can cause disease by altering gene dosage or expression through multiple mechanisms, as discussed in Chapter 3, Genomic Disorders in the Genomics Era. In addition to the main goal of generating and characterizing the first genome sequence of the human species, the HGP had other ancillary goals to enable the understanding of the human genome and harnessing its potential [6]. Some of these goals included the development of sequencing technologies that could enable faster and cheaper genomic sequencing and, in parallel, the development of informatic tools that would facilitate the analyses of the produced genomic sequence data. Additionally, the HGP considered the sequencing of model organisms such as Mus musculus (house mouse), Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fruit fly), and Saccharomyces cerevisiae (baker’s yeast) an important task not only to ensure the technology and approaches were appropriate for a project of the scale of sequencing the entire human genome, but also as a means to understand this information in the context of evolution through comparative genomics. Even after the completion of the main goal of the HGP and the initial [1] and later [2] publications of the reference human genome sequence, the aims of the HGP have continued to guide the field of genomics. The development of faster, cheaper, and higherthroughput sequencing technologies has undoubtedly propelled the implementation of genomics in the research and clinical settings in the last decade like no other discipline before it. Similarly,

4.2 Sequencing of genomes and exomes

63

larger and more ambitious projects aiming to sequence additional nonhuman species have been launched with objectives beyond the understanding of the human species but to derive knowledge that may contribute to the preservation of biodiversity of the planet. In summary, the human genome reference sequence is an average representation of the entire DNA content of the human species. The length of the haploid human genome is generally and simply referred to as 3.2 billion letters (bp), although to be accurate the latest human genome reference assembly (GRCh38) is composed of 3,209,286,105 nucleotides [3]. Although first published almost 20 years ago, the human genome reference sequence is still a work in progress. The human genome is the most studied, refined, and well-curated genome of any species on Earth, and yet, there is much to learn and improve. The most recent assembly or version of the human reference sequence (GRCh38) was released in 2017 adding and refining sequence in complex genomic regions, closing gaps, and improving annotations to better serve its purpose in research and in the clinic [3].

4.2 Sequencing of genomes and exomes As previously mentioned, the human genome is made up of about 3200 million bp (Mbp) (3.2 Gb), but the reference sequence is a haploid representation of the genome. In reality, humans are diploid, meaning any human individual inherited one entire haploid copy of the genome from their biological mother and another one from their biological father. Therefore the actual genomic content of a human diploid cell is twice the size of the reference sequence, 6.4 Gb. However, as discussed previously, the genes that encode the functional molecules (either proteins or RNA) comprise just a small fraction of the entirety of the human genome. In the case of most protein-coding genes in humans and other organisms, these are not made up of continuous sequence, but rather they are formed by exons or “coding sequences” that encode the amino acid sequence of the protein (based on the genetic code, see Chapter 1: Introduction to Concepts of Genetics and Genomics) interspaced by “noncoding sequences” or introns, which do not encode amino acids and are eliminated during messenger RNA (mRNA) processing. The totality of the coding sequences or exons in the human genome has been denominated the “exome” and accounts for about 1% of the human genome sequence. Consequently, for a diploid human individual, the exome would encompass approximately 2% of their total DNA. A major source of debate in current genomic research is the choice between performing wholegenome sequencing (WGS) or whole-exome sequencing (WES). The decision can be based on different factors, including cost, throughput, analysis, and relevance to the scientific question [14]. After the HGP, the development of newer DNA sequencing technologies drove down the cost of sequencing dramatically (Fig. 4.1). However, at the beginning of the personal genomics era when single individuals began to be sequenced and analyzed, the cost of sequencing a full human genome ranged from 50,000 [15] to a few million dollars [16,17]. The development of targeted capture technologies and exome sequencing provided an alternative to focus on the protein-coding portion of the genome at a fraction of the cost of sequencing the whole genome. In addition to the sequencing cost itself, it is important to note that genomic projects require computational processing, storage, and analysis of the generated data. Therefore a WGS project would generate 100 3 more data than a WES project. Although great progress has been made on deciphering the function of many noncoding regions and sequences in the human genome, the reality remains that we are still

64

Chapter 4 Genomic sequencing of rare diseases

FIGURE 4.1 Decrease in sequencing cost per genome since the Human Genome Project. The graph shows the steep decrease in costs for sequencing an entire human genome from the end of the Human Genome Project (HGP) to 2020. The cost of producing the first human genome reference sequence by the HGP has been estimated at about $2.7 billion dollars. Since then, development of next-generation sequencing (NGS) technologies has driven the cost of sequencing DNA down by increasing throughput and parallelizing the process. In 2007 the first personal genomes from Craig Venter and James Watson were published marking the beginning of the “Personal genomics era,” where sequencing of single individual whole genomes became a reality. Later in 2010, the sequencing of Jim Lupski’s whole genome to identify the cause of a genetic disorder and the development of targeted capture methodologies for exome sequencing marked the beginning of the application of genomic technologies for the identification of the molecular causes of genetic disorders. Shortly after at the end of 2011, whole-exome sequencing based tests for clinical diagnostics became available. The improvement in throughput and scalability of genomic sequencing platforms has enabled the large-scale implementation of human genomics for research and in the clinic and will likely continue to drive the field of human genomics for the foreseeable future. With data from the National Human Genome Research Institute (NHGRI), USA (www.genome.gov).

limited to interpreting with a significantly higher degree of accuracy the changes found in the coding sequences, thanks to our knowledge of the genetic code and the translation of DNA into proteins via mRNA. Additionally, most known genetic disorders and rare diseases are caused by changes in the sequence of genes that encode proteins. Therefore WES can be a suitable and more affordable option to WGS. Nevertheless, the continuous improvement of genomic sequencing technologies making them more efficient and high-throughput has reduced the cost gap between WGS and WES significantly over the last decade. However, the analysis and interpretation of the large data output resulting from WGS remains an ongoing challenge for the genomics community, given the sheer number of variants identified and the limitations in our interpretation of their significance, particularly when

4.3 The process of genomic sequencing

65

present in noncoding sequence. Consequently, current WGS approaches generally focus their analyses on the “interpretable” fraction of the genome by prioritizing a computational exome analysis. While we may not be able to yet interpret the vast majority of the noncoding genome sequence, the argument may be made to have the sequence in hand and work on interpretation later. An additional advantage to WGS is the identification of SV that is not as accurately identified through WES only (see Chapter 3: Genomic Disorders in the Genomics Era), enabling a more comprehensive view of the genomic variation in an individual’s genome. Ultimately, the choice between WGS and WES needs to be guided by the genetics question to be answered and the available resources to do so.

4.3 The process of genomic sequencing The process to sequence a human genome or exome is now relatively straightforward, and the methodological differences arise mainly from the preferred capture, amplification, and sequencing platforms used [18]. In broad strokes, the process can be reduced to four different steps: (1) DNA preparation, (2) library construction, (3) sequencing, and (4) data analysis (Fig. 4.2).

4.3.1 DNA preparation Human genomic DNA can be isolated from different sources, with peripheral blood being generally preferred as the starting biological material. However, available reagents to stabilize other biological fluids such as saliva have proven useful when a blood draw is not possible or insufficient, providing an adequate yield and quality of DNA for sequencing when collection is done properly. Extraction of DNA from tissue biopsies, preferably fresh tissues, can be performed by first disrupting the tissue using proteinase K. DNA extraction from formalin-fixed paraffin-embedded (FFPE) tissues is often suboptimal in yield and molecular weight integrity; however, various methodologies have been developed for more efficient FFPE DNA isolation, which include better deparaffinization and repair strategies [19]. The DNA yield for tissue biopsies is lower due to the reduced amount of starting available material, and there is a risk of DNA degradation during extraction and purification. Current next-generation technologies allow the preparation of sequencing libraries for WGS with as little as 500 ng of genomic DNA.

4.3.2 Library preparation After extraction and purification, genomic DNA is sheared via mechanical (such as nebulization or sonication) or enzymatic methods into fragments of B200400 bp. Fragment ends are enzymatically repaired and adapters, which can be barcoded, are ligated to the ends. For WGS, the wholegenome library is amplified and subjected to next-generation sequencing (NGS). For WES or other targeted approaches, target capture and enrichment are performed prior to amplification. Prior to massively parallel sequencing, human genomic sequencing had already used target enrichment approaches such as polymerase chain reaction (PCR), for specific segment amplification or cloning of discreet genomic segments using bacterial vectors and performing bacterial artificial

66

Chapter 4 Genomic sequencing of rare diseases

FIGURE 4.2 Sample preparation, sequencing, and analysis workflow for human genomic sequencing projects.

chromosome (BAC) and fosmid libraries construction, but these were laborious and not highly scalable methods. The currently used targeted capture methodologies were originally developed using oligonucleotide probes covalently bound to a solid array glass slide designed to specifically bind target regions or the exons of target genes [2023]. Later, solution-based capture was developed [24,25] in which target fragments specifically hybridize to biotinylated probes that are then pulled down using streptavidin-coated magnetic beads. During exome capture hybridization, it is important to block repetitive DNA; this is achieved through the excessive addition of human Cot-I DNA in the hybridization solution to avoid nonspecific cross-hybridization. The fragmented genomic DNA

4.3 The process of genomic sequencing

67

hybridizes to the complementary probes either on the array or to the biotinylated oligonucleotide probes in solution; any non-target fragments that do not hybridize are subsequently washed away and therefore not captured for subsequent sequencing. The capture efficiency is dependent on the target fragment length, sequence complexity, and GC content of the region. Solution-based capture is cheaper and more scalable than microarray-based capture; thus most commercially available exome capture reagents use solution-based capture [26]. The above-mentioned methodologies are well suited for enriching the “whole exome” with capture libraries of B50 Mb. While “whole-exome sequencing” intends to capture and subsequently sequence all exons in the human genome, it is recognized that different targeted capture designs may differ based upon the genome/gene annotation(s) used for design. In cases where a more targeted approach is desired, such as those for gene panels or specific regions, approaches such as molecular inversion probes (MIPs) and other multiplex or modified PCR-based amplification of targets can be used to enrich for the desired regions on a reduced scale [27,28]. Originally MIPs were developed, improved, and applied for high-throughput multiplex SNP genotyping [29,30]. The current MIPs technology relies on the specific design of B70-mer capture probes. The MIP structure is composed of a common linker sequence flanked by homologous targeting arms that hybridize upstream and downstream to the genomic region of interest. A synthesis reaction follows in which a DNA polymerase copies the target sequence using the upstream targeting arm as an extension primer. After extension, the 50 end is then ligated to the downstream targeting arm, and the probe is circularized. Further postcapture library amplification, barcoding, and sequencing adapter ligation can later be performed using the common MIP linker sequence [3134]. Current MIPs designs and approaches have proven effective at capturing B55,000 targets or B6 Mb [32,33]. Most of the commonly used sequencing technologies rely on the amplification of the template to be sequenced in order to form clusters of clonally amplified molecules termed “polonies,” which derives from “PCR” and “colony” referring to the original bacterial colonies needed to amplify DNA BACs for sequencing. There are two predominant approaches for pre-sequencing library amplification. Emulsion PCR amplification is performed in a wateroil emulsion that contains the captured fragment library, deoxynucleotides (dNTPs), polymerase, and beads with oligonucleotide primers complementary to the adapters initially ligated to the DNA fragments. In the test tube, each of the spheres formed by the wateroil emulsion will perform as an individual isolated PCR reaction in which the template fragments will be clonally amplified. These beads will then be washed and cross-linked or spread into a slide or solid platform in which the sequencing reaction is performed. The second approach is a solid-phase amplification, in which pairs of oligonucleotide amplification primers are covalently bound to a solid phase and the template amplification takes place by bridge amplification of the target fragment using a pair of primers and generating clusters of clonally amplified target molecules that are already immobilized. Solid-phase bridge amplification is primarily used by the most commonly preferred sequencing platform, Illumina. Thirdgeneration sequencing technologies that aim to perform single-molecule DNA sequencing do not require this amplification step. The efficiency of the targeted capture enrichment step can be easily assessed by quantitative PCR (qPCR), testing a few target loci in the initial non-amplified input DNA versus the amplified captured DNA and comparing their cycle threshold (CT) values. However, this quality control step does not provide information on the specificity and sensitivity of the capture method, just the efficiency of the enrichment [22].

68

Chapter 4 Genomic sequencing of rare diseases

4.3.3 Sequencing After library preparation and amplification (if necessary), the sample can be sequenced using a variety of different technologies with different underlying chemistries (see Table 4.1). Sequencing technologies can be historically divided into first generation, of which Sanger sequencing is the major, most reliable, and still widely used albeit expensive technology; second generation, better known as NGS technologies, of which most of the massively parallel sequencing platforms are part of; and third-generation sequencing that encompasses single-molecule sequencing (SMS) technologies. The characteristics, chemistry, and advantages/disadvantages of the major sequencing technologies will be further discussed in the next section.

4.3.4 Data analysis After sequencing has been performed, depending on the platform, the data will be processed by the instrument and turned into sequence reads or short fragments of sequence in their A, C, T, and G representation. In order to make sense of the resulting genomic sequence data, downstream bioinformatic processes and analyses ought to be performed to derive meaning from the data. Broadly, the data analysis step encompasses processes for mapping and alignment of the short-sequence reads to the human reference genome, variant and genotype calling, variant annotation, and interpretation. These steps will be discussed in more depth in following sections.

4.4 Overview of sequencing technologies In its more basic definition, DNA sequencing is the process of determining the series and order of nucleotides in a molecule of DNA. After the determination of the double helical structure of DNA in 1953 by Watson and Crick using the X-ray crystallography images and data obtained by Raymond Gosling and Rosalind Franklin, many scientists set to develop molecular biology techniques that would allow the sequencing of DNA. Protein and RNA sequencing techniques, in fact, came before DNA sequencing methodologies. However, the understanding of the inherent property of base complementarity of DNA was and has been instrumental to the development of DNA sequencing technologies. In 1977 Frederick Sanger published his DNA sequencing method using chain terminating nucleotides [35]; a method that became the foundation of some of the modern and most used sequencing technologies. Sequencing technologies can be categorized by the enzymatic chemistry and mechanism leveraged to determine the sequence of nucleotides in a DNA strand. Broadly, they are divided into sequencing-bysynthesis (SBS), which encompasses most of the sequencing technologies to date, and sequencing-byligation (SBL). As the name indicates, SBS technologies use a template DNA single strand to synthesize a new complementary strand by incorporating nucleotides and “reading” the sequential one-at-a-time incorporation of these to determine the DNA sequence. SBS technologies utilize a DNA polymerase as the processing enzyme and are therefore bound to the inherent error rate of the synthetases used in spite of most of these being engineered to be thermostable and have high processivity, specificity, and fidelity rates. Conversely, SBL involves the use of a DNA ligase enzyme that incorporates short oligonucleotides with predetermined sequences that are complementary to the sequence in the template DNA strand. In all cases, a short-oligonucleotide primer complementary to one end of the template strand is utilized to kick start the sequencing reaction and incorporation of new nucleotides.

Table 4.1 Characteristics of major sequencing technologies in human genomics.

Generation First

Protein/ enzyme used for DNA sequencing Sequencing by synthesis (SBS) DNA polymerase

Second (nextgeneration sequencing, NGS)

Sequencing by ligation (SBD)— DNA ligase

Chemistry

Detection

Average read length

Dideoxynucleotide (terminators) sequencing

Fluorescence

Reversible terminators

Common/ commercial name

Advantages

Limitations

B1 kb

Long reads, high accuracy

High cost, low throughput, hard to multiplex, and scale up

Sanger sequencing

Fluorescence

300 bp

High throughput, multiplexing, and scalability. Low cost.

Higher error rate compared to Sanger sequencing. Short-read length makes difficult to phase variants and identify structural variation.

Illumina (previously Solexa)

Pyrosequencing

Light

300 bp

Longer reads for NGS, fast run time

High error rate; low resolution of homopolymer tracts due to signal saturation

454 (Roche)

Semiconductor ion sequencing

Hydrogen ion (H1)

100300 bp

Fast, scalable, and miniaturized

High error rate; low resolution of homopolymer tracts due to signal saturation

Ion Torrent

Sequencing by oligonucleotide ligation and detection

Fluorescence

5075 bp

Lower error rate for NGS

Short-read length, long running times, complicated bioinformatics for data processing. Did not achieve high throughput

SOLiD

(Continued)

Table 4.1 Characteristics of major sequencing technologies in human genomics. Continued

Generation Third

Protein/ enzyme used for DNA sequencing

Common/ commercial name

Chemistry

Detection

Average read length

SBS—DNA polymerase

Real-time singlemolecule sequencing

Fluorescence

.10 kb

Long reads

High but random error rate (B10%); slower run times and lower throughput compared to NGS technologies

PacBio

Motor/pore protein complex

Nanopore sequencing

Ionic current

Up to 900 kb

Ultra long reads, suitable for de novo assembly. Portability

High error rate (B13%); low throughput

Oxford Nanopore

Advantages

Limitations

4.4 Overview of sequencing technologies

71

4.4.1 First generation sequencing As mentioned above, for historical purposes, sequencing technologies have been categorized in generations, with Sanger sequencing being the most widely used first-generation sequencing technology. The sequencing of the human genome was in fact performed and completed using Sanger sequencing in its entirety [1]. Sanger sequencing is an SBS technology that utilizes modified dideoxynucleotide phosphate analogs (ddNTPs) that lack the 30 -hydroxyl (OH) group of native dNTPs important for forming the covalent phosphodiester bond with the 50 end of the next nucleotide for DNA chain extension. Originally, the sequencing of a single-DNA template would be divided into four separate synthesis reactions where each would contain the DNA template, oligonucleotide primer, a mixture of normal dNTPs (A, C, G, and T) and a small amount of one of the four modified ddNTPs (ddATP, ddCTP, ddGTP, or ddTTP). The four reactions would be set for the oligonucleotide primer to align to the beginning of the template and begin extension of the DNA chain using a thermostable DNA polymerase and the regular dNTPs. However, when one of the ddNTPs happened to be incorporated during the extension phase, the reaction would be terminated and the extension of the complementary strand would be halted. Doing this through multiple cycles of synthesis results in a pool of fragments that are blunted every time a modified nucleotide is incorporated at random. The four different reactions containing thousands of amplified DNA strand fragments would then be loaded into individual lanes of an electrophoresis polyacrylamide gel. Electrophoresis is a technique that applies an electrical current across a matrix, in this case the gel, and separates molecules based on their size or molecular weight. Therefore the resulting fragments from the four different reactions would be separated by their size across the gel, with the smaller fragments running faster down toward the bottom of the gel and the larger fragments taking longer to move through the gel and accumulating in bands towards the top. Since each of the lanes corresponded to an individual ddNTP added to the corresponding reaction, the sequence of the DNA template is easily readable by following the order of the bands from bottom to top of the gel across the four different lanes. Initially, the ddNTPs were labeled using radioactive phosphorous (32P), and so X-ray films exposed to the radioactive gels would be used for reading the sequence. Fortunately, later on with the development of fluorescent chromophores, the ddNTPs became fluorescently labeled allowing for the simplification and eventual automation of the technique. Utilizing four different fluorescent tags that emit light at different wavelengths when excited by a laser, it became possible to run a single sequencing reaction instead of four separate ones. The resulting mix of terminated fragments is then separated using capillary electrophoresis within a machine, aptly named sequencer, that has a laser to excite the fluorophores and a detector that captures the light emitted by each passing fragment within the capillary, translating this into a graphical output known as a chromatogram. To this date, Sanger sequencing is performed using fluorescently labeled dideoxynucleotide terminators and it remains, albeit the slowest and most costly, the most reliable sequencing technology due to its high accuracy and read lengths of B1 kb.

4.4.2 Second generation sequencing technologies The second generation of sequencing technologies, most popularly known as NGS, encompass a broad range of technologies that resulted from the push for technology development that was one of the objectives of the HGP. Parallel multiplexed short-read sequencing was the first NGS breakthrough. This strategy depended on preparing a library of thousands of short DNA fragment templates derived from an

72

Chapter 4 Genomic sequencing of rare diseases

original DNA sample and amplified in vitro using PCR, followed by sequencing of all the fragments (multiplexed) at the same time (parallel) to detect the identities of the incorporated base pairs (bp). Compared to Sanger sequencing, this approach sped up the sequencing process significantly while also decreasing the cost substantially. Between 2005 and 2012, an assortment of NGS technologies were developed and offered as integrated sequencing platforms by several commercial companies. The majority of the NGS technologies, with one exception, were SBS technologies that differed among each other by the chemistry employed to detect the molecule used as readout for the nucleotide incorporation during the synthesis phase of the reaction (Table 4.1). • Reversible terminators: The most commonly used Illumina (previously Solexa) technology utilizes the same principle as Sanger sequencing but with modified fluorescently labeled reversible terminators ddNTPs. SBS using reversible terminators relies on clusters generated by solid-phase bridge amplification on a flow cell slide. The sequencing cycle starts with flushing a mixture of four fluorescently labeled 30 -modified nucleotide terminators and an engineered DNA polymerase that is able to incorporate these modified nucleotides. If the nucleotide is complementary to the next base in the primed template, it will be added by the polymerase; the extension will be blocked and the fluorescent signal derived by laser excitation of each of the fluorophores will be detected by a high-resolution camera. After imaging, the terminating group of the modified nucleotides is cleaved along with the fluorophore allowing the regeneration of the 30 -OH for the addition of the next specific nucleotide and starting a new cycle. The presence of this terminator is key to this technology’s chemistry as it ensures that no additional or nonspecific nucleotides are added in the same cycle, allowing that just one nucleotide per template is imaged per cycle. All the four-color images are processed in order to derive the actual nucleotide sequence [17,36]. • Pyrosequencing: Once commercialized as Roche’s 454 sequencing technology, pyrosequencing utilized amplified DNA molecules attached to beads that were arrayed into a picotiter plate (PTP) containing millions of micro wells large and deep enough only to hold a single bead containing one amplified molecule of DNA per well. During the sequencing process, smaller beads with sulfurylase and luciferase enzymes attached are flushed and allowed to diffuse into the wells to cover the target beads. Each of the dNTPs is individually flushed one at a time through the PTP and allowed to diffuse through each of the sequencing wells. During DNA synthesis, when the DNA polymerase incorporates a nucleotide, a pyrophosphate group is released and this reaction is leveraged by the pyrosequencing technology to convert this pyrophosphate into ATP by a sulfurylase enzyme in order to phosphorylate luciferase into luciferin. The light produced by luciferin due to the specific incorporation of a nucleotide in that cycle is recorded by a digital camera in the instrument, producing an output known as a flowgram or pyrogram. The height of the peak in a pyrogram is proportional to the bioluminescence signal intensity, which in turn is proportional to the number of incorporated nucleotides in that cycle. However, this is both an advantage and disadvantage of the system, as the specificity of the incorporated base is greater but the detector can be saturated by the signal if more than six nucleotides are incorporated in the same cycle, making it inaccurate for sequencing homopolymer tracts. Between cycles, there is a wash with apyrase, an enzyme that degrades any remaining unincorporated nucleotides, and ATP produced from the previous cycle [16,3739]. • Semiconductor ion sequencing: Semiconductor sequencing is a SBS technology based on detecting the hydrogen ion that is released during the DNA synthesis reaction by a very sensitive pH meter—a microchip sensor. After template amplification by emulsion PCR, template bound acrylamide beads are loaded into the semiconductor chip’s wells. Nucleotides are allowed to flow

4.4 Overview of sequencing technologies

73

through the chip one cycle at a time. When the DNA polymerase incorporates the next complementary dNTP, the reaction produces pyrophosphate (as mentioned above) and hydrogen due to the hydrolysis of the triphosphate of the incorporating nucleotide. The hydrogen ion (H1) released produces a change in pH proportional to the number of dNTPs incorporated in that given nucleotide flow cycle that can be detected by a tantalum oxide coated sensor, which provides increased proton sensitivity. The 0.02 pH change per nucleotide that is incorporated is registered by the sensor and converted into a voltage signal that is finally digitalized to a sequence output [40,41]. • Sequencing by oligonucleotide ligation and detection: Once commercialized by Applied Biosystems, Sequencing by Oligonucleotide Ligation and Detection (SOLiD) is the only non-SBS NGS technology. As the name indicates, SOLiD is a SBL technology that uses a DNA ligase during the sequencing process. Once library amplified fragments are bound through adapters to a sequencing flow cell slide, SBL is initiated by the hybridization of universal primers and then by adding a pool of fluorescently labeled 8-mer oligonucleotides that are labeled depending on their two last base pairs (two-base encoding). This produces 16 different dinucleotide combinations labeled by four different fluorophores on their 50 end. During the sequencing reaction, only the oligonucleotide that is complementary to the template strand will hybridize, bind, and be ligated to the nascent strand by the DNA ligase enzyme. Four-color imaging is performed by exciting each of the fluorophores and detecting the fluorescent signal across all the spots in the flow cell slide. Silver ions are flushed in order to cleave the recently ligated oligonucleotides releasing the fluorophore and leaving the 50 -PO end of the oligonucleotide free to bind the next one, and the cycle is then repeated. The sequencing is said to be performed in color space, which for downstream analysis requires mapping to a color-space reference sequence in order to infer the nucleotide identities. This color-space mapping and the two-base encoding provide SOLiD with increased accuracy over other NGS technologies; however, its throughput is lower and the analysis of the data produced more time consuming [42,43]. In addition to the fundamental differences in the chemistry used, the varied NGS technologies differed in a few other aspects that made them less or more attractive to researchers and clinical laboratories. For example, read lengths or the size of the fragments they were able to sequence varied, starting from dozens of base pairs (SOLiD) up to hundreds (454). Additionally, as previously mentioned, Sanger remains to date the most accurate sequencing technology, given that NGS technologies suffer from various degrees of sequencing error rates and types, as well as signal detection thresholds [4446]. However, these drawbacks are compensated by the high-throughput and reduced cost per bp, allowing several fold redundancy of sequencing and oversampling of the same genomic region (higher DNA sequence coverage) in a short amount of time. Consequently, it is most important when sequencing using NGS technologies to achieve an adequate read depth and coverage across most sites to reduce the false positive rate. A high coverage per site aims to reduce the error rate by increasing the signal to noise at any specific site with up to 20 billion sequenced reads per run in the latest available instruments (Illumina NovaSeq 6000, S4 flow cell). In spite of the diversity of NGS technologies once developed and available, the scalable output and decreased bp sequencing costs have positioned Illumina as the most popular and widely used NGS technology in the research and clinical settings.

4.4.3 Third generation sequencing technologies The third generation of sequencing technologies is characterized by single-molecule sequencing (SMS) technologies that focus on producing long reads while removing the amplification step of

74

Chapter 4 Genomic sequencing of rare diseases

NGS technologies and sequencing the DNA molecules directly. The most popular long-read SMS platforms include Pacific Biosciences (PacBio) and Oxford Nanopore. SMS identifies individual nucleotides in longer DNA molecules, either from their addition by a polymerase (PacBio) or through ionic signals produced by translocation of DNA through a nanometer-scale pore (Oxford Nanopore) [4446]. • PacBio is a SBS technology that performs real-time SMS by utilizing nanophotonic platforms (zero-mode waveguide, ZMW detectors) that can sensitively detect the binding of fluorescently phospho-linked dNTPs to the nascent strand as DNA polymerases attached to the bottom of the wells incorporate nucleotides [47]. The circularized template DNA is diluted such that only one molecule will be sequenced by one polymerase in each of the wells. When the nucleotide is in the active site of the polymerase, a pulse of fluorescence in the specific wavelength is detected by the ZMW detector. Once the new correct nucleotide is covalently bound by the DNA polymerase, the fluorophore is released, the pulse ends, and the recently incorporated nucleotide is left free for the next dNTP incorporation. The processivity of the DNA polymerase allows the sequencing of long molecules using this technology [48,49]. However, the error rate of the PacBio SMS technology is higher than for other NGS technologies. These errors are random during the sequencing process, and therefore multiple passes through the same molecule (increased coverage) can reduce the probability of false variant calls. • Nanopore sequencing is different from the rest of the technologies in that it does not rely on the synthesis of a new DNA strand for the sequencing process, but instead it detects ionic changes produced by the translocation of linear DNA target molecules through nanopores embedded in a synthetic membrane with the help of a motor protein. The passing of each nucleotide at a time through the nanopore causes distinctive disruptions in electric current that are detected and translated into sequence data. A major advantage of this technology, compared to PacBio and other NGS platforms, is its portability [50,51]. However, it also has a higher error rate, and the errors are not randomly distributed, although progress on the bioinformatic analysis of nanopore sequence data has improved this drawback. Currently, PacBio produces .1 kbp reads, with the RS II system averaging 10 kbp read lengths and up to 60 kbp. Oxford Nanopore routinely produces 10100 kbp sized reads depending on the sample preparation protocol and has achieved ultra-long-reads of 1.3 Mbp and up to 2.2 Mbp in length [5153]. As previously mentioned, both of these SMS technologies suffer from higher error rates compared to more standardized NGS technologies, such as Illumina. Although higher coverage of the sequenced molecule can also help reduce the probability of false calls as in the case of NGS technologies, the cost of long-read SMS technologies is still higher in comparison and sequencing is generally performed at a lower coverage. Nevertheless, given the length of sequence reads by these new technologies, long-read SMS technology is a thriving sequencing approach suitable for de novo genome assemblies [50], telomere-to-telomere chromosome assemblies [54], sequencing of complex genomic regions rich in repeat elements, and detection of large structural variants [51].

4.5 Sequence data analysis Following the sequencing step using a chosen platform, the signals detected by the sequencer are processed depending on the chemistry used by the sequencing instrument and turned into a

4.5 Sequence data analysis

75

sequence representation suitable for downstream analyses, a series of As, Cs, Gs, and Ts. The major steps of sequence data analysis are outlined below. 1) Mapping and alignment: The data generated by the different technologies is exported into files containing the sequence of each of the individual DNA fragments that were sequenced (referred to as sequence reads) plus some encoded quality information for each individual read. The next step generally involves the mapping and alignment of each of these sequence reads to the human genome reference sequence, which is used as a template to find the exact and unique locations in the genome where the sequence reads belong. Because of the inability to assemble an entire genome from short-read sequences without a reference scaffold, all the individual sequencing reads that do not map to the reference genome used are generally discarded along with duplicate and lowquality reads at this step. In the case of long-read SMS technologies, this step may be skipped and substituted by other bioinformatic methods for de novo assembly [55]. There are different alignment algorithms which can be used for mapping sequence reads to the reference genome. Mapping algorithms vary in their approaches and how exhaustive their mapping is, reflecting both the accuracy and computational speed with which they can be implemented. Alignment algorithms can be broadly divided into those that build a “hash” or associative array of either the reference genome or the sequence reads to use as seeds or anchors for the alignment; once the seeds have been aligned to the reference genome, a smaller local alignment is performed in order to extend the alignment and ensure more accurate mapping. The second group of algorithms is formed by those that utilize the Burrows-Wheeler transform (BWT) algorithm in which the reference genome is sorted, reordered, and indexed for more efficient access and read alignment, which makes these algorithms faster [56,57]. The output of the alignment algorithms is generally a sequence alignment/map (SAM) file. This file is generally very large as it contains the mapping of each read to the reference and its quality. Therefore for better and more efficient handling of this information, SAM files are converted into binary alignment files (BAM) which computationally can be more readily accessed, read, and handled [57]. More recently, given the large production of genomic data and the need for better storage and manipulation of alignment data files, CRAM, a more compressed file format, has begun to be adopted [58]. CRAM files utilize a compressed reference assembly to which the sequence reads were mapped and aligned to and can reduce the storage footprint by about 30% of a BAM file. When comparing mapping and alignment for WGS and WES, we can clearly see the differences in coverage and depth of both sequencing approaches (Fig. 4.3). Whereas for WGS, sequence reads will map throughout the genome (with the exception of highly repetitive and complex regions as pointed out above) providing more even coverage across and including coding and noncoding sequences; the targeted nature of WES will result in mapping and alignment of the sequence reads only to the exons and captured genomic regions based on the reagent used for targeted capture preparation. Because coverage is more even across the genome in WGS, a tradeoff is made with depth to compensate for cost; therefore WGS will generally be done at an average read depth of 30 3 , meaning on average every base in the genome will be represented 30 times or covered by 30 different short-reads, whereas WES by focusing on a smaller genomic space is generally performed at about 100 3 or higher average read depth. Through the use of current NGS technologies, most of the human genome sequencing projects are in fact resequencing projects, meaning that they rely on the haploid reference genome sequence assembly to map and align the sequence reads produced from any individual personal genome and

76

Chapter 4 Genomic sequencing of rare diseases

FIGURE 4.3 Comparison of whole-genome sequencing (WGS) and whole-exome sequencing (WES) coverage and depth across a gene. Sequence coverage and depth across a representative gene shown for WGS and WES. While coverage is more even across the genome for WGS, read depth is generally shallower (B30 3 ), whereas for WES, short-sequence reads (gray lines in bottom panel) will pileup only over the targeted regions, mainly exons, at much greater read depth (B100 3 ).

identify variants compared to the haploid human genome reference. This resequencing and mapping to the reference genome approach of NGS short-read technologies also presents some challenges when mapping to highly repetitive regions or low-complexity polynucleotide regions. Current longread SMS technologies offer and promise an alternative to study diseases caused by this type of variation such as Fragile X syndrome (FXS, MIM #300624) and other tandem repeat expansion disorders [5961]. Furthermore, the availability and improvement of long-read SMS technologies and the combination of these with short-read NGS approaches is enabling the production of more individual genomes through de novo assembly methods, allowing to capture variation, alternate haplotypes, and structural variants not present in the reference human genome assembly [50,51]. 2) Variant and genotype calling: From the alignment file, information about each base in the genome can be extracted and reported into a pileup genomic file. This file documents for every position in the genome the nucleotide observed by the alignment or “piling up” of all the reads at that specific position (Fig. 4.4). The majority of the nucleotides in the genome will be identical to the reference. For genomic analyses, a major assumption is that the patients or test samples will be different from the reference sequence, and therefore the main focus of downstream analyses is on those positions or regions that are variable. After obtaining the pileup, the variable positions are extracted into another file in a process known as variant calling. At this step, the genotype of the

4.5 Sequence data analysis

77

FIGURE 4.4 Schematic of mapping, alignment, pileup, and variant calling processes in genomics projects. Once individual reads are mapped to the reference human genome sequence, a pileup of these reads is generated and every base reported at each aligned position in the genome is reported. In order to facilitate the processing of these data and the files generated, symbols have been assigned to represent a reference base reported in the forward strand (K), a reference base in the reverse strand (,), and capital and small letters to represent specific variant calls either in the forward or reverse strand, respectively.

78

Chapter 4 Genomic sequencing of rare diseases

variant is also evaluated, whether there is evidence for the reference (R) allele and/or the variant or alternate (A) allele at that position based on the pileup, and therefore the genotype is a heterozygote (RA) or alternatively a homozygous reference (RR) or alternate (AA). There are several algorithms that perform variant and genotype calling from NGS data [6264]. The file produced by these “callers” containing the positions that differ from the reference sequence is known as a Variant Calling Format (VCF) file. Once variants are “called,” it is important to evaluate their quality, number of reads across the position (coverage), number of reads that called the variant, strand bias, genotype quality, and likelihood of being a true variant, in order to ensure that the majority are true positives and filter out sequencing errors and low-quality calls. 3) Variant annotation: The next step in the analysis of genomic data involves gathering as much information as may be relevant and available about the variants identified in the test sample through a process known as variant annotation. In general, the major pieces of information obtained through annotation include whether the variants detected map to genic or intergenic regions, if they are coding or noncoding, as well as whether they represent previously observed polymorphisms or they are novel variants. When annotating variants, several available databases are queried in order to gather as much information as possible regarding the identified variant at that specific genomic position to assess its functional impact. Some of the most widely used databases included in several annotation pipelines are described in the next section. 4) Variant data interpretation: The last and probably most important and time-consuming step of genomic data analysis is variant data interpretation. In this step the genomic variants and information gathered through sequencing and annotation are analyzed and interpreted by an expert genomicist whom, generally aided by additional bioinformatic algorithms and analysis workflows ad hoc to the type of study, will evaluate the relevance of a particular variant or variants in relationship to the phenotype being studied. In the case of rare diseases, important considerations when evaluating variants include the frequency of the identified variant in the general population, the predicted functional effect of the variant, the known evidence or information for the function and phenotype association of the gene where the variant was identified, and the segregation of the variant with the phenotype in the family of the proband with the rare disease. For example, the general assumption is that if a given disorder is rare there will be none to few people in the general adult population that will have the same variant as the sequenced affected proband. However, this assumption needs to be evaluated every time in consideration of the type of disorder being studied, the severity of the disorder, and the possibility of incomplete penetrance (see Chapter 6: Dominant and Sporadic De Novo Disorders and Chapter 9: Multilocus Inheritance and Variable Disease Expressivity in Rare Disease). In terms of their functional effect, generally loss-of-function (LoF) variants that are predicted to disrupt the protein by creating a premature termination codon (nonsense variants) or inserting/deleting bases (frameshift indel variants) that change the sequence and can also cause an early truncation of the protein are considered more deleterious or damaging to the protein function and more likely to have an association with disease. To this end, it is important to evaluate the frequency of LoF variants in the candidate disease genes in “control” individuals, as some genes are more prone to accumulating LoF variation, whereas genes that perform important cellular functions are generally more constrained and have fewer LoF variants. In the case of variants that change an amino acid in the protein (missense variants), these need to be evaluated based on evolutionary conservation of the residue and whether it resides in a functional domain or an active or interaction site important for protein function.

4.6 Genomic databases

79

Information obtained on the particular gene or genes where variants occur is also relevant, as in some cases, some genes may have been already reported to be associated with the disorder observed in the patient. Conversely, when variants are found in genes with no previous disease associations, information about their expression pattern and known or predicted function can shed light upon their likely involvement in the phenotype characterized in the patient with a rare disease. For rare diseases, the major analysis paradigm implemented leverages family relationships, either through trio-based or family-based genomic analyses. When genomic sequencing is performed in the parents of a child with a suspected rare genetic disorder, the genomic information of the parents is invaluable to inform the likelihood of the variants in the proband to be associated with the phenotype. For example, heterozygous variants that are shared between the child and one of the parents can be excluded unless evidence or suspicion of dominant inheritance (Chapter 6: Dominant and Sporadic De Novo Disorders). Alternatively, when both parents are heterozygous carriers of a given variant and the child is homozygous, this can suggest a recessive disorder in the proband (Chapter 5: Recessive Diseases and Founder Genetics). The availability of the genomic data from both biological parents can be utilized to phase (assign the parent of origin) heterozygous variants in the proband and identify compound heterozygous variants in trans for recessive disorders and possible de novo mutations (see Chapter 6: Dominant and Sporadic De Novo Disorders). Even in cases where the causative mutation did not occur de novo, the inclusion of unaffected parents into the study design over probands alone is able to reduce up to ten times the number of candidate variants identified in the proband, significantly increasing the probability of achieving an accurate molecular diagnosis for the proband [65]. When genomic sequencing can be performed in additional family members, such as affected and unaffected siblings of the proband or other members of the family in large multigenerational pedigrees for dominant disorders, this provides additional power to perform genetic analyses and exclude variants that are not associated with the disease. The effectiveness and power of family-based genomic analyses for rare diseases has been extensively demonstrated [65,66].

4.6 Genomic databases One of the initial core goals of the HGP was to enable data collection and distribution. The HGP considered it critical to develop informatic tools and systems that would allow the collection, storage, and distribution of the genomic sequencing data produced by the project [6]. Although unclear at the time what form these tools and systems would take, it was clear that integrated databases would be needed to achieve the goals of the project then and in the future. Since then, genomic databases have proliferated and evolved substantially allowing the access to massive amounts of biological data that scientists use every day to answer questions, perform analyses, or derive hypotheses. When analyzing genomic data, probably the major question that needs to be explored initially is which and how much of the variation observed in a given individual or group of individuals with a certain trait or disorder is unique or specific to this group versus variation in individuals that do not have the trait or disease. As previously mentioned, one of the major realizations of the HGP and subsequent studies like HapMap and the 1000 Genomes projects was the vast amount of variation found between individuals and across human populations around the world. These studies produced

80

Chapter 4 Genomic sequencing of rare diseases

large sets of variants, mostly SNPs, that have been cataloged in databases to inform genetic and genomic studies and compare the frequencies of these variants between groups of individuals with a particular trait and those without it. The database of SNPs (dbSNP) was established in 1998 by the National Center for Biotechnology Information (NCBI) to store and catalog SNPs as the most common form of genetic variation between individuals [67,68]. Since its creation, the database has been greatly expanded to include other types of variants such as indels, microsatellites, and short-tandem repeat sequences, assigning individual identifiers (rs numbers) to ease the ongoing annotation and classification of human genomic variation. Initially the database was populated by the SNPs discovered during the HGP and later expanded with variants discovered during the HapMap and 1000 Genomes projects. The latest additions to dbSNP are variants identified through the myriad of WGS and WES projects of the last decade. In addition to the catalog of variants that dbSNP constitutes, specific projects have developed and made publicly available the genomic variation data produced, one such database is the Genome Aggregation Database (gnomAD) [69]. This database was initially an effort by multiple groups to pool genomic data from different large-scale exome sequencing projects and harmonize it using the same bioinformatic pipelines and methods; such endeavor first produced the Exome Aggregation Consortium (ExAC) database [70]. The addition of new cohorts and datasets containing wholegenome sequence data resulted in the incorporation of ExAC into the larger gnomAD database which as of 2020 contains data from 125,748 exomes and 15,708 genomes from unrelated individuals sequenced as part of various disease-specific and population genomic studies. Information on the allelic frequencies of identified variants in multiple populations and the number of alleles and homozygous individuals observed for these variants is reported; however, no individual level data or phenotype information is available. Nevertheless, the frequency information is relevant and of great use in the analysis of rare diseases as it can be used to exclude variants observed in affected individuals that are present in other subjects and populations at high frequencies, likely representing benign polymorphisms. However, caution must be exerted when a variant is observed in one or few individuals in these large population databases, partly because some of the contributing studies to ExAC, and by extension gnomAD, were ascertained based on particular disorders or phenotypes such as cardiovascular or inflammatory bowel disease. Therefore when evaluating genetic variants identified in probands with phenotypes included in gnomAD, the presence of the variant in the database is not enough to exclude it as a candidate. Additionally, depending on the phenotype under study, the interpretation of variants found in probands that may also be present in databases like gnomAD is to be done cautiously since although mostly depleted of individuals with severe childhood disorders, the individuals sequenced as part of the projects constituting gnomAD are not necessarily “healthy” and in addition to any disease-specific ascertainment, they may also be affected or be carriers of variants for disorders of adult onset or with incomplete penetrance. In addition to large genomic databases that contain variant and population frequency data, other specialized databases that focus on disease-associated genes and variants are paramount for genomic studies of rare diseases. The Online Mendelian Inheritance in Man (OMIM) database (www.omim.org) is a compendium of human diseases and phenotypes of genetic or suspected genetic etiology [71]. OMIM was originally established in the 1960’s by Dr. Victor McKusick as a catalog of observed human mendelian phenotypes. Later in 1995, it became a publicly available online resource hosted by NCBI [71,72]. OMIM is a comprehensive and curated knowledgebase of known human mendelian phenotypes and genes, and it is an invaluable resource to human

4.6 Genomic databases

81

geneticists worldwide. As of December 2020, OMIM reports almost 6000 genetic disorders or mendelian phenotypes for which the molecular cause is known, and 1533 previously documented phenotypes for which the molecular cause has not been identified. Another disease-oriented resource is the Human Gene Mutation Database (HGMD), a catalog of known mutations reported to be associated with particular genetic diseases or phenotypes [73]. HGMD includes more than 150,000 variants that have been published to be associated with genetic conditions; however, some of these variants are derived from single-case reports with insufficient support for variant pathogenicity. Genomic sequence data from thousands of individuals support the argument that about 30% of the variants in HGMD are not damaging or causing the disease they were reported to be associated with, as they reach a certain frequency (. 2%5%) within normal populations. Unfortunately, the public version of HGMD is not the most up-to-date, and access to the most current version of the database is through a commercial license. Consequently, an effort to create a public database of disease-associated variation resulted in the creation of ClinVar [74]. Originally intended to aggregate clinically relevant human disease variation reported by clinical laboratories, ClinVar has expanded to allow submissions from research and clinical laboratories across the world. The variants in ClinVar also undergo a curation process by experts in the different diseases and phenotypes represented through the efforts of the Clinical Genome Resource (ClinGen) project (PMID: 26014595). This curation is denoted in a scoring system that reflects the degree of confidence from the genetics and genomics community on the association of a given variant with disease. The Pharmacogenomics Knowledge database (PharmGKB) documents reported, wellcharacterized human genetic variants and polymorphisms related to the metabolism of a wide variety of drugs and compounds. It is a valuable resource to inform analysis of potential medically actionable variants associated with drug metabolism and suggested dosage and guidelines for the utilization of common medications depending on an individual’s genotype [75]. In addition to the above-mentioned genomic and disease/phenotype-focused databases, the field of genomics has been fertile ground for the development of bioinformatic algorithms to help analyze and interpret the increasing volumes of genomic data. Some of the most useful bioinformatic resources for the analysis of families and variants associated with rare diseases are described next. • Conservation scores: PhyloP computes P-values of nucleotide conservation based on a tree model of neutral evolution and multispecies alignments [76,77]. The likelihood ratio test considers all possible ancestral sequences to estimate the “deleteriousness” of a particular substitution based on a comparative genomic approach across 32 vertebrate species and assuming neutrality from synonymous changes and treating all nucleotide substitutions equally [78]. The Genomic Evolutionary Rate Profiling (GERP) approach aims to identify evolutionary constrained regions that have lower nucleotide substitution rates, which may reflect past purifying selection, through the sequence analysis and alignments of 29 mammalian species [79]. PhastCons attempts to identify evolutionarily conserved regions through multiple species alignments (46 placental mammals) based on a phylogenetic Hidden Markov Model (phylo-HMM) that uses statistical models for unequal nucleotide substitution rates [80]. • Functional prediction algorithms: The “Sorting Tolerant From Intolerant” (SIFT) algorithm predicts the functional effect of an amino acid substitution based on the conservation of that amino acid residue in the protein through multiple sequence alignment of closely related proteins from PSI-BLAST [81]. The scores and predictions given by SIFT range from (1 5 tolerated to

82

Chapter 4 Genomic sequencing of rare diseases

0 5 damaging); the amino acid change is predicted to be damaging if the score is # 0.05 and tolerated if the score .0.05. The “Polymorphism Phenotyping” (PolyPhen) algorithm predicts the possible functional impact of an amino acid substitution based on the protein structure, phylogenetic conservation, and sequence information calculating a naive Bayes posterior probability that the mutation might be damaging [82]. The score reported by PolyPhen reports the probability of the mutation being damaging when it is not damaging over the probability of the mutation being damaging when it is actually damaging; therefore the scores range from “problably damaging” (score 5 0) to “benign” (score 5 1). MutationTaster utilizes a Bayes probabilistic algorithm to predict the functional impact of a given nucleotide change, either SNPs or small indels, based on a training set of known disease-causing mutations and common polymorphisms. This algorithm calculates the probability of the change being a polymorphism or a damaging mutation and reports back P-values (not scores) of the prediction being correct [83]. The Combined AnnotationDependent Depletion (CADD) score aims to measure deleteriousness of genomic variants by combining annotations from the previously described algorithms, conservation scores, and other annotation resources [84]. Other useful information resources to interpret variants in novel candidate disease genes can be provided by pathway and proteinprotein interaction network databases, such as the Kyoto Encyclopedia of Genes and Genomes for biological pathways [85,86] and the Human Integrated ProteinProtein Interaction rEference (HIPPIE) [87] or the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [88] databases for proteinprotein interactions data. These databases can provide information on which gene products directly interact with each other or function upstream or downstream of other known disease-associated genes and help elucidate possible functions and roles of novel disease-associated genes. Information on the tissue expression patterns of candidate disease genes may be relevant to determine whether their normal expression overlaps with the affected tissues and organs in the studied disease or phenotype. For example, genes involved in neurodevelopmental disorders tend to have biased expression patterns towards brain. Publicly available expression databases can be queried to obtain such information [89,90]. In recent years the Genotype-Tissue Expression (GTEx) project has leveraged the development of large-scale RNA sequencing (RNA-seq) technologies using NGS platforms to create a resource for the study of human gene expression and regulation [91]. Since RNA-seq is more sensitive than previous transcriptomics platforms such as microarrays, GTEx provides a more quantitative resource to study gene expression, as well as gene isoform profiles across multiple human tissues. In recent years the development of highthroughput single-cell RNA sequencing methodologies (scRNA-seq) is enabling an even more detailed view of the expression of specific genes in single cells or specific cell populations within organs. Initiatives such as The Human Cell Atlas Project aim to establish a comprehensive map of expression in healthy human tissues at different developmental timepoints to better understand gene regulation, protein interactions, and how dysregulation of these processes lead to disease states [92]. Finally, information regarding the known phenotypes of model organisms with variants or deficiencies of genes of interest may be informative to elucidate the relationship of a novel candidate disease gene with the phenotypes observed in human patients [93]. Initiatives and databases that report the phenotypes observed in animal models such as flies [94], worms [95], zebrafish [96], rats [97], and mice [98100] exist.

4.7 Genomic sequencing of rare diseases

83

4.7 Genomic sequencing of rare diseases Prior to the sequencing of the human genome, the identification of genes and variants associated with human diseases was achieved through a combination of linkage analysis in large families and positional cloning approaches. The sequencing of the human genome facilitated the mapping and Sanger sequencing of candidate genes within previously identified large linkage regions. However, the advent of genomics and the development of NGS technologies boosted the identification of genes associated with human genetic disorders like no other technological advancement before [14,101,102]. It is calculated that as of mid-2020, almost a third (32.5%) of all known genetic disorders (1754 out of 5397) has been molecularly defined through the implementation of NGS approaches in the last decade, primarily WES (Fig. 4.5). Towards the end of the first decade of the 2000’s, the genomics field had been able to characterize a large amount of human variation and the first examples of personal human genomes were published showing the extent of interindividual genomic variation [16,103]. However, the question remained whether genomics could be applied to the study of genetic diseases and the elucidation of

FIGURE 4.5 Growth of new gene reports over time using conventional versus NGS discovery methods. Number of new gene discoveries per year as reported in OMIM from the first reports in 1986 of mapping and identification of genes associated with genetic disorders through linkage and positional cloning approaches to mid-2020. The implementation of next-generation genomic sequencing approaches, either whole-genome (WGS) or wholeexome sequencing (WES), in the last decade have revolutionized the study and identification of genes associated with rare genetic disorders. Graph courtesy of Dr. Jessica Chong based on the analyses reported in Bamshad MJ, Nickerson DA, Chong JX. Mendelian gene discovery: fast and furious with no end in sight. Am J Hum Genet 2019;105(3):448455.

84

Chapter 4 Genomic sequencing of rare diseases

the causative molecular defects. In 2009, Ng et al performed targeted capture exome sequencing of twelve individuals to quantify the amount of common and rare variation in the coding regions of the human genome [104]. Among the individuals sequenced, four were unrelated individuals with Freeman-Sheldon syndrome (FSS, MIM #193700), an autosomal dominant disorder characterized by joint contractures, scoliosis, muscle weakness, and characteristic facial features. Through analysis of the WES data focusing on rare variants in genes shared by all affected individuals, the authors were able to identify the myosin heavy chain 3 gene, MYH3 as the only shared gene among all FSS patients. This was a proof-of-principle study as MYH3 had been previously reported to cause FSS [105], and the known disease-causing variants in these patients were known, but it showed that rare variants associated with genetic disorders were identifiable through WES approaches. Shortly after in 2010, Lupski et al published the first example of the application of WGS to the study of a human genetic disorder of unknown molecular cause [15]. In this study the authors sequenced the whole genome of an individual affected with Charcot-Marie-Tooth (CMT) disease, a genetically heterogeneous disorder of the peripheral nervous system, using the SOLiD NGS technology. CMT can be caused by more than 100 genes involved in different biological pathways and processes and following all possible inheritance patterns [106]. Therefore this disorder made an ideal proof-of-concept example for the application of genomic technologies to look at possible causal variation in the whole genome. Without knowledge of the possible disease-causing gene in this individual and his similarly affected siblings, the authors successfully identified, through WGS data analyses and segregation studies in the family, compound heterozygous variants in the gene SH3TC2 [15,107]. The gene had been reported to cause autosomal recessive demyelinating CMT disease type 4C (CMT4C, MIM #601596). These initial proof-of-principle studies on the utility of NGS technologies to identify variants causing human genetic disorders showed that it was possible to identify the genetic lesion causing a rare mendelian disorder through genomic DNA sequencing, opening the door to the broader application of these technologies to the study of rare diseases [14,15,101,102,104,108,109].

4.7.1 Large-scale genomic sequencing projects for rare diseases Following on the achievements of the initial examples of NGS application to mendelian disease gene diagnosis and discovery, the human genetics community rapidly adopted these new genomic approaches. While success stories of disease gene identification through genomic sequencing of patients and families with genetic disorders span throughout the human genetics community, certain large-scale efforts have contributed significantly to the diagnosis and discovery of rare mendelian diseases. A pioneer initiative in the diagnosis and characterization of rare and Mendelian diseases has been the Undiagnosed Diseases Program (UDP) of the National Institutes of Health (NIH) in the United States. Established in 2008, with the goal of providing answers to patients with undiagnosed medical conditions that had eluded a definitive diagnosis after exhaustive clinical evaluation, the program evaluated 863 patients in its initial 7-year phase [110,111]. With the expansion of the UDP into the Undiagnosed Diseases Network (UDN) by incorporating WGS and WES approaches to the diagnosis of patients evaluated by the program, the UDN has contributed to the diagnosis of many patients with undiagnosed diseases and novel disease gene identification [112,113]. In December 2011, the Centers for Mendelian Genomics (CMGs) supported by the National Human Genome Research Institute (NHGRI) and the National Heart, Lung, and Blood Institute (NHLBI)

4.7 Genomic sequencing of rare diseases

85

of the NIH were instituted with the goal of ascertaining, sequencing, and analyzing samples from around the world for all mendelian disorders with an unknown molecular cause; determining the genetic basis of these disorders, and disseminating the methods, data, and results to the broader genetics community [114]. As of 2018, the CMGs had sequenced 47,129 exomes and 1,169 whole genomes for samples encompassing 22,742 families segregating mendelian phenotypes. These efforts have led to the identification of 616 novel gene-phenotype associations [66]. Given their initial conception and goals, most of the discoveries reported by the CMGs have been in collaboration with clinicians and research groups from around the world, encompassing 80 countries. This collaborative effort has been further spear-headed by the CMGs through the creation of the GeneMatcher portal (www.genematcher.org) [115], allowing researchers worldwide to share their interest and findings in specific genes enabling the rapid exchange of results and observations that may eventually lead to novel gene-phenotype associations. GeneMatcher is also part of the Matchmaker Exchange project that aims to further enhance international collaboration and communication by providing a federated platform that facilitates the matching of patient cases with similar phenotypes or molecular candidates [116]. Outside of the United States, major efforts for the diagnosis of rare and mendelian disorders include the Finding Of Rare Disease GEnes (FORGE) and the Deciphering Developmental Disorders (DDD) initiatives in Canada and the United Kingdom, respectively. The FORGE Canada Consortium was a national effort to apply WES for the identification of genetic variants causing suspected genetic disorders in children across the country, the majority of which had remained unexplained even after extensive clinical and molecular testing [117]. Two-hundred and sixty-four rare disorders covering broad clinical spectra were initially selected for study by the FORGE Consortium; however, phenotypes were enriched for neurodevelopmental and syndromic presentations. In the 2 years of FORGE, more than 500 pediatric patients from 362 families were ascertained for WES. Of these families, 51.9% (N 5 188) received a positive molecular finding that could explain the clinical presentation of the proband; 105 (29%) of the reported back genes were known, whereas 83 were novel diseaseassociated genes [118]. Following on the steps of FORGE and due to its success, Canada engaged into Care4Rare, a similar nation-wide collaborative project that expanded on the original goals of FORGE to include identification and development of therapeutic opportunities for rare diseases. The Deciphering Developmental Disorders (DDD) study, a UK-wide effort to elucidate the molecular causes of severe neurodevelopmental disorders was established in 2011 aiming to recruit 12,000 children and their parents for genomic analyses [119]. An initial report of 1133 children from 1101 distinct families recruited and sequenced through the DDD study achieved a diagnostic yield of 27% (N 5 311 patients) with variants in known disease-associated genes [120,121]. Additionally, 12 genes were associated with novel neurodevelopmental disorders, increasing the diagnostic rate to 31% [121]. Follow-up analyses of 4293 DDD children focused on the prevalence and characteristics of de novo mutations in this cohort and their contribution to developmental disorders [122]. The DDD study closed recruitment in April of 2015, surpassing its initial target with more than 13,600 patients enrolled along with their family members. Following on the success of large-scale projects like the DDD, the United Kingdom has set to sequence one million genomes from patients enrolled and consented through its National Health Service (NHS) [123]. Moreover, since 2019 the UK NHS announced that it would offer WGS to all children with a suspected genetic disorder born in the United Kingdom. Initial analyses of 9802 UK patients with a rare disease or extreme phenotype has provided a genetic diagnosis to 1138 of them [124].

86

Chapter 4 Genomic sequencing of rare diseases

The past, ongoing, and future work of large-scale genomic initiatives contribute greatly to our better understanding of human disease architecture and have been characterized by collaborative efforts across the international genetics community, implementation of data sharing strategies, and recognition of the important role that patients with rare diseases play in the research efforts.

4.7.2 Genomic sequencing in the clinic In the last decade the insights from research efforts into the study of rare genetic diseases have been further and rapidly applied to the clinic (Fig. 4.1). The capacity of NGS technologies to sequence exomes and genomes within a day or two for a price that is orders of magnitude lower than the cost of the original reference human genome has revolutionized the medical genetics field. Clinical WES has become increasingly available to referring physicians that care for patients with suspected genetic disorders [125128]. Because the purpose of clinical WES is to identify a molecular diagnosis, analyses focus primarily on reported disease genes and solve rates are in par with the 21%30% reported by many research projects when subsetting to findings in known diseaseassociated genes. These rates can be even higher, up to B50%60%, in instances where clinical WES is recommended as a first-tier molecular diagnostic test instead of targeted mutation testing for suspected candidate genes or gene panels for heterogeneous genetic disorders with a number of known genes. Interestingly, the solve rate for clinical WES was observed to be different in extreme age groups. The application of clinical WES for the diagnosis of infants in the intensive care unit (ICU) achieved a molecular diagnosis in 36.7% of the cases and up to 48.1% in infants that died while in the ICU, suggesting a burden of more deleterious variants in these severe early-onset cases [129]. Conversely, application of this approach for the retrospective diagnosis of adult patients yielded a molecular diagnosis in 17.5% of cases, lower than the average for clinical WES [130] but suggestive of perhaps less penetrant variants or less severe phenotypes in adults versus children referred for genetics evaluation. Clinically, the superiority of WGS over WES testing has not been demonstrated, and their diagnostic utility has been reported as not significantly different [131]; this may be partly due to our limited ability to interpret the contribution of noncoding variants to disease as previously discussed. However, it has become clear that genomic sequencing approaches provide a significantly higher rate of diagnosis compared to chromosomal microarray only when performed as first-tier testing, for a cost that is not significantly different from standard of care. Both WGS and WES have now been proposed as first-tier testing strategies for the diagnosis of developmental, intellectual, and biochemical suspected genetic disorders, as well as individuals with heterogeneous or nonspecific phenotypes given their dual SNV-SV discovery opportunities [131137]. NGS-based genetic testing has demonstrated improved, cost-effective molecular diagnostic rates with downstream consequences in medical management and clinical outcomes. Additionally, studies have shown their economical and psychological burden impact, alleviating years of misdiagnoses, and diagnostic odysseys for patients and families [138140]. While only a handful of genomes were available only 10 years ago, presently we have thousands of publicly available individual genomes and exomes, making clinical NGS not a matter of if or when, but of now, with questions and procedures aiming to effectively validate, regulate, store, and interpret genomic variation in the clinic. According to the NIH Genetic Testing Registry (GTR) [141], as of January 2020, there are 25,352 clinical exome sequencing, 8369 clinical genome

4.8 Outlook

87

sequencing, and 1373 clinical hereditary panel tests offered by laboratories worldwide. The establishment of best practices and professional guidelines has thus been necessary to guarantee reliable results and ensure proper patient care. In the United States, NGS-based laboratory diagnostic tests without clearance from the US Food and Drug Administration (FDA) are overseen by the Centers for Medicare and Medicaid Services through the Clinical Laboratory Improvement Amendments (CLIA). CLIA regulations establish analytical validity specifications such as test accuracy, precision, sensitivity, and specificity, among others [142]. Subsequent guidelines for standardization of clinical NGS tests have been published by various professional groups, addressing reporting, validation, quality control, proficiency testing, and accrediting checklists, including bioinformatic analysis pipelines [143148]. The latest effort in the regulation of clinical NGS-based testing was the publication of a guidance document by the FDA, where a framework for test design considerations and analytical performance evaluation is proposed for tests used in germline diseases [149]. A point to consider, nonetheless, is that validation and verification of clinical NGS tests is a moving target, as better and cheaper chemistries and platforms are continuously developed or improved; this may induce modifications of current regulatory guidelines and recommendations. Predictably, laboratories will have to revalidate or redevelop tests based on technological improvements, potentially imposing interpretative hurdles in the medical community, who must familiarize themselves with the appropriateness of new testing, their sensitivity, specificity, and reimbursement options. Although the clinical utility diagnostic rate of current NGS-based tests is B25%30% based on pathogenic and predicted pathogenic variation in known disease genes, this is significantly higher than for other molecular assays such as karyotype or CMA alone. Consideration and implementation of WES or WGS as first-tier diagnostic tests for patients with suspected genetic disorders in clinical practice should become routine. Further analyses of the “unsolved” cases under research protocols can reach up to 60%70% discovery rate by adding likely pathogenic rare variants in novel candidate disease genes. However, challenges remain to validate these novel findings and provide evidence of causality for disease association (see Chapter 13: Challenges and Opportunities in Rare Diseases Research). The cross-talk between research and clinical implementation of human genetics findings and insights has never been more relevant as in the current genomic era. As more personal genomes for individuals with and without rare genetic disorders are sequenced and their genotypephenotype associations studied and uncovered, the ease of clinical ordering and interpretation will become common practice, with NGS-based testing costs, regulations, and bioinformatics being balanced through innovations in precision medicine.

4.8 Outlook Either through WES or WGS, the high-throughput and scalability of NGS technologies have enabled the sequencing of thousands of individuals, from single patients and families to large population cohorts, in order to better understand the genetic basis of human diseases. The appreciation of the power of mendelian genetics and rare variant segregation in families in the era of genomics cannot be understated. The reduction in genomic sequencing cost has allowed the logical inclusion of parents and relatives in family-based genomic studies and the systematic identification of the contribution of rare and de novo variation in human disease.

88

Chapter 4 Genomic sequencing of rare diseases

The improvement and development of new sequencing technologies promise to make personal genomic sequencing routine for diagnosis and preventive health care. New approaches such as long-read SMS and de novo assembly bioinformatic methods will provide further insights into the genome’s architecture and expand our assessment of human variation and its relationship with traits and diseases. The path toward personalized medicine continues to be paved with better technologies, analysis algorithms, and the study of the wide spectrum of variation in the human genome fueled by genomic sequencing. While the technological advancements will enable massive genomic data generation in the years to come, the challenge remains to derive meaningful biological insights from these data that can inform diagnoses and human biology.

References [1] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 2001;409:860921. [2] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431:93145. [3] Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen HC, Kitts PA, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res 2017;27(5):84964. [4] Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005;438(7069):80319. [5] Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437(7055):6987. [6] US Department of Health and Human Services. Understanding our genetic inheritance. The US Human Genome Project: the first five years FY 19911995. NIH Publication; 1990. [7] The International HapMap Consortium. The International HapMap Project. Nature 2003;426:78996. [8] International HapMap Consortium. A haplotype map of the human genome. Nature 2005;437:1229320. [9] International HapMap Constortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 2007;449:85162. [10] International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467(7311):528. [11] 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [12] 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):5665. [13] 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015;526 (7571):6874. [14] Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med 2012;63:3561. [15] Lupski JR, Reid JG, Gonzaga-Jauregui C, Rio Deiros D, Chen DC, Nazareth L, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010;362(13):118191. [16] Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452(7189):8726.

References

89

[17] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):539. [18] Gonzaga-Jauregui C. Genome-wide approaches and technologies to assess human variation. PeerJ PrePrints 2013;1:e147v1. [19] McDonough SJ, Bhagwate A, Sun Z, Wang C, Zschunke M, Gorman JA, et al. Use of FFPE-derived DNA in next generation sequencing: DNA extraction methods. PLoS One 2019;14(4):e0211400. [20] Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, et al. Direct selection of human genomic loci by microarray hybridization. Nat Methods 2007;4(11):9035. [21] Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. Microarray-based genomic selection for high-throughput resequencing. Nat Methods 2007;4(11):9079. [22] Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, et al. Genome-wide in situ exon capture for selective resequencing. Nat Genet 2007;39(12):15227. [23] Hodges E, Rooks M, Xuan Z, Bhattacharjee A, Benjamin Gordon D, Brizuela L, et al. Hybrid selection of discrete genomic intervals on custom-designed microarrays for massively parallel sequencing. Nat Protoc 2009;4(6):96074. [24] Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol 2009;27 (2):1829. [25] Bainbridge MN, Wang M, Burgess DL, Kovar C, Rodesch MJ, D’Ascenzo M, et al. Whole exome capture in solution with 3 Gbp of data. Genome Biol 2010;11(6):R62. [26] Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. A comparative analysis of exome capture. Genome Biol 2011 Sep 29;12(9):R97. [27] Turner EH, Ng SB, Nickerson DA, Shendure J. Methods for genomic partitioning. Annu Rev Genomics Hum Genet 2009;10:26384. [28] Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 2010;7(2):11118. [29] Hardenbol P, Ban´er J, Jain M, Nilsson M, Namsaraev EA, Karlin-Neumann GA, et al. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat Biotechnol 2003;21(6):6738. [30] Hardenbol P, Yu F, Belmont J, Mackenzie J, Bruckner C, Brundage T, et al. Highly multiplexed molecular inversion probe genotyping: over 10,000 targeted SNPs genotyped in a single tube assay. Genome Res 2005;15(2):26975. [31] Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, et al. Multiplex amplification of large sets of human exons. Nat Methods 2007;4(11):9316. [32] Turner EH, Lee C, Ng SB, Nickerson DA, Shendure J. Massively parallel exon capture and library-free resequencing across 16 genomes. Nat Methods 2009;6(5):31516. [33] O’Roak BJ, Vives L, Fu W, Egertson JD, Stanaway IB, Phelps IG, et al. Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders. Science 2012;338(6114):161922. [34] Almomani R, Marchi M, Sopacua M, Lindsey P, Salvi E, Koning B, et al., on behalf on the PROPANE Study Group, Evaluation of molecular inversion probe versus TruSeqs custom methods for targeted next-generation sequencing. PLoS One 2020;15(9):e0238467. [35] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A 1977;74(12):54637. Available from: https://doi.org/10.1073/pnas.74.12.5463 PMID: 271968; PMCID: PMC431765. [36] Ju J, Kim DH, Bi L, Meng Q, Bai X, Li Z, et al. Sequencing by synthesis using cleavable fluorescent nucleotide reversible terminators. Proc Natl Acad Sci U S A 2006;103(52):1963540. [37] Ronaghi M, Karamohamed S, Pettersson B, Uhl´en M, Nyr´en P. Real-time DNA sequencing using detection of pyrophosphate release. Anal Biochem 1996;242(1):849.

90

Chapter 4 Genomic sequencing of rare diseases

[38] Ronaghi M, Uhl´en M, Nyr´en P. A sequencing method based on real-time pyrophosphate. Science 1998;281(5375):3635. [39] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):37680. [40] Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475(7356):34852. [41] Merriman B, Ion Torrent R&D Team, Rothberg JM. Progress in ion Torrent semiconductor chip based sequencing. Electrophoresis. 2012;33(23):3397417. [42] Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309(5741):172832. [43] McKernan KJ, Peckham HE, Costa GL, McLaughlin SF, Fu Y, Tsung EF, et al. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using twobase encoding. Genome Res 2009;19(9):152741. [44] Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17(6):33351. [45] Shendure J, Balasubramanian S, Church GM, et al. DNA sequencing at 40: past, present and future. Nature. 2017;550(7676):34553. [46] McCombie WR, McPherson JD, Mardis ER. Next-generation sequencing technologies. Cold Spring Harb Perspect Med 2019;9:11. [47] Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 2003;299(5607):6826. [48] Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Sequencing from singles polymerase molecules. Science. 2009 Jan 2;323(5910):1338. [49] English AC, Richards S, Han Y, Wang M, Vee V, Qu J, et al. Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 2012;7(11):e47768. [50] Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 2018;36(4):33845. [51] Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat Rev Genet 2020;21(10):597614. [52] Laszlo AH, Derrington IM, Ross BC, et al. Decoding long nanopore sequencing reads of natural DNA. Nat Biotechnol 2014;32(8):82933. [53] Bayley H. Nanopore sequencing: from imagination to reality. Clin Chem 2015;61(1):2531. [54] Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585(7823):7984. [55] Wee Y, Bhyan SB, Liu Y, Lu J, Li X, Zhao M. The bioinformatics tools for the genome assembly and analysis based on third-generation sequencing. Brief Funct Genomics 2019;18(1):112. [56] Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009;6(11 Suppl):S612. [57] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. 1000 genome project data processing subgroup; the sequence alignment/map format and SAMtools. Bioinformatics 2009;25(16):20789. [58] Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res 2011;21(5):73440. [59] Loomis EW, Eid JS, Peluso P, Yin J, Hickey L, Rank D, et al. Sequencing the unsequenceable: expanded CGG-repeat alleles of the fragile X gene. Genome Res 2013;23(1):1218. [60] Ardui S, Race V, Zablotskaya A, Hestand MS, Van Esch H, Devriendt K, et al. Detecting AGG interruptions in male and female FMR1 premutation carriers by single-molecule sequencing. Hum Mutat 2017;38 (3):32431.

References

91

[61] McFarland KN, Liu J, Landrian I, Godiska R, Shanker S, Yu F, et al. SMRT sequencing of long tandem nucleotide repeats in SCA10 reveals unique insight of repeat expansion structure. PLoS One 2015;10(8): e0135906. [62] Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 2011;12(6):44351. [63] Shen Y, Wan Z, Coarfa C, Drabek R, Chen L, Ostrowski EA, et al. Discovery method to assess variant allele probability from next-generation resequencing data. Genome Res 2010;20(2):27380. [64] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010;20(9):1297303. [65] Strauss KA, Gonzaga-Jauregui C, Brigatti KW, Williams KB, King AK, Van Hout C, et al., Genomic diagnostics within a medically underserved population: efficacy and implications. Genet Med 2018;20(1):3141. [66] Posey JE, O’Donnell-Luria AH, Chong JX, Harel T, Jhangiani SN, Coban Akdemir ZH, et al., Centers for Mendelian genomics. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet Med 2019;21(4):798812. [67] Sherry ST, Ward M, Sirotkin K. dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res 1999;9(8):6779. [68] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29(1):30811. [69] Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfo¨ldi J, Wang Q, Genome Aggregation Database Consortium, Neale BM, Daly MJ, MacArthur DG, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):43443. [70] Karczewski KJ, Weisburd B, Thomas B, Solomonson M, Ruderfer DM, Kavanagh D, The Exome Aggregation Consortium, Daly MJ, MacArthur DG, et al. The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acids Res 2017;45(D1):D8405. [71] Online Mendelian Inheritance in Man. OMIMs. McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD). Available from: http://omim.org/. [72] McKusick VA. Mendelian inheritance in man and its online version, OMIM. Am J Hum Genet 2007;80 (4):588604. Available from: https://doi.org/10.1086/514346 Epub 2007 Mar 8. PMID: 17357067; PMCID: PMC1852721. [73] Cooper DN, Ball EV, Krawczak M. The human gene mutation database. Nucleic Acids Res 1998;26 (1):2857. [74] Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(Database issue):D9805. Available from: https://doi.org/10.1093/nar/gkt1113 Epub 2013 Nov 14. PMID: 24234437; PMCID: PMC3965032. [75] Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, et al. Pharmacogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics 2012;92(4):41417. [76] Siepel A, Pollard KS, Haussler D. New methods for detecting lineage-specific selection. Proceedings of the 10th International Conference on Research in Computational Molecular Biology (RECOMB) 2006;190205. [77] Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478(7370):47682. [78] Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res 2009;19(9):155361.

92

Chapter 4 Genomic sequencing of rare diseases

[79] Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res 2005 Jul;15(7):90113. [80] Margulies EH, Blanchette M, NISC Comparative Sequencing Program, Haussler D, Green ED. Identification and characterization of multi-species conserved sequences. Genome Res 2003;13(12):250718. [81] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4(7):107381. [82] Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, et al. A method and server for predicting damaging missense mutations. Nat Methods 2010;7(4):2489. [83] Schwarz JM, Ro¨delsperger C, Schuelke M, Seelow D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods 2010;7(8):5756. [84] Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014;46(3):31015. Available from: https://doi.org/10.1038/ng.2892 Epub 2014 Feb 2. PMID: 24487276; PMCID: PMC3992975. [85] Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 1999;27(1):2934. [86] Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res 2012;40(Database issue):D10914. [87] Schaefer MH, Fontaine JF, Vinayagam A, Porras P, Wanker EE, Andrade-Navarro MA. HIPPIE: Integrating protein interaction networks with experiment based quality scores. PLoS One 2012;7(2): e31826. [88] Search Tool for the Retrieval of Interacting Genes/Proteins (STRING): http://string-db.org/. [89] Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 2009;10(11):R130. [90] Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, et al. Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res 2010;38(Database issue):D6908. [91] GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet 2013;45(6):5805. [92] Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. Human cell atlas meeting participants. The human cell atlas. Elife. 2017;6:e27041. [93] Wang J, Al-Ouran R, Hu Y, Kim SY, Wan YW, Wangler MF, et al. MARRVEL: integration of human and model organism genetic resources to facilitate functional annotation of the human genome. Am J Hum Genet 2017;100(6):84353. [94] Thurmond J, Goodman JL, Strelets VB, Attrill H, Gramates LS, Marygold SJ, et al. FlyBase Consortium. FlyBase 2.0: the next generation. Nucleic Acids Res 2019;47(D1):D75965. [95] Harris TW, Arnaboldi V, Cain S, Chan J, Chen WJ, Cho J, et al. WormBase: a modern model organism information resource. Nucleic Acids Res 2020;48(D1):D7627. [96] Ruzicka L, Howe DG, Ramachandran S, Toro S, Van Slyke CE, Bradford YM, et al. The Zebrafish information network: new support for non-coding genes, richer gene ontology annotations and the alliance of genome resources. Nucleic Acids Res 2019;47(D1):D86773. [97] Smith JR, Hayman GT, Wang SJ, Laulederkind SJF, Hoffman MJ, Kaldunski ML, et al. The year of the rat: the rat genome database at 20: a multi-species knowledgebase and analysis platform. Nucleic Acids Res 2020;48(D1):D73142. [98] Bult CJ, Blake JA, Smith CL, Kadin JA, Richardson JE. Mouse genome database group. mouse genome database (MGD) 2019. Nucleic Acids Res 2019;47(D1):D8016. [99] Brown SD, Moore MW. Towards an encyclopaedia of mammalian gene function: the International Mouse Phenotyping Consortium. Dis Model Mech 2012;5(3):28992.

References

93

[100] Cacheiro P, Mun˜oz-Fuentes V, Murray SA, Dickinson ME, Bucan M, Nutter LMJ, et al., Genomics England Research Consortium; International Mouse Phenotyping Consortium. Human and mouse essentiality screens as a resource for disease gene discovery. Nat Commun 2020;11(1):655. [101] Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, Centers for Mendelian Genomics, Bamshad MJ, et al. The genetic basis of mendelian phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet 2015;97(2):199215. [102] Bamshad MJ, Nickerson DA, Chong JX. Mendelian gene discovery: fast and furious with no end in sight. Am J Hum Genet 2019;105(3):44855. [103] Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, et al. The diploid genome sequence of an individual human. PLoS Biol 2007;5(10):e254. [104] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461(7261):2726. [105] Toydemir RM, Rutherford A, Whitby FG, Jorde LB, Carey JC, Bamshad MJ. Mutations in embryonic myosin heavy chain (MYH3) cause Freeman-Sheldon syndrome and Sheldon-Hall syndrome. Nat Genet 2006;38(5):5615. [106] Stavrou M, Sargiannidou I, Christofi T, Kleopa KA. Genetic mechanisms of peripheral nerve disease. Neurosci Lett 2020;742:135357. [107] Wade N. March 10, Disease Cause Is Pinpointed With Genome. New York Times; 2010. [108] Choi M, Scholl UI, Ji W, Liu T, Tikhonova IR, Zumbo P, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci U S A 2009;106(45):19096101. [109] Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet 2010;42(1):305. [110] Tifft CJ, Adams DR. The National Institutes of Health undiagnosed diseases program. Curr Opin Pediatr 2014;26(6):62633. [111] Gahl WA, Mulvihill JJ, Toro C, Markello TC, Wise AL, Ramoni RB, et al. The NIH undiagnosed diseases program and network: applications to modern medicine. Mol Genet Metab 2016;117 (4):393400. [112] Ramoni RB, Mulvihill JJ, Adams DR, Allard P, Ashley EA, Bernstein JA, Undiagnosed Diseases Network, Wise AL, et al. The undiagnosed diseases network: accelerating discovery about health and disease. Am J Hum Genet 2017;100(2):18592. [113] Schoch K, Esteves C, Bican A, Spillmann R, Cope H, McConkie-Rosell A, Undiagnosed Diseases Network, Cogan J, Bernstein JA, Adams DR, McCray AT, Shashi V, et al. Clinical sites of the undiagnosed diseases network: unique contributions to genomic medicine and science. Genet Med 2020. [114] Bamshad MJ, Shendure JA, Valle D, Hamosh A, Lupski JR, Gibbs RA, et al., Centers for Mendelian Genomics. The Centers for Mendelian Genomics: a new large-scale initiative to identify the genes underlying rare Mendelian conditions. Am J Med Genet A. 2012;158A(7):15235. [115] Sobreira N, Schiettecatte F, Valle D, Hamosh A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum Mutat 2015;36(10):92830. [116] Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, Brudno M, et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum Mutat 2015;36:91521. [117] Beaulieu CL, Majewski J, Schwartzentruber J, Samuels ME, Fernandez BA, Bernier FP, FORGE Canada Consortium, Friedman JM, Michaud JL, Boycott KM, et al. FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project. Am J Hum Genet 2014;94(6):80917. [118] Sawyer SL, Hartley T, Dyment DA, Beaulieu CL, Schwartzentruber J, Smith A, FORGE Canada Consortium; Care4Rare Canada Consortium, Majewski J, Boycott KM, et al. Utility of whole-exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care. Clin Genet 2016;89(3):27584.

94

Chapter 4 Genomic sequencing of rare diseases

[119] Firth HV, Wright CF, DDD Study. The deciphering developmental disorders (DDD) study. Dev Med Child Neurol 2011;53(8):7023. [120] Wright CF, Fitzgerald TW, Jones WD, Clayton S, McRae JF, van Kogelenberg M, et al., DDD Study. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet. 2015;385(9975):130514. [121] Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519(7542):2238. [122] Deciphering Developmental Disorders Study, et al. Prevalence and architecture of de novo mutations in developmental disorders. Nature. 2017;542(7642):4338. [123] https://www.genomicsengland.co.uk. [124] Turro E, Astle WJ, Megy K, Gra¨f S, Greene D, Shamardina O, NIHR BioResource for the 100,000 Genomes Project, Kingston N, Walker N, Bradley JR, Ashford S, Penkett CJ, Freson K, Stirrups KE, Raymond FL, Ouwehand WH, et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature. 2020;583(7814):96102. [125] Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med 2013;369(16):150211. [126] Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA. 2014;312(18):18709. [127] Lee H, Deignan JL, Dorrani N, Strom SP, Kantarci S, Quintero-Rivera F, et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA. 2014;312(18):18807. [128] Trujillano D, Bertoli-Avella AM, Kumar Kandaswamy K, Weiss ME, Ko¨ster J, Marais A, et al. Clinical exome sequencing: results from 2819 samples reflecting 1000 families. Eur J Hum Genet 2017;25(2):17682. [129] Meng L, Pammi M, Saronwala A, Magoulas P, Ghazi AR, Vetrini F, et al. Use of exome sequencing for infants in intensive care units: ascertainment of severe single-gene disorders and effect on medical management. JAMA Pediatr 2017;171(12):e173438. [130] Posey JE, Rosenfeld JA, James RA, Bainbridge M, Niu Z, Wang X, et al. Molecular diagnostic experience of whole-exome sequencing in adult patients. Genet Med 2016;18(7):67885. [131] Clark MM, Stark Z, Farnaes L, Tan TY, White SM, Dimmock D, et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med 2018;3:16. [132] LELM Vissers, Nimwegen van, Schieving KJM, Kamsteeg JH, Kleefstra EJ, Yntema T, et al. A clinical utility study of exome sequencing versus conventional genetic testing in pediatric neurology. Genet Med 2017;19(9):105563. [133] Stark Z, Schofield D, Alam K, Wilson W, Mupfeki N, Macciocca I, et al. Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement. Genet Med 2017;19(8):86774. [134] Stark Z, Tan TY, Chong B, Brett GR, Yap P, Walsh M, Melbourne Genomics Health Alliance, Gaff C, White SM, et al. A prospective evaluation of whole-exome sequencing as a first-tier molecular test in infants with suspected monogenic disorders. Genet Med 2016;18(11):10906. [135] Bick D, Jones M, Taylor SL, Taft RJ, Belmont J. Case for genome sequencing in infants and children with rare, undiagnosed or genetic diseases. J Med Genet 2019;56(12):78391. [136] Stavropoulos DJ, Merico D, Jobling R, Bowdin S, Monfared N, Thiruvahindrapuram B, et al. Whole genome sequencing expands diagnostic utility and improves clinical management in pediatric medicine. NPJ Genom Med 2016;1:15012. [137] Srivastava S, Love-Nichols JA, Dies KA, Ledbetter DH, Martin CL, Chung WK, et al., NDD Exome Scoping Review Work Group. Meta-analysis and multidisciplinary consensus statement: exome

References

[138]

[139]

[140]

[141]

[142] [143]

[144] [145] [146]

[147]

[148]

[149]

95

sequencing is a first-tier clinical diagnostic test for individuals with neurodevelopmental disorders. Genet Med 2019;21(11):241321. Vrijenhoek T, Middelburg EM, Monroe GR, van Gassen KLI, Geenen JW, Ho¨vels AM, et al. Whole-exome sequencing in intellectual disability; cost before and after a diagnosis. Eur J Hum Genet 2018;26 (11):156671. Schofield D, Rynehart L, Shresthra R, White SM, Stark Z. Long-term economic impacts of exome sequencing for suspected monogenic disorders: diagnosis, management, and reproductive outcomes. Genet Med 2019;21(11):258693. Robinson JO, Wynn J, Biesecker B, Biesecker LG, Bernhardt B, Brothers KB, et al. Psychological outcomes related to exome and genome sequencing result disclosure: a meta-analysis of seven Clinical Sequencing Exploratory Research (CSER) Consortium studies. Genet Med 2019;21(12):278190. Rubinstein WS, Maglott DR, Lee JM, et al. The NIH genetic testing registry: a new, centralized database of genetic tests to enable access to comprehensive information and improve transparency. Nucleic Acids Res 2013;41(Database issue):D92535. Prevention. USCfDCa. Clinical Laboratory Improvement Amendments (CLIA). https://wwwn.cdc.gov/ clia/. Published 2019. Accessed. Aziz N, Zhao Q, Bry L, Driscoll DK, Funke B, Gibson JS, et al. College of American Pathologists’ laboratory standards for next-generation sequencing clinical tests. Arch Pathol Lab Med 2015;139 (4):48193. Gargis AS, Kalman L, Berry MW, Bick DP, Dimmock DP, Hambuch T, et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30(11):10336. Gargis AS, Kalman L, Bick DP, da Silva C, Dimmock DP, Funke BH, et al. Good laboratory practice for clinical next-generation sequencing informatics pipelines. Nat Biotechnol 2015;33(7):68993. Rehm HL, Bale SJ, Bayrak-Toydemir P, Berg JS, Brown KK, Deignan JL, et al., Working Group of the American College of Medical Genetics and Genomics Laboratory Quality Assurance Commitee. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15(9):73347. Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al., American College of Medical Genetics and Genomics. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15(7):56574. Kalia SS, Adelman K, Bale SJ, Chung WK, Eng C, Evans JP, et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet Med. 2017;19 (2):24955. FDA. USFaDA. Considerations for Design, Development, and Analytical Validation of Next Generation Sequencing-Based In Vitro Diagnostics Intended to Aim in the Diagnosis of Suspected Germline Diseases. https://www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guidance Documents/UCM509838.pdf. Published 2018.

This page intentionally left blank

CHAPTER

Recessive diseases and founder genetics

5 Erik G. Puffenberger

Clinic for Special Children, Strasburg, PA, United States

5.1 Introduction Dr. Victor McKusick (19212008), a cardiologist come medical geneticist, is widely considered as the founding father of medical genetics. In the 1960s, he incidentally learned about a high prevalence of achondroplasia in the Amish of Lancaster County, Pennsylvania. As his practice at Johns Hopkins Hospital was little more than an hour drive from Amish country, he traveled to Lancaster to meet the local physician who first reported the rare cluster of achondroplasia cases. This was the first of many trips to Lancaster County in Dr. McKusick’s long and prolific career in medical genetics. He recognized the value of the Amish population for the study of human genetic traits and conducted landmark studies in this population that helped establish the field of medical genetics. The Old Order Amish and Mennonite founder populations of Lancaster County, Pennsylvania, are contemporary Anabaptist groups that fall under the collective term “Plain people,” which denotes Christian individuals that live a modest and traditionally agrarian life, dress simply, and exist within the modern world yet shun many of its material trappings. Due to their religious beliefs, they live in communities that are culturally and genetically isolated from outside influence. In the early years of medical genetics, researchers produced dozens of clinical monographs describing rare genetic diseases in Amish populations. In 1978 McKusick compiled these research papers into a book entitled “Medical Genetic Studies of the Amish.” Within these articles were 16 new genetic syndromes first described in Amish cohorts and 18 previously described syndromes. All these research articles centered around unusually large clusters of globally rare genetic diseases in Amish demes throughout North America. For the past 60 years, since McKusick’s first excursion to Lancaster County, the Plain people of Pennsylvania and elsewhere have been at the forefront of genetic research and translational medicine. The genetic insights gained through Plain community studies have tracked in lockstep with the growth of medical genetics as a discipline. Today, these groups continue to encourage and cooperate with medical genetic investigations in the belief that the knowledge attained might benefit “special children” everywhere.

5.2 Autosomal recessive inheritance As discussed in Chapter 1, Introduction to Concepts of Genetics and Genomics, monogenic or single-gene disorders exhibit four classical inheritance patterns: autosomal dominant, autosomal Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00012-0 © 2021 Elsevier Inc. All rights reserved.

97

98

Chapter 5 Recessive diseases and founder genetics

recessive, X-linked dominant, and X-linked recessive. The term “autosomal” simply refers to the location of the causative gene in the human genome, whether it is found on one of the autosomes (chromosomes 122) and not in the X or Y sex chromosomes. Autosomal recessive inheritance accounts for the majority of pediatric genetic disease. Even though most recessive diseases are individually rare in the population, the cumulative burden toward pediatric disease and death is high. In these genetic conditions, each parent harbors both a normal allele and an abnormal allele in the same gene. These heterozygous parents, carrying only one abnormal copy of the gene (allele), are generally unaffected by the disease. However, when a child inherits an abnormal allele from each parent (biallelic), and thus has no normal functioning copy of the gene, the proper function of the corresponding protein is abolished and the genetic disorder manifests. If the child inherited the same abnormal allele of the gene from each parent, it is said to be homozygous; whereas if each of the parents passed on a different abnormal allele, then the child will be compound heterozygous at the disease locus. In carrier parents, the segregation of alleles during meiosis results in defined probabilities for the three possible genotype classes in their offspring. For each pregnancy, the probability of an affected child is 25%, the probability of a carrier child is 50%, and the probability of a child with two normal copies of the gene is 25% (Fig. 5.1). The autosomal nature of the inheritance pattern ensures that males and females are equally likely to be affected. Most autosomal recessive disorders are caused by loss-of-function (LoF) mutations or variants that render the protein non-functional. As reviewed in Chapter 1, Introduction to Concepts of Genetics and Genomics, LoF variants can include nonsense or stop-gain mutations that introduce a premature stop codon in the protein, frameshift mutations due to insertion or deletion (indel) of base pairs that alter the reading frame and change the protein sequence downstream of the indel; or splice-altering mutations that can modify the specificity of the splicing process. However, nonsynonymous or missense variants that change one amino acid of the translated protein can also cause recessive disorders through LoF or hypomorphic effects, depending on the degree of retained protein function or activity. There are many autosomal recessive diseases in humans, and some are relatively common. Inborn errors of metabolism (IEMs) are a major group of genetic disorders, most of which are inherited in an autosomal recessive manner, that affect the proper functioning of the cellular metabolic pathways involved in the break-down or storage of carbohydrates, fatty acids, and/or proteins. Individual IEM disorders are very rare; however, as a group, they occur in 1 in 2500 births, making them quite common in the population. Several well-known examples include phenylketonuria (PKU, MIM #261600), Gaucher disease (GD1, MIM #230800), Pompe disease (GSD2, MIM #232300), among many others. Other notorious examples of autosomal recessive diseases include cystic fibrosis (CF, MIM #219700) and spinal muscular atrophy (SMA, MIM #253300). In individuals of European ancestry, the incidence of these disorders ranges from 1/4000 for CF to 1/15,000 for PKU. Less well-known disorders, such as glutaric aciduria type 1 (GA1, MIM #231670), are detected in about 1/100,000 births. However, in specific populations around the world, a particular disorder may be found at much higher frequency; such is the case with GA1 in the Old Order Amish of Lancaster County, Pennsylvania. GA1 was first recognized in the Amish community by Dr. Holmes Morton who studied 14 children from 7 different sibships [1]. Although the disease was known locally as “Amish cerebral palsy,” Dr. Morton realized that it was not caused by a birth injury since nearly all affected children became symptomatic between 3 and 18 months of life, and they often had affected siblings,

5.2 Autosomal recessive inheritance

99

FIGURE 5.1 Autosomal recessive pedigree with genotype classes. In autosomal recessive inheritance, each parent harbors two alleles at the locus, a normal allele, and an abnormal allele depicted by the red bar. Their offspring inherit one allele from the father and one allele from the mother. In carrier parents, the random segregation of alleles during meiosis results in defined probabilities for the three possible genotype classes in their offspring. For each pregnancy, the probability of a child with two normal copies of the gene is 25%, the probability of an affected child is 25%, and the probability of a carrier child is 50%. The autosomal nature of the inheritance pattern ensures that males and females are equally likely to be affected.

suggesting an inherited disorder. The natural history of the disease was unusually variable with severe cases marked by acute infantile encephalopathy and sudden death, whereas milder cases demonstrated extrapyramidal cerebral palsy. At least one child was completely asymptomatic. In 1996 using knowledge of the biochemical pathway and the DNA sequence of the cloned mouse gene, the disease-causing mutation in the Amish was identified in the glutaryl-CoA dehydrogenase gene, GCDH. The specific DNA change was c.1262C . T, whereas the predicted amino acid change in the GCDH protein was p.Ala421Val [2].

100

Chapter 5 Recessive diseases and founder genetics

In any population, the incidence of a recessive disease is governed by the carrier frequency within that population. The carrier frequency is simply defined as the proportion of individuals within a population that harbor a pathogenic allele for a particular disease. For GA1, an empirical carrier frequency in the Amish population was calculated by genotyping a set of healthy Amish controls for the GCDH c.1262C . T variant. Of 298 Amish control individuals, 35 heterozygotes were detected yielding a carrier frequency of 11.74% (35/298). Thus, greater than 1 in 10 Amish individuals is a carrier for the abnormal copy of GCDH associated with GA1. To determine how this affects the incidence of disease in this community, the allele frequency for the c.1262C . T variant must be calculated. As we shall discuss later, the allele frequency can be estimated at 5.87% from the same genotype data presented above [i.e., 35/(2 298)]. This yields a birth incidence of roughly B1/290 [i.e., (0.0587) (0.0587)] [2]. This represents a 345-fold increase in the incidence of GA1 in the Amish compared to the general population. With an estimated population size of 50,000 individuals, these data predict that 5910 Amish individuals in Lancaster are carriers of the pathogenic allele for glutaric aciduria while approximately 199 GA1 patients have been born in the past few generations. Thus the number of GA1 disease alleles carried by heterozygotes (5910) is 14-fold higher than the number of abnormal alleles found in affected individuals (199 3 2 5 398). These statistics emphasize that incidence rates for specific diseases can be much higher in certain populations, and that the major reservoir for any specific mutation or disease-associated variant within a population is healthy carriers and not affected individuals.

5.2.1 The role of consanguinity in recessive diseases The occurrence of recessive genetic disorders is significantly higher in populations that practice consanguineous marriages, (from Latin consanguineus “of the same blood”), meaning the marriage of closely related individuals such as first or second cousins. The probability that two individuals harbor pathogenic alleles in the same gene is significantly increased if they are closely related. For example, 6.25% of the genome of a child born to first cousins will be homozygous by descent from common ancestors, that is regions where both copies of the genome are the same. The child will have significant regions of homozygosity, scattered across their genome, and if a pathogenic allele lay in any of those regions, the child will be affected by a recessive condition. The degree of relatedness among individuals in a given population can be calculated as a metric known as the inbreeding coefficient which denotes the probability of two alleles being identical-by-descent (IBD), meaning descended from a common ancestor, at a given locus. Consanguinity increases the overall coefficient of inbreeding in the population. The closer the relationship between the two parents, the higher the proportion of homozygosity in the offspring and the greater the risk for recessive genetic diseases to occur in these offspring. Some populations in the world preferentially practice consanguineous marriage due to cultural or religious traditions. Generally, a consanguineous marriage is defined as a union between two individuals related as second cousins or closer. Consanguinity rates vary from one population to another due to culture, religion, and/or geography [3]. The highest rates of consanguinity in human populations occur in the Middle East, Northern Africa, and West Asia [4,5]. In these populations, consanguineous marriages can account for 20% to 50% of all marriages. Autosomal recessive diseases in the offspring of first cousin marriages demonstrate a 1.7% to 2.8% higher prevalence than the background population risk. Although contributing to a significant burden of disease, the high

5.2 Autosomal recessive inheritance

101

degree of autozygosity (genomic regions homozygous IBD) present in individuals from consanguineous populations has also been leveraged to identify homozygous pathogenic variation associated with novel recessive disorders at scale [68]. In populations with high rates of consanguinity, these genetic risks are measured against the social and cultural benefits of consanguinity such as more stable marital relationships, greater compatibility with in-laws, lower domestic violence rates, lower divorce rates, and the familial maintenance of any landholdings [5]. Of note, consanguinity is not the most common cause of recessive genetic diseases in the general or outbred population, such as those of the United States, Western Europe, or other cosmopolitan settings. The majority of cases arise between unrelated individuals who carry different pathogenic alleles in the same gene and transmit these different alleles to their offspring in a pattern of compound heterozygosity in contrast to individuals that exhibit homozygosity for the same abnormal allele inherited from each of their parents, as described above. For example, a recent study examined the pathogenic allele spectrum for 137 GA1 patients: 92 of these patients were Old Order Amish and 45 were from the general US population [9]. Only 3 out of 45 patients (7%) from the outbred population were homozygous for their respective disease variants. Meanwhile, as expected, 91 of the Amish patients (99%) were homozygous for the GCDH c.1262C . T disease allele. Interestingly, the parents of the 91 Amish GA1 patients self-reported relationships no closer than third cousin. Therefore close consanguinity per se as described above does not adequately explain the large number of GA1 cases in the Amish.

5.2.2 The founder effect Nearly 60 years ago, after Dr. McKusick visited Lancaster County to investigate the anecdotal evidence purporting a high prevalence of achondroplasia in the Amish, he discovered through studying large pedigrees with multiple affected individuals that the disorder in these families was in fact not achondroplasia, but two different autosomal recessive skeletal dysplasias [10,11]. The first, Ellis-van Creveld syndrome (EVC, MIM #225500), also known then as “six-fingered dwarfism,” was characterized by short stature with normal head circumference, abnormal dentition, congenital heart defects, and skeletal and genital anomalies. The second disorder was a novel disorder, named cartilage-hair hypoplasia (CHH, MIM #250250) by Dr. McKusick, and it was characterized by normocephalic short-limb dwarfism, lumbar lordosis, metaphyseal dysplasia, limited elbow extension, short hands, joint hyperextensibility, and fine, sparse hair. Although the first medical description of this form of dwarfism was reported in the Amish, it was later found to be relatively common in the Finnish population [12]. In addition to the description of these two rare genetic disorders, these early studies of autosomal recessive disorders in the Old Order Amish revealed many of the core characteristics that make genetically isolated populations so attractive for genetics research, namely: (1) a self-defined and closed population, (2) a small number of founding individuals, (3) concentrated population settlements, (4) a relatively homogeneous lifestyle, and (5) large family sizes. Additional data such as extensive pedigrees demonstrating descent from a common ancestor for all parents of affected children further strengthened the hypothesis at the time that a single mutation was responsible for all cases in the population. The founder mutation model presumes that a rare pathogenic allele was carried by one of the founders of the population and that this allele has been passed to descendants of the founder. Depending on the allele frequency in the descendants, some

102

Chapter 5 Recessive diseases and founder genetics

number of affected individuals are born in the community who are homozygous for the mutation carried by the common ancestor. The assumptions in the founder mutation model can lead to surprisingly high carrier frequency estimates in the population. For example, the carrier frequency estimates for the two forms of dwarfism identified by Dr. McKusick suggested carrier frequencies of B13% and B10% for EVC and CHH, respectively. This phenomenon is more formally known as the founder effect. In humans the founder effect is defined as a decrease of genetic variation in the population due to a population bottleneck followed by random genetic drift. The first part of the definition, a population bottleneck, occurs via a significant, even drastic, decrease in population size due to famine, disease, war, or migration. As the population recovers and increases in size, random genetic drift of the alleles present in the original founders can lead to significant shifts in allele frequencies over time compared to the population of origin. As Dr. McKusick foretold, the Old Order Amish are an excellent example of a genetic isolate or founder population. The extant population of Lancaster Amish today are all descended from a core group of B80 effective founders. Due to religious and social isolation, no new members have joined the group, and the Amish population has experienced high growth rates throughout its history. Currently the Lancaster population is estimated to be about 50,000, and all individuals are descended from the 80 original founders. For perspective, consider your own genealogy at a depth of 12 generations (roughly back to the mid-1700s when the Amish immigrated to America). At this depth, you have 4096 ancestors in your family tree, and most of them will be unique unless your parents were related to one another. For the average Amish individual, those 4096 positions are occupied by the 80 founders; so, there are multiple lines of descent from each founder into any current living Amish person. Although this chapter predominantly presents examples from Plain populations, the value of founder and consanguineous populations for the investigation of genetic disorders has been realized elsewhere. Numerous founder populations have been identified, and genetic studies in these groups recapitulate the same lessons learned in the Plain communities, albeit on different scales. These founder populations have all arisen through genetic isolation facilitated by geographic, social, and/ or religious mechanisms after a major population bottleneck event, generally involving migration of a group of founders from the parental population. The new founder populations, due to reduced numbers, neither recapitulate the breadth of genetic diversity that was present in the parental population nor the exact allele frequencies for genetic variants. Some of these populations, such as the entire country of Finland, are well established, quite large, relatively old, and maintained by geography. The Finnish population has been extensively studied, and the genetic heritage of the population includes dozens of founder mutations [13]. The Askenazi Jewish population has supported substantial research into their own disease heritage [14]. They are a large but widespread founder group that maintains their isolation through social and religious means. A significant number of genetically isolated populations exist on islands where geographic barriers maintain separation; examples include the populations of Iceland [15], the Faroe Islands [16], Reunion Island [17], and Puerto Rico [18,19], among others. Indeed, throughout human history, small founder populations were likely the norm rather than the exception. The founder effect can be best understood through a relevant example as shown in Fig. 5.2. Here the change in the allele frequency of the BCKDHA c.1312T . A (p.Tyr438Asn) variant, associated with maple syrup urine disease (MSUD, MIM #248600) in the Mennonite population, is

5.2 Autosomal recessive inheritance

103

FIGURE 5.2 The founder effect. The figure depicts the change in carrier frequency for the BCKDHA c.1312T .A variant associated with maple syrup urine disease (MSUD) from the parental Swiss population to present-day Old Order Mennonite descendants. Each open circle represents an individual in the population. The red-filled circles represent carriers for the variant. In the Swiss parental population, the allele frequency was 0.017% and the carrier frequency was 0.033%. Thus roughly 1 in 3030 Swiss individuals was a carrier. The bottleneck occurred when a group of Swiss Mennonite founders traveled to the New World and settled in Lancaster County, Pennsylvania. This founding population numbered about 240 individuals, and by chance, one of them was a carrier for MSUD. This sampling effect alone altered the carrier frequency from 0.033% to 0.416%, a 12.6-fold increase. The next step in the process was random genetic drift. Imagine that the nascent Mennonite population consisted of 120 couples (240 people), and the lone carrier and their spouse had a large family, perhaps 10 children, and by chance, 8 of their 10 children were also carriers. This is akin to flipping a coin 10 times and getting 8 heads. In the next generation, the carrier frequency increased due to the large number of carrier individuals derived from this initial carrier couple. The exact change depended on the cumulative number of offspring born to the other 119 couples. However, we can imagine that if the MSUD carriers continue to randomly produce more carriers each generation, the carrier frequency will continue to increase. Notably, this effect is most pronounced when the population is small. As a population increases in size, the increase in the number of carriers in some families will be counteracted by decreases in carrier numbers in others. Based on extensive carrier testing at the Clinic for Special Children, the current extant Mennonite population exhibits an MSUD allele frequency of 8.5% and a carrier frequency of 15.5%. Compared to the carrier frequency in the Swiss parental population, this is a 470-fold increase.

depicted. In the Swiss parental population, the carrier frequency for this variant was 0.033% (roughly 1 in 3030 individuals). The Mennonite founders numbered roughly 240 individuals, and by chance, one of them was a heterozygote for this variant. Thus in the founders, the carrier frequency jumped to 0.416%, a 12.6-fold increase. Following the founding event, genetic drift led to increases in the number of heterozygotes in each subsequent generation by chance. Based on extensive carrier testing of Mennonite individuals, we know that the current extant Mennonite population exhibits an MSUD allele frequency of 8.5% and a carrier frequency of 15.5%. Compared to the carrier frequency in the Swiss parental population, this is a 470-fold increase. This example provides an extreme case where a single disease-causing variant has increased in frequency due to

104

Chapter 5 Recessive diseases and founder genetics

the founder effect. However, random genetic drift can influence allele frequencies at many different loci in the human genome, albeit with more modest effects. As an example, using whole-exome sequencing (WES) data from Old Order Amish and Mennonite control DNA samples, important findings related to genetic drift can be derived. First, the allele frequency shifts were greater in the Amish population compared to the Mennonite founder population, and this was due to the greater effect of drift on the smaller Amish population compared to the larger Mennonite population. Second, the data demonstrate that the number of alleles in each population that have increased in frequency is comparable to the number of alleles that have decreased in frequency. This exemplifies the random nature of genetic drift. Although equal numbers of alleles appear to increase and decrease in frequency, it might be plausible that discrepancies might exist based on the functional consequence of the variants. For example, LoF alleles might experience negative selection pressure and might be more likely to decrease in frequency within a founder population. The focus on large allele frequency shifts for disease-causing variants often obscures another more subtle effect of genetic drift. In addition to substantially increasing the frequency of some alleles, genetic drift can also lead to decreased genetic diversity due to the loss of low frequency (rare) alleles in the founding population. It is impossible to count alleles that have been lost from a population, but simulations can estimate the potential magnitude of the effect. By modeling hypothetical populations [20] with characteristics similar to the Amish and Mennonite founding populations, it can be shown that a surprisingly large proportion of rare alleles are likely lost from the population in the early generations due to genetic drift. Assuming an allele frequency of 1% in the founders, the simulations suggest that roughly 80% and 55% of rare alleles disappeared from the Amish and Mennonite populations, respectively. Beyond the handful of alleles that increased more than 10-fold, the notable finding from these simulations was the loss of alleles due to drift.

5.2.3 HardyWeinberg equilibrium and inbreeding While inbreeding is frequently maligned by many as the cause of most genetic disease, random genetic drift alone can explain the allele frequency shifts necessary to increase incidence rates for recessive genetic diseases in founder populations. However, this does not imply that inbreeding plays no role. As previously discussed, allele frequencies can change over time due to random genetic drift. Inbreeding does not alter allele frequencies directly but affects disease incidence through a separate mechanism. The major effect that inbreeding exerts is to alter the distribution of individuals (and thus alleles) in the three genotype classes. Under HardyWeinberg equilibrium, the proportion of individuals in a population who fall into each genotype class can be calculated based on the frequency of the reference allele (i.e., the normal allele) and the alternate allele (i.e., mutant allele) at a particular locus. Given that p 5 the frequency of the normal allele, q 5 the frequency of the mutant allele, and p 1 q 5 1, then the proportion of individuals in each of the three genotype classes can be calculated using the Hardy-Weinberg equilibrium formula as: p2 1 2pq 1 q2

Here, p2 represents the proportion of homozygotes for the normal allele, 2pq equals the proportion of heterozygotes in the population, and q2 denotes the proportion of alternate (or mutant) allele homozygotes. We also refer to q2 as the incidence for the corresponding disease in the population.

5.2 Autosomal recessive inheritance

105

In the presence of inbreeding, the proportion of individuals in each genotype class deviates from the HardyWeinberg estimate in a calculable way [21]. Inbreeding in a population changes the proportion of individuals in the three genotype classes without altering the underlying allele frequencies. Specifically, inbreeding increases the proportion of homozygotes (both normal and mutant allele) at the expense of heterozygotes. Allele frequencies do not change, but rather a higher proportion of alleles are partitioned into the homozygous genotype classes. Fig. 5.3 provides a simple graphic for understanding this phenomenon. Given a locus with two alleles, we expect the number of people in each genotype class to follow HardyWeinberg expectations. When inbreeding is present in the population, the proportions shift leading to a lower-than-expected frequency of heterozygotes compared to homozygotes and an increase in the proportion of homozygous individuals. Notably, mutation homozygotes increase in the population even as heterozygotes decrease, a counter-intuitive result. This re-partitioning effectively increases the number of homozygotes in the population without altering the underlying allele frequencies.

FIGURE 5.3 The effect of inbreeding on genotype class proportions. Given a locus with two alleles, A and B, that are present in the population at frequencies of 99% and 1%, respectively, we expect the genotype classes to follow the HardyWeinberg principle such that AA homozygotes represent 98.01% of the population, AB heterozygotes represent 1.98% of the population, and BB homozygotes represent 0.01% of the population. When inbreeding is present in the population (F 5 0.05 in this example), the proportions shift leading to a lower-than-expected frequency of heterozygotes compared to homozygotes and an increase in the proportion of homozygous individuals. Notably, BB individuals increase sixfold in the population even as heterozygotes decrease, a counter-intuitive result. This re-partitioning effectively increases the number of homozygotes in the population without altering the underlying allele frequencies.

106

Chapter 5 Recessive diseases and founder genetics

As mentioned earlier, the inbreeding coefficient is the probability of an individual having two alleles IBD at a given locus. Both consanguineous and founder populations can experience relatively high coefficients of inbreeding, resulting in an increased number of homozygotes observed for any particular allele. Genetic data suggest that inbreeding coefficients for Amish and Mennonite individuals are roughly 4.1% and 2.5%, respectively [22]. The significance of these estimates is that the amount of homozygosity in an average Amish genome is 4.1% (or B120 megabases of DNA), and any mutations found in those homozygous regions will produce disease in the individual.

5.3 Disease gene mapping of autosomal recessive disorders 5.3.1 Homozygosity mapping in consanguineous pedigrees Although there were some notable successes for gene mapping in the 1980s (e.g., Huntington’s disease [23] and cystic fibrosis [24]), the field of molecular genetics suffered from methods that were tedious, time-consuming, and expensive. However, in the late 1980s, polymerase chain reaction (PCR) was invented [25], and this breakthrough allowed scientists to copy DNA quickly and inexpensively as opposed to the more laborious method of cloning. In addition, new highly polymorphic genetic markers (microsatellites) were discovered, and they quickly supplanted the older restriction fragment length polymorphisms (RFLPs) markers. The new microsatellite markers were easy to use and provided more information content due to multiple alleles at each marker locus than the older RFLP markers. In 1987 Lander and Botstein published a method to map a disease to a chromosome using affected children from consanguineous marriages [26]. The method relied on homozygosity at the disease locus, but also at nearby genes. The closer the relationship between the parents, the larger the region of homozygosity surrounding the disease gene. With a sufficiently dense map of genetic markers (which was soon to come), one could search the genome of affected children for homozygous regions. The position of the disease gene would be inferred by identifying the region of the genome where all affected individuals were homozygous (although not necessarily identically homozygous). This method was named homozygosity mapping and greatly contributed to the mapping and identification of genes for recessive conditions segregating in large multiplex pedigrees. In the special case of founder populations, the method relied on the assumption of genetic homogeneity (i.e., a single mutation) for rare recessive diseases. This assumption means that all affected individuals will be identically homozygous surrounding the location of the disease gene. This intuitively simple idea is depicted in Fig. 5.4A. Since the disease is globally rare, it is assumed that the mutation was introduced into the founder group by a single individual, an ancestral carrier. Over the course of 1215 generations, the chromosome carrying the mutation was passed down from parent to child, and genetic recombination during meiosis reduced the length of the ancestral haplotype. The deeper the pedigree, the smaller the shared homozygous block becomes in affected individuals. The individual ancestral haplotypes are different lengths in each patient due to their different recombinational histories and the randomness of recombination. Occasionally, a recombination event will occur very near the disease gene, and there will be little of the ancestral haplotype remaining, whereas other chromosomes will retain large portions of the ancestral haplotype.

FIGURE 5.4 Identity-by-descent and homozygosity mapping of recessive disorders. (A) Panel A depicts the special case of founder populations where genetic homogeneity (i.e., a single mutation) for rare recessive diseases is the most common finding. This means that all affected individuals in the population will be identically homozygous for the disease mutation and much of the surrounding DNA. At the top of the figure, two (Continued)

108

Chapter 5 Recessive diseases and founder genetics

L

The publication of several dense genetic maps of the human genome in the early 1990s [27,28] enabled widespread gene mapping to occur. Genetic maps provided the necessary framework or backbone to permit diseases, and their corresponding genes, to be accurately localized to a specific region of a chromosome by virtue of their linkage to a genetic marker. These advancements (PCR, highly polymorphic markers, genetic maps) ushered in an era of robust gene mapping and identification. The use of homozygosity mapping was desirable in founder and consanguineous populations due to the relative analytical simplicity of the technique. While many researchers continued to use classical linkage analysis, as the density of genetic maps increased, it became easier to identify shared regions of homozygosity directly. The Plain populations of Pennsylvania, Ohio, and Indiana were used to identify dozens of rare and/or novel genetic diseases and their causative genes. Notable examples included gene identification for Hirschsprung disease associated with Waardenburg syndrome type 4A (WS4A, MIM #277580) [29], progressive familial intrahepatic cholestasis also known as Byler disease (PFIC1, MIM #211600) [30], McKusick-Kauffman syndrome (MKKS, MIM #236700) [31], Amish nemaline myopathy (NEM5, MIM #605355) [32], sitosterolemia (STSL1, MIM #210250) [33], Amish lethal microcephaly due to thiamine metabolism dysfunction (THMD3, MIM #607196) [34], and Amish infantile epilepsy syndrome due to GM3 synthase deficiency (SPDRS, MIM #609056) [35], among many others. The two forms of dwarfism that originally intrigued Dr. McKusick in the early 1960s were also solved. Ellis-van Creveld syndrome was mapped to chromosome 4p16, and subsequently in 2000 a splice site mutation in the EVC gene was found in all Amish cases [36]. In 2003, the cause of cartilage-hair hypoplasia was identified as well, and it was found to be the same as the major mutation for CHH in the Finnish population [37].

homologous chromosomes in a founder individual are represented such that one member of the pair (red shading) harbors the disease mutation, whereas the other is normal. This individual is the ancestral carrier. Over the span of 1215 generations (estimated from Amish and Mennonite genealogies), the chromosome carrying the mutation is passed down from parent to child, but due to genetic recombination during meiosis, the original haplotype on which the mutation existed is not inherited in its original form. Over time, recombination whittles away the ancestral haplotype. The deeper the pedigree, that is, the more generations back to the ancestral carrier, the smaller the shared homozygous block will become in affected individuals as depicted at the bottom of the figure. The remaining fragments of the haplotype found in each affected child will be different lengths due to the randomness of recombination. The shared homozygous block found in all four affected children demonstrates identity-by-descent from the ancestral carrier. (B) Genetic mapping data generated with a single-nucleotide polymorphism (SNP) microarray for a disorder called posterior column ataxia with retinitis pigmentosa (AXPC1, MIM #609033) is summarized. The graph plots the maximum number of SNPs within shared homozygous blocks for eight patients. The data is arranged by chromosome and physical position based on the human reference sequence (hg19). On chromosome 1q32, there was a cluster of 21 SNPs that demonstrated homozygosity among all patients (arrow). The genotype data for the SNPs in this region are found in the embedded table. The genotypes demonstrate that all eight affected individuals are homozygous for the same alleles at each SNP in this region suggesting identity-by-descent from a common ancestor (shaded rows). Therefore it was deduced that the disease gene resided somewhere between rs708726 and rs118002, the two SNPs that bracketed the homozygous block.

5.3 Disease gene mapping of autosomal recessive disorders

109

In the early 2000s, a new technology became widely and commercially available that substantially increased the speed and ease of gene identification studies. Single-nucleotide polymorphisms (SNPs) were the most plentiful variants in the human genome. Unfortunately, they suffered from low information content (i.e., only two alleles at a locus), and they were hard to genotype in sufficient numbers to adequately cover the entire genome. However, the development of microarray technology, also known as SNP chips, allowed tens of thousands of SNPs to be genotyped at one time in a single sample. Suddenly, projects that had taken years to perform could now be completed in weeks. Fig. 5.4B displays genetic mapping data generated with a SNP microarray for a Mennonite disorder called posterior column ataxia with retinitis pigmentosa (AXPC1, MIM #609033). The graph plots the maximum number of SNPs within shared homozygous blocks, arranged by chromosome and physical position, for eight patients with this disorder. On chromosome 1q32, there was a cluster of 21 SNPs that demonstrated homozygosity among all patients. Notably, the genotype data showed that all eight affected individuals were homozygous for the same allele at each SNP marker in this region suggesting IBD from a common ancestor. The application of microarray technologies and other developments led to gene identification for many more rare recessive diseases in the Plain populations through SNP chip homozygosity mapping and candidate gene Sanger sequencing, including cortical dysplasia and focal epilepsy (MIM #610042) [38], polyhydramnios, megalencephaly, and symptomatic epilepsy (PMSE, MIM #611087) due to STRADA deficiency [39]; nephrocerebellar syndrome, also known as Galloway-Mowat syndrome type 1 (GAMOS1, MIM #251300) [40], hypertrophic cardiomyopathy associated with mitochondrial DNA depletion syndrome (MTDPS12B, MIM #615418) [41], multisystem autoimmune disease with facial dysmorphism (ADMFD, MIM #613385) due to ITCH deficiency [42], Usher syndrome type IIIB (USH3B, MIM #614504) [43], and CODAS syndrome (MIM #600373) [44]. As successful as these methods and technologies had been for disease gene mapping of recessive conditions, occasionally these studies failed to identify the associated disease gene. Commonly, this occurred when the homozygous regions linked to the disease were large and contained too many genes to efficiently PCR amplify and Sanger sequence all of them. The size of the region of IBD is heavily dependent on the age of the founder population. For Plain populations the number of generations from the founding is relatively small; so, homozygous blocks are remarkably large. For older founder groups such as the Ashkenazi Jewish or the Finnish populations, the homozygous blocks are relatively small due to the increased number of generations back to the founders. This phenomenon stems from genetic recombination which occurs each generation and serves to prune the size of haplotypes over time. This makes young founder populations excellent for mapping disease genes, even with a low-resolution genome screen, but makes gene identification much more difficult due to the large number of genes within the mapped region.

5.3.2 Genomic sequencing of rare recessive disorders The development of next-generation sequencing (NGS) technologies and exome capture methods (see Chapter 4: Genomic Sequencing of Rare Diseases) in the 2000’s and their widespread application in the last decade revolutionized the study of genetic disorders and allowed revisiting mapped genetic diseases that could not be definitely linked to a particular gene through prior approaches. As an example, WES on DNA samples from seven separate stalled homozygosity mapping studies in Amish and Mennonite families unveiled the underlying disease-causing variants for all seven

110

Chapter 5 Recessive diseases and founder genetics

diseases [43]. The number of genes in each mapped interval ranged from 22 to 187, which precluded disease gene identification via candidate gene Sanger sequencing. One of these disorders, chorioretinopathy with microcephaly (MCCRP1, MIM #251270), was first described in a large Mennonite family by Dr. McKusick in 1966 [45]. The development of NGS technologies seemed to obviate the need for genetic mapping studies, as it became possible to study small families with few affected individuals (or even single patients) using WES and confidently identify the genetic cause of disease. Although it may seem that most recessive genetic disorders would have been recognized to date, the application of WES analyses in Amish and Mennonite families has continued to identify novel gene-disease associations. Some novel genetic syndromes include trichohepatoneurodevelopmental syndrome (THNS, MIM #618268) characterized by developmental delay, liver dysfunction, pruritus, dysmorphic features, and woolly coarse hair due to biallelic pathogenic variants in CCDC47 [46], a poorly characterized protein involved in calcium regulation in the endoplasmic reticulum; and a novel multisystemic disorder caused by a homozygous founder mutation (c.499C . A; p.Pro167Thr) that affects the catalytic domain of the tyrosine tRNA synthetase, YARS1, in the Amish population [47]. Even in the era of large-scale genomics, many of the attributes that make founder populations desirable for genetic research still remain important [42]. Large family sizes continue to be useful and informative, facilitating segregation analyses of candidate variants identified by genomic sequencing and enabling the identification of disease genes and their causative variants. Most WES studies usually focus on trio analyses which involves sequencing and analyzing genomic data for the proband and parents only. While this approach is extremely powerful for phasing alleles, assigning parent of origin transmission (Fig. 5.5); the application of family-based analyses, which use DNA samples from affected and unaffected siblings in the family, helps to identify the diseasecausing variant in the proband faster and more accurately. For example, the assumption of homozygosity for recessive disorders in founder populations can be leveraged well by using data from the unaffected siblings in a family. Genotypes for the siblings provide further evidence to either confirm or exclude any potential pathogenic homozygous variant. If one or more unaffected siblings are also homozygous for a candidate variant, then the variant can be excluded from further analysis. Family-based genomic analyses can decrease the average number of filtered candidate variants reducing the time and effort needed to arrive at the correct diagnosis [48].

5.3.3 Genomics of founder populations Due to the founder effect, many variants have drifted to allele frequencies that are significantly different from available population controls. Therefore it is important to genetically study and sample individuals from these populations to create population-specific genomic databases that are informative for these populations. These databases are necessary to provide accurate population-specific allele frequencies for variants to enable appropriate filtering during analyses of genomic data for patients with rare diseases from these populations. Variants with frequencies that are too high to be reasonably considered pathogenic can be more confidently excluded. Simultaneously, these databases enable the identification of healthy control individuals who are homozygous for rare alleles in the population, deeming these variants unlikely to be pathogenic. This is especially true for founder populations where underlying levels of genome-wide homozygosity are high due to shared common ancestry, but also for consanguineous populations that will have large genome-wide homozygous regions due

5.3 Disease gene mapping of autosomal recessive disorders

111

FIGURE 5.5 Identification and confirmation of homozygous variants using whole-exome sequence data for autosomal recessive disorders. Pedigree and sequence reads for a family with two children affected by an autosomal recessive condition. The affected female proband is denoted by the full black circle, affected male sibling is denoted as a full black square; unaffected parents and unaffected male sibling are denoted by empty symbols. WES sequence reads over the regions where two autosomal recessive variants were identified as shared between both affected siblings. Reference bases are shown in gray denoting matching sequence with the human reference. The positions for the two identified homozygous variants are highlighted showing the change from a “C” nucleotide in the reference to a “T” nucleotide in red for the variant in the SLC26A2 gene; the change from “G” to “A” in the gene SH3TC2 is also shown. The upper tracks above the sequence reads represent the coverage for each of the bases in the region depicted and shows that both affected siblings have all reads (100%) across the variant positions calling the alternate alleles, whereas the heterozygous parents and unaffected sibling show the presence of both the reference and variant alleles (approximately 50:50%). These two children were found to be homozygous for a known pathogenic missense variant in SLC26A2 (c.835C . T; p.R279W) causing diastrophic dysplasia and a pathogenic nonsense variant in SH3TC2 (c.2860C . T; p.R954X) known to cause Charcot-Marie-Tooth disease 4C, as reported in Strauss et al., 2018 [48].

to inbreeding. Furthermore, in cases of pathogenic variants, having a catalog of disease-associated variation in these populations enables rapid, early, and accurate diagnoses that may improve patient outcomes due to informed clinical management and early interventions. The explosion of population-level genomic data for founder and consanguineous populations has also highlighted new avenues for research beyond disease gene identification. Owing to the presence of drifted alleles, founder populations have provided new insights into variant interpretation and classification which could not have been accomplished previously. The presence of large numbers of heterozygotes for a globally rare allele affords the unusual ability to study the phenotypic impacts

112

Chapter 5 Recessive diseases and founder genetics

of these alleles at unprecedented resolution. For example, an allele in the KCNQ1 gene (c.671C . T, p.Thr224Met) was previously identified in two patients with long QT syndrome, but was still classified as a “variant of unknown significance (VUS)”. The variant is extremely rare in the general population (1 heterozygote in 112,482 European control individuals), but relatively common in the Amish (1 in 45) [49]. The high prevalence of this allele allowed researchers to identify 124 Amish heterozygotes and to perform detailed EKG studies on 88 carriers. The variant was shown to be associated with a 20 ms longer QT interval than the normal allele. This information allowed the variant to be reclassified as a “pathogenic” variant and led to culturally appropriate return of results, including recommendations for treatment and cascade testing. The pursuit to understand the function of each human gene has been most strongly advanced by the study of Mendelian diseases. The LoF of a gene in a diseased individual informs researchers about the normal function of that gene. A related and complementary method is to identify healthy individuals who are homozygous for rare LoF alleles (so-called “human knockouts”) and determine whether there is an associated phenotype with the natural absence of that gene [44]. In these studies, founder populations, and especially populations with a high level of consanguinity, have helped to elucidate the function of genes that contribute more subtly to human phenotypic variation [50]. It is particularly important to acknowledge that many of these phenotypic effects are actually medically beneficial, such as the finding of LoF variants in LPA in the Finnish population which confer protection against cardiovascular disease or the identification of a CCR5 homozygous LoF variant conferring resistance to HIV infection [51,52]. These studies enable a richer understanding of the genetic contribution to phenotypic variation in humans without relying solely on a classical disease model.

5.4 Outlook The progress over the past four decades in disease gene identification has been remarkable and has provided us with invaluable insights into human diseases and health. The role of recessive genetic disorders in this quest has been significant, leveraging characteristics of the populations of origin to facilitate disease gene mapping and identification. The implementation of large-scale genomic sequencing to study founder and consanguineous populations continues to derive insights applicable to patients and populations around the world. The contributions of founder and consanguineous populations to the field of human and medical genetics cannot be overstated. Since Dr. McKusick’s first trip to Lancaster County, the promise and utility of founder populations has garnered intense interest. From the first clinical descriptions of rare disease, through genetic mapping and disease gene identification, to WES, whole-genome sequencing, and rare human knockouts, these special populations have continually remained relevant and at the cutting edge of genetics.

References [1] Morton DH, Bennett MJ, Seargeant LE, Nichter CA, Kelley RI. Glutaric aciduria type I: a common cause of episodic encephalopathy and spastic paralysis in the Amish of Lancaster County, Pennsylvania. Am J Med Genet 1991;41:8995. Available from: https://doi.org/10.1002/ajmg.1320410122.

References

113

[2] Biery BJ, Stein DE, Morton DH, Goodman SI. Gene structure and mutations of glutaryl-coenzyme A dehydrogenase: impaired association of enzyme subunits that is due to an A421V substitution causes glutaric acidemia type I in the Amish. Am J Hum Genet 1996;59:100611. [3] Bener A, Mohammad RR. Global distribution of consanguinity and their impact on complex diseases: genetic disorders from an endogamous population. Egypt J Med Hum Genet 2017;18:31520. Available from: https://doi.org/10.1016/j.ejmhg.2017.01.002. [4] Hamamy H, et al. Consanguineous marriages, pearls and perils: Geneva International Consanguinity Workshop Report. Genet Med 2011;13:8417. Available from: https://doi.org/10.1097/ GIM.0b013e318217477f. [5] Bittles AH, Black ML. Consanguinity, human evolution, and complex diseases. Proc Natl Acad Sci U S A 2010. Available from: https://doi.org/10.1073/pnas.0906079106. [6] Alkuraya FS. The application of next-generation sequencing in the autozygosity mapping of human recessive diseases. Hum Genet 2013;132(11):1197211. Available from: https://doi.org/10.1007/s00439013-1344-x. [7] Alazami AM, et al. Accelerating novel candidate gene discovery in neurogenetic disorders via wholeexome sequencing of prescreened multiplex consanguineous families. Cell Rep 2015;10(2):14861. Available from: https://doi.org/10.1016/j.celrep.2014.12.015. [8] Maddirevula S, et al. Autozygome and high throughput confirmation of disease genes candidacy. Genet Med 2019;21(3):73642. Available from: https://doi.org/10.1038/s41436-018-0138-x. [9] Strauss KA, et al. Glutaric acidemia type 1: Treatment and outcome of 168 patients over three decades. Mol Genet Metab 2020;131:32540. Available from: https://doi.org/10.1016/j.ymgme.2020.09.007. [10] McKusick V, Eldridge R, Hostetler J, Ruangwit U, Egeland J. Dwarfism in the Amish. II. Cartilage-hair hypoplasia. Bull Johns Hopkins Hosp 1965;116:285326. [11] McKusick VA, Egeland JA, Eldridge R, Krusen DE. Dwarfism in the Amish I. The Ellis-Van Creveld Syndrome. Bull Johns Hopkins Hosp 1964;115:30636. [12] Ma¨kitie O, Sulisalo T, de La Chapelle A, Kaitila I. Cartilage-hair hypoplasia. J Med Genet 1995;32:3943. Available from: https://doi.org/10.1136/jmg.32.1.39. [13] Peltonen L, Jalanko A, Varilo T. Molecular genetics of the Finnish disease heritage. Hum Mol Genet 1999;8:191323. Available from: https://doi.org/10.1093/hmg/8.10.1913. [14] Ferreira JCP, et al. Carrier testing for Ashkenazi Jewish disorders in the prenatal setting: navigating the genetic maze. Am J Obstet Gynecol 2014;211:197204. Available from: https://doi.org/10.1016/j. ajog.2014.02.001. [15] Sunna Ebenesersdo´ttir S, et al. Ancient genomes from Iceland reveal the making of a human population. Science 2018;360. Available from: https://doi.org/10.1126/science.aar2625. [16] Kim CY, et al. Involuntary thumb flexion on neurological examination: an unusual form of upper limb dystonia in the faroe islands. Tremor Other Hyperkinetic Mov 2019;20:9. Available from: https://doi. org/10.7916/tohm.v0.686. [17] Lerat J, et al. High prevalence of congenital deafness on Reunion Island is due to a founder variant of LHFPL5. Clin Genet 2019;95:17781. Available from: https://doi.org/10.1111/cge.13460. [18] Gonzaga-Jauregui C, Gamble CN, Yuan B, Penney S, Jhangiani S, Muzny DM, et al. Mutations in COL27A1 cause Steel syndrome and suggest a founder mutation effect in the Puerto Rican population. Eur J Hum Genet 2015;23(3):3426. Available from: https://doi.org/10.1038/ejhg.2014.107. [19] Gonzaga-Jauregui C, et al. Functional biology of the Steel syndrome founder allele and evidence for clan genomics derivation of COL27A1 pathogenic alleles worldwide. Eur J Hum Genet 2020;28 (9):124364. Available from: https://doi.org/10.1038/s41431-020-0632-x. [20] PopG, https://evolution.gs.washington.edu/popgen/popg.html. [21] Wright S. Systems of mating. II. The effects of inbreeding on the genetic composition of a population. Genetics 1921;6:12443.

114

Chapter 5 Recessive diseases and founder genetics

[22] Strauss KA, Puffenberger EG. Genetics, medicine, and the plain people. Annu Rev Genomics Hum Genet 2009;10. [23] Gusella JF, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature 1983;306:2348. Available from: https://doi.org/10.1038/306234a0. [24] Knowlton RG, et al. A polymorphic DNA marker linked to cystic fibrosis is located on chromosome 7. Nature 1985;318:3802. Available from: https://doi.org/10.1038/318380a0. [25] Mullis K, et al. Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harb Symposia Quant Biol 1986;51:26373. Available from: https://doi.org/10.1101/ sqb.1986.051.01.032. [26] Lander ES, Botstein D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science 1987;236:156770. Available from: https://doi.org/10.1126/science.2884728. [27] Murray JC, et al. A comprehensive human linkage map with centimorgan density. Cooperative Human Linkage Center (CHLC). Science 1994;265:204954. Available from: https://doi.org/10.1126/science.8091227. [28] Weissenbach J, et al. A second-generation linkage map of the human genome. Nature 1992. Available from: https://doi.org/10.1038/359794a0. [29] Puffenberger EG, et al. A missense mutation of the endothelin-B receptor gene in multigenic hirschsprung’s disease. Cell 1994;79:125766. Available from: https://doi.org/10.1016/0092-8674(94)90016-7. [30] Bull LN, et al. A gene encoding a P-type ATPase mutated in two forms of hereditary cholestasis. Nat Genet 1998;18:21924. Available from: https://doi.org/10.1038/ng0398-219. [31] Stone DL, et al. Mutation of a gene encoding a putative chaperonin causes McKusick- Kaufman syndrome. Nat Genet 2000;25:7982. Available from: https://doi.org/10.1038/75637. [32] Johnston JJ, et al. A novel nemaline myopathy in the Amish caused by a mutation in troponin T1. Am J Hum Genet 2000;67:81421. Available from: https://doi.org/10.1086/303089. [33] Berge KE, et al. Accumulation of dietary cholesterol in sitosterolemia caused by mutations in adjacent ABC transporters. Science 2000;290:17715. Available from: https://doi.org/10.1126/ science.290.5497.1771. [34] Rosenberg MJ, et al. Mutant deoxynucleotide carrier is associated with congenital microcephaly. Nat Genet 2002;32:1759. Available from: https://doi.org/10.1038/ng948. [35] Simpson MA, et al. Infantile-onset symptomatic epilepsy syndrome caused by a homozygous loss-offunction mutation of GM3 synthase. Nat Genet 2004;36:12259. Available from: https://doi.org/ 10.1038/ng1460. [36] Ruiz-Perez VL, et al. Mutations in a new gene in Ellis-van Creveld syndrome and Weyers acrodental dysostosis. Nat Genet 2000;24:2836. Available from: https://doi.org/10.1038/73508. [37] Ridanpa¨a¨ M, Jain P, McKusick VA, Francomano CA, Kaitila I. The major mutation in the RMRP gene causing CHH among the Amish is the same as that found in most Finnish cases. Am J Med Genet— SemMed Genet 2003;121C:813. Available from: https://doi.org/10.1002/ajmg.c.20006. [38] Strauss KA, et al. Recessive symptomatic focal epilepsy and mutant contactin-associated protein-like 2. N Engl J Med 2006;354:13707. Available from: https://doi.org/10.1056/nejmoa052773. [39] Puffenberger EG, et al. Polyhydramnios, megalencephaly and symptomatic epilepsy caused by a homozygous 7-kilobase deletion in LYK5. Brain 2007;130. [40] Jinks RN, et al. Recessive nephrocerebellar syndrome on the Galloway-Mowat syndrome spectrum is caused by homozygous protein-truncating mutations of WDR73. Brain 2015;138:217390. [41] Strauss KA, et al. Severity of cardiomyopathy associated with adenine nucleotide translocator-1 deficiency correlates with mtDNA haplogroup. Proc Natl Acad Sci U S A 2013;110. [42] Lohr NJ, et al. Human ITCH E3 ubiquitin ligase deficiency causes syndromic multisystem autoimmune disease. Am J Hum Genet 2010;86:44753. [43] Puffenberger EG, et al. Genetic mapping and exome sequencing identify variants associated with five novel diseases. PLoS One 2012;7:e28936. Available from: https://doi.org/10.1371/journal.pone.0028936.

References

115

[44] Strauss KA, et al. CODAS syndrome is associated with mutations of LONP1, encoding mitochondrial AAA 1 lon protease. Am J Hum Genet 2015;96:12135. [45] McKusick VA, Stauffer M, Knox DL, Clark DB. Chorioretinopathy with hereditary microcephaly. Arch Ophthalmol 1966;75:597600. [46] Morimoto M, Waller-Evans H, Ammous Z, Song X, Strauss KA, Pehlivan D, et al. Bi-allelic CCDC47 variants cause a disorder characterized by woolly hair, liver dysfunction, dysmorphic features, and global developmental delay. Am J Hum Genet 2018;103(5):794807. Available from: https://doi.org/10.1016/j. ajhg.2018.09.014. [47] Williams KB, Brigatti KW, Puffenberger EG, Gonzaga-Jauregui C, Griffin LB, et al. Homozygosity for a mutation affecting the catalytic domain of tyrosyl-tRNA synthetase (YARS) causes multisystem disease. Hum Mol Genet 2019;28(4):52538. Available from: https://doi.org/10.1093/hmg/ddy344. [48] Strauss KA, et al. Genomic diagnostics within a medically underserved population: efficacy and implications. Genet Med 2018;20:3141. Available from: https://doi.org/10.1038/gim.2017.76. [49] Streeten EA, et al. KCNQ1 and long QT syndrome in 1/45 Amish: the road from identification to implementation of culturally appropriate precision medicine. Circulation: Genomic Precis Med 2020;13. Available from: https://doi.org/10.1161/circgen.120.003133. [50] Alsalem AB, Halees AS, Anazi S, Alshamekh S, Alkuraya FS. Autozygome sequencing expands the horizon of human knockout research and provides novel insights into human phenotypic variation. PLoS Genet 2013;9:e1004030. Available from: https://doi.org/10.1371/journal.pgen.1004030. [51] Lim ET, et al. Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet 2014;10:e1004494. Available from: https://doi.org/10.1371/journal.pgen.1004494. [52] Liu R, et al. Homozygous defect in HIV-1 coreceptor accounts for resistance of some multiply-exposed individuals to HIV-1 infection. Cell 1996;86:36777. Available from: https://doi.org/10.1016/S00928674(00)80110-5.

This page intentionally left blank

CHAPTER

Dominant and sporadic de novo disorders

6

Claudia Gonzaga-Jauregui1, Lauretta El Hayek2 and Maria Chahrour3 1

International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico 2Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, United States 3Eugene McDermott Center for Human Growth and Development, Department of Neuroscience, Center for the Genetics of Host Defense, Department of Psychiatry, Peter O’Donnell Jr. Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States

6.1 Introduction The observation that inherited traits were seen in families and transmitted from generation to generation laid out the clues for the systematic study of family pedigrees to identify the molecular defects and genes associated with rare genetic disorders. Historically, dominant disorders were “easier” to study through the identification of families with several individuals across multiple generations affected by these disorders. Prior to the implementation of current genomic sequencing approaches, methodologies such as linkage analysis were developed and successfully implemented to study these large pedigrees segregating dominant traits, identifying some of the first genes and rare pathogenic alleles associated with human disease traits. A classic example of the systematic genetic and molecular study of autosomal dominant disorders is Huntington disease (HD, MIM #143100). Huntington disease is a neurodegenerative disorder characterized by loss of motor control and a progressive decline in mental abilities. It is an autosomal dominant disease caused by CAG triple repeat expansion in the HD-associated gene, huntingtin (HTT), leading to a gain-of-function (GoF) dominant negative form of the protein. Individuals with 635 CAG repeats are normal, 3639 CAG repeats put the individual at risk of developing HD symptoms, and finally, individuals with 40 or more repeats display the HD phenotype [13]. Individuals with an intermediate CAG repeat length (between 36 and 39) show reduced penetrance and symptoms tend to be delayed until later in life [4]. The HD gene was first mapped to chromosome 4 in 1983 by Gusella and colleagues using polymorphic DNA markers [5]. Samples from an American family affected with HD and a large family from Venezuela with many individuals affected across multiple generations were studied using linkage analysis to identify DNA markers associated with the putative HD gene (see Chapter 10: Statistical Approaches to Rare Disease Analyses). The DNA marker G8 showed a distinctive restriction fragment length polymorphism pattern in HD patients’ DNA following HindIII restriction enzyme digestion and Southern blot analysis. Using human-mouse somatic cell hybrids, the authors then mapped the G8 probe to Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00010-7 © 2021 Elsevier Inc. All rights reserved.

117

118

Chapter 6 Dominant and sporadic de novo disorders

human chromosome 4, close to where HTT is located [5]. It was later in 1993 when HTT was identified as the gene that when altered by these repeat expansions caused HD [1]. In the last decade the development and application of genomic sequencing technologies in genetics research and in the clinic has revolutionized the discovery of disease associated genes and variants and the diagnosis of patients with rare genetic disorders. This chapter reviews and discusses autosomal dominant and sporadic disorders with a focus on the genomic approaches, disease mechanisms, and considerations associated with these classes of disorders, such as incomplete penetrance and de novo mutation (DNM).

6.2 Autosomal dominant disorders The human genome is composed of 23 pairs of chromosomes. Twenty-two of these pairs are known as autosomes, whereas the last pair is composed of the sex chromosomes, X and Y, which determine the biological sex of a given individual (see Chapter 1: Introduction to Concepts of Genetics and Genomics and Chapter 2: Karyotyping as the First Genomic Approach). Every person inherits 22 autosomes and a sex chromosome from each of their parents, and so generally, for each gene in a given chromosome, there are two copies, one that was inherited from the mother and one from the father. For a particular disorder, autosomal dominant inheritance means that one of the two copies (monoallelic) of a given gene on autosomal chromosomes has a defect that causes disease (Fig. 6.1). An individual affected by an autosomal dominant disorder will be heterozygous, having a normal allele and a defective one. Additionally, with each pregnancy, there is a 50% chance that a child born from an affected parent will also have the disorder, inheriting the defective copy of the gene from the affected parent and one normal copy from the unaffected parent. In some cases, the autosomal dominant disorder is caused by a DNM in the germ cells of one of the parents or very early on in the development of the embryo; so the disease is not considered to be inherited from the parents and instead considered a sporadic disorder. Many disorders affecting various biological pathways in different organs follow autosomal dominant inheritance.

6.2.1 Mechanisms of dominant disease There are three major mechanisms by which a defective copy of a given gene can cause an autosomal dominant disorder, namely haploinsufficiency, dominant negative, or gain-of-function effects (Fig. 6.2). These different mechanisms of disease and some examples of autosomal dominant disorders that arise from them are discussed next. • Haploinsufficiency is a mechanism of disease for autosomal dominant disorders that occurs when a mutation or pathogenic variant in one copy of the gene results in the loss-of-function (LoF) of that allele, generally through the introduction of a nonsense (stop-gain) variant that translates into an early/premature termination codon (PTC); or a frameshift indel (insertion/deletion) variant that results in the shifting of the reading frame generally followed by the introduction of a PTC that results in the early termination of the gene transcript and its subsequent degradation through the mechanism of nonsense-mediated decay (NMD) [6]. The degradation of the defective copy of the gene leads to an approximate 50% decrease of the total protein produced, mostly

6.2 Autosomal dominant disorders

119

FIGURE 6.1 Autosomal dominant modes of inheritance are illustrated where one parent is affected by a dominant disorder resulting in 50% of the children inheriting the disease associated variant and being affected. The red bars indicate locations of the disease associated variant.

coming from the normal allele but unable to function in full capacity required by the different biological processes in which the protein is needed and consequently causing perturbed biological homeostasis and disease. One example of a disorder caused by haploinsufficiency is Dravet syndrome (MIM #607208), a severe childhood epilepsy, often caused by mutations resulting in loss of one copy of the gene SCN1A [7]. Another example is a recently described disorder characterized by short stature, facial dysmorphism, skeletal anomalies, and variable cardiac anomalies (SSFSC, MIM #617877) caused by heterozygous LoF variants involving the gene BMP2. Haploinsufficiency for this gene was shown to be the mechanism causing this disorder by recapitulating some of the features observed in human patients in a mouse model that lacked one copy of the mouse Bmp2 gene [8]. In some instances, missense variants in genes causing autosomal dominant disorders do not result in complete loss of protein function but they can reduce the normal function of the affected

120

Chapter 6 Dominant and sporadic de novo disorders

FIGURE 6.2 The three major mechanisms of disease in autosomal dominant disorders are illustrated: haploinsufficiency, dominant negative, and gain-of-function effects, where the mutated copy of a gene leads to no protein, a toxic protein, or an excessive or new protein function, respectively.

proteins, for example, by reducing the catalytic activity of enzymes or altering their structure such that they cannot properly interact or perform their normal functions. These variants that significantly reduce the function of the mutant protein are said to be hypomorphic, and although less severe than LoF variants, they can also cause dominant diseases when the affected protein levels or activity are below the threshold needed for proper cellular functioning, similarly to haploinsufficiency. For some genes causing autosomal dominant disorders due to haploinsufficiency, the homozygous state for pathogenic LoF variants may be incompatible with life resulting in embryonic or neonatal lethality, whereas homozygous hypomorphic variants in haploinsufficient genes may be

6.2 Autosomal dominant disorders

121

more permissive yet display a more severe phenotype than heterozygotes. For example, heterozygous LoF variants in the low-density lipoprotein receptor gene (LDLR) are the most common cause of familial hypercholesterolemia (FHCL1, MIM #143890) resulting in high levels of circulating plasma LDL-cholesterol (350550 mg/L) that can build up in tissues such as tendons causing xanthomas, the skin (xanthelasma), or the coronary arteries resulting in atherosclerosis [9]. Heterozygous carriers of LoF variants in LDLR are at increased risk of cardiovascular disease in their fourth or fifth decade of life. However, in some cases and populations, it is possible to find individuals homozygous for LoF variants in LDLR. These individuals display a much more severe phenotype than heterozygotes, with very high LDL-cholesterol levels in plasma (6501000 mg/L) and risk of myocardial infarctions and other cardiovascular events in their 20’s. Interestingly, some genes that had been historically associated with autosomal dominant disorders have been recently found through genomic sequencing to lead to recessive disorders with a different clinical presentation than the known autosomal dominant diseases they were first identified in. Examples include genes encoding aminoacyl-tRNA synthetases (aaRS), the enzymes responsible for attaching specific amino acids to their corresponding cognate tRNA (transfer RNA) to be used during protein translation. Previously, mutations in several aaRS genes (including YARS, GARS, HARS, MARS, and AARS) had been associated with autosomal dominant forms of Charcot-MarieTooth disease (CMT) [10]. Although the mechanism by which disease associated variants in these aaRS genes cause CMT is still not completely understood, functional characterization of some of the variants has shown that these are LoF or hypomorphic alleles affecting the encoded proteins and possibly proper protein translation in the axons of peripheral neurons [1113]. In the past few years, patients with biallelic LoF or hypomorphic pathogenic variants in aaRS genes have been reported presenting with autosomal recessive disorders different from CMT with more severe clinical manifestations including multisystemic and neurological involvement [1417]. • Dominant negative effects occur when instead of the defective copy of the gene leading to LoF and reduced or half of the functionality required of the corresponding protein, the genetic mutation or pathogenic variant causes the defective copy of the gene to produce an abnormal protein that has a negative or “toxic” effect in the cell. The effects of dominant negative alleles are more prevalent in proteins that function by forming complexes either with themselves or with other proteins because an abnormal version of the protein would render the whole complex unable to function properly. Alternatively, some pathogenic variants may exert their dominant negative effects by preventing proper protein folding, secretion, and consequent function. Misfolded proteins can also accumulate in the endoplasmic reticulum (ER) leading to ER stress and upregulation of the unfolded protein response (UPR), which can ultimately result in apoptosis of the affected cells [18]. Prolonged ER stress and activation of the UPR have been shown to play an important role in neurodegenerative [19], cardiovascular [20], and many other diseases [18]. Notable examples of autosomal dominant disorders caused by dominant negative effects are type II collagenopathies. Collagens are the most abundant proteins of the extracellular matrix in vertebrates [21]. Collagen secretion and deposition in the growing bones during development is important for proper skeletal formation. There are 43 genes that encode collagen proteins, but the most abundant collagens are divided into three types. Type I collagen is present in bone, skin, and blood vessels. Type II is present in cartilage and type III is found in skin, vessels, and uterus alongside type I collagen [22]. The majority of collagen proteins function as homotrimers or heterotrimers where three proteins of the same or different types of collagen assemble together to form

122

Chapter 6 Dominant and sporadic de novo disorders

collagen fibers in tissues. COL2A1 codes for the alpha chain of type II collagen, which is the major type of collagen expressed in chondrocytes (cartilage forming cells). Mutations in this gene, specifically substitutions of glycine (Gly) residues in the Gly-X-Y repeats (where X and Y can be any amino acid), that damage the collagen triple helix, act as dominant negative causing a spectrum of genetic disorders known as type II collagenopathies [23] ranging from hypochondrogenesis (MIM #200610), a very severe form of dwarfism, to osteoarthritis with mild chondrodysplasia (OSCDP, MIM #604864) presenting as joint stiffness and osteoarthritis in adolescence or early adulthood [24]. • Gain of function (GoF) is a third mechanism by which pathogenic genetic variants can cause autosomal dominant disorders. Contrary to haploinsufficiency, a GoF mutation or variant causes the defective copy of the gene to perform its function in an excessive or unregulated manner, consequently causing the disease. A classic example of an autosomal dominant disorder caused by this mechanism is achondroplasia, the most common form of dwarfism. Achondroplasia (MIM #100800) is caused by GoF variants in the fibroblast growth factor receptor 3 gene (FGFR3), that disrupt endochondral bone formation leading to short stature and other skeletal dysmorphisms [25]. The gene associated with achondroplasia was first mapped to the genomic region 4p16.3 using linkage analysis [26], and shortly after, pathogenic mutations in FGFR3 were identified [27]. When FGF (fibroblast growth factor) binds to its receptor FGFR3, the receptor dimerizes, activating its cytoplasmic tyrosine kinase function and initiating a signaling cascade inside the cell. FGFR3 regulates normal bone growth by downstream inhibition of bone growth signals through the MAPK pathway and STAT1 [28]. The mutation in FGFR3 responsible for most cases of achondroplasia is the p.Gly380Arg substitution in the transmembrane domain of the receptor, which leads to overactivation of the receptor, excessively inhibiting the downstream bone growth formation signals [29]. Interestingly, there are genes that can cause autosomal dominant disorders through multiple of the above-mentioned mechanisms, and while some clinical features may be shared among patients with variants that cause disease through different mechanisms, some other features are different and specific to the molecular defect of the mutant protein. A notable example of this is pathogenic variation and allelic heterogeneity in the fibrillin-1 gene, FBN1. Mutations or variants altering the normal function of fibrillin-1 have been associated with a variety of autosomal dominant disorders broadly known as fibrillinopathies, of which Marfan syndrome is the most well-known and studied disorder. Marfan syndrome (MIM #154700) is a disorder that affects connective tissue, leading to defects in many systems including cardiovascular, skeletal, and ocular, with major phenotypic heterogeneity. Diagnosing this disorder is therefore challenging, and since untreated patients risk premature death, timely diagnosis and intervention is essential [30]. Marfan syndrome is highly penetrant and is caused by autosomal dominant mutations in FBN1 [31]. Both haploinsufficient and dominant negative mutations in FBN1 have been reported to cause Marfan syndrome [32]. Haploinsufficiency of fibrillin-1 due to LoF alleles in FBN1 is the main mechanism underlying Marfan syndrome [32]; however, dominant negative variants have been reported where the protein produced by the mutant allele has a toxic effect when multimerizing to form fibrils, disrupting the normal function of fibrillin-1 and effectively leading to loss of proper protein function. In 75% of Marfan syndrome cases, mutations in FBN1 are inherited from an affected parent, whereas in the 25% remaining cases the disorder is caused by DNMs in the gene [32]. Other mutations in FBN1 have been associated with a spectrum of connective tissue and skeletal disorders including acromicric and geleophysic

6.2 Autosomal dominant disorders

123

dysplasias (MIM #102370 and MIM #614185, respectively), Weill-Marchesani syndrome type 2 (MIM #608328), stiff skin syndrome (MIM #184900), familial ectopia lentis (MIM #129600), MASS syndrome (MIM #604308), and Marfan lipodystrophy syndrome (MIM #616914). Interestingly, this last disorder which shares some features of Marfan syndrome but where the main feature is severe congenital lipodystrophy, is caused exclusively by mutations that affect the C-terminus of the fibrillin-1 protein, thought to be important for the secretion of the protein. These mutations in the C-terminal region do not result in the loss of the fibrillin-1 protein, but rather in a smaller protein with a truncated or abnormal C-terminus [33]. Following the first Marfan syndrome diagnosis, a Marfan-like syndrome was described, where patients had similar cardiovascular and skeletal features as observed in typical Marfan patients but did not display abnormalities in their eyes and did not have mutations in FBN1 [30,34]. It was later found that the gene associated with this Marfan-like syndrome referred to as Marfan syndrome type II was TGFBR2 [35], which encodes the type II subunit of the membrane receptor for transforming growth factor β (TGF-β). This different but related disorder to Marfan syndrome was later renamed as Loeys-Dietz syndrome type 2 (LDS2, MIM #610168). Autosomal dominant variants in the other subunit of the TGF-β receptor, TGFBR1, are the cause of Loeys-Dietz syndrome type 1 (LDS1, MIM #609192). Of note, fibrillin-1 regulates the bioavailability of TGF-β [35], which upon binding its receptor is responsible for regulating a plethora of biological processes including cell cycle progression, cell proliferation and differentiation, extracellular matrix production, wound healing, and immune regulation. Another interesting example is Robinow syndrome, a disorder characterized by multiple congenital abnormalities including limb dwarfism, dysmorphic facial features, and genital anomalies among others [36]. Autosomal dominant forms of Robinow syndrome have been described caused by mutations in WNT5A [37] (DRS1, MIM #180700), DVL1 [38] (DRS2, MIM #616331), and DVL3 [39] (DRS1, MIM #616894). In Robinow syndrome caused by mutations in DVL1 and DVL3, all the different pathogenic variants observed localized to the penultimate exon and disrupted the reading frame leading to a 21 frameshift at the end of the transcript [38,39]. However, these abnormal transcripts appear to escape NMD, which was confirmed by the amplification of the abnormal mRNA in patients’ leukocytes, leading to a truncated stable protein with an abnormal highly basic C-terminal region. The exact mechanism by which these truncated proteins cause disease is yet unknown; but, it appears to result in increased canonical WNT activity. Interestingly, WNT5A and ROR2 are also associated with dominant and recessive forms of Robinow syndrome, respectively, and function along with DVL1 and DVL3 in the Wnt signaling pathway [40], which is essential for development. Furthermore, it has been shown that WNT5A activates the ROR2 signaling cascade in osteoblasts [41]. These disorders exemplify how disease-causing variants in different genes functioning in the same pathway lead to similar phenotypes. Also, they illustrate the importance of studying the genetics of rare diseases in elucidating essential players in key biological pathways and in understanding gene function and dynamics in these pathways.

6.2.2 Incomplete penetrance and variable expressivity of dominant disorders Autosomal dominant disorders can manifest incomplete penetrance, where an individual may carry a particular genotype but may not express the corresponding phenotype. In contrast, variable

124

Chapter 6 Dominant and sporadic de novo disorders

expressivity is the phenomenon where individuals with the same genotype can have different disease manifestations. Many factors affect penetrance and expressivity of disease, including modifier genes, epigenetic changes, sex, and age [42]. Furthermore, the type of mutation in a given gene and its impact on the resulting protein can affect penetrance. Gene expression is modulated by regulatory elements in the genome; thus mutations within these elements can also have consequences on penetrance or influence the severity of the disease presentation, demonstrating variable expressivity. Epigenetic modifications, including histone acetylation and DNA and histone methylation, are other molecular factors that may affect penetrance and influence expressivity of the disease by altering gene expression, for example, through methylation of the gene or altering X chromosome inactivation patterns (see Chapter 7: X-linked and Mitochondrial Disorders). Biological sex also affects penetrance and expressivity of inherited variants in an individual. As an example, autosomal dominant mutations in BRCA1 and BRCA2 increase the risk of breast and ovarian cancers significantly but exhibit incomplete penetrance [43]. Additionally, among carriers of BRCA2 mutations, only 6% of men compared to 86% of women are predicted to develop breast cancer by age 70 [44]. In some cases, increased age has been shown to drastically affect disease expressivity. For example, in Parkinson’s disease arising from a p.Arg1441Gly mutation in LRRK2, only 12.5% of individuals present symptoms before the age of 65; however, disease onset occurs in 83% of individuals by age 80 [45]. Our understanding of penetrance and variable expressivity for different diseases will continue to improve as additional disease alleles are identified and patient phenotyping becomes more systematic, detailed, and refined through reverse phenotyping. Penetrance and variable expressivity are further discussed in Chapter 9, Multilocus Inheritance and Variable Disease Expressivity in Rare Disease.

6.3 Sporadic disorders Undoubtedly, the type of disorders that has benefited the most from the implementation of genomic sequencing approaches has been sporadic disorders arising from DNMs (Fig. 6.3). Sporadic disorders refer to cases where there is an affected child born to unaffected parents, and therefore there is no family history of genetic disease. Historically, these disorders have been challenging to study due to their rarity in the majority of cases and the lack of large family pedigrees leveraged for the study of classic dominant, recessive, and X-linked disorders. In the case of sporadic disorders, these are caused by DNMs that occur in the germinal cells, spermatozoids or oocytes, of one of the parents or very early during embryonic development of the affected child.

6.3.1 The human de novo mutation rate De novo means “new” and in genetics DNM refers to a change in the DNA sequence that occurred as a new event in the child and consequently was not inherited from either parent. DNMs are a normal phenomenon, and they can occur any time a cell replicates its DNA. The DNA of all cells in the body will have some new mutations that were not present in the parent cell from which they were derived. With 99% of the human genome sequence composed of intergenic and nonprotein coding sequence, many of these new mutations will not have much consequence and most of them will

6.3 Sporadic disorders

125

FIGURE 6.3 Sporadic disorders caused by de novo mutations (DNMs) are recognizable by trio structures where an affected child is born to unaffected parents. The red bar indicates the disease associated variant causing the disorder which is only present in the child and not in the parents.

occur in nongenic regions. However, when these new mutations occur in the germ cells or very early during the development of the embryo after conception, and they occur in genes, there is potential for causing a disease or phenotype unique to the offspring. The rate of DNM in cells is determined by a combination of the error rate of the DNA polymerase that copies the DNA during each replication cycle and the proofreading mechanisms that ensure the high fidelity of the process. Multiple studies have looked at parent-offspring trios to estimate the DNM rate in humans. Although with some variability, most of these studies have calculated the human germline DNM rate to be approximately 1.2 3 1028 (range from 1.0 3 1028 to 1.5 3 1028) nucleotides per generation [4651], meaning that given the size of the human genome (6.4 Gb for

126

Chapter 6 Dominant and sporadic de novo disorders

a diploid individual), each person will have approximately 60100 new mutations that were not present in their parents. Other factors such as parental age or environmental exposures can increase the number of DNMs in the offspring. For example, it has been documented and consistently replicated that paternal age at conception is positively and strongly correlated with a higher number of DNMs in the children of older fathers [47,50,52]. As males age, the number of DNA replications and cell divisions that the spermatogonial stem cells undergo during the process of spermatogenesis to continuously derive sperm cells increases substantially compared to the replications that occur in female oogenesis. Therefore the probability of new mutations occurring in the male germ cells increases with paternal age. However, the parental age effect is not exclusive to fathers, and studies have shown that there is also a positive correlation between the age of mothers and occurrence of DNMs in their children, albeit less pronounced than for paternal age [50,53,54]. Additionally, exposures of the gonads or of the embryo in utero to mutagens, such as ionizing radiation or chemical agents, can induce DNA damage that can lead to new mutations. Most DNMs occur at random throughout the genome; however, it is now recognized that certain characteristics of the DNA can promote the occurrence of DNMs in specific regions [5557]. One of these features is CG content, meaning the percentage of cytosine and guanine bases in a given DNA region and the presence of CpG dinucleotides (cytosine followed by a guanine base) [58,59]. These regions are more prone to gather DNMs because deamination of methyl cytosine residues in the DNA can result in C . T or G . A transitions, altering the DNA sequence. When DNMs affect the same nucleotide in unrelated patients, it is said that these mutations are recurrent. In some disorders, such as achondroplasia (MIM #100800), the most common form of dwarfism, the p.Gly380Arg mutation in FGFR3 has been reported to occur recurrently de novo in unrelated individuals and to be the molecular cause of the disease in a large fraction of patients [26,60]. Similarly, the recurrent DNM in the ACVR1 gene that causes the p.Arg206His substitution in the activin A receptor is the most common mutation found in approximately 95% of patients with fibrodysplasia ossificans progressiva (FOP, MIM #135100) [61]. FOP is a very rare disorder where bone formation is dysregulated and occurs in places where it should not occur (heterotopic ossification), such as cartilage, tendons, and muscle, leading to disability and early death due to complications from thoracic insufficiency. In most sporadic cases, the DNM occurs in the germ cells of one of the parents or early in the embryo after conception; therefore the probability of the family having additional affected children with the same disorder is very low and close to zero. However, in some cases the DNM can occur in the germinal stem cells, the gonads, or later in the fetal development of one of the parents affecting the tissues that give rise to the reproductive organs, leading to germline mosaicism in one of the parents (see Chapter 8: Mosaicism in Rare Disease). In these cases, a larger number of germline cells will harbor the mutation because of the mosaicism in the parental tissues, and the likelihood of having another spermatozoid or oocyte with the same mutation as the one observed in an affected child is nonzero, leading to a higher risk of recurrence of a sporadic disorder in the same family.

6.3.2 Mechanisms of disease of de novo mutations Disorders caused by DNMs are by definition dominant in that it is sufficient for only one allele of the gene to be abnormal to cause the disease. As discussed previously, dominant disorders are

6.3 Sporadic disorders

127

caused by variants that can affect the gene in multiple ways, through haploinsufficiency, dominant negative, or GoF effects (Fig. 6.2). Similarly, DNMs can cause disease through these same mechanisms. If the new mutation introduces an early stop codon in the gene leading to loss of the functional protein and the one normal copy of the gene is insufficient to perform its proper functions, then the mechanism of disease will be haploinsufficiency. Some genes have been observed to be depleted of LoF variants in large population studies and databases. The calculation of the number of observed versus expected LoF variants in a given gene as a function of its length indicates the degree of constraint of such gene. Higher constraint metrics suggest that the function of the gene may be of high biological relevance, and therefore disruption or alterations are not tolerated; consequently, individuals with deleterious variants in a highly constrained gene are more likely to have a detrimental phenotype. This calculation can be captured as a probability of LoF intolerance (pLI) score [62], that can be a useful piece of information when evaluating novel candidate disease genes where patients have DNMs. As discussed before, other dominant mutations can exert their effect in a dominant negative manner causing proteins to be abnormal, disrupting the formation and function of multiprotein complexes, or accumulating in a toxic form in the cell resulting in ER stress and ultimately causing cell death. An example of a sporadic disorder caused by dominant negative DNMs is NESCAV syndrome (MIM #614255) clinically characterized by neurodegeneration, difficulty walking due to progressive spasticity, and variable features of cerebellar atrophy, cortical visual impairment, learning disabilities, and behavioral abnormalities. NESCAV syndrome is caused by missense DNMs in the gene KIF1A, which encodes a neuron-specific kinesin motor protein involved in fast anterograde axonal transport [63]. To perform its kinesin functions transporting cargo along the axons of neurons, KIF1A associates in a homodimer. De novo missense mutations that alter highly conserved residues in the motor domain of the protein cause NESCAV through a dominant negative mechanism by altering the structure of the protein and impacting its dimerization and proper motility [63,64]. In other instances, DNMs can induce a GoF effect, causing the protein to be overactive or lose its normal inhibitory mechanisms. An example of this is a recently described neurodevelopmental disorder characterized by intellectual disability, macrocephaly, and seizures and caused by de novo missense mutations in the gene PAK1 [65,66]. PAK1 is a kinase enzyme that phosphorylates target proteins in serine and threonine residues to activate them and trigger multiple cellular processes. Additionally, it is known to be activated or to activate other genes important for neurodevelopment, such as CDC42 associated with Takenouchi-Kosaki syndrome (MIM #616737) and RAC1 associated with autosomal dominant intellectual disability (MIM #617751). When inactive, PAK1 is present in the cell as a dimer of two PAK1 proteins that interact by binding to each other’s autoregulatory domain and masking their kinase domains. Upon activation by CDC42 or RAC1, the dimer destabilizes and dissociates, uncovering the active kinase domain and allowing PAK1 to phosphorylate downstream targets [67]. Pathogenic missense DNMs in PAK1 in patients with macrocephaly and intellectual disability cluster in the autoregulatory domain or disrupt the proper binding of the dimers to maintain an inactive complex, causing PAK1 to lose its autoregulatory mechanism and be overactive [66]. A variation to the GoF effect of some dominant mutations that perhaps is more common in sporadic disorders than in inherited autosomal dominant disorders, is the generation of a new function for the affected protein, this is called a neomorphic mutation. An example of this is the previously

128

Chapter 6 Dominant and sporadic de novo disorders

mentioned most common recurrent DNM p.Arg206His in ACVR1 that causes fibrodysplasia ossificans progressiva (FOP, MIM #135100). Previously, it had been hypothesized that the mutation was likely a GoF, causing an overactive ACVR1 receptor to signal to form more bone in the affected patients, thereby resulting in FOP. However, it has now been shown that the mutation instead causes the receptor to signal upon binding of activin, a molecule to which it is normally not responsive [68]. Therefore this neomorphic mutation causes the receptor to acquire a new function by being responsive to a molecule it normally does not recognize and signal intracellularly to initiate processes of ossification in a different context.

6.3.3 Genomic studies of sporadic disorders and identification of de novo mutations Prior to the sequencing of the human genome, the study of rare genetic disorders focused on large families with enough affected and unaffected individuals to perform informative linkage analyses using genetic markers across the genome (see Chapter 10: Statistical Approaches to Rare Disease Analyses); however, this approach was unapplicable to genetic disorders with no evidence of familial segregation. Currently, chromosomal microarray analysis (CMA) is the recommended first-tier genomic testing to identify genetic causes of disease in children with complex neurodevelopmental disorders including autism spectrum disorder (ASD), developmental delay, and intellectual disability [69]. The technique, reviewed in detail in Chapter 3, Genomic Disorders in the Genomics Era, identifies microscopic and submicroscopic gains or losses of chromosomal segments known as copy-number variants (CNVs). CNVs can cause diseases in a dominant fashion when they affect dosage-sensitive genes or their regulatory regions, especially in nervous system tissues like the brain and peripheral nerve, where slight changes in gene dosage can have a significant impact on normal brain development and both brain and nerve function. The application of CMA in research and in the clinic has identified a vast number of CNVs and genomic rearrangements that occur de novo and cause rare genetic disorders [7073]. Next-generation sequencing (NGS) of genomic DNA through whole-exome sequencing (WES) or whole-genome sequencing (WGS), discussed in detail in Chapter 4, Genomic Sequencing of Rare Diseases, is becoming more commonly recommended in individuals with suspected genetic disorders and uninformative CMA findings. The implementation and reduction in cost of NGS technologies to diagnose patients with rare diseases has allowed for the study of smaller nuclear family structures such as trios encompassed by both parents and an affected child, facilitating the analyses of sporadic genetic disorders. Trio-based genomic analyses of WES or WGS data allows for accurate phasing of variants, determining whether they were inherited from the mother or the father, and to identify those variants that are not present in either parent as potential DNMs in the child (Fig. 6.4). Genomic sequencing and trio-based analyses have revolutionized the study of sporadic disorders and enabled the identification of hundreds of new disease entities, some of which had not been previously described in the literature. Furthermore, the contribution of mosaic somatic variation (see Chapter 8: Mosaicism in Rare Disease) to the presentation of sporadic genetic disorders has been made evident and highlighted by the implementation of NGS approaches. Perhaps a major disease category where the study of sporadic cases and identification of DNMs associated with disease has been most impactful is neurodevelopmental disorders. Patients affected

6.3 Sporadic disorders

129

FIGURE 6.4 Pedigree and sequence reads for a trio where a de novo mutation (DNM) was identified in the ARID1B gene associated with Coffin-Siris syndrome-1 (CSS1, MIM #135900) by whole-exome sequencing (WES). The affected female child is denoted by the full black circle, whereas unaffected parents are denoted by empty symbols. The WES sequence reads over the region where the mutation has been identified are shown in gray denoting matching sequence with the human reference; the position of the de novo mutation is highlighted showing the change from a “C” nucleotide in the reference to a “T” nucleotide in red. The topmost track represents the coverage for each of the bases in the region depicted and shows an approximate equal (50%) presence of reference (C: blue) and variant (T: red) reads for the DNM position.

with developmental delay and intellectual disability (DD/ID) have generally been difficult to accurately diagnose, especially in nonsyndromic cases, being assigned to broad categories of intellectual disability, cerebral palsy, or ASD. In the last 10 years, genomic sequencing of cohorts of patients with these broad diagnoses have shown that DD/ID and ASD are genetically heterogenous and unspecific clinical diagnoses that cover a broad spectrum of genetic disorders caused by hundreds of genes, each responsible for a small percentage of cases, but in aggregate accounting for a large fraction of patients that receive these diagnoses [74]. DNMs, sometimes observed recurrently, have been identified as major contributors of disease in patients with DD/ID and ASD without a family history of disease [49,7577]. Analyses of 4293 children recruited as part of the Deciphering Developmental Disorders (DDD) project focused on the prevalence and characteristics of DNMs in this cohort and their contribution to developmental disorders [78]. Through WES, 8361 candidate DNMs were identified affecting

130

Chapter 6 Dominant and sporadic de novo disorders

coding and splicing regions, with a mean of 1.95 DNMs per patient, higher than in previous studies. Consistent with previous studies, analyses of these DNMs confirmed a significant paternal age effect and an independent weaker effect for maternal age, with estimated 0.0306 and 0.0172 DNMs per year for exonic DNMs derived from paternal and maternal lineages, respectively. This observation translates into an estimate of 1/448 to 1/213 children born with a developmental disorder due to DNM, accounting for approximately 400,000 children born per year worldwide [78]. WES has identified DNMs in genes that follow autosomal dominant inheritance and contribute to around 15%20% of ASD cases [79]. Some notable examples include haploinsufficiency of ARID1B, encoding the AT-rich interaction domain 1B protein, which is part of the chromatin remodeling complex SWI/SNF and one of the most commonly mutated genes in ASD [8082]. Another example is CHD8, chromodomain helicase DNA-binding protein 8, a chromatin remodeler which functions in regulating β-catenin target genes, and also one of the most frequently mutated ASD genes, with mutations in one allele resulting in haploinsufficiency [77,83,84]. The evaluation of variant constraints has shown that CHD8 and ARID1B are intolerant to LoF variation [85], a pattern observed for many ASD/DD/ID associated genes. WES and WGS studies in large ASD cohorts have shown that no one gene accounts on its own for more than 0.2% of cases, and that additional genetic and environmental modifier factors play a role in the clinical heterogeneity of this phenotype [86,87]. Inherited recessive and DNMs have been identified in the same gene in association with broader neurodevelopmental disorders that may have ASD as a feature, for example, in KDM5A [88] and KDM5B [89]. Histone lysine methyltransferases (KMTs) and demethylases (KDMs) are chromatin remodelers essential for proper gene expression. Pathogenic variants in KMTs and KDMs have been linked to numerous developmental disorders, including ASD and DD/ID. A significant number of the identified variants are DNMs, and some instances of autosomal dominant or dominant X-linked inheritance have been reported. Many of the variants identified are protein-truncating, suggesting that haploinsufficiency is the main mechanism by which the associated disorders arise [90]. Kabuki syndrome (KS, MIM #147920), for example, a rare developmental disorder characterized by intellectual disability, with distinctive facial features and many organ malformations [91], is caused by DNMs in the gene KTM2D which accounts for most of the cases. Specifically, heterozygous LoF variants lead to haploinsufficiency of KMT2D, a histone methyltransferase that methylates the lysine 4 residue of histone H3, causing the disorder [90]. The X-linked dominant form of KS, referred to as Kabuki syndrome type 2 (MIM #300867) is caused by mutations in KDM6A. The gene encodes the lysine demethylase 6A protein, which specifically demethylates lysine 27 of histone H3 [92,93], activating genes normally repressed by this histone mark. Haploinsufficiency for other KMTs has been shown to result in other sporadic developmental disorders, for example, KMT2C, a H3K4 methyltransferase is associated with Kleefstra syndrome type (KLEFS2, #617768) characterized by ID with delayed psychomotor development and mild dysmorphic features; ASH1L, a H3K36 di- and tri-methyltransferase associated with autosomal dominant mental retardation type 52 (MRD52, MIM #617796); and KMT5B, a H4K20 di- and tri-methyltransferase, where mutations lead to intellectual disability (MDR51, MIM # 617788) [90]. Genomic sequencing of large cohorts of patients with broad clinical diagnoses of ASD, DD/ ID, or neurodevelopmental disorders have made evident that these are in fact a collection of individual rare diseases, each with its own mechanism of disease and affecting distinct processes and biological pathways.

References

131

6.4 Outlook Significant progress has been made in our comprehension of the genetic mechanisms of disease over the past decades as it pertains to uncovering causative genes and variants for rare genetic disorders; however, much remains to be understood. The identification of DNMs in sporadic cases of a novel disorder is often the first step but is not enough to demonstrate the association of a gene with disease. The rarity of these disorders makes it hard to identify additional patients that may help support a novel gene candidacy for sporadic disorders due to DNMs. There is a lot to be uncovered especially as it relates to deeper molecular understanding of dominant and sporadic disorders, where mechanisms like haploinsufficiency, dominant negative, and GoF provide a partial interpretation of the underlying causes of disease but further investigation of the affected biological pathways is necessary. Dominant and sporadic disorders also present with additional challenges including variable severity of disease phenotypes, incomplete penetrance, variable expressivity, hypomorphic alleles, and mosaicism. Mechanisms that regulate gene expression and may contribute to disease variability and incomplete penetrance of dominant disorders are being gradually elucidated with the implementation of more advanced and sophisticated molecular technologies, illuminating their role and contribution to dominant disease traits and phenotype modulation. The technological advances of NGS in the past decade have enabled identification of diseasecausing genes and deeper mechanistic understanding of biological processes and, in many cases, provided genetic diagnoses to patients. Despite the advances these approaches have provided, larger cohorts of patients are still needed in order to identify pathogenic variants and assign causality to novel disease associated genes in rare diseases; especially when dealing with genetically heterogeneous disease traits. Inclusion of more diverse populations in genetic and genomic studies is of utmost importance in order to obtain a more comprehensive picture of genetic diversity establishing a catalog of benign polymorphic variation that may facilitate the identification of rare pathogenic variants and expand our understanding of disease genetics and novel gene functions and mechanisms.

References [1] The Huntington’s Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell 1993;72(6):97183. [2] Bates GP. History of genetic disease: the molecular genetics of Huntington disease—a history. Nat Rev Genet 2005;6(10):76673. [3] Chial H. Huntington’s disease: the discovery of the Huntingtin gene. Nat Educ 2008;1(1):71. [4] Panegyres PK, Shu CC, Chen HY, Paulsen JS. Factors influencing the clinical expression of intermediate CAG repeat length mutations of the Huntington’s disease gene. J Neurol 2015;262(2):27784. [5] Gusella JF, Wexler NS, Conneally PM, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature 1983;306(5940):2348. [6] Brogna S, Wen J. Nonsense-mediated mRNA decay (NMD) mechanisms. Nat Struct Mol Biol 2009;16 (2):10713. [7] Escayg A, Goldin AL. Sodium channel SCN1A and epilepsy: mutations and mechanisms. Epilepsia 2010;51(9):16508.

132

Chapter 6 Dominant and sporadic de novo disorders

[8] Tan TY, Gonzaga-Jauregui C, Bhoj EJ, et al. Monoallelic BMP2 variants predicted to result in haploinsufficiency cause craniofacial, skeletal, and cardiac features overlapping those of 20p12 deletions. Am J Hum Genet 2017;101(6):98594. [9] Di Taranto MD, Giacobbe C, Fortunato G. Familial hypercholesterolemia: a complex genetic disease with variable phenotypes. Eur J Med Genet 2020;63(4):103831. [10] Wei N, Zhang Q, Yang XL. Neurodegenerative Charcot-Marie-Tooth disease as a case study to decipher novel functions of aminoacyl-tRNA synthetases. J Biol Chem 2019;294(14):532139. [11] Jordanova A, Irobi J, Thomas FP, et al. Disrupted function and axonal distribution of mutant tyrosyltRNA synthetase in dominant intermediate Charcot-Marie-Tooth neuropathy. Nat Genet 2006;38 (2):197202. [12] Antonellis A, Lee-Lin SQ, Wasterlain A, et al. Functional analyses of glycyl-tRNA synthetase mutations suggest a key role for tRNA-charging enzymes in peripheral axons. J Neurosci 2006;26 (41):10397406. [13] Nangle LA, Zhang W, Xie W, Yang XL, Schimmel P. Charcot-Marie-Tooth disease-associated mutant tRNA synthetases linked to altered dimer interface and neurite distribution defect. Proc Natl Acad Sci U S A 2007;104(27):1123944. [14] Williams KB, Brigatti KW, Puffenberger EG, et al. Homozygosity for a mutation affecting the catalytic domain of tyrosyl-tRNA synthetase (YARS) causes multisystem disease. Hum Mol Genet 2019;28 (4):52538. [15] Kopajtich R, Murayama K, Janecke AR, et al. Biallelic IARS mutations cause growth retardation with prenatal onset, intellectual disability, muscular hypotonia, and infantile hepatopathy. Am J Hum Genet 2016;99(2):41422. [16] van Meel E, Wegner DJ, Cliften P, et al. Rare recessive loss-of-function methionyl-tRNA synthetase mutations presenting as a multi-organ phenotype. BMC Med Genet 2013;14:106. [17] Simons C, Griffin LB, Helman G, et al. Loss-of-function alanyl-tRNA synthetase mutations cause an autosomal-recessive early-onset epileptic encephalopathy with persistent myelination defect. Am J Hum Genet 2015;96(4):67581. [18] Hetz C, Zhang K, Kaufman RJ. Mechanisms, regulation and functions of the unfolded protein response. Nat Rev Mol Cell Biol 2020;21(8):42138. [19] Ghemrawi R, Khair M. Endoplasmic reticulum stress and unfolded protein response in neurodegenerative diseases. Int J Mol Sci 2020;21(17):6127. [20] Kubra KT, Akhter MS, Uddin MA, Barabutis N. Unfolded protein response in cardiovascular disease. Cell Signal 2020;73:109699. [21] Lodish H, Berk A, Zipursky SL, et al. Collagen: the fibrous proteins of the matrix Vol In: Freeman WH, editor. Molecular Cell Biology. 4th edition New York: W. H. Freeman; 2000. [22] Chan TF, Poon A, Basu A, et al. Natural variation in four human collagen genes across an ethnically diverse population. Genomics 2008;91(4):30714. [23] Barat-Houari M, Sarrabay G, Gatinois V, et al. Mutation update for COL2A1 gene variants associated with type II collagenopathies. Hum Mutat 2016;37(1):715. [24] Gregersen PA, Savarirayan R. Type II collagen disorders overview. In: Adam MP, Ardinger HH, Pagon RA, et al., editors. GeneReviews((R)). Seattle (WA): University of Washington; 1993. [25] Richette P, Bardin T, Stheneur C. Achondroplasia: from genotype to phenotype. J Bone Spine 2008;75 (2):12530. [26] Le Merrer M, Rousseau F, Legeai-Mallet L, et al. A gene for achondroplasia-hypochondroplasia maps to chromosome 4p. Nat Genet 1994;6(3):31821. [27] Shiang R, Thompson LM, Zhu YZ, et al. Mutations in the transmembrane domain of FGFR3 cause the most common genetic form of dwarfism, achondroplasia. Cell 1994;78(2):33542.

References

133

[28] Murakami S, Balmes G, McKinney S, Zhang Z, Givol D, de Crombrugghe B. Constitutive activation of MEK1 in chondrocytes causes Stat1-independent achondroplasia-like dwarfism and rescues the Fgfr3deficient mouse phenotype. Genes Dev 2004;18(3):290305. [29] Bellus GA, Hefferon TW, Deluna RIO, et al. Achondroplasia is defined by recurrent G380r mutations of Fgfr3. Am J Hum Genet 1995;56(2):36873. [30] Canadas V, Vilacosta I, Bruna I, Fuster V. Marfan syndrome. Part 1: pathophysiology and diagnosis. Nat Rev Cardiol 2010;7(5):25665. [31] Dietz HC, Cutting GR, Pyeritz RE, et al. Marfan syndrome caused by a recurrent de novo missense mutation in the fibrillin gene. Nature 1991;352(6333):3379. [32] Judge DP, Biery NJ, Keene DR, et al. Evidence for a critical contribution of haploinsufficiency in the complex pathogenesis of Marfan syndrome. J Clin Invest 2004;114(2):17281. [33] Passarge E, Robinson PN, Graul-Neumann LM. Marfanoid-progeroid-lipodystrophy syndrome: a newly recognized fibrillinopathy. Eur J Hum Genet 2016;24(9):12447. [34] Boileau C, Jondeau G, Babron MC, et al. Autosomal dominant Marfan-like connective-tissue disorder with aortic dilation and skeletal anomalies not linked to the fibrillin genes. Am J Hum Genet 1993;53 (1):4654. [35] Forrester WC. The Ror receptor tyrosine kinase family. Cell Mol Life Sci 2002;59(1):8396. [36] Roifman M, Brunner H, Lohr J, Mazzeu J, Chitayat D. Autosomal dominant Robinow syndrome. In: Adam MP, Ardinger HH, Pagon RA, et al., editors. GeneReviews((R)). Seattle (WA): University of Washington; 1993. [37] Person AD, Beiraghi S, Sieben CM, et al. WNT5A mutations in patients with autosomal dominant Robinow syndrome. Dev Dyn 2010;239(1):32737. [38] White J, Mazzeu JF, Hoischen A, et al. DVL1 frameshift mutations clustering in the penultimate exon cause autosomal-dominant Robinow syndrome. Am J Hum Genet 2015;96(4):61222. [39] White JJ, Mazzeu JF, Hoischen A, et al. DVL3 Alleles resulting in a -1 frameshift of the last exon mediate autosomal-dominant Robinow syndrome. Am J Hum Genet 2016;98(3):55361. [40] White JJ, Mazzeu JF, Coban-Akdemir Z, et al. WNT signaling perturbations underlie the genetic heterogeneity of Robinow syndrome. Am J Hum Genet 2018;102(1):2743. [41] Liu Y, Rubin B, Bodine PV, Billiard J. Wnt5a induces homodimerization and activation of Ror2 receptor tyrosine kinase. J Cell Biochem 2008;105(2):497502. [42] Cooper DN, Krawczak M, Polychronakos C, Tyler-Smith C, Kehrer-Sawatzki H. Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease. Hum Genet 2013;132(10):1077130. [43] Evans DG, Shenton A, Woodward E, Lalloo F, Howell A, Maher ER. Penetrance estimates for BRCA1 and BRCA2 based on genetic testing in a Clinical Cancer Genetics service setting: risks of breast/ ovarian cancer quoted should reflect the cancer burden in the family. BMC Cancer 2008;8:155. [44] Coleman WB, Tsongalis GJ. Essential Concepts of Molecular Pathology. Academic Press; 2010. [45] Ruiz-Martinez J, Gorostidi A, Ibanez B, et al. Penetrance in Parkinson’s disease related to the LRRK2 R1441G mutation in the Basque country (Spain). Mov Disord 2010;25(14):23405. [46] Conrad DF, Keebler JE, DePristo MA, et al. Variation in genome-wide mutation rates within and between human families. Nat Genet 2011;43(7):71214. [47] Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 2012;488(7412):4715. [48] Campbell CD, Chong JX, Malig M, et al. Estimating the human mutation rate using autozygosity in a founder population. Nat Genet 2012;44(11):127781. [49] Neale BM, Kou Y, Liu L, et al. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 2012;485(7397):2425.

134

Chapter 6 Dominant and sporadic de novo disorders

[50] Goldmann JM, Wong WS, Pinelli M, et al. Parent-of-origin-specific signatures of de novo mutations. Nat Genet 2016;48(8):9359. [51] Rodriguez-Galindo M, Casillas S, Weghorn D, Barbadilla A. Germline de novo mutation rates on exons versus introns in humans. Nat Commun 2020;11(1):3304. [52] Aitken RJ, De Iuliis GN, Nixon B. The sins of our forefathers: paternal impacts on de novo mutation rate and development. Annu Rev Genet 2020;54:124. [53] Crow JF. The origins, patterns and implications of human spontaneous mutation. Nat Rev Genet 2000;1 (1):407. [54] Wong WS, Solomon BD, Bodian DL, et al. New observations on maternal age effect on germline de novo mutations. Nat Commun 2016;7:10486. [55] Campbell CD, Eichler EE. Properties and rates of germline mutations in humans. Trends Genet 2013;29 (10):57584. [56] Rahbari R, Wuster A, Lindsay SJ, et al. Timing, rates and spectra of human germline mutation. Nat Genet 2016;48(2):12633. [57] Smith TCA, Arndt PF, Eyre-Walker A. Large scale variation in the rate of germ-line de novo mutation, base composition, divergence and diversity in humans. PLoS Genet 2018;14(3):e1007254. [58] Cooper DN, Youssoufian H. The CpG dinucleotide and human genetic disease. Hum Genet 1988;78(2):1515. [59] Youk J, An Y, Park S, Lee JK, Ju YS. The genome-wide landscape of C:G . T:A polymorphism at the CpG contexts in the human population. BMC Genomics 2020;21(1):270. [60] Rousseau F, Bonaventure J, Legeai-Mallet L, et al. Mutations in the gene encoding fibroblast growth factor receptor-3 in achondroplasia. Nature 1994;371(6494):2524. [61] Kaplan FS, Xu M, Seemann P, et al. Classic and atypical fibrodysplasia ossificans progressiva (FOP) phenotypes are caused by mutations in the bone morphogenetic protein (BMP) type I receptor ACVR1. Hum Mutat 2009;30(3):37990. [62] Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020;581(7809):43443. [63] Lee JR, Srour M, Kim D, Hamdan FF, Lim SH, Brunel-Guitton C, et al. De novo mutations in the motor domain of KIF1A cause cognitive impairment, spastic paraparesis, axonal neuropathy, and cerebellar atrophy. Hum Mutat 2015;36(1):6978. [64] Esmaeeli Nieh S, Madou MR, Sirajuddin M, Fregeau B, McKnight D, Lexa K, et al. De novo mutations in KIF1A cause progressive encephalopathy and brain atrophy. Ann Clin Transl Neurol 2015;2 (6):62335. [65] Harms FL, Kloth K, Bley A, et al. Activating mutations in PAK1, encoding p21-activated kinase 1, cause a neurodevelopmental disorder. Am J Hum Genet 2018;103(4):57991. [66] Horn S, Au M, Basel-Salmon L, et al. De novo variants in PAK1 lead to intellectual disability with macrocephaly and seizures. Brain 2019;142(11):33519. [67] Parrini MC, Lei M, Harrison SC, Mayer BJ. Pak1 kinase homodimers are autoinhibited in trans and dissociated upon activation by Cdc42 and Rac1. Mol Cell 2002;9(1):7383. [68] Hatsell SJ, Idone V, Wolken DM, et al. ACVR1R206H receptor mutation causes fibrodysplasia ossificans progressiva by imparting responsiveness to activin A. Sci Transl Med 2015;7(303):303ra137. [69] Ho KS, Twede H, Vanzo R, et al. Clinical performance of an ultrahigh resolution chromosomal microarray optimized for neurodevelopmental disorders. Biomed Res Int 2016;2016:3284534. [70] Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, et al. Strong association of de novo copy number mutations with autism. Science 2007;316(5823):4459. [71] Sanders SJ, Ercan-Sencicek AG, Hus V, Luo R, Murtha MT, Moreno-De-Luca D, et al. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism. Neuron 2011;70(5):86385.

References

135

[72] Vicari S, Napoli E, Cordeddu V, Menghini D, Alesi V, Loddo S, et al. Copy number variants in autism spectrum disorders. Prog Neuropsychopharmacol Biol Psychiatry 2019;92:4217. [73] Hilger AC, Dworschak GC, Reutter HM. Lessons learned from CNV analysis of major birth defects. Int J Mol Sci 2020;21(21):8247. [74] de Ligt J, Willemsen MH, van Bon BW, et al. Diagnostic exome sequencing in persons with severe intellectual disability. N Engl J Med 2012;367(20):19219. [75] Michaelson JJ, Shi Y, Gujral M, et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell 2012;151(7):143142. [76] Sanders SJ, Murtha MT, Gupta AR, et al. De novo mutations revealed by whole-exome sequencing are strongly associated with autism. Nature 2012;485(7397):23741. [77] O’Roak BJ, Vives L, Girirajan S, et al. Sporadic autism exomes reveal a highly interconnected protein network of de novo mutations. Nature 2012;485(7397):24650. [78] Deciphering Developmental Disorders Study, et al. Prevalence and architecture of de novo mutations in developmental disorders. Nature 2017;542(7642):4338. [79] Iossifov I, O’Roak BJ, Sanders SJ, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 2014;515(7526):21621. [80] O’Roak BJ, Vives L, Fu W, et al. Multiplex targeted sequencing identifies recurrently mutated genes in autism spectrum disorders. Science 2012;338(6114):161922. [81] Celen C, Chuang JC, Luo X, et al. Arid1b haploinsufficient mice reveal neuropsychiatric phenotypes and reversible causes of growth impairment. Elife 2017;6. [82] Shibutani M, Horii T, Shoji H, et al. Arid1b haploinsufficiency causes abnormal brain gene expression and autism-related behaviors in mice. Int J Mol Sci 2017;18(9). [83] Thompson BA, Tremblay V, Lin G, Bochar DA. CHD8 is an ATP-dependent chromatin remodeling factor that regulates beta-catenin target genes. Mol Cell Biol 2008;28(12):3894904. [84] Bernier R, Golzio C, Xiong B, et al. Disruptive CHD8 mutations define a subtype of autism early in development. Cell 2014;158(2):26376. [85] Karczewski KJ, Francioli LC, Tiao G, et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of function intolerance across human protein-coding genes. bioRxiv. 2019. [86] Guo H, Wang T, Wu H, et al. Inherited and multiple de novo mutations in autism/developmental delay risk genes suggest a multifactorial model. Mol Autism 2018;9:64. [87] de la Torre-Ubieta L, Won H, Stein JL, Geschwind DH. Advancing the understanding of autism disease mechanisms through genetics. Nat Med 2016;22(4):34561. [88] El Hayek L, Tuncay IO, Nijem N, et al. KDM5A mutations identified in autism spectrum disorder using forward genetics. eLife 2020;9. [89] Lebrun N, Mehler-Jacob C, Poirier K, et al. Novel KDM5B splice variants identified in patients with developmental disorders: functional consequences. Gene 2018;679:30513. [90] Faundes V, Newman WG, Bernardini L, et al. Histone lysine methylases and demethylases in the landscape of human developmental disorders. Am J Hum Genet 2018;102(1):17587. [91] Niikawa N, Matsuura N, Fukushima Y, Ohsawa T, Kajii T. Kabuki make-up syndrome: a syndrome of mental retardation, unusual facies, large and protruding ears, and postnatal growth deficiency. J Pediatr 1981;99(4):5659. [92] Van Laarhoven PM, Neitzel LR, Quintana AM, et al. Kabuki syndrome genes KMT2D and KDM6A: functional analyses demonstrate critical roles in craniofacial, heart and brain development. Hum Mol Genet 2015;24(15):444353. [93] Lan F, Bayliss PE, Rinn JL, et al. A histone H3 lysine 27 demethylase regulates animal posterior development. Nature 2007;449(7163):68994.

This page intentionally left blank

CHAPTER

X-linked and mitochondrial disorders

7

Lauretta El Hayek1 and Maria Chahrour2 1

Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, United States 2Eugene McDermott Center for Human Growth and Development, Department of Neuroscience, Center for the Genetics of Host Defense, Department of Psychiatry, Peter O’Donnell Jr. Brain Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States

7.1 Introduction As discussed in the previous chapter, the majority of the initial observations of inherited traits occurred in families with dominant (see Chapter 6: Dominant and Sporadic De Novo Disorders) or X-linked disorders, where the traits could be traced from one generation to the next (i.e., vertical transmission), although in some cases sparing family members, which also provided clues to the X chromosome mapping of the associated genes. A classic example of an X-linked rare disease that affected historical figures is that of Hemophilia B (MIM #306900), a condition characterized by an inability of the blood to clot, leading to abnormal and excessive bleeding upon injury or surgery. Hemophilia B is a notable genetic disorder that segregated in the European royal families and resulted in the premature death of many heirs to the thrones of Europe. The mutation affects the gene on the X chromosome encoding the clotting factor IX protein (F9) and was passed down by Queen Victoria to her children in an X-linked recessive pattern of inheritance where, as discussed later in the chapter, the sons who inherited the mutation were affected with Hemophilia B, whereas the daughters carrying the mutation did not have the disease but passed it on to later generations of royals. Interestingly, it was not until 2009 that a specific splice-altering mutation was identified as the likely molecular cause of Hemophilia B in the European royal families using genomic approaches to study historical samples from the Romanov branch of the family [1]. Similarly, the more common form of blood clotting disease, Hemophilia A (MIM #306700) is caused by inherited or de novo mutations in the clotting gene factor VIII (F8), also mapping to chromosome X. About 45% of severe Hemophilia A cases are caused by intron 22 inversion mutation in the factor VIII gene [2]. In this chapter, X-linked and mitochondrial disorders are discussed with a focus on the genomic approaches and considerations associated with the study and diagnosis of these classes of disorders, such as sex, penetrance, X chromosome skewing, and heteroplasmy.

Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00006-5 © 2021 Elsevier Inc. All rights reserved.

137

138

Chapter 7 X-linked and mitochondrial disorders

7.2 X Chromosome disorders 7.2.1 X-linked recessive disorders Humans have 22 pairs of chromosomes known as autosomes and an additional pair formed by the sex chromosomes X and Y. In humans, females carry two X chromosomes (karyotype 46,XX), one inherited from each parent, whereas males have one X chromosome inherited from their mothers and a Y chromosome inherited from their fathers (karyotype 46,XY). X-linked disorders are caused by mutations in genes that are located on the X chromosome. X-linked recessive disorders (Fig. 7.1A) affect males carrying pathogenic variants in genes located on their only X chromosome (hemizygous), whereas females are typically not affected because they have a second normal X chromosome. For X-linked recessive disorders, generally, females are considered heterozygous carriers, which means one of their two X chromosomes carries a defective copy of the gene causing the disorder and therefore, there is a 50% chance that their daughters inherit the X chromosome with the abnormal gene copy and are also carriers and a 50% chance that their sons will inherit the abnormal copy and have the disorder. Since males always inherit their Y chromosome from their fathers, father-to-son transmission is not possible, and affected males will transmit their only X chromosome carrying the defective gene to all their daughters but none of their sons. Finally, in cases of affected females for an X-linked recessive disorder with biallelic pathogenic variants affecting the gene copies in both X chromosomes, they

A

B

X-linked recessive

affected unaffected father mother

unaffected carrier father mother

Y

Y

X

X X

son daughter 100% 100% unaffected carrier

X

Y X X

X

X

Y X

affected unaffected father mother

Y X X

X

X X

X

Y X X

son daughter 100% 100% unaffected affected

Y X X

unaffected affected father mother

Y X X

son son daughter daughter son daughter 50% 50% 50% 50% 100% 100% affected unaffected carrier unaffected affected carrier

Y X

unaffected affected father mother

X-linked dominant

X

Y X X

X

Y X X

X X

son son daughter daughter 50% 50% 50% 50% affected unaffected affected unaffected

X

Y X

X X

X X

FIGURE 7.1 X-linked recessive (A) and X-linked dominant (B) modes of inheritance are illustrated. The red bars indicate locations of the disease mutation.

7.2 X Chromosome disorders

139

will pass a defective copy to all of their children and all male offspring will be affected, whereas all females will be asymptomatic carriers. A common trait that is inherited in an X-linked recessive fashion is redgreen color blindness characterized by the inability to distinguish shades of red, green, or yellow. Normal color vision in humans is trichromatic due to the function of three classes of photoreceptors in retinal cells called cones, that are sensitive to light at different wavelengths (blue, green, and red). Detection of light by the different photoreceptors in the cones and the processing of these signals by the brain allows the perception of red, yellow, green, and blue colors individually and in combinations. The red and green-pigment genes (opsins) are located on the X chromosome in a head-to-tail tandem array organization with one red-pigment gene, OPN1LW (long-wave-sensitive opsin 1), followed by one or more green-pigment genes, OPN1MW (medium-wave sensitive opsin 1). The high homology between these genes promotes relatively common recombination events that result in the deletion of green-pigment genes or red/green hybrid genes. Such genomic rearrangements at this locus constitute the most common cause of redgreen color vision defects [3] followed by mutations in the OPN1LW or OPN1MW genes themselves [4]. Interestingly, this genome architecture of paralogous genomic segments mapping in close proximity can result in color blindness by nonallelic homologous recombination (NAHR), but it also seems to result in genomic instability favoring iterative template switching by microhomology-mediated break-induced replication (see Chapter 3: Genomic Disorders in the Genomics Era) and resulting in duplication and triplication of the nearby MECP2 locus [5]. Other notable examples of X-linked recessive disorders are Duchenne muscular dystrophy (DMD, MIM #310200) and Becker muscular dystrophy (BMD, MIM #300376), which are characterized by muscle weakness caused by an absent or partially functional dystrophin protein, respectively. Dystrophin is mainly expressed in skeletal and cardiac muscles and is encoded by DMD, a large gene (2.09 Mb in length) located on the X chromosome. When mutations in DMD disrupt the reading frame and lead to no dystrophin production, they result in the severe DMD phenotype in males. Immunostaining analyses have revealed that dystrophin is either partially or completely absent in muscle tissue from DMD patients [6]. However, when the mutations in DMD are in-frame, leading to a dystrophin that is shorter but retains some partial functionality or 10% to 40% of the normal dystrophin expression, the resulting phenotype is milder, highly heterogeneous, and is known as BMD [6,7].

7.2.2 X-linked dominant disorders Similar to autosomal dominant disorders, X-linked dominant disorders (Fig. 7.1B) occur when a defective copy of a gene located on one X chromosome is sufficient to cause the disorder. Interestingly, X-linked dominant families can have twice as many females affected as males, whereas autosomal dominant traits have equal numbers of males and females affected. Unlike X-linked recessive disorders, males are not more likely to be affected by X-linked dominant disorders, but rather the mode of inheritance depends on which parent is transmitting the X chromosome with the defective gene. If the mother carries the defective copy of the gene in one of her X chromosomes, she is affected and 50% of her daughters and sons will be affected, same as in autosomal dominant disease traits. However, if the father is affected, it means the defective copy of the gene is present on his single X chromosome and therefore none of his sons

140

Chapter 7 X-linked and mitochondrial disorders

will be affected because the father only transmits his Y chromosome to his sons. On the other hand, all of his daughters will be affected since they will all inherit the X chromosome with the defective gene from their father. Generally, in dominant X-linked disorders, males with the disorder are more severely affected than females and, in some cases, the phenotype may be lethal during embryonic development or shortly after birth, so only affected females survive. An interesting example of an X-linked dominant disorder is Rett syndrome (RTT, MIM #312750), one of the most common neurodevelopmental disorders in girls [8,9]. RTT is generally caused by de novo mutations in the MECP2 (methyl-CpG-binding protein 2) [10] gene and occurs almost exclusively in females; other mutational mechanisms affecting MECP2 can lead to related disorders observed in males, such as MECP2 duplication syndrome discussed below [11,12]. RTT follows an X-linked dominant inheritance pattern which causes males, when viable, to be more severely affected. In females, RTT commonly presents with microcephaly, growth retardation, and muscle hypotonia, later followed by stereotypic hand wringing and flapping, loss of speech, autistic features, ataxia, and seizures [9]. Mapping of the disease-causing gene was not performed with traditional linkage analysis since almost all cases of RTT are sporadic. Instead, exclusion mapping from families with RTT identified the Xq28 locus [13], while later candidate gene screening identified mutations in MECP2 in patients [10]. MECP2 mutations include nonsense, missense, and frameshifting variants [14] and in most cases are predicted to result in MECP2 loss-of-function (LoF). Conversely, MECP2 duplication syndrome is characterized by intellectual disability, hypotonia, poor or absent speech, epilepsy, and autism [12,15,16]. MECP2 duplication is inherited [17] with some de novo cases reported [18]. Gain-of-function (GoF) of MECP2 through its duplication on the X chromosome is fully penetrant and gives rise to the syndrome in males. In females, a duplication on one of the X chromosomes generally does not lead to the disorder due to skewed X chromosome inactivation (XCI; explained below), which is responsible for silencing the X chromosome with the duplication. However, some females could still develop psychiatric symptoms like anxiety and depression later in life. This could be due to tissue-specific XCI patterns, which might lead to activation of the X chromosome with the duplication in the brain, in addition to the fact that MECP2 is one of the most dosage-sensitive genes in the brain [12,19]. Triplication of the gene results in a more severe phenotype than observed with duplication [5]. Fragile X syndrome (FXS, MIM #300624) is another example of an X-linked dominant disorder and the most prevalent inherited cause of intellectual disability [20]. FXS arises when the promoter of the fragile X mental retardation 1 gene (FMR1) accumulates over 200 CGG trinucleotide repeats. Individuals carrying 55 to 200 CGG repeats are known as premutation carriers and they present with other late-onset neuropsychiatric phenotypes [20,21]. Expanded CGG repeats get methylated and silence the expression of FMR1, and this silencing has been shown to depend on histone deacetylation. Inhibiting DNA methyltransferases leads to reestablishment of the acetylated marks on the histones at the FMR1 locus and its transcriptional activation [22]. FMR1 codes for the fragile X mental retardation protein (FMRP), an RNA-binding protein involved in regulating RNA stability [23], neuronal development, and synaptic plasticity [24]. The severity of this disorder depends on the levels of FMRP present, which in turn depend on the number of repeats and their methylation status at the gene promoter. When FMRP is almost absent, the most severe phenotype is observed, characterized by intellectual disability, cognitive impairments, and a spectrum of other neurodevelopmental phenotypes [20,25].

7.2 X Chromosome disorders

141

7.2.3 X chromosome inactivation as a modifier of X-linked disorders A major source of variation in the expressivity of X-linked disorders in females is a phenomenon called X chromosome inactivation (XCI), also referred to as Lyonization in reference to Mary Lyon, who formulated the Lyon hypothesis of this mechanism [26,27], in which all the cells in a female suppress one of their two X chromosomes. XCI occurs in females to ensure dosage compensation of X chromosome genes between the sexes and that females do not express X chromosome-specific genes at twice the gene dosage/expression levels compared to males. Females silence one copy of their X chromosomes through a very organized and complex process, turning the inactive X chromosome into a steady and silent chromatin dense structure called a Barr body [28,29]. XCI could be either random or imprinted. Imprinted inactivation is achieved in the preimplantation embryo and consists of inactivating only the paternal X chromosome [30,31]. Conversely, during early embryonic development, either the paternal or the maternal X chromosome is randomly inactivated. Random inactivation is a controlled sequential process that starts with the cell counting the number of X chromosomes present, then deciding on their mutually exclusive destinies: one X to stay active and the other to be silenced [28]. Two noncoding and complementary RNAs: Xist and Tsix lead to this inactivation [29]. Once XCI begins, the Xist transcript covers the X chromosome to be inactivated in cis leading to its silencing. This is coupled with Tsix silencing on that same X chromosome. The inactivation of the X chromosome is retained in the following cell division with Xist coating maintained [32]. It has been suggested that mammals might have evolved this random inactivation mechanism to better cope with mutations on the maternal X chromosome. Although there is a 50% chance that either X will be silenced, this mechanism is not always completely random, and XCI skewing is often observed (Fig. 7.2). Skewed XCI is evident when female carriers of a recessive X-linked mutation manifest symptoms of the disease or when a dominant X-linked disorder shows different phenotypic severity in different females [33]. Possible explanations of this unbalanced inactivation have been proposed, including selective pressure applied against the X chromosome carrying a lethal mutation or a mutation that leads to reduced survival favoring cells with the wild type X chromosome like in females with linear skin defects with multiple congenital anomalies 1 (LSDMCA1, MIM #309801) and incontinentia pigmenti (MIM #308300), where cells carrying the wild type X chromosome are favored for survival [3335]. In other cases, a growth advantage is observed in cells expressing a mutation on their active X chromosome. An example of this is some mutations in ABCD1 causing adrenoleukodystrophy (ALD, MIM #300100), a lipid storage disorder that results in neurodegeneration early in childhood in males, while only 15% of females show signs of the disease [33,36]. Skin fibroblasts carrying some ALD mutations showed a proliferative advantage over those without the mutation in vitro, and increased fatty acids in plasma of heterozygous females have suggested that favoring of the X chromosome carrying the mutant allele also happens in vivo and results in disease manifestation [36]. In addition, chromosomal rearrangements or translocations can lead to skewed XCI, for example, females with DMD who have translocations spanning DMD tend to preferentially inactivate the wildtype X chromosome [33,37]. A correlation has been reported between the age of women and skewed XCI, which is increased in older females [33,38]. Another mechanism for skewing is asymmetric splitting of the inner cell mass during monozygotic twinning resulting in discordant female twins [39]. X-inactivation skewing can be detected by coupling whole-genome sequencing (WGS) data and RNA sequencing studies or through a less costly approach of using quantitative reverse transcription polymerase chain reaction (qRT-PCR) for several heterozygous alleles on the X chromosomes [40].

Chapter 7 X-linked and mitochondrial disorders

X

Selective pressure

Ran dom ina Ske ct we iv di at na ct iv

n io

e om s o m

ation ctiv a in

n io at

ch ro

142

Growth advantage Chromosomal translocations Old age

Nor ma

X-lin l k e d re X-link ce s ed d om sive in di a s nt di s

r de r or rde o

FIGURE 7.2 X chromosome inactivation could occur randomly, but some factors could skew this inactivation: selective pressure favoring survival of cells expressing genes from the wild-type X chromosome, growth advantage of the cells expressing genes from the X chromosome carrying a disease mutation, chromosomal translocations leading to preferred silencing of one X chromosome, and old age leading to a skewed inactivation of the X chromosome in older females. Skewed inactivation can lead to females presenting with a phenotype for an X-linked recessive disorder, where more wild-type X chromosomes are silenced than the ones carrying the mutation, or an X-linked dominant phenotype presenting with different severities due to variable expression from the X chromosome carrying the mutation. Finally, this skewing could occur in normal females without any X-linked mutations.

7.3 Mitochondrial disorders In addition to the nuclear genome, humans also carry another genome in their cells. Mitochondria are the energy-producing organelles of human cells and they carry a genome of their own. However, although human cells generally just harbor one nuclear genome, they can have from a few hundreds to several thousands of mitochondria, each with its own genome.

7.3 Mitochondrial disorders

Mother

143

Father

Child

FIGURE 7.3 Mitochondrial inheritance and heteroplasmy. Offspring only inherit mitochondria from their mothers, and some of the inherited mitochondria could share the same mtDNA or have different mtDNA. Heteroplasmy is the presence of mitochondria with different mitochondrial genomes within the same cell.

Mitochondrial DNA (mtDNA) is exclusively inherited from the mother in the majority of multicellular organisms [41] (Fig. 7.3). During fertilization, even if the sperm contributes to some mitochondria, the sperm mitochondria are selected against and specifically, in some mammals, sperm mitochondria are tagged with ubiquitin for degradation [42]. A recent study challenged the exclusive maternal mtDNA transmission and showed that in some rare cases, paternal mtDNA is transmitted to offspring [43]. However, soon after the publication of this work, scientists questioned the reported claim, suggesting the data were insufficient to support it, and currently paternal mitochondrial inheritance remains a controversial idea [44,45]. The mutation rate of mtDNA exceeds that of nuclear DNA [46,47]. The reason behind this high mutation rate was originally thought to be due to DNA damage caused by reactive oxygen species (ROS) produced as part of the oxidative phosphorylation process during cellular respiration [48] but was refuted in more recent studies that showed no real role of oxidative damage in the high mtDNA mutation rate [49,50]. In addition, it has been shown that mtDNA mutations accumulate with age [51,52], which might not be true in all tissues [53]. While the exact mechanisms that lead to the accumulation of mutations in mtDNA are not yet completely understood, it is argued that mtDNA replication errors and repair are the main contributors to mutation accumulation [50]. Mutations accumulating in mtDNA lead to the same cell having mitochondria with different mtDNA, a phenomenon referred to as heteroplasmy (Fig. 7.3), and a form of genetic mosaicism (see Chapter 8: Mosaicism in Rare Disease). Mitochondrial replication, in contrast to the more tightly regulated segregation of nuclear DNA, occurs during cell division by sorting mtDNA copies randomly between daughter cells through the process of replicative segregation. Therefore when a mutation occurs in mtDNA, the distribution and fraction of mtDNA genomes carrying the mutation

144

Chapter 7 X-linked and mitochondrial disorders

will be different among cells due to replicative segregation, leading to different degrees of heteroplasmy. Heteroplasmy is particularly important to consider when the mutations in the mitochondrial genome lead to disease, since clinical manifestations of the disease could depend on the specific number of mitochondria carrying the mutation compared to the number of normal mitochondria, and their distribution in specific cells and tissues. For example, mutations in the mitochondrial gene ATPase 6 can lead to Leigh syndrome (LS, MIM #256000), a disease that affects the basal ganglia and brain stem, and eventually results in death [54]. ATPase 6 encodes a subunit of the ATP synthase enzyme that is also known as complex V, which converts ATP to ADP during oxidative phosphorylation [55]. Mutations in ATPase 6 are heteroplasmic, and individuals with these mutations will have LS when 90% or more of their mitochondria carry the mutation. However, individuals with heteroplasmy levels between 70% and 90% do not have LS but instead develop a phenotype of neuropathy, ataxia, and retinitis pigmentosa known as NARP syndrome (MIM #551500). Individuals with a mutation load less than 70% are often asymptomatic [54,56]. The presence and contribution of heteroplasmy in mitochondrial disorders complicates the assessment and estimation of recurrence risk for offspring of heteroplasmic females, as this is dependent on the fraction of mitochondria carrying the pathogenic variant that is passed through replicative segregation to the offspring. In contrast, some mtDNA mutations are homoplasmic, meaning that a particular mutation is present in most of the mitochondria and is inherited by the next generation from the mother. When a homoplasmic mutation is associated with a disease and is carried by the maternal mitochondria, the children will inherit the mutation but not necessarily the disease, demonstrating that a lot remains to be understood in mitochondrial disease genetics [57,58]. An example of this is Leber hereditary optic neuropathy (LHON, MIM #535000), a mitochondrial disease that affects the optic nerve and eventually leads to loss of vision in both eyes in young adults. In most patients, the disease is caused by homoplasmic mtDNA mutations in the NADH dehydrogenase genes, ND1, ND4, and ND6. When carried by maternal mtDNA, the mutation in these genes is inherited by all the children; however, not all of them may show symptoms of the disease [57,5961]. Mitochondrial disorders can affect all tissues, but they are particularly common and more severely affect high energy demand tissues like the brain, heart, and muscles [56,62]. These tissues are also especially sensitive to the heteroplasmy mutation load and have a lower threshold at which clinical manifestations occur [56]. Mutations in mtDNA can be identified through next-generation sequencing approaches. Wholeexome sequencing (WES) can be used for mtDNA disease diagnosis, since mtDNA may be detected as off-target reads with high recall and precision [63]. However, in cases where heteroplasmy levels are very low, it is challenging to detect these mtDNA mutations with WES. Each cell contains several copies of mtDNA, allowing for very high depth of sequencing with WGS (1000to 2000-fold), and enabling even very low levels of mtDNA heteroplasmy (0.5%) to be detected [64,65]. WGS can therefore be used to diagnose mtDNA disorders and has recently played an essential role in the identification of three new mitochondrial disease-associated genes: COX6A1, which when mutated leads to a recessive axonal or mixed form of Charcot-Marie-Tooth disease [66]; TIMMDC1, where a frameshift mutation in intron 5 of this gene was identified in association with hypotonia, severe neurological damage, and developmental delay [67]; and COQ5, where duplications of this gene cause cerebellar ataxia, seizures, encephalopathy, and cognitive disability [68]. Alternatively, when a mitochondrial disorder is specifically suspected based on family history,

7.4 Outlook

145

inheritance pattern, and clinical presentation, mitochondrial genome sequencing can be performed by selectively enriching for mtDNA and sequencing at very high coverage to detect heteroplasmy. Mutations in nuclear genes are also known to cause mitochondrial diseases. Cytochrome c oxidase (COX), also known as complex IV, has 10 of its 13 subunits encoded by nuclear genes. Mutations in any of these nuclear genes can cause deficiency of the complex and lead to diseases like Leigh syndrome (LS, MIM #256000) [69,70]. Mutations in the nuclear gene COX10 have been shown to be linked to isolated COX deficiency in patients with multiple clinical phenotypes including LS and anemia [71,72]. Additionally, many nuclear genes function in replication, maintenance, and repair of mtDNA, and mutations in these genes are known to lead to intergenomic communication disorders. For example, mtDNA depletion syndromes (MDS), which are autosomal recessive disorders resulting in lower copy number of mtDNA, are caused by mutations in nuclear genes, including POLG, MPV17, and TK2, among others [70,73]. Mutations in nuclear genes that function in mtDNA maintenance usually lead to MDS and other disorders like KearnsSayre syndrome (MIM #530000), Pearson syndrome (MIM #557000), and progressive external ophthalmoplegia [70,74]. Mutations in nuclear genes can also lead to defects in mtDNA translation causing respiratory chain deficiency [75]. The integrity of the nuclearmitochondrial intergenomic communication is essential in maintaining healthy and functional mitochondria and defects in this communication result in severe mitochondrial disorders.

7.4 Outlook Despite the significant progress in our understanding of the genetic mechanisms of disease over the past decade, especially in uncovering causative genes and variants for Mendelian disorders, many disease traits remain to be molecularly characterized and understood from the standpoint of perturbations in biological homeostasis. Furthermore, it remains challenging to attribute a clear underlying genetic contribution or cause to many complex disorders where the interplay between multiple genes or between genes and the environment is more difficult to unravel. X-linked disorders present with challenges to their study and understanding, including variable severity of disease phenotypes in females and males and XCI skewing in females with variable contribution to disease severity. Mitochondrial disorders have challenges of their own due to the presence of multiple mitochondrial genomes per cell, heteroplasmy, differences in the number of mitochondria required in different tissues, the interplay between nuclear and mitochondria encoded genes, and how the combinations of these factors influence disease presentation and severity. Advances in sequencing techniques, like single-cell WGS and single-cell RNA sequencing, are playing a pivotal role in addressing these challenges and further understanding these disorders. With the rapid advances in genomic technologies, disorders that were previously considered as one are being dissected into multiple genetic subtypes, each having its own specific molecular contributors and mechanisms that are disrupted, paving the road for targeted therapeutic approaches for each subtype. In other cases, genomic technologies are identifying common pathways affected in multiple disorders, providing more insight into the origin of diseases and the impacted biological processes. Finally, advances in genomic technologies have been driving our understanding of the genetics of inherited disorders including X-linked and mitochondrial disorders and will lead the way in addressing the remaining challenges associated with these disorders.

146

Chapter 7 X-linked and mitochondrial disorders

References [1] Rogaev EI, Grigorenko AP, Faskhutdinova G, Kittler EL, Moliaka YK. Genotype analysis identifies the cause of the “royal disease”. Science 2009;326(5954):817. [2] Lakich D, Kazazian Jr HH, Antonarakis SE, Gitschier J. Inversions disrupting the factor VIII gene are a common cause of severe haemophilia A. Nat Genet 1993;5(3):23641. [3] Deeb SS. The molecular basis of variation in human color vision. Clin Genet 2005;67(5):36977. [4] Nathans J, Thomas D, Hogness DS. Molecular genetics of human color vision: the genes encoding blue, green, and red pigments. Science 1986;232(4747):193202. [5] Carvalho CM, Ramocki MB, Pehlivan D, et al. Inverted genomic segments and complex triplication rearrangements are mediated by inverted repeats in the human genome. Nat Genet 2011;43(11):107481. [6] Hoffman EP, Fischbeck KH, Brown RH, et al. Characterization of dystrophin in muscle-biopsy specimens from patients with Duchenne’s or Becker’s muscular dystrophy. N Engl J Med 1988;318 (21):13638. [7] Coote D, Davis MR, Cabrera M, Needham M, Laing NG, Nowak KJ. Clinical utility gene card for: Becker muscular dystrophy. Eur J Hum Genet 2018;26(7):106571. [8] Hagberg B. ’Rett’s syndrome: prevalence and impact on progressive severe mental retardation in girls. Acta Paediatr Scand 1985;74(3):4058. [9] Chahrour M, Zoghbi HY. The story of Rett syndrome: from clinic to neurobiology. Neuron 2007;56 (3):42237. [10] Amir RE, Van den Veyver IB, Wan M, Tran CQ, Francke U, Zoghbi HY. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 1999;23(2):1858. [11] Van Esch H, Bauters M, Ignatius J, et al. Duplication of the MECP2 region is a frequent cause of severe mental retardation and progressive neurological symptoms in males. Am J Hum Genet 2005;77 (3):44253. [12] Ramocki MB, Peters SU, Tavyev YJ, et al. Autism and other neuropsychiatric symptoms are prevalent in individuals with MeCP2 duplication syndrome. Ann Neurol 2009;66(6):77182. [13] Ellison KA, Fill CP, Terwilliger J, et al. Examination of X chromosome markers in Rett syndrome: exclusion mapping with a novel variation on multilocus linkage analysis. Am J Hum Genet 1992;50 (2):27887. [14] Christodoulou J, Grimm A, Maher T, Bennetts B. RettBASE: the IRSA MECP2 variation database-a new mutation database in evolution. Hum Mutat 2003;21(5):46672. [15] Friez MJ, Jones JR, Clarkson K, et al. Recurrent infections, hypotonia, and mental retardation caused by duplication of MECP2 and adjacent region in Xq28. Pediatrics 2006;118(6):e168795. [16] Smyk M, Obersztyn E, Nowakowska B, et al. Different-sized duplications of Xq28, including MECP2, in three males with mental retardation, absent or delayed speech, and recurrent infections. Am J Med Genet B Neuropsychiatr Genet 2008;147B(6):799806. [17] Yon DK, Park JE, Kim SJ, Shim SH, Chae KY. A sibship with duplication of Xq28 inherited from the mother; genomic characterization and clinical outcomes. BMC Med Genet 2017;18(1):30. [18] Grasshoff U, Bonin M, Goehring I, et al. De novo MECP2 duplication in two females with random X-inactivation and moderate mental retardation. Eur J Hum Genet 2011;19(5):50712. [19] Collins AL, Levenson JM, Vilaythong AP, et al. Mild overexpression of MeCP2 causes a progressive neurological disorder in mice. Hum Mol Genet 2004;13(21):267989. [20] Saldarriaga W, Tassone F, Gonzalez-Teshima LY, Forero-Forero JV, Ayala-Zapata S, Hagerman R. Fragile X syndrome. Colomb Med (Cali) 2014;45(4):1908. [21] Bagni C, Tassone F, Neri G, Hagerman R. Fragile X syndrome: causes, diagnosis, mechanisms, and therapeutics. J Clin Invest 2012;122(12):431422.

References

147

[22] Coffee B, Zhang F, Warren ST, Reines D. Acetylated histones are associated with FMR1 in normal but not fragile X-syndrome cells. Nat Genet 1999;22(1):98101. [23] Chen L, Yun SW, Seto J, Liu W, Toth M. The fragile X mental retardation protein binds and regulates a novel class of mRNAs containing U rich target sequences. Neuroscience 2003;120(4):100517. [24] Darnell JC, Van Driesche SJ, Zhang C, et al. FMRP stalls ribosomal translocation on mRNAs linked to synaptic function and autism. Cell 2011;146(2):24761. [25] Fragile X syndrome: diagnosis, treatment, and research. 3rd ed.; 2002. [26] Lyon MF. Gene action in the X-chromosome of the mouse (Mus musculus L.). Nature 1961;190:3723. [27] Harper PS. Mary Lyon and the hypothesis of random X chromosome inactivation. Hum Genet 2011;130:16974. [28] Boumil RM, Lee JT. Forty years of decoding the silence in X-chromosome inactivation. Hum Mol Genet 2001;10(20):222532. [29] Ahn J, Lee J. X chromosome: X inactivation. Nat Educ 2008;1(1):24. [30] Huynh KD, Lee JT. X-chromosome inactivation: a hypothesis linking ontogeny and phylogeny. Nat Rev Genet 2005;6(5):41018. [31] Okamoto I, Heard E. The dynamics of imprinted X inactivation during preimplantation development in mice. Cytogenet Genome Res 2006;113(1-4):31824. [32] Panning B. X-chromosome inactivation: the molecular basis of silencing. J Biol 2008;7(8):30. [33] Van den Veyver IB. Skewed X inactivation in X-linked disorders. Semin Reprod Med 2001;19 (2):18391. [34] Lindsay EA, Grillo A, Ferrero GB, et al. Microphthalmia with linear skin defects (MLS) syndrome: clinical, cytogenetic, and molecular characterization. Am J Med Genet 1994;49(2):22934. [35] Migeon BR, Axelman J, Jan de Beur S, Valle D, Mitchell GA, Rosenbaum KN. Selection against lethal alleles in females heterozygous for incontinentia pigmenti. Am J Hum Genet 1989;44(1):1006. [36] Migeon BR, Moser HW, Moser AB, Axelman J, Sillence D, Norum RA. Adrenoleukodystrophy: evidence for X linkage, inactivation, and selection favoring the mutant allele in heterozygous cells. Proc Natl Acad Sci USA 1981;78(8):506670. [37] Richards CS, Watkins SC, Hoffman EP, et al. Skewed X inactivation in a female MZ twin results in Duchenne muscular dystrophy. Am J Hum Genet 1990;46(4):67281. [38] Sharp A, Robinson D, Jacobs P. Age- and tissue-specific variation of X chromosome inactivation ratios in normal women. Hum Genet 2000;107(4):3439. [39] Lupski JR, Garcia CA, Zoghbi HY, Hoffman EP, Fenwick RG. Discordance of muscular dystrophy in monozygotic female twins: evidence supporting asymmetric splitting of the inner cell mass in a manifesting carrier of Duchenne dystrophy. Am J Med Genet 1991;40(3):35464. [40] Shvetsova E, Sofronova A, Monajemi R, et al. Skewed X-inactivation is common in the general female population. Eur J Hum Genet 2019;27(3):45565. [41] Hutchison 3rd CA, Newbold JE, Potter SS, Edgell MH. Maternal inheritance of mammalian mitochondrial DNA. Nature 1974;251(5475):5368. [42] Sutovsky P, Moreno RD, Ramalho-Santos J, Dominko T, Simerly C, Schatten G. Ubiquitin tag for sperm mitochondria. Nature 1999;402(6760):3712. [43] Luo S, Valencia CA, Zhang J, et al. Biparental inheritance of mitochondrial DNA in humans. Proc Natl Acad Sci USA 2018;115(51):1303944. [44] Lutz-Bonengel S, Parson W. No further evidence for paternal leakage of mitochondrial DNA in humans yet. Proc Natl Acad Sci USA 2019;116(6):18212. [45] Luo S, Valencia CA, Zhang J, et al. Reply to Lutz-Bonengel et al.: biparental mtDNA transmission is unlikely to be the result of nuclear mitochondrial DNA segments. Proc Natl Acad Sci USA 2019;116 (6):18234.

148

Chapter 7 X-linked and mitochondrial disorders

[46] Marcelino LA, Thilly WG. Mitochondrial mutagenesis in human cells and tissues. Mutat Res 1999;434 (3):177203. [47] Khrapko K, Coller HA, Andre PC, Li XC, Hanekamp JS, Thilly WG. Mitochondrial mutational spectra in human cells and tissues. Proc Natl Acad Sci USA 1997;94(25):13798803. [48] Richter C, Park JW, Ames BN. Normal oxidative damage to mitochondrial and nuclear DNA is extensive. Proc Natl Acad Sci USA 1988;85(17):64657. [49] Itsara LS, Kennedy SR, Fox EJ, et al. Oxidative stress is not a major contributor to somatic mitochondrial DNA mutations. PLoS Genet 2014;10(2):e1003974. [50] Szczepanowska K, Trifunovic A. Different faces of mitochondrial DNA mutators. Biochim Biophys Acta 2015;1847(11):136272. [51] Kadenbach B, Munscher C, Frank V, Muller-Hocker J, Napiwotzki J. Human aging is associated with stochastic somatic mutations of mitochondrial DNA. Mutat Res 1995;338(1-6):16172. [52] Kennedy SR, Salk JJ, Schmitt MW, Loeb LA. Ultra-sensitive sequencing reveals an age-related increase in somatic mitochondrial mutations that are inconsistent with oxidative damage. PLoS Genet 2013;9(9): e1003794. [53] Pallotti F, Chen X, Bonilla E, Schon EA. Evidence that specific mtDNA point mutations may not accumulate in skeletal muscle during normal human aging. Am J Hum Genet 1996;59(3):591602. [54] Tatuch Y, Christodoulou J, Feigenbaum A, et al. Heteroplasmic mtDNA mutation (T----G) at 8993 can cause Leigh disease when the percentage of abnormal mtDNA is high. Am J Hum Genet 1992;50(4):8528. [55] Lenaz G, Baracca A, Carelli V, D’Aurelio M, Sgarbi G, Solaini G. Bioenergetics of mitochondrial diseases associated with mtDNA mutations. Biochim Biophys Acta 2004;1658(1-2):8994. [56] Schmiedel J, Jackson S, Schafer J, Reichmann H. Mitochondrial cytopathies. J Neurol 2003;250 (3):26777. [57] Taylor RW, Turnbull DM. Mitochondrial DNA mutations in human disease. Nat Rev Genet 2005;6 (5):389402. [58] Chial H, Craig J. mtDNA and mitochondrial diseases. Nat Educ 2008;1(1):2017. [59] Wallace DC, Singh G, Lott MT, et al. Mitochondrial DNA mutation associated with Leber’s hereditary optic neuropathy. Science 1988;242(4884):142730. [60] Howell N, Bindoff LA, McCullough DA, et al. Leber hereditary optic neuropathy: identification of the same mitochondrial ND1 mutation in six pedigrees. Am J Hum Genet 1991;49(5):93950. [61] Johns DR, Neufeld MJ, Park RD. An ND-6 mitochondrial DNA mutation associated with Leber hereditary optic neuropathy. Biochem Biophys Res Commun 1992;187(3):15517. [62] Meyers DE, Basha HI, Koenig MK. Mitochondrial cardiomyopathy: pathophysiology, diagnosis, and management. Tex Heart Inst J 2013;40(4):38594. [63] Stenton SL, Prokisch H. Genetics of mitochondrial diseases: Identifying mutations to help diagnosis. EBioMedicine 2020;56:102784. [64] Duan M, Chen L, Ge Q, et al. Evaluating heteroplasmic variations of the mitochondrial genome from whole genome sequencing data. Gene 2019;699:14554. [65] Schon KR, Ratnaike T, van den Ameele J, Horvath R, Chinnery PF. Mitochondrial diseases: a diagnostic revolution. Trends Genet 2020;36(9):70217. [66] Tamiya G, Makino S, Hayashi M, et al. A mutation of COX6A1 causes a recessive axonal or mixed form of Charcot-Marie-Tooth disease. Am J Hum Genet 2014;95(3):294300. [67] Kremer LS, Bader DM, Mertes C, et al. Genetic diagnosis of Mendelian disorders via RNA sequencing. Nat Commun 2017;8:15824. [68] Malicdan MCV, Vilboux T, Ben-Zeev B, et al. A novel inborn error of the coenzyme Q10 biosynthesis pathway: cerebellar ataxia and static encephalomyopathy due to COQ5 C-methyltransferase deficiency. Hum Mutat 2018;39(1):6979.

References

149

[69] Shoubridge EA. Cytochrome c oxidase deficiency. Am J Med Genet 2001;106:4652. [70] Goldstein AC, Bhatia P, Vento JM. Mitochondrial disease in childhood: nuclear encoded. Neurotherapeutics 2013;10:21226. [71] Valnot I, von Kleist-Retzow JC, Barrientos A, et al. A mutation in the human heme A:farnesyltransferase gene (COX10) causes cytochrome c oxidase deficiency. Hum Mol Genet 2000;9:12459. [72] Antonicka H, Leary SC, Guercin GH, et al. Mutations in COX10 result in a defect in mitochondrial heme A biosynthesis and account for multiple, early-onset clinical phenotypes associated with isolated COX deficiency. Hum Mol Genet 2003;12(20):2693702. [73] Copeland WC. Inherited mitochondrial diseases of DNA replication. Annu Rev Med 2008;59:13146. [74] Goldstein A., Falk M.J. Mitochondrial DNA deletion syndromes. In: Adam M.P., Ardinger H.H., Pagon R.A., Wallace S.E., Bean L.J.H., Amemiya A., editors. Seattle (WA); 2003. [75] Kemp JP, Smith PM, Pyle A, et al. Nuclear factors involved in mitochondrial translation cause a subgroup of combined respiratory chain deficiency. Brain 2011;134(Pt 1):18395.

This page intentionally left blank

CHAPTER

Mosaicism in rare disease

8

Bracha Erlanger Avigdor1, Ikeoluwa A. Osei-Owusu1,2 and Jonathan Pevsner1,2,3 1

Department of Neurology, Kennedy Krieger Institute, Baltimore, MD, United States 2Department of Human Genetics, Johns Hopkins School of Medicine, Baltimore, MD, United States 3Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, United States

8.1 Introduction Mosaicism is the occurrence of two or more genomes in the body of an individual derived from a single zygote. This definition distinguishes mosaicism from chimerism, the condition in which an individual has two distinct, complete genomes in his or her body that are derived from two zygotes. Mosaicism may occur in somatic cells as a postzygotic process, and is often referred to as “somatic mosaicism.” It may also occur as germline mosaicism. The German evolutionary biologist August Weismann (18341914) first made the distinction between germplasm (in germ cells of the reproductive organs) which transmit hereditary information and somatic cells (all other cells in the body) [1]. By the 1920s examples of mosaicism were recorded in plants (such as mutants having patches of wildtype color) and Drosophila (such as evidence of the mosaic loss of the X chromosome) in mitosis [2]. When germline mosaicism (also called gonadal mosaicism) occurs in germ cells, mutations may be transmitted to offspring and lead to constitutive expression (Fig. 8.1, arrow A). For example, gonadal mosaicism may be a cause of recurrent miscarriage due to an aneuploidy confined to sperm or egg [3]. However, most mosaic variation occurs postzygotically and is referred to as somatic mosaicism. In this case, daughter cells harbor one mutated allele (Fig. 8.1, arrow B). The cleavagestage embryo consists of 48 cells, and the consequence of a mosaic mutation at such an early time of development is likely to be more pervasive than at later times. Mechanistically, chromosomal mosaicism occurs in several ways [4]. (1) In nondisjunction, sister chromatids fail to separate during mitosis. This creates a cell with one chromosome (monosomy) and one with three chromosomes (trisomy). In a preimplantation embryo, mosaic cells may form that affects the entire body. (2) In anaphase lagging a single chromatid may fail to separate because of improper spindle formation, leading to monosomy in one cell and disomy in the other. Depending on the timing of this event, the result may be general mosaicism across the body or confined placental mosaicism (introduced in Section 8.4). (3) Endoreplication results in the gain of a single chromosome. This occurs when a chromosome replicates without division. (4) Mosaicism can also occur without a change in chromosome number (aneuploidy). This occurs in uniparental disomy (UPD) in which both copies of a chromosome are derived from one parent. If the chromosomes are the same, the condition is uniparental isodisomy, and the affected chromosome or chromosomal segment is homozygous. If Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00003-X © 2021 Elsevier Inc. All rights reserved.

151

152

Chapter 8 Mosaicism in rare disease

F

Revertant mosaicism

E

Second-hit mosaicism

D

Independent mosaic variant

B

Sperm

Early postzygotic mutation

A

Germline mosaicism

Zygote (Fertilized Egg)

Egg

C

Postzygotic mutation in later development

FIGURE 8.1 Germline and somatic mosaicism. Germ cells (egg and sperm) are indicated at top; if mutations occur in these gametes (arrow A) gonadal mosaicism occurs in which the parents appear phenotypically normal and a mutation is transmitted to the offspring. A postzygotic mutation occurring in the first mitosis (arrow B) may be transmitted to daughter cells and may affect half the cells in the affected individual (bottom, orange cells). If a mutation occurs relatively later (arrow C, green cells), relatively fewer cells may be affected in the individual. Some cells harboring a mosaic mutation may undergo selective advantage or disadvantage. A somatic mutation may be lethal to a cell (arrow D, red cell). In some cases a second mosaic mutation occurs (arrow E, purple cell) and in combination the initial and second mutations may cause a disease phenotype. Revertant mosaicism (arrow F) results in daughters of a cell with a mutation reverting to wildtype. The group of cells at the bottom schematically represent a later developmental stage.

two different chromosomes are present, derived from one parent, the condition is uniparental heterodisomy. UPD is associated with a series of disorders including those that involve imprinted genes [5]. When a postzygotic mutation occurs later in development fewer mutant cells are ultimately produced in the proband (Fig. 8.1, arrow C). The stage of embryonic development in which mosaicism arises may govern the clinical outcome of associated genetic disorders. A meiotic error would

8.2 Strategies/technologies to identify mosaic variation

153

become mosaic only if partially rescued at the preimplantation stages (cleavage and blastocyst). At these stages a mitotic error would give rise to mosaicism at a high rate (e.g., 15%90%) with potential lethality or severe clinical outcome. Indeed, a survey of 815 embryos at the cleavage stage found a high prevalence of diploidaneuploid mosaicism (59%73%) [6]. In the blastocyst stage the rate of mosaicism is lower (16%30%) in both the trophectoderm that will give rise to the fetal supporting structures (including the placenta) and the inner cell mass that will give rise to the fetus. In the postimplantation stage, confined placental mosaicism occurs at a much lower rate [as low at 6% from chorionic villus sampling (CVS) at 1012 weeks], and in many cases the abnormalities are autosomal trisomies [3].

8.1.1 Rate of somatic mutations It has been shown that the rate of mutation in somatic cells is substantially higher than that seen in the germline [7]. This is an inevitable outcome of a natural process of mutagenesis that occurs during a lifetime and is attributed to two main factors. First, cell division is susceptible to replication errors including mistakes in error-prone DNA polymerases, DNA repair mechanisms, and mitotic recombination. These errors give rise to several types of variants, from point mutation to loss of heterozygosity. Such mutations are rarely detected, due to the sophisticated cell DNA repair machinery. Mutations are also essential to promote diversity, but, in time, may cause a mutation in DNA repair genes (most famously BRCA1, BRCA2, and TP53) or in an oncogene or a tumor-suppressor gene. Such changes may increase the mutation rate as much as 10-fold. Second, a multitude of environmental factors are deemed carcinogenic. These include exposure to ultraviolet light, ionizing radiation, asbestos, and polycyclic aromatic hydrocarbons from combustion of organic material (e.g., open fire cooking, vehicles, and industrial processes) and nitrosamines from tobacco smoking. Each of these sources may cause DNA damage by intercalation with DNA, causing double-stranded breaks and mutations in cancer-causing genes [8,9]. The consequences of mosaic events are variable [4]. In some cases a mosaic mutation may be lethal to a cell (Fig. 8.1, arrow D) or may confer a selective growth disadvantage (or advantage). This may also depend on the genetic background. In 1971 Knudson proposed a two-hit model of cancer in which, for dominantly inherited retinoblastoma, one mutation is inherited in the germline, while the second somatic mutation occurs leading to disease [10]. It is also possible that two mosaic variants are required to cause a disease phenotype (Fig. 8.1, arrow E). Another mechanism is revertant mosaicism (Fig. 8.1, arrow F), in which cells spontaneously undergo back mutations to correct errors [11,12]. This chapter reviews the current strategies to detect mosaic variation and describes the role of mosaic variation in rare diseases, including mosaic single nucleotide variants (SNVs), mosaic aneuploidy, and mosaicism in cancer.

8.2 Strategies/technologies to identify mosaic variation Our knowledge of the types of mosaic variation has improved as increasingly sensitive techniques have been introduced. The most common methods to detect mosaicism are depicted in Fig. 8.2.

154

Chapter 8 Mosaicism in rare disease

FIGURE 8.2 Approaches to detecting mosaic variation in disease. (A) Mosaicism may be evident by clinical observation, such as the occurrence of capillary hemangioma or port-wine birthmarks. (B) Karyotyping enables microscopic visualization and ordering of the complement of chromosomes, typically from metaphase spreads. Mosaicism is indicated when one set of observations is apparently normal and other cells display abnormalities. (C) SNP arrays provide data on genotype (B allele frequency, top panel) and copy number (log R ratio, bottom panel). Mosaic events may be detected by deviations of BAF from 50% for heterozygous calls in the absence of copy number changes. (D) Next-generation sequencing can be used to detect mosaic single nucleotide variants and mosaic structural variants [13]. Here each gray horizontal line is a sequence read spanning some chromosomal position (e.g., B100 base pairs of genomic DNA; x-axis). A mosaic variant candidate (black; right arrow) is called for some fraction of reads (here B20%). A neighboring, proximal SNP defines two parental haplotypes (green and not green; left arrow). The mosaic variant is phased to the neighboring SNP, providing strong supporting evidence that it is a true positive. Three haplotypes are defined: haplotype 1 (corresponding to the position of the SNP shaded green); haplotype 2 (from the other parent); and new haplotype 3 corresponding to the mosaic variant. BAF, B allele frequency; SNP, single nucleotide polymorphism. From (A) https://history.nih.gov/exhibits/rodbell/images/MA_syndrome.jpg; (B) https://ghr.nlm.nih.gov/primer/mutationsanddisorders/ chromosomalconditions; (C) authors’ laboratory; (D) adapted from Freed D, Pevsner J. The contribution of mosaic variants to autism spectrum disorder. PLoS Genet 2016;12(9):e1006245.

8.2 Strategies/technologies to identify mosaic variation

155

8.2.1 Clinical observation Some early descriptions of mosaicism relied on clinical observation, such as the occurrence of visible changes to skin pigmentation or hair. This was seen in patients with neurocutaneous disorders [14] as well as organisms such as sheep that exhibit mosaic changes in fleece [15]. Patterns may appear random or follow striking patterns such as the lines of Blaschko, a pattern of lines on the skin that follow the pathways of embryonic cell migration (Fig. 8.3).

8.2.2 Cytogenetics The first correct description of the number of human chromosomes came in 1956 when Tjio and Levan applied improved techniques for karyotyping [17]. By 1959 Lejeune and colleagues reported that Down syndrome (DS) was caused by the trisomy of a group G chromosome, soon confirmed to be chromosome 21 [18]. Then by 1961 two groups reported the occurrence of mosaic chromosome 21 in which lymphocytes from DS individuals contained a mixture of trisomic and euploid cells [19,20]. From the 1960s to the present, cytogenetics (reviewed in Chapter 2: Karyotyping as the First Genomic Approach) has continued to provide fundamental evidence for mosaicism (see Section 8.4). A limitation when detecting mosaicism is that karyotyping is typically performed on metaphase spreads, often analyzing B30 cells, so levels of mosaicism below B3% are difficult to detect. An alternative is to use interphase spreads in which several hundred cells can be measured.

FIGURE 8.3 ´ Patterns of cutaneous mosaicism showing cafe-au-lait skin pigmentation in fibrous dysplasia/McCuneAlbright syndrome (OMIM 174800). (A) Skin lesions in a newborn. Pigmentation patterns follow the developmental lines of Blaschko reflecting embryonic cell migration patterns. (B) Mosaic pattern on the chest, face, and arm. (C) Lesions on the crease of the buttocks and the nape of the neck [16]. From Boyce AM, Florenzano P, de Castro LF, Collins MT. Fibrous dysplasia/McCune-Albright syndrome. GeneReviewss; 2019. Available from: https://www.ncbi.nlm.nih.gov/books/NBK274564/ [Cited 2020].

156

Chapter 8 Mosaicism in rare disease

In cytogenetic nomenclature, mos (for mosaic) precedes the karyotype designation, with distinct clones separated by a slant (/) symbol [21]. For example, mos 45, X/46, and XX indicate a mosaic loss of one copy of the X chromosome. The diploid is listed last.

8.2.3 Array comparative genome hybridization Comparative genome hybridization (CGH), introduced in 1992, allowed copy number estimation through the labeling of two samples [22]. Array CGH (aCGH) extended this technology to microarrays allowing the systematic analysis of thousands of bacterial artificial chromosome or other genomic DNA fragments as reviewed in Chapter 3, Genomic Disorders in the Genomics Era. aCGH is commonly used in cytogenetics centers as a replacement for G-banded karyotyping. In a study of 8794 patient samples analyzed through aCGH, 2596 had potentially pathogenic imbalances and 2% of those cases had mosaic imbalances [23]. PallisterKillian syndrome (PKS; MIM #601803), a multisystem developmental disorder, provides an example of the usefulness of aCGH. PKS is caused by the mosaic, tissue-specific occurrence of a supernumerary marker isochromosome 12p or i(p12) [24]. This chromosomal abnormality often manifests at low frequency in phytohemagglutinin stimulated peripheral blood lymphocytes, necessitating cytogenetic analysis of fibroblasts that are obtained via skin biopsy. aCGH technology facilitates the detection of even low levels of mosaic i(p12) from DNA obtained from blood.

8.2.4 Single nucleotide polymorphism arrays High-density single nucleotide polymorphism (SNP) arrays permit the measurement of copy number variation, based on the intensity of hybridization of labeled fragments to DNA immobilized to the surface of an array (see Chapter 3: Genomic Disorders in the Genomics Era). This is typically reported as the log2 ratio (denoted LRR) of intensities of a test sample to a reference panel. Some technologies also allow the measurement of genotypes (heterozygous AB, homozygous AA or BB, and no calls, NC). These are typically represented as B allele frequency (BAF) that ranges from 0 (AA genotype) to 0.5 (AB) to 1 (BB). The occurrence of chromosomal mosaicism is readily identified by deviations of heterozygous calls from BAF values near 0.5. An example is shown in Fig. 8.2, panel C. Genotypes are interpreted in the context of possible copy number changes such as heterozygous deletions (which result in genotypes A/- or B/- and a lack of heterozygous calls) or amplifications such as trisomy (for which BAF values are observed corresponding to the genotypes AAA, AAB, ABB, and BBB).

8.2.5 Next-generation sequencing to identify mosaic variants The advent of next-generation sequencing (NGS) technologies (reviewed in Chapter 4: Genomic Sequencing of Rare Diseases) provided increased resolution and enabled the identification of smaller mosaic events and lower levels of mosaicism. Two main categories of mosaicism can be detected through NGS: mosaic SNVs and structural variants (SVs). One main approach to detecting mosaic SNVs depends on a direct comparison of paired samples (such as tumor and normal). Consider a chromosomal position at which a normal sample has 100

8.2 Strategies/technologies to identify mosaic variation

157

genotype calls of a cytosine (C) residue, while a paired tumor sample has 80C residues and 20 thymidine (T) residues. This is a candidate mosaic site. A common computational approach is to perform a binomial test to determine whether heterozygous variants deviate from the expected 50% frequency. There are key variables that influence the confidence of such a call. •







• •

Read depth. An exome is typically sequenced to an average depth of coverage of 100-fold (denoted 100 3 ), while genomes are often sequenced to 30 3 average depth. At 30 3 , two alternate alleles would constitute 2/30 or B7% minor allele fraction. Adequate read depth is necessary to ensure adequate sensitivity to detect mosaic variants. Furthermore, extremes of read depth (e.g., .95th or ,5th percentiles across a region relative to the entire genome) may harbor false-positive calls. Sequencing error rates. At a base quality score threshold of Q30, there is an error rate of 1:1000 genotype calls, while at Q20, there is an error rate of 1:100. There are also mapping quality scores which may reflect genomics regions at which alignment is problematic. Copy number variants (CNVs). In an amplification region a 2:1 ratio of alleles at heterozygous sites can be expected, leading to apparent minor allele frequencies near 33%. Such positions are unlikely to harbor mosaic variants. Repetitive DNA including homopolymeric repeats. Homopolymeric repeats, as well as other repeat types, can be difficult to sequence and align: base quality and mapping quality scores may be low. Their occurrence may lead to false-positive calls of mosaic variation. Segmentally duplicated regions span B5% of the human genome and represent another category of variation that is routinely filtered out and excluded from mosaicism analyses. Clusters of mosaic variant candidates. For SNVs, it is highly unlikely that clusters of candidates are true positives. Instead, they likely reflect sequence, alignment, or assembly errors. Phasing information. The availability of SNPs in the vicinity of a mosaic variant candidate offers an opportunity for computational validation of that candidate. The SNP defines two copies (maternal and paternal) of the sequence reads. If the mosaic variant is a true positive, it will be assigned to only one of those two parental haplotypes and will define a third haplotype (Fig. 8.2, panel D).

For identification of mosaic SNVs bioinformatic analyses of NGS data are key, a variety of software packages have been developed. One, MuTect2, is associated with the best practices pipeline of the Genome Analysis ToolKit (GATK). MuTect2 accepts as input sequence data in the binary sequence alignment/map (BAM) format from matched tumor and normal DNA. Such input data has had standard processing steps such as marking of duplicate reads, base quality score recalibration, and local realignment. MuTect proceeds in four steps. First, it identifies active regions of possible mosaic variation (based on a simplified somatic genotyping model that employs a log odds score). Second, it assembles haplotypes through local de novo assembly, using a process shared by the GATK germline variant calling workflow. Third, it performs somatic genotyping using a pair-hidden Markov model. Fourth, it performs filtering to distinguish somatic from germline variants or sequencing errors. Mosaic SVs can also be detected by NGS [25] using similar approaches as those used for identification of constitutive CNVs and other SVs (see Chapter 3: Genomic Disorders in the Genomics Era). One approach is based on counting read depth, analogous to the estimation of mosaic CNVs from SNP array data analysis. Many packages use deviations in allele fractions (extracting BAF from BAM files)

158

Chapter 8 Mosaicism in rare disease

and read coverage to detect SVs. For example, for a sample having 100-fold (100 3 ) average depth of coverage across a chromosome, a region having 150 3 depth might be classified as trisomic. A region of 125 3 depth might be interpreted as a segmental mosaic amplification. NGS enables additional approaches to SV detection such as the analysis of split reads (in which a read maps to two genomic positions, as may happen at a translocation or inversion breakpoint) and discordant reads (in which read pairs map to separate genomic loci, as occurs in translocations). Digital polymerase chain reaction (dPCR) is an example of a technology used for absolute quantification of nucleic acids. It is based on partitions, whether physical chambers or liquid droplets, containing a random distribution of molecules (target sequences) that are amplified and detected by fluorescence [26]. Such methods can sensitively detect mosaic variation.

8.2.6 Mosaic disease in ClinVar An estimate of the number of identified mosaic variants to date can be derived from examining ClinVar, a National Center for Biotechnology Information (NCBI) repository of clinically relevant, interpreted variants. ClinVar currently contains B503,000 variants, the majority of which are germline (Fig. 8.4). There are about 2900 somatic variants occurring in B1100 genes, with comparable numbers of somatic and de novo variants (Fig. 8.4, inset). Many factors determine which somatic and de novo data are deposited into ClinVar, including the availability of parental data to call de novo variants and the approach to somatic variant calling. The somatic and de novo variants have comparable numbers of entries across clinical significance categories (Fig. 8.4, x-axis). Pathogenic somatic variants have been reported in 32 genes according to ClinVar (Table 8.1). Most of the ClinVar somatic genes are represented in OMIM, an actively curated database of human genes and inherited traits.

8.3 Mosaic aneuploidy in rare disease 8.3.1 Introduction to aneuploidy Aneuploidies are changes in chromosomal copy number, such as a change in the complement of human chromosomes from the euploid state of 46 to 45 (monosomy) or 47 (trisomy). While most trisomies cause embryonic lethality, DS (MIM #190685), caused by trisomy 21, is a prominent exception. It occurs in 220,000 live births per year worldwide and is the most common genetic cause of intellectual disability. Trisomies 13 (Patau syndrome) and 18 (Edwards syndrome) can be compatible with life, but only  5%12% of affected individuals survive to 1 year of age [27]. Trisomy 13 (1 in 22,000 live births), first described in 1960 [28], is characterized by facial dysmorphism, holoprosencephaly, ocular anomalies, and intellectual disability; heart defects are common. Trisomy 18 (1 in 7500 live births), also described in 1960 [29], involves multiple abnormalities including intellectual disability, dysmorphism (involving an abnormally shaped head), and severe congenital heart defects. Among the monosomies, Turner syndrome (45,X) is the most common, affecting females and associated with a webbed neck, short stature, and heart defects. Most human autosomal trisomies and monosomies are not compatible with postnatal survival. Many constitutional trisomies are identified in spontaneously aborted fetuses, while fewer aneuploidies

8.3 Mosaic aneuploidy in rare disease

159

1,200 200,000

Variants

800

150,000

400

Variants

0 Benign

Likely Benign

Likely Pathogenic

Pathogenic

Germline

Uncertain Significance

Somatic

Clinvar Significance 100,000

De novo

50,000

0

Benign

Likely benign

Likely pathogenic

Pathogenic

Uncertain significance

Clinvar Significance

FIGURE 8.4 Representation of somatic variants in the ClinVar database. Main panel: Somatic and de novo variants represent ,1% of the number of annotated germline variants. Somatic and de novo variants are assigned comparable numbers of the functional categories (pathogenic, likely pathogenic, uncertain significance) relative to germline mutations. Inset: There are twice as many de novo variants as somatic variants that are classified as pathogenic. However, there are three times more somatic variants of uncertain significance than de novo variants. There are likely ascertainment biases particular to both somatic variants (for which identification depends on the methods and technologies used) and de novo variants (which require the availability of parent/child trio data). Note that the three categories (germline, somatic, and de novo) are a subset of the ClinVar allelic origin categories, and ClinVar also includes additional clinical significance categories (x-axis).

occur in liveborns. In 1980 Hassold and colleagues karyotyped 1000 spontaneous abortions, finding 463 with abnormal chromosomal constitution [30]. These included trisomy of all autosomes (except chromosomes 1 and 19), sex chromosome monosomy, triploidy, and tetraploidy. More recent studies report consistent findings [31,32].

Table 8.1 32 genes classified by ClinVar as pathogenic somatic variant genes, with OMIM gene-phenotype annotations. Gene symbol

pLI

Gene MIM#

Disease

Disease MIM#

ASXL1

0.00

612990

BohringOpitz syndrome | Myelodysplastic syndrome, somatic

605039 | 614286

BAP1

0.39

603089

Tumor predisposition syndrome

614327

BRAF

1.00

164757

Adenocarcinoma of lung, somatic | Cardiofaciocutaneous syndrome | Colorectal cancer, somatic | LEOPARD syndrome 3 | Melanoma, malignant, somatic | Nonsmall cell lung cancer, somatic | Noonan syndrome 7

211980 | 115150 | 613707 | 613706

BRCA1

0.00

113705

Fanconi anemia, complementation group S | {Breast-ovarian cancer, familial, 1} | {Pancreatic cancer, susceptibility to, 4}

617883 | 604370 | 614320

BRCA2

0.00

600185

Fanconi anemia, complementation group D1 | Wilms tumor | {Breast cancer, male, susceptibility to} | {Breast-ovarian cancer, familial, 2} | {Glioblastoma 3} | {Medulloblastoma} | {Pancreatic cancer 2} | {Prostate cancer}

605724 | 194070 | 114480 | 612555 | 613029 | 155255 | 613347 | 176807

DICER1

1.00

606241

GLOW syndrome, somatic mosaic | Goiter, multinodular 1, with or without SertoliLeydig cell tumors | Pleuropulmonary blastoma | Rhabdomyosarcoma, embryonal, 2

618272 | 138800 | 601200 | 180295

EGFR

1.00

131550

Inflammatory skin and bowel disease, neonatal, 2 | Adenocarcinoma of lung, response to tyrosine kinase inhibitor in | Nonsmall cell lung cancer, response to tyrosine kinase inhibitor in | {Nonsmall cell lung cancer, susceptibility to}

616069 | 211980 | 211980 | 211980

EGFR-AS1

N/A

N/A

N/A

N/A

ERBB2

1.00

164870

Adenocarcinoma of lung, somatic | Gastric cancer, somatic | Glioblastoma, somatic | Ovarian cancer, somatic

211980 | 613659 | 137800

FGFR1

0.99

136350

Encephalocraniocutaneous lipomatosis, somatic mosaic | Hartsfield syndrome | Hypogonadotropic hypogonadism 2 with or without anosmia | JacksonWeiss syndrome | Osteoglophonic dysplasia | Pfeiffer syndrome | Trigonocephaly 1

613001 | 615465 | 147950 | 123150 | 166250 | 101600 | 190440

FLT3

0.61

136351

Leukemia, acute lymphoblastic, somatic | Leukemia, acute myeloid, reduced survival in, somatic | Leukemia, acute myeloid, somatic

613065 | 601626 | 601626

GNAQ

0.98

600998

Capillary malformations, congenital, 1, somatic, mosaic | SturgeWeber syndrome, somatic, mosaic

163000 | 185300

KIF26B

1.00

614026

N/A

N/A

Table 8.1 32 genes classified by ClinVar as pathogenic somatic variant genes, with OMIM gene-phenotype annotations. (Continued) Gene symbol

pLI

Gene MIM#

KRAS

0.00

LOC107303340

Disease

Disease MIM#

190070

Arteriovenous malformation of the brain, somatic | Bladder cancer, somatic | Breast cancer, somatic | Cardiofaciocutaneous syndrome 2 | Gastric cancer, somatic | Leukemia, acute myeloid | Lung cancer, somatic | Noonan syndrome 3 | Oculoectodermal syndrome, somatic | Pancreatic carcinoma, somatic | RAS-associated autoimmune leukoproliferative disorder | SchimmelpenningFeuersteinMims syndrome, somatic mosaic

108010 | 109800 | 114480 | 615278 | 137215 | 601626 | 211980 | 609942 | 600268 | 260350 | 614470 | 163200

N/A

N/A

N/A

N/A

MEN1

1.00

613733

Adrenal adenoma, somatic | Angiofibroma, somatic | Carcinoid tumor of lung | Lipoma, somatic | Multiple endocrine neoplasia 1 | Parathyroid adenoma, somatic

131100 |

MLH1

0.74

120436

Colorectal cancer, hereditary nonpolyposis, type 2 | Mismatch repair cancer syndrome | MuirTorre syndrome

609310 | 276300 | 158320

MSH2

0.87

609309

Colorectal cancer, hereditary nonpolyposis, type 1 | Mismatch repair cancer syndrome | MuirTorre syndrome

120435 | 276300 | 158320

MYO3A

0.00

606808

Deafness, autosomal recessive 30

607101

NF2

1.00

607379

Meningioma, NF2-related, somatic | Neurofibromatosis, type 2 | Schwannomatosis, somatic

607174 | 101000 | 162091

NPM1

0.96

164040

Leukemia, acute myeloid, somatic

601626

PAX2

0.12

167409

Glomerulosclerosis, focal segmental, 7 | Papillorenal syndrome

616002 | 120330

PDGFRB

1.00

173410

Basal ganglia calcification, idiopathic, 4 | Kosaki overgrowth syndrome | Myeloproliferative disorder with eosinophilia | Myofibromatosis, infantile, 1 | Premature aging syndrome, Penttinen type

615007 | 616592 | 131440 | 228550 | 601812

PIK3CA

1.00

171834

Breast cancer, somatic | CLAPO syndrome, somatic | CLOVE syndrome, somatic | Colorectal cancer, somatic | Cowden syndrome 5 | Gastric cancer, somatic | Hepatocellular carcinoma, somatic | Keratosis, seborrheic, somatic | Macrodactyly, somatic | Megalencephalycapillary malformationpolymicrogyria syndrome, somatic | Nevus, epidermal, somatic | Nonsmall cell lung cancer, somatic | Ovarian cancer, somatic

114480 | 613089 | 612918 | 114500 | 615108 | 613659 | 114550 | 182000 | 155500 | 602501 | 162900 | 211980 | 167000

PLCB4

0.27

600810

Auriculocondylar syndrome 2

614669

POLK

0.00

605650

N/A

N/A (Continued)

Table 8.1 32 genes classified by ClinVar as pathogenic somatic variant genes, with OMIM gene-phenotype annotations. (Continued) Gene symbol

pLI

Gene MIM#

PTPN11

1.00

SF3B1

Disease

Disease MIM#

176876

LEOPARD syndrome 1 | Leukemia, juvenile myelomonocytic, somatic | Metachondromatosis | Noonan syndrome 1

151100 | 607785 | 156250 | 163950

1.00

605590

Myelodysplastic syndrome, somatic

614286

SLC35A2

0.78

314375

Congenital disorder of glycosylation, type IIm

300896

SMO

0.06

601500

Basal cell carcinoma, somatic | CurryJones syndrome, somatic mosaic

605462 | 601707

TPP1

0.03

607998

Ceroid lipofuscinosis, neuronal, 2 | Spinocerebellar ataxia, autosomal recessive 7

204500 | 609270

VHL

0.03

608537

Erythrocytosis, familial, 2 | Hemangioblastoma, cerebellar, somatic | Pheochromocytoma | Renal-cell carcinoma, somatic | von HippelLindau syndrome

263400 | 171300 | 144700 | 193300

This table includes the corresponding OMIM gene and phenotype entries, when available, as well as pLI scores. 14 of these genes have a pLI score .0.9, indicative of functional constraint and a potential role in disease. ClinVar was accessed on January 2020. pLI, Probability of loss of function intolerance. CLOVE, Congenital lipomatous overgrowth, vascular malformations.

8.3 Mosaic aneuploidy in rare disease

163

8.3.2 General principles of mosaic aneuploidy While constitutional aneuploidies are often embryonic lethal, they may manifest in the mosaic state in which only a subset of cells in the body harbor the aneuploidy [4]. Trisomy or other aneuploidies (such as monosomy) of all chromosomes can persist in the mosaic form. Mosaicism for trisomy of chromosomes 8, 9, 14, 20, and 22 are particularly viable [33]. In the Hassold et al. study [30], 8% of the trisomy cases were mosaic. In recent decades a large number of case studies have been reported. Common themes emerge such as the following. • •









Some partial (segmental) trisomies and monosomies are compatible with life and are associated with severe phenotypes, but many others are not compatible with life. Many of these persist in the mosaic form. The identical chromosomal complement is expected in the fetus and its placenta. About 2% of viable pregnancies are associated with confined placental mosaicism [34]. In this condition an aneuploid cell line is observed only in the placenta but not in the fetus, therefore representing tissue-specific mosaicism. The outcome may include intrauterine growth restriction. The fetus is usually nonmosaic diploid (i.e., karyotypically normal) but may be nonmosaic aneuploid when the cytotrophoblast is diploid. Diagnosis of mosaic trisomy is often by prenatal karyotyping. This is challenging because prenatal screening of extrafetal samples, by CVS or amniocentesis, may reveal an abnormal karyotype due to confined placental mosaicism while the fetus has a normal complement of chromosomes. In other cases, mosaicism is not evident from prenatal screening but is observed in particular cell lineages of the newborn, such as fibroblasts. This presents serious dilemmas in prenatal diagnosis and genetic counseling. A postzygotic duplication of one chromosome may result in cell lineages having trisomic cells. Trisomic rescue is a mechanism by which UPD may occur. UPD can result from trisomic rescue in which a trisomic zygote loses one chromosome, resulting in a diploid state with either (1) one chromosomal copy from each parent; (2) two different homologs of a chromosomal pair from the same parent, called paternal or maternal uniparental heterodisomy; or (3) paternal or maternal uniparental isodisomy. Trisomic rescue can lead to mosaic trisomy in the embryo. There is variability in phenotypic expression even for a given mosaic trisomy. The clinical features of these trisomies have been reviewed for prenatal assessment (at birth or termination) or postnatally [32]. Commonly occurring conditions include intrauterine growth retardation (IUGR), congenital heart defects, craniofacial abnormalities, and postnatal developmental delay. In most cases, no clinically distinct syndrome has been elaborated. Higher percentages of cells harboring a mosaic mutation are sometimes associated with more severe clinical phenotypes. In other cases, there is no such association.

Although rare, with some having a dozen or fewer case reports, mosaic trisomies and other mosaic aneuploidies (further reviewed in Refs. [31,32]) have been reported for every human chromosome; these are summarized below.

8.3.3 Mosaic autosome aneuploidies 8.3.3.1 Trisomy 1 mosaicism Trisomy 1 is an extremely rare condition, with only 12 reported cases of trisomy of the entire chromosome or of 1q. The small number of cases suggests the condition is embryonic lethal, in part because

164

Chapter 8 Mosaicism in rare disease

it is the largest human chromosome (and harbors over 2000 protein-coding genes). These cases typically involve an unbalanced translocation with the Y chromosome or the short arm of an acrocentric chromosome (e.g., 14, 15, and 22) [35]. An additional six cases were all mosaic, having karyotypes mos 46,X, der(Y)t(Y;1)(q12;q12)/46, XY or mos 46,X, der(Y)t(Y;1)(q12;q21)/46, XY [35]. These findings were detectable by aCGH of cultured amniocytes, but low-level mosaicism (B20%) would have been challenging to detect without prior karyotyping results. These mosaic cases reflect a recurrent postzygotic event in which the loss of heterochromatic regions on the Y chromosome is not thought to be clinically relevant.

8.3.3.2 Trisomy 2 mosaicism About 1%6% of all spontaneous abortions have complete trisomy of chromosome 2 [36]. However, only B20 cases of prenatally detected mosaic trisomy 2 have been reported [37]. Some cases are associated with hypomelanosis of Ito, a neurocutaneous condition involving hypopigmented anomalies that follow the lines of Blaschko [38].

8.3.3.3 Trisomy 3 mosaicism To date, only four cases of live births with trisomy 3 mosaicism have been reported [39]. There were several common features such as short stature and prominent forehead, but the distribution of aneuploidy cells in different tissues likely impacts the phenotypic expression. No live births have been reported for nonmosaic trisomy 3.

8.3.3.4 Trisomy 4 mosaicism Four children with trisomy 4 mosaicism have been identified, sharing common features of IUGR, congenital heart defects, low birth weight, skin abnormalities, dysmorphism, and intellectual disability. UPD originating from trisomic rescue has been identified in cases of confined placental mosaicism chromosome 4 [40].

8.3.3.5 Trisomy 5 mosaicism Mosaic trisomy 5 is rare but has been identified at amniocentesis. In two patients, Vertebral defects, Anorectal malformations, Cardiac defects, TracheoEsophageal defects, Renal malformations, and Limb defect (VACTERL) has been reported [41].

8.3.3.6 Trisomy 6 mosaicism Mosaic trisomy 6 has been identified in spontaneous abortions and prenatally when confined to extraembryonic tissue such as chorionic cells [42]. Trisomy 6 appears to be incompatible with postnatal survival but has been demonstrated in neoplastic tissues such as Merkel cell carcinoma and squamous cell carcinoma. Patients with heterodisomic and isodisomic UPD of chromosome 6 have been identified prenatally, likely contributing to intrauterine growth restriction and preterm delivery [43]. Of nine cases analyzed, four had mosaicism in lymphocytes.

8.3 Mosaic aneuploidy in rare disease

165

8.3.3.7 Trisomy 7 mosaicism Prenatal mosaicism of trisomy 7 is evident in CVS as a phenomenon of confined placental mosaicism, usually with a normal karyotype in the fetus. About 10 live-born cases of trisomy 7 mosaicism have been identified.

8.3.3.8 Trisomy 8 mosaicism Mosaicism of trisomy 8 is common relative to those of chromosomes 1 through 7. Over 120 liveborn cases have been reported, with a prevalence of 1:25,000 to 1:50,000 newborns [44]. Trisomy 8 mosaicism syndrome has a variable phenotype including intellectual disability, dysmorphic features such as a prominent forehead and domed occiput, malformations such as agenesis of the corpus callosum, and renal abnormalities.

8.3.3.9 Trisomy 9 mosaicism Over 40 cases of mosaic trisomy, 9 have been reported. Features may include growth retardation, intellectual disability, facial dysmorphism, atrial or ventral septal defects, micrognathia, and microcephaly [45].

8.3.3.10 Trisomy 10 mosaicism Mosaic trisomy 10 has been reported in only 10 cases [46]. In one case, mosaicism was accompanied by maternal UPD in diploid cultured lymphocytes, suggesting trisomy rescue. The phenotypes include growth retardation, dysmorphism, and congenital heart disease.

8.3.3.11 Trisomy 11 mosaicism Trisomy 11 has been detected in spontaneous miscarriages, and only several cases of fetal mosaic trisomy 11 are known. In one case, there was bilateral renal agenesis [47], while acardia (congenital absence of the heart) has also been described. A series of partial trisomy 11 cases have also been described, affecting 11q or 11p. A pair of monozygotic twins each had mosaic trisomy 11p but with discordant phenotypes, attributed to differences in mosaicism at the blastocyst stage [48].

8.3.3.12 Trisomy 12 mosaicism Ten cases of live births of mosaic trisomy 12 have been reported [49,50]. Most cases include IUGR, developmental delay, congenital heart defects, microcephaly, cutaneous spots, and other conditions. Notably, PKS (OMIM 601803) is caused by mosaicism for tetrasomy of chromosome 12p.

8.3.3.13 Trisomy 13 mosaicism While trisomy 13 accounts for B1:10,000 to 1:20,000 live births, mosaic trisomy 13 occurs in approximately 5% of those cases. The clinical phenotype is variable, a phenomenon that is poorly understood [51].

8.3.3.14 Trisomy 14 mosaicism Trisomy 14 (Temple syndrome; MIM #616222) is an imprinting disorder caused by abnormal expression of genes on chromosome 14q32. It is caused by maternal UPD of chromosome 14 and

166

Chapter 8 Mosaicism in rare disease

is compatible with live birth only in the mosaic form. The features include short stature, low birth weight, hypotonia and motor delay, feeding problems, and early puberty. Fewer than 100 cases of trisomy 14 and fewer than 50 cases of mosaic trisomy 14 are known. Trisomy 14 and UPD(14)mat can emerge by the same mechanism of trisomic rescue, but the combination of both those conditions in one individual has been described in only nine cases [52].

8.3.3.15 Trisomy 15 mosaicism Angelman syndrome (MIM #105830) usually results from de novo, maternal deletions of chromosome 15q11.2-q13, while B2% of cases result from paternal UPD of this region. PraderWilli syndrome (MIM #176270) is caused by deletion of a paternal portion of 15q11-q13. For the entire chromosome, trisomy 15 is rare and accounts for B8% of all trisomic abortions. The phenotype of mosaic trisomy 15 is severe and can include pre- and postnatal growth retardation, organ malformation, and facial dysmorphisms [53]. In one case a father had a balanced 8;16 reciprocal translocation, while the translocation was accompanied by mosaic trisomy 15 in the fetus with either 46, XY,t(8;15)(q24;q14) 47, XY,t(8;15)(q24;q14), 115 across different tissues [54].

8.3.3.16 Trisomy 16 mosaicism Complete trisomy 16 is embryonic lethal and occurs more often in male fetuses. It also occurs in 1%1.5% of all pregnancies, making it the most commonly occurring trisomy in spontaneous abortions [55]. In contrast, mosaic trisomy 16 occurs more often in females [55]. Both mosaic trisomy 16 and confined placental mosaicism for trisomy 16 are associated with variable complications (such as low birth weight and congenital anomalies), but the majority of individuals have normal neurodevelopmental outcome and good health-related quality of life [56].

8.3.3.17 Trisomy 17 mosaicism Several dozen cases of mosaicism for trisomy 17 have been reported [57]. In several cases, mosaicism for this trisomy is associated with severe malformations, while in others, the clinical phenotype of newborns is normal although amniocytes have up to 100% mosaicism.

8.3.3.18 Trisomy 18 mosaicism Complete trisomy 18 syndrome (Edwards syndrome) is the second most common autosomal trisomy [58]. About 5% of these individuals have mosaicism for trisomy 18, associated with a variable clinical phenotype.

8.3.3.19 Trisomy 19 mosaicism Mosaic trisomy 19 is extremely rare, perhaps because chromosome 19 has the highest gene density. A survey of rare trisomy mosaicism across 17 cytogenetic laboratories identified a single case of trisomy 19 mosaicism [59]. That individual was phenotypically normal at birth.

8.3.3.20 Trisomy 20 mosaicism Mosaic trisomy 20 is one of the most commonly identified karyotypes following prenatal sampling [60]. Although common, the risk for association with an abnormal phenotype is low, with more than 90% of cases associated with healthy babies [61]. UPD of chromosome 20 has also been reported, with mild clinical features such as growth delay.

8.3 Mosaic aneuploidy in rare disease

167

8.3.3.21 Trisomy 21 mosaicism Several reports in 1961 reported mosaicism in DS cases [19,20]. In DS, mosaicism manifests as the presence of a mixed population of trisomic and disomic cells. The prevalence of mosaicism has been estimated as 1.3%5% [31,6264]. In a study of B248,000 consecutive postnatal cases karyotyped in one reference laboratory, 7133 (2.9%) had trisomy 21 [65]; 105 cases (1.5%) were mosaic and generally had milder clinical features. It has been speculated that mosaicism affecting neurons could confer increased risk for Alzheimer’s disease [66].

8.3.3.22 Trisomy 22 mosaicism Trisomy 22 occurs frequently in spontaneous abortions with an incidence of B2.9% [30]. Newborns survive only several days. In the mosaic form the clinical features range from normal to severe developmental delay with IUGR, intellectual disability, failure to thrive, and craniofacial asymmetry. In rare cases, mosaic monosomy 22 has been reported [67].

8.3.4 Mosaic disorders of chromosome X The X chromosome is the only diploid chromosome for which monosomy (whether mosaic or constitutional) is compatible with life [31]. In humans and other placental mammals, females undergo random X-chromosome inactivation (XCI) [68,69]. A proportion of somatic cells actively express the paternal copy of the X chromosome, while other cells express the maternal copy. Females are, therefore, mosaic in the sense of functional expression of a particular copy of the X chromosome across the body. However, they are not mosaic in the sense of having distinct genomes in any different manner than having two distinct copies of any autosomal chromosome. XCI leads to important consequences for diseases and is likely a major factor in the greater risk of mortality for males compared to females at every age. The phenotypic expression of a heterozygous mutation in females leads to less severe phenotypes than in males who, having a hemizygous X chromosome, express only the mutant allele. For Turner syndrome (denoted as 45,X or 45,XO), mosaicism occurs in 15%50% of cases [70,71]. It has been hypothesized that all live (45,X) individuals are indeed cryptic for a “rescue line” of a viable (46,XX) karyotype, since many patients with Turner syndrome do not seem to be mosaic for (45,X). Cryptic mosaicism for the rescue line may be confined to the long discarded placenta or to tissues other than peripheral blood of buccal tissue [71]. Increasingly sensitive techniques such as droplet digital PCR (ddPCR) and single-cell sequencing may increase our ability to detect such cryptic mosaic variants and lead to more accurate diagnoses in pre- and postnatal screening. Mosaicism is also common for trisomy X (47,XXY), occurring in 10% of cases. For other sex chromosome aneuploidies mosaicism is rare, such as Klinefelter syndrome (47,XXY), XYY (47,XYY), and XXYY (48,XXYY) [72].

8.3.5 Mosaic disorders of chromosome Y Chromosome Y has a role in sex determination. It is inherited in a uniparental fashion and is therefore always haploid, allowing the male-specific portion of Y (MSY) to escape recombination. This leads to the accumulation of repeats and deleterious mutations and to chromosome Y degeneration. Of

168

Chapter 8 Mosaicism in rare disease

567 genes on the chromosome, only B66 are protein coding, and these are specific to testis determination and spermatogenesis. Mosaic Loss of Y Chromosome (mLOY), first described in 1963 [73], is the most common postzygotic mosaic variant in humans due to a selective advantage allowing clonal advantage to cells with LOY [74]. Initially, it was associated with the natural process of aging, since genes on chromosome Y are not essential for cell survival, and females live well without that chromosome. However, there is a greater role for chromosome Y beyond sex determination. Associations have been identified between mLOY and elevated risk for overall mortality and various age-related diseases in men, including Alzheimer’s disease, primary biliary cirrhosis, cardiovascular events, and cancer. Cells lacking chromosome Y may have altered tumor suppression and reduced immunosurveillance [74]. Age, multiple autosomal genetic determinants, structural aberrations, and environmental factors such as smoking, drinking, and obesity, are among potential mLOY risk factors [75,76]. mLOY in aging men may contribute to a shorter life expectancy in comparison with women.

8.3.6 Variegated mosaic aneuploidy In rare cases, multiple chromosomal loci can be mosaic aneuploid [77]. Mosaic variegated aneuploidy syndrome 1 (MAVS1; MIM 257300), MAVS2 (OMIM 614114), and MAVS3 (OMIM 617598) are autosomal recessive disorders caused by mutations in BUB1B, CEP57, and TRIP13 respectively. In each of these conditions, homozygous or compound heterozygous mutations lead to mosaic aneuploidies (mostly monosomies and trisomies) on multiple chromosomes and in multiple tissues. Mosaic translocations can also affect multiple chromosomes. Philadelphia chromosome refers to a translocation between chromosomes 22 and 9, creating a BCR/ABL1 fusion gene encoding a tyrosine kinase. This condition is the most frequent cause of chronic myeloid leukemia (MIM #608232), a cancer of white blood cells. The translocation causes genomic instability and is often associated with a variety of additional aneuploidies, such as trisomy of 17q. These conditions are often mosaic.

8.3.7 Ring chromosomes Ring chromosomes are circular DNA molecules that occur commonly in organisms such as viruses and bacteria, but only rarely in humans, with an incidence of B1:50,000 [78]. Ring chromosomes typically form due to two double-stranded DNA breaks. As the DNA circularizes the ring chromosome can persist if it includes a centromere or if it gains a neocentromere (ectopic centromere). Ring chromosomes are usually mosaic and affect chromosomes that are rarely found with mosaic trisomy (e.g., chromosome 19) [79] or that are commonly mosaic for trisomy (e.g., chromosome 20) [80]. Ring chromosome 20 syndrome is marked by a characteristic seizure phenotype. Depending on the amount of chromosomal loss and associated mosaicism, ring(20) can be associated with macrocephaly, mild-to-moderate intellectual deficit, or behavioral problems. In rare cases, brain, kidney, or heart malformations may be present.

8.3.8 Mitochondrial genome mosaicism Mitochondrial DNA (mtDNA), separate from the nuclear genome, is organized in a circular genome of 16,539 base pairs. In humans, mtDNA is inherited uniparentally from the mother (with

8.4 Categories of mosaic variation

169

possible exceptions being assessed) and is present in multiple copies in every cell. mtDNA heterogeneity, also called heteroplasmy, is a dynamic process that begins in oocyte development in a genetic bottleneck in which only a subset of maternal mtDNA is incorporated in each oocyte. The bottleneck effect may continue during development and in later stages in life [81]. mtDNA has a relatively high mutation rate, possibly due to it constant exposure to reactive oxygen species, or replication errors. This results in the presence of different mtDNA mutational loads within a cell or an individual. While heteroplasmy is present in all individuals, an increasing load of deleterious mtDNA mutations leads to mitochondrial and other disorders including cancer as well as aging, with variable expression proportional to the level of heteroplasmy [82]. In mitochondrial disorders, levels of heteroplasmy govern the expression of the disorder. Leigh syndrome (MIM #256000) provides an example [83]. The syndrome requires .80% heteroplasmy for clinical expression. This includes mitochondrial myopathy, encephalomyopathy, lactic acidosis, and stroke (MELASs)-like episodes with . 60% and myoclonus epilepsy, with ragged red fibers (MERRFs) with .70%.

8.3.9 Mosaic mobile element insertions Half the human genome is derived from transposable DNA elements, some of which retain the ability to mobilize [84]. Retrotransposition events represent a form of mosaic variation that may have significant roles in genomic rearrangements and disease. For example, 31 disease-causing LINE-1 insertions have been described, causing diseases such as pyruvate dehydrogenase deficiency and neurofibromatosis. The role of mobile element insertions in neuropsychiatric disease is under investigation [85].

8.4 Categories of mosaic variation In addition to mosaic aneuploidy (described above), there are other classes of mosaic variation that are relevant to disease.

8.4.1 Germline mosaicism Mosaic de novo variants often appear to be sporadic when indeed one parent has germline (gonadal) mosaicism for the same variant. Germline parental mosaicism is commonly attributed to the recurrence of rare dominant disorders in families. Recently improved sensitivity of genomic technologies has revealed parental mosaicism in apparently normal parents who transmitted mosaic SNVs as well as CNVs [86]. An example is seen in early infantile epileptic encephalopathy 6 (Dravet syndrome; MIM #607208), a form of epilepsy related to mutations in the SCN1A gene. Dravet syndrome may be somatic mosaic, or in a relatively high proportion of cases, it is inherited from a mosaic parent with a milder form of the disorder. Minor allele frequencies in parents, well below detectable levels by Sanger sequencing, have been detected using new technologies, including ddPCR, showing that as many as 10% of probands with apparently de novo variants had a parent with the same mosaic variant [87,88].

170

Chapter 8 Mosaicism in rare disease

8.4.2 Cryptic mosaicism Cryptic mosaicism refers to mosaic variants from abnormal cells that are confined to specific noncutaneous tissues and present in the blood in such low levels that are undetectable by standard screening methods of bulk DNA from peripheral blood [74]. Cryptic mosaicism is a particular hindrance in prenatal diagnosis of rare mosaic trisomies and other mosaic variants [89]. An example is Alport syndrome (AS), a hereditary condition causing hearing loss, eye abnormalities, and kidney disease that often goes undetected in parents due to postulated cryptic mosaicism. X-linked AS type 1 (ATS1, MIM #301050) is caused by pathogenic variants in the COL4A5 gene. In a male patient with ATS1 having an identified mutation in COL4A5 the same mutation was not detected in a maternal peripheral blood sample. However, NGS analysis on maternal DNA from both peripheral blood and urine-derived podocyte-lineage cells revealed the pathogenic COL4A5 mutation in the podocyte-lineage cells only. This finding indicates an early postzygotic mutation involving both the renal lineage and germline mosaicism that was passed on to the son [90]. Another example of cryptic mosaicism as mentioned above is Turner syndrome (45,X karyotype), where it has been hypothesized that all live 45,X individuals are indeed cryptic for a “rescue line” of a viable 46,XX karyotype. Increasingly sensitive techniques such as droplet dPCR and single-cell sequencing may increase our ability to detect such cryptic mosaic variants and lead to more accurate diagnoses in pre- and postnatal screening.

8.5 Obligate mosaicism in rare disease Many mosaic disorders are clinically evident by patterns of variation in the skin or hair, or by visible asymmetric overgrowth. A group of neurocutaneous disorders (affecting nervous system and skin) were hypothesized to represent mosaic disorders that would be lethal if occurring in the germline [14]. These disorders never follow a Mendelian pattern but instead are seen only in sporadic cases. Furthermore, monozygotic twins are discordant for the phenotype because the mutation is postzygotic. In 1987 Rudolf Happle hypothesized that sporadic disorders that cannot be explained by Mendelian inheritance patterns and are found in mosaic form are driven by lethal genes [14], meaning that the effect of such mutations in the germline would be so deleterious as to be embryonic lethal. Such mutations arise in the later stages of embryonic development and are found in close proximity to apparently normal cells. Many rare disorders caused by obligate somatic mosaicism are caused by activating mutations in oncogenes (Table 8.2), but rarely cause cancers except in the context of malignant tumors.

8.5.1 GNAS and McCuneAlbright syndrome The GNAS complex locus (having gene symbol GNAS) expresses multiple, highly imprinted transcripts from alternative promoters and an antisense transcript. Variants in this locus result in a number of disorders including pseudohypoparathyroidism 1a (MIM #103580) and 1b (MIM #603233), Albright hereditary osteodystrophy, progressive osseous heteroplasia (MIM #166350), polyostotic fibrous dysplasia of bone, and pituitary tumors (MIM #617686), and the somatic disorder McCuneAlbright

8.5 Obligate mosaicism in rare disease

171

syndrome (MIM #174800). GNAS encodes Gs-alpha, the alpha subunit of the stimulatory guanine nucleotide-binding protein (G protein). Happle’s hypothesis was first confirmed by the discovery of postzygotic somatic mosaic mutations in the GNAS gene that lead to McCuneAlbright syndrome. Mutations in the GNAS gene are embryonic lethal in a nonmosaic form but, when occurring early in embryonic development, result in a clinically heterogeneous disorder that manifests in a classic triad of precocious puberty, caf´e-au-lait skin pigmentation (Fig. 8.3), and polyostotic bone dysplasia [128].

8.5.2 Proteus syndrome Proteus syndrome (MIM #176920) is characterized by progressive postnatal growth of body parts, often affecting the central nervous system, adipose tissue, skeleton, and skin. The syndrome is one of the first mosaic disorders for which, in 2011, the causal gene mutation was identified by whole-exome or whole-genome sequencing [91]. Whole-exome sequencing allows the assessment of mosaic variation across exonic regions of nearly 20,000 protein-coding genes. 26 of 29 patients with the Proteus syndrome harbored an activating mutation in AKT1 encoding AKT serine/threonine kinase 1. AKT1 is phosphorylated by phosphoinositide 3-kinase (PI3K), a signal transduction enzyme involved in several cell survival processes including, cell growth, differentiation, and proliferation. The approach of sequencing the exome (or genome) of genomic DNA derived from both affected and unaffected tissue has been successfully applied to dozens of other mosaic disorders.

8.5.3 PIK3CA and CLOVES syndrome A number of disorders in addition to the Proteus syndrome are driven by variants in the PI3K-AKT pathway [129]. For example, mutations in AKT1 can cause Cowden syndrome 6 (MIM #615109) and breast, colorectal, and ovarian cancer. The PI3K p110α subunit, encoded by the PIK3CA gene, is frequently mutated in multiple tumor types, including breast, ovarian, and colon cancers, and also in benign overgrowth syndromes collectively named PROS (PIK3CA-related overgrowth spectrum). Interestingly, activating hotspot mutations (H1047R, E545K, and E542K) most commonly found in cancer are also found in a subset of the more severe overgrowth syndromes in similar frequencies. These somatic conditions occur without an elevated incidence of cancer (with the exception of Wilms tumor), possibly due to differences in tissue type, and the need for additional mutations to drive tumorigenesis. Rare, nonhotspot mutations are associated with milder overgrowth forms [92,130]. Congenital lipomatous overgrowth, vascular malformations, epidermal nevi, and skeletal/spinal abnormalities (CLOVES syndrome; MIM #612918) is a sporadic nonheritable disorder. CLOVES involves progressive, complex, and mixed truncal vascular malformations, dysregulated adipose tissue, varying degrees of scoliosis, and enlarged bony structures without progressive bony overgrowth. It is caused by postzygotic mutations in PIK3CA [93]. Mutant allele frequencies have been reported from B3% to nearly 50%.

8.5.4 GNAQ and SturgeWeber syndrome SturgeWeber syndrome (MIM #185300) is a neurocutaneous disorder characterized by leptomeningeal angiomatosis, an intracranial vascular anomaly, seizures, and glaucoma. It is accompanied

Table 8.2 Mosaic mutations in genes and their associated signaling pathways and diseases. Gene(s)

Signaling pathway(s)

Disease(s)

Cellular function(s)

Cancer(s)

Cancer role

References

Cervical, various neoplasms, colorectal

Oncogene

[9194]

PIK3CA

PI3K-AKTmTOR

HME, mosaic overgrowth syndrome, type 2 segmental, CLOVES, MCAP

PI3K subunit, serine/ threonine kinase, metabolism

PIK3R2

PI3K-AKTmTOR

MPPH1

PI3K subunit, serine/ threonine kinase

PTEN

PI3K-AKTmTOR

Cowden syndrome

Phosphatase, negative regulator of AKT/PKB pathway

Prostate, colorectal, carcinoma, primary glioblastoma, other

Tumor suppressor

[95]

AKT1

PI3K-AKTmTOR

Proteus syndrome

Serine/threonine kinase

Human breast, ovarian, colorectal

Oncogene

[91]

AKT2

PI3K-AKTmTOR

Diabetes mellitus

Serine/threonine kinase

Ovarian, pancreatic, breast, colorectal, lung cancer

Oncogene

[96]

AKT3

PI3K-AKTmTOR

HME, MCAP, MPPH2

Serine/threonine kinase

Melanoma, glioma, ovarian cancer

Oncogene

[94,9799]

MTOR

PI3K-AKTmTOR

FCD type II

Serine/threonine kinase

Carcinoma, glioblastoma, melanoma

Oncogene

[99,100]

DEPDC5

PI3K-AKTmTOR

Epilepsy w/ FCD

mTORC1 repressor

Glioblastoma and ovarian tumors

Tumor suppressor

[101,102]

TSC1

PI3K-AKTmTOR

TSC

Negatively regulator of mTORC1

Renal angiomyolipomas

Tumor suppressor

[103,104]

TSC2

PI3K-AKTmTOR

TSC

Negatively regulator of mTORC1

Renal angiomyolipomas

Tumor suppressor

[103,104]

NRAS, BRAF, FGFR3, PIK3CA

ras, PI3K-AKTmTOR

Congenital melanocytic, other nevi; seborrheic keratosis

Cell cycle regulation

(FGFR3) bladder, cervical, urothelial

Oncogene

[105109]

NF2

ras, PI3K-AKTmTOR

NF type 2

Negative regulator of ras, mTOR pathways

Neurofibromas

Tumor suppressor

[109]

NF1

ras

NF type 1, Watson syndrome

negative regulator of ras pathway

Neurofibromas, leukemia

Tumor suppressor

[110112]

BRAF, NRAS, KRAS

ras

Pyrogenic granuloma

Cell cycle regulation

(KRAS) breast, colorectal, other; (NRAS) thyroid, melanoma, other; (BRAF) melanoma, colorectal

Oncogene

[113]

[94]

Table 8.2 Mosaic mutations in genes and their associated signaling pathways and diseases. (Continued) Gene(s)

Signaling pathway(s)

Disease(s)

Cellular function(s)

Cancer(s)

Cancer role

References

HRAS, KRAS

ras

Schimmelpenning syndrome

Cell cycle regulation

(KRAS) bladder, breast, colorectal, pancreatic, other; (HRAS) colorectal, bladder, kidney, other

Oncogene

[114]

KRAS

Ras

RALD

Cell cycle regulation

Breast, bladder, other

Oncogene

[115,116]

GNAQ

GPCR, MAPK

SturgeWeber syndrome

G-protein alpha subunit

Melanoma

Oncogene

[117]

GNAQ, GNA11

GPCR, MAPK

Dermal melanocytosis and phakomatosis pigmentovascularis

G-protein alpha subunit

Melanoma

Oncogene

[118]

MAP3K3

MAPK

Verrucous venous malformation

Cell cycle regulation

Breast, colon, rectal cancers

Oncogene

[119]

GNAS1

GPCR

McCuneAlbright syndrome

G-protein alpha subunit

Adenomas, carcinomas, ovarian neoplasms

Oncogene

[120]

IDH1

Glutathione and carbon metabolism

Ollier disease, Maffucci syndrome

Isocitrate dehydrogenase, metabolism

Enchondromas, spindle cell hemangiomas

Oncogene

[121]

IDH2

Glutathione and carbon metabolism

Ollier disease, Maffucci syndrome

Isocitrate dehydrogenase, metabolism

Enchondromas, spindle cell hemangiomas

Oncogene

[121]

JAK2

JAK-STAT

Myelofibrosis, polycythemia vera and essential thrombocythemia

Cell cycle regulation

Leukemia

Oncogene

[122,123]

SCN1A

Sodium channel

Dravet syndrome

Neural excitation

[124]

MLRP3

Caspase/ inflammasome

CINCA/NOMID syndrome

Inflammasome subunit

[125]

PORCN

Wnt

Focal dermal hypoplasia

O-Acyltransferase

[126]

PIGA

Hematopoiesis

Paroxysmal nocturnal Hemoglobinuria

ER protein processing

[127]

CLOVE, Congenital lipomatous overgrowth, vascular malformations, and epidermal nevi; FCD, focal cortical dysplasia; GPCR, G proteincoupled receptor; HME, hemimegalencephaly; MCAP, megalencephalycapillary malformationpolymicrogyria syndrome; MPPH1, megalencephalypolymicrogyriapolydactylyhydrocephalus syndrome 1; NF, neurofibromatosis; RALD, Ras-associated autoimmune leukoproliferative disorder; TSC, tuberous sclerosis complex.

174

Chapter 8 Mosaicism in rare disease

by a congenital capillary malformation or nonsyndromic port-wine birthmark (MIM #163000). A mosaic, activating point mutation in GNAQ was identified in genomic DNA from both affected skin and brain from SturgeWeber syndrome patients as well as in skin of nonsyndromic portwine birthmark individuals [117]. GNAQ encodes G-protein subunit alpha q (Gαq), a member of the q class of G-protein alpha subunits that mediates signals between the transmembrane G proteincoupled receptors and downstream effectors. The identical somatic mutation of GNAQ, when occurring in melanocytes, is a cause of uveal melanoma, an ocular cancer [131]. In both SturgeWeber syndrome and uveal melanoma the consequence of the mosaic variant is to activate intracellular signaling pathways. The development time and cellular location are crucial for determining the phenotypic consequence.

8.5.5 TSC1, TSC2, and the tuberous sclerosis complex TSC1 encodes the growth inhibitory protein hamartin, which interacts with the TSC2 gene product tuberin that activates GTPases. The hamartintuberin complex negatively regulates cell growth signaling. TSC1 and TSC2 are therefore considered tumor-suppressor genes. Somatic mutations in TSC1 have been found in focal cortical dysplasia, tuberous sclerosis, as well as cases of bladder cancer [132]. Tuberous sclerosis (TSC) is a disorder that is caused by heterozygous mutations in either TSC1 or TSC2 genes and is characterized by development of hamartomas (benign tumor-like growths) in multiple organs, including facial angiofibromas, cerebral cortical tubers, brain lesions that cause autism, epilepsy and intellectual disabilities, renal angiomyolipomas, and cysts that may cause renal failure and renal-cell carcinomas. The tuberous sclerosis phenotype is highly variable, from very mild to severe manifestations of the disease. Although described as autosomal dominant, and seen in pedigrees with multiple affected members, many cases are sporadic. Both gonadal and somatic mosaicism have been described. Patients often have one inactivating germline mutation in either TSC gene and a second-hit mutation in the hamartomas; this is compatible with tumor-suppressor gene behavior. Mosaicism in the TSC1 and TSC2 genes manifests in milder forms of tuberous sclerosis and is often difficult to detect, hindering accurate diagnosis [133].

8.6 Cancer as a series of rare mosaic diseases Cancer is perhaps the most common example of somatic mosaicism in disease. Cancer tumorigenesis may be attributed to inherited germline or postzygotic somatic mutations, or both. Theodor Boveri in his 1914 monograph Concerning the Origin of Malignant Tumours [134] reported a number of remarkably precise observations on the genetic origins of cancer. These ideas correspond to mosaic forms in cancers, including oncogenes, genomic instability, aneuploidy, clonal expansion, genetic mosaicism, and tumor-suppressor genes. A prediction was that individuals who are carriers of cancercausing germline mutation are less likely to avoid the development of malignant tumors [135]. Knudson’s two-hit hypothesis, proposed in 1971 [10], supported Boveri’s predictions. This states that inactivation of tumor-suppressor genes requires a mutation in the second allele. Knudson showed that patients with retinoblastoma (MIM #180200) carry germline deletions or mutations in the RB1 gene

8.7 Mendelian disorders in mosaic form

175

and that most retinoblastoma tumors harbor a deleterious somatic variant in the second RB1 allele. This was later corroborated in other cancer-causing genes such as APC, BRCA/2, and P53 [136]. Later, Hanahan and Weinberg defined the “hallmarks of cancer”: proliferative signaling, evading growth suppressors, immune evasion, replicative immortality, inflammation, invasion and metastasis, angiogenesis, genome instability and mutation, resisting cell death, and deregulating cellular energetics [137], many of which support Boveri’s predictions. Germline cancer susceptibility disorders are rare genetic diseases where individuals are born with de novo or inherited deleterious variants in tumor-suppressor genes that make them more susceptible to the development of a wide variety of tumors later in life through somatic mutation of the second allele in the corresponding tissues.

8.6.1 Somatic single nucleotide variants in cancer In cancer, SNVs may be defined as “driver” gene mutations, when they confer a selective advantage and as “passenger” mutations with little or no consequence in the tumor development process. Driver mutations often occur in oncogenes (PIK3CA, AKT, and RAS) and tumor suppressors (PTEN, TSC1, and TSC2) [138]. As a consequence, they may constitutively activate signal transduction cascades. The number of mutations in a tumor depends on age, tissue, and cell type. Childhood cancers tend to harbor far fewer somatic mutations than other cancers influenced by environmental factors (most of which are considered passenger mutations). However, recently, it has become evident that a high tumor mutational burden is prognostically favorable to patients, improving response rate to immunotherapies [139].

8.6.2 Clonal evolution and cancer field effect (field cancerization) A signature of tumor development is clonal expansion in which mutations accumulate, and expand in a clonal fashion. Only a subset of these contributes to the process of malignancy. Field cancerization is an example of tissue mosaicism that confers a cancer risk for individuals with precancerous lesions or normal appearing tissues that acquire age- and environmentally related variants that are required for the development of malignant tumors [140].

8.6.3 Somatic mutations along lines of Blaschko Often mosaic congenital skin diseases are characterized by lesions along lines of Blaschko. Many of these skin lesions harbor mutations in oncogenes and tumor suppressors and can help identify embryonic origins of a somatic cancer mutation [140].

8.7 Mendelian disorders in mosaic form De novo mutations disrupting the function of single genes are known to cause several thousand Mendelian disorders, often with autosomal dominant inheritance (see Chapter 7: X-linked and Mitochondrial Disorders). Segmental mosaicism accounts for attenuated forms of the same disorders. These mosaic disorders appear sporadically from postzygotic rather than germline de novo mutations.

176

Chapter 8 Mosaicism in rare disease

Some classic examples, including relevant signaling pathways and protein functions, are summarized below and in Table 8.2.

8.7.1 Neurofibromatosis type I Neurofibromatosis type I (NF1) (MIM #16220) is caused by a germline autosomal dominant heterozygous mutation in the neurofibromin 1 gene (NF1). It is characterized by widely distributed caf´eau-lait spots, Lisch nodules (benign hamartomas) of the eye, and fibromatous benign tumors on the skin, with increased risk for developing malignant tumors. While NF1 has variable expressivity and may present in a localized manner, segmental neurofibromatosis may be similarly seen in patients with nonsystemic neurofibromatosis, localized to a single unilateral segment of the body, mostly with a negative family history [141]. Segmental neurofibromatosis is caused by a variety of postzygotic mutations specific to primitive neural crest cells in the NF1 gene, which may be identical to the mutation implicated in the Mendelian form of NF1 [11,142].

8.7.2 Incontinentia pigmenti Familial incontinentia pigmenti (MIM #308300) is an X-linked dominantly inherited genodermatosis (cutaneous genetic disorder) caused by mutation in the IKK-gamma (IKBKG) gene on chromosome Xq28. This disorder manifests with skin pigmentation lesions and other malformations including systemic, dental, ocular, and, more rarely, brain abnormalities. Mutations in the IKBKG gene lead to total loss of the protein in males. While embryonic lethal in males, females survive this lethal mutation due to their heterozygous status and X inactivation mechanism. Cells harboring an active X with the mutant allele may be eliminated explaining the incidence of continuously changing patterns of skin lesions [143]. However, some male fetuses do survive due to mosaicism of either abnormal karyotype such as (47,XXY) (Klinefelter syndrome) or postzygotic mutation confining the mutation to a subset of tissues [144].

8.7.3 DarierWhite disease DarierWhite disease (keratosis follicularis; MIM #124200) is a skin disorder characterized by loss of adhesion between epidermal cells and abnormal keratinization. In the familial form the disorder has high penetrance and variable expressivity, with severe cutaneous and neuropsychiatric abnormalities. It is caused by autosomal dominant mutations in the Ca(21 )-transporting ATPase gene (ATP2A2). In a segmental form, there are localized lesions that sometimes present in a unilateral fashion along lines of Blaschko [145,146].

8.7.4 Hereditary hemorrhagic telangiectasia Hereditary hemorrhagic telangiectasia (HHT) is a vascular system disorder of angiodysplasia with a high risk for gastrointestinal bleeding and various other arteriovenous malformations in multiple organ systems [147,148]. HHT1 (MIM #187300) is caused by autosomal dominant pathogenic variants in the ENG gene, encoding endoglin, whereas HHT2 (MIM #600376) is caused by pathogenic variants in the ACVRL1 gene encoding an activin A receptor. Mosaic mutations in the ENG and ACVRLI genes have

References

177

been identified in family members of probands with clinically confirmed HHT. Mosaic family members showed a range of clinical expression. Proposed mechanisms of mosaicism in these cases include a recent blood transfusion, revertant mosaicism, and microchimerism [149].

8.8 Chimerism Chimerism is a phenomenon that may be mistaken for mosaicism. While mosaicism is defined as the presence of two or more genetically distinct genomes in one individual, derived from one zygote, chimerism originates from two or more distinct zygotes. Classic chimerism (macrochimerism) is a rare condition in which two zygotes fuse to form a single embryo. Microchimerism is a common phenomenon, in which a small fraction of cells in the soma originate from another person. Artificial or iatrogenic microchimerism is induced by organ transplantation or blood infusion, where donor cells and DNA disseminate in the recipient beyond the transplant site. Natural microchimerism includes pregnancy-related microchimerism. This may occur as maternal microchimerism (the persistence of maternal cells in the offspring into adulthood) [150] or fetal microchimerism (persistence of fetal cells in the mother even decades postpartum) [151]. This may occur via transplacental bidirectional transfer of fetal and maternal cells, as well as in cases of monochorionic dizygotic twins [74]. Naturally acquired microchimerism is not fully understood. Fetal microchimerism may be advantageous in some types of solid and liquid cancers. In patients with breast cancer, circulating fetal cells are found in lower numbers than in healthy controls. Furthermore, circulating fetal cells tend to be found at tumor sites, suggesting a possible role in injury repair. In contrast, fetal microchimerism may be potentially harmful in autoimmune diseases such as systemic sclerosis, primary biliary cirrhosis, rheumatoid arthritis, and multiple sclerosis. It has been proposed that progenitor fetal cells can persist, migrate to specific tissue sites, and differentiate into immune cells that in turn give rise to autoimmune disease [152,153]. Identifying chimerism versus mosaicism is technically not trivial but is important for diagnosis and genetic counseling for individuals regarding other family members.

8.9 Outlook Mosaicism is a pervasive and inevitable phenomenon in apparently normal individuals, promoting diversity in the human genome. Mosaicism also has central roles in a variety of aneuploidies, de novo and Mendelian rare diseases [36], and cancer. The clinical outcome and the severity of any given disorder are governed by the specific stage of embryonic development in which the event arises and by the type of mosaic variation. This spans the gamut of genomic variation better known in the germline, including SNVs and structural variation in many forms. The increasingly sensitive techniques available for variant discovery allow greater accuracy in identifying causative variants. Thus we continue to improve our knowledge of the role of mosaic variation in the human genome and our ability to diagnose and develop therapies for rare diseases.

References [1] Weismann A. The germ-plasm: a theory of heredity. London: W. Scott; 1893.

178

Chapter 8 Mosaicism in rare disease

[2] Darlington CD. Recent advances in cytology. Philadelphia (PA): P. Blakiston’s Son & Co., Inc; 1932. [3] Delhanty JD. Inherited aneuploidy: germline mosaicism. Cytogenet Genome Res 2011;133(24):13640. [4] Taylor TH, et al. The origin, mechanisms, incidence and clinical consequences of chromosomal mosaicism in humans. Hum Reprod Update 2014;20(4):57181. [5] Kotzot D. Abnormal phenotypes in uniparental disomy (UPD): fundamental aspects and a critical review with bibliography of UPD other than 15. Am J Med Genet 1999;82(3):26574. [6] van Echten-Arends J, et al. Chromosomal mosaicism in human preimplantation embryos: a systematic review. Hum Reprod Update 2011;17(5):6207. [7] Milholland B, et al. Differences between germline and somatic mutation rates in humans and mice. Nat Commun 2017;8:15183. [8] Barnes JL, et al. Carcinogens and DNA damage. Biochem Soc Trans 2018;46(5):121324. [9] Lichtenstein AV. Genetic mosaicism and cancer: cause and effect. Cancer Res 2018;78(6):13758. [10] Knudson Jr. AG. Mutation and cancer: statistical study of retinoblastoma. Proc Natl Acad Sci USA 1971;68(4):8203. [11] Lim YH, Moscato Z, Choate KA. Mosaicism in cutaneous disorders. Annu Rev Genet 2017;51 (1):12341. [12] Pasmooij AM, Jonkman MF, Uitto J. Revertant mosaicism in heritable skin diseases: mechanisms of natural gene therapy. Discov Med 2012;14(76):16779. [13] Freed D, Pevsner J. The contribution of mosaic variants to autism spectrum disorder. PLoS Genet 2016;12(9):e1006245. [14] Happle R. Lethal genes surviving by mosaicism: a possible explanation for sporadic birth defects involving the skin. J Am Acad Dermatol 1987;16(4):899906. [15] Burnet M. Auto-immune disease. II. Pathology of the immune response. Br Med J 1959;2(5154):7205. [16] Boyce AM, Florenzano P, de Castro LF, Collins MT. Fibrous dysplasia/McCune-Albright syndrome. GeneReviewss; 2019. Available from: https://www.ncbi.nlm.nih.gov/books/NBK274564/ [Cited 2020]. [17] Tjio JH, Levan A. The chromosome number of man. Hereditas 1956;42:16. [18] Lejeune J, Gautier M, Turpin R. Etudes des chromosomes somatique de neuf enfants mongoliens. CR Acad Sci Paris 1959;248:17212. [19] Clarke CM, Edwards JH. 21-Trisomy/normal mosaicism in an intelligent child with some mongoloid characters. Lancet 1961;1:102830. [20] Fitzgerald PH, Lycette RR. Mosaicism in man involving the autosome associated with mongolism. Heredity; 1961. [21] McGowan-Jordan J, Simons A, Schmid M. ISCN: an international system for human cytogenomic nomenclature (2016). Cytogenet Genome Res 2016;149(12):38. [22] Kallioniemi A, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992;258(5083):81821. [23] Ahn JW, et al. Array CGH as a first line diagnostic test in place of karyotyping for postnatal referrals—results from four years’ clinical application for over 8,700 patients. Mol Cytogenet 2013;6(1):16. [24] Izumi K, Krantz ID. Pallister-Killian syndrome. Am J Med Genet C Semin Med Genet 2014;166C (4):40613. [25] Dumanski JP, Piotrowski A. Structural genetic variation in the context of somatic mosaicism. Methods Mol Biol 2012;838:24972. [26] Quan PL, Sauzade M, Brouzes E. dPCR: a technology review. Sensors (Basel) 2018;18(4). [27] McCaffrey MJ. Trisomy 13 and 18: selecting the road previously not taken. Am J Med Genet C Semin Med Genet 2016;172(3):2516. [28] Smith DW, et al. A new autosomal trisomy syndrome: multiple congenital anomalies caused by an extra chromosome. J Pediatr 1960;57:33845.

References

179

[29] Edwards JH, et al. A new trisomic syndrome. Lancet 1960;1(7128):78790. [30] Hassold T, et al. A cytogenetic study of 1000 spontaneous abortions. Ann Hum Genet 1980;44 (2):15178. [31] Iourov IY, et al. Ontogenetic and pathogenetic views on somatic chromosomal mosaicism. Genes (Basel) 2019;10(5). [32] Jackson-Cook C. Constitutional and acquired autosomal aneuploidy. Clin Lab Med 2011;31 (4):481511. [33] Daber R, et al. Mosaic trisomy 17: variable clinical and cytogenetic presentation. Am J Med Genet A 2011;155a(10):248995. [34] Kalousek DK, Vekemans M. Confined placental mosaicism. J Med Genet 1996;33(7):52933. [35] Bone KM, et al. Mosaic trisomy 1q: a recurring chromosome anomaly that is a diagnostic challenge and is associated with a Fryns-like phenotype. Prenat Diagn 2017;37(6):60210. [36] Tug E, et al. Confirmation of the prenatal mosaic trisomy 2 via fetal USG and cytogenetic analyses. J Matern Fetal Neonatal Med 2017;30(13):157983. [37] Chen CP, et al. Mosaic trisomy 2 at amniocentesis: prenatal diagnosis and molecular genetic analysis. Taiwan J Obstet Gynecol 2012;51(4):60311. [38] Ponti G, et al. Hypomelanosis of Ito with a trisomy 2 mosaicism: a case report. J Med Case Rep 2014;8:333. [39] Yang YJ, et al. Trisomy 3 mosaicism in a 5-year-old boy with multiple anomalies: a very rare case. Am J Med Genet A 2016;170(6):15904. [40] Kuchinka BD, et al. Two cases of confined placental mosaicism for chromosome 4, including one with maternal uniparental disomy. Prenat Diagn 2001;21(1):369. [41] Hwang S, et al. VACTERL phenotype with mosaic trisomy 5 and uniparental disomy 5. Am J Med Genet A 2018;176(2):5024. [42] Gupta N, et al. Prenatally diagnosed trisomy 6 mosaicism. Prenat Diagn 2004;24(10):8414. [43] Eggermann T, et al. The maternal uniparental disomy of chromosome 6 (upd(6)mat) “phenotype”: result of placental trisomy 6 mosaicism? Mol Genet Genomic Med 2017;5(6):66877. [44] Cassina M, et al. Prenatal detection of trisomy 8 mosaicism: pregnancy outcome and follow up of a series of 17 consecutive cases. Eur J Obstet Gynecol Reprod Biol 2018;221:237. [45] Bruns DA, Campbell E. Twenty-five additional cases of trisomy 9 mosaic: birth information, medical conditions, and developmental status. Am J Med Genet A 2015;167a(5):9971007. [46] Gao Y, et al. Mosaicism trisomy 10 in a 14-month-old child with additional neurological abnormalities: case report and literature review. BMC Pediatr 2018;18(1):266. [47] Balasubramanian M, Peres LC, Pelly D. Mosaic trisomy 11 in a fetus with bilateral renal agenesis: co-incidence or new association? Clin Dysmorphol 2011;20(1):479. [48] Marcus-Soekarman D, et al. Mosaic trisomy 11p in monozygotic twins with discordant clinical phenotypes. Am J Med Genet A 2004;124a(3):28891. [49] Hong B, et al. Clinical features of trisomy 12 mosaicism—report and review. Am J Med Genet A 2017;173(6):16816. [50] Gasparini Y, et al. Mosaic trisomy 12 associated with overgrowth detected in fibroblast cell lines. Cytogenet Genome Res 2019;157(3):1537. [51] Jinawath N, et al. Mosaic trisomy 13: understanding origin using SNP array. J Med Genet 2011;48 (5):3236. [52] Yakoreva M, et al. A new case of a rare combination of temple syndrome and mosaic trisomy 14 and a literature review. Mol Syndromol 2018;9(4):1829. [53] Prontera P, et al. Trisomy 15 mosaicism owing to familial reciprocal translocation t(1;15): implication for prenatal diagnosis. Prenat Diagn 2006;26(6):5716.

180

Chapter 8 Mosaicism in rare disease

[54] Natacci F, et al. Delineating the mosaic trisomy 15 phenotype using a serendipitous mechanism as a clue. Cytogenet Genome Res 2015;146(1):4450. [55] Benn P. Trisomy 16 and trisomy 16 mosaicism: a review. Am J Med Genet 1998;79(2):12133. [56] Sparks TN, Thao K, Norton ME. Mosaic trisomy 16: what are the obstetric and long-term childhood outcomes? Genet Med 2017;19(10):116470. [57] Chen CP, et al. Mosaic trisomy 17 at amniocentesis: prenatal diagnosis, molecular genetic analysis, and literature review. Taiwan J Obstet Gynecol 2016;55(5):71217. [58] Cereda A, Carey JC. The trisomy 18 syndrome. Orphanet J Rare Dis 2012;7:81. [59] Hsu LY, et al. Rare trisomy mosaicism diagnosed in amniocytes, involving an autosome other than chromosomes 13, 18, 20, and 21: karyotype/phenotype correlations. Prenat Diagn 1997;17(3):20142. [60] Hsu LY, et al. Proposed guidelines for diagnosis of chromosome mosaicism in amniocytes based on data derived from chromosome mosaicism and pseudomosaicism studies. Prenat Diagn 1992;12(7):55573. [61] Bianca S, et al. Prenatally detected trisomy 20 mosaicism and genetic counseling. Prenat Diagn 2005;25 (8):7256. [62] Antonarakis SE. Down syndrome and the complexity of genome dosage imbalance. Nat Rev Genet 2017;18(3):14763. [63] Papavassiliou P, et al. Mosaicism for trisomy 21: a review. Am J Med Genet A 2015;167a(1):2639. [64] Hulten MA, et al. Germinal and somatic trisomy 21 mosaicism: how common is it, what are the implications for individual carriers and how does it come about? Curr Genomics 2010;11(6):40919. [65] Zhao W, et al. Postnatal identification of trisomy 21: an overview of 7,133 postnatal trisomy 21 cases identified in a diagnostic reference laboratory in China. PLoS One 2015;10(7):e0133151. [66] Potter H, Granic A, Caneus J. Role of trisomy 21 mosaicism in sporadic and familial Alzheimer’s disease. Curr Alzheimer Res 2016;13(1):717. [67] Pinto-Escalante D, et al. Full mosaic monosomy 22 in a child with DiGeorge syndrome facial appearance. Am J Med Genet 1998;76(2):1503. [68] Payer B. Developmental regulation of X-chromosome inactivation. Semin Cell Dev Biol 2016;56:8899. [69] Migeon BR. Why females are mosaics, X-chromosome inactivation, and sex differences in disease. Gend Med 2007;4(2):97105. [70] Machiela MJ, Chanock SJ. Detectable clonal mosaicism in the human genome. Semin Hematol 2013;50 (4):34859. [71] Hook EB, Warburton D. Turner syndrome revisited: review of new data supports the hypothesis that all viable 45,X cases are cryptic mosaics with a rescue cell line, implying an origin by mitotic loss. Hum Genet 2014;133(4):41724. [72] Skuse D, Printzlau F, Wolstencroft J. Sex chromosome aneuploidies. Handb Clin Neurol 2018;147:35576. [73] Jacobs PA, et al. Change of human chromosome count distribution with age: evidence for a sex differences. Nature 1963;197:10801. [74] Forsberg LA, Gisselsson D, Dumanski JP. Mosaicism in health and disease—clones picking up speed. Nat Rev Genet 2017;18(2):12842. [75] Guo X, et al. Mosaic loss of human Y chromosome: what, how and why. Hum Genet 2020. [76] Forsberg LA. Loss of chromosome Y (LOY) in blood cells is associated with increased risk for disease and mortality in aging men. Hum Genet 2017;136(5):65763. [77] Cho CH, et al. A case report of a fetus with mosaic autosomal variegated aneuploidies and literature review. Ann Clin Lab Sci 2015;45(1):1069. [78] Pristyazhnyuk IE, Menzorov AG. Ring chromosomes: from formation to clinical potential. Protoplasma 2018;255(2):43949.

References

181

[79] Demily C, et al. Complex phenotype with social communication disorder caused by mosaic supernumerary ring chromosome 19p. BMC Med Genet 2014;15:132. [80] Batista DA, et al. An accessory marker derived from chromosome 20 and its co-existence with a mosaic trisomy 20 cell line. Prenat Diagn 1995;15(2):1237. [81] Zhang H, Burr SP, Chinnery PF. The mitochondrial DNA genetic bottleneck: inheritance and beyond. Essays Biochem 2018;62(3):22534. [82] van den Ameele J, et al. Mitochondrial heteroplasmy beyond the oocyte bottleneck. Semin Cell Dev Biol 2020;97:15666. [83] Lake NJ, et al. Leigh syndrome: neuropathology and pathogenesis. J Neuropathol Exp Neurol 2015;74 (6):48292. [84] Kazazian Jr. HH, Moran JV. Mobile DNA in health and disease. N Engl J Med 2017;377(4):36170. [85] McConnell MJ, et al. Intersection of diverse neuronal genomes and neuropsychiatric disease: the brain somatic mosaicism network. Science 2017;356(6336). [86] Campbell IM, et al. Parental somatic mosaicism is underrecognized and influences recurrence risk of genomic disorders. Am J Hum Genet 2014;95(2):17382. [87] Myers CT, et al. Parental mosaicism in “de novo” epileptic encephalopathies. N Engl J Med 2018;378 (17):16468. [88] Mei D, et al. Dravet syndrome as part of the clinical and genetic spectrum of sodium channel epilepsies and encephalopathies. Epilepsia 2019;60(Suppl. 3):S27. [89] Daniel A, et al. Issues arising from the prenatal diagnosis of some rare trisomy mosaics—the importance of cryptic fetal mosaicism. Prenat Diagn 2004;24(7):52436. [90] Pinto AM, et al. Detection of cryptic mosaicism in X-linked Alport syndrome prompts to reevaluate living-donor kidney transplantation. Transplantation 2019;104(11):23604. [91] Lindhurst MJ, et al. A mosaic activating mutation in AKT1 associated with the Proteus syndrome. N Engl J Med 2011;365(7):61119. [92] Bachman KE, et al. The PIK3CA gene is mutated with high frequency in human breast cancers. Cancer Biol Ther 2004;3(8):7725. [93] Kurek KC, et al. Somatic mosaic activating mutations in PIK3CA cause CLOVES syndrome. Am J Hum Genet 2012;90(6):110815. [94] Riviere JB, et al. De novo germline and postzygotic mutations in AKT3, PIK3R2 and PIK3CA cause a spectrum of related megalencephaly syndromes. Nat Genet 2012;44(8):93440. [95] Liaw D, et al. Germline mutations of the PTEN gene in Cowden disease, an inherited breast and thyroid cancer syndrome. Nat Genet 1997;16(1):647. [96] Hussain K, et al. An activating mutation of AKT2 and human hypoglycemia. Science 2011;334 (6055):474. [97] Evrony GD, et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 2012;151(3):48396. [98] Poduri A, et al. Somatic activation of AKT3 causes hemispheric developmental brain malformations. Neuron 2012;74(1):418. [99] Lee JH, et al. De novo somatic mutations in components of the PI3K-AKT3-mTOR pathway cause hemimegalencephaly. Nat Genet 2012;44(8):9415. [100] Lim JS, et al. Brain somatic mutations in MTOR cause focal cortical dysplasia type II leading to intractable epilepsy. Nat Med 2015;21(4):395400. [101] Baulac S, et al. Familial focal epilepsy with focal cortical dysplasia due to DEPDC5 mutations. Ann Neurol 2015;77(4):67583. [102] Dibbens LM, et al. Mutations in DEPDC5 cause familial focal epilepsy with variable foci. Nat Genet 2013;45(5):54651.

182

Chapter 8 Mosaicism in rare disease

[103] European Chromosome 16 Tuberous Sclerosis Consortium. Identification and characterization of the tuberous sclerosis gene on chromosome 16. Cell 1993;75(7):130515. [104] van Slegtenhorst M, et al. Identification of the tuberous sclerosis gene TSC1 on chromosome 9q34. Science 1997;277(5327):8058. [105] Papp T, et al. Mutational analysis of the N-ras, p53, p16INK4a, CDK4, and MC1R genes in human congenital melanocytic naevi. J Med Genet 1999;36(8):61014. [106] Pollock PM, et al. High frequency of BRAF mutations in nevi. Nat Genet 2003;33(1):1920. [107] Hafner C, et al. Multiple oncogenic mutations and clonal relationship in spatially distinct benign human epidermal tumors. Proc Natl Acad Sci USA 2010;107(48):207805. [108] Hafner C, Toll A, Real FX. HRAS mutation mosaicism causing urothelial cancer and epidermal nevus. N Engl J Med 2011;365(20):19402. [109] Bourdeaut F, et al. Mosaicism for oncogenic G12D KRAS mutation associated with epidermal nevus, polycystic kidneys and rhabdomyosarcoma. J Med Genet 2010;47(12):85962. [110] Rasmussen SA, Friedman JM. NF1 gene and neurofibromatosis 1. Am J Epidemiol 2000;151 (1):3340. [111] Garcia-Linares C, et al. Dissecting loss of heterozygosity (LOH) in neurofibromatosis type 1-associated neurofibromas: importance of copy neutral LOH. Hum Mutat 2011;32(1): 7890. [112] Cawthon RM, et al. A major segment of the neurofibromatosis type 1 gene: cDNA sequence, genomic structure, and point mutations. Cell 1990;62(1):193201. [113] Groesser L, et al. BRAF and RAS mutations in sporadic and secondary pyogenic granuloma. J Invest Dermatol 2016;136(2):4816. [114] Groesser L, et al. Postzygotic HRAS and KRAS mutations cause nevus sebaceous and Schimmelpenning syndrome. Nat Genet 2012;44(7):7837. [115] Niemela JE, et al. Somatic KRAS mutations associated with a human nonmalignant syndrome of autoimmunity and abnormal leukocyte homeostasis. Blood 2011;117(10):28836. [116] Takagi M, et al. Autoimmune lymphoproliferative syndrome-like disease with somatic KRAS mutation. Blood 2011;117(10):288790. [117] Shirley MD, et al. Sturge-Weber syndrome and port-wine stains caused by somatic mutation in GNAQ. N Engl J Med 2013;368(21):19719. [118] Thomas AC, et al. Mosaic activating mutations in GNA11 and GNAQ are associated with phakomatosis pigmentovascularis and extensive dermal melanocytosis. J Invest Dermatol 2016;136 (4):7708. [119] Couto JA, et al. A somatic MAP3K3 mutation is associated with verrucous venous malformation. Am J Hum Genet 2015;96(3):4806. [120] Schwindinger WF, Francomano CA, Levine MA. Identification of a mutation in the gene encoding the alpha subunit of the stimulatory G protein of adenylyl cyclase in McCune-Albright syndrome. Proc Natl Acad Sci USA 1992;89(11):51526. [121] Pansuriya TC, et al. Somatic mosaic IDH1 and IDH2 mutations are associated with enchondroma and spindle cell hemangioma in Ollier disease and Maffucci syndrome. Nat Genet 2011;43 (12):125661. [122] Baxter EJ, et al. Acquired mutation of the tyrosine kinase JAK2 in human myeloproliferative disorders. Lancet 2005;365(9464):105461. [123] Kralovics R, et al. A gain-of-function mutation of JAK2 in myeloproliferative disorders. N Engl J Med 2005;352(17):177990. [124] Depienne C, et al. Mechanisms for variable expressivity of inherited SCN1A mutations causing Dravet syndrome. J Med Genet 2010;47(6):40410.

References

183

[125] Tanaka N, et al. High incidence of NLRP3 somatic mosaicism in patients with chronic infantile neurologic, cutaneous, articular syndrome: results of an International Multicenter Collaborative Study. Arthritis Rheum 2011;63(11):362532. [126] Grzeschik KH, et al. Deficiency of PORCN, a regulator of Wnt signaling, is associated with focal dermal hypoplasia. Nat Genet 2007;39(7):8335. [127] Takeda J, et al. Deficiency of the GPI anchor caused by a somatic mutation of the PIG-A gene in paroxysmal nocturnal hemoglobinuria. Cell 1993;73(4):70311. [128] Brillante B, Guthrie L, Van Ryzin C. McCune-Albright syndrome: an overview of clinical features. J Pediatr Nurs 2015;30(5):81517. [129] Follo MY, et al. PLC and PI3K/Akt/mTOR signalling in disease and cancer. Adv Biol Regul 2015;57:1016. [130] Madsen RR, Vanhaesebroeck B, Semple RK. Cancer-associated PIK3CA mutations in overgrowth disorders. Trends Mol Med 2018;24(10):85670. [131] Van Raamsdonk CD, et al. Frequent somatic mutations of GNAQ in uveal melanoma and blue naevi. Nature 2009;457(7229):599602. [132] Zhang X, Zhang Y. Bladder cancer and genetic mutations. Cell Biochem Biophys 2015;73(1):659. [133] Kwiatkowska J, et al. Mosaicism in tuberous sclerosis as a potential cause of the failure of molecular diagnosis. N Engl J Med 1999;340(9):7037. [134] Boveri T. Concerning the origin of malignant tumours by Theodor Boveri [Harris H, Translated and annotated] J Cell Sci 2008;121(Suppl. 1):184. [135] Hansford S, Huntsman DG. Boveri at 100: Theodor Boveri and genetic predisposition to cancer. J Pathol 2014;234(2):1425. [136] Berger AH, Knudson AG, Pandolfi PP. A continuum model for tumour suppression. Nature 2011;476 (7359):1639. [137] Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell 2011;144(5):64674. [138] Vogelstein B, et al. Cancer genome landscapes. Science 2013;339(6127):154658. [139] Yarchoan M, Hopkins A, Jaffee EM. Tumor mutational burden and response rate to PD-1 inhibition. N Engl J Med 2017;377(25):25001. [140] Fernandez LC, Torres M, Real FX. Somatic mosaicism: on the road to cancer. Nat Rev Cancer 2016;16(1):4355. [141] Miller RM, Sparkes RS. Segmental neurofibromatosis. Arch Dermatol 1977;113(6):8378. [142] Biesecker LG, Spinner NB. A genomic view of mosaicism and human disease. Nat Rev Genet 2013;14 (5):30720. [143] Migeon BR, et al. Selection against lethal alleles in females heterozygous for incontinentia pigmenti. Am J Hum Genet 1989;44(1):1006. [144] Alabdullatif Z, et al. Postzygotic mosaicism and incontinentia pigmenti in male patients: molecular diagnosis yield. Br J Dermatol 2018;178(4):e2612. [145] Medeiros PM, et al. Segmental Darier’s disease: a presentation of difficult diagnosis. An Bras Dermatol 2015;90(3 Suppl. 1):625. [146] Horton L, Mehregan D. Unilateral segmental Darier’s disease associated with neuropsychiatric disorders. Clin Case Rep 2019;7(7):13624. [147] Marchuk DA. Genetic abnormalities in hereditary hemorrhagic telangiectasia. Curr Opin Hematol 1998;5(5):3328. [148] Macri A, et al. Osler-Weber-Rendu disease (hereditary hemorrhagic telangiectasia, HHT). StatPearls. Treasure Island (FL): StatPearls Publishing StatPearls Publishing LLC; 2020. [149] Lee NP, et al. Identification of clinically relevant mosaicism in type I hereditary haemorrhagic telangiectasia. J Med Genet 2011;48(5):3537.

184

Chapter 8 Mosaicism in rare disease

[150] Maloney S, et al. Microchimerism of maternal origin persists into adult life. J Clin Invest 1999;104(1):417. [151] Bianchi DW, et al. Male fetal progenitor cells persist in maternal blood for as long as 27 years postpartum. Proc Natl Acad Sci USA 1996;93(2):7058. [152] Nelson JL. The otherness of self: microchimerism in health and disease. Trends Immunol 2012;33 (8):4217. [153] Fugazzola L, Cirello V, Beck-Peccoz P. Microchimerism and endocrine disorders. J Clin Endocrinol Metab 2012;97(5):145261.

CHAPTER

Multilocus inheritance and variable disease expressivity in rare disease

9 Jennifer E. Posey

Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States

9.1 Introduction Incremental, albeit substantial, developments in genetic and genomic technologies [karyotyping, chromosomal microarray analysis (CMA), whole-exome sequencing (WES), whole-genome sequencing (WGS)] have enabled the rare disease research field to charge toward disease locus and gene discovery over the past 23 decades [17]. This is perhaps best illustrated by the steady increase in both the number of phenotypes with an established molecular etiology and the number of genes with a phenotype-causing variant in the Online Mendelian Inheritance in Man (OMIM) database [8]. Yet even as worldwide rare disease research consortia demonstrate a steady rate of novel discovery (B263 discoveries per year in the US National Institutes of Health-supported Centers for Mendelian Genomics alone) [9], fewer than one-quarter of the approximately 20,000 human protein-coding genes have been reported to have a disease-causing variant, and clinical diagnostic rates for unbiased molecular testing (CMA, WES, WGS) have reached only 25%50%, varying by patient age and reason for referral [1016]. It has become increasingly clear that oneto-one cataloging of genephenotype relationships is going to be critical, but not sufficient, for rare disease diagnosis. Historically, the field of human molecular genetics and genomics has applied a “one-gene-onedisease” paradigm to the identification of genotypephenotype relationships in rare, Mendelian disease, but this standard is insufficient in addressing the complexity of such disease and can lead to the inability to conclude a molecular diagnosis or even errors in genetic diagnosis. Several rare disease observations have challenged this “one-gene-one-disease” thinking applied to both discovery and diagnostics. In particular, genes for which both dominantly and recessively inherited diseases occur have been recognized for some time, but these are often thought to be the exception rather than the rule. LMNA is a striking example of this phenomenon: first described in association with EmeryDreifuss muscular dystrophy (EMD2, MIM #181350) in 1994, the locus is now associated with multiple disease traits that can be broadly categorized as lipodystrophies, diseases of skeletal and cardiac muscle, neuropathies, and premature aging syndromes [17,18]. The phenotypes associated with such genes can be distinct, such as with LMNA; or demonstrate substantial clinical overlap, such as the dominant or recessive epidermolysis bullosa associated with monoallelic and biallelic pathogenic variation in COL7A1 (MIM #226600, #132000); or even demonstrate partial overlap, such as the primordial dwarfism and microcephaly described in individuals with biallelic DONSON Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00005-3 © 2021 Elsevier Inc. All rights reserved.

185

186

Chapter 9 Multilocus inheritance and variable disease

variants, compared to the short stature, hypoplastic femur and tibia, and micrognathia described in an individual with Femoral-facial syndrome (MIM #134780) and a de novo monoallelic DONSON variant [19]. Elucidation of molecular mechanisms of disease can reveal the differential impact of particular variant types on a locus (e.g., APC) [2022], pathogenic variation in different protein domains (e.g., NALCN, EGR2) [23,24], and trigger or escape of nonsense-mediated decay (e.g., RHO) [25]. Adding another layer to the above-described molecular and mechanistic complexity that may accompany an individual locus, the interplay of rare—and sometimes even common—variation at more than one locus (multilocus variation) can have substantial impacts on disease expression and penetrance. The following sections will discuss dual (or multiple) molecular diagnoses, phenotypic expansion, mutational burden, and incomplete penetrance. These genetic “confounders” teach us about the true elegance and complexity of genotypephenotype relationships within an individual and provide some clues for how to approach the challenge of maximizing rare disease diagnosis.

9.2 Dual molecular diagnoses 9.2.1 Delineating dual molecular diagnoses Dual (or multiple) molecular diagnoses are defined as a combination of pathogenic variants at two or more loci, leading to a blended phenotype resulting from two or more rare diseases [11]. Dual molecular diagnoses are by no means novel, and in fact, reports of individuals with at least two rare disease diagnoses were published as early as the 1960s. Cahill and Ley described a 35-year-old woman of Mediterranean descent who presented with jaundice during the 6th month of her second pregnancy following the ingestion of 2 pounds of fava beans during the previous 1.5 days [26]. A low glucose-6-phosphate dehydrogenase (G6PD) activity level and the demonstration of an increased alpha globulin hemoglobin fraction confirmed her dual diagnosis: G6PD deficiency (MIM #300908) and alpha-thalassemia (MIM #604131), both rare conditions known to impact individuals of Mediterranean ancestry. Expansion of this testing to relatives confirmed independent segregation of each trait—a hallmark of dual molecular diagnoses—with alpha-thalassemia having been inherited from her father, and G6PD deficiency from her mother. Despite this, and other early reports of dual molecular diagnoses, their relative frequency in rare disease populations has not been fully appreciated until more recently, in some cases wreaking havoc for even the most skilled clinical and molecular diagnosticians.

9.2.2 Dual molecular diagnoses lead to syndrome disintegration Fitzsimmons syndrome is one condition that illustrates the inherent challenges in rare disease diagnosis. First described in 1987, Fitzsimmons and Guilbert reported monozygotic male twins who presented at 20 years of age with a slowly progressive spastic paraplegia that had become apparent by 4 years of age [27]. This was accompanied by borderline intellectual disability, dysarthria, brachydactyly with shortened fourth and fifth metacarpals, and cone-shaped epiphyses on X-ray. The authors hypothesized that although the mode of inheritance was not clear from this single family, a

9.2 Dual molecular diagnoses

187

monogenic condition was likely. Three further reports would provide additional support for Fitzsimmons syndrome as a novel and distinctive disease trait, with a total of six individuals from four unrelated families (the original male twins, one pair of female twins, one female proband, and one male proband) described with shared features that included spastic paraplegia, dysarthria, brachydactyly, and intellectual disability [2830]. With the causative gene remaining elusive, WES was performed on one of the male monozygotic twins reported in 1987 and remarkably revealed a dual molecular diagnosis for the twins: biallelic variants in SACS conferring a diagnosis of autosomal recessive spastic ataxia of Charlevoix Saguenay (MIM #270550), and a heterozygous pathogenic variant in TRPS1, conferring a diagnosis of trichorhinophalangeal syndrome type 1 (TRPS1, MIM #190350) [31]. In contrast, exome sequencing of the male proband described in 2009 revealed a single de novo variant in TBL1XR1, at that time associated with intellectual disability and autism, which was felt to provide only a partial molecular diagnosis for his phenotype; the possibility of a second molecular diagnosis to explain this individual’s spasticity and brachydactyly was raised. Although the other two families previously reported to have Fitzsimmons syndrome were not available for further study, the dual molecular diagnosis reported in the original Fitzsimmons syndrome twins and the partial diagnosis identified in an unrelated individual with Fitzsimmons syndrome were sufficient to retire Fitzsimmons syndrome (formerly MIM ^270710) as a single and clinically recognizable disease entity.

9.2.3 Agnostic molecular techniques reveal multiple molecular diagnoses As the history of Fitzsimmons syndrome suggests, cases of dual and multiple molecular diagnoses can be difficult to ascertain clinically, even for the most skilled clinical genetics diagnosticians. Molecular diagnostics have changed substantially with the increasing development of genome-wide technologies that do not require the ordering clinician to consider the correct diagnosis in the differential. One outcome of clinical implementation of these unbiased techniques, particularly WES, has been the elucidation of diagnoses in individuals with atypical presentations of disease [32,33]. This may be particularly true for adult presentations of rare diseases, which may be confounded by environmental factors and other common disease conditions [10]. Another outcome of these unbiased techniques has been the discovery that some rare diseases, while seemingly well-defined from a clinical standpoint, may actually represent a combination of other established rare disease conditions. Recent WES or WGS analysis of a cohort of 31 individuals with Dubowitz syndrome (MIM %223370), first described in 1965 in individuals with intrauterine growth restriction, microcephaly, mild intellectual disability, short stature, and distinctive facial features, uncovered distinct presumptive molecular diagnoses in 13/27 (48%) of families, with only a single locus (HDAC8) identified in two families [34]. Although the authors proposed extensive locus heterogeneity for Dubowitz syndrome, it is entirely possible that Dubowitz syndrome itself does not represent a distinct condition. Considering atypical presentations of monogenic conditions, and the observation that individuals with seemingly hallmark presentations of a rare condition may in fact have a different molecular diagnosis with similar clinical features, it is not surprising then that cases with dual or multiple molecular diagnoses add even further clinical diagnostic complexity. It is only with the broader use of unbiased molecular testing that such cases have been more regularly uncovered. The earliest report of diagnostic exome sequencing in a clinical, non-cancer referral cohort in 2013 described dual molecular diagnoses in 6.5% of cases for which a molecular diagnosis was rendered by WES

188

Chapter 9 Multilocus inheritance and variable disease

Table 9.1 Frequency of multiple molecular diagnoses across clinical and research cohorts. Study characteristics Clinical referral laboratory

Clinical referral, age-limited Research cohort, phenotypefocused

Total # families

# Families with molec Dx

# Families $ 2 Molec Dx

% Dx’d families with n $ 2 Molec Dx

Yang et al. [11]

250

62

4

6.5

Yang et al. [12]

2000

504

23

4.6

Farwell et al. [15]

500

152

11

7.2

Retterer et al. [13]

3040

875

28

3.2

Posey et al. [42]

7374

2076

101

4.9

Monies et al. [35]

2219

961

30

3.1

Manuscript

Posey et al. [10,43] (adult cases)

486

85

6

7.1

Meng et al. [36] (neonatal cases)

278

102

4

3.9

Balci et al. [37] (mixed undiagnosed)

802

226

8

3.5

Tarailo-Graovac et al. [38] (neurometabolic)

41

37

5

13.5

Karaca et al. [39] (brain malformation)

128

108

3

2.8

Pehlivan et al. [40] (arthrogryposis, known genes)

119

82

10

12.2

Pehlivan et al. [40] (arthrogryposis, known 1 novel genes)

119

82

18

22.0

The total number of studied families, the number of families with a molecular diagnosis, and the number of families with two or more molecular diagnoses are indicated. Finally, the percentage of families with a molecular diagnosis for whom two or more molecular diagnoses were identified is indicated.

(Table 9.1) [11]. Thereafter, additional reports of dual molecular diagnoses from clinically referred, primarily pediatric diagnostic cohorts in the United States ranged from B3.2% to 7.2% [12,13,15] and more recently were reported to be 3.1% in a clinical WES cohort from Saudi Arabia [35]. This dual molecular diagnostic rate was similar in age-limited cohorts: an adult referral cohort [10], and a neonatal intensive care unit cohort [36]. The dual molecular diagnostic rate varied more widely with certain phenotypes in research settings: 3.5% in a mixed undiagnosed cohort from Canada, 13.5% in a neurometabolic disorder cohort of individuals of primarily European descent, and in individuals of Turkish ancestry ranged from 2.8% when brain malformations were studied to 12.2% when arthrogryposis was studied (as high as 22.0% when all diagnoses, including novel genes, were considered) [3741]. With each of these cohorts, the number of cases identified as having multiple molecular diagnoses was relatively modest, ranging from 3 to 30 (Table 9.1). In 2017, a cohort of 101 clinically referred cases was reported to have two or more molecular diagnoses, 4.9% of 2076 cases for which WES was diagnostic [42]. Despite this fairly substantial rate of multiple molecular diagnoses, statistical analyses of the molecular diagnostic rate of this

9.2 Dual molecular diagnoses

189

cohort suggested that an even higher rate of multiple molecular diagnoses (14.0% using a Poisson model; 26.4% using an independent model; one-sided binomial test P , 0.0001 for both models) would have been predicted [42]. This analysis prompted consideration that multiple molecular diagnoses may simply still be under-recognized. Indeed, reanalysis of clinical WES data from the two Yang et al. cohorts in 2019 revealed an almost-doubling of the number of multiple molecular diagnosis cases (to 48 total) [11,12,16] with new disease genes, updated clinical data, variant reclassification, familial segregation studies, and copy-number variant (CNV) analysis each contributing to this increase. Notably, concurrent CMA and WES provide the best opportunity for clinical laboratory studies to identify more than one molecular diagnosis, as diagnosticians may forego further clinical testing once a single plausible molecular diagnosis has been identified [44].

9.2.4 Dissecting dual molecular diagnoses In individuals with dual molecular diagnoses, the resulting “disease pairs” can be conceptualized in terms of their phenotypic similarity or dissimilarity along a spectrum (Fig. 9.1) [42]. At one end of

FIGURE 9.1 Dual molecular diagnoses confer phenotypes that are blended within a single individual. The disease traits resulting from each pair of molecular diagnoses may result in a blending of distinct clinical phenotypes, for example, a combination of AarskogScott syndrome (MIM #305400) due to pathogenic variation in FGD1, and lissencephaly due to pathogenic variation in PAFAH1B1 (MIM #607432), two conditions that involve different organ systems. In contrast, dual molecular diagnoses may also produce a blending of overlapping clinical phenotypes, for example, an individual with KBG syndrome due to pathogenic variation in ANKRD11 (MIM #148050) and Coffin-Siris syndrome (MIM #135900) due to pathogenic variation in ARID1B. DD/ID, developmental delay/intellectual disability.

190

Chapter 9 Multilocus inheritance and variable disease

the spectrum are individuals with highly dissimilar disease pairs, for which two distinct disease traits are blended within a single individual. At the other end of the spectrum are individuals with very similar disease pairs, for which two overlapping disease traits are blended within a single individual. This differentiation between distinct and overlapping disease traits has clinical relevance, as both can create a clinical diagnostic conundrum: individuals with a blended phenotype resulting from distinct disease traits may in fact seem clinically to have an altogether new disease trait, with a new combination of clinical phenotypes; in contrast, individuals with overlapping disease traits may appear clinically to have a single known condition, perhaps with a more exaggerated phenotype than might have been anticipated. Both scenarios can lead to challenges in clinical ascertainment of dual molecular diagnoses, underscoring the need for unbiased testing approaches for clinical diagnosis. One hypothesis born from these observations was the notion that the gene pairs identified in cases with overlapping disease traits are more likely than those identified in cases with distinct disease traits to be close pathway interactors [42] and thus lead to exaggerated phenotypic expression. Several subsequent case reports would seem to further support this hypothesis. The first describes a 6-year-old male with paternally inherited Okur-Chung neurodevelopmental syndrome (MIM #617062) and maternally inherited TRPS1 (MIM #190350), both dominant conditions [45]. While both affected parents demonstrated short stature (23.63 SD, mother; 24.01 SD, father), a component of both conditions, the proband had exaggerated short stature (25.8 SD). Separately, another family was reported with short stature, distinctive craniofacial features, and brachydactyly in an affected father and his two daughters, one of whom was observed to be more severely affected [46]. The family was given a clinical diagnosis of autosomal dominant Robinow syndrome, yet WES revealed a heterozygous pathogenic BMP2 variant in all three affected individuals. Reanalysis of the WES data of the more severely affected daughter revealed a de novo 13-basepair deletion in DVL1, thus conferring a dual molecular diagnosis of short stature, facial dysmorphism, and skeletal anomalies (SSFSC, MIM #617877) and autosomal dominant Robinow syndrome (MIM #616331). It is notable that the more severely affected proband again had more striking short stature (26.2 SD) compared to her affected sister (22.2 SD) and father (22.77 SD), likely due to the Wnt pathway interaction of BMP2 and DVL1. Whereas one hallmark feature of dual molecular diagnoses is that they result from two independently segregating disease traits, digenic conditions are single disease traits that result from pathogenic variation in two loci. At first glance, it may not be clear whether variation at two loci in fact results in expression of one or two disease traits, and lack of available and genetically informative relatives can obscure this distinction. To begin to tackle this challenge and provide objective data characterizing digenic conditions in comparison to a set of cases with either dual molecular diagnoses or a highly penetrant variant plus a genetic modifier (the authors use the term “composite” to represent this comparison cohort), Gazzo et al. performed an analysis of molecular features associated with the different classes of multilocus variation in DIDA (DIgenic diseases DAtabase) [47,48]. Modeling a combination of variant-level, gene-level, and high-level features for each multilocus case, they were able to computationally distinguish true digenic conditions from other multilocus scenarios [47]. The inherent clinical complexity brought about by blended phenotypes has led to the use of phenotypic similarity scores (PSS) to measure the degree of similarity between two disease pairs. To quantify PSS, the set of phenotypic features conferred by each disease trait of a disease pair are

9.2 Dual molecular diagnoses

191

computationally compared, with lower PSS representing distinct disease traits, and higher PSS indicating overlapping disease traits. Critical to such analyses is the use of a controlled vocabulary to create an ontologically structured clinical feature term set, such as the Human Phenotype Ontology (HPO), and the prior mapping of OMIM disease traits to HPO terms (the OMIM MorbidMap) [8,49,50]. One challenge to this approach, however, is that it often utilizes a “snapshot in time” of clinical features which may not always be fully representative of an individual’s spectrum of disease over the course of their lifetime. This is perhaps best illustrated by a case report of monozygotic twin girls who demonstrated high arched palates, a weak cry, axial hypotonia, hypoactivity, and small hands and feet during the neonatal period; feeding difficulties led to the placement of nasogastric tubes for nutrition [51]. By 9 months of age, a CMA had demonstrated a B4.9 Mb deletion at 15q11.2q13.1 on the paternal allele, consistent with a diagnosis of PraderWilli syndrome (PWS, MIM #176270). Children with PWS almost universally have hypotonia and feeding problems during infancy but then progress to develop hyperphagia and distinctive behavioral phenotypes by early childhood. Uncharacteristically, in the case of these twin girls, they re-presented at age 12 years with very limited speech, nonambulatory, and with distinctive repetitive movements; these observed clinical features and a complete lack of hyperphagia prompted further investigation by WES, which ultimately demonstrated a heterozygous pathogenic variant in TCF4 inherited from their mosaic mother and conferring a second molecular diagnosis of PittHopkins syndrome (MIM #610954). Yet another example of sequential disease trait expression was reported in a male with dysmorphic craniofacial features, intellectual disability, and short stature [52]. His early infancy was notable for failure to thrive, but at 15 years of age, he began to develop weight gain that did not respond to dietary intervention. An evaluation at 18 years of age was notable for moderate obesity, and WES at that time revealed dual molecular diagnoses: BohringOpitz syndrome (BOS, ASXL1, MIM #605039) and monogenic obesity (MC4R, MIM #618406). Curiously, the monogenic obesity may have counterbalanced the patient’s early failure to thrive, potentially minimizing the full impact of that particular trait of BOS. In both of these examples, nonclassical phenotypes prompted the WES testing that ultimately led to dual molecular diagnoses. That their phenotypes would first resemble one, and then the other, molecular diagnosis over time portends the degree of clinical and phenotypic complexity that may result from blended phenotypes. It is important to note that not all dual molecular diagnoses result from combinations of germline single-nucleotide variants (SNVs) or small indel variants identified by next-generation sequencing. There have now been numerous examples of dual diagnoses resulting from an SNV at one locus and a CNV at another, or even two CNVs at distinct loci or a combination of CNV plus chromosomal aneuploidy [39,42,44,5355]. In addition, there has been at least one report of a dual molecular diagnosis resulting from one SNV and one duplication CNV at 15q11.2q13.1 [56]. This duplication was present on the maternal allele and unmethylated in three siblings with hereditary spastic paraplegia and intellectual disability, leading to overexpression of UBE3A and ATP10A. The duplication was also detected in the unaffected mother, methylated, and on her paternal allele. Dual molecular diagnoses resulting from structural variants (SVs) may be particularly challenging to dissect, as the two molecular diagnoses may result from the effects of a single SV through impacts on both gene dosage and chromatin structure [57]. In addition, mosaic variants have been described to contribute to a dual molecular diagnosis in an individual with ARID1B-associated intellectual disability (MIM #135900), and

192

Chapter 9 Multilocus inheritance and variable disease

renal and hepatic angiomyolipomas resulting from mosaic TSC2 pathogenic variation [58]. And finally, mosaicism for two abnormal cell populations has also been reported, for example, in a female mosaic for Turner syndrome and 68,XX triploidy [43]. One feature that many of these cases share is the observation of an evolving phenotype that was not typical for a recognized condition, prompting the astute clinician caring for the patient to consider unbiased genomewide molecular testing rather than locus-specific studies, and often requiring more than one molecular technique to uncover the dual molecular diagnosis. Finally, while most cases of dual molecular diagnoses reported have resulted from the combinatorial effects of independently segregating loci, the contribution of variants in linkage disequilibrium segregating in the same haplotype in consanguineous populations or those with high-degrees of inbreeding cannot be ignored. An example of this is a report of a pair of siblings from a Mennonite family with clinical findings of skeletal dysplasia, scoliosis, and clubfoot. Family WES showed that both affected siblings shared homozygous pathogenic variants in SLC26A2, associated with diastrophic dysplasia (MIM #222600), and SH3TC2, known to cause Charcot-Marie-Tooth disease (CMT4C, MIM #601596). Whereas the dominating phenotype of skeletal dysplasia can be attributed to the reduced function of SLC26A2, a sulfate transporter important for chondrocyte differentiation and skeletal development, scoliosis and clubfoot are shared features of both disorders. The two pathogenic variants were determined to be in linkage, segregating in the same ancestral haplotype in the Mennonite population [59].

9.3 Phenotypic expansion 9.3.1 Delineating phenotypic spectrum and expansion Following the elucidation and definition of a novel syndrome, and with confirmatory findings from clinicians and researchers in different groups, a syndrome is considered by the rare disease field to be defined, and its key clinical characteristics fully established (phenotypic spectrum). After this point, there is the tendency for novel features described in the setting of that rare disease to be considered incidental to—rather than components of—the condition. This thinking is well documented in reports of Muenke syndrome (MIM #602849), a rare autosomal dominant condition whose key features are craniosynostosis involving the coronal sutures, fusions of the carpal and tarsal bones, and broad toes [60]. Individuals with Muenke syndrome have a well-studied p.Pro250Arg missense variant in FGFR3 originating on the paternal allele (and associated with advanced paternal age) [61], and many with milder involvement of the aforementioned key phenotypes have been described. Despite a seemingly clearly defined genotypephenotype correlation, since the original description of Muenke syndrome in 1997, there have been several case reports and cohorts detailing additional features that some suspected could be incidental to the Muenke syndrome diagnosis: hearing loss, intellectual disability, temporal lobe dysgenesis, KlippelFeil anomaly [60,62,63]. To address this uncertainty, Maximilian Muenke and his team identified a cohort of nine individuals from five unrelated families for whom to pursue deeper phenotyping studies, in combination with a complete literature review of the condition, 10 years after the initial definition of the syndrome [64]. Strikingly, they found that as many as 11% of the 317 previously reported cases of Muenke syndrome had a normal cranial exam, despite

9.3 Phenotypic expansion

193

craniosynostosis being a hallmark feature of this condition; females were more likely than males to have craniosynostosis recorded. While hearing loss had previously been reported in 40% of cases, Muenke’s own investigations suggested that when evaluated, most patients (95%) will be found to have mild-to-moderate sensorineural hearing loss, as well as developmental delay. These findings were described as an expansion of the previously established Muenke syndrome phenotype, and they led to changes in recommendations for clinical assessment of individuals with Muenke syndrome. Therefore especially for novel rare disorders, it is important to document and carefully phenotype newly diagnosed patients to better understand the true phenotypic spectrum of the disease beyond the initial report [65,66].

9.3.2 Multiple molecular diagnoses masquerading as phenotypic expansion As the pace of novel syndrome and gene discovery has increased, there are now increasing examples of well-established syndromes for which novel phenotypic features are described and ultimately found to represent phenotypic expansion for a previously defined condition [9,67]. Unbiased molecular testing methods have enabled gene-first approaches to assemble cohorts of individuals with rare pathogenic variation at a particular locus to understand the full spectrum of phenotypic effects [6875]. With the growing number of cases with phenotypic expansion reported, however, the possibility that some of these may represent missed dual molecular diagnoses masquerading as phenotypic expansion has been raised. A comprehensive and standardized WES reanalysis approach was applied to a cohort of Turkish individuals with developmental delay, intellectual disability, and brain malformation to investigate the role of dual molecular diagnoses in phenotypic expansion [76]. In this study, the rate of multiple molecular diagnoses was quantified in a “control” cohort of 87 families previously identified to have a single molecular diagnosis that explained the phenotype, and a “case” cohort of 19 families for which a previously identified single molecular diagnosis was felt to represent a phenotypic expansion in the studied family [39]. The rate of dual molecular diagnoses differed significantly, with 31.6% (6/19) of families previously classified as having phenotypic expansion identified in this analysis to have two or more molecular diagnoses explaining their observed phenotype, compared to only 2.3% (2/87) of the control families found to have an additional molecular diagnosis (Fisher’s exact P 5 0.0004) [76]. This difference was even greater when potentially pathogenic variation in candidate genes was considered [42.1% (8/19) compared to 2.3% (2/87)].

9.3.3 Dissection of genotypephenotype relationships through intrafamilial genotypic and phenotypic variability A crucial proof of principle in the study of phenotypic expansion versus multilocus pathogenic variation derived from the fortuitous observation of intrafamilial phenotypic variability. In two unrelated families, sibling pairs born to consanguineous couples demonstrated what might have been clinically interpreted as variable clinical severity of disease trait expression. Yet upon familial segregation of the multiple molecular diagnoses identified in each more severely affected proband, the milder siblings shared only one of the three molecular diagnoses. Given that the phenotype of the study was brain malformation, the multiple molecular diagnoses

194

Chapter 9 Multilocus inheritance and variable disease

unsurprisingly conferred overlapping disease traits which would have been otherwise difficult to clinically dissect. Intrafamilial phenotypic and genotypic variability was again used to dissect genotypephenotype relationships in a family with multiple loops of consanguinity, in whom two cousins presented with microcephaly [77]. The less severely affected child demonstrated pachygyria on brain MRI that was mostly limited to the posterior brain. In contrast, the more severely affected cousin demonstrated widespread pachygyria, a thin corpus callosum, mild cerebellar atrophy, and autism. Both cousins shared a homozygous pathogenic variant in TUBGCP2 (MIM #618737), and the more severely affected cousin was found to additionally have a de novo duplication at 2q23.1, including MBD5 which had previously been associated with intellectual disability (MIM #156200). Similarly, in a Japanese family with autosomal dominant type 1 neurofibromatosis (NF1, MIM #162200) diagnosed in a proband, his sister, his mother, and one maternal aunt, the sister was noted to display phenotypic variability characterized by absent meaningful speech at 5 years of age, a small head size (21.7 SD) unlike the macrocephaly typically observed in NF1, hypotonia, and ataxia [78]. Brain imaging further revealed that she had cerebellar and brainstem hypoplasia. Consistent with their clinical NF1 phenotype, both children and the affected mother were shown to have a pathogenic NF1 variant. However, the more severely affected sister, who had additional features not typical of NF1, also harbored a de novo CASK variant, as well as low-level mosaicism (6%) for trisomy 8 which may have additionally contributed to her phenotype. In each of these reports, substantial phenotypic variability within affected individuals in each family raised suspicion against phenotypic expansion and led to a true dissection of genotypephenotype relationships within the nuclear family.

9.4 Variable expressivity Perhaps one of the most challenging aspects of clinical genetics has been the observation of variable phenotypic expressivity for some rare conditions. Differences in disease severity, age of onset, or even specific combinations of phenotypic features can all contribute to variable expressivity. Specific genotypephenotype correlations have been identified to explain some of these observations. One example is the more severe end of the spectrum of NF1 phenotypes characterized by somatic overgrowth, large hands and feet, severe cognitive difficulties, extensive and widespread cafe-au-lait macules, and substantial craniofacial dysmorphism that is associated with NF1 deletion [79,80], in contrast to the more “typical” NF1 phenotype associated with NF1 SNVs and indels. However, for some conditions, studies of genotypephenotype correlations have brought limited clarity.

9.4.1 Variable expressivity may portend mutational burden Charcot-Marie-Tooth (CMT) neuropathy is one such condition that displays a broad range of phenotypic severity. In a phenotypic analysis of 474 children and adolescents with CMT, a CMTPedS score was used to quantify overall disease severity in each case [81]. While the mean severity score was 21.5 (SD of 8.9), scores ranged widely from 1 to 44, with CMT type and age at

9.5 Incomplete penetrance

195

evaluation explaining much of this variation in overall severity, as well as specific features related to functional dexterity. Even with such extensive studies on genephenotype relationships, the frustrating clinical reality has been that individuals with CMT in the same family (thus, both a shared etiologic variant and shared genetic background) can display different severity of the disease, challenging expectant management and counseling for newly diagnosed relatives. In 2015 Gonzaga-Jauregui et al. considered the possibility—first predicted in 1941 [82]—that a genetic burden of nonpenetrant variants in CMT genes might be responsible for modulating the CMT phenotype observed in families [83]. The authors started with a cohort of 37 unrelated families with 40 affected individuals, applying WES as a comprehensive diagnostic tool to first identify highly penetrant Mendelizing variants (HPMVs) responsible for the neuropathy disease trait and segregating with disease in each family. Through this approach, a molecular diagnosis was revealed in 45% (17/37) of families. They then hypothesized that other rare, nonpenetrant variants might further contribute to disease expression in this cohort. Indeed, they identified an average genetic burden of 2.3 nonsynonymous rare variants in established neuropathy genes per affected individual, compared to only 1.3 nonsynonymous variants in an ethnically matched European control cohort (P , 0.0001). This mutational burden remained significant even after subtraction of the identified HPMVs from their analysis and was confirmed with an independent cohort of 30 families with neuropathy of Turkish origin. They then demonstrated that these multilocus effects resulted in epistatic interactions in zebrafish, with combinations of subeffective morpholino knockdowns of gene pairs identified from the burden analysis demonstrating substantial worsening of the peripheral nervous system phenotype [83]. In a separate study focused on inherited axonopathies [CMT type 2, hereditary spastic paraplegia (HSP)], Bis-Brewer et al. performed another mutational burden analysis [84]. Their cohort included 343 individuals with CMT, 515 individuals with HSP, and a cohort of 935 nonneurological control cases studied by WES. In support of the findings from Gonzaga-Jauregui et al., they demonstrated a statically significant cumulative mutational burden of both rare nonsynonymous and loss-of-function (LoF) variation in disease genes in CMT and HSP cohorts compared to controls. These studies dovetail nicely to support the hypothesis that an aggregation of rare variant alleles in disease genes can modulate phenotypic severity and explain the observed clinical heterogeneity of the disease. These data also suggest that elucidation of multilocus variation across gene networks or biological pathways could be critical to our understanding of disease penetrance and expressivity [8385]—a prerequisite of which is complete functional and phenotypic annotation of the B20,000 human protein-coding genes.

9.5 Incomplete penetrance The term “incomplete penetrance” is often used to describe multigenerational families in which genetic pedigrees reveal one or more individuals harboring a rare disease pathogenic allele, that is, obligate carriers of a dominant trait disease gene, who do not demonstrate the associated disease trait. Age- or sex-dependent traits may, at first glance, appear to demonstrate incomplete penetrance. Furthermore, differential allelic expression, epigenetics, and environmental factors each contribute to the molecular genetic explanation of a proportion of these observations.

196

Chapter 9 Multilocus inheritance and variable disease

Beyond these pathomechanisms, there are increasing data supporting the hypothesis that compound inheritance of a rare allele plus a common variant may be required for penetrance of certain phenotypes.

9.5.1 Rare 1 common alleles at a single locus explain some cases of incomplete penetrance In 2012 Albers et al. described a cohort of 55 individuals with thrombocytopenia with absent radii (TAR, MIM #274000) syndrome [86]. TAR syndrome was known to be associated with a 1q21.1 deletion—present in this cohort—that could be either de novo or inherited from an unaffected parent, leading the authors to speculate an apparently biallelic, or recessive, mode of inheritance. They applied WES to five cases to investigate the hypothesis that an additional causative allele was necessary for phenotypic expression. While no second etiologic protein-coding variant was identified, the authors identified a single-nucleotide polymorphism (SNP) (rs139428292) in four of the five studied individuals. This SNP was located in the 50 UTR of RBM8A, one of the genes mapping within the 1q21.1 deletion region. They then genotyped another 48 individuals in the cohort, ultimately demonstrating the presence of this SNP in 73.6% (39/53) of the cohort; a second novel SNP, located in the first intron of RBM8A, was identified in a further 22.6% (12/53) of the cohort. In each case for which the 1q21.1 deletion was inherited from an unaffected parent, the more common SNP [minor allele frequencies 3.05% (rs139428292), 0.42% (rs201779890)] was in trans with the RBM8A null allele (1q21.1 deletion), confirming biparental inheritance and providing a plausible molecular explanation for the observed incomplete penetrance of the 1q21.1 deletion. The authors hypothesized that the SNPs conferred a hypomorphic effect on the remaining RBM8A allele, resulting in high phenotypic expression. More recently, a similar biallelic model has become evident for seemingly sporadic birth defects of the spine and was revealed by studies at the TBX6 locus, where null TBX6 alleles, most often caused by a deletion at 16p11.2, are associated with congenital scoliosis. Parents harboring the TBX6 null allele are unaffected, thus demonstrating incomplete penetrance. Wu et al. demonstrated that a specific “T-C-A” haplotype comprising three SNPs common to the Chinese Han population (present in 44% of Chinese Han individuals), when inherited in trans with a null TBX6 allele, led to high penetrance of the congenital scoliosis phenotype now known as TBX6-associated congenital scoliosis (TACS) [87]. The authors were able to use functional assays to demonstrate the hypomorphic effect of the common T-C-A haplotype on TBX6 function, supporting their hypothesis that the hypomorphic allele in trans with a null or gene-disrupting missense TBX6 allele led to a sufficient reduction in gene dosage to allow expression of the TACS phenotype [87,88], elucidating their compound inheritance gene dosage model. Notably, the TACS phenotype is clinically distinct, characterized primarily by butterfly vertebrae and hemivertebrae in the lower thoracic and upper lumbar regions, making the spinal lesions more accessible to surgical intervention compared to non-TACS scoliosis phenotypes [89]. Modeling these compound null/mild hypomorphic Tbx6 alleles in mice confirmed the TBX6 gene dosage model: a single null allele is not sufficient to cause the scoliosis phenotype, and a homozygous null allele is lethal, but the compound inheritance of one null and one hypomorphic Tbx6 allele leads to a murine TACS phenotype very similar to that observed in humans [90]. Expanding on this gene dosage and developmental biology model, they

9.5 Incomplete penetrance

197

further demonstrated that increased TBX6 dosage resulted in cervical spine involvement characterized by cervical fusions and vertebral clefts [91]. Another study supporting the compound inheritance model of rare and common variants contributing to disease penetrance has been reported in association with lethal lung disorders, namely acinar dysplasia, congenital alveolar dysplasia, and alveolar capillary dysplasia with misalignment of the pulmonary veins. In an analysis of 26 families with lethal lung developmental disorders, Karolak et al. found that heterozygosity for rare coding variants in TBX4 or FGF10, both previously associated with severe lung developmental disorders, was not sufficient for disease expression [92]. Rather, biallelic variation at either locus, conferred by a second rare coding variant, or a second noncoding variant that could be rare or common, provided sufficient mutational burden for expression of the lethal lung disease phenotype. The authors followed this study with a case report of a 4-month-old child who died from abnormal lung growth and pulmonary arterial hypertension [93]. Trio-WES sequencing demonstrated a de novo nonsense variant in CTNNB1, as well as a heterozygous missense variant in TBX4 that was inherited from an unaffected father, leading the authors to hypothesize that this multilocus mutational burden contributed to the severe clinical features in this child. A more recent case with lethal lung disorder harbored two de novo CNV deletions (17q23.1q23.2, 16p11.2), uncovering seven potential regulatory noncoding SNVs that may contribute to the complex compound inheritance observed in lethal lung disorders associated with a loss of TBX4 [54].

9.5.2 Rare 1 common alleles at unlinked loci explain some cases of incomplete penetrance Whereas the examples above each describe the compound inheritance of rare and common variants at a single locus, there are increasing examples for which compound inheritance involving two unlinked loci results in digenic inheritance of the condition. One particular challenge for investigators pursuing such inheritance models has been the identification of the relevant noncoding variant at a distant locus. Early discoveries in this regard have taken advantage of existing knowledge for specific common variant alleles to identify examples of compound inheritance of rare and common variants at unlinked loci. Facioscapulohumeral dystrophy type 2 (FSHD2, MIM #158901) results from the digenic inheritance of a rare pathogenic variant in SMCHD1 on chromosome 18, and a common, permissive DUX4 allele on chromosome 4 [94,95]. The SMCHD1 variant results in loss of SMCHD1 function and a relaxed chromatin configuration, which in the setting of a permissive DUX4 allele enables DUX4 expression and disease trait manifestation. Lemmers et al. built on previous work describing the etiology of FSHD1 (MIM #158900), a monogenic condition involving 4q35.2 that was caused by a contracted D4Z4 repeat array which led to an open chromatin configuration. Individuals with this repeat contraction only developed FSHD1 in the presence of a permissive DUX4 allele, which Lemmers et al. ultimately found to be critical for FSHD2 disease expression as well. In a separate study, Timberlake et al. described a cohort of 17 individuals from 13 families with nonsyndromic midline craniosynostosis [96]. They identified rare pathogenic variants in SMAD6 in all affected individuals. In two cases, these variants were de novo, but for the remaining 15 cases, the SMAD6 variant was inherited from an unaffected parent. The authors noted that a common

198

Chapter 9 Multilocus inheritance and variable disease

BMP2 variant (located B345 kb downstream of BMP2) had been identified in a genome-wide association study involving cases with nonsyndromic sagittal craniosynostosis, with a large effect size. Genotyping for this SNP revealed that 14 of 17 affected individuals had inherited this BMP2 allele, and segregation in available relatives demonstrated compound inheritance of this common variant and the rare SMAD6 variant. The authors proposed that the low (,60%) penetrance observed with the rare SMAD6 variant alone was explained by the compound inheritance of this BMP2 common variant. In a subsequent case report, the authors described a strikingly severe case of craniosynostosis in a child who demonstrated very unusual recurrence of the craniosynostosis phenotype 2 months after surgical correction [97]. In this case, a rare frameshifting SMAD6 variant was inherited from an unaffected parent, and a de novo TCF12 frameshift variant was also identified; the common BMP2 allele was not detected. As with SMAD6, TCF12 had been associated with incomplete penetrance of the craniosynostosis phenotype. The authors hypothesized that the combination of LoF variation at both loci led to the very severe and rapidly recurring phenotype. Taken together, these studies of FSHD2 and nonsyndromic craniosynostosis illustrate how incomplete penetrance may sometimes be a hallmark of more complex molecular mechanisms influencing disease expression.

9.6 A comprehensive model: Clan Genomics Clan Genomics provides a comprehensive model through which to unify concepts of multilocus variation, multiple molecular diagnoses, mutational burden, and compound inheritance of rare and common variation [98]. The Clan Genomics hypothesis posits that the expression of rare disease traits is primarily influenced by recently arisen, rare variants within a family or clan. Indeed, a substantial proportion of rare disease is caused by de novo variation, and this represents the most immediate demonstration of the Clan Genomics hypothesis (Fig. 9.2). A decade since the original formulation of this hypothesis, Lupski and colleagues have tested more complex corollaries that derive from this hypothesis. In particular, they demonstrate populations for which a high rate of parental consanguinity—and thus, identity-by-descent—support the rapid “homozygosing” of rare variants within a family [40]. In essence, the very structure of the clan, characterized by autozygosity and substantial absence of heterozygosity, drives this observation. Multilocus effects, whether through pairs of de novo variants, or through pairs of homozygous variants, lead to combinatorial diversity that is unique to a studied family, even when variant recurrence (e.g., recurrent de novo pathogenic variants due to C . T transversions in CpG dinucleotides; recurrent CNVs mediated by nonallelic homologous recombination) or founder alleles are observed [39,66,71,99101]. The combinatorial impact of aggregated rare variants leading to a mutational load, or complex inheritance of rare and common variants, further supports the Clan Genomics hypothesis. Indeed, Clan Genomics provides an intellectual framework through which to conceptualize complex molecular mechanisms underlying rare disease, including multilocus inheritance, variable expressivity, and incomplete penetrance. The full extent of the contributions of these molecular mechanisms is not yet known, and it will be critical to continue to develop testable molecular hypotheses through which to explore the Clan Genomics model.

9.6 A comprehensive model: Clan Genomics

199

FIGURE 9.2 Complex molecular models of disease support the Clan Genomics hypothesis. New (de novo) or recently arisen variants within a family or clan have a substantial impact on human disease expression. Phenotypic expression is further influenced by population substructure and the combinatorial effects of rare 1 common or rare 1 rare variants at one or more loci. Highlighted regions of the pedigrees demonstrate individuals who share a given variant. Red lines represent new or recent mutational events. (A) De novo, or even somatic (not shown) mutational events are private to an individual and result in expression of an apparently sporadic rare disease trait. (B) Recently arisen rare variants are private to the immediate family. A variant shared from a single ancestor may be rapidly brought to homozygosity through parental consanguinity and absence of heterozygosity, leading to expression of a rare, recessive disease trait. (C) Compound inheritance of a rare and common variant can result in single locus (shown here) or multilocus variant combinations that are private to an individual or family. These molecular scenarios often modulate gene dosage to influence expression of disease traits that have been observed to demonstrate incomplete penetrance. (D) A single common allele may influence disease risk for common, complex disorders, or in rare disease when paired with a pathogenic rare variant (see panel C). (E) Blended phenotypes result from multilocus pathogenic variation at two or more loci, leading to dual or multiple molecular diagnoses. These multilocus variant combinations are private to an individual or family.

200

Chapter 9 Multilocus inheritance and variable disease

References [1] Jacobs PA, Strong JA. A case of human intersexuality having a possible XXY sex-determining mechanism. Nature 1959;183:3023. [2] Lejeune J, Gautier M, Turpin R. Study of somatic chromosomes from 9 mongoloid children. C R Hebd Seances Acad Sci 1959;248:17212. [3] Kallioniemi A, et al. Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science 1992;258:81821. [4] Veltman JA, et al. Definition of a critical region on chromosome 18 for congenital aural atresia by arrayCGH. Am J Hum Genet 2003;72:157884. [5] Shaw CJ, et al. Comparative genomic hybridisation using a proximal 17p BAC/PAC array detects rearrangements responsible for four genomic disorders. J Med Genet 2004;41:11319. [6] Lupski JR, et al. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. N Engl J Med 2010;362:118191. [7] Bainbridge MN, et al. Whole-genome sequencing for optimized patient management. Sci Transl Med 2011;3:87re3. [8] Amberger JS, Bocchini CA, Scott AF, Hamosh A. OMIM.org: leveraging knowledge across phenotypegene relationships. Nucleic Acids Res 2019;47:D103843. [9] Posey JE, et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet Med 2019;21:798812. [10] Posey JE, et al. Molecular diagnostic experience of whole-exome sequencing in adult patients. Genet Med 2016;18:67885. [11] Yang Y, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med 2013;369:150211. [12] Yang Y, et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA 2014;312:18709. [13] Retterer K, et al. Clinical application of whole-exome sequencing across clinical indications. Genet Med 2016;18:696704. [14] Lee H, et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA 2014;312:18807. [15] Farwell KD, et al. Enhanced utility of family-centered diagnostic exome sequencing with inheritance model-based analysis: results from 500 unselected families with undiagnosed genetic conditions. Genet Med 2015;17:57886. [16] Liu P, et al. Reanalysis of clinical exome sequencing data. N Engl J Med 2019;380:247880. [17] Worman HJ, Bonne G. “Laminopathies”: a wide spectrum of human diseases. Exp Cell Res 2007;313:212133. [18] Bione S, et al. Identification of a novel X-linked gene responsible for Emery-Dreifuss muscular dystrophy. Nat Genet 1994;8:3237. [19] Karaca E, et al. Biallelic and de novo variants in DONSON reveal a clinical spectrum of cell cycleopathies with microcephaly, dwarfism and skeletal abnormalities. Am J Med Genet A 2019;179:205666. [20] Groden J, et al. Identification and characterization of the familial adenomatous polyposis coli gene. Cell 1991;66:589600. [21] Nishisho I, et al. Mutations of chromosome 5q21 genes in FAP and colorectal cancer patients. Science 1991;253:6659. [22] Patel N, et al. A novel APC mutation defines a second locus for Cenani-Lenz syndrome. J Med Genet 2015;52:31721.

References

201

[23] Warner LE, Svaren J, Milbrandt J, Lupski JR. Functional consequences of mutations in the early growth response 2 gene (EGR2) correlate with severity of human myelinopathies. Hum Mol Genet 1999;8:124551. [24] Chong JX, et al. De novo mutations in NALCN cause a syndrome characterized by congenital contractures of the limbs and face, hypotonia, and developmental delay. Am J Hum Genet 2015;96:46273. [25] Roman-Sanchez R, Wensel TG, Wilson JH. Nonsense mutations in the rhodopsin gene that give rise to mild phenotypes trigger mRNA degradation in human cells by nonsense-mediated decay. Exp Eye Res 2016;145:4449. [26] Cahill KM, Ley AB. Favism and thalassemia minor in a pregnant woman. JAMA 1962;180:11921. [27] Fitzsimmons JS, Guilbert PR. Spastic paraplegia associated with brachydactyly and cone shaped epiphyses. J Med Genet 1987;24:7025. [28] Hennekam RC. Spastic paraplegia, dysarthria, brachydactyly, and cone shaped epiphyses: confirmation of the Fitzsimmons syndrome. J Med Genet 1994;31:2512. [29] Lacassie Y, et al. Identical twins with mental retardation, dysarthria, progressive spastic paraplegia, and brachydactyly type E: a new syndrome or variant of Fitzsimmons-Guilbert syndrome? Am J Med Genet 1999;84:903. [30] Armour CM, Humphreys P, Hennekam RC, Boycott KM. Fitzsimmons syndrome: spastic paraplegia, brachydactyly and cognitive impairment. Am J Med Genet A 2009;149A:22547. [31] Armour CM, et al. Syndrome disintegration: exome sequencing reveals that Fitzsimmons syndrome is a co-occurrence of multiple events. Am J Med Genet A 2016;170:18205. [32] Posey JE, et al. Adult presentation of X-linked Conradi-Hunermann-Happle syndrome. Am J Med Genet A 2015;167:130914. [33] Posey JE, et al. Lysinuric protein intolerance presenting with multiple fractures. Mol Genet Metab Rep 2014;1:17683. [34] Dyment DA, et al. Alternative genomic diagnoses for individuals with a clinical diagnosis of Dubowitz syndrome. Am J Med Genet A 2021;185:11933. [35] Monies D, et al. Lessons learned from large-scale, first-tier clinical exome sequencing in a highly consanguineous population. Am J Hum Genet 2019;105:879. [36] Meng L, et al. Use of exome sequencing for infants in intensive care units: ascertainment of severe single-gene disorders and effect on medical management. JAMA Pediatr 2017;171:e173438. [37] Balci TB, et al. Debunking Occam’s razor: diagnosing multiple genetic diseases in families by wholeexome sequencing. Clin Genet 2017;92:2819. [38] Tarailo-Graovac M, et al. Exome sequencing and the management of neurometabolic disorders. N Engl J Med 2016;374:224655. [39] Karaca E, et al. Genes that affect brain structure and function identified by rare variant analyses of Mendelian neurologic disease. Neuron 2015;88:499513. [40] Pehlivan D, et al. The genomics of arthrogryposis, a complex trait: candidate genes and further evidence for oligogenic inheritance. Am J Hum Genet 2019;105:13250. [41] Bayram Y, et al. Molecular etiology of arthrogryposis in multiple families of mostly Turkish origin. J Clin Invest 2016;126:76278. [42] Posey JE, et al. Resolution of disease phenotypes resulting from multilocus genomic variation. N Engl J Med 2017;376:2131. [43] Posey JE, et al. Triploidy mosaicism (45,X/68,XX) in an infant presenting with failure to thrive. Am J Med Genet A 2016;170:6948. [44] Dharmadhikari AV, et al. Copy number variant and runs of homozygosity detection by microarrays enabled more precise molecular diagnoses in 11,020 clinical exome cases. Genome Med 2019;11:30.

202

Chapter 9 Multilocus inheritance and variable disease

[45] Xu S, Lian Q, Wu J, Li L, Song J. Dual molecular diagnosis of tricho-rhino-phalangeal syndrome type I and Okur-Chung neurodevelopmental syndrome in one Chinese patient: a case report. BMC Med Genet 2020;21:158. [46] Mishra R, et al. Robinow syndrome and brachydactyly: an interplay of high-throughput sequencing and deep phenotyping in a kindred. Mol Syndromol 2020;11:439. [47] Gazzo A, et al. Understanding mutational effects in digenic diseases. Nucleic Acids Res 2017;45: e140. [48] Gazzo AM, et al. DIDA: a curated and annotated digenic diseases database. Nucleic Acids Res 2016;44: D9007. [49] Kohler S, et al. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res 2014;42:D96674. [50] James RA, et al. A visual and curatorial approach to clinical variant prioritization and disease gene discovery in genome-wide diagnostics. Genome Med 2016;8:13. [51] Jehee FS, et al. Dual molecular diagnosis contributes to atypical Prader-Willi phenotype in monozygotic twins. Am J Med Genet A 2017;173:24515. [52] Rohanizadegan M, Siddharath A, Retterer K, Hung C, Bodamer O. The tale of two genes: from nextgeneration sequencing to phenotype. Cold Spring Harb Mol Case Stud 2020;6, a004846. [53] Karolak JA, et al. A de novo 2.2 Mb recurrent 17q23.1q23.2 deletion unmasks novel putative regulatory non-coding SNVs associated with lethal lung hypoplasia and pulmonary hypertension: a case report. BMC Med Genomics 2020;13:34. [54] Yuan B, et al. Clinical exome sequencing reveals locus heterogeneity and phenotypic variability of cohesinopathies. Genet Med 2019;21:66375. [55] Harel T, et al. Recurrent de novo and biallelic variation of ATAD3A, encoding a mitochondrial membrane protein, results in distinct neurological syndromes. Am J Hum Genet 2016;99:83145. [56] Aguilera-Albesa S, et al. Hereditary spastic paraplegia and intellectual disability: clinicogenetic lessons from a family suggesting a dual genetics diagnosis. Front Neurol 2020;11:41. [57] Middelkamp S, et al. Prioritization of genes driving congenital phenotypes of patients with de novo genomic structural variants. Genome Med 2019;11:79. [58] Popp B, et al. Dissecting TSC2-mutated renal and hepatic angiomyolipomas in an individual with ARID1B-associated intellectual disability. BMC Cancer 2019;19:435. [59] Strauss KA, et al. Genomic diagnostics within a medically underserved population: efficacy and implications. Genet Med 2018;20(1):3141. [60] Muenke M, et al. A unique point mutation in the fibroblast growth factor receptor 3 gene (FGFR3) defines a new craniosynostosis syndrome. Am J Hum Genet 1997;60:55564. [61] Moloney DM, et al. Prevalence of Pro250Arg mutation of fibroblast growth factor receptor 3 in coronal craniosynostosis. Lancet 1997;349:105962. [62] Lowry RB, Jabs EW, Graham GE, Gerritsen J, Fleming J. Syndrome of coronal craniosynostosis, Klippel-Feil anomaly, and sprengel shoulder with and without Pro250Arg mutation in the FGFR3 gene. Am J Med Genet 2001;104:11219. [63] Grosso S, et al. Medial temporal lobe dysgenesis in Muenke syndrome and hypochondroplasia. Am J Med Genet A 2003;120A:8891. [64] Doherty ES, et al. Muenke syndrome (FGFR3-related craniosynostosis): expansion of the phenotype and review of the literature. Am J Med Genet A 2007;143A:320415. [65] Bend R, et al. Phenotype and mutation expansion of the PTPN23 associated disorder characterized by neurodevelopmental delay and structural brain abnormalities. Eur J Hum Genet 2020;28:7687. [66] Nistala H, et al. NMIHBA results from hypomorphic PRUNE1 variants that lack short-chain exopolyphosphatase activity. Hum Mol Genet 2021;29(21):351631.

References

203

[67] Chong JX, et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet 2015;97:199215. [68] White J, et al. POGZ truncating alleles cause syndromic intellectual disability. Genome Med 2016;8:3. [69] Assia Batzir N, et al. Phenotypic expansion of POGZ-related intellectual disability syndrome (WhiteSutton syndrome). Am J Med Genet A 2020;182:3852. [70] Wangler MF, et al. Heterozygous de novo and inherited mutations in the smooth muscle actin (ACTG2) gene underlie megacystis-microcolon-intestinal hypoperistalsis syndrome. PLoS Genet 2014;10: e1004258. [71] Assia Batzir N, et al. Recurrent arginine substitutions in the ACTG2 gene are the primary driver of disease burden and severity in visceral myopathy. Hum Mutat 2020;41:64154. [72] Bostwick BL, et al. Phenotypic and molecular characterisation of CDK13-related congenital heart defects, dysmorphic facial features and intellectual developmental disorders. Genome Med 2017;9:73. [73] Jiang Y, et al. The phenotypic spectrum of Xia-Gibbs syndrome. Am J Med Genet A 2018;176:131526. [74] Xia F, et al. De novo truncating mutations in AHDC1 in individuals with syndromic expressive language delay, hypotonia, and sleep apnea. Am J Hum Genet 2014;94:7849. [75] Hansen AW, et al. A genocentric approach to discovery of Mendelian disorders. Am J Hum Genet 2019;105:97486. [76] Karaca E, et al. Phenotypic expansion illuminates multilocus pathogenic variation. Genet Med 2018;20:152837. [77] Mitani T, et al. Bi-allelic pathogenic variants in TUBGCP2 cause microcephaly and lissencephaly spectrum disorders. Am J Hum Genet 2019;105:100515. [78] Murakami H, et al. Discordant phenotype caused by CASK mutation in siblings with NF1. Hum Genome Var 2019;6:20. [79] Mautner VF, et al. Clinical characterisation of 29 neurofibromatosis type-1 patients with molecularly ascertained 1.4 Mb type-1 NF1 deletions. J Med Genet 2010;47:62330. [80] Pasmant E, et al. NF1 microdeletions in neurofibromatosis type 1: from genotype to phenotype. Hum Mutat 2010;31:E150618. [81] Cornett KM, et al. Phenotypic variability of childhood Charcot-Marie-Tooth Disease. JAMA Neurol 2016;73:64551. [82] Haldane JBS. The Relative Importance of Principal and Modifying Genes in Determining some Human Diseases. J Genet 1941;41:14957. [83] Gonzaga-Jauregui C, et al. Exome sequence analysis suggests that genetic burden contributes to phenotypic variability and complex neuropathy. Cell Rep 2015;12:116983. [84] Bis-Brewer DM, et al. Assessing non-Mendelian inheritance in inherited axonopathies. Genet Med 2020;22:211419. [85] Bis-Brewer DM, Danzi MC, Wuchty S, Zuchner S. A network biology approach to unraveling inherited axonopathies. Sci Rep 2019;9:1692. [86] Albers CA, et al. Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nat Genet 2012;44(435-9):S12. [87] Wu N, et al. TBX6 null variants and a common hypomorphic allele in congenital scoliosis. N Engl J Med 2015;372:34150. [88] Chen W, et al. TBX6 missense variants expand the mutational spectrum in a non-Mendelian inheritance disease. Hum Mutat 2020;41:18295. [89] Liu J, et al. TBX6-associated congenital scoliosis (TACS) as a clinically distinguishable subtype of congenital scoliosis: further evidence supporting the compound inheritance and TBX6 gene dosage model. Genet Med 2019;21:154858.

204

Chapter 9 Multilocus inheritance and variable disease

[90] Yang N, et al. TBX6 compound inheritance leads to congenital vertebral malformations in humans and mice. Hum Mol Genet 2019;28:53947. [91] Ren X, et al. Increased TBX6 gene dosages induce congenital cervical vertebral malformations in humans and mice. J Med Genet 2020;57:3719. [92] Karolak JA, et al. Complex compound inheritance of lethal lung developmental disorders due to disruption of the TBX-FGF pathway. Am J Hum Genet 2019;104:21328. [93] Karolak JA, et al. Heterozygous CTNNB1 and TBX4 variants in a patient with abnormal lung growth, pulmonary hypertension, microcephaly, and spasticity. Clin Genet 2019;96:36670. [94] Lemmers RJ, et al. Digenic inheritance of an SMCHD1 mutation and an FSHD-permissive D4Z4 allele causes facioscapulohumeral muscular dystrophy type 2. Nat Genet 2012;44:13704. [95] Lupski JR. Digenic inheritance and Mendelian disease. Nat Genet 2012;44:12912. [96] Timberlake AT, et al. Two locus inheritance of non-syndromic midline craniosynostosis via rare SMAD6 and common BMP2 alleles. Elife 2016;5, e20125. [97] Timberlake AT, et al. Co-occurrence of frameshift mutations in SMAD6 and TCF12 in a child with complex craniosynostosis. Hum Genome Var 2018;5:14. [98] Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA. Clan genomics and the complex architecture of human disease. Cell 2011;147:3243. [99] Harel T, Lupski JR. Genomic disorders 20 years on-mechanisms for clinical manifestations. Clin Genet 2018;93:43949. [100] Gonzaga-Jauregui C, et al. Functional biology of the Steel syndrome founder allele and evidence for clan genomics derivation of COL27A1 pathogenic alleles worldwide. Eur J Hum Genet 2020;28 (9):124364. [101] Zollo M, et al. PRUNE is crucial for normal brain development and mutated in microcephaly with neurodevelopmental impairment. Brain 2017;140:94052.

CHAPTER

Statistical approaches to rare disease analyses

10 Cristopher V. Van Hout

International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el ´ Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Mexico (UNAM), Mexico

10.1 Introduction Statistical genetics is the scientific discipline that focuses on the development and application of analytical methods to derive inferences from genetic data. When it is possible to collect phenotype and genotype data from a sufficient number of individuals who are affected by a suspected genetic disorder, a number of statistical approaches are amenable to quantifying the risk and statistical significance of putatively disease-causing variants. Several factors determine whether these approaches are likely to be sufficiently powered, of which some may only be estimated with nontrivial uncertainty during research study design and analysis. Perhaps the most relevant factors in the analytical study design of any rare disease are its prevalence and mode of inheritance. These factors determine whether affected individuals are likely to aggregate in families or appear sporadically and, crucially, impact the number of affected individuals that may be available for clinical evaluation and inclusion in subsequent genetics studies. For example, linkage analysis requires segregation of a phenotype in multiple related individuals within pedigrees for analysis. Alternatively, association approaches are perhaps better suited to study genetic associations in large numbers of unrelated individuals, whereas other methods may be more appropriate to analyze closely related individuals or mixtures of related and unrelated participants. Additionally, as discussed in previous chapters, the penetrance and expressivity of Mendelian disorders impact the power to detect statistical relationships between genetic variants and disease regardless of the methodological approach. This chapter reviews statistical methods that may be applied to study rare disease genetics from the perspective that the study participants and relevant genotype and phenotype data have already been collected, although may also be useful for those designing rare disease genetic studies prior to participant recruitment.

10.2 Pedigree-based statistical methods 10.2.1 Linkage analysis Family-based statistical approaches, that is, those for which a detailed pedigree can be recorded (see Chapter 1: Introduction to Concepts of Genetics and Genomics), have been a mainstay in rare Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00011-9 © 2021 Elsevier Inc. All rights reserved.

205

206

Chapter 10 Statistical approaches to rare disease analyses

disease gene identification since the advent of directly measured genotyping approaches, either using restriction fragment length polymorphisms (RFLPs), mini- or microsatellite markers, singlenucleotide polymorphism (SNP) markers, and more recently agnostic genomic sequencing. Given the high-cost of sequencing DNA using Sanger dideoxy sequencing prior to the development of next-generation sequencing (NGS) technologies (see Chapter 4: Genomic Sequencing of Rare Diseases), their early popularity was driven by the ability to scan the entirety of the genome for statistical associations with disease without measuring every genomic variant, in contrast with modern genomic sequencing approaches which do not assume prior knowledge of genomic variation. First proposed by Morton [1] and extended by Ott [2,3] and many others, linkage analyses have been used to implicate many regions of the genome with both common and rare diseases. Additionally, this approach is well characterized for discrete clinical outcomes, that is, affected and unaffected (case-control), as well as continuously distributed biomarkers and endophenotypes that may be predictors or consequences of disease. It is generally not the case that these methods can easily infer the causal relationships between biomarkers and clinical outcomes, as these are major research themes in both genetic and nongenetic epidemiology. Specialized methods, such as Mendelian Randomization approaches [4], can provide evidence for causality amongst different phenotypic measures; however, they are typically very modestly powered for sample sizes usually collected for rare disease studies. Since meiotic recombination during gametogenesis acts to introduce few recombination events per chromosome, a relatively sparse map of common variants or “markers” can build a highresolution view of the inheritance patterns or vectors across the whole genome and between all members of a pedigree. From unevenly spaced markers, genotype data can leverage recombination maps to interpolate map positions and jointly estimate inheritance vectors in an evenly spaced grid using nondeterministic Markov Chain Monte Carlo (MCMC) approaches [5] known as multipoint Identity-By-Descent (IBD) matrices. Thus the sparse sampling of markers is used to build a model of the genome and to identify which segments of the genome are segregating with the trait more often than by random chance. Ultimately, for each position, the IBD matrix and phenotype information are used to estimate a Log of Odds (LOD) score, or the logarithm of the odds that the marker or position is physically near a disease-associated variant versus random chance, linked versus unlinked, respectively. Canonically, LOD scores greater than 3 were considered evidence for a linkage signal, eschewing more formal inference or hypothesis testing [6]. As with all statistical approaches, violations of analysis assumptions, such as accurate pedigree, genotype, and phenotype specification, that is, misdiagnosis, incomplete penetrance, or phenocopies, can result in loss of power. Additionally, linkage analysis loses resolution in low complexity regions with fewer available markers and in regions with low recombination rates such as centromeres, and tends to implicate large regions of the genome with disease, many times including tens or hundreds of genes complicating follow-up studies. Historically, either targeted Sanger sequencing of genes within the linked regions or evaluation of single-gene knock-out model organisms was standard follow up to linkage analysis studies to help identify the causative variants in new candidate disease-associated loci. However, both approaches are costly and time consuming. Nevertheless, prior to the development of highthroughput and cost-efficient NGS technologies, linkage analysis using large multigenerational pedigrees was the main analytical approach to the study of Mendelian disorders. One of the earliest and perhaps most widely known examples of the application of linkage analysis in the

10.2 Pedigree-based statistical methods

207

identification of a rare disease gene was the resolution of the locus on chromosome 4 associated with Huntington disease (HD, MIM #143100), a late onset autosomal dominant neurodegenerative disorder (see Chapter 6: Dominant and Sporadic De Novo Disorders), from pedigrees collected in Venezuela and the United States with ancestry tracing to the Lake Maracaibo region of Venezuela [7]. Expansion of a trinucleotide repeat in the HTT gene, encoding the broadly expressed nuclear protein huntingtin involved in transcription regulation of many genes, would ultimately be identified as causal for the disease. Linkage analysis showed that a DNA probe marker, now named D4S10 (previously known as G8), toward the p-terminal end of chromosome 4 had a LOD score of 8.53 with the Huntington disease phenotype with a 99% confidence interval containing an approximately 10 cM region around the marker. Further, they observed that the yet unknown Huntington disease variant and D4S10 shared a recombination fraction near zero and proposed the possibility of the marker as a diagnostic probe for Huntington disease carriers in presymptomatic and prenatal contexts. We now know that D4S10 is less than 1 Mb from the encoded polyglutamine tract in the HTT gene that, when abnormally expanded, causes Huntington disease. While state of the art at the time, the molecular tools to measure genetic variation and statistical approaches for linkage analyses formed the foundation for research to follow. In the modern era of inexpensive whole-exome sequencing (WES) and whole-genome sequencing (WGS), the application of linkage analysis in the identification of novel rare disease genes in pedigrees has changed considerably. The ability to directly measure genetic variation agnostically, by low-cost and high-throughput NGS approaches, may obviate the utility of linkage analysis, in that the putatively causal variant is not specifically modeled as “linked” to any marker, but rather cosegregates directly with the disease or trait of study. However, instead of multipoint linkage analysis, which can entail computationally intensive MCMC approaches to estimate the inheritance vectors among all study participants, single-point linkage estimates are possible for all directly measured variants. Thus the computational and bioinformatic investments may be lower for sequencing-based linkage studies compared to genotypes measured by array, but potentially increasing sensitivity to genotyping error and missingness and maximizing the resolution of the linkage region. Single-point linkage analyses can serve as an orthogonal statistical approach to complement pedigree-based rare variant segregation analyses in large families (see Chapter 4: Genomic Sequencing of Rare Diseases). However, linkage analyses are very modestly powered for the analysis of small family structures, such as single or few nuclear families (parents and offspring) or trios (parents and an affected child), making rare variant filtering and segregation analyses best suited for these analyses.

10.2.2 Transmission disequilibrium testing An additional statistical approach that explicitly leverages close relationships between study participants is the Transmission Disequilibrium Tests (TDT) [8] and variations thereof. The TDT tests for linkage in the presence of association; so, the test is robust to inflation of test statistics which can arise due to population stratification or admixture in the presence of association but the absence of linkage. Briefly, this approach quantifies the extent of departure of the observed inheritance of a genetic variant from heterozygous parents to one or more affected offspring, or transmission events, to the expected rate assuming random Mendelian segregation. As an example, consider a number of trios in which affected offspring disproportionately inherit a variant from parents. The expectation is that there is an equal chance of inheriting either of the heterozygous parents’ alleles. If the

208

Chapter 10 Statistical approaches to rare disease analyses

heterozygous variant increases risk of disease, the variant may be observed to have an increased transmission rate amongst trios that are ascertained for affected offspring. Data may be organized in the context of a 2 3 2 contingency table of counts of transmissions with statistical inference using exact methods or binomial models. Transmission-based tests have been generalized to other pedigree configurations [9] as well as to arbitrary pedigree configurations, that is, family-based association tests [10,11]. In summary, there have been extensive methodological developments addressing the analysis of relatively sparse genome-wide genotype data in pedigree samples for the discovery of genes involved in disease, both rare and common. However, ascertainment of genotypes for inclusion on many genotype SNP arrays, i.e., common variants in relatively cosmopolitan continental ancestry groups and perhaps rare variants that were previously identified as associated with disease, result in genotype data that may be unlikely to have directly measured the causal variant(s) for a given rare disease. Increasingly, direct and agnostic interrogation of genomic variation in large numbers of individuals by NGS approaches is leading to the identification of genes for rare disorders by methods that do not assume a priori knowledge of a pedigree.

10.3 Association analyses for rare diseases As asserted in the previous section, the choice of statistical methodology for the discovery of genes involved in rare disease is driven in large part by the number and characteristics of the study participants, including whether there was deliberate ascertainment of affected individuals and close family members or the sample was population-based. While pedigree-based methods such as linkage tend to focus on case ascertained study participants, association-based methods offer additional flexibility for the analysis of much larger sample sizes available for the study of genetic disorders from both ascertained and unascertained samples. In general, the statistical methods developed for common diseases using genome-wide association studies (GWAS) are similarly applied to rare disease traits in the same types of samples, albeit with several additional considerations.

10.3.1 Rare variant association testing in phenotype ascertained studies For distinct clinical disorders where there are readily available and accurate clinical records, or due care has been taken to derive them, further methodological challenges to association analyses in large studies often include case:control ratio imbalance, model choice for collapsing-based tests, genotype quality control (QC), and association modeling covariable choices, as these factors impact the control of falsely inflated statistical significance (type 1 error) and the power to detect novel and known association signals (type 2 error). In theory, GWAS may have reasonable power to detect statistically significant associations without deep sampling of ancestry matched unaffected controls (or statistically controlled ancestry). For example, for a fully penetrant variant (100% of variant carriers are affected, and zero noncarriers are affected), five affected variant carriers and 100 unaffected noncarriers would be required to reach a commonly accepted genome-wide statistical significance threshold of approximately 1028 using Fisher’s Exact Test (FET). Note that statistical tests in this context must be robust to low cell counts

10.3 Association analyses for rare diseases

209

in a 2 3 2 contingency table, and methods such as FET, Firth bias reduced logistic regression [12], SAGE (Scalable and Accurate Implementation of GEneralized mixed models) [13], and regenie [14] may have considerably improved type 1 error compared to chi-square tests and standard logistic regression approaches. Additionally, Firth and SAIGE are linear models and accommodate inclusion of covariables, whereas SAIGE and regenie are linear mixed models (LMMs) that incorporate random effects components estimated from genomic data to model relatedness and ancestry-based stratification to further control type 1 error. To date, only regenie has the ability to account for close relatedness using an LMM approach, maintain robust control over type 1 error in the presence of case:control imbalance and easily test additive and recessive genetic models without recoding of the primary genetic data files. In practice, substantially greater numbers of cases and controls are required than in the hypothetical example above to robustly identify associations with very rare disease and low frequency variants, and often necessitate careful postanalysis QC and iterative revision of modeling choices. The power to detect associations between a single genetic variant and trait decreases with allele frequency, other factors being equal. For rare disease and rare variant analyses, this is a salient challenge. Addressing this, there are a number of approaches to include multiple variants within a gene or specified genomic region that perform an aggregate test for several variants at once [15], thus potentially increasing power. As a motivating example, extremely rare disorders caused by de novo mutations may have few documented cases worldwide. Given the sporadic nature of de novo variants, the likelihood of observing multiple individuals with the same exact single variant is small. Therefore, rare variant aggregation over individual genes to perform association testing is better powered to identify an association signal in large-scale studies of rare diseases. Borrowing from the example above, if there are five affected individuals with five different pathogenic variants across a gene, collapsing these and testing for association with the rare disorder versus 100 unaffected controls where none carry a pathogenic variant in the causative gene would be much more powerful than single-variant testing (Fig. 10.1). Two major groups of collapsing tests include burden and SKAT (Sequence Kernel Association Test) tests. A number of methods have been proposed to test rare variant gene burden; inclusion or exclusion of variants based on their predicted function, the variant frequency in a relevant external population, and how to up or down weight the impact of more or less rare variants, respectively, into the test statistic [15]. As a class of tests, they leverage the notion that deleterious alleles are likely to be rare due to purifying selection [16] and are most powerful when the aggregate of variants have the same direction of effect on the outcome, by assumption, risk increasing. A distinct type of collapsing test is SKAT [17] which explicitly models both risk increasing and decreasing variants into the aggregate test statistic. This approach does not rely on an assumption of purifying selection. However, the power and statistical robustness of any test tend to be optimal when its underlying assumptions are met but not exceeded. Which of the varieties of burden tests or SKAT test will be optimal in any given situation will be unknown at the time of analysis. While similar to association tests, collapsing tests in the context of family-based association approaches have also been well characterized [18]. Phenotype-specific sampling designs can be powerful for the identification of genes involved in rare disorders and characterization of their genetic etiology [19,20]. Case-control sampling designs and case only ascertainment each have unique advantages and challenges in the design and analysis phases. As in the example above, if the prevalence of the disorder is extremely rare,

210

Chapter 10 Statistical approaches to rare disease analyses

FIGURE 10.1 An illustrative example of a gene burden collapsing test. In a hypothetical gene with three exons (gray boxes), only affected study participants, cases, are carriers of pathogenic variants (red bars). No unaffected study participants, controls, are carriers of pathogenic variants. Testing using Fisher’s Exact Test are preferable to parametric Chi-square test, since cells of the 2 3 2 table contain counts less than 5. The statistical significance testing single variants, one case carrier versus 100 noncarrier controls is significant at P 5 0.0099. After correction for five independent tests, the nominal threshold for significance is 0.01, that is, each of the tests meet the threshold of significance, but only just. Collapsing the variants into the 2 3 2 table above, 5 case carriers versus 100 noncarrier controls, is much more powerful, significant at approximately P 5 1028 meeting stringent significance thresholds required by genome-wide screening.

suitable ancestry balanced controls may be challenging to recruit. Alternatively, it may be possible to choose ancestry matched control individuals from a separate study, perhaps collected for completely different purposes or from publicly available databases with individual level genetic data. Crucially, ancestry matching and rigorous control over sequencing platform differences are required to maintain nominal type 1 error. At this time, the advantages of this approach over the use of population references for variant frequency information are unclear. Alternatively, for heterogeneity in disease severity, age of onset, or disease associated quantitative biomarkers, case only studies may screen for association of variants and genes with these severity traits. Genes that are associated with the progression, severity, or other endophenotypes may also be causally involved in the initiation of disease. Such groups of case only research participants may be likely to aggregate in patient advocacy group registries, or by specialist research clinicians. These cohorts provide a valuable new avenue for both the molecular diagnosis of individuals living with a rare disease and the identification of the gene(s) and spectrum of genetic variation that causes it, as well as insights into sub-clinical characteristics and quantitative phenotypes. A frequently underexplored opportunity in the large-scale screening for associations between variants and phenotypes in case-control and population-based studies is analysis of multiple genetic models. Rare disorders often exhibit departure from the simple additivity model of allelic effects assumed for common variants in GWAS. Thus dominant or recessive models may offer considerably more power to detect associations and facilitate a deeper understanding of disease risk. For example, while the association between genetic variation in NOD2 and inflammatory bowel disease

10.3 Association analyses for rare diseases

211

has been reported and widely replicated, the substantially increased risk for variant homozygosity and compound heterozygosity was reported only recently [21].

10.3.2 Rare variant association testing of unascertained population-based studies Broad investigator access to resources with both genomic sequencing data and Electronic Health Records (EHR) information, such as the UK Biobank project [22], has enabled the unbiased evaluation of rare variants contributing to rare and common diseases in large numbers of mostly unrelated adults, but without specific disease ascertainment or clinical reports consistent with apparent Mendelian conditions. Crucially, the success of a search for genes underlying a rare disorder are dependent on the accurate characterization of the affected individuals in the sample, including both the true prevalence of the disease, and the accuracy by which it is captured by the EHR information. Indeed, many rare disorders do not have a specific and unique identifier, that is, an International Classification of Diseases (ICD) code, which forms the basis of many modern EHRs. Furthermore, clinicians may not correctly diagnose patients with a specific disease, resulting in phenotype misspecification in the EHR, i.e., some record of apparent morbidity relevant to the true disease state but imprecisely captured. In these scenarios, EHR-based resources may still be of substantial value to rare disease research, but require more detailed evaluations of the totality of the available EHR information and an appreciation of both the clinical manifestations of the disorder of interest and the nuances of EHR data capture in clinical practice. For some research questions related to the penetrance and expressivity of a disease with known or presumed genetic etiology, evaluation of available evidence for carriers of pathogenic or likely pathogenic variants in the gene of interest can yield novel insights. For example, lipodystrophy, a disorder of lipid storage and metabolism has an estimated clinical prevalence ranging between 1 and 5 per million in the literature. However, estimates based on molecular prevalence of lipodystrophy associated genetic variants in a clinical cohort with available genomic and EHR data were much higher than from clinical diagnoses alone, highlighting that individuals carrying pathogenic variants in genes associated with the disease are more common than previously reported and are often under-diagnosed [23]. Additionally, commonly measured quantitative laboratory traits, such as those collected in the course of routine medical care, may also serve as endophenotypes, genetic correlates of liability in the pathway(s) involved in disease and clinical correlates of rare disorders, such as triglyceride measures and familial chylomicronemia syndrome due to lipoprotein lipase deficiency (MIM #238600) [24]. In summary, while there is great potential for discovery in biobank scale data resources, the granularity of EHR in large unascertained samples is often challenging to extract and to apply robust QC measures, particularly for rare disorders. In contrast to association testing approaches that include individual level genetic data for both cases and controls, publicly available WES and WGS data from worldwide populations aggregated in databases such as ExAC [25] and gnomAD [26] often serve as reference for population frequency of genetic variants. Indeed, the aggregate nature of the sequencing data are incompatible with case-control association analysis, and the lack of individual level phenotype data including the possibility of affected individuals in the resource result in reduced statistical power compared to properly phenotyped controls. Additionally, any comparisons between NGS datasets

212

Chapter 10 Statistical approaches to rare disease analyses

are sensitive to even subtle differences in the sequencing methodology and can confound analysis and interpretation. Often, large studies with less than WGS data may use sequence imputation to estimate variation in the rest of the genome. Very briefly, this approach uses WGS data from a separate reference population and observed genome-wide often common variant data in the target study population to impute unobserved sequence [27]. Notably, because imputation accuracy decreases for rare variants (with no ability to detect de novo variants) and is sensitive to ancestry differences between the reference and target populations, tests of imputed sequence, whether single variant or collapsing, have decreased power compared to directly measured genotype studies and must be interpreted with caution [28].

10.4 Conclusion The application of statistical approaches to human disease genetics began with rare disorders. In 1983 HTT was implicated for Huntington disease in a large kindred of approximately 4000 individuals with substantially higher prevalence of the disease amongst descendants of the Lake Maracaibo region of Venezuela [7]. While technology has advanced in terms of cost and feasibility for collecting genetic markers from SNP arrays to WGS, so have the statistical methods and computational tools and algorithms to analyze the relationships between genes and disease. Linkage analysis was the preferable analytical approach for low-density genotyping data in pedigrees enriched for affected individuals. In the modern era of widely accessible genome-wide sequencing, linkage analysis can still be a useful analytical method for large pedigrees. Increasingly, sampling designs that are not directly focused on ascertaining individuals with a specific disorder are available to investigators, enabling a different view into the variable expressivity of disease, and consequently motivating the development of different analysis methods that are robust to the statistical pitfalls of very rare events in large samples. For rare diseases in general, there has never been before a brighter future of widely available sequencing, detailed health records, patient advocacy groups, and researchers to build a body of evidence for the molecular etiology of disease, and translate that understanding toward treatment options.

References [1] Morton N. Sequential tests for the detection of linkage. Am J Hum Genet 1954;7(3):277318. [2] Ott J. Estimation of the recombination fraction in human pedigrees: efficient computation of the likelihood for human linkage studies. Am J Hum Genet 1974;26(5):58897. [3] Ott J. Analysis of Human Genetic Linkage. Johns Hopkins University Press; 1985. [4] Burgess S, et al. A review of instrumental variable estimators for Mendelian randomization. Stat Methods Med Res 2017;26(5):233355. [5] Ott J, et al. Genetic linkage analysis in the age of whole-genome sequencing. Nat Rev Genet 2015;16:27584. [6] Churchill G, Doerge R. Empirical threshold values for quantitative trait mapping. Genetics 1994; 138:96371.

References

213

[7] Gusella J, et al. A polymorphic DNA marker genetically linked to Huntington’s disease. Nature 1983;306:2348. [8] Spielman R, et al. Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet 1993;52(3):50616. [9] Wittkowski K, Liu X. A statistically valid alternative to the TDT. Hum Heredity 2002;54(3):15764. [10] Laird N, et al. Implementing a unified approach to family-based tests of association. Genet Epidemiol 2000;19(Suppl 1):S3642. [11] Ionita-Laza I, et al. Family-based association tests for sequence data, and comparisons with populationbased association tests. Eurrpean J Hum Genet 2013;21(10):115862. [12] Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993;80(1):2738. [13] Zhou W, et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet 2018;50:133541. [14] Mbatchou J, et al. Computationally efficient whole genome regression for quantitative and binary traits. Biorxiv. Available from: https://doi.org/10.1101/2020.06.19.162354. [15] Povysil G, et al. Rare-variant collapsing analyses for complex traits: guidelines and applications. Nat Rev Genet 2019;20:74759. [16] Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet 2011;13:13545. [17] Ionita-Laza I, et al. Sequence kernel association tests for the combined effect of rare and common variants. Am J Hum Genet 2013;92:84153. [18] De G, et al. Rare variant analysis for family-based design. PLoS ONE 2013;8(1):e48495. [19] Zhu N, et al. Rare variants in SOX17 are associated with pulmonary arterial hypertension with congenital heart disease. Genome Med 2018;10(1):56. [20] Zhu N, Pauciulo MW, Welch CL, Lutz KA, Coleman AW, Gonzaga-Jauregui C, et al. Novel risk genes and mechanisms implicated by exome sequencing of 2572 individuals with pulmonary arterial hypertension. Genome Med 2019;11(1):69. [21] Horowitz J, et al. Mutation spectrum of NOD2 reveals recessive inheritance as a main driver of Early Onset Crohn’s Disease. Sci Rep 2021;11(1). [22] Van Hout CV, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 2020;586:74956. [23] Gonzaga Jauregui CG, et al. Clinical and molecular prevalence of lipodystrophy in an unascertained large clinical care cohort. Diabetes 2020;69(2):24958. [24] Braham A, Hegele R. Hypertriglyceridemia. Nutrients 2013;5(3):9811001. [25] Lek M, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016;536:28591. [26] Karczewski K, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020;581:43443. [27] Das S, et al. Next-generation genotype imputation service and methods. Nat Genet 2016;48:12847. [28] Auer P, Lettre G. Rare variant association studies: considerations, challenges and opportunities. Genome Med 2015;7:16.

This page intentionally left blank

CHAPTER

Transcriptomics in rare diseases

11 Maria Kousi1,2

1

CSAIL, Massachusetts Institute of Technology, Cambridge, MA, United States 2 The Broad Institute of Harvard and MIT, Cambridge, MA, United States

11.1 Introduction The advent of next generation sequencing (NGS) has significantly propelled the genetic diagnosis of rare genetic disorders without necessitating the collection of large pedigrees with several affected individuals. Trio-based analyses in which the affected proband and unaffected parents are sequenced have resulted in a genetic diagnosis for a significant portion of sporadic cases with rare diseases, as it allows an initial analytic focus on de novo rare variant alleles. As described in previous chapters, the use of whole-exome sequencing (WES), which interrogates the coding portion of the genome, has greatly accelerated the identification of novel causative genes, enabling genetic resolution in 25%50% of patients [15]. Implementation of whole-genome sequencing (WGS) has further increased the yield with the diagnostic rate varying per disease group [6], by extending sequencing coverage, uncovering noncoding, regulatory changes, and structural and copy number variants (CNVs) which are difficult to detect by WES, and essentially ascertaining all pathogenic variation captured by existing technologies in a single test (Fig. 11.1A) [69]. Despite these advances, a large fraction of patients with rare disorders still remains undiagnosed for a specific contributory genetic lesion. The next challenges toward improving clinical molecular diagnostic rates lie in our ability to (1) better identify all variant types with notable examples comprising repeat expansions, large insertions and deletions (indels), CNVs, and copy-number neutral structural variants like inversions, (2) interpret the functional relevance of variants of unknown significance (VUS) encompassing largely synonymous, noncoding, and potential gene regulatory changes [10,11], and (3) accurately catalog Mendelian phenotypes [12], as it is now established that in many cases a patient’s clinical manifestation can be due to more than one disease locus [13] (see Chapter 9: Multilocus Inheritance and Variable Disease Expressivity in Rare Disease). Toward bridging this gap, integration of transcriptomics with genetic analyses has proven to provide insights into mechanisms that would have remained elusive if genomics approaches had been used in isolation. Such integrative approaches are crucial in advancing our understanding of regulation across the genome and transcriptome and constitute a building block toward implementation of personalized medicine. This chapter provides an overview of the utility of transcriptomic analyses in elucidating the mechanisms underlying rare diseases and potentially achieving higher molecular diagnostic rates in cases that had remained elusive through genomic DNA analysis alone. Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00007-7 © 2021 Elsevier Inc. All rights reserved.

215

216

Chapter 11 Transcriptomics in rare diseases

FIGURE 11.1 (A) Diagnostic yield upon implementation of exome-based, genome-based, and integrated genomic and transcriptomic approaches. (B) Schematic representation of RNA-seq analysis for diagnostic purposes. The three accessible tissues traditionally sampled comprise fibroblasts, muscle, and blood. Following RNA isolation, analyses that yield diagnostic results in rare diseases typically comprise the identification of splicing aberrations, cases of allele-specific expression (ASE), expression level imbalances, and/or variant calling either directly through the RNA sequence or through the integration of RNA with DNA analyses. RNA-seq, RNA sequencing; RNA, ribonucleic acid; DNA, deoxyribonucleic acid.

11.2 The transcriptome and transcriptomic methodologies

217

11.2 The transcriptome and transcriptomic methodologies The term “transcriptome” was first introduced in the 1990s in an effort to bridge the gap between the structure of the genome and biology, through the study of the short-lived intermediary ribonucleic acid (RNA) molecules [14]. In its entirety, the transcriptome comprises the complete set of all RNA molecules both coding and noncoding, in a cell, tissue, or individual [15]. Coding RNA molecules in particular are information carriers that relay the instructions encoded at the DNA level for making proteins, in the form of messenger RNA (mRNA) molecules, and comprise the subset of the transcriptome that has received more attention for molecular diagnostic utility and disease relevance. Noncoding RNAs on the other hand, are mainly associated with regulatory roles in gene transcription and translation that have less frequently been highlighted in disease causality and diagnostics. Initial transcriptomic analyses encompassed methods like expressed sequence tags (ESTs) that involve short cDNA sequences (reverse transcribed mRNA molecules converted into complementary copies of DNA) as a means to gene discovery and gene-sequence determination [16]; and serial analysis of gene expression (SAGE) that uses short sequence tags along concatenated random transcript fragments to uniquely identify transcripts [17]. SAGE also allowed for expression level quantification, given that the number of times each tag is detected corresponds to the transcript’s expression levels in the biospecimen/material assayed. These methodological forerunners of transcriptomic analyses were hampered by the study of short sequences that did not allow for the assessment of complete transcripts, splicing events, and other structural insights [18]. Such limitations have been overcome with the development of contemporary transcriptomic techniques. Today, there are two main technologies used for the study of the transcriptome: expression microarrays and RNA-seq, both of which employ RNA isolation from other cellular components, enrichment of the extracted molecules, reverse transcription to produce cDNA and subsequent analyses through hybridization or sequencing, respectively [19]. Expression microarrays were first introduced in the mid-1990s and involve solid glass surfaces on which short oligonucleotide oligomers known as “probes” that are complementary to the sequence of genes of interest are arrayed [20]. Fluorescently tagged cDNA from samples of interest and controls are subsequently applied to the surface of the array and allowed to competitively hybridize to the probes. A laser scan is used to visualize the hybridized molecules and establish transcript abundance across the conditions tested, with fluorescence intensity correlating with transcript abundance and expression levels. Microarrays require prior knowledge of the genomic sequence and expressed genes of the organism of interest for the design of the probes. Technological advances have allowed for the assessment of nearly all known genes and detection of even low abundance transcripts through improved fluorescent detection using microarrays; however, they have almost entirely been replaced by NGS techniques during the past decade. RNA sequencing (RNA-seq) is the predominant methodology used today for the analysis of the transcriptome. The method makes use of the advances in high-throughput NGS and computational data processing methods. The initial steps of RNA-seq are typically similar among different existing methodological variations, involving RNA extraction and quality control. Enrichment of the desired class of RNA species to be analyzed is a necessary step with mRNA molecules typically representing only 3% of the total isolated RNA pool from a given sample. Library construction

218

Chapter 11 Transcriptomics in rare diseases

ensues with the reverse transcription of mRNA species to the more stable cDNA intermediary and either subsequent fragmentation to produce short cDNA molecules or the use of recently introduced long-read RNA-sequencing platforms to uncover more complex transcriptomic phenomena [21]. The range of sequences produced with RNA-seq can consequently range from 30 base pairs to thousands of base pairs depending on the method used. The resulting reads are sequenced either unidirectionally or in a paired-end manner and aligned to either the reference genome or used in a de novo transcriptome assembly by mapping the reads produced onto each other [18]. In addition to the opportunity to interrogate the whole transcriptome at a high sequencing depth, RNA-seq requires much lower quantities (nanograms as opposed to the micrograms needed for microarrays) of starting material, allowing for applications of this methodology to finer structures, such as single cells [22]; the sequencing approach is also more “quantitative” than either arrays or SAGE.

11.3 Transcriptomics in rare diseases RNA-seq provides an open-platform detailed view into the whole transcriptome of an individual at a given time point, in the cells and/or tissues assayed, lending the opportunity to both quantify RNA species and also detect rare and novel RNA transcript variants [23]. Consequently, RNA-seq enables the detection of lesions involving single nucleotide variants (SNVs), indels and events leading to aberrant transcript regulation such as expression level imbalances (eQTLs), monoallelic or allele-specific expression (ASE), differential isoform expression, and alternative splicing (Fig. 11.1B). When used as a complementary tool to WES and WGS, RNA-seq has been reported to facilitate molecular diagnosis of rare genetic disorders, as it allows for a direct functional assessment of VUSs on RNA transcript expression and abundance [2426]. Overall, it has been reported that the complementary use of genomic and transcriptomic approaches results in an increased molecular diagnostic yield by 10%35% across a spectrum of different disorders [2427], with the rate of diagnostic resolution limited by and depending upon relevant tissue accessibility.

11.3.1 Mechanisms underlying RNA-seq-based genetic diagnoses Studies using RNA-seq to provide molecular diagnoses in cohorts of patients with unidentified bona fide changes based on WES and/or WGS have reported four major mechanisms through which previously uninterpretable variation or elusive regulatory effects result in the manifested phenotype. These mechanisms affected both known and potentially novel candidate disease genes, expanding our understanding of disease genomics and disease contributory variant types and alleles. 1. Aberrant splicing: Alterations in pre-mRNA splicing have long been recognized as a major cause of Mendelian disorders [28,29]. The disease relevance of this mechanism was highlighted through seminal work in hemoglobinopathies, a genetically heterogeneous group of disorders resulting from mutations in the beta-globin (HBB) gene that result in either the complete (β0) or partial (β1) loss of β-globin expression [30]. Specifically, it was only through the analysis of RNA with S1 nuclease mapping and subsequent cDNA sequencing, that the effects of singlenucleotide changes in both β0 and β 1 were shown to be disease-causing as they induced splicing defects leading to premature globin termination [31,32]. These historical studies relied

11.3 Transcriptomics in rare diseases

219

on the expression of mutant transcripts in heterologous systems, therefore limiting the scale at which such analyses could be performed. High-throughput analyses have been enabled with the advent of NGS and computational tools relying on the comparative analyses of patient and control cohorts. In such instances, the identification of aberrantly spliced transcripts is dependent upon comparison of the patient sample(s) against a set of control individuals or databases such as the one described by the Genotype-Tissue Expression (GTEx) Consortium Project [33]. According to some estimates up to a total of 62% of disease-causing SNVs, with 25% of those being seemingly benign synonymous changes can disrupt normal splicing [34,35]. However, when these VUSs lie beyond the consensus splice site donor GT and acceptor AG sequences of the splice junctions, their impact on splicing cannot be confidently predicted with the currently available bioinformatic tools. Several of these variants representing synonymous changes are located in deep intronic sequences or exert their effects through other complex cisregulatory elements. Systematic analysis of splicing patterns in a cohort of 94 individuals with different pathologies showed that the number of putative splicing aberrations is influenced by the number of junctions in each assessed gene and is further enriched in genes with high intolerance for protein-truncating variants, with a higher frequency of events in affected individuals versus controls [36]. A separate study on 25 patients with an unresolved neuromuscular diagnosis reported the identification of a median of five new “low-frequency novel junctions” (LFNJ) in the patient cohort, with patients having as few as one LFNJ and others having as many as twenty [26]. These novel splice junctions can result in a broad range of splicing aberrations involving exon skipping, exon extension, exonic and intronic splice-gain leading to the expression of new isoforms but also influencing transcript expression levels owing to the fact that a substantial portion of these events leads to premature termination and possibly degradation through the mechanism of nonsense-mediated decay (NMD). It is noteworthy that not all protein-truncating variants result in loss-of-function (LoF). Rather, systematic analysis of control population databases has shown that a significant number of genes tolerant to LoF variants are hampered by truncating variants that underlie disease through a gain-of-function (GoF) mechanism, whereby the resulting transcripts escape NMD and represent antimorphic or neomorphic alleles [37]. The utility of RNA-seq in identifying diagnostic splice defects is highlighted in the study by Cummings et al. in which candidate splice disrupting mutations increased the diagnostic rate by 35% in a cohort of 50 individuals with rare muscle disorders [24]. The main mechanisms identified to date through which LoF and GoF alleles affect normal transcript splicing are: • Intronic gain: Deep intronic variants in the dystrophin (DMD) gene were shown to result in the creation of a pseudo-exon that when present leads to premature termination in one patient with Duchenne muscular dystrophy (DMD, MIM #310200) and was combined with residual normal splicing in another two patients with a milder phenotype consistent with Becker muscular dystrophy (BMD, MIM #300376), highlighting how the combinatorial assessment of causative variants and effect on expression levels can inform phenotypic expressivity and prognosis [24]. This is not an isolated example of DMD dysregulation; an independent study reported the inclusion of a pseudo-exon in intron 1 of DMD in a second patient with DMD and lower gene expression due to an independent event leading to a premature stop codon in a third patient [26]. Conversely, intronic gains are not specific to

220

Chapter 11 Transcriptomics in rare diseases





DMD either. The report of a pseudo-exon inclusion led to the clinical genetic diagnosis of a patient with DYSF deficiency underlying Miyoshi myopathy (MMD1, MIM #254130) [26]. In a separate example, TIMMDC1 deficiency was reported to have occurred independently in probands with mitochondrial complex I deficiency (MC1DN31, MIM #618251) from three unrelated families and three different ethnicities, due to a frameshift variant (p.Gly199_Thr200ins5 ) and consequent protein loss caused by a homozygous deep intronic variant (c.596 1 2146A . G) in intron 5 that results in the inclusion of a pseudo-exon [25]. The most prominent example of intronic sequence inclusion was identified in COL6A1, which collectively with the paralogs COL6A2 and COL6A3 cause collagen-IV-related dystrophies. Specifically, a GC . GT variant creating a novel donor splice site that results in the in-frame inclusion of a pseudo-exon was identified in four unrelated patients [24]. This variant was absent from GTEx but when assessed in an expanded cohort of patients with a suspected collagen VI-like dystrophy, it was reported to have occurred de novo in 27 of them, leading to the conclusion that this particular change alone is likely to account for up to 25% of WES mutation-negative patients with a suspected collagen VI-related dystrophy [24]. To further expand the aberrant splicing landscape of collagen VI-related dystrophies, a similar approach uncovered a 2094 bp deletion in intron 1 of COL6A2 which Mendelizes a second small deletion in exon 28 in trans in a patient with Bethlem myopathy (BTHLM1, MIM #158810) [38]. Exonic extension: In a patient with a muscle disorder and a previously identified variant in nebulin (NEB), which is known to underlie autosomal recessive nemaline myopathy (NEM2, MIM #256030), RNA-seq revealed a second variant at position 13 of an exonic donor site resulting in exonic extension and the introduction of a premature termination codon that Mendelizes the previously identified variant and corroborates the NEB molecular genetic diagnosis in the patient [24]. Similarly, in a patient with congenital myopathy, a synonymous point mutation (c.2862G . A; p.Pro894Pro) in the last nucleotide of RYR1 exon 21 resulted in suppression of the canonical donor splice site of exon 21 and the activation of two alternative donor splice sites extending exon 21 by 44 and 141 bp, respectively, and introducing frameshifts that reduce RYR1 expression by 2.1-fold. This event Mendelized a previously identified heterozygous variant (c.9859C . T; p.Arg3287Cys) confirming a RYR1 positive diagnosis [26]. An exonic extension induced by a variant (c.28114G . A) in titin (TTN) was also reported in compound heterozygosity with a previously detected LoF variant (c.74837_74840dupTTAG; p.Arg24947Ser 2) in a patient with congenital myopathy [26]. Finally, the identification of exonic extensions caused by the homozygous LAMA2 c.92121G . A and the hemizygous DMD c.93 1 1G . C variants further increase the repertoire of such events in rare genetic disorders [26]. Exon skipping: Defects encompassed in this category involve both complete skipping of an exon as well as partial skipping of exonic sequence. Several examples have been reported of complete or partial exon skipping that is either sufficient per se to provide a genetic diagnosis or Mendelize a second pathogenic variant through compound heterozygosity. In the simpler cases of exon skipping SNVs, these either disrupt the canonical splice site sequence or create new sequence motifs that result in complete exon skipping and expression of alternate isoforms. Exemplifying this phenomenon, homozygous missense or synonymous variants close to the splice junction have been

11.3 Transcriptomics in rare diseases

221

described in POMGNT1, LAMA2, and CLPP in patients with muscle disorders [2426]. In more complex examples of exon skipping, competing splice sites can be created in the exon and result in partial exonic inclusion. One such example was described in a patient with multi/minicore congenital myopathy carrying a missense variant that created a new donor splice site that was stronger compared to the canonical site and resulted in gain of splicing from the exon body of TTN [24]. A similar event caused by a synonymous variant was identified in RYR1 and together with the prior identification of a second pathogenic variant in the same gene confirmed a genetic diagnosis for the patient carrying these lesions [24]. • Splice disruptions in repeat regions: The tripartite regions of NEB and TTN harbor highly similar sequences that are hard to deconvolute through short-read NGS due to poor mapping quality [39,40]. Cummings and colleagues developed a method based on remapping of the tripartite regions to a detriplicated pseudoreference and subsequently performing hexaploid variant calling [24]. This approach allowed for the identification within the repeat regions of a nonsense variant in NEB and a frameshift-introducing allele in TTN each of which Mendelized preexisting heterozygous alleles identified by genomic sequencing, therefore providing sufficient evidence to finalize the clinical molecular diagnoses [24]. Importantly, implementation of RNA-seq was not only used to attribute pathogenicity to previously identified VUSs but also resolved benign VUSs that were erroneously considered “likely pathogenic” due to their proximity to canonical exonintron junctions. Exemplifying this, no evidence of aberrant splicing was found in two patients with identified putative extended splice site VUSs, demonstrating the utility of this approach in functionally assessing candidate disease variants [24]. 2. Allele specific expression (ASE): Also referred to as allelic imbalance or monoallelic expression, ASE is established after quantifying the difference of expression of two individual alleles or haplotypes at a specific genetic locus [41]. Allelic imbalance occurs when transcription from one allele is selectively favored or suppressed, or when transcripts undergo posttranslational degradation via mechanisms such as NMD. The two most prominent examples of monoallelic expression are X-chromosome inactivation, in which the double dosage of X-linked genes in females is accounted for by randomly selecting and silencing one of the two X-chromosomes (see Chapter 7: X-linked and Mitochondrial Disorders) [42] and genomic imprinting in which genes/ alleles are expressed in a parental-derived, that is, parent of origin, selective manner [43,44]. In addition to these better-characterized mechanisms, ASE has also been reported for autosomal genes through a plethora of mechanisms involving but not being limited to the presence of CNVs, cis-regulatory elements and expression quantitative trait loci (eQTL) rare changes [45], long noncoding RNAs [43], and allele-specific chromatin marks [46]. Studies suggest that between 0.5% and 15% of autosomal genes exhibit random monoallelic expression [47] and that the extent of this phenomenon is both cell-type-specific and also developmentally regulated, with the number of mitotically stable monoallelically expressed genes increasing 5.6-fold upon differentiation from stem cells to neural progenitor cells in the mouse [48]. The intersection of genomic and transcriptomic data offers a unique opportunity to finely characterize the frequency and mechanistic properties of this biological phenomenon that can greatly influence disease causality through preferential expression of pathogenic alleles or conversely confer disease protection through the preferential expression of wild-type haplotypes.

222

Chapter 11 Transcriptomics in rare diseases

The role of ASE in disease is exemplified in a patient with congenital muscular dystrophy who had previously been found to carry a heterozygous missense variant (c.2584 T . C; p.Cys862Arg) in LAMA2. RNA-seq revealed that 85% of the reads were covering the maternal transcript harboring the missense variant while the paternal copy, which was subsequently found to encompass an exon 33 skipping event due to a c.4860G . A SNV, resulted in almost complete loss of expression of this allele [26]. A second example highlighting the diagnostic potential of RNA-seq involves a patient with a mucolipidosis diagnosis (ML4, MIM # 252650) but no detectable deficiency on routine enzymatic tests. DNA analysis had revealed a heterozygous nonsense change (c.832C . T; p.Gln278 ) in MCOLN1 but no other variants. Transcriptomic evaluation revealed almost complete loss of expression of the nonsense allele, and 91% expression of a trans allele harboring an intronic variant (c.68119A . C) resulting in intronic sequence retention and premature transcript termination (p.Lys227_Leu228ins16 ), corroborating the presence of two LoF alleles in MCOLN1 and establishing the genetic diagnosis in this patient [25]. In parallel with the reported isolated examples, systematic parsing for ASE in a cohort of individuals with congenital heart disease (CHD) revealed an enrichment for ASE in patients versus GTEx controls (P 5 .00011), denoting the role of this mechanism in disease [36]. Specifically, 45 events of extreme monoallelic expression were reported in CHD probands of which ASE was due to NMD only in 16% (7/45) instances, with the remaining 84% of cases possibly arising due to mutations in regulatory elements through mechanisms that remain to be identified [36]. 3. Aberrant expression: In a reverse causality approach, RNA-seq can facilitate the identification of extreme expression changes in case samples. Under such a scenario, the gene expression levels of patients are compared against those of control individuals such as the ones present in GTEx, to dissect genes whose expression is significantly higher or lower than expected and are thus potentially representing disease candidates. These candidates are referred to as expression outliers and are often associated with rare noncoding variants whose effect on expression is challenging to ascertain. To date, a number of reports have exemplified this causal link between rare noncoding changes and extreme expression changes in single tissues [49,50]. In a systematic analysis of 94 individuals spanning 16 diverse disease categories, each sample was reported to harbor an average of 343 expression outliers, denoting that prioritization among those candidates is crucial [36]. To facilitate dissection of the disease-relevant outliers, the authors of the study proposed a filtering process whereby restriction to (1) underexpression outliers which are more likely to be due to LoF mutations, (2) genes with a nearby deleterious rare variant (MAF , 0.1%), and (3) genes linked to the patient’s phenotype by representing a human phenotype ontology match significantly reduces the candidate gene list to less than 1% of the initial set of outliers [36]. A similar approach was described in an independent study introducing the ANEVA method, through which genetic variation within a genomic window of the outlier genes is quantified and ranked to facilitate the identification of genes with heterozygous variants of unusually strong effect on gene expression [51]. These methodologies are able to narrow down the number of candidate expression outliers to an average of 1011 per individual, with the presence of a phenotype-relevant gene in at least 36% of the evaluated patients and the promise to uncover new biology in the remaining individuals [36,52]. An independent approach implemented to narrow down the list of candidate expression outliers is by evaluating the aberrant expression across tissues. This methodology reduced the number of

11.3 Transcriptomics in rare diseases

223

candidates from 83 genes per tissue to a median of 10 per individual across tissues, with higher correlation levels across related tissues [51]. Furthermore, genes that represent expression outliers across multiple tissues are also enriched for structural, splice-site, frameshift, and regulatory variants that mediate ASE among other effects [51]. In more complex cases, the exact variants leading to expression outliers remain unidentified and necessitate the implementation of other approaches that often involve intersection with other omics methodologies such as proteomics or metabolomics (see Chapter 12: Other Omics Approaches to the Study of Rare Diseases), in order to corroborate the disease causality uncovered by aberrant expression. This is exemplified through the report of three patients with muscular hypotonia, developmental delay, and neurological decline for whom WES analysis had revealed no plausible candidate variants [25]. RNA-seq revealed diminished expression of MGST1 in one patient and TIMMDC1 in the remaining two. This was further corroborated using quantitative proteomics that showed only B2% of MGST1 expression and no detectable traces of TIMMDC1, respectively, confirming the genetic diagnosis for these individuals and further providing clues for candidate therapeutic targets [25]. The variants associated with these extreme expression observations remained elusive. 4. Identification of genomic variants: As mentioned, RNA-seq is typically employed as a complementary tool to genomic DNA sequencing through WES and/or WGS. In practice, RNAseq could be used to combinatorially decipher the genotype and also provide information on expression levels and splicing at the same time. The drawback of this approach is that unlike DNA sequencing, transcriptomic analyses are typically hampered by uneven coverage, variable transcript expression, noncomprehensive exonic inclusions through the detection of alternative isoforms, and RNA editing events to name a few considerations [53]. A systematic comparison between RNA-seq and WGS in a cohort of patients revealed that RNA-seq alone is only able to recover 90% of the variants identified through WGS, with the undetected variants being either filtered out or lying in areas with poor/no coverage or exons not present in the expressed isoforms [26]. Similar results were reported for WES in a study of patients presenting with a clinical diagnosis of Cornelia de Lange syndrome [54]. In contrast to the amount of VUSs uncovered by WGS, RNA-seq only slightly increases the candidate variation to be considered compared to WES, through the inclusion of pre-mRNA untranslated regions (UTRs). In a patient presenting with congenital muscular dystrophy, a heterozygous variant in the 50 UTR of GMPPB created a novel start codon adding 29 novel amino acids to the protein and Mendelizing a previously identified heterozygous change in trans (c.94C . T; p.Pro32Ser) [26]. It is noteworthy that the same study reported that 2% of the variants detected through RNA-seq were not seen on WGS. These changes represent either false discoveries or likely somatic mosaic variation with a possible causal role in genetic disorders (see Chapter 8: Mosaicism in Rare Disease) that is progressively being highlighted by the advent of novel sequencing technologies allowing us to study these more complex mechanisms [26].

11.3.2 Transcriptomic analysis highlights disease modifiers In addition to the diagnostic utility of RNA-seq, reports have highlighted its use in uncovering genome-wide expression changes occurring in response to primary locus dysfunction that can be mined to uncover disease modifiers. An elegant study exemplifying this capability of RNA-seq

224

Chapter 11 Transcriptomics in rare diseases

involved the transcriptomic analysis of two “escaper” dogs from a pedigree of spontaneous occurring DMD in Golden Retrievers that exhibited an unusually mild phenotype [55,56]. In particular, the “escaper” dogs displayed functional muscle and normal lifespan despite the complete absence of dystrophin which in typically affected dogs results in early progressive muscle degeneration, fibrosis, muscle atrophy, contractures, and premature death by age 2 [55]. Comparative transcriptomic analysis among wild-type, affected, and “escaper” dogs, integrated with genomic analysis for the identification of variants leading to the candidate transcriptional imbalances, identified Jagged 1 as a candidate modifier gene underlying the milder phenotype in the two “escaper” dogs, providing insights for novel candidate therapeutic approaches [55]. More recently, a study reporting the transcriptomic analysis of 4000 patients with Huntington disease (HD, MIM #143100) across 48 tissues revealed disease modifiers such as the DNA repair associated PMS2 gene, as well as the combinatorial dysregulation of expression across several genes within a pathway [57]. One such coexpression module reported in this study involved the identification of a cortical gene expression set correlating with cognitive decline and clinical onset, therefore presenting a candidate target pathway for future approaches aiming at disease modulation and treatment [57].

11.3.3 Tools for transcriptomics analyses in rare disease diagnosis Recognizing that our ability to detect transcriptomic aberrations can increase the genetic diagnostic yield led to a significant effort in developing tools aiming to facilitate the identification of these phenomena. In addition to the plethora of tools designed for data preprocessing and quality control, several options for either de novo or annotation-guided aligners have also been generated. Analyses of differential splicing and isoform-specific expression are central in the development of such resources with Cufflinks and other packages being widely used to date [67]. More recently, tools that jointly analyze the genome and transcriptome in an attempt to weigh the effect of regulatory and other variants on transcript expression such as ANEVA-DOT [52] and RIVER (RNA-informed variant effect on regulation) [51] have been developed. Implementation of these novel methodologies was reported to significantly decrease the number of candidate genes underlying genetic clinical diagnoses and to increase the resolution of variant interpretation when assessed jointly with other resources. Finally, the increasing deposition and public availability of large RNA-seq datasets from different tissues and ages, that serve as controls against which cases with undetermined genetic etiologies can be queried, are also expected to propel the discoveries made through the use of RNA-seq methodologies [64,68].

11.4 Single-cell resolution transcriptomics In the past decade, methodologies enabling transcriptomic analyses at the single-cell level have been developed [58]. Single-cell RNA-sequencing (scRNA-seq) represents a genomic approach that permits interrogation of all RNA molecules at single-cell granularity. Initial studies have shown that scRNA-seq adequately captures events pertinent to monoallelic expression [59], expression of specific isoforms or aberrantly spliced transcripts [60], and gene coexpression patterns corresponding to coregulated modules [61], while maintaining a high degree of signal specificity over the technical

11.5 Limitations of using RNA-seq in clinical molecular diagnosis

225

noise that is likely to be introduced due to the limited source material used [62]. This technology has enabled the profiling of cell types and cell states across several healthy tissues, leading to the assembly of a reference catalog of all human cells presented through the Human Cell Atlas initiative [63]. These resources are paramount for comparisons that will highlight how the number, composition, state, and homeostasis of individual cells or cell-types change in disease versus health conditions. Initial studies have been performed in the context of common diseases, uncovering a level of biological complexity that is undetectable by bulk transcriptomic analyses. Prominent examples involve the identification of differentially expressed genes in different neuronal subpopulations from Alzheimer disease patients [64], and islet cells from Type 2 diabetes patients [65], and also involve the identification of novel disease-associated cell types and cellular subpopulations through lineage hierarchy, as exemplified in the delineation of the airway epithelium in Cftr expressing mouse airway epithelial cells [66]. Providing cell-type-specific context to the transcriptome will be important toward understanding disease manifestation and expressivity. Affectedness of a specific cell type or group of cell types is likely to inform why mutations that are present in all cells exhibit such tissue or cell-type-specific effects. Similarly, gene expression profiling across cell types and states in health and disease holds the promise to elucidate the mechanisms underlying disease onset, progression, and overall expressivity. Though scRNA-seq has not yet been applied in the elucidation of the mechanisms underlying rare diseases, it certainly represents the next frontier in transcriptomic studies aiming to better refine our understanding of the basis and complex pathomechanisms of disease.

11.5 Limitations of using RNA-seq in clinical molecular diagnosis Though currently used to a much lesser extent compared to genomic sequencing approaches like WES and WGS, the utility of RNA-seq in genetic diagnosis especially when implemented combinatorially with DNA sequencing has been well established in the most recent years. Nevertheless, a number of important considerations need to be addressed in the future, before RNA-seq can routinely be performed in parallel with WES/WGS toward seeking clinical genetic diagnoses. Harmonization of methods and protocols used is the first step toward allowing for comparisons across samples and studies. This involves homogenization of sequencing protocols; specification of tissue characteristics involving site of acquisition, storage and processing; quality and quantity of extracted RNA; library preparation; algorithms used for processing RNA-seq data; filtering steps; information on coverage and ultimately a thorough description of the phenotype of the patient analyzed. The aforementioned considerations are meant to control the technical variability that inherently exists in RNA-seq-based experiments. Another important confounding factor that needs to be considered in such analyses is associated with the biological variability of the transcriptome highlighted through tissue and cell-type-specific expression, alternate splicing, ASE, and other mechanisms described in this chapter. In contrast to the genomic code, the transcriptome is characterized by its very dynamic nature, whereby not all genes are expressed at all times, developmental stages, or tissues. It is therefore advisable to control for biases that can arise due to variables such as age, sex, and biopsy site when evaluating transcriptomic data.

226

Chapter 11 Transcriptomics in rare diseases

Tissue-specific expression, in particular, is a poignant factor to be considered in RNA-seq analysis, due to expression level discrepancies and the wide spectrum of isoforms and alternative splicing across diverse cell types and tissues [69]. Access to the disease-relevant tissue is often prohibitive as it is associated with invasive procedures that preclude its sampling. The first studies employing RNA-seq used blood as a source tissue showing that only a fraction of tissue-relevant genes is expressed therein, with approximately 76% of genes associated with neurological and muscular phenotypes and 71% of all OMIM genes being expressed in blood [24,26,36]. Systematic comparisons between the three traditionally sampled accessible tissues that comprise blood, muscle, and fibroblasts, and expression profiles from all other nonaccessible tissues through GTEx, reported that whole blood has an inadequate representation of 34% of genes, lymphoblasts misrepresent 17%, and fibroblasts 11% of genes from nonaccessible tissues [70]. The choice of cell types showing higher expression concordance with the disease-relevant tissue relative to blood has led to an improvement in diagnostic accuracy by 10% in a study using fibroblasts as a source material for RNA-seq in patients with mitochondrial disease [25,26]. The remarkable progress made over the past two decades in stem cell reprogramming protocols that allow for the generation of essentially any human tissue of interest, is providing an appealing alternative to overcome this important hurdle in transcriptomic analyses. Stem-cell-derived relevant cell types are not only conferring a tissue-specific profiling advantage but also allow for regular sampling which in principle enables for the capture of developmental or other time point relevant transcriptomic differences. Detection of aberrant expression patterns is contingent upon comparison of the affected transcriptomic profile to matched control profiles. Similar to genetic studies, the first transcriptomic analyses required the assembly of an in-house control cohort, but as resources are continuously being built and the necessity to evaluate individual samples rises, databases including several tissues across several healthy donors, such as GTEx have been built [71]. Despite the advantages offered by this invaluable reference database, it is important to point out that this resource is based on a collection of adult tissue which can be strikingly diverse when compared to pediatric tissue, as is often the case when seeking clinical genetic diagnosis. Comparison of the same tissue across GTEx adults and pediatric samples showed a wider variability in the expression profile of children [26]. It is therefore recommended to expand on the efforts toward building reference databases comprising samples across different sexes, developmental ages, and tissues, in order to achieve higher resolution in the comparisons made and maximize the potential benefits of RNA-seq in clinical diagnosis.

References [1] [2] [3] [4] [5]

Retterer K, Juusola J, Cho MT, Vitazka P, Millan F, Gibellini F, et al. Genet Med 2016;18:696704. Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, et al. JAMA 2014;312:18709. Taylor JC, Martin HC, Lise S, Broxholme J, Cazier J-B, Rimmer A, et al. Nat Genet 2015;47:71726. Liu P, Meng L, Normand EA, Xia F, Song X, Ghazi A, et al. N Engl J Med 2019;380:247880. Posey JE, O’Donnell-Luria AH, Chong JX, Harel T, Jhangiani SN, Coban Akdemir ZH, et al. Genet Med 2019;21:798812. [6] Lionel AC, Costain G, Monfared N, Walker S, Reuter MS, Hosseini SM, et al. Genet Med 2018;20:43543.

References

227

[7] The 1000 Genomes Project Consortium. Nature 2015;526:6874. [8] Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. Nature 2015;526:7581. [9] Werling DM, Brand H, An J-Y, Stone MR, Zhu L, Glessner JT, et al. Nat Genet 2018;50:72736. [10] Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Genet Med 2015;17:40524. [11] Cooper GM. Genome Res 2015;25:14236. [12] Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, et al. Am J Hum Genet 2015;97:199215. [13] Posey JE, Harel T, Liu P, Rosenfeld JA, James RA, Coban Akdemir ZH, et al. N Engl J Med 2017;376:2131. [14] Pi´etu G, Mariage-Samson R, Fayein NA, Matingou C, Eveno E, Houlgatte R, et al. Genome Res 1999;9:195209. [15] Morozova O, Hirst M, Marra MA. Annu Rev Genomics Hum Genet 2009;10:13551. [16] Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, et al. Science 1991;252:16516. [17] Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Science 1995;270:4847. [18] Wang Z, Gerstein M, Snyder M. Nat Rev Genet 2009;10:5763. [19] Bryant S, Manning DL. Methods Mol Biol 1998;86:614. [20] Schena M, Shalon D, Davis RW, Brown PO. Science 1995;270:46770. [21] Soneson C, Yao Y, Bratus-Neuenschwander A, Patrignani A, Robinson MD, Hussain S. Nat Commun 2019;10:3359. [22] Hashimshony T, Wagner F, Sher N, Yanai I. Cell Rep 2012;2:66673. [23] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Nat Methods 2008;5:6218. [24] Cummings BB, Marshall JL, Tukiainen T, Lek M, Donkervoort S, Foley AR, et al. Sci Transl Med 2017;9. [25] Kremer LS, Bader DM, Mertes C, Kopajtich R, Pichler G, Iuso A, et al. Nat Commun 2017;8:15824. [26] Gonorazky HD, Naumenko S, Ramani AK, Nelakuditi V, Mashouri P, Wang P, et al. Am J Hum Genet 2019;104:1007. [27] McKean DM, Homsy J, Wakimoto H, Patel N, Gorham J, DePalma SR, et al. Nat Commun 2016;7:12824. [28] Tazi J, Bakkour N, Stamm S. Biochim Biophys Acta 1792;2009:1426. [29] Singh RK, Cooper TA. Trends Mol Med 2012;18:47282. [30] Weatherall DJ, Clegg JB. The Thalassaemia Syndromes. John Wiley & Sons; 2008. [31] Busslinger M, Moschonas N, Flavell RA. Cell 1981;27:28998. [32] Treisman R, Proudfoot NJ, Shander M, Maniatis T. Cell 1982;29:90311. [33] GTEx Consortium. Science 2015;348:64860. [34] Lo´pez-Bigas N, Audit B, Ouzounis C, Parra G, Guigo´ R. FEBS Lett 2005;579:19003. [35] Pagani F, Raponi M, Baralle FE. Proc Natl Acad Sci U S A 2005;102:636872. [36] Fr´esard L, Smail C, Ferraro NM, Teran NA, Li X, Smith KS, et al. Nat Med 2019;25:91119. [37] Coban-Akdemir Z, White JJ, Song X, Jhangiani SN, Fatih JM, Gambin T, et al. Am J Hum Genet, 103. 2018. p. 17187. [38] Bovolenta M, Neri M, Martoni E, Urciuolo A, Sabatelli P, Fabris M, et al. BMC Med Genet 2010;11:44. [39] Kiiski K, Lehtokari V-L, Lo¨ytynoja A, Ahlst´en L, Laitila J, Wallgren-Pettersson C, et al. Eur J Hum Genet 2016;24:57480. [40] Bang ML, Centner T, Fornoff F, Geach AJ, Gotthardt M, McNabb M, et al. Circ Res 2001;89:106572. [41] Lappalainen T, Sammeth M, Friedla¨nder MR, ’t Hoen PAC, Monlong J, Rivas MA, et al. Nature 2013;501:50611.

228

[42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71]

Chapter 11 Transcriptomics in rare diseases

Lyon MF. Nature 1986;320:313. Lee JT, Bartolomei MS. Cell 2013;152:130823. Peters J. Nat Rev Genet 2014;15:51730. Fu X, Cheng Y, Yuan J, Huang C, Cheng H, Zhou R. Hum Genet 2015;134:14758. Chotalia M, Smallwood SA, Ruf N, Dawson C, Lucifero D, Frontera M, et al. Genes Dev 2009;23:10517. Eckersley-Maslin MA, Spector DL. Trends Genet 2014;30:23744. Eckersley-Maslin MA, Thybert D, Bergmann JH, Marioni JC, Flicek P, Spector DL. Dev Cell 2014;28:35165. Montgomery SB, Lappalainen T, Gutierrez-Arcelus M, Dermitzakis ET. PLoS Genet 2011;7:e1002144. Zhao J, Akinsanmi I, Arafat D, Cradick TJ, Lee CM, Banskota S, et al. Am J Hum Genet 2016;98:299309. Li X, Kim Y, Tsang EK, Davis JR, Damani FN, Chiang C, et al. Nature, 550. 2017. p. 23943. Mohammadi P, Castel SE, Cummings BB, Einson J, Sousa C, Hoffman P, et al. Science 2019;366:3516. Byron SA, Van Keuren-Jensen KR, Engelthaler DM, Carpten JD, Craig DW. Nat Rev Genet 2016;17:25771. Rentas S, Rathi KS, Kaur M, Raman P, Krantz ID, Sarmady M, et al. Genet Med 2020;22:92736. Vieira NM, Elvers I, Alexander MS, Moreira YB, Eran A, Gomes JP, et al. Cell 2015;163:120413. Kornegay JN, Tuler SM, Miller DM, Levesque DC. Muscle Nerve 1988;11:105664. Wright GEB, Caron NS, Ng B, Casal L, Casazza W, Xu X, et al. Hum Mol Genet 2020;29:2788802. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. Nat Methods 2009;6:37782. Deng Q, Ramsko¨ld D, Reinius B, Sandberg R. Science 2014;343:1936. Shalek AK, Satija R, Adiconis X, Gertner RS, Gaublomme JT, Raychowdhury R, et al. Nature 2013;498:23640. Wagner A, Regev A, Yosef N. Nat Biotechnol 2016;34:114560. Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, et al. Nat Methods 2013;10:10935. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. eLife 2017;6. Mathys H, Davila-Velderrain J, Peng Z, Gao F, Mohammadi S, Young JZ, et al. Nature 2019;570:3327. ˚ , Palasantza A, Eliasson P, Andersson E-M, Andr´easson A-C, Sun X, et al. Cell Metab Segerstolpe A 2016;24:593607. Montoro DT, Haber AL, Biton M, Vinarsky V, Lin B, Birket SE, et al. Nature 2018;560:31924. Evans C, Hardin J, Stoebel DM. Brief Bioinform 2018;19:77692. Zhang Y, Chen K, Sloan SA, Bennett ML, Scholze AR, O’Keeffe S, et al. J Neurosci 2014;34:1192947. Mel´e M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, et al. Science 2015;348:6605. Aicher JK, Jewell P, Vaquero-Garcia J, Barash Y, Bhoj EJ. Genet Med 2020;22:118190. Gibson G. Science 2015;348:6401.

CHAPTER

Other omics approaches to the study of rare diseases

12 Giusy Della Gatta

Regeneron Genetics Center, Regeneron Pharmaceuticals, Tarrytown, NY, United States

12.1 Introduction The previous chapters explained and illustrated how genomics and transcriptomics have transformed the diagnostics landscape of rare diseases and, in some cases, contributed to identify disease-related molecular mechanisms. However, as with any omics methodologies, there are some limitations to these approaches. More specifically, whole-exome sequencing (WES) and wholegenome sequencing (WGS) face challenges in variant calling and prioritization, and the distinction of true pathogenic variants contributing to disease from the plethora of benign and polymorphic variants in the genome [1,2]. A recent review on the utility of genomics for diagnostic purposes showed that this approach typically reaches a diagnosis in 35%50% of cases with suspected genetic disorders [3]. On the other hand, a number of studies have shown that, in some cases, transcript levels do not fully recapitulate protein expression levels [4,5]. Due to this, there is the need to explore and develop alternative omics methods to be integrated with genomics and transcriptomics to better assess disease-associated variation. In this context, the rapid advancement of epigenomics, proteomics, and metabolomics methods are creating a new definition of personalized medicine. Rare diseases might particularly benefit from the single use or from the integration of these approaches to (1) improve disease diagnosis; (2) identify new prognostic markers; (3) identify novel treatments and drug re-purposing opportunities, and (4) study disease pathophysiology.

12.2 Epigenomics 12.2.1 Definition of epigenetics and epigenomics The word “epigenetics” can be literally translated into “on top of or beyond genetics” referring to the existence of an additional DNA code modulating gene activity and expression beyond the traditional genetic code. In the early 2000s, the Human Genome Project (HGP) revealed that only a small portion of the entire human genome (approximately 1%2%) contains protein-coding sequences (see Chapter 4: Genomic Sequencing of Rare Diseases). This surprising discovery opened a tumultuous debate in the scientific community on the function of the remaining 98%99% of the genome packed in the human chromosomes [69]. Additional studies revealed Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00008-9 © 2021 Elsevier Inc. All rights reserved.

229

230

Chapter 12 Other omics approaches to the study of rare diseases

that, despite the fact that identical copies of the genomic material are present in the 200 different cell types of the human body, each cell type is specialized to express a unique set of genes in order to carry out its particular functions. Both these observations prompted scientists to hypothesize the existence of the “epigenome,” defined as a sequence-independent genetic code regulating gene expression [1012]. Fast forward 20 years and the scientific community has acquired a better understanding of the multidimensional organization of the epigenome, whose ultimate goal is to inhibit or facilitate transcription factor (TF) binding to DNA and dynamically mediate cell-specific gene expression. The epigenome can modulate gene expression through (1) chemical modifications of the DNA itself (i.e., DNA methylation), (2) chromatin remodeling through nucleosome positioning and histone modification (i.e., acetylation, methylation, phosphorylation, ubiquitination of the histone tails), and (3) noncoding RNA-associated mechanisms [1317] (Fig. 12.1). Multiple studies have demonstrated specific patterns of chromatin marks and rearrangements associated with cells at different developmental stages, in response to stimuli or in disease states, highlighting the fact that epi- genetic modifications are shaped by environmental exposures, lifestyle, genetic factors, age, and disease [18,19]. Additionally, the role of the epigenome in a variety of nuclear processes such as DNA replication and damage response, chromatin folding and packaging, and cell-type-specific gene expression patterns has been shown. The establishment of several epigenetic consortia defined a framework to connect researchers and share protocols and resources in the field of epigenomics [13,2022]. Among other things, these initiatives have contributed to the advancement of experimental techniques to process DNA and the application of sequencing technologies, which translated into a significant step toward understanding the function of the epigenome and its changes in structure and composition in health and disease.

12.2.2 DNA methylation as the most prominent epigenetic mechanism DNA methylation represents a “sophisticated molecular mechanism for annotating genetic information” and studying how this annotation works might be essential to understand why and how certain genes are dysregulated in a disease state [23]. DNA methylation was the first and bestcharacterized epigenetic modification identified by the HGP, back in the 2000s [24]. Chemically speaking, DNA methylation consists of the addition of a methyl group to the fifth carbon of the cytosine from a CpG dinucleotide to form 5-methylcytosine (5mC) (Fig. 12.1). The addition and removal of this covalent modification are regulated by an array of “writer” enzymes that act on the DNA through de novo methylation (DNMT3A and DNMT3B) or by removing methyl groups (by the 1011 translocation family of dioxygenases: TET1, TET2, and TET3). At the same time, the UHRF1/DNMT1 tandem protein complex is responsible for inheriting DNA methylation patterns throughout DNA replication. Finally, “reader” proteins such as MeCP2 bind to methylated DNA and mediate the repression of transcription of marked genes. CpG dinucleotides are found at about 1% frequency in the human genome, in both coding and noncoding regions. Moreover, CpG dinucleotides tend to cluster in genomic regions with a high frequency of CpG sites (CpG islands) [9,24,25]. Approximately 70% of CpG islands are found in gene promoter regions and gene bodies [2628]. These CpG islands are usually not methylated and frequently marked by trimethylation of histone H3 (H3K4me3), which indicates a transcriptionally active state of the chromatin [2931]. Interestingly, genome-wide studies pointed out that some intragenic “orphan” CpG islands might

12.2 Epigenomics

Condensed chromatin

Open chromatin

231

Nucleosome positioning and chromatin accessibility Technologies: - DNase-Seq - ATAC-Seq - MNase-Seq - NOMe-Seq

DNA methylation Technologies: - Bisulfite-Seq - TAB-Seq - MeDIP/MDBCap-Seq - Nanopore-Seq

Nucleosome and histone modification Histone tails Histones

Technologies: - ChIP-Seq - N-ChIP & ULI N-ChIP - MOW-ChIP - CUT&RUN & ULI CUT&RUN Single cell: - Drop-ChIP - sc-itChIP-Seq - CUT&Tag, ACT-Seq & CoBATCH

FIGURE 12.1 Epigenome overview and epigenomics technologies.

be not-yet-defined alternative promoters [30,32]; whereas intergenic CpG islands might represent transcription start sites for long noncoding RNA molecules [15]. The establishment of cell-specific DNA methylation patterns occurs very early in human embryonic development through mechanisms such as imprinting. More precisely, germline DNA methylation is erased in the blastocyst and de novo methylation occurs at the time of implantation following a bimodal pattern, which methylates the entire genome except for a group of promoter sequences (CpG islands) that is specifically protected from this event and remains unmethylated. Finally, tissue-specific methylation occurs during postimplantation development. Thus, epigenetic make-up represents the scaffold of the basic gene expression pattern for the whole organism. Additionally, early studies found CpG methylation mainly associated with a repressed state during retrotransposon silencing, and X chromosome inactivation [33]. Over time, refinement of experimental platforms for high-throughput epigenetic analysis contributed to deepen our understanding of how nuanced and dynamic DNA methylation is. Lister et al. showed that 5mC can also occur in non-CpG sites, and that this mark can be added or removed at the site of promoters or gene bodies of actively transcribed genes. This led to the hypothesis that active and repressive methylation states are both needed for a tailored and context-specific regulation of gene expression [27]. Over

232

Chapter 12 Other omics approaches to the study of rare diseases

the past 10 years, several studies demonstrated that, in addition to 5mC, TET1, TET2, and TET3 are able to generate additional epigenetic marks whose abundance is extremely variable throughout the genome. Together with 5mC, now we know that 5-hydroxymethylcytosine (5hmC) is the most common epigenetic mark [34,35]. In addition, TET can further oxidize 5hmC into 5-formylcytosine and 5-carboxylcytosine, which exist in much lower levels in the mammalian genome compared to 5mC and 5hmC [36].

12.3 Landscape of epigenomic technologies 12.3.1 DNA methylation Several methods have enriched the toolkit for analysis of DNA methylation, which can be essentially categorized as (1) bisulfite and oxidative bisulfite conversion, (2) affinity enrichment-based methods, and (3) third-generation sequencing. Currently, the gold standard for quantitative genome-wide DNA methylation analyses is bisulfite sequencing [27,37]. Briefly, genomic DNA is treated with sodium bisulfite, which chemically converts unmethylated cytosine to uracil, while leaving 5mC and/or 5hmC intact. During PCR amplification and sequencing, all the unmethylated cytosines from the bisulfite-treated DNA fragments are turned into thymine and the remaining cytosines will therefore correspond to methylated bases (5mC/5hmC). Despite the fact that this is the only technique that allows DNA methylation analysis at single-base resolution, there are major technical drawbacks, including harsh chemical treatment that degrades most of the DNA sample resulting in poor sequencing quality and uneven genome coverage, as well as inability to distinguish between 5mC and 5hmC, because they are both resistant to bisulfite conversion. Over these past 10 years, several methods have been developed as alternatives to bisulfite sequencing, such as TET-assisted bisulfite sequencing [38] and oxidative bisulfite sequencing [39], both of which enable discrimination of 5mC and 5hmC at single-base resolution. Moreover, additional bisulfite-free protocols have been developed with the aim of preserving the integrity of the DNA and achieving a more even sequencing coverage [4042]. Methylated DNA immunoprecipitation (IP) sequencing and methyl-binding domain capture sequencing are affinity enrichment methods used to pull down methylated fragments from fragmented genomic DNA. The unmethylated genome fragments are eliminated by stringent washes and the leftover methylated DNA is then sequenced. Even though these methods have a low base resolution and they are very much dependent on the quality of the starting DNA sample and the specificity of the reagents used for the IP step, they are mainly used to study the methylation status of promoters and CpG islands [43,44]. Finally, without requirement of any chemical or enzymatic treatment, nanopore sequencing is one of the most promising third-generation sequencing approaches currently developed (see Chapter 4: Genomic Sequencing of Rare Diseases). This technology is based on the principle that each nucleobase can induce a different ionic current when passing through a nanopore [45]. In the context of DNA methylation, this technique was first applied to small viral and microbial genomes [46,47]. Further developments showed that nanopore sequencing can now successfully be applied to map 5mC at approximately 80% of accuracy in the human genome [48,49] and it can be used to study disorders of imprinting [50].

12.3 Landscape of epigenomic technologies

233

12.3.2 Modification of chromatin states Chromosomal DNA is condensed into a nucleosome, which is the basic repeating structural and functional unit of chromatin. Each nucleosome consists of approximately 146 base pairs of DNA wrapped around eight histone proteins consisting of two of H2A, H2B, H3, and H4 subunits [51]. N-terminus histone tails undergo several posttranslational modifications (PTMs) including methylation, acetylation, phosphorylation, and ubiquitination, and more recently described ones such as crotonylation, succinylation, malonylation, and others [5254] (Fig. 12.1). The concerted action of these PTMs affects DNA accessibility and, as consequence, regulates gene expression activation and silencing. Histone acetylation and methylation are two of the best-characterized histone modifications. While acetylation is typically associated with a state of open chromatin and transcriptional activation (e.g., in the case of H3K27 acetylation [55]), histone methylation shows a more nuanced pattern of modulation. Lysine residues can acquire different functions based on their mono-, di-, or tri- methylated status (i.e., H3K4me1, H3K4me3, H3K9me3, H3K36, and H3K79). Several groups observed that methylation can have both activating and repressive transcriptional functions. H3K4me3 (read as trimethylation (me3) of lysine residue 4 (K4) of histone H3) is classically associated with open chromatin and transcriptional activation [52], while H3K9me3 is found in condensed and closed chromatin, hence functioning as transcriptional repressor [56]. In a fascinating study on human early development, Xia et al. highlighted the existence of a bivalent active and repressive chromatin status in embryonic stem cells (ES cells), raising the possibility of the existence of an “intermediate” chromatin structure needed for correct maternal to zygotic transcriptional transition [57]. Other reports identified the bivalent chromatin status in cells other than ES cells, hypothesizing that there are some regulatory regions that can act as promoters in some tissues and enhancers in other ones (cREDS: cis-regulatory elements with dynamic signatures) [22,58]. Currently, the most widely used high-throughput technique to study genome-wide histone modifications is Chromatin Immunoprecipitation followed by Sequencing of enriched DNA fragments (ChIP-Seq). This technique consists of three specific steps, which can be summarized as (1) crosslinking of the DNAprotein complexes by formaldehyde; (2) sonication of DNA into small 200600 bp fragments; and (3) IP of the DNAprotein complexes of interest with a specific antibody (i.e., a modified histone or TF). After de-crosslinking, the DNA is then purified and subjected to end repair, adapter ligation, and library preparation followed by next-generation sequencing (NGS). Resulting sequence reads are then computationally aligned to the reference genome sequence to identify the genomic regions occupied by the protein(s) of interest. ChIP-Seq has been widely used to dissect the epigenetic code of the different histone modifications, genomic occupancy of TFs, as well as to identify new proteins binding to chromatin [5962].

12.3.3 Alternative methods to dissect chromatin modifications Despite the fact that ChIP-seq is one of the most mature techniques for high-throughput epigenomic analyses, its implementation still suffers some limitations due to (1) introduction of potential sequencing bias due to PCR amplification, (2) length limitations of the isolated amplified fragments, (3) need for high quantity of starting genomic DNA material (105107 cells), which makes it difficult to use this technique to study epigenomic events on rare cell populations or single cells, (4) lack of specificity of the antibody used for IP, and (5) potential for epitope masking due

234

Chapter 12 Other omics approaches to the study of rare diseases

to formaldehyde crosslinking [63,64]. In order to circumvent the issue of epitope masking, micrococcal nuclease (MNase) has been proposed as an alternative method of DNA digestion (also called Natural ChIP; N-ChIP) that does not require formaldehyde crosslinking [65,66]. Despite the fact that MNase treatment is quick and can preserve the native chromatin structure, this approach is mainly used for histone analysis, as some weaker DNAprotein interactions such as DNA with TFs might be lost due to the absence of a crosslinking agent. In order to tackle the high number of cells required by standard Chip-Seq, alternative methods have been developed to identify DNA binding sites of histone modification and TFs in small populations of cells. In this context, ChIPmentation combines chromatin IP with sequencing library preparation by Tn5 transposase in a single-step reaction, allowing the performance of histone ChIP seq with only 10,000 cells [67]. Ultralow-input MNase-based native ChIP is an improved N-ChIP method, which relies on collecting the fluorescence-activated cell sorting (FACS) sorted cell population directly in lysis buffer, significantly shortening the ChIP protocol and utilizing as little as 106 cells [68]. Finally, microfluidic oscillatory washing-based ChIP-seq is able to analyze histone modifications from 100 cells [69]. While most of these technologies can be effectively used to profile histone modifications, they are less effective when studying TFs, which require profiling of a much higher number of cells due to the fewer number of binding sites across the genome. In this context, the “cleavage under targets and release using nuclease” (CUT&RUN) technique was recently developed for TF analysis reproducibility in a small number of cells (1001000 cells). Similar to the methods described above, CUT&RUN does not require formaldehyde fixation or sonication [70]. Unfixed nuclei are bound to lectin-coated magnetic beads, successively permeabilized and incubated with antibodies and protein A-MNase. Once digested, the DNAhistone/TF complexes are free to diffuse in the surrounding media, while the undigested chromatin remains largely insoluble and eliminated by centrifugation. The enriched supernatant is then used to extract DNA directly for sequencing. This procedure allows the CUT&RUN approach to have a much higher signal-to-noise ratio and a more even sequencing coverage compared to other approaches [70,71]. Ultralow input CUT&RUN (uliCUT&RUN) consists of a reinterpretation of the original CUT&RUN protocol [72]. Comparative analysis of the number of binding peaks of three different TFs (SOX2, NANOG, and OCT4) obtained by standard ChIP-based approaches and uliCUT&RUN analysis from 50 cells showed medium to high percent overlap in the binding maps obtained, suggesting that uliCUT&RUN might be a promising approach to map specific TFs in smaller sample sets.

12.3.4 Single-cell ChIP-seq Abbreviated as scChIP-seq, this technique allows the study of histone modification and TF occupancy at a single-cell level. This method allows to cluster cells based on their chromatin status and to identify chromatin-specific features acquired in a specific cell state such as during development or in response to a specific stimulus or insult. The first attempt to perform ChIP in single cells was published in 2015 by Rotem et al. trying to optimize droplet-based single-cell ChIP-seq (DropChIP) combining microfluidics technology with single-cell barcoding [73]. In Drop-ChIP, each cell is encapsulated in a single droplet and incubated with lysis buffer and MNase to allow chromatin fragmentation and subsequent barcoding. Fragmented barcoded-chromatin undergoes IP followed by amplification and sequencing. The low sequence coverage obtained by this approach

12.4 Dissecting chromatin structures

235

(1000 uniquely mapped reads for H3K4me3 ChIP per cell) did not allow epigenetic characterization at a single-cell resolution, but rather the identification of specific cellular subpopulations characterized by the same H3K4me3 epigenetic profile. Nevertheless, Rotem’s experiment opened a new avenue for the development of alternative single-cell protocols. Single-cell simultaneous indexing and tagmentation-based ChIP-seq represents a cheap and more feasible alternative to Drop-ChIP. This technology is based on the use of Tn5 transposase widely used in various NGS assays, including ChIP-seq. Briefly, fixed single cells are sorted by FACS and isolated in multiwell plates, where an “all-in-one” strategy is applied using a combination of T5 and T7 barcodes, to perform cellular indexing, and chromatin tagmentation in the same tube. Indexed chromatin is then processed for IP and library amplification using low-input ChIP protocols [74]. Finally, several “ChIP-free” methods have been recently developed with the objective of analyzing the genomic distribution of chromatin-binding proteins in single cells while bypassing the technical biases of the IP step. CUT&Tag [75], ACT-seq [76], and CoBATCH [77] are all based on the same principle of fusing a Tn5 transposase to protein A to perform binding, indexing, and tagmentation directly in the same tube, followed by amplification and NGS. Performing all the reactions in the same tube drastically reduces the experimental procedure time, ensuring a faster readout and a significantly increased signal to noise; however, these methods cannot be applied to samples composed of less than 100 cells (i.e., rare cell populations or preimplantation embryos).

12.4 Dissecting chromatin structures 12.4.1 Profiling nucleosome positioning and chromatin accessibility Over the past few years, it has become clear that the study of chromatin 3D structure is key to understanding how cells that share the same genetic material acquire and maintain different cellular identities [7881]. Regulation of gene expression consists of a series of events whereby TFs compete with histones and other chromatin-binding proteins to modify the status of nucleosome occupancy and, ultimately, access to DNA to initiate transcription. Along these lines, histones are usually densely located in regions of closed and inactive chromatin (Fig. 12.1). Conversely, histones are absent in transcriptionally active chromatin regions [82,83]. This setup allows TFs to bind and organize chromatin in a more permissive conformation such that the whole transcriptional apparatus can access DNA and be assembled [8489]. Chromatin accessibility is a dynamic process that can change not only in response to physiological stimuli and during development but also in relation to a defined disease state [9095]. The advancements in NGS and microfluidics technologies have allowed the development of cutting-edge methods to study nucleosome positioning and chromatin accessibility in very low number cell populations and at a single-cell level. Moreover, the drastic reduction in starting material needs is broadening the application of such methods to clinical samples, otherwise not suitable for these analyses. Chromatin accessibility can essentially be studied by assessing susceptibility to enzymatic methylation or enzymatic cleavage of the accessible DNA sites. •

DNase I hypersensitive site sequencing (DNase-seq): After isolation, nuclei undergo endonuclease DNase I digestion to cut accessible and nucleosome-depleted chromatin regions,

236







Chapter 12 Other omics approaches to the study of rare diseases

followed by NGS. This methodology allows the genome-wide identification of DNaseIhypersensitive sites corresponding to open chromatin generally associated with DNA regulatory regions. Assay for transposase-accessible chromatin using sequencing (ATAC-seq): This method takes advantage of the transposition mechanism to cut and insert ad hoc sequences in the target DNA fragment. In ATAC-Seq, a hyperactive Tn5 transposase is previously assembled in vitro with custom adapters for efficient library preparation of the digested DNA fragments [96,97]. After nuclei isolation, the chromatin is fragmented by the engineered Tn5 transposase and ligated to sequencing adapters in the same reaction [98]. Despite the fact that the transposition process might be affected by sequencing-dependent bias, ATAC-seq results of chromatin accessibility are very similar to the ones from DNase-seq experiments [98100]. Additionally, ATAC-Seq offers several other advantages such as the generation of complex libraries with as few as 500 cells and a fast and straightforward implementation protocol [98,99], making it one of the most widely and currently used methods to study chromatin accessibility. Optimizations and variations of ATAC-Seq to broaden the application of this technology have been developed [99,101103]. Micrococcal nuclease sequencing (MNase-seq): Micrococcal nuclease digestion of chromatin followed by NGS (MNase-seq) is the most common method used to study genome-wide nucleosome positioning. This technique takes advantage of an MNase enzyme that has both endonuclease and exonuclease activity, performing cleavage and elimination of linker DNA. This unique property allows isolation and sequencing of nucleosomal DNA [104,105]. Nucleosome occupancy and methylome sequencing (NOMe-seq): This method represents an alternative to MNase-seq and uses DNA methyltransferase accessibility to footprint nucleosome positions. As opposed to a normal methylation pattern, whereby cytosines are methylated in the CpG context (5mC or 5hmC), the DNA methyltransferase (M.CviPI) methylates cytosine only in the GpC context. In this case, nucleosome positioning is identified by “methylation exclusion” since nucleosomal nucleotides are protected from methylation. Hence, this analysis can reveal four different chromatin states consisting of CpG unmethylated and nucleosome-depleted, CpG unmethylated and nucleosome-occupied, CpG methylated and nucleosome-occupied, and CpG methylated and nucleosome-depleted. Finally, an important advantage of NOME-Seq is that both nucleosome and DNA methylation analysis are performed in the same cell, which generates a more quantitative view of the chromatin accessibility landscape compared to other methods [106].

12.4.2 Evaluating higher-order chromatin architecture The cell nucleus contains roughly 2 meters of unwound nuclear DNA, which is B200,000 times the average diameter of the nucleus itself. Hence, the process of condensing a DNA molecule inside its own nucleus requires significant organizational challenges. Like histone marks and chromatin accessibility, the spatial conformation of the chromosome affects a number of biological processes including gene expression, transcriptional regulation, DNA repair, and DNA replication [107110]. Historically, studies of chromosome conformation relied on fluorescent microscopy imaging. A paradigm shift occurred with the introduction of Chromosome Conformation Capture (3C) [111,112], a high-throughput biochemical method to map interaction frequencies between genomic loci in cell populations. Briefly, formaldehyde is used to crosslink chromatin inside cells and then cut with restriction enzymes recognizing six or more base pairs. In the following step, the

12.5 Epigenomic studies in rare diseases

237

sticky ends of the crosslinked DNA are re-ligated under very diluted conditions, to promote joining of ends in close proximity. The digested fragments are then de-crosslinked and purified. Quantitative PCR (qPCR) is performed to provide an estimate of the ligation frequency and, hence, of the interaction frequency between two non neighboring genomic sites [107,111,113,114]. Over the past 10 years, additional techniques based on 3C but differing in the selection and number of loci tested have been developed [115117], including 4C [118], 5C [119], Hi-C [120], TCC [121], ChIA-PET [122,123], and more recently, single-cell Hi-C [124]. Hi-C is the most widely used alternative to 3C. This technique pairs the use of NGS with a slightly modified 3C protocol. More specifically, before the ligation step the enzyme-digested ends are filled with biotin-labeled nucleotides. This additional step enriches for labeled restriction fragments for downstream sequencing analysis. Sequencing data are used to map the interactions for all sequenced fragments across the entire genome, including intra- and inter-chromosome interactions. Several studies have shown how Hi-C can be used to study genome folding, gene regulation, and enhancer/promoter interactions [125127]. Two technical limitations of Hi-C are the restriction enzyme cutting frequency and the sequencing depth which can limit the level of genomic architecture assayed. Despite these limitations, this approach is the most widely used methodology to study chromosome topology. The refinement of Hi-C through deeper sequencing and improved data analysis tools will be essential to study how the spatial conformation of chromosomes can affect the expression of genes in health and disease.

12.5 Epigenomic studies in rare diseases Some of the new epigenomics methods described above have improved our understanding of 3D chromatin structure in the function of the (epi)genome and, consequently, highlighted its relevance in the regulation of gene expression. However, much remains to be investigated to understand its role in health and disease. More mature methods to explore DNA methylation have been extensively used to study a wide range of diseases including cancer [128,129], autoimmune diseases [130,131], and metabolic [132134] and neurological disorders [135138]. Deregulation of the epigenome in human disease can be caused by two main mechanisms observed to date, namely (1) a coding mutation or pathogenic variant in a gene part of the epigenetic modification machinery; or (2) disruption of DNA methylation during genomic imprinting [139]. As discussed in Chapter 2, Karyotyping as the First Genomic Approach, genomic imprinting is determined by a complex epigenetic process that regulates embryogenesis, reproduction, and gametogenesis. In contrast to most genes, imprinted genes are organized in chromosomal gene clusters (imprinted domains) and undergo monoallelic gene expression, meaning that they are only expressed from a single (maternal or paternal) allele [140]. Loss of imprinting results from deletions, mutations, epimutations, or from uniparental disomy (UPD) [141143]. Rare syndromic disorders such as PraderWilli (MIM #176270), Angelman (MIM #105830), or BeckwithWiedemann (MIM # 130650) syndromes are usually caused by genetic aberrations in the expressing allele of an imprinted gene, thus causing a full loss of gene expression [144]. Several genetic disorders have been linked to DNA methylation defects and genetic alteration of histone modifier genes [145]. Given its complex structure and organization, the nervous system is particularly sensitive to epigenetic alterations resulting

238

Chapter 12 Other omics approaches to the study of rare diseases

in neurodevelopmental and neuropsychiatric disorders [146148]. Epigenomic studies have helped to better understand the molecular mechanisms of these disorders, and in some cases, provide a tool for early detection and intervention. An example of epigenomics applied to better characterize rare genetic disorders is that of Huntington disease (HD, MIM #143100), an autosomal dominant neurodegenerative disease characterized by a progressive movement disorder, cognitive dysfunction, and psychiatric impairment (see Chapter 6: Dominant and Sporadic De Novo Disorders). HD is the most prevalent polyglutamine (polyQ) disorder, with 510 cases per 100,000 individuals. The disease is caused by a variable expansion of CAG repeats coding for glutamine (Q) in exon 1 of the huntingtin (HTT) gene. This expansion leads to an abnormally long polyglutamine sequence that leads to HTT fragmentation, resulting in neuronal dysfunction and death [149]. Given the complexity and severity of this disease, HD has been very well studied over the years. However, the molecular mechanisms driving the pathogenesis of this disorder have yet to be fully elucidated and, currently, no disease-modifying treatment is available. The main molecular feature of HD is the progressive deregulation of gene expression in the brain cortex and striatum, including transcriptional downregulation of key neurotransmitters, growth factors, and their receptors [150]. This observation led to the hypothesis that deregulation of the epigenetic machinery might result in the observed coordinated deregulation of multiple genes [151153]. An epigenomics study of HD, using Reduced Representation Bisulfite Sequencing (RRBS), mRNA-Seq, and ChIP-Seq in cell lines derived from mouse striatal neurons expressing full-length HTT with either the wild-type or an expanded polyglutamine repeat, found an overall change of DNA methylation with a concomitant broad change in gene expression. In particular, some genes involved in neurogenesis showed a high degree of methylation and reduced expression, leading to the hypothesis of a possible role of methylation in neurogenesis and cognitive decline in patients with HD [154]. Furthermore, several studies have confirmed an overall decrease of histone acetylation and H3K4 trimethylation with correspondent decrease in gene expression in HD models by using classic ChIP and high-throughput ChIP-Seq approaches [152,155159]. These results prompted studies to test the efficacy of histone deacetylase (HDAC) inhibitors and H3K4me3 demethylase in HD mouse models, human cell lines, and a fly model. These treatments were able to restore the expression of key downregulated genes in HD disease models by acting directly on the histone modification, supporting a therapeutic value for histone-modifying drugs in HD. Epigenomic studies in postmortem HD brains have further confirmed epigenetic deregulation affecting key processes, such as neuronal development and cellular activity [160,161]. Another epigenomics study observed accelerated epigenetic aging in HD’s blood and brain tissue, suggesting a possible age-dependent alteration of epigenetic regulation in this disorder [162,163]. A positive correlation between epigenetic modifiers and progression of motor-related HD symptoms has been observed with HD samples showing the existence of relevant methylation sites in close proximity to the CAG repeat expansion aberration. These results are similar to those observed for the FMR1 and frataxin genes in Fragile X syndrome (MIM #300624) and Friedreich ataxia (MIM #229300), respectively.

12.6 Proteomics The word proteome was first used in the early 1990s to define the “total protein complement of a genome” [164166]. After 10 years, the Human Proteome Project was born as a consortium of

12.6 Proteomics

239

international laboratories with the aim of identifying the entire human protein set, elucidating biological and molecular functions, and advancing disease diagnosis and treatment [167]. As corollary, proteomics can be defined as the scientific field that studies changes in the proteome of an organism at different biological states. Following the natural flow of biological information, after genomics, transcriptomics, and epigenomics, proteomics represents the next step in the study of biological systems. In terms of complexity, the proteome falls within one order of magnitude higher than the genome and transcriptome. In 2004, Jensen estimated that the human proteome encompasses 1 million different protein species, while the human genome encodes for approximately 20,000 protein coding genes [168]. Several factors contribute to such a difference in the number of human genes compared to proteins synthesized, which include the transition from a 4-nucleotide code to a more complex 20 amino acid code during protein translation (see Chapter 1: Introduction to Concepts of Genetics and Genomics) and the existence of more than 300 PTMs. In addition, there are several other factors to be considered when studying the proteome, including, for example, the fact that in human plasma, the estimated protein concentration is 10101012, and this can vary by 68 orders of magnitude depending on the protein and the cellular context [169]. Proteins rarely function as single molecules; rather they tend to form homo or hetero-di/tri/tetra-dimers and the catalogue of synthesized proteins can differ from cell to cell and over time, in response to physiological, pathophysiological, and environmental conditions. Despite these challenges, remarkable progress has been achieved by the proteomics field in the past two decades. Even if we are still far from measuring and understanding the totality of the human proteome, technological progress in this field has allowed to make significant advancements in the identification of important biomarkers and comprehension of the molecular mechanisms of several disease processes such as cancer [170173], multiple sclerosis [174], and schizophrenia [175,176].

12.6.1 Protein arrays for biomarker discovery The protein arrays technology was born during the early 2000s with the aim of performing multiplexed protein measurements to study protein function, proteinprotein interactions, and modeling networks and pathways with the same throughput as DNA arrays. However, different physical and biochemical properties of proteins versus nucleic acids required the development of specific immobilization methods such as nylon membranes, plastic microwells, planar glass slides, gel-based arrays, and beads in suspension arrays, which could guarantee the integrity of the structure and biological function of each protein identified on the arrays [177184]. As for protein detection, three main methods have been used: (1) direct labeling of a protein mixture, which allows detection of a fluorescent signal following incubation of the protein mix on an antibody array; (2) the sandwich-immunoassay method, which consists of detection of the protein/antibody binding complex through the use of an additional mixture of fluorescently tagged detector antibodies; and finally, (3) an antigen-capture array, where antigens are synthesized and immobilized onto an array and incubated with the sample of interest to detect antibodies, which are consequently recognized by a second read-out antibody. Over the past years, protein arrays have been successfully used in a multitude of research fields ranging from cancer to infectious and immune diseases, as well as rare diseases to significantly advance the identification of new diagnostic markers and to deepen our understanding of disease pathophysiology [185]. In 2002 Robinson et al. described for the first time

240

Chapter 12 Other omics approaches to the study of rare diseases

the generation of fluorescence-based autoantigen chips containing 196 distinct biomolecules representing major autoantigens targeted by autoantibodies from patients with autoimmune rheumatic diseases [186]. In addition, autoantibody and protein marker arrays have been used to perform large-scale serum profiling of autoimmune diseases such as systemic lupus erythematosus (SLE, MIM #152700) and rheumatoid arthritis (MIM #180300) [187190]. SLE is a complex autoimmune disease characterized by autoantibodies recognizing self-antigens which triggers a broad variety of symptoms and phenotypes that can vary in severity and manifestation [191]. Several studies have shown that autoantibodies are circulating in the blood of SLE patients years before the clinical onset of SLE and that, over time, there might be an enrichment of specific autoantibodies over others [192,193]. Autoantibodies panels currently used for SLE diagnosis lack specificity and need to be used only in concert with other clinical criteria for a more accurate diagnosis [194]. In this context, profiling the autoantibody repertoire of SLE patients might be an effective tool to identify new markers and establish a correlation between specific autoantibodies and clinical manifestations over time. In addition to discovery of new markers, the same protein array technology was used in peripheral blood from SLE patients to fully characterize the expression of known autoantigens, previously associated with other autoimmune disorders [189,195]. Moreover, integrative studies with gene expression arrays from peripheral blood showed a deregulation of the interferon (IFN) pathway both in SLE and in incomplete SLE. Interestingly, high expression of IFN-related genes was correlated with high levels of IgG autoantibodies, suggesting that IFN might play a role in IgM-IgG class switching [196]. In another example, nucleic acid programmable protein arrays have been used to simultaneously measure 6000 human proteins and identify novel antigens associated with type 1 diabetes [197]. Finally, protein microarrays have also been used to study a wide array of PTMs such as phosphorylation, acetylation, SUMOylation, and glycosylation [197,198]. In recent years protein array platforms have been improved reaching an unprecedented highquality resolution, despite being non-mass spectrometry-based approaches. In this regard, Somalogic is a multiplexed proteomics discovery platform based on aptamers technology and is able to measure the whole secreted proteome consisting of approximately 3500 proteins extracted from biofluids. Aptamers are single-stranded nucleic acid molecules that are generated and selected to bind tightly to a specific target protein. These molecules are highly stable and resistant to degradation [199,200]. After synthesis and selection, each aptamer is labeled with photo-cleavable biotin and a fluorescent label and bound to the target protein. This complex is then tagged with biotin and captured by streptavidin-coated magnetic beads. Following this step, the enriched aptamers are eluted and quantified by using a customized DNA microarray [201,202]. Similarly, the Olink proteomics platform allows to profile up to 1536 proteins simultaneously in a given sample. This platform uses the Proximity Extension Assay (PEA) technology, which consists of a pair of oligonucleotide-labeled antibodies. When the antibodies bind to the target protein, the oligonucleotides are close enough to generate a PCR product by proximity-dependent DNA polymerization. The PCR product contains a specific barcode for each target protein and the amount of the amplified DNA barcode is quantified by microfluidic qPCR [203206].

12.6.2 Bottom-up and top-down proteomics approaches Bottom-up (shotgun) proteomics (BUPs) uses combined methods of liquid chromatography (LC) and tandem mass spectrometry (MS) to identify proteins from complex biological mixtures in a

12.6 Proteomics

241

Bottom-up proteomics (BUP) Peptide digestion

Tissue Protein extraction or

Pre fractionation

Ion exchange or affinity resin or protein IP

Top-down proteomics (TDP)

cells

Protein IP or immune enrichment

Fractionation

LC-MS

Data collection

Data analysis

FIGURE 12.2 Overview of the bottom-up and top-down proteomics.

high-throughput fashion [207,208]. The first step for an MS-based proteomics analysis is sample collection, followed by protein extraction and peptide digestion, which generates thousands of peptides that need to be prefractionated or enriched in order to reduce sample complexity. Prefractionation methods are based on physicochemical properties of the peptides and include bidimensional polyacrylamide gel electrophoresis (2D-PAGE) methods or LC (Fig. 12.2). For 2D-PAGE, solubilized proteins are loaded onto an isoelectric focus gel, which allows each protein to migrate until it reaches its own isoelectric point, referring to the pH at which the protein charge is neutral. Following this, the proteins are then loaded into an SDS-PAGE gel to allow separation on an orthogonal second axis, based on the proteins’ molecular weight. At the end of this process, the proteins are aligned along two axes: isoelectric point versus molecular weight and are ready to be used for MS analysis [209]. Alternative enrichment methods rely on the use of affinity-based columns or IP methods able to enrich for specific peptide modifications such as phosphorylation, glycosylation, or acetylation [210]. The enriched samples undergo a second round of separation achieved by LC methods, such as ultrahigh-performance liquid chromatography (UHPLC), to further reduce protein complexity before proceeding with MS analysis [211213] (Fig. 12.2). In broad strokes, MS analysis consists of three main steps: 1. Ionization: This can be performed by electrospray or matrix-assisted laser desorption/ionization (MALDI). This process creates positively or negatively charged ions that are then directed by magnetic or electric fields to a mass analyzer.

242

Chapter 12 Other omics approaches to the study of rare diseases

2. Mass selection: This step is typically performed by a mass analyzer instrument, which separates the ions accordingly to their mass. Several mass analyzers have been developed over time, which take advantage of the different physical properties of ions. The time of flight (TOF) analyzer uses an electric field to move the ions and measures the time needed by each ion to reach the detector [214]; the quadrupole mass analyzer, also known as mass filter, consists of four parallel cylindrical metal rods, which are charged by direct current and radiofrequency voltages to generate oscillating electrical fields. The names of mass filters are derived from their ability to select ions of a single mass-to-charge ratio (m/z) value based on oscillating electrical fields [215]. Finally, orbitrap machines belong to the family of Fourier-transform mass analyzers, an ultrahigh-resolution method characterized by the highest resolving power, resolution, and m/z measurement accuracy [216]. In particular, orbitrap instruments electrostatically trap the ions in an orbit around an electrode and record the frequencies of each ion oscillation. The charge of the separated ions is then measured and quantified by a detector system, which calculates the m/z ratio [217]. 3. Peak detection and analysis: The final step of MS analysis consists of peak detection performed by detector instruments that are able to record the ion charge or the current induced by the ion trajectory following the ionization step. The final product of this measurement will be a mass spectrum, which will be used for downstream peak analysis and identification. Finally, bioinformatics analysis is used to associate peaks to peptides identities and calculate any difference in their representation [218,219]. As opposed to the BUP approach, the top-down proteomics (TDP) approach uses undigested proteins for downstream MS analysis. Briefly, the intact protein ions generated by electrospray ionization are subjected to a mass analyzer for the ion separation step, as described above. After ion separation, a dissociation step is carried out using methods such as gas-phase fragmentation, collision-induced dissociation, or electron transfer dissociation. This step is critical to obtain fragmented ions to be analyzed by MS [220,221] (Fig. 12.2). Despite the fact that TDP seem faster and more intuitive than BUP approaches, there are quite a few drawbacks to consider, such as a potential poor ionization due to the size of the protein analyzed, lower sensitivity due to the presence of multiple charges, and ineffective chromatographic separation of the higher molecular weight proteins [222]. On the other side, BUP approaches might offer limited protein coverage and ambiguity in detecting specific proteins due to the redundancy in the sequence of some peptide fragments [223]. Considering the complementarity of the BUP and TDP approaches, new proteomics studies might want to consider consolidating the two methodologies [224]. Overall, the recent progress of proteomics approaches on multiple fronts such as (1) improvement of prefractionation techniques to allow a deeper proteome coverage; (2) refinement of shotgun and targeted MS methods to increase the sensitivity of peptide detection; and (3) technological advancement of sample preparation to decrease the amount of tissue required for analysis [211,225234], have promoted the use of proteomics platforms not only for basic research but also for clinical applications, marker identification, and disease diagnosis [235,236].

12.6.3 Proteomics approaches to study rare diseases Study of protein changes in health and disease might represent a valuable tool, especially in the case of rare diseases where a fast and correct diagnosis and identification of the causative gene could

12.7 Metabolomics

243

immensely help patients and clinicians to inform disease management and therapy options. One of the first successful proteomics-based biomarker identification studies was for CreutzfeldtJakob disease (CJD, MIM #123400), a rapidly progressive and fatal neurodegenerative disorder caused by prion proteins that undergo conformational changes, resulting in a fatal transmissible spongiform encephalopathy [237]. In the study, the authors identified an increase of 14-3-3 protein in the cerebrospinal fluid (CSF) as a marker of acute brain injury in CJD patients [238,239]. In the context of rare diseases, there are several examples of studies where proteomics approaches have been utilized for prognostic and diagnostic marker discovery. However, the low number of individuals analyzed makes these studies underpowered (often with noisy outcomes due to interindividual variability) and difficult to replicate. One such example of application of proteomics for marker identification was to study inherited muscular dystrophies [240]. Ayoglu et al. used affinity proteomics approaches to identify circulating protein markers in Duchenne muscular dystrophy (DMD, MIM #310200) and Becker muscular dystrophy (BMD, MIM #300376), X-linked inherited disorders characterized by progressive muscle weakness (see Chapter 7: X-linked and Mitochondrial Disorders). Both DMD and BMD are caused by mutations in the dystrophin (DMD) gene, which can result in a complete loss or reduced levels of the encoded protein product, respectively [241243]. DMD is the most common muscle disease in children, with an incidence of around 1 in 3500 live-born males [244]. Affected boys are usually wheelchair-bound by 12 years of age and die early in their third decade of life from respiratory or cardiac failure. BMD is a milder form of the disease with a later age of onset and a slower clinical progression. Plasma and serum from control and muscular dystrophy patients were profiled using a bead array platform of 384 antibodies. Hierarchical clustering of protein profiles showed a clear separation between disease and control groups. By directly comparing control and disease samples, eleven protein candidates whose levels were differentially represented in the two groups were identified. Four proteins were found to be elevated in serum and plasma of muscular dystrophy patients: carbonic anhydrase III (CA3), myosin light chain 3, mitochondrial malate dehydrogenase 2 (MDH2), and electron transfer flavoprotein A. Further differential analysis showed that CA3 protein was also a good classifier between DMD and BMD patients. In another study aiming to identify proteins associated with DMD progression, a longitudinal cohort of 157 DMD patients with clinical data available was studied [245]. A protein bead array was used to analyze 303 blood samples from 78 patients with a median follow-up time of 2 years. Data analysis was performed by using a linear mixed model where the fluorescence intensity signal of each protein was calculated in relation to age, hospitalization, wheelchair dependence, and the type of corticosteroids used. Using this methodology, 21 proteins whose levels significantly decreased over time in DMD patients and nine proteins whose levels significantly increased were identified. Most importantly the MDH2 protein was identified to be a good prognostic marker of disease progression as it was associated with an increased risk of wheelchair dependence and response to treatment with corticosteroids [245].

12.7 Metabolomics Metabolites represent a very diversified family of bio-products involved in a wide variety of biological processes ranging from intracellular and extracellular signaling, regulation of cell cycle, energy

244

Chapter 12 Other omics approaches to the study of rare diseases

conversion, epigenetic mechanisms, and many other processes. Metabolites are a direct indicator of the physiological status of an organism and changes in their distribution can be considered as the endpoint of a cascade of events triggered by a number of factors, including genetic variation, diet and gut microbiome, aging, lifestyle, environment, and disease state [246251]. In this context, metabolomics, defined as the comprehensive study of an organism’s metabolites, might represent a useful tool for detection of biochemical changes in early disease onset and an opportunity to identify predictive disease markers for quicker therapeutic intervention [252]. Moreover, integration of an organism’s metabolomic profile with data derived from other “omics” approaches, namely, genomics, transcriptomics, and proteomics, might open new avenues into discovering molecular mechanisms driving disease pathophysiology [253,254]. Metabolites are a diverse class of compounds with a wide concentration range, different physical and biochemical properties, and sensitive to a wide variety of factors. Particular attention is needed for the experimental design of a metabolomics study. It is important to choose the right number of cases and controls and correct sampling considerations. In addition, clear guidelines need to be followed for sample preparation, storage, and processing because the abundance and integrity of some metabolites might be affected by all of these factors. Finally, because of the broad diversity of metabolites’ chemical properties, no single analytical method can currently cover the analysis of all metabolites at once. As such, attention should be paid while choosing the appropriate tissue to analyze and the specific analytical platform to use [255,256].

12.7.1 Metabolomics workflow and data analysis Metabolomics analysis relies on two main technological platforms, namely, nuclear magnetic resonance (NMR) spectroscopy [257] and MS [258]. NMR is based on the principle that each atom’s nucleus placed in a constant magnetic field generates and transfers energy at a wavelength corresponding to radio frequencies. This energy is measured and processed to create an NMR spectrum, which is specific for each nucleus [259]. As opposed to the MS methods, NMR spectroscopy is a highly reproducible technique and requires little or no chromatographic separation or sample treatment. However, its main drawback resides in its lack of sensitivity, which is instead one of the main features offered by MS-based methods. In general, a typical MS-based metabolomics workflow includes four main steps: (1) Generation of the main hypothesis and experimental design, which will define the type of metabolite to be analyzed as well as the analytical technique to be adopted; (2) Implementation of the experiment, which includes tissue preparation, protein precipitation by using ad hoc protocols depending on the nature of the metabolomics analysis (i.e., untargeted vs targeted), metabolite separation by LC, followed by MS analysis to quantify the m/z peaks; (3) Peak annotation and identification by using known metabolic annotation databases; and (4) Data analysis and interpretation (Fig. 12.3). In the context of metabolite separation and MS analysis, liquid chromatography electrospray ionization mass spectrometry (LC-ESI-MS) is one of the most common approaches for metabolite purification and profiling. After LC purification, a high voltage is applied to the liquid phase containing the metabolites of interest. This gaseous ionization process disperses the liquid solution into an aerosol and ultimately generates gas-phase ions ready to be evaluated by a mass spectrometer for peak quantification [255]. Additional LCMS methods characterized by different physicochemical properties have been developed over the years to assure increased sensitivity and robustness in reproducibility of

12.7 Metabolomics

1

Biological question

Experimental design

2

NMR or LC-MS data acquisition

Data preprocessing

3

Putative peak identification by database query

Metabolite identification

245

Sample preparation: 1.Tissue processing 2.Protein precipitation

Data analysis

Pathway and network analysis

FIGURE 12.3 Overview of untargeted metabolomics approaches. Adapted from Tebani A, et al. Int J Mol Sci 2016;17:1167. doi:10.3390/ijms17071167.

results for a longer list of biologically relevant metabolites. UHPLC coupled with tandem mass spectrometry (MS/MS) utilizes high-pressure liquid chromatography to pump the sample mobile phase into the solid absorbent phase under ultrahigh-pressure conditions (i.e., .689 bar for particles ,2 μm). The advantage of this technique lies in increased flow rate and decreased ion retention time [260]. Furthermore, recent advancement in ambient ionization techniques [261], such as desorption electrospray ionization [262,263] and rapid evaporative ionization MS [264], provide real-time, interpretable MS data from in vivo and ex vivo tissues and fluids, supporting the possibility of introducing metabolomic measurement in clinical testing for diagnosis and treatment. Finally, an alternative method that bypasses chromatography-based metabolite separation uses an automated flow-injection electrospray ionization mass spectrometry procedure, which allows MS analysis of cell-free extracts without prior chromatographic separation of the samples [265]. Compared to chromatography-MS, flow injection MS approaches are faster and offer similar sensitivity and accuracy [266]. Once separated, the metabolites undergo MS analysis, including ionization, mass selection, and peak detection, as described above. Finally, MS data analysis requires elaborate preprocessing steps and the use of metabolomics software for the identification, visualization, and quantification of differential peaks [267,268]. MS peaks of each identified feature are then cross-referenced against annotation databases like the Human Metabolome Database, the Small

246

Chapter 12 Other omics approaches to the study of rare diseases

Molecule Pathway Database, Recon2, BioCyc, and METLIN [269271] to look for matches with known metabolites. Despite the exponential increase in the number of annotated metabolites, one caveat of the current untargeted approaches is that the comprehensive identification of all human metabolites is still far from completion. As a consequence, several hundreds of unmatched metabolites are usually put aside before any downstream pathway enrichment analysis is performed.

12.7.2 Untargeted versus targeted metabolomics Untargeted metabolomics aims at measuring global changes in metabolites in an unbiased fashion. As such, untargeted metabolomics holds the promise to become a powerful and hypothesis-free tool for biomarker discovery. Following tissue preparation and protein precipitation [272274], the metabolites are usually separated by LC followed by MS analysis (LC-MS). LC-MS is currently the most common methodology used for metabolite separation. However, because of the complex biophysical properties of the metabolome, multiplexing approaches including gas and liquid chromatography (GC/LC), which are based on a liquid or gaseous mobile phase for metabolite transportation, and capillary electrophoresis, whose separation method is based on ionic mobility, are often used to optimize the number of molecules detected [275,276]. In contrast, targeted metabolomics consists of the analysis of a predefined number of metabolites which can be chosen based on their chemical features or as biomolecules associated with a specific metabolism branch such as nucleic acid, lipid, or amino acid metabolism among others [277]. This approach allows for a very sensitive and robust analysis of metabolites and is particularly suitable for hypothesis-driven experiments. Here, after tissue homogenization and protein precipitation, the next step is metabolite extraction, which is usually customized to the physicochemical attributes of a metabolite subset and its relative abundance [272274]. A range of factors must be considered during the extraction process, including but not limited to pH, temperature, and ratio of aqueous and organic solvents, as these factors could affect the stability and abundance of metabolites. Following the extraction step, typically a GC or LC technique coupled with MS is performed for metabolite extraction and quantification, as described above.

12.7.3 Metabolomics studies in rare diseases In the context of rare disease there has been quite some effort in applying metabolomics approaches for disease diagnosis, target discovery, and disease mechanism studies. Metabolomics analyses of plasma or urine samples provide a very good proxy for the biochemical status of an organism. The use of such minimally invasive procedures can be preferable for diagnosis and monitoring of patients with rare diseases. In concert with WES, metabolomics has helped make significant progress in diagnosis and functional exploration of Inborn Errors of Metabolism (IEMs), a very diverse class of genetic disorders characterized by deregulation of enzymes and transporters involved in several metabolic processes including amino acid, carbohydrate, purine or pyrimidine catabolism, fatty acid oxidation, and mitochondrial function, among others [278]. Early detection and diagnosis of IEMs are of utmost importance as management to avoid metabolic crises can improve disease outcomes and quality of life for patients [279]. Consequently, newborn screening has been particularly successful for IEM diagnosis [280]. Yet, the intrinsic clinical and genetic

12.7 Metabolomics

247

heterogeneity of IEMs can often lead to confusion in choosing the appropriate diagnostic test, leading to delays in treatment. Application of genomics and metabolomics approaches is changing the diagnostic landscape of IEMs. The biggest advantage offered by the omics approaches is time: both genomics and metabolomics allow collection of a comprehensive amount of information on the gene and pathway defect, and metabolite and enzyme status and thus can accelerate the diagnostic process and treatment of new IEM patients [281,282]. An example of the successful application of metabolomics approaches to study IEMs is the study and characterization of Fabry Disease (MIM #301500), the most prevalent lysosomal storage disorder caused by pathogenic variants in the alpha-galactosidase A (GLA) gene that result in loss or decreased function of the encoded protein enzyme. This enzymatic defect leads to the accumulation of globotriaosylsphingosine [lyso-Gb(3)] and globotriaosylceramide [Gb(3)] in biological fluids and tissues, followed by progressive renal failure, cardiomyopathy, and early stroke and gastrointestinal problems. Differently to other lysosomal storage diseases, most FD patients remain clinically asymptomatic during the very first years of life. The first clinical symptoms arise during childhood and progress over time, eventually leading to organ failure [283]. Enzyme replacement therapy is available, and studies have shown that biweekly intravenous infusion of the recombinant functional enzyme can improve quality of life and alter disease progression [284]. In addition to Gb3 and lyso-Gb3 accumulation, the application of UHPLC electrospray ionization MS-based metabolomic approaches have allowed the identification of additional glycosphingolipids such as globotriaosylceramide-globotriaosylsphingosine-related molecules and galabiosylceramide isoforms/ analogs in urine of untreated Fabry patients contributing to better understanding the pathophysiology of FD and its specific deregulated pathways and metabolites [285288]. Mitochondrial disorders are a clinically and genetically heterogeneous group of diseases characterized by dysfunction of the mitochondrial respiratory chain in charge of producing the energy for cellular function. These represent the most prevalent class of inherited metabolic disorders, with an incidence of 1:5000 worldwide and are associated with high morbidity and mortality rates. In spite of this, little is known about the biochemical processes governing these disorders and no proven therapy currently exists. This class of disorders is pretty unique from a genetics standpoint as the genes involved in mitochondrial processes can be encoded both by the nuclear and the mitochondrial genomes (see Chapter 7: X-linked and Mitochondrial Disorders). The clinical manifestation of these disorders can happen at any stage of life and can target one or multiple organs, more specifically those that have large energy requirements. Signature clinical features can include lactic acidosis, skeletal myopathy, deafness, blindness, subacute neurodegeneration, intestinal dysmotility, and peripheral neuropathy, among others [289]. Given their wide genetic heterogeneity, diagnosis of mitochondrial disorders may be laborious, requiring detailed family history, blood and/or CSF lactate concentration, neuroimaging, cardiac evaluation, and molecular genetic testing. Recently, a renewed impetus to study mitochondrial disorders has emerged with the aim of improving diagnosis and management for this class of diseases, but also to gain more insights into mitochondrial function and its role in the pathophysiology of rare and common diseases [290]. Metabolomic approaches have been implemented to identify deregulated metabolic pathways in Leigh syndrome French Canadian variant (LSFC, MIM #220111), a genetically homogenous recessive disorder manifesting some of the main clinical features of mitochondrial disorders including lactic acidosis and necrotizing encephalopathy [291293]. Causal mutations of this syndrome have been identified in the nuclear leucine-rich pentatricopeptide-repeat-motif-containing (LRPPRC) gene, which encodes

248

Chapter 12 Other omics approaches to the study of rare diseases

a mitochondrial localized RNA-binding protein [294,295]. Two different targeted MS-based methods were used to identify deregulated metabolites from plasma and urine of eight LSFC p.Ala354Val founder mutation homozygous patients, one compound heterozygote individual for this mutation, and another individual showing a premature stop variant. A total of 45 differential metabolites were identified as commonly deregulated. In addition to the known elevated levels of lactate, upregulation of previously unknown metabolites such as a/b-hydroxybutyrate and acylcarnitine was found, which suggested impairment of amino acid catabolism and metabolic dysfunction, respectively. Moreover, decreased levels of kynurenine and 3-hydroxyanthranilic acid indicated changes in NAD1 synthesis. Finally, changes in adiponectin and insulin levels suggested an altered risk of cardiovascular disease in LSFC patients. The identification of previously unknown metabolites in LSFC syndrome might open new avenues for the identification of new biomarkers for diagnosis and monitoring of disease progression and severity. The surprising findings of insulin and adiponectin deregulation in LSFC patients might suggest the existence of new and previously unexplored pathways connecting mitochondrial function and risk of cardiovascular events. Overall, these findings demonstrate the power of metabolomics approaches and their potential to contribute to understanding complex and rare diseases. Clinical metabolomics testing is an innovative untargeted metabolomics approach for testing individual patients. In clinical metabolomic studies, a relative measurement of metabolites obtained from a single patient sample is compared to a reference cohort. As a consequence, one of the challenges of this approach is to find an experimental and statistical analysis workflow that allows reliability and reproducibility of the results from different patients and at different times. A study looking at single- and multi-day metabolomic changes in 10 pediatric IEM samples and controls utilized four different chromatographic methods (LCMS/MS Pos Polar, LCMS/MS Pos Lipid, LCMS/MS Neg, and LCMS/MS Polar) followed by MS analysis. Using a new statistical approach Ford et al., were able to identify and quantitatively measure known and previously unknown metabolite biomarkers allowing for a more tailored diagnosis for IEM patients [296]. Future work focused on the development of routine clinical metabolomics approaches holds the promise to be of high diagnostic and disease monitoring value for patients with rare diseases [296,297]. As discussed, metabolomics offers a readout of tissues and body fluid metabolites in response to a wide variety of factors such as genetic, environmental, drug, or diet. In addition, the exponential progress of MS techniques like ambient ionization (which allows the generation of interpretable real-time MS data from biofluids and tissues) makes them particularly suitable for clinical applications. Multiple studies have begun to show the utility of high-throughput metabolite profiling and pharmacometabolomics to predict drug metabolism, pharmacokinetics, drug safety, and drug efficacy [298300].

12.8 Outlook Due to the overall increased scalability and robustness of genomics and other omics approaches, the new concept of Systems Medicine might help reshape current medical practice in the very near future. At the basis of this concept is the observation that geneenvironment interactions are likely to influence normal and disease-related processes, and hence a more in-depth characterization of

References

249

these interactions might offer new clues for disease management, diagnosis, and treatment. Systems Medicine supports the idea of routinely adopting comprehensive omics approaches for disease diagnosis and monitoring [301].

References [1] Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, et al. Sequencing studies in human genetics: design and interpretation. Nat Rev Genet 2013;14(7):46070. [2] MacArthur DG, Manolio TA, Dimmock DP, Rehm HL, Shendure J, Abecasis GR, et al. Guidelines for investigating causality of sequence variants in human disease. Nature 2014;508(7497):46976. [3] Clark MM, Stark Z, Farnaes L, Tan TY, White SM, Dimmock D, et al. Meta-analysis of the diagnostic and clinical utility of genome and exome sequencing and chromosomal microarray in children with suspected genetic diseases. NPJ Genom Med 2018;3:16. [4] Battle A, Khan Z, Wang SH, Mitrano A, Ford MJ, Pritchard JK, et al. Genomic variation. Impact of regulatory variation from RNA to protein. Science 2015;347(6222):6647. [5] Chick JM, Munger SC, Simecek P, Huttlin EL, Choi K, Gatti DM, et al. Defining the consequences of genetic variation on a proteome-wide scale. Nature 2016;534(7608):5005. [6] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431(7011):93145. [7] Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, et al. The DNA sequence of human chromosome 22. Nature 1999;402(6761):48995. [8] Hood L, Galas D. The digital code of DNA. Nature 2003;421(6921):4448. [9] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science 2001;291(5507):130451. [10] Bernstein BE, Meissner A, Lander ES. The mammalian epigenome. Cell 2007;128(4):66981. [11] Bonasio R, Tu S, Reinberg D. Molecular signals of epigenetic states. Science 2010;330(6004):61216. [12] Lander ES. Initial impact of the sequencing of the human genome. Nature 2011;470(7333):18797. [13] The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489(7414):5774. [14] Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011;473(7345):439. [15] Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 2009;458(7235):2237. [16] Heintzman ND, Ren B. Finding distal regulatory elements in the human genome. Curr Opin Genet Dev 2009;19(6):5419. [17] Zhu J, Adli M, Zou JY, Verstappen G, Coyne M, Zhang X, et al. Genome-wide chromatin state transitions associated with developmental and environmental cues. Cell 2013;152(3):64254. [18] Sun YV. The influences of genetic and environmental factors on methylome-wide association studies for human diseases. Curr Genet Med Rep 2014;2(4):26170. [19] Sun YV, Lazarus A, Smith JA, Chuang YH, Zhao W, Turner ST, et al. Gene-specific DNA methylation association with serum levels of C-reactive protein in African Americans. PLoS One 2013;8(8):e73480. [20] ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) project. Science 2004;306(5696):63640. [21] Adams D, Altucci L, Antonarakis SE, Ballesteros J, Beck S, Bird A, et al. BLUEPRINT to decode the epigenetic signature written in blood. Nat Biotechnol 2012;30(3):2246.

250

Chapter 12 Other omics approaches to the study of rare diseases

[22] Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. Integrative analysis of 111 reference human epigenomes. Nature 2015;518(7539):31730. [23] Dor Y, Cedar H. Principles of DNA methylation and their implications for biology and medicine. Lancet 2018;392(10149):77786. [24] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature 2001;409(6822):860921. [25] Myers EW, Sutton GG, Smith HO, Adams MD, Venter JC. On the sequencing and assembly of the human genome. Proc Natl Acad Sci U S A 2002;99(7):41456. [26] Hellman A, Chess A. Gene body-specific methylation on the active X chromosome. Science 2007;315 (5815):11413. [27] Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009;462(7271):31522. [28] Ziller MJ, Gu H, Mu¨ller F, Donaghey J, Tsai LT, Kohlbacher O, et al. Charting a dynamic DNA methylation landscape of the human genome. Nature 2013;500(7463):47781. [29] Deaton AM, Bird A. CpG islands and the regulation of transcription. Genes Dev 2011;25(10):101022. [30] Illingworth RS, Gruenewald-Schneider U, Webb S, Kerr AR, James KD, Turner DJ, et al. Orphan CpG islands identify numerous conserved promoters in the mammalian genome. PLoS Genet 2010;6(9): e1001134. [31] Saxonov S, Berg P, Brutlag DL. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A 2006;103(5):141217. [32] Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D’Souza C, Fouse SD, et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 2010;466(7303):2537. [33] Bird A. DNA methylation patterns and epigenetic memory. Genes Dev 2002;16(1):621. [34] Kriaucionis S, Heintz N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science 2009;324(5929):92930. [35] Tahiliani M, Koh KP, Shen Y, Pastor WA, Bandukwala H, Brudno Y, et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 2009;324(5929):9305. [36] Ito S, Shen L, Dai Q, Wu SC, Collins LB, Swenberg JA, et al. Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science 2011;333(6047):13003. [37] Lister R, Mukamel EA, Nery JR, Urich M, Puddifoot CA, Johnson ND, et al. Global epigenomic reconfiguration during mammalian brain development. Science 2013;341(6146):1237905. [38] Yu M, Hon GC, Szulwach KE, Song CX, Zhang L, Kim A, et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell 2012;149(6):136880. [39] Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, et al. Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science 2012;336 (6083):9347. [40] Liˇcyt˙e J, Gibas P, Skardˇzi¯ut˙e K, Stankeviˇcius V, Rukˇse˙ nait˙e A, Kriukien˙e E, et al. Approach for baseresolution analysis of genomic 5-carboxylcytosine. Cell Rep 2020;32(11):108155. [41] Liu Y, Siejka-Zieli´nska P, Velikova G, Bi Y, Yuan F, Tomkova M, et al. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat Biotechnol 2019;37(4):4249. [42] Raiber E-A, Hardisty R, van Delft P, Balasubramanian S. Mapping and elucidating the function of modified bases in DNA. Nat Rev Chem 2017;1(9):0069. [43] Brinkman AB, Simmer F, Ma K, Kaan A, Zhu J, Stunnenberg HG. Whole-genome DNA methylation profiling using MethylCap-seq. Methods 2010;52(3):2326. [44] Serre D, Lee BH, Ting AH. MBD-isolated genome sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Res 2010;38 (2):3919.

References

251

[45] Wang Y, Yang Q, Wang Z. The evolution of nanopore sequencing. Front Genet 2014;5:449. [46] Laszlo AH, Derrington IM, Brinkerhoff H, Langford KW, Nova IC, Samson JM, et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc Natl Acad Sci U S A 2013;110(47):189049. [47] Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L, et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 2016;530(7589):22832. [48] Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 2018;36(4):33845. [49] Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W. Detecting DNA cytosine methylation using nanopore sequencing. Nat Methods 2017;14(4):40710. [50] Carvalho CMB, Coban-Akdemir Z, Hijazi H, Yuan B, Pendleton M, Harrington E, et al. Interchromosomal template-switching as a novel molecular mechanism for imprinting perturbations associated with Temple syndrome. Genome Med 2019;11(1):25. [51] Van Holde KE, Sahasrabuddhe CG, Shaw BR. A model for particulate structure in chromatin. Nucleic Acids Res 1974;1(11):157986. [52] Kouzarides T. Chromatin modifications and their function. Cell 2007;128(4):693705. [53] Tan M, Luo H, Lee S, Jin F, Yang JS, Montellier E, et al. Identification of 67 histone marks and histone lysine crotonylation as a new type of histone modification. Cell 2011;146(6):101628. [54] Tian Z, Toli´c N, Zhao R, Moore RJ, Hengel SM, Robinson EW, et al. Enhanced top-down characterization of histone post-translational modifications. Genome Biol 2012;13(10):R86. [55] Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci U S A 2010;107(50):219316. [56] Hublitz P, Albert M, Peters AH. Mechanisms of transcriptional repression by histone lysine methylation. Int J Dev Biol 2009;53(23):33554. [57] Xia W, Xu J, Yu G, Yao G, Xu K, Ma X, et al. Resetting histone modifications during human parentalto-zygotic transition. Science 2019;365(6451):35360. [58] Leung D, Jung I, Rajagopal N, Schmitt A, Selvaraj S, Lee AY, et al. Integrative analysis of haplotyperesolved epigenomes across human tissues. Nature 2015;518(7539):3504. [59] Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell 2007;129(4):82337. [60] Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science 2007;316(5830):1497502. [61] Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007;448(7153):55360. [62] Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007;4(8):6517. [63] Baranello L, Kouzine F, Sanford S, Levens D. ChIP bias as a function of cross-linking time. Chromosome Res 2016;24(2):17581. [64] Jain D, Baldi S, Zabel A, Straub T, Becker PB. Active promoters give rise to false ‘positive Phantom Peaks’ in ChIP-seq experiments. Nucleic Acids Res 2015;43(14):695968. [65] Kasinathan S, Orsi GA, Zentner GE, Ahmad K, Henikoff S. High-resolution mapping of transcription factor binding sites on native chromatin. Nat Methods 2014;11(2):2039. [66] O’Neill LP, Turner BM. Immunoprecipitation of native chromatin: NChIP. Methods 2003;31(1):7682. [67] Schmidl C, Rendeiro AF, Sheffield NC, Bock C. ChIPmentation: fast, robust, low-input ChIP-seq for histones and transcription factors. Nat Methods 2015;12(10):9635.

252

Chapter 12 Other omics approaches to the study of rare diseases

[68] Brind’Amour J, Liu S, Hudson M, Chen C, Karimi MM, Lorincz MC. An ultra-low-input native ChIPseq protocol for genome-wide profiling of rare cell populations. Nat Commun 2015;6:6033. [69] Cao Z, Chen C, He B, Tan K, Lu C. A microfluidic device for epigenomic profiling using 100 cells. Nat Methods 2015;12(10):95962. [70] Skene PJ, Henikoff JG, Henikoff S. Targeted in situ genome-wide profiling with high efficiency for low cell numbers. Nat Protoc 2018;13(5):100619. [71] Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife 2017;6. [72] Hainer SJ, Boˇskovi´c A, McCannell KN, Rando OJ, Fazzio TG. Profiling of pluripotency factors in single cells and early embryos. Cell 2019;177(5):131929 e11. [73] Rotem A, Ram O, Shoresh N, Sperling RA, Goren A, Weitz DA, et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nat Biotechnol 2015;33(11):116572. [74] Ai S, Xiong H, Li CC, Luo Y, Shi Q, Liu Y, et al. Profiling chromatin states using single-cell itChIPseq. Nat Cell Biol. 2019;21(9):116472. [75] Kaya-Okur HS, Wu SJ, Codomo CA, Pledger ES, Bryson TD, Henikoff JG, et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun 2019;10(1):1930. [76] Carter B, Ku WL, Kang JY, Hu G, Perrie J, Tang Q, et al. Mapping histone modifications in low cell number and single cells using antibody-guided chromatin tagmentation (ACT-seq). Nat Commun 2019;10(1):3747. [77] Wang Q, Xiong H, Ai S, Yu X, Liu Y, Zhang J, et al. CoBATCH for high-throughput single-cell epigenomic profiling. Mol Cell 2019;76(1):20616 e7. [78] Allis CD, Jenuwein T. The molecular hallmarks of epigenetic control. Nat Rev Genet 2016;17 (8):487500. [79] Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, et al. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature 2009;458(7236):3626. [80] Kornberg RD, Thomas JO. Chromatin structure; oligomers of the histones. Science 1974;184 (4139):8658. [81] Olins DE, Olins AL. Chromatin history: our view from the bridge. Nat Rev Mol Cell Biol 2003;4 (10):80914. [82] Lee CK, Shibata Y, Rao B, Strahl BD, Lieb JD. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat Genet 2004;36(8):9005. [83] Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, et al. The accessible chromatin landscape of the human genome. Nature 2012;489(7414):7582. [84] Bednar J, Horowitz RA, Grigoryev SA, Carruthers LM, Hansen JC, Koster AJ, et al. Nucleosomes, linker DNA, and linker histone form a unique structural motif that directs the higher-order folding and compaction of chromatin. Proc Natl Acad Sci U S A 1998;95(24):141738. [85] Felsenfeld G, Boyes J, Chung J, Clark D, Studitsky V. Chromatin structure and gene expression. Proc Natl Acad Sci U S A 1996;93(18):93848. [86] Fyodorov DV, Zhou BR, Skoultchi AI, Bai Y. Emerging roles of linker histones in regulating chromatin structure and function. Nat Rev Mol Cell Biol 2018;19(3):192206. [87] Kim JM, Kim K, Punj V, Liang G, Ulmer TS, Lu W, et al. Linker histone H1.2 establishes chromatin compaction and gene silencing through recognition of H3K27me3. Sci Rep 2015;5(16714). [88] Krebs AR, Imanci D, Hoerner L, Gaidatzis D, Burger L, Schu¨beler D. Genome-wide Single-molecule footprinting reveals high RNA polymerase II turnover at paused promoters. Mol Cell 2017;67(3):41122 e4. [89] McBryant SJ, Adams VH, Hansen JC. Chromatin architectural proteins. Chromosome Res 2006;14 (1):3951. [90] Gillette TG, Hill JA. Readers, writers, and erasers: chromatin as the whiteboard of heart disease. Circ Res 2015;116(7):124553.

References

253

[91] Moccia A, Martin DM. Nervous system development and disease: a focus on trithorax related proteins and chromatin remodelers. Mol Cell Neurosci 2018;87:4654. [92] Pal S, Tyler JK. Epigenetics and aging. Sci Adv 2016;2(7):e1600584. [93] Sahu RK, Singh S, Tomar RS. The mechanisms of action of chromatin remodelers and implications in development and disease. Biochem Pharmacol 2020;180:114200. [94] Snyder-Mackler N, Sanz J, Kohn JN, Voyles T, Pique-Regi R, Wilson ME, et al. Social status alters chromatin accessibility and the gene regulatory response to glucocorticoid stimulation in rhesus macaques. Proc Natl Acad Sci U S A 2019;116(4):121928. [95] Vinci MC, Polvani G, Pesce M. Epigenetic programming and risk: the birthplace of cardiovascular disease? Stem Cell Rev Rep 2013;9(3):24153. [96] Goryshin IY, Reznikoff WS. Tn5 in vitro transposition. J Biol Chem 1998;273(13):736774. [97] Reznikoff WS. Tn5 as a model for understanding DNA transposition. Mol Microbiol 2003;47 (5):1199206. [98] Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 2013;10(12):121318. [99] Corces MR, Trevino AE, Hamilton EG, Greenside PG, Sinnott-Armstrong NA, Vesuna S, et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat Methods 2017;14(10):95962. [100] Meyer CA, Liu XS. Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat Rev Genet 2014;15(11):70921. [101] Corces MR, Buenrostro JD, Wu B, Greenside PG, Chan SM, Koenig JL, et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat Genet 2016;48(10):1193203. [102] Sos BC, Fung HL, Gao DR, Osothprarop TF, Kia A, He MM, et al. Characterization of chromatin accessibility with a transposome hypersensitive sites sequencing (THS-seq) assay. Genome Biol 2016;17:20. [103] Wu J, Xu J, Liu B, Yao G, Wang P, Lin Z, et al. Chromatin analysis in human early development reveals epigenetic transition during ZGA. Nature 2018;557(7704):25660. [104] Chung HR, Dunkel I, Heise F, Linke C, Krobitsch S, Ehrenhofer-Murray AE, et al. The effect of micrococcal nuclease digestion on nucleosome positioning data. PLoS One 2010;5(12):e15754. [105] Lorzadeh A, Bilenky M, Hammond C, Knapp D, Li L, Miller PH, et al. Nucleosome Density ChIP-Seq identifies distinct chromatin modification signatures associated with MNase accessibility. Cell Rep 2016;17(8):211224. [106] Kelly TK, Liu Y, Lay FD, Liang G, Berman BP, Jones PA. Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules. Genome Res 2012;22 (12):2497506. [107] Dekker J, Marti-Renom MA, Mirny LA. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet 2013;14(6):390403. [108] Gorkin DU, Leung D, Ren B. The 3D genome in transcriptional regulation and pluripotency. Cell Stem Cell 2014;14(6):76275. [109] Splinter E, de Laat W. The complex transcription regulatory landscape of our genome: control in three dimensions. EMBO J 2011;30(21):434555. [110] Woodcock CL, Dimitrov S. Higher-order structure of chromatin and chromosomes. Curr Opin Genet Dev 2001;11(2):1305. [111] Dekker J. The three ‘C’ s of chromosome conformation capture: controls, controls, controls. Nat Methods 2006;3(1):1721. [112] Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science 2002;295 (5558):130611.

254

Chapter 12 Other omics approaches to the study of rare diseases

[113] Splinter E, Heath H, Kooren J, Palstra RJ, Klous P, Grosveld F, et al. CTCF mediates long-range chromatin looping and local histone modification in the beta-globin locus. Genes Dev 2006;20(17):234954. [114] Wu¨rtele H, Chartrand P. Genome-wide scanning of HoxB1-associated loci in mouse ES cells using an open-ended chromosome conformation capture methodology. Chromosome Res 2006;14(5):47795. [115] de Wit E, de Laat W. A decade of 3C technologies: insights into nuclear organization. Genes Dev 2012;26(1):1124. [116] Han J, Zhang Z, Wang K. 3C and 3C-based techniques: the powerful tools for spatial genome organization deciphering. Mol Cytogenet 2018;11:21. [117] Schmitt AD, Hu M, Ren B. Genome-wide mapping and analysis of chromosome architecture. Nat Rev Mol Cell Biol 2016;17(12):74355. [118] Simonis M, Klous P, Splinter E, Moshkin Y, Willemsen R, de Wit E, et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C). Nat Genet 2006;38(11):134854. [119] Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, et al. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006;16(10):1299309. [120] Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326(5950):28993. [121] Kalhor R, Tjong H, Jayathilaka N, Alber F, Chen L. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat Biotechnol 2011;30(1):908. [122] Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, et al. An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 2009;462(7269):5864. [123] Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol 2010;11(2):R22. [124] Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 2013;502(7469):5964. [125] Fraser P, Bickmore W. Nuclear organization of the genome and the potential for gene regulation. Nature 2007;447(7143):41317. [126] Misteli T. Beyond the sequence: cellular organization of genome function. Cell 2007;128(4):787800. [127] Ron G, Globerson Y, Moran D, Kaplan T. Promoter-enhancer interactions identified from Hi-C data using probabilistic models and hierarchical topological domains. Nat Commun 2017;8(1):2237. [128] Feinberg AP, Koldobskiy MA, Go¨ndo¨r A. Epigenetic modulators, modifiers and mediators in cancer aetiology and progression. Nat Rev Genet 2016;17(5):28499. [129] Timp W, Feinberg AP. Cancer as a dysregulated epigenome allowing cellular growth advantage at the expense of the host. Nat Rev Cancer 2013;13(7):497510. [130] Chung SA, Nititham J, Elboudwarej E, Quach HL, Taylor KE, Barcellos LF, et al. Genome-wide assessment of differential DNA methylation associated with autoantibody production in systemic lupus erythematosus. PLoS One 2015;10(7):e0129813. [131] Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol 2013;31(2):1427. [132] Dayeh T, Volkov P, Salo¨ S, Hall E, Nilsson E, Olsson AH, et al. Genome-wide DNA methylation analysis of human pancreatic islets from type 2 diabetic and non-diabetic donors identifies candidate genes that influence insulin secretion. PLoS Genet 2014;10(3):e1004160. [133] Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aı¨ssi D, Wahl S, et al. DNA methylation and bodymass index: a genome-wide analysis. Lancet 2014;383(9933):19908. [134] Ling C, Ro¨nn T. Epigenetics in human obesity and Type 2 diabetes. Cell Metab 2019;29(5):102844.

References

255

[135] Amir RE, Van den Veyver IB, Wan M, Tran CQ, Francke U, Zoghbi HY. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat Genet 1999;23(2):1858. [136] Dashtipour K, Tafreshi A, Adler C, Beach T, Chen X, Serrano G, et al. Hypermethylation of synphilin-1, alpha-synuclein-interacting protein (SNCAIP) gene in the cerebral cortex of patients with sporadic Parkinson’s disease. Brain Sci 2017;7:7. [137] Iossifov I, O’Roak BJ, Sanders SJ, Ronemus M, Krumm N, Levy D, et al. The contribution of de novo coding mutations to autism spectrum disorder. Nature 2014;515(7526):21621. [138] Ovenden ES, McGregor NW, Emsley RA, Warnich L. DNA methylation and antipsychotic treatment mechanisms in schizophrenia: Progress and future directions. Prog Neuropsychopharmacol Biol Psychiatry 2018;81:3849. [139] Brookes E, Shi Y. Diverse epigenetic mechanisms of human disease. Annu Rev Genet 2014;48 (1):23768. [140] Elhamamsy AR. Role of DNA methylation in imprinting disorders: an updated review. J Assist Reprod Genet 2017;34(5):54962. [141] Yamazawa K, Ogata T, Ferguson-Smith AC. Uniparental disomy and human disease: an overview. Am J Med Genet C Semin Med Genet 2010;154c(3):32934. [142] Delaval K, Wagschal A, Feil R. Epigenetic deregulation of imprinting in congenital diseases of aberrant growth. Bioessays 2006;28(5):4539. [143] Horsthemke B. Mechanisms of imprint dysregulation. Am J Med Genet C Semin Med Genet 2010;154c(3):3218. [144] Horsthemke B. In brief: genomic imprinting and imprinting diseases. J Pathol 2014;232(5):4857. [145] Berdasco M, Esteller M. Genetic syndromes caused by mutations in epigenetic genes. Hum Genet 2013;132(4):35983. [146] Mastrototaro G, Sessa A. Chapter 9—Emerging role of epigenetics in human neurodevelopmental disorders. In: 2nd ed. Tollefsbol TO, editor. Epigenetics in human disease, 6. Academic Press; 2018. p. 269304. [147] Miyake K, Hirasawa T, Koide T, Kubota T. Epigenetics in autism and other neurodevelopmental diseases. Adv Exp Med Biol 2012;724:918. [148] Zahir FR, Brown CJ. Epigenetic impacts on neurodevelopment: pathophysiological mechanisms and genetic modes of action. Pediatric Res 2011;69(8):92100. [149] Bates GP, Dorsey R, Gusella JF, Hayden MR, Kay C, Leavitt BR, et al. Huntington disease. Nat Rev Dis Prim 2015;1:15005. [150] Cha JH. Transcriptional signatures in Huntington’s disease. Prog Neurobiol 2007;83(4):22848. [151] Lee J, Hwang YJ, Kim KY, Kowall NW, Ryu H. Epigenetic mechanisms of neurodegeneration in Huntington’s disease. Neurotherapeutics 2013;10(4):66476. [152] Valor LM. Transcription, epigenetics and ameliorative strategies in Huntington’s Disease: a genomewide perspective. Mol Neurobiol 2015;51(1):40623. [153] Wood H. Neurodegenerative disease: altered DNA methylation and RNA splicing could be key mechanisms in Huntington disease. Nat Rev Neurol 2013;9(3):119. [154] Ng CW, Yildirim F, Yap YS, Dalin S, Matthews BJ, Velez PJ, et al. Extensive changes in DNA methylation are associated with expression of mutant huntingtin. Proc Natl Acad Sci U S A 2013;110 (6):23549. [155] Achour M, Le Gras S, Keime C, Parmentier F, Lejeune FX, Boutillier AL, et al. Neuronal identity genes regulated by super-enhancers are preferentially down-regulated in the striatum of Huntington’s disease mice. Hum Mol Genet 2015;24(12):348196. [156] McFarland KN, Das S, Sun TT, Leyfer D, Xia E, Sangrey GR, et al. Genome-wide histone acetylation is altered in a transgenic mouse model of Huntington’s disease. PLoS One 2012;7(7):e41423.

256

Chapter 12 Other omics approaches to the study of rare diseases

[157] Sadri-Vakili G, Bouzou B, Benn CL, Kim MO, Chawla P, Overland RP, et al. Histones associated with downregulated genes are hypo-acetylated in Huntington’s disease models. Hum Mol Genet 2007;16 (11):1293306. [158] Sadri-Vakili G, Cha JH. Mechanisms of disease: histone modifications in Huntington’s disease. Nat Clin Pract Neurol 2006;2(6):3308. [159] Vashishtha M, Ng CW, Yildirim F, Gipson TA, Kratter IH, Bodai L, et al. Targeting H3K4 trimethylation in Huntington disease. Proc Natl Acad Sci U S A 2013;110(32):E302736. [160] De Souza RA, Islam SA, McEwen LM, Mathelier A, Hill A, Mah SM, et al. DNA methylation profiling in human Huntington’s disease brain. Hum Mol Genet 2016;25(10):201330. [161] Dong X, Tsuji J, Labadorf A, Roussos P, Chen JF, Myers RH, et al. The role of H3K4me3 in transcriptional regulation is altered in Huntington’s disease. PLoS One 2015;10(12):e0144398. [162] Horvath S. DNA methylation age of human tissues and cell types. Genome Biol 2013;14(10):3156. [163] Horvath S, Langfelder P, Kwak S, Aaronson J, Rosinski J, Vogt TF, et al. Huntington’s disease accelerates epigenetic aging of human brain and disrupts DNA methylation levels. Aging (Albany NY) 2016;8 (7):1485512. [164] Persidis A. Proteomics. Nat Biotechnol 1998;16(4):3934. [165] Wasinger VC, Cordwell SJ, Cerpa-Poljak A, Yan JX, Gooley AA, Wilkins MR, et al. Progress with gene-product mapping of the mollicutes: mycoplasma genitalium. Electrophoresis 1995;16(7):10904. [166] Wilkins MR, Gasteiger E, Sanchez JC, Appel RD, Hochstrasser DF. Protein identification with sequence tags. Curr Biol 1996;6(12):15434. [167] HUPO (Human Proteome Organization) 1st World Congress. 2124 November 2002, Versailles, France. Abstracts. Mol Cell Proteomics 2002;1(9):651752. [168] Jensen ON. Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr Opin Chem Biol 2004;8(1):3341. [169] Lescuyer P, Hochstrasser D, Rabilloud T. How shall we use the proteomics toolbox for biomarker discovery? J Proteome Res 2007;6(9):33716. [170] Frantzi M, Latosinska A, Flu¨he L, Hupe MC, Critselis E, Kramer MW, et al. Developing proteomic biomarkers for bladder cancer: towards clinical application. Nat Rev Urol 2015;12(6):31730. [171] Pan S, Brentnall TA, Chen R. Proteomics analysis of bodily fluids in pancreatic cancer. Proteomics 2015;15(15):270515. [172] Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002;359(9306):5727. [173] Tsai TH, Song E, Zhu R, Di Poto C, Wang M, Luo Y, et al. LC-MS/MS-based serum proteomics for identification of candidate biomarkers for hepatocellular carcinoma. Proteomics 2015;15(13):236981. [174] Farias AS, Pradella F, Schmitt A, Santos LM, Martins-de-Souza D. Ten years of proteomics in multiple sclerosis. Proteomics 2014;14(4-5):46780. [175] Al Awam K, Haußleiter IS, Dudley E, Donev R, Bru¨ne M, Juckel G, et al. Multiplatform metabolome and proteome profiling identifies serum metabolite and protein signatures as prospective biomarkers for schizophrenia. J Neural Transm (Vienna) 2015;122(Suppl 1):S11122. [176] Nascimento JM, Martins-de-Souza D. The proteome of schizophrenia. NPJ Schizophr 2015;1:14003. [177] Angenendt P, Glo¨kler J, Sobek J, Lehrach H, Cahill DJ. Next generation of protein microarray support materials: evaluation for protein and antibody microarray applications. J Chromatogr A 2003;1009 (12):97104. [178] Huang RP. Detection of multiple proteins in an antibody-based protein microarray system. J Immunol Methods 2001;255(12):113. [179] Moreno-Bondi MC, Alarie JP, Vo-Dinh T. Multi-analyte analysis system using an antibody-based biochip. Anal Bioanal Chem 2003;375(1):1204.

References

257

[180] Nallur G, Marrero R, Luo C, Krishna RM, Bechtel PE, Shao W, et al. Protein and nucleic acid detection by rolling circle amplification on gel-based microarrays. Biomed Microdev. 2003;5(2):11523. [181] Pawlak M, Schick E, Bopp MA, Schneider MJ, Oroszlan P, Ehrat M. Zeptosens’ protein microarrays: a novel high performance microarray platform for low abundance protein analysis. Proteomics 2002;2 (4):38393. [182] Rubina AY, Dementieva EI, Stomakhin AA, Darii EL, Pan’kov SV, Barsky VE, et al. Hydrogel-based protein microchips: manufacturing, properties, and applications. Biotechniques 2003;34(5) 1008-14, 1016-20, 1022. [183] Ruiz-Taylor LA, Martin TL, Zaugg FG, Witte K, Indermuhle P, Nock S, et al. Monolayers of derivatized poly(L-lysine)-grafted poly(ethylene glycol) on metal oxides as a class of biomolecular interfaces. Proc Natl Acad Sci U S A 2001;98(3):8527. [184] Schweitzer B, Roberts S, Grimwade B, Shao W, Wang M, Fu Q, et al. Multiplexed protein profiling on microarrays by rolling-circle amplification. Nat Biotechnol 2002;20(4):35965. [185] Kingsmore SF. Multiplexed protein measurement: technologies and applications of protein and antibody arrays. Nat Rev Drug Discov 2006;5(4):31020. [186] Robinson WH, DiGennaro C, Hueber W, Haab BB, Kamachi M, Dean EJ, et al. Autoantigen microarrays for multiplex characterization of autoantibody responses. Nat Med 2002;8(3):295301. [187] de Seny D, Fillet M, Meuwis MA, Geurts P, Lutteri L, Ribbens C, et al. Discovery of new rheumatoid arthritis biomarkers using the surface-enhanced laser desorption/ionization time-of-flight mass spectrometry ProteinChip approach. Arthritis Rheum 2005;52(12):380112. [188] Hinchliffe TE, Lin ZT, Wu T. Protein arrays for biomarker discovery in lupus. Proteom Clin Appl 2016;10(6):62534. [189] Li QZ, Zhou J, Wandstrat AE, Carr-Johnson F, Branch V, Karp DR, et al. Protein array autoantibody profiles for insights into systemic lupus erythematosus and incomplete lupus syndromes. Clin Exp Immunol 2007;147(1):6070. [190] Szodoray P, Alex P. Protein array diagnostics for guiding therapy in rheumatoid arthritis. Mol Diagn Ther 2011;15(5):24754. [191] Shi Y, Li M, Liu L, Wang Z, Wang Y, Zhao J, et al. Relationship between disease activity, organ damage and health-related quality of life in patients with systemic lupus erythematosus: a systemic review and meta-analysis. Autoimmun Rev 2021;20(1):102691. [192] Arbuckle MR, McClain MT, Rubertone MV, Scofield RH, Dennis GJ, James JA, et al. Development of autoantibodies before the clinical onset of systemic lupus erythematosus. N Engl J Med 2003;349 (16):152633. [193] Heinlen LD, McClain MT, Merrill J, Akbarali YW, Edgerton CC, Harley JB, et al. Clinical criteria for systemic lupus erythematosus precede diagnosis, and associated autoantibodies are present before clinical symptoms. Arthritis Rheum 2007;56(7):234451. [194] Ippolito A, Wallace DJ, Gladman D, Fortin PR, Urowitz M, Werth V, et al. Autoantibodies in systemic lupus erythematosus: comparison of historical and current assessment of seropositivity. Lupus 2011;20 (3):2505. [195] Li QZ, Xie C, Wu T, Mackay M, Aranow C, Putterman C, et al. Identification of autoantibody clusters that best predict lupus disease activity using glomerular proteome arrays. J Clin Invest 2005;115 (12):342839. [196] Li QZ, Zhou J, Lian Y, Zhang B, Branch VK, Carr-Johnson F, et al. Interferon signature gene expression is correlated with autoantibody profiles in patients with incomplete lupus syndromes. Clin Exp Immunol 2010;159(3):28191. [197] Miersch S, Bian X, Wallstrom G, Sibani S, Logvinenko T, Wasserfall CH, et al. Serological autoantibody profiling of type 1 diabetes by protein arrays. J Proteom 2013;94:48696.

258

Chapter 12 Other omics approaches to the study of rare diseases

[198] Atak A, Mukherjee S, Jain R, Gupta S, Singh VA, Gahoi N, et al. Protein microarray applications: autoantibody detection and posttranslational modification. Proteomics 2016;16(19):255769. [199] Gold L, Ayers D, Bertino J, Bock C, Bock A, Brody EN, et al. Aptamer-based multiplexed proteomic technology for biomarker discovery. PLoS One 2010;5(12):e15004. [200] Vaught JD, Bock C, Carter J, Fitzwater T, Otis M, Schneider D, et al. Expanding the chemistry of DNA for in vitro selection. J Am Chem Soc 2010;132(12):414151. [201] Brody E, Gold L, Mehan M, Ostroff R, Rohloff J, Walker J, et al. Life’s simple measures: unlocking the proteome. J Mol Biol 2012;422(5):595606. [202] Williams SA, Kivimaki M, Langenberg C, Hingorani AD, Casas JP, Bouchard C, et al. Plasma protein patterns as comprehensive indicators of health. Nat Med 2019;25(12):18517. [203] Agasing AM, Wu Q, Khatri B, Borisow N, Ruprecht K, Brandt AU, et al. Transcriptomics and proteomics reveal a cooperation between interferon and T-helper 17 cells in neuromyelitis optica. Nat Commun 2020;11(1):2856. [204] Ali AS, Perren A, Lindskog C, Welin S, Sorbye H, Gro¨nberg M, et al. Candidate protein biomarkers in pancreatic neuroendocrine neoplasms grade 3. Sci Rep 2020;10(1):10639. [205] Berbers R-M, Drylewicz J, Ellerbroek PM, van Montfrans JM, Dalm VASH, van Hagen PM, et al. Targeted proteomics reveals inflammatory pathways that classify immune dysregulation in common variable immunodeficiency. J Clin Immunol 2020. [206] Bjo¨rk A, Da Silva Rodrigues R, Richardsdotter Andersson E, Ramı´rez Sepu´lveda JI, Mofors J, Kvarnstro¨m M, et al. Interferon activation status underlies higher antibody response to viral antigens in patients with systemic lupus erythematosus receiving no or light treatment. Rheumatology (Oxf.) 2020. [207] Chait BT. Mass spectrometry: bottom-up or top-down? Science 2006;314(5796):656. [208] Resing KA, Ahn NG. Proteomics strategies for protein identification. FEBS Lett 2005;579(4):8859. [209] Westermeier R. Looking at proteins from two dimensions: a review on five decades of 2D electrophoresis. Arch Physiol Biochem 2014;120(5):16872. [210] Altelaar AF, Munoz J, Heck AJ. Next-generation proteomics: towards an integrative view of proteome dynamics. Nat Rev Genet 2013;14(1):3548. [211] Di Palma S, Hennrich ML, Heck AJ, Mohammed S. Recent advances in peptide separation by multidimensional liquid chromatography for proteome analysis. J Proteom 2012;75(13):3791813. [212] Motoyama A, Yates 3rd JR. Multidimensional L.C. separations in shotgun proteomics. Anal Chem 2008;80(19):718793. [213] Zhang X, Fang A, Riley CP, Wang M, Regnier FE, Buck C. Multi-dimensional liquid chromatography in proteomics—a review. Anal Chim Acta 2010;664(2):10113. [214] Vestal ML, Campbell JM. Tandem time-of-flight mass spectrometry. Methods Enzymol 2005;402:79108. [215] Syed SU, Maher S, Taylor S. Quadrupole mass filter operation under the influence of magnetic field. J Mass Spectrom 2013;48(12):132539. [216] Scigelova M, Hornshaw M, Giannakopulos A, Makarov A. Fourier transform mass spectrometry. Mol Cell Proteom 2011;10(7):M111. [217] Deracinois B, Flahaut C, Duban-Deweer S, Karamanos Y. Comparative and quantitative global proteomics approaches: an overview. Proteomes 2013;1(3):180218. [218] Chen C, Hou J, Tanner JJ, Cheng J. Bioinformatics methods for mass spectrometry-based proteomics data analysis. Int J Mol Sci 2020;21:8. [219] Li S, Tang H. Computational methods in mass spectrometry-based proteomics. Adv Exp Med Biol 2016;939:6389. [220] Parks BA, Jiang L, Thomas PM, Wenger CD, Roth MJ, Boyne MT, et al. Top-down proteomics on a chromatographic time scale using linear ion trap fourier transform hybrid mass spectrometers. Anal Chem 2007;79(21):798491.

References

259

[221] Tran JC, Zamdborg L, Ahlf DR, Lee JE, Catherman AD, Durbin KR, et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 2011;480(7376):2548. [222] Hoofnagle AN, Bystrom C. Proteomics. In: Rifai N, Horvath AR, Wittwer CT, editors. Principles and applications of clinical mass spectrometry. Elsevier; 2018. p. 181201. [223] Yates JR, Ruse CI, Nakorchevsky A. Proteomics by mass spectrometry: approaches, advances, and applications. Annu Rev Biomed Eng 2009;11:4979. [224] Zhang Y, Fonslow BR, Shan B, Baek MC, Yates 3rd JR. Protein analysis by shotgun/bottom-up proteomics. Chem Rev 2013;113(4):234394. [225] Aebersold R, Mann M. Mass-spectrometric exploration of proteome structure and function. Nature 2016;537(7620):34755. [226] Alves G, Yu YK. Improving peptide identification sensitivity in shotgun proteomics by stratification of search space. J Proteome Res 2013;12(6):257181. [227] Anand S, Samuel M, Ang CS, Keerthikumar S, Mathivanan S. Label-based and label-free strategies for protein quantitation. Methods Mol Biol 2017;1549:3143. [228] Bakalarski CE, Kirkpatrick DS. A biologist’s field guide to multiplexed quantitative proteomics. Mol Cell Proteom 2016;15(5):148997. [229] Bantscheff M, Lemeer S, Savitski MM, Kuster B. Quantitative mass spectrometry in proteomics: critical review update from 2007 to the present. Anal Bioanal Chem 2012;404(4):93965. [230] Chandramouli K, Qian PY. Proteomics: challenges, techniques and possibilities to overcome biological sample complexity. Hum Genomics Proteom 2009;2009:239204. [231] Chen Y, Vu J, Thompson MG, Sharpless WA, Chan LJG, Gin JW, et al. A rapid methods development workflow for high-throughput quantitative proteomic applications. PLoS One 2019;14(2):e0211582. [232] Donnelly DP, Rawlins CM, DeHart CJ, Fornelli L, Schachner LF, Lin Z, et al. Best practices and benchmarks for intact protein analysis for top-down mass spectrometry. Nat Methods 2019;16 (7):58794. [233] Proffitt JM, Glenn J, Cesnik AJ, Jadhav A, Shortreed MR, Smith LM, et al. Proteomics in non-human primates: utilizing RNA-Seq data to improve protein identification by mass spectrometry in vervet monkeys. BMC Genomics 2017;18(1):877. [234] Timp W, Timp G. Beyond mass spectrometry, the next step in proteomics. Sci Adv 2020;6(2) eaax8978. [235] Geyer PE, Holdt LM, Teupser D, Mann M. Revisiting biomarker discovery by plasma proteomics. Mol Syst Biol 2017;13(9):942. [236] Savaryn JP, Catherman AD, Thomas PM, Abecassis MM, Kelleher NL. The emergence of top-down proteomics in clinical research. Genome Med 2013;5(6):53. [237] Hsich G, Kenney K, Gibbs CJ, Lee KH, Harrington MG. The 14-3-3 brain protein in cerebrospinal fluid as a marker for transmissible spongiform encephalopathies. N Engl J Med 1996;335(13):92430. [238] Poser S, Mollenhauer B, Kraubeta A, Zerr I, Steinhoff BJ, Schroeter A, et al. How to improve the clinical diagnosis of Creutzfeldt-Jakob disease. Brain 1999;122(Pt 12):234551. [239] Satoh J, Kurohara K, Yukitake M, Kuroda Y. The 14-3-3 protein detectable in the cerebrospinal fluid of patients with prion-unrelated neurological diseases is expressed constitutively in neurons and glial cells in culture. Eur Neurol 1999;41(4):21625. [240] Ayoglu B, Chaouch A, Lochmu¨ller H, Politano L, Bertini E, Spitali P, et al. Affinity proteomics within rare diseases: a BIO-NMD study for blood biomarkers of muscular dystrophies. EMBO Mol. Med. 2014;6 (7):91836. [241] Bushby K, Finkel R, Birnkrant DJ, Case LE, Clemens PR, Cripe L, et al. Diagnosis and management of Duchenne muscular dystrophy, part 1: diagnosis, and pharmacological and psychosocial management. Lancet Neurol 2010;9(1):7793.

260

Chapter 12 Other omics approaches to the study of rare diseases

[242] Bushby KM, Thambyayah M, Gardner-Medwin D. Prevalence and incidence of Becker muscular dystrophy. Lancet 1991;337(8748):10224. [243] Koenig M, Beggs AH, Moyer M, Scherpf S, Heindrich K, Bettecken T, et al. The molecular basis for Duchenne versus Becker muscular dystrophy: correlation of severity with type of deletion. Am J Hum Genet 1989;45(4):498506. [244] Emery AE. Population frequencies of inherited neuromuscular diseases—a world survey. Neuromuscul Disord 1991;1(1):1929. [245] Signorelli M, Ayoglu B, Johansson C, Lochmu¨ller H, Straub V, Muntoni F, et al. Longitudinal serum biomarker screening identifies malate dehydrogenase 2 as candidate prognostic biomarker for Duchenne muscular dystrophy. J Cachexia Sarcopenia Muscle 2020;11(2):50517. [246] Avanesov AS, Ma S, Pierce KA, Yim SH, Lee BC, Clish CB, et al. Age- and diet-associated metabolome remodeling characterizes the aging process driven by damage accumulation. Elife 2014;3:e02077. [247] Berthoud HR. Mind versus metabolism in the control of food intake and energy balance. Physiol Behav 2004;81(5):78193. [248] Ellis JK, Athersuch TJ, Thomas LD, Teichert F, P´erez-Trujillo M, Svendsen C, et al. Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population. BMC Med 2012;10:61. [249] Everard A, Lazarevic V, Derrien M, Girard M, Muccioli GG, Neyrinck AM, et al. Responses of gut microbiota and glucose and lipid metabolism to prebiotics in genetic obese and diet-induced leptinresistant mice. Diabetes 2011;60(11):277586. [250] Fujisaka S, Avila-Pacheco J, Soto M, Kostic A, Dreyfuss JM, Pan H, et al. Diet, genetics, and the gut microbiome drive dynamic changes in plasma metabolites. Cell Rep 2018;22(11):307286. [251] Vigneau-Callahan KE, Shestopalov AI, Milbury PE, Matson WR, Kristal BS. Characterization of dietdependent metabolic serotypes: analytical and biological variability issues in rats. J Nutr 2001;131 (3):924s32s. [252] Graham E, Lee J, Price M, Tarailo-Graovac M, Matthews A, Engelke U, et al. Integration of genomics and metabolomics for prioritization of rare disease variants: a 2018 literature review. J Inherit Metab Dis 2018;41(3):43545. [253] Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol 2017;18(1):83. [254] Karczewski KJ, Snyder MP. Integrative omics for health and disease. Nat Rev Genet 2018;19 (5):299310. [255] Dettmer K, Aronov PA, Hammock BD. Mass spectrometry-based metabolomics. Mass Spectrom Rev 2007;26(1):5178. [256] Patti GJ, Yanes O, Siuzdak G. Innovation: metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol 2012;13(4):2639. [257] Larive CK, Barding Jr GA, Dinges MM. NMR spectroscopy for metabolomics and metabolic profiling. Anal Chem 2015;87(1):13346. [258] Gowda GAN, Djukovic D. Overview of mass spectrometry-based metabolomics: opportunities and challenges. Methods Mol Biol (Clifton, NJ) 2014;1198:312. [259] Wong KC. Review of NMR spectroscopy: basic principles, concepts and applications in chemistry. J Chem Educ 2014;91(8):11034. [260] Xiang Y, Liu Y, Lee ML. Ultrahigh pressure liquid chromatography using elevated temperature. J Chromatogr A 2006;1104(1-2):198202. [261] Twohig M, Shockcor JP, Wilson ID, Nicholson JK, Plumb RS. Use of an atmospheric solids analysis probe (ASAP) for high throughput screening of biological fluids: preliminary applications on urine and bile. J Proteome Res 2010;9(7):35907.

References

261

[262] Eberlin LS, Norton I, Orringer D, Dunn IF, Liu X, Ide JL, et al. Ambient mass spectrometry for the intraoperative molecular diagnosis of human brain tumors. Proc Natl Acad Sci U S A 2013;110 (5):161116. [263] Kerian KS, Jarmusch AK, Pirro V, Koch MO, Masterson TA, Cheng L, et al. Differentiation of prostate cancer from normal tissue in radical prostatectomy specimens by desorption electrospray ionization and touch spray ionization mass spectrometry. Analyst 2015;140(4):10908. [264] Balog J, Sasi-Szabo´ L, Kinross J, Lewis MR, Muirhead LJ, Veselkov K, et al. Intraoperative tissue identification using rapid evaporative ionization mass spectrometry. Sci Transl Med 2013;5(194) 194ra93. [265] Vaidyanathan S, Kell DB, Goodacre R. Flow-injection electrospray ionization mass spectrometry of crude cell extracts for high-throughput bacterial identification. J Am Soc Mass Spectrometry 2002;13 (2):11828. [266] Nanita SC, Kaldon LG. Emerging flow injection mass spectrometry methods for high-throughput quantitative analysis. Anal Bioanal Chem 2016;408(1):2333. [267] Jonsson P, Johansson AI, Gullberg J, Trygg J, Jiye A, Grung B, Marklund S, et al. High-throughput data analysis for detecting and identifying differences between samples in GC/MS-based metabolomic analyses. Anal Chem 2005;77(17):563542. [268] Katajamaa M, Oresic M. Data processing for mass spectrometry-based metabolomics. J Chromatogr A 2007;1158(12):31828. [269] Graham E, Lee J, Price M, Tarailo-Graovac M, Matthews A, Engelke U, et al. Integration of genomics and metabolomics for prioritization of rare disease variants: a 2018 literature review. J Inherit Metab Dis 2018;41(3):43545. [270] Smith CA, O’Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, et al. METLIN: a metabolite mass spectral database. Ther Drug Monit 2005;27(6):74751. [271] Williams EG, Wu Y, Jha P, Dubuis S, Blattmann P, Argmann CA, et al. Systems proteomics of liver mitochondria function. Science 2016;352(6291) aad0189. [272] Villas-Boˆas SG, Højer-Pedersen J, Akesson M, Smedsgaard J, Nielsen J. Global metabolite analysis of yeast: evaluation of sample preparation methods. Yeast 2005;22(14):115569. [273] Want EJ, O’Maille G, Smith CA, Brandon TR, Uritboonthai W, Qin C, et al. Solvent-dependent metabolite distribution, clustering, and protein extraction for serum profiling with mass spectrometry. Anal Chem 2006;78(3):74352. [274] Yanes O, Tautenhahn R, Patti GJ, Siuzdak G. Expanding coverage of the metabolome for global metabolite profiling. Anal Chem 2011;83(6):215261. [275] Jones DP, Park Y, Ziegler TR. Nutritional metabolomics: progress in addressing complexity in diet and health. Annu Rev Nutr 2012;32:183202. [276] Patti GJ. Separation strategies for untargeted metabolomics. J Sep Sci 2011;34(24):34609. [277] Dudley E, Yousef M, Wang Y, Griffiths WJ. Targeted metabolomics and mass spectrometry. Adv Protein Chem Struct Biol 2010;80:4583. ` . Inborn errors of metabolism overview: pathophysiology, manifesta[278] Saudubray JM, Garcia-Cazorla A tions, evaluation, and management. Pediatr Clin North Am 2018;65(2):179208. [279] Vernon HJ. Inborn errors of metabolism: advances in diagnosis and therapy. JAMA Pediatr 2015;169 (8):77882. [280] Kanungo S, Patel DR, Neelakantan M, Ryali B. Newborn screening and changing face of inborn errors of metabolism in the United States. Ann Transl Med 2018;6(24):468. [281] Adhikari AN, Gallagher RC, Wang Y, Currier RJ, Amatuni G, Bassaganyas L, et al. The role of exome sequencing in newborn screening for inborn errors of metabolism. Nat Med 2020;26(9):13927. [282] Mordaunt D, Cox D, Fuller M. Metabolomics to improve the diagnostic efficiency of inborn errors of metabolism. Int J Mol Sci 2020;21:4.

262

Chapter 12 Other omics approaches to the study of rare diseases

[283] Germain DP. Fabry disease. Orphanet J Rare Dis 2010;5:30. [284] Pisani A, Visciano B, Roux GD, Sabbatini M, Porto C, Parenti G, et al. Enzyme replacement therapy in patients with Fabry disease: state of the art and review of the literature. Mol Genet Metab 2012;107 (3):26775. [285] Auray-Blais C, Boutin M, Gagnon R, Dupont FO, Lavoie P, Clarke JT. Urinary globotriaosylsphingosine-related biomarkers for Fabry disease targeted by metabolomics. Anal Chem 2012;84(6):274553. [286] Boutin M, Auray-Blais C. Metabolomic discovery of novel urinary galabiosylceramide analogs as Fabry disease biomarkers. J Am Soc Mass Spectrometry 2015;26(3):499510. [287] Dupont FO, Gagnon R, Boutin M, Auray-Blais C. A metabolomic study reveals novel plasma lyso-Gb3 analogs as Fabry disease biomarkers. Curr Med Chem 2013;20(2):2808. [288] Manwaring V, Boutin M, Auray-Blais C. A metabolomic study to identify new globotriaosylceramiderelated biomarkers in the plasma of Fabry disease patients. Anal Chem 2013;85(19):903948. [289] DiMauro S. Mitochondrial diseases. Biochim Biophys Acta 2004;1658(1-2):808. [290] Vafai SB, Mootha VK. Mitochondrial disorders as windows into an ancient organelle. Nature 2012;491 (7424):37483. [291] Debray FG, Morin C, Janvier A, Villeneuve J, Maranda B, Laframboise R, et al. LRPPRC mutations cause a phenotypically distinct form of Leigh syndrome with cytochrome c oxidase deficiency. J Med Genet 2011;48(3):1839. [292] Finsterer J. Leigh and Leigh-like syndrome in children and adults. Pediatr Neurol 2008;39(4):22335. [293] Thompson Legault J, Strittmatter L, Tardif J, Sharma R, Tremblay-Vaillancourt V, Aubut C, et al. A metabolic signature of mitochondrial dysfunction revealed through a monogenic form of Leigh syndrome. Cell Rep 2015;13(5):9819. [294] Cuillerier A, Honarmand S, Cadete VJJ, Ruiz M, Forest A, Descheˆnes S, et al. Loss of hepatic LRPPRC alters mitochondrial bioenergetics, regulation of permeability transition and trans-membrane ROS diffusion. Hum Mol Genet 2017;26(16):3186201. [295] Uziel G, Ghezzi D, Zeviani M. Infantile mitochondrial encephalopathy. Semin Fetal Neonatal Med 2011;16(4):20515. [296] Ford L, Kennedy AD, Goodman KD, Pappan KL, Evans AM, Miller LAD, et al. Precision of a clinical metabolomics profiling platform for use in the identification of inborn errors of metabolism. J Appl Lab Med 2020;5(2):34256. [297] Kennedy AD, Wittmann BM, Evans AM, Miller LAD, Toal DR, Lonergan S, et al. Metabolomics in the clinic: a review of the shared and unique features of untargeted metabolomics for clinical research and clinical testing. J Mass Spectrom 2018;53(11):114354. [298] Clayton TA, Lindon JC, Cloarec O, Antti H, Charuel C, Hanton G, et al. Pharmaco-metabonomic phenotyping and personalized drug treatment. Nature 2006;440(7087):10737. [299] Everett JR. Pharmacometabonomics in humans: a new tool for personalized medicine. Pharmacogenomics 2015;16(7):73754. [300] Kell DB, Goodacre R. Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discov Today 2014;19(2):17182. [301] Kirschner M. Systems medicine: sketching the landscape. Methods Mol Biol 2016;1386:315.

CHAPTER

Challenges and opportunities in rare diseases research

13 Claudia Gonzaga-Jauregui

International Laboratory for Human Genome Research, Laboratorio Internacional de Investigacio´n sobre el Genoma Humano (LIIGH), Universidad Nacional Auto´noma de Me´xico (UNAM), Mexico

13.1 Introduction As reviewed and discussed throughout this book, the study and understanding of human variation and disease have been closely linked to the development of technologies that allow us to interrogate the human genome at increasing resolution. The last 30 years have witnessed a revolution in the way we approach human and clinical genetics with the establishment of genomics as a bona fide discipline in the biological sciences that is not only technically doable but can also deliver insights into human biology. This revolution has been driven in great part by the development and application of human genomic sequencing, primarily through whole-genome sequencing (WGS) and targeted whole-exome sequencing (WES). October of 2020 marked the 30th anniversary of the beginning of the Human Genome Project (HGP), a major scientific feat of humanity that has transformed science and human health. The birth and development of human genomics in the last three decades, starting with the HGP, have changed modern biomedical sciences, our study of genetic disorders, both rare and common, our understanding of human genetics and human biology, and the practice of medicine for generations to come. Before the first report of the identification of the cause of a rare Mendelian disorder by WES in 2010, there were about 3000 Mendelian human phenotypes with a known molecular cause. Since then, through the application of WES and WGS, links between rare genetic phenotypes and their corresponding molecular cause have been established for about 3000 more [14]. The application of genomic approaches to the study of rare Mendelian diseases has been remarkably successful as evidenced by the surge of publications reporting novel genedisease associations in the last decade. Beyond discovery, the implementation of this technology is providing accurate molecular diagnoses for patients with suspected genetic disorders, many of which had undergone diagnostic odysseys before WES or WGS was able to provide a definitive answer. Although tremendous progress has been made and new understanding of rare diseases has been acquired in the last decade through the increasing application of genomic sequencing technologies to study and diagnose genetic disorders, challenges remain. Among the major current challenges in the study and understanding of rare diseases are those echoed in this chapter and throughout this book (Fig. 13.1). The opportunities that rare diseases research provides to improve the lives of patients living with these conditions and increase our understanding of human biology are also explored. Genomics of Rare Diseases. DOI: https://doi.org/10.1016/B978-0-12-820140-4.00013-2 © 2021 Elsevier Inc. All rights reserved.

263

264

Chapter 13 Challenges and opportunities in rare diseases research

FIGURE 13.1 Major challenges discussed in rare diseases research.

13.2 Challenges in rare diseases research 13.2.1 The N 5 1 Problem As the name denotes, individual rare diseases are rare. Depending on the country, the definition of rare disease changes slightly, but in general, it refers to diseases that affect a relatively small number of people at any given time. In the United States (US), a rare disease is one that affects fewer than 200,000 people [5], whereas in Europe the definition refers to a disorder that affects less than 1 in 2000 people [6]. However, although individually rare, collectively rare diseases are common. It is estimated that between 6% and 7% of the world population may live with a rare disorder, resulting in approximately 350 million people worldwide. However, even these numbers are likely underestimated given the number of “rare genetic diseases” that can be found among the phenotypic extremes of adults with “common disease phenotypes” and the estimated number of yet-to-be-discovered genetic disorders. There are approximately 7000 to 8000 rare genetic disorders that have been described or suspected to date. The Online Mendelian Inheritance in Man (OMIM: www.omim.org) database catalogs these and keeps a record of the disease traits for which a molecular cause has been identified and published in the scientific literature, versus those that still remain to be molecularly characterized. As of December 2020, the molecular culprit had been defined for almost 6000 Mendelian genetic disease traits, while about 1500 remain unsolved [7]. Longitudinal data from OMIM show that the number of known and unresolved disorders is not static and that every month a few more disorders that had not been previously documented are published and enter the OMIM catalog. In the last 10 years, almost 300 novel disorders have been curated and added to OMIM per

13.2 Challenges in rare diseases research

265

year, many of which had no clinical description in the literature prior to their identification through genomic approaches (see Fig. 4.5). These data support the notion by many researchers working in the genomics of rare diseases field that in fact there are many more rare genetic disorders than generally appreciated. Different estimations put the number on the order of at least 10,000 and up to 25,000 considering some genes may be associated with multiple genetic disorders [4,8]. However, the majority of these additional genetic disorders are likely to not have been described yet clinically in the literature because they affect very few people or occur in populations that have not been well studied. Historically, geneticists have studied genetic disorders that are mostly aggregated and segregated in families or specific populations as described in earlier chapters (see Chapters 57). The availability of multiple affected individuals in large pedigrees allowed the development and application of linkage analyses and implementation of human “disease gene” mapping strategies (see Chapter 10: Statistical Approaches to Rare Disease Analyses) through positional cloning approaches aimed to “fine map” the gene(s) and disease variants associated with the disease. However, we have perhaps now studied most of the “common” rare diseases or clinically recognizable and definable syndromes. The vast majority of the remaining and newly described disorders may occur in individual patients scattered around the world, small nuclear families dispersed among genetically uncharacterized populations, or among patients and families with no or limited access to modern healthcare and genomic medicine. As discussed in Chapter 6, Dominant and Sporadic De Novo Disorders, the implementation of genomic sequencing through WES or WGS has allowed the study of mosaic and sporadic disorders caused by de novo mutations in the children of unaffected parents, enabling diagnoses in cases that would have been unsolvable before and increasing our understanding of mutagenesis processes, the rate of mutation in different human cells, and the contribution of de novo variation in the human genome to disease. Genomic sequencing of individual patients with rare genetic diseases may reveal variants in known disease-associated genes enabling accurate and precise molecular diagnoses; however, in many other cases, especially when the disorder is clinically nonspecific or has not been previously described, genomic sequencing often reveals potentially pathogenic variants in genes not previously associated with disease. While some of these variants may be very rare or even novel (new, not previously reported in databases or elsewhere) and the functional effect predicted to be very severe causing loss-of-function (LoF) of the gene or predicted to significantly alter the structure or function of the encoded protein, this is not enough evidence to confidently associate such candidate gene with the phenotype in the patient. One patient or family with a rare predicted deleterious variant in a new, uncharacterized gene represents insufficient evidence to robustly prove causality for novel gene discovery and phenotype association. This is known as “the N 5 1 (N-of-1) Problem” of rare disease research. In order to confidently associate a previously unreported gene with a specific phenotype, enough evidence is necessary to support the association hypothesis being explored. This evidence can be obtained through different and ideally multiple approaches including experiments to demonstrate the functional effects of identified deleterious variants in the gene’s protein product or development of animal models that recapitulate the human observed phenotype when the gene is missing, altered or the specific variants are introduced [911]. However, the strongest evidence for a novel genedisease association is the identification of other individuals with the same variant or other deleterious variants observed in the same gene, sharing the same or similar clinical features of the disease.

266

Chapter 13 Challenges and opportunities in rare diseases research

Because of their rarity, as discussed above, many of these disorders will occur sporadically or at so low frequencies that it would be extremely challenging to find another similarly affected individual in the same study population. However, when considering the entirety of the human population, the probability of finding another patient elsewhere in the world increases. In order to facilitate the identification of patients affected with the same disorder anywhere in the world, the rare disease community has implemented informatic platforms that allow scientists to share their interest in a given gene with the global rare disease research community and connect with other scientists, clinicians, and families that may have identified similarly affected patients with variants in the gene of interest. Matchmaker Exchange is a federated network of platforms developed to share such data and interests among all rare disease stakeholders (patients, families, physicians, and scientists), and facilitate the solving of undiagnosed rare disorders [12,13]. As part of Matchmaker Exchange, GeneMatcher [14], Phenome Central [15], DECIPHER [16], and MyGene2 are some of the major platforms [17] that allow the capture and submission of genes of interest to share with the community. Implementation of initiatives such as Matchmaker Exchange has enabled the rapid dissemination and exchange of information among groups working on rare diseases, including families wanting to actively participate. However, further collaborative and data sharing efforts are necessary to facilitate and maximize the identification of additional patients and families with variants in novel candidate disease-associated genes that may help confirm or refute the genedisease association hypothesis. Improved data sharing capabilities, practices, and platforms to facilitate exchange around the world, in combination with in vitro and in vivo experimental data for poorly characterized genes, will likely reduce the impact of the “N-of-1 Problem” and facilitate novel disease gene discoveries and precise molecular diagnoses for patients with rare diseases.

13.2.2 Underrepresentation of non-European genetic ancestries Historically, genetic and genomic studies have been mostly conducted in countries and populations of predominantly European descent. Undoubtedly, this is in large part due to the unequal distribution of wealth in the world and the greater economical power and science funding of European countries, the US, and Canada. However, even in countries where there is a wide population genetic diversity, such as the US, few directed efforts had been done to include underrepresented groups as part of large-scale genomics projects to study diseases and health-related traits, such as genome-wide association studies (GWAS). This, in part, is due to study design and statistical considerations of this type of projects that relied on single nucleotide polymorphisms (SNPs) used as markers of haplotypes in the population of study that would be in linkage disequilibrium with common variants thought to be associated with the disease. Because population haplotypes tend to break with time due to the process of meiotic recombination occurring with each successive generation, the older the population, the smaller and more diverse the haplotypes will be; whereas younger populations with fewer generations will preserve larger haplotype blocks. From a population genetics perspective, populations of African descent are much older than other major world populations such as European or East Asian populations, and therefore will have smaller haplotype blocks. Furthermore, because prehistoric populations that emigrated out of Africa to populate the rest of the world giving rise to European, East Asian, or later Native American populations underwent population bottlenecks and were subjected to founder effects many generations ago (see Chapter 5: Recessive Diseases and Founder Genetics), the genomic

13.2 Challenges in rare diseases research

267

diversity of these populations is comparatively less in terms of genomic variation than populations that remained in Africa. The extent of genomic variation among world populations was initially observed by projects like HapMap [18,19] and 1000 Genomes [20,21] and further confirmed by WGS projects of individuals of different ancestries [1,2228]. Because GWAS were initially performed in European ancestry populations and, as described, they interrogate SNPs across the genome that tag the different haplotypes segregating in the European population, association signals identified through these studies are rarely replicated and transferable to other populations because the haplotypes are often different. Therefore efforts to replicate initial association signals would recruit additional cohorts of same ancestry individuals so that the haplotype structure would be comparable between discovery and replication studies. This design of GWAS has then led to the exacerbation of the underrepresentation of non-European ancestry individuals in genetics and genomics research, with almost 80% of individuals in these studies being from European background in spite of this ancestry representing only 16% of the world population [2932]. These pervasive differences in population representation in genomics consequently increase health disparities among ancestry groups as not all populations can benefit equally from insights derived from these studies. In the case of rare disease research, this underrepresentation problem is less pronounced because patients with rare genetic disorders can be anywhere in the world and of any genetic ancestry. Since most rare diseases occur due to highly penetrant variants that affect the function of proteincoding genes; the genedisease association is less dependent on population background because instead of tag SNPs used as proxies, or surrogate measures of association through linkage disequilibrium, like in GWAS, the causative variants are actually identified. Therefore, the genedisease association will hold true no matter the genetic ancestry of patients or the world population from which they derive. Furthermore, historically, rare disease research has actually looked at generally underrepresented populations because in many cases due to their structure and history they are the most valuable and amenable for family-based genetics and genomics studies, such as the Amish and Mennonite populations (as described in Chapter 5: Recessive Diseases and Founder Genetics), the Ashkenazi Jewish, Turkish, and other Middle Eastern populations. Family-based genomics approaches in some respects “democratizes” human genetics and genomics throughout the world. However, even in the current genomic era with a few million exomes and genomes sequenced globally, the number of individuals with rare genetic disorders and family members sequenced as part of large-scale sequencing projects is dwarfed by the numbers of individuals genotyped and sequenced by GWAS initiatives [33]. Furthermore, when analyzing genomic data from rare disease patients, we use population frequency databases in order to inform and exclude high-frequency polymorphic and likely benign variants (see Chapter 4: Genomic Sequencing of Rare Diseases). Many of the publicly available databases that can be used for these purposes are in fact built by combining samples from many of these GWAS projects, such as the NHLBI Exome Sequencing Project [34], the Exome Aggregation Consortium [35], and more recently the Genome Aggregation Database (gnomAD) [36], which actually aggregates the previous two. Because GWAS cohorts have had a historical bias towards European ancestry populations (for the reasons mentioned above), utilizing them as population controls has led to biases in the allelic spectrum that we have captured and that serves as a framework for identifying, cataloging, predicting, and nominating likely pathogenic variants in individuals with rare diseases. Rare disease patients of European ancestry are more likely to obtain an accurate molecular diagnosis by WES or WGS than

268

Chapter 13 Challenges and opportunities in rare diseases research

individuals of other ancestries because reference databases are currently inclined toward a better interpretation of variation in a European background. Furthermore, variant misclassification and misdiagnosis are more likely to occur in individuals of underrepresented populations because of the lack of sampling of polymorphic variation in other world populations [37]. In recent years the genetics and genomics community has come to grips with the reality of underrepresentation of non-European ancestry individuals in genetic studies and genomic databases and the impact that this has in exacerbating health disparities. In fact, this is one of the core objectives of the vision of the US National Human Genome Research Institute for the future of human genomics, to “strive for global diversity in all aspects of genomics research, committing to the systematic inclusion of ancestrally diverse and underrepresented individuals in major genomic studies” [38]. Initiatives like the “All Of Us” research program from the US National Institutes of Health (NIH) aim to address this problem and diversify genomic efforts by recruiting a large percentage of participants from underrepresented populations [39]. The genetics and genomics community worldwide might all benefit from directing more attention and resources to increasing the inclusion and diversity of participants from various ancestry backgrounds and underrepresented world population groups in genomics studies in responsible and ethical ways [40]. A broader ascertainment of the allelic architecture of populations from around the world will not only expand our understanding of pathogenic and benign variation but will facilitate and improve the analyses and molecular diagnoses of non-European ancestry patients with rare diseases.

13.2.3 Unequal access to genomic initiatives aimed at understanding health and disease The value and impact of genomic sequencing for disease diagnosis and management have been repeatedly demonstrated, including its economic advantages and benefits to public health by reducing the time from disease presentation to diagnosis [4147]. However, the majority of large-scale genomic research projects have been conducted in Europe, the US, and Canada, including those focused on rare diseases. Consequently, in spite of the clear and revolutionary impact of genomic sequencing approaches, the insights and benefits derived from these studies do not reach patients around the world equally. Even in developed countries such as the US, the cost of clinical WES may be prohibitive for low-income families sometimes depriving them from obtaining a definitive molecular diagnosis. Research projects such as the NIH Undiagnosed Diseases Program [48,49] and the Centers for Mendelian Genomics [50] can compensate for some of these access issues through research-based genomic sequencing projects; however, not all patients and families who may benefit from a genomic analysis are guaranteed to be included in these studies. This access problem is much more pronounced in underdeveloped countries where the lack of research funding and the cost of genomic technologies stymie the implementation of these approaches. Initiatives such as the Human Heredity and Health in Africa (H3Africa) supported by the US NIH and the UK Wellcome Trust have stimulated the development of the field of human genomics in African countries [51]. However, similar initiatives are needed to bring the reality and benefits of human genomics and precision health to other regions of the world. Nation-wide programs to facilitate access to genomic diagnostics for patients with suspected genetic disorders such as those implemented in the United Kingdom [52] are needed in all different countries, further capitalizing on knowledge gained

13.2 Challenges in rare diseases research

269

through human genomics research efforts to date. Special interest should be placed in ensuring that patients and families from underserved populations and communities, such as low-income and indigenous populations, can benefit from the advances in human genomics [53,54]. Because an accurate molecular diagnosis is the first step to enable proper disease management and therapeutic options for patients living with rare diseases, access to genomic technologies for disease diagnosis and research should become available to all patients regardless of ancestry background, socioeconomic status, or geographical location.

13.2.4 Genetic heterogeneity, clinical variability, and phenotypic expansion of rare diseases Diseases and specific clinical syndromes have been historically cataloged and studied based on their clinical signs and symptoms. The application of genomic technologies to molecularly characterize patients with disease phenotypes has demonstrated that clinical entities do not always share the same molecular etiology, and many previously thought clinically well-defined disorders are more genetically heterogeneous than thought. It has been shown that diagnostic rates are inversely correlated with the level of genetic heterogeneity for a given phenotype. Moreover, the observation of multiple molecular defects leading to combined or blended phenotypes in rare disease patients not only highlights the difficulty in clinical assessment of patients with rare disorders but exemplifies the usefulness of genome-wide approaches for variant interrogation at identifying pathogenic variation in multiple loci that contribute to disease presentation and phenotypic variability [55,56]. The generally held paradigm of “one-gene-one-disease” in medical genetics has been repeatedly challenged by the implementation of unbiased genomic sequencing approaches such as WES and WGS. “Umbrella clinical designations” and terminology for phenotypically heterogeneous disorders such as autism spectrum disorder (ASD), intellectual disability and developmental delay, or conge- nital heart defects have been broken down into a multitude of individual genetic disorders based on molecular diagnostics. ASD is a particularly relevant example, wherein the majority of cases had escaped a genetic diagnosis until the advent of genomic sequencing revealing through trio-based analysis approaches the vast contribution and array of de novo mutations in hundreds of genes contributing to the disorder (see Chapter 6: Dominant and Sporadic De Novo Disorders). Each of these individual rare genetic disorders, in turn, has features consistent with ASD (impaired social interaction and communication exhibiting restricted interests and/or repetitive behaviors) but may also encompass other neurodevelopmental, neuropsychiatric, or syndromic phenotypes such as intellectual disability, seizures, behavioral problems, or dysmorphic features. Therefore patient characterization, disease management, and therapeutic opportunities will likely improve as the field moves from clinical phenotype-based diagnoses of patients with rare diseases to molecular-based diagnostics. It is important to also consider that patients with pathogenic variants in the same gene often present with similar but not exact clinical features and outcomes, posing the question of what constitutes the clinical spectrum of a given molecularly defined disorder versus the role of genetic modifiers that influence phenotype expressivity among patients. Newly identified rare diseases need to be studied in depth beyond the first report of a novel genedisease association, as documenting additional patients based on a molecular diagnosis helps to identify the core versus

270

Chapter 13 Challenges and opportunities in rare diseases research

variable features of these new diseases [57]. As delineated in Chapter 9, Multilocus Inheritance and Variable Disease Expressivity in Rare Disease, the occurrence of pathogenic variants in a second or more additional genes involved in similar disease-associated pathways through multilocus inheritance or mutational burden phenomena can contribute to modify the disease presentation and severity. Therefore deep phenotyping and molecular characterization of patient cohorts ascertained based on their primary molecular diagnosis will help to better understand the phenotypic spectrum of individual rare disorders and the role of genetic modifiers, potentially opening avenues to therapeutic options. Additionally, genomic studies have identified well-characterized disease genes that give rise to allelic disorders with novel clinical presentations due to multiple inheritance patterns or different effects of mutations altering the function of the encoded protein beyond dosage effects. Genes that had been historically associated with autosomal dominant disorders have been found to harbor biallelic pathogenic variants in more rare and severe disease cases that may or may not share clinical features with the autosomal dominant presentation. Such is the case for the tyrosyl-tRNA synthetase, YARS1, gene where heterozygous pathogenic variants are associated with dominant intermediate Charcot-Marie-Tooth neuropathy [58] (CMTDIC, MIM #608323); but more recently identified homozygous rare variants have been reported in patients with a multisystem rare disorder characterized by poor growth, developmental delay, brain dysmyelination, sensorineural hearing loss, metabolic and immune abnormalities, and chronic pulmonary disease [59,60]. Conversely, heterozygous mutations affecting different protein domains or leading to truncated abnormal proteins versus haplo-insufficiency can result in different allelic disorders for a particular gene. A premier example of allelic heterogeneity is fibrillin 1 (FBN1), where mutations in the gene have been associated with a spectrum of genetic disorders known as fibrillinopathies, ranging from the more commonly observed Marfan syndrome (MIM #154700) due to generally LoF variants to neonatal progeroid Marfan lipodystrophy syndrome [61] (MIM #616914) due to truncating variants in the ultimate or penultimate exons of the gene, resulting in a transcript that escapes nonsensemediated decay and producing a truncated protein with an abnormal C-terminal region [62]. The concept of allelic affinity, wherein different diseases converge on the same locus can only be appreciated through in-depth mutational studies at a given locus. The more individuals are sequenced and phenotypically characterized, the more we will be able to achieve a better and more complete understanding of disease architecture and pathogenic variation. The medical genetics community ought to move to a more objective definition of disease based on molecular etiology rather than fulfillment of a list of clinical signs and symptoms. Only then can one begin to dissect the main effects of pathogenic variants in specific genes, perturbations of biological homeostasis, and resulting clinical phenotypes, and begin interrogating modifying effects of other genes in the context of molecularly defined disease entities.

13.2.5 Insufficient knowledge of gene and genome function While significant strides and achievements have been made in the study and characterization of the function of a multitude of genes in humans and other organisms, it is not until we encounter patients with rare disease phenotypes that we realize the extent of knowledge that remains to be acquired on the function and biology of many genes. This, in conjunction with the “N 5 1 Problem”, accounts for the delay in reporting novel diseasegene associations, as the body of

13.2 Challenges in rare diseases research

271

evidence to support such discoveries needs to be built around the normal and altered function of the gene and the mechanism of disease. The estimated number of protein-coding genes in the human genome is about 20,000; however, we only know the phenotypic consequences of altering about 4000 of them per OMIM. Of course, we may not expect to observe a phenotype for every single protein-coding gene since some of them might be important for embryonic development and their dysfunction is likely to be incompatible with life leading to embryonic lethality. WES analyses of consanguineous families with recurrent fetal loss have started to identify some of these embryonic lethal genes [63]. Combined with systematic mouse knockout and phenotypic studies [64], we can start to build a catalog of human genes necessary for proper embryonic development that are unlikely to have a postnatal phenotype unless occurring in mosaic forms (see Chapter 8: Mosaicism in Rare Disease). Nevertheless, hypomorphic alleles and allelic combinations, perhaps including noncoding regulatory variants, may underlie certain birth defects rather than embryonic lethality, as some initial studies are starting to illustrate [65,66]. Notwithstanding, a large fraction of protein-coding genes is likely not embryonic lethal, yet a human phenotype remains to be identified in association with alterations to them. In the last 5 years, the rate of publication of novel Mendelian disorders has started to decline compared to the steep increase of reports of novel genedisease associations observed in the first half of the 2010s decade [4]. A major contributor to the slowing and apparent decline in novel genephenotype associations reporting is the increased need to functionally characterize ab initio the variants and novel candidate disease genes to demonstrate the genephenotype association. In many cases, this involves the exploration of genes where nothing else but their expression profile has been described; in other cases, some of these new genes have only been predicted to be protein-coding based on their genomic structure but no function or other information is known. Ultimately, the fact remains that our knowledge of the function of the majority of the genes in the human genome is at most shallow. Let alone our understanding of variation in regulatory and noncoding regions of the genome and its possible phenotypic consequences. Large-scale profiling initiatives such as the Genotype-Tissue Expression (GTEx) project are building publicly available catalogs of large-scale RNA sequencing data for most genes in the genome across multiple tissues [67]. Large expression datasets in humans and other organisms [68] can help prioritize candidate disease genes based on their expression pattern in tissues of relevance for the studied disease. Similarly, large-scale projects and resources that systematically alter and report the functional effects and phenotypic consequences of disruption of orthologous genes in model organisms provide invaluable information of gene function in vivo [64,6976]. Other resources where systematic probing of the function of genes in vitro or in vivo is performed will continue to be necessary and valuable to provide relevant information on previously uncharacterized or poorly characterized genes. Additionally, the implementation of other omics approaches, including transcriptomics, proteomics, metabolomics, and epigenomics (reviewed in Chapter 11: Transcriptomics in Rare Diseases and Chapter 12: Other Omics Approaches to the Study of Rare Diseases) is likely to become more common and streamlined as part of the characterization of downstream effects of pathogenic variation identified by genomics, as well as the upstream epigenomic regulatory landscape of the genome. These methods and technologies will provide additional information on the molecular phenotypes resulting from alterations of genes, such as up/downregulation of genes/transcripts/proteins involved in

272

Chapter 13 Challenges and opportunities in rare diseases research

specific biological pathways or the presence/absence of metabolites in blood or other bodily fluids or tissues resulting from alterations in certain metabolic and biological pathways. The integration of multiple data sources will become increasingly necessary to help inform on the biology of human genes. However, ad hoc experimentation and data-driven hypothesis testing in vitro and in vivo, based on observed phenotypes in patients with rare diseases will be essential to dissect and fully understand the underlying biology, pathways, and mechanisms of disease for newly identified disease-associated genes.

13.3 Opportunities in rare diseases research 13.3.1 Insights into novel biology In addition to the discussed diagnostic value and the evident direct benefit to patients that molecular diagnostics represents for rare disease patients and families, application of genomic sequencing for rare diseases research offers insights into novel human biology not previously explored. Although as previously described, this represents a major challenge in the research of genetic disorders as it pertains to validating and acquiring enough evidence to confidently associate a gene with a novel disorder; the sparse knowledge, and lack thereof in some cases, of the function of the majority of genes in the human genome also represents an untapped opportunity to unveil and unravel novel biological relationships and functions of these genes, and learn more about our human biology.

13.3.2 Therapy development for rare diseases A major opportunity in terms of benefit and impact, although also a great challenge, of research in rare diseases is the development of therapeutic avenues for the millions of people in the world that live with a rare disease. The study and understanding of the molecular defects that lead to rare genetic disorders is only the beginning of the road to considerations for therapeutic intervention, therapy selection, and possible development. Obtaining an accurate molecular diagnosis for individuals with rare diseases not only provides an answer to patients and families but also opens the possibility for improved disease management that may enhance the quality of life and the development of molecularly tailored therapeutics. In some cases, knowledge of the molecular defect and the biological pathways affected can identify already available drugs and compounds that can be used or repurposed for treatment [77]. Table 13.1 summarizes the major therapeutic approaches to the treatment of rare genetic disorders. Prior to the genomics revolution, the study and identification of the genetic defects in classic rare disorders such as inborn errors of metabolism (IEMs) laid the groundwork for precision medicine guided by genetic information. Early diagnosis of IEMs through newborn screening, biochemical testing, and/or targeted molecular testing is essential for preventing metabolic crises that may lead to irreparable organ damage in these diseases, including irreversible neurological damage and long-term disability in many cases. The approaches to successfully managing IEMs depend upon an accurate genetic diagnosis and some interventions have been as simple as diet management and restriction of certain foods or compounds that may trigger metabolic crises or substitution of nutrition with medical foods designed specifically to address the nutritional needs and limitations of patients with certain

13.3 Opportunities in rare diseases research

273

Table 13.1 Major therapeutic approaches to treat rare diseases, their mechanisms, possible disadvantages, and selected examples. Therapeutic approach

Examples of rare diseases where applicable

Therapeutic mechanism

Possible disadvantages

Dietary restriction and management

Focuses on restricting the ingestion of the “toxic” substrate or metabolite to bypass the defective metabolic pathway

Long-term compliance to diet may be a problem

Phenylketonuria (PKU)

Dietary therapy

Focuses on providing specialized dietary mixtures of compounds designed specifically to bypass the metabolic defect while providing proper nutrition for growth and development

Nonpalatable; prone to patient noncompliance

Glutaric aciduria type 1 (GA1)

Product/cofactor supplementation

Focuses on providing the missing metabolite exogenously to rescue the metabolic pathway

Possible lack of compliance

Biotinidase deficiency L-Dopa and 5-HTP in SRD

Small molecule/ antibody therapy

Aims to utilize pharmacological small molecule compounds or therapeutic antibodies to agonize or antagonize/ inhibit the function of target proteins

JAK inhibitors Anti-PCSK9 antibodies in familial hypercholesterolemia (FH)

Pharmacological chaperone therapy

Aims to aid in the proper folding of mutant proteins so that they can perform their proper function

Inespecificity of mechanism of action of small molecules Antibodies can be immunogenic Not suitable for patients with LoF mutations

Enzyme replacement therapy (ERT)

ERT replaces the defective enzyme in the pathway providing an exogenous functional enzyme to perform the necessary biochemical reactions via intravenous infusion Aimed at replacing the major organ where the defective enzyme is most expressed and the function is needed

Costly, infusions can be burdensome to patients and family; Potentially immunogenic Dependent on organ availability; Possibility of graft versus host disease (GvHD) and rejection of organ

Lysosomal storage diseases; Gaucher disease (GD), Pompe disease, etc.

Gene expression/ modulation therapy

The use of oligonucleotide molecules to modulate gene expression (siRNAs, miRNAs, aptamers), antisense oligonucleotides (ASO)

Do not cross BBB requiring invasive delivery methods for CNS

Spinraza in spinal muscular atrophy (SMA)

Gene/cell therapy

Aims to provide a normal functioning copy of the defective gene directly to cells using viral vectors

Immunogenic, possibly oncogenic

Primary immunodeficiency disorders (PIDs)

Genome editing

Promises to perform targeted individualized gene repair to restore normal function by directly modifying the DNA

Unspecific genome editing, off-target effects

To be determined, technically all diseases

Organ/HSC transplantation (Tx)

Fabry disease (FD)

Liver Tx in maple syrup urine disease (MSUD) HSCT in primary immunodeficiency disorders (PIDs)

274

Chapter 13 Challenges and opportunities in rare diseases research

IEMs [78,79]. An example of the successful management of an inborn error of metabolism through dietary therapy is glutaric aciduria type 1 (GA1, MIM #231670) caused by biallelic pathogenic variants in the GCDH gene encoding the glutaryl-CoA dehydrogenase enzyme involved in lysine catabolism. If undiagnosed and untreated at an early age, GA1-associated metabolic crises can cause striatal degeneration of the brain in affected children before age two. However, if identified early ideally through newborn screening, management through strict dietary therapy of lysine-free, arginine-enriched metabolic formula supplemented with L-carnitine during the first 2 years of life has been shown to be safe and effective, supporting normal growth and psychomotor development while preventing brain injury in these patients [80,81]. Another major therapeutic approach to the management of rare metabolic diseases has been enzyme replacement therapy (ERT), particularly successful for the treatment of lysosomal storage disorders. In genetic disorders where pathogenic variants lead to loss or reduced function of protein enzymes important for metabolic processes, exogenous supplementation of purified or recombinant proteins can salvage the metabolic defect avoiding the build-up of toxic metabolites [78]. Gaucher disease type 1 (MIM #230800) is caused by biallelic LoF variants in the GBA gene resulting in deficient activity of the beta-glucocerebrosidase enzyme that breaks down glucosylceramide into ceramide and glucose. Without a functioning enzyme, glucosylceramide (GlcCer, also known as glucosylcerebroside) accumulates intracellularly in the lysosomes of cells in multiple tissues and organs, including the liver, spleen, bone marrow, and lungs; and resulting in hepatosplenomegaly, anemia, thrombocytopenia, interstitial lung disease, bone necrosis, among other clinical features. Exogenous infusion of purified or recombinant glucocerebrosidase enzyme or analogs can restore the pathway function in GD patients and improve the major organ problems. However, ERT can be costly and burdensome to patients and families due to the need for frequent life-long intravenous (IV) infusions necessary to maintain enzyme levels in the organism, and in some cases, patients can develop neutralizing antibodies against the exogenous enzyme, rendering ERT inefficient. Therefore alternatives to ERT such as substrate reduction therapies have been developed to reduce the accumulation of GlcCer through small molecule compounds that inhibit the enzyme ceramide glucosyltransferase involved in the production of GlcCer. These small molecule compounds can be taken orally as daily capsules or tablets, an improvement over IV administration of ERTs without the risk of developing neutralizing antibodies. However, the indication for the use of these compounds relies on GD patients having residual activity of the beta-glucocerebrosidase enzyme since the objective is to reduce the accumulation of the substrate allowing the endogenous enzyme to still process GlcCer, even if at reduced efficiency. Similarly, in cases where pathogenic variants in GBA lead to protein misfolding and accumulation in the endoplasmic reticulum (ER), pharmacological chaperone therapy (PCT) can be utilized to help the mutant protein fold properly and be exported out of the ER to perform its function if the catalytic activity of the enzyme is preserved [71]. PCT has shown promising results, comparable with ERT [82], in treating Fabry disease (MIM #301500), another lysosomal storage disorder due to deficiency of the alpha-galactosidase A enzyme, in patients with missense pathogenic variants in the GLA gene that result in abnormal folding but retain catalytic activity. One of the first examples of precision medicine in the genomic era was the report of dizygotic twins diagnosed with cerebral palsy in infancy that through WGS were eventually diagnosed in their teenage years with sepiapterin reductase (SPR) deficiency associated dopa-responsive dystonia [83] (MIM #612716). The twins were found to share compound heterozygous variants affecting the

13.3 Opportunities in rare diseases research

275

function of SPR, an enzyme important for the biosynthesis of tetrahydrobiopterin (BH4). Appropriate levels of BH4 are necessary for the proper function of enzymes in the dopamine and serotonin neurotransmitter pathways, as well as the metabolism of phenylalanine and other compounds. Experimental treatment with levodopa (L-dopa), a dopamine precursor that gets converted to dopamine in the brain, in these twins prior to obtaining their accurate molecular diagnosis greatly improved their movement problems; however, they still had breathing, gastrointestinal, and other medical problems that escaped proper management. Upon determination of the SPR deficiency diagnosis based on genomic sequencing, the twins were treated with 5-Hydroxytryptophan (5-HTP), a naturally occurring amino acid and precursor of serotonin biosynthesis, rescuing in this form the other major metabolic pathway affected by SPR deficiency and resolving their outstanding medical issues [83]. Other examples of successful therapeutic management derived from molecular diagnoses insights are the indication for hematopoietic stem cell transplantation (HSCT) in primary immunodeficiency disorders (PIDs). Without treatment, many PIDs can be lethal in the first years of life due to recurrent infections and inability to mount, develop, or regulate appropriate immune responses [84]. Depending on the genetic defects, reconstitution of the immune system through HSCT can be indicated and the best treatment option in PID patients, with a high success rate ( . 90%), making it mostly curative for the disease. Although highly successful, HSCT from related or matched donors in PID patients can entail risks of infections and graft versus host disease (GvHD) that can be lethal in some cases. Therefore over recent decades, gene therapy has been a primary focus of therapeutic development for patients with PIDs. Previous gene therapy approaches of in vitro supplementation of a functional copy of the affected gene using viral vectors demonstrated to be effective to cure the PID phenotype, however, adverse events associated with the use of the viral vectors limited the clinical implementation of gene therapy in PID cases. Recent advances in gene modification technologies using CRISPR/Cas9 technology and its variations promise to overcome the limitations of viral vector-mediated gene therapy approaches [85]. Ex vivo gene modification of autologous hematopoietic stem cells is a promising curative approach that may eliminate some of the risks associated with GvHD in patients with PIDs. These examples illustrate how knowledge of the exact molecular defects in patients with rare diseases can open an array of therapeutic possibilities, some of them better suited to particular types of mutations, while others aimed at rescuing the specific metabolic pathways affected. Moreover, in most rare diseases an early genetic diagnosis is essential to the success of the therapeutic approach. Therefore early and accurate diagnosis of patients with rare diseases through comprehensive newborn screening programs followed by molecular testing and/or implementation of genomic sequencing approaches is fundamental to enable precision medicine and optimize individual patient-tailored therapy selection.

13.3.3 Implementation of precision medicine and population health The objective of precision medicine should not only be to treat patients and their genetic diseases reactively but aim to assess individuals based on their genetic susceptibilities and characteristics preventing disease prior to clinical onset. The GA1 example above illustrates this point, where early genetic diagnosis and therapeutic interventions can prevent metabolic disease and life-long morbidity.

276

Chapter 13 Challenges and opportunities in rare diseases research

In milder presentations of diseases or those of adult-onset, individuals can often go undiagnosed and untreated. The lack of appreciation of the contribution of genetics to diseases of adult-onset can also prevent patients from being evaluated by medical geneticists and obtaining an accurate diagnosis for their conditions. Large-scale genomic sequencing projects of population cohorts unascertained for any particular disease phenotype are contributing to understanding the prevalence and clinical manifestations of rare diseases in adult and more representative cohorts of the general population. Genetic analyses in cohorts such as the Geisinger DiscovEHR cohort [86] and the UK Biobank [87] are allowing to assess the prevalence and burden of Mendelian genetic disorders in the general population. For example, genotype driven analyses of participants part of the Geisinger DiscovEHR cohort showed that genetic lipodystrophy disorders are more prevalent than previously reported in the literature and that partial lipodystrophy is an important cause of metabolic disorder and type 2 diabetes diagnoses in this cohort and likely in the general population [88]. Additionally, a survey of pathogenic variation in this cohort showed that about 3.5% of the participants carried variants in clinically actionable Mendelian disease genes that met the criteria for return of actionable secondary findings according to the American College of Medical Genetics and Genomics recommended guidelines [89,90]. A similar survey of the genomic data for UK Biobank participants showed that 2% of individuals harbored actionable pathogenic variants in Mendelian disease genes [87]. These population cohorts allow for unbiased surveys of pathogenic and likely pathogenic variation in known and recently discovered Mendelian genes and can provide insights into their phenotypic and variation spectrum, as well as disease prevalence for rare and more common but yet underdiagnosed Mendelian diseases. As the genomic endeavors continue to move toward recruitment and study of larger unascertained population cohorts, the lessons learned from the study of rare diseases ought to be kept in mind and applied to the study of these populations. Assessment of the burden of Mendelian disease in these large population cohorts can provide better estimates of disease prevalence agnostic to patient ascertainment bias. Conversely, molecularly driven ascertainment of Mendelian diseases in these cohorts can unveil patients suffering from yet undiagnosed rare disorders or at risk of developing these diseases, as evidenced by multiple examples of actionable clinical variants being returned to patient-participants of genomic sequencing projects [91]. Furthermore, the general population is, in reality, a conglomerate of multiple families, each segregating their own rare variants and phenotypes. Therefore, leveraging genetically derived familial relationships [92] and exploring within these large population cohorts the “Clan Genomics” hypothesis, that posits recent rare variation and mutations arising in a family or “clan” as main contributors to disease-related traits compared to common variants inherited from distant ancestors [93], can provide additional opportunities for disease gene identification. Understanding the variation spectrum in a given population associated with genetic disorders can help develop and implement public health policies that aim to identify at-risk individuals before they present disease, either through newborn screening programs or targeted screening approaches. Early detection and prevention of diseases based on molecular diagnostics can reduce morbidity and mortality in the population while being economically beneficial, avoiding diagnostic odysseys and unnecessary and inefficient tests and treatments.

13.3.4 Drug development derived from rare diseases insights Another important area of opportunity in the study of rare diseases is the derivation of insights from patients with rare diseases and the genes affected to inform therapeutic development for both rare and

13.3 Opportunities in rare diseases research

277

common diseases. It has been reported that the main reasons for drug candidate attrition during clinical development are efficacy, followed by safety, with about 50% and 20% of drug failures due to these two predominant factors, respectively [94]. Human genetics insights have gained increasing recognition as an important and determinant factor in drug development and success. It has been observed that target genes for “successful” approved drugs are significantly enriched for having associations to defined human traits. This observation is even more significant for genes associated with Mendelian traits and phenotypes cataloged in OMIM versus those with associations to traits based solely on GWAS approaches [95]. Therefore efforts to complete the catalog of human Mendelian diseases and traits not only provide diagnostic relevant information but also insights into therapeutically actionable pathways and target genes for efficacious and safe drug development. The “poster child” for drug targeting based on human genetics insights is the development of PCSK9 inhibitors. Familial hypercholesterolemia (FH) is a genetic disorder characterized by elevated levels of plasma LDL-cholesterol that lead to accumulation of cholesterol in the skin (xanthelasma), tendons (xanthomas), eyes (corneal arcus), and arteries (atherosclerosis), and even early death due to premature cardiovascular disease and myocardial infarction. FH is inherited as an autosomal dominant/semidominant disease trait wherein heterozygous individuals are affected and at elevated risk for coronary heart disease and myocardial infarctions; however, homozygous individuals with biallelic pathogenic variants have a more severe phenotype and can die of cardiovascular problems in their second decade of life. LoF variants in the LDL receptor gene (LDLR) are the most common cause of FH due to the absence or defective function of the receptor to bind plasma cholesterol and internalize it for degradation inside the cell. Pathogenic variants in two other genes, apolipoprotein B (APOB) and proprotein convertase subtilisin/kexin type 9 (PCSK9) are also associated with FH. PCSK9 is exported into the plasma and binds the LDL receptor in the membrane targeting it for internalization and lysosomal degradation when bound to LDL-cholesterol. In 2003 heterozygous missense variants in PCSK9 were identified segregating in two families with FH after linkage and positional cloning studies [96], implicating PCSK9 in the pathogenesis of familial hypercholesterolemia type 3 (MIM #603776). After this initial report, in 2004 it was shown that overexpression of PCSK9 in the liver of mice resulted in hypercholesterolemia phenocopying the loss of LDLR and supporting the gain-of-function hypothesis for the missense variants identified in the families [97]. Consequently, researchers hypothesized that LoF variants in the gene could have the opposite effect and result in lower circulating cholesterol levels. It was in 2005 when Cohen et al. reported LoF variants in a handful of African American individuals with low levels of plasma LDL-cholesterol [98]. This study confirmed the hypothesis that reducing PCSK9 protein levels could be leveraged pharmacologically to decrease circulating LDL-cholesterol levels. Follow-up studies demonstrated that individuals with LoF variants in PCSK9 were healthy and that reduced levels of the protein did not appear to correlate with any negative phenotypes or outcomes, pointing to the safety of a potential therapeutic inhibition of the protein. This body of evidence culminated with the development of monoclonal antibodies that target and bind PCSK9 for therapeutic purposes [99]. Genetic or pharmacologically induced deficiency of PCSK9 prevents the recycling of LDLR, keeping more available receptors at the plasma membrane available for cholesterol uptake and reducing overall circulating LDL-cholesterol levels in the plasma. These PCSK9 inhibitors can lower plasma LDL-cholesterol by up to 62% and have been proposed as a new class of lipid-lowering drugs that can be used in addition to statins to lower and control LDL-cholesterol levels in patients with FH and those with difficulties to control their cholesterol levels on standard therapy, reducing cardiovascular complications and morbidity.

278

Chapter 13 Challenges and opportunities in rare diseases research

13.4 Outlook The majority of the estimated 350 million patients currently living with a rare disorder in the world do not have a molecular diagnosis for their condition. In developed countries, about half of rare disease patients spend their lives undiagnosed or experiencing diagnostic odysseys without reaching a definitive answer or appropriate treatment. In the last decade, the decrease in cost and increase in throughput for genomic sequencing technologies has enabled the application of these approaches not only for scientific research of rare disorders but also for clinical diagnostics and even research into therapeutic development (Fig. 13.2). While the most common estimate calculates that there are between 7000 and 8000 rare disorders, genomic sequencing and genetic analyses of patients with rare diseases suggest that there are many more rare genetic diseases awaiting to be discovered and molecularly defined. Genomic sequencing of large cohorts of patients with clinically identified disorders has evidenced the genetic heterogeneity of clinically defined disorders showing that many are but a conglomerate of molecularly distinct, albeit related, rare entities, with shared clinical symptoms but variable phenotypic spectrum. Approximately 300 novel genephenotype associations have been added to OMIM each year since the implementation of genomic techniques to study the molecular causes of genetic diseases. While many argue that the discovery rate for rare diseases has plateaued and will start decreasing as we have reached saturation, an alternative explanation is that we are moving from the rare spectrum of disease to the ultrarare. The challenges of investigating rare diseases are therefore exacerbated as we move into the more rare and ultrarare disease/variant spectrum, and so strategies developed or identified to facilitate rare diseases research need to continue, evolve, or get implemented to overcome these challenges. Data sharing must continue to be a priority for the rare disease community in order to connect scientists,

FIGURE 13.2 Major applications of rare diseases research through genomic approaches: provide accurate molecular diagnoses to patients, identify novel disease gene associations, and inform therapeutic management and development.

References

279

clinicians, and patients throughout the world. Ultrarare variant studies and characterization of the variation spectrum of diverse populations from around the world are likely to open up many more gene discoveries. Functional validation and characterization of novel variants and genes must be a priority and evolve to higher-throughput approaches that may allow more systematic evaluations of genomic variation and allelic series for disease genes. This in turn will enable a better understanding of variant effects and mechanisms of disease increasing our understanding of human physiology in health and disease. No other technological development has impacted human health and disease as large-scale high-throughput genomic sequencing has done in the last 30 years, enabling the advent of human genomics. The future is clear in that genomics will continue to have a prominent and growing role in how we diagnose, treat, understand, and prevent diseases, both common and rare. Harnessing the information provided by genomic and other omic approaches to guarantee equitable, predictive, preventive, precise, and participatory genomic medicine will be the major challenge in the years to come.

References [1] Gonzaga-Jauregui C, Lupski JR, Gibbs RA. Human genome sequencing in health and disease. Annu Rev Med 2012;63:3561. [2] Chong JX, Buckingham KJ, Jhangiani SN, Boehm C, Sobreira N, Smith JD, et al. The genetic basis of mendelian phenotypes: discoveries, challenges, and opportunities. Am J Hum Genet 2015;97 (2):199215. [3] Posey JE, O’Donnell-Luria AH, Chong JX, Harel T, Jhangiani SN, Coban Akdemir ZH, et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet Med 2019; 21(4):798812. [4] Bamshad MJ, Nickerson DA, Chong JX. Mendelian gene discovery: fast and furious with no end in sight. Am J Hum Genet 2019;105(3):44855. [5] Orphan Drug Act of 1983”. US Food and Drug Administration. 4 January 1983. [6] European Commission. Available from https://ec.europa.eu/info/research-and-innovation/research-area/ health-research-and-innovation/rare-diseases_en. [7] Online Mendelian Inheritance in Man (OMIM), NCBI. Available from: https://omim.org/statistics/entry. [8] Hartley T, Balcı TB, Rojas SK, Eaton A, Canada CR, Dyment DA, et al. The unsolved rare genetic disease atlas? an analysis of the unexplained phenotypic descriptions in OMIMs. Am J Med Genet C Semin Med Genet 2018;178(4):45863. [9] Gonzaga-Jauregui C, Harel T, Gambin T, Kousi M, Griffin LB, Francescatto L, et al. Exome sequence analysis suggests that genetic burden contributes to phenotypic variability and complex neuropathy. Cell Rep 2015;12(7):116983. [10] Tan TY, Gonzaga-Jauregui C, Bhoj EJ, Strauss KA, Brigatti K, Puffenberger E, et al. Monoallelic BMP2 variants predicted to result in haploinsufficiency cause craniofacial, skeletal, and cardiac features overlapping those of 20p12 deletions. Am J Hum Genet 2017;101(6):98594. [11] Gonzaga-Jauregui C, Yesil G, Nistala H, Gezdirici A, Bayram Y, Nannuru KC, et al. Functional biology of the Steel syndrome founder allele and evidence for clan genomics derivation of COL27A1 pathogenic alleles worldwide. Eur J Hum Genet 2020;28(9):124364. [12] Philippakis AA, Azzariti DR, Beltran S, Brookes AJ, Brownstein CA, Brudno M, et al. The matchmaker exchange: a platform for rare disease gene discovery. Hum Mutat 2015;36:91521.

280

Chapter 13 Challenges and opportunities in rare diseases research

[13] Azzariti DR, Hamosh A. Genomic data sharing for novel mendelian disease gene discovery: the matchmaker exchange. Annu Rev Genomics Hum Genet 2020;21:30526. [14] Sobreira N, Schiettecatte F, Valle D, Hamosh A. GeneMatcher: a matching tool for connecting investigators with an interest in the same gene. Hum Mutat 2015;36(10):92830. [15] Buske OJ, Girdea M, Dumitriu S, Gallinger B, Hartley T, Trang H, et al. PhenomeCentral: a portal for phenotypic and genotypic matchmaking of patients with rare genetic diseases. Hum Mutat 2015;36(10): 93140. [16] Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, et al. DECIPHER: database of chromosomal imbalance and phenotype in humans using ensembl resources. Am J Hum Genet 2009;84 (4):52433. [17] Sobreira NLM, Arachchi H, Buske OJ, Chong JX, Hutton B, Foreman J, et al. Matchmaker exchange. Curr Protoc Hum Genet 2017;95:9.31.19.31.15. [18] The International HapMap Consortium. The international HapMap project. Nature 2003;426:78996. [19] International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 2010;467(7311):528. [20] 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010;467(7319):106173. [21] 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015;526 (7571):6874. [22] Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 2008;452(7189):8726. [23] Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008;456(7218):539. [24] Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, Kasson LR, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 2010;463(7283):9437. [25] Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, et al. The diploid genome sequence of an Asian individual. Nature 2008;456(7218):605. [26] Ahn SM, Kim TH, Lee S, Kim D, Ghang H, Kim DS, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Res 2009;19(9):16229. [27] Kim JI, Ju YS, Park H, Kim S, Lee S, Yi JH, et al. A highly annotated whole-genome sequence of a Korean individual. Nature 2009;460(7258):101115. [28] Choudhury A, Aron S, Botigu´e LR, Sengupta D, Botha G, Bensellak T, et al. High-depth African genomes inform human migration and health. Nature 2020;586(7831):7418. [29] Need AC, Goldstein DB. Next generation disparities in human genomics: concerns and remedies. Trends Genet 2009;25(11):48994. [30] Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature 2016;538(7624):1614. [31] Sirugo G, Williams SM, Tishkoff SA. The missing diversity in human genetic studies. Cell 2019;177 (1):2631. Available from: https://doi.org/10.1016/j.cell.2019.02.048. [32] Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet 2019;51(4):58491. [33] Antonarakis SE, Beckmann JS. Mendelian disorders deserve more attention. Nat Rev Genet 2006;7 (4):27782. [34] Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP), Seattle, WA. Available from: http://evs.gs.washington.edu/EVS/. [35] Karczewski KJ, Weisburd B, Thomas B, Solomonson M, Ruderfer DM, Kavanagh D, et al. The ExAC browser: displaying reference data information from over 60000 exomes. Nucleic Acids Res 2017;45 (D1):D8405.

References

281

[36] Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfo¨ldi J, Wang Q, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020;581(7809):43443. [37] Manrai AK, Funke BH, Rehm HL, Olesen MS, Maron BA, Szolovits P, et al. Genetic misdiagnoses and the potential for health disparities. N Engl J Med 2016;375(7):65565. [38] Green ED, Gunter C, Biesecker LG, Di Francesco V, Easter CL, Feingold EA, et al. Strategic vision for improving human health at the forefront of genomics. Nature 2020;586(7831):68392. [39] All of Us Research Program Investigators, Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, et al. The “All of Us” research program. N Engl J Med 2019;381(7):66876. [40] ASHG. Advancing diverse participation in research with special consideration for vulnerable populations. Am J Hum Genet 2020;107(3):37980. [41] Yang Y, Muzny DM, Reid JG, Bainbridge MN, Willis A, Ward PA, et al. Clinical whole-exome sequencing for the diagnosis of mendelian disorders. N Engl J Med 2013;369(16):150211. [42] Yang Y, Muzny DM, Xia F, Niu Z, Person R, Ding Y, et al. Molecular findings among patients referred for clinical whole-exome sequencing. JAMA 2014;312(18):18709. [43] Stark Z, Tan TY, Chong B, Brett GR, Yap P, Walsh M, et al. A prospective evaluation of whole-exome sequencing as a first-tier molecular test in infants with suspected monogenic disorders. Genet Med 2016;18(11):10906. [44] Stavropoulos DJ, Merico D, Jobling R, Bowdin S, Monfared N, Thiruvahindrapuram B, et al. Whole genome sequencing expands diagnostic utility and improves clinical management in pediatric medicine. NPJ Genom Med 2016;1:15012. [45] Stark Z, Schofield D, Alam K, Wilson W, Mupfeki N, Macciocca I, et al. Prospective comparison of the cost-effectiveness of clinical whole-exome sequencing with that of usual care overwhelmingly supports early use and reimbursement. Genet Med 2017;19(8):86774. [46] Bick D, Jones M, Taylor SL, Taft RJ, Belmont J. Case for genome sequencing in infants and children with rare, undiagnosed or genetic diseases. J Med Genet 2019;56(12):78391. [47] Schofield D, Rynehart L, Shresthra R, White SM, Stark Z. Long-term economic impacts of exome sequencing for suspected monogenic disorders: diagnosis, management, and reproductive outcomes. Genet Med 2019;21(11):258693. [48] Gahl WA, Mulvihill JJ, Toro C, Markello TC, Wise AL, Ramoni RB, et al. The NIH undiagnosed diseases program and network: applications to modern medicine. Mol Genet Metab 2016;117(4):393400. [49] Schoch K, Esteves C, Bican A, Spillmann R, Cope H, McConkie-Rosell A, et al. Clinical sites of the undiagnosed diseases network: unique contributions to genomic medicine and science. Genet Med 2020. [50] Bamshad MJ, Shendure JA, Valle D, Hamosh A, Lupski JR, Gibbs RA, et al. The Centers for Mendelian Genomics: a new large-scale initiative to identify the genes underlying rare Mendelian conditions. Am J Med Genet A 2012;158A(7):15235. [51] H3Africa Consortium, et al. Research capacity. Enabling the genomic revolution in Africa. Science 2014;344(6190):13468. [52] Turro E, Astle WJ, Megy K, Gra¨f S, Greene D, Shamardina O, et al. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 2020;583(7814):96102. [53] Strauss KA, Gonzaga-Jauregui C, Brigatti KW, Williams KB, King AK, Van Hout C, et al. Genomic diagnostics within a medically underserved population: efficacy and implications. Genet Med 2018;20 (1):3141. [54] D’Angelo CS, Hermes A, McMaster CR, Prichep E, Richer E´, van der Westhuizen FH, et al. Barriers and considerations for diagnosing rare diseases in indigenous populations. Front Pediatr 2020;8: 579924. [55] Posey JE, Harel T, Liu P, Rosenfeld JA, James RA, Coban Akdemir ZH, et al. Resolution of disease phenotypes resulting from multilocus genomic variation. N Engl J Med 2017;376(1):2131.

282

Chapter 13 Challenges and opportunities in rare diseases research

[56] Karaca E, Posey JE, Coban Akdemir Z, Pehlivan D, Harel T, Jhangiani SN, et al. Phenotypic expansion illuminates multilocus pathogenic variation. Genet Med 2018;20(12):152837. [57] Bend R, Cohen L, Carter MT, Lyons MJ, Niyazov D, Mikati MA, et al. Phenotype and mutation expansion of the PTPN23 associated disorder characterized by neurodevelopmental delay and structural brain abnormalities. Eur J Hum Genet 2020;28(1):7687. [58] Jordanova A, Irobi J, Thomas FP, Van Dijck P, Meerschaert K, Dewil M, et al. Disrupted function and axonal distribution of mutant tyrosyl-tRNA synthetase in dominant intermediate Charcot-Marie-Tooth neuropathy. Nat Genet 2006;38(2):197202. [59] Nowaczyk MJ, Huang L, Tarnopolsky M, Schwartzentruber J, Majewski J, Bulman DE, et al. A novel multisystem disease associated with recessive mutations in the tyrosyl-tRNA synthetase (YARS) gene. Am J Med Genet A 2017;173(1):12634. [60] Williams KB, Brigatti KW, Puffenberger EG, Gonzaga-Jauregui C, Griffin LB, Martinez ED, et al. Homozygosity for a mutation affecting the catalytic domain of tyrosyl-tRNA synthetase (YARS) causes multisystem disease. Hum Mol Genet 2019;28(4):52538. [61] Passarge E, Robinson PN, Graul-Neumann LM. Marfanoid-progeroid-lipodystrophy syndrome: a newly recognized fibrillinopathy. Eur J Hum Genet 2016;24(9):12447. [62] Inoue K, Khajavi M, Ohyama T, Hirabayashi S, Wilson J, Reggin JD, et al. Molecular mechanism for distinct neurological phenotypes conveyed by allelic truncating mutations. Nat Genet 2004;36(4): 3619. [63] Shamseldin HE, Tulbah M, Kurdi W, Nemer M, Alsahan N, Al Mardawi E, et al. Identification of embryonic lethal genes in humans by autozygosity mapping and exome sequencing in consanguineous families. Genome Biol 2015;16(1):116. [64] Cacheiro P, Mun˜oz-Fuentes V, Murray SA, Dickinson ME, Bucan M, Nutter LMJ, et al. Human and mouse essentiality screens as a resource for disease gene discovery. Nat Commun 2020;11(1):655. [65] Wu N, Ming X, Xiao J, Wu Z, Chen X, Shinawi M, et al. TBX6 null variants and a common hypomorphic allele in congenital scoliosis. N Engl J Med 2015;372(4):34150. [66] Karolak JA, Vincent M, Deutsch G, Gambin T, Cogn´e B, Pichon O, et al. Complex compound inheritance of lethal lung developmental disorders due to disruption of the TBX-FGF pathway. Am J Hum Genet 2019;104(2):21328. [67] GTEx Consortium. The genotype-tissue expression (GTEx) project. Nat Genet 2013;45(6):5805. [68] Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol 2009;10(11):R130. [69] Lloyd KC. A knockout mouse resource for the biomedical research community. Ann N Y Acad Sci 2011;1245:246. [70] Wang J, Al-Ouran R, Hu Y, Kim SY, Wan YW, Wangler MF, et al. MARRVEL: integration of human and model organism genetic resources to facilitate functional annotation of the human genome. Am J Hum Genet 2017;100(6):84353. [71] Thurmond J, Goodman JL, Strelets VB, Attrill H, Gramates LS, Marygold SJ, et al. FlyBase 2.0: the next generation. Nucleic Acids Res 2019;47(D1):D75965. [72] Harris TW, Arnaboldi V, Cain S, Chan J, Chen WJ, Cho J, et al. WormBase: a modern model organism information resource. Nucleic Acids Res 2020;48(D1):D7627. [73] Ruzicka L, Howe DG, Ramachandran S, Toro S, Van Slyke CE, Bradford YM, et al. The Zebrafish information network: new support for non-coding genes, richer gene ontology annotations and the alliance of genome resources. Nucleic Acids Res 2019;47(D1):D86773. [74] Smith JR, Hayman GT, Wang SJ, Laulederkind SJF, Hoffman MJ, Kaldunski ML, et al. The year of the rat: the rat genome database at 20: a multi-species knowledgebase and analysis platform. Nucleic Acids Res 2020;48(D1):D73142.

References

283

[75] Bult CJ, Blake JA, Smith CL, Kadin JA, Richardson JE, Mouse Genome Database Group. Mouse genome database (MGD) 2019. Nucleic Acids Res 2019;47(D1):D8016. [76] Brown SD, Moore MW. Towards an encyclopaedia of mammalian gene function: the International Mouse Phenotyping Consortium. Dis Model Mech 2012;5(3):28992. [77] Pushpakom S, Iorio F, Eyers PA, Escott KJ, Hopper S, Wells A, et al. Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov 2019;18(1):4158. [78] Gambello MJ, Li H. Current strategies for the treatment of inborn errors of metabolism. J Genet Genomics 2018;45(2):6170. [79] Berry SA, Brown CS, Greene C, Camp KM, McDonough S, Bocchini Jr JA, et al. Medical foods for inborn errors of metabolism: history, current status, and critical need. Pediatrics 2020;145(3):e20192261. [80] Bouchereau J, Schiff M. Inherited disorders of lysine metabolism: a review. J Nutr 2020;150(Suppl 1): 2556S60S. [81] Strauss KA, Williams KB, Carson VJ, Poskitt L, Bowser LE, Young M, et al. Glutaric acidemia type 1: treatment and outcome of 168 patients over three decades. Mol Genet Metab 2020;131(3):32540. [82] Hughes DA, Nicholls K, Shankar SP, Sunder-Plassmann G, Koeller D, Nedd K, et al. Oral pharmacological chaperone migalastat compared with enzyme replacement therapy in Fabry disease: 18-month results from the randomised phase III ATTRACT study. J Med Genet 2017;54(4):28896. [83] Bainbridge MN, Wiszniewski W, Murdock DR, Friedman J, Gonzaga-Jauregui C, Newsham I, et al. Whole-genome sequencing for optimized patient management. Sci Transl Med 2011;3(87):87re3. [84] Bucciol G, Meyts I. Recent advances in primary immunodeficiency: from molecular diagnosis to treatment. F1000Res. 2020;9 F1000 Faculty Rev-194. [85] Blanco E, Izotova N, Booth C, Thrasher AJ. Immune reconstitution after gene therapy approaches in patients with X-linked severe combined immunodeficiency disease. Front Immunol 2020;11:608653. Available from: https://doi.org/10.3389/fimmu.2020.608653. [86] Dewey FE, Murray MF, Overton JD, Habegger L, Leader JB, Fetterolf SN, et al. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study. Science 2016;354(6319):aaf6814. [87] Van Hout CV, Tachmazidou I, Backman JD, Hoffman JD, Liu D, Pandey AK, et al. Exome sequencing and characterization of 49,960 individuals in the UK Biobank. Nature 2020;586(7831):74956. [88] Gonzaga-Jauregui C, Ge W, Staples J, Van Hout C, Yadav A, Colonie R, et al. Clinical and molecular prevalence of lipodystrophy in an unascertained large clinical care cohort. Diabetes 2020;69(2):24958. [89] Green RC, Berg JS, Grody WW, Kalia SS, Korf BR, Martin CL, et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet Med 2013;15 (7):56574. [90] Kalia SS, Adelman K, Bale SJ, Chung WK, Eng C, Evans JP, et al. Recommendations for reporting of secondary findings in clinical exome and genome sequencing, 2016 update (ACMG SF v2.0): a policy statement of the American College of Medical Genetics and Genomics. Genet Med 2017;19(2):24955. [91] Schwartz MLB, McCormick CZ, Lazzeri AL, Lindbuchler DM, Hallquist MLG, Manickam K, et al. A model for genome-first care: returning secondary genomic findings to participants and their healthcare providers in a large research cohort. Am J Hum Genet 2018;103(3):32837. [92] Staples J, Maxwell EK, Gosalia N, Gonzaga-Jauregui C, Snyder C, Hawes A, et al. Profiling and leveraging relatedness in a precision medicine cohort of 92,455 exomes. Am J Hum Genet 2018;102 (5):87489. [93] Lupski JR, Belmont JW, Boerwinkle E, Gibbs RA. Clan genomics and the complex architecture of human disease. Cell 2011;147(1):3243. [94] Arrowsmith J, Miller P. Trial watch: phase II and phase III attrition rates 2011-2012. Nat Rev Drug Discov 2013;12(8):569.

284

Chapter 13 Challenges and opportunities in rare diseases research

[95] Nelson MR, Tipney H, Painter JL, Shen J, Nicoletti P, Shen Y, et al. The support of human genetic evidence for approved drug indications. Nat Genet 2015;47(8):85660. [96] Abifadel M, Varret M, Rabe`s JP, Allard D, Ouguerram K, Devillers M, et al. Mutations in PCSK9 cause autosomal dominant hypercholesterolemia. Nat Genet 2003;34(2):1546. [97] Maxwell KN, Breslow JL. Adenoviral-mediated expression of Pcsk9 in mice results in a low-density lipoprotein receptor knockout phenotype. Proc Natl Acad Sci U S A 2004;101(18):71005. [98] Cohen J, Pertsemlidis A, Kotowski IK, Graham R, Garcia CK, Hobbs HH. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat Genet 2005;37 (2):1615. [99] El Khoury P, Elbitar S, Ghaleb Y, Khalil YA, Varret M, Boileau C, et al. PCSK9 Mutations in Familial Hypercholesterolemia: from a Groundbreaking Discovery to Anti-PCSK9 Therapies. Curr Atheroscler Rep 2017;19(12):49.

Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A ABCD1 gene, 141 Aberrant expression, 222223 Aberrant splicing, 218219 ACVR1 gene, 126128 Adenine (A), 12 ADMFD. See Multisystem autoimmune disease with facial dysmorphism (ADMFD) Adrenoleukodystrophy (ALD), 141 Agnostic molecular techniques, 187189 AKT1 gene, 171 “All Of Us” research program, 268 Allele-specific expression (ASE), 218, 221222 Allelic affinity, 15 Allelic heterogeneity, 15 Allelic imbalance. See Allele-specific expression (ASE) Alport syndrome (ATS), 170 ATS1, 170 American College of Medical Genetics (ACMG), 9, 4748 Aminoacyl-tRNA synthetases (aaRS), 121 Amish cerebral palsy, 9899 Aneuploidy, 1920, 158162 Angelman syndromes (AS), 3538 Antigen-capture array, 239240 Apolipoprotein B (APOB), 277 ARID1B gene, 130 Array comparative genomic hybridization (aCGH), 1819, 3839, 39f, 156 ASH1L gene, 130 Assay for transposase-accessible chromatin using sequencing (ATAC-seq), 236 Association analyses for rare diseases, 208212 rare variant association testing in phenotype ascertained studies, 208211 gene burden collapsing test, 210f rare variant association testing of unascertained populationbased studies, 211212 ATP10A gene, 191192 ATPase 6, 143144 Autism spectrum disorder (ASD), 128, 269 Automated flow-injection electrospray ionization mass spectrometry procedure, 244246 Autosomal aneuploidy, 21 Autosomal dominant disorders, 118124 incomplete penetrance and variable expressivity of dominant disorders, 123124 mechanisms of dominant disease, 118123, 120f Autosomal dominant inheritance, 118

Autosomal recessive disorders, 10 Autosomal recessive inheritance, 97106 consanguinity in recessive diseases, 100101 founder effect, 101104 HardyWeinberg equilibrium, 104106 inbreeding, 104106 Autosomes, 2, 118 AXPC1. See Posterior column ataxia with retinitis pigmentosa (AXPC1)

B B allele frequency (BAF), 40, 156 Baker’s yeast (Saccharomyces cerevisiae), 6263 Barr body, 141 Base pairs (bp), 62 Becker muscular dystrophy (BMD), 139, 219220, 243 BeckwithWiedemann syndrome, 29 Bidimensional polyacrylamide gel electrophoresis (2DPAGE), 240242 Binary sequence alignment/map (BAM), 75, 157 BioCyc, 244246 Bioinformatics, 1 for detection of genomic rearrangements, 4042 Bisulfite sequencing, 232 Bmp2 gene, 118119, 197198 BohringOpitz syndrome (BOS), 191 Bottom-up proteomics (BUPs), 240242 BRCA1 gene, 123124 BRCA2 gene, 123124 Break-induced replication (BIR), 4445 BUB1B gene, 168 Burrows-Wheeler transform algorithm (BWT algorithm), 75 Byler disease. See Progressive familial intrahepatic cholestasis (PFIC1)

C Caenorhabditis elegans. See Nematode worm (Caenorhabditis elegans) Cancer as series of rare mosaic diseases, 174175 Carbonic anhydrase III (CA3), 243 5-Carboxylcytosine, 231232 Care4Rare project, 85 Carrier frequency, 100 Cartilage-hair hypoplasia (CHH), 101, 108 Case-control studies, 210211 CDC42 gene, 127 Cell-specific DNA methylation patterns, 231232

285

286

Index

Centers for Mendelian Genomics (CMGs), 8485, 268269 Central dogma of molecular biology, 4 CEP57 gene, 168 Cerebrospinal fluid (CSF), 242243 CGG repeats, 140 Charcot-Marie-Tooth disease (CMT disease), 8384, 121, 192, 194195 CHD8 gene, 130 Chimerism, 177 “ChIP-free” methods, 235 ChIP-Seq, 233 Chorionic villus sampling (CVS), 152153 Chorioretinopathy with microcephaly (MCCRP1), 109110 Chromatin accessibility, 235237 alternative methods to dissect chromatin modifications, 233234 dissecting chromatin structures, 235237 immunoprecipitation, 233 states modification, 233 Chromatogram, 71 Chromoanasynthesis, 4445 Chromosomal aneuploidy, 21 Chromosomal deletions, 17 Chromosomal DNA, 233 Chromosomal microarray analysis (CMA), 78, 1819, 22, 25, 3538, 128, 185 for copy-number variant detection and diagnosis of genomic disorders, 3840 Chromosomal mosaicism, 151152 Chromosome analysis, clinical indications and special considerations for, 30 Chromosome Conformation Capture (3C), 236237 Chromosomes, 17 Chromothripsis, 26, 28 Clan Genomics, 198199 hypothesis, 276 Classic chimerism, 177 Cleavage under targets and release using nuclease (CUT&RUN), 234 ClinGen, 4748 Clinical Laboratory Improvement Amendments (CLIA), 8687 Clinical metabolomics testing, 248 Clinical variability of rare diseases, 269270 Clinical WES, 86 ClinVar, 47 database, 9 Clotting factor VIII (F8), 137 Clotting factor IX (F9), 137 CODAS syndrome, 109 Codons, 4 COL2A1 gene, 121122 COL7A1 gene, 185186

Collagens, 121122 Comparative genome hybridization (CGH), 156 Complex chromosome rearrangements (CCRs), 25, 28 balanced CCRs, 28 Complex inheritance, 198 Complex rearrangements, 28 Cones, 139 Confined placental mosaicism, 151152 Congenital heart disease (CHD), 221222 Congenital lipomatous overgrowth, vascular malformations, epidermal nevi, and skeletal/spinal abnormalities syndrome (CLOVES syndrome), 171 Consanguineous marriage, 100101 Consanguinity in recessive diseases, 100101 Conservation scores, 81 Copy-number variants (CNVs), 19, 25, 3538, 62, 128, 188189, 215 interpretation of genomic structural and, 4748 large, 78 COQ5 gene, 144145 CpG dinucleotides, 230231 Cryptic mosaicism, 170 CTNNB1 gene, 197 Cystic fibrosis (CF), 98 Cytochrome c Oxidase (COX), 145 COX10 gene, 145 COX6A1 gene, 144145 Cytogenetics, 2, 155156 karyotype analysis, 35 Cytosine (C), 12, 156157

D DarierWhite disease, 176 Data analysis, 68 Database of SNPs (dbSNP), 7980 De novo mutations (DNM), 118, 175176 mechanisms of disease of, 126128 DECIPHER, 47, 266 Deciphering Developmental Disorders project (DDD project), 85, 129130 Deletions, 25 Demethylases, 130 Deoxyribonucleic acid (DNA), 12 methylation, 232 as epigenetic mechanism, 230232 sequence variation, 56 sequencing, 68 Derivative chromosome (der chromosome), 27 Developmental delay and intellectual disability (DD/ID), 128129 Dideoxynucleotide phosphate analogs (ddNTPs), 67, 71 DIgenic diseases DAtabase (DIDA), 190 Digenic inheritance, 197

Index

Digital polymerase chain reaction (dPCR), 158 Direct labeling of protein mixture, 239240 Disease gene mapping of autosomal recessive disorders, 106112 genomic sequencing of rare recessive disorders, 109110 genomics of founder populations, 110112 homozygosity mapping in consanguineous pedigrees, 106109 DNase I hypersensitive site sequencing (DNase-seq), 235236 Dominant disorders, 117 Dominant negative effects, 121 DONSON variants, 185186 Down syndrome (DS), 45, 1920, 155, 158 Dravet syndrome, 118119 Driver mutations, 175 Droplet-based single-cell ChIP-seq (Drop-ChIP), 234235 Drosophila, 151 Drosophila melanogaster. See Fruit fly (Drosophila melanogaster) Drug development derived from rare diseases insights, 276277 Dual molecular diagnoses, 186192, 188t, 189f syndrome disintegration, 186187 Duchenne muscular dystrophy (DMD), 139, 219220, 243 Duplications, 17, 25 DUX4 allele, 197 DVL1 gene, 123 DVL3 gene, 123 DYSF deficiency underlying Miyoshi myopathy (MMD1), 219220 Dystrophin, 139

E Electronic Health Records (EHR), 211 Electrophoresis, 71 Ellis-van Creveld syndrome (EVC syndrome), 101, 108 Embryonic stem cells (ES cells), 233 EmeryDreifuss muscular dystrophy (EMD2), 185186 Endoplasmic reticulum (ER), 121, 274 Endoreplication, 151152 Enzyme replacement therapy (ERT), 274 Epigenetics, 229230 Epigenomics, 229232 epigenetics and, 229230 landscape of epigenomic technologies, 232235 studies in rare diseases, 237238 Exome Aggregation Consortium database (ExAC database), 80 Exome sequencing, 6365 Exon(s), 2 skipping, 220221 Exonic extension, 220

Expression Expression Expression Expression

287

level imbalances, 218 microarrays, 217 outliers, 222223 quantitative trait locus/loci (eQTL), 221222

F Fabry Disease, 247 Facioscapulohumeral dystrophy type 2 (FSHD2), 197 Familial hypercholesterolemia (FH), 277 FHCL1, 120121 Family-based analyses, 110 Family-based statistical approaches, 205206 FDA. See US Food and Drug Administration (FDA) Fetal microchimerism, 177 Fibrillin-1 gene (FBN1), 122123, 270 Fibrillinopathies, 122, 270 Fibroblast growth factor (FGF), 122 FGF10, 197 FGFR3, 122 Fibrodysplasia ossificans progressiva (FOP), 126 Field cancerization, 175 Finding Of Rare Disease GEnes (FORGE), 85 First generation sequencing, 71 Fisher’s Exact Test (FET), 208209 Fitzsimmons syndrome, 186187 Flowgram, 72 Fluorescence in situ hybridization (FISH), 1719, 38 Fluorescence-activated cell sorting (FACS), 234 Fork stalling and template switching (FoSTeS), 4445 Formalin-fixed, paraffin-embedded tissues (FFPE tissues), 65 5-Formylcytosine, 231232 Founder effect, 101104 Founder populations, 110112 Fragile X mental retardation 1 gene (FMR1 gene), 140 Fragile X mental retardation protein (FMRP), 140 Fragile X syndrome (FXS), 7576, 140 Frameshift mutation, 67 Freeman-Sheldon syndrome (FSS), 8384 Fruit fly (Drosophila melanogaster), 6263 Functional prediction algorithms, 8182

G G-protein subunit alpha q (Gαq), 171174 Gain-of-function (GoF), 117118, 122, 140, 219 effect, 127128 mutations, 1012 Galactosidase A (GLA), 247 Galloway-Mowat syndrome type 1 (GAMOS1). See Nephrocerebellar syndrome Gas and liquid chromatography (GC/LC), 246 Gaucher disease (GD), 98 type 1, 274

288

Index

GCDH gene, 272274 Geisinger DiscovEHR cohort, 276 GeneMatcher, 8485, 266 Genes, 2, 6162 expression, 219220 Genetic Testing Registry (GTR), 8687 Genetic(s), 1 code, 4, 5t drift, 102 heterogeneity of rare diseases, 269270 inheritance, 1415 maps, 108 variants, 8t variation, 58 Genome, 1 insufficient knowledge of genome function, 270272 Genome Aggregation Database (gnomAD), 80, 267 Genome Analysis ToolKit (GATK), 157 Genome assembly (GA), 4142 Genome-wide association studies (GWAS), 208, 266 1000 Genomes Project, 62 Genomic Evolutionary Rate Profiling (GERP) score, 81 Genomic(s), 1, 229 databases, 7982 disorders, 35 chromosomal microarray analysis for copy-number variant detection and diagnosis, 3840 interpretation of genomic structural and copy-number variants, 4748 molecular mechanisms of genomic rearrangement generation, 4245, 43f next-generation sequencing and bioinformatics for detection of genomic rearrangements, 4042 with recurrent and nonrecurrent chromosome rearrangements, 36t of founder populations, 110112 identification of genomic variants, 223 imprinting, 1314, 2830, 29t, 237238 rearrangements, next-generation sequencing and bioinformatics for detection, 4042 sequencing of rare diseases, 8387, 109110 in clinic, 8687 decrease in sequencing cost per genome, 64f human genome reference sequence, 6163 mapping, alignment, pileup, and variant calling processes, 77f sequencing of genomes and exomes, 6365, 69t studies of sporadic disorders and identification of de novo mutations, 128130 Genotype, 910 Genotype calling, 7678 Genotype-Tissue Expression project (GTEx project), 82, 218219, 271 Germline mosaicism, 151152, 152f, 169

Germplasm, 151 Glucose-6-phosphate dehydrogenase (G6PD), 186 Glucosylceramide (GlcCer), 274 Glucosylcerebroside, 274 Glutaric aciduria type 1 (GA1), 98, 272274 GNAQ, 171174 GNAS, 170171 Gonadal mosaicism. See Germline mosaicism Graft versus host disease (GvHD), 275 Guanine (G), 12

H Hamartin, 174 Haploinsufficiency, 118119 HapMap, 62 Happle’s hypothesis, 171 HardyWeinberg equilibrium, 104106 Hematopoietic stem cell transplantation (HSCT), 275 Hemophilia A, 137 Hemophilia B, 137 Hereditary hemorrhagic telangiectasia (HHT), 176177 Hereditary spastic paraplegia (HSP), 195 Heterodisomic UPD, 28 Heteroplasmy, 143144 Heterozygous, 10 Hi-C, 236237 Higher-order chromatin architecture, 236237 Highly penetrant Mendelizing variants (HPMVs), 194195 Histone, 233 acetylation and methylation, 233 lysine methyltransferases, 130 Homoplasmic mutation, 144 Homozygosity mapping in consanguineous pedigrees, 106109 Homozygote, 1012 Horizontal transmission, 10 House mouse (Mus musculus), 6263 Human Cell Atlas Project (HCA), 82 Human de novo mutation rate, 124126 Human Gene Mutation Database (HGMD), 81 Human genetics and genomics, 89 Human genome, 15, 11f, 118 reference sequence, 6163 sample preparation, sequencing, and analysis workflow, 66f Human Genome Project (HGP), 6162, 229230, 263 Human Heredity and Health in Africa (H3Africa), 268269 Human Integrated ProteinProtein Interaction rEference (HIPPIE), 82 Human knockouts, 112 Human Metabolome Database, 244246 Human Phenotype Ontology (HPO), 190191

Index

Huntingtin gene (HTT gene), 117118, 238 Huntington disease (HD), 117118, 238 5-Hydroxymethylcytosine (5hmC), 231232 5-Hydroxytryptophan (5-HTP), 274275

I Identity-By-Descent (IBD), 100, 206 IKK-gamma gene (IKBKG gene), 176 Immunoprecipitation (IP), 232 In-frame mutations, 67 Inborn errors of metabolism (IEMs), 98, 246247, 272274 Inbreeding, 104106 coefficient, 100 Incomplete penetrance, 123124, 195198 Incontinentia pigmenti, 176 Inheritance, Mendelian patterns of, 912 Intensive care unit (ICU), 86 Interchromosomal insertions, 2526 Interferon (IFN), 239240 International Classification of Diseases (ICD), 211 International System for Human Cytogenetic Nomenclature (ISCN), 17 Intrachromosomal insertions, 2526 Intrachromosomal submicroscopic inversions, 78 Intrauterine growth retardation (IUGR), 163 Intravenous infusions (IV infusions), 274 Intronic gain, 219220 Inversions, 17 Inverted duplication chromosomes, 26 Ionization, 241 Isochromosomes, 2627 Isodicentrics, 26 isodicentric 15, 27

J Jagged 1 (JAG1), 223224

K Kabuki syndrome (KS), 130 Karyotype, 2, 17 Karyotyping, 17 clinical indications and special considerations for chromosome analysis, 30 frequencies of chromosomal aberrations, 19f numerical chromosome aberrations, 1922 structural chromosome aberrations, 2228 KCNQ1 gene, 111112 Keratosis follicularis. See Darier-White disease KIF1A gene, 127 Kleefstra syndrome type (KLEFS2), 130 Klinefelter syndrome, 2122 Kyoto Encyclopedia of Genes and Genomes (KEGG), 82

289

L Leigh syndrome (LS), 143144, 169 Leucine-rich pentatricopeptide-repeat-motif-containing gene (LRPPRC gene), 247248 Levodopa (L-dopa), 274275 Library preparation, 6567 Linear mixed models (LMMs), 208209 Linear skin defects with multiple congenital anomalies 1 (LSDMCA1), 141 Linkage analysis, 117, 205207 Linkage disequilibrium, 910 Liquid chromatography (LC), 240242 Liquid chromatography electrospray ionization mass spectrometry, 244246 Liquid chromatography-mass spectrometry (LC-MS), 246 LMNA, 185186 Locus or genetic heterogeneity, 15 Loeys-Dietz syndrome type 1 (LDS1), 123 Loeys-Dietz syndrome type 2 (LDS2), 123 Log of Odds score (LOD score), 206 Long interspersed nuclear elements (LINEs), 4243 Long-read sequencing technology, 4041 Loss-of-function (LoF), 67, 118119, 209, 219, 265. See also Gain-of-function (GoF) mutations, 98 variants, 7879, 120121 Low-copy repeats (LCRs), 25, 4243 Low-density lipoprotein receptor gene (LDLR gene), 120121, 277 Low-frequency novel junctions (LFNJ), 219 Lyonization. See X-chromosome inactivation (XCI)

M Macrochimerism, 177 Male specific portion of Y (MSY), 167168 Maple syrup urine disease (MSUD), 102104 Mapping, 75 Marfan syndrome, 122123 Marker chromosomes, 2728 Markov Chain Monte Carlo approaches (MCMC approaches), 206 Mass filter. See Quadrupole mass analyzer Mass selection, 242 Mass spectrometry (MS), 240242 Mass-to-charge ratio (m/z ratio), 242 Matchmaker Exchange, 266 Mate-pair sequencing (MPseq), 4546 MCCRP1. See Chorioretinopathy with microcephaly (MCCRP1) McCuneAlbright syndrome, 170171 McKusick-Kauffman syndrome (MKKS), 108 MCOLN1, 221222 MECP2 duplication syndrome, 140

290

Index

Meiosis, 45 Meiotic chromosome nondisjunction, 29 Mendelian disorders, 910, 1415 mosaic forms in, 175177 Mendelian patterns of inheritance, 912 Mendelian randomization approaches, 205206 Messenger RNA (mRNA), 2, 4 Metabolite(s), 243244 extraction, 246 Metabolomics, 243248 studies in rare diseases, 246248 workflow and data analysis, 244246 Methyl-binding domain capture sequencing, 232 5-Methylcytosine (5mC), 230231 METLIN, 244246 MGST1, 222223 Microarray technology, 109 Microchimerism, 177 Micrococcal nuclease (MNase), 233234 Micrococcal nuclease sequencing (MNase-seq), 236 Microdeletions, 25 Microduplications, 25 Microhomology-mediated break-induced replication (MMBIR), 4445 Microhomology-mediated end joining (MMEJ), 4445 Million bp (Mbp), 63 Missense mutations. See Nonsynonymous mutations Mitochondria, 142143 Mitochondrial disorders, 142145, 247248 heteroplasmy, 143f mitochondrial inheritance, 143f Mitochondrial DNA (mtDNA), 4, 143, 168169 Mitochondrial DNA depletion syndrome (MTDPS12B), 109 Mitochondrial genome mosaicism, 168169 Mitochondrial malate dehydrogenase 2 (MDH2), 243 Mitochondrial myopathy, encephalomyopathy, lactic acidosis, and stroke-like episodes (MELAS), 169 Mitosis, 45 Molecular cytogenetics, 1819 Molecular inversion probes (MIPs), 67 Monoallelic expression. See Allele-specific expression (ASE) Mosaic aneuploidy in rare disease, 158169 Mosaic de novo variants, 169 Mosaic disorders of X chromosome, 167 of Y chromosome, 167168 Mosaic loss of Y chromosome (mLOY), 168 Mosaic mobile element insertions, 169 Mosaic mutation, 151152 Mosaic structural variation, 157158 Mosaic variegated aneuploidy syndrome 1 (MAVS1), 168 Mosaicism, 13, 2122, 151, 177 cancer as series of rare mosaic diseases, 174175 categories of mosaic variation, 169170

chimerism, 177 mosaic aneuploidy in rare disease, 158169 mosaic forms in Mendelian disorders, 175177 obligate mosaicism in rare diseases, 170174 strategies and technologies to identify mosaic variation, 153158 array comparative genomic hybridization, 156 clinical observation, 155 cytogenetics, 155156 next-generation sequencing, 156158 single-nucleotide polymorphism arrays, 156 MPV17 gene, 145 mtDNA depletion syndromes (MDS), 145 Muenke syndrome, 192193 Multipoint linkage analysis, 207 Multisystem autoimmune disease with facial dysmorphism (ADMFD), 109 Mus musculus. See House mouse (Mus musculus) Mutagenesis, 153 Mutation(s), 46, 7f in mtDNA, 144145 mutational burden, 194195 rate, 124126 MyGene2, 266 MYH3 gene, 8384 Myoclonic epilepsy associated with ragged-red fibers (MERRF), 169 Myosin light chain 3, 243

N N 5 1 problem, 264266 Nanopore sequencing, 74 NARP syndrome, 143144 National Center for Biotechnology Information (NCBI), 7980, 158 National Health Service (NHS), 85 National Institutes of Health (NIH), 8485, 268 Undiagnosed Diseases Program, 268269 Natural ChIP (N-ChIP), 233234 Natural microchimerism, 177 Nebulin (NEB), 220 Nematode worm (Caenorhabditis elegans), 6263 Neomorphic mutation, 127128 Nephrocerebellar syndrome, 109 NESCAV syndrome, 127 Neurofibromatosis type I (NF1), 176 Next-generation sequencing (NGS), 2, 19, 3538, 6568, 7172, 109110, 128, 205206, 215, 233 for detection of genomic rearrangements, 4042 genomic disorders and, 4547 to identify mosaic variants, 156158 Non-European genetic ancestries, underrepresentation of, 266268

Index

Nonallelic homologous recombination (NAHR), 25, 4245, 139 Noncoding RNA (ncRNA), 6162 Nonframeshifting, 67 Nonhomologous end joining (NHEJ), 4445 Nonsense mutations, 67 Nonsense-mediated decay (NMD), 67, 118119, 123, 219 Nonsynonymous mutations, 67 Nuclear genes, 4 Nuclear magnetic resonance spectroscopy (NMR spectroscopy), 244 Nucleosome, 233 positioning, 235237 Nucleosome occupancy and methylome sequencing (NOMeseq), 236 Nucleotides, 67 Numerical chromosome aberrations, 1922. See also Structural chromosome aberrations autosomal aneuploidy, 21

O Obligate mosaicism in rare diseases, 170174 GNAQ and Sturge-Weber syndrome, 171174 GNAS and McCune-Albright syndrome, 170171 PIK3CA and CLOVES syndrome, 171 Proteus syndrome, 171 TSC1, TSC2 and tuberous sclerosis complex, 174 Olink proteomics platform, 240 Omics approaches, 229 DNA methylation, 230232 epigenomics, 229232 evaluating higher-order chromatin architecture, 236237 metabolomics studies in rare diseases, 246248 workflow and data analysis, 244246 protein arrays for biomarker discovery, 239240 proteomics, 238243 proteomics approaches to study rare diseases, 242243 untargeted vs. targeted metabolomics, 246 Online Mendelian Inheritance in Man database (OMIM database), 9, 8081, 185, 264265 OPN1LW gene, 139 OPN1MW gene, 139 Osteoarthritis with mild chondrodysplasia (OSCDP), 121122 Oxford Nanopore, 4041, 7374

P Pacific Biosciences (PacBio), 4041, 7374 Paired-end read sequencing (PE read sequencing), 40 PAK1 gene, 127

291

Paracentric inversions, 23 Partial lipodystrophy, 276 Pedigree-based methods, 208 pedigree-based statistical methods, 205208 linkage analysis, 205207 transmission disequilibrium testing, 207208 Penetrance, 14 Peptide digestion, 240242 Pericentric inversions, 23 Pharmacogenomics Knowledge database (PharmGKB), 81 Pharmacological chaperone therapy (PCT), 274 Phasing, 157 PhastCons, 81 Phenome Central, 266 Phenotype, 56 Phenotypic expansion, 192194 delineating phenotypic spectrum and expansion, 192193 dissection of genotypephenotype relationships, 193194 multiple molecular diagnoses masquerading as phenotypic expansion, 193 of rare diseases, 269270 Phenotypic similarity scores (PSS), 190191 Phenylketonuria (PKU), 98 Philadelphia chromosome, 168 Phosphoinositide 3-kinase (PI3K), 171 Phylogenetic Hidden Markov Model (phylo-HMM), 81 PhyloP, 81 Picotiter plate (PTP), 72 PIK3CA syndrome, 171 PMP22 gene, 35 PMS2 gene, 223224 POLG gene, 145 Polonies, 67 Polyglutamine (polyQ), 238 Polyhydramnios, megalencephaly, and symptomatic epilepsy (PMSE), 109 Polymerase chain reaction (PCR), 7172, 106 Polymorphism phenotyping algorithm (PolyPhen algorithm), 8182 Pompe disease, 98 Population-based studies, 210211 Posterior column ataxia with retinitis pigmentosa (AXPC1), 109 Posttranslational modifications (PTMs), 233 Postzygotic mutation, 152153 Prader-Willi syndrome (PWS), 1314, 3538, 191 PRDM9 recombination hotspots, 4344 Precision health, 268269 Precision medicine, 275276 Prefractionation methods, 240242 Premature termination codon (PTC), 118119 Primary immunodeficiency disorders (PIDs), 275 Probability of LoF intolerance (pLI), 126127 Progressive familial intrahepatic cholestasis (PFIC1), 108

292

Index

Promoter, 2 Proprotein convertase subtilisin/kexin type 9 (PCSK9), 277 Protein arrays for biomarker discovery, 239240 Proteomics, 238243 Proteus syndrome, 171 Proximity Extension Assay technology, 240 Pyrogram, 72 Pyrosequencing, 72

Q Quadrupole mass analyzer, 242 Quality control (QC), 208 Quantitative laboratory traits, 211 Quantitative PCR (qPCR), 236237

R RAC1 gene, 127 Rad51, 4445 Random inactivation, X chromosome, 141 Rare diseases, 1214, 79 epigenomic studies in, 237238 metabolomics studies in, 246248 mosaic aneuploidy in, 158169 obligate mosaicism in, 170174 proteomics approaches to study rare diseases, 242243 rare Mendelian diseases, 263 research, 263 challenges in, 264272 opportunities in, 272277 transcriptomics in, 218224 Rare variant association testing gene burden collapsing test, 210f Reactive oxygen species (ROS), 169 Read-depth (RD), 4142 Reads. See Sequence reads Recessive disorders, 98 Reciprocal translocations, 22 Repetitive DNA sequences, 13 Replicative segregation, 12, 143144 Restriction fragment length polymorphisms (RFLP), 106 Reversible terminators, 72 Ribonucleic acid (RNA), 2, 4 RNA-binding protein, 140 Ribosomal RNA (rRNA), 6162 Ribosomes, 4 Ring chromosome mosaicism, 168 RNA sequencing (RNA-seq), 82, 217218 limitations, 225226 mechanisms underlying RNA-seq-based genetic diagnoses, 218223 Robertsonian translocations (RTs), 2325 Robinow syndrome, 123

Roche’s 454 sequencing technology, 72 ROR2 gene, 123 RYR1 gene, 15

S Saccharomyces cerevisiae. See Baker’s yeast (Saccharomyces cerevisiae) SACS gene, 186187 Sandwich-immunoassay method, 239240 Sanger sequencing, 68, 71 Scalable and Accurate Implementation of GEneralized mixed models (SAIGE), 208209 Search Tool for the Retrieval of Interacting Genes/Proteins (STRING), 82 Second generation sequencing technologies, 7173 Segmental duplications, 25 Semiconductor ion sequencing, 7273 Sepiapterin reductase (SPR), 274275 Sequence alignment/map file (SAM file), 75 Sequence data analysis, 7479 Sequence Kernel Association Test (SKAT), 209 Sequence reads, 75 Sequencing, 68 of genomes and exomes, 6365 first generation sequencing, 71 second generation sequencing, 7173 third generation sequencing, 7374 Sequencing by Oligonucleotide Ligation and Detection (SOLiD), 73 NGS technology, 8384 Sequencing-by-ligation (SBL), 68 Sequencing-by-synthesis (SBS), 68 Serial analysis of gene expression (SAGE), 217 Serial replication slippage, 4445 Sex chromosome aneuploidies (SCAs), 1922 SH3TC2 gene, 8384 Short stature, facial dysmorphism, skeletal anomalies, and variable cardiac anomalies (SSFSC), 118119 Short-interspersed nuclear elements (SINEs), 4243 Short-read technologies, 4142 Silent mutations. See Synonymous variants Single base-pair mutations, 67 Single or simple nucleotide variants. See Single base-pair mutations Single-cell ChIP-seq (scChIP-seq), 234235 Single-cell resolution transcriptomics, 224225 Single-cell RNA sequencing methodologies (scRNA-seq), 82, 224225 Single-ended, double-stranded DNA (seDNA), 4445 Single-molecule sequencing (SMS), 4041, 68 Single-nucleotide polymorphisms (SNPs), 56, 1819, 62, 109, 196, 205206, 266 array genotyping, 40, 156

Index

Single-nucleotide variants (SNVs), 38, 156, 191192, 218 Single-point linkage, 207 Sitosterolemia (STSL1), 108 Six-fingered dwarfism, 101 Small marker chromosomes (SMCs), 27 Small Molecule Pathway Database, 244246 Smith-Magenis syndrome (SMS), 3538 SMN1 gene, 8 Sodium bisulfite, 232 Solexa, 72 Solid-phase amplification, 67 Somatic mosaicism, 151152, 152f Somatic mutation, 153 Somatic SNVs, 175 Sorting tolerant from intolerant algorithm (SIFT algorithm), 8182 Southern blot analysis, 117118 Spinal muscular atrophy (SMA), 98 Splice site mutations, 67 Split-read analysis (SR analysis), 4142 Sporadic disorders, 124130 genomic studies and identification of de novo mutations, 128130 human de novo mutation rate, 124126 mechanisms of disease of de novo mutations, 126128 Statistical approaches association analyses for rare diseases, 208212 pedigree-based statistical methods, 205208 Statistical genetics, 205 Structural chromosome aberrations, 2228. See also Numerical chromosome aberrations complex rearrangements and chromothripsis, 28 deletions and duplications, 25 intrachromosomal and interchromosomal insertions, 2526 isochromosomes, 2627 marker chromosomes, 2728 paracentric and pericentric inversions, 23 reciprocal translocations, 22 Robertsonian translocations, 2325 Structural variants (SVs), 38, 156 Structural variation, 62 Sturge-Weber syndrome, 171174 Supernumerary small marker chromosomes (sSMCs), 27 Synonymous variants, 67 Systemic lupus erythematosus (SLE), 239240

T Targeted metabolomics, 246 TBX4 gene, 197 TBX6-associated congenital scoliosis (TACS), 196197 TCF12 variant, 197198 TET, 231232 Tetrahydrobiopterin (BH4), 274275

293

Therapy development for rare diseases, 272275 Thiamine metabolism dysfunction (THMD3), 108 Third generation sequencing technologies, 7374 Thrombocytopenia with absent radii (TAR), 196 Thymidine, 156157 Thymine (T), 12 TIMMDC1 deficiency, 219220 gene, 144145 Tissue-specific expression, 226 Titin (TTN), 220 TK2 gene, 145 Top-down proteomics (TDP), 240242 Transcription, 4 Transcription factor (TF), 230 Transcriptome, 217218 methodologies, 217218 Transcriptomics, 229 limitations of using RNA-seq in clinical molecular diagnosis, 225226 mechanisms underlying RNA-seq-based genetic diagnoses, 218223 in rare diseases, 218224 single-cell resolution transcriptomics, 224225 Transfer RNA (tRNA), 4, 6162, 121 Transforming growth factor β (TGF-β), 123 Translation, 4 Translocations, 17 Transmission disequilibrium tests (TDT), 207208 Trichohepatoneurodevelopmental syndrome (THNS), 109110 Trichorhinophalangeal syndrome type 1 (TRPS1), 186187 Trio-based analyses, 215 TRIP13 gene, 168 Trisomy 1 mosaicism, 163164 Trisomy 2 mosaicism, 164 Trisomy 3 mosaicism, 164 Trisomy 4 mosaicism, 164 Trisomy 5 mosaicism, 164 Trisomy 6 mosaicism, 164 Trisomy 7 mosaicism, 165 Trisomy 8 mosaicism, 165 Trisomy 9, 21 mosaicism, 165 Trisomy 10 mosaicism, 165 Trisomy 11 mosaicism, 165 Trisomy 12 mosaicism, 165 Trisomy 13, 21, 158 mosaicism, 165 Trisomy 14 mosaicism, 165166 Trisomy 15 mosaicism, 166 Trisomy 16 mosaicism, 166 Trisomy 17 mosaicism, 166 Trisomy 18, 21, 158

294

Index

Trisomy 18 (Continued) mosaicism, 166 Trisomy 19 mosaicism, 166 Trisomy 20 mosaicism, 166 Trisomy 21, 21, 158 mosaicism, 167 Trisomy 22 mosaicism, 167 Trisomy X syndrome, 2122 Tsix transcript, 141 Tuberous sclerosis complex (TSC) TSC1, 174 TSC2, 174 TUBGCP2 gene, 194 Turner syndrome, 2122, 158, 167, 170 Two-hit model of cancer, 153 Type 1 error, 208210 Type I collagen, 121122 Type II collagen, 121122 Type II collagenopathies, 121122 Type III collagen, 121122

U UBE3A gene, 1314, 191192 UK Biobank, 276 Ultrahigh-performance liquid chromatography (UHPLC), 244246 Ultralow input CUT&RUN (uliCUT&RUN), 234 Ultralow-input MNase-based native ChIP, 234 Undiagnosed Diseases Network (UDN), 8485 Undiagnosed Diseases Program (UDP), 8485 Unfolded protein response (UPR), 121 Uniparental disomy (UPD), 2324, 2830, 151152 Uniparental heterodisomy, 151152 Uniparental isodisomy, 151152 Untargeted metabolomics, 246 Untranslated regions (UTRs), 223 Uracil (U), 4 US Food and Drug Administration (FDA), 8687 Usher syndrome type IIIB (USH3B), 109

V Variable expressivity, 14, 194195 of dominant disorders, 123124 mutational burden, 194195 Variant annotation, 78

Variant calling, 7678 Variant data interpretation, 78 Variants of unknown significance (VUS), 215 Variegated mosaic aneuploidy, 168 Vertebral defects, anorectal malformations, cardiac defects, tracheoesophageal defects, renal malformations, and limb defect (VACTERL), 164 Vertical transmission, 10

W Waardenburg syndrome type 4A (WS4A), 14, 108 WatsonCrick complementary base-pairing, 12 Whole-exome sequencing (WES), 2, 4142, 6364, 67, 102104, 128, 144145, 185, 215, 229, 263 Whole-genome sequencing (WGS), 2, 4142, 6364, 128, 141, 185, 212, 215, 263 WNT5A gene, 123

X X chromosome, 138141 mosaic disorders of, 167 X-chromosome inactivation (XCI), 140, 167 as modifier of X-linked disorders, 141 skewing, 141, 145 X-inactivation skewing, 141 X-linked and mitochondrial disorders, 137 mitochondrial disorders, 142145 X chromosome disorders, 138141 X chromosome inactivation as modifier of X-linked disorders, 141 X-linked dominant disorders, 139140 X-linked recessive disorders, 138140, 138f Xist transcript, 141 XXYY syndrome, 2122 XYY syndrome, 2122

Y Y chromosome, 138140 mosaic disorders of, 167168 YARS1, 270

Z Zero-mode waveguide detectors (ZMW detectors), 74