Epigenome-Wide Association Studies: Methods and Protocols (Methods in Molecular Biology, 2432) 1071619934, 9781071619933

This volume details features of DNA methylation data, data processing pipelines, quality control measures, data normaliz

142 41 7MB

English Pages 239 [232] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
Chapter 1: Quantification Methods for Methylation Levels in Illumina Arrays
1 Introduction
2 Materials
2.1 Obesity Data Set
2.2 Monte Carlo Simulation Settings
3 Methods
3.1 β-Value
3.2 M-Value
3.3 N-Value
3.4 Comparison of Three Methylation Quantification Methods
3.4.1 Features of Signal Intensities
3.4.2 Distributions of Measures of Methylation Level
3.4.3 Identification of Differential Methylation by Three Methods
3.4.4 Simulation Study
4 Notes
References
Chapter 2: Evaluating Reliability of DNA Methylation Measurement
1 Introduction
2 ICC Estimation and Modeling
3 Real Data Analysis Example
3.1 Replicates from Methylation Chips of Same Model
3.2 Replicates from Different Methylation Arrays
4 Summary and Other Remarks
References
Chapter 3: Accurate Measurement of DNA Methylation: Challenges and Bias Correction
1 Introduction
1.1 Incomplete Bisulfite Conversion
1.2 PCR Bias
1.3 Region-Specific Bias
1.4 Coverage Bias
2 Materials
3 Methods
3.1 Statistical Methods to Correct Methylation Levels
3.2 Model Outline
3.3 Calibration Experiment with Technical Replicates
3.4 Corrected Methylation Degree
3.5 Real Case Applications
3.5.1 Application in Clinical Diagnostic of Celiac Patients
3.5.2 Application in Clinical Diagnostic: Mosaic Beckwith-Wiedemann Syndrome
4 Notes
5 Conclusions
References
Chapter 4: Using R for Cell-Type Composition Imputation in Epigenome-Wide Association Studies
1 Introduction
2 Reference-Based Methods
3 Reference-Free Methods
3.1 ReFACTor
3.2 SVA
4 Discussion
References
Chapter 5: Cell Type-Specific Signal Analysis in Epigenome-Wide Association Studies
1 Introduction: Functional Overlap Analysis, FORGE and eFORGE
2 Materials
3 Methods
3.1 Functional Overlap Analysis and EWAS: The Design of eFORGE
3.2 Cell Type-Specific and Cell Type-Confounding Effects in eFORGE
4 Specific Notes on eFORGE Analysis
4.1 Number of Probes
4.2 Separating Hypo- and Hypermethylated Sites
4.3 Tissue-Specific, General, and Mixed Signals
4.3.1 Tissue-Specific eFORGE Results
4.3.2 General Enrichment Results
4.3.3 Mixed Enrichment
4.4 Choosing Different Datasets to Analyze in eFORGE
4.5 False-Positive Rate
4.6 Reproducibility of eFORGE Results
4.7 Evaluating Enrichments Across Replicate Samples
4.8 Number of Background Repetitions
4.9 Enrichment or Depletion
4.10 Significance Threshold
4.11 Analysis Label
4.12 Broad View and Future Directions
References
Chapter 6: Controlling Batch Effect in Epigenome-Wide Association Study
1 Introduction
1.1 Adjusting Raw Methylation Data Matrix
1.2 Adjusting Model as Included Covariates
2 Methods
2.1 COMBAT: Empirical Bayes Method
2.2 SVA: Surrogate Variable Analysis
2.3 Control Probes
2.4 LMM: Linear-Mixed Effect Model
2.5 PCA
3 Discussion
4 Notes
References
Chapter 7: DNA Methylation and Atopic Diseases
1 Introduction
2 Cross-Sectional Studies
2.1 EWAS on Total Serum IgE Levels
2.2 EWAS on Atopic Asthma
3 Birth Cohort and Longitudinal Studies
3.1 Birth Cohort Studies
3.2 Longitudinal Prospective Studies
4 Tissue Type and Methylation Profiles
5 Discussion
6 Conclusion
References
Chapter 8: Meta-Analysis for Epigenome-Wide Association Studies
1 Introduction
2 Methods
2.1 Fixed-Effects Model
2.2 Random-Effects Model
3 Application
3.1 Data
3.2 Install metafor Package
3.3 Load metafor Package and Import Datasets into R
3.4 Conduct Meta-Analysis
3.5 Calculate Inflation Factor
3.6 Quantile-Quantile (QQ) Plot
3.7 Heterogeneity Analysis
4 Discussion
References
Chapter 9: Increase the Power of Epigenome-Wide Association Testing Using ICC-Based Hypothesis Weighting
1 Introduction
2 Materials
3 Methods
3.1 ICC Calculation
3.2 Surrogate Variable Analysis
3.3 Association Testing
3.4 p-Value Adjusting by IHW
3.5 Diagnostic Plots for ICC-Based p-Value Adjusting
4 Notes
References
Chapter 10: A Review of High-Dimensional Mediation Analyses in DNA Methylation Studies
1 Introduction
2 Single Mediator Model
3 Multiple Mediators Model
4 High-Dimensional Mediators Model
4.1 Continuous Outcome
4.2 Binary Outcome
5 Applications
5.1 Normative Aging Study
5.2 Epithelial Ovarian Cancer
6 Concluding Remarks
References
Chapter 11: DNA Methylation Imputation Across Platforms
1 Introduction
2 Materials
3 Methods
3.1 Data Harmonization Via Local Smoothing Among Training Samples
3.2 Penalized Functional Regression (PFR) Model
3.2.1 Estimation of Xi(t)
3.2.2 Estimation of β(t)
3.2.3 Selection of Tuning Parameters
3.2.4 Selection of Local Covariates Z
3.3 Post-imputation Quality Filter
3.4 Imputation Quality Assessment
3.4.1 Cross Validation
3.4.2 Quality Measures
4 Method Performance
5 Future Directions
References
Chapter 12: Workflow to Mine Frequent DNA Co-methylation Clusters in DNA Methylome Data
1 Introduction
2 Materials
3 Methods
3.1 Data Acquisition
3.2 Data Pre-processing
3.3 Individual DNA Co-methylation Cluster (DCMC) Mining
3.4 Cluster Comparison and Visualization (Optional)
3.5 Calculate ``Eigengene´´ Using Singular Value Decomposition for Every Cluster Mined from Individual Dataset
4 Example Study with This Workflow
5 Discussion
References
Chapter 13: BCurve: Bayesian Curve Credible Bands Approach for the Detection of Differentially Methylated Regions
1 Introduction
2 The BCurve Methodology
3 The BCurve Package Implementation and Sample Analysis
3.1 Implementation
3.2 Simulation and Analysis of BS-Seq Data
3.2.1 Simulating BS-Seq Dataset
3.2.2 Input Data for bcurve
3.2.3 Numerical and Graphical Summaries of DMR/DMC Results
3.3 Simulation and Analysis of Microarray Data
3.3.1 Simulating Microarray Dataset
3.3.2 Analysis and Inference of Microarray Data
4 Two Real Data Analysis Examples
4.1 Analysis of a Mouse Brain BS-seq Dataset
4.2 Analysis of the TCGA LUAD Microarray Data
5 Summary and Other Remarks
5.1 Software Availability
References
Chapter 14: Predicting Chronological Age from DNA Methylation Data: A Machine Learning Approach for Small Datasets and Limited...
1 Introduction
2 Materials
2.1 Dataset
2.2 Software
3 Data Preprocessing
3.1 Calculating the β-Values
3.2 Transformation to M-Values
3.3 Centering the Dataset
3.4 Importing the Dataset in R
4 Marker Selection
4.1 Forward Selection and Backward Elimination
4.1.1 Forward Selection
4.1.2 Backward Elimination
4.2 Boruta
4.3 Genetic Algorithm
4.4 Regression
5 Prediction Modeling
5.1 Splitting the Data
5.2 Multiple Linear Regression
5.3 Support Vector Machine with Polynomial Function
6 Notes
References
Chapter 15: Application of Correlation Pre-Filtering Neural Network to DNA Methylation Data: Biological Aging Prediction
1 Introduction
1.1 Source Code Repository
1.2 Getting Started
2 Materials
3 Method
3.1 Build CPFNN
3.1.1 Training Data Preparation
3.1.2 Model Building
3.1.3 Grid Search for Optimal Parameters
3.2 CPFNN User Instruction
3.2.1 Data Parsing
3.2.2 Age Prediction
3.3 Potential Use of CPFNN
3.3.1 Age Acceleration Detection
3.3.2 Unknown Age Prediction
4 Discussion
References
Chapter 16: Differential Methylation Analysis for Bisulfite Sequencing (BS-Seq) Data
1 Introduction
2 Materials
2.1 Data
2.2 Software
3 Methods
3.1 Download Data
3.2 Install R Language
3.3 Install DSS and Dependency Packages
3.4 Set Working Directory and Load Software
3.5 Read in the Data
3.6 Conduct DM Analysis
3.6.1 Regular Two-Group DM Analysis
3.6.2 Two-Group DM Analysis Without Replicates
3.6.3 Multifactor Design Comparisons
4 Notes
References
Index
Recommend Papers

Epigenome-Wide Association Studies: Methods and Protocols (Methods in Molecular Biology, 2432)
 1071619934, 9781071619933

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods in Molecular Biology 2432

Weihua Guan Editor

EpigenomeWide Association Studies Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Epigenome-Wide Association Studies Methods and Protocols

Edited by

Weihua Guan University of Minnesota, Minneapolis, MN, USA

Editor Weihua Guan University of Minnesota Minneapolis, MN, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1993-3 ISBN 978-1-0716-1994-0 (eBook) https://doi.org/10.1007/978-1-0716-1994-0 © Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface DNA methylation is an important epigenetic mechanism. Without necessarily changing the DNA sequence, a methyl group can be added to the 50 carbon position of a cytosine to form 5-methylcytosine. It can potentially mediate gene expression, perhaps by inhibiting transcription factor binding. DNA methylation patterns can vary throughout the lifetime, a dynamic process involving de novo DNA methylation and demethylation. It has been observed that DNA methylation is associated with a variety of phenotypic traits (e.g., aging and smoking) and human diseases (e.g., cancer and heart failure). Recent technological advances have provided multiple platforms, either microarray-based or sequencingbased, for systematically interrogating DNA methylation across the genome. Analogous to genome-wide association studies (GWASs) of single nucleotide polymorphisms (SNPs), epigenome-wide association studies (EWASs) of DNA methylation can be carried out for a “hypothesis-free” search of entire epigenome for disease-associated genes. However, features of DNA methylation data can raise unique challenges as compared to GWASs. For example, DNA methylation data are typically measured as continuous values instead of discrete genotypes, with skewed distributions or outlying values; it is more sensitive to batch effects compared to SNP data; DNA methylation is cell-type or tissue specific, and cell-type compositions can be a significant confounder in EWASs; while SNP imputation methods have been well developed, correlation between nearby CpG sites is generally lower than the linkage disequilibrium between SNPs, which makes it difficult to impute methylation levels across different versions of microarray platforms. In addition to traditional association tests, DNA methylation markers can also be incorporated in machine learning models to predict disease-related factors, e.g., biological aging clocks. This volume in the series covers from understanding features of DNA methylation data, data processing pipelines, quality control measures, and to discussions of statistical methods for association analysis, control of confounding and batch effects, and identification of differentially methylated regions. Other chapters also discuss how to carry out meta-analysis to combine results from multiple EWASs, which has become a common practice to improve statistical power, how to estimate or account for cell-type compositions, how to develop predictive models for epigenetic aging using machine learning methods, and mediation analysis using high-dimensional DNA methylation data. One chapter discusses application of EWAS to clinical outcomes, which aims to identify DNA methylation markers associated with atopic diseases. The last chapter of the book introduces a pipeline to analyze whole genome bisulfite sequencing data. Analysis of genome-wide DNA methylation data has many interesting challenges and applications. We try to bring authors of the chapters from different research areas that cover aspects from statistical methodology to clinical application to modern machine learning techniques. As the field has been rapidly evolving, it is certain that we are unable to cover every single topic related to the analysis of DNA methylation data. But we would like our readers to have exposure to some important concepts and methodology. We hope that this volume will help researchers to successfully carry out EWASs to address important research questions, and be aware of the challenges.

v

vi

Preface

Finally, we would like to extend our sincere gratitude and appreciation for all of the hard work and contribution provided by the authors of the chapters. We also want to thank John Walker (series editor) and Anna Rakovsky (editor at Springer) for their support and assistance throughout the project. Minneapolis, MN, USA

Weihua Guan

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v ix

1 Quantification Methods for Methylation Levels in Illumina Arrays . . . . . . . . . . . . Duchwan Ryu and Hao Shen 2 Evaluating Reliability of DNA Methylation Measurement . . . . . . . . . . . . . . . . . . . . Rui Cao and Weihua Guan 3 Accurate Measurement of DNA Methylation: Challenges and Bias Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eguzkine Ochoa, Verena Zuber, and Leonardo Bottolo 4 Using R for Cell-Type Composition Imputation in Epigenome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chong Wu 5 Cell Type-Specific Signal Analysis in Epigenome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Charles E. Breeze 6 Controlling Batch Effect in Epigenome-Wide Association Study . . . . . . . . . . . . . . Yale Jiang, Jianjiao Chen, and Wei Chen 7 DNA Methylation and Atopic Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yale Jiang, Erick Forno, and Wei Chen 8 Meta-Analysis for Epigenome-Wide Association Studies . . . . . . . . . . . . . . . . . . . . . Nan Wang and Shuilin Jin 9 Increase the Power of Epigenome-Wide Association Testing Using ICC-Based Hypothesis Weighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bowen Cui, Shuya Cui, Jinyan Huang, and Jun Chen 10 A Review of High-Dimensional Mediation Analyses in DNA Methylation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haixiang Zhang, Lifang Hou, and Lei Liu 11 DNA Methylation Imputation Across Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gang Li, Guosheng Zhang, and Yun Li 12 Workflow to Mine Frequent DNA Co-methylation Clusters in DNA Methylome Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jie Zhang and Kun Huang 13 BCurve: Bayesian Curve Credible Bands Approach for the Detection of Differentially Methylated Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenggong Han, Jincheol Park, and Shili Lin

1

vii

15

25

49

57 73 85 101

113

123 137

153

167

viii

14

15

16

Contents

Predicting Chronological Age from DNA Methylation Data: A Machine Learning Approach for Small Datasets and Limited Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Anastasia Aliferi and David Ballard Application of Correlation Pre-Filtering Neural Network to DNA Methylation Data: Biological Aging Prediction . . . . . . . . . . . . . . . . . . . . . 201 Lechuan Li, Chonghao Zhang, Hannah Guan, and Yu Zhang Differential Methylation Analysis for Bisulfite Sequencing (BS-Seq) Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Hao Feng, Karen Conneely, and Hao Wu

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

227

Contributors ANASTASIA ALIFERI • King’s Forensics, Department of Analytical, Environmental and Forensic Sciences, Faculty of Life Sciences and Medicine, King’s College London, London, UK DAVID BALLARD • King’s Forensics, Department of Analytical, Environmental and Forensic Sciences, Faculty of Life Sciences and Medicine, King’s College London, London, UK LEONARDO BOTTOLO • Department of Medical Genetics, University of Cambridge, Cambridge, UK; MRC Biostatistics Unit, University of Cambridge, Cambridge, UK; The Alan Turing Institute, London, UK CHARLES E. BREEZE • Altius Institute for Biomedical Sciences, Seattle, WA, USA; UCL Cancer Institute, University College London, London, UK RUI CAO • Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA JIANJIAO CHEN • Division of Pulmonary Medicine, Department of Pediatrics, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA, USA JUN CHEN • Department of Quantitative Health Sciences and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA WEI CHEN • Division of Pulmonary Medicine, Department of Pediatrics, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA, USA KAREN CONNEELY • Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA, USA; Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA BOWEN CUI • State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Rui Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China SHUYA CUI • Shanghai Jiao Tong University School of Medicine, Shanghai, China HAO FENG • Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA, USA ERICK FORNO • Division of Pulmonary Medicine, Department of Pediatrics, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA, USA HANNAH GUAN • Department of Computer Science, Trinity University, San Antonio, TX, USA WEIHUA GUAN • Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN, USA CHENGGONG HAN • Interdisciplinary Ph.D. Program in Biostatistics, The Ohio State University, Columbus, OH, USA LIFANG HOU • Department of Preventive Medicine, Northwestern University, Chicago, IL, USA JINYAN HUANG • State Key Laboratory of Medical Genomics, Shanghai Institute of Hematology, Rui Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China KUN HUANG • Department of Biostatistics and Health Data Science, School of Medicine, Indiana University, Indianapolis, IN, USA

ix

x

Contributors

YALE JIANG • Division of Pulmonary Medicine, Department of Pediatrics, UPMC Children’s Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA, USA; School of Medicine, Tsinghua University, Beijing, China SHUILIN JIN • School of Mathematics, Harbin Institute of Technology, Harbin, Heilongjiang, China GANG LI • Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA LECHUAN LI • Department of Computer Science, Trinity University, San Antonio, TX, USA YUN LI • Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA SHILI LIN • Department of Statistics, The Ohio State University, Columbus, OH, USA LEI LIU • Division of Biostatistics, Washington University in St. Louis, St. Louis, MO, USA EGUZKINE OCHOA • Department of Medical Genetics, University of Cambridge, Cambridge, UK; Cambridge NIHR Biomedical Research Centre, Cambridge, UK JINCHEOL PARK • Department of Statistics, Keimyung University, South Korea DUCHWAN RYU • Department of Statistics and Actuarial Science, Northern Illinois University, DeKalb, IL, USA HAO SHEN • Department of Statistics and Actuarial Science, Northern Illinois University, DeKalb, IL, USA NAN WANG • School of Mathematics, Harbin Institute of Technology, Harbin, Heilongjiang, China CHONG WU • Department of Statistics, Florida State University, Tallahassee, FL, USA HAO WU • Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA, USA CHONGHAO ZHANG • Department of Computer Science, Trinity University, San Antonio, TX, USA GUOSHENG ZHANG • Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA HAIXIANG ZHANG • Center for Applied Mathematics, Tianjin University, Tianjin, China JIE ZHANG • Department of Medical & Molecular Genetics, School of Medicine, Indiana University, Indianapolis, IN, USA YU ZHANG • Department of Computer Science, Trinity University, San Antonio, TX, USA VERENA ZUBER • Department of Epidemiology and Biostatistics, Imperial College London, London, UK; MRC Biostatistics Unit, University of Cambridge, Cambridge, UK

Chapter 1 Quantification Methods for Methylation Levels in Illumina Arrays Duchwan Ryu and Hao Shen Abstract The genome-wide patterns of methylation levels and identifications of differentially methylated genetic loci or regions of CpG sites are being important, because the methylome has the potential for large effects in disease etiology. There are several quantification methods of methylation level at each CpG site for a given sample, which include β-value, M-value, and N-value. The performance of three quantification methods of methylation levels has been examined with simulation study and 27K Illumina array data for obesity. Key words Identification of differential methylation, Measure of methylation level, Normalization of signal intensity

1

Introduction The DNA methylation is an important epigenetic modification that has shown close relationships with many cancers, and hence, genome-wide DNA methylations have been focused for their potential role in human disease etiology [1, 2]. BeadChip technology developed by Illumina enables epigenome-wide association studies of human disease [3], and it allows comprehensive and expert-selected coverages of CpG sites. The recent BeadChip array contains probes for over 480K CpG sites that cover 99% of RefSeq genes with multiple probes per gene and 96% of CpG islands from the UCSC database. Specifically, Infinium I assay examines 27K array by using two separate probes for each CpG site, one for methylated intensity and the other for unmethylated methylated intensity, whereas Infinium II assay examines 480K array by using only one probe per site for both methylated and unmethylated intensities [4]. Both assays produce red- and greenchannel intensities and provide normalized intensities by a proprietary method in BeadStudio [5].

Weihua Guan (ed.), Epigenome-Wide Association Studies: Methods and Protocols, Methods in Molecular Biology, vol. 2432, https://doi.org/10.1007/978-1-0716-1994-0_1, © Springer Science+Business Media, LLC, part of Springer Nature 2022

1

2

Duchwan Ryu and Hao Shen

The quantification of methylation level can be considered by the relative methylated/unmethylated signal intensity among experimental conditions [5]. From the standard output from BeadStudio, the ratio of the methylated probe intensity and the sum of methylated and unmethylated intensities has been used as the βvalue to describe the methylation level at each CpG site. As an improved measurement, a logistic transformation of β-value is proposed as an M-value that may yield better results [6]. However, neither β-value nor M-value considers the variability in both the methylated and unmethylated signal intensities. In order to further improve the detection of differential methylation, a weighted logistic transformation of the methylated and unmethylated signal intensities has been proposed as an N-value to summarize methylation levels [7]. Data normalization is another important aspect to ensure accurate results for statistical testing of differential methylation. The Illumina software does normalize probe signal intensities in calculating the methylated and unmethylated signal intensities. Much work has focused on normalizing data obtained using BeadArrays for gene expression, while less research has focused on normalization of the methylation arrays [5]. Beyond that supplied through BeadStudio, the quantile normalization [8, 9] and the mean normalization [10, 11] can be considered. Regarding very different distributions of signal intensities after normalization based on the Illumina software, the array normalization is also concerned. The use of various normalization methods for adjusting for batch effects has been examined [12]. Some approaches to normalization with respect to several factors that could affect the analyses, including batch, DNA input, and bisulfite conversion efficiency as measured using control probes, have been examined [8]. The normalized βvalue to follow the standard normal distribution has also been considered in analyses [13]. The N-value has the advantage in that it appropriately normalizes relative methylation levels [7].

2

Materials The performance of three different quantification methods includes β-value, M-value, and N-value for identifying differentially methylated CpG sites through an obesity data set and simulation study.

2.1

Obesity Data Set

Methylation data collected from 7 age-matched lean control samples and 7 obese case samples have been processed by the 27K Illumina array that contains 27,578 CpG sites [14] with NCBI’s Gene Expression Omnibus accession number GSE25301. The control samples are labeled as sample #1 through sample #7, and the case samples are labeled as sample #8 through sample #14.

Quantification for Methylation Levels

3

These samples are 14 subjects identified from 534 participants in the Lifestyle, Adiposity, and Cardiovascular Health in Youth (LACHY). The LACHY study has recruited male and female subjects from high schools in the Augusta, Georgia area with roughly equal numbers of African American and European American adolescents aged 14 to 18 years [15]. The identified subjects are African American ancestry, male, subjects with leukocyte DNA, and obese subjects with body mass index (BMI)  99th percentile and lean subjects with BMI  10th percentile, for corresponding age and sex. 2.2 Monte Carlo Simulation Settings

In the simulation study, as in obesity data, 27,578 CpG sites are considered. For each CpG site, the methylation data of each sample is generated from a statistical distribution. When a CpG site is supposed to be differentially methylated, the distribution of the methylation data for samples of control group is different from the distribution of the methylation data for samples of case group, whereas when a CpG site is not differentially methylated, two distributions of methylation data for samples of control group and case group are the same. Specifically, the means and standard deviations of signal intensity are used to calculate the β-, M-, and N-value. The methylated and unmethylated signal intensities are simulated separately to match with the obesity data, where two signal intensities have slightly different distributions. For the CpG site i and the sample j, the standard deviations of methylated signal S m ij and unmethylated signal S u ij are generated from the lognormal distributions such that m m2 u u u2 log S m ij  N ðμs , σ s Þ and log S ij  N ðμs , σ s Þ, m where μm s and σ s are the mean and standard deviation of methylated u probes and μs and σ us are the mean and standard deviation of unmethylated probes, respectively. The means and standard deviations of the distributions are estimated using the obesity data. For the methylated probes, the standard deviations were generated by m m 2 log S m ij  N ½μs , ð0:3σ s Þ : The unmethylated probes are simulated using larger standard deviations than the methylated probes u u 2 such that log S u ij  N ½μs , ð0:5σ s Þ . Intensities are simulated next by using simple linear regression models of log intensity on log standard deviation for the intensities of the obesity data such that m m m m log y^m ij ¼ γ 0 þ γ 1 log S ij þ εij , u u u u log y^u ij ¼ γ 0 þ γ 1 log S ij þ εij , u m u where γ m 0 and γ 0 are random intercepts, γ 1 and γ 1 are random m u slopes, and εij and εij are random errors for methylated and

4

Duchwan Ryu and Hao Shen

unmethylated probes, respectively. For methylated probes, denotm ing μm γ and Σ γ as the mean and variance of estimated regression coefficients and σ m2 ε as the best quadratic unbiased estimator from the regression of obesity data on log S m ij , regression coefficients m m m and error term are generated by assuming ðγ m 0 , γ 1 Þ  N ðμγ , Σ γ Þ m2 m and εij  N ð0, σ ε Þ. Similarly, for unmethylated probes, denoting u μuγ and Σ γ as the mean and variance of estimated regression coefficients and σ u2 ε as the best quadratic unbiased estimator from the regression of obesity data on log S u and ij , regression coefficients u u u error term are generated by assuming ðγ u , γ Þ  N ðμ , Σ Þ 1 γ and γ 0 u2  N ð0, σ Þ . The distribution of signal intensity is heterogeεu ε ij neous across samples in the obesity data. To maintain the linear relationship between the standard deviations and the intensities, m m regression coefficients are sampled from ðγ m 0 , γ 1 Þ  N ðμγ þ m u u u 0:7, Σ γ Þ for methylated probes and ðγ u 0 , γ 1 Þ  N ðμγ þ 1, Σ γ Þ for unmethylated probes. In addition, differences between case and control groups are generated by shifting the standard deviations in the case samples. In one set of simulations, these standard deviations are shifted by 1 and in another set of simulations these standard deviations are shifted by  1. In the simulation study, among all CpG sites, four different rates of methylated CpG sites are considered: 5, 10, 15, or 20% of CpG sites, and within each rate the differentially methylated sites are randomly selected. For all scenarios, 1000 simulations have been conducted with three sample sizes for each case and control groups: 7, 10, and 20.

3

Methods The Illumina provides the average and standard deviation of both methylated and unmethylated signal intensities for each CpG site. It also calculates an estimate of the percentage of chromosomes that are methylated that is known as β-value. Including β-value other two quantification methods M-value and N-value that utilize the measures from Illumina can be summarized as the following subheadings.

3.1

β-Value

For the ith CpG site on the jth sample, the β-value is defined as the ratio of the average intensity of the methylated probe and the total average intensity of both the methylated and unmethylated probes, βij ¼

max ð ym ij , 0Þ , m y uij , 0Þ þ α max ð y ij , 0Þþ max ð

ð1Þ

Quantification for Methylation Levels

5

u where ym ij is the average methylation signal intensity, yij is the average unmethylated signal intensity, and α is a constant offset. In the simulation study and data analysis, the constant offset is set to α ¼ 100. The β-value ranges from 0 to 1, with 0 indicating that none of the DNA molecules in that sample is methylated and 1 indicating that all of the DNA molecules in that sample are methylated.

3.2

M-Value

As a generalization of β-value, a logistic transformation might be appropriate when analyzing proportions. For the CpG site i and sample j, the M-value that is not bounded by 0 and 1 can be defined by !   ym βij ij þ α ¼ log 2 , ð2Þ M ij ¼ log 2 u 1  βij yij þ α which is equivalent to a logistic transformation of β-value with a base 2 logarithm instead of natural logarithm [6]. It should be noted that M-values still utilize only the average intensities of probes and provide unbounded values through the logistic transformation.

3.3

N-Value

Contrast to β-value and M-value, the N-value incorporates the sample standard deviations of signal intensities, assuming that the mean intensity is linearly related to the standard deviation of intenu sities [7]. For the ith CpG site on the jth sample, let ym ij and yij denote the means of the methylated and unmethylated intensities u and S m ij and S ij denote the standard deviations of the methylated and unmethylated intensities, respectively. The linear relationship between the mean and the standard deviation for intensities is then described using two regression models, for methylated and unmethylated probes, m m m m log ð ym ij Þ ¼ γ 0 þ γ 1 log ðS ij Þ þ εij ,

log ð y uij Þ ¼ γ u0 þ γ u1 log ðS uij Þ þ εuij , m u u m where γ m 0 , γ 1 , γ 0 , and γ 1 are the regression coefficients, and εij and u εij are the normal random errors with zero means and constant variances, respectively. From the regression model, obtaining the estimated param u u meters γ m 0 , γ 1 , γ 0 , and γ 1 , the estimated intensities are given by m m m log ð ym ij Þ ¼ γ 0 þ γ 1 log ðS ij Þ, u u u log ð ym ij Þ ¼ γ 0 þ γ 1 log ðS ij Þ,

for the ith CpG sites and jth sample, respectively. The residuals εm ij and εuij are differences between the observed intensities and estimated intensities such that

6

Duchwan Ryu and Hao Shen

εm ym ym ij ¼ log ð ij Þ log ð ij Þ , y uij Þ log ð y uij Þ : εuij ¼ log ð Using the exponential-scale residuals, the scaled estimates for us the mean methylated yms ij and unmethylated signal intensities yij are achieved by m yms ij ¼ exp ðεij Þ ¼

ym ij m

m γ1 exp ðγ m 0 ÞðS ij Þ

u yus ij ¼ exp ðεij Þ ¼

yuij u

exp ðγ u0 ÞðS uij Þγ1

, :

us Based on the scaled signal intensities yms ij and yij , for the ith CpG site and jth sample, the scaled β-value that is denoted by βij is defined as the following:

βij

yms 1 ij ¼ us ¼ : yij þ yms 1 þ exp ðεuij  εm ij ij Þ

Now, the N-value is defined as a logistic transformation of βij such that ! βij : ð3Þ N ij ¼ log 2 1  βij It should be noted that βij removes the relationship between mean and standard deviation of the signal intensities for each probe when the standard deviation of intensities is depending on the mean of intensities. The N-value can be viewed as a version of the M-value rescaled by the standard deviation. Furthermore, the N-value utilizes the standard deviation of intensities that are provided by Illumina, while the β-value and M-value do not utilize it. 3.4 Comparison of Three Methylation Quantification Methods 3.4.1 Features of Signal Intensities

3.4.2 Distributions of Measures of Methylation Level

From the obesity data, the distributions of the mean and standard deviation of the signal intensities vary by samples in both methylated probes and unmethylated probes (Fig. 1). In particular, the sample #12 has a wider range in the mean and standard deviation of intensities. Further, the distributions showed greater variability among cases than among controls. In addition, the contour plots for the distributions of pairs of the log-scaled standard deviations and means of signal intensities indicate linear relationships between the log-scaled standard deviations and means (Fig. 2). The methylation levels of case samples and control samples at a CpG site are compared by the summary measures that quantify the methylation levels based on the signal intensities. Conventionally,

Quantification for Methylation Levels

0.8

mean #1

mean #2

mean #3

mean #4

mean #5

mean #6

mean #7

mean #8

mean #9

mean #10

mean #11

mean #12

mean #13

mean #14

std #1

std #2

std #3

std #4

std #5

std #6

std #7

std #8

std #9

std #10

std #11

std #12

std #13

std #14

7

0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0

0

5

10

0

5

10

0

5

10

0

5

10

Fig. 1 Distribution of signal intensities and standard deviations. First 14 plots describe the distributions of the log-scaled means of signal intensities of 14 samples, and next 14 plots describe the distributions of the log-scaled standard deviations of signal intensities of 14 samples. At each plot, the solid line indicates the distribution of intensities from the methylated probes, and the dotted line indicates the distribution of intensities from the unmethylated probes

8

Duchwan Ryu and Hao Shen (a)

10 9.5

8.5 8

8.5

log(mean of intensities)

log(mean of intensities)

9

8 7.5 7 6.5 6 5.5

7.5 7 6.5 6 5.5 5

5 4.5

(b)

9

4.5 2

4

6

8

1

2

log(std of intensities)

(c)

10

5

6

8.5

9

8

8.5

log(mean of intensities)

log(mean of intensities)

4

(d)

9

9.5

8 7.5 7 6.5 6 5.5

7.5 7 6.5 6 5.5 5

5 4.5

3

log(std of intensities)

4.5 2

4

6

log(std of intensities)

8

1

2

3

4

5

6

log(std of intensities)

Fig. 2 Linear relationship between the standard deviations and the means of signal intensities. Each plot is a contour plot for the pairs of standard deviations and means of intensities. (a) is based on the pairs from methylated probes for sample #1 through sample #7; (b) is based on the pairs from unmethylated probes for sample #1 through sample #7; (c) is based on the pairs from methylated probes for sample #8 through sample #14; and (d) is based on the pairs from unmethylated probes for sample #8 through sample #14

the β-value Eq. 1 has been used, and the M-value Eq. 2 and N-value Eq. 3 have been proposed as the quantification methods for the methylation levels. Although the β-value is an easily interpretable measure of methylation level, its distribution for each sample tends to be highly

Quantification for Methylation Levels

9

10 5 0 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1

0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1 0

0.5

1

10 5 0

0.3 0.2 0.1 0 -5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-5

0

5

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

-2

0

2

0.3 0.2 0.1 0

0.6 0.4 0.2 0

0.6 0.4 0.2 0

Fig. 3 Distributions of three quantification methods for methylation levels over all CpG sites. The first 14 plots are distributions of β-values for each sample, next 14 plots are distributions of M-values for each sample, and last 14 plots are distributions for N-values for each sample

skewed and slightly bimodal with the bounded range of values between 0 and 1 (Fig. 3). As an improved version, the M-value has the distribution with unbounded range. However, its distribution is still bimodal and skewed with the two peaks being more prominent than that for the β-value.

10

Duchwan Ryu and Hao Shen

Furthermore, the distributions of β-value and M-value for the case sample #12 are similar to the distribution for other samples, while the variability in the distributions of both β-value and M-value is greater among the case samples #8 through #14 than among the control samples #1 through #7. The N-value normalizes the heterogeneity that shows different variabilities by samples. Given the linear relationship observed between the mean signal intensity and the standard deviation of the intensities, the sample specific signal intensity for any given CpG site can be considered as the deviation from the expected relationship between the mean and standard deviation (Fig. 2). The N-value utilizes these adjusted signal intensities. The distribution of the N-value is desirably symmetric and more importantly has less variability among samples than those observed for β-value and M-value (Fig. 3). Both β-value and M-value show a relationship between their standard deviations and means among the samples (Fig. 4). This relationship for β-value occurs mostly when it has a value between 0.2 and 0.8. However, the N-value shows no obvious relationship between the mean and the standard deviation (Fig. 4). This property indicates that the N-value is more appropriate for testing for differences in methylation levels for example when the t-test is used. 3.4.3 Identification of Differential Methylation by Three Methods

Using the t-test, three methylation quantification methods, βvalue, M-value, and N-value, may identify different numbers of differentially methylated CpG sites from all 27,578 sites. Based on all samples from the obesity data, under 0.05 significant level, the t-tests with β-value, M-value, and N-value have identified 2091 CpG sites, 2115 sites, and 1508 sites, respectively. Excluding sample #12 that has shown larger variation and lower mean in intensity as mentioned in the previous subheading, β-value, M-value, and N-value have identified 1901 sites, 2137 sites, and 1549 sites, respectively. Although the β-value has identified the most differentially methylated CpG sites, when a sample is excluded, it has shown a larger change of identified sites, comparing to M-value and N-value, while the N-value identifies the differential methylations in a more conservative way than the M-value does.

3.4.4 Simulation Study

Using a t-test under 5% significance level of 0.05, the true positive percent rates of differentially methylated CpG sites are presented in Table 1. From 1000 simulations, in the rates of correct identifications of methylated CpG sites, the N-value has shown the best performance and β-value has shown the worst performance, while the M-value has the performance in the middle. Also, as the sample size increases and the rate of methylated CpG sites increases, all three methods have resulted in more correctly identified methylated CpG sites. Interestingly, the M-value has resulted in increased accuracy relative to β-value and N-value when the sample heterogeneity is added.

Quantification for Methylation Levels β-value

0.2

standard deviation

11

0.15 0.1 0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

mean

M-value

standard deviation

1.2 1 0.8 0.6 0.4 0.2 -6

-4

-2

0

2

4

6

0.5

1

1.5

mean

N-value

standard deviation

1.2 1 0.8 0.6 0.4 0.2 -1.5

-1

-0.5

0

mean

Fig. 4 Relationship between means and standard deviations of summary measures of methylation levels over all samples. The top plot is from β-values, the middle plot is from M-values, and the bottom plot is from the N-values

Regarding the consistency between results with and without sample heterogeneity, all three methods have shown the improved performance as the sample size is increased. That is, the robustness of the identification is increased. The consistency by change of samples is presented in Table 2. Among three methods, the N-value has shown the best consistency, while the M-value has shown slightly better consistency than the β-value.

12

Duchwan Ryu and Hao Shen

Table 1 True positive rates for all CpG sites in simulation study without and with sample heterogeneity n

R

β-value

M-value

N-value

Without heterogeneity 7

10

20

5

32.56 (1.03)

39.35 (0.93)

45.75 (0.85)

10

46.70 (1.09)

54.63 (0.88)

63.25 (0.72)

15

60.23 (0.92)

67.25 (0.75)

73.11 (0.61)

20

67.58 (0.83)

73.94 (0.67)

79.38 (0.51)

5

40.02 (1.04)

45.17 (0.88)

50.05 (0.73)

10

55.02 (0.90)

60.82 (0.78)

66.95 (0.63)

15

67.77 (0.74)

72.45 (0.61)

76.59 (0.50)

20

74.42 (0.66)

78.50 (0.55)

82.12 (0.43)

5

49.45 (0.76)

50.92 (0.69)

52.42 (0.69)

10

65.10 (0.66)

66.86 (0.64)

68.72 (0.58)

15

75.95 (0.55)

77.17 (0.50)

78.41 (0.47)

20

81.36 (0.45)

82.40 (0.43)

83.43 (0.40)

With heterogeneity 7

10

20

5

32.86 (1.17)

47.66 (1.26)

33.82 (1.15)

10

47.33 (1.24)

62.66 (1.19)

52.83 (0.93)

15

60.52 (1.07)

74.21 (0.87)

62.62 (0.85)

20

68.16 (0.95)

79.67 (0.76)

71.27 (0.73)

5

39.83 (1.13)

51.36 (1.08)

45.17 (0.88)

10

55.08 (1.04)

66.66 (0.90)

63.24 (0.69)

15

67.57 (0.87)

77.13 (0.68)

72.87 (0.59)

20

74.45 (0.75)

82.35 (0.56)

79.38 (0.53)

5

47.81 (0.81)

51.99 (0.76)

52.24 (0.68)

10

63.87 (0.72)

67.83 (0.71)

68.59 (0.59)

15

74.70 (0.64)

77.78 (0.55)

78.25 (0.49)

20

80.37 (0.52)

82.94 (0.47)

83.36 (0.41)

The average (standard deviation) of the true positive percent in 1000 simulations. The per group sample size is n, and the percent CpG sites that are differentially methylated are denoted by R. Results shown are for the scenario in which the case standard deviation was shifted by 1

Quantification for Methylation Levels

13

Table 2 Consistency of test results between analyses with and without heterogeneity n

R

β-value

M-value

N-value

7

5

38.14 (0.95)

39.97 (1.01)

52.67 (1.09)

10

40.64 (0.94)

44.94 (0.93)

56.82 (0.99)

15

43.33 (0.82)

49.92 (0.78)

55.97 (0.89)

20

45.02 (0.82)

51.97 (0.75)

58.73 (0.83)

5

46.22 (0.91)

51.08 (0.89)

66.63 (0.96)

10

49.61 (0.88)

57.58 (0.78)

72.87 (0.76)

15

52.88 (0.73)

62.68 (0.66)

73.66 (0.67)

20

54.76 (0.73)

65.26 (0.55)

76.80 (0.65)

5

58.45 (0.77)

61.02 (0.80)

81.94 (0.68)

10

63.26 (0.65)

68.33 (0.66)

87.57 (0.48)

15

66.51 (0.50)

73.13 (0.50)

90.80 (0.36)

20

68.35 (0.46)

76.11 (0.41)

92.79 (0.29)

10

20

The average (standard deviation) of the percent of sites with p-value 0.05 in both simulations with and without sample heterogeneity over 1000 simulations. The per group sample size is n, and the percent CpG sites that are differentially methylated are denoted by R. Results shown are for the scenario in which the case standard deviation was shifted by 1

4

Notes The N-value as a weighted logistic transformation of β-value normalizes data across arrays by repeating the standard deviations of signal intensities. In the comparison of N-value with both β-value and M-value on the obesity data, the N-value has yielded more stable results, whereas it identified fewer sites. The effect sizes in an experiment comparing obese subjects with matched controls are expected to be small, and the number of sites showing differential methylation is likely to be small. The p-value in this case is more likely to be close to uniformly distributed. Hence, the smaller number of identifications with the N-value may be more reflective of true signal intensities. The higher accuracy of utilizing N-value relative to β-value and M-value has been confirmed in the simulation study. The obesity data in this analysis has obtained using the Infinium I assay. However, most of the probes on the 450K array are based on the Infinium II assay, which may suggest that the results are not applicable. The distribution of β-value is known to differ between Infinium I and Infinium II assays, and the distribution of the signal values will also differ between Infinium I and Infinium II. Considering the N-value that only depends on separate methylated and unmethylated signal intensities in which there is a

14

Duchwan Ryu and Hao Shen

relationship between the signal standard deviation and the signal mean, it accounts for the uncertainty in mean signal intensity and appropriately adjusts for the variability in signal intensities. Therefore, the N-value should show an improved performance with the Infinium II assay as well. Although the N-value is based on the linear regression model at each CpG site, any other regression relationship can be used since the normalized signal intensities are defined as residuals obtained from the regression of the mean signal intensity on the standard deviation of signal intensities. The assumption of linearity can easily be examined with each data set, and then the regression function can be adjusted appropriately. For the preliminary data obtained from the 450K chip, relationship between mean signal and signal standard deviation is linear. Such results suggest that the N-value is a potentially important summary measure that can be used in testing for differential methylation in methylome-wide association studies. Eventually, the efficacy of different methods will need to be evaluated based on confirmatory biological findings. References 1. Kulis M, Esteller M (2010) DNA methylation and cancer. In: Herceg Z, Ushijima T (eds) Epigenetics and cancer, part A (advances in genetics), vol 70. Academic Press, Waltham, pp. 27–56. https://doi.org/10.1016/B9780-12-380866-0.60002-2 2. Rakyan VK, Down TA, Balding DJ, Beck S (2011) Epigenome-wide association studies for common human diseases. Nat Rev Genet 12:529–541 3. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, et al (2011) High density DNA methylation array with single CpG site resolution. Genomics 98:288–295 4. Bibikova M, Le J, Barnes B, Saedinia-Melnyk S, Zhou L, et al (2009) Genome-wide DNA methylation profiling using Infinium assay. Epigenomics 1:177–200 5. Siegmund K (2011) Statistical approaches for the analysis of DNA methylation microarray data. Human Genet 129:585–595 6. Du P, Zhang X, Huang CC, Jafari N, Kibbe W, et al (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinfor 11: 587 7. Ryu D, Xu H, George V, Su S, Wang X, et al (2013) Quantifying and normalizing methylation levels in Illumina arrays. J Biomet Biostatist 4:1000164 8. Teschendorff AE, Menon U, Gentry-MaharajA, Ramus SJ, Weisenberger DJ, et al (2010) Age-dependent DNA methylation of genes

that are suppressed in stem cells is a hallmark of cancer. Genome Res 20:440–446 9. Kim YJ, Yoon HY, Kim SK, Kim YW, Kim EJ, et al (2011) EFEMP1 as a novel DNA methylation marker for prostate cancer: Array-based DNA methylation and expression profiling. Clin Cancer Res 17:4523–4530 10. Choufani S, Shapiro JS, Susiarjo M, Butcher DT, Grafodatskaya D, et al (2011) A novel approach identifies new differentially methylated regions (DMRs) associated with imprinted genes. Genome Res 21:465–476 11. Hinoue T, Weisenberger DJ, Lange CP, Shen H, Byun HM, et al (2012) Genomescale analysis of aberrant DNA methylation in colorectal cancer. Genome Res 22:271–282 12. Sun Z, Chai H, Wu Y, White W, Donkena K, et al (2011) Batch effect correction for genome-wide methylation data with Illumina Infinium platform. BMC Med Genom 4:84 13. Bell J, Pai A, Pickrell J, Gaffney D, Pique-Regi R, et al (2011) DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol 12:R10 14. Wang X, Zhu H, Snieder H, Su S, Munn D, et al (2010) Obesity related methylation changes in DNA of peripheral blood leukocytes. BMC Med 8:87 15. Gutin B, Johnson MH, Humphries MC, Hatfield-Laube JL, Kapuku GK, et al (2007) Relationship of visceral adiposity to cardiovascular disease risk factors in black and white teens. Obesity 15:1029–1035

Chapter 2 Evaluating Reliability of DNA Methylation Measurement Rui Cao and Weihua Guan Abstract DNA methylation is a widely studied epigenetic phenomenon. Alterations in methylation patterns influence human phenotypes and risk of disease. The Illumina Infinium HumanMethylation450 (HM450) and MethylationEPIC (EPIC) BeadChip are widely used microarray-based platforms for epigenome-wide association studies (EWASs). In this chapter, we will discuss the use of intraclass correlation coefficient (ICC) for assessing technical variations induced by methylation arrays at single-CpG level. ICC compares variation of methylation levels within- and between-replicate measurements, ranging between 0 and 1. We further characterize the distribution of ICCs using a mixture of truncated normal and normal distributions, and cluster CpG sites on the arrays into low- and high-reliability groups. In practice, we recommend that extra caution needs to be taken for associations at the CpG sites with low ICC values. Key words DNA methylation, Technical error, Intraclass correlation, Normal mixture models

1

Introduction DNA methylation is one commonly occurring epigenetic markers in the human genome. It is a major regulator of gene transcription and plays a vital role in many cellular processes. In the last decade, numerous studies have shown that abnormal methylation patterns are linked to phenotypic variation and development of disease [1– 6]. Recent technological advances have provided multiple platforms for systematically interrogating DNA methylation variation across the genome [7]. Among them, the Illumina MethylationEPIC BeadChip (EPIC) (Illumina, Inc.) is a new-generation microarray platform constituting a major extension of the previous Infinium HumanMethylation450 BeadChip (HM450) (Illumina, Inc.) and can be used to assess the methylation condition of more than 850,000 cytosines distributed over the entire genome. Similar to other microarray experiments (e.g., RNA expression), it is important to evaluate the impact of technical variation in the measurement. A well-known source of bias for epigenomewide association studies (EWASs) of DNA methylation is the

Weihua Guan (ed.), Epigenome-Wide Association Studies: Methods and Protocols, Methods in Molecular Biology, vol. 2432, https://doi.org/10.1007/978-1-0716-1994-0_2, © Springer Science+Business Media, LLC, part of Springer Nature 2022

15

16

Rui Cao and Weihua Guan

so-called batch effect, which is largely caused by technical differences from one chip to another. Although many statistical methods have been proposed to correct for batch effects [8–17], no approach has been generally accepted as standard. Alternatively, it will be useful to consider statistical measures that can quantify the extent to which the measured methylation level at a specific CpG site is affected by technical errors. For CpG sites with large interindividual variation, it may be reasonable to assume that technical differences will have relatively low impact at these sites. However, the EPIC chip contains a large number of CpG sites which demonstrate little inter-individual variation and likely have low statistical power in association tests. It is critical to consider statistical measures to evaluate the impact of technical errors and identify the CpG sites with low inter-individual variation and high within-individual variation. In experiments, technical replicates are often included which can be used to evaluate the consistency of measurement. Meng et al. [14] suggested using technical replicates to identify and exclude “non-variable” CpG sites, at which the technical “noise” outperforms true biological variation. Bose et al. [15] further evaluated the consistency of methylation measurement at each CpG site based on the intraclass correlation coefficient (ICC), which compares the within- and between-replicate variations, and the ICC values can be modeled by a mixture model approach to classify CpG sites into high and low reliability clusters. Through such a procedure, we can identify CpG sites which give consistent values for methylation levels in DNA extracted from peripheral blood using a methylation profiling array, thereby providing insight into the performance of microarray platforms across CpG sites.

2

ICC Estimation and Modeling At a specific CpG site, the methylation level can be modeled using a random effects ANOVA model: y ij ¼ μ þ τ j þ εij where yij is the measured methylation level for the ith replicate of biological sample set) j, μ the overall mean methylation at  (replicate  ~ 0, σ 2 a random effect shared by all measures for the site, τ j N b   ~ 0, σ 2 a random sample j reflecting sample characteristics, and εij N w noise term including technical errors for multiple measures of the same biological sample. τj and εij are assumed to be uncorrelated. Here σ 2b presents variance between replicate sets and σ 2w represents variance within replicate sets. ICC is calculated as: ICC ¼

σ 2b σ 2b þ σ 2w

DNA Methylation Measure Reliability

17

which takes values in [0,1] and measures the extent to which the measurements in a replicate-set resemble each other. An ICC of 1 indicates perfect measurement accuracy at the CpG site, and 0 implies little variation between independent samples but large measurement error between technical replicates. Thus, we can classify sites into clusters of varying degrees of reliability based on ICC values. The ICC value can serve as a measure for the impact of batch effects on each probe. Consider a widely used random-effects model for batch effects: X y ij ¼ b l ðij Þ þ X jp βp þ e ij p

where lij indexes the batch (chip) where replicate i of sample j is located so bl(ij) is the corresponding batch effect, and Xjp is the pth covariate (characteristic of biological sample), such as age or sex, for sample j. Comparing this model to the random-effects ANOVA model above, the within replicate set variation σ 2w is decomposed into variances of batch-specific P measurement error (bl(ij)) and other X jp βp corresponds to the between technical error (eij), and p replicate set (biological sample) variation. When the measure of methylation is vulnerable to batch effects at a given CpG site, the variance of bl(ij) is large, leading to small ICC. Based on the calculated ICC values, we can cluster the CpG sites into groups with different levels of reliability. Given that the ICC values are bounded in [0, 1], especially taking into account the lower bound of 0, we can fit a censored or truncated normal mixture models to the observed ICC values. For the censored normal mixture model, we assume that the CpG sites could be from two clusters: a low reliability cluster with ICC censored at 0, and a high reliability cluster with ICC distribution modeled by a normal distribution:   δi  1δi  f ðx; ΘÞ ¼ p Φ 0; μ1 , σ 21 þ ϕ x; μ1 , σ 21   þ ð1  pÞϕ x; μ2 , σ 22 , p∈½0, 1, where δi ¼ 1 for the censored observations (i.e., ICC ¼ 0) and 0 p is the mixing proportion for the first cluster, Θ ¼  otherwise, p, μ1 , σ 21 , μ2 , σ 22 the set of parameters, and Φ(·) and ϕ(·) denote the cumulative and probability distribution function of standard normal distribution, respectively. Alternatively, to allow excess 0 s in the distribution of ICC, we separate CpG sites of ICC ¼ 0 and modeled non-zero ICC values using a truncated normal and a normal mixture distribution:     ϕ x; μ1 , σ 21   þ ð1  pÞϕ x; μ2 , σ 22 , p∈½0, 1, f ðx; ΘÞ ¼ p 2 1  Φ 0; μ1 , σ 1

18

Rui Cao and Weihua Guan

The parameters in the mixture models were estimated using an Expectation-Maximization (EM) algorithm [16, 17]. For the purpose of comparison, we revise the log-likelihood function to include the CpG sites with ICC ¼ 0:   ! X   ϕ x; μ1 , σ 21   log p1 logL ðΘ; x Þ ¼ n0 log p0 þ 1  Φ 0; μ1 , σ 21 x∈ICC6¼0 X    log p2 ϕ x; μ2 , σ 22 þ x∈ICC6¼0

where p0 and n0 denote the sample proportion and number of sites with ICC ¼ 0, respectively, and p1 ¼ p(1  p0) and p2 ¼ (1  p) (1  p0). The two models are then compared using the maximized log likelihood values. The CpG sites can be classified based on the corresponding posterior probabilities, π  P (icc ∈ high reliability b computed from either of the two models. cluster|icc; Θ),

3

Real Data Analysis Example

3.1 Replicates from Methylation Chips of Same Model

Assume we have DNA methylation data measured from either the HM450 or EPIC chip with replicates. We will demonstrate the programming using R (version 4.0.3). First of all, we import the methylation data into a R object data, where each row represents one replicate of one individual, first column named id represents the individual id, and each of the rest columns represents one CpG site. We use the following code to calculate the ICC value for each CpG site: >icc.new = mclapply(site.list, function(x){ > formula = as.formula(paste(x,"~as.factor(id)", sep="")) > icc.aov = do.call("aov",args =list(formula,data=data)) > c(x, ICC1(icc.aov)) > } >) >icc = do.call(rbind.data.frame, icc.new)

Here we use the parallel package to run the job with multiple processes. The distribution of ICC from a sample HM450 dataset is shown in Fig. 1. The median ICC value is 0.30 in this sample example. We observed one cluster of sites with relatively high ICC values (mode ~ 0.75), one cluster with relatively low ICC values (mode ~ 0.10), and an additional cluster of sites with ICC of 0 (n ¼ 36,017), which is consistent with the truncated/censored normal mixture model we assumed.

DNA Methylation Measure Reliability

19

Fig. 1 Distribution of intraclass correlation coefficients (ICC) on a HM450 array

Next, we can import all ICC values into a vector x, and calculate the standard deviations and means of the ICC values >0.5 and < 0.5, as well as the proportion of ICC values s=sqrt(var(x)) >s1=sqrt(var(x[xs2=sqrt(var(x[x>0.5])) >m1=mean(x[xp=length(x[xm2=mean(x[x>.5]) >mu=c(m1, m2) >sigma=c(s1, s2)

Following is the EM algorithm to estimate the parameters in the mixture model and fitting the truncated normal mixture model. >em=function(x, p, mu, sigma, itmax, eps) >{ > it=0 > diff=1 > while( (iteps)) > { >comp1 = p*dtruncnorm(x,0,Inf,mu[1],sigma[1]) > comp2=(1-p)*dnorm(x,mu[2],sigma[2]) > prop=comp1/(comp1+comp2) > p1=mean(prop)

20

Rui Cao and Weihua Guan > delta_m=sigma[1]*dnorm(mu[1]/sigma[1])/pnorm(mu[1]/sigma [1]) > mu1=c(mean(x*prop)/p1-delta_m,mean(x*(1-prop))/(1-p1)) > delta_v=sigma[1]^2-sigma[1]^2*(-mu1[1]/sigma[1]*dnorm(mu1 [1]/sigma[1])+pnorm(mu1[1]/sigma[1]))/pnorm(mu1[1]/sigma[1]) > S1 = (x-mu1[1])^2 > S2 = (x-mu1[2])^2 > sigma1=c(mean(S1*prop)/p1,mean(S2*(1-prop))/(1-p1)) > sigma1[1]=sigma1[1]+delta_v > sigma1=sqrt(sigma1) > diff=abs(p1-p)/5+abs(mu1[1]-mu[1])/5+abs(sigma1[1]-sigma [1])/5+abs(mu1[2]-mu[2])/5+abs(sigma1[2]-sigma[2])/5 >it=it+1 >p=p1 >mu=mu1 >sigma=sigma1 >cat(it,diff,p,mu,sigma,sum(log(comp1+comp2)),"\n") >} >post=cbind(comp1, comp2) >return(list(par=list(p=p,mu=mu,sigma=sigma,post=post), conv=list(it=it,diff=diff))) >} >obj=em(x,itmax=1000,eps=0.00001,p=p,mu=mu,sigma=sigma) >samp1 samp2 ind = runif(2*485577) < obj$par$p >finalsample finalsample[!ind] z=finalsample[1:length(x)]

Here we fit a truncated normal mixture model to the non-zero ICC values, to allow flexibility in modeling the cluster of ICC equal to 0. The observed and fitted distributions of ICC values for the model are shown in Fig. 2. 3.2 Replicates from Different Methylation Arrays

In the section above, we have demonstrated how to calculate the ICC using technical replicates from a specific methylation array; in this section, we will show an example where the individuals are measured using different arrays, e.g., HM450 vs EPIC. In an example dataset, we have obtained DNA methylation measures at 450,099 CpG sites in common between the two arrays, after standard quality control filters. We will use this example to demonstrate the reliability of methylation measures across platforms.

DNA Methylation Measure Reliability

21

Fig. 2 Observed and fitted distributions of ICC values using the truncated normal mixture model (for nonzero ICC)

Figure 3 shows the overall agreement between DNA methylation measures from EPIC and HM450 array. We calculate the ICC values between replicates as described in Subheading 3.1. The ICC distribution (Fig. 4) is more left skewed compared to that based on the same array (Fig. 1), and more than half of CpG sites have ICC ¼ 0. The correlation between these ICCs and ICCs from the previous example (Subheading 3.1) is 0.55. The relatively low ICCs between different arrays could be contributed to the variation between array designs, but it can also be due to strong batch effects. We would prompt caution when analyzing data generated from multiple methylation arrays. It can also pose challenges to the effort of imputing missing DNA methylation values across platforms.

4

Summary and Other Remarks Recent technological advances have provided high-throughput array for systematically interrogating DNA methylation across the genome. This allows investigators to evaluate regions of the genome in which variation in DNA methylation may influence gene expression and ultimately disease risk. One of the potential problems that could impact methylation-phenotype association results is the technical variation in experiments (e.g., batch effects). Here we demonstrate how to use technical replicates to assess reliability of methylation measurement from a commercial methylation array, but the approach can also be generalized to bisulfite sequencing data. The ICC value can serve as a measure to quantify

22

Rui Cao and Weihua Guan

Fig. 3 Sample-by-sample comparison for EPIC vs. HM450 array (four randomly selected samples)

the impact of technical errors. Using a mixture model approach, we classify CpG sites into low and high reliability groups. In practice, CpG sites with low reliability could be excluded in subsequent analyses to improve the power of EWASs by reducing the burden of multiple hypothesis tests. In EWASs, it is well known that batch effects can threaten the validity of association results [14]. Commonly used methods for batch effects correction include empirical Bayes methods [17], surrogate variables [15, 16], and linear mixed-effects (LMM) models. However, it is difficult to completely control for or remove batch effects in association tests. Alternatively, we can use ICC to quantify to what extent the batch effects can affect the methylation measures. The probes for CpG sites with low ICC are expected to be more vulnerable to technical variability including batch effects. Similar to the ICC for replicates, we calculated ICC for chip effects (ICCchip), which is the proportion of variation in measured methylation levels due to chip difference in the experiment. Bose et al.

DNA Methylation Measure Reliability

23

Fig. 4 Distribution of ICC between the EPIC and HM450 data

[15] showed that when ICCchip is close to 1, most of the variation in methylation measures is due to technical “errors,” Reliability of methylation measures is therefore low at these CpG sites, with ICC for replicates close to 0. On the other hand, when ICC for replicates is close to 1, technical variability has minimal contribution to the methylation measures, and little batch effects are observed (ICCchip  0). The ICC value is determined by two variance components, σ 2b for variance between independent samples and σ 2w for variance within replicates of the same sample. When sample characteristics (e.g., cell type composition, sample ethnicity, or genetic variants) are controlled for, σ 2b becomes smaller which leads to smaller ICC than the current. Although the chip effects are considered as part of batch effects, σ 2b can be overestimated when samples are placed on different chips. Bose et al. [15] showed that when chip effects are controlled for, both σ 2b and σ 2w can be reduced and the ICC can either increase or decrease at different CpG sites. In summary, we have examined the reliability of methylation measurements using the ICC and have demonstrated that the CpG sites assayed on a microarray platform can be potentially classified according to different levels of reliability. We have also compared the methylation levels acquired from different DNA methylation arrays. We hope that the procedures above can provide additional guidance on inclusion/exclusion of CpG sites for future EWASs using microarray platforms.

24

Rui Cao and Weihua Guan

References 1. Jones PA, Takai D (2001) The role of DNA methylation in mammalian epigenetics. Science 293(5532):1068–1070 2. Robertson KD (2005) DNA methylation and human disease. Nat Rev Genet 6(8):597–610 3. Scarano MI, Strazzullo M, Matarazzo MR et al (2005) DNA methylation 40 years later: its role in human health and disease. J Cell Physiol 204(1):21–35 4. Heyn H, Esteller M (2012) DNA methylation profiling in the clinic: applications and challenges. Nat Rev Genet 13(10):679–692 5. Qiu P, Zhang L (2012) Identification of markers associated with global changes in DNA methylation regulation in cancers. BMC Bioinformatics 13(Suppl 13):S7 6. Liu J, Chen J, Ehrlich S et al (2014) Methylation patterns in whole blood correlate with symptoms in schizophrenia patients. Schizophr Bull 40(4):769–776 7. Laird PW (2010) Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet 11(3):191–203 8. Sun Z, Chai HS, Wu Y et al (2011) Batch effect correction for genome-wide methylation data with Illumina Infinium platform. BMC Med Genet 4:84 9. Chen C, Grennan K, Badner J et al (2011) Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PLoS One 6(2):e17238 10. Leek JT, Scharpf RB, Bravo HC et al (2010) Tackling the widespread and critical impact of

batch effects in high-throughput data. Nat Rev Genet 11(10):733–739 11. Leek JT, Johnson WE, Parker HS et al (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6):882–883 12. Teschendorff AE, Zhuang J, Widschwendter M (2011) Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27(11):1496–1505 13. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1):118–127 14. Meng H, Joyce AR, Adkins DE et al (2010) A statistical method for excluding non-variable CpG sites in high-throughput DNA methylation profiling. BMC Bioinformatics 11:227 15. Bose M, Wu C, Pankow JS et al (2014) Evaluation of microarray-based DNA methylation measurement using technical replicates: the Atherosclerosis Risk In Communities (ARIC) study. BMC Bioinformatics 15(1):1–10 16. Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the E-M algorithm. J R Stat Soc B 39:1–38 17. Lee G, Scott C (2012) EM algorithms for multivariate Gaussian mixture models with truncated and censored data. Comput Stat Data Anal 56(9):2816–2829

Chapter 3 Accurate Measurement of DNA Methylation: Challenges and Bias Correction Eguzkine Ochoa, Verena Zuber, and Leonardo Bottolo Abstract DNA methylation is a key epigenetic modification involved in gene regulation whose contribution to disease susceptibility is still not fully understood. As the cost of genome sequencing technologies continues to drop, it will soon become commonplace to perform genome-wide quantification of DNA methylation at a single base-pair resolution. However, the demand for its accurate quantification might vary across studies. When the scope of the analysis is to detect regions of the genome with different methylation patterns between two or more conditions, e.g., case vs control; treatments vs placebo, accuracy is not crucial. This is the case in epigenome-wide association studies used as genome-wide screening of methylation changes to detect new candidate genes and regions associated with a specific disease or condition. If the aim of the analysis is to use DNA methylation measurements as a biomarker for diseases diagnosis and treatment (Laird, Nat Rev Cancer 3:253–266, 2003; Bock, Epigenomics 1:99–110, 2009), it is instead recommended to produce accurate methylation measurements. Furthermore, if the objective is the detection of DNA methylation in subclonal tumor cell populations or in circulating tumor DNA or in any case of mosaicism, the importance of accuracy becomes critical. The aim of this chapter is to describe the factors that could affect the precise measurement of methylation levels and a recent Bayesian statistical method called MethylCal and its extension that have been proposed to minimize this problem. Key words Incomplete bisulfite conversion, Region and coverage bias, Calibration technical replicates, Corrected methylation degree, Differential calibrated analysis

1

Introduction The deviation between the observed and the expected methylation level is mainly due to the chemical treatment used to discriminate between methylated and unmethylated cytosine. The detection of a methyl group at the 50 -carbon of cytosines in a CpG dinucleotide is challenging because the methylation marks will be lost after one cycle of PCR and the limitation of the large majority of highthroughput technologies to detect only four bases (C, T, G, A). In 1970, two independent groups Shapiro et al. [3] and Hayatsu et al. [4] discovered a chemical interaction between sodium bisulfite and pyrimidines capable to deaminate cytosine

Weihua Guan (ed.), Epigenome-Wide Association Studies: Methods and Protocols, Methods in Molecular Biology, vol. 2432, https://doi.org/10.1007/978-1-0716-1994-0_3, © Springer Science+Business Media, LLC, part of Springer Nature 2022

25

26

Eguzkine Ochoa et al.

into uracil. Ten years later, Wang et al. [5] observed significant differences in the speed of this reaction between 5-methylcytosine (5mC) and cytosines (C). Inspired by these works, Frommer et al. [6] proposed in 1992 to use the reaction to distinguish 5mC from C in the analysis of DNA methylation levels in genomic DNA. The deamination into uracil is much faster at cytosine residues than 5mC. Since then, this chemical treatment is called “bisulfite conversion” and consists of the conversion of unmethylated cytosines into uracil, maintaining methylated cytosines unchanged. The conversion begins with the denaturalization of the genomic DNA at 95  C in order to separate DNA strands and make cytosines accessible to bisulfite treatment (Fig. 1). After denaturalization, singlestrand genomic DNA suffers a nucleophilic addition of bisulfite to the C-6 position of cytosine, allowing the deamination into 5,6-dihydrouracil-6-sulfonate. The treatment with an alkaline solution eliminates the sulfonate group and regenerates the double bond, obtaining uracil (Fig. 1). Uracil will be converted into thymidine after one polymerase chain reaction cycle. Finally, after bisulfite conversion, DNA becomes single-strand fragmented DNA no longer complementary. The revolution of bisulfite conversion treatment in the detection of DNA methylation has led in recent years to an increasing number of DNA methylation studies as well as the identification of new epigenetic biomarkers and epigenetic signatures associated with specific diseases. All high-throughput technologies, capable to detect DNA methylation, including pyrosequencing, ampliconbased bisulfite sequencing, capture-based bisulfite sequencing, genome-wide methylation array, whole-genome bisulfite sequencing (WGBS), and reduced representation bisulfite sequencing (RRBS) use bisulfite conversion for the discrimination of methylated cytosine from unmethylated cytosines. However, the bisulfite conversion of the genomic DNA also introduces some technical problems in the measurement of methylation levels that should be taken into account in downstream analyses. 1.1 Incomplete Bisulfite Conversion

During bisulfite conversion, only unmethylated cytosines will be transformed in uracil, so any incomplete bisulfite conversion will be read in the downstream analysis as methylated cytosines, a false positive. The main reason for incomplete bisulfite conversion is the incomplete denaturation or partial renaturation. Since double-strand DNA is unreactive to bisulfite, an incomplete denaturation will result in cytosine fail to react with bisulfite instead of the conversion of unmethylated cytosines into uracil (Fig. 2). Commercial bisulfite conversion kits like EZ DNA Methylation-Gold (Zymo©) assures a conversion efficiency rate greater than 99%. The conversion rate is calculated as the number of non-CpG cytosines converted into thymidine over the total

Bias Correction in DNA Methylation

27

Fig. 1 Bisulfite conversion reaction. During the treatment of double-strand genomic DNA with bisulfite, DNA suffers denaturalization and conversion of unmethylated cytosines in uracil. This conversion begins with the addition of bisulfite to cytosine producing cytosine sulfonate, which is deaminated into uracil sulfonate. After the treatment with an alkaline solution, uracil sulfonate is converted into uracil

number of non-CpG sites analyzed [7, 8]. Moreover, it is estimated in non-CpG sites because the majority of non-CpG sites are unmethylated, and the conversion should be closed to 100%. To assess the bisulfite conversion rate, another option is the inclusion of completely unmethylated DNA controls. In the past, the most commonly unmethylated DNA control used was Lambda phage DNA, but now there are many companies that provide commercial unmethylated DNA controls especially for human and mouse studies. There are multiple methods to produce unmethylated DNA like the DNA purification from cells knockout for both

28

Eguzkine Ochoa et al.

Fig. 2 Incomplete bisulfite conversion (left) produces a different sequence of bisulfite DNA compared to the sequence produced after complete conversion (right). Also, the PCR product obtained after PCR amplification will have a different sequence. In the example, two unmethylated cytosines (red asterisk) elude the conversion and produce bias results that lead to misinterpretation in downstream analyses. In this case, the incomplete bisulfite conversion of two cytosines will be read as two false methylated cytosines

DNA methyltransferases DNMT1 (/) and DNMT3b (/) or DNA from whole-genome amplification. None of these methods produces exact values, and all have a similar error rate, around 5%. In summary, the conversion rate gives a general overview of the efficiency of the bisulfite reaction. A conversion rate greater than 97% is considered good, while less than 90% is considered very poor for methylation analysis. However, since it is estimated in non-CpG sites, it may fail to provide an accurate assessment of the bisulfite conversion in CpG sites. 1.2

PCR Bias

Some technologies, like amplicon-based bisulfite sequencing or pyrosequencing, require the amplification of the region of interest to detect methylation levels. These technologies may suffer an additional problem called “PCR bias” which is a preferential amplification of an allele and strand in the PCR due to methylation state [9]. After bisulfite conversion, bisulfite DNA is a single-strand fragmented DNA no longer complementary with a reduced number of cytosines which results in a low diversity sequence. Depending on the methylation status in each allele, the amplification of the region of interest could show variable efficiency, which will affect the results and the accuracy. In order to obtain accurate results, it is important to minimize the PCR-bias effect as much as possible. To this end, investigators [10, 11] have proposed to redesign primers by looking at strandspecific as well as bisulfite-specific flanking primers, but this

Bias Correction in DNA Methylation

29

Fig. 3 The observed degree of bias introduced by PCR amplification in (a) KCNQ1OT1, (b) H19/IGF2, and (c) PLAGL1 DMRs by amplicon-based bisulfite sequencing presented in Ochoa et al. [13]. The apparent level of methylation observed after amplification (y-axis) is plotted as a function of the actual methylation percentage (x-axis). The black dashed line represents Moskalev’s cubic polynomial interpolation curve [12] on the average (across CpGs) apparent level of methylation. The gray dot-dashed line represents an unbiased plot. The deviation of observed DNA methylation from the expected DNA methylation is similar in the first two target regions (low degree of preferential PCR amplification of unmethylated alleles), but it is different in PLAGL1 DMR (high degree of methylated DNA loss)

solution is expensive and time-consuming and might not solve the problem completely. Instead, PCR bias can be calculated and corrected in silico [9, 12, 13] by using standard controls with known methylation levels (Fig. 3). The apparent level of methylation after PCR in standard controls is used to correct the observed methylation levels in the case and control samples. The effect of the PCR bias in amplicon-based approaches will depend on the region and should be calculated during the development of new assays. The rest of technologies (pyrosequencing, methylation array, capture-based BS, WGBS, and RRBS) only used PCR for the enrichment of the library, and it is considered to have a small contribution in the deviation between the observed and the expected DNA methylation. Despite the cost, the analysis of at least three standard controls with known methylation level (0%, 50%, and 100% actual methylation) is desirable to obtain accurate calibrated methylation measurements. 1.3 Region-Specific Bias

A study of the influence of DNA sequence and methylation status on bisulfite conversion revealed non-B DNA structures that could form monomolecular double-stranded regions resistant to bisulfite treatment [14]. In addition to this, the same authors [15] described that the bisulfite reactivity of neighboring cytosines, within a DNA sequence that has the potential to form secondary structures, was influenced by certain methyl groups. The majority of repetitive DNA motifs in the genome forms the canonical right-handed B-form, but some may also adopt alternative conformations, called non-B DNA structures like hairpins/cruciforms, Z-DNA, triplexes (H-DNA), G-quadruplex, and slipped DNA (Fig. 4). These non-B

30

Eguzkine Ochoa et al.

Fig. 4 Non-B DNA structures formed by genomic repetitive sequences (Figure adapted from Bacolla et al. [17]). (a) Slipped (hairpin) structure; (b) Triplex; (c) Left-handed Z-DNA; (d) Quadruplex; (e) Cruciform; (f) DNA unwinding element

DNA structures are more resistant to denaturation, and therefore resistant to bisulfite conversion. In these cases, a high bisulfite conversion rate, which demonstrates a general success of the bisulfite conversion, will not warn about region-specific bisulfite conversion errors. In 2008, Clark et al. [16] described an inaccessible region for bisulfite treatment in a DNA methylation hotspot for breast cancer produced by a non-B DNA structure form by tandem repeat elements. Repetitive motifs folded in non-B structures are responsible for repeat expansion or genomic rearrangements associated with neurodegenerative disorders and cancer [18]. These structures are chromosomal targets for DNA repair, recombination, and aberrant DNA synthesis, indicating the crucial role of these regions in human disease [17]. Repeated elements folded in non-B conformations are also frequent in imprinted regions, and there are considered to take part in the regulation of these regions [19]. The majority of the DNA methylation studies focus their effort in CpG sites (28 million in the human genome) which are predominantly methylated [20] and heterogeneously distributed across the genome. However, only 21.8% of autosomal CpGs show dynamic regulation and most of which are located distal to the transcription

Bias Correction in DNA Methylation

31

start sites co-localizing with gene regulatory elements (enhancers and transcription factor-binding sites) and repetitive elements [20]. Therefore, there is a high probability that some genomic regions that are clinically relevant can show resistance to denaturation during bisulfite treatment, which, in turn, will affect methylation measurements. These observations suggest that regions with repetitive elements should be carefully studied to avoid misinterpretation of the observed methylation levels. The analysis of standard controls with different actual methylation levels could help understand the response of these regions to bisulfite conversion and improve the interpretation of the results. 1.4

Coverage Bias

The accuracy of methylation detection also depends on the coverage or read-counts obtained per CpG. Recently, the comparison between whole-genome bisulfite sequencing (WGBS) and methylation arrays data revealed that minimum coverage of 100 for WGBS is recommended to achieve a level of precision broadly comparable to the methylation array [21] (Fig. 5). Ziller et al. [22] showed that at the previously recommended sequencing depth of 30, the technical variability for WGBS is two- to threefold higher than for the methylation array. An insufficient read-counts will generate lower precision quantification. This problem will affect mainly WGBS because pyrosequencing, amplicon-based bisulfite sequencing, capture-based bisulfite sequencing, and RRBS can easily achieve very high sequencing depth for a reasonable price (>100). In summary, there are some limitations to obtain accurate methylation measurements. Some of these limitations affect equally all technologies like the incomplete BS conversion or region bias and others only affect some technologies like the PCR bias and the

Fig. 5 Comparison of the standard deviation (SD) (y-axis) of methylation levels between methylation array and WGBS at different levels of the average read depths (x-axis) (Figure adapted from Zhou et al. [21])

32

Eguzkine Ochoa et al.

Table 1 Factors that could affect methylation measurements stratified by technology. BS bisulfite sequencing, WGBS whole-genome bisulfite sequencing, RRBS reduced representation bisulfite sequencing BS conversion

PCR bias

Region bias

Pyrosequencing







Amplicon-based BS







Methylation array





Capture-based BS





WGBS





RRBS





Coverage bias



coverage bias (Table 1). Therefore, DNA methylation studies should consider these factors during design, analysis, and interpretation of DNA methylation levels.

2

3

Materials l

MethylCal is available as an R package on https://github.com/ lb664/MethylCal.

l

Online R help files are associated with MethylCal’s functions. They provide a detailed description of the syntax and the arguments of the functions as well as the class of the values returned by each function.

l

A vignette included in the R package illustrates, by means of simple examples, data pre-processing, input data format and output generated by MethylCal.

l

A markdown step-by-step user’s documentation that explains how to perform the analysis and produce the plots presented in this book chapter is also available on https://github.com/ lb664/MethylCal.

Methods The first step to improve the accuracy of the methylation levels requires the control of the most critical point, the bisulfite conversion, and design the study using an appropriate coverage per region. The second step is the use of methylation controls to determine the risk of each region to suffer incomplete conversion. The third step is the estimation and correction of the deviation of the observed DNA methylation from the expected DNA methylation using appropriate statistical algorithms. The efficiency of the

Bias Correction in DNA Methylation

33

conversion reaction and the structure of the genomic region will influence the “observed:expected” (O:E) ratio. If this quantity is small, the correction produced by the calibration will be negligible, but if the ratio is large, the calibration will produce results that could depart significantly from the observed methylation levels. 3.1 Statistical Methods to Correct Methylation Levels

PCR bias can be calculated and corrected in silico by using standard controls with known methylation levels. Different statistical models have been proposed to correct the observed methylation levels, specifically the best-fit hyperbolic [9], cubic polynomial curve [12], and a recently proposed Bayesian calibration model called MethylCal [13]. MethylCal can analyze jointly all CpGs within a CpG island (CGI) or a differentially methylated region (DMR), avoiding “oneat-a-time” CpG calibration or the calibration of the average methylation level across CpGs that ignores the variability and dependence across CpGs [23]. The second key feature of MethylCal is the inclusion of random effects [24] that capture the unobserved heterogeneity (variability) of the apparent methylation levels at each actual methylation percentage or CpG or a combination of both. This enables more precise modeling of the methylation measurements observed in the standard controls compared to alternative methods [9, 12] since they assume a unique calibration curve across all CpGs or independent calibration curves for each CpG. MethylCal’s modularity also allows the analysis of complex experimental designs. While the derivation of the calibration curves obtained from a single standard control sample is the typical experimental design in clinical diagnostic and research laboratories, technical replicates might also be available. Rather than taking the average across them, MethylCal can be extended, with negligible computational costs, to include a further random effect that models the variability of the apparent level of methylation across repeated control experiments (technical replicates), allowing precise quantification of biological and technical variability.

3.2

MethylCal’s regression model and its extension for repeated control experiments can be described as follows:

Model Outline

y ijk ¼ β0k þ β1k x ijk þ β2k x 2ijk þ β3k x 3ijk þ REij þ REk þ ϵ ijk ,

ð1Þ

where yijk is the apparent level of methylation after PCR at the ith actual methylation percentages (AMP) (i ¼ 1,   , I), jth CpG (j ¼ 1,   , J), and kth technical replicate (k ¼ 1,   , K), xijk is the ith AMP (x1jk ¼ 0% and xljk ¼ 100%) which is assumed constant across CpGs and technical replicates, and β0k,   , β3k are the intercept and coefficients of the cubic polynomial regression. In a typical experimental design, the percentage control points xijk are

34

Eguzkine Ochoa et al.

equally spaced and placed at least at 0%, 50%, and 100% actual methylation. Finally, REij and REk are random effects and ϵ ijk~N (0, σ 2). The specification of the random effects REij is key in MethylCal’s formulation (Eq. 1) (full details regarding the technical replicates of random effects REk are presented in Subheading 3.3). Specifically, l

l

Simple random effects model, REij ¼ AMPi: The random effects AMPi are introduced to model the variability of the apparent level of methylation after PCR at different AMPs not explained by the cubic polynomial regression; Crossed random effects model, REij ¼ AMPi + CpGj: Besides the random effects AMPi, the crossed random effects CpGj are added to capture the heterogeneity of the apparent level of methylation across CpGs;

l

Paired random effects and latent Gaussian field model, REij ¼ AMPi + μj: The crossed random effects CpGj are replaced by the latent Gaussian field μ ¼ (μ1,   , μJ)T to smoothen the spatial dependence of the apparent levels of methylation across CpGs;

l

Crossed random effects with random-intercepts and random-slopes model, REij ¼ AMPi þ CpG j þ CpGj  x ij : The randomslopes CpGj are added to model the variability of the apparent level of methylation after PCR across CpGs at lower/higher AMPs (we dropped the indexes k in xijk since we assume that they are constant across technical replicates).

The prior distributions on all unknown quantities (intercept and coefficients of the cubic polynomial regression, random effects, and the latent Gaussian field) are discussed in detail in Ochoa et al. [13]. MethylCal uses INLA [25] for fast inference and model selection of the random effects REij specification by selecting the REij model that fits the data best based on the DIC [26]. Figures 6 and 7 show the departure of MethylCal from the simple hypothesis of cubic polynomial regression (black dots) with different specifications of the random effects REij. 3.3 Calibration Experiment with Technical Replicates

While the cost of running a control experiment may prevent its wide applicability, the effects of the of PC bias, as well as BS conversion and region bias (Table 1), can be greatly reduced. We advise to include at least three percentage control points at 25%, 50%, and 75% actual methylation, besides at 0% and 100%, prepared by mixing different ratios of nonmethylated (0%) and methylated (100%) control DNA bisulfite converted. To raise the sensitivity of the correction curve, rather than increasing the number of AMPs, a different strategy would be to generate technical replicates, i.e., repeated measurements of the

Bias Correction in DNA Methylation

35

Fig. 6 Schematic representation of MethylCal’s models. The apparent level of methylation observed after amplification (y-axis) is plotted as a function of the actual methylation percentages (x-axis). The black dots and the black dotted line represent the apparent level of methylation predicted by the best-fit cubic polynomial regression across all CpGs, whereas dots and red dotted lines depict the level of methylation predicted by MethylCal’s models. Panels (a), (b), (c), and (d) correspond to different specifications of the random effects REij, simple random effects, crossed random effects, paired random effects, and latent Gaussian field and crossed random effects with random intercepts and random slopes, respectively. In (a), the calibration curves for different CpGs overlap

same control experiment that represent independent measures of the random noise associated with the preparation of different control DNA ratios. The main problem connected to the generation of these ratios is that currently only 0% and 100% methylated DNA are commercially available in humans. The rest, 25%, 50%, and 75%, are manually generated by mixing an appropriate volume of 0% and 100%. During this process, any pipetting error will create an incorrect expected ratio. In this scenario, MethylCal’s regression model (Eq. 1) is extended to include the random effects REk (k ¼ 1,   , K) that model the heterogeneity (variability) of the levels of apparent methylation across replicates. Instead of fitting and estimate a separate MethylCal’s model for each control experiment, all technical replicates are jointly analyzed using the whole vector of observations y of dimension n ¼ I  J  K.

36

Eguzkine Ochoa et al.

Fig. 7 The apparent level of methylation observed after amplification (y-axis) is plotted as a function of the CpGs (x-axis) in the target region. The black dots at x ¼ 0 represent the level of methylation predicted by the best-fit cubic polynomial regression across all CpGs, whereas dots and red dotted lines depict the level of methylation predicted by MethylCal’s models. Panels (a), (b), (c) and (d) correspond to different specifications of the random effects REij as in Fig. 6

3.4 Corrected Methylation Degree

The specification of a further random effect for the technical replicates leads to a modification of the original MethylCal’s algorithm for the derivation of the corrected methylation degree. Following Ochoa et al. [13], for each CpG (j ¼ 1,   , J) and for each technical replicate (k ¼ 1,   , K), the PCR-bias corrected methylation degree b x jk ∈½0%, 100% is the solution of h i2   b x jk ¼ arg min y obs ð2Þ j  Eðβk jyÞx jk þ η x jk þ C jk , x jk

where y obs j ∈½0%, 100% is the observed level of methylation at the jth CpG, E(βk| y) is the posterior mean of the intercept and regression coefficients of the cubic polynomial   regression in the kth control experiment, x jk ¼ 1, x jk , x 2jk , x 3jk and η(xjk) are the predicted value evaluated in xjk of the smoothed posterior mean of the random effects AMPi obtained from a cubic spline that interpolates E(AMPi| y) (i ¼ 1,   , I). Finally, Cjk ¼ E(REk| y) , Cjk ¼ E(CpGj| y) + E(REk| y), Cjk ¼ E(μj| y) + E(REk| y), and C jk ¼

Bias Correction in DNA Methylation

    E CpG j jy þ E CpGj jy  x jk þ EðREk jyÞ

37

according to the

specification   of the random effects REij, where E(CpGj| y),  E CpG j jy , E(μj| y), and E(REk| y) are the posterior mean of the random intercepts CpGj, random slopes CpGj , latent Gaussian field μj, and the technical replicates random effects REk, respectively. Note that Cjk depends on the unknown PCR-bias-corrected methylation degree only when the random slopes CpGj are specified in REij. The solution of the minimization problem (Eq. 2) produces k corrected methylation degrees b x jk for each observed level of at each CpG. This is expected since there is a methylation y obs j different correction depending on which technical replicate is considered, although the posterior estimates are obtained by using all n observations at once. To get a unique calibration curve, we interpolate b x jk by a cubic spline across the technical replicates and along CpGs and finally obtain a single corrected value b x j ∈ ½0%, 100% (j ¼ 1,   , J). 3.5 Real Case Applications

The interest in DNA methylation and its correlation with diagnostic, prognostic, and treatment is constantly growing. There are well-known DNA methylation biomarkers that support the diagnostic of imprinting disorder [27] and the clinical decisions in various cancers [28–32] and that are also used for noninvasive prenatal testing [33]. In the context of clinical practice, accurate DNA methylation levels are crucial because they can influence treatment decisions or further actions.

3.5.1 Application in Clinical Diagnostic of Celiac Patients

Celiac disease (CD) is a common, immune-mediated enteropathy induced by the exposure to gluten. Multiple studies in the last decade exposed the important relationship between autoimmunity and DNA methylation [34, 35]. Recently, a methylation array study in celiac disease was published confirming the implication of DNA methylation in the pathogenesis of this disease [36]. The validation and identification of new diagnostic, prognostic, and/or therapeutic biomarkers related to DNA methylation in celiac disease will require approaches to interrogate specific target like pyrosequencing or amplicon-based bisulfite sequencing. As we explained earlier in this chapter, these methodologies may suffer PCR bias, which should be considered and corrected in order to obtain accurate DNA methylation measurements. In the following example, we show pyrosequencing results from three assays: NFKBIA, RELA, and TNFAIP3 genes previously associated with susceptibility to celiac disease. Figure 8 presents the calibration curves for the target regions considered. It is apparent that there is a large variability of the apparent level of methylation in the control experiment, especially at medium and

38

Eguzkine Ochoa et al.

Fig. 8 Calibrated methylation levels of (a, b) NFKBIA, (c, d) RELA, and (e, f) TNFAIP3 assays in celiac patients using MethylCal. (a, c, e) The apparent level of methylation observed after PCR (y-axis) is plotted as a function of the actual methylation percentage (AMP) (x-axis). Circles depict the apparent level of methylation for each CpG at different AMPs, whereas red dotted lines show the predicted level of methylation. Black dots highlight potential outliers. (b, d, f) The apparent level of methylation (y-axis) is plotted (circles) against the CpGs in the target region (x-axis), stratified by AMP (top panel box and gray dot-dashed line). For each stratum, red dotted and dot-dashed lines show the level of methylation and the 95% prediction credible interval. All panels have been generated using MethylCal’s R package. For example, (a) and (b) are obtained by typing in R command line: MethylCalCalibrationOutlierPlot(Celiac_data, Target ¼ “NFKBIA”)

high percentages of actual methylation. Figure 8b, d, f shows that this is mainly due to abnormal low values of the apparent level of methylation measured in CpGs located at the end of each target region, in particular in NFKBIA and RELA assays. MethylCal is able to identify several of these observations and automatically mark them as possible outliers (black dots). As a result, they can be removed from the analysis (including the corresponding CpGs from the observed levels of methylation yobs), or the data generation process should be further investigated to understand the possible causes of this unusual pattern, including biological and biochemical factors. For instance, several factors reported in Table 1 that could affect methylation measurements may play an important role also in the derivation of the calibration curves, exerting their influence on the apparent level of methylation measured in the control experiment. A combination of these factors can explain the observed anomalous pattern of the methylation levels at CpGs located at the end of the target regions.

Bias Correction in DNA Methylation

39

Fig. 9 Corrected methylation degree of (a, b) NFKBIA, (c, d) RELA, and (e, f) TNFAIP3 assays in healthy controls (top panels) and celiac patients (bottom panels) using MethylCal. For each individual (x-axis), the boxplot depicts the range and the mean (circle) of the corrected methylation degree (y-axis) across CpGs, while the dashed-dotted gray lines show the healthy controls’ 3SD confidence interval. Top red triangles in bottom panels indicate patients classified as having undergone gain of methylation. All panels have been generated using MethylCal’s R package. For example, (a) and (b) are obtained by typing in R command line: MethylCalCorrection(Celiac_data, Target ¼ “NFKBIA”, n_Control ¼ 13, n_Case ¼ 17)

In all assays analyzed, the crossed random effects with random intercepts and random slopes model REij ¼ AMPi þ CpG j þ CpGj  x ij are chosen as the best-fit model. By looking at Fig. 8a, c, e, this is expected since there is an increasing level of variability of the apparent level of methylation after PCR across CpGs at higher AMPs. Once the calibration curves have been estimated, we applied MethylCal PCR-bias correction to obtain the methylation degree b x j ∈½0%, 100% (j ¼ 1,   , J) as the solution to the minimization problem (Eq. 2) with k ¼ 1. Figure 9 shows the corrected methylation degree of NFKBIA, RELA, and TNFAIP3 assays in healthy controls (top panels) and celiac patients (bottom panels). The statistical test used to declare individuals having undergone loss/ gain of methylation is based on the mean of the corrected methylation degree across CpGs for each individual in the case and control groups. Patients with an average corrected methylation level below healthy individuals’ three standard deviations (SD) confidence interval were considered to undergo loss of methylation and those with a level above the 3SD confidence interval were considered to

40

Eguzkine Ochoa et al.

experience gain of methylation. To avoid false positives, we chose healthy individuals’ 3SD confidence interval since it guarantees very low type-I error (α ¼ 2.7  103). The results of differential calibrated analysis displayed in the bottom panels of Fig. 9 show that while RELA assay is not clinically relevant to distinguish patients from healthy controls, in both NFKBIA and TNFAIP3 assays, we were able to classify patients 09D, 12D and 14D and 09D, 10D and 11D as having undergone gain of methylation. For NFKBIA and TNFAIP3 assays, it is also evident that almost all celiac patients considered were borderline cases that are difficult to diagnose. More sophisticated in silico calibration statistical tools and consequently more reliable corrections of the methylation levels in target regions are therefore essential for improving diagnosis based on methylation data. 3.5.2 Application in Clinical Diagnostic: Mosaic Beckwith–Wiedemann Syndrome

Beckwith–Wiedemann syndrome (BWS) is an imprinting disorder caused by genetic and epigenetic abnormalities that dysregulate imprinted genes on 11p15.5 [37]. This rare disorder is characterized by overgrowth, macroglossia, hemihyperplasia, exomphalos, and embryonal tumor produced by an increased expression of the growth factor, IGF2, or reduced expression of the growth suppressor, CDKN1C. The most frequent epigenetic alterations on BWS are: loss of methylation of maternal KCNQ1OT1, gain of methylation of maternal H19/IGF2, and paternal uniparental disomy of chromosome 11. Furthermore, BWS presentation is frequently as mosaic due to alterations during embryonic development. In mosaic cases, methylation measurements will depend on the percentage of cells with the alteration in the tissue tested. Depending on the level of mosaicism, methylation levels may be borderline for diagnostic criteria, and the accuracy may be critical in these cases. Figure 10 shows the observed degree of bias introduced by PCR amplification in three target regions chosen for their BWS clinical diagnostic power (KCNQ1OT1 and H19/IGF2 DMRs) and the high degree of methylated DNA loss (PLAGL1 DMR) at intermediate percentages of actual methylation (25%, 50%, and 75%). In these target regions, rather than increasing the number of AMPs, we generated data from two independent technical replicates to assess the variability associated with the preparation of different control DNA ratios. By looking at Fig. 10, significant differences were observed at 25%, 50%, and 75% actual methylation in KCNQ1OT1 DMR, while values slightly diverge between replicates at 75% and 100% in H19/IGF2 and 25% in PLAGL1 DMRs. The straightforward modification of MethylCal’s algorithm presented in Subheading 3.3, when independent technical replicates are considered in the control experiment, allows us to estimate the calibrated methylation levels in this new experimental design.

Bias Correction in DNA Methylation

41

Fig. 10 The observed degree of bias introduced by PCR amplification in (a) KCNQ1OT1, (b) H19/IGF2, (c) PLAGL1 DMRs by amplicon-based bisulfite sequencing. The apparent level of methylation observed after amplification (y-axis) is plotted as a function of the actual methylation percentage (x-axis). Color-coded circles (white and gray) indicate two independent calibration replicates. The black dashed line represents Moskalev’s cubic polynomial interpolation curve of the average (across CpGs) apparent level of methylation for each technical replicate. The gray dot-dashed line represents an unbiased plot. The deviation of observed DNA methylation from the expected DNA methylation is significantly different between the technical replicates at 25% and 75% actual methylation in KCNQ1OT1 DMR while differences between replicates are apparent at 75% and 100% in H19/IGF2 and 25% in PLAGL1 DMRs. All panels have been generated using MethylCal’s R package. For example, (a) is obtained by typing in R command line: ExploratoryPlot(BWS_data, Target ¼ “KCNQ1OT1”)

Figure 11 displays the output of MethylCal’s algorithm in the three target regions considered. While there is a negligible difference between the replicate-specific calibration curves (dark red dotted lines) and the cubic spline interpolant (red dashed line) in H19/ IGF2 and PLAGL1 DMRs (Fig. 11c, e), in KCNQ1OT1 DMR the cubic spline interpolant lies between the calibration curves obtained in each technical replicate (Fig. 11a). A closer inspection of the results shows that in order to generate well-separated replicatespecific calibration curves, there should be discordant values of the apparent level of methylation between replicates in at least two AMPs (Fig. 11b at 25% and 75%) and the overall magnitude of the observed degree of bias at different AMPs should be the same across technical replicates in contrast to H19/IGF2 DMR where at 75% and 100% the levels of the apparent methylation after PCR are flipped between technical replicates (Fig. 11d). Finally, in PLAGL1 DMR abnormal low values of the apparent level of methylation are measured in CpGs located nearly at the end of each target region (Fig. 11f), and this phenomenon is reproduced in both technical replicates. The reason that causes this behavior is not completely clear and may depend on a combination of factors, but the additional information obtained by standard controls at different AMPs shows that it is CpG specific suggesting a possible region bias. In addition, the behavior observed in Fig. 11f indicates that the observed bias grows with the level of apparent methylation. The

42

Eguzkine Ochoa et al.

Fig. 11 Calibrated methylation levels of (a, b) KCNQ1OT1, (c, d) H19/IGF2, and (e, f) PLAGL1 DMRs in mosaic Beckwith–Wiedemann syndrome using MethylCal. (a, c, e) The apparent level of methylation observed after PCR (y-axis) is plotted as a function of the actual methylation percentage (AMP) (x-axis). Color-coded circles (white and gray) depict the apparent level of methylation for each CpG at different AMPs in two independent calibration replicates. In each panel, dark red dotted lines show the predicted level of methylation for each technical replicate while the red dashed line displays the cubic spline interpolant of the predicted level of methylation curves. (b, d, f) The apparent level of methylation (y-axis) is plotted (circles) against the CpGs in the DMR (x-axis), stratified by AMP (top panel box and gray dot-dashed line) with different colors (white and gray) for the independent calibration replicates. For each stratum, dark red dotted lines show the predicted level of methylation for each technical replicate while the red dashed line displays the cubic spline interpolant of the predicted level of methylation curves. All panels have been generated using MethylCal’s R package. For example, (a) and (b) are obtained by typing in R command line: MethylCalCalibrationPlot (BWS_data, Target ¼ “KCNQ1OT1”)

most plausible explanation of this pattern could be a secondary structure of DNA form in the presence of methyl groups that avoid the access to MssI methyltransferase enzyme used to produce commercial fully-methylated DNA. For KCNQ1OT1 and H19/IGF2 DMRs, the best MethylCal fit-model is the simple random-effects model REij ¼ AMPi; for PLAGL1 assay the crossed random effects with random intercepts and random slopes model REij ¼ AMPi þ CpG j þ CpGj  x ij is selected to fit the increasing level variability of the apparent level of methylation after PCR across CpGs at higher AMPs. Figure 12 presents the results of the differential calibrated analysis. Interestingly, the higher sensitivity of the calibration curve obtained by adding a technical replicate allows a different classification of patient B5B37 (Fig. 12b) as undergoing loss of methylation in KCNQ1OT1 DMR compared to the single calibration curve

Bias Correction in DNA Methylation

43

Fig. 12 Corrected methylation degree of (a, b) KCNQ1OT1, (c, d) H19/IGF2, and (e, f) PLAGL1 DMRs in healthy controls (top panels) and alleged Beckwith–Wiedemann syndrome patients (bottom panels) using MethylCal. For each individual (x-axis), the boxplot depicts the range and the mean (circle) of the corrected methylation degree (y-axis) across CpGs, while the dashed-dotted gray lines show the healthy controls’ confidence interval. Top red triangles in bottom panels indicate patients classified as having undergone gain/loss of methylation. All panels have been generated using MethylCal’s R package. For example, (a) and (b) are obtained by typing in R command line: MethylCalCorrection(BWS_data, Target ¼ “KCNQ1OT1”, n_Control ¼ 15, n_Case ¼ 18)

presented in Ochoa et al. [13] where the same patient was considered as undergoing gain of methylation in H19/IGF2 DMR. This final example highlights the importance of independent calibration replicates for the quantification of the observed degree of bias introduced after PCR since they can reduce the variability associated with the preparation of different control DNA ratios that, in turn, can have a big impact in the derivation of the corrected methylation degree and classification of borderline cases.

4

Notes 1. MethylCal’s R functions detect automatically from the input data if independent control replicates have been included in the analysis. However, they assume that the same levels of actual methylation are used across all technical replicates. Different sets of CpGs are allowed across technical replicates although MethylCal’s R functions consider only CpGs that are in common.

44

Eguzkine Ochoa et al.

2. The automatic detection of outliers may inform regarding possible factors that cause the observed bias. While MethylCal is robust to outliers (see calibration curves in Fig. 8a, c), they are not removed automatically from the analysis. Corrected methylation degree in Fig. 9b, d is obtained using all standard controls. If CpGs are manually removed from standard controls, in the same way, they should be excluded from the observed methylation levels. 3. When technical replicates are included in the derivations of the calibration curves, the automatic detection of possible outliers has not been implemented in MethylCal’s R package. Instead, we suggest checking for outliers independently in each control replicate and mark observations as outliers if they are detected in more than half of the technical replicates. Similarly to Note 2, they can be removed manually from the technical replicates and observed methylation levels before running MethylCal. 4. In MethylCal’s formulation (Eq. 1), we made the assumption that the error term is normally distributed. While this assumption seems to be reasonable in the pyrosequencing real data example (continuous observations although both yijk, and y obs j ∈½0%, 100% ), in the second example with ampliconbased bisulfite sequencing, we obtained beta intensities by taking the ratio between the read counts and the read depth for each CpG in the target regions. The same strategy could be employed for WGBS and RRBS data, although a better approach would be to use a negative-binomial likelihood. However, in count data regression models, the inclusion of crossed or nested random-effects or a latent Gaussian field may produce biased estimates and overly narrow confidence intervals [38] when posterior inference is performed by using numerical integration or Laplace approximation [25].

5

Conclusions DNA methylation is an epigenetic mechanism responsible for multiple biological processes including genomic imprinting, X-inactivation, tissue-specific gene-expression, and transgenerational effects. Likewise, cumulative evidence indicates also a critical role of DNA methylation in human diseases [39] like rheumatoid arthritis, obesity, or autism [40–42]. The interest in the study of DNA methylation changes and its relationship with the disease is constantly growing. However, the technologies and methodologies available at the moment are still evolving, and right now, the analysis of DNA methylation requires a deep understanding of the limitations and possible solutions.

Bias Correction in DNA Methylation

45

Acknowledgments Alan Turing Institute under the Engineering and Physical Sciences Research Council [EP/N510129/1 to L.B.]. The authors thank Jose Ramon Bilbao for providing the celiac data as well as Nora Fernandez-Jimenez and Eamonn Maher for useful discussion regarding the results of the celiac data and mosaic Beckwith–Wiedemann syndrome. Conflict of Interest The views expressed are those of the authors and not necessarily those of the NHS or Department of Health.

References 1. Laird PW (2003) The power and the promise of DNA methylation markers. Nat Rev Cancer 3:253–266. https://doi.org/10.1038/ nrc1045 2. Bock C (2009) Epigenetic biomarker development. Epigenomics 1:99–110. https://doi. org/10.2217/epi.09.6 3. Shapiro R, Weisgras JM (1970) Bisulfitecatalyzed transamination of cytosine and cytidine. Biochem Biophys Res Commun 40: 839–843. https://doi.org/10.1016/0006291X(70)90979-4 4. Hayatsu H, Wataya Y, Kai K (1970) The addition of sodium bisulfite to uracil and to cytosine. J Am Chem Soc 92:724–726. https:// doi.org/10.1021/ja00706a062 5. Wang RYH, Gehrke CW, Ehrlich M (1980) Comparison of bisulfite modification of 5-methyldeoxycytidine and deoxycytidine residues. Nucleic Acids Res 8:4777–4790. https://doi.org/10.1093/nar/8.20.4777 6. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci U S A 89:1827–1831. https://doi. org/10.1073/pnas.89.5.1827 7. Yu H, Hahn Y, Yang I (2015) Reference materials for calibration of analytical biases in quantification of DNA methylation. PLoS One 10: 8–19. https://doi.org/10.1371/journal. pone.0137006 8. Masser DR, Berg AS, Freeman WM (2013) Focused, high accuracy 5-methylcytosine quantitation with base resolution by benchtop next-generation sequencing. Epigenetics Chromatin 6:1. https://doi.org/10.1186/ 1756-8935-6-33

9. Warnecke PM, Stirzaker C, Melki JR, Millar DS, Paul CL, Clark SJ (1997) Detection and measurement of PCR bias in quantitative methylation analysis of bisulphite-treated DNA. Nucleic Acids Res 25:4422–4426. https://doi.org/10.1093/nar/25.21.4422 10. Wojdacz TK, Borgbo T, Hansen LL (2009) Primer design versus PCR bias in methylation independent PCR amplifications. Epigenetics 4:231–234. https://doi.org/10.4161/epi. 9020 11. Wojdacz TK, Hansen LL, Dobrovic A (2008) A new approach to primer design for the control of PCR bias in methylation studies. BMC Res Notes 1:3–5. https://doi.org/10.1186/ 1756-0500-1-54 12. Moskalev EA, Zavgorodnij MG, Majorova SP, Vorobjev IA, Jandaghi P, Bure IV, Hoheisel JD (2011) Correction of PCR-bias in quantitative DNA methylation studies by means of cubic polynomial regression. Nucleic Acids Res 39. https://doi.org/10.1093/nar/gkr213 13. Ochoa E, Zuber V, Fernandez-Jimenez N, Bilbao JR, Clark GR, Maher ER, Bottolo L (2019) MethylCal: Bayesian calibration of methylation levels. Nucleic Acids Res 47:e81. https://doi.org/10.1093/nar/gkz325 14. Rother KI, Silke J, Georgiev O, Schaffner W, Matsuo K (1995) Influence of DNA sequence and methylation status on bisulfite conversion of cytosine residues. Anal Biochem 231: 263–265. https://doi.org/10.1006/abio. 1995.1530 15. Zhao J, Bacolla A, Wang G, Vasquez KM (2010) Non-B DNA structure-induced genetic instability and evolution. Cell Mol Life Sci 67: 43–62. https://doi.org/10.1007/s00018009-0131-2

46

Eguzkine Ochoa et al.

16. Clark J, Smith SS (2008) Secondary structure at a hot spot for DNA methylation in DNA from human breast cancers. Cancer Genomics Proteomics 5:241–252 17. Bacolla A, Cooper DN, Vasquez KM, Tainer JA (2018) Non-B DNA structure and mutations causing human genetic disease. In: eLS. Wiley, pp 1–15 18. Bacolla A, Wells RD (2009) Non-B DNA conformations as determinants of mutagenesis and human disease. Mol Carcinog 48:273–285. https://doi.org/10.1002/mc.20507 19. Walter J, Hutter B, Khare T, Paulsen M (2006) Repetitive elements in imprinted genes. Cytogenet Genome Res 113:109–115. https://doi. org/10.1159/000090821 20. Ziller MJ, Gu H, Mu¨ller F, Donaghey J, Tsai LTY, Kohlbacher O, De Jager PL, Rosen ED, Bennett DA, Bernstein BE, Gnirke A, Meissner A (2013) Charting a dynamic DNA methylation landscape of the human genome. Nature 500:477–481. https://doi.org/10.1038/ nature12433 21. Zhou L, Ng HK, Drautz-Moses DI, Schuster SC, Beck S, Kim C, Chambers JC, Loh M (2019) Systematic evaluation of library preparation methods and sequencing platforms for high-throughput whole genome bisulfite sequencing. Sci Rep 9:1–16. https://doi.org/ 10.1038/s41598-019-46875-5 22. Ziller MJ, Hansen KD, Meissner A, Aryee MJ (2015) Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nat Methods 12:230–232. https://doi.org/10.1038/nmeth.3152 23. Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE (2015) Predicting genomewide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol 16. https://doi.org/ 10.1186/s13059-015-0581-9 24. (2006) Linear mixed-effects models: basic concepts and examples. In: Mixed-effects models in S and S-PLUS. Springer-Verlag, pp 3–56 25. Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc Ser B Stat Methodol 71:319–392. https://doi.org/10.1111/ j.1467-9868.2008.00700.x 26. Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002) Bayesian measures of model complexity and fit. J R Stat Soc Ser B Stat Methodol 64:583–616. https://doi.org/10. 1111/1467-9868.00353 27. Monk D, Mackay DJG, Eggermann T, Maher ER, Riccio A (2019) Genomic imprinting

disorders: lessons on how genome, epigenome and environment interact. Nat Rev Genet 20: 235–248. https://doi.org/10.1038/s41576018-0092-0 28. Claus R, Lucas DM, Ruppert AS, Williams KE, Weng D, Patterson K, Zucknick M, Oakes CC, Rassenti LZ, Greaves AW, Geyer S, Wierda WG, Brown JR, Gribben JG, Barrientos JC, Rai KR, Kay NE, Kipps TJ, Shields P, Zhao W, Grever MR, Plass C, Byrd JC (2014) Validation of ZAP-70 methylation and its relative significance in predicting outcome in chronic lymphocytic leukemia. Blood 124: 42–48. https://doi.org/10.1182/blood2014-02-555722 29. DeVos T, Tetzner R, Model F, Weiss G, Schuster M, Distler J, Steiger KV, Gru¨tzmann R, Pilarsky C, Habermann JK, Fleshner PR, Oubre BM, Day R, Sledziewski AZ, Lofton-Day C (2009) Circulating methylated SEPT9 DNA in plasma is a biomarker for colorectal cancer. Clin Chem 55:1337–1346. https://doi.org/10.1373/clinchem.2008. 115808 30. Giovannetti E, Codacci-Pisanelli G, Peters GJ (2012) TFAP2E-DKK4 and chemoresistance in colorectal cancer. N Engl J Med 366:966. https://doi.org/10.1056/NEJMc1201170 31. Hegi ME, Diserens AC, Gorlia T, Hamou MF, De Tribolet N, Weller M, Kros JM, Hainfellner JA, Mason W, Mariani L, Bromberg JEC, Hau P, Mirimanoff RO, Cairncross JG, Janzer RC, Stupp R (2005) MGMT gene silencing and benefit from temozolomide in glioblastoma. N Engl J Med 352:997–1003. https:// doi.org/10.1056/NEJMoa043331 32. Queiro´s AC, Villamor N, Clot G, MartinezTrillos A, Kulis M, Navarro A, Penas EMM, Jayne S, Majid A, Richter J, Bergmann AK, ˜ ol N, Kolarova J, Royo C, Russin Castellano G, Pinyol M, Bea S, Salaverria I, Lo´pez-Guerra M, Colomer D, Aymerich M, Rozman M, Delgado J, Gine´ E, Gonza´lezDı´az M, Puente XS, Siebert R, Dyer MJS, Lo´pez-Otı´n C, Rozman C, Campo E, Lo´pezGuillermo A, Martı´n-Subero JI (2015) A B-cell epigenetic signature defines three biologic subgroups of chronic lymphocytic leukemia with clinical impact. Leukemia 29:598–605. https://doi.org/10.1038/leu.2014.252 33. Papageorgiou EA, Karagrigoriou A, Tsaliki E, Velissariou V, Carter NP, Patsalis PC (2011) Fetal-specific DNA methylation ratio permits noninvasive prenatal diagnosis of trisomy 21. Nat Med 17:510–513. https://doi.org/10. 1038/nm.2312 34. McDermott E, Ryan EJ, Tosetto M, Gibson D, Burrage J, Keegan D, Byrne K, Crowe E,

Bias Correction in DNA Methylation Sexton G, Malone K, Harris RA, Kellermayer R, Mill J, Cullen G, Doherty GA, Mulcahy H, Murphy TM (2016) DNA methylation profiling in inflammatory bowel disease provides new insights into disease pathogenesis. J Crohns Colitis 10:77–86. https://doi. org/10.1093/ecco-jcc/jjv176 35. Imgenberg-Kreuz J, Almlo¨f JC, Leonard D, Sjo¨wall C, Syv€anen AC, Ro¨nnblom L, Sandling JK, Nordmark G (2019) Shared and unique patterns of DNA methylation in systemic lupus erythematosus and primary Sjo¨gren’s syndrome. Front Immunol 10:1686. https:// doi.org/10.3389/fimmu.2019.01686 36. Fernandez-Jimenez N, Garcia-Etxebarria K, Plaza-Izurieta L, Romero-Garmendia I, Jauregi-Miguel A, Legarda M, Ecsedi S, Castellanos-Rubio A, Cahais V, Cuenin C, Degli Esposti D, Irastorza I, HernandezVargas H, Herceg Z, Bilbao JR (2019) The methylome of the celiac intestinal epithelium harbours genotype-independent alterations in the HLA region. Sci Rep 9:1–13. https://doi. org/10.1038/s41598-018-37746-6 37. Maher ER, Reik W (2000) BeckwithWiedemann syndrome: imprinting in clusters revisited. J Clin Invest 105:247–252 38. Lea AJ, Tung J, Zhou X (2015) A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data. PLoS Genet 11(11): e1005650. https://doi.org/10.1371/journal. pgen.1005650

47

39. Jin Z, Liu Y (2018) DNA methylation in human diseases. Genes Dis 5:1–8. https:// doi.org/10.1016/j.gendis.2018.01.002 40. Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, Reinius L, Acevedo N, Taub M, Ronninger M, Shchetynsky K, Scheynius A, Kere J, Alfredsson L, Klareskog L, Ekstro¨m TJ, Feinberg AP (2013) Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol 31:142–147. https://doi. org/10.1038/nbt.2487 41. Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aı¨ssi D, Wahl S, Meduri E, Morange PE, Gagnon F, Grallert H, Waldenberger M, Peters A, Erdmann J, Hengstenberg C, Cambien F, Goodall AH, Ouwehand WH, Schunkert H, Thompson JR, Spector TD, Gieger C, Tre´goue¨t DA, Deloukas P, Samani NJ (2014) DNA methylation and body-mass index: a genome-wide analysis. Lancet 383: 1990–1998. https://doi.org/10.1016/ S0140-6736(13)62674-4 42. Andrews SV, Ellis SE, Bakulski KM, Sheppard B, Croen LA, Hertz-Picciotto I, Newschaffer CJ, Feinberg AP, Arking DE, Ladd-Acosta C, Fallin MD (2017) Cross-tissue integration of genetic and epigenetic data offers insight into autism spectrum disorder. Nat Commun 8:1–10. https://doi.org/10. 1038/s41467-017-00868-y

Chapter 4 Using R for Cell-Type Composition Imputation in Epigenome-Wide Association Studies Chong Wu Abstract Adjusting cell type composition is challenging but critical in epigenome-wide association studies (EWAS). In this chapter, we describe how to apply reference-based and reference-free methods in R to impute cell type composition in whole blood samples. Key words Cell-type heterogeneity, DNA methylation, Reference-based method, Reference-free method

1

Introduction DNA methylation (DNAm) is a widely studied epigenetic mechanism. It often results in repression of gene expression and has been shown to change over time and in response to environmental factors and also to be altered by diseases. Nowadays, epigenome-wide association studies (EWASs) have become increasingly popular, which interrogate DNAm changes across the genome for association with the phenotype of interest, such as disease status and risk factors. However, EWAS is often hampered by the complex nature of tissues (such as whole blood tissue) in which DNAm is measured. DNAm level is often cell-type specific [1]; failing to adjust cell-type composition may result in spurious findings as cell types is often a confounding factor between DNAm and phenotype. For example, rheumatoid arthritis (RA) cases and controls typically differ in their cell-type composition; ignoring granulocyte-to-lymphocyte ratio difference in cases and controls resulted in many false positives [2]. When variation in cell-type composition is unrelated (or orthogonal) to the phenotype of interest, not adjusting for the cell-type composition may still result in loss of statistical power. For instance, due to a relatively small

Weihua Guan (ed.), Epigenome-Wide Association Studies: Methods and Protocols, Methods in Molecular Biology, vol. 2432, https://doi.org/10.1007/978-1-0716-1994-0_4, © Springer Science+Business Media, LLC, part of Springer Nature 2022

49

50

Chong Wu

Table 1 Summary of reference-based and reference-free methods for adjusting cell-type heterogeneity in EWAS

Method

Tissues applied

Adjusts for other confounders

Availability Package

References

R

[4]

Reference-based methods Projectionbased method

Whole blood, No PBMC, cord blood, breast

VI + MI

Whole blood

minfi (Available on Bioconductor)

Yes, if reference for R other confounders are provided

https://github.com/ [5] ChongWu-Biostat/ MethyImpute

No

R, JAVA, Online tool

https://cibersort. stanford.edu/

[6]

RefFreeEWAS Any

Yes

R

RefFreeEWAS (Available on CRAN)

[7]

ReFACTor

Whole blood

Yes

R, Python

www.cs.tau.ac.il/ ~heran/cozygene/ software/refactor. html

[8]

EWAsher

Whole blood

Yes

R

FaST-LMM-EWASher [9] (Available on Microsoft)

SVA

Any

Yes

R

SVA (Available on Bioconductor)

[10]

ISVA

Any

Yes

R

isva (Available on CRAN)

[11]

CIBERSORT Whole blood, PBMC, breast Reference-free methods

Note: PBMC peripheral blood mononuclear cell

sample size and failing to adjust cell-type composition, researchers could not identify many established smoking-associated DNAm in blood [3]. Many statistical methods have been proposed to adjusting for cell-type heterogeneity in EWAS. Table 1 summarizes some existing methods categorized into two groups: reference-based and reference-free methods. Reference-based methods typically use pre-measured DNAm reference profiles for each cell/tissue type, identify CpG sites differentially methylated across cell types, and impute the missing cell type composition in a new set of samples. Generally speaking, the imputation can be done by either projecting a sample’s DNAm profile into these reference profiles [4] or applying a modified multiple imputation method [5]. Although

Cell-Type Imputation

51

reference-based methods have been widely used in practice and work well on whole blood samples [4, 5], one major disadvantage is that they require appropriate reference DNAm profiles for the underlying sub-tissues, which might be unknown or challenging to obtain. Alternatively, reference-free methods have been proposed to partially overcome this limitation. For example, EWASher [9] and ReFACTor [8] adjust cell-type composition by assuming that cell-type composition drives the top components of variation in the DNAm data. Although this assumption is generally valid, some true biological signals may be removed in the process, leading to overadjustment. This type of method can be extended to adjust any type of confounders (not only restricted to the cell-type composition) as long as they contribute to variation in DNA methylation across multiple CpG sites. A comprehensive review of reference-based and reference-free methods can be found in Teschendorff and Zheng [12]. In this chapter, to help researchers adjust cell-type composition in their own studies, we describe how to implement widely used popular reference-based and reference-free methods in R with examples.

2

Reference-Based Methods Noting that differentially methylated DNA methylation sites can distinguish cell type composition with high sensitivity and specificity, one popular reference-based method has been proposed [4] to infer cell-type composition using an existing DNAm reference data. This method projects target DNAm data (heterogenous mixture of cells) onto a reference purified (isolated cell types) DNAm data by quadratic programming. Specifically, this method applies least squares minimization with constraint that cell type proportion cannot be negative. This algorithm has been extensively used in practice and proved to work extremely well on whole blood tissue [4, 5]. Here, we demonstrate this method using an R package minfi (need a reference). minfi provides a function, estimateCellCounts(), to estimate the cell-type composition. This function takes the raw DNAm data (in RGChannelSet format) as input and returns a cell-type proportion for each sample. library(minfi) library(minfiData) # get the raw DNAm data baseDir