Hi-C Data Analysis: Methods and Protocols (Methods in Molecular Biology, 2301) 1071613898, 9781071613894

This volume details a comprehensive set of methods and tools for Hi-C data processing, analysis, and interpretation. Cha

122 70 12MB

English Pages 367 [355] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Contributors
Chapter 1: Normalization of Chromosome Contact Maps: Matrix Balancing and Visualization
1 Introduction
2 Materials
2.1 Hardware
2.2 Software
3 Methods
3.1 Alignment of Read Pairs
3.2 Filtering of Non-informative Events
3.3 Iterative Procedure to Balance the Signal
3.4 Scalogram: Alternative Visualization Tool for Normalized Contact Data
4 Notes
References
Chapter 2: Methods to Assess the Reproducibility and Similarity of Hi-C Data
1 Introduction
1.1 Similarity Measures for Hi-C Data
1.1.1 HiCRep
1.1.2 GenomeDISCO
1.1.3 HiC-Spector
2 Data Sets
2.1 Description of the Example Data
2.2 Download the Data
2.3 Data Formats for Contact Maps
2.4 Converting .bam Files into Matrix Formats
3 Assessing Reproducibility Scores
3.1 HiCRep
3.1.1 System Requirements and Installation
3.1.2 Input Format for Contact Maps
3.1.3 Compute Similarity Score Using HiCRep
3.1.4 Demo Code
3.1.5 Output
3.2 HiC-Spector
3.2.1 System Requirements and Installation
3.2.2 Input Format for Contact Maps
3.2.3 Compute Reproducibility Score Using HiC-Spector
3.2.4 Parameters in HiC-Spector
3.2.5 Demo Code
3.2.6 Output
3.3 GenomeDISCO
3.3.1 System Requirement and Installation
3.3.2 Input Format for Contact Maps
3.3.3 Other Required Input Files
3.3.4 Parameters in GenomeDISCO
3.3.5 Compute the Reproducibility Score Using GenomeDISCO
3.3.6 Output
4 Notes
References
Chapter 3: Methods for the Analysis of Topologically Associating Domains (TADs)
1 Introduction
1.1 The 3D Genome and Topologically Associating Domains (TADs)
1.2 Detecting TADs in Hi-C Maps
1.3 Comparing and Assessing the ``Quality´´ of Genomic Partitions
2 Materials
2.1 Required Resources
2.1.1 Hardware
2.1.2 Software
2.2 Datasets
3 Methods
3.1 TAD Calling
3.1.1 TopDom
Input format
Parameters
Running TopDom
Output format
3.1.2 CaTCH
Input format
Parameters
Running CaTCH
Output format
3.1.3 Arrowhead
Input format
Parameters
Running Arrowhead
Output format
3.1.4 HiCseg
Input format
Parameters
Running HiCseg
Output format
3.2 Comparison of TAD Partitions
3.2.1 The Measure of Concordance
3.2.2 Enrichment of Structural Proteins at TAD Boundaries
3.2.3 Enrichment of either Activating or Repressing Histone Marks within TADs
4 Notes
References
Chapter 4: Methods for the Differential Analysis of Hi-C Data
1 Introduction
1.1 What We Can C
1.2 From Contacts to Bytes
2 Materials
2.1 Necessary Resources
2.1.1 Hardware
2.1.2 Software
2.2 Data Sets
2.2.1 Description of the Example Data
2.2.2 Data Download
2.2.3 Data Conversion
3 Methods
3.1 Differential Interactions Analysis with no Replicates
3.1.1 HiCCUPS Diff
3.1.2 Selfish
3.2 Differential Interaction Analysis with Replicates
3.2.1 diffHic
3.2.2 multiHiCcompare
3.3 Results Visualization
3.4 Integration with Gene Expression
4 Notes
References
Chapter 5: Visualizing and Annotating Hi-C Data
1 Introduction
2 Materials
2.1 Required Resources
2.1.1 Hardware Requirements
2.1.2 Software Requirements
2.1.3 Installation Commands
Homebrew and Linuxbrew
Installing R
Installing R with brew
Installing Renv and R from Source
Installing R Packages
2.2 Datasets
2.2.1 Example Hi-C Dataset
2.2.2 Example Histone Marks ChIP-Seq Data
3 Methods
3.1 Importing Hi-C Data into R
3.2 Visualizing Hi-C Contact Maps
3.2.1 Selecting the Region to Visualize
3.2.2 Standard Genomic Loci Based Heatmaps
3.2.3 Rotated Distance-Based Heatmaps
3.2.4 Annotating Hi-C Data with Domain Information
Calling Chromatin Domains
Filter Domains for Non-informative Regions
Annotate the Contact Maps with Domains
3.2.5 Annotate Contact Maps with Histone Marks and TADs
Fetch the Bin Positions of the Region of Interest
Import Histone Marks
Summarize Histone Marks Data at the Same Resolution as Hi-C Data
Convert Histone Marks Track to Coordinates
Assemble the Complete Plot
3.3 Hi-C Interaction Decay Analysis
3.3.1 Retrieve average interactions for each distance
3.3.2 Plot Interaction Decay Profile
3.4 Aggregated Hi-C Contact Profiles
3.4.1 Average Domain
Identify Domains across all Chromosomes
Compute Expected Values
Rescale Sub-Matrices to a Fixed Dimension
Plot the Average Profile
3.4.2 Paired-End Spatial Chromatin Analysis
Fetch the Domain Boundary Bins
Create and Filter Boundary Pair List
Fetch All Sub-Matrices
Plot the Average Profile
4 Notes
References
Chapter 6: Hi-C Data Formats
1 Introduction
2 Data Formats Per Processing Steps
2.1 Formats for Alignments
2.2 Formats for a Contact List
2.3 Formats for a Contact Matrix
2.3.1 The .hic Format
2.3.2 The .cool and .mcool Formats
2.3.3 Other Formats
2.4 Loops, Domains, and Compartments
2.4.1 Calls
2.4.2 Profiles
2.5 Future Direction
3 Notes
References
Chapter 7: Analysis of Hi-C Data for Discovery of Structural Variations in Cancer
1 Introduction
2 Materials
2.1 Computation Resource Requirement
2.1.1 System Requirement
2.1.2 Software Requirement
2.2 Tool Installation
2.2.1 Install BWA
2.2.2 Install Samtools
2.2.3 Install Conda
2.2.4 Install Pairtools
2.2.5 Install Cooler
2.2.6 Install HiGlass
2.2.7 Install Hi-C Breakfinder
2.3 Example Data Sets
3 Methods
3.1 Overview of the SV Discovery Pipeline Using Hi-C Breakfinder
3.2 Hi-C Read Alignment
3.3 SV Discovery with Hi-C Breakfinder
3.4 Visualization of SVs on Normal Hi-C Maps
3.5 Visualization of SVs on Reconstructed Hi-C Maps
4 Notes
Appendix Python script for making reconstructed Hi-C map at SV loci
References
Chapter 8: Metagenomes Binning Using Proximity-Ligation Data
1 Introduction
2 Materials
2.1 Required Resources
2.1.1 Hardware
2.1.2 Software
3 Methods
3.1 Preliminary Assembly
3.2 Contigs Network of Interactions
3.2.1 Mapping Reads along the Metagenome Assembly
3.2.2 Computation of Contigs Data
3.2.3 Filtering Out of Non-informative Contacts and Construction of the Network
3.3 Partitioning of the Network
3.3.1 Louvain Iterative Procedure
3.3.2 Louvain Data Treatment
3.3.3 Subnetwork Extraction
3.3.4 Louvain Recursive Procedure on Subnetwork
3.4 Downstream Analysis
4 Notes
References
Chapter 9: Generating High-Resolution Hi-C Contact Maps of Bacteria
1 Introduction
2 Materials
2.1 Equipment
2.2 Consumables for Hi-C Library Preparation
2.3 Consumables for Sequencing Library Preparation
3 Methods
3.1 Generation of a Bacterial Hi-C Library
3.1.1 Cell Fixation
3.1.2 Hi-C Library Construction
3.1.3 Reverse Cross-Linking and DNA Purification
3.2 Preparation of Hi-C Sequencing Libraries
3.2.1 DNA Sonication and Size Selection
3.2.2 Biotin Pull-Down
3.2.3 End-Repair
3.2.4 A-Tailing
Adapter Ligation
3.2.5 Library Amplification by PCR
4 Notes
References
Chapter 10: Computational Tools for the Multiscale Analysis of Hi-C Data in Bacterial Chromosomes
1 Introduction
2 Methods
2.1 Using Our Framework
2.2 Input Data
2.3 Matrix-Based Approach Using numpy Arrays
2.4 Estimating the Contact Law P(s)
2.5 Highlighting the Frontiers of a Hi-C Heat Map
2.6 Frontier Indexes
2.6.1 Estimating a p-Value
3 Notes
References
Chapter 11: Analysis of HiChIP Data
1 Introduction
2 Materials
2.1 Required Resources
2.1.1 Hardware
2.1.2 Software
2.2 Dataset
2.2.1 Example Dataset: Generalities
2.2.2 Downloading HiChIP Data
2.2.3 Downloading Additional Data
3 Methods
3.1 Preprocessing of Raw Data with HiC-Pro
3.1.1 Input and Configuration Files
3.1.2 Alignment
3.1.3 Filtering
3.2 Identification of Loops
3.2.1 ChIP-Seq Peak Calling
3.2.2 Identification of HiChIP Loops
3.2.3 Calling Differential Loops
3.3 Loop Visualization
3.3.1 Plotting Loops with Diffloop
3.3.2 Exploring Interactions with WashU Epigenome Browser
4 Notes
References
Chapter 12: The Physical Behavior of Interphase Chromosomes: Polymer Theory and Coarse-Grain Computer Simulations
1 On a Few Fundamental Facts Concerning Chromosome Folding
2 Polymer Physics ``in a Nutshell´´: Theory and Simulation
2.1 Theory
2.2 Simulation
2.2.1 Monte Carlo Methods: Metropolis Algorithm
2.2.2 Molecular Dynamic Methods
3 Untangled Ring Polymers in Melt and the Physical Modeling of Interphase Chromosomes
3.1 Ring Polymers in Melt as a Model for Interphase Chromosomes: A Biological Basis
3.2 Physics of Ring Polymers in Melt: Models and Methods
3.3 Constructing Double-Folded, Randomly Branched Ring Polymers in Melt: An Efficient Monte Carlo/Molecular Dynamic Multi-scal...
3.4 Model Predictions and Comparison to Experimental Results
4 Conclusions and Outlook
References
Chapter 13: Polymer Folding Simulations from Hi-C Data
1 Introduction
2 Methods
2.1 The Polymeric Model
2.2 Determination of the Effective Potential
2.3 Analysis of the Results
3 Notes
References
Chapter 14: Predictive Polymer Models for 3D Chromosome Organization
1 Introduction
2 Mechanisms Driving Chromosome Organization
2.1 Diffusing Transcription Factor Model: Bridging-Induced Phase Separation
2.2 The Loop Extrusion Model: Cohesin-Mediated Domains
2.3 Block Copolymer Models: Chromatin Self-Interactions and A/B Compartments
3 The HiP-HoP Model
3.1 Simulation Methods
3.2 Simulation Input Data
4 HiP-HoP Simulation Results
4.1 HiP-HoP at Specific Gene Loci
4.2 HiP-HoP at the Chromosome Scale
5 Discussion and Future Outlook
6 Notes
References
Chapter 15: Polymer Modeling of 3D Epigenome Folding: Application to Drosophila
1 Introduction
2 Methods
2.1 Coarse-Grained Polymer Model for Chromosome
2.2 Block Copolymer Model for Epigenome Folding
2.3 Lattice Model and Numerical Simulations
2.4 Comparison with Experiments
3 Notes
References
Chapter 16: A Polymer Physics Model to Dissect Genome Organization in Healthy and Pathological Phenotypes
1 Introduction
2 SBS Polymer Model and PRISMR Inference Method
3 Predicting the Impact of SVs on the EPHA4 Locus Architecture
4 Pitx1 Regulation Is Explained by Tissue-Specific 3D Conformations
5 Discussion
References
Chapter 17: The 3D Organization of Chromatin Colors in Mammalian Nuclei
1 Introduction
2 Materials
2.1 Required Resources
2.1.1 Datasets
2.1.2 File Formats
2.1.3 Software
3 Methods
3.1 Overview of the Results
3.2 3D Reconstruction in MATLAB
3.3 3D Annotation in MATLAB
3.4 3D Reconstruction and Epigenetic Annotation in Python
3.4.1 Main.py
3.4.2 HiCtoolbox.py
4 Notes
References
Chapter 18: Modeling the 3D Genome Using Hi-C and Nuclear Lamin-Genome Contacts
1 Introduction
2 Materials
2.1 Required Software Needed to Run the Pipeline
2.2 Preparations Prior to Running the Pipeline
2.2.1 Setting up HiC-Pro
2.2.2 Downloading and Organizing the Required Hi-C Data
2.2.3 Downloading and Organizing the Required LAD Data
2.2.4 Downloading and Installing the Required Processing Scripts
3 Methods
3.1 Run HiC-Pro to Process Hi-C Data
3.2 Run Armatus to Call TADs
3.3 Convert Called TADs into a Segmented Genome to Define Chrom3D Beads
3.4 Computing Inter-Bead Contact/Interaction Frequencies
3.5 Identifying Statistically Significant Inter-Bead Interactions within Chromosomes
3.6 Identifying Statistically Significant Inter-Bead Interactions between Chromosomes
3.7 Creating a Model Setup File in GTrack Format
3.8 Adding LAD Information to the Model Setup File (GTrack
3.9 Making the Model Setup File (GTrack) Diploid
3.10 Running Chrom3D
3.11 Visualizing the Resulting Chrom3D Model in ChimeraX
4 Notes
References
Index
Recommend Papers

Hi-C Data Analysis: Methods and Protocols (Methods in Molecular Biology, 2301)
 1071613898, 9781071613894

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methods in Molecular Biology 2301

Silvio Bicciato Francesco Ferrari Editors

Hi-C Data Analysis Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Hi-C Data Analysis Methods and Protocols

Edited by

Silvio Bicciato Department of Life Sciences, University of Modena and Reggio Emilia, Modena, Italy

Francesco Ferrari IFOM, The FIRC Institute of Molecular Oncology, Milan, Italy

Editors Silvio Bicciato Department of Life Sciences University of Modena and Reggio Emilia Modena, Italy

Francesco Ferrari IFOM, The FIRC Institute of Molecular Oncology Milan, Italy

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1389-4 ISBN 978-1-0716-1390-0 (eBook) https://doi.org/10.1007/978-1-0716-1390-0 © Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface The three-dimensional (3D) architecture of chromatin has been gaining increasing attention as a crucial layer in the regulation of genome functionality, both at the level of individual genes and of larger genomic regions. The mechanisms controlling genome organization are critical for the physiological processes of the cell, such as development or differentiation, and their alteration might contribute to pathological states, as cancer and genetic diseases. For these reasons, the study of chromatin 3D architecture has become of paramount importance in biomedical research and life sciences in general. The two copies of three billion nucleotides composing the human genome add up to a DNA fiber of about 2 m. Fitting such a long polymer inside a cell nucleus, which has average diameter size around 10 μm, is first of all a structural challenge. However, such structural compaction must comply with complex functional needs. Indeed, most regions of the genome must be accessible for functional interactions with specific protein complexes during interphase, and all portions of the genome must be accessible to DNA polymerase during its replication. In essence, the structural and functional organization of chromatin inside the cell is governed by biophysical principles and molecular biology processes that require sophisticated techniques and interdisciplinary approaches to be fully elucidated. Given these premises, the development of high-resolution microscopy techniques and molecular biology methods has been pivotal to understand the mechanisms involved in the regulation of chromatin 3D structure. In particular, genome-wide high-throughput chromosome conformation capture (Hi-C) rapidly became a widespread technique, as it allows probing physical proximity between potentially any pair of genomic loci. Hi-C offered the possibility to discover and study crucial features of chromatin architecture at multiple levels of resolution. On a large scale, chromatin is organized in compartments, where active or inactive genomic regions co-localize and interact preferentially with other regions with the same state of activity. On a finer scale, chromatin domains are organized and separated from each other by the formation of Topologically Associating Domains (TADs) structures. Finally, at a higher resolution, individual regulatory elements can interact with distant ones through the formation of specific chromatin loops. The ever-increasing resolution and size of Hi-C datasets, which is needed to characterize chromatin at multiple levels of resolution, has posed significant challenges in terms of data analysis. Indeed, in order to increase the coverage for any pair of genomic loci in a Hi-C experiment, the sequencing depth must grow exponentially, as compared to the linear increase required by other high-throughput sequencing-based epigenomics methods that analyze the one-dimensional structure of the genome. Moreover, despite its relatively short history, several Hi-C protocol variations and data analysis methods have been proposed, in particular to improve the signal-to-noise ratio or the data resolution. This proliferation of technical advancements has further complicated the data analysis procedures, thus requiring ad hoc solutions for different applications. In this context, large collaborative projects and consortia, created to investigate the epigenome and its three-dimensional architecture, have been making concerted efforts to develop shared solutions for data analysis, reproducibility and interoperability. These consortia have also been instrumental in generating and disseminating a large number of datasets, covering several organisms, cell types, and conditions. Among them, it is worth

v

vi

Preface

mentioning the NIH-funded 4D Nucleome program (www.4dnucleome.org) and the European-centered International Nucleome Consortium (https://inc-cost.eu/), as they are bringing together researchers to develop and share common solutions for handling and analyzing data in the field of chromatin 3D architecture. In this book, we collect articles covering a comprehensive set of methods and tools for Hi-C data processing, analysis, and interpretation. The volume also covers applications of Hi-C to address a variety of biological problems, with a specific focus on the state-of-the-art computational procedures adopted for the data analysis, and computational approaches stemming from the field of biophysics to study specific chromatin biology questions. The first part of the volume covers methods going from Hi-C raw data to biologically meaningful results. Namely, the first section on “Data preprocessing” is dedicated to those computational approaches which are indispensable in every Hi-C-based experiment, as every study requires the application of normalization strategies (Doret et al.) and the assessment of data reproducibility (Yang et al.). Then, the section on “Downstream analyses” covers specific steps of data analysis that are commonly adopted in multiple applications, as they are revolving around the key types of information extracted from Hi-C contact matrices. These include the definition of TADs (Zufferey et al.), the differential analysis of Hi-C data (Nicoletti), and solutions to visualize and annotate Hi-C contact matrices (Pal and Ferrari). This section also includes a chapter on the file formats proposed by large consortia to specifically enhance Hi-C data interoperability (Lee). The third section on “Specific applications of Hi-C” delves more deeply into methods adopted for the data analysis in special applications. This section takes advantage of expert insights, while at the same time maintains a style of presentation accessible to a broad readership interested in exploring new applications. It includes a chapter on the analysis of Hi-C to discover structural variants (Song et al.) and a chapter on metagenomic applications of Hi-C (Marbouty and Koszul), both relying on Hi-C data characteristics to reconstruct the genome that they originate from. Other two chapters focus on the generation of high-resolution bacterial Hi-C data (Thierry and Cockram) and on the multiscale analysis of bacterial Hi-C data (Varoquaux et al.), i.e., specific applications of Hi-C to bacterial genomes that pose distinct computational challenges. Finally, the section comprises a chapter on HiChIP data analysis (Dori and Forcato), which is a technique combining Hi-C and chromatin immunoprecipitation for more focused characterization of chromatin interactions. The second part of the volume is dedicated to methods at the intersection between biology, bioinformatics, and biophysics. This part is intended to acknowledge the deep integration of interdisciplinary approaches that has been playing a crucial role in understanding the 3D architecture of chromatin and its functional and mechanistic implications. In particular, methods derived from the biophysics toolset have been instrumental to formulate mechanistic hypotheses on the role of specific features of chromatin organization. This is the case covered in the first section on polymer folding methods that have been historically applied to describe and interpret the physical behavior of chromosomes (Rosa). From a practical point of view these methods have been used to simulate Hi-C data (Zhan et al.), based on investigator-defined constrains, so as to verify the impact of specific features on the 3D architecture of chromatin (Chiang et al.). This part also covers a couple of more specialized applications which illustrate the potential of such approaches. In particular, a chapter describes the use of polymer folding models to clarify the connections between the one-dimensional epigenomic structure of chromatin and its 3D organization in Drosophila (Jost), and another presents an application of polymer folding models to dissect the genome organization changes due to pathological structural variants (Conte et al.). The last section

Preface

vii

of the book addresses applications of biophysics modelling approaches to infer and reconstruct the 3D chromatin architecture from Hi-C data. Specifically, in these two final chapters is described how the 3D genome organization can be reconstructed through the integration of chromatin states definition (Carron et al.) and the integration of information on lamina associated domains (Paulsen and Collas). In conclusion, this volume on Hi-C data analysis provides an overview of methods spanning from the basic analyses to more specialized applications. This collection is intended for a large readership comprising computational and molecular biologists, as well as biophysicists, working in the fields of chromatin 3D architecture, epigenetics, and transcription regulation. The book is meant to address the needs and interests of a broad and growing community encompassing researchers with different backgrounds; thus, we are confident it will be useful for both experts and neophytes as well. Emilia, Italy Milan, Italy

Silvio Bicciato Francesco Ferrari

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 Normalization of Chromosome Contact Maps: Matrix Balancing and Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cyril Matthey-Doret, Lyam Baudry, Shogofa Mortaza, Pierrick Moreau, Romain Koszul, and Axel Cournac 2 Methods to Assess the Reproducibility and Similarity of Hi-C Data . . . . . . . . . . . Tao Yang, Xi He, Lin An, and Qunhua Li 3 Methods for the Analysis of Topologically Associating Domains (TADs). . . . . . . Marie Zufferey, Daniele Tavernari, and Giovanni Ciriello 4 Methods for the Differential Analysis of Hi-C Data . . . . . . . . . . . . . . . . . . . . . . . . . Chiara Nicoletti 5 Visualizing and Annotating Hi-C Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koustav Pal and Francesco Ferrari 6 Hi-C Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soohyun Lee 7 Analysis of Hi-C Data for Discovery of Structural Variations in Cancer . . . . . . . . Fan Song, Jie Xu, Jesse Dixon, and Feng Yue 8 Metagenomes Binning Using Proximity-Ligation Data . . . . . . . . . . . . . . . . . . . . . . Martial Marbouty and Romain Koszul 9 Generating High-Resolution Hi-C Contact Maps of Bacteria. . . . . . . . . . . . . . . . . Agne`s Thierry and Charlotte Cockram 10 Computational Tools for the Multiscale Analysis of Hi-C Data in Bacterial Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nelle Varoquaux, Virginia S. Lioy, Fre´de´ric Boccard, and Ivan Junier 11 Analysis of HiChIP Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martina Dori and Mattia Forcato 12 The Physical Behavior of Interphase Chromosomes: Polymer Theory and Coarse-Grain Computer Simulations . . . . . . . . . . . . . . . . . . . Angelo Rosa 13 Polymer Folding Simulations from Hi-C Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yinxiu Zhan, Luca Giorgetti, and Guido Tiana 14 Predictive Polymer Models for 3D Chromosome Organization. . . . . . . . . . . . . . . Michael Chiang, Giada Forte, Nick Gilbert, Davide Marenduzzo, and Chris A. Brackley 15 Polymer Modeling of 3D Epigenome Folding: Application to Drosophila . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Jost

ix

v xi

1

17 39 61 97 133 143 163 183

197 209

235 259 267

293

x

16

17

18

Contents

A Polymer Physics Model to Dissect Genome Organization in Healthy and Pathological Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Mattia Conte, Luca Fiorillo, Simona Bianco, Andrea M. Chiariello, Andrea Esposito, Francesco Musella, Francesco Flora, Alex Abraham, and Mario Nicodemi The 3D Organization of Chromatin Colors in Mammalian Nuclei . . . . . . . . . . . . 317 Leopold Carron, Jean-Baptiste Morlot, Annick Lesne, and Julien Mozziconacci Modeling the 3D Genome Using Hi-C and Nuclear Lamin-Genome Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Jonas Paulsen and Philippe Collas

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

353

Contributors ` di Napoli Federico II, and INFN ALEX ABRAHAM • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy LIN AN • Bioinformatics and Genomics Program, Pennsylvania State University, University Park, PA, USA LYAM BAUDRY • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, Paris, France; Sorbonne Universite´, Colle`ge Doctoral, Paris, France ` di Napoli Federico II, and INFN SIMONA BIANCO • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy FRE´DE´RIC BOCCARD • Institute for Integrative Biology of the Cell (I2BC), Universite´ ParisSaclay, CEA, CNRS, Gif-sur-Yvette, France CHRIS A. BRACKLEY • SUPA, School of Physics and Astronomy, University of Edinburgh, Edinburgh, UK LEOPOLD CARRON • Sorbonne Universite´, CNRS, Laboratoire de Physique The´orique de la Matie`re Condense´e, Paris, France; Sorbonne Universite´, CNRS, Laboratory of Computational and Quantitative Biology, Paris, France MICHAEL CHIANG • SUPA, School of Physics and Astronomy, University of Edinburgh, Edinburgh, UK ` di Napoli Federico II, and ANDREA M. CHIARIELLO • Dipartimento di Fisica, Universita INFN Napoli, Complesso di Monte Sant’Angelo, Naples, Italy GIOVANNI CIRIELLO • Department of Computational Biology, University of Lausanne (UNIL), Lausanne, Switzerland; Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland CHARLOTTE COCKRAM • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, CNRS, Paris, France PHILIPPE COLLAS • Department of Molecular Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway; Department of Immunology and Transfusion Medicine, Oslo University Hospital, Oslo, Norway ` di Napoli Federico II, and INFN MATTIA CONTE • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy AXEL COURNAC • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, Paris, France JESSE DIXON • Peptide Biology Lab, Salk Institute for Biological Studies, La Jolla, CA, USA MARTINA DORI • Department of Life Sciences, University of Modena and Reggio Emilia, Modena, Italy ` di Napoli Federico II, and INFN ANDREA ESPOSITO • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy FRANCESCO FERRARI • IFOM—The FIRC Institute of Molecular Oncology, Milan, Italy; Institute of Molecular Genetics “Luigi Luca Cavalli-Sforza”, National Research Council, Pavia, Italy ` di Napoli Federico II, and INFN LUCA FIORILLO • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy ` di Napoli Federico II, and INFN FRANCESCO FLORA • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy

xi

xii

Contributors

MATTIA FORCATO • Department of Life Sciences, University of Modena and Reggio Emilia, Modena, Italy GIADA FORTE • SUPA, School of Physics and Astronomy, University of Edinburgh, Edinburgh, UK; MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK NICK GILBERT • MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, UK LUCA GIORGETTI • Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland XI HE • Bioinformatics and Genomics Program, Pennsylvania State University, University Park, PA, USA DANIEL JOST • University of Lyon, ENS de Lyon, Univ Claude Bernard, CNRS, Laboratoire de Biologie et Mode´lisation de la Cellule, Lyon, France IVAN JUNIER • TIMC-IMAG, CNRS, Univ. Grenoble Alpes, Grenoble, France ROMAIN KOSZUL • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, CNRS, Paris, France SOOHYUN LEE • Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA ANNICK LESNE • Sorbonne Universite´, CNRS, Laboratoire de Physique The´orique de la Matie`re Condense´e, Paris, France; Institut de Ge´ne´tique Mole´culaire de Montpellier, University of Montpellier, CNRS, Montpellier, France VIRGINIA S. LIOY • Institute for Integrative Biology of the Cell (I2BC), Universite´ ParisSaclay, CEA, CNRS, Gif-sur-Yvette, France QUNHUA LI • Department of Statistics, Pennsylvania State University, University Park, PA, USA MARTIAL MARBOUTY • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, CNRS, Paris, France DAVIDE MARENDUZZO • SUPA, School of Physics and Astronomy, University of Edinburgh, Edinburgh, UK CYRIL MATTHEY-DORET • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, Paris, France; Sorbonne Universite´, Colle`ge Doctoral, Paris, France PIERRICK MOREAU • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, Paris, France JEAN-BAPTISTE MORLOT • Sorbonne Universite´, CNRS, Laboratoire de Physique The´orique de la Matie`re Condense´e, Paris, France SHOGOFA MORTAZA • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, Paris, France JULIEN MOZZICONACCI • Sorbonne Universite´, CNRS, Laboratoire de Physique The´orique de la Matie`re Condense´e, Paris, France; Muse´um National d’Histoire Naturelle, Structure et Instabilite´ des Genomes, Paris, France; Institut Universitaire de France (IUF), Paris, France ` di Napoli Federico II, and INFN FRANCESCO MUSELLA • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy ` di Napoli Federico II, and INFN MARIO NICODEMI • Dipartimento di Fisica, Universita Napoli, Complesso di Monte Sant’Angelo, Naples, Italy; Berlin Institute of Health (BIH), MDC-Berlin, Berlin, Germany CHIARA NICOLETTI • Development, Aging and Regeneration Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA KOUSTAV PAL • IFOM—The FIRC Institute of Molecular Oncology, Milan, Italy JONAS PAULSEN • EVOGENE, Department of Biosciences, Faculty of Natural Sciences, University of Oslo, Oslo, Norway

Contributors

xiii

ANGELO ROSA • SISSA (Scuola Internazionale Superiore di Studi Avanzati), Trieste, Italy FAN SONG • Department of Biochemistry and Molecular Genetics, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA; Bioinformatics and Genomics Graduate Program, Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA DANIELE TAVERNARI • Department of Computational Biology, University of Lausanne (UNIL), Lausanne, Switzerland; Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland AGNE`S THIERRY • Institut Pasteur, Unite´ Re´gulation Spatiale des Ge´nomes, CNRS, Paris, France GUIDO TIANA • Department of Physics, University of Milano and INFN, Milan, Italy NELLE VAROQUAUX • TIMC-IMAG, CNRS, Univ. Grenoble Alpes, Grenoble, France JIE XU • Department of Biochemistry and Molecular Genetics, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA TAO YANG • Bioinformatics and Genomics Program, Pennsylvania State University, University Park, PA, USA FENG YUE • Department of Biochemistry and Molecular Genetics, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA; Robert H. Lurie Comprehensive Cancer Center of Northwestern University, Chicago, IL, USA YINXIU ZHAN • Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland MARIE ZUFFEREY • Department of Computational Biology, University of Lausanne (UNIL), Lausanne, Switzerland; Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland

Chapter 1 Normalization of Chromosome Contact Maps: Matrix Balancing and Visualization Cyril Matthey-Doret, Lyam Baudry, Shogofa Mortaza, Pierrick Moreau, Romain Koszul, and Axel Cournac Abstract Over the last decade, genomic proximity ligation approaches have reshaped our vision of chromosomes 3D organizations, from bacteria nucleoids to larger eukaryotic genomes. The different protocols (3Cseq, Hi-C, TCC, MicroC [XL], Hi-CO, etc.) rely on common steps (chemical fixation digestion, ligation. . .) to detect pairs of genomic positions in close proximity. The most common way to represent these data is a matrix, or contact map, which allows visualizing the different chromatin structures (compartments, loops, etc.) that can be associated to other signals such as transcription, protein occupancy, etc. as well as, in some instances, to biological functions. In this chapter we present and discuss the filtering of the events recovered in proximity ligation experiments as well as the application of the balancing normalization procedure on the resulting contact map. We also describe a computational tool for visualizing normalized contact data dubbed Scalogram. The different processes described here are illustrated and supported by the laboratory custom-made scripts pooled into “hicstuff,” an open-access python package accessible on github (https://github.com/ koszullab/hicstuff). Key words Chromatin folding, Genome architecture, 3C, Hi-C, normalization, Proximity ligation, Chromosome organization

1

Introduction Since the conception and application on budding yeast chromosome III of the original chromosome conformation capture (3C) protocol [1] (see Note 1), numerous derivatives of the 3C technique (also referred to as C approaches) have been developed and applied to many species. Those approaches provide insights on the higher order of genome folding that, combined with imaging and other molecular data, unveil functional interplay between chromosome architecture and metabolism [2, 3]. C approaches have notably allowed the visualization of chromatin loops signal in a variety of genomes, such as those that appear during the meiotic

Silvio Bicciato and Francesco Ferrari (eds.), Hi-C Data Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 2301, https://doi.org/10.1007/978-1-0716-1390-0_1, © Springer Science+Business Media, LLC, part of Springer Nature 2022

1

2

Cyril Matthey-Doret et al.

and mitotic metaphase stages of the baker’s yeast Saccharomyces cerevisiae [4, 5]. Genetics analyses have further allowed the dissection of the regulatory mechanisms involved in their maintenance, positioning, and features [6]. Chromosomal domains, i.e., regions displaying preferential contacts within themselves rather than with their flanking regions, have been called with different names (topologically associating domains or TADs, chromosome interacting domains, micro-domains, etc.) [7]. The formation of these domains results from mechanisms that remain actively investigated and involve, among others, roadblocks of various nature along the chromatin interplaying with dynamic loop extrusion or transcription [8–11]. The genomes of several animals, and more recently of an archaea, display a bi-partition into two main compartments: the transcriptionally inactive and active ones [12]. The biological significance of these different levels of architecture remains to be understood as well as the precise molecular mechanism(s) responsible for their formation and maintenance. While the general principles behind C’s approaches remain similar, some variations have been introduced to improve or refine the resolution of the captured contact signal. The original chemical fixation step of the experiment has been carried out principally using formaldehyde cross-linking using paraformaldehyde [13], whereas ethylene glycol bis(succinimidyl succinate) (EGS), dimethyl adipimidate (DMA), as well with dual cross-links have been used on occasion [14]. The later cross-links generate longer bridges between the reactive molecules, hence their interest. The genomic digestion step can also be adapted, from the use of cocktail of restriction enzymes [15], to the use of Mnase in Micro-C [16] and Hi-CO [17] protocols, all aiming at higher cutoff frequency and thus higher read coverage and short-scale resolution [4]. And the genomic template can even be engineered to display regularly spaced restriction sites, resulting in polymorphism allowing tracking of two homologous sequences in otherwise isogenic strains [4]. The first steps of the Hi-C data analysis consist in data processing, filtering, and normalization. They aim at improving the signal-to-noise and thus the characterization of relevant contact features in the matrices. In this chapter we describe standard procedures to process contact data using hicstuff, an open-access python package available on https://github.com/koszullab/ hicstuff/. We will also describe a visualization tool called Scalogram, already used to display bacterial contact maps, which can be used to plot normalized contact data while providing some insights on the local behavior of the DNA fiber, eventually reflecting dynamic properties (see [10]).

Matrix Balancing

2

3

Materials

2.1

Hardware

To process genomic contact data, we recommend a machine with ~10 CPUs and at least 8 Gb of RAM, but this is largely dependent on the size of the genome. To visualize chromatin loops along the human genome a Hi-C local resolution of 10 kb or higher is necessary, resulting in matrices of 10,000  10,000 that will necessitate several Gb available memory (see Note 2).

2.2

Software

The recommended software is listed in Table 1 and detailed below: 1. The python package used below is hicstuff, which contains several modules and all functions needed for matrix manipulation (Numpy), computation (SciPy) and visualization (Matplotlib). The easiest way to install the program is to use the python package installer: pip3 install -U hicstuff

2. Alignment software such as bowtie2, bwa, or minimap2 must also be installed as well as the samtools suite to process the aligned reads. Bowtie2 is the more comprehensive aligner and retains the most contacts, whereas minimap2 is the fastest but may discard alignments along the way.

Table 1 Required software. The table reports the list of required software along with their function and URL for download Name

Function

URL

Fasterq-dump

Reads extraction

https://www.ncbi.nlm.nih.gov/sra

bowtie2

Alignment

http://bowtie-bio.sourceforge.net/bowtie2/index. shtml

minimap2

Alignment

https://github.com/lh3/minimap2

bwa

Alignment

https://github.com/lh3/bwa

hicstuff

Hi-C pipeline

https://github.com/koszullab/hicstuff

samtools

processing of sam files

http://samtools.sourceforge.net/

tutorial for 3C data

codes to process contact data

https://github.com/axelcournac/3C_tutorial

Scalogram

Scalogram visualization tool

https://github.com/koszullab/E_coli_analysis

Spyder

IDE

https://www.spyder-ide.org/

4

Cyril Matthey-Doret et al.

3. Finally, we recommend installing the integrated development environment Spyder which allows an interactive use of the python language and thus facilitates an exploratory approach of genomic data processing.

3

Methods To launch hicstuff, an example of command line can be: hicstuff pipeline -t 8 -a bowtie2 -e DpnII --matfmt bg2 --nocleanup -F -p -o out/ -g /home/sacCer3/sacCer3 SRR8769549_1. fastq SRR8769549_2.fastq

The different options are explained in detail on github (https://github.com/koszullab/hicstuff) or can be read by calling the help file: hicstuff pipeline --help

Raw reads can be extracted from Short Read Archive server (SRA) (https://www.ncbi.nlm.nih.gov/sra) using program fasterq-dump from SRA tool kit (sra_sdk/2.9.6) and the following command: fasterq-dump --split-3 SRR8769549 -O .

3.1 Alignment of Read Pairs

As in most technologies involving high-throughput sequencing, one of the first steps is to align the reads to a reference genome. This results in a first filter, since reads whose alignment is ambiguous are discarded from subsequent analysis. This quality filtering is applied by setting a minimum threshold on the Mapping Quality present in the output sam file of the alignment; MQ > 30 is generally used by the community (see Note 3). Another important filter is the filtering of duplicates from PCR amplification. A commonly used filter consists of filtering pairs with identical genomic positions. The probability of finding several of the same pair of positions is very low and these events can be considered as duplicates.

3.2 Filtering of Noninformative Events

The next step is to assign to each read position the corresponding restriction fragment when applicable. This assignment will allow visualizing the different events present in the library and eventually filter the ones that are not directly informative in terms of physical contacts and spatial organization. In our original description of a Hi-C contact map normalization procedure [18], we distinguished several different types of ligation events (see Note 4):

Matrix Balancing

5

Fig. 1 Types of events in Hi-C libraries. (a) Configurations of the different events present in a Hi-C library. (b) Pie charts of the different types of events that can be found in a genomic contact library for different protocols: from 3C seq, [19], from Hi-C [5], from Micro-C [XL] [14], from Hi-CO [17]

1. “uncuts” events: non-digested collinear fragments (also named “dangling ends”). They can represent a large proportion of the events in the library, especially if the process of biotin enrichment of the ligation events is absent (3Cseq, Fig. 1a) or has not functioned correctly.

6

Cyril Matthey-Doret et al.

2. “circularization” events: one or more collinear fragments have circularized (also named self circles). If their proportion is very low, it may indicate that the ligation step has not worked well. 3. “weird” events: pairs of reads with the same orientation aligned onto the same restriction fragment. These events are not possible with a single copy of the restriction fragment. They could be explained either by sequencing errors or events involving two copies of the DNA fragment (e.g. sister chromatids post-replication, or in the presence of diploid genomes) (see Note 5). 4. “Intrachromosomal” events: contacts involving two reads on the same chromosome and having correctly passed the different steps of the protocol and that can be considered as physical contacts between two loci. 5. “Interchromosomal” events: pairs of reads involving two different chromosomes. Figure 1 illustrates the distribution of these categories of events for different genomic contact techniques from various representative experiments involving baker yeast Saccharomyces cerevisiae (see Note 6). The information contained in these pie charts can give indications on the proper conduct of the protocol (digestion efficiency, ligation, enrichment with biotin). The proportion of events in inter can also be a good indicator of the noise content of the library (see Note 7). A 3Cseq protocol (which does not include a ligation event selection step with biotin) contains a large proportion of undigested collinear fragment events compared to the Hi-C protocol. The use of MNase allows a good digestion of genome into small fragments. The use of long cross-linkers as proposed in the protocol of Micro-C [XL] allows capturing relevant physical contacts [20]. The rate of Inter events is very high in the Hi-CO protocol. This may be due to high presence of random ligation caused by the low cross-link procedure adopted (1% formaldehyde). It is probable that a strong cross-link as in Micro-C [XL] would have decreased drastically this proportion. Filtering out non-informative events is crucial when analyzing small-scale (a few kb) signals. For instance, we recently showed a positive correlation between the short-scale contact signal (~2 kb) detected with 3Cseq and the transcription level measured with RNAseq in several bacteria (see Supplementary Fig. S2 in [10]). In other words, the more a region is transcribed, the more it makes contacts with its close neighbors. Such a correlation would have been very difficult to demonstrate without filtering the events.

Matrix Balancing

3.3 Iterative Procedure to Balance the Signal

7

To build the contact map, it is necessary to bin the pairs of reads that form the contact. The size of the bin is a compromise between the desired spatial resolution and the sequencing depth. An example of a 2 kb binned raw contact map for the model organism Saccharomyces cerevisiae is given in Fig. 2a (see Note 8). The purpose of normalization balancing is to mitigate the various biases that may be present. Several biases due to the protocol have been identified in most of the contact techniques notably: 1. The density of restriction sites or cut sites is one of the most important biases. The difference can come from heterogeneity in GC content of the genome (coming for the presence of horizontal transfer elements). It can result in restriction fragments of very different size. The probability of capture depends on the fragment size gradually reaching a plateau around 1 kb [18]. 2. The presence of repeated and non-mappable sequences can also generate difference in detectability. An unknown amount of signal can be “lost” among these regions this way. In practice, matrices are often riddled of empty columns and rows that represent these repeat “gaps.” 3. Other biases more difficult to quantify: the local accessibility of chromatin for certain genomic regions, PCR amplification or sequencing biases, etc. These variations can be corrected or at least a bit attenuated by the iterative normalization procedure, which consists in dividing each matrix element by the detectability of the bin it belongs (i.e., the sum of the elements of each row and then each column). This normalization assumes that each region must have similar detectability: if one bin is under- or over-covered, this may be due to protocol limitations [18, 21]. This assumption may not always be valid (see below). The main advantage of this type of method is that it has no a priori on the nature of the biases present in the library. One example of such methods is the Sequential Component Normalization (SCN), [18]. Other such methods exist: l

The most commonly used in the community is the Iterative Correction and Eigenvector decomposition (ICE) [22]. The term “normalization” is misleading here, as the sums of rows and columns are not equal to one as with the SCN. It should be seen as a bias correction procedure instead. It relies on the first eigenvectors of the contact maps, whose values often correlate with biases such as GC content

l

The Knight-Ruiz balancing algorithm [23] is also a widely used method to quickly obtain a doubly stochastic matrix P (whose sums of rows and columns are equal to one) from a contact map M by finding diagonal scaling D and E such that M ¼ DPE.

8

Cyril Matthey-Doret et al.

Fig. 2 Effect of matrix balancing normalization. (a) Contact map without and with balancing normalization for chromosome 5 of S. cerevisiae (bin 2 kb, mitotic state, data from study [5]). (b) Genomic distance law computed without and with the balancing normalization. (c) Agglomerated plot for peaks of cohesin. (d) The mean Hi-C coverage for bins centered at peaks of cohesin l

Another intuitive method, also called “de-trending” or median contact frequency scaling (MCFS), requires computing, for each genomic distance between any two loci, the median of all contacts found at that genomic distance. This draws a so-called “trend” of contacts as a function of the genomic distance. The contacts between two loci are then divided by the trend found at their distance. As with the ICE, it is not strictly speaking a normalization.

l

Some normalizations effectively correct for specific biases, such as copy number variants (CNV) as in [24].

Before the iterative procedure, poor interacting bins should be removed (to avoid distortion of the matrix elements involving those bins) (see Note 3). As for the biases identified above, these bins could correspond to bins containing repeated sequences and filtered during the alignment procedure or to bins with no or few restriction sites. These bins can correspond to genomic regions that have a different GC content compared to the rest of the genome (that can be attributed to horizontal transfer elements, prophages, etc.).

Matrix Balancing

9

Example of command line to normalize using the ICE procedure and visualize contact map (see Note 9): hicstuff view --normalize --binning 2 kb --region “chr5:0,0600,000” --frags fastq_sam/out/fragments_list.txt abs_fragments_contacts_weighted.bg2

An example of a normalized contact map is given in Fig. 2a for the model organism Saccharomyces cerevisiae. After normalization, the map appears more homogeneous: poor lines or on the contrary rich lines in contact are more balanced. In particular, it becomes easier to distinguish chromosomal domains and loops. To test the effect of normalization on the resulting contact map and the following calculations, we compute the genomic distance law Fig. 2b. This computation is a good metric that reflects the physical properties of chromatin [25] or can be a good check for the cross-link step. The balancing normalization does not affect this plot. The averaging contained in this computation must indirectly normalize the possible biases present in the library. We also compute the agglomerated plot of pairs of cohesin peaks between 10 kb and 50 kb (for more precision on that procedure see [6]). This computation allows detecting the general contact pattern emerging from a particular genomic group. In this example, we can see loop pattern formed by the pairs of cohesins peaks between 10 kb and 50 kb. Interestingly, the pattern is not exactly the same with and without normalization Fig. 2c. The hot spot of contacts is clearly visible on both computations at the center of the plot. However, without balancing normalization, a cross pattern is also visible. To explain this difference, we compute the mean number of the bins that are peaks of cohesin (Fig. 2d). The group of bins containing the cohesin peaks has a greater number of contacts. This enrichment may have been partially mitigated during normalization which can explain why the cross pattern initially present has disappeared. To explain the greater number of contacts for this group of bins, either these genomic positions have a more efficient cross-link (potentially due to higher concentration of proteins at these positions). It can also be interpreted as these bins really making more physical contacts than the rest of the genome, thus opposing the initial hypothesis of uniformity in the number of contacts along the genome. These contact stripes could be compatible with a loop extrusion model [8]. In this model, cohesins, which are molecular motors, wind up the DNA and can be blocked at specific locations (thus forming a stable loop). These lines could correspond to genomic positions trapped with the cohesins that have not yet been blocked and continue to extrude. It is also possible that

10

Cyril Matthey-Doret et al.

both effects (protocol bias and true biological effect) co-occur to explain the representation without normalization. This example shows us a possible limit to balancing normalizations. Thus, it seems important to us to always keep both representations in mind when interpreting the data and the associated molecular mechanisms. It is possible that with the decisive improvements brought to the different protocols in recent years, we are moving away from a uniform distribution of the number of contacts per bin. Indeed, some loci, due to their biological properties, could make a higher number of physical contacts than the rest of the genome and play a particular role in the general architecture of genomes. The loci enriched in cohesin appear good candidates for such category. Many biological networks do not have a uniform connectivity (networks of genes regulation, network of proteins interactions), it appears possible that the contact networks detected with C technologies are not either. 3.4 Scalogram: Alternative Visualization Tool for Normalized Contact Data

One of the most used computations when analyzing contact maps is the so-called genomic distance plot (see above). It represents the number of contacts in function of the genomic distance and can reflect the structural properties of chromatin. The slope of the curves changes according to the type of chromatin (active or inactive) [25] or according to the color of chromatin in drosophila data [26]. In this last section, we propose a simple visualization tool called Scalogram that allows an alternative visualization of normalized contact maps that aims at giving a kind of local representation of the genomic distance law along a chromosome. The algorithm takes as input a binned and normalized contact map for one chromosome. It also takes as input the number of bins on which the computation is done. Scalogram representation aims at representing the dispersion of contact signal along the different spatial scales. For each spatial scale, the cumulative contact signal is computed as the percentage of the total contact signal. The use of contour lines commonly used in cartography science allows smoothing out fluctuations and gives a more readable representation of the contact signal dispersion along a chromosome. The dispersion can give a representation of the local constraint along a chromosome, i.e., if the contact signal is important at short scales or on the contrary quite dispersed along the spatial scales (see Note 10). To have a scalogram visualization, you can use the following command line with the associated code (https://github.com/ koszullab/3C_tutorial/blob/master/python_codes/scalogram_ tool.py): python users_scalogram.py MAT_RAW_chr5_2kb.txt chr5 150

Matrix Balancing

11

Fig. 3 Normalized contact maps and Scalogram representation. (a) For Escherichia coli genome, bin 5 kb, data from [10]. In the scalogram representation, mobilities measurements are also included. They correspond to MSD (Mean Square Displacement) measurements of fluorescent proteins attached to a specific locus [27, 28] (blue dots). (b) For Chromosome 5 of Saccharomyces cerevisiae, bin 2 kb, data from [5]. (c) For chromosome 3 of Homo Sapiens, primary hepatocytes, bin 100 kb, data from [29]. (d) For chromosome 3 of Ursus maritimus (polar bear), bin 100 kb, genome assembly and contact data from [30, 31]

It takes as input 3 arguments: l

The name of the file containing the raw contact map

l

Name given to the output file

l

The number of bins up to which to compute the cumulative signal

The local structuring of chromosome can be apprehended for chromosomes of diverse organisms (Fig. 3). One of the main results of using cumulative signal is an unexpected correlation between contact signal and measurements from dynamics experiments coming from time lapse microscopy technologies [10, 27, 28] (Fig. 3a). We recently observed using this approach in the model organism Escherichia coli, a positive correlation between the cumulative signal extracted from contact data (level line in the Scalogram) with mobilities measurements, i.e., MSD, Mean Square Displacement represented as blue dots measured with time lapse microscopy (for more details see [10]). It would be interesting to test if this type of

12

Cyril Matthey-Doret et al.

correlation between contact data and dynamics measurements can apply for other organisms. For human chromosome 3 (Fig. 3c), the sub-telomeric regions look more constrained compared to the rest of the chromosome. Interestingly, certain regions in inactive chromatin (around the genomic position 50 Mb in example shown) looks as well very constrained. We think that this alternative representation can bring out other aspects of spatial structuring and open up new hypotheses.

4

Notes 1. Interestingly, preliminary conceptualization of a method based on the capture of physical contacts to infer 3D organization can be found in earlier works in the 1980s that focused on the 3D structure of bacteriophage lambda and T7 genomes using a chemically synthesized cross-linker called BAMO (bis(monoazidomethidium)-octaoxahexacosanediamine) capable of sticking together two double helices of DNA [32]. 2. The human chromosome 1 is about 240,000,000 bp, resulting in a Hi-C matrix of ~24,000  24,000 bins of 10 kb. If the matrix is densely represented with float numbers, the required memory space is 24,000  24,000  (24 bytes) ~ 14 Gb. A sparse formalization is necessary for it to be used in most common machines. This should be kept in mind when designing new algorithms for Hi-C normalization, extrapolation of data, or other operations on contact maps. 3. These discarded sequences represent a non-negligible part of the potential information contained in a genomic contact library. For example, for a bacterial genome such as that of Vibrio cholerae (having a super-integron containing numerous repeated sequences) a proportion of 15% of pairs of sequences is removed and not exploited. For a yeast genome such as that of Saccharomyces cerevisiae which contains several families of transposons repeated throughout the genome, 28% of the paired-end reads are removed. Finally, for a standard library of the human genome that contains a large number of different repeated sequences, about 35% of the sequence pairs are not currently used. 4. These different events can be determined by looking at the positions and orientations of the reads in relation to the reference genome and the size separating them (in number of restriction fragments or bins). For a detailed explanation, see [18, 33].

Matrix Balancing

13

5. Interestingly, this type of event has recently been used in an analysis of genomic contact data of drosophila [34]. They have been linked to positions of architectural proteins connecting homologous chromosomes in a diploid genome (on Kc167 cells). They computed the signal in G2, G1, and unsynchronized cells as well with high correlation, which indicates that signal detects homolog pairing and not necessarily sister chromatid pairing. To our knowledge, this is the first study showing a biological interpretation (homologous pairing) of these type of events; it would be interesting to test this approach in other organisms with diploid genomes. 6. Before starting a Hi-C experiment on a new organism, it is advisable to compute the restriction map of the enzyme being considered for genomic digestion and to ensure that the number of restriction sites is sufficient and relatively homogeneous. A good average restriction fragment size is around 250 bp. The restriction map of the genome can be computed using hicstuff with the following command line: hicstuff digest --plot --outdir output_dir --enzyme DpnII /home/sacCer3/all_chr.fa

7. Another trick to quantify the rate of random ligation present in a library of genomic contacts can also be done by calculating the ratio of contacts made with the mitochondrial genome (if available). Since the mitochondrial genome is located in a separate compartment from the rest of the genome, it can be useful for counting purely random ligations that take place without physical contact. However, this metric has several limitations: the number of mitochondria may vary from one biological state to another, some mitochondrial sequences may also be integrated into the main genome. Finally, the sequences of mitochondrial genome can be difficult to access: for example, yeast S. cerevisiae mitochondria has a very low GC content (~17%) and may not be sufficiently cut by standard Hi-C protocols. In our experience, only a Micro-C XL protocol gives a realistic physical contact map for the mitochondrial genome of the yeast Saccharomyces cerevisiae. 8. To plot the contact map, since the signal is very strong on the main diagonal, a distortion is necessary to visualize the signal on large scales. A log representation can be used. We usually apply to the initial matrix an exponent less than 1.0 to make this distortion (for example: 0.2). This exponent can be easily adjusted by hand to make structures at a specific scale appear clearer. 9. The Hi-C community has yet to come to a consensus file format to store Hi-C data. Among the many existing formats, hicstuff

14

Cyril Matthey-Doret et al.

supports bedgraph2d, cool, and graal. Bedgraph2d and graal are tabular text formats, which can be handy to quickly process the data with external scripts. However, when working on organisms with large genomes, storage of Hi-C data can become an issue due to space limitations. Hence, the Hi-C community is progressively adopting the cool file format as a standard. This compressed hierarchical file format is based on HDF5 and therefore inherit many of its perks, such as supporting out-of-core operations and having a small file size. The cool file format comes with an associated command line tool named cooler, also available as a python API. hicstuff having full support for all 3 formats, the choice boils down to the specific needs of the user; however one should keep in mind that tools for the downstream processing of Hi-C data are most likely to require cool format as input. Conversion between different file formats can be achieved using hicstuff. For example, to convert a matrix from cool to bedgraph2d (bg2) format, one could use: hicstuff convert --to bg2 example.cool converted_example

10. When using the Scalogram tool, playing with the bin size of the matrix and/or with the number of bins to compute the cumulative signals allows making different structures appear. Pictograms for each species come from http://phylopic.org/.

Acknowledgments All members of the Re´gulation Spatiale des Ge´nomes laboratory are thanked for the daily exchanges and feedbacks. References 1. Dekker J (2008) Gene regulation in the third dimension. Science 319:1793–1794 2. Lazar-Stefanita L, Scolari VF, Mercy G et al (2017) Cohesins and condensins orchestrate the 4d dynamics of yeast chromosomes during the cell cycle. EMBO J 36:2684 3. Schalbetter SA, Fudenberg G, Baxter J et al (2019) Principles of meiotic chromosome assembly revealed in S. Cerevisiae. Nat Commun 10(1):4795 4. Muller H, Scolari VF, Agier N et al (2018) Characterizing meiotic chromosomes structure and pairing using a designer sequence optimized for hi-c. Mol Syst Biol 14(7):e8293 5. Garcia-Luis J, Lazar-Stefanita L, GutierrezEscribano P et al (2019) Fact mediates cohesin

function on chromatin. Nat Struct Mol Biol 26 (10):9700–9979 6. Dauban L, Montagne R, Thierry A et al (2020) Regulation of cohesin-mediated chromosome folding by eco1 and other partners. Mol Cell 77(6):1279–1293 7. Dixon JR, Selvaraj S, Yue F et al (2012) Topological domains in mammalian genomes identifed by analysis of chromatin interactions. Nature 485(7398):376–380 8. Fudenberg G, Imakaev M, Lu C et al (2016) Formation of chromosomal domains by loop extrusion. Cell Rep 15(9):2038–2049 9. Le TBK, Imakaev MV, Mirny LA et al (2013) High-resolution mapping of the spatial organization of a bacterial chromosome. Science 342 (6159):731–734

Matrix Balancing 10. Lioy VS, Cournac A, Marbouty M et al (2018) Multiscale structuring of the E. coli chromosome by nucleoid-associated and condensin proteins. Cell 172(4):771–783 11. Marbouty M, Cournac A, Flot JF et al (2014) Metagenomic chromosome conformation capture (meta3c) unveils the diversity of chromosome organization in microorganisms. eLife 3: e03318 12. Takemata N, Samson RY, Bell SD (2019) Physical and functional compartmentalization of archaeal chromosomes. Cell 179(1):165–179. e18 13. Tanizawa H, Kim KD, Iwasaki O et al (2017) Architectural alterations of the fission yeast genome during the cell cycle. Nat Struct Mol Biol 24(11):9650–9976 14. Hsieh THS, Fudenberg G, Goloborodko A et al (2016) Micro-c xl: assaying chromosome conformation from the nucleosome to the entire genome. Nat Methods 13 (12):10090–11011 15. Jung I, Schmitt A, Diao Y et al (2019) A compendium of promoter-centered long-range chromatin interactions in the human genome. Nat Genet 51(10):14420–11449 16. Hsieh THS, Weiner A, Lajoie B et al (2015) Mapping nucleosome resolution chromosome folding in yeast by micro-c. Cell 162 (1):1080–1119 17. Ohno M, Ando T, Priest DG et al (2019) Sub-nucleosomal genome structure reveals distinct nucleosome fold-ing motifs. Cell 176 (3):520–534.e25 18. Cournac A, Marie-Nelly H, Marbouty M et al (2012) Normalization of a chromosomal contact map. BMC Genomics 13:436 19. Marie-Nelly H, Marbouty M, Cournac A et al (2014) High-quality genome (re)assembly using chromosomal contact data. Nat Commun 5:5695 20. Swygert SG, Kim S, Wu X et al (2019) Condensin-dependent chromatin compaction represses transcription globally during quiescence. Mol Cell 73(3):5330–5546 21. Liu T, Wang Z (2019) normGAM: an r package to remove systematic biases in genome architecture mapping data. BMC Genomics 20(S12) 22. Imakaev M, Fudenberg G, McCord RP et al (2012) Iterative correction of hi-c data reveals hallmarks of chromosome organization. Nat Methods 9(10):9990–1003

15

23. Knight PA, Ruiz D (2012) A fast algorithm for matrix balancing. IMA J Numer Anal 33 (3):1029–1047 24. Servant N, Varoquaux N, Heard E et al (2018) Effective normalization for copy number variation in hi-c data. BMC Bioinform 19(1):313 25. Barbieri M, Chotalia M, Fraser J et al (2012) Complexity of chromatin folding is captured by the strings and binders switch model. Proc Natl Acad Sci U S A 109(40):16173–16178 26. Serra F, Bau D, Goodstadt M et al (2017) Automatic analysis and 3d-modelling of hi-c data using tadbit reveals structural features of the y chromatin colors. PLoS Comput Biol 13 (7):e1005665 27. Espeli O, Mercier R, Boccard F (2008) DNA dynamics vary according to macrodomain topography in the E. coli chromosome. Mol Microbiol 68:14180–11427 28. Javer A, Long Z, Nugent E et al (2013) Shorttime movement of e. coli chromosomal loci depends on coordinate and subcellular localization. Nat Commun 4(1):3003 29. Moreau P, Cournac A, Palumbo GA et al (2018) Tridimensional infiltration of DNA viruses into the host genome shows preferential contact with active chromatin. Nat Commun 9 (1):4268 30. Liu S, Lorenzen ED, Fumagalli M et al (2014) Population genomics reveal recent speciation and rapid evolutionary adaptation in polar bears. Cell 157(4):785–794 31. Dudchenko O, Batra SS, Omer AD et al (2017) De novo assembly of the aedes aegypti genome using hi-c yields chromosome-length scaffolds. Science 356(6333):92–95 32. Mitchell MA, Dervan PB (1982) Interhelical DNA-DNA crosslinking. Bis (monoazidomethidium)octaoxahexacosanediamine: a probe of packaged nucleic acid. J Am Chem Soc 104 (15):42650–44266 33. Cournac A, Marbouty M, Mozziconacci J, et al (2016) Generation and analysis of chromosomal contact maps of yeast species. In: Yeast functional genomics: methods and protocols, p 2270–245 34. Rowley MJ, Lyu X, Rana V et al (2019) Condensin II counteracts cohesin and RNA polymerase II in the establishment of 3d chromatin organization. Cell Rep 26(11):28900–2903. e3

Chapter 2 Methods to Assess the Reproducibility and Similarity of Hi-C Data Tao Yang, Xi He, Lin An, and Qunhua Li Abstract Hi-C experiments are costly to perform and involve multiple complex experimental steps. Reproducibility of Hi-C data is essential for ensuring the validity of the scientific conclusions drawn from the data. In this chapter, we describe several recently developed computational methods for assessing reproducibility of Hi-C replicate experiments. These methods can also be used to assess the similarity between any two Hi-C samples. Key words Reproducibility, Similarity, Hi-C data, Quality control, HiCRep, HiC-spector, GenomeDISCO

1

Introduction Hi-C is a powerful technique to understand the 3D organization of the genome [1]. The Hi-C assay involves multiple complex experimental steps, producing data with complicated multi-scale structures. How to quantify the similarity between two Hi-C data sets is a fundamental question of practical importance. Assessing the similarity between replicate experiments, i.e., reproducibility, is critical to ensure the validity of scientific conclusions drawn from the data and to indicate when an experiment should be repeated. This information is also important for deciding whether two replicates can be pooled, a strategy that is frequently used to obtain a large number of Hi-C interactions [2]. With the fast accumulation of Hi-C data collected from different cell types and different biological conditions, quantifying the similarity between different Hi-C samples is essential for identifying the biologically meaningful variation in 3D chromatin structure and understanding their roles

All authors contributed equally to this work. Silvio Bicciato and Francesco Ferrari (eds.), Hi-C Data Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 2301, https://doi.org/10.1007/978-1-0716-1390-0_2, © Springer Science+Business Media, LLC, part of Springer Nature 2022

17

18

Tao Yang et al.

in regulating cell differentiation and responding to biological perturbations. Previously, Pearson and Spearman correlation coefficients have been commonly used to measure the similarity between Hi-C experiments [3–6]. However, standard correlation approaches implicitly treat all elements of the Hi-C matrix as independent measurements. They do not consider the spatial patterns in Hi-C contact map, such as domain structures (for example, topological association domain and A/B compartments) and distance dependence, which refers to the fact that the chromatin interaction frequencies between two genomic loci, on average, decrease substantially as their genomic distance increases. As a result, they may produce incorrect assessments, often assigning higher correlations between unrelated samples than those between real biological replicates [7]. To overcome this problem, several methods have been recently developed in population Hi-C data to assess the reproducibility of the Hi-C assay, including HiCRep [7], HiC-spector [8], GenomeDISCO [9], and QuASAR [10]. These methods were specifically designed to capture the biologically relevant structure in a Hi-C dataset, while reducing “uninteresting” similarities. They produce more accurate reproducibility assessment than Pearson or Spearman correlation for real and simulated datasets [2]. They can also be used to evaluate the pairwise similarities between different Hi-C samples. Two recent studies compared the performance of these methods on population [2] and single-cell Hi-C data [11]. In [2], a comparison on the population Hi-C data from 11 cancer cell lines shows that both HiC-spector and HiCRep separates all non-replicates from biological replicates with a clear margin, with a larger margin given by HiC-spector. However, the goal of most studies often goes beyond simply discriminating among replicates and non-replicates. In studies that aim to quantify similarities among various experiments, HiCRep has been shown to discriminate well among cell types in both bulk Hi-C data [7] and singlecell Hi-C data [11]. In particular, [11] compared the ability of these three methods for evaluating the cellular heterogeneity in single-cell Hi-C data when they are used as the similarity measures in multi-dimensional scaling (MDS). It found that HiCRep strongly outperforms the other two methods and a technique that has been used previously for scHi-C analysis, producing low-dimensional embedding that successfully captures biologically meaningful variation in 3D chromatin structure and accurately separates cells at different stages of the cell cycle.

Assessing Reproducibility of Hi-C Data

19

1.1 Similarity Measures for Hi-C Data

In this section, we give a brief overview of HiCRep [7], GenomeDISCO [9], and HiC-spector [8].

1.1.1 HiCRep

HiCRep takes the spatial structures of Hi-C data into account in its similarity assessment. It explicitly corrects for the genomic distance effect and addresses the sparsity issue of contact matrices through stratification and smoothing, respectively. The method consists of two steps. In the first step, HiCRep smooths each Hi-C contact map using a 2D mean filter to improve the contiguity of regions with elevated interaction. The filter which replaces the read count of each contact with the average counts of all contacts in its neighborhood, consequently enhancing the domain structures. In the second step, HiCRep stratifies the smoothed Hi-C contact matrix according to genomic distance between interacting loci and computes the Pearson correlation for the smoothed contacts at each stratum, in order to remove the bias introduced by distance in the similarity assessment. It then computes the stratum-adjusted correlation coefficient statistic (SCC), which is a weighted average of the stratum-specific correlation across all the strata. The weights are derived based on the Cochran-Mantel-Haenszel (CMH) statistic, a common statistic for analyzing stratified data. The SCC is in the range of [1, 1] and is interpreted in a way similar to the standard correlation coefficient.

1.1.2 GenomeDISCO

GenomeDISCO evaluates the similarity between two Hi-C contact maps by summarizing the similarities of domain structures across a range of scales. Assuming that information on the domain structures at different scales can be captured by smoothing the original contact maps at different levels, it smooths each contact matrix across a range of levels. Specifically, it treats the contact matrix as the adjacency matrix of a network and smooths a normalized contact matrix by raising the matrix to the tth power, where t is the smoothing level and is hard-coded as 3. This operation can be viewed as taking a t-step random walk on the network. To capture the information for domain structures at different scales, the smoothing process is conducted multiple times for each contact map across different t’s. It then aggregates the similarity across smoothing levels by summing up the L1 distance between the pair of smoothed maps at each smoothing level. The final similarity score is obtained by using 1-(combined L1 distance) to rescale the score to the range of [1, 1].

1.1.3 HiC-Spector

HiC-Spector takes a dimension reduction approach to assessing the similarity between Hi-C contact matrices. Specifically, it transforms each chromosomal contact matrix into a normalized Laplacian matrix and calculates the eigenvectors of the Laplacian by matrix

20

Tao Yang et al.

decomposition. The similarity between a pair of Hi-C contact matrices then is computed as the Euclidean distance between the normalized 20 leading eigenvectors. The similarity score is obtained by linearly rescaling the distance into the range of [0, 1].

2

Data Sets

2.1 Description of the Example Data

In this chapter, the Hi-C example data are obtained from the mouse cell line G1E-ER4 [15], corresponding to the sample G1E-ER4uninduced. The dataset has two biological replicates, generated using the in situ Hi-C protocol and the restriction enzyme DpnII (a 4-cutter that recognizes the GATC sequence). It comprises totally 323 million valid paired-end reads in two replicate samples with a length of 150 base pairs (bps). The reproducibility of the two replicates will be assessed.

2.2 Download the Data

The example data is available at Gene Expression Omnibus under the accession number GSE95476 (https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc¼GSE95476). Raw reads can be downloaded from the Sequence Read Archive (SRA) under BioProject PRJNA377236 and converted into FASTQ format using the SRA Toolkit. The code below downloads the raw reads and processes them to .bam files. # Generate file lists of raw read ids # SRX2598395: SRA for G1E-ER4 uninduced replicate 1 HiC # SRX2598398: SRA for G1E-ER4 uninduced replicate 2 HiC esearch -db sra -query PRJNA377236 | efetch -format runinfo | grep "SRX2598395"| grep SRR | awk -F "," ’{print$1}’ > id_rep1.txt esearch -db sra -query PRJNA377236 | efetch -format runinfo | grep "SRX2598398"| grep SRR | awk -F "," ’{print$1}’ > id_rep2.txt # Download all the raw reads based on their ids in the file lists and convert into FASTQ format # -o: specify the output directory cat id_rep1.txt | xargs -n 1 fastq-dump --split-files $1 -O / sra1 cat id_rep2.txt | xargs -n 1 fastq-dump --split-files $1 -O / sra2 # Download the annotation file and get the reference genome (mm10) wget https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/ chromFa.tar.gz tar zvfx chromFa.tar.gz

Assessing Reproducibility of Hi-C Data

21

mv chr* ref/ # Merge all chromosomes cat *.fa.gz > mm10.ref.fa # Build index # Required tools: samtools (https://github.com/samtools/samtools) samtools faidx mm10.ref.fa bwa index -a bwtsw mm10.ref.fa # Use bwa for alignment and mapping for replicate 1 cat id_rep1.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "bwa mem mm10.ref.fa sra1/’{}’_1.fastq sra1/’{}’_2.fastq > sra1/ ’{}’.sam" # Convert SAM file to BAM file, filter out unmapped reads cat id_rep1.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "samtools view -Sb -F 12 sra1/’{}’.sam > sra1/’{}’.bam" cat id_rep1.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "samtools sort sra1/’{}’.bam -o ’{}’.sorted.bam" cat id_rep1.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "samtools index ’{}’.sorted.bam" # Use Picardtools to remove PCR duplicates # picardtools: https://broadinstitute.github.io/picard/ # picarddir=/directory/to/picard-tools-1.88 cat id_rep1.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "java -Xmx6g -jar picard.jar MarkDuplicates \ I=input.bam \ O=marked_duplicates.bam \ M=marked_dup_metrics.txt \ ASSUME_SORTED=true \ REMOVE_DUPLICATES=true \ VALIDATION_STRINGENCY=LENIENT" # Merge separate SRRs samtools merge merged.nodup.bam sra1/*.nodup.bam samtools sort merged.nodup.bam -o merged_1.sorted.bam samtools index merged_1.sorted.bam # Repeat the same process for replicate 2 cat id_rep2.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "bwa mem mm10.ref.fa sra2/’{}’_1.fastq sra2/’{}’_2.fastq > sra2/ ’{}’.sam" cat id_rep2.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "samtools view -Sb -F 12 sra2/’{}’.sam > sra2/’{}’.bam" cat id_rep2.txt | xargs -n 1 -I ’{}’ --max-procs 10 sh -c "samtools sort sra2/’{}’.bam -o ’{}’.sorted.bam" cat id_rep2.txt | xargs -n 1 -I ’{}’ --max-procs 10 sh -c "samtools index ’{}’.sorted.bam"

22

Tao Yang et al. cat id_rep2.txt | xargs -n 1 -I ’{}’ --max-procs 5 sh -c "java -Xmx6g -jar picard.jar MarkDuplicates \ I=’{}’.sorted.bam \ O=sra2/’{}’.nodup.bam \ M=’{}’.metrics.txt \ ASSUME_SORTED=true \ REMOVE_DUPLICATES=true \ VALIDATION_STRINGENCY=LENIENT" samtools merge merged_2.nodup.bam sra2/*.nodup.bam samtools sort merged_2.nodup.bam -o merged_2.sorted.bam samtools index merged_2.sorted.bam

2.3 Data Formats for Contact Maps

Here we list the formats of Hi-C contact maps that are compatible with at least one of these packages, and provide the code to obtain some of these formats from the FASTQ format. 1. Squared matrix: The number of rows or columns of the matrix is the number of bins on a chromosome, i.e., the length of the chromosome divided by the resolution rounded up to the nearest integer. The entries in the matrices can be either raw counts or normalized data after ICE normalization [12]. A 5  5 diagonal submatrix of the G1E-ER4 rep1 sample is shown below. 230

64

12

13

13

64

290

45

26

20

12

45

155

51

27

13

26

51

147

41

13

20

27

41

204

2. 3-column matrix indexed by genomic coordinates: The file includes three tab-delimited columns. Each row represents one interaction between two bins on the chromosome, where the first two entries are the genomic coordinates (i.e., bin indices  resolution) of the upstream and downstream bins and the third entry is the read count of the interaction. Below shows the top 5 lines of a dataset with 40 kb resolution. 0

0

20

0

40000

18

0

80000

13

0

120000

5

0

160000

3

Assessing Reproducibility of Hi-C Data

23

3. 3-column matrix indexed by bin numbers: The file includes three tab-delimited columns. Each row represents one interaction, where the first two entries are the bin indices of the upstream and downstream bins and the third entry is the read count of the interaction. Below shows the top 5 lines of a dataset in this format. 0

0

20

0

1

18

0

2

13

0

3

5

0

4

3

4. 5-column matrix indexed by chromosome number and genome coordinates: The file includes five tab-delimited columns. Each row represents one interaction, where entries 1–2 are the chromosome and the bin indices for the upstream bin, respectively, entries 3–4 are the chromosomes and the bin indices of the downstream bin, respectively, and the last entry is the strength of interaction. Below shows the top 5 lines of an example contact map file, which contains the contact map of chromosome 18. 18 3040000 18 3040000 245 18 3080000 18 3040000

69

18 3080000 18 3080000 285 18 3120000 18 3040000

29

18 3120000 18 3080000

67

5. .cool file: The .cool format is a sparse, compressed, binary persistent storage format for Hi-C contact matrices based on the HDF5 file format, developed by Mirny lab [13]. 6. .hic file: .hic is a data format for efficient storage of Hi-C contact matrices, developed by Durand et al. [14]. 2.4 Converting .bam Files into Matrix Formats

We provide the code to convert the .bam files of contact maps to several matrix formats. 1. The script below converts .bam files into the squared matrices at 40 kb resolution for chr18. # Generate squared contact matrix from bam file using get_matrix.pl # Usage: ./get_matrix.pl

24

Tao Yang et al. # Example: generate squared contact matrix of 40k resolution for chr18 perl ./get_matrix.pl merged.nodup.bam 40000 chr18 mm10.fa. fai > chr18.40k.matrix

Below is the Perl script get_matrix.pl #!/usr/bin/perl use strict; MAIN : { my ($bam_file, $bin_size, $chr, $genome_size_file) = @ARGV; if ((not defined $bam_file) || (not defined $bin_size) || (not defined $chr) || (not defined $genome_size_file)) { die ("Usage: ./get_matrix.pl \n"); } # read genome size file my %genome_size; open(FILE, $genome_size_file) || die("could not open file ($genome_size_file)\n"); while (my $line = ) { chomp $line; my ($chr, $size) = split(/\t/, $line); $genome_size{$chr} = $size; } close(FILE) || die("could not close file ($genome_size_file)\n"); for(my $left = 0 ; $left < $genome_size{$chr} ; $left += $bin_size) { my $query = $chr . ":" . $left . "-" . ($left + $bin_size - 1); print STDERR $query . "\n"; my @counts; for(my $i = 0 ; $i 1000)) { my $bin_to = int ($loc_to / $bin_size); $counts[$bin_to]++; } } print(join("\t", $chr, $left, $left + $bin_size, @counts) . "\n"); } }

2. The Python script below converts the squared matrix into the 3-column matrix format. # Generate 3-column matrix from squared contact matrix using matto3col.py # Usage: python matto3col.py < squared matrix file> # Example: generate 3-column contact matrix of 40k resolution for chr18_40k.mat file python matto3col.py chr18_40k.mat 40000

Below is the Python script matto3col.py #!/usr/bin/env python2.7 from sys import argv import numpy as np import pandas as pd import copy pd.options.mode.chained_assignment = None file=str(argv[1]) res=int(argv[2]) df=pd.read_table(file,header=None)

26

Tao Yang et al. squre=df.iloc[:,3:] squre.index = np.arange(1, len(squre)+1) squre.columns = np.arange(1, len(squre)+1) list = pd.DataFrame(squre).stack().rename_axis([’x’, ’y’]).reset_index(name=’val’) ll=list[(list[’val’]>0)] llf=ll.loc[ll[’x’] G1E_rep1_chr18.txt cat G1E-ER4-uninduced.Rep2.chr18.txt| awk ’{$1="18 "$1; $2="18 "$2;print}’ > G1E_rep2_chr18.txt

Assessing Reproducibility of Hi-C Data

3 3.1

27

Assessing Reproducibility Scores HiCRep

3.1.1 System Requirements and Installation

Hardware requirement: Linux, MacOS, Windows systems Software requirement: R version >3.3.0. Installation: HiCRep can be installed in three ways by running the following commands in R. 1. Install from Github: devtools::install_github(“qunhualilab/hicrep”)

2. Install from Bioconductor: if (!requireNamespace("BiocManager", quietly = TRUE))

in-

stall.packages("BiocManager") BiocManager::install("hicrep")

3. Install from source: Download the source package from Github (current version ‘hicrep_1.11.0.tar.gz’): Link: https://github.com/qunhualilab/hicrep/raw/master/ hicrep_1.11.0.tar.gz install.packages("/PATH/TO/SOURCE/hicrep_1.11.0.tar", repo = NULL, type = "source")

3.1.2 Input Format for Contact Maps

HiCRep takes a pair of intra-chromosomal contact matrices as the input. Each matrix contains the contacts from a single chromosome. HiCRep can take the following input formats. 1. Squared matrix format: This is the default format for HiCRep. It can read the files in this format directly. 2. 3-column matrix indexed by bin numbers: This format can be converted to the squared matrix format using the bed2mat() function in HiCRep. # hic.bin.txt is a 3-column contact matrix indexed by bin numbers bed