258 112 12MB
English Pages [268] Year 2021
Methods in Molecular Biology 2242
Alessio Mengoni Giovanni Bacci Marco Fondi Editors
Bacterial Pangenomics Methods and Protocols Second Edition
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Bacterial Pangenomics Methods and Protocols Second Edition
Edited by
Alessio Mengoni, Giovanni Bacci, and Marco Fondi Department of Biology, University of Florence, Sesto Fiorentino, Firenze, Italy
Editors Alessio Mengoni Department of Biology University of Florence Sesto Fiorentino, Firenze, Italy
Giovanni Bacci Department of Biology University of Florence Sesto Fiorentino, Firenze, Italy
Marco Fondi Department of Biology University of Florence Sesto Fiorentino, Firenze, Italy
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1098-5 ISBN 978-1-0716-1099-2 (eBook) https://doi.org/10.1007/978-1-0716-1099-2 © Springer Science+Business Media, LLC, part of Springer Nature 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Bacterial genomics has become in the last 10 years a mature research field, contributed by ecologists, geneticists, bacteriologists, molecular biologists, and evolutionary biologists. In 1999, Carl Woese wrote “Genome sequencing has come of age, and genomics will become central to microbiology’s future. It may appear at the moment that the human genome is the main focus and primary goal of genome sequencing, but do not be deceived. The real justification in the long run, is microbial genomics” [1]. Indeed, microbial genomics, especially prokaryotic genomics, is now central in life science, spanning from environmental studies to medical, agricultural, and industrial application. The discovery of the importance of microbial communities, their tremendous diversity, and their impact over the life of multicellular organisms, including humans, has led to the emergence of new concepts, including evolutionary interpretations of biotic relationships. Concurrent improvement of sequencing technologies is then allowing to shift the interest from the mere assembly/reconstruction and description of genomes and metagenomes to their functional interpretation. Under these premises, this second edition of the book Bacterial Pangenomics belonging to the Methods in Molecular Biology series, became a challenge, in the aim to propose to readers selected up-to-date methods which are relevant and can bring forth novel discoveries in such a rapidly evolving field. Consequently, the book has been completely renewed with respect to the previous edition, paying special attention to the technical and computational improvements, to the methods used for bacterial pangenome analysis which relies on microbiome studies and metagenomic data, and also considering the necessity of computational methods being understandable to every researcher in the field, not a domain restricted to bioinformaticians. This book has been organized into five main parts, starting from the up-to-date sequencing methods (Part I) to the methods for deep phylogenetic analysis (Part II), to the central role of metagenomic data in understanding the genomics of the many yet uncultured bacteria (Part III), to the current Part IV. This book ends with two chapters devoted to promoting the diffusion of computational genomic tools among graduate and undergraduate students (Part V). The aim of the present book is then, as for the previous edition, to serve as a “field guide” both for qualified investigators on bacterial genomics and for less experienced researchers (including students and teachers) who need references for approaching genomic analysis and genome data. Sesto Fiorentino, Firenze, Italy
Alessio Mengoni Giovanni Bacci Marco Fondi
Reference 1. Woese C (1999) The quest for Darwin’s grail. ASM News 65:260–263
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PART I
OPPORTUNITIES FROM NOVEL SEQUENCING TECHNOLOGIES
1 PacBio-Based Protocol for Bacterial Genome Assembly . . . . . . . . . . . . . . . . . . . . . . Agata Motyka-Pomagruk, Sabina Zoledowska, Michal Kabza, and Ewa Lojkowska 2 The Illumina Sequencing Protocol and the NovaSeq 6000 System . . . . . . . . . . . . Alessandra Modi, Stefania Vai, David Caramelli, and Martina Lari
PART II
3
15
PANGENOMICS OF CULTURED ISOLATES
3 Comparative Analysis of Core and Accessory Genes in Coexpression Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biliang Zhang, Jian Jiao, Pan Zhang, Wen-Jing Cui, Ziding Zhang, and Chang-Fu Tian 4 Inferring Core Genome Phylogenies for Bacteria . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Keller and Markus J. Ankenbrand 5 Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guillaume Bernard, Timothy G. Stephens, Rau´l A. Gonza´lez-Pech, and Cheong Xin Chan 6 Fast Phylogeny Reconstruction from Genomes of Closely Related Microbes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernhard Haubold and Fabian Klo¨tzl 7 Comparative Genomics, from the Annotated Genome to Valuable Biological Information: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sabina Zoledowska, Agata Motyka-Pomagruk, Agnieszka Misztak, and Ewa Lojkowska
PART III
v ix
45
59
69
77
91
DARK MATTER PANGENOMICS
8 Accurate Annotation of Microbial Metagenomic Genes and Identification of Core Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Chiara Vanni 9 Metagenomic Assembly: Reconstructing Genomes from Metagenomes. . . . . . . . 139 Zhang Wang, Jie-Liang Liang, Li-Nan Huang, Alessio Mengoni, and Wen-Sheng Shu 10 Genome Recovery, Functional Profiling, and Taxonomic Classification from Metagenomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Davide Albanese and Claudio Donati
vii
viii
11
12
Contents
Functional Metagenomics for Identification of Antibiotic Resistance Genes (ARGs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Francesca Di Cesare Host Trait Prediction from High-Resolution Microbial Features. . . . . . . . . . . . . . 185 Giovanni Bacci
PART IV 13 14
Phylogenetic Methods for Genome-Wide Association Studies in Bacteria . . . . . . 205 Xavier Didelot Simple, Reliable, and Time-Efficient Manual Annotation of Bacterial Genomes with MAISEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Mikolaj Dziurzynski, Przemyslaw Decewicz, Karol Ciuchcinski, Adrian Gorecki, and Lukasz Dziewit
PART V 15
16
PROGRESSES IN GENOME-TO-PHENOME INFERENCE
COOKBOOK FOR PANGENOMICS
A Compendium of Bioinformatic Tools for Bacterial Pangenomics to Be Used by Wet-Lab Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Camilla Fagorzi and Alice Checcucci A Protocol for Teaching Basic Next Generation Sequencing (NGS) Analysis Skills to Undergraduate Students Using Bash and R . . . . . . . . . . . . . . . . . 245 Marco Fondi and Giovanni Bacci
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
267
Contributors DAVIDE ALBANESE • Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy MARKUS J. ANKENBRAND • Center for Computational and Theoretical Biology, Biocenter, University of Wu¨rzburg, Wu¨rzburg, Germany; Department of Bioinformatics, Biocenter, University of Wu¨rzburg, Wu¨rzburg, Germany; Comprehensive Heart Failure Center, University Hospital Wu¨rzburg, Wu¨rzburg, Germany GIOVANNI BACCI • Department of Biology, University of Florence, Sesto Fiorentino, Italy GUILLAUME BERNARD • Sorbonne Universite´s, UPMC Universite´ Paris 06, Institut de Biologie Paris-Seine (IBPS), Paris, France DAVID CARAMELLI • Department of Biology, University of Firenze, Firenze, Italy CHEONG XIN CHAN • Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia; School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia; Australian Centre for Ecogenomics, The University of Queensland, Brisbane, QLD, Australia ALICE CHECCUCCI • Department of Agricultural and Food Science, University of Bologna, Bologna, Italy KAROL CIUCHCINSKI • Faculty of Biology, Institute of Microbiology, Department of Environmental Microbiology and Biotechnology, University of Warsaw, Warsaw, Poland WEN-JING CUI • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China PRZEMYSLAW DECEWICZ • Faculty of Biology, Institute of Microbiology, Department of Environmental Microbiology and Biotechnology, University of Warsaw, Warsaw, Poland FRANCESCA DI CESARE • Magnetic Resonance Center (CERM), University of Florence, Sesto Fiorentino, Italy; Department of Biology, University of Florence, Florence, Italy XAVIER DIDELOT • School of Life Sciences and Department of Statistics, University of Warwick, Coventry, UK CLAUDIO DONATI • Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all’Adige, Italy LUKASZ DZIEWIT • Faculty of Biology, Institute of Microbiology, Department of Environmental Microbiology and Biotechnology, University of Warsaw, Warsaw, Poland MIKOLAJ DZIURZYNSKI • Faculty of Biology, Institute of Microbiology, Department of Environmental Microbiology and Biotechnology, University of Warsaw, Warsaw, Poland CAMILLA FAGORZI • Department of Biology, University of Florence, Florence, Italy MARCO FONDI • Department of Biology, University of Florence, Sesto Fiorentino, Italy RAU´L A. GONZA´LEZ-PECH • Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia ADRIAN GORECKI • Faculty of Biology, Institute of Microbiology, Department of Environmental Microbiology and Biotechnology, University of Warsaw, Warsaw, Poland BERNHARD HAUBOLD • Research Group Bioinformatics, Max–Planck-Insitut fu¨r Evolutionsbiologie, Plo¨n, Germany LI-NAN HUANG • School of Life Sciences, Sun Yat-Sen University, Guangzhou, Guangdong Province, China
ix
x
Contributors
JIAN JIAO • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China MICHAL KABZA • Department of Integrative Genomics, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland ALEXANDER KELLER • Center for Computational and Theoretical Biology, Biocenter, University of Wu¨rzburg, Wu¨rzburg, Germany; Department of Bioinformatics, Biocenter, University of Wu¨rzburg, Wu¨rzburg, Germany FABIAN KLO¨TZL • Research Group Bioinformatics, Max–Planck-Insitut fu¨r Evolutionsbiologie, Plo¨n, Germany MARTINA LARI • Department of Biology, University of Firenze, Firenze, Italy JIE-LIANG LIANG • Institute of Ecological Science, School of Life Science, South China Normal University, Guangzhou, Guangdong Province, China EWA LOJKOWSKA • Department of Plant Protection and Biotechnology, Intercollegiate Faculty of Biotechnology University of Gdansk & Medical University of Gdansk, University of Gdansk, Gdansk, Poland ALESSIO MENGONI • Department of Biology, University of Florence, Florence, Italy AGNIESZKA MISZTAK • Department of Plant Protection and Biotechnology, Intercollegiate Faculty of Biotechnology University of Gdansk & Medical University of Gdansk, University of Gdansk, Gdansk, Poland ALESSANDRA MODI • Department of Biology, University of Firenze, Firenze, Italy AGATA MOTYKA-POMAGRUK • Department of Plant Protection and Biotechnology, Intercollegiate Faculty of Biotechnology University of Gdansk & Medical University of Gdansk, University of Gdansk, Gdansk, Poland WEN-SHENG SHU • Institute of Ecological Science, School of Life Science, South China Normal University, Guangzhou, Guangdong Province, China; Guangdong Magigene Biotechnology Co. Ltd., Guangzhou, Guangdong Province, China TIMOTHY G. STEPHENS • Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia CHANG-FU TIAN • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China STEFANIA VAI • Department of Biology, University of Firenze, Firenze, Italy CHIARA VANNI • Max Planck Institute for Marine Microbiology, Bremen, Germany; Jacobs University, Bremen, Germany ZHANG WANG • Institute of Ecological Science, School of Life Science, South China Normal University, Guangzhou, Guangdong Province, China BILIANG ZHANG • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China PAN ZHANG • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China ZIDING ZHANG • State Key Laboratory of Agrobiotechnology, and College of Biological Sciences, China Agricultural University, Beijing, China SABINA ZOLEDOWSKA • Department of Plant Protection and Biotechnology, Intercollegiate Faculty of Biotechnology University of Gdansk & Medical University of Gdansk, University of Gdansk, Gdansk, Poland; Institute of Biotechnology and Molecular Medicine, Gdansk, Poland
Part I Opportunities from Novel Sequencing Technologies
Chapter 1 PacBio-Based Protocol for Bacterial Genome Assembly Agata Motyka-Pomagruk, Sabina Zoledowska, Michal Kabza, and Ewa Lojkowska Abstract Acquisition of high-quality bacterial genomes is fundamental, while having in mind investigation of subtitle intraspecies variation in addition to development of sensitive species-specific tools for detection and identification of the pathogens. In this view, Pacific Biosciences technology seems highly tempting taking into consideration over 10,000 bp length of the generated reads. In this work, we describe a bacterial genome assembly pipeline based on open-source software that might be handled also by non-bioinformaticians interested in transformation of sequencing data into reliable biological information. With the use of this method, we successfully closed six Dickeya solani genomes, while the assembly process was run just on a slightly improved desktop computer. Key words Next generation sequencing, Whole-genome sequencing, Single-molecule real-time sequencing, Soft rot bacteria, Dickeya spp., Pectobacterium spp.
1
Introduction High throughput and low cost are the hallmarks of next generation sequencing (NGS) methods, which replaced Sanger-based approaches in numerous studies on whole bacterial genomes. In early 2011, Pacific Biosciences (PacBio) RS sequencer has been released together with Ion Torrent’s PGM and the Illumina MiSeq platforms [1]. The first technique exploits DNA polymerase as a single-molecule real-time (SMRT) sequencing engine, yielding significantly longer reads (approx. 10,000 bp) than the other above-mentioned methods. Besides, tracking with the base-pair resolution the kinetic parameters of each individual enzyme opens perspective for studying the base methylation patterns, polymerase inhibitors, or DNA binding proteins [2]. The insight into single nucleotide incorporations according to SMRT technology is achieved within zero-mode waveguide (ZMW), being a nanophotonic visualization compartment operating at 100 zeptoliters detection volumes [3]. As depicted in
Alessio Mengoni et al. (eds.), Bacterial Pangenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2242, https://doi.org/10.1007/978-1-0716-1099-2_1, © Springer Science+Business Media, LLC, part of Springer Nature 2021
3
4
Agata Motyka-Pomagruk et al.
Fig. 1 The principle of PacBio sequencing technology. Polymerase immobilized at the bottom of ZMV is incorporating fluorescent phospholinked nucleotides into the nascent DNA strand. Attachment of each correct nucleotide is associated with specific light emission signal followed by liberation of the fluorophore. Subsequently, the enzyme shifts to the next nucleotide until the end of the template strand is reached. Brown arrows mark excitation light, while blue and yellow arrows refer to the emission spectra, corresponding in this case to T and G, respectively
Fig. 1, DNA polymerase attached to ZMW builds a complementary strand to the DNA matrix from fluorescent phospholinked nucleotides in the presence of a primer. Due to diverse exposure time to ZMW, approaching of fluorescently labeled nucleotides is easily discriminated from actual processing of the matching tagged dNTPs. Notable peak in the fluorescent signal is noted when the α-β phosphodiester bond is cleaved with the release of a fluorophore. Usage of four distinct fluorescent labels permits to decipher which nucleotide, namely, dATP, dTTP, dGTP, or dCTP, has been introduced to a stated position in a newly synthesized strand. The polymerase continues elongation of the nascent DNA after having shifted by one base [2, 3]. Apart from the already mentioned dropping prices also higher availability of NGS platforms contributed to the fact that the number of studies targeting acquisition of whole genomic information is constantly increasing. For instance, there were 5,800 sequencing projects registered in The Genomes Online Database (GOLD) in 2009 [4], in contrast to the current popularity of such actions reaching the number of 317,275 (October, 2019). Still, vast majority of the genomes deposited in publicly available databases are in low quality draft states as the finishing steps are considered problematic, time-consuming and laborious [5]. Though, if the research is aimed at sequencing and subsequent comparison of the genomes of multiple strains classified within a
PacBio Protocol for Bacterial Genomes
5
certain species, high-quality genomic assemblies are required to reveal subtle intraspecies variation [6]. Scientific interest of our group is focused on phytopathogens, especially pectinolytic plant pathogenic bacteria classified to the genera Dickeya and Pectobacterium within the recently established family Pectobacteriaceae [7]. These microorganisms cause soft rot on diverse crops (especially potato tubers), vegetables and ornamentals in addition to blackleg symptoms that are restricted to potato plants [8]. Significant economic losses, for example, reaching 30 million euro each year just in the seed potato production sector [9], result either from tuber tissue decay or rejection/downgrading of the latently infected planting material. During our previous studies, we collected a pool of pectinolytic bacteria belonging to the Dickeya solani and Pectobacterium parmentieri species that differed notably in their virulence; both in the capacity to macerate plant tissue and the ability to produce virulence factors [10– 12]. Therefore, we decided to incorporate comparative genomic approaches to screen the unique and accessory pangenome fractions of two recently established pectinolytic species, D. solani and P. parmentieri, for the presence of genes discriminating highly from low virulent strains [13, 14]. Regarding D. solani, a tremendous effort has been made in order to propose a genome assembly pipeline yielding sequences of sufficient quality for the intended downstream applications. Previous attempt taking advantage of both 454 pyrosequencing (191,539 reads) and PacBio SMRT (118,344 reads) technologies led to aquisition of a draft genomic sequence of D. solani IFB0099 strain that consisted of 97 contigs [15]. In that study, 454 reads were converted to FastQC format and subsequently trimmed with the use of StreamingTrim version 1.0 [16]. Hybrid assembly of the trimmed 454 reads with PacBio data was achieved with pacBioToCA and the runCA program of the Celera assembler [17]. Prokka version 1.7.2 [18] was applied for functional annotation [15]. Though, the resultant low accuracy draft status of the genome turned out to be insufficient for the needs of the planned comparative genomic study. Therefore, we developed a novel genome assembly pipeline proven effective on ten D. solani strains (Table 1). It required obtaining solely PacBio reads (Table 2), took advantage of open-source software and run on just slightly improved desktop computer handled by nonbioinformaticians. In this way we closed six D. solani genomes, while the others remain in just few scaffolds (Table 3). As D. solani is a highly homogeneous species, it might be also interesting for other researchers to look up the alternative hybrid approach (based on PacBio and Illumina reads) that we incorporated for assembling the genomes of P. parmentieri [13], being a much more heterogeneous species.
6
Agata Motyka-Pomagruk et al.
Table 1 Dickeya solani strains for which the herein described PacBio reads-based genome assembly pipeline was used Strain nos.
Country of isolation, year
Origin
References
IFB0099 (IPO2276, LMG28824)
Poland, 2005
Potato, cv. Santa
[19, 20]
IFB0167
Poland, 2009
Potato, cv. Fresco
[11]
IFB0212
Poland, 2010
Potato
[21]
IFB0223a (457)
Germany, 2005
Rhizosphere of potato
[10]
IFB0231 (VIC-BL-25)
Finland, NA
Potato, cv. Victoria
[22]
IFB0311
Poland, 2011
Potato, cv. Innovator
[11]
IFB0417
Portugal, 2012
Potato, cv. Lady Rosetta
[23]
IFB0421
Portugal, 2012
Potato, cv. Lady Rosetta
[23]
IFB0487
Poland, 2013
Potato, cv. Vineta
[23]
IFB0695
Poland, 2014
Potato, cv. Arielle
[23]
a
IFB Bacterial strains collection of Intercollegiate Faculty of Biotechnology University of Gdansk and Medical University of Gdansk, Gdansk, Poland Numbers attributed to the analyzed D. solani strains in other bacterial collections are depicted in parentheses a The genomes of these two strains have been sequenced in the frames of the former project by BaseClear (Leiden, The Netherlands) company (see Note 2)
2 2.1
Materials General Purpose
NanoDrop 1000 Spectrophotometer (Thermo Scientific, USA).
2.2 Biological Material
Bacterial biomass (the strains analyzed herein are listed in Table 1) or 8 μg of high molecular RNA-free DNA (see Note 1).
2.3 Computer Hardware
System Linux Mint version 17.3 Cinnamon 64-bit, Memory 62.8 GiB, Hard Drives 190 GB, Processor Intel Core i7-5820K 3.30GHz x 6, NVIDIA Corporation GM206 GeForce GTX 960.
2.4 Computer Software
SMRT Analysis version 2.3, Canu version 1.5 [24], Quiver [25], Prokka version 1.12 [18], samtools, pbalign.
3
Methods
3.1 Obtaining of PacBio Reads
l
Bacterial biomass of 8 D. solani strains spread in reductive manner on Lysogeny Agar medium has been sent to GATC Biotech (Konstanz, Germany) (see Note 2) for multiplication of this
24,096 2856
IFB0311 300,584 858,690,222
100,274 1,029,957,366 14,603 10,271 99,604
IFB0487 150,292 1,080,552,377 14,413 7189
IFB0695 150,292 1,159,232,658 16,064 7713
1,127,327,575 16,294 11,318
106,388 1,201,169,375 17,650 11,290
IFB0421 150,292 1,234,534,051 17,467 8214
24,433 11,197
21,741 10,833
24,763 11,812
102,794 1,043,716,238 15,128 10,153
825,635,137
773,018,933
309,536,324
IFB0417 150,292 1,107,402,396 14,983 7368
73,736
N50
Read length (bp)
1,348,072,198 23,103 14,439
No. of bases (bp)
No. of bases (bp)
b
N50
Read length (bp)
307,660,285
4314 3683
5500 4498
6047 4959
160,271 1,124,511,594 9505 7016
145,866 1,027,880,868 9524 7046
186,060 1,197,494,239 8913 6436
149,146 1,041,633,970 9484 6983
221,577 816,213,960
170,468 766,788,395
62,031
258,262 1,340,261,637 6168 5189
No. of reads
Filtered subreads
Characteristics of the reads generated for IFB0099 and IFB0223 strains by BaseClear (Leiden, The Netherlands) company are listed in Note 2 300,584 reads were generated by 2 SMRT cells 2 movies, while 150,292 reads were obtained from 1 SMRT cell 1 movie N50 - minimum lenght of contigs covering half of the assembly basses.
a
21,411 2678
IFB0231 300,584 805,230,988
71,357
26,205
24,438 2136
N50
Read No. of length (bp) reads
IFB0212 150,292 321,139,805
No. of bases (bp) 93,363
No. of readsb
Postfiltered reads
IFB0167 150,292 1,407,716,051 22,946 9366
Strain no.a
Prefiltered reads
Table 2 PacBio reads generated for the studied Dickeya solani genomes
PacBio Protocol for Bacterial Genomes 7
0
0
0
0
0
0
0
0
0
0
IFB0099 1
IFB0167 1
IFB0212 2
IFB0223 1
IFB0231 1
IFB0311 3
IFB0417 1
IFB0421 1
IFB0487 4
IFB0695 7
4,904,769
4,882,124
4,934,537
4,924,102
4,913,261
4,924,702
4,937,554
4,909,935
4,922,289
4,932,920
Genome size (bp)
2,442,930
3,440,832
4,934,537
4,924,102
2,394,283
4,924,702
4,937,554
3,946,010
4,922,289
4,932,920
Largest contig (bp)
755,734
2
3,440,832 1
4,934,537 1
4,924,102 1
1,850,246 2
4,924,702 1
4,937,554 1
3,946,010 1
4,922,289 1
56.25 4337
56.23 4572
56.24 4349
56.24 4608
56.24 4306
56.24 4313
56.24 4328
56.25 4304
56.25 4308
56.24 4326
L50 %GC
4,932,920 1
N50
No. of genes
4172
4409
4187
4446
4144
4151
4167
4143
4146
4164
22
22
22
22
20
22
22
18
22
22
75
75
75
75
74
75
75
72
75
75
1
1
1
1
1
1
1
1
1
1
Proteins rRNA tRNA tmRNA
No. of genes encoding
Annotation statistics
N50 - minimum lenght of contigs covering half of the assembly basses. L50 - number of contigs containing half of teh assembled genome.
No. of N bases
No. of Genome scaffolds
Assembly statistics
Table 3 Statistics of the assembled Dickeya solani genomes
8 Agata Motyka-Pomagruk et al.
PacBio Protocol for Bacterial Genomes
9
Fig. 2 Pipeline for generation of PacBio reads. DNA is fragmented. Then fragments of the selected size are repaired and ligated to SMRTbell adapters. The primers and subsequently DNA polymerase anneal to SMRTbell templates. After binding to ZMVs, the SMRTbell templates are sequenced on PacBio RS II platform. The generated reads are deprived of adapters, short and low quality sequences to form a set of filtered subreads
material in Lysogeny Broth medium at 28 C, cell disruption/ lysis, isolation of high molecular weight DNA (see Note 1) with Maxi Prep Qiagen (Hilden, Germany), performing quantification and quality control of the isolated genomic DNA. l
Subsequently standard genomic PacBio library was prepared (Fig. 2, see Note 3) involving: fragmentation of DNA (10,000 bp, see Notes 4 and 5), size selection, repair of DNA damage (see Note 6) and DNA ends (see Note 7) and adapter ligation (see Note 8), annealing of primer to SMRTbell templates (see Note 9), and annealing of polymerase to SMRTbell templates (see Note 10) [26, 27].
l
SMRTbell templates are incorporated into ZMWs (see Note 11).
10
Agata Motyka-Pomagruk et al. l
Sequencing on PacBio RS II platform proceeds with the use of raw data package (see Note 12, Fig. 1).
3.2 Primary Analysis: Quality Control and Adapter Trimming with SMRT Analysis
l
The reads generated by the PacBio RS II instrument were processed and filtered with the use of SMRT Analysis version 2.3 software (see Note 13) to generate the subreads (Fig. 2; Table 2). Data Management and SMRT Analysis modules are compatible with the PacBio RS II data of interest.
3.3 Assembly of PacBio Data
l
At first all the FASTQ format files have been merged (see Note 14).
l
Then the successor of Celera assembler, Canu version 1.5 [24] was incorporated for further correction, trimming and subsequent assembly of the PacBio reads. This software is based on hierarchical strategy, therefore profits from multiple rounds of read overlapping in order to increase the quality of single-molecule reads before performing the assembly process (see Note 15).
l
Primarily the pacbio.fofn (see Note 16) file containing all paths to .bas.h5 (see Note 17) files present in the pacbio folder was created.
l
Secondly, the data were converted with samtools (version 1.4.1) faidx and pbalign [28] (see Notes 18 and 19).
l
For getting consensus and variant calling Quiver [25] was applied (see Note 20).
l
Functional annotation was accomplished with the use of Prokka version 1.12 [18] (see Note 21; Table 3).
3.4 Polishing the Assembled Genomes with PacBio Data
3.5 Functional Annotation of the Genomes
4
Notes 1. DNA concentration should exceed 80 ng/μl with a maximum volume of 100 μl. High DNA purity is required, meaning OD 260/280 1.8 and OD 260/230 1.9. It is important that the sample does not contain impurities, such as biological macromolecules (RNA, proteins, polysaccharides, lipids), chelating agents (e.g., EDTA), divalent metal cations (Mg2+), any detergents (SDS, Triton X-100), or denaturants (e.g., phenol, guanidinium salts), and has been suspended in RNase-, DNase, and protease-free Tris–HCl buffer (pH 8.0–8.5). Usually, an additional sample (e.g., 20 μl) of the DNA elution buffer used is requested by the sequencing companies. DNA should not
PacBio Protocol for Bacterial Genomes
11
have been subjected to multiple freeze-thawing cycles, high temperatures (e.g., >65 C for over 1 h), UV light, or pH extremes (< 6 or >9). DNA should rather have been extracted with the use of commercial kits as the organic extraction methods (involving phenol or TRIzol™) may inhibit the enzymes utilized during library preparation. DNA may be shipped at the room temperature, however performing transfer of the refrigerated material is highly recommended [26, 27]. 2. Regarding two D. solani strains, IFB0099 and IFB0223, high quality genomic dsDNA was sent in the frames of the previous project to another sequencing company BaseClear (Leiden, The Netherlands). For IFB0099 and IFB0223, respectively, the following reads’ statistics have been acheived: number of reads 118,344 and 102,248; sample yield 395,000,000 bp and 356,000,000 bp in addition to average length 3,340,000 bp and 3,490,000 bp. 3. SMRTbell Template Prep Kit is used for libraries preparation, while AMPure® PB beads are needed for the purification steps. The template preparation protocol lasts 3–6 h. 4. Characteristics of the introduced DNA sample, namely, organism of origin, quality, amount, and purity, are crucial in terms of size distribution after fragmentation and the resulting insert size of the genomic PacBio library. 5. Achieved by shearing of DNA. Covaris S2 or LE 220 System (500 bp < 5000 bp), Covaris g-Tube devices (>6000 bp) or HydroShear instrument (>5000 bp) might be used. 6. Nicks, abasic regions, thymine dimers, cytosine deamination, blocked 30 -ends, and oxidation damage sites are treated with the use of a DNA-damage repair mix. 7. T4 DNA polymerase is utilized for both filling in 50 overhangs and the removal of 30 overhangs. T4 polynucleotide kinase phosphorylates the 50 hydroxyl group. Thus, blunt ends are formed. 8. Double-stranded DNA matrix capped by hairpin loops forms the SMRTbell template (Fig. 2). In terms of topology, SMRTbell template is circular, while regarding structure, it is linear. The ligated adapters protect the ends of the DNA fragments from the action of exonucleases (exonuclease III, exonuclease VII) utilized for degradation of the failed ligation products or adapter dimers in the following AMPure® PB beads purification steps. 9. The sequencing primer-binding sites are located within the loops of the ligated hairpin adapters. The primers hybridize to both ends of the SMRTbell template. Higher stability of the primers is assured by 20 -methoxy modifications.
12
Agata Motyka-Pomagruk et al.
10. For binding of DNA polymerase to the primer-annealed SMRTbell templates, incubation at 30 C for 30 min is performed. 11. Efficacy of loading SMRTbell templates into ZMWs depends on the insert size; SMRTbell templates containing smaller fragments link more potently than these including larger inserts. Libraries larger than 1000 bp are immobilized in the bottom of ZMWs by paramagnetic beads. The MagBead loading procedure starts from hybridization between the poly-A tail of the sequencing primer and the oligo dT on the surface of magnetic beads. After the washing steps, SMRTbell-MagBead sample is moved to a 96-well plate, loaded on the instrument, and then put into the SMRT Cell for immobilization. The MagBead station moves the beads around the surface of the SMRT Cell in order to immobilize SMRTbell templates at the bottom of the ZMWs. 12. Specification >500 Mb raw data per package (3%), polymerase reads >6000 bp. 13. Read-length (>50), subread-length (>50), and read quality (>0.75) filters were implemented. 14.
cat pacbio/*.fastq > pacbio.fastq
15.
canu genomeSize¼4.9 m maxMemory¼20 maxThreads¼12 minOverlapLength¼500 -pacbio-raw pacbio.fastq
16. with .fofn referring to “file of file names”. 17. These are the main output files generated by the primary analysis pipeline on PacBio RS II. 18.
samtools faidx canu/canu.contigs.fasta
19.
pbalign --nproc 12 --forQuiver pacbio.fofn canu/ canu.contigs.fasta quiver.cmp.h5
20.
quiver -j 12 -r canu/canu.contigs.fasta -o quiver. gff -o quiver.fasta quiver.cmp.h5
21.
prokka --cpus 12 --compliant --centre XXX --rfam -force --outdir prokka_1/ --prefix IFB_0099 -genus Dickeya --species solani --strain IFB_0099 genome.fasta
Acknowledgments All sequencing and comparative genomics research tasks were conducted thanks to founding from National Science Centre in Poland via 2014/14/M/NZ8/00501 granted to E.L. A.M.P. is supported from National Science Centre in Poland via 2016/21/N/ NZ1/02783.
PacBio Protocol for Bacterial Genomes
13
References 1. Quail M, Smith ME, Coupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341. https://doi.org/10.1186/ 1471-2164-13-341 2. Eid J, Fehr A, Gray J et al (2009) Real-time DNA sequencing from single polymerase molecules. Science 323:133–138. https://doi. org/10.1126/science.1162986 3. Korlach J, Bjornson KP, Chaudhuri BP et al (2010) Real-time DNA sequencing from single polymerase molecules. Methods Enzymol 472:431–455. https://doi.org/10.1016/ S0076-6879(10)72001-2 4. Liolios K, Chen I-MA, Mavromatis K et al (2010) The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 38:D346–D354. https://doi.org/10.1093/nar/gkp848 5. Galardini M, Biondi EG, Bazzicalupo M, Mengoni A (2011) CONTIGuator: a bacterial genomes finishing tool for structural insights on draft genomes. Source Code Biol Med 6:11. https://doi.org/10.1186/1751-0473-6-11 6. Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol 11:472–477. https://doi.org/10.1016/J. MIB.2008.09.006 7. Adeolu M, Alnajar S, Naushad S, Gupta RS (2016) Genome-based phylogeny and taxonomy of the ‘Enterobacteriales’: proposal for Enterobacterales ord. nov. divided into the families Enterobacteriaceae, Erwiniaceae fam. nov., Pectobacteriaceae fam. nov., Yersiniaceae fam. nov., Hafniaceae fam. nov., Morganellaceae fam. nov., and Budviciaceae fam. nov. Int J Syst Evol Microbiol 66:5575–5599. https:// doi.org/10.1099/ijsem.0.001485 8. Perombelon MCM, Kelman A (1980) Ecology of the soft rot Erwinias. Annu Rev Phytopathol 18:361–387. https://doi.org/10.1146/ annurev.py.18.090180.002045 9. Toth IK, van der Wolf JM, Saddler G, Lojkowska E, Hellas V, Pirhonen M, Tsror L, Elphinstone JG (2011) Dickeya species: an emerging problem for potato production in Europe. Plant Pathol 60:385–399. https:// doi.org/10.1111/j.1365-3059.2011.02427.x 10. Potrykus M, Golanowska M, HugouvieuxCotte-Pattat N, Lojkowska E (2014) Regulators involved in Dickeya solani virulence, genetic conservation, and functional variability. Mol
Plant-Microbe Interact 27:700–711. https:// doi.org/10.1094/MPMI-09-13-0270-R 11. Potrykus M, Golanowska M, Sledz W, Zoledowska S, Motyka A, Kołodziejska A, Butrymowicz J, Lojkowska E (2016) Biodiversity of Dickeya spp. isolated from potato plants and water sources in temperate climate. Plant Dis 100:408–417. https://doi.org/10.1094/ PDIS-04-15-0439-RE 12. Zoledowska S, Motyka A, Zukowska D, Sledz W, Lojkowska E (2018) Population structure and biodiversity of Pectobacterium parmentieri isolated from potato fields in temperate climate. Plant Dis 102:154–164. https://doi. org/10.1094/PDIS-05-17-0761-RE 13. Zoledowska S, Motyka-Pomagruk A, Sledz W, Lojkowska E (2018) High genomic variability in the plant pathogenic bacterium Pectobacterium parmenieri deciphered from de novo assembled complete genomes. BMC Genomics 19:751. https://doi.org/10.1186/s12864018-5140-9 14. Golanowska M, Potrykus M, MotykaPomagruk A, Kabza M, Bacci G, Galardini M, Bazzicalupo M, Makalowska I, Smalla K, Mengoni A, Hugouvieux-Cotte-Pattat N, Lojkowska E (2018) Comparison of highly and weakly virulent Dickeya solani strains, with a view on the pangenome and panregulon of this species. Front Microbiol 9:1940. https:// doi.org/10.3389/fmicb.2018.01940 15. Golanowska M, Galardini M, Bazzicalupo M, Hugouvieux-Cotte-Pattat N, Mengoni A, Potrykus M, Slawiak M, Lojkowska E (2015) Draft genome sequence of a highly virulent strain of the plant pathogen Dickeya solani, IFB0099. Genome Announc 3:e00109–e00115. https:// doi.org/10.1128/genomeA.00109-15 16. Bacci G, Bazzicalupo M, Benedetti A, Mengoni A (2014) StreamingTrim 1.0: a Java software for dynamic trimming of 16S rRNA sequence data from metagenetic studies. Mol Ecol Resour 14:426–434. https://doi.org/10. 1111/1755-0998.12187 17. Koren S, Schatz MC, Walenz BP et al (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30:693–700. https://doi.org/10.1038/ nbt.2280 18. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. https://doi.org/10.1093/bio informatics/btu153 19. Slawiak M, Łojkowska E, van der Wolf JM (2009) First report of bacterial soft rot on
14
Agata Motyka-Pomagruk et al.
potato caused by Dickeya sp. (syn. Erwinia chrysanthemi) in Poland. Plant Pathol 58:794–794. https://doi.org/10.1111/j.1365-3059.2009. 02028.x 20. Slawiak M, van Beckhoven JRCM, Speksnijder AGCL et al (2009) Biochemical and genetical analysis reveal a new clade of biovar 3 Dickeya spp. strains isolated from potato in Europe. Eur J Plant Pathol 125:245–261. https://doi.org/ 10.1007/s10658-009-9479-2 21. Golanowska M, Kielar J, Lojkowska E (2017) The effect of temperature on the phenotypic features and the maceration ability of Dickeya solani strains isolated in Finland, Israel and Poland. Eur J Plant Pathol 147:803–817. https://doi.org/ 10.1007/s10658-016-1044-1 22. Degefu Y, Potrykus M, Golanowska M, Virtanen E, Lojkowska E (2013) A new clade of Dickeya spp. plays a major role in potato blackleg outbreaks in North Finland. Ann Appl Biol 162:231–241. https://doi.org/10. 1111/aab.12020 23. Motyka-Pomagruk A (2019) Genotypic and phenotypic characterization of bacteria from
Dickeya solani species and development of novel control methods against phytopathogens. PhD thesis, University of Gdan´sk 24. Koren S, Walenz BP, Berlin K et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722–736. https://doi. org/10.1101/gr.215087.116 25. Chin CS, Alexander DH, Marks P et al (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat Methods 10:563–569. https://doi.org/ 10.1038/nmeth.2474 26. Pacific Biosciences P/N 000-710-821-13 (2014) Template preparation and sequencing guide. Pacific Biosciences, Menlo Park, CA 27. Pacific Biosciences 100-338-500-01 (2014) Introduction to SMRTbell™ template preparation. Pacific Biosciences, Menlo Park, CA 28. Li H, Shahriari AR, Wysoker A (2009) Durbin R; 1000 genome project data processing subgroup. The sequence alignment/MAP format and samtools. Bioinformatics 25:2078–2079
Chapter 2 The Illumina Sequencing Protocol and the NovaSeq 6000 System Alessandra Modi, Stefania Vai, David Caramelli, and Martina Lari Abstract The NovaSeq 6000 is a sequencing platform from Illumina that enables the sequencing of short reads with an output up to 6 Tb. The NovaSeq 6000 uses the typical Illumina sequencing workflow based on library preparation, cluster generation by in situ amplification, and sequencing by synthesis. Flexibility is one of the major features of the NovaSeq 6000. Several types of sequencing kits coupled with dual flow cell mode enable high scalability of sequencing outputs to match a wide range of applications from complete genome sequencing to metagenomics analysis. In this chapter, after explaining how to assemble a normalized pool of libraries for sequencing, we will describe the experimental steps required to run the pools on the NovaSeq 6000 platform. Key words Library quality control, Library quantification, Sequencing pool, Standard sequencing workflow, XP sequencing workflow, Run setup
1
Introduction The NovaSeq 6000 System is a production-scale sequencing platform from Illumina, Inc. The instrument makes use of the Illumina sequencing by synthesis (SBS) chemistry and enables the massively parallel sequencing of billions of DNA fragments in the range of 50–500 bases with an output up to 6 Tb. Beside the highthroughput and cost-effective sequencing, the NovaSeq 6000 offers flexible output and run-time configuration. Multiple flow cell types support a wide output range. User can choose between four flow cell types (SP, S1, S2, and S4) and different read lengths to easily adjust output and sample throughput of sequencing run to a specific project (Table 1). More flexibility is achieved with individual lane loading (Xp workflow): two lanes for SP, S1, and S2 flow cells; four lanes for S4 (Fig. 1). Additionally, the instrument can run one or two flow cells of the same type at a time. Due to scalability, the NovaSeq 6000 could be applied in several applications spanning from complete genome and exome sequencing to target
Alessio Mengoni et al. (eds.), Bacterial Pangenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2242, https://doi.org/10.1007/978-1-0716-1099-2_2, © Springer Science+Business Media, LLC, part of Springer Nature 2021
15
16
Alessandra Modi et al.
Table 1 NovaSeq 6000 flowcells and outputs Reads passing filter (Billion reads) Flow cell type SP 2 50 bp SP 2 150 bp SP 2 250 bp S1 2 50 bp S1 2 100 bp S1 2 150 bp
Sequencing output Number of lanes on (Gb) flowcell 65–80 200–250
2
Single reads
Paired-end reads
Run time (h)
0.6–0.8
1.3–1.6
13 25
325–400
38
134–167 266–333 400–500
2
1.3–1.6
2.6–3.2
13 19 25
S2 2 50 bp 333–417 S2 2 100 bp 667–833 S2 2 150 bp 1000–1250
2
3.3–4.1
6.6–8.2
16 25 36
S4 2 100 bp 1600–2000 S4 2 150 bp 2400–3000
4
8–10
10–20
36 44
Fig. 1 NovaSeq 6000 flow cell types. From left to right SP, S1, S2, and S4 flow cell respectively. Lanes size and number according to kit output
approaches, transcriptome and gene expression analysis, and shotgun metagenomics. As in other Illumina NGS sequencing systems, the NovaSeq 6000 workflow includes four basic steps:
NovaSeq 6000 Sequencing Protocol
17
1. Library preparation. The first step is the conversion of the genomic DNA (gDNA), cDNA, or PCR amplicons into a sequencing library. DNA fragments are coupled with two oligonucleotides adapters (named as P5 and P7) carrying specific sequences that enable the subsequent amplification and sequencing steps in the Illumina sequencing systems. During library preparation, short individual “barcode” sequences (“indexes”) are also added to the standard adapters in order to allow the simultaneous sequencing of DNA fragments from different samples (multiplexing). Two main different approaches can be applied to obtain sequencing libraries from genomic DNA or cDNA. In the “TruSeq” approach, a sequencing library is generated by random mechanical fragmentation of the DNA sample (usually with an ultrasonicator), followed by 50 and 30 adapters ligation [1]. Alternatively, following the “Nextera” workflow, fragmentation and ligation reactions can be combined into a single step employing a transposon/transposase-mediated cleavage mechanism (“tagmentation”) [1]. Genomic fragments are subsequently amplified using primers targeted to adaptor sequences. During the library amplification step, individual indexes are incorporated into the adaptors of each sample by means of sample-specific indexing primers; both single or double indexing can be performed in order to increase multiplexing. Several commercial library preparation kits are available Most of the commercial kits have been optimized for specific applications (e.g., transcriptomics, epigenomics, ChIP-seq, PCR-free library). Alternatively, custom in-house protocols have been developed [2, 3]. When degraded material is analyzed, as in the case of DNA extracted from ancient biological samples, fragmentation step is usually skipped, and adapters are ligated directly to the ends of each DNA molecules. Amplicons targeted sequencing, as in a typical microbial metagenomic 16S analysis, can be simply achieved by using user-defined PCR primers with overhang adapters; a subsequent limited-cycle amplification step is performed to add multiplexing indexes and Illumina sequencing adapters [4]. For some applications where huge numbers of targets are required, specific amplicon panels can be designed and purchased. After quality control and quantification, indexed libraries are normalized, pooled in the required molarity, and then loaded into the sequencing platform along with buffers and reagents. Despite the original source material and/or the specific library preparation protocol followed, each type of library can be virtually sequenced in all the Illumina sequencing platforms. The choice of a specific sequencer and of the appropriate sequencing kit is therefore mainly driven by the required output and cost-effectiveness.
18
Alessandra Modi et al.
2. Cluster generation. For cluster generation, the library is loaded into a flow cell, a glass slide containing small fluidic channels and a lawn of surface-bound oligonucleotides complementary to the library adapters. The DNA library hybridize to these oligonucleotides, temporarily immobilizing individual DNA fragments onto the flow cell. After library hybridization, polymerases, dNTPs, and buffers are pumped in the flow cell. Each single fragment is then copied several thousand times to generate distinct, clonal clusters through an in-situ amplification process known as “bridge PCR” [5]. When cluster generation is complete, the templates are ready for sequencing. The NovaSeq 6000 makes use of patterned flow cells that contain billions of nanowells at fixed locations across both surfaces of the flow cell [6]. The structured organization provides fixed spacing of sequencing clusters, making the flow cells less susceptible to overloading, and reduces run time (no need to map cluster sites during sequencing). Additionally, higher cluster density leads to more usable data per flow cell. 3. Sequencing. Within each cluster, templates are sequenced by means of Illumina sequencing by synthesis (SBS) chemistry. The SBS chemistry is a reversible terminator–based method using a single base extension and competitive addition of nucleotides [5]. As all four nucleotides are present during each sequencing cycle, the method minimizes incorporation bias and significantly reduces errors and missed calls associated with strings of repeated nucleotides. Illumina sequencing kits are designed to support both single-end and paired-end sequencing. In paired-end sequencing the DNA fragments of each cluster are first sequenced from one end (as in the singleend sequencing) and then from the opposite end. This approach generates high-quality sequence data improving alignment and genome assembly. Additionally, paired-end sequencing can help in identifying authentic reads from endogenous DNA when libraries from ancient and degraded samples are analyzed. During each sequencing cycle, a single florescent labeled deoxynucleotide triphosphate (dNTPs) is added to the nucleic acid chains. The label dyes act as a reversible 30 blockers and serve a terminator for DNA strand synthesis. As each dNTP is added, the fluorescently dye is imaged and then cleaved to allow incorporation of the next nucleotide. At each sequencing cycle, fluorescent emission from each cluster is registered and normalized. Emission wavelength and intensity are then used by control software of the instrument to call the bases. For nucleotide detection, the NovaSeq 6000 makes use of the two-channel SBS method that requires only two images per cycle, instead of four, and accelerates sequencing and data
NovaSeq 6000 Sequencing Protocol
19
processing times. Rather than four separate dyes, two-channel SBS uses a mix of dyes. Images are taken of each DNA cluster using only two different wavelength filter bands. Clusters seen in red or green images are interpreted as C and T bases, respectively. Clusters observed in both red and green images (yellow clusters) are recognized as A bases, while unlabeled clusters correspond to G bases. The NovaSeq 6000 usually produces 75–85% of sequenced bases over 30 quality score, that means 0.1% probability that a base was called incorrectly. 4. DataAnalysis. After a sequencing run is completed, data are available for downstream analysis. In order to assign the reads to each different sample, raw data are first “demultiplexd” by means of indexes recognition. The sequenced reads are then mapped to a reference genome or sequence and analyzed according to specific approaches (e.g., SNPs and indel identification, phylogenetic or metagenomic analysis). While steps 1 (Library Preparation) and 4 (Data analysis) can be carried out in any molecular biology laboratory or computation facility regardless of the sequencing machine and the run schedule, steps 2 (Cluster Generation) and 3 (Sequencing) are physically executed into the sequencing platform during the run. In this chapter, we will describe the experimental steps required to run a set of libraries on the NovaSeq 6000 sequencing platform (Fig. 2). We will first provide some tips for the following: (a) To perform library quantification and quality control. (b) To normalize concentration and prepare an equimolar pool of libraries. Then, according to the specific manuals of instruction by Illumina [7], we will report the workflow that should be followed for the following steps: (a) To prepare reagents for cluster generation and sequencing. (b) To load the pools of libraries and reagents on the flow cell with both standard and XP workflow. (c) To set the instrument for the run.
2
Materials
2.1 General Equipment and Reagents
Automated electrophoresis system with related reagents and data processing software. Fluorometer with related reagents. Vortex.
20
Alessandra Modi et al.
Fig. 2 Schematic overview of the experimental steps for pooling sequencing libraries and run the pools on the NovaSeq 6000
Microspin. Micropipettes with disposable tips. Ice bath. 10 mM Tris–HCl, pH 8.5. 0.2 N NaOH. For library dilutions and pool assembling we suggest using LoBind or siliconized tubes. 2.2 Proprietary Reagents and Materials from Illumina1 1
PhiX Control v3 (FC-110-3001). NovaSeq Reagent Kits: several kits are available, according to different flow cells and outputs (Table 1). Each kit provides the following:
At the time of writing, the authors are not aware of alternative suppliers offering sequencing reagents and materials compatible with Illumina NovaSeq6000 System.
NovaSeq 6000 Sequencing Protocol
21
1 Library tube (store at T 15–30 C), 1 Flow cell (store at 2–8 C), 1 Buffer cartridge (store at 15–30 C), 1 Cluster cartridge (store at 25 C to 15 C), 1 SBS cartridge (store at 25 C to 15 C). For details see https://emea.illumina.com/products/by-type/ sequencing-kits/cluster-gen-sequencing-reagents/novaseqreagent-kits.html. NovaSeq XP 2-Lane Kit (20021664) or NovaSeq XP 4-Lane Kit (20021665). NovaSeq XP Flow Cell Dock (20021663).
3
Methods
3.1 Preparation of the Sequencing Pool
In the next sections, we will discuss how to prepare a pool of libraries for sequencing. A crucial step for a successful sequencing is the preliminary validation, quantification and quality control of the libraries to be sequenced. Quantification is also a key preliminary step in order to combine a set of libraries into a single equimolar pool.
3.1.1 Validation, Quality Control, and Quantification of the Sequencing Libraries
The quality of a library is most important in determining the success of the sequencing run, in terms of both number and quality of produced reads. For library quality control, miniaturized capillary electrophoresis is recommended. Proper kits should be used according to manufacturer’s instructions selecting the appropriate DNA analyses assay according to library concentration. This measure provides the size distribution of DNA library fragments, whose size depends on the target length. The electrophoresis profile of a library (Fig. 3) should show a single peak at the molecular weight corresponding to the expected insert size plus adapters. PCR-amplified libraries may show additional peaks representing primer or adapter dimers (at around 80–85 and 130 bases, respectively), or broader fragments of higher molecular weight than the expected peak, that are visualized on the electropherogram as hump-shaped forms. In most cases, adapter dimers could be more efficiently sequenced than the longer library fragments, thus library preparation and/or purification steps must be improved in order to minimize or remove dimers (e.g., by adjusting the adapters amount during ligation step). High molecular weight heteroduplexes are artefacts caused by an excess of amplification cycles during the final library amplification step. While a few amounts of artefacts are tolerable, high concentration of heteroduplexes should affect library quantification. For this reason, the formation of heteroduplexes could be avoided, or
22
Alessandra Modi et al.
Fig. 3 Electrophoresis results for library quality control. (a) Library with a single peak of the expected molecular weight. (b) Library with heteroduplexes visualized as hump-shaped forms, 265 bp long. (c) Library with adapter dimers, around 130 bp long
strongly limited, by reducing the number of amplification steps during library preparation. Library quantification is critical when multiplexing. In order to obtain balanced reads count between pooled libraries, the concentration of each library should be precisely calculated. Despite miniaturized capillary electrophoresis systems provide both quality and quantity assessment, additional library quantification step with specific assays are also recommended. Accurate library quantification should be performed by quantitative real-time PCR (qPCR) or fluorometric dsDNA assays. While the latter method detects all double-strand DNA into a library, qPCR is more sensitive and specific. qPCR of sequencing libraries is based on the SYBR Green Dye signal detection when the dye is incorporated in a newly synthetized amplicon. Since it uses specific amplification primers complementary to adapters sequences, only library fragments with proper adapters at both ends are quantified. In most cases, PCR-amplified libraries need to be quantified by a fluorometer, but PCR-free libraries are best quantified by qPCR. By contrast, UV spectrophotometer should not be used for library quantification, since absorbance-based measurements often overestimate DNA concentration due to nonspecific nucleic acids detection.
NovaSeq 6000 Sequencing Protocol 3.1.2 Pooling Libraries for Sequencing
nM
23
After quantification, libraries have to be normalized to the appropriate concentration and then combined into a single equimolar pool. If library concentration is expressed in ng/μl (as when using a fluorometer), the value for each library should be first converted in nM using the following formula:
¼ ½sample concentration in ng=μl=ð660 g=mol average fragment length in bpÞ 106 Equimolar libraries could be simply prepared following the formula: V i Ci ¼ V f Cf using 10 mM Tris–HCl pH 8.5 for dilution. Normalized libraries should be then pooled using equivalent pooling volumes. The number of libraries to be pooled depends on the depth of sequencing that should be achieved for each sample. In Table 2, number of expected reads and recommended plexity by flow cell type are reported for some common applications. The optimal pooled concentration to be loaded on the sequencer depends on the library type, insert size, and workflow used to load the libraries into the flow cell (Table 3). Generally, libraries with small insert sizes must be load at the lower concentration, while for libraries with longer insert size higher loading concentrations might be necessary. Illumina provides online tools to estimate sequencing coverage and pooling calculator to normalize libraries (see Note 1). Several critical issues should be considered during the preparation of the pool. As best practices, we suggest the following “Pooling Criteria”. 1. To reduce the chance of converting one index into another by sequencing and amplification errors, pay attention to pool libraries with adequately differentiated index sequences, particularly when using custom-designed indexing oligos. 2. To avoid reads misassignment due to unexpected index combinations (“index hopping”), use unique index combinations; index hopping can be seen at slightly elevated levels on instruments using patterned flow cells as NovaSeq 6000. 3. Pool libraries with similar insert size; on patterned flow cells, shorter fragments generate clusters more efficiently than libraries with longer molecules that may be therefore underrepresented in the final counts. 4. Pool only libraries prepared with the same approach (e.g., PCR-amplified libraries or PCR-free libraries). 5. Prepare the pool at a higher concentration than the molarity required in the sequencing protocol, then dilute the pool to the appropriate loading concentration for the application
24
Alessandra Modi et al.
Table 2 Recommended number of multiplexed libraries related to application and flow cell type Illumina Standard Workflow Application
Flow cell type
Libraries per flow cell
Number of paired-end reads passing filter per flow cell (B)
Human genomes
SP S1 S2 S4
~2 ~4 ~10 ~24
1.3–1.6 2.6–3.2 6.6–8.2 16–20
Exomes
SP S1 S2 S4
~20 ~40 ~100 ~250
1.3–1.6 2.6–3.2 6.6–8.2 16–20
Transcriptomes SP S1 S2 S4
~16 ~32 ~82 ~200
1.3–1.6 2.6–3.2 6.6–8.2 16–20
Illumina XP workflow Application
Flow cell type
Libraries per lane
Number of paired-end reads passing filter per lane (B)
Human genomes
SP S1 S2 S4
1 ~2 ~5 ~6
0.65–0.8 1.3–1.6 3.3–4.1 4.0–5.5
Exomes
SP S1 S2 S4
~10 ~20 ~50 ~62
0.65–0.8 1.3–1.6 3.3–4.1 4.0–5.5
Transcriptomes SP S1 S2 S4
~8 ~16 ~41 ~50
0.65–0.8 1.3–1.6 3.3–4.1 4.0–5.5
(Table 3). For example, if the sequencing protocol requires a pooled library concentration of 1.5 nM, prepare a library pool at 3.0 nM and then dilute it at 1.5 nM. Before diluting, we suggest to additionally quantify the pool by fluorometry. 6. Prepare the pool taking into account the final loading volume required in the different flow cells according to the specific workflow used (Table 4). 7. Once sequencing pool is ready, add the appropriate percentage of PhiX control, a balanced library representing the bacteriophage PhiX genome. The PhiX amount depends on pool features: in well-balanced libraries 1% spike-in control is needed for alignment calculations and quantification efficiency, as well
NovaSeq 6000 Sequencing Protocol
25
Table 3 Recommended pooled loading concentrations based on libraries with insert sizes 450 bp Library type
Pooled loading concentration (nM)
Illumina Standard Workflow DNA PCR-free library pool
0.875–1.75
DNA PCR-amplified library pool
1.5–3.0
Illumina XP workflow DNA PCR-free library pool
0.575–1.175
DNA PCR-amplified library pool
1.0–2.0
Table 4 Loading volumes of library pool required for sequencing based on flow cell type and sequencing workflow
Flow cell
Illumina Standard Workflow Volume per flow cell (μl)
Illumina XP workflow Volume per lane (μl)
SP/S1
100
18
S2
150
22
S4
310
30
as for troubleshooting cluster generation problems; in unbalanced and low-complexity libraries (e.g., amplicons libraries) a higher concentration spike-in control (up at least to 10% or more) is required for balancing fluorescent signals at each cycle and improving the overall run quality. For best results, we suggest combining library pool and PhiX control for immediate sequencing. For dilutions, using 10 mM Tris–HCl, pH 8.5 (see Note 2). 3.2 NovaSeq 6000 Sequencing Workflow
In the next sections, we will describe the experimental steps required to load library pools on the NovaSeq 6000 according to two different workflows. In the Standard Workflow (see Note 1) a single pool is automatically distributed between all the lanes of the flow cell. In the XP Workflow (see Note 2) different pools are directly loaded into separate lanes. We will then describe how to load flow cell and reagents onto the instrument and start a sequencing run (see Note 3). These essential instructions are reported according to the NovaSeq 6000 Sequencing System Guide [7] and implemented with our direct practical experience on the NovaSeq 6000 platform available at the Department of Biology of University of Florence. Before approaching sequencing on the
26
Alessandra Modi et al.
NovaSeq 6000, we recommend new users to read carefully the most updated version of the NovaSeq 6000 Sequencing System Guide. The subsequent kit components are necessary for the sequencing workflow. Cluster cartridge. SBS cartridge. Flow cell. Library tube. In the XP workflow, additional components of the NovaSeq XP kits are required. DPX1. DPX 2. DPX3. NovaSeq Xp Manifold. XP Flow Cell Dock. 3.2.1 Standard Workflow
1. Remove SBS and Cluster cartridges (Fig. 4) from refrigerator:
Thaw SBS and Cluster Cartridges
2. Place the cartridges into a wire thaw rack. The racks are provided with the instrument and prevent capsizing in the water bath. 3. Thaw cartridges in a room temperature water bath (19–25 C). Submerge about halfway. Protect the cartridges from direct light. Thaw time is reported in Table 5. Do not use hot water to thaw reagents because data quality may be reduced or run may failure. Once the reagents are completely thaw, thoroughly dry the cartridge bases using paper towels. Be sure to dry between the wells so that all water is removed. Inspect the foil seals for water. If water is present, blot-dry with a lint-free tissue. 4. Invert cartridges 10 times to mix reagents. 5. Gently tap the bottom of each cartridge on the bench to reduce air bubbles. 6. If thawed reagents cannot be loaded onto the instrument within 4 h, cartridge can be store at 2–8 C for up to 24 h.
Prepare Flow Cell
Flow cell (Fig. 1) is stored in dry condition and it has to be used within 12 h after removing it from the package. 1. Remove a new flow cell package from 2 C to 8 C storage. 2. Set the sealed flow cell package aside for 10–15 min to allow the flow cell to reach room temperature. Remove the flow cell
NovaSeq 6000 Sequencing Protocol
27
Fig. 4 Reagent cartridges (used). (a) Cluster cartridge. (b) SBS cartridge Table 5 Thawing time of the SBS and cluster cartridges Cartridge
Thawing time
SP, S1, and S2 SBS cartridge
4h
SP, S1, and S2 cluster cartridge
Up to 2 h
S4 SBS cartridge
4h
S4 cluster cartridge
Up to 4 h
from the package only immediately before loading it into the instrument; grasp the flow cell by the sides to avoid touching the glass slide. 3. Inspect each glass slide surface: if particulate is visible, clean the surface with a lint-free alcohol wipe and dry with a low-lint lab tissue. Denature Library Pool and PhiX Control for Sequencing
After combining library pool with the appropriate concentration of the PhiX control library, sequencing pool must be denatured using 0.2 N NaOH solution (see Note 3).
28
Alessandra Modi et al.
1. Add 0.2 N NaOH directly to the tube containing nondenatured library pool and PhiX control. Required volumes of 0.2 NaOH solution according to flow cell types are reported in Table 6. 2. Vortex briefly and spin down. 3. Incubate at room temperature for 8 min to denature. 4. To neutralize 0.2 N NaOH, add 400 mM Tris–HCl, pH 8.0 as described in Table 7. 5. Vortex briefly and spin down. 6. Transfer the full volume of denatured sequencing libraries to the Library Tube (Fig. 5) provided with the NovaSeq 6000 Reagent Kit. 7. Immediately proceed to prepare SBS and cluster cartridge.
Table 6 Volumes of 0.2 N NaOH solution required to denature library pool following Standard Workflow Flow cell
Nondenatured pool (μl)
0.2 N NaOH (μl)
SP/S1
100
25
S2
150
37
S4
310
77
Table 7 Volumes of 400 mM Tris–HCl, pH 8 required to neutralize 0.2 N NaOH following Standard Workflow. The resulting final volume of the pool per flow cell are also reported Flow cell
400 mM Tris–HCl, pH 8.0 (μl)
Denatured pool final volume (μl)
SP/S1
25
150
S2
38
225
S4
78
465
Fig. 5 Library tube
NovaSeq 6000 Sequencing Protocol Prepare SBS and Cluster Cartridges
29
1. Verify that all reagents are thawed. 2. Invert each cartridge 10 times to mix reagents. 3. Gently tap the bottom of each cartridge on the bench to reduce air bubbles. 4. Insert the uncapped Library Tube into position #8 of the cluster cartridge (Fig. 6); pay attention that the denatured pool does not get out from the tube! 5. The cluster cartridge, including the library tube, must be loaded onto the instrument within 30 min.
Fig. 6 Cluster cartridge (used) with Library Tube into position #8
30
Alessandra Modi et al.
3.2.2 XP Workflow Thaw SBS and Cluster Cartridge
Thaw SBS and Cluster Cartridge as described for the Standard Workflow (Subheading 3.3, step 1). Thaw ExAmp reagents 1. Remove one tube each of DPX1, DPX2, and DPX3 from 25 C to 15 C storage. 2. Thaw at room temperature for 10 min. 3. Set aside on ice during library pool denaturation. ExAmp reagents can be refrozen one time only immediately after thawing. Denature library pool and PhiX control for sequencing. After combining library pool with the appropriate concentration of the PhiX control library, sequencing pool must be denatured using 0.2 N NaOH solution (see Note 4). 1. Add 0.2 N NaOH directly to each tube containing nondenatured library pool and PhiX control. Required volumes of 0.2 NaOH solution according to flow cell types are reported in Table 8. 2. Vortex briefly and spin down. 3. Incubate at room temperature for 8 min to denature. 4. To neutralize 0.2 N NaOH, add 400 mM Tris–HCl, pH 8.0 as described in Table 9. 5. Vortex briefly and spin down. 6. Keep denatured libraries on ice until ready to add the ExAmp master mix. Prepare the flow cell and dock 1. Place the NovaSeq Xp flow cell dock (Fig. 7) on a flat surface. Keep the flow cell level until it is loaded onto the instrument. 2. Inspect the dock and make sure that it is free from particulate. 3. Remove a new flow cell package from 2 C to 8 C storage. Set the sealed flow cell package aside for 10–15 min to allow the flow cell to reach room temperature; open the package and grasp the flow cell by the sides to avoid touching the glass slide.
Table 8 Volumes of 0.2 N NaOH solution required to denature library pools following XP workflow Flow cell
Nondenatured pool (μl)
0.2 N NaOH (μl)
SP/S1
18
4
S2
22
5
S4
30
7
NovaSeq 6000 Sequencing Protocol
31
Table 9 Volumes of 400 mM Tris–HCl, pH 8 required to neutralize 0.2 N NaOH following XP workflow. The resulting final volume of the pool per lane are also reported Flow cell
400 mM Tris–HCl, pH 8.0 (μl)
Denatured pool final volume per lane (μl)
SP/S1
5
27
S2
6
33
S4
8
45
Fig. 7 Xp flow cell dock. The clamp at the bottom of the dock is open
4. Inspect each glass slide surface: if particulate is visible, clean the surface with a lint-free alcohol wipe and dry with a low-lint lab tissue. 5. Invert the flow cell so that the top surface faces downward and place the flow cell onto the dock (Fig. 8).
32
Alessandra Modi et al.
Fig. 8 Two-lane flow cell placed in the XP dock
Fig. 9 XP disposable manifold
6. With the wells facing up, load the NovaSeq Xp manifold (Fig. 9) over the inlet end of the flow cell. Make sure that the NovaSeq Xp manifold arms fit securely into the dock cutouts. 7. Close the clamp to secure the flow cell and NovaSeq Xp manifold and seal the gaskets (Fig. 10) (see Note 4). Prepare the ExAmp master mix
NovaSeq 6000 Sequencing Protocol
33
Fig. 10 Flow cell and manifold secured to the XP dock. The clamp at the bottom of the dock is closes. Note that flow cell is inverted in the dock (top surface facing downward), so the lane numbering is reversed Table 10 Volumes of each reagent needed to prepare ExAmp master mix Addition order
Reagent
Volume for two-lane flow cell SP/S1/ Volume for four-lane flow cell S2 (μl) S4 (μl)
1
DPX1
126
315
2
DPX2
18
45
3
DPX3
66
165
ExAmp master mix
Vf ¼ 210
Vf ¼ 525
1. Mix ExAmp reagents DPX1, DPX2, and DPX3 by vortexing briefly and spin down (see Note 5). 2. Combine the volumes reported in Table 10 in a suitable microcentrifuge tube (see Note 6) in the order specified. 3. Pipet and dispense slowly to avoid bubbles; vortex for 20–30 s until the ExAmp Master Mix is thoroughly mixed. Spin down.
34
Alessandra Modi et al.
Table 11 Volumes of denatured library pool and ExAmp master mix required for each lane. The resulting final volume and the appropriate volume of Library–ExAmp mixture for each manifold well are also reported Flow cell
Denatured library pool ExAmp master mix Final volume (μl) (μl) (μl)
Library–ExAmp mixture per lane (μl)
SP/S1
27
63
90
80
S2
33
77
110
95
S3
45
105
150
130
4. For best results, use immediately the ExAmp Master Mix proceeding to the next step. Load libraries onto the flow cell 1. Add ExAmp Master Mix to each denatured library pool combining the volume as described in Table 11; vortex for 20–30 s to mix and spin down. 2. Add the appropriate volume of ExAmp–library mixture to each NovaSeq Xp manifold well according to Table 11 (see Note 7). To avoid creating bubbles, load the samples slowly. 3. After adding the ExAmp–library mixture to all manifold wells, wait approximately 2 min for the mixture to reach the opposite end of each lane. A small volume of the mix may remain in the manifold wells after the lane is completely filled. Prepare SBS and cluster cartridges 1. Verify that all reagents are thawed. 2. Invert each cartridge 10 times to mix reagents. 3. Gently tap the bottom of each cartridge on the bench to reduce air bubbles. 4. Insert uncapped, empty Library Tube into position #8 of the cluster cartridge before you set up the sequencing run. The tube will be used to prepare the conditioning mix with reagents of the cluster cartridge before distribution to the flow cell. The conditioning mix helps boost clustering efficiency for sequencing. 3.3 Set Up Sequencing Run
In this step, previously prepared reagents cartridges and flow cell are loaded onto the instrument along with buffer cartridges (Fig. 11), and sequencing is started. To set up a sequencing run, users should interact with the subsequent components of the NovaSeq 6000 (Fig. 12):
NovaSeq 6000 Sequencing Protocol
35
Fig. 11 Buffer cartridge (used)
(a) The touch screen monitor (Fig. 13); (b) The flow cells compartment (Fig. 14); (c) The liquids compartment that holds reagent and buffer cartridges (Fig. 15). The touch screen monitor (Fig. 13) displays the NovaSeq Control Software (NVCS) that guides user in system configuration, run setup, and monitoring. If you have an Illumina BaseSpace account, you can sign in to BaseSpace Sequence Hub through the NovaSeq Control Software (NVCS) using your username and password. To send run data to BaseSpace Hub for remote monitoring and data analysis, select “Run Monitoring and Storage.” This option requires a sample sheet. When the run is not connected to BaseSpace Sequence Hub for storing data, only remote monitoring of the run is possible and the data will be stored in local in a specified output folder. In this case sample sheet is not requires. In all cases, before a flow cell run can begin, the minimum space
36
Alessandra Modi et al.
Fig. 12 Full view of the NovaSeq 6000 sequencing platform with the different components
requirements for Compute Engine (CE) and hard drive (C:\) must be met (Table 12). NovaSeq 6000 can run one or two flow cells simultaneously. Since the two sides of the instrument can work independently, flow cells can be loaded on the instrument in different time even while the first sequencing run is in progress (“staggered start”). Staggered runs are set up at specific times during a run, the instrument will indicate when a staggered start is available. If you have loaded the flow cells using different methods, XP workflow flow cell must be loaded on the instrument first. Short videos will guide you in loading the consumables into the instrument. 1. From the Home screen, select “Sequence,” and then select a single or dual flow cell run: (1) select “A” to set up a single flow
NovaSeq 6000 Sequencing Protocol
37
Fig. 13 Touch screen monitor
Fig. 14 Flow cells compartment. (a) Single flow cell mode. (b) Dual flow cell mode
cell run on side A; (2) select “B” to set up a single flow cell run on side B; and (3) select “A + B” to set up a dual flow cell run. 2. The software initiates the series of run setup screens, starting with Load. Select OK open the flow cell door. 3. If present, remove the flow cell from the previous run and load the flow cell onto the instrument (Fig. 14). For Standard Workflow, align the flow cell over the four raised clamps and place it on the flow cell stage. For NovaSeq XP workflow sequencing, unload the flow cell from the dock: (a) Open the clamp that secures the flow cell and manifold.
38
Alessandra Modi et al.
Fig. 15 Fluidic compartment. Reagents cartridges are placed into the reagent chiller drawer (upper part of the compartment): SBS Cartridge is placed into the left position and Cluster Cartridge into the right position. Buffer cartridge is placed on the right side of the buffer drawer (lower part of the compartment) Table 12 Minimum space requirements for CE and C:\ per flow cell pair. For single flow cell runs, the minimum space requirements correspond to the half of those here reported Flow cell
CE space per cycle (Gb)
C:\ Space per flow cell pair (Gb)
SP
0.5
5
S1
1.35
20
S2
2.7
20
S4
4.3
40
(b) Without allowing liquid to drip onto the flow cell, carefully remove and discard the manifold. (c) If liquid drips onto the flow cell, clean with a lint-free alcohol wipe and dry with a lint-free lab tissue. (d) Grasp the sides of the flow cell to remove it from the dock. Keep the flow cell level. If there is residual material on the gaskets, blot the four flow cell gaskets with a lint-free tissue to dry. Do not touch the gaskets. (e) Invert the flow cell around the long axis so that the top surface faces up. (f) Align the flow cell over the four raised clamps and place it on the flow cell stage.
NovaSeq 6000 Sequencing Protocol
39
4. Select “Close Flow Cell Door.” 5. Load the SBS and Cluster Cartridges as follows: (a) Open the liquid compartment doors (Fig. 15), then open the reagent chiller door and remove the used SBS and cluster cartridges. (b) Load the prepared cartridges into the reagent chiller drawer so that the Insert labels face the back of the instrument: place the SBS Cartridge (gray label) into the left position and the Cluster Cartridge (orange label) containing the uncapped library tube into the right position (recall: for the NovaSeq Xp Workflow, empty, uncapped library tube is loaded into the cartridge). (c) Slide the drawer into the chiller, and then close the reagent chiller door. (d) The sensors and RFIDs are checked. The IDs for the library tube and the two cartridges appear on the screen. 6. Load the Buffer Cartridge: (a) Pull the metal handle to open the buffer drawer and remove the used buffer cartridge from the right side of the buffer drawer (see Note 8). (b) Place a new buffer cartridge into the buffer drawer (right side) so that the Illumina label faces the front of the drawer. Align the cartridge with the raised guides on the drawer floor and sides. When properly loaded, the buffer cartridge is evenly seated, and the drawer can close. (c) Select the checkbox acknowledging that both used reagent bottles are empty. 7. Select “Run Setup” and enter run parameters: (a) Select the workflow type, “NovaSeq Standard” or “NovaSeq Xp.” (b) In the Run Name field, enter a name to identify the current run. (c) Enter the number of cycles for each read and index length in the sequencing run. Number of cycles should be entered according to the specific sequencing kit (Table 1). The sum of the four values entered can exceed the indicated number of cycles for the selected reagent kit by up to 23 cycles for paired-end runs, and 30 cycles for single-read runs. For single-end run, Read 2 value is 0. For single-index library, Index 2 value is 0. (d) Expand “Advanced Options” to apply optional settings (see Note 9). (e) Select “Review.” The software confirms that the specified parameters are appropriate for the recipe.
40
Alessandra Modi et al.
8. Check and confirm run parameters then select “Start Run.” After a prerun check, the run will start automatically. 9. During the run monitor run progress, intensities, and quality scores as metrics appear on the screen. If the run is connected to BaseSpace, remote monitoring run, in real time, is possible. 10. After the run and data transfer are complete, delete the current run from Process Management to clear space for a subsequent run. (a) From the Main Menu, select “Process Management.” (b) Select “Delete Run,” and then select “Yes” to confirm. (c) Select “Done.”
4
Notes 1. Illumina provides several support resources including the following, among others: Coverage Calculator: https://support.illumina.com/ downloads/sequencing_coverage_calculator.html Pooling Calculator: https://support.illumina.com/help/ pooling-calculator/pooling-calculator.html 2. PhiX control library is provided as 10 nM solution that has to be diluted at an appropriate concentration, using 10 mM Tris– HCl, pH 8.5 (i.e., Resuspension Buffer). Follow Illumina instructions to dilute 10 nM PhiX control library to 2.5 nM for Standard Workflow, or to 0.25 nM for XP workflow. After dilution, check again concentration with the fluorometer, then add the appropriate volume of diluted PhiX directly to the nondenatured library pool. 3. Prepare a fresh dilution of 0.2 N NaOH sufficient for the application. Freshly diluted 0.2 N NaOH is essential to the denaturation process and improper denaturation can reduce yield. We suggest to use the diluted NaOH solution within 12 h; do not keep diluted NaOH solution for subsequent sequencing runs. For dilutions, using laboratory-grade water. For Standard Workflow, 50 μl 0.2 N NaOH for one flow cell and 100 μl 0.2 N NaOH for two flow cells are needed. For XP workflow, 30 μl 0.2 N NaOH for one flow cell and 60 μl 0.2 N NaOH for two flow cells. 4. The manifold is single used, so it must be discarded after loading library pool into the flow cell. 5. ExAmp reagents are viscous, especially DPX2 and DPX3. Pay attention during ExAmp master mix preparation to minimize pipetting errors. 6. When preparing ExAmp master mix, use a microcentrifuge tube that holds at least twice the required volume: for
NovaSeq 6000 Sequencing Protocol
41
two-lane flow cell, use a 0.5 ml or 1.7 ml tube; for four-lane flow cell, use a 1.7 ml tube. 7. Loading ExAmp–library mixture to the manifold is a critical step for the success of the sequencing. When load the samples in the manifold wells, avoid contact with the filter at the bottom of the well. Remember that flow cell is inverted in the dock (top surface facing downward), so the lane numbering is reversed: be sure to add the ExAmp–library mixture to the well that corresponds to the intended lane. 8. Before start run, make sure both used reagents bottles are empty. Failure to empty the used reagent bottles can result in a terminated run and overflow, which damages the instrument and poses a safety risk. 9. The “Advanced Options” are: (a) Custom primers. Select this option for libraries prepared with custom protocols. Select the appropriate checkboxes: (1) Read 1, use custom primer for Read 1; (2) Read 2, use custom primer for Read 2; (3) Custom Index, use custom primer for Index 1. (b) Output folder. Select “Browse” to change the output folder for the current run. An output folder is required when the run is not connected to BaseSpace Sequence Hub for storage. (c) Samplesheet. Samplesheet is required when using Sequence Hub not only for run monitoring but also for storage; select “Browse” to upload a sample sheet for the current run. (d) Custom recipe. Select “Custom Recipe,” then “Browse” to use a custom recipe in XML format for this run.
Acknowledgments The authors acknowledge Chiara Natali for the help in revising protocols and for taking pictures. The Department of Biology at University of Firenze is supported by the Italian Ministry of Education, University and Research (project “Dipartimenti di Eccellenza 2018-2022”). References 1. Head SR, Kiyomi Komori H, LaMere SA et al (2014) Library construction for next-generation sequencing: overviews and challenges. BioTechniques 56(2):61–77. https://doi.org/10. 2144/000114133
2. Meyer M, Kircher M (2010) Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010(6):5448. https://doi.org/ 10.1101/pdb.prot5448
42
Alessandra Modi et al.
3. Gansauge MT, Meyer M (2013) Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA. Nat Protoc 8 (4):737–748. https://doi.org/10.1038/nprot. 2013.038 4. 16S Metagenomic sequencing library preparation, part # 15044223 Rev. B. https://support. illumina.com/documents/documentation/ chemistry_documentation/16s/16s-meta genomic-library-prep-guide-15044223-b.pdf 5. Bentley DR, Balasubramanian S, Swerdlow HP et al (2008) Accurate whole human genome sequencing using reversible terminator
chemistry. Nature 456(7218):53–59. https:// doi.org/10.1038/nature07517 6. Patterned flow cell technology technical note. https://emea.illumina.com/content/dam/ illumina-marketing/documents/products/ technotes/patterned-flow-cell-technology-tech nical-note-770-2015-010.pdf 7. NovaSeq 6000 sequencing system guide, document # 1000000019358 v14, material # 20023471. (2020). https://emea.support. illumina.com/content/dam/illumina-support/ documents/documentation/system_documen tation/novaseq/novaseq-6000-system-guide1000000019358-14.pdf
Part II Pangenomics of Cultured Isolates
Chapter 3 Comparative Analysis of Core and Accessory Genes in Coexpression Network Biliang Zhang, Jian Jiao, Pan Zhang, Wen-Jing Cui, Ziding Zhang, and Chang-Fu Tian Abstract Prokaryotes harbor a various proportion of accessory genes in their genomes. The integration of accessory functions with the core regulation network is critical for environmental adaptation, particularly considering a theoretically unlimited number of niches on the earth for microorganisms. Comparative genomics can reveal a co-occurrence pattern between a subset of accessory genes (or variations in core genes) and an adaptation trait, while comparative transcriptomics can further uncover whether a coordinated regulation of gene expression is involved. In this chapter, we introduce a protocol for weighted gene coexpression network construction by using well-developed open source tools, and a further application of such a network in comparative analysis of bacterial core and accessory genes. Key words Coexpression, WGCNA, Orthologs, Core gene, Accessory gene, Network analysis
1
Introduction Horizontal gene transfer can expand the gene pool of prokaryote species and is a major driver for prokaryotes to explore diverse niches on the earth. However, these typical accessory genes should be tightly regulated and integrated with the existing core regulation network to allow beneficial functions acting in the ever-changing circumstances. This field has been significantly advanced by pangenomics and subsequent pantranscriptomics, though step-by-step protocols involving the state-of-the-art bioinformatic tools are rare for biological scientists. Networks based on similarity in gene expression are called gene coexpression networks (GCNs). A GCN is an undirected graph, in which genes were represented by nodes and a pair of genes is connected with an edge if they have a significant coexpression relationship [1]. With transcription profiles of a set of genes for a considerable number of samples, a GCN can be constructed based on pairs of genes which show a strong positive or negative correlation in expression
Alessio Mengoni et al. (eds.), Bacterial Pangenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2242, https://doi.org/10.1007/978-1-0716-1099-2_3, © Springer Science+Business Media, LLC, part of Springer Nature 2021
45
46
Biliang Zhang et al.
pattern across samples. These features of GCNs are distinct from those of gene regulatory networks (GRNs), in which edges connect gene pairs representing factual biochemical process such as a reaction, transformation, interaction, activation, or inhibition[2]. Nevertheless, GCNs provide informative network relationships between genes which are putatively subject to the regulation by the same signal transduction program, or involved in the same pathway [3]. GCN is a powerful approach in exploring biologically relevant information, for example, for the identification of genes not yet associated with explicit biological questions, and for accelerating the interpretation of molecular mechanisms at the root of significant biological processes. Trends in this field include, among other approaches, the integration of coexpression analysis with other omics techniques, such as metabolomics, for estimating the coordinated behavior between gene expression and metabolites, as well as for assessing metabolite-regulated genetic networks [4]. In this chapter, gene expression data of Sinorhizobium fredii CCBAU45436 were used as materials to introduce the methods of construction and analysis of gene coexpression network.
2
Materials
2.1 Genome and Gene Expression Data
Eleven complete genomes of Sinorhizobium strains were used in comparative genomics analysis to define core and accessory genes of S. fredii CCBAU45436 (Fig. 1a), which can either live saprophytically in soils or enter in nitrogen-fixing symbiosis with diverse legumes including soybeans [5]. Gene expression data used in coexpression network construction were obtained from 25 RNA-seq samples of CCBAU45436 under ten conditions.
Prepare input data
Genomes of sibling species
Choose β power
Ortholog analysis
Calculate adjacency matrix
Phylogenetic tree
Module detection
Define core/accessory genes
Degree statistics
Functional enrichment
Key driver identification
Network visualization
Other customized analyses
Fig. 1 Workflow of this chapter
Comparative analysis
Comparative Analysis in Co-Expression Network
47
Table 1 List of bioinformatic tools used in this chapter Software
Reference Source
OrthoFinder [6]
https://github.com/davidemms/OrthoFinder/releases
MUSCLE
[7]
http://www.drive5.com/muscle/
RAxML
[8]
https://cme.h-its.org/exelixis/web/software/raxml/
Bowtie 2
[9]
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
samtools
[10]
https://github.com/samtools/samtools
HTSeqCount
[11]
https://htseq.readthedocs.io/en/release_0.11.1/count.html
pheatmap
[12]
https://github.com/raivokolde/pheatmap
WGCNA
[13]
https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/ Rpackages/WGCNA/Tutorials/
pvclust
[14]
http://stat.sys.i.kyoto-u.ac.jp/prog/pvclust/
Cytoscape
[15]
https://cytoscape.org
ggpubr
[16]
https://rpkgs.datanovia.com/ggpubr/index.html
eggNOGmapper
[17]
http://eggnog-mapper.embl.de/
2.2 Bioinformatic Tools
3
All bioinformatic tools used in this chapter are listed in Table 1.
Methods
3.1 Core Gene Definition 3.1.1 Ortholog Inference
OrthoFinder is a fast and accurate tool to identify orthogroups (see Note 1) [6]. At the install step of Ortho Finder, we strongly recommend installing the dependent software DIAMOND [18]. The key feature of DIAMOND is 500–20,000 speed of BLAST and has a similar degree of sensitivity. All protein sequences in fasta files (one file per strain) should be ready in the same directory before running OrthoFinder with the following command: $ orthofinder -f Dataset_directory
In OrthoFinder result directory, a directory named Orthogroups can be found. There are four files in Orthogroups directory describing the orthogroups, unassigned genes, counts of the number of genes for each strain in each orthogroup, and a list of orthogroups that contain exactly one gene per strain.
48
Biliang Zhang et al.
3.1.2 Phylogenetic Tree Construction
To reconstruct the phylogenetic tree for 11 strains of Sinorhizobium, Ensifer adhaerens OV14 was used as an outgroup to root the tree (Fig. 2a). The way to detach orthologs for the 12 strains is the same as in Subheading 3.1.1. By analyzing the result file “SingleCopyOrthogroups.txt”, 2170 single-copy genes were obtained. All single-copy ortholog sequences which contain exactly one protein sequence per strain were selected from the “Orthogroup Sequences” directory. Multiple sequence alignments were performed with MUSCLE using default parameters for each singlecopy ortholog [7]. Then, all alignments were concatenated in one file. The concatenated file was used to perform a maximum likelihood analysis by RAxML with PROTGAMMAAUTO substitution model and 1000 bootstrap replicates [8]. The commands are shown below: $ orthofinder -f Dataset_for_tree ## Multiple sequence alignments with muscle for one singlecopy ortholog $ muscle -quiet –in OG0000000.fa -out OG0000000.aln ## construct the ML species tree and specify the name of outgroup with -o parameters $ raxmlHPC-PTHREADS-AVX -T 20 -f a -x 12456 -p 12345 -# 1000 -m PROTGAMMAAUTO -s concatenated.fa -n tree –o adhaerens OV14 > raxml.log
3.1.3 Core Gene Identification
All CDSs in CCBAU45436 were assigned to one of the following hierarchical subsets (Fig. 2b): Subset I, gene present in all Sinorhizobium strains; Subset II, those present in all five strains belong to the Cluster II but excluding Subset I; Subset III, those shared by four S. fredii strains but not present in Subset I and II; Subset IV, the remaining accessory genes of CCBAU45436 [5]. To achieve these assignments based on output files from OrthoFinder, Orthogroups.GeneCount.csv was used (see Note 2).
3.2 Construction of Coexpression Network
A coexpression network is fully specified by its adjacency matrix, which is a m m matrix, where m is the number of genes to be analyzed. The most widely used software for coexpression analysis is Weighted Gene Correlation Network Analysis (WGCNA) [13].
3.2.1 Input Data for WGCNA
High-quality reads from 25 samples under ten conditions were independently mapped to the reference genome of S. fredii CCBAU45436 using Bowtie 2 with default parameters [9]. Samtools was used to retain mapped reads with a quality score higher than 30. The number of mapped reads for each protein-coding gene was extracted from filtered bam files by HTSeq-Count [11]. The commands are shown below:
Comparative Analysis in Co-Expression Network
a
100 S. fredii CCBAU45436
0.5
49
b
S. fredii CCBAU25509 100 99 S. fredii CCBAU83666 II 100 S. sp. III CCBAU05631 S. sojae CCBAU05684 S. meliloti AK83 I 100 S. meliloti SM11 83 S. meliloti BL225C 94 S. meliloti Rm1021 100 94 S. meliloti GR4 S. medicae WSM419 Ensifer adhaerens OV14
Subset I 2453
Subset II Subset III Subset IV 1057
2615
753
Fig. 2 Phylogeny and comparative genomics analysis of Sinorhizobium strains. (a) Maximum likelihood phylogenetic tree based on the concatenated protein sequences of 2170 core genes shared by Sinorhizobium strains and an outgroup Ensifer adhaerens OV14. (b) Hierarchical divisions of core/accessory gene subsets for S. fredii CCBAU45436. Subset I, genus core genes shared by all Sinorhizobium strains; Subset II, genes shared by five strains of cluster II in phylogenetic tree but excluding Subset I; Subset III, genes shared by three S. fredii strains excluding Subsets I and II; Subset IV, the remaining accessory genes in CCBAU45436 ## create an index for genome $ bowtie2-build CCBAU45436_genome.fa index/CCBAU45436 ## map reads to genome $ bowtie2 -p 20 -x index/CCBAU45436 -1 sample_reads_1.fq sample_reads_2.fq -S sample.sam ## filter the sam file $ samtools view -bhS -q 30 sample.sam > sample.bam ## sort the bam file and counting reads with htseq-count $ samtools sort sample.bam sample.sorted $ htseq-count -f bam -t CDS -i ID -s reverse sample.sorted.bam CCBAU45436.gff 1>sample.counts 2>sample.counts.log
For a normal coexpression network construction, the number of samples should not be less than 15, otherwise, the network may not exhibit a strong tendency to have a consistent expression pattern [13]. Log-transformed TPM (Transcripts Per Million) is used to normalize expression data (see Note 3). To have a global view of reproducibility of gene expression in samples from the same conditions (Fig. 3a), the pheatmap R package was run using commands as follows: # load packages library("RColorBrewer") library("pheatmap") # tpm.log is log-transformed TPM data # compute the distance between samples sampleDists cl_rep >seq1 MVEVGAGFSEKAYAKLNLYLDVVGKRSDGY[...] >seq2 HDIVGLFQTIDMYDEIVVESNVPIEGQNLVERA[...]
2. gene_clu_cluster.tsv: a tab-separated file containing the cluster representative headers on the first field and the cluster member headers in the second field. 3. gene_clu_rep_seq.fasta: fasta file containing the cluster representative sequences. In addition, MMseqs2 generates and stores in the tmp folder a database of the gene sequences in input (named input) and the cluster databases (clu, clu_seqs (with sequence info)), both in a format that can be further processed by other MMseqs2 modules. 3.1.3 Functional Annotation of the Gene Clusters
For the functional annotation we can consider the gene clusters as units and use the cluster representative sequences that we obtained from the clustering (gene_clu_rep_seq.fasta). In alternative, the MMseqs2 module result2msa can be used to generate a multiple sequence alignment (MSA) for each cluster. mmseqs result2msa tmp/input tmp/input tmp/clu clu_msa
Otherwise, using the apply module an external multiple sequence aligner can be called. The only requirement is that the external aligner must be able to read from stdin and write the result to stdout. Therefore, this method can be used with programs like Clustal-omega [39], MUSCLE [39, 40], or FAMSA [41].
Annotation and Core Set of Microbial Metagenomic Genes
121
For example, we can install Clustal-omega with conda: conda install -c bioconda clustalo
And then align the cluster sequences using: mmseqs apply clu_seq clu_msa -- clustalo -i - --threads=1
From the MSA, we can then retrieve the profiles and extract the consensus sequence for each cluster using the msa2profile and the profile2consensus modules. mmseqs msa2profile clu_msa clu_profile mmseqs profile2consensus clu_profile clu_consensus
The cluster annotation can be performed with either the cluster representatives or consensus sequences. In this chapter we use the cluster representative sequences. The first search is against the highly curated Pfam database of protein domain families [16]. The Pfam database can be downloaded using the following: wget http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/ Pfam-A.hmm.gz And searched using the hmmsearch program of the HMMER3 software [43], that we can obtain using conda: conda install -c bioconda hmmer
The Pfam database of HMM profiles can then be searched using the following command, where search query and target are specified at the end: hmmsearch --cut_ga -Z n_seqs*pfam_entries \ --domtblout gene_rep_Pfam31.out \ Pfam-A.hmm gene_clu_rep_seq.fasta
The --cut_ga flag specify to use Pfam GA (gathering threshold) score cutoffs, Z defines the number of comparisons done, and is used for e-value calculation, and --domtblout specify that the output should be in a file in tabular format, space-delimited, summarizing the results per-domain, with one line per homologous domain detected in a query sequence. The methods described above is considered the standard method to search the Pfam database. However, the same search can be performed with MMseqs2 as well. A detailed explanation on how to perform the search can be found at https://github.com/ soedinglab/MMseqs2/wiki#how-to-create-a-target-profile-data base-from-pfam.
122
Chiara Vanni
Second, we can search our sequences against the eggNOG database [19] to identify potential orthologous groups. The eggNOG database can be downloaded and searched using the eggnogmapper program. The latest release of the mapper can be freely downloaded and decompressed from https://github.com/ jhcepas/eggnog-mapper/releases. Or it can be obtained by cloning the git repository: git clone https://github.com/jhcepas/eggnog-map per.git
In the folder you can find the script to retrieve the eggNOG database: download_eggnog_data.py, which will download the necessary data into a folder named data/. The script emapper.py runs the search of the cluster representatives, in fasta format, against the eggNOG database of HMM profiles (specified using the -d flag), using HMMER as a search algorithm ( m). The program requires python 2.7. python emapper.py -i gene_clu_rep_seq.fasta \ -d data/NOG_hmm/NOG_hmm.all_hmm \ --output gene_clu_rep_seq_NOG -m hmmer
The output consists of two main filles: 1. gene_clu_rep_seq_NOG.emapper.hmm_hits. Only returned when using the hmmer mode. It is a list of significant hits to eggNOG Orthologous Groups (OGs) for each query. The file contains the following fields: query_name,hit,evalue,sum_score,query_length,hmmfrom,hmmto, seqfrom,seqto,query_coverage
2. gene_clu_rep_seq_NOG.annotations. This file provides final annotations of each query. Tab-delimited columns in the file are as follows. (a) query_name: query sequence name (b) seed_eggNOG_ortholog: best protein match in eggNOG (c) seed_ortholog_evalue: best protein match (e-value) (d) seed_ortholog_score: best protein match (bit-score) (e) predicted_gene_name: Predicted gene name for query sequences (f) GO_terms: Comma-delimited list of predicted Gene Ontology terms (g) KEGG_pathways: Comma-delimited list of predicted KEGG pathways (h) Annotation_tax_scope: The taxonomic scope used to annotate this query sequence
Annotation and Core Set of Microbial Metagenomic Genes
123
(i) Matching_OGs: Comma-delimited list of matching eggNOG Orthologous Groups (j) best_OG|evalue|score: Best matching Groups (only in HMM mode)
Orthologous
(k) COG functional categories: COG functional category inferred from best matching OG (l) eggNOG_HMM_model_annotation: eggNOG functional description inferred from best matching OG The representative sequences can then be searched against general functional databases using the easy-search workflow of MMseqs2, which takes fasta/fastq files as input for both query and target. This workflow allows to perform sequence/sequence searches, but also profile/sequence or sequence/profile searches (400 times faster than PSI-BLAST). A good approach is to start searching broad but curated databases, like UniProtKB [42] or its clustered version the UniProt Reference Clusters (UniRef), like for example the version clustered at 90% (UniRef90) [21], to improve the computational efficiency of your search. To run the easy-search workflow you need to specify the search query file, the target, the results file name, and the name of a template folder. Additional options can be found running mmseqs easy-search -h. mmseqs easy-search gene_clu_rep_seq.fasta uniref90.fasta gene_clu_rep_uniref.tsv tmp
The output is a tab-separated file containing the following fields: query,target,pident,alnlen,mismatch,gapopen,qstart,qend, tstart,tend,evalue,bits. The sequences found without any match can then be searched for example, against the broader NCBI nr database [25]. There are many ways to retrieve the sequences found without any hit from a search. A fast way is to use the filterbyname.sh script from BBMAP (https://sourceforge.net/projects/bbmap/). The BBMAP tools can also be installed via conda: conda install -c bioconda bbmap
Then, the filtering script can be run using the searched fasta file as input, and the search results to filter out the already annotated sequences, specifying the option include as FALSE: filterbyname.sh in=gene_clu_rep_seq.fasta \ out=gene_clu_rep_seq_nohits.fasta \ names= uniprot.fasta
Create the taxonomy database in MMseqs2 format. mmseqs createdb uniprot.fasta uniprotDB mmseqs createtaxdb uniprotDB tmp mmseqs createindex uniprotDB tmp
Run the taxonomy workflow to annotate the cluster representatives. mmseqs easy-taxonomy gene_clu_rep_seq.fasta uniprotDB gene_clu_rep_seq_taxonomy tmp
The output consists of two main files: 1. gene_clu_rep_taxonomy_report, a summary report with the following fields: percent of mapped reads,count of mapped reads,rank,taxonomy identifier,taxonomic name
2. gene_clu_rep_taxonomy_lca.tsv, the main assignment output, where every line contains a taxonomical classification of input sequence: sequence name,taxonomic identifier,rank,taxonomic name
Annotation and Core Set of Microbial Metagenomic Genes 3.1.6 First Section Output
127
The ideal output of this multi-step gene cluster annotation is a combined tab-separated file, gene_cluster_annotat.tsv, containing the cluster and the cluster representative (or consensus) identifiers and the retrieved annotations from the various databases, using the following format: cluster_id
rep_id
database
accession
function
e_value
cl_1
1
Pfam
PF01029
NusB family
2.5e-30
cl_2
3
UniRef90
Q38950
AtA beta
1.3e-25
And gene_cluster_taxonomy.tsv. cl_id
gene_id
domain
phylum
class
order
[...]
cl_1
1
Bacteria
Firmicutes
NA
NA
...
From these outputs we can create a new table with expanded annotations to all the genes in each cluster, which will result useful for the next section. 3.2
Second Section
3.2.1 How to Use Gene Clusters to Explore Pangenomes Using Anvi’o
Once we have a catalog of gene clusters decorated with annotations, we can start using them to look for specific functions or specific organisms and define core sets of genes. For this second part, we are going to need the metagenomic assemblies (contigs), in fasta format, the coverage information for each contig retrieved from the mapping of the reads back to the assemblies, in the form of BAM files, and the results from the previous section. We are going to import the data in Anvi’o, an open-source platform that provides analysis and visualization tools for ‘omics data [37]. We are going to use Anvi’o to perform a pangenomic analysis starting from our metagenomic data. Several detailed tutorials for metagenomic, pangenomic, and phylogenomic workflows are available on the Anvi’o website. This section will provide the basics steps to retrieve MAGs from the metagenomic assemblies and investigate them in a pangenomic context, using a set of gene clusters that have been annotated as shown in the previous section. Anvi’o can be installed with conda. conda install y -c conda-forge -c bioconda anvio
3.2.2 Create a Collection of Metagenome-Assembled Genomes (MAGs)
The starting point for every metagenomic analysis in Anvi’o is the contig database (contigs.db). We will generate the contig database, including also the gene prediction we already performed in the previous section. The gene prediction file needs to be formatted in “Anvi’o style.” To do this, we can use the information file we have created from the Prodigal output: my_gene_prediction.tsv.
128
Chiara Vanni anvi-gen-contigs-database -f my_contigs.fasta -o my_contigs.db \ -n "my_metagenome" \ --external-gene-calls my_gene_prediction.tsv
At this point. We can create a folder called "additional-files/" where we can store all the information tables we have from the clustering, and the functional and taxonomic annotation. We can import functional information about the genes in the contig database using the following command: anvi-import-functions -c my_contigs.db \ -i additional-files/gene_cluster_annot.tsv
Where the gene_cluster_annot.tsv table has to be in the following format: gene_id
Database
Accession
Function
e_value
In the same way we can import the taxonomic information retrieved with MMseqs2. anvi-import-taxonomy-for-genes -c my_contigs.db \ -i additional-files/gene_cluster_taxonomy.tsv
The taxonomy table gene_cluster_taxonomy.tsv has to have the following structure: gene_id
Domain
Phylum
Class
Order
Family
Genus
Species
The taxonomic annotations for your genes can also be inferred with Anvi’o (http://merenlab.org/2019/10/08/anvio-scg-taxon omy/). We can then complement the contig database with the HMM models of curated bacterial and archaeal single-copy gene collections, which are already contained in Anvi’o [47, 48]. This step will help to build the metagenome-assembled genomes and to estimate their completion and redundancy. anvi-run-hmms -c my_contigs.db
After this step, we have a complete and decorated contig database. Next, we need to initialize the BAM file with the mapping information and build the Anvi’o “profile” of our contig database. The “profile” database (my_profile.db) contains the mapping/coverage information for the contigs in each sample. The BAM file for each sample can be initialized using the following command: anvi-init-bam sample_01_raw.bam -o sample_01.bam
Annotation and Core Set of Microbial Metagenomic Genes
129
The profile is built running the anvi-profile command for each sample: anvi-profile -i sample_01.bam -c my_contigs.db
In case of multiple samples, we would have to repeat or iterate this command over all samples, and subsequently merge the profiles together using: anvi-merge */my_profile.db -o sample_merged -c my_contigs.db
The asterisk * is a wildcard that tells the computer to consider the file my_profile.db contained in all the directories/folders found in the current working directory. The anvi-merge command performs the automatic binning of your contig collection, using CONCOCT [32, 33], and calls the bin collection “CONCOCT”. You can also import an external collection of bins, obtained with other binning tools like MetaBat [31] or Maxbin [33] using the following: anvi-import-collection additional-files/my_binning_results. txt \ -p sample_merged/my_profile.db \ -c my_contigs.db \ -C collection_name
With automatic binning, however, there is a high probability to create contaminated/spurious bins. There are tools to check bin contamination and to evaluate in general bin completion and redundancy, like CheckM [36] or BUSCO [49]. However, this automatic control is usually not enough. Best practice would be to manually curate the bins using interactive visualization tools like Anvi’o. anvi-interactive -p sample_merged/my_profile.db \ -c my_contigs.db \ -C CONCOCT
The above command opens an interactive interface where you can visualize your bin collection, or the comparison of different binning approaches, the gene coverage in each bin, the gene clusters and other information stored in the profile and contig databases. On Anvi’o website, you can find detailed tutorials on how to use the interactive interface at http://merenlab.org/2016/02/ 27/the-anvio-interactive-interface/ and http://merenlab.org/ 2016/06/22/anvio-tutorial-v2/#anvi-interactive. To evaluate the completion and redundancy of the bin collection, we can run the following command:
130
Chiara Vanni anvi-estimate-genome-completion -p my_profile.db \ -c my_contigs.db \ -C CONCOCT
Specific bins can be manually refined using anvi-refine. anvi-refine -p my_profile.db \ -c my_contigs.db \ -C CONCOCT \ -b bin_008
The completion and redundancy of the bins can be checked using anvi-summarise, which produces an HTML report of the stored collection. anvi-summarize -c my_contigs.db \ -p my_profile.db \ -C CONCOCT \ -o my_collection_summary
Not all bins will have a good quality, and to select those that could represent good-quality MAGs we can follow the standard proposed by Browers et al. [35]. Therefore bins can be considered high-quality MAGs if they are >90% complete and 50% complete and MAPPING/sample_01_raw.bam
References 1. Yooseph S, Sutton G, Rusch DB et al (2007) The sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biol 5:1–35 2. Sunagawa S, Coelho LP, Chaffron S et al (2015) Ocean plankton. Structure and function of the global ocean microbiome. Science 348:1261359 3. Gilbert JA, Jansson JK, Knight R (2014) The earth microbiome project: successes and aspirations. BMC Biol 12:69 4. Duarte CM (2015) Seafaring in the 21St century: the Malaspina 2010 circumnavigation expedition. Limnol Oceanog Bull 24:11–14 5. Kopf A, Bicak M, Kottmann R et al (2015) The ocean sampling day consortium. Gigascience 4:27 6. Lloyd-Price J, Mahurkar A, Rahnavard G et al (2017) Strains, functions and dynamics in the expanded human microbiome project. Nature 550:61–66 7. Luo R, Liu B, Xie Y et al (2012) SOAPdenovo2: an empirically improved memoryefficient short-read de novo assembler. Gigascience 1:18 8. Li D, Liu C-M, Luo R et al (2015) MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31:1674–1676 9. Peng Y, Leung HCM, Yiu SM, Chin FYL (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428
10. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA (2017) metaSPAdes: a new versatile metagenomic assembler. Genome Res 27:824–834 11. Mikheenko A, Saveliev V, Gurevich A (2016) MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32:1088–1090 12. Hyatt D, Chen G-L, LoCascio PF et al (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119 13. Pavlopoulos GA (2017) How to cluster protein sequences: tools, tips and commands. MOJ Proteom. Bioinform 5 14. Fu L, Niu B, Zhu Z et al (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152 15. Steinegger M, So¨ding J (2018) Clustering huge protein sequence sets in linear time. Nat Commun 9:2542 16. El-Gebali S, Mistry J, Bateman A et al (2019) The Pfam protein families database in 2019. Nucleic Acids Res 47:D427–D432 17. Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373 18. Marchler-Bauer A, Lu S, Anderson JB et al (2011) CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229 19. Huerta-Cepas J, Szklarczyk D, Heller D et al (2019) eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology
Annotation and Core Set of Microbial Metagenomic Genes resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 47:D309–D314 20. Tatusov RL, Fedorova ND, Jackson JD et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4:41 21. Suzek BE, Wang Y, Huang H et al (2015) UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926–932 22. Haft DH, DiCuccio M, Badretdin A et al (2018) RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 46:D851–D860 23. Parks DH, Waite DW, Skarshewski A et al (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36:996–1004. https://doi.org/10.1038/nbt. 4229 24. Steinegger M, Soding J (2017) MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35(11):1026–1028. https://doi.org/ 10.1038/nbt.3988 25. NCBI Resource Coordinators (2018) Database resources of the National Center for biotechnology information. Nucleic Acids Res 46: D8–D13 26. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60 27. Remmert M, Biegert A, Hauser A, So¨ding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175 28. Langmead B, Salzberg SL (2012) Fast gappedread alignment with bowtie 2. Nat Methods 9:357–359 29. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25:1754–1760 30. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079 31. Kang DD, Froula J, Egan R, Wang Z (2015) MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3:e1165 32. Alneberg J, Bjarnason BS, de Bruijn I et al (2014) Binning metagenomic contigs by coverage and composition. Nat Methods 11:1144–1146 33. Wu Y-W, Tang Y-H, Tringe SG et al (2014) MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome 2:26
137
34. Imelfort M, Parks D, Woodcroft BJ et al (2014) GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2:e603 35. Bowers RM, Kyrpides NC, Stepanauskas R et al (2017) Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35:725–731 36. Parks DH, Imelfort M, Skennerton CT et al (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25:1043–1055 ¨ C, Quince C et al (2015) 37. Murat Eren A, Esen O Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3:e1319 38. Steinegger M, Meier M, Mirdita M et al (2019) HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20:473 39. Sievers F, Wilm A, Dineen D et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol 7:539 40. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797 41. Deorowicz S, Debudaj-Grabysz A, Gudys´ A (2016) FAMSA: fast and accurate multiple sequence alignment of huge protein families. Sci Rep 6:33964–33964 42. UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515 43. Potter SC, Luciani A, Eddy SR et al (2018) HMMER web server: 2018 update. Nucleic Acids Res 46:W200–W204 44. Mirdita M, von den Driesch L, Galiez C et al (2017) Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res 45:D170–D176 45. Hingamp P, Grimsley N, Acinas SG et al (2013) Exploring nucleo-cytoplasmic large DNA viruses in Tara oceans microbial metagenomes. ISME J 7:1678–1695 46. UniProt Consortium T (2018) UniProt: the universal protein knowledgebase. Nucleic Acids Res 46:2699 47. Lee MD (2019) GToTree: a user-friendly workflow for phylogenomics. Bioinformatics 35:4162–4164 48. Waterhouse RM, Seppey M, Sima˜o FA et al (2018) BUSCO applications from quality
138
Chiara Vanni
assessments to gene prediction and Phylogenomics. Mol Biol Evol 35:543–548 49. Sima˜o FA, Waterhouse RM, Ioannidis P et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212 50. Dennis Benson GA, Karsch-Mizrachi I, Lipman DJ et al (2008) GenBank. Nucleic Acids Res 36:25–30
51. Delmont TO, Eren AM (2018) Linking pangenomes and metagenomes: the Prochlorococcus metapangenome. PeerJ 6:e4320 52. Price MN, Dehal PS, Arkin AP (2010) FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5:e9490
Chapter 9 Metagenomic Assembly: Reconstructing Genomes from Metagenomes Zhang Wang, Jie-Liang Liang, Li-Nan Huang, Alessio Mengoni, and Wen-Sheng Shu Abstract Assembly of metagenomic sequence data into microbial genomes is of critical importance for disentangling community complexity and unraveling the functional capacity of microorganisms. The rapid development of sequencing technology and novel assembly algorithms have made it possible to reliably reconstruct hundreds to thousands of microbial genomes from raw sequencing reads through metagenomic assembly. In this chapter, we introduce a routinely used metagenomic assembly workflow including read quality filtering, assembly, contig/scaffold binning, and postassembly check for genome completeness and contamination. We also describe a case study to reconstruct near-complete microbial genomes from metagenomes using our workflow. Key words Metagenomics, Assembly, Contig/scaffold binning, Genome curation
1
Introduction Microbial communities play a pivotal role in the global ecosystem. However, the vast majority of microorganisms are uncultivable (known as the “microbial dark matter”); hence, their metabolic capacity and ecological functionality remain largely unexplored [1]. Ever since the advent of next-generation sequencing, metagenomic approach has seen its wide applicability in the field of microbial evolution and ecology. Metagenomics provide unprecedented resolution to unravel the functional diversity and metabolic potential of microorganisms in a community, and have been widely applied to elucidating the microbial complexity in a variety of environments including ocean [2], freshwater [3], soil [4], and human body [5]. In particular, one of the most remarkable advances in the metagenomics over the past years is the increasing possibility to reliably reconstruct hundreds to thousands of highquality microbial genomes from raw sequencing reads through metagenomic assembly. In certain circumstances, it is even possible
Alessio Mengoni et al. (eds.), Bacterial Pangenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2242, https://doi.org/10.1007/978-1-0716-1099-2_9, © Springer Science+Business Media, LLC, part of Springer Nature 2021
139
140
Zhang Wang et al.
to recover complete microbial genomes through metagenomic assembly [6]. Metagenomic assembly is a complex computational task, mainly as a result of the inherent genetic diversity and genomic versatility of the microbial community. For example, intragenomic repeats such as mobile genetic elements have long been recognized as a challenge in assembly of isolated bacterial genomes [7]. Long stretches of intergenomic repeats can further arise from homologous regions of closely related bacterial strains that coexist in a community, adding to the complexity of metagenomic assembly. Recently, such burden is largely eased by the rapid development of sequencing technology that drastically reduces the cost of achieving high sequencing depth for metagenomic samples, the boosted computational power to facilitate memory-intensive assembly processes, and the development of novel algorithms and methods designed to tackle the assembly challenges. A typical metagenomic assembly process involves quality trimming of raw reads, assembly, contig/scaffold binning, manual curation, and postassembly quality evaluation. There is an increasing number of tools specifically designed for metagenomic assembly, most of which employ a technique called “de Bruijn graph” that involves splitting reads into k-mers, finding the k-1 base pair overlap between different k-mers and traversing through the graph of overlaps. Contiguous sequence fragments (or “contigs”) can be derived from this graph by walking all the paths formed by unambiguous stretches of “connected” sequences. In some situations, the contigs can be further oriented and linked to scaffolds based on the paired-end information, with gaps between contigs in the same scaffold. The output of the assembly is typically a set of contigs or scaffolds that may be further grouped though a process called genomic “binning.” The binning process is based on genomic signatures such as the k-mer profiles, contig coverage, oligonucleotide frequency, or a combination of these features. Presumably, such process will generate a set of bins each with a list of contigs belonging to the same genomes. However, owing to the complexity of metagenomes, the assembly processes can be incomplete and erroneous, making postassembly quality evaluation a critical next step. Of particular interest in the quality check are completeness of the genomes—how completely each bin represents a genome, and contamination—how many contigs in a bin were incorrectly assigned. In this chapter, we describe a routinely used workflow for reconstructing genomes from metagenomes, beginning with the quality filtering of raw reads and followed by assembly, binning, and postassembly quality check. For each step, we introduce one or two popular tools and provide the command for using the tools. Alternative methods for each step are also discussed. We also describe potential applications of the reconstructed genomes for
Metagenomic Assembly
141
downstream analyses, such as taxonomic assignment, gene annotation, and phylogenetic analysis. Lastly, we provide a case study to recover near-complete microbial genomes from metagenomic data of an artificial acid mine drainage (AMD) system [8] using our workflow.
2
Materials In this section, we describe what is needed to assemble a metagenome from raw sequencing reads, including both hardware and software requirements to perform the analyses.
2.1
Raw Read Files
After the high-throughput sequencing is complete, one or more files are returned containing the sequencing reads in FASTQ format. Depending on sequencing strategy, single-end or paired-end reads can be generated. During sample preparation, individual “barcode” sequences are added to the DNA of each sample, which allows different samples to be mixed and sequenced together in one run. For a full paired-end Illumina HISEQ or MISEQ run, users will have R1, R2 files that contain the forward and reverse reads and I1 file that contains the associated barcode sequences for each read. Demultiplexing is the first step to segregate raw reads to individual samples according to the barcode sequences. This is usually performed by the sequencing facility but can also be done by users using tools such as Cutadapt [9]. The demultiplexed FASTQ files are the subject for downstream processing and analyses.
2.2
Hardware
The procedure of metagenomic assembly is typically done in a computer server equipped with a multicore processor and a GNU/Linux operating system. Many of the tools described here can also run on a desktop computer with Windows or Mac operating system. However, the limiting factors to be considered are the CPU power and the amount of memory (RAM). The most computing-intensive step is reads assembly, which requires computing resources that increase exponentially with the amount of sequencing reads (with a naı¨ve time complexity of O(n2)), due to the fact that it needs to compare every read with every other read. Hence, most of the assemblers can be configured to run in parallel with multithreads. In comparison, other steps such as reads quality filtering have a linear time complexity of O(n) and can therefore be readily performed on a desktop computer with optimized settings. Nevertheless, we strongly advocate the use of a multicore server with GNU/Linux system for which most of the bioinformatics tools are developed and optimized.
142
2.3
Zhang Wang et al.
Software
All the bioinformatics tools described in this chapter are freely available online. FastQC (v0.11.8): https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/ Trimmomatic (v0.39): page¼trimmomatic Cutadapt (v2.3): guide.html
http://www.usadellab.org/cms/?
https://cutadapt.readthedocs.io/en/stable/
Bowtie 2 (v2.3.5.1): http://bowtie-bio.sourceforge.net/bowtie2/ index.shtml MEGAHIT (v1.2.5): https://github.com/voutcn/megahit SPAdes (v3.12): http://cab.spbu.ru/files/release3.12.0/manual. html BBmap (v38.56): https://sourceforge.net/projects/bbmap/ Samtools (v1.9): https://github.com/samtools/samtools MetaBAT (v2.12.1): https://bitbucket.org/berkeleylab/metabat MaxBin (v2.2.7): https://sourceforge.net/projects/maxbin/ CheckM (v1.0.12): https://ecogenomics.github.io/CheckM/ GTDB-Tk (v0.3.0): https://github.com/Ecogenomics/GTDBTk Prodigal (v2.6.3): https://github.com/hyattpd/Prodigal Diamond (v0.9.24): https://github.com/bbuchfink/diamond
3
Methods In this section, we describe the general workflow to perform metagenomic assembly, starting from quality filtering of raw sequencing reads, to contig/scaffold generation from reads assembly, to assignment of assembled contigs/scaffolds into genomic bins, and to quality assess of the assembly results (Fig. 1). For each step we focus on one or two tools that are widely used in the community. We also discuss alternative tools and methods at the end of each sub-section.
3.1 Quality Assessment of Raw Sequencing Reads
Quality assessment of FASTQ sequencing reads is performed in FastQC [10], a java-based tool for quality control analysis of highthroughput sequencing reads. FastQC can be executed both in a standard-alone mode with interactive graphic interface and a noninteractive mode with command-line interface. To execute FastQC on paired-end FASTQ files in command line, we would execute the command: fastqc R1.fq . fastqc R2.fq .
Metagenomic Assembly
143
Raw sequencing reads
1. Quality assess FastQC
2. Quality trimming Trimmomatic, Cutadapt
Clean sequencing reads
3. Reads assembly MEGAHIT, SPAdes
Contigs/Scaffolds
4. Genomic binning MetaBAT, MaxBin
Draft genomes with potential errors
5. Quality check CheckM
6. Quality refinement RefineM
Error-corrected draft genomes
7. Taxon identification GTDB-Tk, PhyloPhlan
8. Gene annotation
9. Phylogenomics
Prodigal, Diamond/NR
PhyloPhlan, AMPHORA2
Fig. 1 Overall workflow of metagenomic assembly
Given the raw FASTQ reads, FastQC reports a set of statistics including basic statistics such as the total number of sequences and the sequence length distribution, and more sophisticated statistics such as sequence quality per base and per sequence, overrepresentative sequences and sequence duplication levels. The output of FastQC is a set of HTML files that can be viewed in the browser. For each quality control analysis module, the report includes a quick evaluation of whether the results of the module are normal (“Pass” with green tick), slightly abnormal (“Warn” with orange triangle), or very unusual (“Fail” with red cross). While it gives a straightforward pass/fail result, such result is based on a priori
144
Zhang Wang et al.
defined assumptions and does not take into account the characteristics of users’ experiment. Therefore users need to be cautious about relying on these flags and should interpret them under the context of their specific sequencing tasks. The result of FastQC can be used as a guideline for quality filtering of raw sequencing reads. In particular, one of the most relevant statistics is the per-base sequence quality, based on which both 50 and 30 ends of the reads with low quality can be trimmed, as discussed below. 3.2 Quality Trimming of Raw Sequencing Reads
Reads quality trimming and filtering is a critical step before metagenomic assembly, as errors arising from library preparation and sequencing can lead to significant assembly artifacts. Multiple software can be used for read trimming and filtering, among which the most widespread ones are Trimmomatic [11] and Cutadapt [9]. Both software allows for multiple trimming steps to be performed in order, including removal of adapters, low-quality sequences and short sequences. The main difference between the two program is that Trimommatic performs quality pruning using a sliding window cutting algorithm, whereas Cutadapt trims reads based on the per-base quality score. Compressed FASTQ files (.fq. gz) can also be used as input for both programs. To trim FASTQ paired-end reads using Trimmomatic, we can execute the command: java -jar trimmomatic-0.39.jar PE R1.fq R2.fq R1_trim.fq R1_trim_unpaired.fq R2_trim.fq R2_trim_unpaired.fq CLIP:TruSeq3-PE.fa:2:30:10
LEADING:3
TRAILING:3
ILLUMINA SLIDINGWIN
DOW:4:15 MINLEN:36.
This will perform the following tasks in the given order: (1) remove adapters in TruSeq3-PE.fa; (2) remove leading bases below quality score of 3; (3) remove trailing bases below quality score of 3; (4) scan the reads with four-base sliding window and cut the sequence when the average quality per base drops below 15; and (5) discard reads shorter than 36 bp. Alternatively, we can run below command in Cutadapt to trim the paired end reads: cutadapt -a forward_adapter_seq -b reverse_adapter_seq -A forward_adapter_seq -B reverse_adapter_seq --pair-filter ¼ any -q 20:20 -m 36:36 -o R1_trim.fq -p R2_trim.fq R1.fq R2.fq
This will (1) remove the designated forward and reverse adapter sequences, (2) trim the bases below quality score of 20, and (3) discard reads shorter than 36 bp. Other tools used for read quality assessment and filtering include Fastx toolkit [12] which utilizes a collection of scripts for reads processing, and Fastp [13] which combines quality assessment, adapter trimming, and reads filtering steps in one program. Fastp also provides a base correction algorithm based on the
Metagenomic Assembly
145
overlaps between the two pairs of the pair-end reads. After reads filtering, it is advised to run FastQC again to ensure the quality of the trimmed sequences improves and satisfies the quality check, before the metagenomic assembly. 3.3 Metagenomic Assembly to Generate Contigs or Scaffolds
After reads trimming and filtering, the next step is to assemble the clean reads into longer contigs. Numerous tools have been developed for metagenomic assembly. Here we describe two widely used assemblers: MEGAHIT [14] and SPAdes [15], both designed to assembly large and complex metagenomics data within a feasible amount of computer memory. MEGAHIT employs a new data structure called succinct de Bruijn graphs (SdBG) and a fast parallel algorithm to achieve the ultrafast and memory-efficient assembly. To apply MEGAHIT for paired-end clean reads we can execute the following: megahit -1 R1_trim.fq -2 R2_trim.fq --k-min 31 --k-max 111 --kstep 20 -o sample_trim_megahit --out-prefix sample_trim --min-contig-len 500 -t 12
Here, the most important parameter in MEGAHIT and other assemblers is the size of k-mers to build the de Bruijn graphs. The selection of k-mer size is a tradeoff. While a small k-mer size is favorable for filtering erroneous edges and filling gaps in low-coverage regions, a large k-mer size is useful for resolving repeats. MEGAHIT utilizes an incremental k-mer size strategy by building de Bruijn graphs iteratively from k-min to k-max. In the above command, MEGAHIT starts from k ¼ 31, builds the SdBG, cleans the graph, generate contigs, and extracts (k + 20 + 1)-mers as the edges of the graph for the next iteration. Such process ends when k ¼ 111. The k-mer can be set between lengths of 15 and 127. MEGAHIT allows for multithread parallel computing by setting the parameter -t (here 12 CPU-threads are used). SPAdes is another powerful tool for metagenomic assembly. It was originally developed for assembly of small bacterial or fungal genomes or single-cell sequencing data but has proven to be applicable for metagenomic assembly (as a separate pipeline called metaSPAdes). SPAdes employs an iterative multi-k-mer approach similar to MEGAHIT. But unlike the latter, SPAdes does not reduce the size of the data by replacing reads with preassembled contigs at each iteration of de Bruijn graph construction. Instead, it utilizes both original reads and the preassembled contigs at each iteration, to be able to better account for small indels [16]. Hence, while SPAdes potentially generates more precise results, it can be more memory-intensive and time-consuming than MEGAHIT. It has been shown that depending on datasets, MEGAHIT can be 2–7 times more memory-efficient than SPAdes [17]. One advantage of SPAdes compared to MEGAHIT, however, is that it performs scaffolding, a process of resolving repeats and linking contigs to
146
Zhang Wang et al.
longer stretches based on the paired-end information [18]. To perform assembly using SPAdes, we can execute the following: spades.py --pe1-1 R1_trim.fq --pe1-2 R2_trim.fq -o sample_trim_spades -t 12 -k 31,51,71,91,111 --meta
Other tools for metagenomic assembly include IDBA_UD [19], MetaVelvet [20], and SOAPdenovo2 [21], all de Bruijn graph algorithm-based but each with a different architecture. Once assembly is completed, some basic assembly statistics need to be assessed. These include the number of assembled contigs, the length of the longest contig, the length distribution of all contigs, and the N50 of the assembly. The N50 is defined as the minimum length of contig needed to cover 50% of the metagenome. The higher the N50 is, the better the assembly. Tools such as MetaQUAST [22] and FRCurve [23] can be used to compare the performance of different assemblies and identify the strategy that yields the optimal assembly results. 3.4 Assignment of Assembled Contigs/ Scaffolds to Genomic “Bins”
After assembly, contigs or scaffolds can be further clustered into genomic “bins,” which enables near-complete microbial genomes to be reconstructed. Binning can be performed through a variety of approaches based on the genomic signatures such as k-mer profiles, contig coverage, GC content, tetranucleotide frequency, or a combination of these features. Here we introduce two methods, MetaBAT [24] and MaxBin [25] that utilized both tetranucleotide frequencies and sequence coverages for binning. We first map the clean reads against contigs using BBMap [26]. We then use the built-in program “jgi_summarize_bam_contig_depths” in MetaBAT to generate a “depth.txt” file containing the reads coverage depth of each contig across samples. bbmap/bbmap.sh in2¼R2_trim.fq
ref¼scaffolds.fasta
out¼scaffolds.bbmap.bam
in1¼R1_trim.fq k¼14
minid¼0.9
threads¼12 build¼1 samtools sort scaffolds.bbmap.bam scaffolds.bbmap.sorted metabat/jgi_summarize_bam_contig_depths
--outputDepth
depth.txt scaffolds.bbmap.sorted.bam
The MetaBAT tool takes the “depth.txt” file as input to assign contigs into different bins. We set the number of CPU-threads as 12 using -t setting and use --unbinned to save unbinned contigs into a separate file. metabat -i scaffolds.fasta -a depth.txt -o bins_dir -t 12 -unbinned
Likewise, we can use MaxBin with similar settings: run_MaxBin.pl -thread 12 -contig scaffolds.fa -out scaffolds. maxbin -abund depth.txt
Metagenomic Assembly
147
Other alternative tools for genomic binning include CONCOCT [27], GroopM [28], and MyCC [29], each with its unique features. Like assembly, there are also tools such as AMBER to assess binning performance of different binners [30]. Another strategy is to combine the results of different binning tools to achieve the optimal outcome, such as the scoring strategy used in DAS tool [31]. 3.5 Postassembly Quality Evaluation
After the genomic bins are generated, a postassembly quality check for genome completeness and contamination can be performed using CheckM [32]. CheckM provides estimates of completeness and contamination of a reconstructed genome using a set of “marker genes” specific to the position of a genome within a reference genome tree. It includes three different workflows: the recommended lineage-specific workflow for analysis of each individual bin, the taxonomic-specific workflow that allows different bins to be analyzed with the same marker set (i.e., if they belong to the same taxonomic group), or the custom marker-gene workflow that allows users to specific their marker genes of interest as hidden Markov models (HMMs). Assuming all binned sequences are stored in the bins_dir folder with “.fa” extensions, a typical lineage-specific workflow would run as follows: checkm lineage_wf -t 12 -x fa bins_dir/ checkm.output
After the assembly quality check using CheckM, users can use RefineM [33] to improve completeness of a genome and identify contaminations. RefineM identifies potential contaminations based on the genomic properties of bins (GC content, tetranucleotide signature, coverage), along with their taxonomic assignment against a reference database. As RefineM is still in active development and can have major changes in newer versions, here we do not touch on the detailed usage for its current version. We encourage users to refer to its website for the most up-to-date software and instructions (https://github.com/dparks1134/RefineM). 3.6 Potential Applications of Recovered Genomes for Downstream Analyses
With the reconstructed microbial genomes from metagenome, users can perform further analyses to answer their biological questions of interest. For example, users can perform taxonomic identification and gene annotation to understand the composition and functional properties of an environmental sample. Taxonomic identification can be performed at read (i.e., via MEGAN [34], MetaPhlan [35]), contig (i.e., via Kraken [36], PhyloPythiaS+ [37]), and genome level (i.e., via PhyloPhlan [38], GTDB-Tk [39]). The taxonomic information for reads and contigs are often integrated in the process of taxonomic-based genomic binning. GTDB-Tk is a recently developed tool for assigning taxonomic classifications to microbial genomes based on a combination of phylogeny, Average Nucleotide Identity (ANI) values and relative
148
Zhang Wang et al.
evolutionary divergence. To run GTDB-Tk, assuming the reconstructed genomes are in the genomes_dir folder with “.fa” extensions, we would execute the following: gtdbtk classify_wf --cpus 12 --genome_dir genomes_dir --out_dir genomes_taxa --extension fa
In addition, users can perform gene annotation to understand the functional perspective of the reconstructed genomes. The tool of Prodigal can be used for open reading frame (ORF) prediction [40]. Functional annotation of the ORFs can be performed by sequence alignment to NCBI nonredundant (NR), KEGG or eggNOG databases using aligners such as Diamond [41] (using NR database as an example). prodigal -i metagenome.fa -a proteins.faa -d nucleotides.fna -o coords.gff -f gff -p meta -m diamond-v0.9.24 blastx -p 20 -q proteins.faa -a proteins.daa -d nr/nr.dmnd -e 1e-5 -k 5 diamond-v0.9.24 view -a proteins.daa -o proteins.m8
Finally, to understand the phylogenetic diversity and evolutionary relationship of the genomes, users can reconstruct whole genome phylogenetic trees using PhyloPhlan [38] or AMPHORA2 [42]. PhyloPhlan is an integrated pipeline for genome-scale phylogenetic analysis (known as phylogenomic analysis), based on a set of 400 conserved “marker” genes broadly distributed across bacteria and archaea. Similarly, AMPHORA2 is an automated pipeline for phylogenomic analysis based on 31 universal, single-copy “marker” genes. The program contains a set of manually curated HMMs that allows for a fully automated process for marker gene identification, alignment trimming, and phylogenetic reconstruction.
4
Case Study Here we provide a case study to show how to reconstruct nearcomplete microbial genomes from metagenomes using our workflow. We use the metagenomic data from an artificial AMD system as an example [8]. About 434 GB metagenomic raw data were generated by Illumina HiSeq and MiSeq sequencers. After removing duplicates and low quality reads, all high-quality reads (~377 GB) were coassembled using SPAdes with the parameters “-k 21, 33, 55, 77, 99, 127 --meta”. A total of 18,270 assembled scaffolds (length 2000 bp) were obtained, with the total length of 186 Mb. To calculate scaffold coverage, all high-quality reads from metagenomic datasets were mapped to the assembled scaffolds using BBMap with the parameter “minid ¼ 0.97”. These scaffolds were binned using MetaBAT with the parameter “-m 2000 --unbinned”, which considers both tetranucleotide
Metagenomic Assembly
149
frequencies and the coverage of these scaffolds. The retrieved bins from MetaBAT were evaluated and refined based on taxonomic assignment, genome completeness, potential contamination, and strain heterogeneity, using CheckM. A total of 80 metagenome-assembled genomes (MAGs) were obtained, with 39 high-quality MAGs (completeness >90% and contamination 50% and contamination 1%) and rare (relative abundance 1500. Create the bins folder: mkdir bins
Run MetaBAT 2:
metabat2 -i sample1_assembly/scaffolds.fasta \ -a sample1_depth.txt -o bins/sample1_bin \ --seed 42 The output FASTA files will be placed in the For coassembly, see Note 5. 3.4 Quality Assessment and Dereplication
bins
directory.
After binning, it is essential to assess the quality of the assembled bins. Besides the parameters usually reported for single genome assembly (e.g., assembly size, contig N50/L50, and maximum contig length), it is important to estimate their completeness (i.e., the fraction of the genome that has been recovered) and contamination (i.e., the fraction of the draft genome that is of exogenous origin). Currently, there is no established standard procedure for the definition and estimation of these two parameters. Common approaches are (1) the mapping to closely related genomes or (2) the identification and characterization of sets of universal marker genes. While the former approach is often not possible due to the lack of suitable references for many microbial lineages and high levels of strain heterogeneity [38], the latter has become popular for prokaryotic genomes thanks to the availability of dedicated software tools [11, 39]. Recent guidelines [40] suggest to classify metagenome-assembled genomes (MAGs) as “high-quality draft” (HQ) if they are >90% complete with less than 5% contamination, “medium-quality draft” (MQ) with completeness estimates of 50% and less than 10% contamination and “low-quality draft” (LQ) with completeness and contamination less than 50% and 10%, respectively. Another important aspect is dereplication. When multiple samples are analyzed, binning often results in sets of closely related genomes, with variable degree of completeness and fragmentation,
162
Davide Albanese and Claudio Donati
Fig. 4 Schematic representation of the quality assessment and dereplication performed by dRep. (a) The input files are the MAGs defined in the previous step. MAGs can be classified into HQ, MQ, and LQ according to their quality, as suggested in [40]. In this 2-dimensional representation MAGs are placed according their distance defined as 1-ANI (b) LQ are removed from the set (c) The remaining MAGs are clustered with the ANI strainlevel threshold of 0.99, defining six clusters A–F. For each cluster, the genome with the highest quality score is designed as representative
distributed in the different samples. The aim of dereplication is to remove redundant genomes and identify the minimal set of highquality representatives. Individual sample assembly coupled with dereplication is a valid alternative to coassembly, improving the recovery process of genome bins. dRep [15] performs dereplication by implementing the following steps (see Fig. 4): 1. Filtering genomes using CheckM (see Note 6). A completeness threshold contig_ID_information. GTTTGTGAACTTTGATATTTCATGTAGAGTATATAATA TATATTTGGGGTACTTTG. >contig_ID_information. TATACTGTACAAAAAATATCAAAGTACCCAAGGTATAT ATTCTATACTGTACAAAA. 2.2
Software
1. Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) [12] (version 3.0.0) is a fast microbial (bacteria and archaea) gene recognition tool which is able to predict protein-coding gene; to handle draft genomes and metagenomes; to handle gaps, scaffolds, and partial genes; to identify translation initiation sites; and to generate detailed statistics summary for each genome. Prodigal program can be run locally under UNIX-like operating platforms (i.e., Linux, Windows, and MacOSX). Sequences in FASTA, FASTQ, and GenBank formats can be used as input in Prodigal software. Software user manual is available on the following website: https://github.com/hyattpd/prodigal/wiki. To install using terminal in OS systems carry out the following steps: # install Prodigal make install INSTALLDIR=
2. Bowtie2 [13] (version 2.3.4.3) is a is an ultrafast and efficient tool for aligning sequencing reads to long reference sequences. In this work, it is used to map reads back to the assemblies. Bowite2 can be run on UNIX-like operating platforms (i.e., Linux, Windows, and MacOSX). Software user manual is available on the following website: http://bowtie-bio.sourceforge. net/bowtie2/index.shtml. To install using terminal in OS systems carry out the following steps:
Functional Metagenomic for Identification of ARGs
177
# create, go to install directory and download Bowtie2 cd $HOME/tools/bowtie2/ wget https://sourceforge.net/projects/bowtiebio/files/bowtie2//bowtie2--linux-x86_64.zip/download # decompress unzip download # add location to system PATH and check to installation export PATH=$HOME/tools/bowtie2/bowtie:$PATH bowtie2 –help
3. Bedtools [14] (version 2.27.0) is defined as an innovative tool for genome arithmetic because it allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely used genomic file formats such as BAM, BED, GFF/GTF, and VCF. Bedtools can be run on UNIX-like operating platforms (i.e., Linux, Windows, and MacOSX). Software user manual is available on the following website: https://bedtools.readthedocs.io/en/latest/index. html. To install using terminal in OS systems carry out the following steps:
178
Francesca Di Cesare
# create, go to install directory and download Bowtie2 cd $HOME/tools/bedtools/
wget https://github.com/arq5x/bedtools2/releases/download //bedtools.tar.gz # decompress tar -zxvf bedtools-.tar.gz # add location to system PATH and check to installation cd bedtools2 make
4. The Resistance Gene Identifier (RGI) (version 4.0.3) and Comprehensive Antibiotic Resistance Database (CARD) [15] (version 3.0.7). The first tool is used to predict ARGs from protein or nucleotide data based on homology and SNP models. This tool uses reference data from the CARD database. Analyses can be performed via website (using the website https://card.mcmaster.ca/analyze/rgi) via Galaxy platform or by installing it from Conda or Docker. Software user manual is available on the following website: https://github.com/ arpcard/rgi. To install using terminal in OS systems carry out the following steps:
Functional Metagenomic for Identification of ARGs
179
# first of all, check if you have requirements tool, list as follows: # install dipendecies pip3 install six | pip3 install biopython | pip3 install filetype |pip3 install pytest | pip3 install mock | pip3 install pandas | pip3 install matplotlib | pip3 install seaborn| pip3 install pyfaidx | pip3 install pyahocorasick
# Install RGI from Project Root pip3 install git+https://github.com/arpcard/rgi.git
# Running RGI test rgi –h
3
Methods In this section, the crucial steps to identify ARGs starting from metagenomic data are described (Fig. 1).
3.1 Coverage from Metagenomic Data and Its Analysis
Obtained for each sample a contigs.fasta file and removed contigs smaller than 500 bp, prodigal can be performed to predict open reading frames (ORF) present in metagenome data object of study (gene calling). This information provides the possibility to classify the translated protein sequences—obtained from contigs—using mapper tool against a specific database (e.g., eggNOG vs bactNOG database). $prodigal -i -o -a -p anon -f gff
The command-line arguments -o and -f are the output table of predicted ORFs and its format (in this case .gff), respectively. For
180
Francesca Di Cesare
Fig. 1 Experimental pipeline for identification of ARGs
metagenomic data, it is suggested that -p followed by anon (Anonymous mode) algorithm be used. In order to identify the average coverage and the G/C content of contigs, processed reads in FASTQ file format must be mapped back to contigs, using Bowtie2 tool (redundant reads could be removed during this step). To perform this analysis, run the following command-line: $PATH/bowtie2 -x -1 $PATH/ -2 $PATH/ –S
To obtain the number of reads mapping each ORF, bedtools command “multicov” is used. $ bedtools multicov -bams -bed > (other command-line arguments could be added to this code)
Functional Metagenomic for Identification of ARGs
181
An example of the content of a multicov file—in which the number of mapped reads is highlighted in red color—is shown here: NODE_1_length_242767_cov_17.9369 CDS
1
780
142.4
-
Prodigal_v2.6.3 0
ID=1_1;partial=10;start_type=ATG;rbs_motif=AGGAG;rbs _spacer=510bp;gc_cont=0.683;conf=100.00;score=141.72;cscore=123.45 ;sscore=18.28;rscore=15.89;uscore=-1.51;tscore=4.54; 281 NODE_1_length_242767_cov_17.9369 CDS
976
2460 319.2
Prodigal_v2.6.3
- 0
ID=1_2;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rb s_spacer=510bp;gc_cont=0.688;conf=99.99;score=319.25;cscore=310.33; sscore=8.92;rscore=0.72;uscore=3.65;tscore=4.54;
482
To quantify gene content across different metagenomic samples, genes are collapsed by summing together the number of reads that mapped genes with the same annotation, using bestOG given by eggNOG mapper. With this information it is possible to evaluate the variation of microbial biodiversity, using the coverage values—considered as a quantitative indicator—related to the G/C content and the length of the contigs for each sample. 3.2 ARGs Identification
The CARD database is used in combination to the Resistance Gene Identifier (RGI tool) to inspect the distribution of antibiotic resistance gene (AR genes), using the following command-line: rgi main -i -o -t -a -n
182
Francesca Di Cesare
RGI output contains Open Reading Frame Identifier (ORF_ID), contig source sequence, start and stop coordinates of ORF, strand of ORF, bitscore value of match to top hit in CARD database, Antibiotic Resistance Ontology (ARO) accession of match to top hit in CARD, Drug Class, Resistance Mechanism, Antimicrobial Resistance (AMR) Gene Family, ORF predicted nucleotide sequence, ORF predicted protein sequence, Protein sequence of top hit in CARD, CARD detection model ID, and Percentage Length of Reference Sequence, which is calculated using the following expression: of ORF protein Percentage Length of Ref erence Sequence ¼ ðlengthlength of CARD ref erence proteinÞ
The same approach—previously described—can be used to quantify AR genes predicted using RGI tool, but, this time, the unique identifier provided by CARD was used to collapse counts.
4
Conclusions This chapter has shown a potential pipeline which can be applied to perform a functional metagenomic analysis with the final aim to identify and to quantify ARGs detectable in environmental samples; in this scenario, scientists are able to detect the presence of ARGs in metagenomic samples and are able to understand how and how much antibiotic resistance mechanisms evolve in microbial communities under selective pressure (antibiotic era). For further information and utilities, please consult the related publications [6, 12] and the user manuals, respectively.
References 1. Handelsman J (2004) Metagenomics application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685 2. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5:R245–R249 3. Pettersson E, Lundeberg J, Ahmadian A (2009) Generations of sequencing technologies. Genomics 93:105–111 4. Gupta S, Arango-Argoty G, Zhang L, Pruden A, Vikesland P (2019) Identification of discriminatory antibiotic resistance genes among environmental resistomes using extremely randomized tree algorithm. Microbiome 7:123 5. Lerminiaux NA, Cameron ADS (2019) Horizontal transfer of antibiotic resistance genes in
clinical environments. Can J Microbiol 65:34–44 6. Ewing B, Hillier LD, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8:175–185 7. Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194 8. Bacci G, Bazzicalupo M, Benedetti A, Mengoni A (2014) StreamingTrim 1.0: a Java software for dynamic trimming of 16S rRNA sequence data from metagenetic studies. Mol Ecol Resour 14:426–434 9. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA (2017) MetaSPAdes: a new versatile metagenomic assembler. Genome Res 27:824–834
Functional Metagenomic for Identification of ARGs 10. Mikheenko A, Saveliev V, Gurevich A (2016) MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32:1088–1090 11. Truong DT et al (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods 12:902–903 12. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119
183
13. Langmead B, Salzberg SL (2012) Fast gappedread alignment with Bowtie2. Nat Methods 9:357–359 14. Quinlan AR, Hall IM (2010) The BEDTools manual. Genome 16:1–77 15. Jia B, Raphenya AR, Alcock B et al (2017) CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res 45(D1): D566–DD57
Chapter 12 Host Trait Prediction from High-Resolution Microbial Features Giovanni Bacci Abstract Predicting host traits from metagenomes presents new challenges that can be difficult to overcome for researchers without a strong background in bioinformatics and/or statistics. Profiling bacterial communities using shotgun metagenomics often leads to the generation of a large amount of data that cannot be used directly for training a model. In this chapter we provide a detailed description of how to build a working machine learning model based on taxonomic and functional features of bacterial communities inhabiting the lungs of cystic fibrosis patients. Models are built in the R environment by using different freely available machine learning algorithms. Key words Machine learning, Next generation sequencing, Metagenomics, Host trait prediction, Community profiling, Taxonomic profiling, Functional profiling
1
Introduction Metagenomics—the direct extraction and sequencing of genetic material from bacterial cells inhabiting a given environment—has drastically increased our knowledge of the microbial world. Since the first study made by Antonie van Leeuwenhoek in 1680s where he compared microorganisms from fecal and oral samples of healthy and ill individuals [1], microbiologists have characterized thousands of different microbial strains in, almost, all districts of human body [2–4]. The interaction between plants and microorganisms has been intensively explored during the last 20 years, shading light on new possible methods of cultivation and defining groups of microorganisms associated with plant health [5]. In 2001 Joshua Lederberg coined the term microbiome referring to “the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space” [6]; but in more recent times researchers have used the same term to refer to other types of macroorganisms such arthropods, fish, and plants. Many studies have been performed in animals and plants, reporting
Alessio Mengoni et al. (eds.), Bacterial Pangenomics: Methods and Protocols, Methods in Molecular Biology, vol. 2242, https://doi.org/10.1007/978-1-0716-1099-2_12, © Springer Science+Business Media, LLC, part of Springer Nature 2021
185
186
Giovanni Bacci
the description of the bacterial communities (microbiome) found in several districts as gut, roots, skin, and leaves [7–9]. In the last decade the sequencing cost for a megabase of DNA has dropped while the output of sequencing machines has rapidly increased. The advent of third-generation sequencing technologies (also known as long-read sequencing) has enabled the production of long DNA sequences from single DNA molecules increasing the resolution power of omics techniques including metagenomics, but all these technical advancements requires the development of specific analysis methods suitable for different applications. Several methods have been developed to cope with this humongous amount of data, all dealing with sequence information that can be retrieved from public database (extrinsic approaches) or directly from sequences themselves (intrinsic approaches). These methods (almost) always produce abundance matrices describing different aspects of the bacterial community under study, depending on the analytic approach used. In this chapter we are going to use two matrices reporting the abundance of bacterial taxa and genes detected in the lung of cystic fibrosis patients to inspect the link between host traits (i.e., the type of mutation in the CFTR gene) and microbial characteristics.
2
Materials This tutorial requires a working installation of R [10] along with a set of additional libraries mainly used for building and validating our final model. The workflow here proposed uses data coming from a metagenomic study on cystic fibrosis lung communities along time [11]. A complete description of datasets, hardware, and software requirements is given below.
2.1
Data Files
We will build a machine learning model based on bacterial features obtained from shotgun metagenomics sequencing. The data consist of three main tables reporting quantitative and qualitative information about taxa detected in the lung of subjects included in the study, genes harbored by those taxa, and clinical characteristics of the subjects. Data can be downloaded from https://github. com/GiBacci/predicting_from_metagenomes/tree/master/data. Tables are available in the RDS format and can be easily imported into R using the function readRDS(). Since RDS is a native data R file format, we can load tables directly into the R environment without worrying about additional parameters such as field separator, decimal separator, character encoding format, and so on. A description of the data files is reported below: 1. taxa_ab.rds: taxa abundances in all subjects included in the study. Each row of the table is a different observation, whereas each column represents a different taxon detected. The
Predictions from Metagenomes
187
proportion of taxa is reported as relative abundance so that the sum of all taxa abundances in each observation is one. 2. gene_counts.rds: counts of metagenomic reads mapping to bacterial genes recovered from lung communities. The same standard used for taxa abundance was used here with each row reporting a different observation and each column reporting a different gene. In metagenomic studies genes are usually more than observations and they can be reported as rows instead of columns so to minimize the number of variables and reduce memory requirement. 3. sample_meta.rds: characteristics of patients included in the study. The table is the Table S1 of the paper reported above [11] and a complete description of columns is available in the work. In this chapter we will focus on the genotype of the patients aiming at building a machine learning algorithm that can predict a patient’s genotype from bacterial features. 4. gene_meta.rds: characteristics of genes included in the gene count table reported in 2. This table is a slightly modified version of the output produced by eggNOG mapper [12]. In principle any kind of metagenomics/transcriptomics study follow the scheme here proposed. The table reported in 1 and/or 2 could be replaced by the expression levels of genes found in a transcriptomic study or by counts of reads coming from a metabarcoding study based on 16S rRNA sequencing. Feel free to replace the tables reported above with any kind of data that fit the general scheme provided. 2.2 Software Requirements
Models are generated using a free software environment for statistical computing called R. R is part of many Linux distributions but it can be freely downloaded and installed from https://cran.r-proj ect.org/ by choosing the appropriate operation system in the “Download and Install R” window. Additional packages needed are listed below (the version of each package used in this tutorial is reported between brackets): 1. compositions [13] (version 1.40.3): a collection of functions for analyzing compositional data (quantitative data, strictly positive, which sum to a constant value). 2. vegan [14] (version 2.5.6): a package developed for studying multivariate data produced during ecological studies. It contains several functions for dimensionality reduction (such as correspondence analysis, nonmetric multidimensional scaling, and others) and for diversity analysis (either alpha or beta diversity). 3. DESeq2 [15] (version 1.26.0): a suite for the analysis of count data from many biological assays. The package was developed for analyzing RNA-seq data but it is also used for amplicon
188
Giovanni Bacci
sequence data (such as 16S rRNA metabarcoding) or ChIPSeq. It implements a transformation function (called variance stabilizing transformation or VST) useful to prepare count data for machine learning approaches. 4. caret [16] (version 6.0.85): a collection of functions for training and validating multiple machine learning algorithms. It contains methods for fine tuning classification and regression algorithms using a unified syntax. It also evaluates the performance of models produced using standard metrics such as root mean squared error, receiver operating characteristic curve, and accuracy. 5. randomForest [17] (version 4.6.14): implementation of the random tree forest algorithm described by Breiman in 2001 [18]. This package is used in combination with caret to build the final model. 6. kernlab [19] (version 0.9.29): implementation of the most used kernel-based machine learning methods [20]. In this chapter we will use the radial basis function kernel in combination with caret. 7. gbm [21] (version 2.1.5): implementation of gradient boosting machines. A stochastic gradient boosting approach will be used within caret. 8. pROC [22] (version 1.16.1): tool for visualizing receiver operating characteristic (ROC) curves. It contains also functions for comparing curves from different models. Curves are computed in sensitivity and specificity space defined as the probability to assign a true positive or a false negative given the model. 9. ggplot2 [23] (version 3.2.1): package for creating different types of graphics based on the book “The grammar of Graphics” by Leland Wilkinson [24]. 10. ggbeeswarm [25] (version 0.6.0): this package provides ggplot2 geoms for plotting categorical data minimizing overlaps. 11. multcompView [26] (version 0.1.8): package that converts vectors of p-values into a letter-based visualization useful for multiple comparisons across categories.
3
Methods A working installation of R can run all lines of code reported in this chapter. However, I would suggest to write codes into a script (a simple text file) by using an integrated development environment (IDE) like RStudio https://rstudio.com/.
Predictions from Metagenomes
3.1
Importing Data
189
Before starting to build our models, data must be imported into R. Several functions can do this depending on the input file format. The data suggested in this chapter were saved using a native R format called RDS and can be imported using the readRDS function. In case of text data the function read.table can be used as well as one of its sister functions (to see the help of read.table simply run: ?read.table).
# importing gene counts genes