Metagenomics : Current Advances and Emerging Concepts [1 ed.] 9781910190609, 9781910190593

Metagenomics continues to be one of the most dynamic scientific fields due largely to the development of new and cheaper

160 76 5MB

English Pages 154 Year 2017

Recommend Papers

Polycystic Ovary Syndrome: Current and Emerging Concepts 9783030925895, 9783030925888, 3030925897

Now in a completely newly revised and expanded second edition, this comprehensive text presents the current state of the

121 27 17MB Read more

Microbial Ecology : Current Advances from Genomics, Metagenomics and Other Omics [1 ed.] 9781912530038, 9781912530021

The development of metagenomics, metatranscriptomics, metaproteomics, metametabolomics and other related methods has mad

143 61 7MB Read more

SUMOylation and Ubiquitination: Current and Emerging Concepts [1 ed.] 9781912530137, 9781912530120

Most proteins undergo post-translational modifications altering physical and chemical properties, folding, conformation

120 13 25MB Read more

Polycystic Ovary Syndrome: Current and Emerging Concepts [2nd ed. 2022] 9783030925888, 9783030925895, 3030925889

Now in a completely newly revised and expanded second edition, this comprehensive text presents the current state of the

112 42 10MB Read more

Direct Current Fault Protection: Basic Concepts and Technology Advances 3031265718, 9783031265716

The lack of effective DC fault protection technology remains a major barrier for the DC paradigm shift. In addressing th

171 8 31MB Read more

Expertise at Work: Current and Emerging Trends 3030643700, 9783030643706

Expertise, which combines knowledge, years of experience in one domain, problem-solving skills, and behavioral traits, i

109 32 3MB Read more

Current Concepts in Autoimmunity and Chronic Inflammation (Current Topics in Microbiology and Immunology, 305) 3540297138, 9783540297130

The immune system has been known to be capable of distinguishing self from non-self since the pioneering work of Paul Er

112 41 5MB Read more

Ageing and Dementia: Current and Future Concepts 9783709161395, 3709161398

Epidemiological studies, modern clinical, neuroimaging, neuropsychological, molecular biological, and genetic studies ha

99 80 46MB Read more

Current Concepts in Bovine Reproduction 9811901155, 9789811901157

This book provides updated information on the current concepts in bovine reproduction. It describes the complex issues a

116 41 Read more

Neovascular Glaucoma: Current Concepts in Diagnosis and Treatment 3031117190, 9783031117190

This book offers a comprehensive overview of neovascular glaucoma (NVG), which is an extraordinarily aggressive type of

175 92 9MB Read more

Metagenomics : Current Advances and Emerging Concepts [1 ed.]
9781910190609, 9781910190593

Author / Uploaded
Diana Marco

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Metagenomics Current Advances and Emerging Concepts

Edited by

Diana Marco

Caister Academic Press

Metagenomics

Current Advances and Emerging Concepts https://doi.org/10.21775/9781910190593

Edited by Diana Marco Faculty of Biological Sciences Córdoba National University; and CONICET Córdoba Argentina

Caister Academic Press

Copyright © 2017 Caister Academic Press Norfolk, UK www.caister.com British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-1-910190-59-3 (paperback) ISBN: 978-1-910190-60-9 (ebook) Description or mention of instrumentation, software, or other products in this book does not imply endorsement by the author or publisher. The author and publisher do not assume responsibility for the validity of any products or procedures mentioned or described in this book or for the consequences of their use. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher. No claim to original U.S. Government works. Cover design adapted from images provided by Diana Marco. Ebooks Ebooks supplied to individuals are single-user only and must not be reproduced, copied, stored in a retrieval system, or distributed by any means, electronic, mechanical, photocopying, email, internet or otherwise. Ebooks supplied to academic libraries, corporations, government organizations, public libraries, and school libraries are subject to the terms and conditions specified by the supplier.

Contents Prefacev 1

Integration of Ecology and Environmental Metagenomics Conceptual and Methodological Frameworks Diana Marco

2

1

Guidelines to Statistical Analysis of Microbial Composition Data Inferred from Metagenomic Sequencing

17

3

Methods for the Metagenomic Data Visualization and Analysis

37

4

Comparing Viral Metagenomic Extraction Methods

59

5

Spatiotemporal Variations in the Abundance and Structure of Denitrifier Communities in Sediments Differing in Nitrate Content

71

Vera Odintsova, Alexander Tyakht and Dmitry Alexeev

Konstantin Sudarikov, Alexander Tyakht and Dmitry Alexeev Jeanette Klenner, Claudia Kohl, Piotr W. Dabrowski and Andreas Nitsche

David Correa-Galeote, Germán Tortosa, Silvia Moreno, David Bru, Laurent Philippot and Eulogio J. Bedmar

6

Using Metagenomics to Connect Microbial Community Biodiversity and Functions

103

Application of Omics Approaches to Studying Methylotrophs and Methylotroph Communities

119

Lucas W. Mendes, Lucas Palma Perez Braga, Acacio A. Navarrete, Dennis Goss de Souza, Genivaldo G. Z. Silva and Siu M. Tsai

7

Ludmila Chistoserdova

Index143

Current Books of Interest Illustrated Dictionary of Parasitology in the Post-Genomic Era2017 Next-generation Sequencing and Bioinformatics for Plant Science2017 The CRISPR/Cas System: Emerging Technology and Application2017 Brewing Microbiology: Current Research, Omics and Microbial Ecology2017 Bacillus: Cellular and Molecular Biology (Third Edition)2017 Cyanobacteria: Omics and Manipulation2017 Foot-and-Mouth Disease Virus: Current Research and Emerging Trends2017 Brain-eating Amoebae: Biology and Pathogenesis of Naegleria fowleri2016 Staphylococcus: Genetics and Physiology2016 Chloroplasts: Current Research and Future Trends2016 Microbial Biodegradation: From Omics to Function and Application2016 Influenza: Current Research2016 MALDI-TOF Mass Spectrometry in Microbiology2016 Aspergillus and Penicillium in the Post-genomic Era2016 The Bacteriocins: Current Knowledge and Future Prospects2016 Omics in Plant Disease Resistance2016 Acidophiles: Life in Extremely Acidic Environments2016 Climate Change and Microbial Ecology: Current Research and Future Trends2016 Biofilms in Bioremediation: Current Research and Emerging Technologies2016 Microalgae: Current Research and Applications2016 Gas Plasma Sterilization in Microbiology: Theory, Applications, Pitfalls and New Perspectives2016 Virus Evolution: Current Research and Future Directions2016 Arboviruses: Molecular Biology, Evolution and Control2016 Shigella: Molecular and Cellular Biology2016 Aquatic Biofilms: Ecology, Water Quality and Wastewater Treatment2016 Alphaviruses: Current Biology2016 Thermophilic Microorganisms2015 Flow Cytometry in Microbiology: Technology and Applications2015 Probiotics and Prebiotics: Current Research and Future Trends2015 Epigenetics: Current Research and Emerging Trends2015 Full details at www.caister.com

Preface

Metagenomics continues to be one of the most dynamic scientific fields, due in part to the development of new and cheaper sequencing techniques. What constitutes a great step ahead, the new facility for obtaining data from microbiomes, however, leads to a cascade of new methodological troubles. As a contribution to this problem, this book is mainly oriented to the new conceptual and methodological tools arising in metagenomics and other meta-omics to deal with these newly appearing problems. The diversity of habitats explored with metagenomics and other meta-omics has increased exponentially, from field to microcosm experiments through organism-specific microbiomes. However, although being placed at the very beginning of any metagenomics (or meta-omics) pipeline, the issue of how to get reliable data from an adequate sampling either from field, microcosm or other types of habitats is still largely disregarded. This may be in part due to an early divorce between ecological and metagenomics theoretical and methodological frameworks in spite of the large common grounds shared by the two disciplines. Fortunately, this is beginning to change, and a fruitful confluence between ecology and metagenomics (and other meta-omics) is on work (see Chapter 1). At present, another big challenge lies on the steps of organizing, classifying, analysing and interpreting the vast number of data generated by metagenomics and meta-omics. Due to researchers’ efforts, new statistical and bioinformatic techniques are continuously appearing (see Chapters 2, 3 and 6). Exploring new microbiomes means that the diversity of samples taken is also increasing, thus creating new challenges for sample processing techniques, for example to check for effectiveness of extraction methods of nucleic acids in virus clinical samples (see Chapter 4). The next two chapters illustrate how a better understanding of microbial community ecology can be reached through metagenomics, focusing both on structure and function. In Chapter 5, a three-year study showed that the abundance and diversity of denitrifier communities in sediments were influenced by hydric seasonality and nitrate concentration along a stream in Southern Spain. Chapter 6 explores some advances in bioinformatics tools to connect the microbial community biodiversity to their potential metabolism, and shows how this information can be useful for a better understanding of the microbial role in tropical soils. Finally, Chapter 7 shows how some recent advances in application of omics technologies like metaproetomics and DNA-SIP improve the study of the methylotrophs guild, that performs an important function in natural environments. The book is especially aimed, among other users, to researchers and students interested in starting projects in this field, to researchers already performing studies in metagenomics,

vi | Preface

teachers interested in the latest developments of the field, and persons involved in biotechnological applications like bioremediation. I would like to thank all the authors for their invaluable contributions. I am very grateful to Hugh Griffin for his unconditional support, and to Caister Academic Press for trusting me again with the editing of a volume on such a dynamic and changing field. Diana Marco Córdoba, Argentina

Integration of Ecology and Environmental Metagenomics Conceptual and Methodological Frameworks

1

Diana Marco

Faculty of Biological Sciences, Córdoba National University; and CONICET, Córdoba, Argentina. Correspondence: [email protected] https://doi.org/10.21775/9781910190593.01

Abstract Although from its origin metagenomics was concerned with composition of communities of microbial OTUs (operational taxonomic units) living in a given habitat and their diversity and functional heterogeneity (concepts already well rooted in ecology), the new field was more ‘environmentally’ than ‘ecologically’ oriented. Probably by circumstantial reasons, metagenomics and ecology followed rather independent trajectories and conceptual and methodological gaps appeared. Recently, calls for the need of integrating the theoretical basis and methodologies coming from metagenomics (and other meta-omics) and ecology have been made. Here I will address some of the principles and methods of field ecology that, although useful in the context of environmental metagenomic studies, have been rather disregarded. In particular, I will emphasize the contribution of some well established concepts and methods of field ecology to an appropriate field sampling and experimental design of environmental metagenomic studies. Introduction The early beginning of metagenomics can be traced back to the 1980s. The pioneering work of Pace et al. (1985) using ribosomal RNA to study natural microbial populations without cultivation (1985), the work from Woese and Fox (1977), proposing the usage of ribosomal RNA as a tool for establishing phylogenetic relationships among microbial kingdoms, and the formal definition of metagenome by Handelsman and colleagues in 1998 (Handelsman et al., 1998) are the main milestones in the development of one of the most innovative scientific fields in the last decades. Since then, the number of works involving metagenomics and other meta-omics has grown exponentially, as a direct consequence of the advantages arising from the new ability of accessing microbial information bypassing cultivation,

2 | Marco

and the development of new and cheaper sequencing techniques. The variety of habitats explored with metagenomics and other meta-omics has also increased exponentially, with virtually no limits in the diversity of samples taken, from field to microcosm experiments through organism-specific microbiomes. In parallel, the number of useful applications of metagenomics and meta-omics studies has also increased greatly, from agriculture to medicine passing through obtention of bioproducts like new enzymes. A typical pipeline for metagenomic studies (Fig. 1.1) begins with the delimitation of the microbiome of interest, the field and/or experimental sampling design, and the extraction of the genetic material. Extracting the desired material from an ever increasing range of metagenomic samples involves developing new and suitable methodologies. In the case of metagenomics, obtaining the metagenomic data requires more and more sophisticated and cheaper sequencing

Figure 1.1 A typical pipeline for metagenomic studies.

Integrating Ecology and Metagenomics Concepts and Methodology | 3

methods, and even sequencing-free strategies are being developed. But at present, one of the biggest challenges lies in the next steps of organizing, classifying, analysing and interpreting the vast number of data generated by metagenomics and meta-omics. The other great challenge, and perhaps the most disregarded, is however at the very beginning of the pipeline. While new statistical and bioinformatic techniques to treat the increasing number of data produced are continuously appearing (see Chapters 2 and 3), the matter of how to get reliable data from an adequate sampling either from field, microcosm or other types of habitats is still largely overlooked. Several authors have drawn attention to this aspect during the last years, especially on the need for statistically adequate replicates for metagenomic studies (Prosser, 2010; Fierer et al., 2012; Knight, 2013; Creer et al., 2016), although with some controversy (Lennon, 2011). This calls for adequate, replicated sampling designs and the subsequent controversy may be related to the very beginning of the metagenomics approach, from the molecular biology field applied to microbial genomics (Pace et al., 1985; Woese and Fox, 1997) to the challenge of linking the genomic information with the organism or ecosystem from which the DNA was isolated (Handelsman, 2004). Although from its origin metagenomics was concerned with composition of communities of microbial OTUs (operational taxonomic units) living in a given habitat and their diversity and functional heterogeneity (concepts already well established in ecology), the new field was more ‘environmentally’ than ‘ecologically’ oriented (O’Malley and Dupré, 2009). By the time metagenomics emerged as a new field, ecology was an already well established discipline with a high degree of formalization and a powerful theoretical and methodological background. However, probably by circumstantial reasons, the two disciplines followed rather independent trajectories and conceptual and methodological gaps appeared. ‘Environmental genomics’, ‘microbial population genomics’, ‘ecogenomics’, were some of the new terms coined to refer to metagenomics (DeLong, 2004), all of them resembling or alluding to ecological concepts. Some well-defined ecological concepts like ‘biodiversity’ or ‘niche’ were adopted in metagenomics studies, although with different meanings or interpretations to those already established in ecology. In particular, the term ‘niche’ began to be used in metagenomic and environmental genomics studies, in spite of its original definition in ecology, centred on the traditional concept of species (Marco, 2008), and it is still being applied without further revision. In another, methodological example, the usage of statistical multivariate analyses, appropriate for multivariate data common in metagenomics and ecology, was still in its infancy in metagenomics as late as the beginning of the twenty-first century (Ramette, 2007), while they constituted one of the most used statistical methods in ecology since the middle of twentieth century (Goodall, 1954). Thus, among others, the before mentioned controversy over the matter of taking replicates for metagenomic studies clearly appears as a consequence of the parallel trajectories followed by environmental genomics and ecological fields. Recently, a call to environmental sequencing studies to adhere to robust ecological study design, allowing for an adequate number of sites/replicates to provide statistical power, as well as ensuring the collection of a robust set of environmental metadata (e.g. climate variables, soil pH) has been made (Creer, 2016). Clearly, there is a strong need to integrate the theoretical basis and methodologies coming from metagenomics (and other meta-omics) and ecology, and early it was recognized that metagenomics’ power would be realized when it is integrated with classical ecological approaches (Reisenfield et al., 2004).

4 | Marco

Here I will address some of the principles and methods of field ecology that, although useful in the context of environmental metagenomic studies, have been in my opinion rather disregarded. In particular, I will emphasize the contribution of some well established concepts and methods of field ecology to an appropriate field sampling and experimental design of environmental metagenomic studies. For space reasons I will not extensively address here other ‘meta-omics’ approaches, although clearly for metatranscriptomics, metaproteomics, metametabolomics, lipidomics, and other emerging approaches (Meiring et al., 2011), the considerations made here about metagenomics studies are, with some caveats, amply valid. I will refer mainly to soil studies for examples, since soil is one of the habitats where environmental metagenomics is firstly showing integration with ecological concepts and methods. Some concepts and methods of field ecology useful in the context of environmental metagenomic studies Looking for composition and function Metagenomic studies usually have two main purposes, asking ‘who is there?’ (composition approach) or ‘what are they doing?’ (functional approach). The first approach aims to answer questions about OTUs/genes like phylogenetic relationships, community structure (composition and relative abundances), diversity, etc. The second approach is oriented to the study of genes performing specific functions. The two approaches may be assimilated to the proposed classification of metagenomic studies into ‘open’ and ‘closed’ formats (Zhou et al. 2015). The ‘open’ format does not require a previous knowledge of the metagenomic community, and it is more used in exploratory studies of composition and diversity, but allowing anyway for gene discoveries (that may be later related to functions). Massive sequencing techniques are the most conspicuous methodologies used in this approach. The ‘closed’ format, in contrast, is focused on already known genes performing functions of interest, and their detection is, for example, performed by functional gene arrays. In the same way, in ecological studies the focus may be on community structure (species composition and abundance) and diversity (commonly, α, within community, and β, between communities), or on function, through the definition of functional guilds (groups of species performing similar functions) (Simberloff and Dayan, 1991). The functional approach in ecology is mainly based on functional traits of the species in a community allowing to group them in guilds or functional groups (Wilson, 1999), for example, birds with similar beak morphology are expected to feed on the same resources. In metagenomics, from the beginning, studies focused on community composition and diversity. However, by quantification of particular genes intervening in a given metabolic route, functional metagenomic studies allow to infer the existence of specific microbial functional guilds in the metagenomic community. One well-known example is the determination of genes intervening in denitrification pathways from soil microbiomes (Demanèche et al., 2009). Recently, fungal functional diversity has began to be investigated through a bioinformatic tool, FUNGuild, that allows to taxonomically parse fungal OTUs by ecological guild from high-throughput sequencing data (Nguyen et al., 2016). In recent years, a comprehensive approach combining metagenomics and other meta-omics like metatranscriptomics and metaprotreomics has allowed us to understand the functioning of the methylotroph guilds,

Integrating Ecology and Metagenomics Concepts and Methodology | 5

to discover new pathways and new players in the methane and other methylated compounds cycle, and to understand its relations with the N cycle (see Chapter 7). However, although functional diversity is increasingly recognized as an important component of biodiversity, in comparison to taxonomic diversity, methods for quantifying functional diversity are less well developed. Petchey and Gaston (2002) proposed a measure of functional diversity (FD), defined as the total branch length of a functional dendrogram, constructed using species functional traits. Various characteristics of FD make it preferable to other measures of functional diversity, such as the number of functional groups in a community. This method has began to be used recently with metagenomic functional data as well. For example, Salles et al. (2015) found that for functions such as denitrification, the diversity of functional, nir gene sequences is a better predictor of functioning than the diversity of sequences of phylogenetic markers. A unified, flexible and multifaceted framework to estimate microbial diversity based on taxonomic, phylogenetic or functional data and across temporal and spatial scales has been recently proposed (Escalas et al., 2013). Spatial and temporal scales Spatial scaling issues have been recognized since early in ecology, mainly because the spatial scale chosen for sampling may have profound effects on the patterns found (Wiens, 1989). Two interesting concepts to deal with the scaling problem are the extent and the grain of a study (O’Neill et al., 1986). Extent is the overall area encompassed by a study to be described by sampling. Grain is the size of the individual units of observation, for example the size of the grids used to count species in a plant community. Both extent and grain of a study should be defined by our knowledge of the system to study, for example discerning the effects of physical processes that could act at broader scales from more local, edaphic or biological interactions. Thus, while vegetation patterns at biogeographical scales are mainly determined by climatic variables, the extent of a distinctive grassland may be determined by local, edaphic variables. Finding or not finding a pattern will depend on the homogeneity or heterogeneity of the extent considered, and on the grain size. As grain increases, a greater proportion of the spatial heterogeneity of the system is contained within a sample or grain and is lost to the study resolution, while between-grain heterogeneity decreases (Wiens, 1989). If the occurrence of species in quadrats is recorded, rare species will be less likely to be recorded as grain size increases; this effect is more pronounced if the species are widely scattered in small patches than if they are highly aggregated (Levin, 1989). Fig. 1.2 shows the effect of choosing a given grain size when the variable of interest is distributed in patches (grey) in an homogeneous matrix (white). Given an extent (large, black outer quadrat encompassing the study area), a given grain size (small red sampling quadrats), will reflect, for example, the smaller patchiness, but it will miss the heterogeneity at a broader scale (larger patches and matrix). Conversely, choosing a larger grain (larger red sampling quadrats) will result in missing the smaller patch heterogeneity, since now the sampling quadrat will encompass more spatial heterogeneity, while variance between sampling quadrats will decrease. In more technical terms, the variance (the degree of spatial autocorrelation among sampling points) will change with the extent and grain size chosen for the study. Of course, the election of the extent and the grain size (sampling quadrats for example) should depend on the hypothesis and aim of the study. Choosing the relevant scale, extent and grain size for a study requires some previous

6 | Marco

Figure 1.2 The effect of choosing a given grain (sampling unit) size when the variable of interest is distributed in patches (grey) in an homogeneous matrix (white). Given an extent (black outer quadrat encompassing the study area), a given grain size (small red sampling quadrats), will reflect for example the smaller patchiness but it will miss the heterogeneity at a broader scale (larger patches and matrix). Conversely, choosing a larger grain (larger red sampling quadrats) will result in missing the smaller patch heterogeneity, and variance between sampling quadrats will decrease.

knowledge about the spatial distribution of the variable under study and the habitat variables that could influence its distribution. At field, there may be domains of scale, regions of the spectrum over which, for a particular phenomenon in a particular ecological system, patterns either do not change or change monotonically with changes in scale. Domains are separated by relatively sharp transitions from dominance by one set of factors to dominance by other sets. If the focus is on phenomena at a particular scale domain, studies conducted at finer scales will fail to include important features of pattern or causal mechanisms; studies restricted to broader scales will fail to reveal the pattern or mechanistic relationships because such linkages are averaged out or are characteristic only of the particular domain (Wiens, 1989). Different methods have been early used in ecology to assess spatial heterogeneity and to detect scale domains. For a series of point samples, the average squared difference (semivariance) or the spatial autocorrelation between two points may be expressed in semivariograms as a function of the distance between them to estimate the scale of patchiness in a system (Sokal and Oden, 1978). Other methods used are spectral analysis (Legendre and Demers, 1984; Legendre and Gauthier, 2014), dimensional analysis (Lewis and Platt, 1982) and fractal geometry (Burrough, 1983). All these early developed methods, although with some refinements, are still in use in field ecology, while new methods, like graph theory are beginning to be used (Fortin et al., 2012). Intimately related with the spatial heterogeneity of many ecological systems in nature, there is the problem of spatial pseudoreplication. Pseudoreplication is defined as the use of inferential statistics to test for treatment effects with data from experiments where either

Integrating Ecology and Metagenomics Concepts and Methodology | 7

treatments are not replicated (though samples may be) or replicates are not statistically independent (Hurlbert, 1984). In statistical terms, depending on the type of pseudoreplication incurred, two effects may arise, increase the probability of rejecting our null hypothesis when it is true (inflated type I error), or increase the probability of accepting the null hypothesis when it is false (inflated type II error) (for a detailed explanation see Chapter 2). In ‘simple’ pseudoreplication, there are no true replicates of treatment, while in ‘sacrificial pseudoreplication’, there is true replication of treatments but data from replicates are pooled prior to statistical analysis, or two or more samples or measurements taken from each experimental unit are treated as independent replicates. Information on the variance among treatment replicates exists in the original data, but is confounded with the variance among samples (within replicates) or else is effectively thrown away when the samples from the two or more replicates are pooled (hence ‘sacrificial’) (Hurlbert, 1984). Without entering into technical details, replication reduces the effects of ‘noise’ or random variation or error, thereby increasing the precision of an estimate of, for example, the mean of a treatment (or field variable) or the difference between two treatments (or field variables) (Hurlbert, 1984). Thus, coming back to the example in Fig. 1.2, to detect any spatial pattern of a given field variable, like for example, a soil contaminant that could be conditioning the presence and abundance of metagenomic communities of microbes able of metabolizing the contaminant, not only the extent and grain of the study must be taken into account but also an appropriate replicated sampling design is needed. A random sampling design, with a high number of sampling quadrats of the right grain covering a great part of the extent may be adequate, but a systematic design may be more convenient to reflect the spatial pattern. However, systematic designs run the risk that the spacing interval may coincide with the period of some periodically varying property of the experimental area (Hurlbert, 1984), taking us back to the scale issue. The ecological principles and methods above mentioned are entirely valid for choosing the spatial scale, extent and grain in environmental metagenomic studies. Moreover, as metagenomic communities are increasingly being recognized as spatially heterogeneous, special care should be taken when choosing the spatial scale for a study. Although only from recently, the soil microbiome is one of the most studied at different spatial scales, from biogeographical extent to scales smaller than 1 m. At each length scale different drivers of microbiome community organization are expected to act. The soil main drivers acting at ecosystem (regional and biogeographic) scales (> m) are factors like climatic patterns and biogeochemical processes, at meta-community scales (cm to m) environmental gradients (pH, soil moisture, etc.) are the main factors, while at microbiome community level (10– 103 μm) very local ecological interactions shape the pattern and functioning of microbial aggregations characteristic of such small scales (Cordero and Datta, 2016). While some evidence of defined distribution patterns has been found at regional and continental scales, examples of clear patterns for smaller scales are scarce (O’Brien et al., 2016). However, as the issue of the grain size election in general has not clearly been addressed, it is not surprising that many studies have not been able to detect significant patterns in either OTUs or genes distribution, nor significant correlations among metagenomic community variables and habitat variables like soil pH, moisture, and other factors assumed to be potentially relevant in shaping soil microbiome distributions. In addition, but related to the issue of the small spatial scales typical of the microbiome communities, their highly patchy distribution, attributable to different factors in each

8 | Marco

habitat, complicates even more the election of the grain size for sampling. For example, the soil appears to be a rather homogeneous habitat at cm scales, but it is extremely heterogeneous and patchy at smaller scales of μm, more relevant to the microbiome. As described by Vos et al. (2016), at these small scales, the soil is composed of microaggregates (at 10 μm scale) with micropores filled with water, clustered into macro-aggregates with meso- and macro-pores (at 100 μm scale) filled with water or air, depending on the moisture status of the soil. Thus, the patchy distribution of resources, large distances between bacterial cells and incomplete connectivity often restrict nutrient access and the ability to interact with other cells. Cell division also results in a short distance dispersal, and thus many bacteria remain in microaggregates where micropores offer refuge against predators and dehydration, contributing to the microscale patchiness of microbial communities. This small-scale patchiness appears to be inherent to the widely extended microbial activity of creating biofilms. Biofilms are ubiquitous, spatially heterogeneous systems that have high cell densities, and typically comprise many microbial species. Biofilm heterogeneity may arise through local conditions of the substrate. Further sources of heterogeneity are the ability of cells in biofilms to undergo differentiation, and ecological interactions (competition, facilitation) among microbes in the biofilm, sometimes creating heterogeneity from homogenous initial conditions (Nadell et al., 2016; Flemming et al., 2016). The same principles behind extent and grain selection, and replication for a classical ecological study should be taken into account when formulating the hypothesis and designing a spatial sampling for a metagenomic study (Cordero and Datta, 2016). Scales of domain and spatial heterogeneity can be assessed at field in environmental metagenomics, using the same methods already used in field ecology for decades (Gonzalez et al., 2012). In recent years an increasing number of environmental metagenomic studies on spatial distribution of metagenomic communities at different extents and grains have appeared, from cm to hundreds of km (Correa-Galeote et al., 2013; Shi et al., 2015). On smaller scales, using a microcosm approach, Reim et al. (2012) sub-sampled the top 3 mm of a water-saturated soil at near in situ conditions in 100 μm steps, focusing on pmoA as a functional and phylogenetic marker in methane-oxidizing bacteria. Unfortunately, the lack of adequate replication in environmental metagenomic studies is still very common, either by ‘simple pseudoreplication’ (no true replicates), but in many cases by ‘sacrificial pseudoreplication’ (by pooling samples from true replicates). However, an increasing number of researchers are taking into account the necessity of design experiments and field studies with adequate replication. A global initiative, the Earth Microbiome Project (EMP; www.earthmicrobiome.org), seeks to systematically characterize microbial taxonomic and functional biodiversity across global ecosystems through the standardization of the protocols used to generate and analyse the data between studies. EMP is fully aware of the problem of pseudoreplication and is working towards a standardized protocol for sampling design to be adopted by all the research groups contributing samples (Knight et al., 2013). Temporal scales are inherently connected with spatial scaling in ecology, and the tendency is to integrate both scales in ecological studies (Legendre and Gauthier, 2014). Increasing the spatial scale, the time scale of important processes also increases because processes operate at slower rates, time lags increase, and indirect effects become increasingly important (Wiens, 1989). The dynamics of different ecological phenomena in different systems follow different trajectories in space and time. For example, relevant processes to

Integrating Ecology and Metagenomics Concepts and Methodology | 9

perennial plants in grasslands, like species competition and grazing, may occur in hundreds of square metres and through decades, while processes relevant to soil arthropods, restricted to smaller, local spaces and with much more shorter lives, may be defined in days and hours. In soil characteristic short timescales occur over hours to seasons. Soil microbes greatly vary their abundance and activity over timescales of hours to days (Bardgett et al., 2005). This variation is related to factors such as predation of microbes by bacteriophages, soil animals, the action of abiotic stresses (e.g. wet–dry and freeze–thaw cycles) (Mikola et al., 2002), and, importantly, temporal variation in the supply of carbon and other nutrients from roots to soil (Bardgett et al., 2005). Such variations also occur at seasonal time scales. There is a general idea that soil microbes are inactive during the winter. However, Schadt et al. (2003) found in alpine soils that the biomass of microbes reached its annual maximum when soil is still frozen in late winter, and showed a significant decay thereafter. Between winter and summer there is an almost complete turnover of the microbial community, with many novel DNA sequences (Schadt et al., 2003) with different functional attributes (Lipson and Schmidt, 2004). Thus, and at least in alpine soils, one snapshot sampling in a given time of the year may underestimate microbial diversity. Following temporal microbiome dynamics recently allowed to address the important role in community diversity of taxa that are typically in very low abundance but occasionally achieve prevalence (Shade and Gilbert, 2015). In ecological studies, often a series of observations on the abundance of the species or variable of interest is made at equal intervals over a period of time, to detect any hidden temporal pattern through statistical procedures. Most of these statistical methods are based on time-series analysis, which allow to extract information and to identify scales of temporal patterns. One of the essential tools in time-series analysis is the periodogram or spectrum, in what is called spectral analysis. The signal (the time series) is decomposed into harmonic components based on Fourier analysis, similarly to a partition of the variance of the series, into its different oscillating components with different frequencies (periods). Peaks in the periodogram or in the spectrum indicate which periods contribute most to the variance of the series (Cazelles et al., 2008). Spectral analysis has a long way in ecology, back to the work from Bartlett (1954) that analysed lynx temporal abundances using periodograms, and since then, it has been amply used in ecology and population dynamics. Although the analysis of temporal variability has an old tradition in ecology only recently has it begun to be implemented with metagenomic data. Classical time series and other related techniques are increasingly used to study microbiome data obtained by metagenomics and other meta-omics approaches to assess diversity, function and ecological interactions (exhaustively reviewed in Faust et al., 2015). These techniques have some specific requirements, that should be taken into account at the time of planning the field or experimental design. Increasing sampling frequencies in general provide higher resolution on metagenomic community dynamics although at an increased cost, thus a compromise should be reached. Sampling regularity is another important requirement for analysis techniques involving autocorrelation. Estimates for time points missing in samplings with irregular intervals can be used, but this technique can mislead conclusions if specific statistical modelling assumptions are not met. Another issue is that, although most of the time series analyses require long time records with short and regular sampling intervals, in general metagenomic time series tend to have few time points, with many sampling point gaps and many records with zero values, characteristics that create challenges for statistical analyses. Besides these problems, in metagenomic studies, just as in many ecological systems, linear correlation

10 | Marco

analyses are difficult to justify since non-linear dynamics seems to be the norm and not the exception. Rapidly changing relationships between variables in microbial community dynamics cause transient correlations that may result in spurious patterns. To overcome this problem, techniques like convergent cross-mapping can be applied to time-series data by examining the degree to which temporal components of a given variable are useful to predict the state of another variable (Sugihara et al., 2012). Another important issue in temporal metagenomic studies is, again, pseudoreplication. However, as replicates in time are not easily available for temporal metagenomic studies, combining information across replicate, multiple time series can improve the inference of interactions from observations, and help to distinguish stochastic fluctuations from real temporal patterns (Hekstra et al., 2012). One more point on the temporal scales framework in ecology and metagenomics. Classically, evolutionary time and ecological time have been differentiated. Evolutionary time operates on a longer time scale, over which changes in gene frequencies in species populations can be described as trends. Ecological time operates on a shorter time scale, over which changes in populations occur with little or no gene frequency changes (Schneider, 1994). These concepts have been developed in the context of plant and animal ecology and evolution, based on general and well-known mechanisms of changes in gene frequencies: mutation, migration, genetic drift, and natural selection. However, it is not clear if this distinction can directly be extrapolated to microbial ecology and evolutionary time scales. Bacteria and fungi acquire genetic heterogeneity through other mechanisms besides mutation, like horizontal gene transfer by plasmids, transport of genetic material by phage, and capture of nucleic acids from the environment (Zaneveld et al., 2008; Fitzpatrick, 2011). The horizontally (not genealogically) acquired genes in general contribute to the adaptation of bacteria to local competitive or environmental pressures (Cohan, 2002), encoding for antibiotic resistance, novel metabolic functions, toxin production, symbiotic abilities, and other functions. Thus, this horizontally acquired genetic material confers a fitness advantage to receipt bacteria in the appropriate circumstances (Dobrindt et al., 2004), acting as a true evolutionary force. The horizontal transfer or acquisition of this extra genetic material occurs over very short times and may establish a new lineage with new functional abilities in few years (Sullivan et al., 1995). This creates a conflict with the classical distinction between ecological and evolutionary time, that should be taken into account when considering the issue of time scales in metagenomic studies. The spatial and temporal scales of a study thus determine the range of patterns and processes that can be detected. If we study a system at an inappropriate scale, we may not detect its actual dynamics and patterns but may instead either not detect any pattern at all or identify patterns that are artefacts of scale. One interesting concept, used in ecology for long, is multiscale analysis: performing an analysis with respect to multiples of a unit of measurements (Schneider, 1994). By changing the unit of analysis, and thus changing the resolution, it is expected to find different patterns of the variable of interest. For example, changing the sampling quadrat size (the grain) and recording soil microbiome diversity in nested quadrats of 1 cm2, 10 cm2 and 100 cm2, probably diversity indexes or other metagenomic community variables will change. This is different from simply spanning many quadrats of any of this sizes in a greater space (changing the extent of the study). For example, Shi et al. (2015), in a study described as multiscaled, investigated the biogeographical patterns of microbial functional genes in 24 heath soils from across the Arctic using GeoChip-based metagenomics. Principal coordinates of neighbour matrices (PCNM)-based analysis was

Integrating Ecology and Metagenomics Concepts and Methodology | 11

used to analyse data across several spatial scales. However, although the sampling locations were scattered around the Canadian, Alaskan and European Arctic in a very broad extent, the grain used was the same (sampling quadrats of 12 × 12 cm). Thus, this approach can be interpreted as not truly multiscaled, since the correlations were measured between quadrats similar in size at different distances. Multiscale analysis can be used to assess changes in time as well, although studies with a temporal multiscale approach in metagenomics are only beginning to appear (Stempfhuber, 2016). Multiscale approaches, in combination with unified spatial and temporal frameworks for metagenomic studies, will soon allow to improve our understanding of the variability of microbial communities (Gonzalez et al., 2012; Gilbert and Henry, 2015). Finally, it should be stressed here that all the considerations made about sampling metagenomic data should be taken into account for the collection of environmental metadata (climate variables, soil parameters, etc.). There is an interesting tendency to integrate metadata information in integrative workflows for processing and analysing metagenomic data on most of the currently available platforms (Ladoukakis et al., 2015). The issue of metadata collection is an important and urgent problem, that should be taken into account by the metagenomics research community, to elaborate standardized samplings protocols and share them. Mathematical modelling An entire paper would be needed to address in detail the issue of mathematical modelling in ecology and its influence on the recent surge of microbial community modelling. However, as mathematical modelling is becoming an increasingly important topic in metagenomics, I will give a brief account here. In a broad sense, a model is any abstraction of a system, built using a conceptual, mathematical, or logical, alone or combined, frameworks. In particular, mathematical modelling has been used since early in ecology. The origins of modern population ecology models can be traced back to the end of eighteenth century, with the model describing human population exponential growth built by Thomas Malthus (1798), and to the middle of nineteenth century, with the logistic growth model formulated by Pierre-François Verlhust (1845), also for human populations. In the first decades of the twentieth century, these models were rediscovered by the first population ecologists, like John Gray McKendrick (for bacterial growth) and Alfred J. Lotka, who, together with Vito Volterra, are considered the founders of population ecology. Since then, mathematical modelling has been implemented in every ecological field and organization level, from population to ecosystem ecology. An exhaustive review of the huge variety of mathematical models used in ecology (deterministic, stochastic; discrete, continuous, mean-field, individually based, etc.) (Müller and Kuttler, 2015), is out of the scope of this work, but perhaps one of the most helpful classifications of ecological models is in phenomenological and mechanistic models. Phenomenological (also called statistical) models are based on observed patterns in the data, while mechanistic models are built addressing directly the mechanisms generating observed processes and patterns. Phenomenological models provide no information about the underlying ecological mechanisms, since there is no a unique relationship between statistical patterns and mechanisms, and their predictive power is somewhat restricted to conditions comparable to those from which the data to build the model were taken. On the other hand, since mechanistic models attempt to understand the phenomenon modelled, they

12 | Marco

are usually regarded as enclosing more explanatory and predictive powers than phenomenological models. For example, building a phenomenological model for a species dispersal distance using a regression model based on actual dispersal records taken at field does not tell much about the mechanisms underlying the dispersal pattern found, and the model would be applicable only to a similar scenario and within the ranges of dispersal actually recorded. Building a mechanistic model, however, including the main mechanisms involved in dispersal of, for example, wind-dispersed seeds, like seed morphology, wind direction and velocity, and elevation and topographic landscape, would inform about more general features of the dispersal, like interactions among the variables included, and allow for greater and more extrapolating predictive ability. However, in some cases, both kinds of models can be complementary, since some parts of a mechanistic model, not suitable for being backed by an explicit mechanism, may contain statistical relationships (Kendall et al., 1999). For some time now, the tendency in ecology has been to move on from purely phenomenological models to more explanatory and predictive mechanistic models. This tendency is also beginning to permeate the work in microbial systems, thus contributing to the foundation of a modern microbial ecology (Gonzalez et al., 2012; Liberles et al., 2013). The development of mathematical models with a basis on mechanistic understanding, integrated with controlled experiments will allow to convert the huge empirical knowledge gained through microbial metagenomics and other meta-omics into fundamental insights and testable predictions about microbiome composition, function and dynamics (Widder et al., 2016). In a succinct but informative review, Widder and colleagues show how metagenomic and other meta-omics data can be integrated with different modelling approaches. Dynamic models of deterministic and mechanistic nature (like difference equations and flux balance analysis), stochastic dynamical systems (like Markov chains, random walks), individual-based models, and other approaches can be used to find patterns at different spatial and temporal scales, and at different ecological organization levels (from single cells to microbiomes at community and ecosystem levels), and to generate explanations and predictions about microbiome structure and function. Some modelling approaches, although essentially phenomenological, may however contribute to the generation of new hypotheses on microbiome structure and function. Network analysis has been used in ecology to study co-occurrence networks established by calculating correlations between the abundance of individual species to detect interactions among them in the community for long ( Jordano, 1987). This approach has recently began to be used in microbial ecology. For example, Barberán and colleagues calculated associations between microbial taxa and applied network analysis approaches to a 16S rRNA gene barcoded pyrosequencing dataset containing 4,160,000 bacterial and archaeal sequences from 151 soil samples from a broad range of ecosystem types. The analysis revealed habitat generalists and specialists, co-occurrence patterns including general non-random association, common life history strategies at broad taxonomic levels and unexpected relationships between community members. Thus, although regarded as not purely mechanistic, network analysis has the potential of exploring inter-taxa correlations to gain a more integrated understanding of microbial community structure and the ecological rules guiding community assembly (Barberán et al., 2012). New modelling approaches tend to integrate different modelling tools to combine information from different sources. For example, Noecker et al. (2015), in a systems biology approach, propose a comprehensive framework to systematically link variation in metabolomic data with community composition by utilizing taxonomic, genomic,

Integrating Ecology and Metagenomics Concepts and Methodology | 13

and metabolic information. Their approach integrates available and inferred genomic data, metabolic network modelling, and a method for predicting community-wide metabolite turnover to estimate the biosynthetic and degradation potential of a given community. Conclusions Metagenomics and other meta-omics constitute, due to their inherent nature, a complex field placed at the intersection of many disciplines, like molecular biology, microbiology, ecology, chemistry, bioinformatics, among others, and new ones are hastily being implicated. The theoretical and methodological complexity arising from this multifaceted and dynamic field requires the integration of useful theoretical basis and methodologies coming from already well established disciplines like ecology, and dealing with change of paradigms like the traditional organism-centred approach to a new, organism- and species-free context. Thus, while some concepts and methodologies coming from ecology should be revised for application on metagenomics and meta-omics fields, like niche theory, other do not require great changes and it is predicted to be increasingly adopted by environmental metagenomics. Ecological principles behind spatial and temporal scales should be taken into account when formulating the hypothesis and sampling design for metagenomic and meta-omics studies, and the wealth of modelling approaches developed through decades by ecologists is being proven extremely useful in the context of metagenomics. Acknowledgements DEM is a research member of the Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET), Argentina. References

Barberán, A., Bates, S.T., Casamayor, E.O., and Fierer, N. (2012). Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME. J. 6, 343–351. http://dx.doi.org/10.1038/ ismej.2011.119 Bardgett, R.D., Bowman, W.D., Kaufmann, R., and Schmidt, S.K. (2005). A temporal approach to linking aboveground and belowground ecology. Trends Ecol. Evol. 20, 634–641. Bartlett, M.S. (1954). Problèmes de l’analyse spectrale des sèries temporelles stationnaires. Publ. Inst. Stat. Univ. Paris 3,119–134. Burrough, P.A. (1983). Multiscale sources of spatial variation in soil. I. The application of fractal concepts to nested levels of soil variation. J. Soil Sci. 34, 577-597. Cazelles, B., Chavez, M., Berteaux, D., Ménard, F., Vik, J.O., Jenouvrier, S., and Stenseth, N.C. (2008). Wavelet analysis of ecological time series. Oecologia 156, 287–304. http://dx.doi.org/10.1007/ s00442-008-0993-2 Chistoserdova, L. (2014). Functional metagenomics of the nitrogen cycle in freshwater lakes with focus on methylotrophic bacteria. In Metagenomics of the microbial nitrogen cycle. Theory, methods and applications, D. Marco, ed. (Norfolk, UK: Caister Academic Press), pp. 195-208. Cohan, F.M. (2002). What are bacterial species? Annu. Rev. Microbiol. 56, 457–487. http://dx.doi. org/10.1146/annurev.micro.56.012302.160634 Cordero, O.X., and Datta, M.S. (2016). Microbial interactions and community assembly at microscales. Curr. Opin. Microbiol. 31, 227–234. http://dx.doi.org/10.1016/j.mib.2016.03.015 Correa-Galeote, D., Marco, D.E., Tortosa, G., Bru, David, Philippot, L., and Bedmar, E.J. (2013). Spatial distribution of N-cycling microbial communities showed complex patterns in constructed wetland sediments. FEMS Microbiol. Ecol. 83, 340-351. Creer, S., Deiner, K., Frey, S., Porazinska, D., Taberlet, P., Thomas, W.K., Potter, C., and Bik, H.M. (2016). The ecologist’s field guide to sequence-based identification of biodiversity. Methods Ecol. Evol. 7, 1008–1018.

14 | Marco

DeLong, E.F. (2004). Microbial population genomics and ecology: the road ahead. Environ. Microbiol. 6, 875–878. http://dx.doi.org/10.1111/j.1462-2920.2004.00668.x Demanèche, S., Philippot, L., David, M.M., Navarro, E., Vogel, T.M., and Simonet, P. (2009). Characterization of denitrification gene clusters of soil bacteria via a metagenomic approach. Appl. Environ. Microbiol. 75, 534–537. http://dx.doi.org/10.1128/AEM.01706-08 Dobrindt, U., Hochhut, B., Hentschel, U., and Hacker, J. (2004). Genomic islands in pathogenic and environmental microorganisms. Nat. Rev. Microbiol. 2, 414–424. http://dx.doi.org/10.1038/ nrmicro884 Escalas, A., Bouvier, T., Mouchet, M.A., Leprieur, F., Bouvier, C., Troussellier, M., and Mouillot, D. (2013). A unifying quantitative framework for exploring the multiple facets of microbial biodiversity across diverse scales. Environ. Microbiol. 15, 2642-2657. Faust, K., Lahti, L., Gonze, D., de Vos, W.M., and Raes, J. (2015). Metagenomics meets time series analysis: unraveling microbial community dynamics. Curr. Opin. Microbiol. 25, 56-66. Fierer, N., Lauber, C.L., Ramirez, K.S., Zaneveld, J., Bradford, M.A., and Knight, R. (2012). Comparative metagenomic, phylogenetic and physiological analyses of soil microbial communities across nitrogen gradients. ISME J. 6, 1007-1017. Fitzpatrick, D.A. (2012). Horizontal gene transfer in fungi. FEMS Microbiol. Lett. 329, 1–8. http://dx.doi. org/10.1111/j.1574-6968.2011.02465.x Flemming, H.C., Wingender, J., Szewzyk, U., Steinberg, P., Rice, S.A., and Kjelleberg, S. (2016). Biofilms: an emergent form of bacterial life. Nat. Rev. Microbiol.14, 563–575. http://dx.doi.org/10.1038/ nrmicro.2016.94 Fortin, M.J., James, P.M., MacKenzie, A., Melles, S.J., and Rayfield, B. (2012). Spatial statistics, spatial regression, and graph theory in ecology. Spat. Stat. 1, 100-109. Gilbert, J.A., and Henry, C. (2015). Predicting ecosystem emergent properties at multiple scales. Environ. Microbiol. Rep. 7, 20–22. http://dx.doi.org/10.1111/1758-2229.12258 Gonzalez, A., King, A., Robeson, M.S., Song, S., Shade, A., Metcalf, J.L., and Knight, R. (2012). Characterizing microbial communities through space and time. Curr. Opin. Biotechnol. 23, 431–436. http://dx.doi.org/10.1016/j.copbio.2011.11.017 Goodall, D.W. (1954). Vegetational classification and vegetational continua. Angew. Pflanzensoziologie, Wien. Festchrift Aich. 1,168-182. Handelsman, J., Rondon, M.R., Brady, S.F., Clardy, J., and Goodman, R.M. (1998). Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem. Biol. 5, R245–9. Handelsman, J. (2004). Metagenomics: application of genomics to uncultured microorganisms. Microbiol. Mol. Biol. Rev. 68, 669–685. Hekstra, D.R., and Leibler, S. (2012). Contingency and statistical laws in replicate microbial closed ecosystems. Cell 149, 1164–1173. http://dx.doi.org/10.1016/j.cell.2012.03.040 Hurlbert, S.H. (1984). Pseudoreplication and the design of ecological field experiments. Ecol. Monogr. 54, 187-211. Jordano, P. (1987). Patterns of mutualistic interactions in pollination and seed dispersal: Connectance, dependence asymmetries, and coevolution. Am. Nat. 129, 657–677. Ju, F., and Zhang, T. (2015). Experimental design and bioinformatics analysis for the application of metagenomics in environmental sciences and biotechnology. Environ. Sci. Technol. 49, 12628-12640. Kendall, B.E., Briggs, C.J., Murdoch, W.W., Turchin, P., Ellner, S.P., McCauley, E., Nisbet, R.M., and Wood, S.N. (1999). Why do populations cycle? A synthesis of statistical and mechanistic modeling approaches. Ecology 80, 1789-1805. Knight, R., Jansson, J., Field, D., Fierer, N., Desai, N., Fuhrman, J.A., Hugenholtz, P., van der Lelie, D., Meyer, F., Stevens, R., et al. (2013). Unlocking the potential of metagenomics through replicated experimental design. Nat. Biotechnol. 30, 513-520. Kurtz, Z.D., Müller, C.L., Miraldi, E.R., Littman, D.R., Blaser, M.J., and Bonneau, R.A. (2015). Sparse and compositionally robust inference of microbial ecological networks. PLOS Comput. Biol. 11, e1004226. http://dx.doi.org/10.1371/journal.pcbi.1004226 Ladoukakis, E., Kolisis, F.N., and Chatziioannou, A.A. (2015). Integrative workflows for metagenomic analysis. In Multi-omic Data Integration, Frontiers Research Topics, P. Tieri, Ch. Nardini and J.E. Dent, eds. pp. 115-125. Legendre, L., and Demers, S. (1984). Towards dynamic biological oceanography and limnology. Can. J. Fish. Aquat. Sci. 41, 2-19.

Integrating Ecology and Metagenomics Concepts and Methodology | 15

Legendre, P., and Gauthier, O. (2014). Statistical methods for temporal and space-time analysis of community composition data. Proc. Biol. Sci. 281, 20132728. http://dx.doi.org/10.1098/rspb.2013.2728 Lennon, J.T. (2011). Replication, lies and lesser-known truths regarding experimental design in environmental microbiology. Environ. Microbiol. 13, 1383-1386. Levin, S.A. (1989). Challenges in the development of a theory of ecosystem structure and function. In Perspectives in Ecologicol Theory, J. Roughgarden, R.M. May and S.A. Levin, eds. (Princeton, N.J: Princeton University Press), pp. 242-255. Lewis, M.R., and Platt, T. (1982). Scales of variation in estuarine ecosystems. In Estuarine Comparisons, V.S. Kennedy, ed. (New York: Academic Press), pp. 3-20. Liberles, D.A., Teufel, A.I., Liu, L., and Stadler, T. (2013). On the need for mechanistic models in computational genomics and metagenomics. Genome Biol. Evol. 5, 2008–2018. http://dx.doi. org/10.1093/gbe/evt151 Lipson, D.A., and Schmidt, S.K. (2004). Seasonal changes in an alpine soil bacterial community in the colorado rocky mountains. Appl. Environ. Microbiol.70, 2867–2879. Linquist, S., Cottenie, K., Elliott, T.A., Saylor, B., Kremer, S.C., and Gregory, T.R. (2015). Applying ecological models to communities of genetic elements: the case of neutral theory. Mol. Ecol. 24, 3232–3242. http://dx.doi.org/10.1111/mec.13219 Malthus, T. (1798) An Essay on the Principle of Population, as it affects the future improvement of society with remarks on the speculations of Mr. Godwin, M. Condorcet, and other writers. Anonymously published. Marco, D. (2008). Metagenomics and the niche concept. Theory. Biosci. 127, 241–247. http://dx.doi. org/10.1007/s12064-008-0028-x Meiring, T.L., Bauer, R., Scheepers, I., Ohloff, C., Tuffin, I.M., and Cowan, D.A. (2011). Metagenomics and beyond: current approaches and integration with complementary technologies. In Metagenomics: current innovations and future trends, D. Marco, ed. (Norfolk, UK: Caister Academic Press), pp. 1-19. Mikola, J., Bardgett, R.D., and Hedlund, K. (2002). Biodiversity, ecosystem functioning and soil decomposer food webs. In Biodiversity and ecosystem functioning: synthesis and perspectives, M. Loreau, S. Naeem, and P. Inchausti, eds. (Oxford, UK: Oxford University Press), pp. 169-180. Müller, J., and Kuttler, C. (2015). Methods and Models in Mathematical Biology (Berlin Heidlerberg: Springer-Verlag). Nadell, C.D., Drescher, K., and Foster, K.R. (2016). Spatial structure, cooperation and competition in biofilms. Nat. Rev. Microbiol. 14, 589–600. http://dx.doi.org/10.1038/nrmicro.2016.84 Nguyen, N.H., Song, Z., Bates, S.T., Branco, S., Tedersoo, L., Menke, J., Schilling, J.S., and Kennedy, P.G. (2016). FUNGuild: an open annotation tool for parsing fungal community datasets by ecological guild. Fungal Ecol. 20, 241-248. Noecker, C., Eng, A., Srinivasan, S., Theriot, C.M., Young, V.B., Jansson, J.K., Fredricks, D.N., and Borenstein, E. (2016). Metabolic model-based integration of microbiome taxonomic and metabolomic profiles elucidates mechanistic links between ecological and metabolic variation. mSystems 1, e00013-15. O’Brien, S.L., Gibbons, S.M., Owens, S.M., Hampton-Marcell, J., Johnston, E.R., Jastrow, J.D., Gilbert, J.A., Meyer, F., and Antonopoulos, D.A. (2016). Spatial scale drives patterns in soil bacterial diversity. Environ. Microbiol. 18, 2039–2051. O’Malley, M.A., and Dupré, J. (2009). Philosophical themes in metagenomics. In Metagenomics: Theory, methods and applications, D. Marco, ed. (Norfolk, UK: Caister Academic Press), pp. 183-208. O’Neill, R.V., DeAngelis, D.L., Waide, J.B., and Allen, T.F.H. (1986). A hierarchical concept of ecosystems (Princeton, New Jersey, USA: Princeton University Press). Pace, N.R., Stahl, D.A., Lane, D.J., and Olsen, G.J. (1985). Analyzing natural microbial populations by rRNA sequences. ASM News 51, 4-12. Petchey, O.L., and Gaston, K.J. (2002). Functional diversity (FD), species richness and community composition. Ecol. Lett. 5, 402-411. Prosser, J.I. (2010). Replicate or lie. Environ. Microbiol.12, 1806–1810. http://dx.doi.org/10.1111/j.14622920.2010.02201.x Ramette, A. (2007). Multivariate analyses in microbial ecology. FEMS Microbiol. Ecol. 62, 142–160. Reim, A., Lüke, C., Krause, S., Pratscher, J., and Frenzel, P. (2012). One millimetre makes the difference: high-resolution analysis of methane-oxidizing bacteria and their specific activity at the oxic–anoxic interface in a flooded paddy soil. ISME J. 6, 2128-2139. Riesenfeld, C.S., Schloss, P.D., and Handelsman, J. (2004). Metagenomics: genomic analysis of microbial communities. Annu. Rev. Genet. 38, 525–552. http://dx.doi.org/10.1146/annurev. genet.38.072902.091216

16 | Marco

Salles, J.F., Le Roux, X., and Poly, F. (2015). Relating phylogenetic and functional diversity among denitrifiers and quantifying their capacity to predict community functioning. In The causes and consequences of microbial community structure, D.R. Nemergut, A. Shade, and C. Violle, eds. Frontiers Research Topics 140-153. Shade, A., and Gilbert, J.A. (2015). Temporal patterns of rarity provide a more complete view of microbial diversity. Trends Microbiol.23, 335–340. http://dx.doi.org/10.1016/j.tim.2015.01.007 Schadt, C.W., Martin, A.P., Lipson, D.A., and Schmidt, S.K. (2003). Seasonal dynamics of previously unknown fungal lineages in tundra soils. Science 301, 1359–1361. http://dx.doi.org/10.1126/ science.1086940 Schneider, D.C. (1994). Quantitative ecology: spatial and temporal scaling (San Diego, California, USA: Academic Press). Shi, Y., Grogan, P., Sun, H., Xiong, J., Yang, Y., Zhou, J., and Chu, H. (2015). Multi-scale variability analysis reveals the importance of spatial distance in shaping Arctic soil microbial functional communities. Soil Biol. Biochem. 86, 126-134. Simberloff, D., and Dayan, T. (1991). The guild concept and the structure of ecological communities. Annu. Rev. Ecol. Syst. 22, 115-143. Sokal, R.R., and Oden, N.L. (1978). Spatial autocorrelation in biology. 2. Some biological implications and four applications of evolutionary and ecological interest. Biol. J. Linn. Soc. 10, 229- 249. Stempfhuber, B.H.J. (2016). Drivers for the performance of nitrifying organisms and their temporal and spatial interaction in grassland and forest ecosystems (Doctoral dissertation, Dissertation, München, Technische Universität München). Sugihara, G., May, R., Ye, H., Hsieh, C.H., Deyle, E., Fogarty, M., and Munch, S. (2012). Detecting causality in complex ecosystems. Science 338, 496–500. http://dx.doi.org/10.1126/science.1227079 Sullivan, J.T., Patrick, H.N., Lowther, W.L., Scott, D.B., and Ronson, C.W. (1995). Nodulating strains of Rhizobium loti arise through chromosomal symbiotic gene transfer in the environment. Proc. Natl. Acad. Sci. U.S.A.92, 8985–8989. Verhulst, P.-F. (1845). Recherches mathématiques sur la loi d’accroissement de la population. Nouveaux Mémoires del’Académie Royale des Sciences et Belles-Lettres de Bruxelles 18, 1–42. Vos, M., Wolf, A.B., Jennings, S.J., and Kowalchuk, G.A. (2013). Micro-scale determinants of bacterial diversity in soil. FEMS Microbiol. Rev.37, 936–954. http://dx.doi.org/10.1111/1574-6976.12023 Widder, S., Allen, R.J., Pfeiffer, T., Curtis, T.P., Wiuf, C., Sloan, W.T., Cordero, O.X., Brown, S.P., Momeni, B., Shou, W.,et al. (2016). Challenges in microbial ecology: building predictive understanding of community function and dynamics. ISME. J. 10, 2557–2568. http://dx.doi.org/10.1038/ismej.2016.45 Wiens, J.A. (1989). Spatial scaling in ecology. Funct. Ecol. 3, 385-397. Wilson, J.B. (1999). Guilds, functional types and ecological groups. Oikos 86, 507-522. Woese, C.R., and Fox, G.E. (1977). Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. U.S.A.74, 5088–5090. Zaneveld, J.R., Nemergut, D.R., and Knight, R. (2008). Are all horizontal gene transfers created equal? Prospects for mechanism-based studies of HGT patterns. Microbiology 154, 1–15. http://dx.doi. org/10.1099/mic.0.2007/011833-0 Zhou, J., He, Z., Yang, Y., Deng, Y., Tringe, S.G., and Alvarez-Cohen, L. (2015). High-throughput metagenomic technologies for complex microbial community analysis: open and closed formats. MBio 6, e02288-14.

Guidelines to Statistical Analysis of Microbial Composition Data Inferred from Metagenomic Sequencing

2

Vera Odintsova1*, Alexander Tyakht1,2 and Dmitry Alexeev1,2

1Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a,

Moscow, Russian Federation.

2Moscow Institute of Physics and Technology, Institutskiy pereulok 9, Dolgoprudny, Russian

Federation.

*Correspondence: [email protected] https://doi.org/10.21775/9781910190593.02

Abstract Metagenomics, the application of high-throughput DNA sequencing for surveys of environmental samples, has revolutionized our view on the taxonomic and genetic composition of complex microbial communities. An enormous richness of microbiota keeps unfolding in the context of various fields ranging from biomedicine and food industry to geology. Primary analysis of metagenomic reads allows to infer semi-quantitative data describing the community structure. However, such compositional data possess statistical specific properties that are important to consider during preprocessing, hypothesis testing and interpreting the results of statistical tests. Failure to account for these specifics may lead to essentially wrong conclusions as a result of the survey. Here we present a researcher introduction to the field of metagenomics with the basic properties of microbial compositional data including statistical power and proposed distribution models, perform a review of the publicly available software tools developed specifically for such data and outline the recommendations for the application of the methods. Introduction Microbiota, complex communities consisting of microbial species, appear to inhabit literally any environmental niche in the world. Recent advances in molecular genetic techniques allowed the study of microbiota in a cultivation-independent way, leading to the discovery of enormous diversity. One of the most advanced and widely used techniques is metagenomic sequencing: classification and quantification of metagenomic sequences can be used

18 | Odintsova et al.

to assess both taxonomic and functional composition of microbiota. However, when a researcher is interested in proper statistical assessment of hypothesis regarding the differences between the samples from different groups or its dependence from multiple factors, quantification of metagenomic reads is just an intermediate step. Generally, the specific features and problems of analysing metagenomic data are universal whether the focus is on microbiota of a subway station, saline lake or host-associated communities. However, here we will illustrate these concepts on the example of human microbiota, as it possesses particular interest to researchers due to its essential role in a biomedical context. The human body is a habitat for a great amount of microbes (approximately 1–3% of body mass). They play important roles in biological processes that maintain its vital activity. The microbial community consists of thousands of species, and its structure noticeably varies both among subjects and body parts (Levy and Borenstein, 2013; Stein et al., 2013; Tyakht et al., 2013; Human Microbiome Project Consortium, 2012). As cultivation and examination in vitro of the majority of microorganisms is difficult, metagenomic analysis is one of the most common ways to study the human-associated microbial community structure and functionality. An important part of metagenomic analysis is testing hypotheses about the associations between microbiome structure and some factors. Age, diet type, presence of some disease and so on may play the role of such factors in the case of human microbiome research. Factors may take on discrete or continuous values. Such characteristics as sex, body site and stage of the disease represent discrete ones. They allow us to distinguish two or more groups to compare structure of their microbiota. While designing the groups and the inclusion/ exclusion criteria for a metagenomic survey, a researcher should make sure that the groups are matched and differ only by the selected features. For example, in a case when the microbiota of healthy subjects and patients with dysbiosis are compared, then there should be no substantial differences in weight, age, sex and other parameters between the groups. Age, body mass index and drug dosage are the examples of continuous factors. It is impossible to distinguish groups in such case. So the researcher’s aim is to obtain the functional dependence of microbiome structure on factor value. A study may aim to explore the association of the microbiome with more than one factor at a time. Multifactor analysis is useful when several characteristics may contribute to the changes in microbial structure and each of them is presumed to have substantial influence. Such analysis attempts to estimate the individual impact of each factor. A common metagenomic study of association between clinical data and microbiome composition consists of the following stages (Goodrich et al., 2014): 1 2 3 4 5

formulation of the aims and experiment design (stating the hypothesis, describing the groups of subjects, determining their minimal size, choosing the optimal sequencing technology, targeted sequencing depth and methods of experimental data analysis); collecting the required number of microbial samples; metagenomic sequencing of each sample; taxonomic profiling for each metagenome; statistical analysis of the compositional data.

This review describes basic concepts and models of statistical inference of associations between microbiome structure and factors of interest, with the main focus on human

Statistical Analysis of Metagenomic Data | 19

microbiota. It is intended to serve as an assistance in choosing the correct statistical methods for metagenomic analysis realized in R (R Core Team, 2015) and applying them properly (stages 1 and 5). To state the mathematical formulation of the problem it is necessary to get familiar with the data format inherent for metagenomics. We will first briefly describe the first steps of a metagenomic study from collecting samples to obtaining its taxonomic profile. Next, the statistical formulation of the aim of the research and design of an experiment will be discussed. Then we will focus on the main steps and specifics of metagenomic statistical analysis. The last section contains an overview of the actual and widespread approaches to such analysis implemented in R packages. Preparing data for statistical analysis Collection of microbiome samples, sample preparation and sequencing, as well as the primary bioinformatical analysis for taxonomic profiling are complex processes, with each step having its own subtleties. A comprehensive description of them is out of scope of this text and is described elsewhere (for example, in Goodrich et al., 2014). Here we will outline only the essential points. After collecting the microbiota samples from the individuals under investigation, the samples are stored and transported under low temperature and other conditions to prevent changes in the structure of microbial community before the sample gets to a laboratory. There the DNA is extracted from the samples through a multistage process with the use of special reagents. Then a so-called sequencing library is prepared for each sample. The obtained libraries are subject to DNA sequencing resulting in thousands to tens of millions of short nucleotide sequences (reads) corresponding to the genomes of all microorganisms and viruses present in a sample. There are two main types of sequencing: shotgun sequencing (random sequences for the totality of the genetic material are obtained) or amplicon sequencing (reads belong to a fixed gene of each species, most commonly 16S rRNA gene). The resulting reads are subject to quality filtering. Then taxonomical classification is performed so each read is put into correspondence with the available taxonomically annotated database of microbial genomes or genes. It is also possible to perform a de novo analysis without the reference. The result of the taxonomical classification for all reads of a sample is a vector of feature abundances. Each element of this vector reveals the number of reads related to some feature – taxon, gene or gene group. In the context of 16S rRNA sequencing, the common type of feature are referred to as operational taxonomic units (OTUs); as the concept of OTU is not normally used in ‘shotgun’ sequencing, we will use a microbial species as a feature in the text. ‘Shotgun’ sequencing allows us to describe the functional structure of microbial community in addition to its taxonomic structure. It shows semiquantitative portrait of genes, gene groups or metabolic paths in the community. Currently ‘shotgun’ sequencing is a much more costly procedure than the amplicon format. So it is less widespread, albeit it does not give an idea about gene structure of a microbiota. In this paper, without loss of generality, we will describe methods of statistical analysis on the example of 16S rRNA data. The result of the taxonomical classification for all reads of a sample is a vector of feature relative abundance values. Each element of this vector reveals a read count related to some feature (microbial taxon). The microbial communities are compared via the analysis of these vectors.

20 | Odintsova et al.

Statistical properties of metagenomic data Basic steps As an exhaustive search across all gut microbiotas is not feasible, the statistical analysis is based on a representative sample from this entire assembly. The researcher is in charge for controlling the validity of the outcome. To identify the changes in microbiota composition associated with certain factors, a researcher postulates and then tests a null hypothesis that the groups do not differ. For instance, that the observed difference in relative abundance of a microbial species between the healthy subjects and patients is just due to chance. Each feature describing microbial composition (e.g. relative abundance of a certain species) can be considered as a random value realized in each individual metagenome. Then a null hypothesis is postulated stating that the distribution of this random value is independent of the examined factor. Often the differences between distributions are assessed by comparing their mean (or median) values, with null hypothesis stating that the parameter is equal across the groups. As a means for hypothesis testing, a test statistic is introduced – a numeric function reflecting the degree of similarity between the samples. Then a P-value is computed – a probability that such or a more extreme statistics value can be observed assuming the null hypothesis is true. The absolute value of deviation (two-sided test) or deviation in certain direction (one-sided test) may be considered. A one-sided test is convenient for experiments with a strong a priori-supported direction of the factor influence for example, when assessing the deleterious effect of antibiotics on the abundance of sensitive species. One-sided testing allows the detecting of changes with a smaller effect given the same significance level. But more often the direction of influence is not known or the relative abundance of each of the taxa in a community is compared at a time, so, while the abundance of some taxa increases, the abundance of the others has to decrease. In such cases, two-sided test is the right choice. A small P-value means that it is unlikely to obtain two such samples from the same distribution. If P-value is lower than a defined threshold then the null hypothesis is rejected. The threshold value (known as a critical value or a significance level of a test) is not fixed absolutely and may vary but it is commonly set to 5% (P 2 allowed values)

Many continuous or discrete factors

Package

Method

Parameterization

Notes

stats

Wilcoxon, Kruskal– Wallis, Welch and Student’s tests

Componentwise

Should be controlled manually (rarefaction or normalization)

Variance stabilizing transformation is recommended for all tests except for the Welch test designed for unequal dispersions

Wilcoxon and Kruskal– Wallis test are non-parametric, Welch and Student’s test suggest normal distribution

Variance stabilizing data transformation is recommended for non-parametric methods and Student’s test

n.ttest() from samplesize package for Welch and Student’s tests; https://fedematt. shinyapps. io/shinyMB/ and [22,23] for unpaired Wilcoxon test

+

–

–

ALDEx2

Wilcoxon, Welch and Kruskal–Wallis tests, ANOVA

Componentwise

Each sample is substituted with Monte Carlo samples from Dirichlet distribution

Monte Carlo sampling provides more accurate estimation of the dispersion; variance stabilizing clr-transformation, Welch test is valid for data with non-uniform variance

Wilcoxon and Kruskal– Wallis test are non-parametric, Welch test and ANOVA suggest normal distribution

Allows accurate estimate of the significance for small sample sizes and correct comparison of abundance vectors after neglecting low-abundance taxa

samplesize and pwr packages for Welch and Student’s tests and ANOVA; https://fedematt. shinyapps. io/shinyMB/ and [22,23] for unpaired Wilcoxon test

+

+

+

metagenomeSeq

Generalized linear model

Componentwise

Percentile normalization to avoid biases caused by the preferable amplification of certain nucleotide sequences

Variance stabilizing Zero-inflated logarithmic Gaussian transformation, distribution used statistics is valid for data with non-uniform variance

+

+

+

Avoids biases caused No by the preferable amplification of certain nucleotide sequences; designed for sparse data inherent for 16S rRNA sequencing; needs sufficient sample size and sequencing depth

edgeR

Generalized linear model

Componentwise

Normalization: normalizing factors are selected to minimize the log-fold changes for the majority of taxa

Empirical Bayesian Negative approach to binomial dispersion distribution estimation, used statistics is valid for data with non-uniform variance

Some results suggest the method yields too many false positives caused by too low estimate of the dispersion

No

+

+

+

DESeq2

Generalized linear model

Componentwise

Normalization: normalization factors take into account the sequencing depth

Empirical Bayesian Negative approach to binomial dispersion distribution estimation, used statistics is valid for data with non-uniform variance

Provides an empirical Bayesian approach to effect size estimation

No

+

+

+

HMP

Generalized Wald-type statistics

Communitylevel

Should be rarefied manually

Dirichletmultinomial distribution, used statistics is valid for data with non-uniform variance

Dirichletmultinomial distribution

Takes into account the Is implemented compositional nature in the package; of metagenomic data the calculations of the required sample size may be performed by the wrapper https://fedematt. shinyapps.io/ shinyMB/

+

+

–

vegan

PERMANOVA, ANOSIM

Communitylevel

Should be rarefied manually in the case of weighted metrics

The method is not designed for unequal variances. PERMANOVA is more robust than ANOSIM and some other non-parametric methods

Non-parametric

Takes into account the micropower compositional nature package (for of metagenomic data Jaccard and UniFrac metrics)

+

+

–

30 | Odintsova et al.

taxa resulting in hundreds of statistical tests. Therefore, multiple testing correction is necessary for adjusting the resulting P-values. However, if the study initially aims to examine only a single species of interest from the totality of community members and the other species are not analysed further, the procedure is not required. The power of the comparison should be calculated according to the adjusted significance level. The common solution for the comparison of the two groups is the Student’s t-test for samples from normal distributions with equal variances. As it was mentioned above, this method cannot be directly applied to metagenomic data, because of the weak conformance to normal distribution and heterogeneous variance. The latter problem may be overcome with the use of variance-stabilizing transformation of the feature vector or Welch’s test – a generalization of the Student’s test for the case of unequal variances. These approaches provide acceptable results for the high-abundance taxa ( Jonsson et al., 2016). The Student’s and Welch’s tests for paired and unpaired comparisons are implemented in a standard R package stats. The power or needed sample size for such comparison without multiple testing correction can be calculated using samplesize package. An approach that overcomes the non-normality of the data distribution is the nonparametric Wilcoxon test. It also has modifications for paired and unpaired comparisons. No assumptions about the type of distribution are used for them. This method is also implemented in stats package. The power calculations for Wilcoxon test in case of single-feature abundance comparison is described in detail elsewhere (Mattiello et al., 2016; Rahardja et al., 2009). An online resource for calculating the required sample size for multiple features is available at https://fedematt.shinyapps.io/shinyMB/(Mattiello et al., 2016). The package ALDEx2 combines a variance-stabilizing transformation, Welch’s and Wilcoxon tests and an instrument for obtaining a more correct P-value estimate for small sample size. In order to compare metagenomes with different sequencing depth, each feature vector is associated with a Dirichlet distribution with mean value equal to the normalized feature vector. A random vector from this distribution has nonnegative components summing to 1. The probability density of Dirichlet distribution is: f ( x1 ,…,xn ;α 1 ,…, αn ) =

n 1 α −1 ∏ i=1 x i , B(α) i

where x = (x1, . . ., xn) is a random vector, α = (α1, . . ., αn) is a feature vector (with an additional pseudocount of 0.5 added to each component) and B(α) is the multivariate beta function. The greater the taxon abundance and the less the whole number of reads for the sample, the greater the variance (Fernandes et al., 2014). Substituting the original feature vector with several random vectors generated from the corresponding Dirichlet distribution leads to a more correct estimation of variance and thus of significance of differences. At the next step, centred log-ratio transformation is applied to the obtained random vectors. Besides stabilizing the variance, such transformation ensures the proper comparison of two components of the same vector (e.g. which of two species is higher in abundance within a single community) even if low-abundance taxa are excluded from the study (as it is often done) (Fernandes et al., 2014). The transformed abundances may be compared using either Wilcoxon or Welch’s tests. For small groups, the authors of the package recommend to use Welch’s test, as its power is less sensitive to the sample size, while for the large groups the results for the two tests are similar.

Statistical Analysis of Metagenomic Data | 31

ALDEx2 contains two methods for multiple group comparisons. They allow us to test the hypothesis stating that a taxon abundance is equal among all groups against the alternative hypothesis stating that for at least one of the groups the taxon abundance is different. The first method is the non-parametric Kruskal–Wallis test which is a generalization of the Wilcoxon test for the case of more than two groups. The second method is the one-way ANOVA (analysis of variance) approach that is based on the comparison of inter-group and intra-group differences. It is a parametric method that suggests that the transformed data are normally distributed. Both ANOVA and Kruskal–Wallis methods require the equality of variances in all groups. The mentioned methods are appropriate only for the studies with discrete factors (e.g. disease severity index or country of residence). But it is often necessary to identify the associations between the microbial community structure and the factors that take continuous values – for example, body mass index, age, drug dosage. One of the approaches is a nominal division into groups according to some intervals of the factor’s values to make them discrete. For example, age groups can be used instead of the age in years. Then the methods used for discrete methods can be implemented. Another approach that allows both discrete and continuous factor analysis and, moreover, multifactor analysis is based on generalized linear models (GLMs). Essentially, the method suggests that the mathematical expectation of the dependent value (microbial composition components) is a function of linear combination of covariates (factors): E(y) = g–1(Xβ), where g(y) is a so-called link function which should be defined for the model, –1 denotes the inversion of a function, X is a predictor vector, β is a coefficient vector that is estimated from the input data, y is a dependent variable and E is a mean. The dependent value should correspond to an exponential class of distributions (examples include normal, log-normal, Dirichlet, Poisson, binomial or negative binomial). In the case of the identity link function and normally distributed random variable, the GLM degenerates into a standard linear model. Paired comparison can be performed by including N additional binary factors, where N is the number of pairs, with each of the factors reflecting if a metagenome belongs to a certain pair. In the case of a GLM, each of the null hypotheses states that the coefficient preceding the factor equals zero. It is worth noting that such a model is useful when the underlying association between the factor and the feature abundance (transformed by link function) is linear indeed. Otherwise, this approach is inappropriate and may lead to biased conclusions. Common function for fitting generalized linear model in R is glm2 (Marschner, 2014). Adaptation of such a model to metagenomic data is implemented in metagenomeSeq, MaAsLin and shotgunFunctionalizeR packages. The packages edgeR and DESeq2 for differential gene expression analysis based on RNA-seq data also contain implementations of GLM. These are widely used in microbiome studies due to the similarity of data format and statistical properties between RNA-Seq and metagenomics: hundreds to thousands of discrete features with distributions varying within several orders of magnitude. In the edgeR package, the dependent variable – relative abundance of a taxon – is described by a negative binomial distribution. Between-group comparison is performed using an exact test based on this distribution model (in a way similar to a standard t-test). The variance is estimated using empirical Bayesian approach: the estimate obtained from the data for individual taxon is shrunk to the value assessed across all taxa. The degree of shrinkage depends

32 | Odintsova et al.

on the mean value. The edgeR package includes the correction for sequencing depth that minimizes the log-fold changes for the majority of the taxa. Several methods for determining the GLM coefficients β and their significance are available. The approach implemented in DESeq2 package is similar with a few differences. Besides alternative normalization and variance estimation methods, DESeq2 uses an empirical Bayesian approach to obtain the effect size. The variance estimate is shrunk towards zero – the less abundant the taxon, the stronger the shrinking. The shrinkage also depends on the feature variance. This approach helps to avoid a situation when the majority of the features differentially abundant in the two groups have low abundance. The methods for estimating the significance also differ from the ones in edgeR. DESeq2 tends to make less false positives then edgeR. However, unlike the below-described metagenomeSeq package, DESeq2 works slowly on large datasets containing 100 or more samples per group – a typical scenario for a metagenomic study (Weiss et al., 2015). The metagenomeSeq package based on GLMs was developed specifically for 16S rRNA metagenomic data that were found to be more sparse than the gene expression data. To provide the correct comparison of the metagenomes with different library sizes, the taxa abundances are normalized by certain percentile determined from the given data. This method allows us to resolve the problem of varying OTU-specific PCR amplification efficiency, a known technical artefact. The variance-stabilizing logarithmic transformation is applied to the normalized data. The transformed feature abundances are supposed to follow a zero-inflated Gaussian distribution, which takes into account the dependence of the set of the taxa detected in a sample on the sequencing depth and adjusts data for its sparsity. After this correction, the data distribution is closer to the normal type according to Shapiro–Wilk test (Paulson et al., 2013). The empirical Bayesian approach implemented in the limma (Ritchie et al., 2015) package is used to test the null hypothesis that the linear model coefficient equals zero and to estimate the significance. The test is based on moderated t-statistics test and involves shrunk variance estimate in a way similar to edgeR (Paulson et al., 2013; Ritchie et al., 2015). Benchmarking of various methods on the example of two-groups comparison for both simulated and real-world metagenomic datasets showed that metagenomeSeq performed better in the terms of AUC (area under curve), especially for the middle- and high-abundance taxa (Button et al., 2013; Jonsson et al., 2016). However, the package underestimates FDR value, especially for the datasets with low sequencing coverage and sample size, and tends to make more false discoveries than other packages ( Jonsson et al., 2016; McMurdie et al., 2014; Weiss et al., 2015). Thus, taking into account its high speed as compared to other methods (Weiss et al., 2015), metagenomeSeq is recommended for the analysis of 16S rRNA sequencing data providing both high group size and sequencing depth ( Jonsson et al., 2016; McMurdie et al., 2014; Paulson et al., 2013; Weiss et al., 2015). Generalized linear models for metagenomic data are also implemented in the packages MaAsLin and shotgunFunctionalizeR. While only limited information is published on the details, the overdispersed Poisson generalized linear model realized in shotgunFunctionalizeR showed good results on ‘shotgun’ metagenomes ( Jonsson et al., 2016). As for MaAsLin, its advantage is a boosting process that rates the factors in the order of contribution to the observed differences in microbiota composition – which is useful in the cases when the number of the factors is high and it is difficult for a researcher to infer the importance of each factor.

Statistical Analysis of Metagenomic Data | 33

Community-level comparison of microbial communities The comparison of microbial communities can be conducted not only in the componentwise manner, but also viewing the community as a whole, taking into account the possible interdependence of the relative abundance levels between various bacterial species in microbiota. It is reasonable, in the view of many studies pointing out elaborate ecological relations between the species within microbiota (Levy and Borenstein, 2013; Stein et al., 2013). One of such common methods in biostatistics is MANOVA (multivariate analysis of variances), a generalization of the ANOVA method for the multivariate case – but its application to metagenomic data is limited because it requires the normal distribution of taxa abundance levels. A researcher can resort to non-parametric modifications of this approach such as ANOSIM [analysis of similarity, function anosim in package vegan (Oksanen et al., 2012)] and PERMANOVA (permutational multivariate analysis of variances, function adonis in the same package). They are also based on the comparison of within-group and between-group variances. In the case of metagenomics, the researcher may choose specific dissimilarity measures like weighted and unweighted Jaccard distance or UniFrac. The unweighted metrics are less sensitive to differences in sequencing depth between the samples, since they are based only on the presence/absence of a taxon in a sample rather than its abundance levels. The methods differ by their approach to the variance comparison: ANOSIM compares ranks of variances similar to Kruskal–Wallis test, while PERMANOVA compares variance values by estimating the significance via a permutation method. A disadvantage of these methods is that they require equal withingroup variances – however, usually this is not the case for metagenomic data (Warton et al., 2012). PERMANOVA was shown to be more robust to the failure of this restriction than ANOSIM (Anderson and Walsh, 2013). R package micropower provides functions for statistical power calculations for PERMANOVA based on weighted and unweighted Jaccard distance as well as UniFrac. A parametric approach to the problem of multivariate group comparison is implemented in package HMP. It employs a generalized Wald-type statistics to compare the estimates of statistical model parameters. The Dirichlet–multinomial distribution is used to model the vector abundance across each group. It is a combination of multinomial and Dirichlet distributions. Similarly to the binomial and Poisson distributions, the multinomial one does not provide an instrument for independent estimation of both mean and variance. The Dirichlet-multinomial distribution is able to avoid this flaw, as it provides an over dispersion parameter that can be derived independently of the data; in the case of the zero overdispersion it coincides with the multinomial distribution. It is shown in a paper by La Rosa et al. (2012) that this approach is indeed better to describe the metagenomic data. The parameters of the model are estimated with the method of moments or the maximum likelihood method. There is no instrument for correct comparison of samples with different sequencing depth in HMP, so one should be careful with the data preprocessing. Package HMP allows power calculations. The wrapper for the sample size definition is available at https://fede.shinyapps.io/shinyMB/ (Mattiello et al., 2016). Conclusions It is important to emphasize that statistical analysis should be thought of at the very beginning of a metagenomic study, before the sample collection and sequencing procedures.

34 | Odintsova et al.

Proper balance between the number of the samples and sequencing depth will lead to high statistical power and subsequently the results of the study will be more valuable to the scientific community. Preliminary in silico experiments with the published metagenomes in similar format and microbiota type as well as on simulated data contribute to the success of the analysis. The choice of a package for a specific problem in a metagenomic survey depends on several conditions: paired or unpaired design, continuity of factors values, component-wise or vector comparison, the need for power control, suggested sample size and sequencing depth. First of all, it is needed to formulate the aim of comparison – is the researcher interested in the component-wise analysis or in beta-diversity among groups? The packages ALDEx2, metagenomeSeq, edgeR, DESeq2, MaAsLin and overdispersed model in shotgunFunctionalizeR are designed for the former task, while HMP and PERMANOVA coupled with micropower package may be used for the latter. For the case of continuous factors or multifactor analysis, the models based on generalized linear model are recommended (metagenomeSeq, edgeR, DESeq2, MaAsLin and overdispersed model in shotgunFunctionalizeR). In the case of low number of samples, the DESeq2 package is recommended, while for larger sample sizes and deeper sequencing metagenomeSeq is preferable due to higher performance (Weiss et al., 2015). The ALDEx2 package is suitable for multiple group comparison for the small sized groups, as it is more accurate in significance estimates. The existing evidence suggests that methods based on binomial, multinomial and Poisson distributions are not appropriate for metagenomic statistical evaluation due to a great number of false discoveries. Overall, a researcher should perform an exploratory analysis to check if the distributions for the particularly analysed dataset conform to the parameterization used by the package of choice, as it greatly influences the accuracy of the results. If the quality of model fit is low, the nonparametric methods, such as Wilcoxon test, Kruskal–Wallis test and PERMANOVA should be used. Acknowledgements We thank Pavel Mazin for discussions of the text. This work was financially supported by the Ministry of Education and Science of Russian Federation (agreement # RFMEFI60414X0119). References

Anderson, M.J., and Walsh, D.C.I. (2013). PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecol. Monogr. 83, 557–574. http:// dx.doi.org/10.1890/12-2010.1 Baker, M. (2016). Statisticians issue warning over misuse of P values. Nature 531, 151. http://dx.doi. org/10.1038/nature.2016.19503 Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., and Munafò, M.R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376. http://dx.doi.org/10.1038/nrn3475 Egshatyan, L., Kashtanova, D., Popenko, A., Tkacheva, O., Tyakht, A., Alexeev, D., Karamnova, N., Kostryukova, E., Babenko, V., Vakhitova, M., et al. (2016). Gut microbiota and diet in patients with different glucose tolerance. Endocr. Connect. 5, 1–9. http://dx.doi.org/10.1530/EC-15-0094 Fernandes, A.D., Reid, J.N., Macklaim, J.M., McMurrough, T.A., Edgell, D.R., and Gloor, G.B. (2014). Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome 2, 15. http://dx.doi.org/10.1186/2049-2618-2-15

Statistical Analysis of Metagenomic Data | 35

Goodrich, J.K., Di Rienzi, S.C., Poole, A.C., Koren, O., Walters, W.A., Caporaso, J.G., Knight, R., and Ley, R.E. (2014). Conducting a microbiome study. Cell 158, 250–262. http://dx.doi.org/10.1016/j. cell.2014.06.037 Hair, J.F., Black, W.C., Babin, B.J., and Anderson, R.E. (2010). Multivariate Data Analysis (Pearson Prentice-Hall, Inc). Human Microbiome Project Consortium. (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214. http://dx.doi.org/10.1038/nature11234 Jonsson, V., Österlund, T., Nerman, O., and Kristiansson, E. (2016). Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics. BMC Genomics 17, 78. http://dx.doi.org/10.1186/s12864-016-2386-y Kelly, B.J., Gross, R., Bittinger, K., Sherrill-Mix, S., Lewis, J.D., Collman, R.G., Bushman, F.D., and Li, H. (2015). Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics 31, 2461–2468. http://dx.doi.org/10.1093/bioinformatics/btv183 Kristiansson, E., Hugenholtz, P., and Dalevi, D. (2009). ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics 25, 2737–2738. http://dx.doi.org/10.1093/ bioinformatics/btp508 Levy, R., and Borenstein, E. (2013). Metabolic modeling of species interaction in the human microbiome elucidates community-level assembly rules. Proc. Natl. Acad. Sci. U.S.A. 110, 12804–12809. http:// dx.doi.org/10.1073/pnas.1300926110 Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550. http://dx.doi.org/10.1186/s13059-014-0550-8 Marschner, I. (2014). glm2: Fitting Generalized Linear Models. R package version 1.1.2. http://CRAN.Rproject.org/package=glm2 Mattiello, F., Verbist, B., Faust, K., Raes, J., Shannon, W.D., Bijnens, L., and Thas, O. (2016). A web application for sample size and power calculation in case-control microbiome studies. Bioinformatics 32, 2038–2040. http://dx.doi.org/10.1093/bioinformatics/btw099 McCarthy, D.J., Chen, Y., and Smyth, G.K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 40, 4288–4297. http://dx.doi. org/10.1093/nar/gks042 McMurdie, P.J., and Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLOS Comput. Biol. 10, e1003531. http://dx.doi.org/10.1371/journal.pcbi.1003531 Oksanen, J., Blanchet, F.G., Kindt, R., Legendre, P., Minchin, P.R., O’Hara R.B. et al. (2012), vegan: Community Ecology Package. R package version 2.0-5. http://CRAN.R-project.org/package = vegan Paulson, J.N., Stine, O.C., Bravo, H.C., and Pop, M. (2013). Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10, 1200–1202. http://dx.doi.org/10.1038/nmeth.2658 R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/ Rahardja, D., Zhao, Y.D., and Qu, Y. (2009). Sample Size Determinations for the Wilcoxon–Mann–Whitney Test: A Comprehensive Review. Stat. Biopharm. Res. 1, 317–322. https://dx.doi.org/10.1198/ sbr.2009.0016 Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47. http://dx.doi.org/10.1093/nar/gkv007 La Rosa, P.S., Brooks, J.P., Deych, E., Boone, E.L., Edwards, D.J., Wang, Q., Sodergren, E., Weinstock, G., and Shannon, W.D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLOS ONE 7, e52078. http://dx.doi.org/10.1371/journal.pone.0052078 La Rosa, P.S., Zhou, Y., Sodergren, E., Weinstock, G., and Shannon, W.D. (2015). Hypothesis Testing of Metagenomic Data. In Metagenomics for Microbiology, J. Izard and M.C. Rivera, ed. (Academic Press), pp. 81-96. Sham, P.C., and Purcell, S.M. (2014). Statistical power and significance testing in large-scale genetic studies. Nat. Rev. Genet. 15, 335–346. http://dx.doi.org/10.1038/nrg3706 Stein, R.R., Bucci, V., Toussaint, N.C., Buffie, C.G., Rätsch, G., Pamer, E.G., Sander, C., and Xavier, J.B. (2013). Ecological modeling from time-series inference: insight into dynamics and stability of intestinal microbiota. PLOS Comput. Biol. 9, e1003388. http://dx.doi.org/10.1371/journal.pcbi.1003388 Tyakht, A.V., Kostryukova, E.S., Popenko, A.S., Belenikin, M.S., Pavlenko, A.V., Larin, A.K., Karpova, I.Y., Selezneva, O.V., Semashko, T.A., Ospanova, E.A., et al. (2013). Human gut microbiota community structures in urban and rural populations in Russia. Nat. Commun. 4, 2469. http://dx.doi.org/10.1038/ ncomms3469

36 | Odintsova et al.

Wang, F., Kaplan, J.L., Gold, B.D., Bhasin, M.K., Ward, N.L., Kellermayer, R., Kirschner, B.S., Heyman, M.B., Dowd, S.E., Cox, S.B., et al. (2016). Detecting Microbial Dysbiosis Associated with Pediatric Crohn Disease Despite the High Variability of the Gut Microbiota. Cell. Rep. 14, 945–955. http:// dx.doi.org/10.1016/j.celrep.2015.12.088 Warton, D.I., Wright, S.T., and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods Ecol. Evol. 3, 89–101. http://dx.doi.org/10.1111/j.2041210X.2011.00127.x Wasserstein, R.L., and Lazar, N.A. (2016). The ASA’s statement on p-values: context, process, and purpose. Am. Stat. 70, 2, 129-133. http://dx.doi.org/10.1080/00031305.2016.1154108 Weiss, S.J., Xu, Z., Amir, A., Peddada, S., Bittinger, K., Gonzalez, A., Lozupone, C., Zaneveld, J.R., Vazquez-Baeza, Y., Birmingham, A., et al. (2015). Effects of library size variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ 230313. http://dx.doi.org/10.7287/peerj. preprints.1157v1

Methods for the Metagenomic Data Visualization and Analysis Konstantin Sudarikov1*, Alexander Tyakht2,3 and Dmitry Alexeev2

3

1The National Research University Higher School of Economics, Myasnitskaya ulitsa 20, Moscow,

Russian Federation.

2Moscow Institute of Physics and Technology, Institutskiy pereulok 9, Dolgoprudny, Russian

Federation.

3Federal Research and Clinical Centre of Physical-Chemical Medicine, Malaya Pirogovskaya 1a,

Moscow, Russian Federation.

*Correspondence: [email protected] https://doi.org/10.21775/9781910190593.03

Abstract Surveys of environmental microbial communities using a metagenomic approach produce vast volumes of multidimensional data regarding the phylogenetic and functional composition of the microbiota. Faced with such complex data, a metagenomic researcher needs to select the means for data analysis properly. Data visualization became an indispensable part of the exploratory data analysis and serves a key to the discoveries. While the molecular genetic analysis of even a single bacterium presents multiple layers of data to be properly displayed and perceived, the studies of microbiota are significantly more challenging. Here we present a review of the state-of-the-art methods for the visualization of metagenomic data in a multilevel manner: from the methods applicable to an in-depth analysis of a single metagenome to the techniques appropriate for large-scale studies containing hundreds of environmental samples. Introduction Metagenomics is an interdisciplinary research field combining molecular genetics, microbial ecology and data analysis. Its central object of study is a metagenome, the total genomic content of the organisms and viruses present in an environmental sample. Metagenomics is based on culture-independent methods of bacterial identification, meaning that they allow detecting the whole totality of microbes (microbiota) even the species that cannot be isolated and cultivated using the existing microbiological techniques. During the last years, this advantage together with high throughput of the DNA-sequencing platforms opened the opportunity to the researchers to reveal the previously unobserved richness of microbiota in

38 | Sudarikov et al.

various niches, from soils and oceans to urban environment and host-associated microbiota. Particularly, human microbiota is of high interest to the biomedical researchers: analysis of microbial gut community balance and dynamics allows us to discover new biomarkers of disease and predict more precisely the influence of diet, medical treatment and other factors on human organism homeostasis, as well as to design efficient predictive and therapeutic approaches. Faced with the vast volumes of biological experimental data (both published and generated in-house), a researcher can only efficiently process them provided availability of adequate methods for visual display of these complex datasets. In this way, visualization is only partially concerned to graphical expression of the data; in fact, it is an essential tool of exploratory analysis in biology. The studies of microbiota are not an exclusion: the datasets obtained in such surveys are characterized by intrinsic multidimensionality, presence of multiple levels of hierarchy and connectivity. Even a genomic study dedicated to a single isolated microbial species contains multidimensional data with heterogeneous structure that are challenging to perceive, illustrate and navigate, and microbiota contains hundreds and thousands of such entities. Visualization of metagenomics is an active area of research, with dozens of new publications describing novel original methods every year, and bringing new tools for generating and testing novel biological hypotheses from the visualization. Here we represent a comprehensive overview of the existing methods for the visualization of metagenomic datasets. While the general graphic design guidelines of choosing the proper palette, illustration composition, proportions, fonts and other artistic elements are described elsewhere (Tufte, 1986; Steele and Iliinsky, 2010), the main focus of our review is on various visual techniques that prove particularly appropriate for mining the data on microbiota and can be easily adopted by a beginning metagenomic researcher using publicly available software implementations (as a Web service or a stand-alone application). We have illustrated applications of described implementations with figures especially constructed. We have also summarized the general methods for the metagenomic data visualization in Table 3.1. Depending on the environmental niche in focus, microbial studies involving metagenomics widely range in the scale of the generated datasets: particularly, the number of metagenomes can vary from few (i.e. for a novel niche and/or sequenced with high coverage) to hundreds of samples (previously studied niches like human gut microbiota and/or studies performing meta-analysis of the published data). Such variation of the number of samples suggests that different methods should be applied in order to efficiently navigate the different levels of visualization. Even within a project with thousands of samples, a researcher can choose to examine a single sample in details or zoom out to overview the whole general landscape of the metagenomes in the analysis. The fact that a researcher’s success is based on effective navigation between different scales of visual representation for the data is neatly expressed by the so-called Visual Information Seeking Mantra: ‘Overview first, zoom and filter, then details on demand’ (Shneiderman, 1996). With this consideration in mind, we have divided the description of the methods for metagenomic data visualization into three sections: the methods that are commonly intended for the visual display of a single metagenome, several metagenomes (the number is about 10) and multiple metagenomes (tens to hundreds of metagenomes).

Table 3.1 Methods for metagenomic data visualization, with a short description, rationality for the visualization of single, several and multiple metagenomes, the advantages and drawbacks of each approach and some selected tools and articles where this approach was implemented Method Pie charts

Bar charts

Manhattan plots

Suggested usage and subtypes

Single Several Multiple metagenome metagenomes metagenomes Advantages

Taxon abundance at all taxonomic ranks

+

+

–

Convenient for overviewing the community structure of a single metagenome

Poor comparability between several metagenomes

Krona, AmphoraVizu, Taxonomer

Taxon abundance at a fixed taxonomic rank with various characteristics of contigs

+

+

–

Multiple metagenomic can be represented as rings

Can be too large and contain too much information for easy perception

Anvi’o

Comparison of metagenomic features

+

+

–

Many metagenome features represented as rings

Can be too large and contain too much information for easy perception

Anvi’o

Taxon abundance at all taxonomic ranks

+

+

+

Summarizes the information about all metagenomes

Can contain too many coloured bars for easy perception

AmphoraVizu

Taxon abundance at a fixed – taxonomic rank

+

–

Opportunities for the demonstrative comparison of several metagenomes

Too many taxa are shown

Community Analyzer, Phinch

Taxon abundance for any numerical meta-data category

+

+

+

Information about all samples is used

Difficult to perceive if the number of the categories is high

Phinch

Distribution of samples for each taxonomic rank

–

+

–

Metagenomic SNPs distribution along the microbial genomes

+

-

–

Drawbacks

Selected implementations

Difficulty for Community Analyzer perception if the number of taxa is large The highest values of each metagenomic SNP are clearly distinguishable

Too many SNPs can be confusing

Explicit

Table 3.1 Continued Suggested usage and subtypes

Single Several Multiple metagenome metagenomes metagenomes Advantages

Contig graph of a single metagenome

+

–

–

Dimension reduction Large number of and nice representation contigs can be in the form of densely disorientating concentrated contigs

Taxonomic graph at any taxonomic rank

+

+

+

When large, the representation is chaotic

Rarefaction curves

Richness of the community (alpha-diversity)

+

–

–

Shows multilevel information

Parallel coordinate plots

Clustering of metagenomes + into different groups by their taxonomic or other properties

+

+

Multiple simultaneous groupings

Pathways

Metabolic potential analysis +

–

–

Detailed representation Large map does not of the functional allow the overall view properties of microbiota of the whole pathway

Trees and dendrograms

Taxonomic composition

–

+

–

Taxonomic classification and abundance comparison of each taxon at the same time

Contig tree: hierarchical clustering of contigs based on their sequence composition and their distribution across the samples

+

+

+

Phylogenetic tree

+

+

+

Method Bubble charts

If number of taxa is not very large, this method can be representative

Drawbacks

Selected implementations Elviz, R package ‘gbtools’

Phinch

QIIME, Eren et al. (Pacific Symposium on Biocomputing, 2011) Too many clusters can lead to disorientation

Juxter

iPath

Comparing only the same taxon in different samples (no between-taxa comparison)

MetaSee

Too difficult for perception when the number of contigs is high

Anvi’o

Too difficult for perception when the number of contigs is high

MetaSee, GraPhlAn, iTOL, MEGAN, Eren et al. (Pacific Symposium on Biocomputing, 2011)

Sample clustering tree (dendrogram according to the similarity of the samples’ composition)

–

+

+

If number of taxa is not very large, this method can be representative

Too difficult for perception when the number of contigs is high

PanPhlAn, Eren et al. (Pacific Symposium on Biocomputing, 2011)

Box plots

Distribution of a taxon across the samples

–

+

–

Visual display of the means and quartiles and their visual comparison

Not possible for easy comparison of many metagenomes

Anvi’o

Dot plots

– Dots representing the presence of several taxa for several sample categories

+

+

Combination with a box Difficult comparison with the high number plot results in a nice of samples representation

API ‘dimple’, R package ‘rCharts’, Eren et al. (Pacific Symposium on Biocomputing, 2011)

Heatmaps

Coloured matrix of nucleotide positions for each bin in each sample

–

+

–

Colour comparing of bins at the same positions of metagenomes

Many alternate bins

Anvi’o

Taxa abundance in the samples

–

+

+

Special areas are highlighted

Difficulty with identifying a selected sample or taxon

R package ‘matR’, Eren et al. (Pacific Symposium on Biocomputing, 2011)

Presence/absence of gene family profiles for the strains in samples

–

–

+

Exclusive areas are highlighted

Difficulty with identifying a selected sample or taxon

PanPhlAn

Coloured table of taxa correlation

+

+

+

Selected correlation is displayed well

Too many numbers are not representative if the number of datasets is high

MetaFast, Community Analyzer

Slopegraphs

Connected taxa levels in two metagenomes

–

+

–

Good if the displayed number of the taxa is not high

Only two metagenomes, many taxa will lead to chaos

R package ‘ggplot2’

Layouts

Bipartite graphs: graph with – connections between the samples and taxa

+

–

Taxa are displayed according to their co-occurrence

Edges are superimposed, so they can not be distinguished

Community Analyzer

Table 3.1 Continued Method

Suggested usage and subtypes

Single Several Multiple metagenome metagenomes metagenomes Advantages

Drawbacks

Selected implementations

Bipartite graphs: graph with – grouping of taxa near the samples where the taxa are abundant

+

–

Showing similarity of several samples

Using information only about the most abundant taxa

Sedlar et al. (Evolutionary Bioinformatics Online, 2016)

Spring graph layout with both samples and taxa

–

+

–

Distances from sample to taxa are proportional to abundances of taxa in that sample

Edges are superimposed, so they can not be distinguished

Community Analyzer

PCA, PCoA and MDS

–

+

+

Dimension reduction

Can be low-descriptive metaG, EMPeror, R packages ‘GrammR’ and ‘matR’, Arumugam et al. (Nature, 2011)

BCA (between-class analysis)

–

+

+

Visual enhancement of clusters display

Sankey diagrams

Diagram with taxa abundance and connections between different taxonomic ranks

+

+

+

Displaying any number of taxonomic ranks

Bacterial rose garden

Plot showing phylogenetic distances from the selected sample to other metagenomes in relation a selected taxon

+

+

+

Interactive and original

Self-organizing maps

Large map that preserves the SOM projection topology

–

+

+

Coloured clusters of data

Difficulty with identifying a selected sample or taxon

Laczny et al. (Scientific Reports, 2014)

Co-occurrence graphs

Links between the species reflecting their simultaneous presence in the same environments

+

+

+

Visual identification of the clusters of co-occurring taxa

Large number of taxa will lead to a chaotic picture

CoNet, MEGAN, Lui et al. (BioData Mining, 2015); Freilich et al. (Nucleic Acids Research, 2010)

Arumugam et al. (Nature, 2011) Phinch Sometimes the figure is too large and carries too much information to be perceived Alexeev et al. (BioData Mining, 2015)

Visualization of Metagenomic Data | 43

Visualizing a single metagenome On the most detailed level, visualization of a single metagenomic dataset is needed to represent clearly some taxonomic, functional or other properties of a given metagenome in order to understand its structure and infer biological insights. Analysis of a metagenomic dataset involves certain feature extraction: millions of metagenomic reads produced as the result of the DNA sequencing can hardly be directly visualized in a comprehensive way. One of the steps involved at this point is metagenomic classification of each read, either taxonomic (when each read is assigned to a specific microbial taxon) or functional (when it is assigned to genes, gene groups or metabolic pathways). The classified reads are then aggregated to form a relative abundance feature vector, with each position reflecting a taxon or gene group, respectively. This vector represents the composition of a single metagenome and its sum is frequently normalized to 100%. Thus, the metagenomic data are inherently compositional. The best-known visualization of compositional data is a pie chart. It looks like a circular graphic divided into chunks. Each chunk is a share of group of corresponding data in per cent. It can also be applied to a metagenomic datasets. In the field of metagenomics, a pie chart can be used to visualize the community structure of an environmental sample. If the taxonomic rank (i.e. species, genus, family, etc.) is fixed, then each pie chunk represents a taxon of this rank. Usually every share is also denoted with the percentage of the respective share. The total amount of chunks equals to the total amount of taxa identified in this sample at the fixed taxonomic rank. This approach is common and implemented in any of the spreadsheet processors. For a researcher with more advanced computer skills inclined towards coding rather than using a graphical interface, this, as well as most of the primitive visualizations described in the text, can be carried out in a code-based statistical analysis environment, one of the most popular being R programming language (R Core Team, 2014). Advanced variations on the theme of a pie chart have been developed for the metagenomic data. Scientists often want to explore the structure of metagenomes at a deeper level, and interact with it. For these purposes, there exist approaches that allow visualizing the relative abundance of all taxonomic ranks represented in a given sample. One of such tools popular in the life science researchers community is Krona (Ondov et al., 2011). In this software, a metagenome is represented as nested concentric rings forming a circle together. Each of the rings corresponds to a single fixed taxonomic rank, the more distant the ring, the lower the rank. At each level, a taxon is shown as a part of the ring proportional to the abundance of the taxon in the sample. Thus, this visualization gives a multi level view of the community structure. Krona is distinguished by its hierarchical interactivity: when a user clicks a sector or a segment, another pie chart is displayed that shows the embedded taxonomic hierarchy of this fragment. So it becomes possible to examine in detail each taxon in a metagenome and view the levels of its member taxa. Sometimes it is necessary to display the additional properties of metagenomes beyond the basic composition. Quite a few of such layers of information arise when the metagenomic feature extraction includes assembly de novo, identifying and putting together the reads appearing to be overlapping to form longer sequences (contigs). When a metagenome is transformed into a set of contigs, each contig is being assigned various characteristics: GC-content (percentage of guanine plus cytosine bases in the contig sequence), length, number of the ORFs (open reading frames) in the contig, taxonomic annotation, etc. One of the tools that allow visualizing such representation comprehensively is Anvi’o (Eren

44 | Sudarikov et al.

et al., 2015). It allows to draw a ring of the sample divided into the contigs and represent each one of its properties as a bar with the value for each contig. Anvi’o is a flexible tool applicable for comparing several metagenomes, so it will be mentioned in the next section also. Another well-known representation of a data distribution is a bar chart. It produces rectangular colourful bars for each group of data. The length (or height) of these bars is proportional to the values of corresponding groups. For a single metagenome, a bar chart can be used for representing the abundance of taxa (or microbial genes). For each taxon inside the fixed taxonomic rank, there is a bar if this taxon is present in the sample. The height of the bar shows the proportion of the taxon normalized by the total abundance of all taxa. Hence, the summary length of all bars is equal. Such representation can be generated, for instance, with the AmphoraVizu (Kerepesi et al., 2014) tool. The R packages like ‘gplots’ (Warnes, 2016) and ‘metricsgraphics’ (Rudis et al., 2015) provide functions for constructing bar plots. Considering the dimensionality of the features to visualize, even a single metagenome can yield tens of thousands of primitive values. An example of this is the metagenomic single nucleotide polymorphisms (metagenomic SNPs) that can be calculated in large numbers in each of the most prevalent genomes in the metagenome (Luo et al., 2015). For such cases, an approach called Manhattan plot is especially useful. Genomic coordinates (for example, taxa) are displayed along the x-axis while the negative logarithm of each SNP’s P-values is displayed along y-axis. This approach is used in the Explicet (Robertson et al., 2013) tool that provides wide metagenomic analysis and visualization options. When a metagenome is represented in the form of contigs (as a result of de novo assembly), the contigs can be grouped into bins based on the similarity of their characteristics. This process is called binning, and one of the convenient methods for visualizing its results is a bubble chart. The chart consists of circles and can represent up to four dimensions of data by changing the values of x- and y-axis, circle size and colour. Every contig is placed on the grid where the two coordinates are chosen from the three following values: average fold coverage (a measure of contig abundance), GC-content and length of contig, and the circle size denotes the third remaining value. The contigs included into each bin are coloured in their own colour. Bubble chart method gives visual clues for discovering multiple microbial species (especially phylogenetically distant taxa) and detecting mobile genetic elements. The method is implemented, for example, in ‘gbtools’ R package (Seah and Gruber-Vodicka, 2015). There is also an elegant tool called Elviz (Cantor et al., 2015) that allows us to construct interactive versions of such illustrations. It provides means for isolating and examining a specific group of the contigs or to search the biological databases for any part of a contig sequence. One of the basic characteristics of a single community structure is diversity (conditional number of various species observed in the metagenome): combined with the data on relative abundance of the individual species, it forms the diversity index so called alpha-diversity. Obviously, the more reads are sequenced (and then classified), the higher the richness is; when the number of the reads is increased, the diversity usually converges to certain limited value. With this in mind and given a fixed number of reads per a single metagenome, a common procedure is to perform random rarefactions – randomly sampling a fixed number of reads from the metagenome and assessing the alpha-diversity for each sampling. Such data can be illustrated as a rarefaction curve that shows to what extent the richness increases

Visualization of Metagenomic Data | 45

when the read number is increased artificially. One of the tools providing the means for plotting alpha-diversity rarefaction curves for 16S rRNA datasets is QIIME (Kuczynski et al., 2012). In a way similar to cutting the pie, a single metagenome, having intrinsically compositional nature, can be divided into portions in multiple ways. For instance, 100% of the metagenome can be divided into the relative abundance of gene groups, or into the relative abundance of the microbial phyla. One of the methods for visualizing multiple division of a single metagenome at the same time is a parallel coordinate plot: every parallel line on this plot corresponds a new division of the dataset data into groups. For example, in the case of gene composition, each curve from the top to the bottom is a gene belonging to one of the groups-dots at every horizontal level. The highest and the lowest levels represent the taxonomic assignment of the genes into the gene families, whereas the medium levels cluster data according to the confidence value and the phylum. In the case of taxonomic composition, each level represents the taxonomic division into groups at a fixed taxonomic rank. The means for plotting parallel coordinate plots is available in the Juxter tool (Havre et al., 2005) that visualizes the clusters of metagenomic data using multiple colours. As ‘shotgun’ metagenomics allow assessing of the composition of the microbial community not only from the taxonomic, but also from the functional perspective, a researcher needs the appropriate visual representation for such gene-centric profile also. Genes and their groups are grouped into metabolic pathways that can be illustrated as a pathway map, a convenient representation of functional data. The maps usually consist of the nodes, denoting the genes encoding enzymes that are detected in the metagenome, and the edges, linking the genes involved in consequent biochemical reactions. The tool iPath (Yamada et al., 2011) allows us to explore metabolic, regulatory and biosynthetic pathway maps of metagenome. Each biochemical process encoded in the metagenome is highlighted on the map and accompanied by the relevant information from the public databases about metabolism and biochemical reactions. Fig. 3.1 shows the folate biosynthesis pathway visualization using iPath tool. Visualizing several metagenomes Metagenomic scientists often want not only to explore the structure of one metagenome, but also to compare it across multiple metagenomes. Below is the description of the methods that were developed to represent the difference between community structures and functional composition of the samples clearly. The pie chart concept was previously introduced. When applied to several metagenomes, pie charts can be used in two ways. The first approach shows taxonomic abundance of each sample at different taxonomic ranks. This method was mentioned in the single-metagenome analysis: each ring of the pie chart denotes one taxonomic rank (i.e. phylum, family, etc.). Here it is proposed to implement this idea for analysing several metagenomes. In the centre of the circle, all samples are placed with their shares’ size proportional to their summary abundance. Further, for each sample, the taxonomic abundance sectors are displayed as in the case of a single metagenome. It allows us to compare the shares shown in the figure that belong to the same taxonomic rank for different samples. Tools that could be useful there include Krona and Taxonomer (Flygare et al., 2016), the latter depicts taxonomic abundance of metagenomic data as ring charts. Although less functional than Krona, it allows to

Figure 3.1 Highlighting the pathway of folate biosynthesis (important function of gut microbiota) within the global metabolic network using iPath tool, http://pathways.embl.de/iPath2.cgi.

Visualization of Metagenomic Data | 47

discard the low-abundance noisy taxa identifications before the display. A general word of caution is that common pie charts should not be abused during the comparison of several metagenomes, because it is difficult for an eye to compare the angular sizes of more than 2–3 sections. Another way in which pie charts can be used is comparing the metagenomes using their features like the average coverage or relative abundance of contigs in a given sample or across the samples. With this approach, information about every contig is shown as a bar chart and about all contigs as a circle. This method is implemented in Anvi’o (Eren et al., 2015). It is particularly clear for comparing the metadata about the samples. Some of the commonly used methods for visualizing several metagenomes are based on bar charts that were presented previously. Bar charts are suitable for displaying the taxonomic composition of the samples. Each sample can be shown as a bar divided into taxa detected in the sample according to their abundance. Every taxon has a unique colour. Bars can be shown for any taxonomic rank. This technique is used in Community Analyser (Kuntal et al., 2013) and Phinch (Bik, 2014). Both are publicly available services that can display the information about the name, observational data and taxonomy for every sample. They can reflect absolute or normalized number of observations. Phinch is a versatile tool. Particularly, if a factor from the metadata is quantitative (for example, pH value) then Phinch allows displaying the taxon abundance summary (about all samples) for each category. A bar chart shows the taxon bars (of the selected rank), where every bar consists of sample bars that depict abundance of the taxon in each sample. Every sample is coloured in a one and only colour. An example of using Phinch for taxon abundance representation is presented in Fig. 3.2. This approach is mentioned in Kuntal et al. (2013). When a researcher has more than one metagenome in the analysis, it is natural to state the question of to what extent is the content of metagenomes similar and which metagenomes are closer to each other by the set of their components, whether assessed from the taxonomic or functional perspective. In microbial ecology the respective measure of pairwise dissimilarity between the microbiomes is called beta-diversity. Once computed for all the metagenomes in the study (and represented as a pairwise dissimilarity matrix), it is subsequently used for the cluster analysis of the metagenomes. The obtained clusters are often represented using a tree diagram (or dendrogram) that shows how datasets are similar on different hierarchical levels. A static taxonomic tree including all taxa detected in the samples gives detailed information but is applicable only for a small number of samples simultaneously. For each node (taxon), there is a small bar chart near the taxon name that displays the abundance distribution of this taxon across all samples. Every sample has its own bar chart filled with a specific colour. This was implemented in MetaSee pipeline (Song et al., 2012). Another way of implementing trees and dendrograms for the analysis of several metagenomes is a contig tree. It displays the hierarchical clustering of contigs based on their sequence composition and their distribution across samples. Anvi’o (Eren et al., 2015) includes such dendrogram implementation. The contigs are displayed with small bars as parts of a ring. Circular clustering dendrogram is placed in the centre of this ring. Metagenomic data can be also represented as trees with the R package ‘phyloseq’ (McMurdie and Holmes, 2013). Along with the well-known visualization methods like pie charts and bar charts, a box plot is another popular technique for representing the numerical data that indicates their

48 | Sudarikov et al.

Figure 3.2 Bar chart showing taxonomic composition of microbial communities at the level of class. Constructed using PHINCH and the default test dataset from http://phinch.org as an input. The figure depicts the first 45 of 90 metagenomes.

variance. For instance, the method can be used for showing the distribution of the coverage of contigs in the bin. Box plots for several samples can be visually compared. These can be performed with R package ‘plotly’ (Ohri, 2014), as well as in the Anvi’o tool. Box plots can be combined with the scatter plots (dot plots) to complement the graph with additional information. If a box plot shows a relative abundance distribution of a taxon across all samples, then each dot represents the level of the taxon in a specific sample so that it is easier to spot the outliers. The dot plot is overlaid on the box plot. In a way, the samples can be compared by the taxon abundances. The functionality of drawing the dot plots combined with the box plots is implemented in Framework (Eren et al., 2011). One of the advantages of this tool is that it denotes the samples divided into different groups (for example, on the basis of their functional properties) using different colours. Dot plots can be also constructed using the JavaScript charting application programming interface (API) ‘dimple’ (Kiernander et al., 2014) or the R package ‘rCharts’ (Vaidyanathan, 2013). It is worth mentioning an advanced implementation of a box plot (‘violin plot’) that shows more details about the variable

Visualization of Metagenomic Data | 49

distribution due to the presence of a histogram (especially useful for the data distributed in a non-normal way); the method is available in the R package ‘vioplot’ (Adler and Adler, 2014). The standard way of representing the community structure inferred from metagenomic data is by means of an abundance table, where the rows correspond to samples and columns to features (microbial taxa); the values in the cells show the relative abundance of the respective taxa in the sample. However, a large table with hundreds of digits is hard to grasp visually. A natural extension of the abundance table is a ‘heatmap’, a table where each cell is filled with a colour, usually a gradient, with the distinct colours corresponding to the lowest and the highest values. Another specific feature of the heatmap is clustering visualization: the rows are subject to reordering in a way that the most similar rows are put in the proximity (same with the columns). In reference to metagenomics data, heatmaps usually combine the taxonomic abundances with the clustering of samples. However, for a small number of samples there is another implementation of heatmaps. For instance, with Anvi’o it is possible to draw a heatmap of variable nucleotide positions. Here each column is a sample and every row is a nucleotide base. While each of the four nucleotide bases is displayed in a different colour, the cells can also be coloured using a gradient according to the normalized ratio of the two bases most frequently occurring at the position. The R package ‘d3heatmap’ (Cheng, 2016) is a multifunctional package that has many options for microbiome analysis allowing to construct many types of heatmaps. They are interactive and provide the information about any element of the heatmap table when a mouse hovers over it. Layouts are visualization methods oriented towards the optimal location of data on the plane or in space. It is usually a two- or three-dimensional plot that plots dots representing the datasets according to a certain principle based on the mutual relations between the datasets. One of the types is a bipartite graph. These are the graphs that consist of two groups of nodes where the nodes within each group are not connected. Each edge of this graph connects a vertex from one group to a vertex from another group. Such graphs can be implemented for visualization of data about several metagenomes. Metagenomic analysis involves many entities, microbial taxa of various ranks, metagenomic samples, relative abundance values, etc. and it is useful to represent several types of entities on the same figure. One of the implementations is the representation of the metagenomes and the taxa together to reflect the community structure and the relations between them. This approach was used in Community Analyser, where each sample and each taxon is represented as a node of the graph. From each vertex depicting a sample, there is an edge to the taxon contained in the samples (usually above certain threshold value). Moreover, the taxa that have high correlations of abundance levels are connected and set apart of the taxa with which the correlation is low. Although this approach can display mutually exclusive relations between the taxa, the limitation is that it does not show the abundances of individual taxa. One of the novel approaches for metagenomic visualization depicts taxonomic units as vertices (Sedlar et al., 2016). There are also vertices of a large size that represent groups of types or samples. Each taxon is connected to the groups that include the taxon as one of the most prevalent ones. The width of the edges is proportional to the abundance of taxa in the sample. The taxa are connected only with the groups of samples, and groups of samples are connected only with taxa, so the graph is bipartite. This approach allows highlighting the

50 | Sudarikov et al.

taxa that are the most represented in samples. It is also an effective method for determining similarities and differences between the groups of samples, basing on commonalities and variations in taxonomic composition of the groups. Besides the connectivity, the location of the vertices can also be used for the purposes of visually exploring the community structures. One such approach is Spring Graph Layout (implemented in Community Analyser), which simulates a model in which vertices are considered as electrically charged particles and edges as forces of attraction and repulsion. When processes in this system end, then the desired layout is achieved. Since the data are metagenomic, vertices are samples (painted in one colour) and taxa (painted in another colour), while edges connect taxa with all samples where they occur. A special case of the several metagenomes analysis is the paired comparison, when the samples are grouped into pairs. Examples include human gut microbiota of the same patient before and after the antibiotic treatment. In such cases it is important to emphasize this twoness visually. For the display of individual taxa, there is a method of the visual representation of data called a slopegraph. The slopegraph allows to show the abundance level of a taxon in the two datasets (for example, before and after the experiment). Multiple slope graphs help to understand the dynamics of the individual microbial members of the community. And when it is necessary to compare the overall structure of the paired samples, a researcher can visualize the metagenomes using the dimension reduction plot like principal coordinates analysis (PCoA, described in the next section) plot and subsequently connecting the paired samples with arrows, using, for example, R package ‘ggplot2’ (Wickham and Hadley, 2009). This approach was introduced by Tufte (1986). Overall, a network (or a graph) is a very descriptive form of metagenomic data representation because it allows us to display the numerous interactions between the elements of the microbial system. Popular tools that work with graphs include Cytoscape (Smoot et al., 2011). It allows to work with complex molecular interaction networks providing their analysis and visualization. Cytoscape has a broad functionality, so recently it has also been often used for non-bioinformatic analyses. Many of the functions for constructing, arranging and drawing the graphs are also available in R, for example, in the ‘igraph’ package (Csárdi and Nepusz, 2006). Visualizing multiple metagenomes The most challenging task is to visualize the data calculated from a large number of metagenomes. Most methods used for the cases of single and few metagenomes are not applicable here because such procedures would require a vast amount of space and overwhelm the visual perception of the researcher. Bubble charts are useful to display the total distribution of each taxon across the samples. In this case, each bubble represents a taxon filled with a specific colour, the size of which is proportional to the summary level of this taxon in the examined samples. This concept is implemented in the Phinch tool that also provides a user with the information regarding any taxon of interest and allows to arrange the plot at any taxonomic rank. A very common approach for visualization in many areas of applied science is a Sankey diagram. In these diagrams the width of the arrows is proportional to the values that these arrows connect. More detailed, it can be interpreted as a ramification representation of the data with the arrow width depending on the quantification of the grouped data.

Visualization of Metagenomic Data | 51

It is possible to construct a ramification representing relative abundance of the taxa that groups metagenomic data into taxa at every taxonomic rank. Each rank is represented as a column bar. Its width corresponds to the number of reads assigned to each taxon. Such a Sankey plot can be constructed using Phinch. Additionally, it allows to use an arbitrary number of the taxonomic ranks. Fig. 3.3 depicts a simple Sankey diagram constructed with Phinch and used for the taxonomic and quantitative representation of metagenomes. However, this approach generally tends to result in large maps that are too complex for a clear observation of the taxa abundances and metagenomic taxonomy. A dendrogram can be used to analyse multiple metagenomes efficiently, the researcher just has to be careful with the choice of text labels in the case of large trees that can be substituted with a colour legend. One frequent implementation of a phylogenetic tree represents a dendrogram of clustered microbial taxa. All taxa are clustered accordingly to their co-occurrence across the set of samples and can be displayed as a simple or a circular dendrogram. The former is used in Framework (Eren et al., 2011) in combination with a heatmap. The latter is presented in MetaSee and GraPhlAn (Asnicar et al., 2015). GraPhlAn has many additional options like drawing a bar chart for each taxon representing its abundance, comparing the abundances for each group of data with drawing every group as a circle and marking special taxa of interest with dedicated colours. iTOL, interactive tree of life (Letunic and Bork, 2016) is an original tool that allows us to draw simple as well as circular dendrograms with bar charts of taxon abundance and colour the specific nodes. The most popular tool for analysis of metagenomic data, MEGAN (Huson, 2016) can visually represent taxonomic abundance of a given dataset using different approaches. However, there is a more common approach to the clustering and visualizing of a given set of samples basing on the set of the microbial taxa detected in each of them. Here the samples are represented as leaves. There are many tools for such tree visualization including PanPhlAn (Scholz et al., 2016) and Framework (Eren et al., 2011). These tools allow to construct typical dendrograms of samples located on the side of the heatmaps. The Framework also provides functionality to accompany each of the leaves with a pie charts representing the taxonomic composition of the respective sample. A heatmap is one of the most popular ways of visualizing the quantitative compositional data with the information about many objects. In metagenomics, it is often taxon abundance in each metagenome. Although the approach is convenient for displaying few samples, when the number of the metagenomes becomes over 20 to 30 or the number of the features is high, certain limitations appear. For instance, the row-side labels and the cell colours can become indistinguishable. These problems can be solved by discarding the low-abundance species or pooling the samples into subgroups. An alternative implementation of a heatmap is based on binarized values and it can be used to display the presence and absence of the features, for instance, of the gene-family profiles of strains during the analysis of pan-genome, as demonstrated in PanPhlAn. A table (or matrix) representation of the data can be used not only for the heatmaps. For example, the correlation table reflects pairwise correlations between multiple variables. The correlation table filled with the colour gradient corresponding to the correlation values will clearly show which variables are most correlated. In metagenomics, these variables can be taxa, and high correlations between them could hint to potential mutualism or symbiosis (inferred from co-occurrence of the species). This method is applied and described in Community Analyser or, for instance, in MetaFast (Ulyantsev et al., 2016).

52 | Sudarikov et al.

Figure 3.3 Sankey Diagram displaying the composition of microbiota at the levels of kingdom and included phyla. Constructed using PHINCH and the default test dataset from http:// phinch.org as an input.

Visualization of Metagenomic Data | 53

As mentioned before, analysis of the ‘shotgun’ metagenomics produces not only the information about the taxonomic and gene composition of the microbiota, but the data on genomic variability of the environmental microbes. Commonly produced in the form of metagenomic SNPs, they require the specialized visualization methods. A novel method for displaying such data layer when the number of the metagenomes is high was proposed (Alexeev et al., 2015) and applied to visualize the SNPs for a large set of human gut metagenomes. For each selected microbial species, a circular chart is drawn (a ‘bacterial rose’), where each ray shows the presence of the SNPs in an individual metagenome. Such a typical ‘bacterial rose’ is shown in Fig. 3.4.

Figure 3.4 ‘Bacterial rose garden’ visualization applied to display the genomic sub-species level diversity of a major gut species Prevotella copri in human populations of the world (shown at the level of all geographic regions).

54 | Sudarikov et al.

When the number of the metagenomes in the analysis reaches tens or hundreds, the economy of space becomes an urgent requirement for a visualization technique. One of the most effective ways of visualizing multidimensional data are based on dimension reduction, including the classical scatterplot layouts such as principal component analysis (PCA) plots (Vidal et al., 2016). Each metagenome described by hundreds of the features (relative abundance of individual species) is subject to dimension reduction and ultimately shown as a dot on the scatter plot of two (or three, in the case of 3D visualization) principal components. The underlying statistical algorithm implies that the first principal component corresponds to the direction of the highest variance in the cloud of the analysed metagenomes, the second component is orthogonal to the first one and corresponds to the next highest direction. PCA is a very common method, because it allows to evaluate quickly the overall distribution of the metagenomes by their composition, identify the samples with similar composition and detect the ‘outliers’. In the case of metagenomics, the variations usually used instead of the PCA are principal coordinate analysis (PCoA) and multidimensional scaling (MDS) because the taxa relative abundance values are distributed in a non-normal way and alternative metrics of pairwise dissimilarity between the samples are used, like UniFrac, Bray–Curtis measure, etc. A good example of the application of PCoA to the analysis of microbiota datasets was demonstrated in the study of adult humans’ microbiota sampled from 18 body sites including oral, vaginal, gut and skin from the Human Microbiome Project (HMP). The samples on a PCoA can be coloured by the country of origin to highlight the country-specific features of microbiota in the populations of the world (Tyakht et al., 2013). The approach can be used to track the temporal dynamics, for example, of an infant gut metagenome with respect to the adults’ samples: this visualization was performed using the EMPeror tool (VázquezBaeza et al., 2013). A variation of PCoA – one with an inclusion of an instrumental variable – is called between-class analysis (BCA). It was used to visualize the enterotypes (the distinct types of human gut microbiota composition) in the original paper by Arumugam et al. (2011). Overall, PCoA and its versions are indispensable tools for exploratory analysis of metagenomic data. Sometimes the adoption of the machine learning methods, including neural networks, to the field of the metagenomics is especially fruitful. One of such approaches is self-organizing maps (SOM). A SOM is an unsupervised neural network algorithm that represents multidimensional data in a two-dimensional space in a clustered way. This concept has an effective implementation, emergent SOM (ESOM) which is simply a large map that preserves the SOM projection topology. On these maps, every cell colour represents the quantity of certain selected feature. The ESOM approach is widely used in metagenomic projects for binning the data. For example, ESOM-clustering has been used for classifying the metagenomic sequence structures for the selected metagenomes (Laczny et al., 2014). As the microbiota contains many species that are in cooperative or competitive relations with each other, it is especially needed to highlight the so-called co-occurrence networks as a visualization method. Generally, these networks show the relationships between some objects (organisms, social groups, words in texts) reflecting their presence in the same environment. Every object is depicted with a node. If two objects tend to co-occur (for example, microbial species across multiple metagenomes) then an edge is drawn to connect them. The obtained network is called co-occurrence graph (or network). With this approach, the

Visualization of Metagenomic Data | 55

size of nodes and the width of links can vary according to the object abundance and the co-occurrence frequency, respectively. In the context of metagenomics, this method is usually implemented for bacterial species co-occurrence. Every taxon is drawn as a node while a link between nodes is their co-occurrence (measured as the correlation between the respective levels in metagenomes). This approach was used for the microbial network construction where the vertices of the graph were selected to be individual genera, their size reflected the relevant abundance of the genera and the colours distincted network modules (Liu et al., 2015). Another implementation with the human microbiota example is given in the large-scale microbial network organization article (Freilich et al., 2010). The MEGAN tool for the deep analysis and visualization of metagenomic data also includes the functionality of constructing the cooccurrence plots. Particularly, it allows changing dynamically the co-occurrence threshold as well as lowest abundance threshold; a similar functionality is provided by CoNet (Faust et al., 2012). The number of metagenomic datasets is growing not only in number and volume but also in the relation to the metadata: the samples are accompanied with a description containing the type of environment, date of collection and others including the geographic coordinates. The geographic data lead to the challenging task of visualizing the data using the combination of metagenomic and geovisualization approaches. One of the recent metagenomic visualization tools that implements such hybrid is ResistoMap (Yarygin et al., 2016), an interactive Web-based application showing the level of the potential resistance to antibiotics (resistome) in human gut microbiome. This tool allows visual exploration of the resistome levels in more than 1600 gut metagenomes of the populations of the world for most known antimicrobial drug types as an interactive heatmap. The navigation and summary resistome information are implemented as a geographic map of the world, where the countries are filled with the colour according to the median resistome levels of their populations. A researcher can quickly switch between the two visual forms due to the application interactivity. The ResistoMap interface is shown in Fig. 3.5. Such tools demonstrate that the efficient display of metagenomic data with the external factors describing the metagenomes can be useful for improving the value of accumulated data and help to gain insights into the complex interactions between the factors. Conclusions Recent discoveries in molecular microbial ecology using metagenomics have revolutionized our understanding of the structure and functional potential of complex bacterial communities. Most of these insights would not happen without an intense and in-depth data analysis, an important part of which certainly belongs to visualization of metagenomic data. A bioinformatician approaching a novel metagenomic dataset should be skilful in applying the basic methods described in the article, as well as the advanced novel toolkits that continue to appear. An additional understanding can come from adopting the known visualization methods previously not applied specifically in the area of metagenomics, and interactive tools are particularly valuable for mining such multilayered and complex data.

56 | Sudarikov et al.

Figure 3.5 The genomic potential of gut microbes for antibiotic resistance (resistome) in world populations is shown on a heatmap combined with geographic map using ResistoMap, http:// resistomap.datalaboratory.ru. Here the resistance to fluoroquinolone is selected.

Acknowledgements This work was financially supported by the Ministry of Education and Science of Russian Federation (RFMEFI57514X0075). References

Adler, D., and Adler, M. (2014). Package ‘vioplot’, CRAN repository. https://CRAN.R-project.org/ package=vioplot Alexeev, D., Bibikova, T., Kovarsky, B., Melnikov, D., Tyakht, A., and Govorun, V. (2015). Bacterial rose garden for metagenomic SNP-based phylogeny visualization. BioData. Min. 8, 10. http://dx.doi. org/10.1186/s13040-015-0045-5 Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D.R., Fernandes, G.R., Tap, J., Bruls, T., Batto, J.M., et al. (2011). Enterotypes of the human gut microbiome. Nature 473, 174–180. http:// dx.doi.org/10.1038/nature09944 Asnicar, F., Weingart, G., Tickle, T.L., Huttenhower, C., and Segata, N. (2015). Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ 3, e1029. http://dx.doi. org/10.7717/peerj.1029 Bik, H.M. (2014). Phinch: An interactive, exploratory data visualization framework for -Omic datasets. bioRxiv 009944. http://dx.doi.org/10.1101/009944 Cantor, M., Nordberg, H., Smirnova, T., Hess, M., Tringe, S., and Dubchak, I. (2015). Elviz - exploration of metagenome assemblies with an interactive visualization tool. BMC Bioinf.16, 130. http://dx.doi. org/10.1186/s12859-015-0566-4

Visualization of Metagenomic Data | 57

Cheng, J. (2016). Package ‘d3heatmap’, CRAN repository. https://CRAN.R-project.org/ package=d3heatmap Csárdi, G., and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal Complex Syst. 1695, 1695. Eren, A.M., Ferris, M.J., and Taylor, C.M. (2011). A framework for analysis of metagenomic sequencing data. Pac. Symp. Biocomput., 131–141. Eren, A.M., Esen, Ö.C., Quince, C., Vineis, J.H., Morrison, H.G., Sogin, M.L., and Delmont, T.O. (2015). Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3, e1319. http://dx.doi. org/10.7717/peerj.1319 Faust, K., Sathirapongsasuti, J.F., Izard, J., Segata, N., Gevers, D., Raes, J., and Huttenhower, C. (2012). Microbial co-occurrence relationships in the human microbiome. PLOS Comput. Biol. 8, e1002606. http://dx.doi.org/10.1371/journal.pcbi.1002606 Flygare, S., Simmon, K., Miller, C., Qiao, Y., Kennedy, B., Di Sera, T., Graf, E.H., Tardif, K.D., Kapusta, A., Rynearson, S., et al. (2016). Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling. Genome Biol. 17, 111. http://dx.doi. org/10.1186/s13059-016-0969-1 Freilich, S., Kreimer, A., Meilijson, I., Gophna, U., Sharan, R., and Ruppin, E. (2010). The large-scale organization of the bacterial network of ecological co-occurrence interactions. Nucleic Acids Res. 38, 3857–3868. http://dx.doi.org/10.1093/nar/gkq118 Havre, S.L., Webb-Robertson, B.J., Shah, A., Posse, C., Gopalan, B., and Brockman, F.J. (2005). Bioinformatic insights from metagenomics through visualization. Proc. IEEE. Comput. Syst. Bioinform. Conf. 341– 350. http://dx.doi.org/10.1109/CSB.2005.19 Huson, D.H., Beier, S., Flade, I., Górska, A., El-Hadidi, M., Mitra, S., Ruscheweyh, H.J., and Tappu, R. (2016). MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data. PLOS Comput. Biol.12, e1004957. http://dx.doi.org/10.1371/journal. pcbi.1004957 Kerepesi, C., Bánky, D., and Grolmusz, V. (2014). AmphoraNet: the webserver implementation of the AMPHORA2 metagenomic workflow suite. Gene 533, 538–540. http://dx.doi.org/10.1016/j. gene.2013.10.015 Kiernander, J. et al. (2014). Dimple, A simple charting API for d3 data visualisations. http://dimplejs.org/ Kuczynski, J., Stombaugh, J., Walters, W., González, A., Caporaso, J.G., and Knight, R. (2012). Using QIIME to analyze 16S rRNA gene sequences form microbial communities. Curr. Protoc. Microbiol. EmergingT, 1–20. http://dx.doi.org/10.1002/9780471729259.mc01e05s27 Kuntal, B.K., Ghosh, T.S., and Mande, S.S. (2013). Community-analyzer: a platform for visualizing and comparing microbial community structure across microbiomes. Genomics 102, 409–418. http:// dx.doi.org/10.1016/j.ygeno.2013.08.004 Laczny, C.C., Pinel, N., Vlassis, N., and Wilmes, P. (2014). Alignment-free visualization of metagenomic data by nonlinear dimension reduction. Sci. Rep. 4, 4516. http://dx.doi.org/10.1038/srep04516 Letunic, I., and Bork, P. (2016). Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–5. http://dx.doi. org/10.1093/nar/gkw290 Liu, Z., Lin, S., and Piantadosi, S. (2015). Network construction and structure detection with metagenomic count data. BioData. Min. 8, 40. http://dx.doi.org/10.1186/s13040-015-0072-2 Luo, C., Knight, R., Siljander, H., Knip, M., Xavier, R.J., and Gevers, D. (2015). ConStrains identifies microbial strains in metagenomic datasets. Nat. Biotechnol. 33, 1045–1052. http://dx.doi. org/10.1038/nbt.3319 McMurdie, P.J., and Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLOS ONE8, e61217. http://dx.doi.org/10.1371/journal. pone.0061217 Ohri, A. (2014). R with Cloud APIs. R Cloud Comput. Springer, New York. http://dx.doi.org/10.1007/9781-4939-1702-0_7 Ondov, B.D., Bergman, N.H., and Phillippy, A.M. (2011). Interactive metagenomic visualization in a Web browser. BMC Bioinf.12, 385. http://dx.doi.org/10.1186/1471-2105-12-385 R Core Team (2014). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. Robertson, C.E., Harris, J.K., Wagner, B.D., Granger, D., Browne, K., Tatem, B., Feazel, L.M., Park, K., Pace, N.R., and Frank, D.N. (2013). Explicet: graphical user interface software for metadata-driven

58 | Sudarikov et al.

management, analysis and visualization of microbiome data. Bioinformatics 29, 3100–3101. http:// dx.doi.org/10.1093/bioinformatics/btt526 Rudis, B., Almossawi, A., Ulmer, H., and jQuery Foundation and contributors (2015). Package ‘metricsgraphics’. CRAN repository. https://CRAN.R-project.org/package=metricsgraphics Scholz, M., Ward, D.V., Pasolli, E., Tolio, T., Zolfo, M., Asnicar, F., Truong, D.T., Tett, A., Morrow, A.L., and Segata, N. (2016). Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat. Methods 13, 435–438. http://dx.doi.org/10.1038/nmeth.3802 Seah, B.K.B., and Gruber-Vodicka, H.R. (2015). gbtools: Interactive Visualization of Metagenome Bins in R. Front. Microbiol. 6, 1451. http://dx.doi.org/10.3389/fmicb.2015.01451 Sedlar, K., Videnska, P., Skutkova, H., Rychlik, I., and Provaznik, I. (2016). Bipartite Graphs for Visualization Analysis of Microbiome Data. Evol. Bioinform. Online.12, 17–23. http://dx.doi.org/10.4137/EBO. S38546 Shneiderman, B. (1996). The eyes have it: a task by data type taxonomy for information visualizations. Proc. IEEE Symp. Vis. Lang. 336–343. http://dx.doi.org/10.1109/VL.1996.545307 Smoot, M.E., Ono, K., Ruscheinski, J., Wang, P.L., and Ideker, T. (2011). Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431–432. http://dx.doi.org/10.1093/ bioinformatics/btq675 Song, B., Su, X., Xu, J., and Ning, K. (2012). MetaSee: an interactive and extendable visualization toolbox for metagenomic sample analysis and comparison. PLOS ONE7, e48998. http://dx.doi.org/10.1371/ journal.pone.0048998 Steele, J., and Iliinsky, N. (2010). Beautiful visualization: looking at data through the eyes of experts. O’Reilly Media, Inc. Sebastopol, Canada. Tufte, E. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut. Tyakht, A.V., Kostryukova, E.S., Popenko, A.S., Belenikin, M.S., Pavlenko, A.V., Larin, A.K., Karpova, I.Y., Selezneva, O.V., Semashko, T.A., Ospanova, E.A., et al. (2013). Human gut microbiota community structures in urban and rural populations in Russia. Nat. Commun. 4, 2469. http://dx.doi.org/10.1038/ ncomms3469 Ulyantsev, V.I., Kazakov, S.V., Dubinkina, V.B., Tyakht, A.V., and Alexeev, D.G. (2016). MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data. Bioinformatics 32, 2760–2767. http://dx.doi.org/10.1093/bioinformatics/btw312 Vaidyanathan, R. (2013). rCharts: Interactive charts using javascript visualization libraries. R package version 0.4 5. Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A., and Knight, R. (2013). EMPeror: a tool for visualizing high-throughput microbial community data. GigaScience 2, 16. http://dx.doi.org/10.1186/2047217X-2-16 Vidal, R., Ma, Y., and Sastry, S. (2016). Generalized Principal Component Analysis. Springer Publishing Company. http://dx.doi.org/10.1093/0.1109/TPAMI.2005.244 Warnes, M., Bolker, B., Bonebakker, L., Gentleman, R., Liaw, W.H., Lumley, T., Maechler, M., Magnusson, A., Moeller, S., Schwartz, M., et al.(2016). Package ‘gplots’, CRAN repository. https://CRAN.R-project. org/package=gplots Hadley, W. (2009). ggplot2: Elegant graphics for data analysis. Springer Science & Business Media. http:// dx.doi.org/10.1007/978-0-387-98141-3 Yamada, T., Letunic, I., Okuda, S., Kanehisa, M., and Bork, P. (2011). iPath2.0: interactive pathway explorer. Nucleic Acids Res. 39, W412–5. http://dx.doi.org/10.1093/nar/gkr313 Yarygin, K., Kovarsky, B., Bibikova, T., Melnikov, D., Tyakht, A., and Alexeev, D. (2016). ResistoMap - online visualization of human gut microbiota antibiotic resistome (Cold Spring Harbor Labs Journals, Preprint).

Comparing Viral Metagenomic Extraction Methods Jeanette Klenner1,2, Claudia Kohl1, Piotr W. Dabrowski1,3 and Andreas Nitsche1*

4

1Centre for Biological Threats and Special Pathogens 1, Robert Koch Institute, Berlin, Germany. 2Bundeswehr Research Institute for Protective Technologies and NBC Protection, Munster,

Germany.

3Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin, Germany.

*Correspondence: [email protected] https://doi.org/10.21775/9781910190593.04

Abstract A crucial step in the molecular detection of viruses in clinical specimens is the efficient extraction of viral nucleic acids. The total yield of viral nucleic acid from a clinical specimen is dependent on the specimen’s volume, the initial virus concentration and the effectiveness provided by the extraction method. Recent next generation sequencing (NGS)-based diagnostic approaches (i.e. metagenomics) provide a molecular ‘open view’ into the sample, as they theoretically generate sequence reads of any nucleic acid present in a specimen in a statistically representative manner. However, since a higher virusrelated read output promises better sensitivity in the subsequent bioinformatic analysis, the extraction method selected determines the reliability of diagnostic NGS. In this study nine commercially available kits for nucleic acid extraction were compared regarding the simultaneous isolation of DNA and RNA by real-time PCR, four of which were selected for subsequent comparison by NGS (QIAamp Viral RNA Mini Kit, QIAamp DNA Blood Mini Kit, QIAamp cador Pathogen Mini Kit and QIAamp MinElute Virus Spin Kit). The nucleic acid yields and the sequence read output were compared for four different model viruses – reovirus, orthomyxovirus, orthopoxvirus and paramyxovirus – each at defined but varying concentrations in the same sample. The total amount of nucleic acid was processed to sequence the RNA (as cDNA) and the DNA with quantification by Qubit and virus-specific quantitative real-time PCRs. NGS libraries were prepared for sequencing on the Illumina HiSeq 1500 system. Finally, the percentage of reads assignable to each virus was determined via mapping. Evaluation of different commercial nucleic acid extraction kits with four different viruses indicates little variation in the read numbers obtained for transcribed RNA or DNA by NGS. Since NGS is increasingly being used as a tool in diagnostics of infectious diseases, the individual steps of the complete process have to be validated carefully. Here we could show

60 | Klenner et al.

that for virus identification in liquid clinical specimens, any nucleic acid extraction kit that is performing well for PCR diagnostics can be used for NGS diagnostics as well and that the selection of the kit has only a minor impact on the yield of viral reads. Introduction The polymerase chain reaction (PCR) is currently the gold standard in nucleic acid-based diagnostics (Pabinger et al., 2014; Schmittgen et al., 2008). The multitude of rapid, highly specific and sensitive diagnostic PCR assays currently available for this well-understood and affordable method make it invaluable in the clinical context (Perandin et al., 2004; Templeton et al., 2004). However, the high specificity comes at a price: since PCR primers target a specific region of an organism’s genome, the targeted pathogen is usually detected exclusively (Klein, 2002; Yamamoto, 2002). A new method of nucleic acid-based analysis, next generation sequencing (NGS), has seen a soaring rise in popularity with several sequencing platforms developed independently of each other within less than a decade (Ahmadian and Svahn, 2011; Liu et al., 2012; Metzker, 2010). This massively parallel sequencing approach provides a molecular ‘open view’ into the specimen which can allow for non-targeted pathogen detection (Hazelton and Gelderblom, 2003), thus overcoming the limitations set by PCR. The application of NGS technologies (i.e. metagenomics) in the field of infectious diseases ranges from fundamental to applied research (Didelot et al., 2012; Lecuit and Eloit, 2014; Mardis, 2008; Quail et al., 2012; Renkema et al., 2014). NGS is also increasingly used for diagnostics of bacterial and viral pathogens in different clinical specimen matrices, i.e. faeces, tissue, plasma, blood, urine, cerebrospinal fluid and diagnostic cell culture (Batty et al., 2013; Cheval et al., 2011; Kohl et al., 2015; Law et al., 2013; Nakamura et al., 2008; Nassirpour et al., 2014). It is generally assumed that NGS with its extremely rapid technical advancement and continuously decreasing costs will proceed to become a favourite tool in clinical routine diagnostics (Desai and Jere, 2012; Hayden, 2014; Peng et al., 2013; Voelkerding et al., 2009). The process of generating sequencing libraries and the technique of sequencing itself are well established. However, the impact of nucleic acid extraction from clinical samples on how well viral pathogens can be detected using metagenomics has not yet been comprehensively examined. Nevertheless, this is a critical question when considering adopting metagenomics as a standard method for clinical diagnostics. As is generally accepted for clinical diagnostics, NGS-based diagnostics requires purification methods that are quick and allow for immediate sample processing, simple by not requiring specialized equipment and in the best case can be run automatically. Due to the open view approach it is of particular importance to isolate DNA and RNA simultaneously. To meet biosafety demands, reliable inactivation plays a critical role when dealing with clinical specimens (Boom et al., 1990). In some kits the lysis buffer can contain chaotropic salts, i.e. guanidine hydrochloride/thiocyanate, which denature macromolecules, by which a virus inactivation is assumed to be likely (Blow et al., 2004; Boom et al., 1990). Limiting factors in the nucleic acid extraction for NGS are specimen volume and low pathogen concentration that could result in insufficient amounts of NGS starting material or insufficient fragment length of the nucleic acid extracted (Fahle and Fischer, 2000). Several nucleic acid extraction kits are commercially available, but differ inter alia in cost, application, additional reagents required and hands-on effort. For PCR-based detection the

Comparing Viral Metagenomic Extraction Methods | 61

evaluation of several extraction kits for the recovery of pathogens such as Y. pestis, hepatitis A virus, mumps virus and cytomegalovirus has revealed that no kit is ideal under all conditions (Fahle and Fischer, 2000; Krause et al., 2006; Peng et al., 2013; Ribao et al., 2004). However, such an evaluation is missing for NGS-based pathogen detection, i.e. metagenomics. In this study we have therefore compared different commercially available RNA, DNA and combined RNA/DNA extraction kits regarding their performance when recovering four different model viruses present at different concentrations in fluid samples: a reovirus, an orthomyxovirus, an orthopoxvirus and a paramyxovirus. These model viruses were chosen in order to cover a wide range of different viral properties, i.e. non-enveloped particles, enveloped particles and single- or double-stranded genomic DNA or RNA. Samples were subjected to the identical diagnostic NGS process, and finally the number of reads mapping to the reference genome and the percentage of pathogen-specific reads in the total read amount obtained by sequencing on the Illumina HiSeq 1500 system were used as the performance measure. Extraction kits Eight commercially available kits [PureLink Viral RNA/DNA Kit (Invitrogen), ZR Viral DNA/RNA Kit (Zymonas), TRIzol LS Reagent (Invitrogen), QIAamp Viral RNA Mini Kit, QIAamp DNA Blood Mini Kit, QIAamp cador Pathogen Mini Kit, QIAamp MinElute Virus Spin Kit and QIAamp Ultrasense Kit (all Qiagen)] were compared for the simultaneous isolation of DNA and RNA, even for kits that are primarily designed for DNA or RNA exclusively. The selection of the individual kits was based on their commercial availability, the ease of handling and the total preparation time required. Therefore, the largest part of kits consists of silica spin columns, which turned out to be most efficient and popular for nucleic acid purification during the past decade. Major differences of all kits utilized are in the different chaotropic salts included (i.e. guanidine hydrochloride/thiocyanate), detergents and other additives in the lysis buffers. In some kits the addition of protease as a separate additive is required. Some kits recommend the use of carrier DNA or RNA; however, in this study carrier DNA and RNA were omitted intentionally. Since typically NGS sequences all sequences in the specimen, carrier RNA or DNA would also be sequenced, and the viral sequencing read depth could be impaired. Pretests using quantitative real-time PCR were performed for nucleic acid extracts obtained from all kits (data not shown), and the four kits that scored well regarding the nucleic acid yields were chosen for further evaluation by using NGS (QIAamp Viral RNA Mini Kit, QIAamp DNA Blood Mini Kit, QIAamp cador Pathogen Mini Kit and QIAamp MinElute Virus Spin Kit). All four kits fulfilled the general requirements set: speed, simplicity, safety and no need for additional equipment. In contrast, all kits can be run manually or automatically on the QIAcube. Artificial virus specimen An aliquoted mix of four different RNA and DNA viruses was used as a well-defined surrogate for a liquid clinical specimen. This mix contained a reovirus (family Reoviridae, subfamily Spinareovirinae, genus Orthoreovirus, species Orthoreovirus T3/342/08), an Orthomyxovirus (family Orthomyxoviridae, genus Influenzavirus A, species H1N1 PR8/1934), an

62 | Klenner et al.

Orthopoxvirus (family Poxviridae, subfamily Chordopoxvirinae, genus Orthopoxvirus, species vaccinia virus) from cell culture supernatant and a paramyxovirus (family Paramyxoviridae, subfamily Paramyxovirinae, genus Respirovirus, species Sendai virus) from the allantoic fluid of an infected egg. To produce viruses from cell culture supernatant, Vero E6 cells together with the respective virus were grown to confluence in DMEM with 10% Fetal Bovine Serum (FCS) and 2 mM L-glutamine. Viral supernatant was collected in a tube on day five and centrifuged at 200 rpm for 10 minutes, and the pellet was resuspended in FCS. For Paramyxovirus propagation allantoic fluid of infected chicken eggs was handled as previously described by Kohl et al. (2015). Each virus was present at defined but varying concentrations in the same sample (as determined by quantitative real-time PCR). A whole aliquot of the mix was used for each extraction. NGS workflow The workflow used to compare the different extraction kits’ performance on the selected viruses is shown in Fig. 4.1 and described in detail below. Compared to PCR, NGS is still

1.   Nucleic acid extraction

2.   Library processing

RNA concentration DNA digestion cDNA synthesis 3.   Virus genome quantiﬁcation via real-time PCR second strand synthesis DNA concentration 4.   Quantiﬁcation of total nucleic acid 5.   Library preparation and sequencing 6.   Bioinformatics analysis

Figure 4.1 Experimental workflow. After total nucleic acid extraction, each sample was divided into two aliquots. Each of the aliquots was processed to either DNA or RNA. For the latter, DNA digestion, cDNA synthesis and second strand synthesis were performed in addition. The DNA and the RNA aliquot were further treated identically. Details of each step (1–6) of the workflow are given in the text under the corresponding section title.

Comparing Viral Metagenomic Extraction Methods | 63

an expensive method, despite the steadily decreasing costs. Therefore, the NGS runs were performed in duplicate, as a higher number of replicates would have substantially increased the cost of the study. Nucleic acid extraction Nucleic acid was extracted in duplicate following the manufacturers’ instructions, except that no carrier RNA/DNA was used for the reasons mentioned above (Table 4.1). The final DNA and RNA solutions were eluted in the same volume of 100 µl of AE or AVE buffer (Qiagen), respectively. Library processing Each nucleic acid solution was divided into two aliquots, with one of the aliquots being further subjected to RNA and the other to DNA processing for NGS. One aliquot was concentrated to a final volume of 13 µl using the RNeasy MinElute Cleanup Kit (Qiagen) and subjected to DNase digestion by Turbo DNA-free (Life Technologies, Darmstadt, Germany) at 37°C for 20 minutes according to the manufacturers’ instructions. Total purified and concentrated RNA was reverse-transcribed into cDNA using random hexamers, RNaseOUT™ and Superscript II (Life Technologies) following the manufacturers’ instructions, albeit an additional denaturing step of 95°C for 5 minutes Table 4.1 Comparison of the different extraction kits (according to the manufacturer’s information)

Kita

Target

VRMK

Viral RNA

Costs per Specimen type kitb (€)

Requirementsc Special Reagents equipmentd

Starting Elution volume volume (µl) (µl)

Cell-free body fluids

197

Ethanole

Heating block, Vortex, Microcentrifuge

140

60–80f

DBMK Viral DNA

Blood and related body fluids

145

Ethanole, PBSg

Heating block, Vortex, Microcentrifuge

200

50–200h

CPMK Viral RNA, viral DNA

Whole blood, serum and tissue

176

Ethanole, PBSg

Vortex, Microcentrifuge

200

50–150

Plasma, serum 224 and cell-free body fluids

Ethanole, PBSg

Heating block, Vortex, Microcentrifuge

200

20–150

Bacterial DNA MEVK

aBy

Viral RNA, Viral DNA

PCR pre-selected kits. VRMK, QIAamp Viral RNA Mini Kit; DBMK, QIAamp DNA Blood Mini Kit; CPMK, QIAamp cador Pathogen Mini Kit; MEVK, QIAamp MinElute Virus Spin Kit. bPricing quote for small-volume orders (50), Germany, as of April 2016 (€). cNot included in the kit. dAdditional laboratory equipment. eUse 96–100% ethanol. fA single elution with 60 μl of buffer AVE is sufficient to elute at least 90% of the viral RNA. Performing a double elution using 2 × 40 μl of buffer AVE will increase yield by up to 10%. gPhosphate-buffered saline (PBS) may be required for some samples. hIf more eluate is required increase the amount of buffer AVE used in the two elution steps (2 × 50 μl instead of 2 × 30 μl).

64 | Klenner et al.

was used for initial RNA denaturation. Second strand synthesis was performed with the NEBNext® Second Strand Synthesis Module according to the manufacturers’ instructions (New England Biolabs, Frankfurt/Main, Germany). The resulting double-stranded cDNA (ds-cDNA) and the originally extracted DNA fraction were further purified with the MinElute PCR Purification Kit (Qiagen). In none of the experiments was the capacity limit of the columns of 5 µg exceeded (Table 4.2). Virus genome quantification via real-time PCR Following nucleic acid extraction by the different kits (Table 4.1), their individual performance regarding the yield of viral nucleic acids was compared by quantitative real-time PCR. RNA was reversely transcribed into cDNA as described in the paragraph above. Specific real-time PCR protocols were applied in duplicate to each aliquot, using the Applied Biosystems 7500 Real Time PCR System (Applied Biosystems, Darmstadt, Germany), for the quantification of T3/Bat/Germany/342/08 Reovirus as previously described by Kohl et al. (2012), Orthomyxovirus/Influenza viruses by Schulze et al. (2010), Orthopoxviruses by Schröder and Nitsche (2010) and Sendai virus by Kohl et al. (2015). The PCR reaction mixture consisted of 1 U Platinum Taq DNA polymerase (Invitrogen, Darmstadt, Germany), 1x Platinum Taq Buffer (Invitrogen), 5 mM MgCl2 (Invitrogen), 300 nM of each primer, 100 nM of TaqManprobe and 100 µM mix of dNTPs (Invitrogen). All PCR reaction mixtures were performed in a final volume of 25 µl containing 3 μl of the samples. Identical conditions were used for all PCR reactions: an enzyme activation step at 95°C for 10 minutes, followed by 45 cycles with a 95°C denaturation step for 15 s and 60°C annealing for 35 s (Bustin et al., 2009). Quantification of total nucleic acid Prior to further library processing, the yield of transcribed RNA and DNA extracted was determined by Qubit® 2.0 Fluorometer (Qubit® dsDNA HS Assay Kit, Invitrogen). The assay is designed for the accurate measurement of sample concentrations from 10 pg/µl to 100 ng/µl.

Table 4.2 Nucleic acid concentration of samples with different kits, determined by Qubit (mean concentration, n=2) Concentration (ng/µl) Kitsa

RNA

DNAc

VRMK

b/d

2.3

DBMK

b/d

1.0

CPMK

b/d

1.0

MEVK

b/d

2.2

aBy

b

PCR pre-selected kits. VRMK, QIAamp Viral RNA Mini Kit; DBMK, QIAamp DNA Blood Mini Kit; CPMK, QIAamp cador Pathogen Mini Kit; MEVK, QIAamp MinElute Virus Spin Kit. bAfter RNA preparation (ds-cDNA). cAfter DNA preparation. b/d, below Qubit detection limit.

Comparing Viral Metagenomic Extraction Methods | 65

Library preparation and sequencing NGS libraries were generated with the Nextera XT DNA Sample Preparation Kit according to the manufacturer’s instructions (Illumina, San Diego, CA, USA). For quantification of NGS libraries the KAPA Library Quantification Kits for Illumina sequencing (Kapa Biosystems, Wilmington, MA, USA) were used. Due to the fact that the starting amount of 1 ng of nucleic acid was not reached in the case of the RNA preparations after the second strand synthesis, the entire sample volume was added to the library. Both the RNA and DNA libraries were sequenced. Sequencing was performed on the Illumina HiSeq 1500 system in rapid-run mode. A paired-end protocol was used with read lengths of 2 × 150 bp and a total run time of 40 h. Bioinformatics analysis The bioinformatics analysis was performed by mapping the reads obtained to the reference genome sequences of the four viruses present in the sample, using bowtie2 (Langmead and Salzberg, 2012). Reference sequences were: reovirus: NCBI accession numbers JQ412763.1, JQ412761.1, JQ412759.1, JQ412757.1, JQ412755.1, JQ412764.1, JQ412762.1, JQ412760.1, JQ412758.1 and JQ412756.1; influenza virus: NCBI accession numbers NC_002018.1, NC_002016.1, NC_002023.1, NC_002021.1, NC_002019.1, NC_002017.1, NC_002020.1 and CY033582.1; Orthopoxvirus: NCBI accession number NC_006998.1; and Sendai virus: NCBI accession number EF679198.1. Comparison of extraction kit performance For the comparison of different commercially available DNA and RNA extraction kits, four different viruses with different characteristics were chosen: Reovirus, Orthomyxovirus, Orthopoxvirus and Paramyxovirus. Following nucleic acid extraction, NGS was applied in order to determine the performance of each kit regarding viral nucleic acid detection. The nucleic acid concentrations following DNA preparation were close to 1 ng/µl which is the amount required theoretically for library preparation with the Nextera XT Sample Prep Kit® according to the manufacturer’s instructions (Table 4.2). The highest DNA concentration on average was found in samples extracted by the QIAamp Viral RNA Mini Kit (2.3 ng/µl), while extraction with the QIAamp DNA Blood Mini Kit and QIAamp cador pathogen detection kit resulted in the lowest concentration (1.0 ng/µl). However, all four viruses were detectable after nucleic acid preparation by quantitative real-time PCR, indicating a detectable amount of viral nucleic acid in the respective samples (Table 4.3). Interestingly, both DNA and RNA viruses were detected using the QIAamp Viral RNA Mini Kit and the QIAamp DNA Blood Mini Kit, although these kits are designed for either DNA or RNA. According to quantitative real-time PCR results, the efficiency of Orthomyxovirus, Orthopoxvirus and Orthoreovirus genome extraction was higher when using the QIAamp cador Pathogen Mini Kit than when using the QIAamp MinElute Virus Spin Kit. The lowest CT values for Paramyxovirus were reached for samples extracted with the QIAamp Viral RNA Mini Kit (CT 15.8 ± 0.4) while the QIAamp MinElute Virus Spin Kit had the lowest detectability (CT 17.4). Illumina sequencing of the respective libraries following DNA or RNA preparation resulted in 93 million paired-end reads (150 + 150 bp), with a total of 24.5 Gbp sequence

66 | Klenner et al.

Table 4.3 Comparison of different extraction kits based on average threshold cycle (CT) during real-time PCR (n=2, CI = 95%). The best result is presented in bold Orthomyxovirus genomeb

Orthopoxvirus

Orthoreovirus

Paramyxovirus

genomec

genomeb

RNA genomeb

Kitsa

RNA

VRMK

20.9 ± 0.2

17.6 ± 0.2

28.8 ± 1.1

15.8 ± 0.4

DBMK

21.5 ± 0.2

18.0 ± 0.2

29.8 ± 0.2

16.3 ± 0.5

CPMK

21.2 ± 0.4

17.5 ± 0.2

28.7 ± 1.1

16.3 ± 1.2

MEVK

21.6d

18.9 ± 0.2

30.4d

17.4d

DNA

RNA

By PCR pre-selected kits. VRMK, QIAamp Viral RNA Mini Kit; DBMK, QIAamp DNA Blood Mini Kit; CPMK, QIAamp cador Pathogen Mini Kit; MEVK, QIAamp MinElute Virus Spin Kit. bAfter RNA preparation (ds-cDNA). cAfter DNA preparation. dSingle measurement due to insufficient sample volume. a

information. On average, the percentage of bases with a quality score greater than 30 was 88.15%. The average number of reads from each library and the percentage of reads mapping to the reference genomes of all viruses examined are shown in Table 4.4. In both libraries derived from DNA and RNA preparation, the number of non-viral reads exceeded the number of viral reads. The total number of sequence reads for each sample ranged from 5,827,117 to 10,537,950 reads in the RNA preparation and from 9,176,541 to 13,256,053 reads in the DNA preparation. On average, a higher proportion of non-viral reads was observed following DNA preparation (98.92%) compared to RNA preparation (71.8%). Overall, the QIAamp MinElute Virus Spin Kit allowed the recovery of the highest percentage of viral reads after RNA preparation while generating the lowest percentage of non-viral reads. The QIAamp cador Pathogen Mini Kit showed the second-best performance regarding RNA preparation read recovery and the best read recovery for the Orthopoxvirus DNA library. The largest difference in percentage of recovered reads was observed in the Orthopoxvirus library, where the highest percentage was close to twofold higher than the lowest percentage. As visualized in Fig. 4.2, the performance of the kits concerning the tested viruses generally showed only small differences, and all viruses could be detected via NGS after preparation with each kit. As shown in Table 4.3, the results obtained from real-time PCR are somewhat different. The QIAamp MinElute Virus Spin Kit, which allowed for the recovery of the highest percentage of viral reads for RNA viruses and an average percentage of viral reads for the DNA virus by NGS, produced the highest CT values for all viruses. However, the QIAamp cador Pathogen Mini Kit produced both the highest percentage of Orthopoxvirus reads in NGS and the lowest Orthopoxvirus CT of all kits tested. Discussion In this study we compared four pre-selected QIAGEN kits (QIAamp Viral RNA Mini Kit, QIAamp DNA Blood Mini Kit, QIAamp cador Pathogen Mini Kit and QIAamp MinElute Virus Spin Kit) with NGS in terms of their ability to recover sequence information of four different viruses (Reovirus, Orthomyxovirus, Orthopoxvirus and Paramyxovirus). These

Comparing Viral Metagenomic Extraction Methods | 67

Table 4.4 Percentage (%) of viral reads and total number of reads (#) obtained after Illumina sequencing. The best results are given in bold Kita VRMK

Orthomyxovirus

Orthopoxvirus

Orthoreovirus

Paramyxovirus

Non-viral reads

RNAb (%)

DNAc (%)

RNAb (%)

RNAb (%)

RNAb (%) DNAc (%) RNAb

6.7

0.7

0.001

17.9

75.4

No. of readsd DNAc

99.3

1.05 × 107

1.33 × 107 1.21 × 107

6

1.01 × 107 9.18 × 106

DBMK

5.8

1.0

0.001

16.6

77.6

99.0

6.11 × 106

CPMK

6.9

1.6

0.003

23.4

69.6

98.4

7.86 × 10

MEVK

11.3

0.003

24.0

99.0

5.83 × 106

1.0

64.6

aBy

PCR pre-selected kits. VRMK, QIAamp Viral RNA Mini Kit; DBMK, QIAamp DNA Blood Mini Kit; CPMK, QIAamp cador Pathogen Mini Kit; MEVK, QIAamp MinElute Virus Spin Kit. b% of reads after RNA preparation (ds-cDNA). c% of reads after DNA preparation. VRMK 1.E+06 Orthomyxovirus (a)

1.E+04 1.E+02 MEVK

1.E+00

Orthopoxvirus (b) DBMK Reovirus (a)

Paramyxovirus (a)

CPMK

Figure 4.2 Number of viral reads recovered after extraction of samples using four different extraction kits. The viruses used were Reovirus (T3/Bat/Germany/342/08), Orthomyxovirus (H1N1 PR8/34), Orthopoxvirus (Vaccinia Virus) and Paramyxovirus (Sendai virus). Recovered viral reads were normalized to 10 million output reads. Normalized read numbers are shown in log scale. VRMK: QIAamp Viral RNA Mini Kit, DBMK: QIAamp DNA Blood Mini Kit, CPMK: QIAamp cador Pathogen Mini Kit and MEVK: QIAamp MinElute Virus Spin Kit. (a) after RNA preparation (ds-cDNA), (b) after DNA preparation.

viruses with varying genome characteristics were selected to cover a broad range of viruses possibly present in a clinical sample. In contrast to a clinical sample, the mixed aliquots were all at fairly high titres with a minor complexity because the focus of this study was the comparison of different extraction kits and not the evaluation of the detection limit of NGS for metagenomics. While similar studies have shown that for PCR – the current gold standard for nucleic acid-based diagnostics – no single kit is perfect for all pathogens (Dauphin et al., 2010, Fahle and Fischer, 2000; Krause et al., 2006; Ribao et al., 2004), to our knowledge this is the first study directed at NGS-based pathogen detection.

68 | Klenner et al.

As shown in Table 4.4, all viruses could be detected using any of the kits compared, with the highest percentage of reads recovered by the QIAamp MinElute Virus Spin Kit for Reovirus, Orthomyxovirus and Paramyxovirus and by the QIAamp cador Pathogen Mini Kit for the Orthopoxviruses, respectively. Interestingly, this is in contrast to the results (CT values) obtained for each recovery by corresponding real-time PCR shown in Table 4.3. In the real-time PCR results, QIAamp MinElute Virus Spin Kit has the highest CT value for all viruses tested. PCR quantifies the total number of target sequences in the sample, while NGS quantifies the number of target sequences in relation to the total number of sequences in the sample. Thus, when preparing nucleic acid for PCR-based detection, it is most important not to decrease the total amount of nucleic acid in the sample to preserve as many total PCR targets as possible; however, dilution of the nucleic acid prior to PCR reaction may sometimes be required to prevent PCR inhibition by too high amounts of DNA. In contrast, when preparing nucleic acid for NGS-based detection (i.e. metagenomics), it is most important to deplete the host nucleic acid more efficiently than the virus nucleic acid. It is to a certain degree acceptable to decrease the pathogen nucleic acid, as long as the host nucleic acid is decreased by a higher factor to increase the detection likelihood. Subsequently, available publications comparing different preparation kits, based on their performance in PCR-based detection methods alone, are not expedient for NGS-based detection approaches. In the study presented, only slight differences in the number of recovered NGS reads were observed between all kits tested – the highest difference observed between the recoveries of two kits is just above twofold (Fig. 4.2). While not conclusive due to the defined number of model organisms tested, our results suggest that NGS-based detection is not dependent on the choice of nucleic acid extraction kit. This allows choosing kits according to other parameters in the decision process. One factor could be costs which range from €2.90 to €4.48 per reaction, the most economical solution being the QIAamp Viral RNA Mini Kit. Or the QIAamp cador Pathogen Mini Kit is the only one tested not requiring a heating block. All of the kits tested are compatible with the QIAcube automated extraction system. In summary, while all kits perform similarly well, the QIAamp MinElute Virus Spin Kit seems most advantageous for NGS due to the low elution volume of 20 µl. In general, detection of viruses is made difficult by their structural and genetic diversity (Hulo et al., 2011). The lack of a universally conserved region analogous to bacterias’ 16S rRNA prohibits generic amplification and enrichment. The small size of the viral genome compared to the host genome (e.g. the porcine circoviruses, with about 1.7 kb, versus the human genome, with 3,000,000 kb) adversely affects the nucleic acid ratio in a sample (Finsterbusch and Mankertz, 2009; Hulo et al., 2011; Nagele, 2011). Background reads such as the human genome thus present a major problem when using NGS for pathogen detection, necessitating to rely either on huge sequencing depth or on luck for finding viral sequences. This attaches importance to the methods for extraction and enrichment of viral nucleic acid from clinical samples which allow better detection of pathogens using NGS; obviously an important factor in the adoption of NGS as a standard method for clinical diagnostics. Available research data show that in metagenomic and virome studies the virus purification and enrichment have a high level of importance. Optimization of the extraction process has already led to an increase in the detection likelihood of viruses from organ tissue (Kohl et al., 2015). In summary, one can use any of the compared kits for the extraction of nucleic acid from fluid samples.

Comparing Viral Metagenomic Extraction Methods | 69

Acknowledgements The authors are grateful to Dr Brunhilde Schweiger for providing Influenza A virus strain PR8/38, Dr Marc Hoferer and Dr Andreas Kurth for providing the Sendai virus isolate and Ursula Erikli for copy-editing. References

Ahmadian, A., and Svahn, H.A. (2011). Massively parallel sequencing platforms using lab on a chip technologies. Lab Chip 11, 2653–2655. http://dx.doi.org/10.1039/c1lc90035h Batty, E.M., Wong, T.H., Trebes, A., Argoud, K., Attar, M., Buck, D., Ip, C.L., Golubchik, T., Cule, M., Bowden, R., et al. (2013). A modified RNA-Seq approach for whole genome sequencing of RNA viruses from faecal and blood samples. PLOS ONE 8, e66129. http://dx.doi.org/10.1371/journal. pone.0066129 Blow, J.A., Dohm, D.J., Negley, D.L., and Mores, C.N. (2004). Virus inactivation by nucleic acid extraction reagents. J. Virol. Methods 119, 195–198. http://dx.doi.org/10.1016/j.jviromet.2004.03.015 Boom, R., Sol, C.J., Salimans, M.M., Jansen, C.L., Wertheim-van Dillen, P.M., and van der Noordaa, J. (1990). Rapid and simple method for purification of nucleic acids. J. Clin. Microbiol. 28, 495–503. Bustin, S.A., Benes, V., Garson, J.A., Hellemans, J., Huggett, J., Kubista, M., Mueller, R., Nolan, T., Pfaffl, M.W., Shipley, G.L., et al. (2009). The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments. Clin. Chem. 55, 611-622. Cheval, J., Sauvage, V., Frangeul, L., Dacheux, L., Guigon, G., Dumey, N., Pariente, K., Rousseaux, C., Dorange, F., Berthet, N., et al. (2011). Evaluation of high-throughput sequencing for identifying known and unknown viruses in biological samples. J. Clin. Microbiol. 49, 3268–3275. http://dx.doi. org/10.1128/JCM.00850-11 Dauphin, L.A., Stephens, K.W., Eufinger, S.C., and Bowen, M.D. (2010). Comparison of five commercial DNA extraction kits for the recovery of Yersinia pestis DNA from bacterial suspensions and spiked environmental samples. J. Appl. Microbiol. 108, 163–172. http://dx.doi.org/10.1111/j.13652672.2009.04404.x Desai, A.N., and Jere, A. (2012). Next-generation sequencing: ready for the clinics? Clin. Genet. 81, 503–510. http://dx.doi.org/10.1111/j.1399-0004.2012.01865.x Didelot, X., Bowden, R., Wilson, D.J., Peto, T.E., and Crook, D.W. (2012). Transforming clinical microbiology with bacterial genome sequencing. Nat. Rev. Genet. 13, 601–612. http://dx.doi. org/10.1038/nrg3226 Fahle, G.A., and Fischer, S.H. (2000). Comparison of six commercial DNA extraction kits for recovery of cytomegalovirus DNA from spiked human specimens. J. Clin. Microbiol. 38, 3860–3863. Finsterbusch, T., and Mankertz, A. (2009). Porcine circoviruses – small but powerful. Virus Res. 143, 177–183. http://dx.doi.org/10.1016/j.virusres.2009.02.009 Hayden, E.C. (2014). Technology: The $1,000 genome. Nature 507, 294–295. http://dx.doi. org/10.1038/507294a Hazelton, P.R., and Gelderblom, H.R. (2003). Electron microscopy for rapid diagnosis of infectious agents in emergent situations. Emerg. Infect. Dis. 9, 294–303. Hulo, C., de Castro, E., Masson, P., Bougueleret, L., Bairoch, A., Xenarios, I., and Le Mercier, P. (2011). ViralZone: a knowledge resource to understand virus diversity. Nucleic Acids Res. 39, D576–82. http://dx.doi.org/10.1093/nar/gkq901 Klein, D. (2002). Quantification using real-time PCR technology: applications and limitations. Trends Mol. Med. 8, 257–260. Kohl, C., Lesnik, R., Brinkmann, A., Ebinger, A., Radonić, A., Nitsche, A., Mühldorfer, K., Wibbelt, G., and Kurth, A. (2012). Isolation and characterization of three mammalian orthoreoviruses from European bats. PLOS ONE 7, e43106. http://dx.doi.org/10.1371/journal.pone.0043106 Kohl, C., Brinkmann, A., Dabrowski, P.W., Radonić, A., Nitsche, A., and Kurth, A. (2015). Protocol for metagenomic virus detection in clinical specimens. Emerg. Infect. Dis. 21, 48–57. http://dx.doi. org/10.3201/eid2101.140766 Krause, C.H., Eastick, K., and Ogilvie, M.M. (2006). Real-time PCR for mumps diagnosis on clinical specimens – comparison with results of conventional methods of virus detection and nested PCR. J. Clin. Virol. 37, 184–189. Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359. http://dx.doi.org/10.1038/nmeth.1923

70 | Klenner et al.

Law, J., Jovel, J., Patterson, J., Ford, G., O’Keefe, S., Wang, W., Meng, B., Song, D., Zhang, Y., Tian, Z., et al. (2013). Identification of hepatotropic viruses from plasma using deep sequencing: a next generation diagnostic tool. PLOS ONE 8, e60595. http://dx.doi.org/10.1371/journal.pone.0060595 Lecuit, M., and Eloit, M. (2014). The diagnosis of infectious diseases by whole genome next generation sequencing: a new era is opening. Front. Cell. Infect. Microbiol. 4, 25. http://dx.doi.org/10.3389/ fcimb.2014.00025 Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M. (2012). Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 251364. http://dx.doi.org/10.1155/2012/251364 Mardis, E.R. (2008). The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141. http://dx.doi.org/10.1016/j.tig.2007.12.007 Metzker, M.L. (2010). Sequencing technologies - the next generation. Nat. Rev. Genet. 11, 31–46. http:// dx.doi.org/10.1038/nrg2626 Nagele, P. (2011). Perioperative genomics. Best. Pract. Res. Clin. Anaesthesiol. 25, 549–555. http://dx.doi. org/10.1016/j.bpa.2011.09.001 Nakamura, S., Maeda, N., Miron, I.M., Yoh, M., Izutsu, K., Kataoka, C., Honda, T., Yasunaga, T., Nakaya, T., Kawai, J., et al. (2008). Metagenomic diagnosis of bacterial infections. Emerg. Infect. Dis. 14, 1784–1786. http://dx.doi.org/10.3201/eid1411.080589 Nassirpour, R., Mathur, S., Gosink, M.M., Li, Y., Shoieb, A.M., Wood, J., O’Neil, S.P., Homer, B.L., and Whiteley, L.O. (2014). Identification of tubular injury microRNA biomarkers in urine: comparison of next-generation sequencing and qPCR-based profiling platforms. BMC Genomics 15,485. Pabinger, S., Rödiger, S., Kriegner, A., Vierlinger, K., and Weinhäusel, A. (2014). A survey of tools for the analysis of quantitative PCR (qPCR) data. Biomol. Detect. Quantif. 1, 23-33. Peng, X., Yu, K.Q., Deng, G.H., Jiang, Y.X., Wang, Y., Zhang, G.X., and Zhou, H.W. (2013). Comparison of direct boiling method with commercial kits for extracting fecal microbiome DNA by Illumina sequencing of 16S rRNA tags. J. Microbiol. Methods 95, 455–462. http://dx.doi.org/10.1016/j. mimet.2013.07.015 Perandin, F., Manca, N., Calderaro, A., Piccolo, G., Galati, L., Ricci, L., Medici, M.C., Arcangeletti, M.C., Snounou, G., Dettori, G., et al. (2004). Development of a real-time PCR assay for detection of Plasmodium falciparum, Plasmodium vivax, and Plasmodium ovale for routine clinical diagnosis. J. Clin. Microbiol. 42, 1214–1219. Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, H.P., and Gu, Y. (2012). A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341. Renkema, K.Y., Stokman, M.F., Giles, R.H., and Knoers, N.V. (2014). Next-generation sequencing for research and diagnostics in kidney disease. Nat. Rev. Nephrol. 10, 433–444. http://dx.doi.org/10.1038/ nrneph.2014.95 Ribao, C., Torrado, I., Vilariño, M.L., and Romalde, J.L. (2004). Assessment of different commercial RNA-extraction and RT-PCR kits for detection of hepatitis A virus in mussel tissues. J. Virol. Methods 115, 177–182. Schmittgen, T.D., Lee, E.J., Jiang, J., Sarkar, A., Yang, L., Elton, T.S., and Chen, C. (2008). Real-time PCR quantification of precursor and mature microRNA. Methods 44, 31–38. Schröder, K., and Nitsche, A. (2010). Multicolour, multiplex real-time PCR assay for the detection of human-pathogenic poxviruses. Mol. Cell. Probes 24, 110–113. http://dx.doi.org/10.1016/j. mcp.2009.10.008 Schulze, M., Nitsche, A., Schweiger, B., and Biere, B. (2010). Diagnostic approach for the differentiation of the pandemic influenza A(H1N1)v virus from recent human influenza viruses by real-time PCR. PLOS ONE 5, e9966. http://dx.doi.org/10.1371/journal.pone.0009966 Templeton, K.E., Scheltinga, S.A., Beersma, M.F., Kroes, A.C., and Claas, E.C. (2004). Rapid and sensitive method using multiplex real-time PCR for diagnosis of infections by influenza A and influenza B viruses, respiratory syncytial virus, and parainfluenza viruses 1, 2, 3, and 4. J. Clin. Microbiol. 42, 1564-1569. Voelkerding, K.V., Dames, S.A., and Durtschi, J.D. (2009). Next-generation sequencing: from basic research to diagnostics. Clin. Chem. 55, 641–658. http://dx.doi.org/10.1373/clinchem.2008.112789 Yamamoto, Y. (2002). PCR in diagnosis of infection: detection of bacteria in cerebrospinal fluids. Clin. Diagn. Lab. Immunol. 9, 508–514.

Spatiotemporal Variations in the Abundance and Structure of Denitrifier Communities in Sediments Differing in Nitrate Content

5

David Correa-Galeote1*, Germán Tortosa1, Silvia Moreno1, David Bru2, Laurent Philippot2 and Eulogio J. Bedmar1

1Department of Soil Microbiology and Symbiotic Systems, Estación Experimental del Zaidín,

Granada, Spain.

2INRA-Université de Bourgogne, UMR 1229, Microbiologie et Géochimie des Sols, Dijon Cedex,

France.

*Correspondence: [email protected] https://doi.org/10.21775/9781910190593.05

Abstract Spatial and temporal variations related to hydric seasonality in abundance and diversity of denitrifier communities were examined in sediments taken from two sites differing in nitrate concentration along a stream in the Doñana National Park during a 3-year study. We found a positive relationship between the relative abundance of denitrifiers, determined as narG, napA, nirK, nirS and nosZ denitrification genes, and sediment nitrate content, with similar spatial and seasonal variations. However, we did not find association between denitrification activity and the community structure of denitrifiers. Because nosZ showed the strongest correlation with the content of nitrate in sediments, we used this gene as a molecular marker to construct eight genomic libraries. Analysis of these genomic libraries revealed that diversity of the nosZ-bearing communities was higher in the site with higher nitrate content. Regardless of nitrate concentration in the sediments, the Bradyrhizobiaceae and Rhodocyclaceae were the most abundant families. On the contrary, Rhizobiaceae was exclusively present in sediments with higher nitrate content. Results showed that differences in sediment nitrate concentration affect the composition and diversity of nosZ-bearing communities.

72 | Correa-Galeote et al.

Introduction Denitrification is the biological process in the biogeochemical nitrogen (N) cycle by which nitrate (NO3–) is sequentially reduced to dinitrogen gas (N2) via the intermediate compounds nitrite (NO2–), nitric oxide (NO) and nitrous oxide (N2O) when oxygen concentrations are limiting. This reduction process is carried out by the sequential activity of the enzymes nitrate (NarG, NapA)-, nitrite (NirK, NirS)-, nitric oxide (cNor, qNor)- and nitrous oxide (NosZ)-reductase, encoded by the narG, napA, nirK, nirS, norC and nosZ genes, respectively (Zumft, 1997; van Spanning et al., 2007; Richardson, 2011; Kraft et al., 2011; Sánchez et al., 2011; Bedmar et al., 2013). Denitrifiers constitute a taxonomically diverse group of microorganisms included in more than 60 genera of bacteria and some archaea (Philippot et al., 2007; Hayatsu et al., 2008), fungi (Takaya, 2002; Prendergast-Miller et al., 2011), Foraminifera (RisgaardPetersen et al., 2006) and the amoeboid Gromia (Piña-Ochoa et al., 2010). The density of denitrifiers in soils can be up to 109 cells per g of soil (Babic et al., 2008; Dandie et al., 2008; Henry et al., 2008), and both cultivation-independent and -dependent methods have shown that the proportion of denitrifiers represents up to 5% of the total soil microbial community (Tiedje, 1988; Henry et al., 2006; Jones et al., 2013). Several studies have shown that nitrate is the main factor controlling activity, abundance and biodiversity of denitrifier communities in different ecosystems (Wolsing and Priemé, 2004; Enwall et al., 2005; Deiglmayr et al., 2006; Reyna et al., 2010; García-Lledó et al., 2011; Carrino-Kyker et al., 2012; Smith et al., 2015a). However, only a few studies have dealt with the effect of nitrate concentration on the spatial and temporal variations on the abundance and biodiversity of denitrifiers in ecosystems differing in nitrate content. Significant seasonal shifts in the community structure of nirK-containing bacteria were found in agricultural soils in Denmark after the application of mineral fertilizer (Wolsing and Priemé, 2004), and variations in the diversity of the nirS and nirK genes from an agricultural field in Canada were assigned to temporal variations in the physicochemical properties of the soil (Smith et al., 2010). Temporal variations of denitrification activity were dependent on water temperature and nitrate concentration in riverine wetlands (Song et al., 2012) and constructed wetlands (Song et al., 2014). Changes in the relative abundance of narG, napA and nirS genes followed similar spati-temporal patterns than those corresponding to the nitrate concentrations along the estuary of the river Colne (UK) (Smith et al., 2015a) and seasonal changes in the abundance of the nirK and nirS communities were related to the seasonal variations in nitrogen dynamics in the Elkhorn (USA) tidal estuary (Smith et al., 2015b). In a previous study (Tortosa et al., 2011), we analysed the biological and physicochemical properties of la Rocina stream, a main natural creek feeding a 60,000 ha wetland, el Rocio, within Doñana National Park. The park is in a marshy area of SW Spain, in the estuary of the Guadalquivir River. Screening of more than 25 points along the course of la Rocina stream (36 km) revealed differences in nitrate concentration in its sediments, most probably due to contamination from agricultural practices allowed in the ecotone of the Park, as no urban areas are located nearby. Thus, la Rocina stream provides a unique model system to study the effect of nitrate content on abundance and biodiversity of denitrifying communities in sediments as the long term effect related to nitrate content could influence community abundance, composition

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 73

and activity. In this study, we determined the spatial and temporal variations in the activity, abundance and biodiversity denitrifying communities between two sites along la Rocina stream with contrasted nitrate content. Denitrification activity was examined as N2O production and biodiversity was analysed by using the nosZ gene as a molecular marker for construction of genomic libraries. Materials and methods Site description Based on the physicochemical properties of the surface waters and sediments along la Rocina stream, which feeds el Rocio marsh in Doñana National Park (Tortosa et al., 2011), two sites, el Acebrón lagoon (S1, UTM coordinates 29S 0718632, 4114294) and la Cañada creek (S2, UTM coordinates 29S 0722653, 4111704) with the lowest and highest nitrate concentrations, respectively, were selected in this study (see Fig. 1 in Tortosa et al., 2011). Sediment samples were taken as indicated earlier (Tortosa et al., 2011) in April and October of years 2008, 2009 and 2010 in order to represent the wet and dry pluvial regimes, respectively. Samples were placed on ice while returned to the laboratory and then stored at −80°C until use. Denitrification activity Denitrifying enzyme activity was carried out as previously described (Šimek and Hopkins, 1999; Šimek et al., 2004). Briefly, 25 g of sediment was placed in 125 ml glass bottles containing 25 ml of a solution made of 1 mM glucose, 1 mM KNO3 and 1 g/l chloramphenicol. The bottles were closed with serum caps and acetylene (10% v/v) was injected into each bottle to inhibit the nitrous oxide reductase (Yoshinari and Knowles, 1976). After incubation for 1 h at 25°C, gas samples (500 µl) were withdrawn from the headspace and injected in a gas chromatograph equipped with an electron capture detector (ECD) and a Porapak Q-packed stainless-steel column (180 × 0.32 cm) (Agilent Technologies, S.L., Madrid, Spain). N2 at 20 ml/min served as a carrier gas. Oven, detector and injector temperature were 60, 375 and 125°C, respectively. Concentrations of nitrous oxide in each sample were calculated from standards of pure nitrous oxide. The Bunsen coefficient for the N2O dissolved in water was considered during calculations. DNA extraction DNA was extracted from 250 mg of each subsample stored at −80°C according to the ISO standard 11063 ‘Soil quality – Method to directly extract DNA from soil samples’ (Petrić et al., 2011). Briefly, samples were homogenized in 1 ml of extraction buffer (1 M Tris-HCl, 0.5 M EDTA, 1M NaCl, 20% PVP 40, 20% SDS) for 30 s at 1600 rpm in a minibead beater cell disrupter (Mikro-DismembratorS; B. Braun Biotech International, Germany). Soil and cell debris were removed by centrifugation (14,000 × g for 1 minute at 4°C). After precipitation with ice-cold isopropanol, nucleic acids were purified using both PVPP and GeneClean Turbo Kit (MP Bio, USA) spin columns. Quality and size of soil DNAs were checked by electrophoresis on 1% agarose. DNA was also quantified by spectrophotometry at 260 nm using a BioPhotometer (Eppendorf, Germany).

74 | Correa-Galeote et al.

Quantification of the denitrification-associated microbial community The size of the denitrifier community was estimated by quantitative, real-time PCR (qPCR) of narG, napA, nirK, nirS and nosZ gene fragments using reaction mixtures, primers and thermal cycling conditions described previously (Bru et al., 2011; Correa-Galeote et al., 2013a). The total bacterial community was quantified using 16S rRNA gene as molecular marker as described by Correa-Galeote et al. (2013b). Reactions were carried out in an ABI Prism 7900 Sequence Detection System (Applied Biosystems). Quantification was based on the fluorescence intensity of the SYBR Green dye during amplification. Two independent qPCR assays were performed for each gene. Standard curves were obtained using serial dilutions of linearized plasmids containing cloned narG, napA, nirK, nirS, nosZ and 16S rRNA genes amplified from bacterial strains. PCR efficiency for the different assays ranged between 90% and 99%. No template controls gave null or negligible values. Presence of PCR inhibitors in DNA extracted from sediments was estimated by (i) diluting soil DNA extract and (ii) mixing a known amount of standard DNA to sediment DNA extract prior to qPCR. In all cases, inhibition was not detected. Methodological evaluation of the real-time PCR assays showed a good reproducibility of 95.0 ± 12% between two runs. Gene abundances were analysed as absolute and relative abundances (gene copy number/16S rRNA gene Bacteria copy number). As the number of 16S rRNA gene operon per cells is variable (Klappenbach et al., 2001), we did not convert the 16S rRNA gene copy data into cells numbers and we expressed our results as gene copy numbers per g of soil. Clone library construction and DNA sequencing The construction of genomic libraries using nosZ gene as a molecular marker for April and October sampling months and for S1 and S2 sampling sites was limited to years 2009 and 2010. nosZ amplicons were purified using the QIAquick PCR purification kit (Qiagen, Germany) and cloned using the pGEM-T Easy cloning kit according to the manufacturer’s instructions (Promega, USA). The recombinant Escherichia coli JM109 cells were inoculated onto solid Luria–Bertani (LB) medium (Miller, 1972) containing ampicillin and X-Gal (5-bromo-4-chloro-3-indolyl-β-d-galactopyranoside), and grown overnight at 37°C. White colonies were screened by PCR using the vector primers Sp6 and T7 (Invitrogen). Purity of amplified products was checked by observation of a unique band of the expected size in a 1% agarose gel stained with GelRed as indicated by the manufacturer’s (Biotium Inc., USA). Nucleotide sequences of clones containing inserts of the expected size were determined by sequencing with the vector primer Sp6 and the BigDye terminator cycle kit v3.1 (Applied Biosystems, USA) according to the manufacturer’s instructions, followed by electrophoresis on an ABI 3100 genetic analyser (Applied Biosystems, USA) at the sequencing facilities of Estación Experimental del Zaidín, CSIC, Granada, Spain. Phylogenetic analysis The DNA sequences of nosZ gene fragments were aligned by using the ClustalW program available in the Geneious software package (version 6.0.3, Biomatters, New Zealand). Vector sequence was removed and discrepancies in alignment verified manually. The obtained sequences were compared against database sequences using the BLASTN program in Geneious and those showing similarity higher than 80% of those previously deposited for nosZ were selected as positives. A distance matrix was calculated according to Kimura’s

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 75

two-parameter model (Kimura, 1980) using the dnadist Phylip-3.68 package software (University of Washington, USA). Estimation of the richness as operational taxonomic units (OTUs) and Chao1, Shannon–Weaver and Simpson diversity indexes were calculated using the Mothur program (Schloss et al., 2009). In this study, 3% sequence divergence was used to define OTUs and compare libraries. The Good’s coverage index was calculated according to Magurran (2004). A phylogenetic tree was constructed from a matrix of pairwise genetic distances by using the neighbour-joining method available in Geneious. Bootstrap analysis was based on 1000 resamplings. Statistical analyses Gene abundances were analysed as absolute or relative abundances (gene copy number/16S rRNA Bacteria copy number) (Correa-Galeote et al., 2013). Absolute abundance of a given denitrification gene was used to analyse changes in the total population containing the gene, while gene relative abundance was used to assess specific changes of the gene with respect to the total bacterial community. Measured variables in this study were first explored using the Shapiro–Wilk test to check whether they meet the normality assumptions. We used the Mann–Whitney test to compare data between sampling sites and times of sampling, and the Kruskal–Wallis and Conover– Iman combined tests for comparisons among samples. A Spearman correlation matrix was made to study relations between measured variables. A principal component analysis (PCA) was performed to analyse relationships among parameters concerning nitrate content, denitrification activity and denitrification genes relative abundance. A canonical correspondence analysis (CCA) was made to determine the effect of the nitrate content in the structure of the nosZ-bearing communities. Multivariate analyses were carried out by the PC-ORD 6.08 version software (MJM). The analysis of molecular variance (AMOVA) to determine population-specific differences among clone libraries was run using MOTHUR (Schloss et al., 2009). Good’s coverage, OTUs richness, and diversity (using Chao1, Shannon-Weaver and Simpson indexes) were determined. Different indexes were calculated since they are differentially sensitive to OTUs’ rarity or abundance (Morris et al., 2014). Nucleotide sequence accession numbers The nucleotide sequences of nosZ reported in this study have been deposited in GeneBank under the accession numbers KC936294 to KC936797. Results Nitrate content in sediments For the 3-year study, nitrate content in sediments from site S1 varied between 0.03 and 0.06 mg N-NO3– per kg dry weight, and between 5.73–10.64 mg N-NO3– per kg dry weight from those taken at site S2 (Table 5.1). According to the Mann–Whitney test, nitrate content at S1 was lower than that at S2, regardless of the sampling season and year. Also, the content of nitrate in October was always higher than that found in April for each sampling site, except for year 2008 when no differences were detected.

76 | Correa-Galeote et al.

Table 5.1 Nitrate content and potential denitrification activity in sediments from la Rocina stream. Values of nitrate concentration (n = 4 ± SE) are expressed as mg N-NO3– per kg dry sediment. Values of activity (n = 4 ± SE) are expressed as ng N-N2O per g dry sediment per hour. According to the Mann–Whitney test (α = 0.05), letters a and b indicate significant differences between sampling months for a given sampling site and year, and letters A and B show significant differences between sampling sites for a given sampling month and year

Year

Sampling month

Sampling site

Nitrate content (mg N-NO3– per kg dry sediment)

Potential denitrification activity (ng N-N2O per g dry sediment per hour)

2008

April

S1

0.04 ± 0.01 (a, B)

164 ± 8.72 (a, B)

S2

5.73 ± 0.09 (b, A)

1393 ± 121 (a, A)

S1

0.04 ± 0.01 (a, B)

114 ± 8.08 (a, A)

S2

7.31 ± 0.12 (a, A)

130 ± 16.46 (b, A)

S1

0.05 ± 0.01 (a, B)

164 ± 7.28 (a, B)

S2

7.02 ± 0.34 (b, A)

1616 ± 122 (a, A)

S1

0.03 ± 0.01 (b, B)

126 ± 9.74 (a, A)

S2

10.64 ± 0.27 (a, A)

137 ± 9.38 (b, A)

S1

0.04 ± 0.01 (b, B)

194 ± 17.48 (a, B)

S2

6.01 ± 0.33 (b, A)

1134 ± 44.91 (a, A)

S1

0.06 ± 0.02 (a, B)

113 ± 9.88 (a, A)

S2

7.70 ± 0.26 (a, A)

134 ± 6.96 (b, A)

October 2009

April October

2010

April October

Denitrification activity Potential denitrification activity (PDA), determined as N2O emission, in sediments varied between 113 and 194 ng N-N2O per g dry sediment per hour and 130–1616 ng N-N2O per g dry sediment per hour in sediments from S1 and S2, respectively (Table 5.1). For the 3-year study, PDA at S2 was statistically higher than that at S1 for the samples taken in April, and no differences were found in samples taken during October. At S1, PDA detected in April was always similar to that observed in October, while at S2, PDA was higher in April than in October. Quantification of 16S rRNA, narG, napA, nirS, nirK and nosZ genes The number of 16S rRNA genes in the sediment samples ranged from 7.38 × 106 to 2.91 × 109 copies per g dry sediment (Table 5.2). No significant differences between sites were observed regardless of the year and sampling season, except for samples taken in October 2010, when the copy number of the 16S rRNA at S1 was higher than that at S2. Comparing months at the same site, in S1, the number of 16S rRNA genes was higher in October than in April, except for year 2009 when both months had similar values. At S2, however, no differences in the 16S rRNA gene copy number were detected comparing months within the same year, except for October 2009 when the number of target genes was higher than in April 2009. Among the denitrification genes analysed in the sediments of la Rocina stream, narG was the most abundant gene (from 2.19 × 106 to 3.53 × 108 copies per g dry sediment), followed by napA (from 1.57 × 106 to 3.84 × 107 copies per g dry sediment), nirS (from 3.91 × 105 to 2.72 × 108 copies per g dry sediment), nirK (from 1.17 × 105 to 2.22 × 107 copies per g dry sediment), and nosZ was the gene with the lowest copy number (from 1.67 × 104 to

Table 5.2 Abundance of 16S rRNA and narG, napA, nirS, nirK and nosZ denitrification genes in sediments from la Rocina stream. Values (n = 4 ± SE) are expressed as gene copy numbers per g of dry sediment. According to the Mann–Whitney test (α = 0.05), letters a and b show statistically significant differences between sampling months for a given sampling site and year, and letters A and B represent statistically significant differences between sampling sites for a given sampling month and year Year 2008

Sampling month April

2009

April

October

2010

April

October

Gene abundance 16S rRNA

narG

napA 6 ±

nirK 6 ±

nirS 6 ±

nosZ

S1

4.52 × 10 4.89 × 106 (b, A)

4.25 × 10 4.20 × 105 (a, A)

2.69 × 10 2.42 × 105 (a, A)

3.69 × 10 3.07 × 105 (a, A)

7.89 × 10 8.38 × 104 (b, B)

5.03 × 104 ± 4.02 × 103 (b, A)

S2

2.87 × 107 ± 4.27 × 106 (a, A)

4.76 × 106 ± 3.22 × 105 (b, A)

3.03 × 106 ± 5.52 × 105 (b, A)

2.05 × 106 ± 1.98 × 105 (b, B)

1.51 × 106 ± 9.97 × 104 (b, A)

7.01 × 104 ± 8.40 × 103 (b, A)

S1

5.68 × 107 ± 2.27 × 106 (a, A)

5.10 × 106 ± 9.16 × 104 (a, B)

3.88 × 106 ± 8.48 × 105 (a, A)

4.99 × 106 ± 3.20 × 105 (a, A)

1.80 × 106 ± 7.77 × 104 (a, B)

7.80 × 104 ± 7.98 × 103 (a, B)

S2

4.33 × 107 ± 4.62 × 106 (a, A)

1.44 × 107 ± 2.27 × 106 (a, A)

9.45 × 106 ± 1.64 × 106 (a, A)

3.87 × 106 ± 3.27 × 105 (a, A)

4.42 × 106 ± 9.09 × 105 (a, A)

2.28 × 105 ± 1.27 × 104 (a, A)

S1

8.29 × 107 ± 1.30 ´ 107 (a, A)

4.32 × 106 ± 6.32 × 105 (a, A)

1.93 × 106 ± 2.62 × 105 (a, A)

4.10 × 106 ± 9.53 × 105 (a, A)

4.13 × 105 ± 7.02 × 104 (a, B)

3.24 × 104 ± 4.62 × 103 (a, B)

S2

7.38 × 106 ± 3.44 × 106 (a, A)

7.69 × 106 ± 7.45 × 105 (a, A)

3.82 × 106 ± 4.65 × 105 (a, A)

2.35 × 106 ± 1.24 × 105 (a, A)

1.81 × 106 ± 2.51 × 105 (a, A)

2.13 × 105 ± 8.47 × 103 (a, A)

S1

3.76 × 107 ± 7.96 × 106 (a, A)

2.57 × 106 ± 2.61 × 105 (a, B)

1.57 × 106 ± 4.43 × 105 (a, B)

1.17 × 105 ± 2.12 × 105 (b, A)

3.91 × 105 ± 9.82 × 104 (a, B)

1.67 × 104 ± 3.25 × 103 (a, B)

S2

2.70 × 107 ± 3.52 × 106 (b, A)

9.29 × 106 ± 1.60 × 106 (a, A)

4.60 × 106 ± 5.93 × 105 (a, A)

2.13 × 106 ± 2.22 × 105 (a, A)

3.33 × 106 ± 3.15 × 105 (a, A)

1.35 × 105 ± 1.21 × 104 (b, A)

S1

2.49 × 108 ± 3.24 × 107 (b, A)

2.19 × 106 ± 1.66 × 106 (b, A)

1.47 × 107 ± 2.41 × 106 (b, A)

1.35 × 106 ± 1.39 × 106 (a, A)

4.84 × 106 ± 5.67 × 105 (b, A)

3.91 × 105 ± 9.15 × 104 (b, B)

S2

2.01 × 108 ± 4.53 107 (a, A)

2.08 × 107 ± 4.87 × 106 (a, A)

1.10 × 107 ± 4.25 × 106 (a, A)

1.21 × 107 ± 3.56 × 106 (a, A)

4.36 × 106 ± 6.96 × 105 (b, A)

5.44 × 105 ± 9.45 × 104 (a, A)

S1

2.91 × 109 ± 9.68 × 108 (a, A)

3.53 × 108 ± 1.28 × 108 (a, A)

3.84 × 107 ± 8.92 × 106 (a, A)

2.15 × 107 ± 7.17 × 106 (a, A)

2.72 × 108 ± 8.52 × 107 (a, A)

4.67 × 106 ± 1.39 × 106 (a, A)

2.40 × 108 ± 9.39 × 107 (a, B)

5.61 × 107 ± 2.26 × 107 (a, A)

1.52 × 107 ± 5.98 × 106 (a, A)

2.22 × 107 ± 7.96 × 106 (a, A)

5.12 × 107 ± 2.47 × 107 (a, A)

1.80 × 105 ± 1.63 × 106 (a, A)

S2

7 ±

5 ±

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 77

October

Sampling site

78 | Correa-Galeote et al.

4.67 × 106 copies per g dry sediment) (Table 5.2). In general, spatiotemporal variations were not observed for the narG, napA and nirK genes; however, the Mann–Whitney tests showed that nirS and nosZ followed the same variations than those in the nitrate content in sites S1and S2 (Table 5.2). It is to note, however, that some exceptions can be found in the general spatio-temporal patterns mainly associated with the nirS and nosZ genes. Relative abundance of the narG, napA, nirS, nirK and nosZ denitrification genes On average for the 3-year study (Table 5.3), at S1, the mean relative abundances of narG (8.09%), napA (4.79%), nirS (6.16%), nirK (1.38%), and nosZ (0.10%) genes in April were similar to those of 9.56%, 5.50%, 6.46% 4.61% and 0.12% for the narG, napA, nirK, nirS and nosZ genes found in October, respectively. In contrast, in April at S2, relative abundances of the narG (13.21%), napA (7.44%), nirS (5.61%), nirK (3.54%) and nosZ (0.28%) genes Table 5.3 Relative abundance of narG, napA, nirS, nirK and nosZ denitrification genes in sediments from la Rocina stream. Values (n = 4 ± SE) are expressed as percentage of the ratio between a given denitrification gene copy number and the 16S rRNA gene copy number. According to the Mann–Whitney test (α = 0.05), letters a and b show significant differences between sampling months for a given sampling site and year, and letters A and B represent significant differences between sampling sites for a given sampling month and year Year

Gene relative abundance (%) Sampling Sampling month site narG napA nirK

2008

April

October

2009

April

October

2010

April

October

nosZ

S1

9.71 ± 1.10 (a, B)

8.29 ± 0.32 (a, A)

1.78 ± 0.18 (b, B)

0.11 ± 0.01 (a B)

S2

17.92 ± 1.93 12.46 ± 1.22 7.81 ± 1.13 (a, A) (a, A) (a, A)

5.70 ± 0.61 (b, A)

0.25 ± 0.01 (b, A)

S1

9.08 ± 0.44 (a, B)

8.82 ± 0.52 (a, A)

3.23 ± 0.28 (a, B)

0.14 ± 0.01 (a, B)

S2

34.79 ± 4.73 21.48 ± 2.38 9.33 ± 0.76 (a, A) (a, A) (a, A)

9.93 ± 1.45 (a, A)

0.57 ± 0.07 (a, A)

S1

5.25 ± 0.06 (a, B)

1.92 ± 0.12 (b, B)

4.69 ± 0.47 (a, B)

0.40 ± 0.04 (b, B)

0.04 ± 0.01 (a, B)

S2

10.59 ± 1.24 6.46 ± 0.58 (b, A) (b, A)

3.22 ± 0.22 (a, A)

2.47 ± 0.34 (b, A)

0.29 ± 0.02 (b, A)

S1

8.35 ± 1.38 (a, B)

3.28 ± 0.20 (a, B)

1.42 ± 0.24 (a, B)

0.05 ± 0.01 (a, B)

S2

36.16 ± 6.17 17.05 ± 0.94 8.57 ± 1.35 (a, A) (a, A) (b, A)

12.94 ± 1.13 0.53 ± 0.07 (a, A) (a, A)

S1

9.31 ± 0.85 (a, B)

3.41 ± 0.27 (a, A)

5.51 ± 0.28 (b, A)

1.96 ± 0.03 (b, A)

0.15 ± 0.01 (a, B)

S2

11.26 ± 2.24 6.35 ± 1.13 (b, A) (b, A)

5.81 ± 0.85 (a, A)

2.45 ± 0.35 (a, A)

0.29 ± 0.02 (b, A)

S1

11.26 ± 0.77 2.57 ± 0.53 (a, B) (a, B)

7.28 ± 0.26 (a, B)

9.18 ± 0.70 (b, B)

0.17 ± 0.02 (a, B)

S2

6.11 ± 0.56 (a, B)

nirS

8.19 ± 0.79 (a, B)

5.74 ± 0.93 (a, B)

22.47 ± 0.51 15.08 ± 3.60 9.75 ± 0.31 (a, A) (a, A) (a, A)

17.20 ± 1.57 0.51 ± 0.09 (a, A) (a, A)

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 79

were lower than those found in October that were 31.14%, 17.87%, 9.22%, 13.36% and 0.57% for the narG, napA, nirS, nirK and nosZ, respectively (Table 5.3). Spatial and temporal variations of denitrification genes relative abundances Generally considered, the relative abundance of each narG, napA, nirS and nosZ genes in sediments taken at S2 was higher than in sediments taken at S1; some exceptions, however, were found throughout the 3 yeas study, mainly due to relative abundances of napA and nirS genes. The relative abundance of nirK gene did not show any spatial or temporal variations along the time of study (Table 5.3). During the 3-year study, the sampling month did not affect the relative abundance of the narG, napA, nirK and nosZ genes at S1, except for nirS whose relative abundance was always higher in October (Table 5.3). At S2, in April, the relative abundances of narG, napA, nirS and nosZ were lower than those detected in October, and differences were not observed for the nirK gene (Table 5.3). Correlation tests and multivariate analysis A Spearman test showed that correlation between nitrate content and abundance of each denitrification gene was very weak (supplementary material Table 5.1S). In contrast, there were some strong significant correlations between the content of nitrate and the relative abundance of narG, napA, nirS, and nosZ genes, the last correlation showing the highest value (Table 5.4). There were also significant correlations among relative abundances of narG, napA, nirS, nirK, and nosZ genes, with high values for some nosZ and narG (Table 5.4). There were no significant correlations between denitrification activity and either nitrate content or any of the denitrification genes, considering either gene abundances (supplementary material Table 5.1S) or relative gen abundances (Table 5.4). Fig. 5.1 shows the PCA analysis including the variables nitrate concentration, denitrification activity and relative abundance of denitrification genes. PCA1 accounted for 61.72% of the total variance in the data. Nitrate concentration and the relative abundance of denitrification genes were positively related with PCA1, with the relative abundances of narG (r = 0.927) and nosZ (r = 0.929) as main contributors. PCA2 accounted for an additional 17.91% of the variation of the data and this variation is described almost exclusively by the

Table 5.4 Spearman coefficient values between nitrate content, relative abundance of the narG, napA, nirS, nirK and nosZ denitrification genes and potential denitrification activity in sediments from la Rocina stream. Values followed by asterisk (*) are statistically significant (P-value 0.05) Nitrate content

narG

napA

nirK

nirS

narG

0.677*

napA

0.564*

0.765*

nirK

0.278NS

0.627*

0.406*

nirS

0.668*

0.784*

0.634*

0.585*

nosZ

0.856*

0.817*

0.690*

0.483*

0.793*

–0.030NS

0.050NS

–0.127NS

–0.038NS

Potential denitrification 0.019NS activity

nosZ

0.134NS

80 | Correa-Galeote et al.

Figure 5.1 Principal components analysis (PCA) of nitrate content, relative abundance of the narG, napA, nirS, nirK and nosZ denitrification genes and potential denitrification activity. Percentages of PCA1 and PCA2 explaining variance are shown. Group centroids are indicated by +. Sediments were taken in April (A) and October (O) 2009 (09) and 2010 (10) at El Acebrón lagoon (S1) and la Cañada creek (S2).

denitrification activity variable (r = 0.927). No correlation was found between denitrification activity and the other six variables (supplementary material Table 5.2S). Samples taken at sites S1 and S2 were mostly separated along Axis 1 regardless of the sampling year, and ranked highly negatively and positively for PCA1, respectively. Samples from S2 were also separated according to the sampling season, most of the samples taken in April were associated with DEA activity and separated along PCA2. PCA2 also separated samples S1 from S2, although the percentage of explained variability was low. Samples S1 were not separated by sampling season. Analysis of clone libraries The eight nosZ libraries contained 504 clones grouped in 109 OTUs (supplementary material Table 5.3S). At S1, 65 and 63 clones were obtained in April 2009 and 2010, respectively, and 61 and 60 in October 2009 and 2010, respectively. Whereas 58 clones were obtained at S2 for April 2009 and 2010, 70 and 69 clones were procured in October 2009 and 2010, respectively. The Good’s coverage index for each library (Table 5.5) was higher than 75%, which indicates that the sampling effort was enough to permit extrapolations for analysis of total nosZ biodiversity in the samples. Two libraries, corresponding to October 2009 and 2010 at S2 presented statistically significant higher number of OTUs (35 and 34 OTUs, respectively), and the remaining six libraries contained between 25 and 29 OTUs (Table 5.5 and supplementary material Table 5.3S). The 8 libraries had similar Chao1 and Shannon–Weaver

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 81

Table 5.5 Diversity indexes of nosZ clone libraries from la Rocina stream sediments as estimated with the Simpson index and Shannon–Weaver and Chao 1 richness estimators computed using Mothur. According to the Kruskal–Wallis test (α = 0.05), letters a and b show significant differences between sampling months and sites Year

Sampling month

Sampling site

Number of clones

Good’s coverage

Number of OTUs

Chao1

Shannon– Weaver

Simpson

2009

April

S1

65

75.38

29 b

53.0 a

3.07 a

0.047 a

S2

58

77.59

25 b

38.0 a

2.92 a

0.053 a

S1

61

78.69

27 b

36.7 a

3.02 a

0.050 a

S2

70

75.29

35 a

48.9 a

3.15 a

0.034 b

S1

63

76.19

29 b

44.0 a

3.10 a

0.045 a

S2

58

79.31

26 b

33.3 a

2.96 a

0.055 a

S1

60

76.66

29 b

31.1 a

3.11 a

0.044 a

S2

69

75.36

34 a

45.3 a

3.29 a

0.031 b

October 2010

April October

indexes but differed in their Simpson diversity index values, which were statistically lower at S2 in October (Table 5.5). Construction of a phylogenetic tree based on the 504 nosZ sequences showed they distributed into 31 clusters (Fig. 5.2). Overall, members of the Betaproteobacteria class were the most abundant (59.1%) followed by those of the Alphaproteobacteria (39.5%) and the Gammaproteobacteria (1.4%). Clusters C4, C5, C6, C7 and C10 within the Alphaproteobacteria and clusters C15, C16, C17, C18, C20, C24, C25, C27, C28 and C30 included in the Betaproteobacteria contained clones showing homology with unclassified nosZ gene sequences deposited in GenBank (supplementary material Table 5.4S). The number of unclassified clones was clearly higher in libraries from site S1 (56.63%) than that from S2 (23.92%). The more abundant clusters were C1, which contains 52 clones of the family Bradyrhizobiaceae, C22, which includes 44 clones in the Rhodocyclaceae, and clusters C24 (43 clones) and C27 (40 clones) with unidentified families (supplementary material Table 5.4S). Only 14 clones were members of the family Pseudomonadaceae within the Gammaproteobacteria, and they all were found at S1 (supplementary material Table 5.4S). The unclassified clusters, C4, C5 and C10 from Alphaproteobacteria, C15 and C27 from Betaproteobacteria and members of the family Pseudomonadaceae from Gammaproteobacteria in C14 were found only at S1 (Fig. 5.3). The C2 Bradyrhziobiaceae cluster, the C3 and C12 Rhizobiaceae clusters, the C9 Beijereinckiaceae of the Alphaproteobacteria and the C26 unclassified Burkholderiales and C31 Comamonadaceae of the Betaproteobacteria were present only at S2. Spatial and temporal variation of denitrifier community diversity AMOVA of the 504 nosZ sequences indicated that total sequence variation was 3.62% among libraries and 96.38% within the clone libraries (supplementary material Table 5.5S), which indicates the existence of a highly randomized diversity within libraries. At S1, pairwise alignments revealed that sequences from April 2009 and 2010 and October 2009 were

82 | Correa-Galeote et al.

Figure 5.2 Neighbour-joining phylogenetic tree based on 504 nosZ DNA sequences cloned from la Rocina stream sediments and other cultured bacteria. Sediments were taken in April (A) and October (O) 2009 (09) and 2010 (10) at El Acebrón lagoon (S1) and la Cañada creek (S2). The significance of each branch is indicated by a bootstrap value calculated for 1000 subsets.

statistically the same population, and that those from October 2010 were significantly different (Table 5.6). On the contrary, at S2, sequences in the clone libraries from April 2009 and 2010 and October 2009 were statistically different populations, but no differences however, were found between sequences in the October 2009 and 2010 clone libraries (Table 5.6). A CCA sample ordination based on the relative abundance of the nosZ clusters within each clone library showed that the eight samples distributed in two clearly separated groups (Fig. 5.4). The two first CCA axes, with canonical coefficients 1.01 and −0.024 for axes 1 and 2, respectively, explained 42.2% of the total variance and revealed that nitrate concentration of the sediments could be responsible for the grouping of the clone libraries along the two axes.

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 83

Figure 5.3 Pie charts comparing the nosZ communities composition of sediments from la Rocina stream. To facilitate the comparison between clusters colour has been used as an indication of bacterial families and unidentified groups.

84 | Correa-Galeote et al.

Table 5.6 Pairwise dissimilarity indexes (Fst) from AMOVA of nosZ clone libraries. Clones from S1 are shown in bold. An asterisk indicates a P-value 0.05). April and October stand for the months of April and October, respectively Clone library Clone library

Apr09

Apr09

Oct09

Apr10

Oct10

1.53

1.99NS

4.01*

1.25NS

3.02*

NS

Oct09

2.66*

Apr10

2.60*

3.16*

2.79*

1.55NS

Oct10

1.23NS 3.98*

Figure 5.4 Canonical correspondence analysis (CCA) of the composition of the 31 clusters found in the nosZ clone libraries. Crosses represent vector scores for the different clusters. Open and closed triangles represent the axes 1 and 2 scores for the clusters found in taken in April (A) and October (O) 2009 (09) and 2010 (10) at el Acebrón lagoon (S1) and la Cañada creek (S2). The arrow represents the biplot vector for the nitrate concentration of the sediments.

Discussion In this work, the denitrification genes narG, napA, nirK, nirS, and nosZ were quantified using qPCR to estimate spatio-temporal variations of denitrifiers and to analyse their correlation with nitrate content and denitrification activity in sediments from la Rocina stream taken at two sites with relatively low (S1) and high (S2) nitrate concentration; also, the biodiversity of denitrifiers was analysed using the nosZ gene as a molecular marker.

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 85

Due to its consideration as a national park, Doñana is subjected to special regulations and, consequently, any anthropogenic effect is mainly derived from agricultural practices allowed in the ecotone of the park, where farming of rice and strawberries is common. In fact, diffuse nitrate contamination in Doñana National Park has been previously recorded (Serrano et al., 2006; Manzano et al., 2009; Espinar and Serrano, 2009; Tortosa et al., 2011). Sediment samples were taken in April and October in order to represent the dry and wet seasons, respectively. During the 3-year study, hydrological dynamics at each sampling site varied with sampling date, which was clearly visible at S2 in October, where stream waters were transformed into swampy waters, and, finally, in almost dry sediments. With slight differences, the content of nitrate at S1 was similar for the two sampling seasons and lower than those at S2, where nitrate content in October was always higher than in April. All those values, however, were lower than the 50 mg/l defined by the European directive 91/676/CEE as the upper limit for nitrate content in waters (European Commission, 1991). Values of PDA by sediments were relatively low and remained constant at S1, but those found at S2 were greater and highly variable, and no clear relation was found between the nitrate content and denitrification activity. Shifts in denitrification rates could be due to changes in water content at the end of the dry season in Doñana, when the water flow is scarce or even null as compared with that in April. Woodward et al. (2009) proposed that oxygenic conditions remaining in sediments after a drought period would result in inhibition of denitrification activity, and Tortosa et al. (2011) showed that the pluvial, seasonal regime of Doñana affected denitrification activity, with the lowest values of N2O emission found at the end of the dry season. Temporal variations in denitrification activity have been reported in creek sediments (Rich and Myrold, 2004) and agricultural (Dandie et al., 2008) and riparian soils (Deslippe et al., 2014). The copy number of the 16S rRNA gene was within the ranges previously determined by other authors in soils and sediments samples (Dandie et al., 2007; Bárta et al., 2010; GarcíaLledó et al., 2011; Keil et al., 2011). Similarly, abundances of denitrification genes were similar to those found for narG (Smith et al., 2007; Lindsay et al., 2010), napA (Marhan et al., 2011), nirK (Henry et al., 2006; Dandie et al., 2008; Su et al., 2010; Attard et al., 2011), nirS (Yoshida et al., 2009; Attard et al., 2011; Deslippe et al., 2014) and nosZ (Torrentó et al., 2011; Ma et al., 2011; Deslippe et al., 2014) genes from soils and sediments under different environmental conditions. Whereas spatial differences in gene abundances were not detected for any of the denitrification genes, the copy number of nirS and nosZ showed seasonal variations, being October the month with the highest abundance of those genes. Using hybridization techniques, the abundance of the narG, nirK, nirS and nosZ genes in a forest soil was higher in the autumn season (Merget el al., 2001) and so was the abundance of the narG, napA and nirS genes determined by qPCR in samples collected in October from estuarine sediments during 1-year study (Smith et al., 2015a). Relative abundances of denitrification genes in sediments from la Rocina stream are within the range of those reported for narG (Henry et al., 2006; Čuhel et al., 2010), napA (Kandeler et al., 2009; Bru et al., 2011; Wieder et al., 2013), nirK (Chen et al., 2012a; Palmer et al., 2012), nirS (Chon et al., 2011; Chen et al., 2012a; Ligi et al., 2014a,b) and nosZ (Chen et al., 2012a; Ligi et al., 2014a,b) from different environmental samples after qPCR determination. In this study, relative abundances of narG/napA were always higher than .

86 | Correa-Galeote et al.

those of nirK/nirS which, in turn, widely exceeded those of nosZ. These results suggest that incomplete denitrifiers are more abundant than those able to carry out the complete denitrification process in sediments from la Rocina stream. They also agree with those found in constructed wetlands (García-Lledó et al., 2011), aquifer’s water and sediment (Torrentó et al., 2011) and paddy (Chen et al., 2012a) and riparian soils (Deslippe et al., 2014). When expressed as gene copy number, spatio-temporal variations of most denitrification genes were not clearly shown; however, both temporal and spatial changes were observed when their relative abundance was calculated. Spatio-temporal shifts in the denitrification communities have been described to occur in constructed wetlands (Song et al., 2012), crop fields (Enwall et al., 2010; Novinscak et al., 2013) and estuarine sediments (Magalhães et al., 2008; Smith et al., 2015a). Correlations among variables analysed in this study revealed that nitrate content correlated best with the relative abundance of the narG, napA, nirS and nosZ genes, the highest positive correlation corresponding to the nosZ relative abundance. Similar results were published by Song et al. (2012) and Smith et al. (2015a) in their studies on the effects of nitrate content on the relative abundance of denitrification genes in constructed wetlands and estuarine sediments, respectively. However, the size of nosZ populations was independent of the nitrate content in constructed wetlands (García-Lledó et al., 2011), grassland soils (Keil et al., 2011) and river’s biofilms (Lyautey et al., 2013). The PCA analysis showed a positive relationship between the content of nitrate and the relative abundance of the narG, napA, nirK, nirS and nosZ genes. Because nosZ showed a strong correlation with the content of nitrate in sediments we used it as a molecular marker to analyse diversity of denitrifiers in the sediment samples taken in years 2009 and 2010. We are aware that the primers used in our study capture only part of the nosZ diversity as other primers have been designed that amplify the so called nosZ clade II (Sanford et al., 2012; Jones et al., 2013; Ligi et al., 2015). A total of 504 sequences were obtained from the 8 genomic libraries showing homology with nosZ genes deposited in the DataBank, of which 202 sequences corresponded to unclassified bacteria, which indicates the presence of hitherto uncultured bacterial groups in the sediments. Similar results were found in marine sediments by Scala and Kherkoff (1999) when analysing nosZ denitrifiers and Chen et al. (2010) and Smith and Ogram (2008) during studies on bacterial diversity based on the nirK and nirS genes communities in soil and sediments. Betaproteobacteria in la Rocina stream sediments dominated over the Alphaproteobacteria, and was followed far behind by Gammaproteobacteria whose presence was restricted to S1. Circumscription of Gammaproteobacteria to specific sites has been previously reported in paddy soils (Chen et al., 2012a). Generally considered, the dominant clusters found in this study include members of families Bradyrhizobiaceae (cluster C1) and Rhodocyclaceae (cluster C22), and clusters C24 and 27 which contain unclassified sequences. Members of those clusters have been reported to be the most abundant in eutrophic lake sediments (Wang et al., 2013), ephemeral wetland soils (Ma et al., 2011), wastewater treatment plants (Chon et al., 2010), paddy soils (Ishii et al., 2011) and activated sludge (Srinandan et al., 2011). The AMOVA test revealed that the structure of the denitrifier communities at S1 remained relatively constant during the 2-year study and changed at S2 associated to the sampling dates and sampling years. OTUs richness and Simpson diversity index also indicated a seasonal effect on diversity at S2. Changes in biodiversity of denitrifier communities

Spatiotemporal Variations in Denitrifier Communities Related to Nitrate | 87

in natural ecosystems following re-wetting after prolonged drought periods have been already published (Groffman et al., 2009), and temporal and spatial variation in their structures were described for bacterial communities bearing nirK (Wolsing and Priemé, 2004; Yoshida et al., 2009; Smith, 2010; Tatti et al., 2015), nirS (Yoshida et al., 2009; Smith et al., 2010; Hussain et al., 2011; Song et al., 2012; Tatti et al., 2015; Zheng et al., 2015; Pan et al., 2016), and nosZ (Smith et al., 2010; Tatti et al., 2015) genes. Nevertheless, other works have shown that denitrifier communities are both spatially and temporally stable (Zhou et al., 2011; Dandie et al., 2011; Chen et al., 2012b; Clark et al., 2012). Despite the scarce differences in the numbers of clusters and their distribution among the 8 genomic libraries, the CCA analysis confirmed the spatio-temporal variations found in the AMOVA test and suggests that the structure of the nosZ populations was affected by the content of nitrate. Other studies, however, suggest that the community structure of denitrifiers is less or not related to the content of nitrate (Wolsing and Priemé, 2004; Zhou et al., 2011; Carrino-Kyker et al., 2012; Chen et al., 2012b; Vilar-Sanz et al., 2013). Nitrate content also affected the diversity of denitrifiers when it was analysed using the narG (Reyna et al., 2010) and the nirK/nirS (Santoro et al., 2006) genes. Taken together, our results suggest the existence of spatio-temporal variations linked to seasonality in the structure (abundance and diversity) of the denitrifier communities in sediments of two locations differing in nitrate concentration within Doñana National Park. We showed that sediment nitrate content is a main factor structuring these communities, although the effect of other habitat variables cannot be discarded. Acknowledgements This study was supported by the ERDF-cofinanced grants RNM-4746 and AGR2012–1968 from Consejería de Economía, Innovación y Ciencia ( Junta de Andalucía, Spain).

88 | Correa-Galeote et al.

Table 5.1S Values of Spearman coefficient between nitrate content, abundance of the 16S rRNA gene and the narG, napA, nirS, nirK and nosZ denitrification genes and PDA in sediments from la Rocina stream. Values followed by asterisk (*) are statistically significant (P-value