134 79 8MB
English Pages 254 [244] Year 2021
Methods in Molecular Biology 2396
Vladimir Shulaev Editor
Plant Metabolic Engineering Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Plant Metabolic Engineering Methods and Protocols
Edited by
Vladimir Shulaev Department of Biological Sciences, University of North Texas, Denton, TX, USA
Editor Vladimir Shulaev Department of Biological Sciences University of North Texas Denton, TX, USA
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1821-9 ISBN 978-1-0716-1822-6 (eBook) https://doi.org/10.1007/978-1-0716-1822-6 © Springer Science+Business Media, LLC, part of Springer Nature 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Plants produce a vast array of specialized metabolites, many of which provide significant benefit for human health, nutrition, and lifestyle. Plant-derived products are used as pharmaceuticals, food products, cosmetics, health supplements, dies, cosmetics, and fragrances. Due to low level of production, significant cost of raw materials, and low cultivation range, there was a continuous effort to increase production of plant metabolites via metabolic engineering. Several strategies can be employed to increase endogenous metabolite production or produce novel metabolites in plants or plant cultures. This can be achieved either by overexpression of endogenous or introduction of novel genes encoding for biosynthetic enzymes, use of transcription factors regulating metabolic pathways and targeting metabolites to specific location or extracellular space, or a combination of these strategies. Traditionally metabolic engineering relied on identifying key metabolic enzymes or transcription factors and engineering plant or microbial systems for higher production of target metabolites using classical molecular biology techniques. The major limitation of this approach was a poor understanding of biochemical networks and their regulation. The emergence of “omics” disciplines, systems, and synthetic biology provided researchers with novel tools to study biochemical networks, identify new genes and pathways using advanced bioinformatics tools, follow metabolic fluxes, and build predictive mathematical models of metabolism. Lately we see a paradigm shift where classical metabolic engineering based on molecular biology and biotechnology co-exists with and is being enhanced by synthetic biology, “omics” approaches, mathematical modeling, and systems biology. These technological advances led to the publication of an array of excellent methodological books on plant metabolic engineering in recent years. The aim of this book was to complement published research and provide protocols for state-of-the-art techniques in various aspects of plant metabolic engineering. This volume presents methods for different aspects of metabolic engineering from bioinformatics tools used to discover novel genes and pathways, heterologous expression of biosynthetic genes in plant and microbial systems, omics technologies, such as transcriptomics, proteomics and metabolomics, and bioinformatics and data analysis. It is intended for biologists, chemists, biotechnologists, students, and a broad cohort of researchers who work in the area of plant metabolism and metabolic engineering. Denton, TX, USA
Vladimir Shulaev
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Methods for the Development of Recombinant Microorganisms for the Production of Natural Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Perl, Hunter Dalton, YeJong Yoo, and Mattheos A. G. Koffas 2 Sustainable Technological Methods for the Extraction of Phytochemicals from Citrus Byproducts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salvatore Multari, Fulvio Mattivi, and Stefan Martens 3 Reconstitution of Metabolic Pathway in Nicotiana benthamiana . . . . . . . . . . . . . Chenggang Liu 4 A Protocol for Phylogenetic Reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soham Sengupta and Rajeev K. Azad 5 RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly, Alignment, and Differential Expression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . David J. Burks and Rajeev K. Azad 6 A Protocol for Horizontally Acquired Metabolic Gene Detection in Algae. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ravi S. Pandey and Rajeev K. Azad 7 Global Comparative Label-Free Yeast Proteome Analysis by LC-MS/MS After High-pH Reversed-Phase Peptide Fractionation Using Solid-Phase Extraction Cartridges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khadiza Zaman, Prajita Pandey, Vladimir Shulaev, and Laszlo Prokai 8 Gas Chromatography Coupled to Atmospheric Pressure Chemical Ionization High-Resolution Mass Spectrometry for Metabolite Fingerprinting of Grape (Vitis vinifera L) Berry . . . . . . . . . . . . . . . . . . . . . . . . . . . . ˜ a, Jeffrey Morre´, Laurent Deluc, Johana S. Revel, Armando Alca´zar Magan and Claudia S. Maier 9 GC-MS/MS Profiling of Plant Metabolites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feroza Kaneez Choudhury, Prajita Pandey, Ron Meitei, Dwain Cardona, Amit C. Gujar, and Vladimir Shulaev 10 Analysis of Grape Volatiles Using Atmospheric Pressure Ionization Gas Chromatography Mass Spectrometry-Based Metabolomics. . . . . . . . . . . . . . . Manoj Ghaste, Fulvio Mattivi, Giuseppe Astarita, and Vladimir Shulaev 11 A High-Throughput HILIC-MS-Based Metabolomic Assay for the Analysis of Polar Metabolites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giuseppe Paglia and Giuseppe Astarita 12 Macrolipidomic Profiling of Vegetable Oils: The Analysis of Sunflower Oils with Different Oleic Acid Content . . . . . . . . . . . . . . . . . . . . . . . . Juan J. Aristizabal-Henao and Ken D. Stark
vii
v ix
1
19 29 35
47
61
71
85
101
117
137
161
viii
13
14
15
16 17
Contents
Non-targeted Lipidomics Using a Robust and Reproducible Lipid Separation Using UPLC with Charged Surface Hybrid Technology and High-Resolution Mass Spectrometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giorgis Isaac, Vladimir Shulaev, and Robert S. Plumb Comprehensive Analysis of Plant Lipids Using Sub-2-μm Particle CO2-Based Chromatography Coupled to Mass Spectrometry . . . . . . . . . . . . . . . . Carolina Salazar, Michael D. Jones, Giorgis Isaac, and Vladimir Shulaev Bioinformatics in Lipidomics: Automating Large-Scale LC-MS-Based Untargeted Lipidomics Profiling with SimLipid Software . . . . . . . . . . . . . . . . . . . . Ningombam Sanjib Meitei and Vladimir Shulaev A Protocol for Prion Discovery in Plants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jamie D. Dixson and Rajeev K. Azad Structural Determination of Uridine Diphosphate Glycosyltransferases Using X-Ray Crystallography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kasandra Alderete and Xiaoqiang Wang
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175
187
197 215
227 243
Contributors ARMANDO ALCA´ZAR MAGAN˜A • Department of Chemistry, Oregon State University, Corvallis, OR, USA KASANDRA ALDERETE • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA JUAN J. ARISTIZABAL-HENAO • Department of Kinesiology, University of Waterloo, Waterloo, ON, Canada GIUSEPPE ASTARITA • Department of Biochemistry and Molecular and Cellular Biology, Georgetown University, Washington, DC, USA RAJEEV K. AZAD • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA; Department of Mathematics, University of North Texas, Denton, TX, USA DAVID J. BURKS • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA DWAIN CARDONA • Thermo Fisher Scientific, Austin, TX, USA FEROZA KANEEZ CHOUDHURY • Department of Biological Sciences, College of Science, University of North Texas, Denton, TX, USA; Genentech, South San Francisco, CA, USA HUNTER DALTON • Department of Chemical and Biological Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA LAURENT DELUC • Department of Horticulture, Oregon State University, Corvallis, OR, USA JAMIE D. DIXSON • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA MANOJ GHASTE • Department of Biological Sciences, College of Science, University of North Texas, Denton, TX, USA; Department of Food Quality and Nutrition, Research and Innovation Centre, Fondazione Edmund Mach (FEM), San Michele all’Adige, (TN), Italy; Department of Cellular, Computational and Integrative Biology—CIBIO, University of Trento, Provo, (TN), Italy AMIT C. GUJAR • Thermo Fisher Scientific, Austin, TX, USA GIORGIS ISAAC • Waters Corporation, Milford, MA, USA MICHAEL D. JONES • Waters Corporation, Milford, MA, USA MATTHEOS A. G. KOFFAS • Department of Biological Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA; Department of Chemical and Biological Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA CHENGGANG LIU • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA CLAUDIA S. MAIER • Department of Chemistry, Oregon State University, Corvallis, OR, USA STEFAN MARTENS • Fondazione Edmund Mach, Research and Innovation Centre, San Michele all’Adige (TN), Italy FULVIO MATTIVI • Department of Food Quality and Nutrition, Research and Innovation Centre, Fondazione Edmund Mach (FEM), San Michele all’Adige (TN), Italy; Department of Cellular, Computational and Integrative Biology—CIBIO, University of Trento, Provo (TN), Italy
ix
x
Contributors
NINGOMBAM SANJIB MEITEI • PREMIER Biosoft, Indore, India; PREMIER Biosoft, Palo Alto, CA, USA RON MITTLER • Division of Plant Sciences and Interdisciplinary Plant Group, College of Agriculture, Food, and Natural Resources, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA JEFFREY MORRE´ • Department of Chemistry, Oregon State University, Corvallis, OR, USA SALVATORE MULTARI • Fondazione Edmund Mach, Research and Innovation Centre, San Michele all’Adige (TN), Italy GIUSEPPE PAGLIA • School of Medicine and Surgery, University of Milano-Bicocca, Milan, Italy PRAJITA PANDEY • Department of Biological Sciences, College of Arts and Sciences, University of North Texas, Denton, TX, USA RAVI S. PANDEY • The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA ALEXANDER PERL • Department of Biological Sciences, Rensselaer Polytechnic Institute, Troy, NY, USA ROBERT S. PLUMB • Waters Corporation, Milford, MA, USA LASZLO PROKAI • Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, USA JOHANA S. REVEL • Department of Chemistry, Oregon State University, Corvallis, OR, USA CAROLINA SALAZAR • Department of Biological Sciences and Advanced Environmental Research Institute, College of Science, University of North Texas, Denton, TX, USA SOHAM SENGUPTA • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA VLADIMIR SHULAEV • Department of Biological Sciences and Advanced Environmental Research Institute, College of Sciences, The University of North Texas, Denton, TX, USA KEN D. STARK • Department of Kinesiology, University of Waterloo, Waterloo, ON, Canada XIAOQIANG WANG • Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, USA YEJONG YOO • Department of Chemical and Biological Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA KHADIZA ZAMAN • Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, USA
Chapter 1 Methods for the Development of Recombinant Microorganisms for the Production of Natural Products Alexander Perl, Hunter Dalton, YeJong Yoo, and Mattheos A. G. Koffas Abstract Metabolic engineering strives to develop microbial strains that are capable of producing a target chemical in a biological organism. There are still many challenges to overcome in order to achieve titers, yields, and productivities necessary for industrial production. The use of recombinant microorganisms to meet these needs is the next step for metabolic engineers. In this chapter, we aim to provide insight on both the applications of metabolic engineering for natural product biosynthesis as well as optimization methods. Key words Metabolic engineering, Natural products, Recombinant microorganisms
1
Introduction Metabolic engineering aims to engineer cellular phenotypes that are usually associated with the overproduction of a target chemical of interest. The need to make natural products from recombinant species is the driving force behind many of the advances in metabolic engineering because of issues with sustainability, chemistry, and environmental factors. Oftentimes, natural sources do not provide enough products in a sustainable manner while chemical means cannot synthesize complex natural products. Recombinant species are also capable of using cheap carbon sources with minimal environmental footprints for the production of biomolecules. In this chapter, we will begin by discussing methods for static balancing. In this section, we will summarize methods for chromosomal integration, the effects of changing DNA copy number and promoter strength, and challenges associated with static balancing. The next section will focus on dynamic balancing, specifically the advantages and disadvantages to a variety of methods. A section will follow on co-culturing with focus on media, temperature, and induction optimization. A variety of examples are included to demonstrate the benefits of co-culturing such as titer yields and
Vladimir Shulaev (ed.), Plant Metabolic Engineering: Methods and Protocols, Methods in Molecular Biology, vol. 2396, https://doi.org/10.1007/978-1-0716-1822-6_1, © Springer Science+Business Media, LLC, part of Springer Nature 2022
1
2
Alexander Perl et al.
decreased production of toxic intermediates. The purpose of this chapter is to provide insight on both the applications for and the pros and cons of many optimization methods.
2 2.1
Methods Static Balancing
The fundamental goal of metabolic engineering is to reallocate the use of resources within a cell toward the production of a specific, target chemical [1]. This goal is accomplished through the redirection or addition of metabolic pathways within the cell, often via the introduction of transgenes and regulatory elements. Development of new genetic tools and genome-editing approaches has increased the capacity to genetically modify host systems and direct metabolic fluxes toward desired metabolites [2, 3]. Once the desired metabolic pathways for target molecules are expressed by the cell, the strains go through rounds of an optimization process to direct the maximum possible amount of metabolic flux toward the pathway of interest. The goal of this optimization process is to increase titer yields, the rate of the reaction, and the yield of the product. To that end, researchers have developed and implemented a variety of regulatory tools to fine-tune the expression level of pathway genes and enzymes (Fig. 1). Once fine-tuned, these regulatory tools are able to remove bottlenecks and keep intermediates or byproducts that diminish strain productivity from accumulating, as well as delete or insert non-pathway genes, so the cell allocates more cellular resources to synthesis of target metabolites [3]. Despite a variety of genome-editing tools and approaches being available, achieving a strain productivity that meets the economic demands for industrial-scale production remains the most resourceconsuming step of converting engineered biological systems into commercially viable bioprocesses [2]. Studies have found that engineered strains that demonstrated specific levels of productivity in small-scale laboratory bioreactors show decreased productivity when they are grown at the industrial scale due to localized fluctuations in physiological conditions [4–6]. Unfortunately, improving homogeneity in physiological conditions often results in qualitative solutions, as mixing is one of the most poorly understood processes in classical chemical and biochemical engineering [7]. One of the more recent ways to regulate metabolic pathways is the use of CRISPR interference (CRISPRi) on the production of natural products. There have been many advances for use in the model organism Escherichia coli. For instance, the production of various organic compounds, such as terpenes, polyhydroxybutyrate (PHB), or alcohols such 1,4-butanediol (1,4-BDO) [8]. A pigment, O-methylated anthocyanin, can be synthesized by CRISPRi targeting a Met repressor-encoding gene metJ. When
Methods for the Development of Recombinant Microorganisms. . .
3
Fig. 1 Various genetic tools for regulation can be used to increase the overall titer
combined with S-adenosylmethionine, the resulting repression of the methionine regulon and of enzymes involved in SAM synthesis leads to increased expression of the methylated anthocyanin [9]. A less rigid cell shape is beneficial for the accumulation of PHB in a cell. By modifying E. coli genes that code for proteins involved in the synthesis of the cell wall with CRISPRi, it is possible to weaken cell rigidity [10]. Example proteins include FtsZ, the cell division protein, and MreB the cell shape-determining protein. As such, more PHB will be able to accumulate in the cell [8]. When discussing the production of alcohols, it is possible to design a CRISPRi system to suppress the competing genes (gabD, ybgC, and tesB) that restrict the production of 1,4-BDO and at the same time will reduce the production of the by-products gammabutyrolactone and succinate improving the total amount of the butanediol produced. Decreased strain productivity observed when strains go from the laboratory to an industrial bioreactor makes clear the potential benefit regulatory elements may have under fluctuating conditions. A majority of the genome-editing tools and approaches that have traditionally been employed for the improvement of strain productivity are static—their effects, by design, vary neither over time nor in response to changes in intracellular or extracellular conditions. Static regulatory tools are generally optimized for a specific set of fermentation conditions, often at smaller scales. Fluctuations in extracellular physiological conditions induce changes in intracellular conditions [4, 6] which can then alter the previously balanced regulatory effects and introduce perturbation in the expression of genes and enzymes. The shift in expression levels and metabolite concentrations can reduce the efficiency of
4
Alexander Perl et al.
metabolite use, causing metabolic burden and decreased productivity. To understand static regulation, one must study the genetically encoded components of a cell that help determine the flow rate of components through a pathway but do not sense cellular conditions and modulate the pathway based on the sensed information. There are many parameters that can be used to change the static control of a particular system. Strength and type of constitutive promoter, ribosome binding site strength, or copy number of the vector are just a few examples [11]. A major challenge in statically regulating a system is specifying the level of gene expression in order for translation to occur. With too little translation, not enough product is produced, while with too much translation, the cell wastes significant amounts of energy. To solve this problem, researchers studied the folding of mRNA near the ribosome-binding site (RBS) as this can strongly impact gene expression level [11]. One of the more successful models for studying translation initiation is the design of an RBS sequence designed to obtain a specific translation initiation rate [12–14]. By characterizing the free energy of certain interactions in translation initiation, it is possible to measure the energetic changes during RNA folding and hybridization. This computational tool works by predicting the relative translation initiation rate of an existing RBS sequence or by utilizing an optimization algorithm that chooses an RBS that currently exists capable of translating a protein coding sequence at a rate chosen by the researcher [13]. By being able to control and predict the strength of an RBS, it is possible to optimize regulation to decrease the metabolic burden on the cell while still making enough products. The kinetics of translation initiation has been extensively studied; however, many of the existing models have an unacceptable failure rate [11]. In terms of controlling gene expression, DNA copy number modulation is perhaps the most frequently applied method. At the beginning, metabolic engineers focused on increasing the availability of precursors and intermediates by removing competing pathways and overexpressing desired pathways with high copy number vectors. This allowed the researchers to achieve maximum production rates. Additionally, researchers were unaware of the effects of other factors, such as promoter strength or substrate availability, so the expression of high copy number vectors led to only moderate strain improvements because of the build-up of unnecessary and toxic intermediates [15, 16]. Another example of DNA copy number modulation occurs in the biosynthesis of fatty acids. Researchers were able to modify the plasmid copy number to change the distribution of precursor molecules into the two different pathways of fatty acid biosynthesis. In other words, changing the plasmid number allowed for the optimal balancing of the expression of acetyl-CoA-producing pathways and malonyl-ACP-consuming
Methods for the Development of Recombinant Microorganisms. . .
5
pathways. Because of the toxic nature of acetyl-CoA/malonyl-ACP intermediates, a lower copy number on the plasmid controlling the acetyl-CoA pathway yields more fatty acid production [17]. In terms of transcriptional regulation, the control of promoter strength is just as important as the control of DNA copy number modulation. Researchers have developed a library of constitutive and inducible promoters in Saccharomyces cerevisiae to increase promoter activity and control over metabolic flux. It is with this combination of factors, strength of promoters, RBS strength, and the copy number of the vector that allow a researcher to better regulate a system [16]. The hybrid promoter strategy, based on characterizing the element function of multiple promoters, is another way to control promoter strength, and it works by linking together structural components of various promoters. Researchers have used this method to add inducibility to an otherwise constitutive system [18, 19]. Another example of hybridizing a promoter is the addition of an operon to a promoter, thus reducing the toxicity and burden of metabolite to cells while optimizing the network [19]. Jiao et al. were able to take a lac operon and add it to a native promoter in order to increase the yield of surfactin titer [20]. While transcriptional regulation has many benefits, high copy plasmid DNA can increase the metabolic burden on bacterial cells which can become a problem when optimizing metabolite production. Chromosomal integration can decrease the copy number of the genes being overexpressed and has been shown to be capable of increasing production of heterologous metabolites [21]. Researchers were able to use chromosomal integration in order to easily construct and integrate entire metabolic pathways into E. coli, thus demonstrating that both protein expression and metabolite production are influenced by the location of their respective integrations on the genome [22]. Englaender et al. studied a single-gene pathway for the production of trans-cinnamic acid as a case study for effects of chromosomal integration. A “landing-pad” that inhibited tetracycline production was inserted into four different loci on the gene of interest. Englaender et al. observed a 14.4% decrease in trans-cinnamic acid production between the highest and lowest producing strains in this analysis demonstrating the impact location on the chromosome can have on strain efficiency [22]. 2.2 Dynamic Balancing
As an alternative to static control of engineered metabolic pathways, tools for the dynamic regulation of intracellular conditions have gained traction, in recent years, due to their ability to address the challenges of optimization that static control strategies cannot overcome. In particular, they are far more capable of preventing the overuse of central metabolites and the buildup of toxic intermediates [23]. Dynamic regulatory devices are able to sense changes in the concentration of specific, relevant cellular conditions, which
6
Alexander Perl et al.
Fig. 2 Mechanism of each reviewed technology. (a) STARs disable a hairpin on a gene allowing for gene expression. (b) Toehold switches are RNA in a hairpin formation that can be activated when a trigger strand (RNA complementary to the toehold switch)
typically end up being metabolite concentration. and induce changes in the cell to adaptively upregulate or downregulate gene expression. At any given moment, cells undergo a variety of metabolic processes that are regulated at the transcriptional, translational, and posttranslational levels. The goal of much work in biotechnology has been to mimic these regulatory processes in order to control the production of some end product. In order to achieve these goals, novel methods have been designed to utilize the existing mechanisms in new ways or to apply novel synthetic constructs to biological systems [24]. These dynamic regulatory systems take advantage of cellular resources in order to control the production of different metabolites. While some of these methods have been studied for years, there are some newer methods that have only begun to be developed. Whether new or old, these regulatory elements are an important bridge between the artificial, specific targeting of genes and the ability of the cell to be able to respond to intracellular signals. These dynamic regulatory approaches have significant potential to optimize the regulation of metabolic pathways far more effectively than static regulatory approaches [25]. Each dynamic approach has different advantages and disadvantages, and understanding these differences makes it possible to determine which strategy is best for a specific system. There are two recently designed methods for regulating metabolic pathways currently under development (Fig. 2). The first involves the use of small transcription-activating RNAs (STARs)
Methods for the Development of Recombinant Microorganisms. . .
7
which range in size from approximately 18 to 30 base pairs. STARs function cooperatively with a target gene that contains a terminator hairpin in the beginning of its transcript. This hairpin, known as the regulator, is inserted in the 50 untranslated region before the gene. During transcription, the regulator folds into an intrinsic terminator hairpin that halts transcription of the gene of interest [26]. When the STAR is expressed, it binds to the hairpin and disrupts its structure, thus allowing the gene to be fully transcribed. While STARs have primarily seen use in gene activation thus far, they can also be designed to act as attenuators. In this implementation, the STAR is designed to form an inhibitory hairpin when it contacts the gene of interest rather than disabling a hairpin already in place [26]. This inhibitory system allows the STAR to completely block the transcription of a protein. As such, it will be less likely to leak than a traditional aptamer [27]. One of the most unique attributes of STARs is their ability to act as independent RNA regulators; they are able to control and regulate transcription as opposed to translation [26]. Altering transcription is significantly advantageous in that it has the potential to be applied more generally in the regulation of RNA synthesis [27]. Once designed, these STARs would be able to target a specific gene. While the ability to target a specific gene is not unique to STARs, the de novo design has the potential to make them a powerful tool. The use of NUPACK, a design tool capable of performing statistical, thermodynamic calculations allows for the prediction of three-dimensional RNA structures. By utilizing NUPACK, it will be possible to design STARs capable of dynamic regulation with a larger dynamic range and lower metabolic burden than other methods. The second method for regulating metabolic pathways involves the use of a riboregulator called a toehold switch (THS). A general toehold switch is comprised of RNA in a hairpin configuration that sequesters the RNA strand’s ribosome-binding site (RBS) [28]. In order for transcription to occur, a short length of ssRNA adjacent to the hairpin binds to a complementary strand of RNA, called a trigger, and RNA changes conformation, exposing the RBS and allowing translation of the gene of interest [28]. When designing a THS, it is important that the toehold-trigger complex is more energetically favorable than the unbound THS so that the binding happens spontaneously [28]. THS has the benefit of not requiring a specific sequence in the toehold and hairpin regions, and thus can be designed for any arbitrary sequence. This means that a library of orthogonal trigger–THS pairs can be generated. Because the trigger can be an arbitrary sequence, the “trigger” can in fact be an mRNA that is already produced by the cell, and thus can allow a THS to sense transcription levels in the cell. One of the most unique attributes of a toehold switch is the lack of necessity for the RBS to bind to the actual hairpin
8
Alexander Perl et al.
[29]. Rather than binding to the actual hairpin, there is a sequence of bases before and after the start codon that bind to RNA duplexes. The actual start codon remains unbound which allows for the multiple trigger strands to be able to bind to one toehold. In other words, each trigger can lack specificity despite the toehold being orthogonal; therefore, toehold switches can be designed to allow a gene to be translated in the presence of a trigger RNA with an arbitrary sequence. The production of certain metabolites, such as flavonoids, can be controlled by dynamic regulation [25, 30]. One such flavonoid is naringenin which is formed by an enzymatic reaction starting with the precursor p-coumaric acid. Maximizing naringenin production depends on maintaining high levels of chalcone synthase (CHS) and chalcone isomerase (CHI) while decreasing tyrosine ammonia lyase (TAL) and 4-coumaryl-CoA ligase (4CL) [31]. It is possible to dynamically regulate the production of TAL and 4CL while constitutively expressing CHS and CHI in order to increase titer yields of naringenin [30]. Another benefit of incorporating dynamic regulatory methods is the ability to optimize titer yields during fatty acid production. One fatty acid precursor is malonyl-CoA which is involved in the chain elongation of a variety of pharmaceuticals and biofuels [32]. Xu et al. designed a variety of genetically encoded metabolic switches capable of controlling both the malonyl-CoA source and sink pathway [33]. Two of the switches they designed were the malonyl-CoA–regulated T7 promoter and pGAP promoter. While many problems can arise when combining multiple promotors or switches, such as incoherent input signals, these two promoters were able to be combined into one system. This system was shown to be able to control the gene expression of two reporter proteins by switching between two distinct states depending on the intracellular level of malonyl-CoA [33]. Additionally, Zhang et al. engineered a fatty acid/acyl-CoA (FA/acyl-CoA) biosensor capable of dynamically regulating the production of the class of fatty acids, acyl-CoAs, the precursors to specific biofuels [34]. The researchers found that the biosensors detected any produced fatty acids regardless of whether they were produced exogenously or endogenously. By being able to detect the presence of these molecules and respond to changing levels, the acyl-CoA biosensor can be used to regulate any pathway dependent on the production of said specific fatty acid [34]. Looking ahead, STARs will add new tools for the construction of transcriptional RNA repressors. One goal is to interface STARs with CRISPRi in order to create a system that could be either activated or repressed dynamically. While systems with similar functionality already exist, RNA-based transcriptional methods would likely present less of a metabolic burden and faster network dynamics than traditional methods of repression [26].
Methods for the Development of Recombinant Microorganisms. . .
2.3
Co-culturing
9
Co-culture fermentation is a method for the biosynthesis of chemicals that utilizes two distinct strains of microorganisms in the same fermentation media. In this arrangement, the strains have been designed to carry out different steps in the same synthesis pathway. Co-culture fermentations have existed alongside humans for quite some time, in traditional foods such as cheese, yogurt, alcohol, and more [35]. Now, co-culture fermentations are seeing application in the field of metabolic engineering as an alternative to single-strain fermentations, particularly when a larger number of transgenic reaction pathways are involved. This is the greatest advantage of using co-cultures over pathways expressed in a single strain. Rather than attempting to balance the internal state of the strain to cater to several distinct stages of a synthesis pathway, one is able to optimize two strains separately for two different types of reactions, which ultimately can allow for a higher efficiency at both stages [36]. Not only does this approach have the potential to improve rate or yield of the fermentation, it also allows for less modification of the strains used in the fermentation. If two existing strains are already effective at certain reactions, they can be modified slightly without adding entirely new sets of pathway genes, which reduces the amount of work required to set up the reaction system. Alongside the benefits of co-culture fermentations, however, come new complications. While having both strains present in the media at the same time can allow for greater yield or efficiency of fermentation, it means that the relative amount of each strain becomes a new variable that can drastically change the overall effectiveness of the system. To prevent such fluctuations in the fermentation, it becomes critical to balance not just the individual behavior of each strain the co-culture, but also their relative growth rates [36]. One of the simpler methods for keeping the desired ratio between the two strains is optimization of the media. In many cases, the two strains may naturally require different carbon sources, and thus by changing the relative amounts of the two strains’ carbon sources, it is possible to cause the two strains to grow in the ideal proportions [35]. Individual growth curves must be used to track how quickly each strain grows in specific conditions. Once the growth rate is determined, the ratio that they are mixed together can determine what percentage of the culture is which strain, thus optimizing fermentation. Another challenge to overcome while co-culturing is the performance of both strains. This can be most easily studied by observing the end products of a reaction such as ethanol in a fermentation reaction. By observing how much ethanol is produced, it is simple to track the effect of any change to an initial parameter has on the system. In order for co-culturing to be worth the trouble, there
10
Alexander Perl et al.
must be a significant increase in total ethanol produced when compared to single-fermentation reactions [35]. Possibly the most difficult problem to overcome is the interactions between the different microorganisms in a co-culture. Growth of cells of one strain may be enhanced or inhibited by the activities of other cells present in the medium. The same is also true for the formation of any primary or secondary metabolites [37]. In extreme examples, a metabolite produced by one organism may even be fatal to the other organism completely ruining the experiment all together. On the other hand, the interactions between the strains can also promote growth and metabolite production. The enzymatic activity of one strain in the co-culture may supply another strain a substrate capable of leading to an enhanced growth rate [37]. In addition, one strain may reduce the production of a growth-inhibiting substance made by the other strain in the co-culture which will improve the growth rate of said strain. On top of mutualistic growth, it is also possible for the biological production of renewable resources to be improved by co-culturing. By increasing the yields of renewable, organic products, it may be possible to increase the efficiency of renewable energy sources and make them a more practical solution to the energy crisis [37]. Biofuels that are currently being made by single-fermentation, such as ethanol, demonstrate increased yields when produced under co-cultures. Researchers have produced ethanol via a co-culture containing Zymomonas mobilis and Saccharomyces cerevisiae at both higher yields and production rates than with either organism alone. A conversion of 94% of the theoretical maximum was the highest result researchers achieved [37]. Utilization of co-culturing for the production of ethanol is hampered by the inability to produce monosaccharides from cellulose. In order to solve this, researchers are attempting to use simultaneous hydrolysis and fermentation which prevents disaccharides from accumulating in the cell. This eliminates the inhibition of cellulolytic enzymes from occurring [38]. A co-culture consisting of both anaerobic and aerobic bacteria resulted in a significant increase in enzymatic activity. By choosing an aerobic and anaerobic organism, enzyme production improved as the consumption of all available oxygen and metabolic degradation of inhibiting substances by aerobic organism, created better conditions for the anaerobic organisms to thrive [35–38]. Additionally, co-cultivation is also more advantageous than two-stage fermentations. Co-culturing decreases the metabolic cost of production by eliminating the need for a second sterilization step. This significantly decreases the production time, effort, and difficulty of the fermentation process without decreasing the overall yield [38]. Finally, co-culturing also decreases the chance for contamination in the bioreactor, as there is no longer a transfer from one bioreactor to another.
Methods for the Development of Recombinant Microorganisms. . .
11
In nature, microbes coexist in communities comprised of many different interacting species. These communities cycle oxygen, carbon, and nitrogen in order to live symbiotically [37]. In such communities, each organism performs a different task to mutually benefit the biocenosis. While the food industry has taken advantage of this phenomenon to make a variety of dairy products, there are other ways researchers can utilize these naturally occurring mutualistic relationships as well. For example, herbivores’ guts are home to a host of microbial and fungal organisms. These symbionts work synergistically to produce cellulolytic enzymes to degrade plant biomass to simple sugars [38]. By utilizing the same organisms that naturally coexist in these herbivores, it is possible to create a co-culture capable of efficiently breaking down complex polysaccharides into simple monosaccharides. These monosaccharides can then be used for the creation of ethanol or other biofuels by already existing technologies. Another potential use for co-culturing is the synthesis of flavonoids. Flavonoids are compounds that contain important pharmaceutical applications. The synthesis of flavonoids requires a variety of cofactors and precursors. These pathway-dependent molecules must be stoichiometrically balanced in order to achieve maximal yields [38]. Traditionally, the synthesis of the flavonoid anthocyanin occurred in one culture where the biosynthetic pathway was split into two cultures and co-cultured [39]. Each plasmid contained three of the six necessary genes for the synthesis of the flavonoid. While this did produce viable yields, 40.7 mg/L which is almost a 1000-fold improvement over a monoculture yield, even better yields can be achieved by using a polyculture. Pyranoanthocyanins are the most chemically complex of the anthocyanins and are produced during the fermentation of red wine grapes. Synthesis of this molecule is difficult because of the extremely low titer yields, even in aged wine. Co-culturing two recombinant E. coli strains to synthesize pyranoanthocyanins has been shown to improve yields of the titer up to 16 mg/L under ideal conditions [40]. Another flavonoid, apigenin, can also be produced via a co-culture mix of two unique strains of E. coli. The mix of these two strains was found to have a 2.5-fold increase in the titer yield of the flavonoid [41]. Researchers have designed a fungi/bacteria mixture for the biosynthesis of biofuel products such as isobutanol [38]. Isobutanol can be produced by a co-culture containing a fungal species that secretes cellulase enzymes to hydrolyze biomass into soluble saccharides, and a fermentation specialist, which metabolizes the soluble saccharides into isobutanol [42]. Minty et al. were able to produce titers of approximately 62% the theoretical maximum, and this yield may improve as the chromosomal integration of plasmid genes is better understood [42]. Co-culturing can also be utilized to improve titer yields by decreasing the number of steps in a metabolic pathway and thus
12
Alexander Perl et al.
decreasing the loss of product between steps [43]. One such example is the production of 2-keto-l-gulonic acid (2-KGA), the precursor of vitamin C. This precursor was originally produced in a threestep fermentation that involved two sterilizations. This second sterilization decreased the final yield of the precursor, and after switching to a single fermentation reaction, by use of co-culture, the yields were found to improve by almost 30% [44]. Muconic acid (MA) is an important precursor to the synthesis of important macromolecules such as nylon or plastic polymers. It is currently synthesized in a complex metabolic pathway that puts a heavy burden on the organism (typically E. coli) that is produced in [45]. To solve this problem, researchers use a co-culture of two different E. coli strains each containing half of the total metabolic pathway. By using the same organism, problems such as strain dominance were avoided. The result was a decrease in byproducts and an overall increase in titer yield [45]. Additionally, similar yields were shown when the researchers used a similar co-culture to produce benzoic acids. An important pharmaceutical lovastatin is a class of drug used to treat hypercholesterolemia as it catalyzes the rate-limiting step in the biosynthesis of cholesterol [46]. One of its intermediates, monacolin J acts as an important precursor for other pharmaceuticals and is typically produced hydrolysis from lovastatin. Lovastatin is typically produced through fungal fermentation which can lead to several problems including long fermentation period, multiple byproducts, and cost. By using a p-pastoris co-culture, researchers were able to decrease the metabolic burden, decrease intermediate accumulation, and improve the yield of lovastatin [47]. Other pharmaceuticals can also be biosynthesized by means of co-culturing to increase titer yields. Paclitaxel is an important anticancer compound that can be synthesized from taxadiene. Taxanes are able to be biosynthesized from a co-culture containing E. coli and S. Cerevisiae. Researchers found no toxic byproducts, and the taxane titer was improved approximately threefold after co-culturing [48]. A class of compounds called curcuminoids has begun to be studied for their antioxidant, anticancer, and antitumor properties. These biomolecules were originally produced through plant extraction. While effective at producing the molecules at useable titers, dependence on plants for the production of this molecule means researchers are dependent on seasonal growth and having to deal with byproduct production. Recently, co-culturing has managed to eliminate unnecessary byproducts, thus purifying the titer [49]. A more recent method involved four E. coli strains collectively expressing 15 pathway genes from different plants and microbes for the production of anthocyanins [50]. This was able to produce even larger yields of the flavonoid as well as demonstrate the production of anthocyanins directly from glucose. Because this system only
Methods for the Development of Recombinant Microorganisms. . .
13
requires basic fermentation optimization to achieve peak production and has such high yields, there is potential for the technology to be scaled up for use in a bioreactor [50]. Microbial biosynthesis via co-culturing provides a new approach to metabolic pathway balancing. Not only can co-culturing be customized for the production of various metabolites, but it can also increase the yields of metabolites made by more traditional means. While there is a lot of promise in using co-culturing to improve dynamic regulation, there are inherent challenges that need to be addressed. The improvements made to organism choice and media optimization will benefit co-cultivation engineering. 2.4 Fermentation Optimization
Fermentation optimization involves optimizing a process rather than the modification of the genetic elements within a microorganism. By carefully optimizing the media, temperature, and induction conditions, this tool has been shown to greatly improve pathway fluxes. While historically led by industry, this field has the potential to both rapidly and significantly improve product yield, product titer, and overall productivity of the system without the hindrances associated with genetic optimization, including DNA synthesis, cloning, DNA sequencing. Here we will discuss how to achieve significantly higher production by modifying specific parameters that can be easily modified in shake flask experiments [31]. Primarily, designing and performing some small-scale experiments to probe the production landscape will allow the researcher to determine the starting point for the complete optimization. Because these experiments exist to give a starting point for the optimization, they tend to not end up reported in the literature. Therefore, these experiments do not have to be strategically planned or reproduced. Oftentimes, these initial optimization experiments are performed early in the genetic optimization process to determine initial screening conditions for genetically varied mutants [31]. Revisiting past optimizations is necessary, as this process is affected by many factors in relatively unpredictable ways. Due to immense size of the solution space, it is impossible to screen all possible combinations of conditions leading to the generation of assumptions regarding less significant parameters. Some of these parameters include overnight culture growth conditions, initial inoculation level, and media preparation protocols. It is critical that reproducible methods and proper documentation occur in the methods section to ensure reproducibility of the process.
2.5 Media Optimization
Media optimization varies significantly between batch shake flask and bioreactor-based studies. We will focus on shake flask systems to avoid the numerous challenges when optimizing with bioreactor systems [51]. For preliminary data, it is necessary to select a variety
14
Alexander Perl et al.
of different base media. Some examples include: minimal media such as M9 or MOPS [52] complex media such as LB, TB, or SOC, or rich defined media such as AMM [31, 50]. Not only does type of media matter but the supplements added to the media can greatly impact cell growth. Several different carbon sources that utilize different metabolic pathways for uptake can be used such as glucose, xylose, glycerol, or yeast extract [51, 53]. Screening all combinations of these media and carbon sources will allow for the selection of the best combination for future experiments. 2.6 Temperature Optimization
Optimizing the temperature of a fermentation reaction allows a researcher to find the ideal conditions for the exogenous enzymes to be the most active, while still preserving the growth rate advantages of the host strain at increased temperatures. In order to discover this critical temperature, a range of temperatures should be selected at 5–10 C intervals. Peer-reviewed literature can give a basis for the experimental design necessary to identify the in vitro temperature–activity relationship data for the exogenous enzymes as well as the preferred growth temperature for the host organism. In metabolic engineering studies, a dual-temperature approach is often used to maximize growth and optimize recombinant enzyme activity [31]. In other words, there is a shift in temperature at the induction point. This temperature shift can be applied slightly prior to induction to reduce the overall stress on the cells.
2.7 Induction Optimization
Induction optimization is the most important factor of fermentation optimization because of its sensitivity [54]. The use of a constitutive promoter eliminates the ability to improve titers through induction optimization but can complicate the system meaning the choice to use an inducible or constitutive promoter system will always be case dependent [31]. Inducible systems allow for a rapid shift from growth phase to the production phase of a fermentation which allows the cell to devote resources to the formation of the desired product.
Acknowledgments This work was supported by the NSF grant program grant number MCB-1817631 awarded to Mattheos Koffas. References 1. Stephanopoulos G (2012) Synthetic biology and metabolic engineering. ACS Synth Biol 1 (11):514–525. https://doi.org/10.1021/ sb300094q 2. Tyo KE, Alper HS, Stephanopoulos GN (2007) Expanding the metabolic engineering toolbox: more options to engineer cells.
Trends Biotechnol 25(3):132–137. https:// doi.org/10.1016/j.tibtech.2007.01.003 3. Cobb RE, Wang Y, Zhao H (2015) Highefficiency multiplex genome editing of Streptomyces species using an engineered CRISPR/ Cas system. ACS Synth Biol 4(6):723–728. https://doi.org/10.1021/sb500351f
Methods for the Development of Recombinant Microorganisms. . . 4. Schweder T, Kru¨ger E, Xu B, Ju¨rgen B, Blomsten G, Enfors S-O, Hecker M (1999) Monitoring of genes that respond to processrelated stress in large-scale bioprocesses. Biotechnol Bioeng 65(2):151–159. https://doi. org/10.1002/(SICI)1097-0290(19991020) 65:23.0.CO;2-V 5. George S, Larsson G, Olsson K, Enfors SO (1998) Comparison of the Baker’s yeast process performance in laboratory and production scale. Bioprocess Eng 18(2):135–142. https:// doi.org/10.1007/PL00008979 6. Enfors SO, Jahic M, Rozkov A, Xu B, Hecker M, Jurgen B, Kruger E, Schweder T, Hamer G, O’Beirne D, Noisommit-Rizzi N, Reuss M, Boone L, Hewitt C, McFarlane C, Nienow A, Kovacs T, Tragardh C, Fuchs L, Revstedt J, Friberg PC, Hjertager B, Blomsten G, Skogman H, Hjort S, Hoeks F, Lin HY, Neubauer P, van der Lans R, Luyben K, Vrabel P, Manelius A (2001) Physiological responses to mixing in large scale bioreactors. J Biotechnol 85(2):175–185. https:// doi.org/10.1016/s0168-1656(00)00365-5 7. Villadsen J, Jens Nielsen J, Lide´n G (2011) Bioreaction engineering principles, 3rd edn. Springer, Boston, MA 8. Schultenkamper K, Brito LF, Wendisch VF (2020) Impact of CRISPR interference on strain development in biotechnology. Biotechnol Appl Biochem 67(1):7–21. https://doi. org/10.1002/bab.1901 9. Cress BF, Leitz QD, Kim DC, Amore TD, Suzuki JY, Linhardt RJ, Koffas MA (2017) CRISPRi-mediated metabolic engineering of E. coli for O-methylated anthocyanin production. Microb Cell Fact 16(1):10. https://doi. org/10.1186/s12934-016-0623-3 10. Zhang XC, Guo Y, Liu X, Chen XG, Wu Q, Chen GQ (2018) Engineering cell wall synthesis mechanism for enhanced PHB accumulation in E. coli. Metab Eng 45:32–42. https:// doi.org/10.1016/j.ymben.2017.11.010 11. Holtz WJ, Keasling JD (2010) Engineering static and dynamic control of synthetic pathways. Cell 140(1):19–23. https://doi.org/10. 1016/j.cell.2009.12.029 12. Xia T, SantaLucia J Jr, Burkard ME, Kierzek R, Schroeder SJ, Jiao X, Cox C, Turner DH (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-crick base pairs. Biochemistry 37(42):14719–14735. https://doi.org/10.1021/bi9809425 13. Salis HM, Mirsky EA, Voigt CA (2009) Automated design of synthetic ribosome binding sites to control protein expression. Nat
15
Biotechnol 27(10):946–950. https://doi. org/10.1038/nbt.1568 14. Kierzek R, Burkard ME, Turner DH (1999) Thermodynamics of single mismatches in RNA duplexes. Biochemistry 38 (43):14214–14223. https://doi.org/10. 1021/bi991186l 15. Pitera DJ, Paddon CJ, Newman JD, Keasling JD (2007) Balancing a heterologous mevalonate pathway for improved isoprenoid production in Escherichia coli. Metab Eng 9 (2):193–207. https://doi.org/10.1016/j. ymben.2006.11.002 16. Jones JA, Toparlak OD, Koffas MA (2015) Metabolic pathway balancing and its role in the production of biofuels and chemicals. Curr Opin Biotechnol 33:52–59. https://doi. org/10.1016/j.copbio.2014.11.013 17. Xu P, Gu Q, Wang W, Wong L, Bower AG, Collins CH, Koffas MA (2013) Modular optimization of multi-gene pathways for fatty acids production in E. coli. Nat Commun 4:1409. https://doi.org/10.1038/ncomms2425 18. Jung YK, Kim TY, Park SJ, Lee SY (2010) Metabolic engineering of Escherichia coli for the production of polylactic acid and its copolymers. Biotechnol Bioeng 105(1):161–171. https://doi.org/10.1002/bit.22548 19. Jin LQ, Jin WR, Ma ZC, Shen Q, Cai X, Liu ZQ, Zheng YG (2019) Promoter engineering strategies for the overproduction of valuable metabolites in microbes. Appl Microbiol Biotechnol 103(21–22):8725–8736. https://doi. org/10.1007/s00253-019-10172-y 20. Jiao S, Li X, Yu H, Yang H, Li X, Shen Z (2017) In situ enhancement of surfactin biosynthesis in Bacillus subtilis using novel artificial inducible promoters. Biotechnol Bioeng 114(4):832–842. https://doi.org/10.1002/ bit.26197 21. Mairhofer J, Scharl T, Marisch K, CserjanPuschmann M, Striedner G (2013) Comparative transcription profiling and in-depth characterization of plasmid-based and plasmid-free Escherichia coli expression systems under production conditions. Appl Environ Microbiol 79(12):3802–3812. https://doi.org/10. 1128/AEM.00365-13 22. Englaender JA, Jones JA, Cress BF, Kuhlman TE, Linhardt RJ, Koffas MAG (2017) Effect of genomic integration location on heterologous protein expression and metabolic engineering in E. coli. ACS Synth Biol 6(4):710–720. https://doi.org/10.1021/acssynbio.6b00350 23. Martin VJ, Pitera DJ, Withers ST, Newman JD, Keasling JD (2003) Engineering a mevalonate pathway in Escherichia coli for production of
16
Alexander Perl et al.
terpenoids. Nat Biotechnol 21(7):796–802. https://doi.org/10.1038/nbt833 24. Cress BF, Trantas EA, Ververidis F, Linhardt RJ, Koffas MA (2015) Sensitive cells: enabling tools for static and dynamic control of microbial metabolic pathways. Curr Opin Biotechnol 36:205–214. https://doi.org/10.1016/j. copbio.2015.09.007 25. Dahl RH, Zhang F, Alonso-Gutierrez J, Baidoo E, Batth TS, Redding-Johanson AM, Petzold CJ, Mukhopadhyay A, Lee TS, Adams PD, Keasling JD (2013) Engineering dynamic pathway regulation using stress-response promoters. Nat Biotechnol 31(11):1039–1046. https://doi.org/10.1038/nbt.2689 26. Chappell J, Westbrook A, Verosloff M, Lucks JB (2017) Computational design of small transcription activating RNAs for versatile and dynamic gene regulation. Nat Commun 8 (1):1051. https://doi.org/10.1038/s41467017-01082-6 27. Meyer S, Chappell J, Sankar S, Chew R, Lucks JB (2016) Improving fold activation of small transcription activating RNAs (STARs) with rational RNA engineering strategies. Biotechnol Bioeng 113(1):216–225. https://doi.org/ 10.1002/bit.25693 28. To AC-Y, Chu DH, Wang AR, Li FC, Chiu AW, Gao DY, Choi CHJ, Kong SK, Chan TF, Chan KM, Yip KY (2018) A comprehensive web tool for toehold switch design. Bioinformatics 34(16):2862–2864. https://doi.org/ 10.1093/bioinformatics/bty216 29. Green AA, Silver PA, Collins JJ, Yin P (2014) Toehold switches: de-novo-designed regulators of gene expression. Cell 159(4):925–939. https://doi.org/10.1016/j.cell.2014.10.002 30. Dinh CV, Prather KLJ (2019) Development of an autonomous and bifunctional quorumsensing circuit for metabolic flux control in engineered Escherichia coli. Proc Natl Acad Sci U S A 116(51):25562–25568. https://doi. org/10.1073/pnas.1911144116 31. Jones JA, Koffas MA (2016) Optimizing metabolic pathways for the improved production of natural products. Methods Enzymol 575:179–193. https://doi.org/10.1016/bs. mie.2016.02.010 32. Daniel R, Rubens JR, Sarpeshkar R, Lu TK (2013) Synthetic analog computation in living cells. Nature 497(7451):619–623. https:// doi.org/10.1038/nature12148 33. Xu P, Li L, Zhang F, Stephanopoulos G, Koffas M (2014) Improving fatty acids production by engineering dynamic pathway regulation and metabolic control. Proc Natl Acad Sci U S A
111(31):11299–11304. https://doi.org/10. 1073/pnas.1406401111 34. Zhang F, Carothers JM, Keasling JD (2012) Design of a dynamic sensor-regulator system for production of chemicals and fuels derived from fatty acids. Nat Biotechnol 30 (4):354–359. https://doi.org/10.1038/nbt. 2149 35. Chen Y (2011) Development and application of co-culture for ethanol production by co-fermentation of glucose and xylose: a systematic review. J Ind Microbiol Biotechnol 38 (5):581–597. https://doi.org/10.1007/ s10295-010-0894-3 36. Li Q, Liu C-Z (2012) Co-culture of clostridium thermocellum and clostridium thermosaccharolyticum for enhancing hydrogen production via thermophilic fermentation of cornstalk waste. Int J Hydrog Energy 37 (14):10648–10654. https://doi.org/10. 1016/j.ijhydene.2012.04.115 37. Bader J, Mast-Gerlach E, Popovic MK, Bajpai R, Stahl U (2010) Relevance of microbial coculture fermentations in biotechnology. J Appl Microbiol 109(2):371–387. https:// doi.org/10.1111/j.1365-2672.2009. 04659.x 38. Jawed K, Yazdani SS, Koffas MA (2019) Advances in the development and application of microbial consortia for metabolic engineering. Metab Eng Commun 9:e00095. https:// doi.org/10.1016/j.mec.2019.e00095 39. Chemler JA, Lock LT, Koffas MA, Tzanakakis ES (2007) Standardized biosynthesis of flavan3-ols with effects on pancreatic beta-cell insulin secretion. Appl Microbiol Biotechnol 77 (4):797–807. https://doi.org/10.1007/ s00253-007-1227-y 40. Akdemir H, Silva A, Zha J, Zagorevski DV, Koffas MAG (2019) Production of pyranoanthocyanins using Escherichia coli co-cultures. Metab Eng 55:290–298. https:// doi.org/10.1016/j.ymben.2019.05.008 41. Thuan NH, Chaudhary AK, Van Cuong D, Cuong NX (2018) Engineering co-culture system for production of apigetrin in Escherichia coli. J Ind Microbiol Biotechnol 45 (3):175–185. https://doi.org/10.1007/ s10295-018-2012-x 42. Minty JJ, Singer ME, Scholz SA, Bae CH, Ahn JH, Foster CE, Liao JC, Lin XN (2013) Design and characterization of synthetic fungalbacterial consortia for direct production of isobutanol from cellulosic biomass. Proc Natl Acad Sci U S A 110(36):14592–14597. https://doi.org/10.1073/pnas.1218447110
Methods for the Development of Recombinant Microorganisms. . . 43. Wang R, Zhao S, Wang Z, Koffas MA (2020) Recent advances in modular co-culture engineering for synthesis of natural products. Curr Opin Biotechnol 62:65–71. https://doi.org/ 10.1016/j.copbio.2019.09.004 44. Wang EX, Ding MZ, Ma Q, Dong XT, Yuan YJ (2016) Reorganization of a synthetic microbial consortium for one-step vitamin C fermentation. Microb Cell Factories 15:21. https://doi. org/10.1186/s12934-016-0418-6 45. Zhang H, Pereira B, Li Z, Stephanopoulos G (2015) Engineering Escherichia coli coculture systems for the production of biochemical products. Proc Natl Acad Sci U S A 112 (27):8266–8271. https://doi.org/10.1073/ pnas.1506781112 46. Evans M, Rees A (2002) Effects of HMG-CoA reductase inhibitors on skeletal muscle: are all statins the same? Drug Saf 25(9):649–663. https://doi.org/10.2165/00002018200225090-00004 47. Liu Y, Tu X, Xu Q, Bai C, Kong C, Liu Q, Yu J, Peng Q, Zhou X, Zhang Y, Cai M (2018) Engineered monoculture and co-culture of methylotrophic yeast for de novo production of monacolin J and lovastatin from methanol. Metab Eng 45:189–199. https://doi.org/10. 1016/j.ymben.2017.12.009 48. Zhou K, Qiao K, Edgar S, Stephanopoulos G (2015) Distributing a metabolic pathway among a microbial consortium enhances production of natural products. Nat Biotechnol 33 (4):377–383. https://doi.org/10.1038/nbt. 3095
17
49. Fang Z, Jones JA, Zhou J, Koffas MAG (2018) Engineering Escherichia coli co-cultures for production of Curcuminoids from glucose. Biotechnol J 13(5):e1700576. https://doi. org/10.1002/biot.201700576 50. Jones JA, Vernacchio VR, Collins SM, Shirke AN, Xiu Y, Englaender JA, Cress BF, McCutcheon CC, Linhardt RJ, Gross RA, Koffas MAG (2017) Complete biosynthesis of anthocyanins using E. coli polycultures. mBio 8(3):e00621-00617. https://doi.org/10. 1128/mBio.00621-17 51. Shiloach J, Fass R (2005) Growing E. coli to high cell density—a historical perspective on method development. Biotechnol Adv 23 (5):345–357. https://doi.org/10.1016/j.bio techadv.2005.04.004 52. Neidhardt FC, Bloch PL, Smith DF (1974) Culture medium for enterobacteria. J Bacteriol 119(3):736–747. https://doi.org/10.1128/ JB.119.3.736-747.1974 53. Bizzini A, Zhao C, Budin-Verneuil A, Sauvageot N, Giard J-C, Auffray Y, Hartke A (2010) Glycerol is metabolized in a complex and strain-dependent manner in Enterococcus faecalis. J Bacteriol 192(3):779–785. https:// doi.org/10.1128/jb.00959-09 54. Donovan RS, Robinson CW, Glick BR (1996) Review: optimizing inducer and culture conditions for expression of foreign proteins under the control of thelac promoter. J Ind Microbiol 16(3):145–154. https://doi.org/10.1007/ BF01569997
Chapter 2 Sustainable Technological Methods for the Extraction of Phytochemicals from Citrus Byproducts Salvatore Multari, Fulvio Mattivi, and Stefan Martens Abstract Citrus fruits are products of great market values, as used by the juice industry in huge quantities. The juice industry processes millions of tons of citrus fruits per year, but only the pulp is utilized, whereas peels, seeds, and membrane residues are mostly discarded. This generates vast amounts of byproducts (>100 million tons/year), since the peel can make up to 50% of the weight of the fresh fruit. Phytochemical investigations showed that citrus peels are great sources of bioactive compounds, e.g., phenolic compounds, carotenoids, and monoterpenes. These compounds could find numerous applications in the food, cosmetics, and pharmaceutical industries. The recovery of the phytochemicals would provide economic and environmental benefits. Researchers worldwide have developed innovative techniques to recover phytochemicals from the citrus waste, by endorsing the international waste-prevention policies. This chapter reviews the advances in the sector of food technology applied to citrus chemistry and describes the available green techniques that allow the recovery of phytochemicals from citrus byproducts. Key words Waste, Green technologies, Phenolic compounds, Carotenoids, Monoterpenes
1
Introduction The ultrasound-assisted extraction (UAE) uses the shear force created by acoustic cavitation bubbles that are produced by propagating ultrasonic waves (at 20–100 kHz) [1]. The implosion of the cavitation bubbles disrupts the plant cell wall. This allows the penetration of the solvent into the cells and facilitates the extraction of bioactive compounds [2]. The UAE approach is largely used for extracting phytochemicals from citrus byproducts since it is suitable for producing food grade extracts, it requires low cost instrumentation, and it is simple to operate [3]. When performing UAE, several factors must be considered, e.g., time, temperature, type, and concentration of the solvent. Citrus byproducts are a disparate class of raw material, as obtained from fruits of different species, structure, and chemical composition. Therefore, the extraction
Vladimir Shulaev (ed.), Plant Metabolic Engineering: Methods and Protocols, Methods in Molecular Biology, vol. 2396, https://doi.org/10.1007/978-1-0716-1822-6_2, © Springer Science+Business Media, LLC, part of Springer Nature 2022
19
20
Salvatore Multari et al.
parameters have to be optimized to maximize the recovery of the phytochemicals from the chosen material. The enzyme-assisted extraction (EAE) of phytochemicals is a green technology that is suitable for producing food-grade extracts. It enhances the extractability of the bioactive molecules that are bound to polymers of the plant cell wall [4]. Since citrus byproducts are inexpensive and highly abundant and provide phytochemicals complexed to polysaccharides, enzymes, mostly pectinase and cellulase, have been employed to allow the recovery of phenolics, carotenoids, and aroma compounds [5–7]. The headspace solid-phase microextraction (HS-SPME) is a simple and solvent-free method that is used to extract volatile organic compounds (VOCs). The method is rapid since it requires minimum sample preparation [8]. The HS-SPME is mostly coupled to GC/MS to allow the identification of the extracted compounds. For this reason, the technique is generally referred as SPME-GC/MS. Citrus byproducts are prominent sources of aroma active compounds, and are very suitable for analysis through SPME [9]. The citrus essential oil, known to provide hundreds of aroma active compounds, is extracted by the citrus peel that is discarded by the juice industry. Among the numerous volatile organic compounds found in the citrus byproducts, limonene is by far the most abundant (90% of total VOCs), followed by valencene and γ-terpinene [10]. The fermentation with lactic acid bacteria is a biotechnological approach that has been used for centuries to prolong the shelf-life of plant foods, e.g., cabbage, cauliflower, ginger, and garlic, and improve the organoleptic characteristics of food. As regards the citrus products, it was believed that their low pH would have hampered the growth of lactic bacteria, and for this reason, the lactic fermentation of citrus products has been neglected for long time. Recently, lactic acid bacteria have shown to extensively convert the sour-tasting citric and malic acids into lactic acid [11], which has a mild acidic taste and a long-lasting flavor [12]. This has given momentum to the lactic fermentation of citrus byproducts [13], which embed a wide range of bioactive molecules in their fiber-rich structure. Lactic acid bacteria can soften the food matrix and favor the release of beneficial phytochemicals, e.g., polymethoxylated flavones, and aroma active compounds, mostly monoterpenes.
2
Materials
2.1 UltrasoundAssisted Extraction
Prepare all the solutions using distilled water (or deionized, i.e., 18 MΩ-cm at 25 C) and HPLC or LC-MS grade reagents when the crude extracts are injected into the instruments. The sonication will generate some heat in proportion to the length of the process.
Sustainable Technological Methods for the Extraction of Phytochemicals. . .
21
Sample particles have to be uniform in size, therefore, mill finely the dry samples or cut uniformly into small squares (0.5 cm per side) the fresh samples. The extraction mixtures are made of aqueous methanol or ethanol. The extractions are performed at T ¼ 20–50 C, for 15–40 min. 1. An ultrasonic bath equipped with a variable transducer. 2. Thermometer and ice in case the bath is not fitted with a thermostat. 3. Extraction mixture: aqueous methanol, ethanol, or acetone at concentration 20–80% v/v. 2.2 Enzyme-Assisted Extraction
1. Enzymes, cellulase, pectinase, and β-glucosidase. 2. pH meter. 3. Ethanol. 4. Acetate buffer (V ¼ 1 L; 20 mM, pH ¼ 5.0). Introduce 800 mL of distilled water in a glass bottle, and place it on a magnetic stirrer. Introduce a stir bar into the bottle, add to the water 1.105 g of sodium acetate anhydrous and 0.392 g of acetic acid 100%. Start stirring the solution and measure the pH. Bring the pH of the solution to 5.0 with 0.5 M acetic acid or NaOH. Transfer the solution into a 1-L volumetric flask, and add distilled water up to the mark. Bring the buffer to 40 C. 5. Enzymatic reaction medium. Place 1 g of citrus byproducts in a 50-mL tube, and add 10 mL of acetate buffer.
2.3 Head Space Solid Phase Microextraction (HS-SPME)
1. SPME should be performed on fresh samples that are milled finely. To avoid oxidation of the volatile compounds, samples have to be milled under liquid nitrogen or when frozen. The chemicals used to perform this technique need to be of high purity grade. Operational requirement: 2. Standard mix of saturated alkanes, e.g., C7-C40, from Supelco Inc. to determine the Kovats retention index (RI). 3. Silica fiber (2-cm long) coated with a polymeric absorbent material and a fiber holder. 4. Capillary GC column.
2.4 Fermentation with Lactic Acid Bacteria
1. MRS broth and PBS. 2. Pyrex glass (or autoclavable) containers.
22
3
Salvatore Multari et al.
Methods
3.1 UltrasoundAssisted Extraction
When performing UAE of phytochemicals, several parameters require to be adapted to the purpose of the study (see Notes 1–5). 1. Place the samples in the tubes and add the solvent mixture. 2. Mix thoroughly the samples and introduce them in the bath. 3. Fix the samples in the bath, since the sonication will make them drift. You can use a piece of polyethylene foam or a rack for this purpose. 4. Add cold water to the bath up to the operating level line. The samples have to be immersed in the bath to obtain the maximum extraction efficiency. 5. Select the required temperature, apply the chosen frequency, namely 40 kHz, and start the sonication. Keep sonicating for about 20 min. 6. Centrifuge the extracts (~3000 g), filter with PTFE membranes to obtain clear extracts, and store them into glass vials for further analysis.
3.2 Enzyme-Assisted Extraction
After the purchase, record all the enzyme parameters and store them at 20 C (see Notes 6–12). 1. Add the enzyme to the reaction medium. The enzymes cellulase, pectinase, and β-glucosidase can be used at a concentration of 5 U per g of material, i.e., 5 Uenz g1 DM. Calculate the amount of enzyme to add, in order to obtain this dosage. 2. Incubate the samples in the dark, for 12 h (or overnight), at 40 C under shaking (400 rpm). 3. Stop the enzymatic hydrolysis by heating the sample at 95 C for 5 min. You could use a heating block or a water bath for this purpose. 4. Cool down the samples in an ice bath. 5. Food-grade extract: Add 20 mL of ethanol to the samples, and mix on an orbital shaker for 15 min at 900 rpm. Centrifuge the sample at 6000 g, 10 min, at 18 C. Collect the supernatant and dry under reduced pressure at T < 40 C. Store the extract at 20 C. 6. Not food-grade extract: To samples in step 4, add 1 mL of ethanol, and mix thoroughly. Then, add 6 mL of ethyl acetate, mix on an orbital shaker for 10 min, and centrifuge at 3000 g, 5 min, at 18 C. Collect the supernatant and repeat this step twice. Pool the supernatants and remove the organic solvent under reduced pressure at T < 40 C. To the dry extract, add 1 mL of MeOH, filter with PTFE membranes to
Sustainable Technological Methods for the Extraction of Phytochemicals. . .
23
obtain clear extracts, and store them into glass vials at 20/ 80 C for further analysis. 3.3 Head Space Solid Phase Microextraction (HS-SPME)
1. Sample preparation. Fresh samples are the most suitable to SPME analysis, since the volatile compounds are severely diminished by treatments of dehydration. 2. Internal standard. The use of an internal standard is paramount in SPME-GC/MS analyses. Several internal standards can be employed, e.g., tridecene, 4-methyl-2-pentanone, 2-octanol, 3-hexen-1-ol. It is advisable to choose a compound that is not naturally present in the samples. The internal standard should be added at a concentration that is in line with the concentrations of the other compounds. 3. Preconditioning. When samples are injected manually into the GC, the conditioning can be performed in a water bath, whereas for GC instruments equipped with an autosampler, the conditioning is performed automatically. The use of the autosampler improves the precision and reproducibility of the data. 4. SPME fiber. The employed SPME fiber should (1) perform optimum extraction and desorption, and (2) prevent sample carryover. The CAR/PDMS-coated fibers extract efficiently the small volatile compounds, i.e., MW < 150 g mol1, and are suitable for the analysis of citrus byproducts that are rich in monoterpenes. 5. Reference compounds. To be dissolved in pure ethanol. The calibration curves should be built in a mixture that reflect the chemical composition of the analyzed matrix, e.g., equivalent amounts of sugars and organic acids, to avoid shifts in the retention time. This is particularly important for liquid samples, i.e., citrus juices. 6. Identification. When the reference compounds are available, the identification of the peaks of interest occurs by comparison with the standards. Since citrus byproducts provide hundreds of VOCs, several standards might not be available. In this case, the identification occurs by looking up the relevant spectral database. A high matching score indicates a high degree of mass spectrum similarity. A matching score 850 provides a practical degree of confidence. 7. Quantification. The absolute quantification of the analytes through SPME presents challenges, due to the operational characteristics of the fiber. The R2 values of the standard curves are often 0.98. Normally, researchers perform semiquantifications based on the internal standard peak. 8. Place 2 g of citrus powder into a 20-mL headspace vial.
24
Salvatore Multari et al.
9. Add 1 mL of deionized water. 10. Introduce the internal standard to a final concentration of ~1 mg kg1. 11. Precondition the samples at 40 C for 15 min. 12. Expose the SPME fiber to the headspace of the samples for 40 min. 13. Inject into the GC by desorbing the fiber into the injection port at 250 C for 5 min. 14. Run the chromatographic method. 15. Identify the peaks of interest by comparison with those of the authentic standards and the mass spectral database, e.g., NIST 2.0, Wiley 8, and FFNSC 2. 16. Calculate the Kovats RI to assist in the identification of the peaks of interest. The RI can be calculated with the following equation (to be adjusted on your chromatographic method): t ðx Þ t ðnÞÞ RI ¼ 100 n þ ðN nÞ t ðN Þ t ðnÞÞ where n ¼ is the number of carbon atoms of the n-alkane eluting before the unknown peak; N ¼ is the number of carbon atoms of the n-alkane eluting after the unknown peak; t ¼ retention time; x ¼ unknown peak. 3.4 Fermentation with Lactic Acid Bacteria
Perform the fermentation with autoclaved containers, tools, and racks. In case the starting raw material is in the dry form, perform its hydration by adding 5 mL of distilled water per gram of citrus powder. Prior to the fermentation, carry out the pasteurization (85 C, 30 s.) of the citrus material. 1. Control samples. These are subjected to all the steps of the treatment but without the inoculum of lactobacilli. To verify the efficacy of the initial pasteurization, perform the cell count on the controls as well. 2. Incubation temperature. Perform the fermentation at a temperature range of 30–37 C. 3. Incubation time. Fiber-rich raw materials, as the citrus byproducts, require relatively long period of incubation (5–7 days) to soften the citrus matrix and allow the recovery of the phytochemicals. In case thermosensitive compounds, e.g., anthocyanins, are researched, the incubation period can be shortened to 24–48 h to avoid extensive degradation of the molecules. 4. Matrix. Use fresh material (not dry) to perform the fermentation.
Sustainable Technological Methods for the Extraction of Phytochemicals. . .
25
5. The day before the experiment, rehydrate the lactobacilli in MRS and acidify the broth to pH 5.5 with 5 mol L1 of lactic acid. 6. Activate the strains through incubation at 30 C for 24 h. 7. Following activation, collect the lactobacilli through centrifugation (3360 g, 10 min, 4 C). Wash the pellet of lactobacilli twice with sterile PBS, and resuspend in sterile PBS to a final concentration of 108 CFU mL1. 8. Inoculate 1 mL of the cell suspension (lactobacilli) into the hydrated citrus byproducts. Reach a final concentration of 106 CFU mL1. 9. Incubate at 30 C for 5 days. 10. Following fermentation, measure the viable cells. This step is paramount to verify the growth of the lactobacilli. It follows a series of dilutions with sterile PBS or NaCl 0.8%. As a final point, aliquot 0.2 mL of the diluted sample onto MRM agar medium plates (acidified to pH 5.5 with 5 mol L1 of lactic acid). Incubate at 30 C for 48 h, and count manually the grown colonies.
4
Notes 1. Solvent concentration. UAE is performed at solvent concentrations ranging 20–80% v/v. Highly hydrophilic compounds, e.g., phenolic acids, requires aqueous environments, and lipophilic compounds, e.g., carotenoids, are extracted in 100% organic mixtures. 2. Solvent type. Methanol covers the largest range of applications. Ethanol/aqueous ethanol should be used when extracts are intended for human consumption. Acetone and mixture of apolar solvents, e.g., hexane, should be used for extracting lipophilic compounds. 3. Time. UAE is usually performed for 20 min, overly long procedures can damage the bioactive compounds. Nevertheless, fibrous materials, such as citrus leaves and seeds, can be sonicated up to 45 min. 4. Sample/solvent ratio. The ratio 1:20 w/v is the most employed. However, the solvent mixture should not saturate. When using an excess of solvent, the surplus volume can be removed under vacuum. 5. Byproduct type. Citrus byproducts are diverse. The flavedo is an oil-rich tissue to be extracted at T < 20 C to avoid damaging the heat-sensitive components. The albedo is a porous material
26
Salvatore Multari et al.
to extract with high volumes of water. The pomace derived from juice production are rich in fiber and require high power. 6. Enzyme unit (U). Manufacturers detail the activity of the purchased enzymes. One unit of activity is described as the amount of enzyme that hydrolyzes 1 μmol of product (mostly glucose) from a given matrix, e.g., cellulose in 1 min, at pH 5–7 and 37 C. 7. Measurement of enzyme U. Manufacturers provide accurate information on the activity of the enzymes. Nevertheless, the activity can be measured with a spectrophotometer. It is required to build a calibration curve of a standard compound, for instance of D-galacturonic acid for assessing the activity of pectinase. 8. Enzyme type. Pectinase, cellulase, and β-glucosidase are commonly used to extract phytochemicals from citrus byproducts, but other enzymes can be employed, e.g., tannase and xylanase. The selection should be performed on the basis of the chemical (proximate) composition of the material employed. 9. Enzyme mix. Enzymes can be used in combination or singularly. The combination of enzymes leads to extensive hydrolysis, whereas single enzymes perform selective hydrolysis. 10. Concentrations. Enzymes can be used at different concentrations, usually up to 20 U g1 DM. However, they are expensive and should be used rationally. The concentration of 5 U g1 DM is a reasonable compromise between efficiency and cost. 11. Form. Enzymes can be purchased in solution or in dry powder. The effect on the matrix is equivalent. 12. Production. Enzymes can produce in loco by microbial solidstate fermentation. Methods of enzyme production are available in the literature [14–16]. When enzymes are produced in loco, the enzymatic activity must be measured. References 1. Medina-Torres N, Espinosa-Andrews H, Trombotto S et al (2019) Ultrasound-assisted extraction optimization of phenolic compounds from Citrus latifolia waste for chitosan bioactive nanoparticles development. Molecules 24:3541. https://doi.org/10.3390/ molecules24193541 2. Nipornram S, Tochampa W, Rattanatraiwong P, Singanusong R (2018) Optimization of low power ultrasound-assisted extraction of phenolic compounds from mandarin (Citrus reticulata Blanco cv. Sainampueng) peel. Food Chem
241:338–345. https://doi.org/10.1016/j. foodchem.2017.08.114 3. Montero-Calderon A, Cortes C, Zulueta A et al (2019) Green solvents and ultrasoundassisted extraction of bioactive orange (Citrus sinensis) peel compounds. Sci Rep 9:16120. https://doi.org/10.1038/s41598-01952717-1 4. Van Hung P, Nhi NHY, Ting LY, Phi NTL (2020) Chemical composition and biological activities of extracts from pomelo Peel by-products under enzyme and ultrasoundassisted extractions. J Chem 2020:1043251. https://doi.org/10.1155/2020/1043251
Sustainable Technological Methods for the Extraction of Phytochemicals. . . 5. Ruviaro AR, de Paula Menezes Barbosa P, Macedo GA (2019) Enzyme-assisted biotransformation increases hesperetin content in citrus juice by-products. Food Res Int 124:213–221. https://doi.org/10.1016/j.foodres.2018.05. 004 6. Li BB, Smith B, Hossain MM (2006) Extraction of phenolics from citrus peels. Sep Purif Technol 48:189–196. https://doi.org/10. 1016/j.seppur.2005.07.019 7. Chavez-Gonzalez ML, Lopez-Lopez LI, Rodriguez-Herrera R et al (2016) Enzymeassisted extraction of citrus essential oil. Chem Pap 70:412–417. https://doi.org/10.1515/ chempap-2015-0234 8. Zacharis CK, Tzanavaras PD (2020) Solidphase microextraction. Molecules 25:379. https://doi.org/10.3390/ molecules25020379 9. Goh RMV, Lau H, Liu SQ et al (2019) Comparative analysis of pomelo volatiles using headspace-solid phase micro-extraction and solvent assisted flavour evaporation. LWT-Food Sci Technol 99:328–345. https:// doi.org/10.1016/j.lwt.2018.09.073 10. Zhang H, Xie Y, Liu C et al (2017) Comprehensive comparative analysis of volatile compounds in citrus fruits of different species. Food Chem 230:316–326. https://doi.org/ 10.1016/j.foodchem.2017.03.040 11. Multari S, Carafa I, Barp L et al (2020) Effects of Lactobacillus spp. on the phytochemical composition of juices from two varieties of Citrus sinensis L. Osbeck: “Tarocco” and
27
“Washington navel”. LWT-Food Sci Technol 125:109205. https://doi.org/10.1016/j.lwt. 2020.109205 12. Chen C, Zhao S, Hao G et al (2017) Role of lactic acid bacteria on the yogurt flavour: a review. Int J Food Prop 20:S316–S330. https://doi.org/10.1080/10942912.2017. 1295988 13. Kimoto-Nira H, Moriya N, Nogata Y et al (2019) Fermentation of Shiikuwasha (Citrus depressa Hayata) pomace by lactic acid bacteria to generate new functional materials. Int J Food Sci Technol 54:688–695. https://doi. org/10.1111/ijfs.13980 14. Valdo Madeira J, Rosas Ferreira L, Alves Macedo J, Alves Macedo G (2015) Efficient tannase production using Brazilian citrus residues and potential application for orange juice valorization. Biocatal Agric Biotechnol 4:91–97. https://doi.org/10.1016/j.bcab. 2014.11.005 15. Ahmed I, Zia MA, Hussain MA et al (2016) Bioprocessing of citrus waste peel for induced pectinase production by aspergillus Niger; its purification and characterization. J Radiat Res Appl Sci 9:148–154. https://doi.org/10. 1016/j.jrras.2015.11.003 16. Sharma D, Mahajan R (2020) Development of methodology for concurrent maximum production of alkaline xylanase-pectinase enzymes in short submerged fermentation cycle. Waste Biomass Valori 11:6065–6072. https://doi. org/10.1007/s12649-019-00853-0
Chapter 3 Reconstitution of Metabolic Pathway in Nicotiana benthamiana Chenggang Liu Abstract Reconstitution of metabolite biosynthesis pathway plays a pivotal role in functional characterization of biosynthesis enzymes and metabolite bioengineering. Traditionally, metabolic pathways are reconstituted in bacteria or yeast due to their ease for genetic manipulation and transformation. Many plant metabolite pathways involve multiple enzyme complexes channeled on plant endomembrane system, which is absent in bacteria and yeast. Nicotiana benthamiana is particularly suitable for reconstitution plant metabolite pathway involving enzymes associated with plant endomembrane systems. Compared with other plants, N. benthamiana can be easily transiently transformed by multiple genes simultaneously by a procedure called leaf agroinfiltration. The results of transient transformation can be analyzed in several days, compared with several months with other stable transformation procedures. In this chapter, we present a protocol for multiple-gene transformation by agroinfiltration, followed by UPLC MS analysis. Key words Metabolic pathway reconstruction, Plant transformation, Nicotiana benthamiana, Leaf agroinfiltration
1
Introduction Reconstitution of metabolite biosynthesis pathway plays a pivotal role in functional characterization of biosynthesis enzymes and metabolite bioengineering. Traditionally, metabolic pathways are reconstituted in bacteria or yeast due to their ease for genetic manipulation and transformation. Many plant metabolite pathways involve multiple enzyme complexes channeled on plant endomembrane system, which is absent in bacteria and yeast. In these scenarios, the metabolite pathways can only be reconstituted in plant host. Nicotiana benthamiana is particularly suitable for reconstitution plant metabolite pathway involving enzymes associated with plant endomembrane systems. Compared with other plants, N. benthamiana can be easily transformed by common Agrobacterium-mediated transformation systems. Especially, N. benthamiana can be transiently transformed by multiple genes simultaneously by
Vladimir Shulaev (ed.), Plant Metabolic Engineering: Methods and Protocols, Methods in Molecular Biology, vol. 2396, https://doi.org/10.1007/978-1-0716-1822-6_3, © Springer Science+Business Media, LLC, part of Springer Nature 2022
29
30
Chenggang Liu
a procedure called leaf agroinfiltration. The results of transient transformation can be analyzed in several days, compared with several months with other stable transformation procedures. N. benthamiana agroinfiltration procedure has been successfully used to reconstitute the cyanogenic metabolites, 4-hydroxyindole3-carbonyl nitrile (4-OH-ICN) of Arabidopsis [1], dhurrin of sorghum [2]. N. benthamiana agroinfiltration procedure was also used to reconstitute the metabolic pathways of triterpenes, thalianin, thalianyl medium-chain fatty acid esters, and arabidin of Arabidopsis [3]. All of these pathways involve membrane-associated P450 enzyme complexes. In this chapter, a protocol for performing multiple-gene transformation by agroinfiltration, followed by UPLC MS analysis, is presented.
2
Materials
2.1 Materials and Reagents
1. Nicotiana benthamiana plants around 4 weeks old. 2. Agrobacterium tumefaciens harboring a plant binary plasmid expressing gene under CaMV 35S promoter. 3. LB media. 4. Antibiotics. 5. 100 mM, pH 5.6 MES (2-ethanesulfonic acid) stock, adjust pH with KOH. 6. 1 M MgCl2 stock. 7. 200 mM, Acetosyringone in DMSO stock. 8. Transformation buffer: 10 mM pH 5.6 MES, 10 mM MgCl2, 100μM Acetosyringone (prepare this buffer freshly from stock before use, see Note 1). 9. 15-mL culture tubes. 10. 1-mL Luer-slip disposable syringes. 11. 50-mL polypropylene disposable centrifuge tubes. 12. 250-mL flasks (sterilized by autoclaving). 13. One waste flask to collect bacterium waste. 14. Commercial bleach.
2.2
Equipment
1. Incubator shaker. 2. Bench-top centrifuge with rotor accepting 50-mL centrifuge tubes. 3. Spectrophotometer. 4. Bench-top shaker.
Reconstitution of Metabolic Pathway in Nicotiana benthamiana
2.3 Materials for UPLC/ MS Analyses
31
1. 80% Methanol with 2μg/mL umbelliferone (see Note 2). 2. Liquid nitrogen. 3. Mortar and pestle. 4. 15-mL polypropylene disposable centrifuge tubes. 5. 1.5-mL microcentrifuge tubes. 6. HPLC vials. 7. Microvolume inserts. 8. ACQUITY UPLC HSS T3 Column, 100 A˚, 1.8μm, 2.1 mm 50 mm. 9. ACQUITY UPLC HSS T3 VanGuard Pre-column, 100 A˚, 1.8μm, 2.1 mm 5 mm. 10. UPLC grade water. 11. UPLC grade methanol. 12. UPLC grade water with 0.1% formic acid. 13. UPLC grade methanol with 0.1% formic acid. 14. Speed vacuum system. 15. Refrigerated centrifuge. 16. Refrigerated microcentrifuge. 17. Benchtop vortexer. 18. LC/MS system with accurate mass analysis ability.
3
Methods 1. On day 1 afternoon, grow Agrobacterium in 3 mL LB with appropriate antibiotics in 15-mL culture tubes overnight. Set incubator shaker at 28 C, 220 rpm. 2. On day 2 afternoon: inoculate 50 mL Agrobacterium culture in 250-mL flasks with 0.1 mL freshly grown Agrobacterium from step 1. Grow the Agrobacterium overnight. Set incubator shaker at 28 C, 220 rpm. 3. On day 3 morning, transfer the overnight-grown Agrobacterium into 50-mL centrifuge tubes and centrifuge the bacterium culture at 5000 g for 15 min. 4. Pour off the supernatant liquid into a waste flask, resuspend the bacterium pellets with transformation buffer. Adjust the OD600 to 1.0 with appropriate amount of transformation buffer (see Note 3). Sterilize the waste bacterium with bleach or autoclaving. 5. Pick healthy and well developed N. benthamiana leaves for infiltration. If multiple bacterium need to be infiltrated, combine equal amount of bacterium into one tube (see Note 4). If
32
Chenggang Liu
the substrate is not present in leaves, the substrate can be added into the bacterium suspension. Use 1 mL Luer-slip disposable syringe to infiltrate the bacterium into leaves from abaxial side (lower side). Use finger to counter the force of syringe from adaxial side (upper side). A steady spread of wetting liquid should be observed. 6. Leave the plants under low-light condition for 2–5 days (see Note 5). 7. After 2–5 days, harvest the infiltrated leaves and freeze them with liquid nitrogen. The leaves can be stored in 80 C freezer for later analyses or analyzed immediately. 8. Grind the leaves into powder in liquid nitrogen. Transfer the powder into 80% methanol with 2μg/mL umbelliferone. For every 200 mg of powder, use 1 mL of 80% methanol (see Note 6). Mix the extraction slurries by vortexing. 9. Leave the extraction slurries at 4 C refrigerator overnight (see Note 7). 10. Centrifuge the extraction slurries at 10,000 g, 4 C. 11. Transfer supernatant into new sterilized tubes and store them in 80 C for later analyses. Discard the pellets. 12. Transfer 200μL of supernatants into 1.5-mL microcentrifuge tubes and dry them with speed vacuum at room temperature. 13. Resuspend the dried pellets into 60μL UPLC grade water (see Note 8). 14. Centrifuge the resuspension at 10,000 g for 10 min and transfer 50μL supernatants into microvolume inserts of HPLC vials. Try to avoid any solid particle in supernatants. If necessary, centrifuge one more time. 15. Perform UPLC/MS analyses by injecting 5μL supernatants.
4
Notes 1. In most cases, adding acetosyringone is optional. 2. Umbelliferone serves as internal standards, other internal standards can be used dependent on the researcher’s own purpose. 3. If acetosyringone is added in the transformation buffer, the bacterium suspensions can be leaved on bench for 4 h to overnight to increase the transformation efficiency. 4. If Agrobacterium carrying different antibiotics resistance are being mixed, make sure the antibiotics are being washed off before mixing. 5. Analyze the samples from day 2 to day 5 to determine the optimal expressing level.
Reconstitution of Metabolic Pathway in Nicotiana benthamiana
33
6. Other extraction ratio can be used, dependent on the production level of target metabolite. 7. Some labile metabolites may need to be analyzed immediately. 8. This step is designed for water soluble metabolite. For other methanol-only soluble metabolites, analyze them directly in 80% methanol. In these cases, the standards also need to be dissolved in the same solvent. References 1. Rajniak J, Barco B, Clay NK, Sattely ES (2015) A new cyanogenic metabolite in Arabidopsis required for inducible pathogen defence. Nature 525:376–379 2. Laursen T et al (2016) Characterization of a dynamic metabolon producing the defense
compound dhurrin in sorghum. Science 354:890–893 3. Huang AC et al (2019) A specialized metabolic network selectively modulates Arabidopsis root microbiota. Science 364:eaau6389
Chapter 4 A Protocol for Phylogenetic Reconstruction Soham Sengupta and Rajeev K. Azad Abstract The similarity of biological functions and molecular mechanisms in living organisms suggests their common origin. The inference of evolutionary relationships among the extant organisms is primarily based on structural, functional, and sequence data of biomolecules, such as DNA, RNA, and protein, and their relative changes over the course of time. To decipher evolutionary relationships, a variety of data can be used. The exponential growth of genomic data, spurred by advances in DNA sequencing, has enabled biologists to reconstruct the tree or network of life for a vast number of organisms dwelling in the earth. In addition of organismal relationships, phylogenetic analysis is often performed to characterize gene families, specifically to identify the orthologs and paralogs of a gene of interest and understand their varied functions in light of evolution. In this chapter, we describe a protocol for reconstructing a phylogenetic tree using maximum-likelihood approach. We demonstrate using an example dataset and a suite of publicly available programs. Key words Phylogenetic reconstruction, Phylogenetic tree, Maximum-likelihood, Phylogenetic analysis
1
Introduction The similarity of biological functions and molecular mechanisms in living organisms suggests their common origin. The inference of evolutionary relationships among the extant organisms is primarily based on structural, functional, and sequence data of biomolecules, such as DNA, RNA, and protein, and their relative changes over the course of time. The discipline of evolution and phylogenetics emerged in early nineteenth century when Darwin noticed the striking phenotypic variations in finches of Galapagos Islands. These variations inspired him to the idea of a branching process of evolution consistent with the knowledge of contemporary fossil researchers studying such variations over longer periods of time. To decipher evolutionary relationships, a variety of data can be used. Morphological data to classify organisms was first introduced by Carolus Linnaeus [1]. Even today, with the volume of molecular data for fine-scale taxonomic classification, morphological
Vladimir Shulaev (ed.), Plant Metabolic Engineering: Methods and Protocols, Methods in Molecular Biology, vol. 2396, https://doi.org/10.1007/978-1-0716-1822-6_4, © Springer Science+Business Media, LLC, part of Springer Nature 2022
35
36
Soham Sengupta and Rajeev K. Azad
characteristics still remain valuable for classification of organisms. The exponential growth of genomic data, spurred by advances in DNA sequencing, has enabled biologists to reconstruct the tree or network of life for a vast number of organisms dwelling the earth. In addition to organismal relationships, phylogenetic analysis is often performed to characterize gene families, specifically to identify the orthologs and paralogs of a gene of interest and understand their varied functions in the light of evolution. In this chapter, we describe a protocol for reconstructing a phylogenetic tree using maximum-likelihood approach. We demonstrate using an example dataset and a suite of publicly available programs.
2
Materials
2.1 The Following Programs Need to Be Downloaded and Installed on a Local Computer
1. PSI-BLAST [2]—https://blast.ncbi.nlm.nih.gov/Blast.cgi 2. PFAM [3]—https://pfam.xfam.org/ 3. MUSCLE [4]—https://www.drive5.com/muscle/ 4. trimAL [5]—http://trimal.cgenomics.org/ 5. ProtTest [6]—https://github.com/ddarriba/prottest3 6. PhyML [7]—http://www.atgc-montpellier.fr/phyml/ 7. Figtree [8]—http://tree.bio.ed.ac.uk/software/figtree/ 8. iToL [9]—https://itol.embl.de/
2.2
3
Data
1. To demonstrate the procedure to reconstruct a maximumlikelihood tree, we will use UDP-glucuronosyltransferase (UGT) proteins from Medicago truncatula. These data can be obtained from NCBI database (https://www.ncbi.nlm.nih. gov/protein/AAW56092.1, as well as, is provided at https:// github.com/sohamsg90/Protocol-for-Phylogenetic-Recon struction-for use with this protocol).
Methods All steps of the tree reconstruction protocol are shown sequentially in Fig. 1 and described below.
3.1 Selection of Candidate Genes Using PSI-BLAST (https://blast.ncbi.nlm. nih.gov/Blast.cgi)
1. To construct a tree of UDP-glucuronosyltransferase (UGT) protein family, select an organism harboring UGTs to obtain “seed” sequences of UGTs as a first step in the protocol. The well-studied Barrel medic or Medicago truncatula is selected here for illustration. Now, retrieve complete protein sequence of a well-characterized UGT in this organism from the NCBI protein database (https://www.ncbi.nlm.nih.gov/protein/ AAW56092.1) [10] (UGT71G1 in M. truncatula is selected,
A Protocol for Phylogenetic Reconstruction
Fig. 1 Outline of the phylogenetic tree reconstruction protocol
37
38
Soham Sengupta and Rajeev K. Azad
accession number: AAW56092.1/Medtr5g070090.1). Use this sequence as the query sequence in PSI-BLAST search to obtain sequences of UGT homologs in the genome of M. truncatula. 2. Use an Expect threshold of 10 and PSI-BLAST threshold of 0.005 (default parameters). 3. Using UGT71G1 as query, conduct PSI-BLAST searches against the nonredundant (NR) BLAST database. Iterate until no new UGT homologs are found. 4. Select PSI-BLAST hits that satisfy the criteria of query coverage >70% and sequence identity >35% for further downstream analysis. 3.2 Domain Analysis Using PFAM (https:// pfam.xfam.org/)
1. Examine the candidate UGT sequences obtained from PSI-BLAST search for the presence of the signature UDP-GT domain. Utilize the services of PFAM 31.0 to search for the aforementioned conserved domain (PFAM uses hidden Markov models to generate sequence alignments and search for the presence of conserved domains). 2. Upload a FASTA-formatted file (https://www.ncbi.nlm.nih. gov/WebSub/html/help/fasta.html) with sequences of PSI-BLAST hits using the “Batch search” tab of the PFAM website (a valid email ID is required to obtain the results). 3. Eliminate all sequences lacking the UDP-GT domain manually from the list of candidate UGT genes.
3.3 Multiple Sequence Alignment (MSA) Using MUSCLE
Format the input data before computing the MSA using MUSCLE software. Input file containing protein sequences must be in phylip format (interleaved or sequential). Although MUSCLE takes care of this issue, the fasta headers of the sequences have to be adjusted to utmost ten characters. However, appearance of duplicate headers may complicate and should be addressed. As an example, the fasta header beginning with ‘>’ below provides short description of the protein sequence.
>Medtr5g066390.1 | UDPglucosyltransferase family protein | HC | chr5:2799781428000234 MYDGDEVSAIVIDLGSHTCKAGYAGEDAPKAVFPSRYYYFVLFVVADDSDEAGTEIEANCKKPFLWITRPNLVIGGSVVLSSEFRKEISDRGLISNWCPQEKVLNHPSISGFLTHCGWNSTTESICAGVPMLCWSFFADQPTNCRFICNEWKIGMEIDMNVKIEDLEKLINELMVGENGKKMRQKAMELKKKAEENTRPGGCSYMNLDKVIKEVLFKQN
The fasta header can be changed to a string of ten characters. However, a separate tracking file needs to be maintained to
A Protocol for Phylogenetic Reconstruction
39
correspond the short header to the original header. A PERL script to perform this function is provided at: https://github.com/ sohamsg90/Protocol-for-Phylogenetic-Reconstruction-. perl extracthead.pl
Example of short names (in “*_short_names.fasta”):
>seq1 MYDGDEVSAIVIDLGSHTCKAGYAGEDAPKAVFPSRYYYFVLFVVADDSDEAGTEIEANCKKPFLWITRPNLVIGGSVVLSSEFRKEISDRGLISNWCPQEKVLNHPSISGFLTHCGWNSTTESICAGVPMLCWSFFADQPTNCRFICNEWKIGMEIDMNVKIEDLEKLINELMVGENGKKMRQKAMELKKKAEENTRPGGCSYMNLDKVIKEVLFKQN
Example of tracking file (in “*_input.txt”): >Medtr5g066390.1 | UDP glucosyltransferase family protein | HC | chr5:2799781428000234
seq1
>Medtr . . .
seq2
1. MUSCLE is a standalone software, available for Windows, Unix/Linux, and, MacosX. Download from http://www. drive5.com/muscle/. 2. This is primarily a command-line program and therefore requires the use of a system terminal (in case of Unix/Linux/ MacosX) or a DOS-prompt window on a Windows computer. 3. Copy the input fasta file (containing all sequences with short names; file ending with extension “*_short_names.fasta”) to the MUSCLE work folder, open a terminal/DOS prompt window, and go to the directory of the MUSCLE folder. 4. To view all options provided by MUSCLE, type and enter: > muscle MUSCLE v3.8.31 by Robert C. Edgar http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32 (5), 1792-97. Basic usage muscle -in -out Common options (for a complete list please see the User Guide): -in Input file in FASTA format (default stdin)
40
Soham Sengupta and Rajeev K. Azad -out Output alignment in FASTA format (default stdout) -diags
Find
diagonals
(faster
for
similar
sequences) -maxiters Maximum number of iterations (integer, default 16) -maxhours Maximum time to iterate in hours (default no limit) -html Write output in HTML format (default FASTA) -msf Write output in GCG MSF format (default FASTA) -clw Write output in CLUSTALW format (default FASTA) -clwstrict As -clw, with ’CLUSTAL W (1.81)’ header -log[a] Log to file (append if -loga, overwrite if -log) -quiet Do not write progress messages to stderr -version Display version information and exit Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2 Fastest
possible
(amino
acids):
-maxiters
1 -diags -sv -distance1 kbit20_3 Fastest
possible
(nucleotides):
-maxiters
1 -diags
5. To execute the program, type:
muscle -in -phyiout
Here the flag “-phyiout” refers to the output format set to interleaved phylip format. Note: If you are using the windows version, make sure to write the full name of the program in the DOS prompt.
muscle.exe -in -phyiout
The output will be written to “*_muscle.phylip.” 6. Execution of MUSCLE yields multiple sequence alignment that is written to the output file. The time of computation is dependent on the number and lengths of sequences to be aligned.
A Protocol for Phylogenetic Reconstruction
3.4 MSA Gap Treatment Using trimAL (http://trimal. cgenomics.org/)
41
Multiple sequence alignment procedure introduces gaps, and depending on the level of conservation between the sequences, these gaps may also be erroneously introduced in the alignment profile. Although new methods for better alignment computation are being developed and made available every day, accuracy of these alignment profiles can be improved by careful treatment of gaps. trimAL, a tool for automated alignment trimming in large-scale phylogenetic analyses, is highly recommended. trimAL comes with an in-built function that can effectively trim an MSA by implementing a heuristics approach to decide the most appropriate alignment state specifically for downstream maximum likelihood phylogenetic analysis. 1. The trimAL program is available for Unix/Linux and Windows systems. In addition, a ready-to-use online version is available at http://phylemon2.bioinfo.cipf.es/. 2. The output file format of MUSCLE is compatible with trimAL. 3. After implementing MSA using MUSCLE, copy the “*_muscle.phylip” file to trimAL folder. 4. To see all the options provided by the program, type and enter: > trimal trimAl
v1.4.rev15
build[2013-12-17].
2009-
2013. Salvador Capella-Gutierrez and Toni Gabaldn. trimAl webpage: http://trimal.cgenomics.org This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, the last available version. Please cite: trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Salvador
Capella-Gutierrez; Jose M. Silla-
Martinez; Toni Gabaldon. Bioinformatics 2009, 25:1972-1973. Basic usage trimal -in -out (other options). Common options (for a complete list please see the User Guide or visit http://trimal.cgenomics. org): -h
Print
this
information
and
examples. --version Print the trimAl version.
show
some
42
Soham Sengupta and Rajeev K. Azad -in Input file in several formats (clustal,
fasta,
NBRF/PIR,
nexus,
phylip3.2,
phylip).
---phylip Output file in PHYLIP/PHYLIP4 format ---automated1 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).
5. To execute the trimAL program, type: trimal -in -out -automated1 -phylip
The output will be written to “*_trimal.phylip.” 6. Use Jalview software, an alignment viewing and editing tool (http://www.jalview.org/), to view the alignment pre- and post-gap treatment. 3.5 Selection of Best-Fit Models of Protein Evolution Using ProtTest (https:// github.com/ddarriba/ ProtTest3/releases)
To reconstruct the most accurate phylogenetic tree, the best-fit model of substitution needs to be determined and employed. Models of substitution or amino acid replacement provide the probabilities of change of one amino acid to another. ProtTest determines the optimal model of substitution using model selection criterion, namely Akaike Information Criterion (AIC) as well as the Bayesian Information Criterion (BIC). The software provides the best-fit model by selecting among 120 different models considering rate variation among sites (+I: invariable sites; +G: gamma-distributed rates) and the observed amino acid frequencies (+F). 1. Install the ProtTest program (either Unix/Linux or Windows version). For Unix/Linux system, after installation, open a terminal, go to the ProtTest* folder, and type:
> ./runXProtTestHPC.sh
For Windows system, open a DOS prompt window, go to the ProtTest* folder and double click on “runXProtTestHPC. bat.” 2. Click on File ! Load Alignment ! Select the alignment file (output of trimAL program).
A Protocol for Phylogenetic Reconstruction
43
Click on Analysis ! Compute Likelihood Scores. A window will pop-up, showing all the available substitution model matrices and distributions. By default, all options are checked. You can change the parameters as and when required. Click on “Compute”, to begin analysis. Another small window will pop-up, showing the progress of the analysis. The time of computation depends on the number of substitution model matrices selected and the number of sequences. 3. Once the analysis is complete, on the bottom-left corner of the window, “Likelihood scores available” will be displayed. Click on Selection ! Results. The window with the likelihood scores, model averaged parameters, and the best-fit model following AIC and BIC will pop-up as a separate window. 4. Save the result by clicking on the tab “Export to main console.” 5. The main information that is needed for the next step is. (a) Model of substitution selected by AIC or BIC (i.e., the best-fit model). (b) Model averaged parameters (i.e., values of +I, +G, +F). 3.6 Maximum Likelihood Analysis Using PhyML (http:// www. atgc-montpellier.fr/ PhyML/)
1. Use PhyML to build phylogenetic tree. Install PhyML on a Windows or Unix/Linux system, or use the web server at http://www.atgc-montpellier.fr/PhyML/. PhyML uses maximum likelihood method for inferring tree. Maximum likelihood method enables determination of a tree model (topology and branch lengths) that best describes the given data (i.e., the MSA), that is, yields highest likelihood (maximum probability) of generating the data. Thus, here, the likelihood function is examined to find the model parameters (usually the topology and branch lengths) that yields the maximum value of the function. 2. The output file of trimAL is the input file of PhyML. To check all the options provided by the program, open a terminal and type:
> PhyML –help
and enter to obtain the following description: NAME - PhyML 3.2.20160531 ’’A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood’’
44
Soham Sengupta and Rajeev K. Azad Stephane Guindon and Olivier Gascuel, Systematic Biology 52(5):696-704, 2003. Please cite this paper if you use this software in your publications. SYNOPSIS: PhyML [command args] All the options below are optional (except ’-i’ if you want to use the c ommand-line interface). Command options: -i (or --input) seq_file_name seq_file_name is the name of the nucleotide or amino-acid sequen ce file in PHYLIP format. -d (or --datatype) data_type data_type is ’nt’ for nucleotide (default), ’aa’ for aminoacid sequences, or ’generic’, (use NEXUS file format and the ’symbols’ parameter here). - - - - - - - - - - - - - - - EXAMPLES DNA interleaved sequence file, default parameters : ./PhyML -i seqs1 AA interleaved sequence file, default parameters : ./PhyML -i seqs2 - d aa AA sequential sequence file, with customization : ./PhyML -i seqs3 - q -d aa -m JTT -c 4 -a e
3. To execute the program, open a terminal and type:
phyML -i -d aa -m -b -5 -a -s BEST --no_memory_check
Here the input file is: “*_trimal.phylip.” 4. Two output files are generated, one with the statistics, parameters, etc. used in the program, while the other with the phylogenetic tree in Newick format.
A Protocol for Phylogenetic Reconstruction
45
5. Please note that in the output tree file, the taxa names are the short names that need to be replaced before visualization. Use the Perl script provided in this tutorial to replace shorter taxon names with their full names. It is highly recommended to provide the accession numbers in the final taxon names for easy identification.
> perl replace_short_long.pl
3.7 Tree Visualization Using FigTree
A number of tools are available for visualization of phylogenetic trees. One of the most frequently used tools is Figtree, available from http://tree.bio.ed.ac.uk/software/figtree/ for all platforms. The input file for FigTree is the edited Newick tree file. A lot of options are available on how to design, color, and annotate the trees as per the users’ need and choice.
3.8 Phylogenetic Tree of Plant UGTs and Visualization Using iToL
With reference to the current dataset used as an example, we construct to find a phylogenetic tree of UGT proteins in plants. Web server iToL (Interactive Tree of Life: https://itol.embl. de/) was used for visualization. Readers can access this tree at https://github.com/sohamsg90/Protocol-for-PhylogeneticReconstruction. Readers interested in constructing tree using the protocol in this chapter are recommended to first try to recreate the MSA and tree using sample files provided at https://github.com/ sohamsg90/Protocol-for-Phylogenetic-Reconstruction-
4
Notes 1. For programs to be run on Linux system, download the binaries, if available, in addition to the source code files. If binaries are available or are compiled from the source codes, all of them need to be designated to the path for accessing from any folder. There are many ways to do it: (a) Make a symlink in directory:
/usr/bin
(or
/usr/local/bin)
sudo cp -s /opt/toolname/tool.sh /usr/bin/[Toolname]
(b) Add /opt/toolname/tool.sh to $PATH variable export $PATH¼$PATH:/opt/toolname/
46
Soham Sengupta and Rajeev K. Azad
(c) Combine the above but use $HOME/.local/share/bin instead system /usr/bin 2. Always remember to run the programs in the folder where your input files are or provide the path to the files. 3. Choose output file names in such a way that they do not clash with input/output files of other programs. 4. For domain analysis, PFAM offers both a web server and local version. If the input sequences are too many, it is recommended to download and run the program locally. 5. While visualizing the Newick file using FigTree, if an error message “Duplicate taxon name” pops up, check and correct for any duplicate names for taxa. References 1. Lack D (1983) Darwin’s finches. Cambridge University Press, Cambridge 2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. https://doi.org/10.1093/nar/25.17.3389 3. Finn RD (2005) Pfam: the protein families database. In: Encyclopedia of genetics, genomics, proteomics and bioinformatics. Wiley, Hoboken, New Jersey. https://doi.org/10. 1002/047001153X.g306303 4. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32 (5):1792–1797. https://doi.org/10.1093/ nar/gkh340 5. Capella-Gutie´rrez S, Silla-Martı´nez JM, Gabaldo´n T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25(15):1972–1973 6. Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: fast selection of best-fit
models of protein evolution. Bioinformatics 27(8):1164–1165. https://doi.org/10.1093/ bioinformatics/btr088 7. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O (2010) New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59 (3):307–321. https://doi.org/10.1093/sys bio/syq010 8. Rambaut A (2012). http://tree.bio.ed.ac.uk/ software/figtree/ 9. Letunic I, Bork P (2007) Interactive tree of life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23 (1):127–128. https://doi.org/10.1093/bioin formatics/btl529 10. Achnine L, Huhman DV, Farag MA, Sumner LW, Blount JW, Dixon RA (2005) Genomicsbased selection and functional characterization of triterpene glycosyltransferases from the model legume Medicago truncatula. Plant J 41(6):875–887
Chapter 5 RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly, Alignment, and Differential Expression Analysis David J. Burks and Rajeev K. Azad Abstract In this chapter, we describe methods for analyzing RNA-Seq data, presented as a flow along a pipeline beginning with raw data from a sequencer and ending with an output of differentially expressed genes and their functional characterization. The first section covers de novo transcriptome assembly for organisms lacking reference genomes or for those interested in probing against the background of organism-specific transcriptomes assembled from RNA-Seq data. Section 2 covers both gene- and transcript-level quantifications, leading to the third and final section on differential expression analysis between two or more conditions. The pipeline starts with raw sequence reads, followed by quality assessment and preprocessing of the input data to ensure a robust estimate of the transcripts and their differential regulation. The preprocessed data can be inputted into the de novo transcriptome flow to assemble transcripts, functionally annotated using tools such as InterProScan or Blast2Go and then forwarded to differential expression analysis flow, or directly inputted into the differential expression analysis flow if a reference genome is available. An online repository containing sample data has also been made available, as well as custom Python scripts to modify the output of the programs within the pipeline for various downstream analyses. Key words Transcriptomics, RNA-Seq data analysis, Transcriptome assembly, Alignment, Differential expression analysis
1
Introduction Advances in technology coupled with decreasing costs have brought high-throughput sequencing to the frontlines of biological research. In particular, RNA sequencing (RNA-Seq) has become the primary choice for understanding the transcriptional processes in organisms [1]. RNA-Seq allows researchers to analyze the transcriptional changes in organisms without any prior information regarding their genomic identities or gene information, a limitation inherent to prior technologies such as DNA microarrays. Furthermore, de novo transcriptome assembly has matured into a practical, reliable alternative to reference genomes enabling RNA-Seq for organisms lacking reference genome sequences [2]. This is of great interest in plant biology, where de
Vladimir Shulaev (ed.), Plant Metabolic Engineering: Methods and Protocols, Methods in Molecular Biology, vol. 2396, https://doi.org/10.1007/978-1-0716-1822-6_5, © Springer Science+Business Media, LLC, part of Springer Nature 2022
47
48
David J. Burks and Rajeev K. Azad
novo transcriptomics affords new insights into niche organisms that harbor genes or pathways not present in their close relatives [2–5]. In this protocol, we describe methods for analyzing RNA-Seq data, presented as a flow along a pipeline beginning with raw data from a sequencer and ending with an output of differentially expressed genes and their functional characterization. The first section covers de novo transcriptome assembly for organisms lacking reference genomes or for those interested in probing against the background of organism-specific transcriptomes assembled from RNA-Seq data. Section 2 covers both gene- and transcriptlevel quantifications, leading to the third and final section on differential expression analysis between two or more conditions. The pipeline starts with raw sequence reads, followed by quality assessment and preprocessing of the input data to ensure a robust estimate of the transcripts and their differential regulation. The preprocessed data can be inputted into the de novo transcriptome flow (section 1) to assemble transcripts, functionally annotated using tools such as InterProScan or Blast2Go [6, 7] and then forwarded to differential expression analysis flow (section 2), or directly inputted into the differential expression analysis flow if a reference genome is available (section 2). An online repository containing sample data has also been made available, as well as custom Python scripts to modify the output of the programs within the pipeline for various downstream analyses. Despite the cohesiveness of the sections to gain a comprehensive and thorough understanding of the RNA-Seq data, each section can be used independently for specific goals, offering a great starting point for more advanced bioinformatics analyses.
2
Materials
2.1 Software and Data
This protocol will require several steps to be completed before undertaking the tasks presented in each section. This includes procuring the data from the supplied repository and installing the software outlined in (Table 1). The tools for this pipeline require a GNU/Linux command line environment. This can be through a standalone installation or a virtual machine (VM) as long as adequate resources are allocated to the VM. Windows users have the option of installing the Windows Subsystem for Linux, which provides a full command-line GNU/Linux environment within Windows without the need for virtualization and resource allocation. Popular distributions such as Ubuntu and Debian may have many of the tools available in their repositories for easy installation, though it is possible that the available software may not be the current release versions. This can introduce unnecessary problems into the workflow. Many of the programs are also available natively for Mac OS X, but may require the end-user to compile from source
RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly. . .
49
Table 1 Software required to build the RNA-Seq data analysis pipeline Software
URL
FastQC
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic
http://www.usadellab.org/cms/?page¼trimmomatic
Trinity
https://github.com/trinityrnaseq/trinityrnaseq/wiki
STAR
https://github.com/alexdobin/STAR
Salmon
https://combine-lab.github.io/salmon/
DESeq2
https://bioconductor.org/packages/release/bioc/html/DESeq2.html
pcaExplorer
http://bioconductor.org/packages/release/bioc/html/pcaExplorer. html
SRA toolkit
https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/
Jellyfish
http://www.genome.umd.edu/jellyfish.html
Bowtie2
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
Data
URL
Tutorial repository
https://github.com/djburks/RNA-Seq-Data-Analysis-Pipeline-forPlants
Ensembl plants FTP portal
https://plants.ensembl.org/info/website/ftp/index.html
Information
URL
SAM format Specifiation guide
https://samtools.github.io/hts-specs/SAMv1.pdf
Windows subsystem for Linux https://docs.microsoft.com/en-us/windows/wsl/install-win10 Compiling software on Linux https://itsfoss.com/install-software-from-source-code/
packages, whereas binaries are typically available for GNU/Linux. A general primer on software installation and modifying $PATH has been added to Table 1, though the steps required for compiling software from source can vary from one program to another. In all cases where possible, it is recommended that users download the latest available binary for addition to the /usr/local/bin directory. The commands for this pipeline are suited toward a standard desktop with 4 CPU cores and 8 GB of RAM, with the exception of the Trinity step. Larger datasets will require more RAM, and users with more cores can adjust the appropriate parameters by referencing the manual for each particular program. The paired-end reads provided in the supplied repository were artificially constructed using the RNA-Seq read simulator Polyester [8]. A subset of 500 transcripts were selected from the TAIR10 cDNA file for Arabidopsis thaliana and used to construct three individual experimental sets (WT, MU, CN) with randomly generated fold-changes using the random exponential distribution
50
David J. Burks and Rajeev K. Azad
function of R (rexp) [9]. Three replicates per experiment were simulated. A custom Python script was used to convert the reads into FASTQ format with preset quality scores per base, as well as to append a variable amount of adapter sequence to the beginning of each read. All FASTQ files are compressed using the GZIP function in order to save space; however, the tools used in this tutorial are fully compatible with both compressed and uncompressed formats (see Note 1).
3
Methods
3.1 Read Preparation and Quality Assessment 3.1.1 Initial Quality Check
All raw read files should be subjected to quality assessment through FastQC. Adapter contamination can cause significant problems during transcriptome assembly, including chimeric read production. Low-quality read ends can significantly diminish the mapping rate during alignment. The standard format for raw reads is FASTQ, which contains both sequence and quality information for each base of every read. The files in this tutorial are compressed FASTQ, which saves space. First, move to the Raw Reads directory and invoke FastQC from the command line for FASTQ files to analyze: $ mkdir FastQC $ fastqc *.fq.gz -o FastQC/
Each FastQC run will build a HTML report with associated data in ZIP format. These HTML reports can be viewed with any standard Internet browser, such as Google Chrome or Mozilla Firefox. Each category is assigned as pass, caution, or fail through green, yellow, or red symbols, respectively. The manual provided with FastQC gives insights into each category, and information on how to remedy any yellow or red flag. A video tutorial is also provided at the FastQC download page, as well as example reports for “good” and “bad” RNA-Seq data. For this protocol, we will focus primarily on the “Per Base Sequence Quality” and “Adapter Content” categories. The “Per Base Sequence Quality” evaluation will issue a warning if the lower quartile score for any base position across all reads in the sample is less than 10, or if the median for a base position is less than 25. If the lower quartile score falls below 5, or the median below 20, FastQC reports a failure. It is common for quality to degrade over the duration of runs. Longer read lengths have more noticeable drop in quality across each read. Problems encountered during sequencing, such as interference in the flow cell, affects base quality earlier in the read. The “Adapter Content” category is useful for detecting adapter sequence
RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly. . .
51
contamination, as well as identifying the adapter sequence so that it can be properly removed from each read where it is present. 3.1.2 Adapter Trimming and Adapter Removal
Trimmomatic uses a sliding window to identify the proper point to trim poor quality ends from raw reads. Trimmomatic can also detect and remove adapter sequence remnants from the read when supplied with a FASTA file containing the adapter sequence [10]. For most cases, as well as this tutorial, the default parameters will suffice for trimming reads. An adapter FASTA file is also included in our repository, containing many of the more common adapter sequences. Single and paired-end reads require different commands. As our tutorial data is paired-end, an example command for trimming and clipping adapter sequence from reads in our first wild-type replicate (WT1) file is shown below. Repeat this for reads in the remaining files. $ java -jar trimmomatic-0.38.jar PE WT1_1.fq.gz WT1_2.fq.gz WT1_1_Trimmed.fq.gz WT1_1_Orphan.fq.gz WT1_2_Trimmed.fq.gz WT1_2_Orphan.fq.gz
ILLUMINACLIP :adapter.fa:2:30:10 LEADING:3
TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 HEADCROP: 14
The above command produces four output files. The *_Trimmed.fq.gz files will be used for downstream analysis. The *_Orphan.fq.gz files contain mate pairs where one of the two associated reads did not survive the cutoffs. These reads have little use for our purposes, but in cases where data is sparse, they may provide valuable information to the researcher despite lacking their mate. The ILLUMINACLIP parameter looks for adapter sequences in the adapter.fa FASTA file with a modifiable set of values to determine how stringent the match needs to be for detection. The LEADING and TRAILING parameters specify the minimum quality necessary to keep a base. The SLIDINGWINDOW specifies the number of consecutive base scores to average as the read is scanned, as well as the minimum average quality required to retain that segment of the read. The HEADCROP parameter removes a preset number of bases from the start of reads, which is being performed in response to the “Per base sequence content” flag in our preliminary FASTQC report. The MINLEN parameter is used to dispose of any reads, following trimming and adapter removal, that are less than 36 bases long (preset minimum length), as such short reads can lead to spurious alignments and chimeric transcript production. Following read processing, it is important to reassess and note the changes in quality metrics by re-running FastQC. $ fastqc *_Trimmed.fq.gz -o FastQC/
52
David J. Burks and Rajeev K. Azad
3.2 De Novo Transcriptome Assembly Using Trinity
Trinity was selected for transcriptome assembly due to its performance specifically in regard to eukaryotic plant assemblies [11], as well as ease-of-use and available documentation [12]. While other assemblers can be viable, possibly superior choices, we feel that Trinity offers the best balance between quality and complexity. Despite its ease of use, Trinity can be particularly demanding. In most cases involving plant transcriptomes, 32 GB of RAM should suffice. This is highly dependent on the number of reads being passed to Trinity in order to build the transcriptome. Trinity consists of three independent programs: Inchworm, Chrysalis, and Butterfly. As such, there are a great number of configurations and parameters that can be adjusted by the end-user. Optimizing a de novo assembly is an arduous task and is beyond the scope of this tutorial. Users interested in exploring these options are encouraged to use qualitative metrics, and also experiment with the trimming stringency of the first section, as more aggressive trimming can remove vital information that would otherwise benefit the assembler. Programs such as Transrate can be used to score the quality of transcriptomes without using a reference [13] and can be useful in testing the various parameters available within Trinity. We will be using default parameters and a publicly available A. thaliana RNA-Seq dataset from NCBI SRA [14]. The fastq-dump utility of the SRA toolkit allows us to download the files in their proper format for Trinity assembly. This is a rather large set of files (2 ~10 GB), so make sure that adequate space is available. Despite using data directly from NCBI SRA for this example assembly, it is important to perform proper trimming and adapter removal on all data used for building de novo transcriptomes. $ fastq-dump --defline-seq ’@$sn[_$rn]/$ri’ --split-files SRR6914594 $ Trinity -–seqType --max_memory 32G fq --left SRR6914594_1. fastq --right SRR6914594_2.fastq --CPU 4
Test run of Trinity on less powerful workstations can be done using only a subset of the reads available in the FASTQ file. The following example needs less than 1 GB RAM, but the resulting transcriptome will not be suitable for downstream alignment and should serve only to test the Trinity installation. $ fastq-dump --defline-seq ’@$sn[_$rn]/$ri’ -X 10000 --splitfiles SRR6914594 $ Trinity -–seqType --max_memory 1G fq --left SRR6914594_1. fastq --right SRR6914594_2.fastq --CPU 4
RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly. . .
53
The CPU parameter can be adjusted dependent upon the user’s available resources. The resultant transcriptome will become available in the “trinity_out_dir” in FASTA format as Trinity.fasta. The transcripts within this file can be used for transcript-level alignment using Salmon as detailed in the next section, and functionally annotated using the programs referenced in the last section. A de novo transcriptome will never be of the reference quality, though it can be seen as an invaluable substitute when the reference genome is not available. The identifiers for each transcript contain several pieces of information regarding its construction. For example, the identifier “TRINITY_DN1000|c115_g5_i1” indicates that this particular transcript was clustered with other transcripts in the TRINITY_DN1000|c115 group. From this cluster, several genes with multiple isoforms were identified. Other identifiers from the same cluster with the same gene number (e.g., g5) but different isoform numbers (e.g., i1, i2) correspond to different isoforms of the gene. 3.3 RNA-Seq Alignment at the Gene and Transcript Level 3.3.1 Genome Indexing and Alignment Using STAR
Due to the increasing use of RNA-Seq and the need for more robust quantification of transcript abundance, a number of RNA-Seq alignment programs have been developed in recent years. We have chosen STAR as our gene-level aligner for this tutorial based on a number of comparative studies that have found it to be the most accurate freely available program overall for RNA-Seq read alignment at the default parameter setting [15– 17]. As of this writing, STAR also provides raw counts without the need of an intermediary program to analyze the BAM files. This makes STAR an attractive option for inexperienced users. The GeneCounts parameter, which did not exist in early versions of STAR, is essential to this tutorial. For users using the Windows Subsystem for Linux, please note that the --readFilesCommand zcat parameter introduces problems, and the files will need to be decompressed prior to running STAR. $ gunzip *.fq.gz
First, it is necessary to construct a genome index for STAR to align each of our read sets. This will require two files: the genomic DNA FASTA and associated GTF annotation files. These are organism-specific files. The DNA FASTA file contains the genomic sequence for all chromosomes within a species. GTF files are tab-delimited text files containing information regarding the gene structure of an organism, including the coordinates/indices of exons and introns in the chromosomes as organized in the DNA FASTA file. Ensembl Plants offers many of these files from their FTP, and a portal for easy navigation is linked in Table 1. As our objective is to align A. thaliana read data in this tutorial, we will
54
David J. Burks and Rajeev K. Azad
first download the top-level DNA FASTA and associated GTF files using the command line. $ wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/ fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10. dna.toplevel.fa.gz
$ wget ftp://ftp.ensemblgenomes.org/pub/release42/plants/gtf/arabidopsis_thaliana/Arabidopsis_ thaliana.TAIR10.42.gtf.gz
It is imperative that most current version of DNA FASTA file and the corresponding GTF file be used for alignment. The FASTA and GTF file annotations must also match, as one file referring to chromosome 1 as “1” and the other referring to the same chromosome as “chr1” will lead to indexing failure. Using GTF files built for a different DNA FASTA release or version can also introduce indexing variants that will give erroneous gene counts. In some cases, the annotation may only be available in GFF format. Instructions on how to use GFF files in lieu of GTF files are given in the STAR manual. With the files procured, we can generate the genome index. First, unzip the two files from Ensembl Plants and make a directory for the index. $ gunzip Arabidopsis*gz $ mkdir STAR_Index $ STAR --runThreadN 4 --runMode genomeGenerate --genomeDir STAR_Index/ --genomeFastaFiles Arabidopsis_thaliana.TAIR10. dna.toplevel.fa --sjdbGTFfile Arabidopsis_thaliana.TAIR10.42. gtf
With the genome index built, align reads in the trimmed FASTQ files using STAR. Make sure to align each mate pair together, using the files containing the “left” reads and the corresponding “right” reads, with readFilesCommand and quantMode parameters to decompress the FASTQ files and to export the gene counts, respectively. This can be done easily with a bash for loop, but for clarity, we will only show the first command to align the WT1 experiment set. $ mkdir WT1 $ STAR --runThreadN 10 --genomeDir STAR_Index --readFilesCommand zcat --quantMode GeneCounts --readFilesIn WT1_1_Trimmed. fq.gz WT1_2_Trimmed.fq.gz --outFileNamePrefix WT1/
RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly. . .
55
Six output files are generated in the specified directory. Three log files give valuable information regarding alignments and offer insight into failed runs should the event arise. For example, the Log.final.out file for this dataset states that ~85% of the reads aligned, with the remaining ~15% mapping to too many loci. This is expected due to the fact that our synthetic dataset was built from only 500 transcripts, many of which were isoforms of the same gene. The SJ.out.tab file contains filtered splice junctions detected during mapping, but are not covered in this protocol. The Sequence Alignment/Map (SAM) file is the most informative output file from STAR, containing information regarding every read in our dataset and its alignment to the indexed genome. The SAM format specification guide has been linked in Table 1 should the user desire additional information. For this tutorial, we are primarily focused on the ReadsPerGene.out.tab file. This tab-delimited text file contains four columns for every gene in the genome. The first column is the gene ID, followed by columns for unstranded counts, counts for the first read of our pair, and finally counts for the second read of the pair. More information regarding stranded alignment is provided in the STAR manual, and it is not recommended to use the counts of this pipeline for stranded RNA-Seq data. The tutorial dataset is very small, built from just 500 transcripts, and will produce many zero counts. Actual datasets from experimentally derived total mRNA extractions, however, will produce a much higher genomic coverage. 3.4 Transcript Indexing and Alignment Using Salmon
While genome alignment programs are capable of quantifying transcript-level abundances, modern alignment-free strategies can quantify datasets in minutes versus hours, requiring a fraction of the RAM on older hardware [18, 19]. A dataset requiring over 16 h to align and quantify in a traditional pipeline such as Tophat2 and Cufflinks can be aligned and quantified using a process called pseudoalignment in ~10 min on identical hardware [18]. There are limitations to using lightweight and alignment-free strategies, particularly for lowly expressed and small RNA datasets, but accuracy for aligning to protein-coding genes rivals that of traditional full alignment pipelines [20]. For this pipeline, we have chosen Salmon for its consistency in estimating abundances across experiments, even to the gene level when isoform counts are combined [19]. Furthermore, Salmon exports both raw and normalized counts (in transcripts per million (TPM)) which can be used to compare expression across independent experimental conditions. Technical benefits aside, transcript-level quantification affords additional detail not present in gene-level quantification. Researchers interested in examining isoform level differential expression will find tools such as Salmon offering a straightforward method for obtaining the counts for abundance quantification. Similar to STAR, Salmon builds an index for its lightweight alignment
56
David J. Burks and Rajeev K. Azad
algorithm. Unlike STAR, Salmon only needs a FASTA file containing all transcripts in a transcriptome of interest. In most cases, such as with Ensembl Plants FTP directory, these files are labeled as cDNA FASTA files. Unlike CDS FASTA files, these also contain sequence information for the untranslated regions (UTRs) of genes, which would be expected in a typical mRNA-derived RNA-Seq experiment. For the tutorial data, download the cDNA FASTA file for A. thaliana from Ensembl Plants and build the index using Salmon. $ wget ftp://ftp.ensemblgenomes.org/pub/release-42/plants/ fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10. cdna.all.fa.gz $ salmon index -t Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz -i salmon_index
For users with de novo transcriptomes, e.g., assembled with Trinity, use Salmon’s index function on the Trinity.fasta file instead. $ salmon index -t Trinity.fasta -i salmon_denovo
To avoid mixing up STAR and Salmon output, make a new directory and begin processing the read pairs in a subdirectory. Alignment for the first dataset (WT1) is shown below and will need to be repeated for each remaining dataset making sure to change the output (-o) parameter to a new subdirectory each time. To use the de novo transcriptome index, adjust the index (-i) parameter accordingly. $ mkdir quants $ salmon quant -i salmon_index -l A -1 WT1_1_Trimmed.fq.gz -2 WT1_2_Trimmed.fq.gz -p 4 -o quants/WT1
Each run will produce a corresponding set of data files. The quant.sf file, similar to STAR’s ReadsPerGene file, contains the transcript abundances in the fourth and fifth columns. For direct comparisons to other experiments, such as through heatmap and correlation analysis, the normalized TPM abundances of the fourth column can be used. For differential expression using software such as DeSEQ2 and edgeR, the fifth column containing the raw read counts should be used after rounding the values to whole numbers. For users interested in gene-level abundances, the individual counts for each isoform can be summed.
RNA-Seq Data Analysis Pipeline for Plants: Transcriptome Assembly. . .
3.5 Data Preparation for DeSEQ2 Input
57
A number of methods for importing data to DeSEQ2 exist, even using the same upstream pipeline. Users familiar with R can build the required data tables using native functions alone, or use software packages built specifically for the task such as tximport [21]. Others may find it more convenient to use in-house scripts written in scripting languages such as Perl or Python, or by using graphical spreadsheet applications such as those included with Libreoffice and Microsoft Office. We feel that the method is irrelevant, as long as the data can be adjusted in such a way that it is easily imported into the differential expression tool of our pipeline. For convenience, Python scripts are included in the sample directory that can take a list of abundance files from STAR or Salmon, and their associated experimental conditions, and generate the files necessary for import into DeSEQ2. Example files containing the location of each STAR and Salmon alignment, as well as their experimental label, have been provided and can be used to build the tables using the table builder script. These files (STAR2DESEQ.txt and Salmon2DESEQ.txt) can be modified for different directory structures and experiment labels as necessary. $ python3 table_builder.py
The script will generate two tables containing counts and experimental groups for the STAR and Salmon data. It will also round the raw counts from Salmon, as DESeq2 is only compatible with whole integers. Open an R environment in the directory containing the tables and read them into R. $ starcount