292 74 8MB
English Pages X, 255 [254] Year 2021
Methods in Molecular Biology 2189
Mario Andrea Marchisio Editor
Computational Methods in Synthetic Biology Second Edition
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Computational Methods in Synthetic Biology Second Edition
Edited by
Mario Andrea Marchisio School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, China
Editor Mario Andrea Marchisio School of Pharmaceutical Science and Technology Tianjin University Tianjin, China
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-0821-0 ISBN 978-1-0716-0822-7 (eBook) https://doi.org/10.1007/978-1-0716-0822-7 © Springer Science+Business Media, LLC, part of Springer Nature 2015, 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface The second edition of the book Computational Methods in Synthetic Biology can be seen as complementary to the first edition. Almost all chapter authors are new, and the overlap between the topics covered in the two editions is also relatively small. The continuity with the first edition is, mainly, in the first part of the book that contains chapters on foundational themes in Synthetic Biology such as biological parts and gene circuits design. As a novelty with respect to the first edition, this book hosts a part dedicated to computational tools for the design and analysis of CRISPR-Cas systems, now largely adopted in synthetic biology. Another part completely new with respect to the first edition is on “synthetic genomics”: it presents two chapters, one about the assembly of synthetic chromosomes (great progress has been made, in recent years, in the construction of artificial S. cerevisiae chromosomes) and the other about minimal genome design (another fundamental goal of synthetic biology). Moreover, in this second edition, more emphasis is given to methods for the analysis of metabolic pathways and techniques borrowed from Systems Biology and Electrical Engineering to study and optimize various kinds of synthetic networks. We think that, on the whole, the two editions of our book give a very broad overview of the research areas that can be met in the area of in silico synthetic biology. Tianjin, China
Mario Andrea Marchisio
v
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Using a Design of Experiments Approach to Inform the Design of Hybrid Synthetic Yeast Promoters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James Gilman, Valentin Zulkower, and Filippo Menolascina 2 Computational Design of Multiplex Oligonucleotide-Based Assays. . . . . . . . . . . . Michaela Hendling and Ivan Barisˇic´ 3 Computational Methods for the Design of Recombinase Logic Circuits . . . . . . . Sarah Guiziou and Jerome Bonnet 4 Modular Modeling of Genetic Circuits in SBML Level 3 . . . . . . . . . . . . . . . . . . . . Mario Andrea Marchisio 5 CRISPR-ERA: A Webserver for Guide RNA Design of Gene Editing and Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Honglei Liu and Xiaowo Wang 6 iGUIDE Method for CRISPR Off-Target Detection . . . . . . . . . . . . . . . . . . . . . . . . Christopher L. Nobles 7 Web-Based Base Editing Toolkits: BE-Designer and BE-Analyzer . . . . . . . . . . . . . Gue-Ho Hwang and Sangsu Bae 8 Synthetic Gene Circuit Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . Irene Otero-Muras and Julio R. Banga 9 Monitoring Single S. cerevisiae Cells with Multifrequency Electrical Impedance Spectroscopy in an Electrode-Integrated Microfluidic Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Zhu, Yangye Geng, and Yingying Wang 10 Construction of Protein Expression Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein 11 Systems-Theoretic Approaches to Design Biological Networks with Desired Functionalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Priyan Bhattacharya, Karthik Raman, and Arun K. Tangirala 12 A Deterministic Compartmental Modeling Framework for Disease Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . King James B. Villasin, Eva M. Rodriguez, and Angelyn R. Lao 13 Computer Aided Assembly and Verification of Synthetic Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giovanni Stracquadanio and Valentin Zulkower 14 Minimal Genome Design Algorithms Using Whole-Cell Models . . . . . . . . . . . . . Joshua Rees-Garbutt, Oliver Chalkley, Claire Grierson, and Lucia Marucci
vii
v ix
1 19 31 45
65 71 81 89
105 119
133
157
169 183
viii
15
16
17
Contents
Tn-Core: Functionally Interpreting Transposon-Sequencing Data with Metabolic Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 George C. diCenzo, Marco Galardini, and Marco Fondi Genome-Scale Metabolic Modeling of Escherichia coli and Its Chassis Design for Synthetic Biology Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Bashir Sajo Mienda and Andreas Dr€ ager A Systems Bioinformatics Approach to Interconnect Biological Pathways . . . . . . 231 George Minadakis and George M. Spyrou
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
251
Contributors NOR AFIQAH-ALENG • Institute of Systems Biology (INBIOSIS), Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia; Institute of Marine Biotechnology, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, Malaysia SANGSU BAE • Department of Chemistry, Hanyang University, Seoul, South Korea; Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul, South Korea JULIO R. BANGA • BioProcess Engineering Group, IIM-CSIC, Spanish National Research Council, Vigo, Spain IVAN BARISˇIC´ • Center for Health & Bioresources, Molecular Diagnostics, Austrian Institute of Technology GmbH, Vienna, Austria PRIYAN BHATTACHARYA • Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai, India JEROME BONNET • Centre de Biochimie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France OLIVER CHALKLEY • Department of Engineering Mathematics, University of Bristol, Bristol, UK; Bristol Centre for Complexity Science, University of Bristol, Bristol, UK GEORGE C. DICENZO • Department of Biology, Queen’s University, Kingston, ON, Canada € ANDREAS DRAGER • Computational Systems Biology of Infection and AntimicrobialResistant Pathogens, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tu¨bingen, Tu¨bingen, Germany; Department of Computer Science, University of Tu¨bingen, Tu¨bingen, Germany; German Center for Infection Research (DZIF), partner site Tu¨bingen, Tu¨bingen, Germany MARCO FONDI • Department of Biology, University of Florence, Sesto Fiorentino, FI, Italy MARCO GALARDINI • Biological Design Center, Boston University, Boston, MA, USA; Department of Biomedical Engineering, Boston University, Boston, MA, USA YANGYE GENG • Key Laboratory of MEMS of Ministry of Education, Southeast University, Nanjing, China JAMES GILMAN • Institute for Bioengineering, School of Engineering, University of Edinburgh, Edinburgh, UK CLAIRE GRIERSON • BrisSynBio, University of Bristol, Bristol, UK; School of Biological Sciences, University of Bristol, Bristol, UK SARAH GUIZIOU • Centre de Biochimie Structurale (CBS), INSERM U1054, CNRS UMR5048, University of Montpellier, Montpellier, France; Department of Biology, University of Washington, Seattle, WA, USA MICHAELA HENDLING • Center for Health & Bioresources, Molecular Diagnostics, Austrian Institute of Technology GmbH, Vienna, Austria GUE-HO HWANG • Department of Chemistry, Hanyang University, Seoul, South Korea; Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul, South Korea ANGELYN R. LAO • Mathematics and Statistics Department, De La Salle University, Manila, Philippines HONGLEI LIU • School of Biomedical Engineering, Capital Medical University, Beijing, China; Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, China
ix
x
Contributors
MARIO ANDREA MARCHISIO • School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, People’s Republic of China LUCIA MARUCCI • BrisSynBio, University of Bristol, Bristol, UK; Department of Engineering Mathematics, University of Bristol, Bristol, UK; School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK FILIPPO MENOLASCINA • Institute for Bioengineering, School of Engineering, University of Edinburgh, Edinburgh, UK BASHIR SAJO MIENDA • Department of Microbiology & Biotechnology, Faculty of Science, Federal University Dutse, Dutse, Jigawa, Nigeria GEORGE MINADAKIS • Department of Bioinformatics, The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology & Genetics, Nicosia, Cyprus ZETI-AZURA MOHAMED-HUSSEIN • Institute of Systems Biology (INBIOSIS), Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia; Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia CHRISTOPHER L. NOBLES • Department of Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA IRENE OTERO-MURAS • BioProcess Engineering Group, IIM-CSIC, Spanish National Research Council, Vigo, Spain KARTHIK RAMAN • Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India; Initiative for Biological Systems Engineering, Indian Institute of Technology Madras, Chennai, India; Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI), Indian Institute of Technology Madras, Chennai, India JOSHUA REES-GARBUTT • BrisSynBio, University of Bristol, Bristol, UK; School of Biological Sciences, University of Bristol, Bristol, UK EVA M. RODRIGUEZ • Department of Mathematics, School of Sciences and Engineering, University of Asia and the Pacific, Pasig City, Philippines GEORGE M. SPYROU • Department of Bioinformatics, The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology & Genetics, Nicosia, Cyprus GIOVANNI STRACQUADANIO • School of Biological Sciences, The University of Edinburgh, Edinburgh, UK ARUN K. TANGIRALA • Department of Chemical Engineering, Indian Institute of Technology Madras, Chennai, India; Robert Bosch Centre for Data Science and Artificial Intelligence (RBC-DSAI), Indian Institute of Technology Madras, Chennai, India KING JAMES B. VILLASIN • Department of Mathematics, School of Sciences and Engineering, University of Asia and the Pacific, Pasig City, Philippines XIAOWO WANG • Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing, China; Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China; Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China; Department of Automation, Tsinghua University, Beijing, China YINGYING WANG • Key Laboratory of MEMS of Ministry of Education, Southeast University, Nanjing, China ZHEN ZHU • Key Laboratory of MEMS of Ministry of Education, Southeast University, Nanjing, China VALENTIN ZULKOWER • Edinburgh Genome Foundry, The University of Edinburgh, Edinburgh, UK
Chapter 1 Using a Design of Experiments Approach to Inform the Design of Hybrid Synthetic Yeast Promoters James Gilman, Valentin Zulkower, and Filippo Menolascina Abstract Hybrid promoter engineering takes advantage of the modular nature of eukaryotic promoters by combining discrete promoter motifs to confer novel regulatory function. By combinatorially screening sequence libraries for trans-acting transcriptional operators, activators, repressors and core promoter sequences, it is possible to derive constitutive or inducible promoter collections covering a broad range of expression strengths. However, combinatorial approaches to promoter design can result in highly complex, multidimensional design spaces, which can be experimentally costly to thoroughly explore in vivo. Here, we describe an in silico pipeline for the design of hybrid promoter libraries that employs a Design of Experiments (DoE) approach to reduce experimental burden and efficiently explore the promoter fitness landscape. We also describe a software pipeline to ensure that the designed promoter sequences are compatible with the YTK assembly standard. Key words Design of Experiments, Hybrid promoter engineering, JMP, Synthetic promoter, Yeast
1
Introduction Robust, predictable genetic parts are a fundamental requirement if scalable, bottom-up engineering of complex biological systems is to be achieved [1, 2]. Multiple decision variables are available to tune synthetic gene circuits [3], but on a practical level the control of transcription activation using well-characterized promoters with defined activation and output functions (or “strengths”) offers a simple, widely used method by which the expression of transgenes or pathways can be balanced [1, 3]. Numerous approaches are available for the identification and design of promoters with desirable characteristics. For example, endogenous promoter sequences may be identified using genomic or transcriptomic analyses of the host organism of interest, followed by extensive in vivo characterization of promoter activity in multiple genetic and environmental contexts [3, 4]. However, natural promoter activity is often highly context specific [5] and
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_1, © Springer Science+Business Media, LLC, part of Springer Nature 2021
1
2
James Gilman et al.
subject to interaction with a multitude of endogenous regulatory systems [6], complicating prediction of activity levels under varying conditions. Furthermore, the repeated use of identical promoter sequences is complicated in eukaryotes as a result of the efficient homologous recombination systems that are present in many species, which can decrease the stability of recombinant DNA constructs [7, 8]. Given the inherent limitations of endogenous promoters, the rational design of synthetic sequences with defined transcriptional activities has become a widely used method for the development of synthetic biology parts [3, 5, 9]. Although the regulation of eukaryotic transcription is a complex process, involving DNA–protein interactions between promoters and highly specific transcription factors, enhancers, and suppressors [10], promoter elements themselves can be abstracted to somewhat modular combinations of key motifs [9] that provide binding sites for trans-acting regulatory proteins [8]. As a result of this inherent modularity, eukaryotic promoters are particularly suited to hybrid engineering approaches, in which well-characterized promoter sequence motifs are combined to confer novel functionality [11–15]. For example, a 2018 study by Dossani et al. used a combinatorial approach to design a library of hybrid promoters that are regulated by a synthetic transcription factor [16]. The authors used a full-factorial approach to hybrid promoter design, in which all possible combinations of the sequence motifs of interest were characterized (Fig. 1). Short (100 bp) or long (250 bp) variants of 10 core promoters were combined with either 1, 2, or 3 tandem repeats of 1 of 4 operator sequences of the LexA DNA binding domain. A library of 240 putative synthetic promoters was therefore designed in silico, of which 154 sequences were successfully characterized in vivo [16]. A full-factorial approach to hybrid promoter design rapidly becomes an inefficient, experimentally expensive way to explore the promoter fitness landscape as the complexity of the promoter structure increases. To restrict the complexity of their design space, Dossani et al. specified that only one type of operator could be present in any single promoter sequence [16]. However, such a restriction prevents exploration of regions of the design space that may encode promoters with desirable characteristics [15, 17], and would not be possible when designing promoters that respond to multiple input signals, which by their nature require multiple operators [18, 19]. For a hypothetical hybrid promoter containing x possible motifs, each with y possible variants, yx experiments would be required to satisfy a full-factorial design [20]. In lieu of designing full-factorial hybrid promoter libraries, a multifactorial Design of Experiments (DoE) approach to promoter design offers a method by which a rationally selected subset of putative promoter sequences can be identified in silico to maximize
Design of Experiments for Synthetic Promoter Design
Operator
Core promoter
1x 2x 3x
3
Consensus
Short (100 bp) Long (250 bp)
colE1
umuDC
uvrA
CUP1p
LEU2p
GAL1p
SPO13p
GCN4p
SSL1p
HEM13p
TEF1p
HHF2p
ZRT1p
Fig. 1 Diagrammatic representation of the promoter structure characterized by Dossani et al.
a.
Factor 2
Fa
Fa
ct
ct
or
or
3
3
Factor 1
Factor 1
b.
Factor 2
Fig. 2 Exploring a design space using (a) a full-factorial experimental design or (b) a Taguchi array. A Taguchi array is a type of fractional factorial experimental design that is balanced, so that all levels of the experimental factors of interest are evaluated equally [21]
the proportion of the design space that is explored whilst experimental burden is minimized (Fig. 2) [22]. Subsequent in vivo empirical characterization of the designed sequences can provide well-structured data sets against which statistical models can be trained [23], allowing the promoter fitness landscape to be mapped [24] and promoter sequences to be optimized to provide defined strengths. This design-build-test cycle can be further expedited using BioFoundries to facilitate the high-throughput in vitro
4
James Gilman et al.
assembly of the designed constructs [25]. Here, we describe a generally applicable protocol for the in silico design of a hybrid promoter library using DoE methodologies and the preparation of the resulting sequences for DNA synthesis and assembly on a BioFoundry platform, using the Saccharomyces cerevisiae hybrid promoter library described by Dossani et al. [16] as an example.
2
Materials 1. JMP Pro statistical analysis software (SAS Institute Inc., www. jmp.com). 2. Joint Genome Institute build-optimization software tools (BOOST) [26]; see Note 1. 3. Edinburgh Genome Foundry Collection of Useful Biological Apps (CUBA; https://cuba.genomefoundry.org/).
3
Methods An overview of the Design of Experiments approach to promoter design, in both general terms and as applied to the promoter design space identified by Dossani et al., is provided in Fig. 3.
3.1 Identifying Candidate Promoter Motifs
Hybrid promoter engineering requires the careful selection of candidate sequence motifs so that the final promoter library contains sequences with the desired functionality. However, the number and type of motifs that are included in an experimental design will be heavily dependent on the requirements of individual studies. In particular, a library of constitutive promoters will comprise a different selection of functional motifs to a library of inducible promoter sequences. The promoter sequence motifs considered in the library designed by Dossani et al. are summarized in Fig. 1. Regardless if a hybrid promoter library is intended to be constitutive or inducible, core promoter sequences are required to determine the start site and direction of transcription [5], and to provide a “backbone” to which upstream enhancers, operators, or suppressors can be added. Synthetic minimal core promoters with a range of strengths have been previously described and could be applied in hybrid promoter design [27]. Alternatively, putative core promoters can be bioinformatically identified from upstream of well-characterized genes in the host organism of interest [16], or previously characterized core promoters can be extracted from synthetic biology part registries such as the iGEM Registry of Standard Biological Parts [28] or SynBioHub [29]. The endogenous function of candidate core promoter sequences should be carefully considered, as this is likely to impact the activity of the
Design of Experiments for Synthetic Promoter Design
5
Fig. 3 Flow chart outlining the Design of Experiments approach to hybrid promoter design. The process is described (a) in general terms and (b) as applied to the promoter design space identified by Dossani et al.
6
James Gilman et al.
hybrid promoter library. For example, when designing inducible promoters, it may be desirable to select core promoters from upstream of genes that are known to be natively differentially expressed in the presence or absence of the inducer ligand or environmental conditions of interest [16]. Alternatively, promoters whose function remains constant across a wide range of environmental or genetic contexts may be preferable [30]. In the case of the hybrid promoter library designed by Dossani et al., the 10 core promoter sequences that were chosen were isolated from upstream of well-characterized genes from the S. cerevisiae genome, with the stated aim of “representing both constitutive and inducible profiles at both low and high expression levels” [16]. Upstream of the core promoter, enhancer, suppressor and operator motifs can be specified to confer various functions. If, for example, the aim of a study is to develop a library of promoters that cause high levels of transcription activation, the hybrid promoter design could include natural or synthetic Upstream Activation Sequences (UAS), as tandem repeats or combinations of UAS can be used to tune the dynamic range of promoter activity [15]. To confer inducibility to a synthetic promoter, operator sequences are required that facilitate DNA–protein interactions between the promoter sequence and the relevant transcription factor, and promoter function can be fine-tuned by combining multiple operators or by modifying the operator sequence [31– 33]. Native transcription factor binding sites can be individually identified from individual endogenous promoters with wellcharacterized activity [11, 13], or isolated en masse from ChIPseq-derived data sets [33, 34]. As an alternative to co-opting endogenous regulatory machinery and to reduce cross-talk between native and synthetic genes or circuits, synthetic transcription factors and operator sequences can be incorporated into hybrid promoter design. Synthetic transcription factors can be targeted to promoters using zinc fingers and their associated binding sequences [31], Transcription Activator Like Effectors (TALEs) [30], catalytically inactive cas9 (dCas9) [32, 35], or, as in the study performed by Dossani et al., using prokaryotic DNA binding domains like LexA [16]. 3.2 Defining Experimental Parameters
Once identified, promoter motifs can be used to define the factors and responses of an experimental design. In the case of the hybrid promoter library designed by Dossani et al., the experimental factors of interest are the identity and length of the core promoter and the identity and copy number of the operator sequence (Fig. 1, Table 1). Dossani et al., defined two responses of interest in their experimental design. The “Inducibility” of a promoter was defined as the fold-change in promoter induction from 0 to 100 nM Estradiol, and the promoter “Response” was defined as the maximum level of
Design of Experiments for Synthetic Promoter Design
7
Table 1 Experimental factors for the hybrid promoter design specified by Dossani et al. Pro_core
Pro_core_length
Operator
Operator_count
GAL1p
Short
consensus
1
LEU2p
Long
colE1
2
SPO13p
uvrA
3
TEF1p
umuDC
HHF2p GCN4p CUP1p HEM13p ZRT1p SSL1p
induction at 10 nM Estradiol [16]. When considering which responses to characterize for a designed inducible system, it may also be beneficial to consider the “Leakiness” of a promoter in the absence of inducer [36] as a potential experimental response, as promoters with minimal levels of background activity are often desirable. 3.3 Screening the Promoter Design Space for Synthesis Constraint Violations and Incompatible Restriction Sites
Prior to DNA synthesis, commercial vendors screen submitted sequences for the presence of features that are known to be predictive of synthesis failure [26]. Violation of these synthesis constraints can preclude the synthesis of certain DNA sequences. The inclusion of one or more of these hard-to-synthesize sequences in an experimental design could result in missing data when the hybrid promoter library is characterized in vivo, reducing the amount of the promoter design space that is empirically explored and potentially reducing the predictive power of any downstream models. To prevent the inclusion of promoter sequences that violate synthesis constraints in an experimental design, the design space of interest should be screened for any problematic sequences, which can be excluded in silico from subsequent experimental designs. Additionally, the promoter sequences should be screened for the presence of any restriction sites that would render them incompatible with whichever DNA assembly strategy is to be used, e.g., BsaI restriction sites in the case of the method discussed in Subheading 3.5. Any illegal restriction sites can be removed from promoter sequences in silico using silent mutations, or the promoter can be included on the list of sequences to be excluded from the final experimental design.
8
James Gilman et al.
1. Using the Full-Factorial Design function of JMP software, generate a full combinatorial promoter library from the identified sequence motifs. Once the design is generated in JMP, the find and replace function of any text editing software can be used to convert the categorical motif labels into DNA sequence to generate the complete promoter sequences. 2. Identify and remove any incompatible restriction sites from the promoter sequences using the CUBA“Sculpt a Sequence” tool (see Note 2). For example, MoClo-based assembly strategies require that any candidate DNA parts do not contain BsaI restriction sites. 3. Analyze the designed promoters using the “Polisher” function of BOOST to identify sequences that violate DNA synthesis constraints of the chosen synthesis vendor (see Note 3). Dossani et al. identified 60 putative promoters that violated the synthesis constraints specified by IDT and that were therefore likely to be problematic to synthesize. 4. Once screening is complete, JMP scripting language (JSL) should be used to write a Disallowed Combinations script to prevent the inclusion of any problematic sequences in downstream experimental designs. The script should take the form: Pro_core = 5 & Pro_core_length = 2 & Operator = 2 & Operator_count = 1 | Pro_core = 4 & Pro_core_length = 2 & Operator = 2 & Operator_count = 1 | Pro_core = 7 & Pro_core_length = 2 & Operator = 1 & Operator_count = 1 | Pro_core = 7 & Pro_core_length = 1 & Operator = 1 & Operator_count = 1
where the numbers refer to the ordinal value of the level being excluded (Table 1). For example, the first line of the above script refers to a promoter consisting of the HHF2p core promoter, in its short form, combined with a single copy of the colE1 operator sequence.
Design of Experiments for Synthetic Promoter Design
3.4 Designing a Hybrid Promoter Library
9
The Custom Design platform of JMP Pro can be used to design a hybrid promoter library according to the following protocol. The Custom Design window, populated with settings to generate an experimental design based on the promoter structure described by Dossani et al., [16] is shown in Fig. 4. 1. Specify the responses to be used in the experimental design, and whether the goal of the experiment is to maximize or minimize the value of each response. It is also possible to aim to match a specified target value, or to have no goal for a response if the purpose of the experiment is to explore the design space of interest rather than to optimize promoter performance. 2. Specify the factors to be used in the experimental design. The factor type must also be specified; the majority of the factors included in hybrid promoter design are likely to be categorical. In the case of the hybrid promoter library designed by Dossani et al., Pro_Core, Pro_core_length, and Operator were set as categorical factors. Operator_count was defined as a Discrete Numeric Factor. It is also possible to specify the difficulty of changing a factor. Setting a factor as “Hard” to change will impose blocking on the final experimental design. However, as the promoter sequences designed by Dossani et al. were synthesized, there were no constraints on changing factors between runs. As such, all four experimental factors were set as “Easy” to change. Once the experimental responses and factors have been defined they can be saved as data tables. This option can be found in the “Red Arrow” menu, which is located at the top left of the Custom Design window (Fig. 4). The resulting tables can subsequently be reloaded in any of the JMP DoE platforms, expediting the process of generating multiple experimental designs from the same responses and factors. 3. Define the factor constraints to be applied to the experimental design by inserting the Disallowed Combinations Script (see Subheading 3.3). 4. Specify the effects that are to be estimated in the assumed model (i.e., the model that will be trained on the experimental data). Main effects are added by default, as are polynomial terms if a Discrete Numeric factor has been specified. Interaction effects up to the fifth order can be specified, as can Polynomial terms for Continuous or Discrete Numeric factors, again up to the fifth order. The Cross option can be used to add specific interaction terms. As one of the aims of a Design of Experiments approach to hybrid promoter design is a reduction in experimental burden as compared to a full-factorial approach, a trade-off is required between the complexity of
10
James Gilman et al.
Fig. 4 JMP Pro version 14 DoE Custom Design window
Design of Experiments for Synthetic Promoter Design
11
the specified model and the number of runs that can be experimentally budgeted. A more complex model can potentially yield greater insights into the system under investigation, but generally requires a greater number of experimental runs to resolve than a more parsimonious alternative. To prevent designs becoming too experimentally costly, the Estimability of model terms can be changed. The Estimability of a term is a designation of the importance of estimating that term in the final design. Setting the Estimability of a term as “Necessary” will force the JMP algorithm to generate an experimental design in which that term can be estimated, whereas setting Estimability to “If Possible” will result in a design that estimates the model term only if there is sufficient space in the design, as permitted by the number of runs that are specified by the user (see step 6, below). In the case of the example shown in Fig. 4, main order effects and second order interactions were specified, and all the Estimability of all model terms was “Necessary”. 5. If applicable, specify any Alias Terms. Alias terms are effects which are not included in the assumed model, but that do have an effect on the performance of the system under investigation and therefore may bias the estimates of the model terms. Once the experimental design is generated, the Alias Matrix, which can be accessed under the Design Evaluation option in JMP, can be used to assess the extent to which the effects that have not been included in the assumed model are predicted to alias with the model terms. As an example, if one were to specify an assumed model that consisted solely of first order effects (e.g. Prom_core, Prom_core_length, Operator, and Operator_count), it may be beneficial to specify second order interactions as alias terms. If one or more of the second order effects were shown to be strongly aliased any of the main order effects, a second experimental design should be created containing the aliased term [37]. 6. Select Design Generation settings. For a given experimental design, JMP will recommend a Minimum and Default number of runs (in this case, a run represents a single promoter sequence). Selecting the minimum number of runs results in a design with no error term for testing, and is only appropriate when the cost of additional runs is prohibitive [37]. It may be beneficial to generate multiple experimental designs that vary in terms of number of runs and to subsequently use the Compare Designs tool of the JMP software to evaluate if the benefits of including additional runs in an experimental design are worth the increase in experimental burden (see Note 4). It is not necessary to specify replicate runs, as we do not wish to include multiple copies of the same promoter sequence in the
12
James Gilman et al.
Fig. 5 Output of the JMP Custom Design platform as applied to hybrid promoter design
experimental design; biological or technical replicates can be obtained at the in vivo characterization stage without the need to synthesize multiple copies of an individual promoter. Randomization of the run order is not required at the design and synthesis stage, but should be considered at the in vivo characterization stage to minimize the effect of bias and technical or biological sources of error on the experimental outcome [38]. Likewise, it may be necessary to consider blocking at the characterization stage if experimental constraints require promoters to be characterized in batches rather than en masse. 7. Make Design. Generating a hybrid promoter library using the settings in Fig. 4 resulted in the output shown in Fig. 5. The find and replace feature of any text processing software can be used to convert the categorical feature labels to DNA sequence, and the resulting promoters should be flanked with cloning affixes to facilitate their post-synthesis assembly using a relevant assembly standard. 3.5 Synthesis and Assembly of Designed Promoter Sequences
Promoter sequences can be ordered as linear parts from a synthesis company (e.g., Integrated DNA Technologies, or Twist Biosciences) and assembled upstream of a reporter gene in a yeast expression plasmid using the Yeast Toolkit (YTK) assembly
Design of Experiments for Synthetic Promoter Design
13
standard [39]. Here, we describe a software pipeline to ensure that the synthesized promoter sequences are compatible with this assembly standard. 1. Ensure that the sequence of each promoter is void of BsaI restriction sites, using the tools discussed in Subheading 3.3. 2. Add the flanking sequences “atgcGGTCTC aAACG” and “ATACtGAGACCatgc” to the left and right hand side each promoter respectively, to create a sequence that is compatible with position 2 of the YTK standard (GGTCT and its reverse complement GAGACC are BsaI restriction sites, and AACGATAC are assembly overhangs). This operation can be performed manually using a text editor, or automatically via the “Domesticate Part Batches” CUBA application. 3. To verify that the resulting sequences assemble correctly with other YTK parts, and to obtain the final DNA sequences of the resulting constructs, first download the part sequences of the Yeast Toolkit from AddGene (https://www.addgene.org/kits/ moclo-ytk, section Protocols & Resource). For illustration, here we simulate the cloning of the different promoters upstream of the mRuby2 coding sequence (part YTK034) and tENO1 terminator (YTK061), along with connectors conL1 and colR1 (YTK003 and YTK067, respectively) and a receptor vector (YTK095) (Fig. 6). 4. Connect to the CUBA application for simulated Golden Gate cloning (Fig. 6a). Drag and drop the YTK parts in the upload box (Fig. 6a i). The sequences of all the designed promoters, in either Genbank or FASTA format, should also be uploaded (Fig. 6a ii), after which the cloning simulation can be started (Fig. 6a iii). The application will return GenBank records for all of the resulting constructs, which can subsequently be inspected to verify that the promoters assemble with the YTK parts as intended. The GenBank files can be stored in a sequence manager (e.g. Benchling or JBEI-ICE [40]) for later use, e.g., in quality control operations. Once the final promoter sequences are verified and synthesized, the assembly itself can be performed at the bench or outsourced to a specialized facility using robotic equipment to automate liquid handling and quality-control operations [41]. 3.6 Suggestions for Future Work
Once designed and synthesized, the promoter library should be characterized in vivo in the host organism of interest. Ideally, part characterization should be carried out in multiple contexts. If part characterization does not consider the potentially synergistic, antagonistic, or neutral effects of environmental and genetic context [42], the performance of individual parts cannot be generalized, necessitating time-consuming and potentially
14
James Gilman et al.
Fig. 6 Interface of the CUBA Golden Gate cloning simulation application. (a) Screenshot of the web application, with red letters added for reference in the main text. (b) Example of construct record returned by the application (map rendered with SnapGene viewer)
prohibitive empirical testing and optimization when using said parts to design synthetic pathways or circuits [43]. It may also be beneficial to consider including insulator mechanisms in the promoter characterization construct. Insulators can take the form of either physical separation of genetic parts [44, 45] or molecular transcript processing [46, 47] to disrupt context-specific mRNA secondary structures and increase the modularity of the designed promoters. When in vivo characterization is complete, the resulting empirical data can be analyzed using the statistical modeling capabilities of the JMP software, or another statistical analysis package of choice. Once trained and validated, these statistical models can be used to explore the regions of the promoter design space that were not empirically characterized to identify promoters with desirable characteristics.
Design of Experiments for Synthetic Promoter Design
4
15
Notes 1. The “Polisher” function of the Joint Genome Institute’s BOOST build optimization software tools [26] was used to identify putative promoter sequences that violated DNA synthesis constraints. At the time of writing, BOOST is capable of identifying sequences that violate the synthesis constraints of six vendors; Gen9, GenScript, Integrated DNA Technologies, SGI-DNA, Thermo Fisher Scientific (Life Technologies), and Twist Bioscience. Alternatively, users may define their own synthesis constraints, or use the online tools for sequence analysis provided on the websites of the DNA synthesis vendors themselves. 2. Documentation for the Sculpt a Sequence application is available via the Edinburgh Genome Foundry GitHub [48]. 3. One of the features for which BOOST screens is the presence of sequence repeats. However, as hybrid promoter libraries may by design contain tandem repeats of DNA sequence, using the default BOOST settings to analyze a hybrid promoter library can result in a high-proportion of the analyzed sequences being flagged as problematic. Users can therefore either define a unique set of synthesis constraints in BOOST such that sequence repeats are not flagged in the analysis, or the BOOST output can be manually examined to identify those sequences with multiple sequence constraint violations [16]. 4. For a thorough overview of the Compare Design platform of JMP see [37].
References 1. Gilman J, Singleton C, Tennant RK, James P, Howard TP, Lux T, Parker DA, Love J (2019) Rapid, heuristic discovery and design of promoter collections in non-model microbes for industrial applications. ACS Synth Biol 8:1175–1186. https://doi.org/10.1021/ acssynbio.9b00061 2. Nielsen AAK, Der BS, Shin J, Vaidyanathan P, Paralanov V, Strychalski EA, Ross D, Densmore D, Voigt CA (2016) Genetic circuit design automation. Science 352:aac7341. https://doi.org/10.1126/science.aac7341 3. Gilman J, Love J (2016) Synthetic promoter design for new microbial chassis. Biochem Soc Trans 44:731–737. https://doi.org/10. 1042/BST20160042 4. Johns NI, Gomes ALC, Yim SS, Yang A, Blazejewski T, Smillie CS, Smith MB, Alm EJ, Kosuri S, Wang HH (2018) Metagenomic
mining of regulatory elements enables programmable species-selective gene expression. Nat Methods 15:323–329. https://doi.org/ 10.1038/nmeth.4633 5. Blazeck J, Alper HS (2013) Promoter engineering: recent advances in controlling transcription at the most fundamental level. Biotechnol J 8:46–58. https://doi.org/10. 1002/biot.201200120 6. Collado-Vides J, Magasanik B, Gralla JD (1991) Control site location and transcriptional regulation in Escherichia coli. Microbiol Rev 55(3):371–394 7. Gibson DG, Benders GA, Axelrod KC, Zaveri J, Algire MA, Moodie M, Montague MG, Venter JC, Smith HO, Hutchison CA (2008) One-step assembly in yeast of 25 overlapping DNA fragments to form a complete synthetic mycoplasma genitalium genome.
16
James Gilman et al.
Proc Natl Acad Sci 105:20404–20409. https://doi.org/10.1073/pnas.0811011106 8. Mehrotra R, Renganaath K, Kanodia H, Loake GJ, Mehrotra S (2017) Towards combinatorial transcriptional engineering. Biotechnol Adv 35:390–405. https://doi.org/10.1016/j.bio techadv.2017.03.006 9. Shabbir Hussain M, Gambill L, Smith S, Blenner MA (2016) Engineering promoter architecture in oleaginous yeast Yarrowia lipolytica. ACS Synth Biol 5:213–223. https://doi.org/ 10.1021/acssynbio.5b00100 10. Hahn S, Young ET (2011) Transcriptional regulation in Saccharomyces cerevisiae: transcription factor regulation and function, mechanisms of initiation, and roles of activators and coactivators. Genetics 189:705–736. https://doi.org/10.1534/genetics.111. 127019 11. Hector RE, Mertens JA (2017) A synthetic hybrid promoter for xylose-regulated control of gene expression in saccharomyces yeasts. Mol Biotechnol 59:24–33. https://doi.org/ 10.1007/s12033-016-9991-5 12. Ji H, Lu X, Zong H, Zhuge B (2017) A synthetic hybrid promoter for D-xylonate production at low pH in the tolerant yeast Candida glycerinogenes. Bioengineered 8:700–706. https://doi.org/10.1080/21655979.2017. 1312229 13. Pothoulakis G, Ellis T (2018) Construction of hybrid regulated mother-specific yeast promoters for inducible differential gene expression. PLoS One 13:e0194588. https://doi.org/10. 1371/journal.pone.0194588 14. Trassaert M, Vandermies M, Carly F, Denies O, Thomas S, Fickers P, Nicaud J-M (2017) New inducible promoter for gene expression and synthetic biology in Yarrowia lipolytica. Microb Cell Factories 16:141. https://doi.org/10. 1186/s12934-017-0755-0 15. Blazeck J, Garg R, Reed B, Alper HS (2012) Controlling promoter strength and regulation in Saccharomyces cerevisiae using synthetic hybrid promoters. Biotechnol Bioeng 109:2884–2895. https://doi.org/10.1002/ bit.24552 16. Dossani ZY, Reider Apel A, SzmidtMiddleton H, Hillson NJ, Deutsch S, Keasling JD, Mukhopadhyay A (2018) A combinatorial approach to synthetic transcription factorpromoter combinations for yeast strain engineering. Yeast 35:273–280. https://doi.org/ 10.1002/yea.3292 17. Jonsson J, Norberg T, Carlsson L, Gustafsson C, Wold S (1993) Quantitative sequence-activity models (QSAM)—tools for sequence design. Nucl Acids Res 21:733–739. https://doi.org/10.1093/nar/21.3.733
18. Chen X, Li T, Wang X, Du Z, Liu R, Yang Y (2016) Synthetic dual-input mammalian genetic circuits enable tunable and stringent transcription control by chemical and light. Nucleic Acids Res 44:2677–2690. https:// doi.org/10.1093/nar/gkv1343 19. Mazumder M, McMillen DR (2014) Design and characterization of a dual-mode promoter with activation and repression capability for tuning gene expression in yeast. Nucleic Acids Res 42:9514–9522. https://doi.org/10. 1093/nar/gku651 20. Owen M, Cox I (2018) Design of Experiments. In: Schlindwein WS, Gibson M (eds) Pharmaceutical quality by design. Wiley, Chichester, pp 157–199 21. Ferna´ndez-Lo´pez JA, Angosto JM, Roca MJ, ˜arro M (2019) Taguchi designDoval Min based enhancement of heavy metals bioremoval by agroindustrial waste biomass from artichoke. Sci Total Environ 653:55–63. https:// doi.org/10.1016/j.scitotenv.2018.10.343 22. Fellermann H, Shirt-Ediss B, Kozyra J, Linsley M, Lendrem D, Isaacs J, Howard T (2019) Design of experiments and the virtual PCR simulator: an online game for pharmaceutical scientists and biotechnologists. Pharm Stat 18:402–406. https://doi.org/10.1002/ pst.1932 23. Brown SR, Staff M, Lee R, Love J, Parker DA, Aves SJ, Howard TP (2018) Design of experiments methodology to build a multifactorial statistical model describing the metabolic interactions of alcohol dehydrogenase isozymes in the ethanol biosynthetic pathway of the yeast Saccharomyces cerevisiae. ACS Synth Biol 7:1676–1684. https://doi.org/10.1021/ acssynbio.8b00112 24. Kumar V, Bhalla A, Rathore AS (2014) Design of experiments applications in bioprocessing: concepts and approach. Biotechnol Progress 30:86–99. https://doi.org/10.1002/btpr. 1821 25. Hillson N, Caddick M, Cai Y, Carrasco JA, Chang MW, Curach NC, Bell DJ, Le Feuvre R, Friedman DC, Fu X, Gold ND, Herrga˚rd MJ, Holowko MB, Johnson JR, Johnson RA, Keasling JD, Kitney RI, Kondo A, Liu C, Martin VJJ, Menolascina F, Ogino C, Patron NJ, Pavan M, Poh CL, Pretorius IS, Rosser SJ, Scrutton NS, Storch M, Tekotte H, Travnik E, Vickers CE, Yew WS, Yuan Y, Zhao H, Freemont PS (2019) Building a global alliance of biofoundries. Nat Commun 10:2040. https://doi.org/10.1038/s41467019-10079-2 26. Oberortner E, Cheng J-F, Hillson NJ, Deutsch S (2017) Streamlining the design-to-build transition with build-optimization software
Design of Experiments for Synthetic Promoter Design tools. ACS Synth Biol 6:485–496. https://doi. org/10.1021/acssynbio.6b00200 27. Redden H, Alper HS (2015) The development and characterization of synthetic minimal yeast promoters. Nat Commun 6:7810. https://doi. org/10.1038/ncomms8810 28. parts.igem.org. http://parts.igem.org/Main_ Page. Accessed 9 Aug 2019 29. McLaughlin JA, Myers CJ, Zundel Z, ˜ iMısırlı G, Zhang M, Ofiteru ID, Gon Moreno A, Wipat A (2018) SynBioHub: a standards-enabled design repository for synthetic biology. ACS Synth Biol 7:682–688. https://doi.org/10.1021/acssynbio.7b00403 30. Blount BA, Weenink T, Vasylechko S, Ellis T (2012) Rational diversification of a promoter providing fine-tuned expression and orthogonal regulation for synthetic biology. PLoS One 7:e33279. https://doi.org/10.1371/journal. pone.0033279 31. Khalil AS, Lu TK, Bashor CJ, Ramirez CL, Pyenson NC, Joung JK, Collins JJ (2012) A synthetic biology framework for programming eukaryotic transcription functions. Cell 150:647–658. https://doi.org/10.1016/j. cell.2012.05.045 32. Farzadfard F, Perli SD, Lu TK (2013) Tunable and multifunctional eukaryotic transcription factors based on CRISPR/Cas. ACS Synth Biol 2:604–613. https://doi.org/10.1021/ sb400081r 33. Rajkumar AS, Liu G, Bergenholm D, Arsovska D, Kristensen M, Nielsen J, Jensen MK, Keasling JD (2016) Engineering of synthetic, stress-responsive yeast promoters. Nucleic Acids Res 44:e136–e136. https://doi. org/10.1093/nar/gkw553 34. MacIsaac KD, Wang T, Gordon DB, Gifford DK, Stormo GD, Fraenkel E (2006) An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7:113. https://doi.org/10.1186/ 1471-2105-7-113 35. Chavez A, Scheiman J, Vora S, Pruitt BW, Tuttle M, P R Iyer E, Lin S, Kiani S, Guzman CD, Wiegand DJ, Ter-Ovanesyan D, Braff JL, Davidsohn N, Housden BE, Perrimon N, Weiss R, Aach J, Collins JJ, Church GM (2015) Highly efficient Cas9-mediated transcriptional programming. Nat Methods 12:326–328. https://doi.org/10.1038/ nmeth.3312 36. Chen Y, Ho JML, Shis DL, Gupta C, Long J, Wagner DS, Ott W, Josic´ K, Bennett MR (2018) Tuning the dynamic range of bacterial promoters regulated by ligand-inducible transcription factors. Nat Commun 9:64. https:// doi.org/10.1038/s41467-017-02473-5
17
37. SAS Institute Inc (2018) JMP 14 Design of Experiments Guide. SAS Institute Inc., Cary, NC 38. Urbach P (1985) Randomization and the Design of Experiments. Philos Sci 52:256–273. https://doi.org/10.1086/ 289243 39. Lee ME, DeLoache WC, Cervantes B, Dueber JE (2015) A highly characterized yeast toolkit for modular, multipart assembly. ACS Synth Biol 4:975–986. https://doi.org/10.1021/ sb500366v 40. Ham TS, Dmytriv Z, Plahar H, Chen J, Hillson NJ, Keasling JD (2012) Design, implementation and practice of JBEI-ICE: an open source biological part registry platform and tools. Nucleic Acids Res 40:e141–e141. https://doi. org/10.1093/nar/gks531 41. Chao R, Mishra S, Si T, Zhao H (2017) Engineering biological systems using automated biofoundries. Metab Eng 42:98–108. https:// doi.org/10.1016/j.ymben.2017.06.003 42. Del Vecchio D (2015) Modularity, contextdependence, and insulation in engineered biological circuits. Trends Biotechnol 33:111–119. https://doi.org/10.1016/j. tibtech.2014.11.009 43. Rudge TJ, Brown JR, Federici F, Dalchau N, Phillips A, Ajioka JW, Haseloff J (2016) Characterization of intrinsic properties of promoters. ACS Synth Biol 5:89–98. https://doi. org/10.1021/acssynbio.5b00116 44. Davis JH, Rubin AJ, Sauer RT (2011) Design, construction and characterization of a set of insulated bacterial promoters. Nucleic Acids Res 39:1131–1141. https://doi.org/10. 1093/nar/gkq810 45. Mutalik VK, Guimaraes JC, Cambray G, Lam C, Christoffersen MJ, Mai Q-A, Tran AB, Paull M, Keasling JD, Arkin AP, Endy D (2013) Precise and reliable gene expression via standard transcription and translation initiation elements. Nat Methods 10:354–360. https:// doi.org/10.1038/nmeth.2404 46. Qi L, Haurwitz RE, Shao W, Doudna JA, Arkin AP (2012) RNA processing enables predictable programming of gene expression. Nat Biotechnol 30:1002–1006. https://doi.org/10. 1038/nbt.2355 47. Lou C, Stanton B, Chen Y-J, Munsky B, Voigt CA (2012) Ribozyme-based insulator parts buffer synthetic circuits from genetic context. Nat Biotechnol 30:1137–1142. https://doi. org/10.1038/nbt.2401 48. Edinburgh Genome Foundry. In: GitHub. https://github.com/Edinburgh-GenomeFoundry. Accessed 16 Sept 2019
Chapter 2 Computational Design of Multiplex Oligonucleotide-Based Assays Michaela Hendling and Ivan Barisˇic´ Abstract The success of any oligonucleotide-based experiment strongly depends on the accurate design of the components. Oli2go is a user-friendly web tool that provides efficient multiplex oligonucleotide design including specificity and primer dimer checks. Its fully automated workflow involves important design steps that use specific parameters to produce high-quality oligonucleotides. This chapter describes how these steps are computationally implemented by oli2go. Key words Multiplex oligonucleotide design, Primer dimers, Probe specificity
1
Introduction Polymerase chain reaction (PCR), microarrays, and fluorescence in situ hybridization (FISH) are some of the most common oligonucleotide-based experiments. The key event of these experiments is the specific binding between oligonucleotides and target DNA. However, these single-stranded DNA molecules also tend to bind to unintended targets or themselves. Therefore, the success of any oligonucleotide-based experiment strongly depends on the accurate design of these components. The selection of single oligonucleotides can usually be performed without much computational effort. However, the construction of complex multiplex assays, involving multiple primer sets, represents a considerable challenge [1]. Besides comprehensive data sets, comprising both target and nontarget DNA, it is necessary to apply a design workflow that accurately simulates the experimental behavior of oligonucleotides [2]. In this chapter, we present the computational methods applied by oli2go [3]. This tool is a fully automated multiplex oligonucleotide design tool, which performs primer and different hybridization
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_2, © Springer Science+Business Media, LLC, part of Springer Nature 2021
19
20
Michaela Hendling and Ivan Barisˇic´
probe designs as well as specificity and cross dimer checks in a single run. It is freely available through http://oli2go.ait.ac.at.
2
Implementation The software workflow runs on a Linux server (64 CPUs, 256GB RAM) with an Ubuntu distribution (release 14.04). The main pipeline is managed via a Python script that consecutively calls other Python scripts and third-party software. This pipeline and all Python scripts are implemented using Python 2.7. The web service processes one job at a time to provide access to available resources during the design. The webserver is implemented using Apache 2.4.18. The highly responsive web interface is implemented using Bootstrap 3.3.7, Javascript, and PHP.
3
Databases Oli2go implements a specificity check for the designed probes targeting sequences from bacteria, fungi, protozoa, viruses, invertebrates, plants, patented sequences, whole genome shotgun (WGS) projects, archaea, and environmental samples. The basis for this check are huge datasets of DNA sequences downloaded from the NCBI File Transfer Protocol (FTP) server. The sequence data were retrieved from three different directions (see Table 1): 1. Genbank—ftp://ftp.ncbi.nlm.nih.gov/genbank/ 2. Genomes—ftp://ftp.ncbi.nlm.nih.gov/genomes/ 3. WGS—ftp://ftp.ncbi.nlm.nih.gov/genbank/wgs Afterwards, the sequences were saved as BLAST databases using BLAST 2.7.0+ and the command makeblastdb [4]. The databases are directly stored on a RAM drive to obtain optimal performance.
4
The Software Workflow Oli2go consists of a concatenation of several tools and scripts implementing essential design steps for multiplex oligonucleotide design, managed by one main Python workflow script. The following sections describe each workflow step and their implemented computational methods. The workflow starts after a user inputs at least two different DNA sequences in FASTA format and all necessary design parameters (see Table 2). A separate status page informs the user about the currently processing workflow step, if any failure
Computational Methods of Oli2go
21
Table 1 NCBI database sources used for the probe specificity check Source
Number of sequences
NCBI directory
Bacteria
7,658,345
Genbank, genomes
Environmental samples
7,276,975
Genbank
Invertebrates
27,651,271
Genbank, genomes
Patented sequences
31,140,928
Genbank
Plants
3,798,824
Genbank
Viruses
1,837,439
Genbank, genomes
Archaea
38,310
Genomes
Fungi
3,889,143
Genomes
Protozoa
3,880,518
Genomes
WGS project sequences
14,220,046
Total amount of sequences
WGS
101,391,799
occurred during the job submission, and presents the results after the design completed successfully. If another job submission is already running on the server, each following job will be queued, as only one job after another can be processed by oli2go. The reason for this is to provide full computational power for each job submission and prevent server overloading. The following sections describe the implementation of each workflow step. Figure 1 shows a schematic illustration of the software workflow. 4.1
Data Preparation
The data preparation consists of two steps: 1. Gathering of the user-defined input parameters and input sequences from the web interface (see Table 2). 2. Aligning the input sequences to the user-defined background databases for the specificity check using BLAST 2.7.0+ and the command blastn with 90% sequence similarity, output “format 6” returning the query ID, subject ID, alignment start and end position, alignment strand (‘blastn –perc_identity 90 –outfmt “6 qacc sacc sstart send sstrand” –num_threads 10 -max_target_seqs 1000000’). Each BLAST run uses ten threads. The hit IDs and alignment locations of the input sequences within the hit sequences are later saved to one text file per database and target sequence. These files form the basis for the subsequent probe specificity check.
22
Michaela Hendling and Ivan Barisˇic´
Table 2 User-defined parameters of oli2go Parameters
Description
Application
Conventional hybridization probe design, ligation assay probe design, primer design (y/n)
Oligonucleotide length*
Minimum and maximum allowed probe/primer length
Oligonucleotide Tm*
Minimum and maximum allowed probe/primer Tm in C
Monovalent cations*
Millimolar concentration of monovalent salt cations
Divalent cations*
Millimolar concentration of divalent salt cations
DNTPs*
Millimolar concentration of the sum of all deoxyribonucleotide triphosphates (dNTPs)
Oligonucleotide DNA concentration*
Nanomolar concentration of each annealing oligonucleotide during the reaction
Maximum number of IUPAC Maximum allowed number of ambiguous bases within the codes* oligonucleotides Product size
Minimum and maximum allowed product size
Minimum product size difference
Minimum allowed difference in size between the final amplicons
Databases for probe specificity Organisms, sample types of the input sequences check Hairpin Tm threshold*
Maximum allowed Tm of hairpin structures
Hairpin ΔG threshold*
Minimum allowed ΔG value of hairpin structures
Cross dimer Tm threshold
Maximum allowed Tm of primer dimers
Cross dimer ΔG threshold
Minimum allowed ΔG value of primer dimers
Parameters with an asterisk (*) need to be defined separately for primers and probes
4.2 Oligonucleotide Selection
Users have the possibility to choose between hybridization and ligation probe design. Furthermore, it is possible to skip the primer design, for example, if users are aiming to design probes for a microarray only. Depending on the user selection, a Python script performing the selection of oligonucleotides based on user-defined input parameters is separately called for primers and probes in parallel. The script processes the following steps for each sequence: 1. The sequence is incrementally sliced into k-mers using a sliding window. At the beginning, the window size matches the minimum defined primer or probe length and is increased by 1 till the maximum defined length is reached. Sequence fragments for each window size are saved to an array. After the maximum length is reached, the window moves one position forward, the size gets reset to the minimum primer or probe length, and the fragmentation starts again (Fig. 2).
Computational Methods of Oli2go
23
Fig. 1 olig2o software. Schematic illustration of the olig2o software workflow performed by a main Python script
Fig. 2 Sliding window for the selection of k-mers from the target sequence. After each possible k-mer is chosen within the user-defined oligonucleotide length range (17–20 bp in this example), the window (blue rectangle) moves one position forward
2. The fragments are checked whether they meet the user-defined number of allowed ambiguous bases or not. 3. The melting temperature (Tm) is calculated for each remaining oligonucleotide candidate by reference to the table of thermodynamic values, the salt correction method, and the Tm calculation formula suggested by SantaLucia et al. [5]. The calculation is performed by a separate Python script. 4. Hairpin structure and self-dimer checks are performed using Primer3’s nucleotide thermodynamic alignment tool ntthal and Python’s subprocess library [6]. Primer and probe candidates are accepted if they do not form any secondary structures or if the structure’s Tm and ΔG values are below the userdefined thresholds.
24
Michaela Hendling and Ivan Barisˇic´
5. The resulting oligonucleotides are saved in one FASTA file per input sequence. The IDs of the sequences contain the target name, the start and end positions within the sequences, as well as the strand information. 6. If primer design is included, an additional Python script checks whether primers and probes resulting from the previous design steps can form primer and probe packages. This script performs a short test to check if at least five primer-probe-combinations per target are present within the defined product size range without forming primer dimers. 4.3 Ligation Probe Design
Ligation assay probes such as padlock probes consist of two neighboring oligonucleotide sequences [7]. The creation of these probes is based on the previously described probe selection process. However, no gaps are allowed between the two probes. For this, neighboring oligonucleotides with a minimum difference in ΔG and Tm values are selected based on the formulas published by SantaLucia et al. [5]. This workflow step is performed in parallel for all targets.
4.4 Data Reduction Steps
Oli2go’s specificity check is the computationally most intense task. Therefore, we implemented data reduction steps to decrease the load on the server and to reach a better performance. The data reduction is divided into two steps: 1. A general data reduction of primers and/or probes is implemented, no matter how many targets are involved in the multiplex assay. If the number of oligonucleotide candidates, generated and filtered in the previous steps, exceeds a total number of 80 oligonucleotides per target, data reduction will be applied. The oligonucleotides are sorted and eliminated by means of their start position in the respective target sequence. The resulting oligonucleotides should show at least a difference of five base pairs in their start coordinates. 2. If the number of probes per target is still above 80, another data reduction step is applied. Here, probes with a minimum distance of 15 base pairs, if the number of targets is below 11, or a minimum distance of 50 base pairs, if the number of targets is above ten, between their starting positions are saved.
4.5 Probe Specificity Check
The probe specificity check is applied to hybridization and ligation probes (both parts together) using the database hits of the input sequences (generated during the data preparation step), the probe candidates resulting from the data reduction, and the databases selected by the user. 1. The hit IDs and location parameters of the input sequences’ BLAST results are loaded and saved in form of a Python dictionary.
Computational Methods of Oli2go
25
Fig. 3 Example of IUPAC decoding. The example oligonucleotide on the left includes three ambiguous bases (or IUPAC codes): R (¼A, G), Y (¼C, T), B (¼C, G, T). The decoding step leads to 12 possible oligonucleotides on the right side of the figure
2. The probe candidates are scanned for IUPAC codes. These codes are decoded into the set of bases represented by the IUPAC code. Finally, each probe with ambiguous bases is subdivided into a set of probes, each representing a possible combination of bases (see example oligonucleotide in Fig. 3). 3. All decoded probe candidates of all targets are blasted against the user-defined databases. These BLAST searches run in parallel for each database using the multiprocessing package of Python. The specificity check is accomplished using ungapped blastn with 100% sequence identity and 100% query coverage (‘blastn -outfmt "6 qacc sacc sstart send" -evalue 60000 -ungapped -perc_identity 100 -max_target_seqs 100000 -word_size 7 -qcov_hsp_perc 100’). The e-value is 6000 times higher than the default value, used by standard BLAST, to apply a more sensitive search [8]. This BLAST search also applies multithreading, whereas the number of threads depends on the size of the database. For example, the biggest database, the bacterial genomes, uses 20 threads for one BLAST run. The smallest databases, like the viral database, need only one thread per BLAST run. 4. The tab-limited BLAST results are then divided into smaller files by the target sequence of the oligonucleotides and the BLAST database. This results in one text file per probe target and database (e.g., ten target sequences and two query databases would result in a file number of 20 result files). This file division leads to a reduction in I/O memory load during the following step. 5. The content of the divided result files is compared to the content of the Python dictionary created from the input sequences’ BLAST results in the first step of the specificity
26
Michaela Hendling and Ivan Barisˇic´
check. This comparison runs in parallel for all files. First, the hit IDs are compared. If the probe hit list contains at least one hit ID that does not exist in the input sequence’s dictionary, the probe gets dismissed. The second control checks whether the probe is located in the same region as the input sequence or not. If not, the probe gets not involved in the further processing. The specificity check was successful if the comparison identifies at least one probe per target. 4.6 Primer Package Creation
A Python script defines forward and reverse primers for the specific probes selected in the previous workflow step. This script utilizes the primers’ salt and DNA concentrations, IUPAC settings, the product size range, and primer dimer thresholds. This script is called consecutively for each target. Regardless of whether ambiguous bases are allowed or not, the workflow aims to find primer pairs with least ambiguous bases. Only if no forward and reverse primers could be found for the specific probes, the number of allowed ambiguous bases is increased till the user-defined threshold is reached. The primer package creation works as follows: 1. All primer candidates within the currently allowed IUPAC code range are fetched from the reduced primer set generated in the previous steps. 2. In a nested for loop, forward primer candidates are matched with possible reverse primer candidates by comparing their start and end values within the target sequence. This comparison checks whether the current primer pair is located within the defined product size range. 3. If a primer pair matches the defined product size range, it gets further checked for primer dimer formation by applying ntthal with the user-defined salt concentrations in the ANY mode. The Tm and ΔG values of the resulting alignment need to be below the primer dimer thresholds, defined by the user. If the sequences include ambiguous bases, each possible encoded variant of the oligonucleotides is checked for primer dimer formation. 4. The ΔG values of both primers are calculated and compared. The difference should not be higher than 5 kcal/mol. This comparison is also applied to all possible variants of decoded ambiguous sequences. 5. A maximum number of three reverse primers is saved per forward primer. 6. One FASTA file is created per primer pair and the corresponding probe.
Computational Methods of Oli2go
4.7 Primer Specificity Check
27
The primer specificity check is applied to reduce the risk of primers binding to human DNA. This feature is especially useful for applications where different organisms should be detected in human or with human DNA contaminated samples. For this, primer candidates are aligned to the human reference genome, downloaded from the NCBI FTP server, using the Burrows-Wheeler Aligner (BWA) [9]. Depending on the number of specific probes, a defined number of primer pairs is selected for the alignment. This selection is based on a data preparation step performed by a Python script, applying the following two steps: 1. The number of probes from the previously created primer packages is collected. 2. The more different probes are available for a target, the lower the number of primers selected for the specificity check, to increase the performance of the workflow. For example, if a target has more than ten different specific probes, only two primer pairs per probe are selected for the specificity check. In contrast, if less than five specific probes could be designed for a gene, up to 20 different primer pairs are added to the primer specificity check. The BWA is called using the subprocess library in the respective Python script. The command aln is called with zero allowed gaps, no mismatches, and 30 threads (‘bwa aln -o 0 -N -t 30 –n 0’) to get the suffix array coordinates of the primers within the indexed human genome. Afterwards, samse is called to generate the alignments in SAM format with a maximum number of 1000 alignments (‘bwa samse -n 1000’). The output is parsed by the Python script and only primers with no alignment hits are selected for further processing.
4.8 Primer Dimer Check
The primer dimer check implements Primer3’s thermodynamical alignment tool ntthal to check the previously designed, specific primers for cross dimerization. The user-defined salt concentrations, minimum product size distance, Tm and ΔG thresholds are used for the calculations. The primer dimer check is divided into two Python scripts. The first script involves the following steps and aims to find at least one primer pair for each target that forms no primer dimers with the target having the smallest number of primer pairs: 1. The target with the least number of specific primers is selected. 2. A for loop iterates through all primer pairs of the selected target sequence and performs the secondary structure calculations. These calculations run in parallel for each other target folder, whereas one folder involves at least one primer pair.
28
Michaela Hendling and Ivan Barisˇic´
3. The forward and reverse primers of the selected target are aligned to the forward and reverse primers of the other targets applying ntthal with the user-defined salt concentrations in the ANY mode. The Tm and ΔG values of the resulting alignments need to be below the primer dimer thresholds, defined by the user. If the sequences include ambiguous bases, each possible encoded variant of the oligonucleotide is checked for primer dimer formation. 4. The ΔG value of both primers are calculated and compared. The difference should not be higher than 5 kcal/mol. This comparison is also applied to all possible variants of decoded ambiguous sequences. 5. The primer pairs of the selected target sequence, as well as the ones which form no primer dimers with them, are written to a text file. 6. Each result file that contains at least one primer pair for each target gets forwarded to the second primer dimer Python script. The second Python script uses the resulting text files from the first cross dimer check to perform an all-vs-all primer dimer check. It performs the following steps: 1. The list of primer pairs resulting from the first primer dimer check are sorted by the number of pairs per gene. 2. A for loop iterates through all target sequences, first selecting the one with the smallest number of primer pairs resulted from the first cross dimer check. 3. The forward and reverse primers of the selected sequence are aligned to the forward and reverse primers of all other targets in parallel. This cross dimer check also uses ntthal in the ANY mode together with the user-defined primer salt concentrations. 4. Similar primer dimer, product size, and ΔG calculation procedures, as before during the first cross dimer check, are applied. 5. Primer pairs with no interaction are saved in one Python dictionary per target gene primer pair (e.g., if the target gene with the smallest number of primer pairs, has two pairs, two Python dictionaries are resulting from the first round). 6. The resulting dictionaries are checked if they contain at least one primer pair for each target. If not, the primer dimer check stops here. 7. If one of the resulting dictionaries contains at least one primer pair per target gene, the workflow starts again at step two with the next gene in the list. Here, the newly generated dictionaries are used as input.
Computational Methods of Oli2go
29
Fig. 4 Schematic illustration of the primer dimer check workflow. This example shows a primer dimer check for a four-plex assay (Target 1, Target 2, Target 3, Target 4). The first script starts with Target 1, as it has the smallest number of primer pairs (pp). Its primer pair T1_pp1 is aligned to the primer pairs of all other targets in parallel (marked with black arrows). Primer dimers are marked with a red cross. The result of the first script on top of the second column (violet box) contains T1_pp1 and all primer pairs that showed no interaction with this primer pair. This list is the input for the second script and gets sorted by the number of primer pairs per target. The script starts with the alignment of T2_pp1 with all other primer pairs in the list. In this example, T2_pp1 shows no interaction with other primers. Therefore, the next list named T2_pp1 (second box in the middle column) contains the same primer pairs as the previous list. The next primer in the list, T3_pp2, is aligned to the remaining primer pairs of Target 4. Here, T3_pp2 forms a primer dimer with T4_pp4, but not with T4_pp3. Therefore, the primer dimer check results successfully with one primer pair per target (see result of script 2 on the right side of the figure)
8. The workflow succeeds if the result dictionary contains at least one primer pair per target sequence at the end of the for loop. Figure 4 shows a schematic illustration of the primer dimer check workflow on an example four-plex assay. 4.9
Output
The resulting primers and probes are presented on the status page in form of a table. The table contains the oligonucleotides’ sequences, their lengths, product sizes, Tm values, hairpin Tm values, and hairpin ΔG values. The last column of the result table consists of one button per row providing online testing options for each oligonucleotide. The buttons within probe rows lead to the online BLAST service where the user can align the corresponding probe to a BLAST database or any other user-defined sequence. Forward and reverse primers can be checked with the web primer check tool Primer-BLAST by pressing the button in either row [8]. The result table can be downloaded in CSV format. Additionally, the sequences are available as FASTA file.
30
Michaela Hendling and Ivan Barisˇic´
Acknowledgments This work was supported by the European Union’s Horizon 2020 research and innovation program [634137]. Funding for open access charge: H2020 [634137]. References 1. Dieffenbach CW, Lowe TM, Dveksler GS (1993) General concepts for PCR primer design. PCR Methods Appl 3:30–37 2. Hendling M, Barisˇic´ I (2019) In-silico design of DNA oligonucleotides: challenges and approaches. Comput Struct Biotechnol J 17:1056–1065 3. Hendling M, Pabinger S, Peters K et al (2018) Oli2go: an automated multiplex oligonucleotide design tool. Nucleic Acids Res 46:W252–W256 4. Camacho C, Coulouris G, Avagyan V et al (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421 5. SantaLucia J (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-
neighbor thermodynamics. Proc Natl Acad Sci U S A 95:1460–1465 6. Untergasser A, Cutcutache I, Koressaar T et al (2012) Primer3-new capabilities and interfaces. Nucleic Acids Res 40:e115 7. Barisˇic´ I, Schoenthaler S, Ke R et al (2013) Multiplex detection of antibiotic resistance genes using padlock probes. Diagn Microbiol Infect Dis 77:118–125 8. Ye J, Coulouris G, Zaretskaya I et al (2012) Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinf 13:134 9. Li H, Durbin R (2010) Fast and accurate longread alignment with burrows-wheeler transform. Bioinformatics 26:1754–1760
Chapter 3 Computational Methods for the Design of Recombinase Logic Circuits Sarah Guiziou and Jerome Bonnet Abstract Synthetic biology aims at engineering new biological systems and functions that can be used to provide new technological solutions to worldwide challenges. Detection and processing of multiple signals are crucial for many synthetic biology applications. A variety of logic circuits operating in living cells have been implemented. One particular class of logic circuits uses site-specific recombinases mediating specific DNA inversion or excision. Recombinase logic offers many interesting features, including single-layer architectures, memory, low metabolic footprint, and portability in many species. Here, we present two automated design strategies for recombinase-based logic circuits, one based on the distribution of computation within a multicellular consortia and the other one being a single-cell design. The two design strategies are complementary and are both adapted for none expert as a design web-interface exits for each strategy, the CALIN and RECOMBINATOR web-interface for respectively the multicellular and single-cell design strategy. In this book chapter, we are guiding the reader step by step through recombinase-logic circuit design from selecting the design strategy fitting to his/her final system of interest to obtaining the final design using one of our design web-interface. Key words Synthetic biology, Automatized design, Logic circuit, Recombinase
1
Introduction Synthetic biology consists of engineering new biological systems with the aim of (a) further understanding biology and (b) providing new technological solutions to worldwide challenges such as climate change and healthcare. Examples of such engineered biological systems include building synthetic metabolic pathways in yeast to produce drugs in a more affordable manner [1, 2], cellular therapies for in vivo detection and destruction of cancer cells [3], and functional living biomaterials providing new solution to healthcare challenges [4, 5]. In nature, cells adapt to their environment by sensing and processing myriad signals and performing actions accordingly. Similarly, synthetic biological systems rely on the detection and
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_3, © Springer Science+Business Media, LLC, part of Springer Nature 2021
31
32
Sarah Guiziou and Jerome Bonnet
integration of multiple endogenous or exogenous signals for multiplexed biosensing [6], bioproduction of complex chemical compounds [7, 8], or production of biopolymers that can respond to change in their environment [5]. Synthetic biologists have mimicked electronic circuits to implement cellular devices built from biological molecules that can process multiple signals (Fig. 1a). In that context, the main approach treats molecular of physical signals as binary inputs (like in electronics), and cellular processing devices are assimilated to logic circuits. While we focus in this chapter on digital logic circuits, analogic logic circuits have also been implemented in living organisms [9]. To implement logic circuits, numerous molecular mechanisms have been used, such as transcription regulators [10–12], RNA molecules [13, 14], and site-specific recombinases [15–17]. Here, we focus on the implementation of logic circuits using recombinase, more specifically serine integrases [18]. Serine integrases are a tool of choice for large logic circuit implementation as numerous orthogonal serine integrases have been characterized [19], and have already been implemented in numerous organisms such as bacteria, plants, or mice [20]. Serine integrases recognize two DNA sites and recombine DNA between these two sites depending of their relative orientations, leading to an inversion of the DNA if the sites are in opposite orientation and to an excision if the sites are in the same orientation (Fig. 1b). In recombinase logic circuits, each input induces the expression of a recombinase, while circuit output is the expression of a reporter or production of a compound of interest. To implement logic functions using recombinases, promoters, terminators, and output genes are combined in a specific manner with integrase sites to condition the expression of the output gene to a particular combination of inputs (Fig. 1c) [15]. Recombinase logic circuits up to six inputs have been implemented [17], and various design strategies have been used [15, 17, 19, 21]. Integrase-mediated recombination is irreversible in the absence of cofactor, and recombinase logic devices exhibit permanent memory. Consequently, inputs are considered ON if they have been present at any time of the circuit history, and recombinase logic devices thus implement asynchronous logic. This type of logic circuit is of interest for biosensing in which delayed readout might sometimes be necessary [22]. The design of recombinase logic circuits is challenging, as it does not follow electronic logic standards. Interestingly, complex logic functions can be implemented within a single layer; for example, a XOR logic gate can be built using a terminator surrounded by two pairs of integrase sites in inversion orientation [15]. While circuits can be signed by hand for a small number of inputs, the task becomes daunting as the number of input and possible part combinations increases. Thus, accessible software tools for the design of recombinase-based logic circuits are needed. Similar
Design of Recombinase Logic Circuits
33
Fig. 1 Recombinase-based logic circuits. (a) Biological logic system. Multiple signals (either environmental or endogenous) are detected by a cell. Each analog signal is converted into a binary signal. In this example, the signal B and C are considered 1 as above a defined threshold and the signal A is considered 0. Then, a logic circuit (Here implementing the logic function ((A OR B) and NOT C) processes these signals and produce a specific output. Biological logic systems are used for engineering biomaterials, biosensors, and control protein and metabolite production. (b) Recombinase switch. Expression of a serine integrase is controlled by the input signal. Integrase recognizes two integrase sites: attB and attP sites. If the sites are in opposite orientation (left side), the DNA between the sites (here the promoter) is inverted and leads to two new sites: attL and attR. The integrase alone cannot mediate recombination between attL and attR sites. If the sites are in the same orientation (right side), the DNA between the sites is excised, leading to a single integrase site, either attL or attR sites. (c) Example of a recombinase AND gate [15]. The AND logic device is composed of one promoter, two asymmetric terminators surrounded by integrase sites in inversion orientation, and a gene. In the absence of input, the output gene cannot be expressed as the RNA polymerase is blocked by the two terminators. In the presence of input 1 or input 2, integrase 1 (turquoise) or integrase 2 (orange) is expressed and the terminator surrounding their corresponding sites is inverted. The output gene is still not expressed, as one asymmetric terminator is still blocking the transcription to occur. Both inputs need to have been present to have both the terminators inverted and then expression of the output gene, implementing therefore an AND gate
efforts have been taken for repressor-based logic circuits (CELLO) [10]. We developed two computational methods for designing recombinase-logic circuits. Each of them provides a different approach to systematize recombinase circuit design. The first design strategy called CALIN (Composable Asynchronous Logic Integrase Networks) allows the implementation of logic circuits by distributing the computational labor through a multicellular consortia, using a limited number of standardized logic devices that can be mixed and matched [21]. The second design strategy, called RECOMBINATOR, uses a database of devices generated in a combinatorial manner within which architectures implementing a particular function can be found. The RECOMBINATOR strategy aims at implementing logic within a compact and single-layer
34
Sarah Guiziou and Jerome Bonnet
Table 1 Type of design required according to the application type
Fields
Applications
Biosensing
In vitro medical diagnostic Environmental diagnostic On site environmental diagnostic
Therapy
Time of usage
Physical confinement
# strains
Type of design
Short use, Confine one shot
Unlimited
Multicell
Long term
Free
Medium
Single or multicell depending on input number
Therapeutic Long bacteria term Environmental Long bioremediation term
Free
Low, Single cell better ¼ 1 Reduce to Single cell 1–3 strains
Metabolic Production by engineering fermentation
Medium term
Free
Medium Medium confinement
Single- or multicell depending on input number
device operating in single cells, here using an ad hoc design for each case [23]. The two different design strategies with their automatized computational design methods are complementary, they have different properties which can be advantageous depending on the context of implementation. Therefore, choosing between one or the other will depend on the particular specifications determined by the user (see Table 1 for a few examples). The objective of this book chapter is to provide guidelines on how to design recombinase-based logic circuits using multicellular or single-cell designs, following the CALIN or RECOMBINATOR strategies, respectively (Fig. 2). First, we describe how to choose which design strategy to use according to the device specification, then we explain how to define and write down the logic function to implement. Finally, we show how to use the two web-interfaces to obtain the final logic design.
2
Definition of the System Specification According to the Application Depending on the application, the user, and the complexity of the logic function, one design should be preferred over the other. The multicellular approach allows for a systematic and modular design using a reduced number of already characterized biological components (Fig. 2). However, this approach can lead to cellular consortia composed of a high number of different cells, with issue of stability. Additionally, to maintain a consortia composed of different strains,
Design of Recombinase Logic Circuits
35
Fig. 2 Comparison of the two logic-circuit design strategy In recombinase logiccircuit design, modularity is usually inversely proportional to compactness. The multicellular design strategy leads to highly modular circuits that are composable but these circuits have low compactness as they require the assembly of a multicellular consortia. The single-cell design strategy leads to compact designs that can be implemented in a single cell by having low modularity, as each device can be used in a very restricted set of situations and are more challenging to engineer as not following any design rules
a confined environment is required. The single-cell approach enables a compact design and avoid competition problems between strains, but leads to more ad hoc designs that require more expertise and heavier engineering work to obtain devices operating as expected (Fig. 2). Table 1 lists some logic-circuit applications with their specifications and the favorable type of logic-circuit design to use. For users without much synthetic biology expertise, a multicellular design is favorable as the optimization process is straightforward and existing engineered devices can be used. Similarly, for users requiring the implementation of numerous logic circuits, the composable multicellular design is more advantageous, as the majority of functions can be implemented by mixing and matching the existing logic devices. Of note, depending on the logic function of interest, the multicellular design computational method can lead to a single-cell system. The two computational methods can therefore be performed in parallel and the final logic circuits compared to choose the final design.
3
Definition of the Logic Function To use the CALIN or RECOMBINATOR interfaces (for single or multicellular designs, respectively), the user first needs to determine the logic program that she/he wants to implement. Logic programs can then be written as a Boolean equation (f(A,B) ¼ A.B) encoding the output state. Since the establishment of logic by Aristotle and of Boolean Algebra by Georges Boole, various terms and notations have been used to converse on logic reasoning and to write down Boolean equations (Table 2). As an example, to express a gene only if A signal is present and not B signal, the Boolean function: f(A,B) ¼ A AND NOT(B) (also written as A.!B) has to be implemented.
36
Sarah Guiziou and Jerome Bonnet
Table 2 Conversion from truth function to Boolean function Classique language
Mathematical language
Truth function
Boolean function
True
True
1
False
False
0
OR
Disjunction
_
+
AND
Conjunction
^
·
NOT
Negation
¬
¯ or !
While there might sometimes be some debates on which notation is “correct,” the use of a notation usually mostly reflects the habits and usages of particular scientific communities (e.g. mathematics, informatics. . .). Therefore, for a given application, the only important guideline is to choose one notation, be consistent, and not mix the different notations together. In the RECOMBINATOR design interface, the Boolean function has to be written using + for OR,. for AND and! for negation. In the CALIN web-interface, the logic function has to be written down as a truth table expliciting the output state (either 0 for OFF or 1 for ON) in each input state. Of note, in recombinase-based devices, inputs are decoupled from logic implementation. Indeed, the identity of an input is defined by the conditional expression of an integrase, e.g., by the connection between an inducible promoter specifically responding a signal of interest to an integrase. Therefore, by using a single logic device and switching the connection between inputs and integrases, various logic functions can be implemented in a very straightforward manner and without further optimization. Logic functions implementable using the same logic device are equivalent when inputs are permuted, and therefore belong to the same P-class (where P stands for permutation) (Fig. 3). As an example, the function A.not(B) is P-equivalent to the function not(A).B. We widely use this property in the CALIN and RECOMBINATOR web-interface to reduce the number of logic devices required or generated.
4
Using CALIN or RECOMBINATOR to Obtain the Theoretical Biological Design
4.1 The CALIN Web-Interface: Multicellular Recombinase Logic
The CALIN web-interface allows for the systematic design of logic circuits operating as a multicellular system (Fig. 4a). Each logic function is decomposed as a sum of product of NOT or IMPLY functions, called subfunctions; a IMPLY function corresponds f (X) ¼ X and a NOT function to f(X) ¼ NOT(X). Each subfunction
Design of Recombinase Logic Circuits
37
Fig. 3 Implementation of two logic functions belonging to the same permutation class (P-class) using one logic device and permuting the connections between integrases and input
is implemented in a single cell using a combination of IMPLY elements in series and NOT elements in parallel. IMPLY elements are composed of a terminator surrounded by integrase sites and NOT elements by promoter surrounded by integrase sites (Fig. 4b). After entering the number of input and the truth table corresponding to the logic function of interest, the web-interface generates the biological logic design corresponding to the number of strains required, the genetic circuit layout for each strain, i.e., the connection of integrase genes with inducible promoters corresponding to each input plus the logic device (Fig. 4c). For each logic device, a DNA sequence corresponding to an optimized design for E. coli is also available. The web-interface by itself is based on a Python script that allows the conversion of a Boolean logic function into a genetic logic design in an automated manner. Here, we will detail this Python script algorithm. 1. The first step is to decompose the input Boolean function as independent subfunctions. To do so, we write the logic function in its disjunctive normal form corresponding to a sum of products of input variables or their negations using the McCluskey algorithm [24]: XM N f x 1, . . . , x j , . . . , x N ¼ ∏ ϕ ð x Þ j ¼1 i,j i i¼1 N corresponds to the number inputs and M to the number of terms in the disjunction. ϕi, j is either the IMPLY or NOT function, such as ϕi, j(xi) is equal to xi or to not(xi).
38
Sarah Guiziou and Jerome Bonnet
Fig. 4 CALIN automatized design strategy. (a) Workflow of logic circuit design. The input of the CALIN web-interface is a truth table corresponding to the logic function of interest. The function is decomposed as a sum of product of IMPLY (such as f(X) ¼ X) and NOT (such as f(X) ¼ NOT(X)) functions, here: f ¼ f1 + f2 with f1 ¼ NOT(A).NOT(B).C and f2 ¼ A.B.C. Each subfunction is implemented in a single-cell and the composition of the f1 and f2 cell allows the implementation of the full logic function in a multicellular logic system. (b) Implementation of IMPLY and NOT functions using recombinase-based excision elements. IMPLY functions are implemented by surrounding a terminator placed between a promoter and the output gene by integrase sites in excision orientation. In the absence of input, the terminator blocks the expression of the output gene. In presence of the input, the integrase is expressed, and the terminator is excised, leading to expression of the output gene. The IMPLY logic element switches therefore from state 0 to state 1. NOT functions are implemented by surrounding a promoter by integrase sites in excision orientation. The output gene is expressed in the absence of the input; in presence of the input, the integrase mediates the excision of the promoter, the output gene is not expressed anymore. The NOT logic element switches from 1 to 0 state in the presence of the input. (c, d) Output of the CALIN web-interface: the logic device and integrase/inducible promoter cassette for each cell. The design of the logic devices computing the logic subfunctions are based on the composition of IMPLY and NOT logic elements. IMPLY logic elements are placed in series while NOT logic elements are placed in parallel. (c) The subfunction f1 (NOT(A).NOT(B).C) is composed of two NOT elements in parallel corresponding to the NOT(A).NOTB) function (nested integrase sites in excision orientation surrounding the promoter) and a IMPLY element placed between the promoter and the gene corresponding to the C function. (d) The subfunction f2 is composed of three IMPLY elements in series
The McCluskey algorithm takes as input the ON and OFF output states corresponding respectively to the input states that have a one or a zero as output. The algorithm provides as an output an array composed of strings, each one corresponding to a subfunction. 2. Each subfunction is translated into the corresponding logic and integrase device using our Python algorithm. In this design, the number of logic devices is minimized. Indeed, functions belonging to the same P-class are implemented with the same logic devices, and only the connection between input and integrases are inverted. The logic device of each subfunction is obtained by extracting the number of IMPLY and NOT functions of the subfunction and by following the design rules detailed briefly above
Design of Recombinase Logic Circuits
39
and described in detail in Guiziou et al [21]. The integrase device is obtained by associating the integrase to the input, permitting the implementation of the desired logic function. 3. The DNA sequence of the logic devices for E. coli is generated. The generated DNA sequence results from a hierarchical composition of optimized logic elements [16]. Various permutations of integrase sites have been characterized for each logic element corresponding to IMPLY and NOT functions with different integrases. Well-behaving IMPLY and NOT functions were selected and composed to obtain the 16 well-behaving logic devices permitting the implementation of all 4-input logic functions [16]. The same design strategy can be used to optimize logic devices for other organisms. 4.2 RECOMBINATOR: A Database of Single-Layer Recombinase Logic Device
RECOMBINATOR is a database composed of ~19 millions devices allowing single-cell implementation of all 2- and 3-input logic functions and up to 92% of 4-input logic functions. This database was generated by permutation of recombinase sites, promoters, genes, and terminators (Fig. 5) [23]. A web-interface allows to search the database: http:// recombinator.lirmm.fr. The user writes down his/her logic function of interest, either as a well-formed formula such as using the logic operators: “. +! ”, or as a binary number corresponding to the output state in each input state. Using the same example as earlier, to express a gene only if A signal is present and not B signal: – the well-formed formula is A.!B, – the binary number is 0010, but can also be written as 0100 as the logic device design is agnostic with input identity as explained previously .
Fig. 5 RECOMBINATOR database and web-interface. The RECOMBINATOR database was generated by combination and permutation of integrase sites, promoters, terminators, and genes. ~19 millions architectures were obtained, each associated to the logic function that they compute. The web-interface allows to search in this database using as input a logic function and providing as an output a list of architectures with their specifications that can be sorted according to various biological constraints
40
Sarah Guiziou and Jerome Bonnet
Fig. 6 Example of a list of architectures generated by the RECOMBINATOR web-interface for the logic function A AND NOT B (A.!B). The screenshot corresponds to the 10 first listed architectures without applying any constraint or any sorting criteria. For a better visualization, the table has been truncated in the right showing only 7 of the 12 criteria Table 3 Symbols used in the RECOMBINATOR web-interface to represent each part in the different possible orientations Part
Symbol
Promoter in forward orientation
蘥
Promoter in reverse orientation
蘦
Terminator in forward orientation
⊤
Terminator in reverse orientation
⊥
Gene in forward orientation
G
Gene in reverse orientation
蔓
Sites in excision orientation
[]
Sites in inversion orientation
()
After submitting the logic function of interest, a table is generated with various designs, all theoretically allowing the implementation of the input logic function, and their characteristics (Fig. 6). These designs are called architectures, and are abstract versions of the final biological devices. Indeed, in an architecture, the identity of parts is not defined, only the type of functions encoded by the part is defined. Each line corresponds to one architecture represented by symbols (see Table 3 for correspondence between parts and symbols). The characteristics of each architecture are specified in the table generated by the web-interface (Fig. 5). Each column corresponds
Design of Recombinase Logic Circuits
41
to a particular feature, such as the number of genes, promoters, terminators, asymmetric terminators, parts, if the gene is positioned at the extreme segment of the device, etc... Architectures can be sorted according to each of these criterion. It is also possible to filter them by applying some constraints: maximum and/or minimum constraints for numbers and lengths and on/off constraints for the Boolean criteria, which are cross promotion (promoters facing each other) and gene at the end. For more details on each architecture, the view button at the extreme right of each line leads to a new page with all the characteristics of a specific architecture and the recombination state of the architecture for each input state (Fig. 7). Additionally, from this page, logic functions belonging to the same P-class are accessible with their corresponding architecture. Indeed to go from the implementation of one logic function to a logic function belonging to the same P-class, only the identity of the integrase sites have to be changed (i.e., in the RECOMBINATOR web-interface the color of the integrase site pairs).
Fig. 7 Detailed description of one architecture properties and its recombination intermediates in the RECOMBINATOR web-interface. Screenshot of the web-page obtained from the view button of the first architecture in the architecture list for a.!b
42
Sarah Guiziou and Jerome Bonnet
A lot of information is provided to the user and for most logic functions a large number of architectures are available for their implementation. Passing from an architecture to a biological implementation can be challenging and will require optimization, choosing the simplest architecture and the one most suited to the final chassis will increase the probability of successful biological implementation.
5
Conclusion We presented two design strategies to implement asynchronous Boolean logic program in living organisms using serine integrases. These two design strategies are complementary: one is modular and scalable but requires a multicellular system while the other design is ad hoc, such as can be more complex to engineer but is single cell. For now, the RECOMBINATOR database allows the design of logic circuits up to 4 inputs and CALIN up to 5 inputs. For CALIN, the Python software supports the design to increasing number of inputs and is available on github (https://github. com/synthetic-biology-group-cbs-montpellier/calin), we limited the web-interface to 5 inputs to reduce lagging of the service. For RECOMBINATOR, the database existing for now is composed of architectures for up to 4 inputs. Of note, we have experimentally validated the CALIN framework, while the architectures provided by RECOMBINATOR are for now only theoretical. The large diversity and peculiarities of some of the designs will probably require the user to test several different architectures and optimize their behavior on a case-bycase basis. We hope that this book chapter will guide synthetic biologists as well as scientists from other fields to choose the more coherent design strategy for their specific application, and facilitate the design of their logic devices using our design web-interfaces.
Glossary Automatic design Compact Complete Portable Reusable Scalable
Theoretical design performed via software, sometimes through a web-interface. A design in which the number of parts needed to perform a function is reduced to its minimum. Capable of implementing all logic functions. Implementable in various organisms. The parts developed can be used for the construction of other circuits. The design principles developed at a given scale (e.g. a certain number of inputs) are applicable to a larger scale (here for an increasing number of inputs).
Design of Recombinase Logic Circuits
43
References 1. Galanie S, Thodey K, Trenchard IJ et al (2015) Complete biosynthesis of opioids in yeast. Science 349:1095–1100 2. Paddon CJ, Westfall PJ, Pitera DJ et al (2013) High-level semi-synthetic production of the potent antimalarial artemisinin. Nature 496:528–532 3. Kalos M, June CH (2013) Adoptive T cell transfer for cancer immunotherapy in the era of synthetic biology. Immunity 39:49–60 4. Bryksin AV, Brown AC, Baksh MM et al (2014) Learning from nature - novel synthetic biology approaches for biomaterial design. Acta Biomater 10:1761–1769 5. Kalyoncu E, Ahan RE, Ozcelik CE, Seker UOS (2019) Genetic logic gates enable patterning of amyloid nanofibers. Adv Mater 31(39): e1902888 6. Chang H-J, Voyvodic PL, Zuniga A, Bonnet J (2017) Microbially derived biosensors for diagnosis, monitoring and epidemiology. Microb Biotechnol 10(5):1031–1035. https://doi. org/10.1111/1751-7915.12791 7. Kim SG, Noh MH, Lim HG et al (2018) Molecular parts and genetic circuits for metabolic engineering of microorganisms. FEMS Microbiol Lett 365. https://doi.org/10. 1093/femsle/fny187 8. Pham HL, Wong A, Chua N et al (2017) Engineering a riboswitch-based genetic platform for the self-directed evolution of acid-tolerant phenotypes. Nat Commun 8:411 9. Sarpeshkar R (2014) Analog synthetic biology. Philos Trans A Math Phys Eng Sci 372:20130110 10. Nielsen AK, Der BS SJ et al (2016) Genetic circuit design automation. Science 352(6281): aac7341. https://doi.org/10.1126/science. aac7341 11. Macia J, Manzoni R, Conde N et al (2016) Implementation of complex biological logic circuits using spatially distributed multicellular consortia. PLoS Comput Biol 12:e1004685
12. Gander MW, Vrana JD, Voje WE et al (2017) Digital logic circuits in yeast with CRISPRdCas9 NOR gates. Nat Commun 8:15459 13. Green AA, Kim J, Ma D et al (2017) Complex cellular logic computation using ribocomputing devices. Nature 548(7665):117–121. https://doi.org/10.1038/nature23271 14. Win MN, Smolke CD (2007) A modular and extensible RNA-based gene-regulatory platform for engineering cellular function. Proc Natl Acad Sci U S A 104:14283–14288 15. Bonnet J, Yin P, Ortiz ME et al (2013) Amplifying genetic logic gates. Science 340:599–603 16. Guiziou S, Mayonove P, Bonnet J (2019) Hierarchical composition of reliable recombinase logic devices. Nat Commun 10:456 17. Weinberg BH, Pham NTH, Caraballo LD et al (2017) Large-scale design of robust genetic circuits with multiple inputs and outputs for mammalian cells. Nat Biotechnol 35:453–462 18. Merrick CA, Zhao J, Rosser SJ (2018) Serine integrases: advancing synthetic biology. ACS Synth Biol 7:299–310 19. Yang L, Nielsen AAK, Fernandez-Rodriguez J et al (2014) Permanent genetic memory with >1-byte capacity. Nat Methods 11:1261–1266 20. Fogg PCM, Colloms S, Rosser S et al (2014) New applications for phage integrases. J Mol Biol 426:2703–2716 21. Guiziou S, Ulliana F, Moreau V et al (2018) An automated design framework for multicellular recombinase logic. ACS Synth Biol 7:1406–1412 22. Courbet A, Endy D, Renard E et al (2015) Detection of pathological biomarkers in human clinical samples via amplifying genetic switches and logic gates. Sci Transl Med 7(289):289ra83 23. Guiziou S, Pe´rution-Kihli G, Ulliana F, Lecle`re M (2019) Exploring the design space of recombinase logic circuits. bioRxiv 24. Enderton H, Enderton HB (2001) A mathematical introduction to logic. Academic Press, Oxford
Chapter 4 Modular Modeling of Genetic Circuits in SBML Level 3 Mario Andrea Marchisio Abstract The System Biology Markup Language (SBML) Level 2 has been used extensively to make models for biological systems of different complexity. However, the lack of modularity was a serious hurdle for its application to Synthetic Biology where genetic circuits are preferably modeled by putting together the models of their components. SBML Level 3 with the Hierarchical Composition Package overcame this limit. Here, we describe how to realize a modular model for a eukaryotic AND gate in SBML Level 3. Circuit modules, such as transcription units and pools of molecules, are modeled separately and connected, to close the circuit, via Python scripts that utilize the libSBML API. Circuit simulations with COPASI confirm the validity of this modeling approach. Key words SBML Level 3, Modules, Hierarchy, Composition
1
Introduction The “Parts & Pools” framework represents a computational method for the modular design and modeling of both prokaryotic and eukaryotic genetic circuits [1, 2]. Parts are DNA sequences such as promoters, coding regions, and terminators. Pools either store molecules of species referred to as signal carriers (e.g., RNA polymerases and ribosomes) or host interactions that do not take place at the DNA or the mRNA (e.g., protein dimerization). Parts are put together to give rise to transcription units (TUs), the simplest genetic devices. TUs and pools are connected to close a circuit. As a piece of software, “Parts & Pools” is a collection of Python and Perl scripts that form an add-on of ProMoT [3]. A synthetic gene circuit is realized by, first, running the scripts corresponding to the parts and the pools present in the circuit. Then, parts and pools are displayed on the graphical user interface of ProMoT and wired together. Overall, a genetic circuit is designed in the same way as an electronic one. A bio-chemical model for the complete circuit arises from the composition of the models of the circuit
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_4, © Springer Science+Business Media, LLC, part of Springer Nature 2021
45
46
Mario Andrea Marchisio
components and it is encoded in MDL (Model Definition Language). ProMoT offers an export function from MDL to SBML (System Biology Markup Language) [4] Level 2. Differently from MDL, SBML Level 2 does not support modularity, i.e., it does not allow to generate a model starting from separate modules corresponding to the circuit components (unless the model files are first parsed and modified, as in Asmparts [5]). With SBML Level 3 [6], in contrast, a model can be constructed as a composition of multiple submodels by using the Hierarchical Model Composition (comp) package [7]. In this chapter we illustrate how to build a modular model, for a genetic network, that is encoded in SBML Level 3 and follows the main ideas on which “Parts & Pools” is based. The circuit we have chosen is a yeast logic AND gate based on both transcriptional and translational controls [8]. The SBML Level 3 model is obtained by writing a set of Python scripts that make use of the libSBML Python API (Application Programming Interface) [9].
2
Methods
2.1 The AND Gate Circuit
The AND gate we consider in this chapter is organized into four transcription units and thirteen pools: three containing mRNA sequences, two storing input chemicals (one inducer and one corepressor), two hosting transcription factors (repressors), one for siRNAs (small interfering RNAs), one associated with the circuit output, i.e., the green fluorescent protein (GFP), and the remaining four enclosing RNA polymerase, ribosome, argonaute, and dicer molecules (see Fig. 1). The circuit is supposed to be constructed into the yeast S. cerevisiae where the RNAi (RNA interference) pathway is missing but can be restored with the expression of the dicer and the argonaute protein, as shown in [10]. Introns are generally removed from genes used in eukaryotic synthetic circuits. Hence, we can omit from our model the action of the spliceosome on the pre-mRNA. Differently from the “Parts & Pools” approach, here we consider transcription units, and not standard biological parts, as basic DNA components. This choice is made only in order to simplify the overall AND gate model. Numeric values for the kinetic parameters present in the model are taken from our previous work [11] or tuned ad hoc.
2.2 The comp Package and the libSBML Python API
In an SBML Level 3 document that makes use of the comp package, an object of the class Model is used, mainly, as a container of Submodels (see Note 1). Submodels, in their turn, refer to either ModelDefinition or ExternalModelDefinition objects, where actual models are specified. Submodels represent modules that are
Modeling in SBML Level 3
47
Fig. 1 AND gate general scheme. An AND gate carries out a logic multiplication, i.e., it gives 1 only in the presence of all its inputs and 0 otherwise. Our AND gate receives two input chemicals (inducer A and corepressor B) and returns GFP as an output. Inducer A controls GFP production by binding and inactivating the repressor R aA. This transcription factor is constitutively expressed in its active configuration (i.e., able to bind the DNA) by the transcription unit TU_YES_A and targets the promoter of TU_AND, the device that encodes for GFP. Therefore, in the absence of A, R aA binds the promoter along TU_AND and prevents RNA polymerase II from transcribing the GFP gene. Upon binding to A, R aA is no longer able to bind the DNA, and the mRNA corresponding to GFP is produced and stored into its corresponding pool in the cytoplasm (pool_mRNA_AND). The input signal B determines the synthesis of siRNA molecules (siRNAB) that induce rapid degradation of the mRNA corresponding to the green fluorescent protein. Corepressor B binds the inactive repressor R iB (constitutively expressed by TU_NOT_B) and turns it into its active configuration, referred to as R aB . R aB binds the promoter belonging to TU_YES_B and precludes the synthesis of siRNAB. Therefore, only in the presence of both A and B the GFP gene is, first, transcribed and then translated. In the absence of B, double-stranded RNA sequences become siRNAB molecules under the action of the dicer in the nucleus. siRNAB binds argonaute proteins in the cytoplasm to form the RNA-induced silencing complex (RISC) that binds and cleaves the mRNA synthesized by TU_AND. Dashed arrows symbolize transcription or translation, a green arrow represents transcription activation, hammer-like red lines represent repression of either transcription or translation, other colored lines indicate exchange of molecules among different circuit components
hierarchically composed. The class ExternalModelDefinition, furthermore, allows to distribute submodels among separate SBML files (see Note 2). The class Model is a default SBML (Level 3) class, whereas Submodel, ModelDefinition, and ExternalModelDefinition are
48
Mario Andrea Marchisio
defined in the comp package. Objects of default SBML classes do not have direct access to features of the comp package. A connection between standard SBML objects and the comp package is made possible by the libSBML function getPlugin(“comp”) that allows an SBML object to instantiate an object of a class defined into comp (see Note 3). Objects of the class Port are the interfaces that permit to connect separate modules. A port can be seen as a component of a submodel that “points to” input or output species (see Note 4). However, two ports of different submodels cannot be connected directly. They require the presence—into the model that contains the two submodels—of an object (a species in our case) that establishes the link between the two ports by creating two more objects: one belonging to the class ReplacedElement, the other to the class ReplacedBy, as illustrated in Fig. 2. 2.3 Modeling the AND Gate with libSBML Python
The model for the AND gate is organized into five Python scripts. Four are named after the transcription units in the circuit and contain the ModelDefinition objects that correspond not only to the TUs but also to the pools of mRNA, repressor proteins, siRNAs, and GFP. The fifth file, “cell.py”, instantiates the Model object (called cell) for the whole circuit and contains the class Species objects that are exchanged among TUs and pools. These species are the input/output of the different modules in the circuit and allow the creation of the ReplacedElement and ReplacedBy objects that are required to connect modules to each other. RNA polymerases II, ribosomes, dicers, argonautes, and the two input chemicals do not require to be stored into separate pools. They are instantiated directly into the cell model.
2.3.1 TU_YES_A
The transcription unit corresponding to the YES_A gate contains a constitutive promoter ( pc) that leads the synthesis of the pre-mRNA (pm) associated with the active repressor RaA. mRNA maturation and transport into the cytoplasm are described by a single reaction
where mRN A Y ES A represents the mature mRNA and kmt is the “maturation and transport” rate-constant. pm is degraded into this module with rate-constant kd
TU_YES_A has a single input, RNA polymerase II (PolII), and a single output, the mature mRNA. The pre-mRNA dynamics is described by the following ordinary differential equation (ODE):
Modeling in SBML Level 3
49
Fig. 2 Connecting two submodels. Species A and B are created into submodel1 and submodel2, respectively. The blue circle indicates that they are fully defined into the submodel they belong to (e.g., the initial concentration and other attributes are assigned to them). Species Am and A2 (green circles) are defined only to establish a connection between submodel1 and submodel2 such that A and B can interact. Orange squares symbolize ports. Once created, Am instantiates a “plugin” species Am_plugin to get access to the comp package. By using Am_plugin, Am is replaced by A (as it is defined into submodel1) and replaces A2 into submodel2. As a result, A can interact with B. The Python code shows how Am_plugin creates the ReplacedElement object re_Am that refers to portA2 into submodel2. Inside submodel2, portA2 points to A2. Similarly, Am_plugin creates also the ReplacedBy object rb_Am that refers to portA into submodel1. Inside submodel1, portA refers to species A
d pm ¼ k2 pc dt
PolI I KH ðkmt þ kd Þpm, PolI I 1þ KH
ð1Þ
where k2 is the transcription initiation rate and KH is the Hill constant (see Note 5). We suppose that pc (as every other promoter in our circuit) is present in a single copy (see Note 6). RNA polymerase II is here treated as an “activator” of the constitutive promoter. Hence, transcription is proportional to a Hill function with Hill coefficient n ¼ 1 (i.e., no cooperativity among RNA polymerase II molecules [12]).
50
Mario Andrea Marchisio
A Python script to generate a model in SBML Level 3 for starts with the lines
TU_YES_A
from libsbml import ∗ # importing libSBML ns ¼ SBMLNamespacesð3, 1, ”comp”, 1Þ # setting the XML namespace URI associated with SBML Level 3 version 1 and the package comp version 1 document tu ¼ SBMLDocumentðnsÞ plugin document tu ¼ document tu:getPluginð”comp”Þ plugin document tu:setRequiredðTrueÞ # “True” means that the package comp is used to change the mathematical interpretation of the model TU YES A ¼ plugin document tu:createModelDefinitionðÞ TU YES A:setIdð”TU YES A”Þ TU YES A:setNameð”TU YES A”Þ # assigning a name to the ModelDefinition object
an object of the class SBMLDocument, is used here to instantiate the ‘‘plugin’’ SBMLDocument object (plugin_document_tu) that is required to get access to the comp package and create the ModelDefinition object TU_YES_A. The latter represents the actual model for YES_A transcription unit and will become a submodel of the cell model. The Id assigned to a ModelDefinition object is used to unequivocally identify ports when establishing connections between modules. The specification of the ModelDefinition object TU_YES_A proceeds with the creation of the compartment nucleus document_tu,
nucleus ¼ TU YES A:createCompartmentðÞ nucleus:setIdð”nucleus”Þ nucleus:setSpatialDimensionsð3Þ nucleus:setConstantðTrueÞ nucleus:setSizeð2:9e 15Þ # the volume of the yeast S: cerevisiae nucleus nucleus:setUnitsð0 litre0 Þ # the unit should be consistent with the spatial dimension
The species that interact within TU_YES_A are then created. The first species we consider is RNA polymerase II PolII ¼ TU YES A:createSpeciesðÞ PolII:setIdð0 PolII 0 Þ PolII:setCompartmentð0 nucleus 0 Þ PolII:setConstantðFalseÞ PolII:setSubstanceUnitsð0 mole 0 Þ PolII:setInitialConcentrationð0Þ PolII:setBoundaryConditionðFalseÞ PolII:setHasOnlySubstanceUnitsðFalseÞ # ”False” is required to deal properly with the species concentration
PolII inside TU_YES_A will be replaced by the RNA polymerase II species instantiated into the cell model. Therefore, there is no
Modeling in SBML Level 3
51
need to set here the exact initial concentration of PolII, which is left equal to 0. The other three species present into this module are created in an analogous way. It should be noted, though, that the compartment associated with the mature mRNA shall be set to cytoplasm even though it has not been defined in this module. Once the creation of the species is completed, reactions can be added to the ModelDefinition object. Here, the most complex reaction is the pre-mRNA synthesis that we indicate as reaction1 reaction1 ¼ TU YES A:createReactionðÞ reaction1:setReversibleðFalseÞ reaction1:setFastðFalseÞ # ”False” specifies that there are no different time scales in the model reaction1:setIdð0 reaction10 Þ
This reaction has two reactants (PolII and pc) and three products (PolII, pc, and pm). For instance, the creation of the first reactant, RNA polymerase II, has the form reaction1 r1 ¼ reaction1:createReactantðÞ reaction1 r1:setSpeciesð0 PolII0 Þ reaction1 r1:setStoichiometryð1Þ reaction1 r1:setConstantðFalseÞ # “False” implies that the concentration of the reactant can be modified by the reaction
and the third product, the pre-mRNA, is created in a similar way reaction1 p3 ¼ reaction1:createProductðÞ reaction1 p3:setSpeciesð0 pm0 Þ reaction1 p3:setStoichiometryð1Þ reaction1 p3:setConstantðFalseÞ
Two kinetic parameters are involved in the synthesis of pre-mRNA: the transcription initiation rate, k2, and the Hill constant, KH. k2, for example, is determined by the instructions k2 ¼ TU YES A:createParameterðÞ k2:setIdð0 k20 Þ k2:setConstantðTrueÞ k2:setValueð0:5Þ # units are not declared
The kinetic law, which contains a Hill function, corresponding to reaction1 is defined as kinetics reaction1 ¼ reaction1:createKineticLawðÞ math reaction1 ¼ parseL3Formulað0 nucleus k2 pc ðPolII=KHÞ=ð1 þ ðPolII=KHÞÞ0 Þ kinetics reaction1:setMathðmath reaction1Þ
52
Mario Andrea Marchisio
As we have seen above, TU_YES_A contains overall three reactions. The last objects to be declared inside any ModelDefinition object are the ports, i.e., the interfaces with other modules. TU_YES_A demands two ports, one associated with the “input” species PolII and the other with the “output” species mRN A Y ES A plugin TU YES A ¼ TU YES A:getPluginð”comp”Þ port PolII ¼ plugin TU YES A:createPortðÞ port PolII:setIdð”port PolII”Þ port PolII:setIdRefð”PolII”Þ port mRNA YES A ¼ plugin TU YES A:createPortðÞ port mRNA YES A:setIdð”port mRNA YES A”Þ port mRNA YES A:setIdRefð”mRNA YES A”Þ
The “plugin” object plugin_TU_YES_A grants access to the comp package and let us instantiate the two ports that refer to the species exchanged with other modules. Finally, the SBML document shall be written into a file, which can be named as “TU_YES_A.xml” writeSBMLToFileðdocument tu, 0 TU YES A:xml0 Þ
The other modules in the circuit are encoded in a similar way. For the remaining circuit components we will describe only the main differences with respect to the code for TU_YES_A and illustrate their models. 2.3.2 pool_mRNA_YES_A
pool_mRNA_YES_A
is placed into the cytoplasm and receives two input species: the mature mRNA mRN A Y ES A , produced by TU_YES_A and re-instantiated into the cell model, and the ribosomes (rib). The active RaA repressor is synthesized into this module and sent, as an output, to its corresponding pool in the nucleus. pool_mRNA_YES_A is a ModelDefinition object that requires the creation of the cytoplasm as a new compartment cytoplasm ¼ pool mRNA YES A:createCompartmentðÞ cytoplasm:setIdð”cytoplasm”Þ cytoplasm:setSpatialDimensionsð3Þ cytoplasm:setConstantðTrueÞ cytoplasm:setSizeð39:1e 15Þ # the volume of the cytoplasm in the yeast S: cerevisiae cytoplasm:setUnitsð0 litre0 Þ
Modeling in SBML Level 3
53
Besides the input/output species listed above, this module contains the species RaAC that represents the molecules of RaA located inside the cytoplasm before their transfer to the nucleus. mRNA translation is described in a similar way to DNA transcription inside TU_YES_A, i.e., ribosomes play the role of “activators” of mRN A Y ES A . Therefore, mRNA translation is proportional to a Hill function without cooperativity among ribosomes, the concentration of mRN A Y ES A , and the transcription initiation rate k2r (see Note 7). mRN A Y ES A is degraded within this module with decay rate kd
After its synthesis, RaAC is either transported into the nucleus with rate-constant kt (see Note 8)
or degraded with rate-constant kdp (see Note 9)
Overall, the dynamics of RaAC results from the solution of the following ODE (see Note 10): dRaAC ¼ k2r mRN A Y ES A dt
rib KH ðkt þ kdp ÞRaAC rib 1þ KH
:
ð2Þ
The three ports of pool_mRNA_YES_A are encoded as plugin pool mRNA YES A ¼ pool mRNA YES A:getPluginð”comp”Þ port rib ¼ plugin pool mRNA YES A:createPortðÞ port rib:setIdð”port rib”Þ port rib:setIdRefð”rib”Þ # this port allows the exchange of ribosomes with the cell port mRNA YES A ¼ plugin pool mRNA YES A:createPortðÞ port mRNA YES A:setIdð”port mRNA YES A”Þ port mRNA YES A:setIdRefð”mRNA YES A”Þ # this port permits the incoming of mature mRNA port Ra A ¼ plugin pool mRNA YES A:createPortðÞ port Ra A:setIdð”port Ra A”Þ port Ra A:setIdRefð”Ra A”Þ # this port lets RaA leave the cytoplasm and reach the nucleus
54
Mario Andrea Marchisio
2.3.3 pool_Ra_A
The pool of RaA (see Note 11) is placed in the nucleus and takes two inputs: RaA and the inducer A (from now on referred to as iA). In the “Parts & Pools” framework, this pool would be wired to the a TU_AND module with which it would exchange RA proteins. In SBML Level 3, fluxes of proteins or other molecules are no longer necessary. TU_AND sees the concentration of the species RaA instantiated in the cell model and pool_Ra_A just hosts the reactions, which take place far from the DNA, in which RaA is involved. The chemical iA interacts with RaA via a reversible reaction
where RiA is the inactive configuration of RA; λ and μ are the RiA formation and dissociation rate-constant, respectively (see Note 12). Both active and inactive RA are degraded inside this module with the same rate-constant kdp
Two ports are required by this module plugin pool Ra A ¼ pool Ra A:getPluginð”comp”Þ port Ra A ¼ plugin pool Ra A:createPortðÞ port Ra A:setIdð”port Ra A”Þ port Ra A:setIdRefð”Ra A”Þ # this port allows the arrival of Ra A molecules port ia ¼ plugin pool Ra A:createPortðÞ port ia:setIdð”port ia”Þ port ia:setIdRefð”ia”ÞÞ # this port grants access to the inducer ia
2.3.4 TU_NOT_B, pool_mRNA_NOT_B, and pool_Ri_B
These three modules are modeled in an almost identical way to TU_YES_A, pool_mRNA_YES_A, and pool_Ra_A, respectively. The only remarkable difference is that repressor RB is inactive in its wildtype configuration, RiB , and gets activated (i.e., able to bind the DNA) upon binding the corepressor B (cB). Hence, inside pool_Ri_B we have the following reactions:
2.3.5 TU_YES_B
TU_YES_B
is different from the two TUs previously examined since it contains a promoter regulated by a repressor protein, RaB , and encodes for siRNA molecules instead of proteins. TU_YES_B gets three input species: RNA polymerase II, RaB , and the dicer (D),
Modeling in SBML Level 3
55
whereas it returns a single output species, siRNAB, that goes into its corresponding pool in the cytoplasm. The repressed promoter pR1 leads the synthesis of double-stranded RNA molecules, dsRNA. Transcription is activated by RNA polymerase II and repressed (without cooperativity) by RaB . Therefore, dsRNA synthesis is proportional to the product of two Hill functions. dsRNA is, then, turned by the dicer into siRNAn, i.e., siRNA molecules located into the cell nucleus. In order to limit the number of kinetic parameters in our model, we describe the conversion of dsRNA into siRNAn with a Hill function as well (see Note 13). Finally, dsRNA is degraded within this module with rate-constant kd
The temporal behavior of dsRNA is determined by the following ODE: d dsRN A dt
¼
k2 pR1
PolI I 1 1 KH k2d dsRN A kd dsRN A, D PolI I RaB 1þ 1þ 1þ KH KH KH ð3Þ
where k2d is the siRNAn production rate (see Note 14). siRNAn molecules either undergo maturation and transport into the cytoplasm with rate-constant kmt
or are degraded with decay rate kd
Therefore, the dynamics of siRNAn obeys to the ODE d siRN A n ¼ k2d dsRN A dt
1 1þ
D KH
ðkmt þ kd Þ siRN A n :
ð4Þ
We suppose that in Eqs. 3, 4 we can use the same value of KH as in Eqs. 1, 2—see Note 10. 2.3.6 pool_siRNA_B
The module pool_siRNA_B hosts the interactions that lead to the formation of the RNA-induced silencing complex (RISC). It takes siRNAB and the argonaute (ago) as inputs and gives the RISC as an output. Argonaute and siRNAB interact through a reversible reaction
56
Mario Andrea Marchisio
where θ and ζ are the RISC formation and dissociation rateconstant, respectively (see Note 15). In this pool, siRNAB is degraded with rate-constant kd
2.3.7 TU_AND
The last transcription unit in our circuit, TU_AND, encodes for the output of the whole AND gate, i.e., the green fluorescent protein. RNA polymerase II and the active repressor RaA are the two inputs for the module. They bind the regulated promoter pR2 and modulate the transcription of the pre-mRNA associated with GFP (pmG). pmG is either degraded with rate-constant kd
or undergoes maturation (becoming mRNAAND) before being transported into the cytoplasm with rate-constant kmt
The dynamics of pmG obeys to an ODE that contains as many Hill function as there are molecules that intervene in pmG transcription d pm G ¼ k2 pR2 dt
PolI I 1 KH ðkd þ kmt Þ pm G PolI I RaA 2 1 þ ðK H Þ 1þ KH
: ð5Þ
Here, RaA dimerizes before binding pR2. Hence, this repressor protein is associated with a Hill cooperativity coefficient n ¼ 2. mRNAAND is the only output of this module. 2.3.8 pool_mRNA_AND
This mRNA pool gets, as inputs, the mature mRNA synthesized by TU_AND, the ribosomes, and the RIS complex. The output is the green fluorescent protein (GFP). Ribosomes translate mRNAAND into GFPu (u stands for unfolded). GFPu is either degraded with rate-constant kdp
or folds into its final configuration
where kf represent the fluorophore maturation rate (see Note 16). The ODE that determines GFPu dynamics has the form
Modeling in SBML Level 3
d GF P u ¼ k2r mRN A AN D dt
rib KH ðkdp þ k f Þ GF P u : rib 1þ KH
57
ð6Þ
mRNAAND is a “regulated” mRNA since it binds to and gets cleaved by the RISC. The binding of the RISC to the mRNA is due to the siRNA, the cleavage to the argonaute. RISC and mRNAAND form a complex, rm, that obeys the reversible reaction
where k1risc and k1risc are the formation and dissociation rateconstant of rm, respectively (see Note 17). Upon mRNA cleavage, the RISC leaves the mRNA that is degraded quickly
In the above reaction, kdf represents the mRNA fast degradation rate (see Note 18). mRNAAND molecules that are not cleaved by the argonaute decay with rate-constant kd
2.3.9 pool_GFP
The last pool in our circuit gets a single input, the green fluorescent protein, and has no output. Hence, it demands only one port. This pool hosts a single reaction, the degradation of GFP, with rateconstant kdG (see Note 19)
2.3.10
In order to put together the models of the circuit components (submodels) into a unique model for the whole AND gate, we shall instantiate an object of the class Model. As mentioned above, our model is called cell. cell does not contain any reactions since we decided to assign them either to transcription units or pools. In contrast, cell shall contain those species that are the input and output of the circuit modules, i.e., the species that, into the ModelDefinition objects, are associated with ports. The Python script to generate the SBML Level 3 description of cell requires the instructions to create an object of the class Model and define the units adopted in the model
The Cell Model
58
Mario Andrea Marchisio cell ¼ document:createModelðÞ # ”cell”is created as a model: To this aim the SBMLDocument object ”document” is used cell:setIdð”cell”Þ cell:setNameð”cell”Þ cell:setSubstanceUnitsð”mole”Þ cell:setTimeUnitsð”second”Þ cell:setVolumeUnitsð”litre”Þ cell:setLengthUnitsð”dimensionless”Þ cell:setExtentUnitsð”mole”Þ # the units of kinetic laws are defined as the ratio between ExtentUnits and TimeUnits plugin cell ¼ cell:getPluginð”comp”Þ
The ModelDefinition objects, corresponding to the 11 modules defined above, become submodels of the cell model. In our circuit, each ModelDefinition object is instantiated into a separate “XML” file. Therefore, ModelDefinition objects have to be associated with ExternalModelDefinition objects that “point to” the files where the ModelDefinition objects are defined. As a result, inside the “cell.xml” file every submodel is accompanied by a different ExternalModelDefinition object: both of them refer to the same ModelDefinition object, i.e., the same module. For instance, the submodel for TU_YES_A is defined as sub mod TU YES A ¼ plugin cell:createSubmodelðÞ # Submodels are created from a ”plugin” Model object sub mod TU YES A:setIdð”TU YES A”Þ # The ID of a submodel can be the same as that of a ModelDefinition object sub mod TU YES A:setModelRefð”TU YES A”Þ #sub mod TU YES A points to the ModelDefinition object TU YES A”
Similarly, the definition of the ExternalModelDefinition object associated with TU_YES_A requires the following lines of Python code: ext TU YES A ¼ plugin document:createExternalModelDefinitionðÞ # ExternalModelDefinition objects are created from a plugin SBMLDocument object ext TU YES A:setIdð”TU YES A”Þ ext TU YES A:setModelRefð”TU YES A”Þ # ext TU YES A points to the ModelDefinition object TU YES A ext TU YES A:setSourceð”TU YES A:xml”Þ # The ModelDefinition object TU YES A is written into the XML file TU YES A::xml
Once all submodels and ExternalModelDefinition objects have been instantiated, the nucleus and cytoplasm compartments are created together with their “plugin” version. For the nucleus, for instance, we shall write
Modeling in SBML Level 3
59
nucleus ¼ cell:createCompartmentðÞ nucleus:setIdð”nucleus”Þ nucleus:setSpatialDimensionsð3Þ nucleus:setConstantðTrueÞ nucleus:setSizeð2:9e 15Þ nucleus:setUnitsð0 litre0 Þ nucleus plugin ¼ nucleus:getPluginð”comp”Þ nucleus_plugin is required to instantiate a ReplacedElement object such that the nucleus created inside the cell model will replace the nucleus definition into the submodels that lie in the nucleus (see Note 20). Considering, for instance, TU_YES_A, one shall write
re nucleus ¼ nucleus plugin:createReplacedElementðÞ re nucleus:setSubmodelRefð”TU YES A”Þ re nucleus:setIdRefð”nucleus”Þ
At this point it is necessary to create the species that represent the input/output of the different submodels. Each species is associated with a “plugin” species in order to instantiate the ReplacedElement and ReplacedBy objects that determine the connection between two (or more) modules. Fifteen species are created into the cell model: seven lie in the nucleus (RNA polymerase II, the dicer, the inducer iA, the corepressor cB, RaA , RiB , and RaB ) and eight in the cytoplasm (the ribosomes, the argonaute, mRN A Y ES A , mRN AN OT B , siRNAB, RISC, mRNAAND, and the green fluorescent protein). For instance, RNA polymerase II is instantiated as PolII ¼ cell:createSpeciesðÞ PolII:setIdð”PolII”Þ PolII:setConstantðFalseÞ PolII:setCompartmentð”nucleus”Þ PolII:setSubstanceUnitsð”mole”Þ PolII:setInitialConcentrationð2:864e 6Þ PolII:setBoundaryConditionðFalseÞ PolII:setHasOnlySubstanceUnitsðFalseÞ PolII plugin ¼ PolII:getPluginð”comp”Þ
Here the initial concentration of RNA polymerase II is specified (see Note 21). RNA polymerase II is not an output of any modules but represents an input for the four transcription units. Therefore, we shall instantiate a ReplacedElement object associated with PolII that points to the port that refers to the RNA polymerase II species into each of the four TU submodels. This is achieved by writing
60
Mario Andrea Marchisio TU modules ¼ ½”TU YES A”, ”TU NOT B”, ”TU YES B”, ”TU AND” for i in rangeð0, lenðTU modulesÞÞ : PolII re ¼ PolII plugin:createReplacedElementðÞ PolII re:setSubmodelRefðTU modules½iÞ PolII re:setPortRefð”port PolII”Þ
mRN A Y ES A , in contrast, is the output of TU_YES_A and an input for pool_mRNA_YES_A. The “plugin” species associated with mRN A Y ES A into the cell model (mRNA_YES_A_plugin) shall instantiate a ReplacedElement object to replace mRN A Y ES A into pool_mRNA_YES_A and a ReplacedBy object to be replaced by mRN A Y ES A from TU_YES_A. Hence, we have mRNA YES A re ¼ mRNA YES A plugin:createReplacedElementðÞ mRNA YES A re:setSubmodelRefð”pool mRNA YES A”Þ mRNA YES A re:setPortRefð”port mRNA YES A”Þ mRNA YES A rb ¼ mRNA YES A plugin:createReplacedByðÞ mRNA YES A rb:setSubmodelRefð”TU YES A”Þ mRNA YES A rb:setPortRefð”port mRNA YES A”Þ
No port shall be instantiated into the cell model. The last instruction in the Python script is for writing the cell model into the file “cell.xml” writeSBMLToFileðdocument, 0 cell:xml0 Þ
2.4 Simulations and Results
3
We run simulations of the AND gate with COPASI [13] (see Note 22). All “XML” files shall be placed into the same directory and only “cell.xml” shall be imported into COPASI. COPASI automatically retrieves the submodels from the files associated with each ExternalModelDefinition object and flattens the composite model (i.e., creates a unique model for the circuit). In our simulations we supposed to grow yeast cells upon induction with none, one, or both input signals (see Note 23) and measure fluorescence at steady state. The results in Fig. 3 indicate that our model (with the numerical values we chose for the kinetic parameters) faithfully reproduces the truth table of an AND gate.
Notes 1. A model can also contain other kinds of objects such as species and reactions. 2. As for the AND gate, the class Model object represents the whole yeast cell, whereas submodels are the transcription units and the pools present in the circuit.
Modeling in SBML Level 3
61
Fig. 3 AND gate simulations. Fluorescence values corresponding to the four possible combinations of the input signal concentrations are calculated at steady state and normalized to the fluorescent level “detected” in the presence of both inputs
3. If
is an object of the class Model, an object subthat represents a submodel of mainModel is created through a plugin object, pluginModel, in the following way: mainModel
model1
pluginModel ¼ mainModel:getPluginð”comp”Þ submodel1 ¼ pluginModel:createSubmodelðÞ submodel1:setIdð”submodel1”Þ # the Id attribute is used to establish connections among modules submodel1:setModRefð”submodel1”Þ # this is a reference to a ModelDefinition object
4. The creation of portA as an object of the class Port that points to a species A within the module submodel1 demands the following instructions: A ¼ submodel1:createSpeciesðÞ A:setIdð”A”Þ # the Id attribute is necessary to establish a connection between a port and a species plugin submodel1 ¼ submodel1:getPluginð”comp”Þ # submodel1 cannot create a port directly portA ¼ plugin submodel1:createPortðÞ portA:setIdð”portA”Þ # the Id of a port is also used to connect modules portA:setIdRefð”A”Þ # portA now points to the species A
5. For circuit simulations, we used the following parameter values: kmt ¼ 5.5 104 s1, which corresponds to a time interval of about 30 min; kd ¼ 5.7 104 s1, for an mRNA half-life of about 20 min; k2 ¼ 0.5 s1; KH ¼ 107 M. 6. A single molecule in the nucleus has the concentration of 5.73 1010 M.
62
Mario Andrea Marchisio
7. k2r ¼ 0.02 s1. 8. The nuclear import rate-constant kt is assigned the value of 8.3 103 s1, which corresponds to a time interval of about 2 min. 9. Protein degradation rate kdp is set to 2.7 104 s1, which corresponds to a half-life of about 40 min. 10. For the sake of simplicity, we use the same value of the Hill constant KH in describing both transcription and translation. 11. We wrote Python scripts encoding for two or three modules together, i.e., a transcription unit on the DNA, its pool of (m)RNA product and, if present, the pool of the corresponding protein. This facilitated the model drafting. 12. λ ¼ 106 M1 s1; μ ¼ 103 s1. 13. We use a Hill function similar to the one generally adopted to model repression of transcription (without any cooperativity among dicer molecules). Since the dicer concentration does not vary considerably, this Hill function could be approximated with a constant. 14. k2d ¼ 0.033 s1. 15. θ ¼ 107 M1 s1; ζ ¼ 0.01 s1. 16. kf ¼ 1.39 104 s1, which corresponds to a maturation time of about 2 h. 17. k1risc ¼ 3 107 M1 s1; k1risc ¼ 0.017 s1 18. kdf ¼ 2 103 s1, which corresponds to an mRNA half-life of about 6 min. 19. kdG ¼ 8.25 105 s1, which corresponds to a half-life of about 140 min. GFP is a very stable protein. The decay rateconstant here used is derived from the duplication time of S. cerevisiae in a synthetic medium. 20. Obviously, the same holds for the cytoplasm. 21. In our model, RNA polymerase II, ribosome, dicer, and argonaute are not supposed to be degraded. Their initial concentrations are – – – –
RNA polymerase II: 2.864 106 M, which corresponds to about 5000 molecules; dicer: 1.432 106 M (2500 molecules); ribosome: 8.5 107 M (20, 000 molecules); argonaute: 1.05 107 M (2500 molecules).
22. We used COPASI 4.26 (Build 213). 23. A 0 logical value corresponds to the absence of the input signal. A 1 logical value corresponds to a concentration of 0.01 M. In our model, both inducer iA and corepressor cB are not supposed to be degraded.
Modeling in SBML Level 3
63
Acknowledgements We acknowledge financial support by the National Natural Science Foundation of China, grant number 31571373. References 1. Marchisio MA, Stelling J (2008) Computational design of synthetic gene circuits with composable parts. Bioinformatics 24 (17):1903–1910 2. Marchisio MA (2014) Parts & Pools: a framework for modular design of synthetic gene circuits. Front Bioeng Biotechnol 2:42 3. Mirschel S, Steinmetz K, Rempel M et al (2009) PROMOT: modular modeling for systems biology. Bioinformatics 25(5):687–689 4. Hucka M, Finney A, Sauro HM et al (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4):524–531 5. Rodrigo G, Carrera J, Jaramillo A (2007) Asmparts: assembly of biological model parts. Syst Synth Biol 1(4):167–170 6. Hucka M, Bergmann FT, Dr€ager A et al (2016) The Systems Biology Markup Language (SBML): language specification for Level 3 Version 1 Core. http://sbml.org/specifications/ sbml-level-3/version-1/core/release-2-rc2. Accessed 28 Dec 2016 7. Smith LP, Hucka M, Hoops S et al (2013) Hierarchical model composition. http://
identifiers.org/combine.specifications/sbml. level-3.version-1.comp.version-1.release-3. Accessed 14 Nov 2013 8. Marchisio MA (2015) Modular design of synthetic gene circuits with biological parts and pools. In: Marchisio MA (ed) Computational methods in synthetic biology. Methods in molecular biology, vol 1244. Humana Press, New York, pp 137–165 9. Bornstein BJ, Keating SM, Jouraku A et al (2008) LibSBML: an API library for SBML. Bioinformatics 24(6):880–881 10. Drinnenberg IA, Weinberg DE, Xie KT et al (2009) RNAi in budding yeast. Science 326 (5952):544–550 11. Marchisio MA (2014) In silico design and in vivo implementation of yeast gene Boolean gates. J Biol Eng 8:6 12. Alon U (2006) An introduction to systems biology. Chapman & Hall/CRC Press, Boca Raton 13. Hoops S, Sahle S, Gauges R et al (2006) COPASI–a COmplex PAthway SImulator. Bioinformatics 22(24):3067–3074
Chapter 5 CRISPR-ERA: A Webserver for Guide RNA Design of Gene Editing and Regulation Honglei Liu and Xiaowo Wang Abstract The CRISPR/Cas9 system has been developed as a powerful technology for both targeted genome editing and gene regulation. However, the design of efficient single-guide RNAs (sgRNAs) remains challenging with the consideration of many criteria. In this section, we introduce how to design sgRNA sequences and build genome-wide sgRNA library using CRISPR-ERA, which is one of the state-of-the-art designer webserver tools for sgRNA design based on a set of sgRNA design rules summarized from published reports. Key words sgRNA design, CRISPR-Cas9 system, Gene editing, Gene regulation
1
Introduction The bacterial CRISPR/Cas9 system provides a powerful technology for diverse purposes of genome engineering, including gene editing (modifying the genome sequence) and regulation (repression or activation, without modifying the genome sequence) [1, 2]. CRISPR-Cas9 system is highly programmable, with Cas9 endonuclease and a target-identifying CRISPR RNA duplex that can be simplified into a single-guide RNA (sgRNA). The sgRNA sequence can match and target with an 18- to 25-bp DNA sequence, with a protospacer-adjacent motif (PAM) adjacent to the binding site. Among most applications of CRISPR/Cas9 system, designing functional single-guide RNAs (sgRNAs) remains a limiting step for effective editing or regulation applications. Considering the biophysical/biochemical properties of the CRISPR system and as the large genome sequence in many organisms, it is not a straightforward task to design effective sgRNAs. A number of designer tools have been developed for sgRNA design. The whole field of CRISPR genome engineering is in need of a predictive and automated tool for designing sgRNAs, both for individual gene
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_5, © Springer Science+Business Media, LLC, part of Springer Nature 2021
65
66
Honglei Liu and Xiaowo Wang
expression control and whole genome-wide sgRNA library design for genetic screening. CRISPR-ERA (http://crispr-era.stanford.edu/) [3, 4] is among the first computational tools that can design sgRNAs with potentially better efficiency for gene regulation (repression or activation) in different host organisms, including Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, rat, mouse, and human cells. Several input formats are allowed according to the types of gene regulation. For gene editing, input format includes gene name, gene sequence and location region. For gene regulation, gene name, gene sequence, and TSS location could be inputted. CRISPR-ERA provides two options for user-defined target sequence based on the webpage textbox (genomic region) or uploading DNA sequences.
2
Methods CRISPR-ERA applies criteria summarized from published data [5– 7] and papers, and computes an efficacy score (E-score) and specificity score (S-score) to evaluate the efficiency and specificity of each sgRNA. E-score is based on the sequence features such as GC content, presence of poly-thymidine and location information. S-score is based on the genome-wide off-target binding sites. The off-target binding information within two mismatches is derived by Bowtie. E-score and S-score are calculated differently for different applications and organisms. An example of E-score and S-score computation is shown in Fig. 1. The more detailed description of the scoring system could be found on the “Help” webpage of CRISPR-ERA (http://crispr.stanford.edu/help.jsp).
Fig. 1 An example of E-score and S-score computation for sgRNA sequence TCGCAAGCCCTCATTTCACC
Guide RNA Design for Gene Editing and Regulation
3
67
Procedures
3.1 CRISPR-ERA Input (Fig. 2)
Step 1. Choose the type of objective gene manipulation: gene editing using nuclease, gene editing using nickase, gene repression, or gene activation. Step 2. Choose the host organism. Different types of choice in step 1 present different optional organisms. Step 3. Choose the input format: gene name, gene location (target region for gene editing or transcriptional start sites (TSS) location for gene regulation), or gene sequence.
3.2 CRISPR-ERA Output (Fig. 3)
Output webpage contains two parts, “See results in UCSC Genome Browser” and “Results.” By clicking the button “click here to see result in UCSC Genome Browser,” the sgRNA sequences could be presented on UCSC Genome Browser. The sum of E-score and S-score is represented by color bar. Result table contains properties for each sgRNA sequence, such as target gene, transcript ID, distance to TSS, location, and strand. The detailed information of E-score, S-score, GC content, and off-targets are included.
Fig. 2 CRISPR-ERA input webpage
68
Honglei Liu and Xiaowo Wang
Fig. 3 sgRNAs design of human gene Pou5f1 for gene repression. (a) Output webpage. (b) Top 50 sgRNA designs in UCSC genome browser 3.3 Genome-Wide sgRNA Library Building Pipeline
The source code for the generation of genome-wide sgRNA library could be downloaded. To use the source code, genome sequence files in FASTA format and genome annotation files in RefFlat or GFF format should be prepared. With sgRNA sequences derived, the next step is finding all possible off-target sequences (both PAM ¼ NGG, PAM ¼ NAG are considered) using Bowtie. By analyzing the sgRNA sequence features, E-score and S-score can be derived based on the Bowtie result. Criteria can be different in different organisms and gene manipulations.
Guide RNA Design for Gene Editing and Regulation
4
69
Conclusion CRISPR-ERA can also be applied to other types of CRISPR applications and expanded to other organisms. CRISPR-ERA has addressed the current needs of computational tools for CRISPRbased genome engineering. Such computational tools are part of the CRISPR toolset for genome engineering, which facilitate broad applications.
References 1. Qi LS, Larson MH, Gilbert LA et al (2013) Repurposing CRISPR as an RNA-guided platform for sequence-specific control of gene expression. Cell 152:1173–1183 2. Nakamura M, Srinivasan P, Chavez M et al (2019) Anti-crispr-mediated control of gene editing and synthetic circuits in eukaryotic cells. Nat Commun 10:194–204 3. Liu H, Wei Z, Dominguez A et al (2015) CRISPR-ERA: a comprehensive design tool for CRISPR-mediated gene editing, repression and activation. Bioinformatics 31:3676–3678
4. Liu H, Wang X, Qi LS (2017) Using CRISPRERA webserver for sgRNA design. Bio-protocol 7:e2522 5. Doudna JA, Charpentier E (2014) Genome editing. The new frontier of genome engineering with CRISPR-Cas9. Science 346:1258096 6. La Russa MF, Qi LS (2015) The new state of the art: Cas9 for gene activation and repression. Mol Cell Biol 35:3800–3809 7. Liu Y, Yu C, Daley TP et al (2018) Crispr activation screens systematically identify factors that drive neuronal fate and reprogramming. Cell Stem Cell 23:758–771.e8
Chapter 6 iGUIDE Method for CRISPR Off-Target Detection Christopher L. Nobles Abstract With the advent of genome editing technologies, scientists have recognized that these technologies can be prone to nonspecific or off-target activity. As many areas of the genome are sensitive and can give rise to abnormalities if mutated, it is imperative that scientists identify regions of off-target activity in order to utilize these new technologies for medical benefits. GUIDE-seq and iGUIDE both use an oligo-based marker method to identify regions of DNA double-strand breaks in an unbiased manner. The repeated observation of these double-strand breaks across the genome in comparison with target sequences (such as guide RNAs) has allowed researchers to identify on- and off-target sites related to their targeted-nuclease technologies. Key words Off-target analysis, Specificity, CRISPR, DNA double-strand breaks
1
Introduction Following the development of CRISPR-based technologies and other designer nucleases, a number of methods were developed to measure off-target activity, or the amount and location of DNA editing not intended by design [1–7]. These methods centered around quantifying the distributions of DNA double-strand breaks (DSBs) by different means, leading to each having their own strengths and limitations. Initial methods to identify DSBs either used whole genome sequencing approaches or recognized marking of large foreign DNA could indicate regions of DSBs (such as with integration defective retroviral vectors or adeno-associated viral vectors) [1, 3]. Other methods identify DSBs by fixing cells and their DNA to isolate factors associated with DSBs or DNA-repair [2, 5, 6]. Lastly but similar to previous work, small protected oligonucleotides were found to incorporate into DSBs at a higher frequency than large DNA, providing a high-throughput targeted approach at identifying DSBs over the course of exposure to designer nucleases [3, 7]. The GUIDE-seq method was originally developed by the laboratory of J. Keith Joung, MD, Ph.D. and the
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_6, © Springer Science+Business Media, LLC, part of Springer Nature 2021
71
72
Christopher L. Nobles
protocol was further improved upon by the laboratory of Frederic D. Bushman, Ph.D., by applying practices developed for unbiasedtargeted sequencing, resulting in iGUIDE [3, 7]. While experimental-based off-target analysis is commonly applied to CRISPR-gRNA selection at later stages of research, its value can be applied much earlier in gRNA selection within the appropriate model system. Discordance can often be observed between in silico methods of off-target analysis and experimental or in vivo methods [3]. The latter typically revealing fewer off-target sites associated with a specific gRNA, likely due to cellular-based restrictive factors associated with the DNA of a cell within eukaryotic organism [8, 9]. Therefore, experimental-based off-target analyses typically carry more relevant results that could have a significant impact on the decision to use a gRNA. Both GUIDE-seq and iGUIDE share similar protocols. GUIDE-seq uses a 34-bp double-stranded oligo-dinucleotide (dsODN) while iGUIDE uses a 46-bp dsODN to mark DSBs, and both methods use a nested-PCR approach to enrich for incorporated dsODN and identify the flanking DNA by pairedend sequencing. The ODN marker is included during delivery of targeted-nucleases (plasmid, mRNA, or enzyme-gRNA complex) to cells or tissue. The marker incorporates into DSB induced by nucleases (and other sources of DSBs) and will persist within cells when it has terminal phosphorothioate bonds, preventing exonuclease digestion [3, 9, 10]. Genomic DNA is harvested from the cells or tissue after a recovery period, typically hours to days. This genomic DNA is then sheared into small fragments (400–800 bp) through ultra-sonication, end-repaired, and ligated to Y-adapters. These adapters are part single-strand, double-strand DNA with a 30 T-overhang, making TA-based ligation the preferred method for combining the Y-adapters to the repaired DNA fragments. DNA is then subjected to a nested-PCR where primers must first amplify from the incorporated oligo before creating the template for the adapter primer to amplify. This selectively amplifies the incorporated dsODN rather than all DNA present. During the second phase of the amplification, segments of DNA are added on to the amplicons through the primers that will allow for indexing and sequencing of the amplicons on Illumina-based paired-end sequencers (see Note 1). The bioinformatic processing of the obtained data follows a typical workflow. Primary processing is conducted by Illuminabased software to transfer image files into sequence-based data. The secondary processing focuses on manipulating the sequence data and obtaining usable information. The tertiary portion interprets the data, and then the data is reported in a concise manner to convey the results to the user, commonly a bioinformatician or scientist. Here, we will go into more detail as to how the secondary, tertiary, and reporting pipeline work as these portions of the analysis are handled by GUIDEseq and iGUIDE software.
iGUIDE for CRISPR Off-Target Detection
2
73
Materials 1. Illumina-based paired-end sequencer. 2. Computational resources, such as a computational cluster, server, or cloud computing platform. 3. A minimum 8 GB of memory (but preferably much more). 4. A linux-based operating system. 5. A proficiency in bioinformatic programming languages is highly useful, such as R or Python.
3
Methods
3.1 Primary Processing
The primary processing for many sequence-based analyses are typically handled by software provided by the manufacturer. For instance, with Illumina-based paired-end sequencing platforms, Illumina provides a utility for converting their standard output files into a FASTQ format (“bcl2fastq”). FASTQ format can vary slightly, but typically is broken down into one read every four lines. The first line displays the read name, the second line the nucleotide sequence, the third line can contain the read name again or a placeholder (“+”), and the fourth line contains the PHRED quality scores. Illumina-based sequencing platforms can produce millions to billions of reads from a single sequencing run, therefore these files are not trivial to manage. The user needs to consider how they will keep track of sequencing files and runs, how they will archive these files, and how they will process these files. The latter relates to the next sections, though some advice is included in the notes section regarding the former points (see Notes 2–4). Paired-end sequencing with GUIDE-seq and iGUIDE protocols will return the four different types of reads typical of an Illumina-based sequencer. These include Read 1 (R1), Index 1 (I1), Index 2 (I2), and Read 2 (R2). Each read contains different information. R1 will contain information from the genomic sequence, starting from the Y-adapter. I1 and I2 will both contain indexing sequences, though half of I2 will also contain a unique molecular index (UMI) sequence (8 nts). R2 will contain sequence that starts in the dsODN and then sequence through the junction into genomic sequence.
3.2 Secondary Processing
Many bioinformatic software utilities or pipelines, such as GUIDEseq and iGUIDE, are built around using FASTQ files as input and cover the remaining steps in processing (secondary, tertiary, and reporting). These pipelines will typically cover all sections in a single command, such as the command “iguide run” [7]. Here, the steps for each processing section will be described.
74
Christopher L. Nobles
1. Demultiplexing. Sequencing runs commonly include several samples “multiplexed” together into a single run. This first step in the secondary processing separates reads by sample. This serves two purposes, (1) to make sure reads from one sample will not interfere with the analysis of another sample, and (2) to distribute workload as FASTQ files may contain millions to hundreds of millions of sequencing reads. Separating reads is handled by isolating and identifying the index sequences or barcodes. Single barcoding schemes are simple to execute but can lead to artifactual problems during amplification, therefore dual-indexing or -barcoding is a preferred method. Index sequences may be captured by the Illumina instrument in I1 and I2 reads, but sometimes users may also capture barcodes within R1 or R2 reads. GUIDE-seq and iGUIDE both use methods that generate I1 and I2 sequences for dual-barcoding. 2. Primer Trimming. Once reads are separated by sample, the next step in the pipeline is to remove synthetic sequences that were appended during amplification (Fig. 1a). In particular, GUIDE-seq and iGUIDE process the beginning of the R2 reads to identify a sequence related to the PCR2 primer targeting the dsODN. Additionally, the R1 and R2 reads need to be A. Genomic dsODN(rc) |----------------------------------| |-----… R1: GATTTGTTGCTCCAGGCCACAGCACTGTTGCTCTTG TCGCGT… R2: AACGGTAT ACGCGA CAAGAGCAACAGTGCTGTGGCCTGGAGCAACAAATC AGATCG… |------| |----| |----------------------------------| |-----… Primer dsODN Genomic Adapter (PCR2) (bit) (rc)
B. (hg38) Chromosome 14 : 22547647-22547727 Ref: CTGTGCTAGACATGAGGTCTATGGACTTCAAGAGCAACAGTGCTGTGGCCTGGAGCAACAAATCTGACTTTGCATGTGCAA Edit Site Position ^ CAAGAGCAACAGTGCTGTGGCCTGGAGCAACAAATC (R2 Alignment) (chr14:+:22547664) ^ Position of incorporation site (chr14:+:22547675)
Fig. 1 Example sequence captured during sequencing for an incorporation site in the TRAC locus. (a) A potential sequence captured by Read1 (R1) and Read2 (R2) of an Illumina-based paired end sequence. The sequences have inserted spaces to help with understanding the different sections. Adapter, primer, and dsODN-based sequences are dependent on the sequences used during marking and amplification of the treated sample DNA. During processing, adapter, primer, and dsODN-related sequences are identified and removed, leaving genomic sequence to align to a reference (ref) genome. (b) After alignment, the captured sequence is used to identify the incorporation site, which is recognized as the first base (in order of 50 to 30 ). The target pattern is then searched for upstream of the incorporation site, represented by the blue sequence. The target sequence is broken down into the target sequence (underlined) and the PAM sequence (not underlined). The edit site is identified based on an input configuration that should be representative of the nucleases enzymatic activity to identify the base where editing occurs. Locational identification is represented by the chromosome number (chr14), the strand on which the target sequence matches (+, for positive strand), and then the numerical index of the base (22,547,664)
iGUIDE for CRISPR Off-Target Detection
75
scanned for “overread” trimming. This occurs as many reads will contain fragments of DNA sequence between the dsODN and the adapter sequence that is shorter than the number of sequencing cycles. The reverse complement sequence of the adapter should be trimmed from the 30 end of the R2 sequences. In addition, the reverse complement sequence of the dsODN should be trimmed from the 30 end of the R1 sequences. These reverse complement sequences are not always in the same location for each read. Therefore, the software programmatically works through the reads to identify these sequences and trim them off. Trimming the primer section from the 50 end of the R2 reads serves to identify which reads contain the correct sequence. This is an initial method to filter out mispriming artifacts from the previous PCRs. 3. dsODN Trimming. A major difference between GUIDE-seq and iGUIDE is that iGUIDE implements an extended dsODN design, of which only 34 bps are used for priming during the nested-PCR. This leaves an additional 6 bps on the 50 and 30 ends of the dsODN that are untouched by the PCR priming scheme. Why does this matter? The additional six nucleotides can be used informatically to identify the dsODN from genomic mispriming events (Fig. 1a). This is done in the same manner as looking for the primer sequences but is conducted independently and under strict matching conditions. 4. Filtering. During the previous trimming steps, many reads may have been removed from the analysis and would only serve to increase the workload if they were processed any further without their paired-end mate (R1 and R2 mated-pairs). A filtering step is used to make sure that for every R2 sequence that will continue to be processed, there is a corresponding R1 sequence, and vice versa. 5. Consolidating. Depending on preferred sequence aligner (BWA, BLAT, or other programs), it may help to reduce the number of sequences aligned to only unique sequences. For BLAT, a very computationally intensive aligner, the fewer sequences needing alignment leads to faster processing times. Additionally, for BLAT, each read needs to be aligned independently (R1 and R2). For other aligners, both reads from a paired-end sequencing can be provided at the same time, meaning unique pairs are important rather than unique sequences. 6. Aligning. The trimming steps removed synthetic DNA included in the amplicons, leaving the remaining flanking genomic DNA. This sequence is aligned against a reference genome, such as hg38 for Homo sapiens, using either BLAT or BWA for iGUIDE (Fig. 1b, see Note 5). Outputs from these
76
Christopher L. Nobles
different aligners vary, but a quality control step has been included after each option to make sure the results are consistent with good practice (outlined in the next section). As a recommendation, care for the reference genome should be taken regarding alternative or repetitive sequences. These regions can lead to multiple alignments that may confound the analysis downstream. For hg38, removal of alternative or random contigs could be used to exclude alignments to these regions. 7. Quality Control. Independent of alignment software used to match sequence data to a reference genome location, several features of the alignments should be used for quality control filtering the results. All alignments should be inward facing, meaning that the 50 ends of the paired sequences should be downstream of one another (50 to 30 , as in the direction of DNA and RNA polymerization). If this previous statement is true, then they should also map to opposite strands of the genome. A minimum sequence length should also be considered, and while this is adjustable through iGUIDE configurations, it is recommended to have a minimum of 30 nts of sequence information for a dependable alignment. Alignments should also start a predetermined distance from the first nucleotide of the identified sequence. If the first or few nucleotides do not align, this may be due to PCR or sequencing error. Additionally, the genomic content may diverge slightly from the reference. Yet, if the first half of the alignment does not align to the reference, it is not likely a good alignment. Therefore, the analyst should set expectations prior to the experiment for what to consider. It is recommended to consider alignments that start in the first 5 nts and have a global identity score of greater than 95%. Alignments should represent the profile of DNA sizes that were used in the experiment. If genomic DNA was sheared to an average size of 400–500 bps, then an alignment with a size of 3 megabases (3,000,000 bps) should not likely be considered. It is unlikely to find many alignments greater than two times the average size of your sequencing library. Therefore, it is recommended to double the size and add a little for error, such as a 2500-bp cutoff. Another point to consider is whether to exclude alignments that legitimately align to multiple locations of the genome, or multihits. If this is the desire, then the analyst should only accept reads with one alignment that passes the above criteria. This is an important point, as some aligners only yield the “best match.” Care should be taken to understand the sequence aligners methods and parameters to return appropriate alignments for additional quality control filtering.
iGUIDE for CRISPR Off-Target Detection
3.3 Tertiary Processing
77
Secondary processing produces valid data from the analysis. For iGUIDE, at this point the pipeline has returned all locations of identified DSBs given the analyst’s criteria. The next step is to interpret these DSBs to achieve the goals of the experiment. In iGUIDE, a focus is directed to identifying the on- and off-target genomic locations from targeted nucleases. First, the pipeline identifies specific phenomena that may be indicative of nuclease activity. Next, these locations are assessed for sequence similarity within a specified region, yielding predicted edit-sites. Lastly, these sites are analyzed for their sequence divergence from the expected target sequence, such as a guide RNA. 1. Mapping phenomena. Analysis of on-target nuclease activity indicates that alignments from DSBs can have several different features, suggesting that these phenomena could indicate nuclease activity. If multiple alignments appear near each other on the same strand, such that they overlap, they are termed a “pileup.” These pileups indicate repetitive DSB in the same region. Like a pileup, if the alignments should flank another alignment on the opposite strand, then they could be called “flanking pairs.” A criterion used to control the identification of flanking pairs is the distance for which to consider. The most focused on phenomena is the proximity to a sequence like the target sequence (or gRNA), termed “target matched” in iGUIDE. This is a strong indicator, and often considered as a requirement, for a genomic location to be identified as a potential off-target. A requirement is this sequence similarity site should be upstream (50 of identified location) or overlapping with the identified genomic location. How much upstream to search is dictated by the analyst. If the analyst should choose too flexible of parameters, then they will likely decrease their specificity by increasing their false positive rate of detecting off-target sites. Additionally, too strict of criteria may lead to a decreased sensitivity. It is advised to set these parameters prior to running your experiment with recommendations based on similar data. 2. Identifying Target Matched Sequences. This task is set to search for potential sequences that would match the target sequence used by the nuclease and requires the user to have a reference genome to work with. Given the alignment, sequences are extracted from the reference from the incorporation site to a predetermined distance upstream (or 50 , see Fig. 1b). The target sequence is then aligned to the queried reference sequences to identify locations within a mismatch cutoff using a Smith-Waterman alignment. Locations closer to the incorporation sites are considered over others if multiple exist, yet the top priority is the best match (or least number of mismatches). If multiple target sequences are used at the same
78
Christopher L. Nobles
time (for example, where the experiment is editing multiple genes with different gRNAs used simultaneously), then these same criteria are used, first best match followed by closest proximity. These sites are referred to in iGUIDE as “edit sites.” There may be incorporation sites that match exactly with edit sites, yet the difference is that an incorporation site was observed while an edit site is inferred to be the location of nuclease activity. 3. Edit Site Sequence Analysis. With experimental evidence suggesting the locations of nuclease editing sites, the remainder of the analysis focuses on understanding these locations. Abundances can be calculated by summing unique alignments per sample or replicate, such as with the SonicAbundance method [11], or using UMI data as in GUIDE-seq. These metrics could indicate the frequency of specific off-target edit sites as compared to others, though this should be validated through other approaches such as targeted sequencing. The sequence around the edit site is compared to develop a sequence profile of indicating the target requirements or preferences. 3.4
Reporting
Data are often summarized in either plots or reports (pdf or html formats typically) after a completed analysis. It is good practice to keep up version control and document versioning with the results and reports. This allows for reproducible reporting of results, and understanding of how results were processed. Analysts should strive to present data in a clear and consistent manner, i.e. if count-based summaries were presented in tables for on-target specific results, then off-target specific count-based summaries should likely be presented in tables rather than plots. For iGUIDE, the standard analysis report has four major sections: (1) Summary, (2) Specimen Overview, (3) On-Target Analysis, (4) Off-Target Analysis. The latter two sections include additional subsections that further explore the data, yet keep the focus around incorporation sites identified near target sequence similar regions. When developing a reproducible report, care should be taken to understand the questions the report should try to answer. For iGUIDE and other off-target analyses, the major question tends to be focused around identification-unintended nuclease activity. GUIDE-seq and iGUIDE, as well as other protocols, are able to return a vast number of genomic locations, but each rely on identification of DSB or DNA-repair mechanisms. It is important to remember that the off-target sites are inferred from repeated detection of DSBs around a sequence similar to the target sequence, and, at some frequency, a nuclease-independent DSB could occur near such a site. Additional work is still needed to validate the suspected sites for nuclease-dependent activity, such as targeted amplicon sequencing [12].
iGUIDE for CRISPR Off-Target Detection
4
79
Notes 1. Sequencing-based experiments can become quite expensive. In order to make the most out of the experiment, it is important to characterize sequencing libraries extensively beforehand. Assays to measure library size and concentration are common, yet an assay to measure incorporation efficiency can indicate the level of marking within a sample. This can be accomplished using PCR and restriction enzyme digest or qPCR. In both the GUIDEseq and iGUIDE dsODN, a NdeI restriction site is present in the middle of the sequence. A quantitative assay to measure the amount of NdeI activity at on-target locations within a sample can give the experimentalist an estimate of whether they had sufficient marking in their initial sample. This can save time, money, and confusing results later by identifying nonoptimal samples. 2. It is highly recommended to take enough time in considering a workflow system for managing sequencing-based experiments. Determining answers to questions like, how will sequencing files be archived, how will experiments be tracked, how will experiment-related data be associated with the analysis, would the designed system work for 5/50/100 + experiments? 3. Many Illumina sequencers are capable of conversion of the raw output from sequencing (BCL files) to FASTQ files at the end of sequencing, often through a streaming analysis with Illumina’s cloud platform. If information is provided in the sample sheet, then the machine may also demultiplex the sequencing files. While this may work well for some workflows, iGUIDE is equipped to handle demultiplexing, and it is a simpler workflow if the files are not demultiplexed after sequencing completion. To adjust these settings, please contact your Illumina representative or adjust the configurations for your sequencer so that it does not demultiplex the output files. iGUIDE does have an alternative workflow that will accept the demultiplexed files if the above is not an option. 4. Single-index workflows can suffer from a PCR artifact called template hopping. This artifact can be almost completely removed by incorporating a dual-indexing strategy with unique barcodes for each sample for both index 1 and index 2. 5. Regarding various alignment software, iGUIDE is compatible to both BLAT and BWA aligners as of version 1.0.0. If the user would rather use a different alignment software, they can contact the current maintainer or try to update their own copy of the code. If the preferred aligner outputs to BAM/SAM formats, then the current quality control that is implemented for iGUIDE should be compatible.
80
Christopher L. Nobles
References 1. Li H, Haurigot V, Doyon Y et al (2011) In vivo genome editing restores haemostasis in a mouse model for haemophilia. Nature 475 (7355):217–221. https://doi.org/10.1038/ nature10177 2. Crosetto N, Mitra A, Silva MJ et al (2013) Nucleotide-resolution DNA double-strand break mapping by next generation sequencing. Nat Methods 10(4):361–365. https://doi. org/10.1038/nmeth.2408 3. Kim D, Bae S, Park J et al (2015) Digenomeseq: genome-wide profiling of CRISPR-Cas9 off-target effects in human cells. Nat Methods 12(3):237–243. https://doi.org/10.1038/ nmeth.3284 4. Tsai SQ, Zheng Z, Nguyen NT et al (2015) GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol 33(2):187–197. https://doi. org/10.1038/nbt.3117 5. Canela A, Sridharan S, Sciascia N et al (2016) DNA breaks and end resection measured genome-wide by end sequencing. Mol Cell 63 (5):898–911. https://doi.org/10.1016/j. molcel.2016.06.034 6. Yan WX, Mirzazadeh R, Garnerone S et al (2017) BLISS is a versatile and quantitative method for genome-wide profiling of DNA double-strand breaks. Nat Commun 8:15058. https://doi.org/10.1038/ncomms15058
7. Nobles CL, Reddy S, Salas-McKee J et al (2019) iGUIDE: an improved pipeline for analyzing CRISPR cleavage specificity. Genome Biol 20(1):14. https://doi.org/10.1186/ s13059-019-1625-3 8. Lin S, Staahl BT, Alla RK, Doudna JA (2014) Enhanced homology-directed human genome engineering by controlled timing of CRISPR/ Cas9 delivery. Elife 3:e04766. https://doi. org/10.7554/eLife.04766 9. Sternberg SH, LaFrance B, Kaplan M, Doudna JA (2015) Conformational control of DNA target cleavage by CRISPR-Cas9. Nature 527 (7576):110–113. https://doi.org/10.1038/ nature15544 10. Kleinstiver BP, Tsai SQ, Prew MS et al (2016) Genome-wide specificities of CRISPR-Cas Cpf1 nucleases in human cells. Nat Biotechnol 34(8):869–874. https://doi.org/10.1038/ nbt.3620 11. Berry CC, Gillet NA, Melamed A et al (2012) Estimating abundances of retroviral insertion sites from DNA fragment length data. Bioinformatics 28(6):755–762. https://doi.org/ 10.1093/bioinformatics/bts004 12. Akcakaya P, Bobbin ML, Guo JA et al (2018) In vivo CRISPR editing with no detectable genome-wide off-target mutations. Nature 561(7723):416–419. https://doi.org/10. 1038/s41586-018-0500-9
Chapter 7 Web-Based Base Editing Toolkits: BE-Designer and BE-Analyzer Gue-Ho Hwang and Sangsu Bae Abstract The CRISPR-Cas system is broadly used for genome editing because of its convenience and relatively low cost. However, the use of CRISPR nucleases to induce specific nucleotide changes in target DNA requires complex procedures and additional donor DNAs. Furthermore, CRISPR nuclease-mediated DNA cleavage at target sites frequently causes large deletions or genomic rearrangements. In contrast, base editors that consist of catalytically dead Cas9 (dCas9) or Cas9 nickase (nCas9) connected to a cytidine or a guanine deaminase can correct point mutations in the absence of additional donor DNA and without generating double-strand breaks (DSBs) in the target region. To design target sites and assess mutation ratios for cytosine and adenine base editors (CBEs and ABEs), we have developed web tools, named BE-Designer and BE-Analyzer. These tools are easy to use (such that tasks are accomplished by clicking on relevant buttons) and do not require a deep knowledge of bioinformatics. Key words Cytosine base editors (CBEs), Adenine base editors (ABEs), Web-based tool, BEDesigner, BE-Analyzer
1
Introduction CRISPR-Cas (clustered regularly interspaced short palindromic repeats and CRISPR associated) effectors, naturally used as an adaptive immune system in bacteria and archaea for targeting viral nucleic acid sequences, have been applied for genome editing due to their convenience and high efficacy [1–5]. CRISPR-Cas nucleases recognize protospacer-adjacent motif (PAM) sequences and induce DSBs in target regions in a guide RNA-dependent manner [6]. The DNA DSBs are typically repaired by the cell’s own repair pathways: nonhomologous end joining (NHEJ) or homology-directed repair (HDR) [7]. NHEJ occurs throughout the cell cycle but the repair process is frequently accompanied by errors such as small insertions or deletions (indels) [8]. In contrast, HDR occurs mostly during G2 and S phases, and corrects DNA DSBs precisely without causing mutations but with relatively lower
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_7, © Springer Science+Business Media, LLC, part of Springer Nature 2021
81
82
Gue-Ho Hwang and Sangsu Bae
efficiency [9]. For these reasons, NHEJ and HDR have dominantly been used for introducing target gene knockouts and knock-ins, respectively [10–12]. Hence, for inducing a specific nucleotide correction in a target region, the applicability of CRISPR-Cas nucleases is limited; additional donor DNAs are necessary, the editing efficiency is not high enough, and DNA in nondividing cells is not editable. The advent of cytosine and adenine base editors, respectively called CBEs (for C-to-T conversions) and ABEs (for A-to-G conversions), has enabled the correction of many pathogenic genetic variants, with high efficacy in the absence of any donor DNA and without the generation of DNA DSBs. The first to be developed, CBEs were initially created by combining dCas9 or nCas9 with a cytidine deaminase such as rAPOBEC1, PmCDA1, or hAPOBEC3. ABEs were later constructed in a similar manner, by fusing an adenine deaminase to dCas9 or nCas9. In this case, because an adenine deaminase that accepts single-stranded DNA (ssDNA) as a substrate is unknown in nature, ssDNA-targetable enzymes were obtained by evolving an Escherichia coli adenine deaminase, TadA, from its natural function of targeting transfer RNAs (tRNAs) to targeting ssDNAs. Recently, side effects of CBEs and ABEs, such as off-target single-nucleotide conversions, have been reported [13– 17]. Several newer versions of both CBEs and ABEs were developed to improve specificities or efficacies, and to decrease the size of the target window: these include BE4 [18], BE4max [19], EvoFERNY-BE4max [20], ABEmax [21], ABE7.10 [22], and miniABEmax [23]. In addition to DNA base editors, RNA base editors, which are constructed by using CRISPR-Cas13b nucleases that target RNAs instead of DNAs, have been reported [24]. In contrast to the original CRISPR-Cas nucleases, CBEs and ABEs can respectively convert multiple cytidines and adenosines within defined target windows in target sites, resulting in specific amino acid mutations according to the RNA codon table. Therefore, dedicated toolkits for the design and analysis of base editors are needed, in addition to the toolkits for CRISPR nucleases. To address this issue, we implemented web-based target sequence design and next generation sequencing (NGS) data analysis tools for base editors, respectively named BE-Designer and BE-Analyzer [25]. BE-Designer shows all possible target sequences in submitted target sequences, along with useful information including potential off-target sites and the expected amino acid sequences after editing. Currently, BE-Designer supports 418 organisms and 13 PAM sequences and both categories are continuously updated; thus, researchers can choose organisms and PAMs appropriate for their specific interests. BE-Analyzer analyzes NGS data and displays the results in a web browser. Using JavaScript, BE-Analyzer can perform analysis without the NGS data being uploaded to a server, reducing the analysis running time. BE-Analyzer can optionally
Base Editing Toolkits
83
receive NGS experimental data together with control data so that users can compare the mutation patterns in data from base editortreated and -untreated samples. These two programs are available on our website so that users can freely access them without a login process.
2
Software All programs have been constructed to run on the CRISPR RGEN Tools website (http://www.rgenome.net). BE-Designer and BE-Analyzer can operate in the Chrome and Edge browsers.
3 3.1
Methods BE-Designer
BE-Designer provides a list of potential base editor target sequences along with useful information. This section introduces the method for running BE-Designer. 1. Access BE-Designer (http://www.rgenome.net/be-designer/) from the CRISPR RGEN Tools site (see Note 1). 2. Select the type of PAM sequence recognized by the CRISPR endonuclease and the reference genome by selecting the type of organism (see Note 2). 3. Paste the sequence to be searched in the Target Sequence box (see Note 3). 4. Select the type of base editor and confirm the base editing window (see Note 4). 5. Click the submit button. 6. BE-Designer provides the reference sequence with the corresponding amino acid sequence in a box, together with a list of possible target sequences (Fig. 1). The appropriate reading frame can be selected by clicking the “Codon n” button (n ¼ 0,1,2). The table provided includes the target’s editing window sequence with the associated amino acid sequence, position, direction, GC content, and predicted off-target sites (see Note 5). More information about predicted off-target sites can be accessed by clicking the number of off-target sites (see Note 6). The results can be filtered as desired and downloaded by clicking the appropriate buttons.
3.2
BE-Analyzer
This section describes the protocol for running BE-Analyzer, a tool for the analysis of NGS results from base editing experiments. BE-Analyzer can finish the analysis of the fastq files (Sample_R1: 904 Mb, Sample_R2: 904 Mb, Control_R1: 1.1GB, Control_R2:
84
Gue-Ho Hwang and Sangsu Bae
Fig. 1 A sample BE-Designer results page. BE-Designer shows possible target sequences with associated amino acid sequences and other useful information
1.1GB) within 257.41 s via CPU (Ryzen3 3800X) to the maximum. 1. Access BE-Analyzer (http://www.rgenome.net/be-analyzer/) from the CRISPR RGEN Tools site. 2. Select the file type and choose single-end reads, paired-end reads, or merged NGS data (see Note 7). 3. Input the full reference sequence and the target DNA sequence and select the type of PAM sequence and base editor (see Notes 4, 8-10). 4. Click the Submit button. 5. After the analysis is complete, BE-Analyzer displays useful results: (1) the count of each mutation in the received data, (2) a table showing nucleotide mutations with the associated expected amino acid sequences (see Note 11), (3) graphs showing the rate of nucleotide mutations, and (4) a table of alignment results (Fig. 2).
Base Editing Toolkits
85
Fig. 2 A sample BE-Analyzer results page. BE-Analyzer shows count information with graphs and aligned results
4
Notes 1. The CRISPR RGEN Tools site (http://www.rgenome.net) provides several web tools for researchers who want to find targets and analyze NGS data related to CRISPR experiments.
86
Gue-Ho Hwang and Sangsu Bae
Along with BE-Designer and BE-Analyzer, the site includes tools for identifying CRISPR off-target sites (Cas-OFFinder [26]), microhomology calculations (MicrohomologyPredictor [27]), designing and analyzing CRISPR experiments (Cas-Designer [28] and Cas-Analyzer [29]), designing guide RNA libraries for gene knockout screens using Cas9 (Cas-Database [30]) and Cpf1 (Cpf1-Database [31]), and profiling CRISPR-Cas9 specificity (Digenome-sequencing [32, 33]). 2. BE-Designer currently offers 13 choices of PAM sequences and 418 choices of organisms. If the desired PAM sequence or organism is not available, click the sentence “Send request for a new organism” to send an email to the CRISPR RGEN Tools administrator. 3. The sequence is limited to 1000 characters, because the time required for running Cas-OFFinder increases as the sequence length increases. 4. The base editing window has a default value according to the type of base editor. The value of the base editing window can be modified as desired. 5. Potential off-target sites are analyzed by Cas-OFFinder. BE-Designer will show the target results and request that the server run Cas-OFFinder. Until Cas-OFFinder finishes the analysis, BE-Designer will show “Running. . .” in the table of status and “Available After Job is Done” in the table of mismatches. 6. When a row in the table is clicked, BE-Designer highlights the target sequence with the PAM and mutated amino acid sequence in the reference sequence box. 7. BE-Analyzer also receives control data files so that NGS and PCR errors can be distinguished from editing results. 8. BE-Analyzer analyzes and shows more information in an additional flanking window. If a broader view is desired, the length of the additional flanking window can be modified by the user. 9. If the number of reads of a particular sequence is lower than the set minimum frequency, the sequence is considered to contain a PCR or NGS error. A user can change the minimum frequency to see more information. 10. When the default value of the comparison range is used, BE-Analyzer reads the full length of the reference sequence. In default setting, BE-Analyzer considers the primer dimer sequence as deletion sequence. So, if the results show a lot of large deletions, consider to decrease the comparison range. 11. The reading frame can be changed by selecting the appropriate codon button.
Base Editing Toolkits
87
Acknowledgments This work was supported by grants from the National Research Foundation of Korea (NRF) (no. 2018M3A9H3022412), the Next Generation BioGreen 21 Program (no. PJ01319301), Korea Healthcare Technology R&D Project (no. HI16C1012), and Technology Innovation Program (no. 20000158) to S.B. References 1. Horvath P, Barrangou R (2010) CRISPR/Cas, the immune system of bacteria and Archaea. Science 327(5962):167–170 2. Doudna JA, Charpentier E (2014) The new frontier of genome engineering with CRISPR-Cas9. Science 346(6213):1258096 3. Kim H, Kim JS (2014) A guide to genome engineering with programmable nucleases. Nat Rev Genet 15(5):321–334 4. Sander JD, Joung JK (2014) CRISPR-Cas systems for editing, regulating and targeting genomes. Nat Biotechnol 32(4):347–355 5. Shalen O, Sanjana NE, Zhang F (2015) Highthroughput functional genomics using CRISPR-Cas9. Nat Rev Genet 16(5):299–311 6. Jiang F, Doudna JA (2017) CRISPR–Cas9 structures and mechanisms. Annu Rev Biophys 46:505–529 7. Lieber MR, Ma Y, Pannicke U, Schwarz K (2003) Mechanism and regulation of human non-homologous DNA end-joining. Nat Rev Mol Cell Biol 4(9):712–720 8. Mao Z, Bozzella M, Seluanov A et al (2008) DNA repair by nonhomologous end joining and homologous recombination during cell cycle in human cells. Cell Cycle 7 (18):2902–2906 9. Rothkamm K, Kru¨ger I, Thompson LH et al (2003) Pathways of DNA double-strand break repair during the mammalian cell cycle. Mol Cell Biol 23(16):5706–5715 10. Baek K, Kim DH, Jeong J et al (2016) DNA-free two-gene knockout in Chlamydomonas reinhardtii via CRISPR-Cas9 ribonucleoproteins. Sci Rep 6(1):30620 11. Auer TO, Duroure K, De Cian A et al (2014) Highly efficient CRISPR/Cas9-mediated knock-in in zebrafish by homologyindependent DNA repair. Genome Res 24 (1):142–153 12. Lin CH, Tallaksen-Greene S, Chien WM et al (2001) Neurological abnormalities in a knockin mouse model of Huntington’s disease. Hum Mol Genet 10(2):137–144
13. Zhou C, Sun Y, Yan R et al (2019) Off-target RNA mutation induced by DNA base editing and its elimination by mutagenesis. Nature 571 (7764):275–278 14. Gru¨newald J, Zhou R, Garcia SP et al (2019) Transcriptome-wide off-target RNA editing induced by CRISPR-guided DNA base editors. Nature 569(7756):433–437 15. Jin S, Zong Y, Gao Q et al (2019) Cytosine, but not adenine, base editors induce genomewide off-target mutations in rice. Science 364 (6437):292–295 16. Zuo E, Sun Y, Wei W et al (2019) Cytosine base editor generates substantial off-target single-nucleotide variants in mouse embryos. Science 364(6437):289–292 17. Kim HS, Jeong YK, Hur JK et al (2019) Adenine base editors catalyze cytosine conversions in human cells. Nat Biotechnol 37:1145–1148 18. Komor AC, Zhao KT, Packer MS et al (2017) Improved base excision repair inhibition and bacteriophage Mu Gam protein yields C:Gto-T:A base editors with higher efficiency and product purity. Sci Adv 3(8):eaao4774 19. Koblan LW, Doman JL, Wilson C et al (2018) Improving cytidine and adenine base editors by expression optimization and ancestral reconstruction. Nat Biotechnol 36(9):843–846 20. Thuronyi BW, Koblan LW, Levy JM et al (2019) Continuous evolution of base editors with expanded target compatibility and improved activity. Nat Biotechnol 37 (9):1070–1079 21. Rees HA, Wilson C, Doman JL et al (2019) Analysis and minimization of cellular RNA editing by DNA adenine base editors. Sci Adv 5(5):eaax5717 22. Zhou C, Sun Y, Yan R et al (2019) Off-target RNA mutation induced by DNA base editing and its elimination by mutagenesis. Nature 571 (7764):275–278 23. Gru¨newald J, Zhou R, Iyer S et al (2019) CRISPR DNA base editors with reduced RNA off-target and self-editing activities. Nat Biotechnol 37(9):1041–1048
88
Gue-Ho Hwang and Sangsu Bae
24. Abudayyeh OO, Gootenberg JS, Franklin B et al (2019) A cytosine deaminase for programmable single-base RNA editing. Science 365 (6451):382–386 25. Hwang GH, Park J, Lim K et al (2018) Web-based design and analysis tools for CRISPR base editing. BMC Bioinformatics 19(1):542 26. Bae S, Park J, Kim JS (2014) Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 30 (10):1473–1475 27. Bae S, Kweon J, Kim HS et al (2014) Microhomology-based choice of Cas9 nuclease target sites. Nat Methods 11(7):705–706 28. Park J, Bae S, Kim JS (2015) Cas-Designer: a web-based tool for choice of CRISPR-Cas9 target sites. Bioinformatics 31(24):4014–4016
29. Park J, Lim K, Kim JS et al (2017) Cas-analyzer: an online tool for assessing genome editing results using NGS data. Bioinformatics 33(2):286–288 30. Park J, Kim JS, Bae S (2016) Cas-Database: web-based genome-wide guide RNA library design for gene knockout screens using CRISPR-Cas9. Bioinformatics 32 (13):2017–2023 31. Park J, Bae S (2018) Cpf1-Database: web-based genome-wide guide RNA library design for gene knockout screens using CRISPRCpf1. Bioinformatics 34(6):1077–1079 32. Park J, Childs L, Kim D et al (2017) Digenome-seq web tool for profiling CRISPR specificity. Nat Methods 14(6):548–549 33. Kim D, Bae S, Park J et al (2015) Digenomeseq: genome-wide profiling of CRISPR-Cas9 off-target effects in human cells. Nat Methods 12(3):237–243
Chapter 8 Synthetic Gene Circuit Analysis and Optimization Irene Otero-Muras and Julio R. Banga Abstract Synthetic biology aims at engineering synthetic circuits with pre-defined target functions. From a systems (model-based) perspective, the following problems are of central importance: (1) given the model of a biomolecular circuit, elucidate whether it is capable of a certain behavior/functionality; and (2) starting from a pre-defined required functionality and a library of biological parts, find the biomolecular circuit that, built as a combination of such parts, achieves the desired behavior. These two problems, framed, respectively, as nonlinear analysis and automated design problems, are tackled here by efficient optimization methods. We illustrate these methods with case studies considering the analysis and design of biocircuits capable of bistability (bistable switches). Bistability is of particular interest in the context of systems and synthetic biology because it endows cells with the capacity to make decisions. Key words Synthetic biology, Automated design, Bistability, Cell decision making, Global optimization, Mixed integer nonlinear programming, Multiobjective optimization
1
Introduction Computational analysis and design are fundamental tools to move forward the construction of synthetic circuits into a more efficient, rational, and automated process. Many groups have contributed to the automated design of biomolecular networks [1–9]. A key goal is to map DNA sequence to function/cellular behavior (once the DNA sequence is expressed inside the cell). From the perspective of design (forward mapping), we aim to find DNA sequences that achieve a function or behavior of interest. From the perspective of analysis (backward mapping), we are interested in elucidating the behavior of a particular DNA sequence (once expressed in the host, and under particular conditions). To establish these maps (see Fig. 1), we follow a systems (model-based) approach where mathematical models describe the behavior of biomolecular circuits, i.e. the gene regulatory networks encoded by these DNA sequences. Assuming a deterministic setting, we use models of ordinary differential equations (ODEs) to encode the dynamics of biomolecular circuits. If the effect of
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_8, © Springer Science+Business Media, LLC, part of Springer Nature 2021
89
90
Irene Otero-Muras and Julio R. Banga
DNA sequence
Biomolecular circuit model
Behaviour (host cell)
Fig. 1 Scheme of important mappings in analysis and design of synthetic gene circuits
stochastic noise is relevant, the dynamics are better described by stochastic models of gene regulatory networks [10, 11]. Unfortunately, efficient methods for design and analysis of dynamic stochastic circuits are still lacking. Here we illustrate methods for analysis and design that combine ODE models with global optimization methods. A more precise definition of each of the problems is provided below. Analysis of Biomolecular Networks: Starting from the model of a biomolecular circuit, we aim to elucidate whether it is capable (and under which operational conditions) of a particular target behavior. The target behavior might be, for example, a type of dynamic response with respect to an input (for example, searching for biochemical adaptation), or the capacity to undergo a bifurcation of interest [12] (leading to a particular bifurcation and/or to bistability, for example). Here we are going to elucidate whether a particular circuit (with some tunable parameters) is capable to undergo bistability, and in positive case, find values for these tunable parameters leading to bistable behavior. Design of Biomolecular Networks: Starting from a library of standard parts, the aim is to find the biomolecular biocircuit (corresponding to a particular combination of the available parts) that achieves a pre-defined target performance. The method in [8], based on global optimization, can handle any type of target performance. In contrast to other approaches based on, e.g., Boolean logic and digital responses [7], our method relies on dynamic models of ODEs and analogue responses. Here, we illustrate in detail the approach presented in [8], describing step by step how to find circuit(s) capable of bistable behavior starting from a particular library of standard parts. Bistability in Synthetic Biology: Bistability endows cells with the capacity to make decisions, for example, changing of phenotype in response to microenvironmental signals. Therefore, the detection of bistability (deciding whether a biomolecular network has the capacity to undergo bistability), and the design of bistable circuits is of great importance in synthetic biology [13, 14].
Gene Circuit Optimization
2
91
Methods In a noise-free deterministic setting, the dynamics model of a biomolecular circuit is described by a set of ordinary differential equations of the form: z_ ¼ f ðz, kÞ
ð1Þ
n
where z∈ is the vector with the concentrations of all the species involved (depending on the granularity and kinetics, the vector of species might contain DNA, mRNA, proteins, etc.). The function f : n ! n is the structure of the model and k is the vector of parameters. 2.1 Analysis of Synthetic Gene Circuits 2.1.1 Optimization Framework and General Procedure
Starting from the model of a biomolecular circuit of the form (1), we aim to elucidate whether it is capable of a particular target behavior, and under which parameters/operational conditions. 1. We partition the parameters into two different vectors: a vector k containing the parameters that cannot be tuned or modified (and therefore remain fixed during the search), and a vector w containing the parameters that can be tuned. 2. We write (1) in the form: z_ ¼ f^ðz, k, wÞ
ð2Þ
3. We encode the target behavior in a cost function: _ z, k, wÞ J ðz,
ð3Þ
such that, when the function reaches the minimum, the system achieves the target response. In some cases we need additional equations, added as constraints in the optimization problem, in order to define the behavior of interest. 4. We formulate the search as an optimization problem: _ z, k, wÞ min J ðz, w
subject to: z_ ¼ f^ðz, k, wÞ wL w wU
ð4Þ
(where wL and wU indicate lower and upper bounds for the decision variables), and to additional equality and/or inequality constraints in case they are needed. 5. We solve the optimization problem (4). The problem is nonconvex and multimodal and, therefore, the use of global optimization is required. We make use of the hybrid solver eSS in [15] with fmincon as a local solver (since all the decision variables are real) to solve this problem efficiently.
92
Irene Otero-Muras and Julio R. Banga
2.1.2 Detection of Bistability in Biomolecular Circuits
The detection of bistability in biochemical reaction networks is a recurrent problem in systems and synthetic biology. In a recent paper, Yordanov et al. [16] present an efficient method for bistability detection in biomolecular networks with mass action kinetics. The method exploits the inherent structure of mass action kinetic networks and exact algebraic conditions to be checked in search for bistability, allowing for very accurate and efficient detection. However, in the context of synthetic biology is very frequent to adopt other type of kinetics (Hill, Shea–Ackers, Michaelis–Menten). Here we illustrate a method for bistability detection in networks of arbitrary kinetics, which follows [17]. 1. We start from a model of a biomolecular circuit (2), and we express the model equations in the form: z_ ¼ f~ðz, k, w, βÞ
ð5Þ
where z is the vector of states (the number of states is n, and, therefore, z is an n-dimensional vector), k is the vector of fixed parameters, w is the vector of parameters that we can modify (also denoted as decision parameters), and β is a scalar that represents the bifurcation parameter (see Fig. 2). 2. We denote by Q the extended Jacobian of the system, defined as: Q ðz, βÞ¼½D z f~ D β f~
ð6Þ
where D z f~ and D β f~ are the Jacobians with respect to the states and bifurcation parameter, respectively. The Jacobians can be computed analytically or numerically by finite differences. 3. We denote by v the normalized tangent vector to the equilibrium curve at (z, β)T (see Fig. 2), i.e.: v ¼ nullðQ Þ
ð7Þ
where null is the nullspace or kernel of the matrix. 4. We denote by vn+1 the last component, i.e. the (n + 1)-th component, of the vector v. Let us remind here that n is the number of species. 5. We define the function J ¼ ðvnþ1 Þ2 as the objective function to minimize. 6. We formulate the following optimization problem: min w, β
J
ð8Þ
Gene Circuit Optimization
93
types of limit points
z
bistability range
ON stable equilibria
fold bifurcation
limit point a (fold bifurcation)
unstable fold bifurcation limit point b (hysteresis point)
v stable equilibria
OFF
β Fig. 2 Bifurcation diagram of a bistable system: the steady state concentration of one species is represented versus the bifurcation parameter. The bistability region is enclosed by twofold bifurcations. A bistable system is also denoted as bistable switch, with an ON and an OFF state. In the right part of the figure we represent the two types of limit points
subject to: f~ðz, k, w, βÞ ¼ 0
ð9Þ
rankðQ ðz ∗ , β∗ ÞÞ ¼ n
ð10Þ
wL w wU
ð11Þ
zL z zU
ð12Þ
βL β βU
ð13Þ
where Eq. 9 is an equality constraint ensuring that (z, β)T is a steady state, Eq. 10 is an equality constraint ensuring that (z, β)T is a regular point [12], Eq. 11 is the set of bounds for the decision parameters, Eq. 12 is the (optional) set of inequality constraints imposing ranges for the steady state concentrations, and Eq. 12 sets the lower and upper bounds for the bifurcation parameter. 7. We solve the optimization problem, which is nonconvex and multimodal, using a global optimization algorithm. We make use of the efficient hybrid solver eSS (enhanced scatter search) in [15], with fmincon as local solver.
94
Irene Otero-Muras and Julio R. Banga
8. If we find an optimum (z∗, β∗)T such that J∗ ¼ 0, then (z∗, β∗)T is a limit point (demonstration can be found in [17]). This is a necessary (but not sufficient) condition for a fold bifurcation (see Fig. 2). In order to ensure that the system undergoes a fold bifurcation and has the capacity for bistability, we go to step 9. If the optimum obtained (z∗, β∗)T is such that J∗ > 0, bistability is not found. It is important to remark that, due to numerical issues, we do not expect to find and exact zero for a limit point, but a real number which is small enough. 9. If a limit point (z∗, β∗)T is found as a result of the optimization, we perform a continuation analysis (in forward and backward directions) starting from (z∗, β∗)T using the continuation algorithm Cl_Matcont [18]. The continuation algorithm computes the bifurcation diagram, allowing us to elucidate whether the limit point is a fold bifurcation point or not (see Fig. 2). In contrast to [16], the method does not use exact algebraic expressions to obtain the objective function, and relies on numerical computations. The disadvantage is that the efficacy of the algorithm depends on the precision and accuracy of the numerical methods; using the Matlab functions fsolve for the steady state computations and null for the nullspace we obtain in general very good results. The advantage is that the method can deal with very large systems, both in terms of number of states and number of decision parameters. In Subheading 3, we illustrate the use of the method through a practical example, finding conditions for bistability in the synthetic toggle switch by [13]. 2.2 Design of Synthetic Gene Circuits 2.2.1 Optimization Framework and General Procedure
Starting from a biomolecular circuit superstructure we aim to find circuits with a pre-defined target behavior. Here we describe the procedures to design circuits with a single design objective and with two opposing design objectives. The model superstructure: z_ ¼ fðz, k, w, yÞ
ð15Þ
contains all the possible circuit options in terms of structure and parameters. The integer vector y encodes the topology of the network, the real vector w contains all the real parameters that can be tuned, and the vector k contains the parameters that remain fixed during the search. We start by the single objective design problem. In this case: 1. The target behavior is encoded in one objective function: _ z, k, w, yÞ J ðz, 2. The design is formulated an optimization problem: _ z, k, w, yÞ min J ðz, w, y
ð16Þ
Gene Circuit Optimization
95
subject to: z_ ¼ fðz, k, w, yÞ wL w wU
ð17Þ
yL y yU and to additional equality and/or inequality constraints in case they are needed. 4. The optimization problem (17) is a mixed integer nonlinear dynamic optimization problem. The hybrid solver eSS in [15] in combination to misqp by [19] is used to solve this problem efficiently. In design problems with multiple design criteria, the target behavior is the best trade-off between competing optimization objectives. Here we illustrate the design method implemented in [8] based on multiobjective optimization. 1. The target behavior is encoded in one or a set of opposing objective functions. Here we follow with a biobjective problem, which is one of the most common set-ups in design. In this case, we have two competing objectives: _ z, k, w, yÞ, J 2 ðz, _ z, k, w, yÞ J 1 ðz,
ð18Þ
2. The design is formulated as a biobjective optimization problem: _ z, k, w, yÞ, J 2 ðz, _ z, k, w, yÞ min J 1 ðz, w, y subject to: z_ ¼ fðz, k, w, yÞ wL w wU
ð19Þ
yL y yU 3. The optimization problem (19) is a multiobjective mixed integer nonlinear dynamic optimization problem. The ε-strategy is used to convert the multiobjective problem in set of single objective problems. The hybrid solver eSS (enhanced scatter search) in [15 ] in combination to misqp by [19] is used to solve each of the single objective problems efficiently. 4. As a result of the optimization, we obtain the Pareto front of solutions containing the circuits that best trade-off both design objectives.
96
Irene Otero-Muras and Julio R. Banga
2.2.2 Design of Synthetic Bistable Switches Starting from a Library of Parts
In this section we use the procedure previously described to design bistable switches from a library of biological components. Biological components are sequences of genetic parts, such as ribosome binding sites, promoters, and protein coding regions. 1. The first step is to obtain the model superstructure (15) starting from the library of parts, and such that the structure is encoded in a vector y of integer variables. For kinetics of Hill and Mass Action type, Synbadm allows us to obtain automatically the superstructure model starting from libraries of components. 2. The second step is choosing the bifurcation diagram β. 3. The third step is to encode the desired behavior (capacity for bistability) in a cost function and a set of constraints. We use the same rational than in the detection problem. The optimization problem reads in this case: min
w, y, β
J
subject to: fðz, k, w, y, βÞ ¼ 0 rankðQ ðz ∗ , β∗ ÞÞ ¼ n wL w wU yL y yU
ð20Þ
zL z zU βL β βU This optimization problem can be easily implemented in through a user-defined objective function file.
Synbadm
4. The optimization problem is solved with the hybrid eSS with misqp as a local solver. This can be directly executed from Synbadm calling the single objective design function. 5. If the value of the objective function at the minimum obtained is J ¼ 0, the optimum is a limit point. Otherwise, no circuits leading to bistability are found. 6. Finally, if a limit point is found as a result of the optimization, we perform a continuation analysis (in forward and backward directions) starting from the limit point using the continuation algorithm Cl_Matcont [18]. The bifurcation diagram obtained will allow us to elucidate whether the system is bistable, and in positive case, the bistability range (see Fig. 2).
Gene Circuit Optimization
3
97
Results
3.1 Detection of Bistability in Biomolecular Circuits
Next, we illustrate the method through a practical example. We take the model of the synthetic toggle switch by [13], where two genes mutually repress each other (see Fig. 3): z_1 ¼
α1 z1 1 þ z h22
ð21Þ
z_2 ¼
α2 z2 1 þ z h11
ð22Þ
where z1 is the dimensionless concentration of repressor 1, z2 is the dimensionless concentration of repressor 2, α1 and α2 are, respectively, the effective rates of synthesis of repressor 1 and 2, and h1 and h2 are the cooperativities of repression of promoters 1 and 2. We want to assess the capacity for bistability of the network with h1 ¼ h2 ¼ 2, and within ranges for the steady state dimensionless concentrations of (0.1, 10), taking α1 as bifurcation parameter. In order to accommodate the model equations (21)–(22) to the form (5), we set k ¼ [h1, h2]T, w ¼ α2, and β ¼ α1. The extended Jacobian (6), in this case, can be computed symbolically: 0 B B B Q ¼B B @
ðh 1Þ
1
α1 h 2 z 2 2
ð1 þ z h22 Þ
ðh 1Þ
α2 h 1 z 1 1
1
2
ð1 þ z h11 Þ
2
1 1 C ð1 þ z h22 Þ C C C C A 0
ð23Þ
The objective function is defined starting from the matrix Q as indicated in Subheading 2, see Eqs. 7, 8, and the search is Inducer 2
Repressor 1
Promoter 1
Repressor 2
Promoter 2
Inducer 1
Fig. 3 Synthetic toggle switch based on two genes that mutually repress each other [13]
98
Irene Otero-Muras and Julio R. Banga
formulated as the optimization problem (9). We set the following bounds for the parameters α1 and α2: 0:1 α1 10,
0:1 α2 10
At each iteration of the optimization problem, the values of α1, α2 are substituted in Eqs. 21–22 to compute the steady state concentrations z1 and z2. Then, the extended Jacobian (23) is computed, and its nullspace v obtained. The objective function (8) is then computed starting from the (n + 1)-th component of the vector v. In order to ensure that the steady state concentrations are within the desired ranges, we impose the following inequality constraints (12): 0:1 z 1 10,
0:1 z 2 10
The nonconvex multimodal problem is solved with eSS, with the local solver fmincon (the decision variables are all real numbers). A minimum corresponding to v ¼ 0 (i.e., a limit point) is found in less than 10 s. The values obtained for the parameters are α1 ¼ 4.0098 and α2 ¼ 5.8798. The corresponding steady state concentrations of the repressors are: z1 ¼ 2.8991 and z2 ¼ 0.6198. Starting from this limit point, we compute the bifurcation diagram with Cl_Matcont, depicted in Fig. 4. Note that in this case, we have fixed the cooperativity indices. If we want to consider the cooperativity indices (integer numbers) as decision parameters, the resultant problem will be mixed integer nonlinear and we should use eSS with the local solver misqp [19].
10
10
8
8
6
6 v
u 4
4
2
2
0
0 0
5
10 1
15
0
5
10
15
1
Fig. 4 Bifurcation diagram for the toggle switch, starting from the solution (limit point) of the optimization problem. The limit point is indicated in black
Gene Circuit Optimization
3.2 Design of Synthetic Bistable Switches Starting from a Library of Parts
99
In this example, we start from a library containing eight different promoters: Plac1, Plac2, Plac3, Plac4, Pλ, Ptet1, Ptet2 Para, four transcripts: tetR, lacI, cI, and araC, and two different inducers IPTG and aTc. The promoters and transcripts are denoted, respectively, by P1. . .P8 and R1. . .R4. This library is contained in Synbadm, the kinetic parameters can be found elsewhere [1, 8], and the corresponding biocircuit model superstructure is of the form: dz j ðtÞ ¼ E j ðtÞ þ Γ j ðtÞ K j decay z j ðtÞ dt
8j
ð24Þ
where Ej is the expression term for the transcripts, Kjdecayzj is the degradation rate, and Γj is the production/consumption rate of zj due to other reactions. The expression rates for the transcripts are: E j ðtÞ ¼
P i
Y ij V j i ðtÞ
ð25Þ
where Vji is the rate of production of Rj from Pi, and Yij is a binary variable such that Yij ¼ 1 if the production of protein Rj from promoter Pi is turned on and Yij ¼ 0 otherwise. In this way, the topology of the circuit is given by an 8 4 superstructure matrix Y containing 32 binary variables. We compute the vector of binary variables y in Eq. 15 by converting the matrix Y to a vector by columns. Our goal is to find circuits (with a maximum of two devices) leading to bistable behavior. We choose the concentration of IPTG as bifurcation parameter, and we set βL ¼ 1, βU ¼ 100 as its lower and upper bounds in the optimization problem (20). All the remaining parameters are fixed (not tunable). Importantly, we set an additional constraint imposing the maximum number of devices that we allow in the final circuit (in this case 2). We run the optimization algorithm in a multistart strategy. We find two circuits leading to reversible switches, depicted in Fig. 5. Interestingly, the algorithm finds also a number of irreversible switches. Two of them are depicted in Fig. 6.
4
Implementation Notes Here we provide information about the software needed to reproduce our results. The scripts used to solve our examples above are available at Zenodo: https://doi.org/10.5281/zenodo.3466541 The additional software requirements are: l
Matlab version R2017 (https://www.mathworks.com) or later
l
SYNBADm
l
Cl_Matcont
toolbox [8] [18]
100
Irene Otero-Muras and Julio R. Banga
a
b
IPTG
IPTG
cI
Pλ Plac1
LacI
tetR
Ptet1 Plac2
LacI
aTc
5
4
20
20
40
15
15
30
10
tetR
lacI
cI
lacI
3 10
20
2 5
5
1
0
0
0 0
50
IPTG
100
10
0
50
IPTG
100
0 0
5
IPTG
10
0
5
10
IPTG
Fig. 5 Reversible switches: circuits leading to bistable behavior and the corresponding bifurcation diagrams. The switch is reversible because both fold bifurcations lie in the positive orthant
We have performed all the computations using Matlab under Windows 7. The use of Matlab under other operating systems might require adapting the scripts. SYNBADm is available at https://sites.google.com/site/ synbadm. For both design and detection we make use of the optimization solvers contained in this toolbox. To install SYNBADm: 1. Unzip and copy the folder SYNBAD to the directory of your choice (please keep the original name of the SYNBAD folder). 2. From Matlab, change to the SYNBAD directory and run SYNBAD_Startup, which adds all the relevant files to the path. Remember to run SYNBAD_Startup in every new Matlab session. 3. Test the installation by executing any of the available examples contained in the Examples folder. If the installation is correct, the optimization runs and the corresponding results are stored in the file RESULTS_DESIGN.mat. Cl_Matcont by [18] is used to check that the solutions found by optimization are fold bifurcation points, and to compute the bifurcation diagram. To install Cl_Matcont:
Gene Circuit Optimization
a
b
IPTG
IPTG
cI
Pλ
tetR
Ptet2
Plac2
LacI
101
Plac1
LacI
aTc 50
40
20
20
40
15
15
30
10
tetR
lacI
cI
lacI
30 10
20
20 5
5
10
0
0
0 0
20
IPTG
40
10
0
20
IPTG
40
0 0
50
IPTG
100
0
50
100
IPTG
Fig. 6 Irreversible switches: circuits leading to bistable behavior and the corresponding bifurcation diagrams. The switch is irreversible because one of the fold bifurcations occurs at a negative (unfeasible) value of the bifurcation parameter
1. Download the Cl_Matcont package (version cl matcont5p4) freely available at https://sourceforge.net/projects/matcont/ files/Oldcl_matcont/cl_matcont5p4/ 2. Copy the folder cl_matcont5p4 to the directory of your choice, and the folder and subfolders to the MATLAB path. 4.1 Bistable Switch Detection
The files used to solve the example in this paper are available in the folder Switch_Detection_Files. To solve the design problem in Subheading 2.2.2: 1. Go to the subfolder Optimization and run Main_optim to detect bistability in the model given by Eqs. 21–22 under the specifications indicated in Subheading 3.1. The results of the optimization are stored in the file ess_report. The structure results contains, among others, the best solution found (results.xbest), and the corresponding value of the objective function (results.fbest). 2. Go to the subfolder Continuation, and open the file Main_cont. Introduce the solution found in the optimization step, such that par¼results.xbest. The continuation algorithm is invoked and the bifurcation diagrams automatically plotted.
102
Irene Otero-Muras and Julio R. Banga
4.2 Bistable Switch Design
The files used to solve the example in this paper are available in the folder Switch_Design_Files. 1. Go to the subfolder Switch_Design_Files 2. Copy the file
OF_LimitPoint
into the
SYNBADm
folder
Switch_Design
into the
SYNBADm
folder
USR_ObjFuns.
3. Copy the file USR_Inputs.
4. From the SYNBADm main directory, call SYNBAD_Design_SO( Switch_Design) . The results are stored in RESULTS_DESIGN. The structure results contains among others, the best solution found (results.xbest), and the corresponding value of the objective function (results. fbest). 5. Go to the subfolder Continuation, and open the file Main_cont. Introduce the solution found in the optimization step, such that sol¼results.xbest. The continuation algorithm is invoked and the bifurcation diagrams automatically plotted. Funding: This research was funded by the Spanish Ministry of Science, Innovation and Universities and the FEDER/ERDF (project SYNBIOCONTROL, ref. DPI2017-82896-C2-2-R). References 1. Dasika MS, Maranas CD (2008) OptCircuit: an optimization based method for computational design of genetic circuits. BMC Syst Biol 2:24 2. Pedersen M, Phillips A (2009) Towards programming languages for genetic engineering of living cells. J R Soc Interface 6:S437–S450 3. Marchisio MA, Stelling J (2011) Automatic design of digital synthetic gene circuits. PLOS Comp Biol 7:e1001083 4. Rodrigo G, Jaramillo A (2012) AutoBioCAD: full biodesign automation of genetic circuits. ACS Synth Biol 2:230–236 5. Marchisio MA (2014) Parts & Pools: a framework for modular design of synthetic gene circuits. Front Bioeng Biotechnol 2:42 6. Huynh L, Tagkopoulos I (2015) Fast and accurate circuit design automation through hierarchical model switching. ACS Synth Biol 4 (8):890–897 7. Nielsen AAK, Der BS, Shin J et al (2016) Genetic circuit design automation. Science 352:aac7341 8. Otero-Muras I, Henriques D, Banga JR (2016) SYNBADm: a tool for optimization-
based automated design of synthetic gene circuits. Bioinformatics 32(21):3360–3362 9. Watanabe L, Nguyen T, Zhang M et al (2019) iBioSim 3: a tool for model-based genetic circuit design. ACS Synth Biol 8(7):1560–1563 10. Pajaro M, Otero-Muras I, Vazquez C et al (2018) SELANSI: a toolbox for simulation of stochastic gene regulatory networks. Bioinformatics 34(5):893–895 11. Pajaro M, Otero-Muras I, Vazquez C et al (2019) Transient hysteresis and inherent stochasticity in gene regulatory networks. Nat Commun 10:4581 12. Kuznetsov YA (1998) Elements of applied bifurcation theory. Springer, New York 13. Gardner TS, Cantor CR, Collins JJ (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403:339–342 14. Gnu¨gge R, Dharmarajan L, Lang M et al (2016) An orthogonal permease-inducerrepressor feedback loop shows bistability. ACS Synth Biol 403:339–342, 5(10):1098 15. Egea JA, Henriques D, Cokelaer T et al (2014) MEIGO: an open-source software suite based on metaheuristics for global optimization in
Gene Circuit Optimization systems biology and bioinformatics. BMC Bioinf 15:136 16. Yordanov P, Stelling J, Otero-Muras I (2020) BioSwitch: a tool for the detection of bistability and multi-steady state behaviour in signalling and gene regulatory networks. Bioinformatics. https://doi.org/10.1093/bioinformatics/ btz746 17. Otero-Muras I, Banga JR (2018) Optimization-based prediction of fold
103
bifurcations in nonlinear ODE models. IFACPapersOnLine 51(15):485–490 18. Govaerts W, Dhooge A, Kuznetsov Y et al (2003) Cl_MatCont: a continuation toolbox in Matlab. In: Proceedings of the 2003 ACM Symposium on Applied Computing, pp 161–166 19. Exler O, Schittkowski K (2007) A trust region SQP algorithm for mixed integer nonlinear programming. Optim Lett 1(3):269–280
Chapter 9 Monitoring Single S. cerevisiae Cells with Multifrequency Electrical Impedance Spectroscopy in an Electrode-Integrated Microfluidic Device Zhen Zhu, Yangye Geng, and Yingying Wang Abstract This chapter describes an electrode-integrated microfluidic system with multiple functions of manipulating and monitoring single S. cerevisiae cells. In this system, hydrodynamic trapping and negative dielectrophoretic (nDEP) releasing of S. cerevisiae cells are implemented, providing a flexible method for single-cell manipulation. The multiplexing microelectrodes also enable sensitive electrical impedance spectroscopy (EIS) to discern the number of immobilized cells, classify different orientations of captured cells, as well as detect potential movements of immobilized single yeast cells during the overall recording duration by using principal component analysis (PCA) in data mining. The multifrequency EIS measurements can, therefore, obtain sufficient information of S. cerevisiae cells at single-cell level. Key words Saccharomyces cerevisiae, Negative dielectrophoresis, Electrical impedance spectroscopy, Single-cell analysis, Principal component analysis
1
Introduction Single-cell analysis, as an important method of studying cell heterogeneity [1], can reveal differences between individual cells and even obtain intracellular information. Flow cytometry [2] and microscopy are two commonly used methods for single-cell analysis. However, momentary measurements of flow cytometry limit its applications in long-term monitoring of single-cell dynamics. The labeling procedure of cells for microscopic imaging may interfere with cellular functions and processes. Microfluidics-based devices integrated multiple functions can provide a desirable platform for manipulation and subsequent analysis of single cells [3–5]. Electrical impedance spectroscopy (EIS), which is a noninvasive and label-free approach, can detect critical features, such as cell size and dielectric properties of single cells [6–11]. Cellular and even intracellular activities can be recorded by multifrequency EIS. At lower frequencies (from hundreds of kHz to MHz), cell size can be
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_9, © Springer Science+Business Media, LLC, part of Springer Nature 2021
105
106
Zhen Zhu et al.
detected, while information related to cell membrane capacitance can be gained at higher frequencies (several MHz). Moreover, one can extract intracellular information from multifrequency EIS data at even higher frequencies. Therefore, EIS-integrated microfluidic systems can be used to measure flow-through or immobilized biological samples. Cell types [12–14], cell disease states [15], and differentiation states of single stem cells [16] can be characterized by microfluidic impedance cytometry. Combined with the function of cell seeding or cell immobilization, EIS can be used to record cellular response to drug treatment [17] or dynamic chemical perturbations [18] over an extended time period in a microfluidic device. Modeling and analytical methods of single-cell EIS have been proposed accordingly [6, 19–21]. A single cell in suspension can be modeled as a single-shelled sphere consisting of cytoplasm and cell membrane, which are equivalent to resistance and capacitance respectively [19]. Based on the equivalent circuit model and application scenarios, various methods have been used to analyze impedance spectra of single cells [22–26]. As a computational algorithm commonly used in the field of acoustics, maximum length sequence (MLS) has been used to perform data analysis of high-speed multifrequency EIS measurements in impedance cytometry [27]. Recently, algorithms in artificial intelligence (AI) have emerged in EIS data analysis of single cells, such as support vector machines (SVMs) [16, 28] and neural network [29]. The advances of computational algorithms in AI and data mining facilitate the comprehensive analysis of multifrequency EIS data, which are crucial in extracting cellular or subcellular information for single-cell analysis. In this chapter, we present an electrode-integrated microfluidic system which can be used to manipulate and monitor single S. cerevisiae cells. Single cells were reliably captured at the trapping orifices by the hydrodynamic forces. Functions of nDEP and EIS were implemented into the microfluidic device by integrating perpendicular microelectrodes. Immobilized cells were selectively released by nDEP forces with an AC voltage applied to the respective microelectrode pairs. Moreover, we successfully recognized the number of immobilized cells at each trapping site by using EIS signals at 1 MHz after calibration measurements. By performing PCA analysis of multifrequency EIS data, accurate classification of four different cell growth states and identification between cell motion and growth were achieved.
Multifrequency Impedance Analysis of Single S. cerevisiae Cells
2
Materials
2.1 Materials for Device Fabrication and Experiments
2.2
107
Software
Materials used for device fabrication and experiments can be referred to our previous studies [30–33]. In particular, components of cell-culturing medium are as follows. Cell-culturing medium: 0.17% w/v yeast nitrogen base (YNB) (Difco™), 0.5% w/v ammonium sulfate (Sigma-Aldrich), 2% w/v glucose sulfate (SigmaAldrich), and 0.5% w/v Pluronic F127 (Sigma-Aldrich) (see Note 1 for detailed steps of medium preparation). 1. neMESYS software: neMESYS User Interface is used to precisely control flowrates of syringes fixed on the syringe pump (neMESYS, Cetoni GmbH, Germany). Download the neMESYS UserInterface from https://www.cetoni.com/ser vice/software and then follow the recommended instructions to complete the installation. 2. ziControl software: ziControl software is used to communicate with the impedance spectroscope (HF2IS, Zurich Instruments AG, Switzerland) and the transimpedance amplifier (HF2TA, Zurich Instruments AG, Switzerland), set parameters of EIS measurements, and save EIS data. Download ziControl software (version 16.12 for Windows or Linux) from https://www. zhinst.com/downloads and then follow the recommended instructions to complete the installation. 3. Elveflow Smart Interface (ESI) software: ESI software is used to set pressure values via the pressure controller (OB1 MK3+, Elveflow, France). Download ESI software from https://www. elveflow.com/microfluidic-flow-control-products/flow-con trol-system/elveflow-software and then follow the recommended instructions to complete the installation. 4. MATLAB software: MATLAB software is used to load, process, and analyze the multifrequency EIS data.
3
Methods
3.1 Device Fabrication
Platinum (Pt) microelectrodes are first patterned on a glass substrate. Then, SiNx is deposited on the electrodes to fabricate an insulation layer, on top of which the fluidic network is fabricated in SU-8 photoresist. Finally, a PDMS cover is reversibly bonded onto the SU-8 surface. The detailed fabrication procedure of the electrode-integrated microfluidic device (Fig. 1) can be referred to our previous work [30–33].
3.2 Experimental Setup
To set up the experimental system, the microfluidic device is first assembled with a custom-made aluminum (Al) holder, a poly (methyl methacrylate) (PMMA) cover, and a printed circuit board
108
Zhen Zhu et al.
Fig. 1 Photograph and schematics of the microfluidic device. (a) Photograph of the microfluidic device indicated with fluidic inlets and outlets. (b) 3D exploded schematic of the five-layer device. (c) Schematic top view of the microfluidic device. Only a local area with two trapping sites is presented for illustration. Inset shows a trap immobilized with a single budding yeast cell
(PCB) (see Note 2). Through the setup, the device is connected to peripheral instruments (see Note 3) for optical imaging, cell trapping, cell releasing, cell culturing and EIS measurements. Afterwards, two syringes with cell-culturing medium and cell suspension are affixed on the syringe pump, respectively (see Note 4). Finally, tubing is used to connect the inlets, outlet and pressure port on the device for fluidic access. Detailed steps can be referred to our previous work [30–33]. 3.3 Trap Single S. cerevisiae Cells
1. Switch on the syringe pump and run its user interface. Set the flow rates of both cell-culturing medium and cell suspension to 5 μL/min. Perfuse the whole fluidic network with liquid to remove any air in the channels. 2. Switch on the pressure controller and run its user interface. Set the pressure to about 2000 Pa (negative and positive pressures are relative to atmospheric pressure if not specified otherwise) (see Note 5). Hold for 10 min to stabilize the system. 3. Set the flow rates of both cell-culturing medium and cell suspension to 0.5 μL/min. Position the field of view and focus on the bottom of fluidic network (Fig. 2a). If there is any cell immobilized at the trapping sites, release it by raising the pressure gradually. 4. Set the pressure to 6000 Pa (see Note 6). Hold on until observing a cell captured at the trapping site. 5. Once a single S. cerevisiae cell is captured at each trapping site, immediately increase the pressure to a higher value of around 5000 Pa (see Note 7 and Fig. 2b). Hold this pressure to maintain single immobilized S. cerevisiae cells. 6. If more than one cell or an unwanted cell is captured at some trapping site, carry out steps 7–10 (see Note 8). 7. Switch the operation mode to nDEP-release mode, and ground the common electrode through the PCB. 8. Switch on the tip electrode corresponding to the unwanted cell and apply an alternating current (AC) voltage of 10 MHz,
Multifrequency Impedance Analysis of Single S. cerevisiae Cells
109
Fig. 2 Trapping and releasing of single S. cerevisiae cells. (a) Empty sites before capturing cells. (b) Single cells were immobilized at each trapping site by applying an underpressure of 6000 Pa. (c) The cell at the right site was released by nDEP force activated by a 10 MHz, 20 Vpp AC voltage. (d) The cell at the left site was released by nDEP force activated by a 10 MHz, 20 Vpp AC voltage. White arrows indicate the immobilized single yeast cells, while red arrows indicate empty traps. Scale bar is 30 μm [32]
20 Vpp from the signal generator to selectively release the corresponding S. cerevisiae cell via nDEP forces (see Note 9 and Fig. 2c, d). 9. Once the unwanted S. cerevisiae cell is released, switch off the applied AC voltage. 10. Repeat steps 4–6 to trap another single S. cerevisiae cell at the empty trapping site. 3.4 Electrical Impedance Measurements
To perform EIS measurements of immobilized S. cerevisiae cells, the long common electrode is used as a stimulus electrode, on which a stimulus AC voltage is applied, and the individual tip electrode serves as a recording electrode, through which response current is amplified and transformed to a voltage using the HF2TA transimpedance amplifier. 1. Switch on the impedance spectroscope and run its user interface. 2. Set parameters on the software interface. The stimulus AC voltage with an amplitude of 1 V sweeps over a frequency range from 10 kHz to 10 MHz including 92 sampling frequencies. 3. Switch the operation mode to EIS-detection mode, and apply the stimulus AC voltage to the common electrode through the PCB. 4. Select and activate the corresponding tip electrode. 5. Run a single sweep on the software interface (see Note 10). 6. Save the EIS data.
110
Zhen Zhu et al.
3.5 Distinguish the Number of S. cerevisiae Cells at Each Trapping Site
Two vertically stacked cells obstruct more cross-sectional opening of the trapping orifice than a single one. Number of S. cerevisiae cells at each trapping site can be distinguished after calibration measurements. 1. Initialize the microfluidic system as steps 1–3 in Subheading 3.3. 2. Initialize and set the impedance spectroscope as steps 1–3 in Subheading 3.4. 3. Perform EIS measurement of the specific empty trap as steps 4–6 in Subheading 3.4. 4. Capture a single cell at the trapping site as steps 4 and 5 in Subheading 3.3. 5. Perform EIS measurement of the trap with a single immobilized cell as steps 4–6 in Subheading 3.4. 6. Release the cell by raising the pressure gradually. 7. Perform EIS measurement of the same empty trap as steps 4–6 in Subheading 3.4. 8. Capture two vertically stacked yeast cells at the same trapping site as steps 4 and 5 in Subheading 3.3. 9. Perform EIS measurement of the trap with two vertically stacked yeast cells as steps 4–6 in Subheading 3.4. 10. Release the cells by raising the pressure gradually. 11. Repeat steps 3–10 to measure an adequate number of samples. 12. Repeat the calibration procedure as steps 3–11 for each trapping site. 13. Run the custom-made MATLAB GUI software to plot the bar chart of relative amplitudes Ar (see Note 11 for procedure of normalizing EIS data) at 1 MHz of EIS calibration measurements and plot the base line as the criterion of discrimination between two cases of budding yeast cells for all traps (see Note 12 and Fig. 3). 14. After capturing yeast cells at several traps and performing respective EIS measurements, press Measuring Buttons to discriminate the number of immobilized cells at each trapping site (Fig. 3).
3.6 Classify Cell Growth States of S. cerevisiae Cells
The sensitive EIS measurement can be used to distinguish the number of immobilized S. cerevisiae cells and classify cell growth states. The morphology of budding yeast cells changes with cell growth. Budding yeast cells can be simply classified into unbudded and budded cells. Cells with small buds tend to lie down flat on the horizontal substrate with the bud pointing inside or outside the trapping orifice under hydrodynamic forces. If the diameter of a
Multifrequency Impedance Analysis of Single S. cerevisiae Cells
111
Fig. 3 Custom-made MATLAB GUI software. The Plotting Button is used to plot a bar chart of relative EIS amplitudes Ar at all traps with a single cell and two vertically immobilized cells after calibration measurements. The base value of relative amplitude at each site is then displayed in the Base value tab. After cells are captured at specific sites and subsequent EIS measurements are performed, the corresponding Measuring Buttons are pressed to obtain the relative amplitudes of EIS signals at 1 MHz. By comparing Ar with base value, the status of each orifice (Y: a single cell, N: two or more immobilized cells) can be shown in Single tab [32]
bud is larger than the width of the trapping orifice, cells can be retained at the trap in a vertical position. Therefore, four typical orientations of immobilized single yeast cells (unbudded cell (UB), horizontally immobilized cell with bud inside the trap (HBI), horizontally immobilized cell with bud outside the trap (HBO), vertically immobilized cell with mother cell and bud stacked vertically (VB)) can be observed in our microfluidic device. To classify the four orientations of immobilized cells, EIS is used to measure cell growth states of S. cerevisiae cells. 3.6.1 Experimental Procedure
1. Initialize the microfluidic system as steps 1–3 in Subheading 3.3. 2. Initialize and set the impedance spectroscope as steps 1–3 in Subheading 3.4. 3. Perform EIS measurement of the first empty trap as steps 4–6 in Subheading 3.4. 4. Capture a single budding yeast cell at the first trapping site as steps 4 and 5 in Subheading 3.3. 5. Perform EIS measurement of the first trap with a single immobilized S. cerevisiae cell as steps 4–6 in Subheading 3.4.
112
Zhen Zhu et al.
6. Capture the microscopic image of the immobilized cell. 7. Release the cell by raising the pressure gradually. 8. Repeat steps 3–7 until an adequate number of yeast cells with different orientations are measured. 3.6.2 Data Analysis
1. Plot the raw amplitude and phase versus logarithmic axis of frequency (Fig. 4a). 2. Plot the relative amplitude and phase versus logarithmic axis of frequency (Fig. 4b, c). 3. Discriminate VB cells Select frequencies at which the relative magnitude and phase vary most between the VB and the other three groups. Then plot the scatter of the relative amplitude at 1 MHz versus the relative phase at 4 MHz to discriminate the VB cells from the complete data set (Fig. 4d). 4. Discriminate HBI cells Select frequencies at which the relative magnitude and phase vary most between the HBI and the other two groups. Then plot the scatter of the relative amplitude at 100 kHz versus the relative phase at 900 kHz to discriminate the HBI cells from the complete data set (Fig. 4e). Perform steps 5–10 to classify UB, HBI, and HBO cells by using PCA (see Note 13). 5. Represent each EIS measurement of a cell as a vector in a 184-dimensional feature space (92 dimensions for the relative amplitude and 92 dimensions for the relative phase at 92 frequencies). 6. Build an input matrix with column vectors of all measurements and subtract the mean from the data for each feature. 7. Compute the covariance matrix of all data points. 8. Compute the eigenvalues and eigenvectors of the covariance matrix. 9. Project each data point on the two eigenvectors (i.e. principal components, PCs) corresponding to the two largest eigenvalues (see Note 14). 10. Classify three groups of cells using linear discriminant analysis (LDA) and leave-one-out cross validation (LOOCV) (see Note 15 and Fig. 4f).
3.7 Monitor Cell Growth/Motion
The sensitivity and stability of the microfluidic system enable realtime EIS monitoring of cell growth/motion. By performing PCA analysis of multifrequency EIS data, cell growth and cell motion can be discerned.
Multifrequency Impedance Analysis of Single S. cerevisiae Cells
113
Fig. 4 Classification of cell growth states of S. cerevisiae cells: UB (n ¼ 34), HBI (n ¼ 10), HBO (n ¼ 13), and VB (n ¼ 11) (a) Amplitude and phase spectra of raw signals over the swept frequency range from 10 kHz to 10 MHz. (b), (c) Relative amplitude and phase spectra. The color curves and shaded regions in (a), (b), and (c) represent the mean values and standard deviations in the measurements, respectively. (d) Separation of VB cells by using the relative amplitude at 1 MHz versus the relative phase at 4 MHz shown in a scatter plot. (e) Separation of HBI cells by using the relative amplitude at 100 kHz versus the relative phase at 900 kHz shown in a scatter plot. Inserts show images of cell samples. Buds are marked with arrowheads. Scale bar is 5 μm. (f) Classification of UB, HBI, and HBO cells by means of LDA on the full multifrequency data set (projections on first two PCs shown). Colored areas depict the regions in which a cell sample is assigned to the respective group, and dashed lines show classification boundaries [33]
114
Zhen Zhu et al.
3.7.1 Experimental Procedure
1. Initialize the microfluidic system as steps 1–3 in Subheading 3.3. 2. Initialize and set the impedance spectroscope as steps 1–3 in Subheading 3.4. 3. Perform EIS measurement of the first empty trap as steps 4–6 in Subheading 3.4. 4. Capture a single yeast cell at the first trapping site as steps 4 and 5 in Subheading 3.3. 5. Perform EIS measurement and microscopic imaging of the immobilized cell as steps 5 and 6 of Experimental Procedure in Subheading 3.6.1 every 1 min. 6. Stop monitoring the captured cell after a desired period. 7. Release the cell by raising the pressure gradually. 8. Repeat steps 3–7 to monitor and record more cells.
3.7.2 Data Analysis
1. Compute the first PC over all time-lapse measurements for each cell individually as steps described in Data Analysis in Subheading 3.6.2. 2. Extract cells which are visually identified as either growing or moving from the complete data set. Average all first PCs of these samples to obtain two vectors representing cell growth or motion. 3. Normalize the two vectors to unit length and orthogonalize them to obtain motion vector and growth vector (see Note 16). 4. Project all recorded EIS data in the space spanned by motion vector and growth vector to identify cell growth and cell motion (see Note 17 and Fig. 5).
Fig. 5 Identification of cell motion or cell growth by projecting multifrequency EIS data to cell growth and cell motion vectors. Rec 1 to Rec 5 show the bud growth of five budding yeast cells. Mot 1 and Mot 2 represent the cell motion/growth of two cells [33]
Multifrequency Impedance Analysis of Single S. cerevisiae Cells
4
115
Notes 1. Steps of preparing cell-culturing medium: Dissolve 1.7 g YNB, 5 g ammonium sulfate, 20 g glucose sulfate and 5 g F127 in 1 L deionized (DI) water. Filter through a 0.22-μm syringe filter for purification and sterilization. Store at 4 C. 2. The custom-made PCB is used to switch the operation mode and select electrode pairs. It has two operation modes: nDEPrelease mode and EIS-detection mode. In nDEP-release mode, the tip electrodes can be selectively connected to a signal generator (RIGOL DG1062Z, Tektronix Inc., USA) or ground, and the common electrode is grounded. In EIS-detection mode, the tip electrodes can be selectively connected to the input of the HF2TA transimpedance amplifier or ground, and the common electrode is connected to the output of the HF2IS impedance spectroscope. 3. We used an IX83 inverted microscope for imaging, a neMESYS syringe pump for cell perfusion, an OB1 MK3+ pressure controller for cell trapping, a RIGOL DG1062Z signal generator for cell releasing, and a HF2IS impedance spectroscope for EIS detection. 4. Dilute the cell suspension to a concentration of approximately 1 106 cells/mL in the cell-culturing medium. 5. If there is any air bubble in the fluidic network, set the pressure to a higher value to remove bubbles relying on the gas permeability of PDMS material. 6. Firstly, users should roughly determine the range of applied underpressure to capture cells upon the flow rate. Since slight difference of the channel geometry may affect the required pressure, users have to optimize the value of the pressure around 6000 Pa in the beginning of each experiment. 7. Cells in suspension can be kept away from the side of channel with the traps under the laminar flow regime. It can prevent the trapping sites from capturing redundant cells. Users have to optimize the retaining pressure around 5000 Pa in the beginning of every experiment due to the slight geometric difference of microfluidic devices. 8. To capture single S. cerevisiae cell (e.g. unbudded cell, cell with bud in different orientations), nDEP releasing can be performed to selectively release unwanted samples. 9. To find the proper voltage for nDEP, users should adjust the amplitude of AC voltage from the signal generator at different frequencies until the immobilized cells are released against the hydrodynamic forces. Since the hydrodynamic forces acting on the immobilized cells are variant at different traps, parameters of AC voltage may be slightly different for different traps.
116
Zhen Zhu et al.
10. The distribution of swept frequency is logarithmic. The recorded EIS data is a complex signal displayed as a format of amplitude (A) and phase (θ). 11. The normalization of EIS data: Divide the amplitude of EIS signal when a cell is trapped (A) through the amplitude signal of the empty trap (Ae) to obtain the relative amplitude (Ar ¼ A/Ae). Subtract the phase signal of the empty trap (θe) from that of the trap with an immobilized cell (θ) to obtain the relative phase (θr ¼ θ θe). 12. The base value for each site is calculated as the mid-point of mean Ar acquired from one and two vertically stacked cells. 13. PCA is a commonly used method of dimensionality reduction, which can reduce the dimensions of dataset to a manageable number by transforming correlated variables to uncorrelated ones without losing critical information [34, 35]. 14. We choose the first two PCs so that 89% of the variance of the original data is represented in the projected subspace. 15. We use LDA to separate three classes of data points by maximizing the between-class variance and minimizing the within class variance. Then, we use leave-one-out cross validation to estimate the classification accuracy of cell growth states. 16. The cell motion and bud growth can cause largest variations in the EIS data along the respective vectors. 17. For each recording, we set the temporally first data point to (0, 0) for better illustration.
Acknowledgments This work was financially supported by the National Natural Science Foundation of China (No. 61774036), the National Key R&D Program of China (No. 2018YFF01012100), and the Fundamental Research Funds for the Central Universities. References 1. Svahn HA, van den Berg A (2007) Single cells or large populations? Lab Chip 7:544–546 2. Macey MG (2007) Flow cytometry: principles and applications. Humana, Totowa 3. Murphy TW, Zhang Q, Naler LB, Ma S, Lu C (2017) Recent advances in the use of microfluidic technologies for single cell analysis. Analyst 143:60–80 4. Lecault V, White AK, Singhal A, Hansen CL (2012) Microfluidic single cell analysis: from
promise to practice. Curr Opin Chem Biol 16:381–390 5. Yin H, Marshall D (2012) Microfluidics for single cell analysis. Curr Opin Biotech 23:110–119 6. Morgan H, Sun T, Holmes D, Gawad S, Green NG (2007) Single cell dielectric spectroscopy. J Phys D Appl Phys 40:61–70 7. Heileman K, Daoud J, Tabrizian M (2013) Dielectric spectroscopy as a viable biosensing
Multifrequency Impedance Analysis of Single S. cerevisiae Cells tool for cell and tissue characterization and analysis. Biosens Bioelectron 49:348–359 8. Park H, Kim D, Yun KS (2010) Single-cell manipulation on microfluidic chip by dielectrophoretic actuation and impedance detection. Sensors Actuators B Chem 150:167–173 9. Haandbæk N, Bu¨rgel SC, Heer F, Hierlemann A (2013) Characterization of subcellular morphology of single yeast cells using high frequency microfluidic impedance cytometer. Lab Chip 14:369–377 10. Schade-Kampmann G, Huwiler A, Hebeisen M, Hessler T, Berardino MD (2008) On-chip non-invasive and label-free cell discrimination by impedance spectroscopy. Cell Prolif 41:830–840 11. Sarro´ E, Lecina M, Fontova A, Go`dia F, Brago´s R, Cairo´ JJ (2016) Real-time and on-line monitoring of morphological cell parameters using electrical impedance spectroscopy measurements. J Chem Technol Biotechnol 91:1755–1762 12. Holmes D, Pettigrew D, Reccius CH, Gwyer JD, van Berkel C, Holloway J, Davies DE, Morgan H (2009) Leukocyte analysis and differentiation using high speed microfluidic single cell impedance cytometry. Lab Chip 9:2881–2889 13. Holmes D, Morgan H (2010) Single cell impedance cytometry for identification and counting of CD4 T-cells in human blood using impedance labels. Anal Chem 82:1455–1461 14. Chen J, Zheng Y, Tan Q, Shojaei-Baghini E, Zhang YL, Li J, Prasad P, You L, Wu XY, Sun Y (2011) Classification of cell types using a microfluidic device for mechanical and electrical measurement on single cells. Lab Chip 11:3174–3181 15. Du E, Ha S, Diez-Silva M, Dao M, Suresh S, Chandrakasan AP (2013) Electric impedance microflow cytometry for characterization of cell disease states. Lab Chip 13:3903–3909 16. Song H, Wang Y, Rosano JM, Prabhakarpandian B, Garson C, Pant K, Lai E (2013) A microfluidic impedance flow cytometer for identification of differentiation state of stem cells. Lab Chip 13:2300–2310 17. Asphahani F, Wang K, Thein M, Veiseh O, Yung S, Xu J, Zhang MQ (2011) Single-cell bioelectrical impedance platform for monitoring cellular response to drug treatment. Phys Biol 8(1):015006 18. Malleo D, Nevill JT, Lee LP, Morgan H (2010) Continuous differential impedance spectroscopy of single cells. Microfluid Nanofluid 9:191–198
117
19. Sun T, Green NG, Morgan H (2008) Analytical and numerical modeling methods for impedance analysis of single cells on-chip. Nano 3:55–63 20. Tsai SL, Wang MH, Chen MK, Jang LS (2014) Analytical and numerical modeling methods for electrochemical impedance analysis of single cells on coplanar electrodes. Electroanalysis 26:389–398 21. Claudel J, Nadi M, Mazria OE, Kourtiche D (2016) An electrical model optimization for single cell flow impedance spectroscopy. Int J Smart Sensing Intelligent Syst 9:526–536 22. Caselli F, Bisegna P (2015) A simple and robust event-detection algorithm for singlecell impedance cytometry. IEEE Trans Biomed Eng 63:415–422 23. Sun T, Gawad S, Green NG, Morgan H (2007) Dielectric spectroscopy of single cells: time domain analysis using Maxwell’s mixture equation. J Phys D Appl Phys 40:1–8 24. Gawad S, Sun T, Green NG, Morgan H (2007) Impedance spectroscopy using maximum length sequences: application to single cell analysis. Rev Sci Instrum 78:054301 25. Sun T, Gawad S, Bernabini C, Green NG, Morgan H (2007) Broadband single cell impedance spectroscopy using maximum length sequences: theoretical analysis and practical considerations. Meas Sci Technol 18:2859–2868 26. Zhao Y, Chen D, Luo Y, Li H, Deng B, Huang SB, Chiu TK, Wu MH, Long R, Hu H, Zhao X, Yue W, Wang J, Chen J (2013) A microfluidic system for cell type classification based on cellular size-independent electrical properties. Lab Chip 13:2272–2277 27. Sun T, Holmes D, Gawad S, Green NG, Morgan H (2007) High speed multifrequency impedance analysis of single particles in a microfluidic cytometer using maximum length sequences. Lab Chip 7:1034–1040 28. Ahuja K, Rather GM, Lin Z, Sui J, Xie P, Le T, Bertino JR, Javanmard M (2019) Toward point-of-care assessment of patient response: a portable tool for rapidly assessing cancer drug efficacy using multifrequency impedance cytometry and supervised machine learning. Microsyst Nanoeng 5:34 29. Zhao Y, Wang K, Chen D, Fan B, Huang C (2018) Development of microfluidic impedance cytometry enabling the quantification of specific membrane capacitance and cytoplasm conductivity from 100,000 single cells. Biosens Bioelectron 111:138–143 30. Zhu Z, Frey O, Haandbaek N, Franke F, Rudolf F, Hierlemann A (2015) Time-lapse
118
Zhen Zhu et al.
electrical impedance spectroscopy for monitoring the cell cycle of single immobilized S. pombe cells. Sci Rep 5:17180 31. Zhu Z, Frey O, Hierlemann A (2018) Wideband electrical impedance spectroscopy (EIS) measures S. pombe cell growth in vivo. Methods Mol Biol 1721:135–153. https://doi.org/ 10.1007/978-1-4939-7546-4_13 32. Geng Y, Zhu Z, Wang Y, Wang Y, Ouyang S, Zheng K, Ye W, Fan Y, Wang Z, Pan D (2019) Multiplexing microelectrodes for dielectrophoretic manipulation and electrical impedance measurement of single particles and cells in a microfluidic device. Electrophoresis 40:1436–1445
33. Zhu Z, Frey O, Franke F, Haandbæk N, Hierlemann A (2014) Real-time monitoring of immobilized single yeast cells through multifrequency electrical impedance spectroscopy. Anal Bioanal Chem 406:7015–7025 34. Unpingco J (2016) Python for probability, statistics, and machine learning. Springer, Cham. https://doi.org/10.1007/978-3-319-307176 35. Sammut C, Webb GI (2017) Encyclopedia of machine learning and data mining. Springer, New York. https://doi.org/10.1007/978-14899-7687-1
Chapter 10 Construction of Protein Expression Network Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein Abstract In this post-genomic era, protein network can be used as a complementary way to shed light on the growing amount of data generated from current high-throughput technologies. Protein network is a powerful approach to describe the molecular mechanisms of the biological events through protein–protein interactions. Here, we describe the computational methods used to construct the protein network using expression data. We provide a list of available tools and databases that can be used in constructing the network. Key words Protein network, Protein–protein interaction, PPI network, Gene expression, Protein expression
1
Introduction Expression data are generated in order to identify the expressed genes in a particular condition. These data can be generated using approaches such as microarrays and RNA-seq, and they are capable of identifying up to thousands of genes that associate with specific phenotypes. They can also screen gene expression patterns for various conditions of interest and obtain the biological information based on Gene Ontology (GO) analysis [1]. Currently, there is an increasing amount of expression data, and several datasets refer to the same condition. The integration of expression data with protein–protein interaction (PPI) data is one of the approaches that can be applied to combine and analyze the accumulated amount of expression data. PPI network is one of the biological networks that is useful to combine different sets of expression data for more in-depth analysis to understand the biological meanings of the data. PPI network is an extensive collection of physical interaction between proteins. High technical error rate such as the incompleteness of PPI, the presence of positive and false negative interactions and missing interactions on the current available PPI network is caused by the occurrence of transient and dynamic state of interactions [2]. Instead, the application of PPI network in big data
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_10, © Springer Science+Business Media, LLC, part of Springer Nature 2021
119
120
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
analysis has still become a powerful approach to identify the functional features of the data. For example, in medical biology, the integration of expression data with PPI network was used to identify new disease-causal genes [3], gene prioritization [4], identification of diseases associations [5], disease modules or clusters development [6], and prediction of new therapeutic targets [7]. Several studies used PPI network approach to analyze the expression data. For example, in humans, the PPI network was constructed to understand the pathobiology of polycystic ovarian syndrome (PCOS) [5, 8], diabetes mellitus [9], endometrial cancer [10], breast cancer [7], Alzheimer’s disease [11], and Parkinson’s disease [12]. The same approach was also performed to understand the biological processes in rice [13], maize [14], and Arabidopsis [15]. Here, we provide materials and methods to construct the PPI network using co-expression data. PPI network can be used for a comprehensive understanding of the biological processes, which in turn is very useful in identifying genetic or biomarkers that can be further studied using bioengineering or synthetic biology in the future.
2
Materials All the materials described below are publicly available and compatible with Windows and Mac operating systems.
2.1 Source of the Expression Data
There are two main sources where the gene expression datasets can be retrieved: (1) annotated gene expression databases and (2) literature databases. 1. ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) [16], a public database from European Bioinformatics Institute (EBI) that can be used to search and submit gene expression datasets from a variety of organisms and platforms. 2. Gene Expression Omnibus (GEO) (https://www.ncbi.nlm. nih.gov/geo/) [17], a functional genomic repository from National Center for Biotechnology Information (NCBI) that can be used to search and submit gene expression datasets from a variety of organisms and platforms. 3. Google Scholar (https://scholar.google.com/), a web search engine that contains scholarly articles, including functional genomics articles. 4. PubMed (https://www.ncbi.nlm.nih.gov/pubmed/), a literature database from NCBI that can be used to retrieve the functional genomics studies.
Protein Expression Network Construction
2.2
Data Processing
121
1. BioMart (https://asia.ensembl.org/info/data/biomart/ index.html) [18], a web-based data mining tool from EBI used to unify the annotation identification (ID) and extract other biological data. 2. UniProt Mapping (https://www.uniprot.org/uploadlists/) [19], a web-based tool to retrieve gene/protein ID and other biological data. 3. AnnotationDbi [20], an R package to obtain gene/protein ID and the functional annotation (see Note 1). 4. DAVID (https://david.ncifcrf.gov/) [21], an online-based tool for gene/protein functional annotation.
2.3 Biological Data Annotation
1. Gene and protein information: NCBI (https://www.ncbi.nlm. nih.gov) [22], Ensembl (https://ensembl.org) [23], UniProt (https://www.uniprot.org/) [24], and GeneCards (https:// www.genecards.org/) [25]. 2. Gene ontology (GO), includes biological process, molecular function, and cellular process: GO Consortium (http:// geneontology.org/) [26]. 3. Pathway: Kyoto Encyclopedia of Genes and Genomes (KEGG) (https://www.kegg.jp/) [27], BioCyc (https://biocyc.org/) [28], BioCarta (https://cgap.nci.nih.gov/Pathways/ BioCarta) [29] and WikiPathways (https://www. wikipathways.org) [30]. 4. Protein domain: interpro) [31].
InterPro
(https://www.ebi.ac.uk/
5. PPI information: Biological General Repository for Interaction Datasets (BioGRID) [32], Database of Interacting Proteins (DIP) [33], GeneMANIA [34], Human Integrated Protein– Protein Interaction Reference (HIPPIE) [35], Human Integrated Protein–Protein Human Protein Reference (HPRD) [36], Interologous Interaction Database (I2D) [37], IntAct [38], Mammalian Protein–Protein Interaction Database (MIPS) [39], Molecular Interaction database (MINT) [40], and STRING [41] (see Note 2). 6. Tissue and cell localization: Human Protein Atlas (https:// www.proteinatlas.org) [42]. 7. Fuzzy lookup (https://www.microsoft.com/en-my/down load/details.aspx?id¼15011), an add-in feature in Microsoft Excel that can be used to match data from two different datasets. This add-in has to be downloaded from the given link and installed in Microsoft Excel. It is capable of processing a dataset with 1.2 million rows.
122
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
8. Power Query, an add-in feature in Microsoft Excel and it has similar function with Fuzzy lookup (to match the data between two different datasets). However, Power Query is able to process a dataset with more than 1.2 million rows (see Note 3). 2.4 Construction and Analysis of a Protein Network
1. Cytoscape [43], free software for network construction, analysis, integration, and visualization (see Note 4). 2. MCODE [44], a network clustering algorithm that can be installed from Cytoscape apps. 3. ClusterMaker [45], a Cytoscape app that contains multiple network clustering algorithms. 4. ClusterONE [46], a Cytoscape app to cluster the PPI network. 5. ClueGO [47], a Cytoscape apps for functional (GO, pathway and domain) enrichment analysis of the network. 6. BiNGO [48], a Cytoscape app for GO enrichment analysis of a network. 7. PesCa [49], a Cytoscape app used to find the shortest path of protein–protein interaction in the network.
3
Methods
3.1 Identification of Expression Datasets
Gene expression datasets are identified using keywords search. Here we used the examples from the PCOS-diseases association study where the expression data of polycystic ovarian syndrome (PCOS) were searched using the keywords related to PCOS, e.g., “polycystic ovary syndrome,” “PCOS,” “polycystic ovary syndrome 1,” “polycystic ovarian syndrome,” “polycystic ovaries,” “PCO,” “PCO1,” “Stein-Leventhal,” “Stein Leventhal,” “Stein-Leventhal syndrome,” “polycystic ovary disease,” “polycystic ovarian disease,” “PCOD,” “multicystic ovaries,” “sclerocystic ovarian degeneration,” and “sclerocystic ovary syndrome.” The keywords are inserted into the search box or search menu provided in the gene expression databases (ArrayExpress and GEO) in order to search for the expression datasets (Fig. 1a). The returned gene expression data was selected for further analysis. The expression datasets were searched from the literature databases such as Google Scholar and PubMed using combination of keywords such as “gene expression,” “expression,” “transcriptomics,” “functional genomics,” “transcriptional profiling,” “profiling,” and “microarray” with the phenotypes of interest, in this case, PCOS. The selected keywords were inserted into the search box of literature databases (Fig. 1b). List of genes from the expression datasets was provided in the articles or their supplementary files (see Note 5).
Protein Expression Network Construction
123
Fig. 1 Search box in (a) expression databases and (b) literature databases. The keywords are inserted to this box to retrieve the expression datasets
Most of the gene expression in the databases (gene expression databases and literature databases) only listed the differential expressed genes (DEG) with the fold changes and were not annotated and analyzed. 3.2
Data Processing
Gene expression data recorded the DEG with variety of ID, including NCBI gene ID/Entrez ID, Uniprot ID, gene symbol, gene name or protein name, Unigene ID, RefSeq ID, and/or array probe ID. Several gene expression datasets recorded the DEG with only one ID, and some of them were designated with several IDs of DEG (Fig. 2). We used the IDs for different datasets of gene expression in mapping them onto UniProt ID using UniProt Mapping, BioMart, AnnotationDbi, or DAVID to unify the ID of DEG. Several ID mapping tools are required as each tool contains different datasets as none of the tools provide comprehensive gene information. Some genes can be mapped using UniProt Mapping and BioMart, but not in DAVID and AnnotationDbi or vice versa. Hence, a combination of tools can increase the number of matched DEGs. After the genes ID were unified into Uniprot ID, all genes were organized in a single dataset. Overlapped genes and genes with UniProt ID were removed prior to constructing the PPI network.
124
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
Fig. 2 Examples of accession identifier (ID) that were used in gene expression dataset. Each dataset used different accession ID and the ID is required to be unified to ease the next analysis 3.3 Biological Data Annotation
The biological information (e.g., chromosomal location, gene ontologies, pathway, protein domain, PPI, and tissue localization) of all collected genes are annotated in order to understand the function of the genes. The biological information of the genes was obtained from the above-mentioned databases (the databases are mentioned in the Materials). Genes in the biological information databases such as KEGG, GO Consortium, and Interpro, were recorded with different types of ID. Thus, the gene was mapped to the UniProt ID before annotating the DEGs with their biological information. The obtained DEGs with the biological information are matched using a Fuzzy lookup or Power query. Figure 3 describes the steps in collecting the gene expression datasets and their biological information.
3.4 Construction of PPI Network
There are two approaches to construct a protein network, and those are provided in this method (see Note 6). Before selecting which approach is suitable to construct the network, it is essential to select a PPI database to retrieve the appropriate PPI information. There are PPI databases that compile PPI information from the wet-lab experiments and several PPI databases provide integrated and curated PPI information from other databases and PPI experiments. Majority of PPI databases listed several types of interactions
Protein Expression Network Construction
125
Fig. 3 Example steps to obtain gene expression datasets and the biological information of differential expressed genes
such as physical PPI, predicted PPI, and co-expressed interactions. Several criteria such as organisms, source of interactions (known physical interactions—either from the curated PPI databases or experimentally determined, predicted interactions—neighborhood, fusions or cooccurrence, co-expression, protein homology, or text-mining), and score of interactions (each PPI database provide score for each PPI, where the score represents the strength of PPI evidence) are required in selecting the suitable PPI databases for expression data (see Note 7). The first way to construct a protein network is by integrating the obtained DEGs with the interactome of the organism. The PPI information is initially downloaded from the PPI database. For example, the human interactome is downloaded from a database, e.g., HIPPIE (HIPPIE has integrated human PPI information from curated PPI databases and PPI studies) to construct a PPI network in human diseases. The ID of the proteins in the downloaded PPI data is mapped to UniProt ID to unify the ID with obtained DEGs. In a PPI dataset, it is important to have at least two columns of data, which are (1) source and (2) target. Both source and target should be the UniProt ID of proteins. The organized PPI dataset in .xls, .csv or .txt is then imported into Cytoscape by clicking File ! Import ! Network ! File (Fig. 4a). The imported
126
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
Fig. 4 View of Cytoscape on (a) how to construct a network using downloaded PPI information, (b) how to add other information to the network, (c) the columns in table panel and (d) how to construct a network using PPI databases in Cytoscape
PPI was named as a human interactome. Cytoscape dashboard has three panels, which are control, table, and result. A control panel can be used to edit the features of a network. A table panel consists of three tabs, which are node, edge, and network. In this case, the UniProt ID of the proteins in human interactome can be seen in the node tab of the table panel. Meanwhile, a result panel contains the results of the analysis from the downloaded Cytoscape apps. To integrate the human interactome with the obtained DEGs, a dataset that contained the UniProt ID of DEGs, e.g., DEGs of PCOS, is then imported to the human interactome by clicking File ! Import ! Table (Fig. 4b). A dataset of DEGs should be composed of two columns, which are (1) UniProt ID and (2) expression (either up- or downregulated). Now, there are two columns in the table panel of the node tab, (1) shared name, which is the UniProt ID of a protein in the network and (2) expression of DEGs (Fig. 4c). The integrated human interactome with DEGs of PCOS can be recognized as PCOS PPI network. PCOS PPI network consists of two types of protein, (1) PCOS protein and (2) non-PCOS protein. Generally, the number of non-PCOS
Protein Expression Network Construction
127
proteins in the network is extremely higher than the number of PCOS proteins, where this situation can contribute a noise to the network. Thus, PesCa, a Cytoscape app is be used to reduce the noise of the network by finding the shortest path between PCOS proteins in the network. The second approach that can be used to construct a network is by inserting the obtained DEGs into a Cytoscape by clicking File ! Import ! Network ! Public Database (Fig. 4d). There are two main network databases in Cytoscape, (1) Universal Interaction Database Client—this database is provided by Cytoscape, which consists of several curated PPI databases such as IntAct [38], GeneMANIA [34], BioGrid [32] and Molecular Interaction database (MINT) [40] and (2) STRING—this database has to be initially installed via Cytoscape apps. Both databases will search the interaction information of the inserted DEGs and the PPI network of the obtained DEGs is then constructed. The expression of DEGs should be added into the node tab of the table panel by importing a dataset of DEGs with two columns by clicking File ! Import ! Table. The network constructed using this approach consists of one type of protein, e.g., PCOS proteins. This approach can have many unconnected components, duplicated edges, self-loop and isolated nodes, where all of these are removed to ease further network analysis (Fig. 5) (see Note 8). 3.5 Network Visualization
After constructing the network, several components of networks should be edited to deliver appropriate information on the network. Here, two main things are needed to be considered: (1) information at table panel and (2) node and edges styles. In the table panel, more information can be added in the node, edge, and network tab. In all tabs, there is a column, shared name, which is a unique ID of node, edge, or network. As mentioned, there are two columns in the node tab of the network, which are shared name (UniProt ID) and expression. Other information such as gene symbol was added to the node tab as a label of each node in the network. The biological information that has been annotated to the DEGs datasets was added to the node tab in order to group the nodes based on the biological information. These datasets were added in the table panel and they should contain a column of shared name or unique IDs to match with the shared name of the network. In the control panel of Cytoscape, there is a Style tab consisting of three tabs, (1) node, (2) edge, and (3) network. Style tab allows the users to edit the features such as color, size, shape, label font, label font size, label color, and location of nodes, edges, and network. Under node and edge tabs in the Style tab, there are three different mapping types: passthrough, discrete, and continuous mapping. Passthrough mapping gives all nodes or edges with the same styles, where this mapping is usually used to specify the node label. For instance, all nodes in the network can be labeled
128
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
Fig. 5 Types of interactions that are removed before analyzing the network
with the gene symbol using a passthrough mapping. Discrete mapping, maps the protein to the node or edges styles based on particular information in the table panel. This mapping offers the users to differentiate the up- or downregulated DEGs with color or shape. Besides, discrete mapping assists the users in representing the biological information with a variety of styles. Continuous mapping applies to the information in the table with numerical value formats such as fold change of expression and number of proteins involved in certain GO or pathway. Users can set the size or the color brightness of nodes or edges according to the numerical value in certain columns in the table panel. Cytoscape also provides Select tab (beside Syle tab) in a control panel to ease the selection of a group of nodes or edges according to certain information in the table panel. Users can simply use discrete or continuous mapping to set specific styles for the selected nodes or edges. 3.6 Network Analysis
The PPI network is analyzed using several apps of Cytoscape. Network clustering and enrichment analysis are examples of network analysis approach. Network clustering uses several topological parameters such as a number of interactions, density, or shortest
Protein Expression Network Construction
129
path to cluster the protein network. Network clustering is performed using clustering algorithms such as MCODE, ClusterONE, and clusterMaker (see Note 9). They cluster the network based on the specific topological parameters of the network. Enrichment analysis was performed on PPI network or PPI clusters to determine their biological function. We used BiNGO and ClueGO, both are Cytoscape apps to perform the enrichment analysis. Both apps can identify significant GO (biological process, molecular function, and cellular compartment) that relate to the network or clusters. ClueGO has additional features that enable the users to identify significant pathways or domains that might be involved in the network or clusters. ClueGO and BiNGO provide several statistical (hypergeometric test and binomial test) and false discovery rate (Bonferroni and Benjamini-Hochberg) options to perform the enrichment analysis. The enriched GO and pathways can be used to describe the biological significance of the network or generated clusters (see Note 10).
4
Notes 1. The latest R and several other packages need to be installed in order to use AnnotationDbi. R will ask to install those specific packages for AnnotationDbi. 2. This method provides several examples of databases and tools that can be used to construct the protein network of expression data. There might be several other databases that are suitable for your data. Users can search for other databases and combine several different databases to have complete information. 3. There are specific operating systems requirements and preinstalled software in order to use Fuzzy lookup and Power query. Users can refer to Microsoft’s website for details. 4. It is compulsory to install Java in order to use Cytoscape. Some users might have problems with Java or other systems while using the latest Cytoscape. If such problems occur, users can install the older version of Cytoscape. 5. Each expression data uses different parameters such as biological samples and platforms that influence the gene expression. Hence, several criteria have to be determined before compiling the expression data. 6. This method provides several approaches to construct and analyze the PPI network. It is crucial to have a complete understanding of the obtained data before choosing the approach, database, and tool. 7. A specific range of scores is required to reduce the size of the protein network and false-positive PPI.
130
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
8. It is very important to double-check the operating systems requirements in order to avoid systems crash or freezing if users have a massive number of interactions in a protein network. 9. Several clustering algorithms such as Markov Cluster (MCL) (provides in clusterMaker Cytoscape app) require high computer specifications to cluster the network. 10. The enrichment analysis identified lists of significant GOs and pathways. These lists can be up to hundreds of significant GOs and pathways. Users have to have prior knowledge regarding the research in order to analyze them further.
Acknowledgments This work was supported by Malaysia Ministry of Higher Education (FRGS/1/2014/SGD5/UKM/02/6 and FRGS/1/2018/ STG05/UKM/02/7) and Universiti Kebangsaan Malaysia (DIP-2018-004), awarded to Zeti-Azura Mohamed-Hussein. References 1. Zhu Z, Jin Z, Deng Y et al (2019) Co-expression network analysis identifies four hub genes associated with prognosis in soft tissue sarcoma. Front Genet 10:37 2. Robinson JL, Nielsen J (2016) Integrative analysis of human omics data using biomolecular networks. Mol BioSyst 12:2953–2964 3. Eguchi R, Karim MB, Hu P et al (2018) An integrative network-based approach to identify novel disease genes and pathways: a case study in the context of inflammatory bowel disease. BMC Bioinformatics 19:264 4. Tang X, Hu X, Yang X et al (2016) Predicting diabetes mellitus genes via protein-protein interaction and protein subcellular localization information. BMC Genomics 17:433 5. Ramly B, Afiqah-Aleng N, Mohamed-Hussein Z-A (2019) Protein–protein interaction network analysis reveals several diseases highly associated with polycystic ovarian syndrome. Int J Mol Sci 20:2959 6. Barren€as F, Chavali S, Alves AC et al (2012) Highly interconnected genes in disease-specific networks are enriched for disease-associated polymorphisms. Genome Biol 13:R46 7. Alshabi AM, Vastrad B, Shaikh IA, Vastrad C (2019) Exploring the molecular mechanism of the drug-treated breast cancer based on gene expression microarray. Biomol Ther 9:282
8. Afiqah-Aleng N, Altaf-Ul-Amin M, Kanaya S et al (2019) Polycystic ovarian syndrome novel proteins and significant pathways identified using graph clustering approach. Reprod Biomed Online 40(2):319–330 9. Ding L, Fan L, Xu X et al (2019) Identification of core genes and pathways in type 2 diabetes mellitus by bioinformatics analysis. Mol Med Rep 20:2597–2608 10. Li W, Wang S, Qiu C et al (2019) Comprehensive bioinformatics analysis of acquired progesterone resistance in endometrial cancer cell line. J Transl Med 17:58 11. Wu M, Fang K, Wang W et al (2019) Identification of key genes and pathways for Alzheimer’s disease via combined analysis of genomewide expression profiling in the hippocampus. Biophys Rep 5:98–109 12. Tan C, Liu X, Chen J (2018) Microarray analysis of the molecular mechanism involved in Parkinson’s disease. Parkinsons Dis 2018:1590465 13. Zinati Z, Delavari A (2019) Identification of candidate genes related to aroma in rice by analyzing the microarray data of highly aromatic and nonaromatic recombinant inbred line bulks. Biotechnologia 100:227–240
Protein Expression Network Construction 14. Zhu G, Wu A, Xu XJ et al (2016) PPIM: a protein-protein interaction database for maize. Plant Physiol 170:618–626 15. Wang Y, Thilmony R, Zhao Y et al (2014) AIM: a comprehensive Arabidopsis interactome module database and related interologs in plants. Database 2014:bau117 16. Athar A, Fu¨llgrabe A, George N et al (2019) ArrayExpress update - from bulk to single-cell expression data. Nucleic Acids Res 47: D711–D715 17. Barrett T, Wilhite SE, Ledoux P et al (2013) NCBI GEO: archive for functional genomics data sets - update. Nucleic Acids Res 41:991–995 18. Smedley D, Haider S, Ballester B et al (2009) BioMart - biological queries made easy. BMC Genomics 10:22 19. Bateman A, Martin MJ, O’Donovan C et al (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169 20. Pages H, Carlson M, Falcon S et al (2018) AnnotationDbi: annotation database interface. R Packag. version 1.42.1 1471–2164 21. Huang DW, Sherman BT, Lempicki RA (2008) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44–57 22. Brown GR, Hem V, Katz KS et al (2015) Gene: a gene-centered information resource at NCBI. Nucleic Acids Res 43:D36–D42 23. Hubbard T, Barker D, Birney E et al (2002) The Ensembl genome database project. Nucleic Acids Res 30:38–41 24. The UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515 25. Stelzer G, Rosen N, Plaschkes I et al (2016) The GeneCards suite: from gene data mining to disease genome sequence analyses. Curr Protoc Bioinformatics 2016:1.30.1–1.30.33 26. Gene Ontology Consortium (2015) Gene ontology consortium: going forward. Nucleic Acids Res 43:D1049–D1056 27. Kanehisa M, Furumichi M, Tanabe M et al (2017) KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res 45:353–361 28. Karp PD, Billington R, Caspi R et al (2017) The BioCyc collection of microbial genomes and metabolic pathways. Brief Bioinform 20:1085–1093 29. Nishimura D (2001) A view from the web, BioCarta. Biotech Softw Internet Rep 2:117–120
131
30. Pico AR, Kelder T, Van Iersel MP et al (2008) WikiPathways: pathway editing for the people. PLoS Biol 6:e184 31. Finn RD, Attwood TK, Babbitt PC et al (2017) InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res 45: D190–D199 32. Stark C, Breitkreutz B-J, Reguly T et al (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34:D535–D539 33. Salwinski L, Miller CS, Smith AJ et al (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32:D449–D451 34. Franz M, Rodriguez H, Lopes C et al (2018) GeneMANIA update 2018. Nucleic Acids Res 46:W60–W64 35. Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH (2017) HIPPIE v2.0: enhancing meaningfulness and reliability of proteinprotein interaction networks. Nucleic Acids Res 45:D408–D414 36. Prasad KS, Goel R, Kandasamy K et al (2009) Human protein reference database - 2009 update. Nucleic Acids Res 37:D767–D772 37. Kotlyar M, Pastrello C, Sheahan N et al (2016) Integrated interactions database: tissue-specific view of the human and model organism interactomes. Nucleic Acids Res 44:D536–D541 38. Orchard S, Ammari M, Aranda B et al (2014) The MIntAct project - IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 42:D358–D363 39. Pagel P, Kovac S, Oesterheld M et al (2005) The MIPS mammalian protein-protein interaction database. Bioinformatics 21:832–834 40. Licata L, Briganti L, Peluso D et al (2012) MINT, the molecular interaction database: 2012 update. Nucleic Acids Res 40: D857–D861 41. Szklarczyk D, Morris JH, Cook H et al (2017) The STRING database in 2017: qualitycontrolled protein-protein association networks, made broadly accessible. Nucleic Acids Res 45:D362–D368 42. Uhlen M, Fagerberg L, Hallstrom BM et al (2015) Tissue-based map of the human proteome. Science 347:1260419 43. Shannon P, Markiel A, Owen O et al (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504 44. Bader GD, Hogue CWV (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4:2
132
Nor Afiqah-Aleng and Zeti-Azura Mohamed-Hussein
45. Morris JH, Apeltsin L, Newman AM et al (2011) ClusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics 12:436 46. Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in proteinprotein interaction networks. Nat Methods 9:471–472 47. Bindea G, Mlecnik B, Hackl H et al (2009) ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway
annotation networks. Bioinformatics 25:1091–1093 48. Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21:3448–3449 49. Scardoni G, Tosadori G, Pratap S et al (2016) Finding the shortest path with PesCa: a tool for network reconstruction. F1000Res 4:484
Chapter 11 Systems-Theoretic Approaches to Design Biological Networks with Desired Functionalities Priyan Bhattacharya, Karthik Raman, and Arun K. Tangirala Abstract The deduction of design principles for complex biological functionalities has been a source of constant interest in the fields of systems and synthetic biology. A number of approaches have been adopted, to identify the space of network structures or topologies that can demonstrate a specific desired functionality, ranging from brute force to systems theory-based methodologies. The former approach involves performing a search among all possible combinations of network structures, as well as the parameters underlying the rate kinetics for a given form of network. In contrast to the search-oriented approach in brute force studies, the present chapter introduces a generic approach inspired by systems theory to deduce the network structures for a particular biological functionality. As a first step, depending on the functionality and the type of network in consideration, a measure of goodness of attainment is deduced by defining performance parameters. These parameters are computed for the most ideal case to obtain the necessary condition for the given functionality. The necessary conditions are then mapped as specific requirements on the parameters of the dynamical system underlying the network. Following this, admissible minimal structures are deduced. The proposed methodology does not assume any particular rate kinetics in this case for deducing the admissible network structures notwithstanding a minimum set of assumptions on the rate kinetics. The problem of computing the ideal set of parameter/s or rate constants, unlike the problem of topology identification, depends on the particular rate kinetics assumed for the given network. In this case, instead of a computationally exhaustive brute force search of the parameter space, a topology–functionality specific optimization problem can be solved. The objective function along with the feasible region bounded by the motif specific constraints amounts to solving a non-convex optimization program leading to non-unique parameter sets. To exemplify our approach, we adopt the functionality of adaptation, and demonstrate how network topologies that can achieve adaptation can be identified using such a systemstheoretic approach. The outcomes, in this case, i.e., minimum network structures for adaptation, are in agreement with the brute force results and other studies in literature. Key words Systems biology, Systems theory, Design principles, Stability, Adaptation
1
Introduction An important challenge in synthetic biology is the design of networks, or circuits, that perform very specific functions [1, 2]. A classic approach to the de novo design of such networks is through a brute force search of the space of all possible networks (of a given
Mario Andrea Marchisio (ed.), Computational Methods in Synthetic Biology, Methods in Molecular Biology, vol. 2189, https://doi.org/10.1007/978-1-0716-0822-7_11, © Springer Science+Business Media, LLC, part of Springer Nature 2021
133
134
Priyan Bhattacharya et al.
size) that can achieve this functionality. This approach was beautifully demonstrated in the seminal work of Ma and co-workers [3], who identified a small number of topologies that could achieve perfect adaptation, from a space of over 16,000 possibilities. Similar approaches are also carried out for oscillation networks [4]. A contrasting theoretical approach to solve this problem has its roots in systems theory, which offers very important insights to analyze the functionality and the network [5–9]. Interestingly, the connection pattern of a network for a given functionality is shown to be conserved across the organism space [3]. For instance, adaptation involved in bacterial chemotaxis of E. coli involves a similar strategy of negative feedback as in the process of homoeostasis [10]. Furthermore, it is possible to argue that any functionality arising at any stage of the central dogma of any organism retains its design principles—although the dynamics at different levels of biological networks are decidedly different. For instance, the reaction dynamics of a protein network is quite different from that of a metabolic network. According to the above hypotheses, adaptation also employs some basic design principles, which are functionally independent of the organism and the network dynamics to a large extent. Previously, it was shown that adaptation can be performed by a protein network where each node is a protein or enzyme, and the edges represent interactions between the proteins [3]. Although the challenges of retroactivity and context dependence exist in many networks [11], a number of networks do exhibit modularity in the context of a given functionality. Every network at any level of the “central dogma” comprises two distinct kinds of properties: (1) static properties [12] and (2) dynamical properties [6, 13]. The static properties such as degree distribution, assortativity, mean path length, or structural controllability help in understanding the structural properties like structural robustness to random or targeted attacks and finding the most suitable node for the network to control [2, 14, 15]. The dynamic properties such as steady-state analysis and stability are based on the underlying dynamical system, constituted by the interactions between various molecules. Systems theory contributed to formalizing some of these concepts such as structural controllability [15] and measuring the controllability for a realistic biological network. Several efforts have applied matching theory in order to analyze large and complex networks [14, 16]. In this chapter, we illustrate the use of systems theory to draw proper insights about the structural properties of a network with the help of partial or full knowledge about the dynamic behavior of the system. The chapter is organized in the following manner. The next Subheading 2 discusses some of the foundational concepts of systems theory, Subheading 3 deals with the proposed methodology
Systems Theoretic Approaches to Circuit Design
135
and its application on adaptation networks. Simulation studies are performed to illustrate the theoretical arguments.
2
Useful Systems Theory Concepts Systems theory is a very well-established field of study which uses sophisticated mathematical tools to analyze complex systems [17]. Considering the scope of this chapter and the methodology to be proposed in the following sections, we hand-pick a few relevant concepts from the vast ocean of systems theory in order to familiarize ourselves about some mathematical tools to be used for the purpose of application of our proposed methodology.
2.1
Linearization
Linear systems follow the laws of superposition and scalability [17]. A linear system is time-invariant if the system response depends only on the time interval, i.e., from the point of excitation to the present. Given a biological network consisting of the nodes as physical constituents, e.g., proteins or metabolites, and the edges as the reactions between the nodes, the rate equations for the concentration of the nodes (x) can be written as _ xðtÞ ¼ f ðxðtÞ, uðtÞÞ,
yðtÞ ¼ gðxðtÞ, uðtÞÞ
ð1Þ
where xðtÞ∈ , uðtÞ∈ , and yðtÞ∈ are the states, inputs, and outputs, respectively. For most of the natural systems, f : ! and g : ! are non-linear with respect to the states and inputs. In these cases, under some specific assumptions the linear systems theory can be used to study the local behavior of the system around the point of interest. A non-linear system can be linearized by evaluating the Jacobian of the of the vector fields f(x, u, t) and the measurement functions g(x, u, t) with respect to the state variables and inputs. For a differentiable vector function ϕ, the Jacobian of ϕ with respect to a vector ε is defined as n
m
Jij ¼
p
∂ϕi ∂ε j
ð2Þ
For this setup, the linearized state space model _ δxðtÞ ¼ AδxðtÞ þ BδuðtÞ, nn
δyðtÞ ¼ CδxðtÞ þ DδuðtÞ
ð3Þ
where A∈ is the Jacobian of f(x, u) with respect to the states x (t) evaluated at the operating condition (x∗, u∗). B∈nm is the Jacobian of f(x, u) with respect to the inputs u evaluated at the operating condition (x∗, u∗). Similarly, C∈pn and D∈pm are obtained as the Jacobians of g(x, u) with respect to the states and inputs—x and u—respectively. The steady-state for the system can be computed by equating the vector fields f(x) to zero. The
136
Priyan Bhattacharya et al.
deviation variable δ denotes the perturbation of the system from the operating conditions. A non-linear system can have multiple isolated solutions or steady-states (multi-state system). On the other hand, in this case, depending on the rank of A, a linear system can only offer either a unique or uncountably infinitely many solutions. The representation of the linear dynamical system above (Eq. 3) employs ordinary differential equations (ODEs) involving the states and the output, which is a function of the states. This is called the state space representation. On the other hand, there exists another form of representation of a linear dynamical system, which is a system of algebraic equations involving the frequency domain representation of the states, inputs, and outputs. The transformation from time (t) to frequency (s) is carried out by Laplace transformation. The Laplace operator (L) is defined as LðxðtÞÞ ¼
R1
1
xðτÞeðsτÞdτ
With the assumption of every state in the system being initially relaxed, it is always possible to compute the ratio of the Laplace transformation of the output to that of the input, explicitly. This ratio is called the transfer function (GðsÞ∈pm ) of the system, in a SISO (single-input-single-output) scenario. For a given MIMO (multiple-input-multiple-output) state space system (Eq. 3), the corresponding transfer function takes on a matrix form, and can be directly evaluated as [18] ¼ GðsÞUðsÞ
ð4Þ
¼ CðsI AÞ1 B þ D
ð5Þ
YðsÞ GðsÞ
Here, consistent with the notation introduced previously, Y(s) and U(s) are the Laplace transforms of outputs and inputs, respectively. The Laplace variable s is a complex number, where the Re(s) guarantees the absolute summability of x(τ)reRe(sτ), which ensures a unique existence of a Laplace transformation of x(t) in the region of convergence. Similarly, the transfer function matrix GðsÞ is also a complex-valued function, which carries a lot of useful information about the safety and performance of the system. The inverse Laplace Y(s) of G(s) is denoted as the impulse response coefficients. For a linear system, the response, y(t), can be expressed as a convolution of impulse response coefficients, gðtÞ , with the input, u(t), applied to the system. 2.2
Stability
Stability in systems theory refers to the safety of operation. It can be related to the notion of convergence. Since a non-linear system can offer multiple isolated steady-states assessing the stability of each steady-state is required to characterize the overall stability of a
Systems Theoretic Approaches to Circuit Design
137
system. A steady-state (x∗) of Eq. 1 is stable if ∃δ s.t. if kx0 x∗k δ ) kx(t) x∗k Eδ, 8t ∈ Π, where x0 is the initial state. Depending on the sequence x(t), the stability of a steady-state can be of two types: (1) internal stability, when the system is autonomous and (2) input/output stability. Further, internal stability is divided into three types: 1. Asymptotically stable: a stable steady-state (x∗) of Eq. 1 is said to be asymptotically stable if ∃δ s.t. if kx0 x∗k δ ) kx(t) x∗k! 0 as t !1. 2. Exponentially stable: a steady-state (x∗) of Eq. 1 is exponentially stable if ∃δ, λ > 0 s.t. if kx0 x∗k δ ) kx(t) x∗kx0eλt, 8t ∈ Π. For both the cases, x0 is the initial condition. 3. Marginally stable: a stable steady-state (x∗) of Eq. 1 is marginally stable if it is neither of the above. For linear systems, the conditions for exponential and asymptotic stability both converge to a requirement of negative real part of all the eigenvalues of the system matrix A. In the case of exogenous systems, the notion of input–output stability for linear systems is called BIBO (bounded-input-bounded-output) stability. As it can be inferred from the term, the response to a bounded input should necessarily be bounded for a system to be BIBO-stable. Interestingly, it can be shown that, for linear systems, the conditions for internal stability serve as sufficient conditions for BIBO stability. The stability of a steady-state for a non-linear system can be assessed in a significant manner by finding the stability of the linearized version around the steady-state in consideration. For instance, the Hartman–Grobman theorem [18] guarantees the stability of the non-linear system around a steady-state, if the linearized system is asymptotically stable. Inconclusive situations arise when A becomes singular or consists of purely imaginary pairs of eigenvalues. 2.3 Characterizing Parameters in Linear Time-Invariant (LTI) System Framework
The transfer function representation of an LTI system (Eq. 4) can be characterized by different parameters, such as poles, zeros, and DC gain. For a MIMO system each element of the transfer function matrix can be represented as GðsÞi,j ¼ ðCðsI AÞ1 B þ DÞi,j ¼ K
∏Zk¼1 ðs z k Þ ∏Pj ¼1 ðs p j Þ
ð6Þ
where pj, zk, and K are called poles, zeros, and gain of the system, respectively. Z and P denote the number of zeros and poles, respectively. Note that zeros and poles are the roots of the numerator and denominator polynomials, respectively. It can be inferred from
138
Priyan Bhattacharya et al.
Eq. 4 that poles are the eigenvalues of the system matrix A; therefore, for stability, the poles have to reside on the left half of the splane [18]. Zero determines the overshoot and final value of the system; for instance, a zero or an odd pair of zeroes placed at the right-hand side of s-plane leads to an inverse response, where the initial slope of the response to a step input is of opposite sign to the final value.
3
Methodology and Application In this section, we discuss the proposed procedure for determining admissible structures given a functionality. The first check to be performed is whether the functionality can be executed by a wellposed dynamical system, i.e., whether the functionality is unique, dependent on the initial condition and does not show finite blowup phenomenon. As a second step, the given functionality is analyzed using different parameters, and their values are evaluated in the ideal scenario, i.e., when the system has attained the desired functionality perfectly. These parameters at the ideal condition can be mapped as some constraints on the system parameters, such as pole-zero positions for linear systems or bifurcation points, etc., in the case of non-linear systems. The system parameters encode the information about the desired network structure.
3.1 Generalized Algorithm for Finding Design Principles
Algorithm 1 Framework for finding all possible network structures 1: Check whether the functionality can be achieved by a wellposed dynamics. 2: Analyse the given functionality with some parameters which can later be used as performance indices. 3: Evaluate the values of the performance parameters for the ideal case. 4: Translate these conditions to constraints on the parameters (pole, zero position, gain, etc.) of the system. 5: Map these conditions on to the physical entities, i.e., connections and types of connections in the underlying network. 6: In the case of linearised models ensure that the above-derived conditions for the concerned functionality are satisfied at the modified (due to change in the input) fixed point. To illustrate the efficacy of the above method, we use the exemplar of the functionality of adaptation. Adaptation is defined as the quality of living organisms to sense and respond to a change in
Systems Theoretic Approaches to Circuit Design
139
the environment, yet (ideally) revert to its pre-stimulus steadystate. This section sets up a design-oriented problem for the purpose of obtaining potential network structures capable of adaptation, instead of employing an exhaustive search across a topology– parameter space, as has been done before [3]. As a first step, the conditions for a specific biological functionality are determined from a perspective of LTI systems theory. Imposing these specifications on the linearized version of the rate equations, the problem of determining design principles is resolved [5, 6, 19, 20]. The quantum of adaptation can be measured by two parameters, namely sensitivity and precision [3]. Sensitivity is the ratio of the relative difference between the peak value of the output and the initial steady-state to that of the two different levels of the input: O peak O 1 O1 Sensitivity ¼ I2 I1 I1
ð7Þ
Precision, on the other hand, is a steady-state measure calculated as the ratio between two different output steady-state levels to that of the input levels [3]: O2 O1 O1 Precision ¼ I2 I1 I1
ð8Þ
The case of perfect adaptation can be characterized as “infinite precision, non-zero finite sensitivity and minimum peak time.” 3.2 Translation of Perfect Adaptation to Systems Theory Requirements
Perfect adaptation of infinite precision should be sensitive to changes in the input and be able to drive the response to its previous steady-state value. These conditions can be translated to restrictions on G(s) using the LTI systems theory as: (1) zero final value of the output and (2) non-zero peak value. Since the system of rate equations are linearized around a stable fixed point, the initial value of the output (deviation from the stable point) of the linearized system should be zero. These conditions can be mapped on the parameters of an LTI system for a step change in the input as Y ðsÞ ¼ GðsÞ=s
ð9Þ
where Y (s) is the Laplace transform of the output y(t). Using initial value theorem, lim GðsÞ
s!1
¼0
ð10Þ
140
Priyan Bhattacharya et al.
The above equality holds only when G(s) is a proper transfer function, i.e., number of zeros should be less than that of poles. The condition for zero final value can only be attained by a zero-gain system: lim GðsÞ s!0
¼0
ð11Þ
The above-derived condition can be attained if and only if the system has at least one zero at the origin, which leads to having an output with zero final value. The condition of non-zero peak value can easily be ensured by any dynamical system. However, the important demand here is to minimize the peak time, as in reality, the system should sense the change in input as soon as possible. It can be shown that the peak time for a system can be minimized if the zero is placed at the origin. To establish this fact, let us assume a proper LTI system G(s) and another system H(s) with same singularities (all real), except a zero at the origin. Assume y1(t) (Y1(s)), y2(t) (Y2(s)), and tp1, tp2 to be the step responses and the peak times for G(s) and H(s), respectively. ðs þ z 1 Þ ∏k¼2 ðs þ z k Þ , ∏mi¼1 ðs þ pi Þ n
GðsÞ ¼ K
s ∏k¼2 ðs þ z k Þ ∏mi¼1 ðs þ pi Þ n
H ðsÞ ¼ K
ð12Þ H ðsÞ s
ð13Þ
Y 2 ðsÞ s
ð14Þ
y ðτÞdτ 0 2
ð15Þ
GðsÞ ¼ H ðsÞþ
z1
Y 1 ðsÞ ¼ Y 2 ðsÞþ
z1
y_1 ðtÞ ¼ y_2 ðtÞ
þz 1
Rt
Putting t ¼ tp2
ð16Þ
y_1 ðtÞjt¼tp ¼ 0 þ z 1 max ðy 2 ðtÞÞ > 0
ð17Þ
2
tp1
> tp2
ð18Þ
The above result can be extended in the case of damped oscillatory systems as well. From Eq. 14 it can be seen that y 1 ðtÞ ¼ y 2 ðtÞ þ z 1
Rt
y ðτÞdτ 0 2
ð19Þ
The peak R t time for y2(t) is always less than or equal to that of its integral 0 y 2 ðτÞdτ therefore their combination y1(t) has a peak time always greater than or equal to that of y2(t).
Systems Theoretic Approaches to Circuit Design
3.3 Adaptation in Biological Networks
141
In this chapter, we first consider a protein network [3] as our system, and then extend our method to other networks in literature, which are reported to provide adaptation, e.g., voltage-gated sodium ion channels and gene regulatory networks [21, 22]. A protein network can be characterized by proteins as nodes and the directed chemical reactions as edges. We first consider a protein network comprising two proteins, ðx 1 Þ and ðx 2 Þ (Fig. 1). The input is given to , and the output is taken as the concentration of protein . The concentrations of proteins and are represented by x1 and x2, respectively. According to the framework, the system of non-linear rate equations is linearized with respect to a stable fixed point. The A matrix of the state space representation for the linearized system is obtained as Aij ¼ αij ¼
∂fi ∂x j
j
x 1 ,x 2
As there is no direct link between the input I and the output protein , the Jacobian of g(x, u) with respected to u is zero, i.e., D ¼ 0 (recall Eq. 3). The transfer function is obtained as α GðsÞ ¼ β 2 21 ð20Þ ∏k¼1 ðs λk Þ where β¼
∂fi ∂j
j
x 1 ,x 2
, and λk are the eigenvalues of A
ð21Þ
The above transfer function satisfies the condition of zero initial value as the transfer function is proper, i.e., the number of poles are greater than the number of zeros. However, due to the absence of zero, it cannot ensure zero final value, irrespective of the positions of the poles. Hence, this structure is incapable of providing perfect adaptation. However, typical cases with a finite precision and sensitivity (imperfect adaptation) can be considered for this system, which merits exploration for the purpose of deducing constraints in the case of semi-perfect adaptation scenarios.
Fig. 1 Schematic representation of a two-protein network. Blue edge represents activation, while a red edge signifies repression
142
Priyan Bhattacharya et al.
As for adaptation, the system should sense the change in the input and the output y(t) should increase until it reaches the peak. Therefore, the slope of the output should be positive (ideally monotonically decreasing) till the peak time. Later, as the output decreases, the slope should be negative (ideally monotonically increasing) from the peak time to the rest, and at large times, the slope again goes back to zero. For the concerned case of the linearized system in Eq. 20, Y ðsÞ _ yðtÞ GðsÞ gðtÞ
¼ GðsÞ=s ¼ gðtÞ α ¼ β 2 21 ∏k¼1 ðs λk Þ 1 1 e λ1 t e λ2 t ¼ βα21 λ2 λ1 λ2 λ1
Without any loss of generality let us assume λ2 > λ1 then e λ1 t > e λ2 t then for the above case we see g(t) can never be negative if the poles are real. If the system is under-damped, then these networks can show damped oscillation, which in some restricted scenarios can provide adaptation, as we will discuss below. " # " #" # " # α11 α12 δx 1 β1 δx_ 1 δI ¼ þ α21 α22 δx 2 0 δx_2 Now, the eigenvalues of A matrix are qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðα11 α22 Þ2 þ 4α12 α21 α11 þ α22 λ1,2 ¼ 2 2
ð22Þ
For damped oscillation, the conditions are 1. αii < 0 2. |4α12α21| > (α11 α22)2 3. 4α12α21 < 0 The last condition for imperfect adaptation translates to a requirement of a negative feedback between and . To surmount the inability of a two-node network to provide perfect adaptation, another intermediate protein, ðx 3 Þ, is considered along with the input and output proteins, as shown in Fig. 2. The associated transfer function for the system is computed as GðsÞ ¼ β1 α21
s
ðα31 α23 α21 α33 Þ α21
∏3i¼1 ðs λi ðAÞÞ
ð23Þ
As derived in the previous section, for perfect adaptation, the zero of the linearized system should be positioned at the origin. Additionally, the system should also satisfy the stability criteria.
Systems Theoretic Approaches to Circuit Design
143
Fig. 2 A possible network structure of three-protein network. Blue represents activation, while red signifies repression
Table 1 Possible motifs for adaptation Possibilities
Final condition
α31α32α31 < 0
Gross ve feedback
α32α23 < 0
ve feedback between and
α13α31 < 0
ve feedback between and
α31 αα23 21